Lecture 4 "Curse of Dimensionality / Perceptron" -Cornell CS4780 SP17

Name: Lecture 4 "Curse of Dimensionality / Perceptron" -Cornell CS4780 SP17
Uploaded: 2018-07-09T21:21:06.000Z
Duration: 1 h 34 min 43 s

Introduction to K Nearest Neighbor Classifier

Overview of the Lecture

The speaker reflects on previous lectures, emphasizing the importance of foundational knowledge in mathematics for understanding upcoming topics.

Projects 1 and 2 are discussed, with a reminder for students to stay on track and not fall behind due to overlapping project deadlines.

K Nearest Neighbor Classifier

Introduction to the K nearest neighbor (KNN) classifier, which operates under the assumption that similar points have similar labels.

The method involves identifying the nearest training point to a test point and assigning its label based on proximity.

Understanding Error Rates and Bayes Optimal Classifier

Performance Insights

If sufficient data is available, KNN can achieve an error rate at most twice that of the Bayes optimal classifier, indicating its potential effectiveness.

Curse of Dimensionality

Concept Explanation

The curse of dimensionality is introduced as a significant challenge in high-dimensional spaces where data points become sparse.

A mathematical exploration begins regarding how large a box must be to encapsulate K nearest neighbors within a hypercube.

Mathematical Analysis

The volume of this box is calculated as L^D, leading to insights about how uniformly distributed points relate to their spatial dimensions.

A relationship between box size L, number of points K, and total points N is established: L approx left(K/Nright)^1/D.

Implications of High Dimensions

Size Comparisons Across Dimensions

For different values of D (dimensions), calculations reveal surprising results about how quickly box sizes increase with dimensionality.

In higher dimensions (e.g., D = 1000), even small numbers of neighbors require disproportionately large boxes, illustrating sparsity issues.

Challenges in High-Dimensional Spaces

The discussion highlights that in high-dimensional spaces, many data points are located far from each other, complicating clustering and classification tasks.

This phenomenon leads to difficulties in defining "closeness," as most points reside near edges rather than within dense clusters.

Conclusion: Understanding Data Distribution

Key Takeaways

The lecture concludes by stressing that high-dimensional spaces often contain empty interiors where few data points exist, challenging traditional intuitions about distance and similarity.

Understanding the Limitations of K-Nearest Neighbors

The Assumption of Proximity in K-Nearest Neighbors

The assumption that nearby points share the same label is flawed; in high-dimensional spaces, points are often equidistant from each other, making it unreasonable to rely on proximity for classification.

The argument emphasizes that two points can be close together yet have different labels, highlighting the inadequacy of using distance alone as a metric for similarity.

High-Dimensional Space and Point Distribution

When randomly drawing points within a cube, each coordinate is selected independently, leading to most points being located near the interior rather than at the edges.

Defining "interior" as being away from the edges (within an epsilon margin), it becomes evident that as dimensions increase, the likelihood of landing in this interior space diminishes significantly.

Probability and Dimensionality Challenges

In higher dimensions (D), the probability of remaining in the interior decreases exponentially; specifically, it approaches zero as D increases due to multiplying probabilities across dimensions.

This leads to a critical insight: while K-nearest neighbors may fail in high-dimensional spaces under uniform distribution assumptions, real-world data often does not follow such distributions.

Intrinsic Dimensionality and Data Subspaces

Real-world data like images exist within lower-dimensional subspaces embedded in high-dimensional spaces. Thus, K-nearest neighbors can still function effectively if data lies within these intrinsic lower dimensions.

An example illustrates how a two-dimensional plane can be represented within a thousand-dimensional space without losing its low dimensionality characteristics.

Understanding Manifolds and Their Properties

A manifold is defined as a surface with low dimensional properties existing within higher dimensions. It retains local Euclidean characteristics despite being part of a more complex structure.

Understanding Manifolds and Dimensionality Reduction in Data Analysis

The Nature of Manifolds

Discussion on manifolds highlights that while they may appear Euclidean locally, they do not maintain this property globally. This means distances calculated using Euclidean metrics can be misleading.

An example illustrates how points on a manifold can be closer in Euclidean space than they are when considering the true manifold structure, emphasizing the importance of understanding the geometry of data.

K-Nearest Neighbors (KNN) and Local Distances

The KNN algorithm operates effectively by focusing on local distances rather than global ones, allowing it to classify points based on their nearest neighbors within a specific region.

Techniques like Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) help estimate if data lies on a subspace or manifold by finding new coordinate systems that capture variance efficiently.

Estimating Intrinsic Dimensionality

A method for estimating intrinsic dimensionality involves drawing spheres around data points and observing how the number of points increases as the sphere's size doubles. This helps determine whether data is two-dimensional or three-dimensional.

The discussion emphasizes that understanding the true dimensionality of data is crucial; for instance, facial images may have thousands of pixels but only require a few dimensions to accurately describe them.

Practical Applications and Preprocessing Steps

Reducing dimensionality through techniques like PCA is recommended as an essential preprocessing step before applying algorithms, especially since many algorithms struggle with high-dimensional datasets.

Demonstration of KNN Classifier

A demo showcases a KNN classifier where different distance metrics are applied to visualize classification regions. It demonstrates how varying parameters affect classification boundaries.

Understanding Decision Boundaries and Dimensionality

The Dynamics of Decision Boundaries

The circles representing data points become more dominant in their respective areas, but the decision boundary eventually smooths out, leading to a mean label representation.

As iterations continue, the average label becomes predominant (blue), indicating that despite potential mislabeled points, the overall region remains classified correctly due to density.

Exploring the Curse of Dimensionality

A demonstration is introduced to illustrate the curse of dimensionality by drawing data points within a hypercube and computing distances between them.

A histogram shows how many pairs of points exist at various distances; maximum distance in two dimensions is √2 when points are at opposite corners.

Distance Distribution in Higher Dimensions

In higher dimensions, most point pairs exhibit a distance around 0.5; as dimensionality increases, fewer points remain close together.

By ten dimensions, very few points have less than 0.5 distance from each other, illustrating that neighbors become sparse and distant.

Implications for Nearest Neighbors

K-nearest neighbors (KNN) works well in lower dimensions but fails as dimensionality increases because all points become equidistant from one another.

In extreme cases like 10,000 dimensions, no meaningful nearest neighbor relationships exist; this analogy compares it to farms being far apart with no neighbors.

Addressing Questions on Distance Metrics

The speaker confirms that while L2 metrics demonstrate similar effects across different metrics, some specialized metrics may work better for specific datasets like images.

It’s crucial to recognize that if data does not possess lower intrinsic dimensionality, analyzing it may yield uninteresting results.

Precision and Neighbor Relationships

Even with high precision computing (e.g., 1000-bit floating-point numbers), if two points are still far apart despite minor differences in distance measurements, they cannot be considered true neighbors.

The discussion emphasizes that similar labels do not guarantee proximity; thus assumptions about shared labels based on closeness can be misleading.

Understanding Dimensionality and K-Nearest Neighbors

The Nature of Data in High Dimensions

There is no uniform pattern in high-dimensional data; it often lacks a clear structure. Efforts are made to identify reasonable low-dimensional representations or clusters within the data.

The concept of dimensionality is illustrated using a hypercube, where maximum distances between points are constrained by the shape's geometry.

The "curse of dimensionality" suggests that as dimensions increase, points tend to cluster at the edges of the space, leading to maximum distances from each other.

Challenges with K-Nearest Neighbors (KNN)

Assumptions and Limitations

KNN assumes that nearby points share similar labels. This can lead to inaccuracies when neighbors are not representative due to geographical or contextual differences.

An example illustrates how distant neighbors (e.g., from Cleveland and San Francisco) may not provide relevant information about a person’s characteristics (e.g., baseball team preference).

Practical Considerations

While KNN can be effective with large datasets, its performance diminishes near decision boundaries, making it less reliable in certain scenarios.

Collecting more data does not always solve issues inherent in KNN; challenges remain even with increased sample sizes.

Computational Complexity

Efficiency Concerns

As dataset size increases, the computational burden grows significantly. Each prediction requires calculating distances for all training points, which becomes impractical with large datasets.

The time complexity during inference is O(n * D), meaning that doubling the dataset size results in doubled computation time—this poses significant limitations for applications like Google’s classification problems.

Conclusion on Scalability

Introduction to the Perceptron Algorithm

Overview of the Perceptron

The perceptron, invented by Frank Rosenblatt in 1957 at Cornell University, is considered one of the first machine learning algorithms and serves as an alternative to k-nearest neighbors.

Concept of Hyperplanes

The perceptron assumes that there exists a hyperplane that can separate two classes of data points (e.g., nods and crosses).

It posits that all points from one class lie on one side of this hyperplane while those from the other class lie on the opposite side.

Dimensionality Considerations

In high-dimensional spaces, data points tend to be far apart, making it easier for a hyperplane to exist that separates different classes.

The algorithm's effectiveness increases in infinite dimensional spaces where such separation is guaranteed.

Comparison with K-Nearest Neighbors

Differences in Approach

Unlike k-nearest neighbors which works best in low-dimensional spaces due to computational efficiency, the perceptron operates effectively in high-dimensional settings.

Computational Efficiency

The speed of classification using a hyperplane remains constant regardless of the number of training points since it only requires evaluating whether a new point lies above or below the hyperplane.

Training and Testing with Perceptrons

Finding Hyperplanes

A hyperplane is mathematically defined by vector W and offset B; it separates classes based on their positions relative to this plane.

Classification Process

During testing, new data points are classified by determining which side of the hyperplane they fall on through simple calculations involving W and B.

Limitations and Assumptions

Binary Data Requirement

The perceptron algorithm assumes binary data; it does not work for continuous datasets without modifications.

Handling Edge Cases

If a test point lies exactly on the hyperplane during classification, a random decision (like flipping a coin) may be used to assign its label.

Addressing Non-Linearly Separable Data

Challenges with Separation

If data cannot be separated linearly (e.g., circles vs. crosses), adjustments must be made before applying the perceptron algorithm.

Mapping into Higher Dimensions

Techniques will be introduced later to map non-linearly separable data into higher dimensions where linear separation becomes feasible.

Formalizing Labels and Learning Parameters

Label Definitions

How to Simplify Learning by Eliminating Bias in Data Representation

Transforming Data for Simplicity

The speaker discusses the challenge of learning two components (A and B) in data representation, suggesting that it would be easier if only one component needed to be learned.

A transformation is introduced where a vector with all previous dimensions of X is created, along with an additional dimension representing bias (B), allowing for simplification in learning.

Understanding Hyperplanes and Dimensions

The concept of defining a hyperplane W^T X = 0 is presented, emphasizing that the bias can be absorbed into the data through dimensional transformation.

By eliminating the offset (B), the hyperplane now passes through the origin, indicating that W represents the orientation of this hyperplane.

Geometric Interpretation of Data Transformation

The geometric implications are explained: moving all data points into an additional dimension effectively shifts them while maintaining their relationships.

This shift allows for a new perspective on how data points are arranged; previously separate points are now aligned differently within this transformed space.

Visualizing Hyperplanes in Higher Dimensions