Lecture 5 "Perceptron" -Cornell CS4780 SP17

Lecture 5 "Perceptron" -Cornell CS4780 SP17

Curse of Dimensionality and Perceptron Discussion

Overview of the Curse of Dimensionality

  • The discussion begins with a recap on the curse of dimensionality, particularly in relation to K nearest neighbors and perceptrons.
  • A question arises about the implications of having points that are far apart in high-dimensional space, leading to an exploration of neighborhood sizes.
  • It is explained that as dimensions increase, the number of points required for a small neighborhood grows exponentially, specifically scaling with 10^D, where D is the dimension.

Implications of High Dimensions

  • The speaker emphasizes that needing more points than there are electrons in the universe renders data analysis practically useless.
  • Another question addresses how nearest neighbors can be identified even when they are distant; however, this does not provide meaningful insights into data labeling.

Understanding Nearest Neighbor Limitations

  • The assumption that nearby points have similar labels is crucial; without understanding what lies between known labels (X's and O's), predictions become unreliable.
  • The speaker illustrates how one cannot infer labels accurately if there's uncertainty about transitions between classes.

Project Updates and Logistics

  • Transitioning to project logistics, common issues with submissions are discussed, particularly regarding saving changes before committing them.
  • Important deadlines for projects are highlighted: Project 0 is due today while Projects 1 and 2 have ongoing timelines.

Understanding Perceptron Algorithm

Basics of Perceptron

  • The perceptron algorithm operates under the assumption that a hyperplane can separate two classes (X's and O's).
  • In binary classification scenarios, it’s noted that finding such a hyperplane is feasible even in high-dimensional spaces due to less restrictive conditions.

Mathematical Representation

  • The perceptron uses a mathematical representation involving weights (W), input features (X), and bias (B).
  • A point X’s position relative to the hyperplane is determined by calculating W^T cdot X + B.

Hyperplane Existence Assumption

  • The algorithm assumes a hyperplane exists for effective classification. It provides an efficient method for identifying this separating boundary through iterative adjustments.

Transformation Technique

Understanding the Perceptron and K-Nearest Neighbors

Decomposing Weight Vectors

  • The discussion begins with the concept of decomposing the weight vector W bar into W and bias B , emphasizing that this simplification makes derivations easier by eliminating the need to constantly update B .

K-Nearest Neighbors vs. Perceptron

  • Acknowledgment of the complexity in managing updates for bias in algorithms, leading to a preference for simpler methods.
  • Introduction of K-nearest neighbors (KNN), which allows decision boundaries to curve, unlike perceptrons that assume linear separability.

Decision Boundaries

  • In KNN, decision boundaries are formed based on proximity to training points, creating complex shapes between different classes (e.g., circles and crosses).
  • The perceptron is limited as it requires decision boundaries to be hyperplanes, which can restrict its applicability in high-dimensional spaces.

Memory Efficiency and Trade-offs

  • While perceptrons only store one weight vector and a bias for memory efficiency, KNN must retain all training data points, making it slower and less efficient.

Limitations of Perceptron

  • Discussion on how KNN can adapt better than perceptrons when dealing with complex datasets where no hyperplane can separate classes effectively.

Error Rates and Classifier Performance

  • It’s noted that while there exists a theorem for KNN regarding error rates as sample size increases, such guarantees cannot be established for perceptrons due to their restrictive nature.

Challenges with Infinite Data Sets

  • Participants are prompted to consider why increasing dataset size does not improve perceptron's performance if data distribution is complex (e.g., sine wave patterns).

Misclassification Issues

  • An example illustrates how even with infinite data points distributed around a sine wave pattern, no hyperplane can accurately classify them without errors.

Overview of the Perceptron Algorithm

  • Introduction of the basic structure of the perceptron algorithm: starting weights at zero leads to an initial prediction of zero; adjustments occur upon misclassifications.

Iterative Learning Process

Understanding the Perceptron Algorithm

Introduction to the Perceptron Loop

  • The perceptron algorithm involves a loop where data points are classified based on their position relative to a hyperplane. If a point lies on the wrong side, adjustments are made.

Classifying Points and Adjusting Hyperplane

  • A point is classified by checking if Y times (W^T times X) leq 0 . This condition indicates that the classification is incorrect, prompting an adjustment of the hyperplane.

Understanding Misclassifications

  • The speaker explains that if Y times (W^T times X) leq 0 , it signifies a misclassification. The relationship between labels and hyperplane positioning is crucial for understanding errors.

Aligning Signs for Correct Classification

  • For correct classifications, positive labels should align with positive outputs from W^T times X, while negative labels should align with negative outputs. Misalignment indicates an error in classification.

Updating Weights Based on Errors

  • When a misclassification occurs, weights are updated using W = W + Y times X. Positive points increase weights, while negative points decrease them to improve future classifications.

Counting Mistakes and Iterating Over Data

  • A counter tracks mistakes during iterations over the dataset. The process continues until all data points are correctly classified without any updates needed.

Convergence of the Algorithm

  • Once every data point is correctly classified without updates, it indicates that a separating hyperplane has been found. This convergence demonstrates the effectiveness of the perceptron algorithm.

Demonstration of Training Process

  • A demonstration illustrates how data points are drawn and classified using the perceptron algorithm. Initial conditions start with zero vectors before adjustments based on misclassifications occur.

Understanding Hyperplanes and Weight Vectors in Classification

The Concept of Hyperplanes

  • The blue line represents the error after an update, indicating the vector from the origin (0,0) to point X. This illustrates how vectors are added to adjust positions in a hyperplane.
  • A current hyperplane is established with a vector W; misclassification occurs when a positive point is incorrectly placed on the negative side, prompting an adjustment by adding this point to the vector.
  • Continuous adjustments are made to the hyperplane as more points are misclassified, demonstrating how each new data point influences the overall classification boundary.

Convergence of Algorithms

  • The algorithm converges towards a hyperplane that correctly classifies all positive and negative points. Questions arise about various algorithms and their effectiveness in achieving this goal.
  • Historical context reveals that while early algorithms were groundbreaking for finding any hyperplane, subsequent developments have led to better-performing alternatives.

Efficiency of Classification

  • An example highlights that larger margins between points allow for quicker convergence compared to scenarios with minimal separation, where overshooting can occur during updates.
  • The efficiency of classification is further emphasized through practical examples demonstrating varying speeds based on data distribution.

Visualizing Data and Weight Vectors

  • Handwritten digits (zeros and sevens) serve as training data within a 256-dimensional space. This setup allows visualization of weight vectors as images corresponding to pixel values.
  • Initially set at zero, the weight vector changes upon misclassification; adjustments involve adding or subtracting points based on whether they are classified positively or negatively.

Iterative Classification Process

  • Misclassifications lead to updates in the weight vector; for instance, classifying a '7' as '0' results in an adjustment reflecting this error.
  • As new examples are introduced, reclassification occurs based on updated weights. Initial classifications may still be incorrect until sufficient examples refine accuracy.
  • Continuous iterations reveal improvements in classification accuracy over time; however, some instances still require further adjustments due to persistent errors.

Final Adjustments and Clarifications

  • Further refinements illustrate how specific pixels influence classifications; ongoing updates ensure correct identification across multiple examples.
  • Clarification arises regarding label assignments: negative labels correspond with subtraction from weight vectors while positive labels result in additions.

Understanding the Perceptron Algorithm and Its Limitations

The Basics of Weight Adjustment

  • The weight vector W is updated by adding y times X , where Y can be either +1 or -1. This adjustment is crucial for classification.
  • If the data is not linearly separable, the algorithm may loop indefinitely without finding a solution, highlighting a significant limitation in its design.

Historical Context and Evolution

  • The perceptron algorithm has evolved since its inception in 1957, with more efficient algorithms developed to handle similar tasks.
  • A quiz was introduced to visualize geometric intuition regarding hyperplanes and their relation to data points.

Geometric Intuition and Hyperplane Updates

  • When a point X is misclassified (should be positive but classified as negative), the hyperplane must be adjusted. This involves updating W .
  • There’s no guarantee that one update will correct the classification; repeated encounters with the same point could lead to continued misclassification.

Error Bound Analysis

  • After multiple updates, if a point remains misclassified, it leads to an upper bound on how many times it can be incorrectly classified before being corrected.
  • This analysis indicates that there exists a finite number of errors possible for any given data point before achieving correct classification.

Common Misconceptions in Implementation

  • If a data point lies exactly on the hyperplane, it may still be considered misclassified due to initialization issues with zero weights.

Understanding the Perceptron Algorithm

High Dimensional Space and Data Representation

  • The discussion begins with the concept of high-dimensional space, emphasizing that vectors can exist in various dimensions. The speaker notes that even one-dimensional data can be represented in a two-dimensional space.

Convergence Proof Introduction

  • The speaker introduces the convergence proof of the perceptron algorithm, highlighting its significance in changing the world over half a century since Rosenblatt's work. They pose a question about why the perceptron algorithm guarantees a separating hyperplane if it exists.

Assumptions of the Perceptron Algorithm

  • It is assumed that there exists a weight vector W^* capable of separating data points. This assumption leads to defining conditions under which every data point satisfies Y W^T X > 0 .

Approach to Finding Separating Hyperplanes

  • While not claiming to learn W^* , the proof aims to show how we can get closer to it, ultimately classifying all points correctly. The existence of multiple separating hyperplanes is acknowledged, leading to a need for scaling.

Rescaling Weights and Data Points

  • To simplify calculations, it's proposed that W^* can be rescaled so that its norm equals one. This allows for easier manipulation without losing generality in finding separating hyperplanes.

Simplifying Assumptions for Proof Clarity

  • The speaker suggests rescaling all data points such that their norms are less than or equal to one. This approach simplifies later steps in the proof while maintaining validity.

Visualization and Conceptual Understanding

  • A metaphorical device is introduced where an entire dataset can be shrunk and then expanded after finding a hyperplane, illustrating how scaling does not affect classification outcomes.

Final Thoughts on Rescaling Techniques

Video description

Cornell class CS4780. (Online version: https://tinyurl.com/eCornellML ) Lecture Notes: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote03.html