Machine Learning Lecture 14 "(Linear) Support Vector Machines" -Cornell CS4780 SP17

Machine Learning Lecture 14 "(Linear) Support Vector Machines" -Cornell CS4780 SP17

Logistics and Project Updates

Class Logistics

  • The instructor welcomes everyone and reminds them to have the handouts distributed earlier.
  • Project 4 is currently being tested, which may cause a slight delay in its release until Friday, ensuring students still receive their full two weeks for completion.
  • The CEO of Voc Ihram has acknowledged student complaints and suggestions via Piazza, indicating that they are taking these issues seriously and working on fixes.

Exam Preparation

  • The exam is approaching in less than two weeks; students are urged to start preparing as it significantly impacts their overall grade.
  • A makeup exam will be arranged if necessary; otherwise, an oral exam will take place in the instructor's office.

Understanding Square Loss in Linear Regression

Key Concepts of Square Loss

  • The discussion begins with square loss related to linear regression, assuming data points lie along a line with Gaussian noise around them.
  • The loss function derived from this model is expressed as the sum of squared differences between predicted values (W transpose X) and actual values (y).

Maximum Likelihood Estimation (MLE)

  • MLE involves minimizing the loss function using gradient descent while incorporating regularization through a term like λx² for better performance.

Closed Form Solution vs. Gradient Descent

  • Although there exists a closed-form solution for finding minimum loss, it requires O(d²) memory and O(d³) complexity, making it impractical for high-dimensional data sets.

Deriving the Closed Form Solution

Matrix Representation of Loss Function

  • The instructor explains how to represent the loss function using matrix notation: L(W)=||XW - y||², where X contains input data points.

Completing the Square Technique

  • By expanding and rearranging terms within the quadratic form of the loss function, one can complete the square to identify critical points for minimization.

Finding Minimum Through Derivatives

Understanding Derivatives and Matrix Inversion in Regression

Key Concepts in Derivative Calculation

  • The process of taking derivatives can significantly reduce computation time, saving hours if one is familiar with the necessary notation and matrix operations.
  • The derivative leads to a linear term involving Y^T X , which is set equal to zero for solving equations.
  • The closed-form solution for W is derived as W = (X^T X)^-1 X^T Y , indicating that once this computation is done, further steps are minimal.

Challenges with Matrix Inversion

  • Inverting large matrices can be computationally expensive and memory-intensive, making it impractical for large datasets.
  • Cubic complexity ( O(n^3) ) arises when doubling data size, leading to significant slowdowns in processing time.

Space Complexity Considerations

  • Algorithms must be evaluated based on their space and time complexity as input sizes increase; this is crucial in high-dimensional spaces like those encountered in natural language processing.
  • High dimensionality (e.g., millions of dimensions from vocabulary size) makes traditional methods prohibitively expensive.

Introduction to Support Vector Machines (SVM)

Transitioning from Regression to Classification

  • The discussion shifts from regression techniques to classification using support vector machines, focusing on binary labels (+1 or -1).

Finding Hyperplanes with SVM

  • SVM aims to find a hyperplane that separates positive and negative points; unlike perceptron algorithms, SVM guarantees finding an optimal hyperplane.
  • There are infinitely many hyperplanes possible; SVM focuses on identifying the one that maximizes the margin between classes.

Importance of Margin in Classification

  • The concept of margin refers to the distance between the closest points of different classes; maximizing this margin enhances generalization capabilities.
  • A larger margin reduces the risk of misclassification by providing a buffer zone where new data points can fall without affecting classification accuracy.

Understanding Support Vector Machines and Hyperplanes

The Concept of Hyperplanes

  • A hyperplane can effectively separate training data, but misclassification may occur for test points that are close to the decision boundary. This raises questions about the effectiveness of such a separation.
  • The maximizing hyperplane is defined as the one with the maximum distance to its closest point on either side, ensuring equal margins on both sides. This equality is crucial for maintaining balance in classification.

Margin Definition and Importance

  • The margin is determined by the shortest distance to a data point; if one side's margin were smaller, it would indicate a potential improvement in hyperplane positioning. Thus, both distances must be equal for optimal classification.
  • Support Vector Machines (SVM) were developed in 1994 by Cortes, Vapnik, and others and remain highly popular due to their effectiveness in machine learning tasks despite being patented. Users must navigate patent issues carefully when implementing SVMs in products.

Formalizing Distance from Hyperplanes

  • To understand how to calculate the distance from a point X to a hyperplane defined by W and P, one can project X onto the hyperplane, denoting this projection as X_P. The distance between these two points represents this measure of separation.
  • By defining vector D as parallel to vector W, we can express it as a rescaled version of W. This relationship helps simplify calculations related to distances from points to hyperplanes.

Solving for Alpha

  • By substituting known values into equations involving projections onto the hyperplane, we derive an expression for alpha, which plays a critical role in determining distances within SVM frameworks. This process involves manipulating terms until reaching an equation that isolates alpha.

Understanding the Margin of a Hyperplane

Deriving the Norm and Distance to Hyperplane

  • The norm of D is derived as D^T D, leading to an expression involving alpha^2 W^T W. This indicates that alpha scales with the squared values of W.
  • The term alpha is defined in relation to W^T X + P / W^T W, emphasizing its role in calculating distances within the hyperplane context.
  • The formulation for D leads to understanding its norm, which simplifies down to alpha sqrtW^T W, illustrating how distance calculations are structured.
  • The distance from a point X to the hyperplane is expressed as (W^T X + B)/||W||, connecting geometric concepts learned in high school geometry with modern applications.
  • This formula applies across various dimensions, confirming its relevance not just in two-dimensional spaces but also in higher-dimensional Euclidean and Banach spaces.

Defining and Maximizing Margin

  • To find the smallest distance from data points to a hyperplane, one must compute distances for all data points and identify the minimum value, establishing a basis for defining margins.
  • The margin of a classifier (gamma) is defined as the smallest distance calculated previously. This sets up an optimization problem aimed at maximizing this margin.
  • A challenge arises when attempting to maximize this margin without constraints; naive maximization could lead to unrealistic solutions where hyperplanes are pushed infinitely away from data points.
  • Constraints must be introduced into the optimization problem: ensuring that all data points satisfy conditions related to their classification (i.e., positive or negative sides relative to the hyperplane).
  • These constraints ensure that while maximizing margin, we maintain sensible placements of hyperplanes between classes rather than allowing them to diverge excessively.

Formulating Constrained Optimization Problems

  • The constrained optimization problem requires minimizing/maximizing functions subject to specific conditions. Here, it involves ensuring correct classifications based on existing labels and their positions relative to the hyperplane.
  • Each constraint states that for every data point i, it must hold true that y_i (W^T X_i + B)geq 0. This ensures proper separation between classes by enforcing correct sign alignment with respect to their labels.
  • Satisfying these constraints guarantees that positive examples lie on one side of the hyperplane while negative examples lie on another side, reinforcing effective classification boundaries.
  • While discussing solving methods for these problems, it's noted that further exploration will reveal techniques applicable for finding optimal solutions under given constraints.
  • Understanding how minimization interacts with maximization leads into complex formulations but can be simplified through strategic manipulation of terms involved in both processes.

Understanding Hyperplanes and Rescaling in Optimization

The Concept of Hyperplanes

  • The discussion begins with the minimization of the expression W^T X + P over X , leading to a maximization problem concerning W .
  • A hyperplane is defined by the equation W^T X + B = 0 . Rescaling both W and B by any positive constant does not alter the hyperplane's position.
  • This rescaling introduces a degree of freedom, allowing for multiple representations of the same hyperplane without affecting its properties.

Fixing a Scale for Simplification

  • To simplify calculations, a specific scale is chosen such that the minimum value of W^T X + P = 1 . This choice aids in managing constraints effectively.
  • The objective function becomes simpler as it reduces to maximizing or minimizing terms involving W^T W , streamlining future calculations.

Constraints and Their Implications

  • It is possible to rescale values so that certain margins are achieved, ensuring that while solutions may differ, the hyperplane remains consistent.
  • A critical insight is presented: maximizing an expression can be transformed into minimizing its reciprocal. This equivalence simplifies optimization strategies.

Equivalence of Constraints

  • The speaker claims that two sets of constraints can be expressed interchangeably. Specifically, if one set holds true, it implies validity for another set involving conditions on all data points.
  • Participants are encouraged to discuss and verify this claim about constraint equivalence among themselves.

Analyzing Margin Conditions

  • The discussion shifts towards understanding how margin conditions relate to overall constraints. If all data points satisfy certain inequalities, implications arise regarding their minimum values.
  • Clarifications are made regarding potential concerns about margins exceeding specified limits; however, it’s emphasized that minimization processes will ensure compliance with required conditions.

Understanding Quadratic Programming in Machine Learning

The Basics of Optimization and Constraints

  • The discussion begins with the concept of minimizing an objective function, specifically mentioning that dividing both W and B by ten still satisfies constraints but results in a smaller objective value, indicating it cannot be optimal.
  • It is highlighted that the problem at hand is a linear quadratic function, characterized as a parabola with linear constraints, categorizing it as a quadratic program studied extensively in mathematics.

Practical Application of Quadratic Programming

  • A demo is introduced to illustrate the application of quadratic programming in machine learning. The speaker acknowledges previous skepticism about the demo's authenticity but confirms its relevance.
  • The speaker creates a dataset with linearly separable data points and runs an SVM (Support Vector Machine) solver to find the hyperplane that maximizes margin between classes.

Visualization and Results from SVM

  • The output from the SVM solver reveals the hyperplane along with parallel lines representing margins around closest points, demonstrating how effectively this method identifies optimal separation.
  • Emphasis is placed on the elegance of formulating machine learning problems as optimization tasks rather than relying solely on iterative algorithms, which was a significant shift in perspective within the field.

Impact of SVM on Machine Learning Practices

  • The introduction of SVM revolutionized approaches to machine learning by providing lower error rates across datasets compared to traditional methods. This led to widespread adoption due to its straightforward framework.
  • Questions arise regarding automatic minimization processes within SVM, clarifying that equal distance from both classes must be maintained for effective optimization.

Conclusion and Future Directions

Video description

Lecture Notes: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html