Machine Learning Lecture 13 "Linear / Ridge Regression" -Cornell CS4780 SP17

Name: Machine Learning Lecture 13 "Linear / Ridge Regression" -Cornell CS4780 SP17
Uploaded: 2018-07-11T14:00:58.000Z
Duration: 1 h 18 min 50 s
Description: Lecture Notes: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote08.html

Logistic Regression and Optimization Techniques

Project Updates

The instructor welcomes students and requests them to close their laptops.

Students are reminded about the ongoing naive-based projects, with an update that the next project is completed and will be tested on two TAs before shipping out tomorrow.

There will be no more homework after the midterm, but one additional project is due before it.

Introduction to Logistic Regression

The discussion begins with logistic regression, focusing on the probability of Y given X, represented as P(Y|X) .

The goal is to optimize this function through maximum likelihood estimation by ensuring that observed data points are as likely as possible.

Loss Function and Optimization

The loss function discussed involves summing over all data points using a logarithmic form: log(1 + e^-W^T X_i Y_i) .

It’s noted that there is no closed-form solution for minimizing this loss function; however, it is convex, differentiable, and continuous.

Gradient Descent Methodology

The instructor explains gradient descent as a method for finding local minima by evaluating the loss function at various points.

Step size in gradient descent can be adjusted adaptively using algorithms like AdaGrad to improve convergence speed.

Advanced Optimization Techniques

If second derivatives can be computed, optimization can utilize parabolic approximations for faster convergence.

Caution is advised regarding divergence; if the function value increases after a step, revert to smaller gradient steps until stability returns.

Generalization of Machine Learning Algorithms

Many machine learning algorithms can be expressed in terms of similar functions where lower values indicate better performance.

The logistic loss function serves as an approximation of zero-one loss; while zero-one loss penalizes incorrect predictions directly, it cannot be optimized due to its non-differentiability.

Discussion on Loss Functions

Students are encouraged to reflect on how the logistic loss behaves when predictions are correct versus incorrect.

Emphasis is placed on understanding how many examples are misclassified during training and how this impacts overall performance metrics.

Understanding Hyperplanes and Loss Functions in Machine Learning

The Role of Hyperplanes

A hyperplane is defined by the weight vector W . If a point X_i has a positive label Y_i , then W^T X_i is also positive, indicating that the point lies on one side of the hyperplane.

When correctly classified, as points move further from the hyperplane, the product W^T X_i Y_i increases significantly due to exponential decay properties.

Consequences of Misclassification

If misclassified (e.g., a positive point labeled negatively), then W^T X_i < 0 , leading to a negative value for loss calculations. This results in an increased penalty.

The exponential function grows rapidly when its input is greater than zero, causing significant increases in loss values when predictions are incorrect.

Understanding Loss Calculations

Adding 1 to a large number does not substantially change it; thus, logarithmic transformations yield similar results regardless of minor adjustments.

The log transformation simplifies under certain conditions: for instance, log(1 + e^a) approx a when e^a gg 1.

Implications of Distance from Hyperplane

Greater distances from the hyperplane result in larger penalties for misclassifications. This indicates that errors further away from correct classifications incur higher losses.

There’s an understanding that if predictions are increasingly wrong (further from the hyperplane), penalties grow more severe—this can be linear or even quadratic.

Logistic Loss vs. Other Loss Functions

Logistic loss functions are beneficial in scenarios with mislabeled data because they do not overly penalize small errors compared to other functions.

Some loss functions may exacerbate issues with mislabeled data by focusing too heavily on outliers or incorrect labels.

Introduction to Ordinary Least Squares (OLS)

Clarifying Regression Terminology

Logistic regression is often misunderstood as a regression algorithm; however, it serves classification purposes while linear regression pertains strictly to predicting continuous outcomes.

Key Takeaways about OLS

Understanding Linear Relationships in Data Modeling

Introduction to Continuous Values

The discussion begins with the concept of predicting continuous values, such as house prices, using various features of a house. This is similar to what platforms like Zillow do.

Assumptions in Data Modeling

Two key assumptions are introduced for modeling data:

The first assumption involves linearity, where a linear hyperplane separates classes in classification algorithms.

In this context, the focus shifts from classification to regression with continuous variables.

Linear Correspondence Between Variables

The speaker explains that instead of two classes, we have a relationship between input (X) and output (Y), assuming this relationship is linear.

A second assumption is introduced regarding the distribution of data points around the line; they will not lie perfectly on it but will be scattered around it.

Noise and Distribution Models

It’s acknowledged that real-world observations often deviate from the ideal line due to noise.

The noise is modeled as a Gaussian distribution, indicating that most observations cluster around a mean value determined by X.

Mathematical Representation of Noise

The model can be expressed mathematically as Y_i = W^T X_i + epsilon_i , where epsilon_i represents noise drawn from a Gaussian distribution with zero mean and some variance.

Understanding Gaussian Distribution in Context

An alternative representation states that Y values are drawn from Gaussians centered at W^T X_i . This emphasizes how predictions vary around the expected value based on features.

Justification for Assumptions

The speaker justifies using linear models because many datasets exhibit linear relationships, making them easier to work with mathematically.

Central Limit Theorem Relevance

It’s noted that Gaussian noise is common due to the Central Limit Theorem, which suggests that means of numerous random variables tend toward normal distributions.

Predicting House Prices Using Features

With established assumptions, any given feature vector (describing aspects like square footage or proximity to schools) allows us to predict house prices through their corresponding distributions.

Final Thoughts on Finding Model Parameters

As part of concluding remarks, the speaker prepares for an exercise focused on determining optimal weights (W), emphasizing maximum likelihood estimation as one approach.

Approaches for Estimating Weights

Maximum likelihood estimation aims to maximize the probability of observed data given feature vectors by adjusting W accordingly.

Understanding Hyperplanes and Probability in Data Points

Introduction to Hyperplanes

The concept of hyperplanes is introduced, emphasizing their role in determining the likelihood of data points. If a data point is highly unlikely, it can significantly affect calculations by approaching zero.

Collaborative Learning Approach

Participants are encouraged to collaborate with neighbors to review derivations related to probability and features, enhancing understanding through peer explanation.

Deriving Probabilities from Data Sets

The process begins by calculating the probability of observing outcomes (Y) given feature vectors (X). This involves multiplying probabilities across all independent and identically distributed (i.i.d.) data points.

Logarithmic Transformation for Simplification

A logarithmic transformation is applied to simplify the product of probabilities into a summation format, facilitating easier computation and analysis.

Substituting Probability Expressions

The log expression incorporates specific probability functions, leading to a more manageable form that highlights key components like variance (σ²).

Maximizing Logarithmic Functions

The importance of maximizing log probabilities is discussed. Although different expressions may yield the same maximum value, they provide insights into parameter optimization.

Constants in Optimization Problems

It’s noted that constants do not affect maximization or minimization processes. Thus, certain terms can be disregarded without altering the outcome.

Transitioning from Maximization to Minimization

A shift from maximizing to minimizing loss functions is proposed. By modifying terms appropriately, one can focus on minimizing average loss rather than absolute values.

Interpreting Loss Values

Average loss provides meaningful insights into model performance. For instance, an average prediction error quantifies how far off predictions are from actual values.

Characteristics of Loss Functions

The derived function resembles a parabola—characterized as convex and differentiable—which allows for precise minimum calculations using methods like Newton's method or gradient descent.

Practical Implications of Gradient Descent

Maximum Likelihood Estimation and MAP

Introduction to Maximum Likelihood Estimation (MLE) and MAP

The speaker introduces the concept of maximum likelihood estimation (MLE) as a method for parameter estimation, noting that it is often faster to implement using gradient descent.

Transitioning from MLE, the speaker discusses maximum a-posteriori (MAP) estimation, which aims to find the most likely parameters W given observed data.

Understanding MAP Estimation

MAP assumes W is a random variable and estimates its probability based on prior beliefs about its distribution.

A common assumption for the prior distribution of W is Gaussian with zero mean, indicating uncertainty about its exact value.

Application of Prior Beliefs

The speaker illustrates how prior knowledge can be applied in practical scenarios like spam filtering, where individual preferences may differ from general trends.

By incorporating personal data into the model, one can adjust the Gaussian prior around an estimated value w_0 , reflecting individual differences.

Deriving MAP Using Bayes' Rule

The audience is encouraged to work through the derivation of MAP with their neighbors while clarifying any confusion regarding concepts presented.

The speaker explains using Bayes' rule to express the relationship between data and parameters: P(W | D) = P(D | W) P(W)/P(D).

Log-Likelihood Function and Independence Assumption

The discussion emphasizes that each observation's likelihood can be treated independently due to their identically independent distribution (iid).

This independence allows simplification into a product form over all observations, focusing on how each output depends solely on its corresponding input.

Final Steps in Derivation

Taking logarithms simplifies calculations; this leads back to familiar expressions used in MLE.

Ultimately, combining terms results in an expression involving both squared errors and regularization terms related to Gaussian priors.

Understanding Probability in Context

The Relationship Between Variables

The discussion begins with the probability of a variable P(X | W) , where X represents certain variables and W is another variable that may influence them.

It is noted that the variables X do not depend on W , indicating that they are treated as constants in this context.

This leads to the conclusion that the probability of X , denoted as P(X) , remains unaffected by changes in W .

Consequently, it is suggested that one can simplify the expression by removing W , reinforcing the independence of these variables.