Machine Learning Lecture 16 "Empirical Risk Minimization" -Cornell CS4780 SP17

Name: Machine Learning Lecture 16 "Empirical Risk Minimization" -Cornell CS4780 SP17
Uploaded: 2018-07-11T14:09:13.000Z
Duration: 1 h 32 min 33 s
Description: Lecture Notes: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html

Exam Preparation and Key Concepts in Machine Learning

Exam Logistics and Importance

The exam is scheduled for next Tuesday, emphasizing the need for early preparation.

Students are encouraged to review all material covered in class, as everything will be included on the exam.

Previous years' exams are available on Piazza under resources, with a recommendation to attempt them without solutions first.

The exam constitutes 50% of the overall grade, highlighting its significance compared to project work.

Course Structure and Project Updates

The course balances theory and practical application in machine learning; both aspects are crucial for success.

A new project will be released today after resolving technical issues; students will have two weeks to complete it once released.

The project involves implementing various loss functions (e.g., hinge loss, logistic loss), serving as good preparation for the exam.

Understanding Empirical Risk Minimization

Empirical risk minimization focuses on finding parameters (weights) that minimize a defined loss over a dataset.

Loss functions measure how well predictions match true labels, incorporating regularization to avoid overly complex models.

Loss Functions Overview

Hinge Loss and Other Classification Losses

Hinge loss is discussed alongside other classification losses like zero-one loss and exponential loss.

Introduction to Regression Loss Functions

Transitioning into regression, common scenarios include predicting continuous values such as height or house prices based on features.

Square Loss Function Characteristics

Square loss penalizes deviations from predicted values squared; this emphasizes larger errors more significantly than smaller ones.

The square loss function's properties ensure it always yields positive penalties regardless of whether predictions are above or below actual values.

Optimal Solution with Square Loss

Understanding Income Prediction and Loss Functions

The Limitations of Average in Income Prediction

The average income may not accurately represent the distribution due to power law characteristics, where a small number of individuals hold significant wealth compared to the majority.

For instance, if Bill Gates increases his income by 10%, it skews the average but does not reflect meaningful changes for most people.

Square Loss vs. Absolute Loss

Square loss is differentiable everywhere, making it easy to optimize using gradient descent; minimizing this with a linear classifier leads to ordinary least squares.

Absolute loss estimates the minimum error in absolute terms rather than the average, which can be more appropriate in certain contexts like income prediction where outliers exist.

Median as an Alternative Measure

In cases with extreme outliers (e.g., Bill Gates), median provides a better measure since it remains unaffected by such extremes, unlike the average.

When predicting values conditioned on features (like height based on gender), utilizing features can yield better predictions than relying solely on averages.

Challenges with Absolute Loss

While absolute loss is robust against outliers (e.g., predicting house prices), it poses optimization challenges due to non-differentiability at zero.

Square loss can disproportionately focus on extreme values (like expensive mansions), potentially sacrificing accuracy for more common data points.

Combining Strengths: Huber Loss

Huber loss offers a hybrid approach: it behaves like square loss near zero for small errors and transitions to absolute loss for larger errors, balancing sensitivity and robustness.

Understanding Loss Functions in Machine Learning

Overview of Loss Functions

The discussion begins with the concept of loss functions, emphasizing that using a millionaire's perspective on losses can be misleading. It highlights how many people overlook the importance of understanding different types of losses.

Types of Loss Functions

Introduction to the "law cosh loss," which is similar to other loss functions. The cost function is defined as phi(X) + e^-X/2 , prompting consideration about its behavior at extreme values.

Delta and Its Implications

The speaker discusses the choice of Delta (Δ), which represents how much square loss one wants to incorporate into their calculations. This choice affects sensitivity to outliers in data.

Characteristics of Cost Functions

A connection is made between quadratic functions and absolute loss, explaining how these functions behave differently based on their derivatives and points where they are evaluated.

Symmetry in Cost Functions

The cost function exhibits symmetry; when X becomes very negative, it approaches zero, while positive values lead to exponential growth. This characteristic is crucial for understanding how these functions react under various conditions.

Differentiability and Computational Considerations

The log-cosh function is noted for being differentiable everywhere, unlike the Huber loss at certain points. However, computing logarithmic functions can be slow when performed repeatedly on large datasets.

Practical Exercise: Graphing Loss Functions

An exercise prompts participants to graph four different types of loss functions (Huber, law cosh, absolute, and square). Clarification is provided regarding the constant Delta used in this context.

Identifying Different Loss Functions

Participants engage in identifying various plotted functions based on color coding. Each function's characteristics are discussed briefly during this identification process.

Comparing Huber and Law Cosh Losses

A comparison reveals that Huber and law cosh losses are quite similar; distinguishing them visually can be challenging due to their comparable shapes.

Importance of Outlier Management

Emphasis is placed on how different loss functions handle outliers. Quadratic losses provide significant improvements when correcting errors far from expected values compared to linear or absolute losses.

Data Scientist's Choice

Understanding Loss Functions and Regularization in Machine Learning

Theoretical vs. Practical Aspects of Loss Functions

Discussions around loss functions highlight that while the theory is well understood, practical applications often face challenges due to data noise.

Historical debates in industry regarding the selection of appropriate loss functions indicate the complexity involved in these decisions.

Computational efficiency varies among operations; for instance, squaring a number is significantly faster than computing logarithms.

Ridge Regression Demonstration

A demo on Ridge regression illustrates how it minimizes square loss with L2 regularization, emphasizing its relevance in current discussions.

Visual representation shows how fitting a dataset with minimal regularization results in a line that closely follows data points.

Increasing regularization flattens the regression line, demonstrating how excessive regularization can lead to oversimplified models.

Impact of Outliers on Predictions

The effect of outliers is illustrated by showing how one extreme point can distort overall predictions significantly.

This example emphasizes the limitations of square loss when dealing with noisy measurements or outlier data points.

Introduction to Regularization Techniques

Regularization introduces an additional term to minimize error effectively, as seen in various models like logistic regression and OLS.

Key types of classifiers discussed include Ridge regression, logistic regression with MAP estimation, and SVM (Support Vector Machines).

Understanding SVM Optimization Problems

Clarification on SVM optimization reveals that minimizing W transpose W under certain constraints maximizes margin despite initial confusion about maximizing margins directly.

The discussion highlights the importance of understanding constraints within optimization problems for effective classification strategies.

Minimizing W: Understanding the Optimization Problem

The Objective of Minimizing W

The last function aims to minimize the square of W, which involves reducing the error associated with it.

Shrinking W results in a change in the inner product, causing points X to be further away to maintain the same inner product value.

As W shrinks, X must increase to keep the inner product constant; thus, minimizing W is crucial for achieving maximum margin.

Constraints and Hyperplane Adjustments

When a hyperplane reaches its limit (data points are already at one edge), there’s no more room to shrink W.

Rotating the hyperplane can create additional space for adjustment, allowing for a smaller possible value of W squared.

The optimization problem seeks to maximize margin while adhering to constraints that prevent exceeding data point boundaries.

Hinge Loss and Regularization

The concept of hinge loss is introduced as part of regularization in SVM (Support Vector Machine).

Regularization ensures that not just any hyperplane is found but specifically one that maximizes margin—leading towards simplicity in solutions.

Lagrangian Formulation

An introduction to Lagrangian formulation highlights how optimization problems with constraints can be reformulated.

It emphasizes minimizing loss subject to constraints on R(W), where R represents some regularization function.

Balancing Loss and Regularization

A larger lambda in regularization leads to prioritizing minimization of both loss and R(W).

There exists a balance between lambda and B values that allows for equivalent solutions across different formulations.

Interpretation of Regularizers

L2 regularization is discussed as ensuring that W^2 remains below a specified threshold B, indicating feasible solutions within defined limits.

Understanding Regularization in Gradient Descent

The Concept of Minimizing Loss

The discussion begins with the visualization of a valley representing the loss function, where gradient descent is employed to find the minimum by adjusting weights (W).

A constraint is introduced: w^T w leq b , indicating that the sum of squares of weights must be less than or equal to a constant, forming a hypersphere around the origin.

Constraints and Regularization

The goal is to minimize the loss function while ensuring that the solution remains within the defined hypersphere, which acts as a regularizer.

As parameter b increases, it becomes less restrictive; if b is very large, regularization has minimal effect on finding an optimal solution.

Impact of Regularization Strength

If regularization strength (lambda) is too high, it may prevent reaching an optimal solution by enforcing constraints that keep solutions within bounds.

To determine an appropriate lambda value in practice, one should test various values on a holdout set to identify which yields the best performance.

Trade-offs in Model Performance

Lowering lambda can improve training performance but may negatively impact test performance due to overfitting. This trade-off will be explored further through bias-variance decomposition.