Machine Learning Lecture 19 "Bias Variance Decomposition" -Cornell CS4780 SP17

Machine Learning Lecture 19 "Bias Variance Decomposition" -Cornell CS4780 SP17

Introduction to the Bias-Variance Tradeoff

Announcement and Importance of Today's Topic

  • The session begins with a brief announcement about robotic systems, inviting interested individuals to learn more.
  • The speaker emphasizes the significance of today's topic: the bias-variance tradeoff, which is crucial for understanding machine learning.
  • Understanding this concept distinguishes knowledgeable practitioners from those merely experimenting with algorithms.

Relevance to Upcoming Project

  • The upcoming project will focus on the bias-variance tradeoff, making it essential for students to grasp today’s lecture content.
  • Students are encouraged to pay close attention as mastering this topic will simplify their project work.

Understanding Generalization Error

Decomposing Generalization Error

  • The lecture aims to explore generalization error rather than just minimizing training error, providing deeper insights into model performance.
  • A regression setting is used for easier derivation; data points are drawn from a distribution P.

Data Distribution and Predictions

  • Data points (X, Y) are assumed to be independent and identically distributed (iid), forming the basis of many machine learning assumptions.
  • For any given input vector X, there may not be a unique output Y due to variations in underlying distributions.

Expected Label Prediction

Predicting Expected Values

  • In regression tasks, predicting the expected label (Y bar of X) is emphasized as a reasonable approach when faced with ambiguity in outputs.
  • This involves integrating all possible values of Y weighted by their probabilities given X.

Setting Up Machine Learning Algorithms

Understanding Machine Learning Algorithms and Generalization Error

Overview of Machine Learning Algorithms

  • The discussion begins with an explanation of machine learning algorithms, specifically mentioning perceptrons and linear classifiers. The speaker emphasizes the process of inputting training data to generate a sound classifier.
  • The speaker highlights their daily work in developing these algorithms, indicating that they utilize a dataset D drawn from a distribution P to train models.

Generalization Error

  • A key focus is on understanding generalization error, which refers to the expected error when encountering new data points not seen during training. This contrasts with training error measured on the dataset D .
  • The expected test error is defined as the anticipated performance of a classifier H , derived from the training set D . This involves evaluating how well the model performs on unseen data.

Expected Test Error Calculation

  • To compute expected test error, one looks at new inputs drawn from distribution P , measuring errors using squared loss. This method simplifies calculations for regression tasks.
  • The speaker acknowledges that while squared loss is effective for regression, it may not be as straightforward for other types of problems but assures further clarification will follow.

Random Variables and Distribution

  • It’s emphasized that both the dataset D and resulting classifier H are random variables due to their dependence on sampling from distribution P . Each iteration can yield different outcomes.
  • To estimate expected test error practically, one would sample multiple datasets and evaluate them; however, this approach has limitations in real-world applications.

Average Predictions Across Classifiers

  • The concept of averaging predictions across all possible classifiers is introduced. By integrating over all potential datasets drawn from distribution P , one can derive an average prediction function denoted as H_textbar .
  • The weak law of large numbers suggests that with sufficient datasets, averaging outputs leads to accurate estimations of expected values across classifiers.

How to Combine Multiple Classifiers?

Understanding Random Processes in Data Sets

  • The discussion begins with the concept of combining multiple classifiers, emphasizing that data sets can be viewed as random processes. A data set D is drawn from a distribution, which allows for variability in its composition.
  • It is noted that both sets and functions can be treated as random variables. This means one can sample multiple data sets to analyze their characteristics and expected outcomes.

Practical Considerations in Data Set Management

  • A key point raised is the practicality of using large data sets versus splitting them into smaller ones. If infinite data were available, it might be more efficient to use one comprehensive set rather than averaging many smaller ones.
  • The variance among different data sets provides insights into uncertainty, suggesting that while combining may seem practical, understanding individual contributions remains important.

Defining Expected Classifiers

  • The notion of an "expected classifier" is introduced. Here, barh represents a function derived from averaging various classifiers weighted by their significance.
  • Caution is advised regarding the nature of these functions; they could range from linear to highly nonlinear forms like decision trees.

Generalization Error and Algorithm Performance

  • Transitioning to algorithm performance, if h is a random variable, one can compute the expected error of an algorithm A . This involves integrating over all possible classifiers trained on varying datasets.
  • The process includes drawing training datasets and evaluating how well each classifier performs against test points to determine generalization error effectively.

Decomposing Error Expressions

  • The lecture emphasizes decomposing expressions related to error calculations. Understanding this decomposition will be crucial for future discussions about algorithm selection and design based on performance metrics.

Understanding Error in Classifiers

Analyzing the Expression

  • The speaker discusses manipulating an expression by adding and subtracting H barX , which represents the expected classifier. This manipulation is justified as it does not change the overall value of the expression.
  • The speaker emphasizes that enclosing terms in parentheses allows for further analysis, particularly focusing on understanding where high error originates from.

Completing the Square

  • The process of completing the square is introduced, with a correction noted regarding notation. The speaker explains how to identify components A and B within the squared term.
  • It is claimed that one specific term becomes zero during this process, prompting audience engagement to confirm understanding of this simplification.

Expected Values and Independence

  • A discussion arises about why certain terms can be eliminated based on their expected values being zero. The importance of independence between data points (X, Y pairs) is highlighted.
  • Clarification is provided that while X and Y are not independent, they can still be manipulated mathematically to simplify expressions involving expectations.

Variance Interpretation

  • The error of an algorithm consists of two main terms:
  • The expected value of H_T - H barX
  • The difference between H barX and Y
  • Notably, the first term represents variance since it measures how predictions deviate from their mean function value.

Decomposing Further

  • In analyzing the second term further, a similar approach is taken by adding and subtracting Y barX , leading to deeper insights into prediction errors relative to labels.

Understanding the Components of Prediction Error in Machine Learning

Decomposing Prediction Error

  • The discussion begins with a mathematical expression involving H(X) and Y , where the squared differences are analyzed. The focus is on how these terms relate to each other.
  • The expected value of the product of two variables is examined, leading to cancellations that simplify the analysis. This sets up for further decomposition into expected values.
  • A method for decomposing expectations over two variables X and Y is introduced, emphasizing their independence in this context.
  • The expectation process is clarified: first taking the expected value over X , then over Y , highlighting that certain terms do not depend on Y .
  • It’s concluded that some terms vanish, reinforcing that they contribute zero to the overall error.

Analyzing Variance and Bias

  • The speaker emphasizes understanding prediction error as composed of three distinct components: variance, bias, and noise.
  • Variance is defined as how much predictions from different classifiers vary when trained on different datasets. This reflects model stability across training sets.
  • The difference between actual labels and expected labels ( y - y_textbar ) indicates data difficulty; significant discrepancies suggest challenging problems due to noisy data.
  • Noisy data can lead to inconsistent label predictions for similar feature vectors, complicating accurate modeling efforts.

Exploring Noise and Bias

  • Bias squared represents systematic errors when predicting average outcomes under ideal conditions without noise. It captures inherent limitations in model assumptions or structure.
  • Examples illustrate bias: if a model assumes linearity while data exhibits non-linearity, it will consistently mispredict regardless of additional training data.
  • Clarification on bias indicates it can be positive or negative; squaring removes directional concerns but retains magnitude in error calculations.

Conclusion on Error Components

Understanding Bias and Variance in Classifiers

The Role of Bias and Variance in Error

  • The error in a classifier can be attributed to the variance of Y, noise in the data, and bias. The upcoming lectures will focus on methods to reduce these errors.
  • Data scientists must assess whether their classifier's issues stem from high bias or high variance, then apply appropriate measures to address the larger problem.
  • It's crucial for practitioners to accurately identify the source of error when training classifiers; misdiagnosing can lead to ineffective solutions.

Analogy: Throwing Darts

  • An analogy involving throwing darts is introduced to illustrate concepts of bias and variance. Participants are asked to visualize outcomes based on different conditions.
  • Low variance and low noise result in minimal error, with all darts landing at the center. This scenario represents ideal performance.

Exploring High Variance vs. High Bias

  • When experiencing high variance but low bias, throws may vary widely around a central point without systematic error; averaging results would still yield accuracy.
  • In contrast, high bias with low variance means consistent accuracy at an incorrect target; this leads to systematic errors despite precision.

Worst Case Scenario: High Bias and High Variance

Video description

Lecture Notes: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html