Stanford CS229: Machine Learning - Linear Regression and Gradient Descent | Lecture 2 (Autumn 2018)
Introduction to Linear Regression
In this section, the instructor introduces linear regression as a learning algorithm and discusses its importance in fitting linear regression models.
Linear Regression as a Learning Algorithm
- Linear regression is an algorithm used for fitting linear regression models.
- It is one of the simplest supervised learning algorithms.
- Supervised learning involves mapping inputs (X) to outputs (Y).
- Linear regression is commonly used for regression problems where the output Y is a continuous value.
Motivating Linear Regression with House Price Prediction
- The instructor uses the example of predicting house prices to motivate linear regression.
- A dataset of houses and their corresponding prices is collected.
- The goal is to fit a straight line to the data using linear regression.
Representing Hypothesis in Linear Regression
This section focuses on how to represent the hypothesis in linear regression and discusses important design choices in machine learning algorithms.
Representing Hypothesis
- In linear regression, the hypothesis represents a function that takes input features (X) and outputs a predicted value.
- The hypothesis can be represented as a linear function of the input features.
Design Choices in Machine Learning Algorithms
- Designing a learning algorithm involves making key decisions about workflow, dataset, and hypothesis representation.
- These design choices are crucial for every supervised learning algorithm.
Affine Functions in Linear Regression
This section explains affine functions and their relevance in representing hypotheses in linear regression.
Affine Functions
- Affine functions are technically more accurate than calling them "linear" functions.
- In linear regression, an affine function represents the relationship between input features (X) and predicted values.
Multiple Input Features in Linear Regression
This section discusses how linear regression can handle multiple input features.
Handling Multiple Input Features
- Linear regression can handle cases where there are multiple input features.
- For example, in the house price prediction example, additional features like the number of bedrooms can be included.
The transcript is already in English.
Estimating the Size of a House
In this section, the speaker introduces a hypothesis for estimating the size of a house using parameters theta and features X1 and X2.
Hypothesis Notation
- The size of a house is estimated using the equation h(x) = θ0 + θ1X1 + θ2X2.
- To simplify notation, an alternative notation is introduced: h(x) = ∑θjXj, where j ranges from 0 to 2.
- X0 is defined as a dummy feature that always takes on the value of 1.
Introducing Hypothesis Notation
The speaker further explains the hypothesis notation and its compact form.
Compact Form of Hypothesis
- The hypothesis can be written as h(x) = ∑θjXj, where j ranges from 0 to 2.
- This compact form allows for concise representation of the hypothesis.
Three-Dimensional Parameters and Features
The speaker discusses how the parameters and features become three-dimensional in nature.
Three-Dimensional Parameters and Features
- In the compact form of the hypothesis, theta becomes a three-dimensional parameter (θ0, θ1, θ2).
- The features also become a three-dimensional feature vector (X0, X1, X2), where X0 is always equal to 1.
- X1 represents the size of the house and X2 represents the number of bedrooms.
Terminology and Notation
The speaker introduces terminology related to parameters, learning algorithms, training examples, and features.
Terminology
- Theta (θ) is referred to as the parameters of the learning algorithm.
- The learning algorithm's job is to choose parameters theta that allow for accurate predictions of house prices.
- M represents the number of training examples, which corresponds to the number of rows in the table.
- X is used to denote inputs or features, which are often called input attributes.
- Y represents the output or target variable, sometimes referred to as the target variable.
Training Examples and Indexing
The speaker explains how training examples are denoted and indexed using notation.
Training Examples and Indexing
- Each training example consists of an input feature vector (x) and an output value (y).
- x_i, y_i in parentheses denotes the i-th training example.
- The superscript i serves as an index into the table of training examples.
- The superscript i ranges from 1 through m, representing the number of training examples.
Number of Features
The speaker introduces n as the number of features in a supervised learning problem.
Number of Features
- n represents the number of features in a supervised learning problem.
- In this example, n is equal to 2 since there are two features: size of house and number of bedrooms.
- The feature vector x and parameter vector theta become n+1 dimensional due to the addition of X0 and θ0.
Choosing Parameters Theta
The speaker discusses how parameters theta are chosen for accurate predictions.
Choosing Parameters Theta
- The goal is to choose values for parameters theta such that h(x) is close to y for all training examples.
- This allows for accurate predictions of house prices.
Notation for Hypothesis
The speaker explains the notation used for the hypothesis function.
Notation for Hypothesis
- h(x) is written as a function of the features of the house, emphasizing its dependence on parameters theta and input features x.
- The notation h_Theta(x) is used to denote this dependency.
Learning Parameters and Linear Regression
In this section, the instructor explains the concept of learning parameters in linear regression and introduces the cost function.
Minimizing Square Difference
- The goal is to choose parameter values (Theta) that make the learning algorithm output prices close to the correct prices for known houses.
- In linear regression, the square difference between the hypothesis output (h_Theta of x) and the correct price (y) needs to be minimized.
Cost Function J(Theta)
- The cost function J(Theta) is defined as the sum of squared differences between predicted prices and true prices for all training examples.
- A constant factor of one-half is often included in front of the sum for mathematical convenience.
- The objective is to find parameter values that minimize J(Theta).
Justification for Squared Error
- The choice of squared error as a measure instead of absolute error or other forms will be discussed later when generalizing linear regression.
- Squared error corresponds to a Gaussian distribution in generalized linear models.
Gradient Descent Algorithm
This section introduces gradient descent as an algorithm to find optimal parameter values that minimize the cost function.
Initializing Theta
- Theta represents parameter values in linear regression. It can be initialized randomly or with zeros.
- Gradient descent aims to iteratively update Theta to reduce the value of J(Theta).
Visualizing Gradient Descent
- Imagine a surface plot where Theta 0 and Theta 1 are on the horizontal axes, representing different combinations of parameter values.
- The goal is to find Theta 0 and Theta 1 that minimize the height (value) of J(Theta).
- Gradient descent involves taking small steps downhill by adjusting Theta based on its current position on the surface.
Descent Algorithm and Learning Rate
In this section, the lecturer discusses the gradient descent algorithm and the learning rate in linear regression.
Gradient Descent and Local Optima
- Gradient descent is used to find local optima in a function.
- The steepest direction downhill is chosen as the next step in gradient descent.
- Depending on where you initialize parameters, you can reach different local optima.
- Running gradient descent from different starting points can lead to different local minima.
Formalizing Gradient Descent Algorithm
- In each step of gradient descent, parameters are updated based on a learning rate (α) and the partial derivative of the cost function with respect to each parameter.
- The learning rate determines how big of a step is taken in each iteration.
- The update equation for parameter θj is: θj = θj - α * (∂J(θ)/∂θj)
Determining the Learning Rate
- The learning rate (α) determines how quickly or slowly the algorithm converges.
- In practice, a common initial value for α is 0.01.
- If features are scaled between 0 and 1 or -1 and 1, trying different values of α can help find the best convergence.
Derivative Calculation
- The partial derivative (∂J(θ)/∂θj) represents the direction of steepest descent in calculus.
- It allows us to go downhill as steeply as possible towards a minimum point.
For more detailed equations and derivations, refer to the lecture notes on the course website.
New Section
In this section, the speaker explains the application of the chain rule of derivatives in a mathematical equation involving Theta and X.
Understanding the Chain Rule and Partial Derivatives (0:28:08 - 0:29:10)
- The equation involves multiplying two terms and applying the chain rule of derivatives.
- The partial derivative of each term with respect to Theta J is calculated.
- Only the term corresponding to J depends on Theta J, while all other terms have a partial derivative of 0.
- The partial derivative of the term depending on Theta J is equal to XJ.
New Section
This section focuses on understanding how to calculate the partial derivative for a sum involving multiple training examples.
Calculating Partial Derivative for Multiple Training Examples (0:29:10 - 0:32:35)
- When considering multiple training examples, the correct formula for the partial derivative involves summing over all M training examples.
- Each training example contributes to the sum with its respective input features (XI) and target label (YI).
- By summing these contributions, we obtain an updated formula for the partial derivative with respect to Theta J.
- The gradient descent algorithm iteratively updates Theta J using this formula for j equals 0 up to n, where n is the number of features.
New Section
This section discusses how gradient descent can be used to find optimal parameter values for linear regression models.
Applying Gradient Descent Algorithm (0:31:20 - 0:33:44)
- Gradient descent involves updating each parameter Theta J by subtracting a learning rate multiplied by H(X) minus Y, multiplied by XJ.
- This update is repeated until convergence is achieved.
- The algorithm iterates through all features (j equals 0 to n) and updates the corresponding Theta J values.
- By following this process, a good set of parameter values for Theta can be found.
New Section
This section highlights the behavior of the cost function J(Theta) for linear regression models and its relationship with local optima.
Understanding the Cost Function and Optimal Solutions (0:33:44 - 0:34:45)
- The cost function J(Theta) for linear regression, defined as the sum of squared terms, forms a quadratic function.
- Unlike functions with local optima, J(Theta) has either no local optima or only one global optimum.
- Visualizing the cost function reveals contours resembling ellipses that represent different levels of error.
The transcript is already in English.
Gradient Descent Algorithm
In this section, the speaker explains the concept of gradient descent algorithm and its role in finding the global minimum of a function.
The Steps of Gradient Descent Algorithm
- The algorithm takes steps downhill to find the global minimum.
- The direction of steepest descent is always orthogonal to the contour direction.
- By taking steps downhill, the algorithm eventually converges to the global minimum.
Choosing the Learning Rate Alpha
- Setting Alpha too large can cause overshooting and missing the minimum.
- Setting Alpha too small makes the algorithm slow.
- It is common practice to try different values of Alpha and observe if j(Theta) decreases or increases.
Trying Multiple Values of Learning Rate
- Trying multiple values of Alpha on an exponential scale (e.g., 0.1, 0.2, 0.4) helps find the most efficient learning rate.
Visualizing Gradient Descent
This section focuses on visualizing gradient descent using a dataset and hypothesis fitting.
Initializing Parameters Theta
- If parameters Theta are initialized as 0, the hypothesis starts with a horizontal line fit to data.
Changing Parameters Theta with Each Iteration
- With each iteration of gradient descent, different values of Theta are found that fit the data better.
- The goal is to minimize j(Theta), which represents one-half of sum-of-squares errors.
Convergence to a Decent Fit
- After several iterations, gradient descent converges to a hypothesis that provides a decent straight line fit to data.
Subtracting vs Adding Alpha Times Gradient
This section explains why subtracting Alpha times gradient is used in gradient descent instead of adding it.
Going Downhill with Subtraction
- Subtracting multiple times the gradient helps in going downhill towards the minimum.
- Adding multiple times the gradient would lead to going uphill instead.
Conclusion
The speaker concludes by highlighting the significance of gradient descent and linear regression as widely used learning algorithms.
Importance of Gradient Descent
- Gradient descent is one of the most widely used learning algorithms.
- It can be implemented for various purposes and provides decent results.
Alternative Name for Gradient Descent
- The algorithm that calculates derivatives by summing over the entire training set is sometimes referred to as gradient descent.
Batch Gradient Descent and Stochastic Gradient Descent
In this section, the speaker discusses the differences between batch gradient descent and stochastic gradient descent. Batch gradient descent processes all data as a batch, while stochastic gradient descent updates parameters using one example at a time.
Batch Gradient Descent
- Batch gradient descent processes all data as a batch.
- It becomes slow when dealing with large datasets.
- Every step of gradient descent requires scanning through the entire dataset.
- The disadvantage is that it can be computationally expensive.
Stochastic Gradient Descent
- Stochastic gradient descent updates parameters using one example at a time.
- It takes a slightly noisy and random path but on average heads towards the global minimum.
- It allows for faster progress when dealing with large datasets.
- Parameters won't converge completely due to constantly looking at different examples.
Comparison and Usage
- Batch gradient descent requires scanning through the entire dataset for each update, making it slower for large datasets.
- Stochastic gradient descent is used more in practice for very large datasets due to its faster progress.
- Switching from stochastic to batch gradient descent is possible but not commonly done.
Stochastic Gradient Descent Algorithm
This section explains the algorithm for stochastic gradient descent and how it iteratively updates parameters using individual training examples.
Stochastic Gradient Descent Algorithm
- Initialize parameters somewhere on the contour plot of cost function J(theta).
- Loop through each training example (i = 1 to m) in an inner loop.
- Update theta_j (parameters) for every j using the derivative calculated with respect to one training example i.
Comparison of Convergence
The speaker discusses how stochastic gradient descent never fully converges compared to batch gradient descent due to constantly looking at different examples. However, it still makes progress towards the global minimum.
Comparison of Convergence
- Batch gradient descent converges to the global minimum and stops.
- Stochastic gradient descent never fully converges due to constantly looking at different examples.
- Stochastic gradient descent takes a slightly noisy and random path but on average heads towards the global minimum.
Advantages of Stochastic Gradient Descent
This section highlights the advantages of using stochastic gradient descent, especially when dealing with large datasets.
Advantages of Stochastic Gradient Descent
- Allows for faster progress when working with very large datasets.
- Makes implementation and algorithm more efficient.
- Provides a viable option for handling big data.
Comparison and Usage
The speaker discusses the usage of stochastic gradient descent in practice, particularly when dealing with large datasets. The possibility of switching from stochastic to batch gradient descent is also mentioned.
Comparison and Usage
- Stochastic gradient descent is used more in practice than batch gradient descent when dealing with very large datasets.
- Switching from stochastic to batch gradient descent is possible but not commonly done.
[t=48:35s] Stochastic Gradient Descent vs Batch Gradient Descent
In this section, the speaker discusses the differences between stochastic gradient descent and batch gradient descent, and when to use each method.
Stochastic Gradient Descent for Large Datasets
- When working with large datasets, batch gradient descent is rarely used due to its slow performance.
- For modern machine learning tasks with terabytes of data, it is expensive to scan through all the data using batch gradient descent.
- Stochastic gradient descent offers a significant advantage in terms of computational efficiency as it allows for faster iterations.
The Benefits of Stochastic Gradient Descent
- Even if stochastic gradient descent does not converge to the global minimum, the resulting parameter values are still effective for making accurate predictions.
- Trying out different parameters with stochastic gradient descent does not cause harm and can lead to satisfactory results.
- In practice, stochastic gradient descent is commonly used over batch gradient descent.
Decreasing Learning Rate
- Another common approach in training models with stochastic gradient descent is gradually decreasing the learning rate over time.
- By taking smaller steps as the learning rate decreases, the size of oscillations decreases and brings the model closer to a better solution.
- This technique is widely used in practice to improve convergence.
[t=50:31s] Monitoring Convergence and Linear Regression
This section covers how to monitor convergence during training using cost function analysis. It also discusses convergence issues specific to linear regression.
Monitoring Convergence
- To determine when to stop training with stochastic gradient descent, plot the cost function (j(theta)) over time.
- If there is no significant decrease in j(theta), it indicates that further training may not be beneficial, and you can stop training at that point.
Convergence Issues in Linear Regression
- Linear regression has no local optima, which reduces convergence debugging issues compared to highly non-linear models like neural networks.
- However, convergence issues become more acute when training complex models like neural networks.
[t=52:47s] Choosing Between Batch Gradient Descent and Stochastic Gradient Descent
This section discusses the choice between batch gradient descent and stochastic gradient descent based on dataset size and computational efficiency.
Dataset Size Considerations
- For relatively small datasets (hundreds or thousands of examples), it is computationally efficient to use batch gradient descent.
- Batch gradient descent is preferred in such cases as it simplifies the training process by avoiding parameter oscillations.
- However, for large datasets where batch gradient descent becomes prohibitively slow, stochastic gradient descent or similar methods are commonly used.
[t=53:12s] Iterative Algorithms and the Normal Equation
This section introduces iterative algorithms like gradient descent and highlights a special case for linear regression called the normal equation.
Iterative Algorithms
- Gradient descent, both batch and stochastic, is an iterative algorithm that requires multiple steps to approach the global optimum.
- Many algorithms in machine learning rely on gradient descent, including generalized linear models and neural networks.
The Normal Equation for Linear Regression
- The normal equation provides a direct solution for finding optimal parameter values in linear regression without using an iterative algorithm.
- It only applies to linear regression specifically and cannot be used with other algorithms discussed later in the course.
- The derivation of the normal equation will be presented in subsequent content.
Derivation of the Normal Equation
In this section, the speaker introduces a matrix derivative notation that simplifies the derivation of the normal equation for linear regression. The notation allows for a concise representation and understanding of the process.
Matrix Derivative Notation
- The speaker describes a matrix derivative notation that simplifies the derivation of the normal equation.
- This notation enables solving problems more easily by providing a clear representation.
- The derivative of J with respect to theta is defined as a three-dimensional vector.
- Theta represents the parameters in linear regression.
- The derivative is computed for each element of theta.
Function Mapping and Matrix Derivatives
- The speaker explains how function mapping works with matrices and real numbers.
- An example is given where a function maps from a 2x2 matrix to a real number.
- The derivative with respect to A is defined as another matrix with the same dimensions.
- Each element of this derivative matrix corresponds to the derivative with respect to individual elements of A.
Overview and Further Details
- The speaker emphasizes that detailed derivations will be provided in lecture notes.
- Students are encouraged to explore further details on their own through these notes.
- This matrix derivative notation proves convenient for deriving learning algorithms.
Timestamps have been associated with relevant bullet points.
Deriving the Normal Equation
In this section, the speaker explains how to derive the normal equation by taking derivatives with respect to Theta and setting it equal to 0. The normal equation allows us to find the global minimum of the cost function J(Theta).
Derivation of the Normal Equation
- Take derivatives with respect to Theta and set it equal to 0.
- Solve for the value of Theta that makes the derivative 0.
- The maximum and minimum of a function occur where the derivative is equal to 0.
Using Matrix Notation
- J(Theta) maps from a vector to a real number.
- Take derivatives with respect to Theta, set them equal to 0, and solve for Theta.
- This gives us a formula for Theta that leads us directly to the global minimum of J(Theta).
Properties of Trace Operator
The speaker discusses properties of the trace operator, which is used in matrix calculations.
Trace Operator Definition
- The trace of a square matrix A is defined as the sum of its diagonal entries.
Properties of Trace Operator
- Trace(A) = Trace(A transpose)
- The trace operator has other interesting properties:
- Trace(AB) = Trace(BA)
- Trace(A * B * C) = Trace(C * A * B)
- Derivative of AA transpose C with respect to A is B transpose
Expressing J(Theta) in Matrix Notation
In this section, we express J(Theta), which represents our cost function, using matrix-vector notation. This allows us to take derivatives with respect to Theta and solve for its value.
Definition of J(Theta)
- J(Theta) = (1/2) * sum((h(xi) - yi)^2) for i = 1 to m, where h(xi) is the predicted value and yi is the actual value.
Design Matrix
- Define a matrix X by stacking up the training examples in rows.
- X times Theta represents the matrix-vector multiplication, where Theta is a column vector.
- Each element of Theta is multiplied with each corresponding element of X.
Conclusion
The speaker concludes the discussion on expressing J(Theta) in matrix notation and emphasizes that we can now take derivatives with respect to Theta and solve for its value to find the global minimum of J(Theta).
Definition of J(Theta)
In this section, the speaker defines a vector y as the column vector containing all the labels from the training examples. The cost function J(Theta) is then defined as one-half of the transpose of (X * Theta - y) multiplied by (X * Theta - y).
Definition of J(Theta)
- The vector y is formed by stacking up all the labels from the training examples.
- The cost function J(Theta) is equal to one-half times the transpose of (X * Theta - y) multiplied by (X * Theta - y).
Derivation Outline
The speaker outlines the proof for deriving J(Theta), but does not go into great detail. They explain that X * Theta - y represents the errors made by the learning algorithm on each example.
Derivation Outline
- X * Theta - y represents the errors made by the learning algorithm on each example.
- The cost function J(Theta) is computed by taking the sum of squares of these error terms.
Derivative and Normal Equations
The speaker explains that to find the optimum value for Theta, we need to take the derivative of J(Theta) with respect to Theta and set it to zero. This leads to solving X^T * X * Theta = X^T * y, known as normal equations.
Derivative and Normal Equations
- To find optimum value for Theta, take derivative of J(Theta) with respect to Theta and set it to zero.
- Simplifying through matrix derivatives, we get X^T * X * Theta = X^T * y.
- This equation is known as normal equations.
Optimum Value for Theta
The speaker explains that the optimum value for Theta can be obtained by solving the normal equations. They mention that if X is non-invertible, it usually indicates redundant features or linearly dependent features.
Optimum Value for Theta
- The optimum value for Theta is given by Theta = (X^T * X)^(-1) * X^T * y.
- If X is non-invertible, it indicates redundant or linearly dependent features.
- In such cases, using the pseudo inverse can provide a solution.
Choosing a Learning Rate
The speaker briefly mentions that choosing a learning rate is often an empirical process where different values are tried and one is selected based on performance.
Choosing a Learning Rate
- Selecting a learning rate is often an empirical process.
- Different values are tried and one is chosen based on performance.
Conclusion
The transcript covers the definition of J(Theta), the derivation outline, the derivative and normal equations, obtaining the optimum value for Theta, and choosing a learning rate. It emphasizes understanding matrix derivatives and provides insights into handling non-invertible matrices.