Stanford CS229: Machine Learning - Linear Regression and Gradient Descent | Lecture 2 (Autumn 2018)

Name: Stanford CS229: Machine Learning - Linear Regression and Gradient Descent | Lecture 2 (Autumn 2018)
Uploaded: 2020-04-17T20:08:06.000Z
Duration: 2 h 36 min 17 s

Introduction to Linear Regression

In this section, the instructor introduces linear regression as a learning algorithm and discusses its importance in fitting linear regression models.

Linear Regression as a Learning Algorithm

Linear regression is an algorithm used for fitting linear regression models.

It is one of the simplest supervised learning algorithms.

Supervised learning involves mapping inputs (X) to outputs (Y).

Linear regression is commonly used for regression problems where the output Y is a continuous value.

Motivating Linear Regression with House Price Prediction

The instructor uses the example of predicting house prices to motivate linear regression.

A dataset of houses and their corresponding prices is collected.

The goal is to fit a straight line to the data using linear regression.

Representing Hypothesis in Linear Regression

This section focuses on how to represent the hypothesis in linear regression and discusses important design choices in machine learning algorithms.

Representing Hypothesis

In linear regression, the hypothesis represents a function that takes input features (X) and outputs a predicted value.

The hypothesis can be represented as a linear function of the input features.

Design Choices in Machine Learning Algorithms

Designing a learning algorithm involves making key decisions about workflow, dataset, and hypothesis representation.

These design choices are crucial for every supervised learning algorithm.

Affine Functions in Linear Regression

This section explains affine functions and their relevance in representing hypotheses in linear regression.

Affine Functions

Affine functions are technically more accurate than calling them "linear" functions.

In linear regression, an affine function represents the relationship between input features (X) and predicted values.

Multiple Input Features in Linear Regression

This section discusses how linear regression can handle multiple input features.

Handling Multiple Input Features

Linear regression can handle cases where there are multiple input features.

For example, in the house price prediction example, additional features like the number of bedrooms can be included.

The transcript is already in English.

Estimating the Size of a House

In this section, the speaker introduces a hypothesis for estimating the size of a house using parameters theta and features X1 and X2.

Hypothesis Notation

The size of a house is estimated using the equation h(x) = θ0 + θ1X1 + θ2X2.

To simplify notation, an alternative notation is introduced: h(x) = ∑θjXj, where j ranges from 0 to 2.

X0 is defined as a dummy feature that always takes on the value of 1.

Introducing Hypothesis Notation

The speaker further explains the hypothesis notation and its compact form.

Compact Form of Hypothesis

The hypothesis can be written as h(x) = ∑θjXj, where j ranges from 0 to 2.

This compact form allows for concise representation of the hypothesis.

Three-Dimensional Parameters and Features

The speaker discusses how the parameters and features become three-dimensional in nature.

Three-Dimensional Parameters and Features

In the compact form of the hypothesis, theta becomes a three-dimensional parameter (θ0, θ1, θ2).

The features also become a three-dimensional feature vector (X0, X1, X2), where X0 is always equal to 1.

X1 represents the size of the house and X2 represents the number of bedrooms.

Terminology and Notation

The speaker introduces terminology related to parameters, learning algorithms, training examples, and features.

Terminology

Theta (θ) is referred to as the parameters of the learning algorithm.

The learning algorithm's job is to choose parameters theta that allow for accurate predictions of house prices.

M represents the number of training examples, which corresponds to the number of rows in the table.

X is used to denote inputs or features, which are often called input attributes.

Y represents the output or target variable, sometimes referred to as the target variable.

Training Examples and Indexing

The speaker explains how training examples are denoted and indexed using notation.

Training Examples and Indexing

Each training example consists of an input feature vector (x) and an output value (y).

x_i, y_i in parentheses denotes the i-th training example.

The superscript i serves as an index into the table of training examples.

The superscript i ranges from 1 through m, representing the number of training examples.

Number of Features

The speaker introduces n as the number of features in a supervised learning problem.

Number of Features

n represents the number of features in a supervised learning problem.

In this example, n is equal to 2 since there are two features: size of house and number of bedrooms.

The feature vector x and parameter vector theta become n+1 dimensional due to the addition of X0 and θ0.

Choosing Parameters Theta

The speaker discusses how parameters theta are chosen for accurate predictions.

Choosing Parameters Theta

The goal is to choose values for parameters theta such that h(x) is close to y for all training examples.

This allows for accurate predictions of house prices.

Notation for Hypothesis

The speaker explains the notation used for the hypothesis function.

Notation for Hypothesis

h(x) is written as a function of the features of the house, emphasizing its dependence on parameters theta and input features x.

The notation h_Theta(x) is used to denote this dependency.

Learning Parameters and Linear Regression

In this section, the instructor explains the concept of learning parameters in linear regression and introduces the cost function.

Minimizing Square Difference

The goal is to choose parameter values (Theta) that make the learning algorithm output prices close to the correct prices for known houses.

In linear regression, the square difference between the hypothesis output (h_Theta of x) and the correct price (y) needs to be minimized.

Cost Function J(Theta)

The cost function J(Theta) is defined as the sum of squared differences between predicted prices and true prices for all training examples.

A constant factor of one-half is often included in front of the sum for mathematical convenience.

The objective is to find parameter values that minimize J(Theta).

Justification for Squared Error

The choice of squared error as a measure instead of absolute error or other forms will be discussed later when generalizing linear regression.

Squared error corresponds to a Gaussian distribution in generalized linear models.

Gradient Descent Algorithm

This section introduces gradient descent as an algorithm to find optimal parameter values that minimize the cost function.

Initializing Theta

Theta represents parameter values in linear regression. It can be initialized randomly or with zeros.

Gradient descent aims to iteratively update Theta to reduce the value of J(Theta).

Visualizing Gradient Descent

Imagine a surface plot where Theta 0 and Theta 1 are on the horizontal axes, representing different combinations of parameter values.

The goal is to find Theta 0 and Theta 1 that minimize the height (value) of J(Theta).

Gradient descent involves taking small steps downhill by adjusting Theta based on its current position on the surface.

Descent Algorithm and Learning Rate

In this section, the lecturer discusses the gradient descent algorithm and the learning rate in linear regression.

Gradient Descent and Local Optima

Gradient descent is used to find local optima in a function.

The steepest direction downhill is chosen as the next step in gradient descent.

Depending on where you initialize parameters, you can reach different local optima.

Running gradient descent from different starting points can lead to different local minima.

Formalizing Gradient Descent Algorithm

In each step of gradient descent, parameters are updated based on a learning rate (α) and the partial derivative of the cost function with respect to each parameter.

The learning rate determines how big of a step is taken in each iteration.

The update equation for parameter θj is: θj = θj - α * (∂J(θ)/∂θj)

Determining the Learning Rate

The learning rate (α) determines how quickly or slowly the algorithm converges.

In practice, a common initial value for α is 0.01.

If features are scaled between 0 and 1 or -1 and 1, trying different values of α can help find the best convergence.

Derivative Calculation

The partial derivative (∂J(θ)/∂θj) represents the direction of steepest descent in calculus.

It allows us to go downhill as steeply as possible towards a minimum point.

For more detailed equations and derivations, refer to the lecture notes on the course website.

New Section

In this section, the speaker explains the application of the chain rule of derivatives in a mathematical equation involving Theta and X.

Understanding the Chain Rule and Partial Derivatives (0:28:08 - 0:29:10)

The equation involves multiplying two terms and applying the chain rule of derivatives.

The partial derivative of each term with respect to Theta J is calculated.

Only the term corresponding to J depends on Theta J, while all other terms have a partial derivative of 0.

The partial derivative of the term depending on Theta J is equal to XJ.

New Section

This section focuses on understanding how to calculate the partial derivative for a sum involving multiple training examples.

Calculating Partial Derivative for Multiple Training Examples (0:29:10 - 0:32:35)

When considering multiple training examples, the correct formula for the partial derivative involves summing over all M training examples.

Each training example contributes to the sum with its respective input features (XI) and target label (YI).

By summing these contributions, we obtain an updated formula for the partial derivative with respect to Theta J.

The gradient descent algorithm iteratively updates Theta J using this formula for j equals 0 up to n, where n is the number of features.

New Section

This section discusses how gradient descent can be used to find optimal parameter values for linear regression models.

Applying Gradient Descent Algorithm (0:31:20 - 0:33:44)

Gradient descent involves updating each parameter Theta J by subtracting a learning rate multiplied by H(X) minus Y, multiplied by XJ.

This update is repeated until convergence is achieved.

The algorithm iterates through all features (j equals 0 to n) and updates the corresponding Theta J values.

By following this process, a good set of parameter values for Theta can be found.

New Section

This section highlights the behavior of the cost function J(Theta) for linear regression models and its relationship with local optima.

Understanding the Cost Function and Optimal Solutions (0:33:44 - 0:34:45)

The cost function J(Theta) for linear regression, defined as the sum of squared terms, forms a quadratic function.

Unlike functions with local optima, J(Theta) has either no local optima or only one global optimum.

Visualizing the cost function reveals contours resembling ellipses that represent different levels of error.

The transcript is already in English.

Gradient Descent Algorithm

In this section, the speaker explains the concept of gradient descent algorithm and its role in finding the global minimum of a function.

The Steps of Gradient Descent Algorithm

The algorithm takes steps downhill to find the global minimum.

The direction of steepest descent is always orthogonal to the contour direction.

By taking steps downhill, the algorithm eventually converges to the global minimum.

Choosing the Learning Rate Alpha

Setting Alpha too large can cause overshooting and missing the minimum.

Setting Alpha too small makes the algorithm slow.

It is common practice to try different values of Alpha and observe if j(Theta) decreases or increases.

Trying Multiple Values of Learning Rate

Trying multiple values of Alpha on an exponential scale (e.g., 0.1, 0.2, 0.4) helps find the most efficient learning rate.

Visualizing Gradient Descent

This section focuses on visualizing gradient descent using a dataset and hypothesis fitting.

Initializing Parameters Theta

If parameters Theta are initialized as 0, the hypothesis starts with a horizontal line fit to data.

Changing Parameters Theta with Each Iteration

With each iteration of gradient descent, different values of Theta are found that fit the data better.

The goal is to minimize j(Theta), which represents one-half of sum-of-squares errors.

Convergence to a Decent Fit

After several iterations, gradient descent converges to a hypothesis that provides a decent straight line fit to data.

Subtracting vs Adding Alpha Times Gradient

This section explains why subtracting Alpha times gradient is used in gradient descent instead of adding it.

Going Downhill with Subtraction

Subtracting multiple times the gradient helps in going downhill towards the minimum.

Adding multiple times the gradient would lead to going uphill instead.

Conclusion

The speaker concludes by highlighting the significance of gradient descent and linear regression as widely used learning algorithms.

Importance of Gradient Descent

Gradient descent is one of the most widely used learning algorithms.

It can be implemented for various purposes and provides decent results.

Alternative Name for Gradient Descent

The algorithm that calculates derivatives by summing over the entire training set is sometimes referred to as gradient descent.

Batch Gradient Descent and Stochastic Gradient Descent

In this section, the speaker discusses the differences between batch gradient descent and stochastic gradient descent. Batch gradient descent processes all data as a batch, while stochastic gradient descent updates parameters using one example at a time.

Batch Gradient Descent

Batch gradient descent processes all data as a batch.

It becomes slow when dealing with large datasets.

Every step of gradient descent requires scanning through the entire dataset.

The disadvantage is that it can be computationally expensive.

Stochastic Gradient Descent

Stochastic gradient descent updates parameters using one example at a time.

It takes a slightly noisy and random path but on average heads towards the global minimum.

It allows for faster progress when dealing with large datasets.

Parameters won't converge completely due to constantly looking at different examples.

Comparison and Usage

Batch gradient descent requires scanning through the entire dataset for each update, making it slower for large datasets.

Stochastic gradient descent is used more in practice for very large datasets due to its faster progress.

Switching from stochastic to batch gradient descent is possible but not commonly done.

Stochastic Gradient Descent Algorithm

This section explains the algorithm for stochastic gradient descent and how it iteratively updates parameters using individual training examples.

Stochastic Gradient Descent Algorithm

Initialize parameters somewhere on the contour plot of cost function J(theta).

Loop through each training example (i = 1 to m) in an inner loop.

Update theta_j (parameters) for every j using the derivative calculated with respect to one training example i.

Comparison of Convergence

The speaker discusses how stochastic gradient descent never fully converges compared to batch gradient descent due to constantly looking at different examples. However, it still makes progress towards the global minimum.

Comparison of Convergence

Batch gradient descent converges to the global minimum and stops.

Stochastic gradient descent never fully converges due to constantly looking at different examples.

Stochastic gradient descent takes a slightly noisy and random path but on average heads towards the global minimum.

Advantages of Stochastic Gradient Descent

This section highlights the advantages of using stochastic gradient descent, especially when dealing with large datasets.

Advantages of Stochastic Gradient Descent

Allows for faster progress when working with very large datasets.

Makes implementation and algorithm more efficient.

Provides a viable option for handling big data.

Comparison and Usage

The speaker discusses the usage of stochastic gradient descent in practice, particularly when dealing with large datasets. The possibility of switching from stochastic to batch gradient descent is also mentioned.

Comparison and Usage

Stochastic gradient descent is used more in practice than batch gradient descent when dealing with very large datasets.

Switching from stochastic to batch gradient descent is possible but not commonly done.

[t=48:35s] Stochastic Gradient Descent vs Batch Gradient Descent

In this section, the speaker discusses the differences between stochastic gradient descent and batch gradient descent, and when to use each method.

Stochastic Gradient Descent for Large Datasets

When working with large datasets, batch gradient descent is rarely used due to its slow performance.

For modern machine learning tasks with terabytes of data, it is expensive to scan through all the data using batch gradient descent.

Stochastic gradient descent offers a significant advantage in terms of computational efficiency as it allows for faster iterations.

The Benefits of Stochastic Gradient Descent

Even if stochastic gradient descent does not converge to the global minimum, the resulting parameter values are still effective for making accurate predictions.

Trying out different parameters with stochastic gradient descent does not cause harm and can lead to satisfactory results.

In practice, stochastic gradient descent is commonly used over batch gradient descent.

Decreasing Learning Rate

Another common approach in training models with stochastic gradient descent is gradually decreasing the learning rate over time.

By taking smaller steps as the learning rate decreases, the size of oscillations decreases and brings the model closer to a better solution.

This technique is widely used in practice to improve convergence.

[t=50:31s] Monitoring Convergence and Linear Regression

This section covers how to monitor convergence during training using cost function analysis. It also discusses convergence issues specific to linear regression.

Monitoring Convergence

To determine when to stop training with stochastic gradient descent, plot the cost function (j(theta)) over time.

If there is no significant decrease in j(theta), it indicates that further training may not be beneficial, and you can stop training at that point.

Convergence Issues in Linear Regression

Linear regression has no local optima, which reduces convergence debugging issues compared to highly non-linear models like neural networks.

However, convergence issues become more acute when training complex models like neural networks.

[t=52:47s] Choosing Between Batch Gradient Descent and Stochastic Gradient Descent

This section discusses the choice between batch gradient descent and stochastic gradient descent based on dataset size and computational efficiency.

Dataset Size Considerations

For relatively small datasets (hundreds or thousands of examples), it is computationally efficient to use batch gradient descent.

Batch gradient descent is preferred in such cases as it simplifies the training process by avoiding parameter oscillations.

However, for large datasets where batch gradient descent becomes prohibitively slow, stochastic gradient descent or similar methods are commonly used.

[t=53:12s] Iterative Algorithms and the Normal Equation

This section introduces iterative algorithms like gradient descent and highlights a special case for linear regression called the normal equation.

Iterative Algorithms

Gradient descent, both batch and stochastic, is an iterative algorithm that requires multiple steps to approach the global optimum.

Many algorithms in machine learning rely on gradient descent, including generalized linear models and neural networks.

The Normal Equation for Linear Regression

The normal equation provides a direct solution for finding optimal parameter values in linear regression without using an iterative algorithm.

It only applies to linear regression specifically and cannot be used with other algorithms discussed later in the course.

The derivation of the normal equation will be presented in subsequent content.

Derivation of the Normal Equation

In this section, the speaker introduces a matrix derivative notation that simplifies the derivation of the normal equation for linear regression. The notation allows for a concise representation and understanding of the process.

Matrix Derivative Notation

The speaker describes a matrix derivative notation that simplifies the derivation of the normal equation.

This notation enables solving problems more easily by providing a clear representation.

The derivative of J with respect to theta is defined as a three-dimensional vector.

Theta represents the parameters in linear regression.

The derivative is computed for each element of theta.

Function Mapping and Matrix Derivatives

The speaker explains how function mapping works with matrices and real numbers.

An example is given where a function maps from a 2x2 matrix to a real number.

The derivative with respect to A is defined as another matrix with the same dimensions.

Each element of this derivative matrix corresponds to the derivative with respect to individual elements of A.

Overview and Further Details

The speaker emphasizes that detailed derivations will be provided in lecture notes.

Students are encouraged to explore further details on their own through these notes.

This matrix derivative notation proves convenient for deriving learning algorithms.

Timestamps have been associated with relevant bullet points.

Deriving the Normal Equation

In this section, the speaker explains how to derive the normal equation by taking derivatives with respect to Theta and setting it equal to 0. The normal equation allows us to find the global minimum of the cost function J(Theta).

Derivation of the Normal Equation

Take derivatives with respect to Theta and set it equal to 0.

Solve for the value of Theta that makes the derivative 0.

The maximum and minimum of a function occur where the derivative is equal to 0.

Using Matrix Notation

J(Theta) maps from a vector to a real number.

Take derivatives with respect to Theta, set them equal to 0, and solve for Theta.

This gives us a formula for Theta that leads us directly to the global minimum of J(Theta).

Properties of Trace Operator

The speaker discusses properties of the trace operator, which is used in matrix calculations.

Trace Operator Definition

The trace of a square matrix A is defined as the sum of its diagonal entries.

Properties of Trace Operator

Trace(A) = Trace(A transpose)

The trace operator has other interesting properties:

Trace(AB) = Trace(BA)

Trace(A * B * C) = Trace(C * A * B)

Derivative of AA transpose C with respect to A is B transpose

Expressing J(Theta) in Matrix Notation

In this section, we express J(Theta), which represents our cost function, using matrix-vector notation. This allows us to take derivatives with respect to Theta and solve for its value.

Definition of J(Theta)

J(Theta) = (1/2) * sum((h(xi) - yi)^2) for i = 1 to m, where h(xi) is the predicted value and yi is the actual value.

Design Matrix

Define a matrix X by stacking up the training examples in rows.

X times Theta represents the matrix-vector multiplication, where Theta is a column vector.

Each element of Theta is multiplied with each corresponding element of X.

Conclusion

The speaker concludes the discussion on expressing J(Theta) in matrix notation and emphasizes that we can now take derivatives with respect to Theta and solve for its value to find the global minimum of J(Theta).

Definition of J(Theta)

In this section, the speaker defines a vector y as the column vector containing all the labels from the training examples. The cost function J(Theta) is then defined as one-half of the transpose of (X * Theta - y) multiplied by (X * Theta - y).

Definition of J(Theta)

The vector y is formed by stacking up all the labels from the training examples.

The cost function J(Theta) is equal to one-half times the transpose of (X * Theta - y) multiplied by (X * Theta - y).

Derivation Outline

The speaker outlines the proof for deriving J(Theta), but does not go into great detail. They explain that X * Theta - y represents the errors made by the learning algorithm on each example.

Derivation Outline

X * Theta - y represents the errors made by the learning algorithm on each example.

The cost function J(Theta) is computed by taking the sum of squares of these error terms.

Derivative and Normal Equations

The speaker explains that to find the optimum value for Theta, we need to take the derivative of J(Theta) with respect to Theta and set it to zero. This leads to solving X^T * X * Theta = X^T * y, known as normal equations.

Derivative and Normal Equations

To find optimum value for Theta, take derivative of J(Theta) with respect to Theta and set it to zero.

Simplifying through matrix derivatives, we get X^T * X * Theta = X^T * y.

This equation is known as normal equations.

Optimum Value for Theta

The speaker explains that the optimum value for Theta can be obtained by solving the normal equations. They mention that if X is non-invertible, it usually indicates redundant features or linearly dependent features.

Optimum Value for Theta

The optimum value for Theta is given by Theta = (X^T * X)^(-1) * X^T * y.

If X is non-invertible, it indicates redundant or linearly dependent features.

In such cases, using the pseudo inverse can provide a solution.

Choosing a Learning Rate

The speaker briefly mentions that choosing a learning rate is often an empirical process where different values are tried and one is selected based on performance.

Choosing a Learning Rate

Selecting a learning rate is often an empirical process.

Different values are tried and one is chosen based on performance.

Conclusion

The transcript covers the definition of J(Theta), the derivation outline, the derivative and normal equations, obtaining the optimum value for Theta, and choosing a learning rate. It emphasizes understanding matrix derivatives and provides insights into handling non-invertible matrices.