Locally Weighted & Logistic Regression | Stanford CS229: Machine Learning - Lecture 3 (Autumn 2018)

Name: Locally Weighted & Logistic Regression | Stanford CS229: Machine Learning - Lecture 3 (Autumn 2018)
Uploaded: 2020-04-17T20:08:31.000Z
Duration: 2 h 38 min 34 s

Introduction to Supervised Learning

In this section, the instructor introduces the topics that will be covered in the lecture.

Topics Covered

Locally weighted regression

Probabilistic interpretation of linear regression

Logistic regression and Newton's method for logistic regression

Locally Weighted Regression

The instructor discusses locally weighted regression as a way to modify linear regressions and make it fit very non-linear functions.

Key Points

Locally weighted regression modifies linear regressions to fit non-linear functions.

This allows for fitting more complex data sets than straight lines.

Later in the course, feature selection algorithms will be discussed as a way to automatically decide which features are best suited for fitting data.

Notation Used in Linear Regression

The instructor reviews notation used in linear regression from the previous lecture.

Key Points

x_i is an n+1 dimensional training example where x_0 is always set to 1.

y_i is always a real number in the case of regression.

j is the cost function that you minimize to find parameters Theta for your straight line fit to the data.

Fitting Nonlinear Functions with Linear Regression

The instructor discusses how linear regression can be used to fit nonlinear functions by defining new features.

Key Points

Defining new features such as x^2 or square root of x can allow linear regression machinery to fit nonlinear functions.

Feature selection algorithms can help determine which features are best suited for fitting data.

Introduction to Locally Weighted Linear Regression

The instructor introduces locally weighted linear regression as an alternative approach for fitting nonlinear functions.

Key Points

Locally weighted linear regression is a different way of addressing the problem of fitting data that isn't just fit well by a straight line.

It allows for fitting more complex data sets than straight lines.

The instructor will cover the key ideas of locally weighted regression and let students play with some of the ideas in problem set 1.

Introduction to Locally Weighted Regression

In this section, the instructor introduces locally weighted regression as a non-parametric learning algorithm and distinguishes it from parametric learning algorithms.

Parametric vs Non-Parametric Learning Algorithms

Machine learning distinguishes between parametric and non-parametric learning algorithms.

A parametric learning algorithm fits a fixed set of parameters to data, while a non-parametric learning algorithm requires more data/parameters that grow linearly with the size of the training set.

With a parametric learning algorithm, you can erase the training set from your computer memory and make predictions using only the fitted parameters. However, with a non-parametric learning algorithm, you need to keep all of the data around in computer memory or on disk just to make predictions.

Locally Weighted Regression

Locally weighted regression is an example of a non-parametric learning algorithm that can fit data well without manually fiddling with features.

To make a prediction using locally weighted regression at a certain value of x, you look in a local neighborhood at the training examples close to that point x where you want to make a prediction. You then try to fit a straight line like that focusing on those training examples that are close to where you want to make your prediction.

To formalize this in locally weighted regression, Theta is fitted by minimizing a modified cost function where wi is a weight function. The default choice for wi will be such that if xi - x is small, then the weight will be close to 1.

Locally Weighted Regression

In this section, the instructor explains how locally weighted regression works and how it differs from linear regression.

How Locally Weighted Regression Works

If xi - x is small, then the training example is close to where you want to make a prediction for x. Conversely, if xi - x is large, then wi is close to 0.

If an example xi is far from where you want to make a prediction, multiply that error term by 0 or by a constant very close to 0. Whereas if it's close to where you want to make a prediction, multiply that error term by 1.

The net effect of this is that the sums over essentially only the terms for the squared error for the examples that are close to the value of x where you want to make a prediction.

When fitting Theta to minimize this, you end up paying attention only to the points close to where you want to make a prediction and fitting a line like a green line over there.

Choosing Width of Gaussian Density

The choice of bandwidth tau has an effect on overfitting and underfitting.

If tau is too broad, you end up over-smoothing the data. If tau is too thin, you end up fitting a very jagged fit to the data.

Locally Weighted Linear Regression

In this section, the instructor discusses locally weighted linear regression and its applications.

Using Locally Weighted Linear Regression

Locally weighted linear regression can still be used to infer the value of h outside the scope of the dataset, but its results may not be very good.

All formulas still work, and you can try a linear problem set to see what happens.

There are quite complicated ways to choose tau based on how many points there are in the local region. There is a huge literature on different formulas for this algorithm.

Locally weighted linear regression is usually used when you have a relatively low-dimensional dataset with a lot of data and don't want to think about what features to use. It's a pretty good algorithm for datasets that look like those in the drawing shown by the instructor.

Probabilistic Interpretation of Linear Regression

In this section, the instructor provides a probabilistic interpretation of linear regression and explains why squared error is used.

Justification for Squared Error

The true price of every house y_i is x^T_i + epsilon_i, where epsilon_i includes unmodeled effects and random noise. This means that every house's price is a linear function of the size of the house and number of bedrooms plus an error term that captures unmodeled effects such as maybe one day that seller is in an unusually good mood or an unusually bad mood and so that makes the price go higher or lower.

Epsilon_i is distributed Gaussian with mean 0 and covariance sigma squared.

Squared error falls out very naturally under the assumption that epsilon_i is distributed Gaussian with mean 0 and covariance sigma squared.

Introduction to Gaussian Density

In this section, the instructor introduces the concept of Gaussian density and its properties.

Gaussian Density

The normal distribution has a mean of 0 and variance sigma squared.

The probability density function for epsilon i is a Gaussian density with 1 over root 2 pi sigma e to the negative epsilon i squared over 2 sigma squared.

Sigma controls the width of the Gaussian curve.

The error terms are assumed to be independently and identically distributed (IID).

Probability Density Function

In this section, the instructor explains how housing prices are determined using a probability density function.

Probability Density Function

Given x and theta, the probability of a particular house's price is a Gaussian with mean given by theta transpose xi or theta transpose x, and variance given by sigma squared.

The price of a house is determined by taking theta transpose x with the true price of the house and then adding noise or adding error of variance sigma squared to it.

Theta in this view is not a random variable.

Introduction to Linear Regression

In this section, Andrew Ng introduces the concept of linear regression and explains how it can be derived from certain assumptions.

The Random Variable Y Given X and Parameterized by Theta

The random variable y given x and parameterized by theta is a distributed Gaussian with that distribution.

Linear regression falls out almost naturally of the assumptions made.

The likelihood of the parameters theta is equal to the probability of the data.

Probability of Data

Under the assumptions made, the probability of all values of y in your training set is equal to the product of probabilities.

This is because we assumed that errors are independently and identically distributed (IID).

Plugging in the definition of p(y|x;theta), we get a product of probabilities.

Likelihood vs. Probability

Likelihood refers to viewing L(theta|x,y) as a function of theta holding x,y fixed.

Probability refers to viewing L(theta|x,y) as a function of x,y holding theta fixed.

We use likelihood when viewing data as fixed and varying parameters, and probability when viewing parameters as fixed and varying data.

Gaussian Error Distribution

Most error distributions are Gaussian due to central limit theorem from statistics.

If there are many little noise sources which are not too correlated, then by central limit theorem it will be Gaussian.

Maximum Likelihood Estimation

In this section, the speaker discusses maximum likelihood estimation and how it relates to least squares.

Log-Likelihood and Maximum Likelihood Estimation

The log-likelihood is the logarithm of the likelihood.

The log of a product is equal to the sum of the logs.

To simplify algebra, it's easier to maximize the log-likelihood instead of maximizing the likelihood itself.

Choosing a value of theta that maximizes the log-likelihood is equivalent to choosing a value of theta that maximizes the probability of data.

Maximum likelihood estimation (MLE) means choosing theta to maximize the likelihood.

Cost Function for MLE

To find MLE, we need to choose a value of theta that minimizes J(theta), which is just like finding least squares errors in linear regression.

Sigma squared is just a constant, so minimizing J(theta) is equivalent to maximizing negative log likelihood.

When to Use MLE

If you're willing to assume that error terms are Gaussian and IID and want to use MLE, then you should use least squares.

If you know your training set isn't IID, there are more sophisticated models you could build but often we wouldn't bother.

Introduction to Classification and Logistic Regression

In this section, the speaker introduces classification problems and explains why linear regression is not a good algorithm for classification. The speaker then introduces logistic regression as a commonly used classification algorithm.

Binary Classification Problem

A binary classification problem has two possible values for Y: 0 or 1.

Linear regression is not a good algorithm for classification because it can produce bad fits to the data, especially when there are outliers.

Logistic regression is a commonly used algorithm for classification that outputs values between 0 and 1.

Sigmoid Function

The sigmoid function, also known as the logistic function, is used in logistic regression to output values between 0 and 1.

The sigmoid function crosses the x-axis at 0 and asymptotes towards 1 as z increases.

Maximum Likelihood Estimation

Maximum likelihood estimation is used in logistic regression to find the parameters theta that maximize the likelihood of observing the training data.

Gradient descent can be used to optimize the cost function in logistic regression.

Multiclass Classification

Multiclass classification involves predicting one of three or more possible values for Y.

One-vs-all (OvA), also known as one-vs-rest (OvR), is a technique for extending binary classifiers to multiclass problems.

Introduction to Logistic Regression

In this section, the instructor introduces logistic regression and explains why it is used in machine learning.

Logistic Function

The hypothesis function for logistic regression uses a sigmoid or logistic function to output values between 0 and 1.

The choice of the logistic function is based on its ability to represent a broader class of algorithms called generalized linear models.

Probability Distribution

The probability distribution of y given x parameterized by theta is assumed to have two possible outcomes: y = 0 or y = 1.

The chance of y being equal to 1 given the feature x parameterized by theta is equal to the output of the hypothesis function h(x).

The chance of y being equal to 0 is equal to 1 minus the chance of y being equal to 1.

Compressing Two Equations into One

There is a nifty way to take two equations that describe p(y|x;theta) when y can only be either 0 or 1 and compress them into one equation using h(x).

Introduction to Maximum Likelihood Estimation

In this section, the speaker introduces a notational trick that simplifies data derivations and explains how maximum likelihood estimation is used to find the value of theta that maximizes the likelihood of parameters.

Maximum Likelihood Estimation

The speaker explains how they will use maximum likelihood estimation again.

The likelihood of the parameters is defined as p(y|x;theta).

To make algebra simpler, we take the log of the likelihood and compute it.

Taking the log results in a simplified equation for computing the log-likelihood.

Theta is chosen to maximize L(theta), which is done using an algorithm such as gradient descent or batch gradient ascent.

Once theta has been chosen, H(theta) can be used to estimate whether a new tumor is malignant or benign.

Gradient Ascent

Gradient ascent is used to climb up a concave function like a hill rather than down it.

Two differences between linear regression and gradient ascent are explained: optimizing log-likelihood instead of squared cost function and maximizing rather than minimizing.

Batch Gradient Ascent

The definition of H(theta) is plugged into an equation, then derivatives are taken with respect to theta through calculus and algebra.

Batch gradient ascent updates theta J according to learning rate Alpha times partial derivative with respect to theta.

Logistic Regression and Newton's Method

In this section, the instructor discusses logistic regression and Newton's method. The lecture covers the properties of the log-likelihood function for logistic regression, why it is different from linear regression, and how it avoids local optima problems. The instructor also introduces Newton's method as an alternative optimization algorithm to gradient ascent.

Properties of Log-Likelihood Function for Logistic Regression

The log-likelihood function for logistic regression is always a concave function with only one global maximum.

Choosing the logistic function guarantees that the likelihood function has only one global maximum.

Although the surface level equation looks similar to linear regression, the definition of H(theta) is different.

Comparison of Linear Regression and Logistic Regression

Linear regression uses H(theta)(x)=theta^T x while logistic regression uses H(theta)(x)=g(theta^T x).

Linear regression is not suitable for classification problems because a single example can cause issues.

Introduction to Newton's Method

Newton's method allows you to take much bigger jumps than gradient ascent but each iteration will be more expensive.

Newton's method solves a problem where you have some function f and want to find a theta such that f(theta)=0.

Later on in the lecture, we will use Newton's method to maximize L(theta).

Introduction to Newton's Method

In this section, the speaker introduces Newton's method and explains how it can be used to find the point where the derivative of a function is equal to zero.

How Newton's Method Works

To explain Newton's method, the speaker first works on a problem where you have a function F and want to find the value of theta where F of theta is equal to 0.

The goal is to find the value of theta where f of theta is equal to 0. The speaker draws a picture and explains that one iteration of Newton's method involves finding a line that is tangent to f, solving for where it touches the horizontal axis, and then moving from the current value of theta to that point.

The speaker shows how this process can be repeated multiple times until you get very close to the point where f of theta is equal to 0.

The math behind one iteration of Newton's method involves solving for delta using calculus. Delta is equal to f of theta 0 over f prime of theta 0. A single iteration rule for updating theta t+1 becomes: theta t+1 = theta t - (f(theta t)/f'(theta t)).

Applying Newton's Method in Logistic Regression

To apply Newton's method in logistic regression, we set F equal to L prime theta since we want to find the place where the first derivative of L is 0.

Newton's Method

This section covers the properties of Newton's method and its advantages and disadvantages.

Quadratic Convergence

Newton's method enjoys a property called quadratic convergence.

If on one iteration, Newton's method has 0.01 error, then after one iteration, the error could go to 0.0001 error, and after two iterations it goes 0.00000001.

Under certain assumptions that functions move not too far from quadratic, the number of significant digits that you have converged, the minimum doubles on a single iteration.

Hessian Matrix

When theta is a vector, then the generalization of the rule for updating theta t+1 is theta t + H where H is the Hessian matrix.

The Hessian matrix is defined as the matrix of partial derivatives.

In high-dimensional problems, if theta is a vector, each step of Newton's method is much more expensive because you're either solving a linear system equations or having to invert a pretty big matrix.

Advantages and Disadvantages

If you have 10 parameters or 50 parameters in your iteration, use Newton's method because then you probably get convergence in maybe 10 iterations or less than 10 iterations.

If you have a very large number of parameters like 10,000 or more than that use gradient descent instead because each iteration requires computing like a 100,000 by a 100,000 matrix which is very hard to do in high-dimensional problems.

On Wednesday there will be a lecture on generalized linear models.