Lecture 7 "Estimating Probabilities from Data: Maximum Likelihood Estimation" -Cornell CS4780 SP17

Name: Lecture 7 "Estimating Probabilities from Data: Maximum Likelihood Estimation" -Cornell CS4780 SP17
Uploaded: 2018-07-11T13:45:02.000Z
Duration: 1 h 36 min 50 s

Machine Learning Lecture Overview

Introduction and Course Logistics

The lecture begins with a light-hearted remark about the proximity of fall break, emphasizing the importance of focus during class.

Project one is due today, but students have two additional days to improve their leaderboard positions.

Issues regarding memory limitations on Julia have been resolved; memory has been quadrupled for grading purposes.

Previous problems with package installations delaying grading have also been fixed, ensuring smoother project evaluations.

Perceptron Review

A brief recap of the perceptron algorithm is provided, highlighting its assumption that data is linearly separable.

The proof discussed involves the norm of weight vector W and its relationship to updates made during training.

The conclusion drawn from the proof indicates that there’s a limit to how many updates can be made before reaching a contradiction.

Introduction to K Nearest Neighbor Algorithm

Transitioning from perceptrons, the lecture introduces the K nearest neighbor (KNN) algorithm, which assumes that points close together share similar labels.

Discussion on data distribution P(X,Y), emphasizing that optimal classification relies on understanding this distribution.

Bayes Optimal Classifier

If access to distribution P(X,Y) were available, one could use it for optimal classification via Bayes' theorem.

The concept of estimating distributions from data is introduced as a fundamental aspect of machine learning.

Discriminative vs. Generative Learning

Two approaches in machine learning are outlined: discriminative learning (estimating P(Y|X)) and generative learning (estimating P(X,Y)).

Discriminative Learning and Probability Estimation

Understanding Discriminative Learning

The speaker discusses the current trend in machine learning, emphasizing that most approaches are based on discriminative learning, which focuses on estimating functions from data.

It is noted that one can predict features given a label, allowing for insights into which features correspond to specific labels.

Importance of Probability in Predictions

The main goal is to estimate the distribution from data; once this distribution is known, making predictions becomes straightforward.

In applications like self-driving cars, understanding probabilities is crucial. For instance, distinguishing between a squirrel and a pedestrian can significantly impact decision-making.

Integrating information from various sensors (e.g., vision and radar) using probabilities helps make informed decisions when faced with conflicting data.

Applications of Probability in Speech Recognition

The speaker highlights the relevance of probability in speech recognition systems like Siri, where two algorithms work together: one predicts phonemes based on sounds heard, while another considers context to determine likely words.

This integration allows for more accurate interpretations of spoken language by assessing both what has been said and what was just heard.

Introduction to Maximum Likelihood Estimation (MLE)

The discussion transitions to how to estimate probabilities from data, introducing concepts such as maximum likelihood estimation (MLE).

A brief survey reveals that few participants are familiar with MLE or posteriori estimation methods.

Simplifying Probability Distribution Concepts

The speaker presents a simple scenario involving coin flips to illustrate probability distributions without additional variables.

An example involves flipping a coin multiple times and recording outcomes (heads/tails), leading to questions about estimating the probability of heads or tails based on observed results.

Deriving Probabilities through Observations

After observing several flips resulting in heads and tails, participants are prompted for estimates regarding the likelihood of getting heads or tails based on their observations.

The method discussed involves calculating the ratio of heads observed over total flips as an approximation for true probability.

Maximum Likelihood Estimation and Bayesian Statistics

Understanding Parameters and Data

The discussion begins with the importance of parameters in statistical models, specifically focusing on the probability of observing heads in a binary outcome scenario.

The speaker introduces the concept of maximum likelihood estimation (MLE), explaining that for each parameter value (theta), a different probability of observing the collected data can be calculated.

MLE aims to find the parameter value that maximizes the likelihood of observing the given data, expressed mathematically as θ = Arg max P(D|θ).

Binomial Distribution and Logarithmic Transformation

Clarification is made between "max" which finds maximum values and "Arg max" which identifies parameter values yielding those maxima.

The speaker explains that when dealing with independent events (like coin flips), a binomial distribution applies, leading to a specific formula for calculating probabilities.

To simplify calculations, maximizing log-likelihood is preferred since logarithmic functions are monotonically increasing, making it easier to work with.

Numerical Stability in Computation

Taking logarithms helps avoid issues with very small probabilities that computers struggle to represent accurately; instead of multiplying small numbers, logs allow addition which maintains precision.

The log-likelihood function derived includes terms involving counts of heads (NH) and tails (NT), facilitating further analysis.

Deriving Maximum Likelihood Estimates

To find the maximum likelihood estimate for theta, one must take derivatives and set them to zero. This leads to an equation where solving gives θ = NH / (NH + NT).

A reminder is given about checking whether this solution indeed represents a maximum rather than a minimum.

Bayesian vs Frequentist Statistics

Transitioning into Bayesian statistics, the speaker prompts audience engagement regarding their familiarity with both Bayesian and frequentist approaches.

A question arises about potential pitfalls in MLE; extreme cases like having very little data can lead to misleading conclusions due to over-reliance on limited observations.

Understanding Bayesian Smoothing Techniques

Introduction to Smoothing in Probability Estimation

The speaker discusses the limitations of probability estimation with small sample sizes (N), suggesting a workaround by assuming prior observations of M tosses of heads and tails.

By adding hypothetical tosses, the speaker illustrates how one can adjust initial probabilities, using an example where they assume five heads and five tails to maintain a close estimate to 0.5.

This method is described as "cheating," where the speaker prefers trusting their prior belief over limited data, effectively hallucinating additional samples from their assumed distribution.

The Concept of Smoothing

The technique being discussed is known as smoothing, which helps mitigate issues arising from zero probabilities in statistical estimates.

Frequentists have utilized this approach for years; it addresses problems that occur when modifying probabilities leads to undefined outcomes due to zeros.

The speaker emphasizes that while smoothing can introduce bias, it is often necessary for practical applications, such as homework assignments related to project number three.

Historical Context and Applications

The concept was originally developed by Laplace centuries ago and remains relevant today in various fields including spam filtering.

In spam filtering, smoothing allows statisticians to account for rare words by assuming they have been observed at least once in both spam and non-spam contexts.

Distinction Between Bayesian and Frequentist Approaches

The discussion transitions into comparing Bayesian statistics with frequentist methods, highlighting that maximum likelihood estimation (MLE) represents frequentist statistics while maximum a-posteriori (MAP) approximation aligns with Bayesian approaches.

A significant divide exists between these two schools of thought within statistics; disagreements can be intense enough to lead to conflicts among practitioners.

Key Differences in Statistical Interpretation

The speaker outlines how frequentists focus on estimating the probability of data given parameters (theta), whereas Bayesians treat theta as a random variable conditioned on data.

Bayesian vs. Frequentist Statistics

Understanding Bayesian and Frequentist Perspectives

The concept of a random variable in Bayesian statistics allows parameters like theta to be treated as random variables, which contrasts with the frequentist view that sees theta merely as a parameter without an associated probabilistic event.

Frequentists argue that since there is no event linked to theta, it cannot be considered a random variable; they maintain that this distinction is fundamental and dismissive of Bayesian interpretations.

In Bayesian statistics, the distribution over theta represents beliefs about its value, even if such distributions may seem nonsensical from a frequentist perspective due to the lack of events.

A key advantage of Bayesian methods is their ability to incorporate prior beliefs into estimates, especially when data is scarce. This contrasts with maximum likelihood estimation (MLE), which can yield poor estimates under limited data conditions.

Bayesians argue against extreme values derived from data alone by incorporating prior beliefs into their estimations, thus allowing for more reasonable conclusions based on subjective experience or intuition.

Differences in Statistical Approaches

While both Bayesian and frequentist approaches aim at estimating parameters, they differ fundamentally in how they treat uncertainty and incorporate prior knowledge into statistical models.

It's important to note that Bayesian statistics does not solely rely on Bayes' rule; rather, it encompasses broader methodologies beyond just applying this mathematical principle.

The likelihood function P(D | theta) , representing the probability of observing data given parameters, plays a crucial role in both MLE and MAP (Maximum A Posteriori estimation).

MAP estimation flips the question from "which parameter maximizes our data" to "given our data, what is the most likely parameter," highlighting different perspectives within statistical inference.

Practical Applications of Distributions

The relationship between MLE and MAP illustrates how both methods utilize probabilities but approach them differently: MLE focuses on maximizing likelihood while MAP incorporates priors into its calculations.

Bayes' rule connects posterior probabilities with prior distributions and likelihood functions; however, it's essential to recognize that all statisticians can apply Bayes' rule regardless of their methodological stance.

When considering distributions for theta in practical scenarios (e.g., coin tosses), one might assume a binomial distribution due to independent trials yielding heads or tails outcomes.

Beliefs about theta's value often lead practitioners to consider distributions like beta distributions for modeling probabilities effectively without venturing into negative values.

Conclusion on Distribution Choices

Understanding the Beta Distribution and Bayesian Inference

The Basics of the Beta Distribution

The beta distribution is a special case that provides values between 0 and 1, making it a well-defined probability distribution dependent on parameters alpha and beta.

The shape of the beta distribution can vary; it may be skewed towards low or high values or centered around a specific point, which is often desired in practical applications.

Estimating Theta with Prior Beliefs

When estimating theta from data, we assume theta follows a beta distribution based on prior beliefs. This allows us to model our uncertainty about theta effectively.

According to Bayes' rule, the most likely value of theta given our data is proportional to the likelihood of observing the data given theta multiplied by the prior probability of theta.

Maximizing Likelihood for Parameter Estimation

To maximize this expression, we combine terms related to observed data (NH and NT), leading us to an equation that simplifies our estimation process.

By collecting terms together, we derive an expression for maximizing theta that resembles previous calculations but incorporates adjustments based on alpha and beta parameters.

Smoothing Techniques in Bayesian Statistics

The resulting formula indicates smoothing effects where alpha minus one represents heads from coin tosses while beta minus one accounts for tails. This highlights similarities between frequentist and Bayesian approaches over time.

Distinguishing Between Bayesian and Frequentist Approaches

It's crucial to understand that both Maximum A Posteriori (MAP) estimation (Bayesian approach) and Maximum Likelihood Estimation (MLE) (Frequentist approach) aim at estimating parameters but differ fundamentally in their methodologies.

Flexibility in Bayesian Methods

The Bayesian framework treats theta as a random variable, providing flexibility in modeling. This contrasts with traditional methods where maximization is necessary for parameter estimation.

Integrating Over Models for Predictions

A true Bayesian approach involves calculating probabilities across all possible models rather than relying solely on maximizing estimates. This leads to predictions that are independent of any single model assumption.