Machine Learning Lecture 8 "Estimating Probabilities from Data: Naive Bayes" -Cornell CS4780 SP17

Name: Machine Learning Lecture 8 "Estimating Probabilities from Data: Naive Bayes" -Cornell CS4780 SP17
Uploaded: 2018-07-11T13:47:01.000Z
Duration: 1 h 28 min 28 s
Description: Cornell class CS4780. (Online version: https://tinyurl.com/eCornellML ) Lecture Notes: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote04.html

Understanding Machine Learning Distributions and Estimation

Introduction to Data Distribution

The discussion begins with a reference to a dataset D drawn from an unknown distribution P(X, Y) , emphasizing the challenge of working with this elusive distribution.

An example is provided using images of faces, illustrating how the probability of capturing a specific face can be seen as part of this unknown distribution.

Approximating Unknown Distributions

The speaker suggests that if the unknown distribution isn't overly complex, it might be approximated by defining a more manageable distribution with parameters theta .

Once an approximation is established, one could theoretically use the Bayes optimal classifier for perfect predictions if access to the true distribution were possible.

Estimating Parameters

The goal is to estimate parameters theta such that our understood distribution closely matches the elusive one based on available training data.

Two methods for estimating these parameters are introduced: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP).

Maximum Likelihood Estimation (MLE)

MLE focuses on finding parameters theta that maximize the likelihood of observing the given data.

This method assumes that distributions derived from both true and estimated models will yield similar results when sampling data.

Maximum A Posteriori (MAP)

MAP estimation flips the perspective by maximizing the probability of theta , given observed data.

It treats theta as a random variable rather than just a parameter, introducing Bayesian concepts into parameter estimation.

Bayesian Approach to Parameter Estimation

In Bayesian statistics, prior distributions must be established for P(theta), allowing for fitting based on observed data.

The ultimate aim in estimating θ ) is prediction; once fitted, we can determine probabilities like P(Y|X).

Fully Bayesian Predictions

The fully Bayesian approach integrates over all possible models to predict outcomes by averaging them weighted by their posterior probabilities.

Understanding Maximum Likelihood Estimation and MAP

Introduction to Parameter Estimation

The discussion begins with the concept of estimating parameters from data, emphasizing the importance of maximum likelihood estimation (MLE) and its connection to Bayesian methods.

A quiz is mentioned, indicating a focus on understanding distributions and their constancy in relation to parameter estimation.

Coin Toss Example

The speaker introduces a simple case involving a coin toss to illustrate MLE. The goal is to estimate the probability of getting heads based on observed data.

In this example, after tossing a coin ten times with three heads and seven tails, the challenge is to determine the probability of heads using MLE.

Deriving Probabilities

MLE results in calculating the probability of heads as the ratio of heads observed over total tosses. This straightforward calculation provides an initial estimate.

When applying Maximum A Posteriori (MAP), additional terms are included that account for prior beliefs about outcomes, such as adding "imaginary" tosses.

Prior Beliefs in Bayesian Analysis

The speaker explains how MAP incorporates prior beliefs about coin biases into calculations. Even if all observed results are heads, prior experience suggests skepticism towards extreme probabilities.

A distribution over theta (the parameter representing bias in coin tossing) is established before observing any data, shaping future estimates based on new information.

Visualization and Results

As more data is collected (e.g., 10,000 tosses), both MLE and MAP converge towards the true distribution (70% chance for heads).

After extensive trials, estimates stabilize around 0.7 for heads' probability due to sufficient sampling size reinforcing initial assumptions.

Impact of Prior Distributions

The speaker discusses how different priors affect MAP estimates; starting with uninformed priors leads to uniform distributions while informed priors can skew results toward specific outcomes.

An example illustrates how hallucinating one head and one tail alters belief distributions without affecting MLE but significantly impacts MAP estimations.

Conclusion: Understanding Graphical Representations

The left graph represents a Probability Density Function (PDF), showcasing how varying prior beliefs influence estimated probabilities through visual representation.

Understanding Probability Density Functions and Priors

The Concept of Priors in Bayesian Statistics

The discussion begins with the probability density function, emphasizing that smoothing is related to Maximum A Posteriori (MAP) estimation.

A uniform prior indicates a strong belief in a 50/50 outcome, suggesting that extreme outcomes (0 or 1) are unlikely based on previous observations of one head and one tail.

An uninformed prior results in a flatline distribution, indicating no bias or specific belief about the outcome. This contrasts with more informed priors which become sharper with more data.

Impact of Data on Prior Beliefs

As more coin tosses are observed, the MAP estimate adjusts from an initial belief of 0.5 towards reality as data accumulates; however, early estimates can be misleading if data is limited.

With insufficient data, hallucinated tosses dominate the estimation process, leading to inaccurate conclusions about probabilities.

Confidence Levels and Sample Size

When observing many samples (e.g., 100), confidence in the estimated probability increases significantly; however, this can lead to overconfidence if prior beliefs are incorrect.

The speaker humorously compares overconfident estimates to "confident teenagers," highlighting how they may not adjust their beliefs without substantial evidence.

Examples of Prior Influence on Estimates

MAP estimates perform well when priors align closely with true probabilities but can fail dramatically if initial assumptions are wrong.

In cases where the prior suggests a low probability for heads despite actual outcomes being favorable for heads (e.g., two heads and one tail), it takes considerable data for estimates to correct themselves.

Convergence Challenges with Incorrect Priors

If the prior is accurate (e.g., 7:3), MAP quickly converges to correct values; however, significant divergence occurs when starting from an incorrect assumption (e.g., believing there’s only a 20% chance of heads).

Even after extensive sampling (like 1,000 trials), poor priors can lead to persistent inaccuracies in estimations compared to better-informed models like Emily's approach.

Discussion on Bayesian vs Frequentist Approaches

The conversation touches upon ongoing debates within statistics regarding Bayesian methods versus frequentist approaches and whether human intuition aligns more closely with either methodology.

Understanding Bayesian Statistics and Machine Learning Concepts

Introduction to Bayesian Tendencies in Psychology

The discussion begins with the exploration of how psychologists have fun analyzing whether we exhibit phases or tendencies towards Bayesian statistics, suggesting that we often rely on prior beliefs.

This reliance on prior beliefs is linked to our prejudices, indicating that personal experiences can lead us to jump to conclusions.

Transitioning to Machine Learning

The speaker shifts focus to machine learning, proposing a more complex scenario involving joint distributions of variables Y and X.

Participants are encouraged to determine the probability P(Y|X), emphasizing the need for collaboration in estimating parameters θ for this distribution.

Maximum Likelihood Estimation (MLE)

A reminder is given about calculating P(X=x), where the probability is derived from counting occurrences of X in the dataset divided by total instances.

Students are tasked with finding P(Y|X=x), which involves conditioning on specific values of X while discussing similarities with previous coin-tossing examples.

Conditional Probability Insights

A volunteer explains that conditioning Y on X=x resembles earlier discussions about heads in coin tossing but focuses only on cases where X equals a specific value.

The concept of conditional probabilities is further clarified using set notation, illustrating how subsets are formed based on conditions applied to both variables.

Practical Application and Challenges

The definition of conditional probability is reiterated: P(Y=y|X=x)=P(Y=y,X=x)/P(X=x), leading into calculations similar to previous examples.

The process involves summing over data points meeting both conditions (X=x and Y=y), reinforcing understanding through practical application.

High-Dimensional Data Considerations

When estimating P(Y|X=x), it’s noted that one must filter out all data points not equal to x, which becomes increasingly challenging as dimensionality increases.

Understanding Naive Bayes and Its Assumptions

The Challenge of Face Recognition Systems

In face recognition systems, determining whether an image depicts a student or a professor requires exact pixel matching with previously seen images. This approach is impractical due to the vast number of possible variations.

Limitations of Traditional Algorithms

Traditional algorithms struggle because they often operate on an empty set, making it difficult to estimate probabilities accurately.

Estimating P(X | Y) (the probability of features given a label) is complex since unique combinations of features are rarely encountered in practice.

Introduction to Naive Bayes

Naive Bayes simplifies the problem by applying Bayes' theorem, flipping the estimation from P(Y | X) to P(X | Y) .

The prior probability can be easily estimated; for example, determining if an email is spam based on its content can be derived from personal inbox data.

Key Assumption: Independence of Features

Naive Bayes assumes that features are independent given the label. This means that knowing one feature does not provide information about another when the label is known.

Mathematically, this translates to P(X | Y) = P(X_1 | Y) * P(X_2 | Y)...*P(X_D | Y), where each feature contributes independently to the overall probability.

Practical Implications and Trade-offs

For instance, in spam detection, words in an email are treated as independent variables contributing to its classification as spam or not.

Understanding Independence in Machine Learning

The Concept of Independence

Discussion on scenarios where variables are independent, using sentiment analysis as an example. It highlights the complexity of determining independence in language processing.

Sentiment analysis is described as a billion-dollar industry where algorithms quickly assess public opinion on products like the iPhone watch to inform stock trading decisions.

Correlation and Language Processing

The speaker argues that even if sentiment is classified as positive or negative, terms used (like "Apple") may still be correlated with other terms (like "watch"), indicating a lack of true independence.

An analogy is made with pixels in images, explaining that adjacent pixels often share similar colors, demonstrating high correlation rather than independence.

Graphical Models and Causality

Introduction to graphical models, emphasizing how features can be dependent on labels. For instance, symptoms caused by a disease illustrate this dependency.

If one knows whether a person has a disease (the label), then symptoms (features) become independent of each other since they are all caused by the same underlying condition.

Assumptions in Estimation

The discussion shifts to assumptions made during estimation processes in machine learning, particularly regarding independence and identical distribution (IID).

Acknowledgment that while IID assumptions simplify calculations, more complex distributions can also be considered for better accuracy.

Bayes Classifier Overview

Explanation of the Bayes classifier: it outputs the most likely label based on input data. This involves maximizing probabilities given certain conditions.

The process includes applying Bayes' rule to derive necessary probabilities without needing to compute constants that do not affect outcomes directly.

Naive Bayes Assumption

Introduction of the naive Bayes assumption which allows for decomposing probability into manageable parts across multiple dimensions.