Machine Learning Lecture 10 "Naive Bayes continued" -Cornell CS4780 SP17

Machine Learning Lecture 10 "Naive Bayes continued" -Cornell CS4780 SP17

Lecture on Machine Learning: Naive Bayes Overview

Introduction and Logistics

  • The lecture begins with a reminder to students to close their devices, emphasizing focus on the topic.
  • An update is provided regarding homework three, which has faced delays due to issues with auto-grading. Testing is ongoing before rollout.
  • A new teaching assistant from VOC areum has joined Piazza, enhancing support for student inquiries.

Understanding Naive Bayes

  • Naive Bayes is introduced as a generative algorithm that estimates the distribution of data points based on labels (positive or negative).
  • The algorithm simplifies high-dimensional data by assuming independence among dimensions, allowing for easier modeling of distributions.

Modeling Distributions

  • The process involves projecting data onto each axis individually and modeling these one-dimensional distributions.
  • Predictions are made using Bayes' theorem after estimating the individual feature distributions.

Prediction Process

  • Once the model is built, predictions rely solely on the estimated distributions rather than raw data points.
  • For categorical features, probabilities are calculated based on frequency counts of features associated with each label.

Classifier Functionality

  • The lecture outlines how classifiers work by comparing likelihood between positive and negative point distributions for test points.
  • A decision boundary exists where classification shifts from one distribution to another; this boundary is identified as linear.

Additional Insights

  • Estimating probability density functions (PDFs) is straightforward through counting occurrences of labels in relation to total data points.

Naive Bayes and Generative Models in Text Classification

Understanding Naive Bayes

  • Naive Bayes is often used for text classification, particularly with high-dimensional data like text documents. Transformations may not be practical due to the expense of processing numerous features corresponding to words.
  • While typically a linear classifier, Naive Bayes can be adapted for various scenarios beyond linearity, though its primary application remains in text classification.

Multinomial Distribution and Data Generation

  • The discussion transitions to the multinomial distribution as a generative model, emphasizing the importance of understanding how data is generated.
  • In this context, the model estimates probabilities based on input labels (e.g., spam or not spam), generating an output vector that represents word counts in emails.

Process of Email Generation

  • The generative process involves starting with an empty document and sequentially adding words by simulating random selection (rolling a die). Each word's occurrence increases its count in the vector representation.
  • This method continues until a predetermined number of words (M) are added to form a complete email. The assumption is that this process mimics real email writing behavior.

Probability Estimation

  • To estimate probabilities within this framework, one must determine the likelihood of selecting specific words given their class label. This probability must sum to one across all possible outcomes.
  • The formula for estimating probabilities incorporates word frequencies while disregarding order—this aligns with the bag-of-words model where sequence does not impact classification accuracy.

Clarifying Key Concepts

  • The ordering of words is deemed unimportant for tasks like spam detection; thus, frequency statistics suffice for effective classification.
  • A factorial component adjusts for permutations since word order does not affect meaning in this context. Questions arise about understanding these concepts among participants.

Final Thoughts on Generative Assumptions

Understanding Maximum Likelihood Estimates in Spam Filtering

Estimating Probabilities for Words in Emails

  • The speaker discusses the process of estimating the likelihood of words appearing in spam emails using maximum likelihood estimates, which involves analyzing data to determine probable outcomes.
  • The focus is on estimating the probability of a specific word (alpha) given that an email belongs to a certain class (C), such as spam or not spam.
  • To calculate this probability, one must count how many times word alpha appears in spam emails and divide it by the total number of spam emails.
  • This fraction represents the frequency of word alpha within the context of spam emails, providing insight into its relevance as a feature for classification.

Analyzing Word Frequencies

  • The speaker prompts discussion among participants about understanding the significance of the calculated fraction related to word frequencies in spam detection.
  • Clarification is provided regarding summing over all emails and counting occurrences only within classified categories (spam vs. non-spam).
  • A cumulative approach is suggested where each unique word's occurrence across all relevant emails is tallied to create a comprehensive dataset for analysis.

Calculating Probabilities with Smoothing Techniques

  • The speaker explains that when calculating probabilities, especially for rare words, it's essential to account for variations in email lengths and total counts.
  • An example illustrates how often a common spam-related term like "Nigeria" might appear compared to total words across all analyzed spam emails, leading to an estimated probability calculation.
  • If one were to randomly select a word from their spam inbox, this calculated probability would represent their chances of selecting "Nigeria."

Addressing Rare Words and Naive Bayes Classifier

  • Discussion shifts towards handling rare words through techniques like Laplace smoothing (adding 1), ensuring no zero probabilities arise from unseen terms.
  • It’s emphasized that assuming every word has been seen at least once helps mitigate issues with rare or misspelled terms affecting overall classification accuracy.
  • Questions are raised about how these methods apply when encountering new or misspelled words during classification processes.

Practical Applications and Efficiency

  • The efficiency of naive Bayes classifiers is highlighted; they allow quick updates without needing retraining when new emails arrive by simply adjusting existing counters based on observed words.
  • Participants discuss real-world implications where spammers adapt tactics (like misspelling common terms), impacting classifier effectiveness but also showcasing adaptive learning capabilities within naive Bayes systems.

Hashing Functions and Text Classification

Introduction to Hashing in Text Processing

  • The speaker discusses the use of hashing functions instead of traditional dictionary lookups for mapping words to dimensions, particularly when dealing with unknown or new words.
  • A hashing function takes a word as input and outputs a number between 0 and D, allowing for dynamic handling of new vocabulary without needing prior knowledge.

Handling New Language Trends

  • The approach allows for adaptation to evolving language trends, such as abbreviations (e.g., "lol") that emerge over time.
  • In cases where two different words hash to the same dimension (collisions), the speaker mentions strategies to mitigate this issue.

Collision Management Techniques

  • To reduce collision probability, the method involves hashing twice using two different hash functions, significantly lowering the chance of repeated collisions.
  • Concerns about counter overflow from excessive identical inputs are addressed; sufficient bits are used in counters to prevent rollover issues.

Application in Spam Classification

  • The discussion transitions into text classification applications, specifically spam filtering using multinomial classifiers like Naive Bayes.
  • Emphasis is placed on ensuring that the order of words does not affect classification outcomes by incorporating a constant into calculations.

Continuous Variables in Classification

  • The conversation shifts towards handling continuous variables in classification problems, introducing Gaussian distributions as a common model choice.
  • Gaussian distributions are favored due to their mathematical properties; they allow modeling probabilities based on mean and variance for each feature across classes.

Estimating Probabilities with Gaussian Distributions

  • An example involving height illustrates how features can be modeled using Gaussian distributions with distinct parameters for different classes (e.g., male vs. female).

Maximum Likelihood Estimation and Project Overview

Understanding Maximum Likelihood Estimation

  • The process involves calculating the average value of features within a specific class, denoted as C. This is done by summing all points in class C and averaging them.
  • For example, to find the average height for men (new alpha), one would compute the average height from all male patient files, similarly for women.
  • The concept of maximum likelihood estimation applies to Gaussian distributions; it can also accommodate both discrete and continuous features, allowing different models for each feature.
  • Variance is defined as the sum of squared differences from the mean for data points with a specific label. This definition is crucial for understanding how variance is calculated in this context.
  • Questions about independence assumptions are addressed; while they may seem nitpicky, they are important in statistical modeling.

Project Description: Classifying Baby Names

  • The project involves predicting whether a given name belongs to a boy or girl based on a dataset of baby names. Participants will extract features from names to aid classification.
  • A list of 500 boy names and 600 girl names will be provided. The task requires participants to classify new names not included in this list.
  • An example feature extraction method includes using a Naive Bayes classifier to predict gender based on name characteristics.

Demonstration of Classification Model

  • A live demo showcases how the classifier predicts gender based on input names like "Stephen" (boy), "Anna" (girl), and others, demonstrating its accuracy with various examples.
  • The model's robustness is tested with ambiguous names such as "Robin" and "Taylor," showing that it can handle cases where gender association isn't clear-cut.
  • Participants are encouraged to test adventurous or less common names during the demonstration, highlighting potential challenges in classification accuracy due to cultural variations in naming conventions.

Competition Aspect

  • There will be a competition aspect where participants' classifiers will be evaluated based on their accuracy against test data. However, there’s an acknowledgment that some names may inherently have ambiguous gender associations which could limit accuracy.

Understanding Naive Bayes and Perceptron Classifiers

Decision Boundaries and Classifier Types

  • The speaker discusses visualizing decision boundaries, assuming features are real-numbered and follow Gaussian distributions. This leads to the conclusion that the classifier is linear.
  • It is revealed that by assuming independence of features and their Gaussian nature, one implicitly assumes a linear classifier without realizing it.
  • The speaker explains that with two Gaussian distributions, the classification line must be a hyperplane, which can accurately separate points in this context.

Limitations of Naive Bayes

  • A distinction is made between perceptron classifiers, which always find a separating hyperplane if one exists, versus naive Bayes classifiers that may misclassify data points even when they are linearly separable.
  • The audience is prompted to think about how to create a dataset where naive Bayes would fail despite being linearly separable for perceptrons.

Dataset Discussion

  • An audience member volunteers a dataset example; however, it turns out to be incorrect. The speaker emphasizes understanding why naive Bayes misclassified certain points due to assumptions about point distribution.
  • The discussion highlights an assumption made regarding point projections being Gaussian when they are actually heavily concentrated in specific areas.

Misclassification Insights

  • The speaker illustrates how certain points might be more likely drawn from different distributions than assumed (Gaussian), leading to potential misclassification by naive Bayes.
  • A scenario with multiple clusters (owls vs. X's) demonstrates how naive Bayes could inaccurately model data due to insufficient representation of minority classes affecting mean calculations.

Performance Comparison

  • There’s a comparison between the performance of perceptrons and naive Bayes: while naive Bayes operates faster by counting probabilities, perceptrons require iterative adjustments over datasets which can slow convergence.