UMass CS685 S23 (Advanced NLP) #2: N-Gram Language Models
Language Modeling Overview
In this section, the instructor introduces the topic of language modeling, specifically focusing on n-gram models and discussing upcoming deadlines and additional opportunities for students.
Impending Deadlines and Final Project Groups
- Homework zero and Piazza registration due by February 17th. If not added to GradeScope or Piazza, email the instructor.
- Submission deadline for forming final project groups by February 17th. Provide group members' names and email addresses.
Project Proposals and Extra Credit Opportunity
- Project proposals due in about a month. Start considering topics for the final project.
- Weekly remote NLP seminar offers extra credit. Attend talks on Tuesdays from 11:30 to 12:30, submit summaries for credit up to three times.
Motivation for Studying Language Models
The instructor delves into the importance of understanding language models by revisiting sentiment analysis as an example, highlighting the evolution from basic sentiment analysis models to more advanced transfer learning approaches.
Importance of Language Models
- Language models are crucial for various applications like sentiment analysis in reviews to determine positivity or negativity.
- Traditional sentiment analysis involved supervised learning with limited capabilities restricted to sentiment analysis only.
Transfer Learning in Language Models
- Transfer learning involves pre-training a model on unlabeled data before fine-tuning it on specific tasks like sentiment analysis using smaller labeled datasets.
- Base models from transfer learning can be specialized for multiple tasks, leveraging shared information across different applications.
New Section
In this section, the speaker discusses the concept of using prompt-based methods to solve tasks without further training on sentiment data.
Using Prompt-Based Methods for Tasks
- The speaker explains how a giant language model can be utilized for translation by providing natural language instructions and examples.
- Examples are given where specific words are paired with their translations to guide the model in predicting the next word accurately.
- The discussion delves into prompt-based methods, emphasizing the importance of providing clear prompts for the model to learn from examples effectively.
- The ease of using natural language prompts without deep knowledge of the model's workings is highlighted, showcasing the flexibility and adaptability of such methods.
- The evolution in understanding why prompt-based methods work is mentioned, with a promise to delve deeper into this topic later in the semester.
New Section
This part focuses on prompting as a method that does not involve updating model weights but rather relies on input text for predictions.
Prompting Approach
- Prompting involves asking the model to predict the next word based on input text, demonstrating a task-oriented approach without specialized training updates.
- The question arises about finding optimal ways to formulate tasks through prompts, leading to discussions on different prompting strategies beyond natural language instructions.
- Exploring how models understand prompts and tasks showcases complexities in formulating effective prompts that align with desired outputs.
- Addressing whether a model can perform tasks in languages it hasn't been exposed to highlights the necessity of diverse training data for multilingual capabilities.
Language Models and Conditional Probability
In this section, the speaker discusses the use of conditional probability in language models to estimate the likelihood of the next word based on preceding words. This concept is crucial for tasks like machine translation and speech recognition.
Conditional Probability in Language Models
- Language models utilize conditional probability to rank candidate translations or recognize spoken phrases based on their likelihood.
- Speech recognition relies on determining the most probable sequence of words spoken, emphasizing the importance of accurate language models.
- Language models are essential for computing joint probabilities of text pieces and predicting subsequent words in applications like autocomplete features.
Decomposing Joint Probability with Chain Rule
The discussion delves into decomposing joint probabilities using the chain rule, a fundamental concept in probability theory that underpins language modeling.
Decomposition Using Chain Rule
- Joint probabilities can be broken down into products of conditional probabilities through the chain rule, enabling efficient computation in language modeling.
- Factorizing joint probability distributions allows for expressing complex relationships between random variables as a series of conditional probabilities.
Modeling Prefixes for Language Prediction
The focus shifts to understanding prefixes in language prediction models and extracting valuable information from them to enhance word prediction accuracy.
Importance of Modeling Prefixes
- Prefixes refer to preceding words in a text sequence, influencing predictions about subsequent words in language modeling tasks.
Producing Language Models
In this section, the speaker discusses the challenges of producing language models due to sparsity issues in data and introduces the concept of n-gram models as a solution.
Challenges in Count-Based Approaches
- The issue of sparsity arises when counting occurrences of prefixes, leading to a vast number of unobserved sequences.
- Sparse data poses a significant challenge in estimating probabilities accurately, even with extensive text data available.
Introducing N-Gram Models
- N-gram models address sparsity by utilizing the Markov assumption to simplify count-based approaches.
- Approximating probabilities by considering shorter prefixes helps mitigate sparsity issues but may lead to loss of information.
Handling Sparsity with Extended Prefixes
Extending the length of prefixes is explored as a method to overcome sparsity issues and improve prediction accuracy in language models.
Extending Prefix Length
- Increasing prefix length provides more constraints for predicting the next word based on context.
- Longer prefixes offer greater precision but also introduce ambiguity in determining contextual relevance.
Unigram Model Analysis
The speaker delves into the limitations and implications of unigram models in language processing tasks.
Unigram Model Evaluation
- Unigram models simplify probability estimation by considering individual word frequencies without conditioning on any context.
New Section
In this section, the speaker discusses the process of generating text from a language model through decoding and probability distributions.
Generating Text from Language Models
- Decoding is the process of generating text from a language model.
- Language models provide conditional probability distributions for predicting the next word based on a given prefix.
- The vocabulary defined in the language model determines the possible options for the next word.
- Higher probabilities assigned by the language model indicate more suitable words for prediction in a given context.
- Randomly selecting words based on generated numbers may not utilize the probabilities effectively for meaningful text generation.
New Section
This part delves into strategies for sampling words from probability distributions to enhance text generation quality.
Sampling Strategies in Text Generation
- Sampling words based on their probabilities can improve text generation quality by favoring higher-probability words.
- The inverse probability strategy involves sampling words proportionally to their probabilities, enhancing coherence in generated text.
New Section
Exploring alternative methods beyond simple word selection to enhance text generation processes.
Advanced Text Generation Strategies
- Choosing the word with the highest probability eliminates randomness but may lead to repetitive outputs like common words such as "the."
- Contrasting unigram and sampling-based models highlights different outcomes in generated texts based on frequency considerations.
New Section
Understanding how conditional probability distributions guide word generation and extend to generating longer pieces of text.
Conditional Probability Distribution in Text Generation
- Utilizing conditional probabilities guides word selection; updating prefixes with generated words enables continuous text creation.
Language Models and N-gram Models
In this section, the speaker discusses language models and N-gram models, highlighting how different prefix lengths impact the fluency of generated text.
Understanding Language Models
- In a unigram model, only the last word is considered for probability calculation.
- The fluency of generated text improves as the prefix length in N-gram models increases.
- Unigram models are poor in coherence, while trigram models offer more grammatically correct sequences.
Optimal N in Language Models
This part delves into determining the optimal value for 'N' in language models based on data availability and model performance.
Finding Optimal 'N'
- The choice of 'N' depends on available data; larger datasets allow better estimates for longer prefixes.
- Google has trained up to 10-gram models due to limitations posed by data size and computational complexity.
Evaluating Language Model Quality
Evaluating language model quality involves understanding inherent limitations like handling long-distance dependencies.
Assessing Model Quality
- Long-distance dependencies pose challenges due to Markov assumption limitations.
- Higher-order N-gram models can better handle long-range dependencies compared to lower-order ones.
Computing Conditional Probabilities
This segment focuses on computing conditional probabilities using training datasets and counting occurrences of word sequences.
Computing Probabilities
- Training datasets involve counting occurrences of word sequences to compute conditional probabilities.
Data Set Probabilities and Terminology
In this section, the speaker discusses data set probabilities and introduces key terminology related to word occurrences in a vocabulary.
Data Set Probabilities
- The probabilities in the data set are highlighted, showcasing high values.
- An example is provided where the probability of observing the end of sequence symbol after seeing the word "Sam" is discussed.
- Calculations for probabilities such as the likelihood of "Sam" given the start of a sequence are demonstrated.
Terminology Introduction
- Unseen bigrams or two-word phrases in the data set are assigned zero probability due to lack of occurrence data.
- Definitions for word types and tokens within a vocabulary are explained, emphasizing unique words versus their occurrences.
Google Ngram Viewer and Language Evolution
This segment delves into utilizing Google's Ngram Viewer to explore language evolution through historical text analysis.
Google Ngram Viewer Demonstration
- The speaker showcases Google's Ngram Viewer, allowing users to track phrase frequencies over time in digitized books.
- Examples like "ordered a pizza" reveal shifting language trends across different eras based on relative frequency changes.
Language Analysis Insights
- Researchers leverage Google's Ngram Corpus for digital Humanities studies, drawing connections between language shifts and societal contexts.
Ngram Modeling and Probability Tables
This part focuses on constructing probability tables from ngram models to compute conditional probabilities for sequences.
Ngram Model Construction
- Forming tables with prefixes and subsequent words aids in calculating conditional probabilities within ngram models.
- The exponential growth of table size with higher model orders is highlighted, illustrating increased complexity.
Probability Table Computation
- Converting count-based tables into probability tables involves normalizing cell counts by total prefix occurrences for accurate probabilistic assessments.
- Utilizing these probability tables enables computing joint probabilities for entire sentences through product calculations based on conditional probabilities.
Language Models and Evaluation
In this section, the speaker discusses the challenges of working with probabilities in language models and introduces the concept of log probabilities to address these issues. The importance of encoding knowledge within language models is also highlighted.
Handling Probabilities in Language Models
- Working with longer inputs leads to extremely small probabilities, causing potential errors. -
- Replacing the product of probabilities with the sum of log probabilities makes calculations more manageable. -
- Log probabilities are more tractable for implementation, aiding in model performance evaluation. -
Encoding Knowledge in Language Models
- Language models encode not only language understanding but also world knowledge from training data. -
- Examples from a restaurant corpus demonstrate how language models encode grammar and world knowledge. -
- Language models exhibit elements of reasoning abilities through encoded knowledge about language and the world. -
Evaluating Language Models
This section delves into evaluating language models by assessing their ability to assign high probabilities to unseen text, emphasizing the importance of generalization beyond training data.
Evaluating Model Performance
- Efficient toolkits exist for computing count tables and applying smoothing algorithms for language model evaluation. -
- A good language model should assign high probability to unseen text, indicating effective generalization beyond training data. -
Understanding Model Evaluation in Machine Learning
In this section, the speaker discusses the importance of having a validation set in machine learning, particularly for complex models like Chat GPT. The validation set helps in iterating over different configurations to enhance model performance.
Importance of Validation Set
- "Pick the configuration that gives the highest probability on the validation set and evaluate it once on the test set."
- Explains how comparing probabilities of training data and held-out data helps assess model quality.
- Illustrates comparing language models using different n-gram models on specific texts.
- Emphasizes avoiding information leakage between training and test sets to prevent erroneous conclusions.
- Highlights the necessity of careful data splitting without overlap to ensure accurate evaluation.
Understanding Perplexity as an Evaluation Metric
This section delves into perplexity as an evaluation metric in language modeling, explaining its significance in assessing model performance based on test set probabilities.
Perplexity Definition and Significance
- Defines perplexity as the inverse probability of the test set normalized by the number of words, aiding comparison across different texts.
- Discusses minimizing perplexity to maximize test set probability and why perplexity is preferred over direct probability comparisons.
- Explores historical reasons for using perplexity and its information theory interpretations.
- Provides an example with random digits to illustrate calculating perplexity based on branching factor concept.
Language Models and Perplexity
In this section, the speaker discusses language models and perplexity, emphasizing the evaluation of models based on perplexity scores derived from log probabilities.
Evaluating Language Models
- The negative log likelihood is commonly used as a training loss in language models, with perplexity being an exponentiated form of this metric.
- Lower perplexity indicates a better model performance. A unigram model tested on 1.5 million words not part of the training set yielded a perplexity of 962.
- Different n-gram models show varying perplexities; however, increasing to higher-order n-grams can lead to sparsity issues and subsequent performance degradation.
Dealing with Unseen Tokens in Language Models
This segment delves into challenges posed by unseen tokens in language models and introduces the concept of smoothing to address these issues effectively.
Addressing Unseen Tokens
- Unseen tokens in testing data pose significant challenges, especially in higher-order n-gram models due to sparsity issues.
- Shakespeare's works serve as an example where most possible bigrams were not utilized, leading to zero probability for many valid combinations.
Smoothing Techniques in Language Models
The discussion shifts towards smoothing techniques employed in end-grade models to mitigate problems associated with unseen tokens and improve model robustness.
Implementing Smoothing
- Smoothing redistributes probability mass from observed items to unobserved ones, enhancing the model's ability to handle unseen tokens effectively.
- Various methods exist for distributing probability mass across unobserved items through smoothing algorithms, contributing to improved model performance and generalization capabilities.
Evaluation Metrics for Language Models
Evaluation metrics such as assigning high probabilities on test sets are explored alongside discussions on training versus test perplexity calculations.
Model Evaluation Metrics
- Assigning high probabilities on test sets equates to low perplexity scores, indicating strong model performance.
Detailed Discussion on Language Models and Updating Information
In this segment, the speaker delves into the challenges faced by language models due to the dynamic nature of language, emphasizing the need for continuous updates to adapt to evolving vocabulary and meanings.
Challenges in Updating Language Models
- The speaker highlights the issue of language evolution, where new words or phrases may emerge, leading to a lack of access to updated information beyond the training data available.
- Discusses the impracticality of re-estimating count tables constantly for complex models like chat GPT due to time and resource constraints.