Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer

Name: Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer
Uploaded: 2025-10-17T22:08:51.000Z
Duration: 3 h 22 min 59 s

Introduction to CME 295: Transformers and Large Language Models

Course Overview

The course is taught by twin brothers Afshine and Shervine, both with backgrounds in engineering from Centrale Paris.

Their professional experiences include working at Uber, Google, and Netflix, focusing on Large Language Models (LLMs).

The class aims to explore the mechanisms behind LLMs and their applications, especially following the rise of ChatGPT in 2022.

Target Audience

This course is suitable for individuals interested in pursuing careers as research or ML scientists or those wanting to develop personal projects involving LLMs.

It also caters to professionals from other fields seeking to understand AI and its applications.

Prerequisites

A foundational understanding of machine learning concepts such as model training and neural networks is required.

Basic knowledge of linear algebra, particularly matrix multiplication, is also recommended but not mandatory.

Logistics of the Course

Class Schedule

Classes are held every Friday from 3:30 PM to 5:20 PM for two units; students can choose letter grading or credit/no credit options.

Recordings and Materials

Lectures will be recorded and made available online shortly after each session. Slides will also be posted on the course website.

The primary textbook for the course is "Super Study Guide - Transformer LLMs," which aligns closely with class content.

Assessments

There will be two exams: a midterm on October 24 and a final exam during the week of December 8. Each exam contributes equally (50%) to the final grade.

Communication Channels

Announcements & Questions

Important announcements will be posted on Canvas; students can ask questions via an Ed tab on Canvas or through a mailing list provided by instructors.

Clarifications Regarding Exams

Exam Format

Exams will focus solely on conceptual understanding rather than coding skills. Following class materials should adequately prepare students for assessments.

Waitlist Information

Students who are waitlisted are encouraged to communicate with instructors about their status; historically, many waitlisted students gain admission as schedules finalize.

Introduction to NLP and Course Structure

Overview of the Course

The course will consist of a midterm and final exam, each contributing 50% to the overall grade.

Class sessions are limited to two hours per week over nine or ten weeks, necessitating focused content delivery.

Sources will be provided on slides for further exploration of topics discussed in class.

Understanding Abbreviations in NLP

Students are encouraged to familiarize themselves with common abbreviations used throughout the course.

The instructor aims for students to have a mental map of these terms by the end of the class.

What is Natural Language Processing (NLP)?

Definition and Classification of NLP Tasks

NLP stands for Natural Language Processing, focusing on manipulating and computing text data.

Categories of NLP Tasks

Classification

Involves predicting a single outcome from input text, such as sentiment analysis (e.g., determining if a movie review is positive or negative).

Other examples include intent detection (e.g., identifying commands like "create an alarm") and language detection.

Multi-classification

Predicting multiple outcomes from input text; an example is Named Entity Recognition (NER), which labels specific words in context (e.g., identifying locations or times).

Generation

Involves generating output text based on input; tasks include machine translation, question answering, summarization, and creative generation like poetry or code.

Deep Dive into Classification Tasks

Sentiment Extraction Example

An example task involves predicting sentiment from sentences like "this teddy bear is so cute," where the expected output is positive sentiment.

Evaluation Metrics for Classification

Common metrics include accuracy (percentage of correct predictions), precision (correct positive predictions out of all positives predicted), recall (correct positives out of actual positives), and F1 score (harmonic mean of precision and recall).

Importance of Metrics

These metrics are crucial when dealing with imbalanced datasets where one class significantly outweighs another; relying solely on accuracy can be misleading.

Exploring Multi-classification with NER

Named Entity Recognition Task

NER focuses on identifying categories within given words in texts, requiring classification metrics evaluated at token or entity-type levels rather than sentence level.

Understanding Text Processing in Machine Translation

Evaluating Word Prediction in Categories

The discussion begins with the importance of evaluating word prediction metrics based on specific categories, such as location.

The focus shifts to machine translation, exemplified by translating text from English to French, highlighting the complexity of obtaining paired datasets for evaluation.

Challenges in Machine Translation Evaluation

Multiple valid translations exist for a single phrase, complicating performance evaluation.

Traditional metrics like BLEU (Bilingual Evaluation Understudy) and ROUGE are introduced; both require reference texts for comparison.

The high cost and time associated with obtaining labeled data is noted as a significant challenge in using these traditional metrics.

Alternative Metrics and Historical Context

Perplexity is mentioned as another metric that evaluates model output probabilities; lower perplexity indicates better performance.

A brief history of language models is provided, noting advancements from the 1980s through LSTMs and into modern large language models (LLMs).

Evolution of Language Models

Key developments include Word2Vec for meaningful embeddings and transformers introduced in 2017, which form the basis of current models.

The scaling up of models has been driven by increased computational power and larger training datasets.

Tokenization: Preparing Text for Models

To process text effectively, it must be quantifiable; tokenization is essential for breaking down sentences into manageable units.

Different methods of tokenization are discussed: arbitrary unit-based versus word-level approaches, each with pros and cons regarding representation accuracy.

Subword Tokenizers: Addressing Limitations

Word-level tokenization can lead to similar words being treated as distinct entities (e.g., "bear" vs. "bears").

Subword tokenizers leverage common roots within words to create more efficient representations while acknowledging potential increases in sequence length.

Tokenization Techniques in NLP

Overview of Tokenization

Tokenization involves breaking down text into smaller units (tokens), which can impact processing time; more tokens mean longer processing times.

Character-level tokenization captures misspellings better than subword tokenization but results in much longer sequences, increasing model processing time.

Word-level tokenization is simple but does not leverage word roots and can lead to unknown tokens at inference if a word wasn't seen during training.

Out Of Vocabulary (OOV) Issues

OOV refers to words that the model has not encountered during training, leading to them being marked as unknown during inference.

Subword tokenizers reduce the risk of OOV issues by leveraging word roots while still having some limitations compared to character-level approaches.

Representation of Tokens

After tokenizing text, each token needs a representation; this is often done using one-hot encoding (OHE).

One-hot encoding assigns a unique vector for each token, making it easy to represent but limiting similarity comparisons between tokens.

Similarity Measures

Cosine similarity is commonly used to measure how similar two vectors are based on their orientation in n-dimensional space.

One-hot encoded vectors are orthogonal, meaning they do not capture semantic similarities effectively; ideally, similar tokens should have higher similarity scores.

Understanding Norm and Vector Size

The goal is for semantically similar tokens (e.g., "soft" and "teddy bear") to have high cosine similarity while unrelated tokens should be closer to zero.

The choice of vocabulary size influences whether word or subword tokenizers are used; subword tokenizers are generally preferred for single-language tasks.

Understanding Subword Tokenization and Word Embeddings

The Importance of Subwords in Vocabulary Size

Subwords serve as a compromise between identifying words by their roots and minimizing out-of-vocabulary (OOV) risks, making them effective for language processing.

For English, a typical vocabulary size is tens of thousands, while multilingual models may require hundreds of thousands due to the complexity of different languages.

Learning Word Embeddings

One-hot encoding is ineffective for token representation; thus, learning embeddings from data is essential.

The concept of word embeddings gained popularity with the introduction of Word2Vec in 2013, which illustrated intuitive relationships between words (e.g., "king" to "queen" as "Paris" to "France").

Methods for Computing Embeddings

Two primary methods for computing embeddings are Continuous Bag of Words (CBOW) and Skip Gram. Both utilize context to predict target words.

These methods are considered proxy tasks aimed at understanding language structure rather than merely predicting the next word.

Proxy Tasks and Neural Networks

The goal is to create meaningful representations that reflect linguistic relationships (e.g., capital cities).

A simple neural network model processes input vectors through hidden layers to learn word representations via prediction tasks.

Training the Model

During training, one-hot encoded inputs are used to predict subsequent words based on previous context.

Predictions yield probabilities for potential next words; adjustments are made using backpropagation based on loss calculations (typically cross-entropy).

Iterative Learning Process

The model iteratively predicts next tokens by adjusting weights after each prediction error until it effectively learns how to anticipate subsequent words.

Understanding Word Representations in Language Models

The Basics of Word Representation

The model learns word representations through a process where each word is initially represented as a one-hot encoding, which is then multiplied by learned weights to obtain the green representation.

In this example, there are only six possible words representing a toy vocabulary size; real-world applications typically involve much larger vocabularies.

Vocabulary Challenges

Variations in words can lead to an extensive vocabulary when using a word-level tokenization approach, complicating language processing tasks.

To handle unseen words during inference, models reserve an "unknown token" representation for out-of-vocabulary tokens, which helps manage the limitations of word-level tokenizers.

Tokenization Levels and Their Implications

Subword level tokenization reduces the chances of encountering out-of-vocabulary tokens compared to word-level approaches. Character-level tokenization eliminates this issue entirely.

Training and Convergence in Language Models

Understanding Model Training

The goal during training is not just predicting the next word but also learning meaningful representations. Tracking loss functions over epochs helps determine when to stop training.

Generation Stopping Criteria

Special tokens like "end of sequence" signal when generation should stop, preventing infinite loops in text generation processes.

Hidden Layer Size Considerations

Trade-offs in Hidden Layer Dimensions

The size of hidden layers impacts how rich embeddings are for downstream tasks; larger vectors may be needed for complex tasks while simpler tasks might require smaller vectors.

Factors influencing hidden layer size include task complexity and sensitivity to latency or cost. Typical embedding sizes range from hundreds to thousands.

Contextualizing Words: Addressing Ambiguity

Distinguishing Contextual Meanings

Addressing words that are spelled identically but have different meanings requires methods that contextualize these words within their sentences.

Sequential Nature of Text Representation

Moving Beyond Simple Averages

Averaging word representations loses meaning and order; thus, models must capture the sequential nature of text effectively.

Introduction to RNN Architecture

Recurrent Neural Networks (RNNs) maintain a hidden state that represents the sentence so far while processing tokens sequentially, allowing them to consider the order of words effectively.

Understanding RNNs and Their Limitations

Overview of RNN Processing

The processing in Recurrent Neural Networks (RNNs) begins with dummy hidden states, often denoted as A or H, which represent the hidden state activation or context vector.

At each time step, the model considers both the current word and the accumulated meaning from previous words to produce an output vector aimed at predicting the next word.

Hidden States and Word Order

Hidden states serve as representations of the sequence processed so far, allowing RNNs to account for word order in a more natural manner.

The process involves tracking hidden states through multiple iterations to predict subsequent words based on both current input and past context.

Applications of RNNs

For classification tasks like sentiment analysis, the last hidden state can be projected into a prediction space (e.g., positive or negative sentiment).

In multi-classification scenarios, representations of specific tokens are used for predictions; for generation tasks, a final context vector is utilized to decode outputs.

Challenges Faced by RNNs

Despite their utility, RNNs struggle with long-range dependencies due to encapsulating sentence meaning solely within hidden states.

This limitation led to the development of Long Short-Term Memory networks (LSTMs), designed to better track important information over longer sequences.

Vanishing Gradient Problem

A significant issue with RNN-based methods is forgetting past information due to vanishing gradients during backpropagation through time.

When updating weights based on predictions that depend on earlier hidden states, multiplying values less than one can lead to diminishing updates—resulting in ineffective learning.

Summary of Key Concepts

The goal is effective text representation. Initial methods like Word2Vec had limitations regarding context awareness and word order.

While newer methods consider sequential data better than earlier models, they still face challenges such as slow computations and difficulties managing long sequences due to vanishing gradients.

Understanding Attention Mechanisms in Neural Networks

The Need for Attention in Long Sequences

Training models to predict words requires computing hidden states from previous inputs, which becomes time-consuming with long sequences.

Traditional RNNs process words sequentially, making it difficult to access relevant past information when generating translations or predictions.

Attention mechanisms create direct links between the current prediction and relevant past tokens, addressing long-range dependency issues.

Introduction of Self-Attention

The transformer architecture introduced in 2017 relies heavily on attention mechanisms, as highlighted by its foundational paper titled "Attention is All You Need."

Self-attention allows models to connect all parts of the input text simultaneously rather than sequentially, improving efficiency and performance.

Each token's representation can be contextually unique due to self-attention, allowing for nuanced understanding based on surrounding tokens.

Key Concepts: Query, Key, and Value

In self-attention computations, terms "query," "key," and "value" (Q, K, V) are used to express relationships between tokens.

A query compares itself against keys from other tokens to determine similarity; corresponding values are then weighted based on this similarity.

Matrix Representation and Computational Efficiency

Expressing self-attention computations in matrix format leverages GPU capabilities for efficient processing across entire sequences.

The softmax function is utilized to assign weights based on the importance of each value relative to the query-key comparison.

Learning Mechanisms Behind Keys and Values

Keys and values are learned quantities that help determine similarities; they are not fixed but derived through model training.

The dot product between queries and keys helps establish weights for a weighted average of values during computation.

Understanding the Architecture of Self-Attention Mechanisms

Overview of the Architecture

The architecture consists of two main components: an encoder (left side) and a decoder (right side), primarily used for translation tasks.

Input text in the source language is processed by the encoder, while the decoder predicts the target language output based on this input.

Encoder Functionality

The encoder computes meaningful embeddings from input text using a multi-head attention layer, allowing tokens to attend to one another for better representation.

After processing through multi-head attention, a feedforward layer helps learn additional projections, resulting in rich token representations.

Decoder Process

Translation begins with a start-of-sentence (BOS) token; subsequent predictions utilize representations from both encoded input and previously decoded tokens.

The query from the decoder identifies which words from the input are significant for predicting the next word, while keys and values come from the encoder's output.

Attention Layers Explained

A masked self-attention layer in the decoder focuses only on previously translated tokens to predict subsequent words without looking ahead.

Two types of attention layers exist: one in the encoder that computes embeddings as functions of themselves, and another in the decoder that considers both previous outputs and inputs.

Position Encoding Importance

Unlike RNN models that process sequentially, this architecture lacks inherent order due to direct links between tokens; thus, position encodings are necessary to indicate word positions within sequences.

Summary of Translation Steps

The translation process involves tokenizing text into units, learning embeddings for these tokens, adding positional encoding, passing through an encoder with multi-head attention and feedforward networks before starting decoding with BOS token.

Understanding Self-Attention and Multi-Head Mechanisms in Neural Networks

Overview of Feedforward Neural Networks

The process involves using a feedforward neural network to generate a vector that is passed through a softmax function, which predicts the next word based on vocabulary size.

This method allows for flexibility in predicting the next word, raising questions about its implementation.

Clarifying the Concept of "Heads" in Self-Attention

The term "head" refers to projection matrices used to create queries, keys, and values during self-attention computations.

Multiple heads enable the model to learn different projections, enhancing its ability to identify various associations between vectors.

Multi-Head Attention Explained

Running self-attention computations in parallel with different projection matrices is akin to using multiple filters in convolutional networks from computer vision.

Typically, there are no constraints on how these projections differ; models often learn varied representations naturally.

Label Smoothing Technique

Label smoothing addresses the inherent ambiguity in language prediction by suggesting that predictions may not be absolute (e.g., predicting a word with some uncertainty).

Instead of using one-hot encoding (1, 0, 0), label smoothing adjusts this to (1 - ε, ε/(v - 1)), promoting uncertainty in predictions.

Impact of Label Smoothing on Model Performance

This technique generally leads to improved performance metrics like BLEU scores for translation tasks by making models less confident about their predictions.

Transitioning to Practical Examples

Following theoretical discussions, an end-to-end example will illustrate how transformers operate practically.

Tokenization Process

The example begins with tokenization where arbitrary decompositions convert text into tokens while indicating sequence start and end with BOS and EOS tokens.

Position-Aware Embeddings

Each token representation includes learned embeddings combined with positional information via sine and cosine functions added element-wise.

This results in position-aware embeddings structured as a matrix reflecting both embedding size and sequence length before being processed through the encoder.

Understanding Self-Attention Mechanism

Projection of Inputs

The self-attention mechanism involves projecting input embeddings into three distinct spaces: queries (Q), keys (K), and values (V) using learned projection matrices Wq, Wk, and Wv.

Computation Process

Each row in the query matrix Q represents a specific query, while the transposed key matrix K^T has columns representing key representations for each token. This setup allows for effective interaction between queries and keys.

Matrix Multiplication Insights

The multiplication of Q and K^T results in a matrix where each row indicates how a query relates to all keys, leading to a probability distribution after applying softmax.

Importance of Scaling

Scaling by the square root of dk is crucial as it normalizes dot products; this prevents large dimensions from skewing results when calculating attention scores.

Multi-head Attention Explained

Multi-head attention processes multiple sets of projections in parallel, concatenating them at the end. A final projection matrix Wo transforms these concatenated outputs back to the original embedding dimension.

Gradient Descent's Role in Representation Learning

Variability in Representations

Gradient descent enables models to learn diverse representations rather than identical ones. The model's objective function drives it towards creating unique features that enhance learning efficiency.

Feedforward Networks and Encoder Structure

Hidden Layer Dimensions

In contrast to Word2Vec, the hidden layer in this architecture typically has a larger dimension than both input and output layers, allowing for more complex feature learning.

Stacked Encoder Architecture

The model consists of multiple encoder modules stacked together, producing context-aware embeddings that are subsequently fed into decoders for further processing.

Decoding Process Initiation

Beginning with BOS Token

Decoding starts with a Beginning Of Sequence (BOS) token which signals the model to predict subsequent words. Initially, this token attends only to itself during self-attention processing before interacting with other tokens later on.

Decoding Process in Language Models

Overview of the Decoding Mechanism

The decoding process involves multiple iterations (n times), culminating in a linear projection followed by a softmax layer, which converts the predicted next word into a probability distribution over the vocabulary.

After determining the next token, represented as a one-hot encoding, this embedding is reintroduced into the decoder to continue generating subsequent tokens.

The iterative nature of this process allows for continuous prediction and refinement of outputs based on previous tokens generated.

The use of softmax ensures that the output probabilities sum to one, facilitating selection from among all possible vocabulary words.

This method highlights how language models can effectively generate coherent text by leveraging learned embeddings and probabilistic predictions.