Build an LLM from Scratch 2: Working with text data

Name: Build an LLM from Scratch 2: Working with text data
Uploaded: 2025-03-02T14:15:00.000Z
Duration: 2 h 56 min 1 s

Introduction to Chapter 2: Working with Text Data

Overview of the Chapter

The focus of this chapter is on preparing the dataset for training a Large Language Model (LLM) .

Key stages in building an LLM include setting up the dataset, coding the attention mechanism, and defining the LLM architecture. This chapter specifically addresses data preparation as the first stage before pre-training and fine-tuning .

Tokenization Process

The process involves taking raw text, tokenizing it into smaller chunks, and converting these tokens into token IDs that can be encoded into vectors for processing by the LLM .

Vectors serve as numeric representations of text, which are essential for optimizing weight parameters during pre-training. The current focus remains solely on preparing input data for the LLM rather than delving into pre-training details .

Understanding Tokenization

Importance of Tokenization

Tokenization breaks down input text into individual components or tokens, which is crucial for further processing by the model. An example using a library called tiktoken will illustrate this process later in the video .

Example of Tokenization

A demonstration shows how a simple phrase like "hello world" is broken down into tokens such as "hello", "world", and punctuation marks. Each token corresponds to a unique token ID that will be discussed later .

Dataset Preparation

Selecting a Dataset

The chosen dataset for this tutorial is "the-verdict," a public domain short story by Edith Wharton. It was selected due to its simplicity and readability, making it suitable for educational purposes without copyright concerns .

Downloading the Dataset

Instructions are provided on how to download this dataset using standard Python libraries like os and urllib.request. A GitHub repository link is also shared where users can access this dataset if needed in future .

Working with Files in Python

Reading Dataset Files

Tokenization and Text Processing in Python

Introduction to Tokenization

The speaker discusses verifying the successful loading of a text dataset, emphasizing that it should contain approximately 20,479 characters if loaded correctly.

An introduction to tokenization using regular expressions is presented as a preliminary step before utilizing the tiktoken library for more advanced tokenization techniques.

Regular Expressions for Tokenization

The speaker acknowledges their limited expertise with regular expressions but notes that tools like ChatGPT can simplify understanding them. The goal is to split text into individual tokens.

A demonstration of using a regular expression to tokenize simple text into words and whitespace characters is provided, highlighting the need for punctuation to be treated as separate tokens.

Advanced Tokenization Techniques

A more sophisticated regular expression is introduced to include punctuation as separate tokens, improving the initial tokenization method.

The speaker suggests an optional step of stripping whitespace characters from the output, showcasing how this can alter the results of tokenization.

Handling Complex Cases

The discussion shifts towards handling more complex cases in tokenization where special characters may cause issues. A refined approach with improved regular expressions is suggested.

After preparing text by splitting it into individual tokens and punctuation characters, the speaker prepares to apply these methods on real text data.

Converting Tokens into Token IDs

Transitioning from tokenizing raw text, the next focus is on converting these tokens into unique integer identifiers known as token IDs.

The process begins with building a vocabulary that maps each unique word in the dataset to an integer. This involves breaking down text into tokens and sorting them alphabetically while removing duplicates.

Creating Vocabulary from Tokens

Using a sample sentence ("the quick brown fox jumps over the lazy dog"), steps are outlined for creating a vocabulary by eliminating duplicates and sorting unique words.

Vocabulary Building and Tokenization Process

Understanding Vocabulary Size

The concept of vocabulary size refers to the number of unique words present in a dataset, which is quantified as 1130 in this instance.

To build the vocabulary, each token is assigned a unique integer. This process involves enumerating all unique words and mapping them to integers in ascending order.

Tokenizing Training Data

Once the vocabulary is established, it is utilized to tokenize training data into token IDs. The vocabulary is sorted alphabetically for easy reference.

For example, the word "the" corresponds to an integer ID (e.g., 7), while other words like "brown" and "dog" correspond to different integers (0 and 1 respectively).

Implementing a Simple Tokenizer Class

A Python class named Tokenizer is introduced, with an __init__ method that initializes the object by saving the vocabulary as a string-to-integer mapping.

The class also includes an inverted mapping from integers back to strings, allowing for bidirectional conversion between tokens and their corresponding IDs.

Encoding Text into Token IDs

The encode method processes input text using regular expressions to break it down into tokens. Each token is then converted into its respective token ID based on the established vocabulary.

An example illustrates how passing a word like "Jack" through the tokenizer retrieves its corresponding ID from the vocabulary.

Decoding Token IDs Back into Text

A decode method allows for reversing the encoding process by converting token IDs back into their original string representations.

SimpleTokenizer and Special Tokens

Introduction to SimpleTokenizer

The concept of tokenization is introduced, illustrating how text is converted into integer representations.

Upcoming discussions will focus on special coding and handling special tokens in the tokenizer.

Extending Vocabulary with Special Tokens

The vocabulary created from training set tokens can be extended to include special tokens for unknown words or end-of-text indicators.

This extension aims to enhance the tokenizer's ability to manage various text inputs effectively.

Handling Unknown Words

A demonstration reveals a limitation of the current tokenizer when encountering unfamiliar words, resulting in a KeyError for "Hello".

The issue arises because the small dataset used does not contain every common word, highlighting the importance of comprehensive training data.

Improving Tokenizer Functionality

To address unknown words, an advanced algorithm can break them down into individual characters instead of causing errors.

Suggestions are made to extend the list of tokens by adding new ones like end-of-text and unknown word tokens using Python's list methods.

Modifying SimpleTokenizer

A simple modification allows the tokenizer to return an "unknown" label for unrecognized words, preventing crashes during processing.

After reinitializing the modified tokenizer, it successfully handles previously problematic inputs by recognizing unknown tokens without errors.

Introduction to Byte Pair Encoding

Overview of Byte Pair Encoding (BPE)

BPE is presented as a sophisticated algorithm that enhances tokenization capabilities beyond basic methods.

It has been widely adopted in modern models like GPT series and Meta AI’s Llama 3 due to its effectiveness in managing tokenization challenges.

Addressing Shortcomings with BPE

Understanding Token Replacement in Language Models

The Challenge of Unknown Tokens

In language models, unknown tokens are replaced with a special placeholder token when they are not found in the vocabulary. This approach leads to ambiguity as the model cannot differentiate between different unknown words.

The limitation of this method is evident in real-world applications where specific names or terms not included in training data can confuse the model, rendering it ineffective for those instances.

Byte Pair Encoding (BPE) as a Solution

The byte pair encoding algorithm allows for breaking down any word into smaller subtokens, enhancing the model's ability to handle previously unseen words without error.

When encountering an unknown word, BPE decomposes it into known subwords or even individual characters if necessary, ensuring that the model does not fail due to unrecognized input.

Efficiency and Implementation of BPE

Although using individual characters may be less efficient since one word can translate into multiple tokens, this method guarantees that processing will always succeed without errors.

OpenAI has open-sourced their implementation of BPE within their GPT-2 GitHub repository. However, they do not provide details on the training methods used for these tokenizers.

Advanced Implementations and Resources

For those interested in deeper insights into BPE, a personal implementation from scratch is available through a Jupyter notebook linked in supplementary materials. This includes step-by-step guidance on training and loading OpenAI weights.

While exploring advanced topics like BPE could warrant its own book due to complexity, it remains optional for readers focused primarily on large language models (LLMs).

Practical Application with Tiktoken Library

The tiktoken library from OpenAI will be utilized for practical implementations of tokenization throughout later chapters. It is noted for its efficiency compared to other implementations due to core functions being written in Rust.

Using the GPT-2 Tokenizer for Model Compatibility

Importing and Installing the Tiktoken Library

To ensure compatibility with the model discussed in Chapter 5, import the tiktoken library. If your Python environment is set up correctly, this should work seamlessly.

If you encounter an import error, it likely means that the tiktoken library isn't installed. You can install it using either uv pip install or pip install tiktoken.

The installation command is commented out in the code cell since it's already installed on the speaker's system. It's a good practice to check which version of a library is being used to avoid discrepancies in future runs.

Version Control and Tokenization

The speaker emphasizes checking library versions to prevent unexpected results due to version changes over time. Versions like 0.9, 0.7, or 0.6 are acceptable for use.

A new tokenizer object will be instantiated using the GPT-2 tokenizer with its full vocabulary without needing additional training.

Encoding and Decoding Text

The encoding process mirrors that of a simple tokenizer; methods such as encode and decode function similarly across both tokenizers.

An advanced example demonstrates how arbitrary text can be encoded but raises an error due to special tokens not being enabled by default.

Special Tokens and Document Separation

The end-of-text token is crucial for indicating where one document ends and another begins during dataset preparation for training LLMs (Large Language Models).

This special token must be explicitly added when using the GPT tokenizer; otherwise, errors may occur during processing.

Understanding Byte Pair Encoding Algorithm

The vocabulary size of this tokenizer includes 50,256 entries, highlighting its capacity for diverse text representation.

While understanding byte pair encoding can be complex, it's essential for breaking down text into subword tokens—an algorithm fundamental to LLM implementation.

Data Sampling with a Sliding Window

Efficient Token ID Management

Focus shifts towards efficiently providing smaller chunks of token IDs to LLM models since they cannot process all tokens simultaneously.

Predictive Nature of LLM Training

Understanding LLMs: Predicting the Next Token

The Mechanism of Learning in LLMs

Large Language Models (LLMs) learn to predict the next token in a sequence, making them efficient for training on large datasets without needing labeled data.

By hiding parts of text and providing one token as input, LLMs can learn to predict the subsequent token, which serves as their target label.

Data Preparation for Training

The "the-verdict" dataset by Edith Wharton is used to demonstrate how to encode text into token IDs using the tiktoken tokenizer.

For visualization purposes, smaller chunks of four tokens are utilized; however, larger chunks (up to 8,000 tokens) are typically used in practice.

Input and Target Creation

The dataset is prepared such that targets are derived from inputs shifted by one position, allowing the model to predict the next token effectively.

An example illustrates how a sample text with 5095 tokens is processed with a context size of four tokens for training.

Visualizing Token Overlaps

Overlapping tokens between inputs and targets help clarify which token should be predicted based on received input sequences.

Utilizing PyTorch for Efficiency

PyTorch is introduced as a widely-used deep learning framework that will facilitate efficient data loading and processing during model training.

Installation and Versioning of PyTorch

Installing PyTorch with CUDA Support

The default installation command for PyTorch on Linux is pip install torch, which includes CUDA 12.4 libraries for GPU support.

Users can specify different versions of CUDA if needed, as new versions may be available by the time of installation.

Version Compatibility

The speaker mentions using version 2.6 of PyTorch, noting that they began writing the book with version 2.0 two years prior.

Testing has shown no significant differences in code functionality across versions from 2.0 to 2.6, allowing users to utilize older versions without issues.

Recommendations for New Users

For those unfamiliar with PyTorch, it is described as a comprehensive deep learning framework primarily used here as a linear algebra library with some deep learning functions.

An appendix (Appendix A) is provided in the book to cover essential concepts necessary for basic LLM training and development.

Learning Resources and Efficiency

Suggested Learning Path

While Appendix A offers a compact overview sufficient for this book's purposes, readers are encouraged to explore full courses or books on PyTorch for deeper understanding.

The aim is to prevent readers from spending excessive time learning before engaging with the material in the book.

Data Preparation Techniques

Dataset Creation Using Sliding Window Technique

The discussion shifts back to data sampling methods, specifically preparing datasets by creating input chunks based on context size (e.g., size of 4).

Input and Target Generation

Inputs are created alongside targets that are shifted by one position; this method allows simultaneous feeding into the LLM during training.

Utilizing PyTorch Dataset Class

Advantages of Dataset Class

Employing a PyTorch Dataset class facilitates efficient batch creation, data shuffling, and multi-process data handling.

Tokenization Process

The process involves tokenizing input text and generating chunks while maintaining an organized structure for inputs and targets.

Handling Large Datasets

Memory Management Considerations

Understanding Data Loaders for LLM Training

Overview of Data Loader Functionality

The discussion begins with the importance of data loaders in training large language models (LLMs), emphasizing a simplified approach for educational purposes.

A function is defined to create a data loader, allowing customization of parameters such as batch size, maximum context length, stride, shuffling options, and whether to drop the last batch.

Handling Batch Sizes and Loss Spikes

An example illustrates how datasets are divided into batches; if the dataset size isn't divisible by the batch size, it can lead to small final batches that may cause loss spikes during training.

It’s recommended to drop the last batch when training over multiple epochs to maintain stability and avoid instabilities caused by varying batch sizes.

Configuration Considerations

The number of worker processes (num_workers) is discussed; it's set to zero in notebooks due to restrictions on spawning subprocesses. This choice ensures compatibility across different environments.

While using larger numbers for num_workers can improve performance in scripts, setting it to zero is safer for notebook execution.

Tokenization and Dataset Creation

The tokenizer used is identified as "tiktoken," which plays a crucial role in preparing text data. The dataset is created from raw text input.

Demonstration involves passing a short story ("the-verdict") into the dataset and data loader setup while maintaining simplicity in configuration.

Iterating Over Datasets

Initial settings include a batch size of 1 and context length/stride also set at 1. This allows observation of how inputs and targets are structured within each batch.

During iteration over the dataset, it becomes clear that targets are shifted versions of inputs. However, this overlap can lead to overfitting issues.

Adjusting Stride for Unique Tokens

Increasing the stride from 1 to 4 eliminates token overlap between iterations, ensuring unique tokens are presented without redundancy.

A visual representation clarifies how adjusting stride affects token selection; moving by four positions prevents repeated exposure during training.

Efficient Batch Size Utilization

Data Loader and Token Embeddings in LLMs

Setting Up the Data Loader

The speaker discusses configuring a data loader with a batch size of 8, max length of 4, and stride of 4 to facilitate clearer labeling and organization of training examples.

Each row in the data loader represents one training example, allowing for efficient iteration over multiple inputs and their corresponding targets.

The implementation aims to prepare the data for loading into a large language model (LLM) efficiently.

Transitioning from Token IDs to Token Embeddings

The discussion shifts towards creating token embeddings from token IDs, which involves converting integer values into embedding vectors containing real numbers that can be optimized.

An example is provided using smaller token IDs represented as PyTorch tensors for better visualization during the explanation of embedding creation.

Understanding the Embedding Layer

The speaker introduces an embedding layer typically integrated within LLM architectures, highlighting its role in mapping token IDs to vector representations based on vocabulary size and output dimensions.

For demonstration purposes, a simplified vocabulary size of six and an output dimension of three are chosen to illustrate how embeddings work without overwhelming complexity.

Importance of Random Seeds in Initialization

A random seed is initialized to ensure consistent results across different instantiations of the network layer due to varying random weights.

The concept of weight parameters is introduced; these are adjusted during neural network training processes like LLM training.

Exploring Matrix Operations with Embedding Layers

The speaker explains that an embedding layer serves as an efficient lookup mechanism related to linear layers and matrix multiplications but reassures those unfamiliar with these concepts that they will be explained intuitively later on.

Understanding Embeddings and Positional Information in LLMs

Introduction to Embedding Layers

The embedding layer converts token IDs into vectors, with a total of 50,257 rows representing the vocabulary size.

Initial weights in the embedding layer are random; these numbers will be optimized during training to improve model performance.

An embedding is defined as a vector from the embedding matrix that will be trained later, emphasizing its role in language models.

Scaling Up Token Embeddings

After creating token embeddings, positional information will be added for better context understanding.

The tokenizer has a vocabulary size of 50,257 words; this is passed to the embedding layer for accurate representation.

The output dimension for training is increased to 256 from smaller toy examples, making it more realistic for practical applications.

Data Loader and Batch Processing

A data loader creates batches of size eight with a maximum length of four tokens per example.

Token IDs are converted into embedding vectors using the data batch created earlier, resulting in three-dimensional tensors.

Converting Token IDs to Vectors

Each example retains a batch size of eight while each token ID corresponds to a 256-dimensional vector.

Visualization of high-dimensional vectors can be challenging; however, individual vectors can still be printed out for analysis.

Adding Positional Information

Positional embeddings provide additional context by indicating where each word appears within an input sequence.

Using identical token IDs results in identical vectors unless positional information is included; this highlights the need for differentiation based on position.

Understanding RoPE Embeddings and Tokenization in GPT Models

Overview of RoPE Embeddings

The discussion begins with an introduction to RoPE (Rotary Positional Encoding) embeddings, which are more complex than traditional methods used in models like GPT-2.

In contrast to RoPE, GPT-2 utilizes a simpler approach with two embedding layers: token embeddings and position embeddings.

Token and Position Embedding Layers

The input consists of identical token embeddings for repeated words (e.g., "fox fox fox fox"), but positional information is added to differentiate their locations.

Initial position embedding values start as random numbers but are optimized during the training process, similar to token embeddings.

Input Length and Tensor Creation

A maximum length for inputs is set; in this example, it supports only four tokens. Larger values will be used during actual LLM training.

The torch.arange function generates sequential order values (0, 1, 2, 3), creating a tensor that represents positions for the input tokens.

Broadcasting Mechanism in PyTorch

When adding position embeddings to token embeddings, broadcasting occurs where dimensions align without duplication across batch sizes.

This allows the model to efficiently combine input and positional information while maintaining consistent dimensions across examples.

Summary of Input Pipeline

The entire input pipeline includes breaking down text into tokens, converting them into IDs, generating token embeddings, creating position embeddings, and finally combining them into input embeddings.

Transitioning to Model Architecture