The $5 Million Reading Game Behind Every AI | Pretraining
Understanding the Complexity of Building Modern AI
The Basics of Training a Language Model
- Building a modern AI involves extensive computational resources, with 48 GPUs running for nearly two months to perform 14.8 trillion operations.
- The process is often oversimplified; it requires three engineering disciplines: data curation, model distribution across GPUs, and maintaining system stability during long training runs.
The Role of Mathematics in Language Models
- The mathematical aspect of language modeling can be likened to a person guessing words while reading; this guess-and-check cycle occurs trillions of times during training.
- Over time, the model learns patterns in language without explicit instruction on grammar or logic, developing an understanding based solely on context.
Data Curation: Sourcing Quality Information
Filtering Raw Data from the Internet
- While the internet provides vast amounts of text data, much of it is unusable; effective filtering is crucial to obtain quality training data.
- A multi-step pipeline processes raw HTML from web sources: extracting text, filtering out low-quality content, and anonymizing personal information.
Final Dataset Preparation
- After rigorous filtering, only about 12 megabytes per gigabyte of raw web data are retained for training purposes.
- Surprisingly, a smaller subset known as Fine Web ADU (1.3 trillion tokens) performs better than the larger dataset (15 trillion tokens), highlighting that quality trumps quantity in data selection.
Training Mechanics: Understanding Model Predictions
Predicting Next Words
- During each training step, the model predicts the next word by processing input through transformer layers and generating logits for potential candidates.
- These logits are converted into probabilities using a softmax function; even at its highest probability (32% for "Matt"), confidence remains relatively low.
Loss Calculation and Model Improvement
- The loss value indicates how well the model's prediction aligns with actual outcomes; lower loss signifies improved performance over time.
- As training progresses through numerous iterations (up to 600,000 steps), perplexity decreases significantly—from around 50,257 down to just 17—indicating enhanced predictive accuracy.
Technical Challenges: GPU Memory Management
Managing Large Models Across Multiple GPUs
- A moderately sized language model requires substantial memory—over 1.12 TB—to accommodate weights and gradients across multiple GPUs due to their size exceeding individual GPU capacities.
Parallelism Techniques for Efficient Training
- To manage large models effectively:
- Tensor parallelism divides weight matrices among GPUs,
- Pipeline parallelism distributes layers across different sets of GPUs,
- Data parallelism creates copies that handle various batches simultaneously.
Optimal Model Size and Data Utilization
Finding Balance Between Parameters and Tokens
- Research shows an optimal ratio exists between parameters and training tokens—approximately 20 tokens per parameter—for effective learning outcomes.
Implications for Modern Models
- Recent models like Llama have deviated from this optimum by utilizing excessive tokens per parameter but still aim to maximize efficiency in inference costs post-training.
Cost Analysis: Financial Implications of AI Development
Breakdown of Training Costs
- Deepseek V3's pre-training cost was approximately $5.576 million—a significant reduction compared to earlier models like GPT4 which were estimated at $50-$100 million.
Comprehensive Cost Considerations
- While marginal costs have decreased dramatically due to advancements in engineering practices, total organizational expenses—including salaries and infrastructure—remain high (estimated at $500 million annually).
Debunking Common Misconceptions About AI Training
Clarifying Misunderstandings
- Training Definition:
- Training encompasses more than just feeding data into a model; it includes curating datasets and managing complex systems over extended periods.
- Cost Myths:
- Contrary to popular belief, frontier model training has become more affordable over recent years with Deepseek V3 costing significantly less than previous models.
- Data Quantity vs Quality:
- More data does not always equate to better performance; examples show that carefully curated smaller datasets can outperform larger ones on knowledge benchmarks.
Conclusion: The Artistry Behind AI Pre-training
Summary Insights
- Each iteration during pre-training follows a consistent pattern involving forward passes through layers followed by updates based on loss calculations—all contributing towards building language understanding capabilities within models.