The $5 Million Reading Game Behind Every AI | Pretraining

Name: The $5 Million Reading Game Behind Every AI | Pretraining
Uploaded: 2026-05-05T10:30:29.000Z
Duration: 30 min 53 s

Understanding the Complexity of Building Modern AI

The Basics of Training a Language Model

Building a modern AI involves extensive computational resources, with 48 GPUs running for nearly two months to perform 14.8 trillion operations.

The process is often oversimplified; it requires three engineering disciplines: data curation, model distribution across GPUs, and maintaining system stability during long training runs.

The Role of Mathematics in Language Models

The mathematical aspect of language modeling can be likened to a person guessing words while reading; this guess-and-check cycle occurs trillions of times during training.

Over time, the model learns patterns in language without explicit instruction on grammar or logic, developing an understanding based solely on context.

Data Curation: Sourcing Quality Information

Filtering Raw Data from the Internet

While the internet provides vast amounts of text data, much of it is unusable; effective filtering is crucial to obtain quality training data.

A multi-step pipeline processes raw HTML from web sources: extracting text, filtering out low-quality content, and anonymizing personal information.

Final Dataset Preparation

After rigorous filtering, only about 12 megabytes per gigabyte of raw web data are retained for training purposes.

Surprisingly, a smaller subset known as Fine Web ADU (1.3 trillion tokens) performs better than the larger dataset (15 trillion tokens), highlighting that quality trumps quantity in data selection.

Training Mechanics: Understanding Model Predictions

Predicting Next Words

During each training step, the model predicts the next word by processing input through transformer layers and generating logits for potential candidates.

These logits are converted into probabilities using a softmax function; even at its highest probability (32% for "Matt"), confidence remains relatively low.

Loss Calculation and Model Improvement

The loss value indicates how well the model's prediction aligns with actual outcomes; lower loss signifies improved performance over time.

As training progresses through numerous iterations (up to 600,000 steps), perplexity decreases significantly—from around 50,257 down to just 17—indicating enhanced predictive accuracy.

Technical Challenges: GPU Memory Management

Managing Large Models Across Multiple GPUs

A moderately sized language model requires substantial memory—over 1.12 TB—to accommodate weights and gradients across multiple GPUs due to their size exceeding individual GPU capacities.

Parallelism Techniques for Efficient Training

To manage large models effectively:

Tensor parallelism divides weight matrices among GPUs,

Pipeline parallelism distributes layers across different sets of GPUs,

Data parallelism creates copies that handle various batches simultaneously.

Optimal Model Size and Data Utilization

Finding Balance Between Parameters and Tokens

Research shows an optimal ratio exists between parameters and training tokens—approximately 20 tokens per parameter—for effective learning outcomes.

Implications for Modern Models

Recent models like Llama have deviated from this optimum by utilizing excessive tokens per parameter but still aim to maximize efficiency in inference costs post-training.

Cost Analysis: Financial Implications of AI Development

Breakdown of Training Costs

Deepseek V3's pre-training cost was approximately $5.576 million—a significant reduction compared to earlier models like GPT4 which were estimated at $50-$100 million.

Comprehensive Cost Considerations

While marginal costs have decreased dramatically due to advancements in engineering practices, total organizational expenses—including salaries and infrastructure—remain high (estimated at $500 million annually).

Debunking Common Misconceptions About AI Training

Clarifying Misunderstandings

Training Definition:

Training encompasses more than just feeding data into a model; it includes curating datasets and managing complex systems over extended periods.

Cost Myths:

Contrary to popular belief, frontier model training has become more affordable over recent years with Deepseek V3 costing significantly less than previous models.

Data Quantity vs Quality:

More data does not always equate to better performance; examples show that carefully curated smaller datasets can outperform larger ones on knowledge benchmarks.

Conclusion: The Artistry Behind AI Pre-training

Summary Insights

Each iteration during pre-training follows a consistent pattern involving forward passes through layers followed by updates based on loss calculations—all contributing towards building language understanding capabilities within models.