Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI
Lessons Learned from Pre-Training Small Models
Introduction to Small Models
- Maxim Labonne introduces himself and his role at Liquid AI, focusing on pre-training small models for edge deployment.
- Liquid AI's model range spans from 350 million to 24 billion parameters, emphasizing the focus on smaller models suitable for devices like phones and cars.
Characteristics of Small Models
- Small models are memory-bound due to hardware limitations, leading to lower knowledge capacity compared to larger models.
- They excel in specific tasks rather than general-purpose applications, making them effective for focused functions like summarization.
Model Architecture Insights
- Discussion on Gemma 3 (270M parameters) and Gemma 2.5 (0.8B parameters), highlighting their hybrid architectures with unique attention mechanisms.
- The embedding layer constitutes a significant portion of the model's parameters, indicating inefficiencies in reasoning capacity.
Training Methodologies
- LFM 2.5 training involves extensive token usage (28 trillion tokens), challenging traditional scaling laws but showing performance improvements even at smaller scales.
- Comparison with Chinchilla scaling laws reveals that more pre-training tokens enhance performance across various benchmarks despite initial assumptions about optimal parameter sizes.
Addressing Doom Looping Challenges
- Doom looping is identified as a recurring issue in small language models, particularly during complex tasks or reasoning scenarios.
- Solutions include preference alignment data generation techniques that utilize diverse rollouts to mitigate doom loops effectively.
Future Directions and Applications
- Emphasis on treating edge models as distinct entities with unique challenges rather than merely scaled-down versions of larger models.
- Proposes integrating web search tools into small models to enhance their knowledge capacity without increasing size constraints.
Conclusion and Next Steps
- Highlights the potential of small edge models when combined with agentic tools, suggesting further exploration in this area could yield significant advancements.
Q&A Session Insights
- Discusses practical applications of small vs. large models based on context such as internet connectivity and latency sensitivity.