Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

Lessons Learned from Pre-Training Small Models

Introduction to Small Models

  • Maxim Labonne introduces himself and his role at Liquid AI, focusing on pre-training small models for edge deployment.
  • Liquid AI's model range spans from 350 million to 24 billion parameters, emphasizing the focus on smaller models suitable for devices like phones and cars.

Characteristics of Small Models

  • Small models are memory-bound due to hardware limitations, leading to lower knowledge capacity compared to larger models.
  • They excel in specific tasks rather than general-purpose applications, making them effective for focused functions like summarization.

Model Architecture Insights

  • Discussion on Gemma 3 (270M parameters) and Gemma 2.5 (0.8B parameters), highlighting their hybrid architectures with unique attention mechanisms.
  • The embedding layer constitutes a significant portion of the model's parameters, indicating inefficiencies in reasoning capacity.

Training Methodologies

  • LFM 2.5 training involves extensive token usage (28 trillion tokens), challenging traditional scaling laws but showing performance improvements even at smaller scales.
  • Comparison with Chinchilla scaling laws reveals that more pre-training tokens enhance performance across various benchmarks despite initial assumptions about optimal parameter sizes.

Addressing Doom Looping Challenges

  • Doom looping is identified as a recurring issue in small language models, particularly during complex tasks or reasoning scenarios.
  • Solutions include preference alignment data generation techniques that utilize diverse rollouts to mitigate doom loops effectively.

Future Directions and Applications

  • Emphasis on treating edge models as distinct entities with unique challenges rather than merely scaled-down versions of larger models.
  • Proposes integrating web search tools into small models to enhance their knowledge capacity without increasing size constraints.

Conclusion and Next Steps

  • Highlights the potential of small edge models when combined with agentic tools, suggesting further exploration in this area could yield significant advancements.

Q&A Session Insights

  • Discusses practical applications of small vs. large models based on context such as internet connectivity and latency sensitivity.
Video description

A new class of small models is emerging with the ability to reliably follow instructions and call tools while running on-device under 1 GB of memory. In this talk, we'll break down how to post-train frontier small models using the LFM2.5 recipe: on-policy preference alignment, agentic reinforcement learning, and curriculum training with iterative model merging. We'll cover training challenges unique to the 1B scale, like doom loops, capability interference, and how to fix them. The goal is to give you a concrete playbook to fine-tune and deploy small models for your own use cases, from structured data extraction to multi-turn tool use. Speaker info: - https://x.com/maximelabonne - https://www.linkedin.com/in/maxime-labonne/ - https://github.com/mlabonne Timestamps: 0:00:00 - Start 0:00:14 - Introduction to frontier small models at Liquid AI 0:01:02 - Characteristics: memory-bound, task-specific, latency-sensitive 0:02:20 - Architecture: why large embedding layers are inefficient 0:04:01 - LFM2 architecture: using gated short convolutions for speed 0:06:09 - LFM 2.5 recipe: 28T tokens and post-training stages 0:08:34 - Post-training: SFT, preference alignment, and RL best practices 0:10:43 - Identifying "doom loops" in reasoning models 0:11:34 - Solutions: mitigating loops via preference alignment and RL 0:15:29 - Future focus: using agentic tools to overcome memory limits 0:17:58 - Q&A: real-world applications for small vs. large models