LLaMA 4 is HERE! Meta Just COOKED
Llama 4: A New Era in Multimodal AI
Overview of Llama 4 Models
- Meta has introduced Llama 4, featuring a remarkable 10 million token context window, available in three versions: small, medium, and large.
- All models are multimodal, capable of processing text, images, and other modalities. They utilize a mixture of expert architecture rather than being pure thinking models.
Details on Model Variants
- Llama 4 Scout: The smallest model with 109 billion total parameters; it has 17 billion active parameters and operates with 16 experts. It boasts an industry-leading context length of 10 million tokens.
- Llama 4 Maverick: Features a total of 400 billion parameters with only 17 billion active ones and utilizes 128 experts. It also supports a million token context length.
- Llama 4 Behemoth: An upcoming model with an astounding total of 2 trillion parameters; it's positioned as a frontier model comparable to Claude's and OpenAI's ChatGPT models.
Performance Insights
- Llama 4 Scout is noted for outperforming previous generation models while fitting within a single Nvidia H100 GPU. Its performance surpasses that of Gemini models across various benchmarks.
- The introduction of Box AI will leverage Llama 4 to enhance document processing capabilities for businesses by automating workflows and extracting insights from unstructured data.
Cost Efficiency and Competitive Edge
- Llama 4 Maverick demonstrates superior performance-to-cost ratio compared to competitors like GPT-40 and Gemini models while maintaining lower active parameter counts.
- The behemoth model is still under development but is expected to significantly enhance the capabilities of the existing Llama variants once released.
Future Implications
Llama 4: Innovations in AI Architecture
Mixture of Experts and Model Architecture
- Llama 4 models are the first to utilize a Mixture of Experts architecture, which is perceived as somewhat outdated but still relevant for current model trends.
- The architecture includes an attention mechanism where prompts are processed through a shared expert and routed to one of 16 experts for final output.
- Llama 4 has been pre-trained on 200 languages, significantly increasing multilingual token availability compared to Llama 3.
Efficient Training Techniques
- The model employs FP8 precision during training, allowing for efficient utilization of GPU resources without compromising quality.
- During the pre-training phase with 32,000 GPUs, Llama 4 achieved an impressive performance of 390 T flops per GPU.
Cost Efficiency and Benchmarking
- Cost analysis shows that Llama 4 offers a competitive rate of $0.19 to $0.49 per million tokens, making it cheaper than competitors like Gemini 2.0.
- In image reasoning benchmarks, Llama 4 scored highly (73.4), outperforming other models such as Deepseek V3.1 and GPT40.
Context Window Capabilities
- The Scout variant of Llama 4 features a context window capable of handling up to 10 million tokens, enhancing its generalization capabilities.
- Despite some failures in specific tests, the overall performance remains strong with high success rates in recalling information from extensive text inputs.
Licensing Issues and Future Developments
- Licensing limitations persist from Llama 3; companies with over 700 million users must seek special permissions from Meta.
- Jeremy Howard notes that even smaller versions of the model may not run on consumer-grade GPUs due to their size and complexity.
1.58 Bit for the Win
Insights from Emad Mostaque, Founder of Stability AI
- Emad Mostaque emphasizes the significance of "1.58 bit" as a winning strategy in AI development.
- He mentions that models will be run at a hyper quantized level, indicating advancements in efficiency and performance.
- Mostaque reveals that new models are on the horizon, including a reasoning model and one with an almost infinite context window.
- The upcoming model is described as "super fast," suggesting improvements in processing speed and capability.