Qwen3 is a fantastic open-source model
Quen 3: A New Era in Open-Source AI Models
Overview of Quen 3 Model
- Quen 3 is an open-source model comparable to Gemini 2.5 Pro, featuring a flagship version with 235 billion parameters and 22 billion active parameters.
- Benchmark comparisons show that while Gemini 2.5 Pro leads slightly in some areas, Quen 3 performs competitively across various metrics, including ELO ratings.
Performance Benchmarks
- The BFCL benchmark indicates Quen 3's superior function calling ability at a score of 70.8 compared to Gemini's score of 62.9.
- The model also outperforms previous generations and other competitors like GPT40 in multiple benchmarks, showcasing its advanced capabilities.
Hybrid Thinking Model
- Quen 3 introduces a hybrid thinking model allowing users to adjust the "thinking budget," enhancing performance based on token usage.
- In non-thinking mode, the model provides quick responses for simpler tasks; in thinking mode, it takes time for deeper reasoning on complex problems.
Task-Specific Budget Control
- Users can configure task-specific budgets easily, optimizing the balance between cost-efficiency and inference quality.
- This flexibility is particularly beneficial for coding tasks where varying levels of thought are required depending on complexity.
Integration with MCP Tools
- Quen 3 is optimized for use with MCP tools through Zapierβs new service, which connects AI to thousands of applications seamlessly.
- Users can set up automation without coding skills using Zapierβs platform, making it accessible for various applications.
Model Variants and Specifications
- The Quen 3 family includes two Mixture of Experts models and six dense models; the flagship has significant parameter counts designed for efficiency.
Introduction to Quen 3 Models
Overview of Model Parameters and Capabilities
- The Quen 3 models range from 600 million to 32 billion parameters, with varying context windows: 128k for larger models (8B-32B) and 32K for smaller ones (600M-4B).
- Notably, the model demonstrates tool calling during chain of thought, a feature previously seen in earlier series (03 and 04).
Demonstration of Tool Calling
- In a demo task, the model fetches GitHub stars and generates a bar chart while seamlessly switching between thinking and tool calls.
- Another example shows the model organizing desktop files by type through multiple tool calls within a single inference run.
Pre-training Process of Quen 3
Data Collection and Training Tokens
- Quen 3 was trained on nearly double the tokens compared to its predecessor, using approximately 36 trillion tokens across various languages.
- The dataset included diverse sources such as web content and PDF-like documents, utilizing previous models for text extraction.
Synthetic Data Generation
- To enhance math and coding data representation, synthetic data was generated using Quen 2.5 models focused on mathematics and coding tasks.
Training Stages Explained
Three Phases of Pre-training
- The first stage involved pre-training on over 30 trillion tokens to establish basic language skills.
- The second phase improved knowledge-intensive data representation through an additional training set of five trillion tokens.
Post-training Methodology
- A four-stage training pipeline was implemented post-pre-training to develop reasoning abilities alongside rapid response capabilities.
Stages Breakdown:
- Long Chain of Thought: Initial training focused on fundamental reasoning across various domains.
- Reinforcement Learning: Enhanced exploration-exploitation capabilities through rule-based rewards.
- Model Fusion: Integrated non-thinking capabilities via fine-tuning with instruction tuning data.
- General Reinforcement Learning: Strengthened general capabilities across numerous domain tasks.
Model Availability and Comparisons
Accessing the Model
- Users can download the model immediately from platforms like LM Studio or O Lama within MLX.
Benchmark Comparisons with Llama Models
- Quen 3's flagship model (235B parameters) is compared against Llama 4's Frontier model (402B parameters), noting differences in active parameter usage despite size advantages.
Performance Metrics:
Gemini 2.5 and Its Competitors
Overview of Model Performance
- Gemini 2.5 is identified as a substantial leader in the AI model landscape, achieving an impressive performance score of 84%.
- The second place is held by model 03, closely followed by Deepseek R1 and Llama 3.1 Neatron Ultra.
- Notably, the Nvidia-flavored version of Llama (previous generation) ranks just behind these models.
Analysis of Active Parameters
- A comparison is made between GPQA Diamond and various models based on their active parameters.
- Quen 3B, with only 3 billion active parameters, is positioned lower on the performance scale but still shows potential as a reasoning model.
- The X-axis represents performance quality (left side being better), while the Y-axis indicates GPQA Diamond's effectiveness (higher up being better).