Qwen3 is a fantastic open-source model

Name: Qwen3 is a fantastic open-source model
Uploaded: 2025-04-29T20:50:21.000Z
Duration: 28 min 5 s

Quen 3: A New Era in Open-Source AI Models

Overview of Quen 3 Model

Quen 3 is an open-source model comparable to Gemini 2.5 Pro, featuring a flagship version with 235 billion parameters and 22 billion active parameters.

Benchmark comparisons show that while Gemini 2.5 Pro leads slightly in some areas, Quen 3 performs competitively across various metrics, including ELO ratings.

Performance Benchmarks

The BFCL benchmark indicates Quen 3's superior function calling ability at a score of 70.8 compared to Gemini's score of 62.9.

The model also outperforms previous generations and other competitors like GPT40 in multiple benchmarks, showcasing its advanced capabilities.

Hybrid Thinking Model

Quen 3 introduces a hybrid thinking model allowing users to adjust the "thinking budget," enhancing performance based on token usage.

In non-thinking mode, the model provides quick responses for simpler tasks; in thinking mode, it takes time for deeper reasoning on complex problems.

Task-Specific Budget Control

Users can configure task-specific budgets easily, optimizing the balance between cost-efficiency and inference quality.

This flexibility is particularly beneficial for coding tasks where varying levels of thought are required depending on complexity.

Integration with MCP Tools

Quen 3 is optimized for use with MCP tools through Zapier’s new service, which connects AI to thousands of applications seamlessly.

Users can set up automation without coding skills using Zapier’s platform, making it accessible for various applications.

Model Variants and Specifications

The Quen 3 family includes two Mixture of Experts models and six dense models; the flagship has significant parameter counts designed for efficiency.

Introduction to Quen 3 Models

Overview of Model Parameters and Capabilities

The Quen 3 models range from 600 million to 32 billion parameters, with varying context windows: 128k for larger models (8B-32B) and 32K for smaller ones (600M-4B).

Notably, the model demonstrates tool calling during chain of thought, a feature previously seen in earlier series (03 and 04).

Demonstration of Tool Calling

In a demo task, the model fetches GitHub stars and generates a bar chart while seamlessly switching between thinking and tool calls.

Another example shows the model organizing desktop files by type through multiple tool calls within a single inference run.

Pre-training Process of Quen 3

Data Collection and Training Tokens

Quen 3 was trained on nearly double the tokens compared to its predecessor, using approximately 36 trillion tokens across various languages.

The dataset included diverse sources such as web content and PDF-like documents, utilizing previous models for text extraction.

Synthetic Data Generation

To enhance math and coding data representation, synthetic data was generated using Quen 2.5 models focused on mathematics and coding tasks.

Training Stages Explained

Three Phases of Pre-training

The first stage involved pre-training on over 30 trillion tokens to establish basic language skills.

The second phase improved knowledge-intensive data representation through an additional training set of five trillion tokens.

Post-training Methodology

A four-stage training pipeline was implemented post-pre-training to develop reasoning abilities alongside rapid response capabilities.

Stages Breakdown:

Long Chain of Thought: Initial training focused on fundamental reasoning across various domains.

Reinforcement Learning: Enhanced exploration-exploitation capabilities through rule-based rewards.

Model Fusion: Integrated non-thinking capabilities via fine-tuning with instruction tuning data.

General Reinforcement Learning: Strengthened general capabilities across numerous domain tasks.

Model Availability and Comparisons

Accessing the Model

Users can download the model immediately from platforms like LM Studio or O Lama within MLX.

Benchmark Comparisons with Llama Models

Quen 3's flagship model (235B parameters) is compared against Llama 4's Frontier model (402B parameters), noting differences in active parameter usage despite size advantages.

Performance Metrics:

Gemini 2.5 and Its Competitors

Overview of Model Performance

Gemini 2.5 is identified as a substantial leader in the AI model landscape, achieving an impressive performance score of 84%.

The second place is held by model 03, closely followed by Deepseek R1 and Llama 3.1 Neatron Ultra.

Notably, the Nvidia-flavored version of Llama (previous generation) ranks just behind these models.

Analysis of Active Parameters

A comparison is made between GPQA Diamond and various models based on their active parameters.

Quen 3B, with only 3 billion active parameters, is positioned lower on the performance scale but still shows potential as a reasoning model.

The X-axis represents performance quality (left side being better), while the Y-axis indicates GPQA Diamond's effectiveness (higher up being better).