OpenAI Dropped a FRONTIER Open-Source Model
OpenAI Releases GPTOSS: A Game-Changer in Open-Source AI
Overview of GPTOSS
- OpenAI has launched GPTOSS, a state-of-the-art open-source model, which may be linked to the previously mentioned Horizon Alpha.
- The model is available in two sizes: 120 billion parameters and 20 billion parameters, both classified as open-weight language models.
Benefits of Open Source
- Open-source models are significantly cheaper than closed-source alternatives and allow for customization through fine-tuning.
- Released under an Apache 2.0 license, these models provide a permissive framework for use and modification.
Performance Insights
- The 12B version of GPTOSS performs comparably to OpenAI's 04 Mini on reasoning benchmarks while being efficient enough to run on consumer hardware like an 80 GB GPU.
- The 20 billion parameter version can operate on edge devices with just 16 GB of memory, making it suitable for local inference.
Practical Applications
- Users are encouraged to download the model for offline access to knowledge, ensuring availability during internet outages or emergencies.
- Both models excel in tool use, function calling, chain-of-thought reasoning, and health diagnostics.
Unique Features
- Users can adjust the reasoning depth during problem-solving tasks based on complexity requirements.
- Trained using advanced techniques focused on efficiency and usability across various deployment environments.
Technical Specifications
- Each model employs a transformer architecture with a mixture of experts approach that optimizes active parameters per token processed.
- The larger model activates only 5 billion parameters per token while maintaining high efficiency; the smaller version activates 3.6 billion.
Training Methodology
- Models utilize alternating dense and sparse attention patterns similar to GPT3 for improved inference efficiency.
- They were trained on high-quality text datasets emphasizing STEM fields and general knowledge using an enhanced tokenizer from previous models.
Post-training Techniques
Performance Benchmarks of AI Models
Benchmarking Results
- The 120 billion parameter version with tools scored 2622 in a coding competition, closely trailing the frontier model (03) which scored 2706. This indicates strong performance across different model sizes.
- In expert-level coding questions, the 120 billion parameter version achieved a score of 19%, while the frontier model (03) reached 24.9%. Notably, an open-source 12 billion version outperformed both the mini versions of models 04 and 03 without tools.
- For medical benchmarks like Healthbench, the scores were comparable: the 120B model scored 57.6 against the frontier's score of 59.8. Even smaller models showed impressive results, such as a score of 96% on Amy 2024 by the 20 billion parameter version.
- The GPQA Diamond benchmark for PhD-level science yielded scores of 80.1 for the 120B model and slightly lower at 71.5 for the smaller variant, indicating robust capabilities even in advanced academic contexts.
- MMLU results showed high accuracy with scores reaching up to 90% for larger models and respectable performances from smaller ones, reinforcing their competitive edge.
Safety Considerations
- Monitoring a reasoning model's chain of thought can help detect misbehavior; however, direct supervision was avoided during training to maintain authenticity in responses.
- Developers are advised against displaying raw chains of thought to users due to potential hallucinations or harmful content; instead, summarization and filtering are recommended practices.
- Pre-training involved filtering out harmful data related to sensitive topics like chemical and biological threats; however, concerns remain about adversaries fine-tuning open-source models for malicious purposes.
- Testing indicated that even after extensive fine-tuning aimed at malicious use cases (e.g., biological weapons), these models did not achieve high capability levels according to OpenAIβs preparedness framework.
- A challenge is being hosted for red teamers with a $500,000 reward aimed at identifying safety issues within these AI models through expert reviews.