OpenAI Dropped a FRONTIER Open-Source Model

Summary Transcript Chat

OpenAI Dropped a FRONTIER Open-Source Model

OpenAI Releases GPTOSS: A Game-Changer in Open-Source AI

Overview of GPTOSS

OpenAI has launched GPTOSS, a state-of-the-art open-source model, which may be linked to the previously mentioned Horizon Alpha.

The model is available in two sizes: 120 billion parameters and 20 billion parameters, both classified as open-weight language models.

Benefits of Open Source

Open-source models are significantly cheaper than closed-source alternatives and allow for customization through fine-tuning.

Released under an Apache 2.0 license, these models provide a permissive framework for use and modification.

Performance Insights

The 12B version of GPTOSS performs comparably to OpenAI's 04 Mini on reasoning benchmarks while being efficient enough to run on consumer hardware like an 80 GB GPU.

The 20 billion parameter version can operate on edge devices with just 16 GB of memory, making it suitable for local inference.

Practical Applications

Users are encouraged to download the model for offline access to knowledge, ensuring availability during internet outages or emergencies.

Both models excel in tool use, function calling, chain-of-thought reasoning, and health diagnostics.

Unique Features

Users can adjust the reasoning depth during problem-solving tasks based on complexity requirements.

Trained using advanced techniques focused on efficiency and usability across various deployment environments.

Technical Specifications

Each model employs a transformer architecture with a mixture of experts approach that optimizes active parameters per token processed.

The larger model activates only 5 billion parameters per token while maintaining high efficiency; the smaller version activates 3.6 billion.

Training Methodology

Models utilize alternating dense and sparse attention patterns similar to GPT3 for improved inference efficiency.

They were trained on high-quality text datasets emphasizing STEM fields and general knowledge using an enhanced tokenizer from previous models.

Post-training Techniques

Performance Benchmarks of AI Models

Benchmarking Results

The 120 billion parameter version with tools scored 2622 in a coding competition, closely trailing the frontier model (03) which scored 2706. This indicates strong performance across different model sizes.

In expert-level coding questions, the 120 billion parameter version achieved a score of 19%, while the frontier model (03) reached 24.9%. Notably, an open-source 12 billion version outperformed both the mini versions of models 04 and 03 without tools.

For medical benchmarks like Healthbench, the scores were comparable: the 120B model scored 57.6 against the frontier's score of 59.8. Even smaller models showed impressive results, such as a score of 96% on Amy 2024 by the 20 billion parameter version.

The GPQA Diamond benchmark for PhD-level science yielded scores of 80.1 for the 120B model and slightly lower at 71.5 for the smaller variant, indicating robust capabilities even in advanced academic contexts.

MMLU results showed high accuracy with scores reaching up to 90% for larger models and respectable performances from smaller ones, reinforcing their competitive edge.

Safety Considerations

Monitoring a reasoning model's chain of thought can help detect misbehavior; however, direct supervision was avoided during training to maintain authenticity in responses.

Developers are advised against displaying raw chains of thought to users due to potential hallucinations or harmful content; instead, summarization and filtering are recommended practices.

Pre-training involved filtering out harmful data related to sensitive topics like chemical and biological threats; however, concerns remain about adversaries fine-tuning open-source models for malicious purposes.

Testing indicated that even after extensive fine-tuning aimed at malicious use cases (e.g., biological weapons), these models did not achieve high capability levels according to OpenAI’s preparedness framework.

A challenge is being hosted for red teamers with a $500,000 reward aimed at identifying safety issues within these AI models through expert reviews.

Channel: Matthew Berman

Video description

Try gpt-oss, the best open-source coding model, on Together AI - the platform for AI-engineers and production-ready inference. Visit: http://bit.ly/41tgRka and start building today!. Download The Matthew Berman Vibe Coding Playbook (free) 👇🏼 https://bit.ly/3I2J0YQ Download Humanities Last Prompt Engineering Guide (free) 👇🏼 https://bit.ly/4kFhajz Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai Discover The Best AI Tools👇🏼 https://tools.forwardfuture.ai My Links 🔗 👉🏻 X: https://x.com/matthewberman 👉🏻 Instagram: https://www.instagram.com/matthewberman_ai 👉🏻 Discord: https://discord.gg/xxysSXBxFW Media/Sponsorship Inquiries ✅ https://bit.ly/44TC45V Disclaimer: I am a small investor in CrewAI and LMStudio. Links: https://openai.com/index/introducing-gpt-oss/