The Industry Reacts to o3-Pro! (It Thinks a LOT)

Name: The Industry Reacts to o3-Pro! (It Thinks a LOT)
Uploaded: 2025-06-12T00:52:42.000Z
Duration: 24 min 33 s

03 Pro Release Overview

Introduction to 03 Pro

The 03 Pro model is OpenAI's most powerful release, but its performance isn't reflected in benchmarks.

The release coincided with an 80% price drop on the vanilla 03 model, raising questions about market strategy.

Features and Performance Insights

The model is available for all Pro users in ChatGPT and API, with reviewers favoring it over the previous version due to improved performance in various domains.

Key areas of improvement include science, education, programming, data analysis, and writing; however, writing lacks verifiable rewards for reinforcement learning.

Performance Metrics

Evaluation Results

Reviewers rated 03 Pro higher for clarity, comprehensiveness, instruction following, and accuracy compared to the vanilla model.

Win rates against the previous version show a consistent advantage: overall queries (64%), scientific analysis (64%), personal writing (66%), computer programming (62%), and data analysis (64%).

Benchmark Comparisons

On coding competitions like Codeforces, 03 Pro achieved a score of 2748 ELO—over 200 points higher than the medium version.

This score places it at rank 159 globally among competitive programmers.

Reliability Testing

Success Criteria

OpenAI employs a four-out-of-four reliability benchmark to assess success; slightly lower scores were noted but still impressive.

Tool Capabilities

The model includes features such as web searching, file analysis, code execution, image input processing, Python usage capabilities, and memory access.

Industry Reactions

Expert Opinions

Greg Cameron from ARK Prize noted that while initial performance aligns with earlier models released in April, he anticipates greater robustness despite no significant intelligence improvements.

Comparative Analysis

Although some versions scored higher than 03 Pro during evaluations (e.g., ARK Preview), they were significantly more expensive.

Cost Considerations

Pricing Structure

Pricing varies between $1-$10 for different tiers of the new model compared to competitors which are generally less expensive yet yield similar win rates on benchmarks.

Sponsorship Mention

SEO Writing Promotion

Flavio Adamo's Insights on 03 Pro

Performance and Speed

Flavio Adamo, known for the rotating hexagon ball test, claims that 03 Pro is "extremely cheaper, faster, and way more precise" than its predecessor, 01 Pro.

The model handles realistic collisions between balls and walls almost perfectly; however, it exhibits slow response times with basic queries taking up to 20 minutes.

Thinking Time Concerns

A user named Sam Alman noted that a simple prompt took over 13 minutes to process without clear reasoning behind the delay due to obfuscated thought chains.

McKay Wriggley reported long processing times (up to 26 minutes), raising questions about whether such inference time is justified for simpler tasks.

Capabilities of 03 Pro

Accuracy vs. Inference Time

Matt Schumer highlighted that while 03 Pro can produce correct answers after lengthy thinking periods, excessive time spent on simple queries renders it ineffective.

Despite being slow, the model is described as "smart as a whip," showcasing strong reasoning capabilities but also frustrating refusal mechanisms.

Jailbreaking and Strategic Use

Users have successfully jailbroken the model to bypass restrictions, demonstrating its potential for creative applications despite initial limitations.

Ben from Raindrop shared an experience where feeding internal data led to actionable strategic plans generated by 03 Pro.

Real-world Applications of 03 Pro

Innovative Problem Solving

Daria, an MD using 03 Pro for immune system research, found its insights deeper and more thoughtful compared to previous models when analyzing natural immune system limitations.

Creative Challenges

Ethan Mollik presented a word ladder puzzle which the model solved correctly by changing one letter at a time from "earth" to "space."

Technical Limitations in Code Generation

Rubik's Cube Simulation Attempt

A prompt given for simulating a Rubik's cube resulted in only 328 lines of code from 03 Pro compared to Gemini’s over 1,200 lines but ultimately failed due to a simple error in resolving references.

Final Thoughts on Model Performance