The Industry Reacts to o3-Pro! (It Thinks a LOT)

The Industry Reacts to o3-Pro! (It Thinks a LOT)

03 Pro Release Overview

Introduction to 03 Pro

  • The 03 Pro model is OpenAI's most powerful release, but its performance isn't reflected in benchmarks.
  • The release coincided with an 80% price drop on the vanilla 03 model, raising questions about market strategy.

Features and Performance Insights

  • The model is available for all Pro users in ChatGPT and API, with reviewers favoring it over the previous version due to improved performance in various domains.
  • Key areas of improvement include science, education, programming, data analysis, and writing; however, writing lacks verifiable rewards for reinforcement learning.

Performance Metrics

Evaluation Results

  • Reviewers rated 03 Pro higher for clarity, comprehensiveness, instruction following, and accuracy compared to the vanilla model.
  • Win rates against the previous version show a consistent advantage: overall queries (64%), scientific analysis (64%), personal writing (66%), computer programming (62%), and data analysis (64%).

Benchmark Comparisons

  • On coding competitions like Codeforces, 03 Pro achieved a score of 2748 ELO—over 200 points higher than the medium version.
  • This score places it at rank 159 globally among competitive programmers.

Reliability Testing

Success Criteria

  • OpenAI employs a four-out-of-four reliability benchmark to assess success; slightly lower scores were noted but still impressive.

Tool Capabilities

  • The model includes features such as web searching, file analysis, code execution, image input processing, Python usage capabilities, and memory access.

Industry Reactions

Expert Opinions

  • Greg Cameron from ARK Prize noted that while initial performance aligns with earlier models released in April, he anticipates greater robustness despite no significant intelligence improvements.

Comparative Analysis

  • Although some versions scored higher than 03 Pro during evaluations (e.g., ARK Preview), they were significantly more expensive.

Cost Considerations

Pricing Structure

  • Pricing varies between $1-$10 for different tiers of the new model compared to competitors which are generally less expensive yet yield similar win rates on benchmarks.

Sponsorship Mention

SEO Writing Promotion

Flavio Adamo's Insights on 03 Pro

Performance and Speed

  • Flavio Adamo, known for the rotating hexagon ball test, claims that 03 Pro is "extremely cheaper, faster, and way more precise" than its predecessor, 01 Pro.
  • The model handles realistic collisions between balls and walls almost perfectly; however, it exhibits slow response times with basic queries taking up to 20 minutes.

Thinking Time Concerns

  • A user named Sam Alman noted that a simple prompt took over 13 minutes to process without clear reasoning behind the delay due to obfuscated thought chains.
  • McKay Wriggley reported long processing times (up to 26 minutes), raising questions about whether such inference time is justified for simpler tasks.

Capabilities of 03 Pro

Accuracy vs. Inference Time

  • Matt Schumer highlighted that while 03 Pro can produce correct answers after lengthy thinking periods, excessive time spent on simple queries renders it ineffective.
  • Despite being slow, the model is described as "smart as a whip," showcasing strong reasoning capabilities but also frustrating refusal mechanisms.

Jailbreaking and Strategic Use

  • Users have successfully jailbroken the model to bypass restrictions, demonstrating its potential for creative applications despite initial limitations.
  • Ben from Raindrop shared an experience where feeding internal data led to actionable strategic plans generated by 03 Pro.

Real-world Applications of 03 Pro

Innovative Problem Solving

  • Daria, an MD using 03 Pro for immune system research, found its insights deeper and more thoughtful compared to previous models when analyzing natural immune system limitations.

Creative Challenges

  • Ethan Mollik presented a word ladder puzzle which the model solved correctly by changing one letter at a time from "earth" to "space."

Technical Limitations in Code Generation

Rubik's Cube Simulation Attempt

  • A prompt given for simulating a Rubik's cube resulted in only 328 lines of code from 03 Pro compared to Gemini’s over 1,200 lines but ultimately failed due to a simple error in resolving references.

Final Thoughts on Model Performance

Video description

Create your first Super Page with SEOWriting: https://bit.ly/4gbFfM9 Check out Humanity's Last Prompt Engineering Guide: https://www.forwardfuture.ai/p/humanity-s-last-prompt-engineering-guide Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai Discover The Best AI Tools👇🏼 https://tools.forwardfuture.ai My Links 🔗 👉🏻 X: https://x.com/matthewberman 👉🏻 Instagram: https://www.instagram.com/matthewberman_ai 👉🏻 Discord: https://discord.gg/xxysSXBxFW Media/Sponsorship Inquiries ✅ https://bit.ly/44TC45V Links: https://x.com/OpenAI/status/1932530409684005048 https://x.com/OpenAIDevs/status/1932532781457752533 https://x.com/GregKamradt/status/1932536315545100315 https://x.com/flavioAd/status/1932530860063961288 https://x.com/Yuchenj_UW/status/1932544842405720540 https://x.com/ns123abc/status/1932536816550220254 https://x.com/mattshumer_/status/1932600385295827259 https://x.com/elder_plinius/status/1932608359028391998 https://x.com/morqon/status/1932512050154279325 https://x.com/DeryaTR_/status/1932541350827774316 https://x.com/emollick/status/1932533635984355792