GPT-5.3 Codex VS Opus 4.6 : I TESTED BOTH Models EXTENSIVELY & ONE IS A CLEAR WINNER!

Name: GPT-5.3 Codex VS Opus 4.6 : I TESTED BOTH Models EXTENSIVELY & ONE IS A CLEAR WINNER!
Uploaded: 2026-02-06T10:13:23.000Z
Duration: 30 min 44 s

New AI Model Releases: Opus 4.6 vs. GPT 5.3 Codeex

Overview of New Models

Anthropic released Opus 4.6 and OpenAI launched Codeex 5.3 within an hour of each other, highlighting a competitive landscape in AI model development.

Opus 4.6 is a significant upgrade over its predecessor, Opus 4.5, particularly excelling in agentic coding tasks.

Features of Opus 4.6

The model features a beta context window of 1 million tokens and can output up to 128,000 tokens, beneficial for extensive coding projects.

Benchmark scores show improvements:

Terminal Bench 2.0: 65.4% (up from 59.8%)

OS World: 72.7% (up from 66.3%)

Arc Agi2: 68.8%, nearly double its predecessor's score.

Introduces adaptive thinking with adjustable effort levels impacting performance; pricing remains unchanged at $5 per million input and $25 per million output.

Features of GPT 5.3 Codeex

OpenAI claims GPT 5.3 Codeex is their most capable agentic coding model yet, combining the strengths of previous models while being 25% faster.

Notably, it was developed using early versions that helped debug its own training and deployment processes—an unprecedented self-referential capability.

Benchmark Performance Comparison

On benchmarks:

Terminal Bench 2.0: GPT 5.3 scored 77.3%, significantly higher than both Opus 4.6 and GPT 5.2.

Cybersecurity CTF challenges yielded a score of 77.6%, marking it as OpenAI's first high-capability model for cybersecurity tasks.

Availability and Efficiency

The context window for GPT 5.3 is set at 400,000 tokens, with an output limit of 128,000 tokens; available now through various platforms including the Codeex app with one month free access.

Personal Benchmark Results for Opus 4.6

In personal testing on King Bench:

Achieved a perfect score (100%) across all questions—unprecedented performance among tested models.

Specific coding tasks included creating complex structures like a functional chessboard and detailed graphics in JavaScript frameworks.

Leaderboard Insights

Currently ranks alongside Gemini Pro at the top with both scoring 100%, but with Gemini Pro costing less per run compared to Opus' pricing structure.

This structured overview captures the key insights from the transcript regarding the new AI models released by Anthropic and OpenAI while providing timestamps for easy reference to specific parts of the discussion.

Testing Agentic Coding and App Performance

Overview of Agentic Tests

The speaker discusses the upcoming testing of an API on Kingbench, indicating plans for a separate video or update.

Introduces agentic coding as a distinct area of focus, highlighting seven questions to be addressed through app evaluations.

Evaluation of Mobile Movie Tracker Apps

The Opus mobile movie tracker app is praised for its functionality, featuring a homepage with watched movies and a calendar view.

In contrast, GPT 5.3 Codeex's implementation is criticized for being poorly organized in one file and lacking effective usability.

Calculator Application Assessment

The graphical calculator built in Go by Opus performs adequately but isn't exceptional; however, Codeex's version is riddled with bugs and non-functional.

Conban App Performance Review

Opus excels in creating a fully functional Conban app with no bugs, while Codeex fails after the login page due to errors.

Nux App Analysis

The Stack Overflow clone created by Codeex appears visually appealing but encounters authentication issues; Opus delivers a seamless experience instead.

Image Cropper App Comparison

Claude’s Tori image cropper works on the web but not as an app; Codeex's version fails entirely. Both models struggle with unresolved code issues.

Overall Model Performance Insights

Opus 4.6 ranks highest among models tested, while Codeex lags behind significantly due to performance issues and slow response times.

Criticism directed at OpenAI for lack of API access raises concerns about their commitment to building functional products compared to competitors like Claude.

Conclusion on Current Models

The speaker expresses disappointment in Codeex’s capabilities despite being from a major company, favoring Opus 4.6 for its superior user experience.

Plans to continue using Opus 4.6 across various coding platforms are mentioned, emphasizing its efficiency over other models available.