Anthropic and OpenAI released flagship models 27 minutes apart -- the AI pricing and capability gap is getting weird

February 7, 2026
Anthropic and OpenAI released flagship models 27 minutes apart -- the AI pricing and capability gap is getting weird

So, here’s something wild — Anthropic dropped Opus 4.6 and OpenAI released GPT-5.3-Codex just 27 minutes apart. Both are claiming top spots, but on different benchmarks. According to /u/prakersh, Opus 4.6 dominates reasoning tests, scoring 53.1% on Humanity’s Last Exam and pulling ahead in GDPval-AA, while GPT-5.3-Codex crushes coding benchmarks like Terminal-Bench 2.0 at 75.1%. But here’s where it gets tricky — pricing. Opus costs about $25 per million tokens, which is double Gemini’s fee but still a fraction of open-source options that go 50 times cheaper. And get this — adding 1 million tokens of context is becoming standard, with Opus showing strong retrieval quality at that scale. The market’s reacting fast — Thomson Reuters’s stock plummeted nearly 16%, and legal services are feeling the squeeze. The surprising twist? Opus’s better reasoning might be sacrificing some writing quality, showing that models are now specializing, not dominating across everything. As /u/prakersh notes, this fragmentation is the new normal.

Anthropic shipped Opus 4.6 and OpenAI shipped GPT-5.3-Codex on the same day, 27 minutes apart. Both claim benchmark leads. Both are right -- just on different benchmarks.

Where each model leads Opus 4.6 tops reasoning tasks: Humanity's Last Exam (53.1%), GDPval-AA (144 Elo ahead of GPT-5.2), BrowseComp (84.0%). GPT-5.3-Codex takes coding: Terminal-Bench 2.0 at 75.1% vs Opus 4.6's 69.9%.

The pricing spread is hard to ignore

Model Input/M Output/M
Gemini 3 Pro $2 $12.00
GPT-5.2 $1.75 $14.00
Opus 4.6 $5.00 $25.00
MiMo V2 Flash $0.10 $0.30

Opus 4.6 costs 2x Gemini on input. Open-source alternatives cost 50x less. At some point the benchmark gap has to justify the price gap -- and for many tasks it doesn't.

1M context is becoming table stakes Opus 4.6 adds 1M tokens (beta, 2x pricing past 200K). Gemini already offers 1M at standard pricing. The real differentiator is retrieval quality at that scale -- Opus 4.6 scores 76% on MRCR v2 (8-needle, 1M), which is the strongest result so far.

Market reaction was immediate Thomson Reuters stock fell 15.83%, LegalZoom dropped nearly 20%. Frontier model launches are now moving SaaS valuations in real time.

The tradeoff nobody expected Opus 4.6 gets writing quality complaints from early users. The theory: RL optimizations for reasoning degraded prose output. Models are getting better at some things by getting worse at others.

No single model wins across the board anymore. The frontier is fragmenting by task type.

Source with full benchmarks and analysis: Claude Opus 4.6: 1M Context, Agent Teams, Adaptive Thinking, and a Showdown with GPT-5.3

submitted by /u/prakersh
[link] [comments]
Audio Transcript

Anthropic shipped Opus 4.6 and OpenAI shipped GPT-5.3-Codex on the same day, 27 minutes apart. Both claim benchmark leads. Both are right -- just on different benchmarks.

Where each model leads Opus 4.6 tops reasoning tasks: Humanity's Last Exam (53.1%), GDPval-AA (144 Elo ahead of GPT-5.2), BrowseComp (84.0%). GPT-5.3-Codex takes coding: Terminal-Bench 2.0 at 75.1% vs Opus 4.6's 69.9%.

The pricing spread is hard to ignore

Model Input/M Output/M
Gemini 3 Pro $2 $12.00
GPT-5.2 $1.75 $14.00
Opus 4.6 $5.00 $25.00
MiMo V2 Flash $0.10 $0.30

Opus 4.6 costs 2x Gemini on input. Open-source alternatives cost 50x less. At some point the benchmark gap has to justify the price gap -- and for many tasks it doesn't.

1M context is becoming table stakes Opus 4.6 adds 1M tokens (beta, 2x pricing past 200K). Gemini already offers 1M at standard pricing. The real differentiator is retrieval quality at that scale -- Opus 4.6 scores 76% on MRCR v2 (8-needle, 1M), which is the strongest result so far.

Market reaction was immediate Thomson Reuters stock fell 15.83%, LegalZoom dropped nearly 20%. Frontier model launches are now moving SaaS valuations in real time.

The tradeoff nobody expected Opus 4.6 gets writing quality complaints from early users. The theory: RL optimizations for reasoning degraded prose output. Models are getting better at some things by getting worse at others.

No single model wins across the board anymore. The frontier is fragmenting by task type.

Source with full benchmarks and analysis: Claude Opus 4.6: 1M Context, Agent Teams, Adaptive Thinking, and a Showdown with GPT-5.3

submitted by /u/prakersh
[link] [comments]
0:00/0:00