I Tested GPT 5.4 Against Every Rival — Here's My Honest Review
Two weeks ago, OpenAI dropped GPT 5.4. Within hours, my feed was wall-to-wall benchmark tables and breathless takes about the "best model ever." So I did what any reasonable developer would do: I ignored all of it and ran my own test.
This is my GPT 5.4 review after two weeks of daily use across real projects — not a benchmark summary, not a migration guide, and definitely not a press release rewrite. I tested it head-to-head against Claude, Gemini, and MiniMax on a single creative coding prompt, then spent the following days using it on actual work. Here's what I found.
What GPT 5.4 Actually Ships With
Let's get the spec sheet out of the way quickly.
GPT 5.4 launched on March 5, 2026 as a convergence play — it merges GPT-5.3 Codex's coding chops with GPT-5.2's generalist reasoning into one model. The headline features: native computer use (75% on OSWorld, beating the human expert baseline of 72.4%), a 1M-token context window, tool search that cuts token usage by up to 47%, and configurable reasoning effort across five levels from none to xhigh.
Now the caveats nobody puts in the headline.
That 1M context window comes with a 2× input / 1.5× output surcharge once you exceed 272K tokens. And "1M" is generous marketing — OpenAI's own Graphwalks benchmark shows accuracy dropping from 93% at 128K to 21.4% between 256K and 1M. One independent test found instructions at token 850,000 were missed 40% of the time. So you've got a 1M window where roughly the last 400K tokens are decorative.
Pricing is $2.50/$15 per million tokens for standard context — roughly half of Claude Opus 4.6. But that comparison flips for long-context work. Anthropic removed all long-context surcharges on March 14, so a 500K-token conversation on Claude might actually be cheaper than GPT 5.4. The devil, as always, is in the pricing page.
The Test — One Prompt, Four Models, Zero Hand-Holding
I have a theory: benchmarks tell you how well a model performs on tasks someone else chose. A single creative prompt tells you how a model thinks.
The Prompt
I asked each model the same thing: build me an atomic world clock app. Analog clocks with hands, digital time readouts, a world map with time zones, and real atomic time synchronisation. One shot, no follow-ups, no hand-holding. The kind of prompt that tests design sense, technical accuracy, and integration quality all at once.
Four models entered. None left unscathed.
Results by Model
GPT 5.4 delivered the best-looking result by a comfortable margin. Clean layout, nice visual hierarchy, the kind of output you could screenshot and put in a pitch deck. But the atomic time sync was broken — it fetched the time once and then drifted. The analog clock hands were also slightly off, which is a problem when your app is literally a clock.
Claude nailed the atomic time synchronisation. The NTP sync worked correctly, updating at proper intervals. The visual design was functional rather than pretty — it prioritised correctness over aesthetics, which tracks with what I found in my Opus 4.6 review. For a clock app, I'd argue that's the right call, but your mileage may vary.
Gemini surprised me. The setup was rocky — first attempt had import errors — but once it got going, it produced a surprisingly complete result with an actual world map (not just a list of cities). The map used real geographic projections. The time sync had the same drift issues as GPT 5.4, though, and the analog clock hands were off.
MiniMax M2.5 produced a non-functional result. At a fraction of the cost, you get what you pay for on creative integration tasks. It's a different story for targeted code fixes (I covered its strengths in my MiniMax M2.5 review), but this wasn't that kind of test.
Every single model struggled with analog clock hand placement. Turns out, correctly mapping hours, minutes, and seconds to angular positions on a circle is the kind of simple-sounding problem that trips up AI models consistently. Make of that what you will.
What This Test Actually Reveals
The "best model" depends entirely on what you're measuring. Design? GPT 5.4. Correctness? Claude. Completeness? Gemini surprised me. Raw coding output at 1/20th the price? MiniMax has a case for targeted tasks.
But here's the real insight: benchmarks measure narrow capabilities in controlled conditions. A single creative prompt exposes integration quality, aesthetic judgment, and failure modes simultaneously. No model aced everything. Not one. And that single data point tells me more about the state of AI in 2026 than any leaderboard.
Benchmarks Are Barely Useful Now
I'm going to spend a few hundred words on benchmarks, and then I'm going to explain why you should take all of them with a fistful of salt.
The Numbers That Matter
Here's the current state of play, sourced from Vals.ai and Artificial Analysis:
| Benchmark | GPT 5.4 | Claude Opus 4.6 | Gemini 3.1 Pro | Leader |
|---|---|---|---|---|
| SWE-Bench Verified | 77.2% | 79.2% | 80.6% | Opus 4.6 |
| SWE-Bench Pro | 57.7% | ~45% | 54.2% | GPT 5.4 |
| Terminal-Bench 2.0 | 75.1% | 74.7% | 78.4% | Gemini 3.1 |
| OSWorld-Verified | 75.0% | 72.7% | — | GPT 5.4 |
| Arena Coding Elo | ~1481† | 1561 (#1) | — | Opus 4.6 |
| AA Intelligence Index | 57 | 53 | 57.2 | Gemini ≈ GPT 5.4 |
†GPT 5.4 may not have accumulated enough Arena votes for a stable ranking yet.
No single model dominates. The leader changes depending on which row you look at.
Why You Shouldn't Trust Any Single Number
OpenAI dropped SWE-Bench Verified — the benchmark where Claude leads — in favour of SWE-Bench Pro, where GPT 5.4 happens to lead. Convenient.
Scaffold choice massively changes scores. xAI's Grok 4 self-reported 72–75% on SWE-Bench but tested at 58.6% independently with SWE-agent. MiniMax M2.5 runs its evaluations using Claude Code as scaffolding, which is a bit like entering a cooking competition with someone else's oven.
Then there's the SM-Bench regression: GPT 5.4 scored 36.8% on conversational tasks where GPT-4o scored 97.3%. That's not a typo. The model got dramatically worse at casual conversation while getting better at professional tasks. Whether that matters to you depends on what you're building.
As Zvi Mowshowitz put it: benchmarks have never been less useful for telling us which models are best. I don't think he's wrong.
The Real Strengths and Deal-Breakers
Where GPT 5.4 Wins
Computer use is the real differentiator. 75% on OSWorld isn't just a number — it means the model can actually operate software through screenshots and mouse/keyboard actions via Playwright. Box validated this on property-tax portal automation with 95% first-attempt success across roughly 30,000 tasks. If your workflow involves automating desktop applications, GPT 5.4 is currently the best option.
Pricing for standard-context work. At roughly half the per-token cost of Claude Opus 4.6 (for prompts under 272K tokens), the value proposition is real. For high-volume API workloads that don't need massive context, the savings add up.
Code fixing. This is subjective and based on my own experience, but GPT 5.4 may be the best model right now for targeted code repairs. When you have a specific bug and need it fixed, it tends to zero in on the problem without rewriting half your codebase. Tends to. (More on that in a moment.)
The Codex app's parallel-agent architecture is genuinely impressive. Running multiple coding agents across isolated worktrees with built-in diff review and Git integration is a good workflow for async, autonomous task delegation.
Where It Falls Down
Task overexpansion is the big one. Developer @vasumanmoza's viral post captures it perfectly: GPT-5 refactored their entire codebase in a single call — 25 tool invocations, 3,000+ new lines, 12 brand-new files, none of it working. Every.to's Vibe Check evaluation confirmed the pattern: the model routinely expands tasks beyond what you asked for, redesigning login systems nobody asked it to touch. I've experienced this myself. You ask it to fix a button, and it comes back having restructured your component hierarchy. Not great.
False completion claims. The model sometimes says it's done when it isn't — and in some cases, does so in ways that look deliberate. OpenAI acknowledged this in their launch post, noting they'd reduced the deception rate from 4.8% (o3) to 2.1%. That's progress, but 2.1% still means roughly 1 in 50 tasks might be confidently presented as complete when they're not.
The car wash problem. Ask GPT 5.4 whether you should walk or drive to a car wash 100 meters away, and it writes a full essay recommending walking — completely missing that you need the car at the car wash. Claude answered this in one sentence. It's a trivial example, but it illustrates a pattern: strong quantitative reasoning paired with weak practical inference.
Terminal-Bench regression. GPT 5.4 scores 75.1% versus GPT-5.3 Codex's 77.3% on Terminal-Bench 2.0. The model got worse at terminal operations than its own predecessor. If your workflow is terminal-heavy — SSH, CLI debugging, git operations, build systems — GPT-5.3 Codex is still the better choice.
A tool behaviour bug since launch day. Since March 6, GPT 5.4 ignores built-in tools like shell and apply_patch when custom function tools are present. It tells you "I do not have such a tool" when the tool is right there. Multiple developers confirmed this on the OpenAI community forum and GitHub issue #13773 documents the regression. Two weeks later, it's still not fixed.
GPT 5.4 vs Claude for Coding — Developer Experience Matters
Benchmarks are one thing. The actual experience of sitting down and writing code with these models is another.
I've used the Codex CLI, Claude Code, and OpenCode extensively. My preference is Claude Code, and honestly, I even reach for OpenCode more than Codex CLI — I wrote about why I switched to it for a while. Codex is async and autonomous — you delegate tasks and review results later. That's powerful for certain workflows, but I find I do my best work iteratively, and Claude Code is the best middle ground for that. It's right there in the terminal with me, I can steer it in real time, and it handles multi-file refactors better than anything else I've used.
The pattern I've settled into, and what I hear from most developers I talk to: nobody is switching wholesale. Everyone is routing by task. GPT 5.4 for large-codebase analysis and targeted fixes. Claude for multi-file refactoring and architectural work. The dual-wield approach isn't a compromise — it's the strategy.
The Verdict — Should You Switch?
GPT 5.4 is very good. It might be the best model — for your specific task.
Fair comparison between frontier models is nearly impossible right now. The gap has closed to 2–3 percentage points on most benchmarks, and which model "wins" depends on which benchmark you pick, which scaffold runs the evaluation, and what kind of work you're doing. Anyone telling you one model is definitively the best in March 2026 is either selling something or hasn't tested broadly enough.
Here's my recommendation: use GPT 5.4 for code fixes and computer use automation. Use Claude for architectural work, multi-file refactoring, and anything requiring consistency over long sessions. Consider MiniMax M2.5 if you're cost-sensitive and doing targeted coding work. Route by task, not by brand loyalty.
The bottom line: GPT 5.4 is OpenAI's strongest model yet, excelling at computer use and targeted code fixes. But Claude Opus 4.6 still leads on key coding benchmarks and developer preference. The best strategy in 2026 is task-based model routing — use the right model for each job, and stop waiting for a single model to win at everything. That model isn't coming.
Related Articles
Claude Opus 4.6: What's Actually Better?
Claude Opus 4.6 dominates benchmarks and coding tasks, but is it really better than 4.5? A developer's honest take on what changed and what matters.
7 min read
AI NewsGemini 3 Flash: Why Google's Budget Model Is My New Default
Google's Gemini 3 Flash beats GPT-5.2 benchmarks at 3.5x lower cost. Here's why this workhorse model changes the economics of AI in production apps.
7 min read
AI NewsClaude Opus 4.5 Review: Anthropic's New Coding Model Breaks Records
Claude Opus 4.5 achieves 80.9% on SWE-bench with 67% lower costs. Hands-on review of the new effort parameter, token efficiency, and real coding performance.
8 min read