← Back to Blog
AI Solutions·7 min read

Claude Opus 4.6: What's Actually Better?

Anthropic dropped Claude Opus 4.6 on February 5th, and the internet did what it does — half the people called it a breakthrough, the other half said it was lobotomised. I've been using it since launch day, and the truth is somewhere in between. More interesting, actually.

Let me walk through what's genuinely new, what the benchmarks say versus what my fingers-on-keyboard experience tells me, and whether you should bother switching from 4.5.

What Opus 4.6 Brings to the Table

The headline is a 1M-token context window — a first for any Opus-class model. It's in beta and restricted to API users at tier 4 or above (sorry, Claude Max subscribers), but the implication is significant. You can feed it an entire codebase, a stack of legal documents, or a research corpus in a single pass. Output capacity doubles to 128K tokens, up from 64K.

The old binary extended-thinking mode is gone, replaced by adaptive thinking with four effort levels: low, medium, high (default), and max. Instead of a fixed token budget for reasoning, the model dynamically decides how deeply to think. Anthropic recommends dialling it down to medium for simple tasks, which is polite-speak for "this thing will burn through your token budget if you let it."

Then there's the stuff that actually changes how you work. Agent teams in Claude Code — still a research preview — let multiple Claude instances split tasks in parallel and coordinate results. Context compaction does server-side summarisation of older conversation context, enabling effectively infinite conversations. And there are new integrations for PowerPoint and upgraded Excel capabilities, if that's your world.

This isn't a point release. Architecturally, it's a different beast from 4.5.

The Benchmarks Look Impressive — But Do They Matter?

Where Claude Opus 4.6 Clearly Wins

The numbers are hard to argue with. On GDPval-AA, Opus 4.6 scored 1606 Elo — a 190-point jump over 4.5 and 144 points ahead of GPT-5.2. On Terminal-Bench 2.0 (agentic coding), it leads at 65.4%. ARC AGI 2, which tests novel problem-solving, nearly doubled from 37.6% to 68.8%. That's not incremental.

The long-context performance is where things get genuinely impressive. On MRCR v2 with an 8-needle test at 1M context, Opus 4.6 scored 76% compared to 18.5% for Sonnet 4.5. At 256K context, the gap widened to 93% versus 10.8%. That's roughly a 4× improvement, and if you work with large codebases or document sets, that's the number that matters most.

Oh, and during testing it discovered over 500 previously unknown zero-day vulnerabilities in open-source code. Axios reported it could become a primary mechanism for securing open-source software. Not bad for a side effect.

Where It Doesn't Move the Needle

SWE-bench Verified is essentially flat — 80.8% versus 80.9% for 4.5. A prompt modification pushed it to 81.42%, which tells you these margins are within noise. GPT-5.2 still edges it on GPQA Diamond (93.2% vs 91.3%) and MCP Atlas tool coordination.

Here's the thing I keep coming back to: benchmarks improve every release. The numbers go up, the charts look good, the blog posts write themselves. But the gap between "good" and "better" shrinks perceptually even as the numbers climb. Going from 80% to 81% on SWE-bench doesn't feel like anything in your daily workflow. Going from 18.5% to 76% on long-context retrieval — that you feel.

Coding Got Better, Writing Got Worse — The Familiar Tradeoff

The partner testimonials read like a greatest-hits album. Cursor co-founder Michael Truell said it "excels on the hardest problems" with "greater persistence" and "stronger code review." GitHub's CPO highlighted its strength in complex multi-step coding work. Cognition's Scott Wu — the Devin people — said it "reasons through complex problems at a level we haven't seen before."

Real-world numbers back it up. Rakuten reported the model autonomously closed 13 issues and assigned 12 to the right team members in a single day across a 50-person org and six repositories. Norway's sovereign wealth fund found it produced the best results in 38 of 40 blind-ranked cybersecurity investigations against Claude 4.5 models.

As a coder, the improvements are what I feel. The model is more persistent. It holds context better across long agentic sessions. It doesn't give up and suggest workarounds as quickly.

But within hours of launch, Reddit lit up. Posts titled "Opus 4.6 lobotomized" and "Opus 4.6 nerfed?" gained traction on r/ClaudeCode and r/Anthropic. The complaint was consistent: writing quality regressed, particularly for technical documentation. The emerging community consensus was blunt — use 4.6 for coding, stick with 4.5 for writing.

This raises a question I think about more than I should: are we training models to be great at what's measurable — code passes tests, benchmarks have scores — at the expense of what's inherently subjective? Writing quality, tone, the feel of a well-crafted explanation — those don't have leaderboards. And maybe that's the problem.

Pricing Looks the Same But Isn't

Token pricing is identical to 4.5: $5 per million input tokens, $25 per million output tokens. Prompts exceeding 200K trigger premium rates of $10/$37.50. Batch processing still gets you a 50% discount, and prompt caching can save up to 90%.

Sounds fine on paper. In practice, early adopters report Opus 4.6 consumes roughly 5× more tokens per task than 4.5 due to adaptive thinking. The model thinks harder by default, which means it burns through your budget faster even though the per-token price hasn't changed. Anthropic's own evaluation cost via Artificial Analysis was $1,030.78 for a full Intelligence Index run.

Opus 4.5 Opus 4.6
Input (per 1M tokens) $5 $5
Output (per 1M tokens) $25 $25
Tokens per typical task Baseline ~5× baseline
Effective cost per task $X ~$5X

Budget the same per-token, but expect higher bills.

Should You Switch from Opus 4.5?

Switch if your workload is code-heavy, you're doing long-context analysis, or you need the agentic capabilities — agent teams, context compaction, the deeper reasoning. If you're in security research, the vulnerability-finding capabilities alone might justify it. The model is available via the API (claude-opus-4-6), Amazon Bedrock, Google Cloud Vertex AI, and directly through claude.ai on Pro, Max, Team, and Enterprise plans.

Stay on 4.5 if writing quality matters more than coding performance for your use case, you're cost-sensitive and don't want the adaptive thinking overhead eating your budget, or you want the 1M context window but you're on Claude Max (it's API-only, tier 4+ for now).

Here's my honest take: I use Claude daily. It's my primary tool. Opus 4.6 works well and feels strong — but distinguishing it from 4.5 in everyday use is genuinely difficult. The improvements are real but incremental in feel, even when the benchmarks say otherwise. The long-context stuff is the exception — that's a qualitative shift you notice immediately. Everything else is the kind of improvement you'd struggle to identify in a blind test.

My recommendation: switch your coding workflows to 4.6 now. Keep 4.5 around for writing-heavy tasks until the regression gets addressed. And dial that adaptive thinking down to medium for anything that doesn't need deep reasoning — your wallet will thank you.

The Bigger Picture — Model Releases as Market Events

One thing worth noting: this launch landed during a week where Bloomberg reported a $285 billion rout in software stocks, with Thomson Reuters down nearly 16% and LegalZoom dropping almost 20%. Goldman Sachs' basket of US software stocks sank 6% in its biggest single-day decline since April.

Model releases aren't just product updates anymore. They're macroeconomic events. When Opus 4.6's financial analysis capabilities — the ability to scrutinise filings, market data, and regulatory documents in one pass — hit the news cycle, investors didn't debate benchmarks. They repriced entire sectors.

If each model release triggers selloffs, the AI industry's release cadence becomes a macro factor. That's a sentence I never expected to write on a developer blog, but here we are.

Thomas Wiegold

AI Solutions Developer & Full-Stack Engineer with 14+ years of experience building custom AI systems, chatbots, and modern web applications. Based in Sydney, Australia.

Ready to Transform Your Business?

Let's discuss how AI solutions and modern web development can help your business grow.

Get in Touch