← Back to Blog
AI Solutions·8 min read

Claude Opus 4.5 Review: Anthropic's New Coding Model Breaks Records

AnthropicClaudeOpusLLM

Another week, another model release. GPT-5.1 dropped, Gemini 3 Pro followed, and now Anthropic throws Claude Opus 4.5 into the ring. Normally, I'd roll my eyes at yet another "groundbreaking" announcement—but this time I'm genuinely paying attention.

Here's why: I use Claude constantly. Before Anthropic tightened the limits a few months back, Opus was my daily driver for serious coding work. Then the caps got restrictive enough that I downgraded to Sonnet, and honestly? Haiku 4.5 handled most of my quick tasks just fine. But Opus 4.5 changes the equation—80.9% on SWE-bench (first model to crack 80%), a 67% price cut, and apparently increased limits for Max subscribers.

I haven't run it through the wringer yet, but first impressions are strong. Let me break down what matters.

What's New in Claude Opus 4.5

Hybrid Reasoning and the Effort Parameter

The headline feature is hybrid reasoning—a single model trained for both quick responses and extended chain-of-thought processing. You control how hard Claude thinks via a new effort parameter: low, medium, or high.

This is more useful than it sounds. Need a quick function signature? Low effort. Complex refactoring across multiple files? Crank it to high. The practical benefit is obvious: you're not burning tokens (and money) on trivial tasks, but you get the full reasoning depth when it matters.

response = client.messages.create(
    model="claude-opus-4-5-20251101",
    max_tokens=4096,
    messages=[{"role": "user", "content": "Your prompt"}],
    effort="high"  # "low", "medium", or "high"
)

Infinite-Length Conversations

Anyone who's hit the context wall mid-session knows the pain. Opus 4.5 introduces automatic context compaction—when you approach the 200K token limit, Claude summarizes earlier messages and keeps going. No more "sorry, we need to start a new conversation."

For extended coding sessions where you're iterating on architecture, this is a genuine workflow improvement.

Token Efficiency That Actually Matters

At medium effort, Opus 4.5 matches Sonnet 4.5's best SWE-bench score while using 76% fewer output tokens. That's not a typo. GitHub's Copilot team reported it "surpasses internal coding benchmarks while cutting token usage in half."

In practice, this means faster responses and lower API costs. Combined with the price cut, the economics of using Opus just got significantly better.

Other Notable Updates

  • 200K context window with 64K max output—larger output than competitors
  • Knowledge cutoff of March 2025—reasonably current
  • Thinking block preservation—previous reasoning persists across turns
  • Zoom tool for computer use—inspects fine UI details and small text

Benchmark Performance vs. Reality

Breaking Records on SWE-Bench

The numbers are impressive:

Benchmark Opus 4.5 Sonnet 4.5 GPT-5.1 Gemini 3 Pro
SWE-bench Verified 80.9% 77.2% 76.3% 76.2%
Terminal-Bench 59.3% 44.3% 47.6% 54.2%
ARC-AGI-2 37.6% 13.6% 17.6% 31.1%

Opus 4.5 is the first model to break 80% on SWE-bench Verified. It more than doubles GPT-5.1's ARC-AGI-2 score. These aren't marginal improvements.

The Simon Willison Reality Check

But benchmarks aren't everything. Simon Willison—someone whose testing I trust—ran Opus 4.5 through extensive refactoring: 20 commits, 39 files changed, over 3,000 lines modified. His honest take? He switched back to Sonnet 4.5 and "kept on working at the same pace."

That's worth acknowledging. For many tasks, the difference between Opus and Sonnet might not be dramatic. The gains show up on the hard problems—complex multi-file refactoring, architectural decisions, edge cases that trip up lesser models.

My Early Testing

I haven't pushed it hard enough yet to make definitive claims. What I can say: responses feel faster, the output quality on the tasks I've thrown at it has been solid, and I haven't hit the frustrating limits that made me abandon Opus months ago.

The reduced token consumption is real and noticeable. Whether that translates to meaningfully better code? I'll update after more testing.

Where Competitors Still Lead

Gemini 3 Pro beats Opus on knowledge-intensive benchmarks—91.9% on GPQA Diamond versus 87.0%. If you need multimodal reasoning or PhD-level domain knowledge, Gemini has an edge.

GPT-5.1 is significantly cheaper at $1.25 input versus Opus's $5, and OpenAI's ecosystem integrations remain strong.

Pricing: Finally Reasonable

The 67% Price Cut

Model Input (per M tokens) Output (per M tokens)
Claude Opus 4.5 $5 $25
Claude Opus 4.1 (old) $15 $75
Claude Sonnet 4.5 $3 $15
GPT-5.1 $1.25 $10
Gemini 3 Pro $2-4 $12-18

Opus went from $15/$75 to $5/$25. That's massive. Still the most expensive frontier model, but no longer absurdly so.

Add prompt caching (90% savings at $0.50/M for cache reads) and batch processing (50% discount), and the real-world costs become much more manageable.

Max Plan Changes That Actually Matter

Here's what got my attention: Max users now get "significantly more Opus usage than before—as much as they previously received for Sonnet." That's Anthropic effectively removing the tight Opus caps that pushed me to Sonnet in the first place.

If you're on Max, Opus is now a viable daily driver again. That alone changes my workflow.

Why Claude Remains Best for Professional Use

Here's my opinion, take it or leave it: Claude is the best model for professional use in a business context. I've tried them all extensively. Claude consistently delivers better first-attempt results, which means fewer retries, less time debugging AI-generated garbage, and lower total cost despite the higher per-token pricing.

Consistency and Reliability

Anthropic describes Opus 4.5 as producing work with "consistency, professional polish, and genuine domain awareness" for finance, legal, and precision-critical fields. That matches my experience. When I need code that works the first time without weird hallucinations or obvious bugs, Claude delivers more reliably.

The safety numbers tell a similar story: 4.7% attack success rate on prompt injection tests versus 21.9% for GPT-5.1. If you're building anything customer-facing, that matters.

API Features Worth Knowing

  • Tool Search (beta): 85% context reduction when working with 100+ tools
  • Prompt caching: I've been using this heavily—it's excellent
  • Computer use: Increasingly useful for automation tasks
  • Thinking block preservation: Multi-turn reasoning actually works

The Cost Reality

Yes, it's expensive. But "expensive per token" isn't the same as "expensive to use." Fewer retries and better first-attempt success often mean lower total cost. The token efficiency improvements close the gap further.

For professional work where quality matters, the premium is worth it. For experimentation and quick questions, Haiku 4.5 at a fraction of the cost handles it fine.

Developer Experience: CLI and Desktop Updates

Claude Code Improvements

The CLI got some genuinely useful updates:

  • Plan mode creates more precise plans with editable plan.md files
  • Checkpoints save state before each change—double-tap Esc to rewind
  • Subagents for specialized tasks with independent permissions
  • VS Code extension (beta) with real-time inline diffs

The checkpoint feature deserves attention. Safe experimentation on ambitious refactoring tasks without fear of breaking everything? That's a meaningful workflow improvement.

Desktop Integration

Claude Desktop now runs multiple parallel coding sessions via integrated Claude Code. Each session operates as its own git worktree—fix bugs in one, research issues in another, update docs in a third, all simultaneously.

File creation expanded to include Word documents, Excel with pivot tables, PowerPoint with speaker notes, and PDFs. Whether you need these features depends on your workflow, but they're there.

First Impressions and What Needs More Testing

What I've tried so far:

  • General coding tasks: solid, responsive, no complaints
  • Token consumption: noticeably improved
  • Response speed at different effort levels: the granularity is useful

What needs more testing:

  • Complex multi-file refactoring—this is where Opus should shine
  • The effort parameter's optimal settings for different task types
  • Infinite conversations over extended sessions
  • Cost-benefit versus Sonnet 4.5 for production API usage

Worth Upgrading?

Another model release, but this one has substance. The pricing changes and increased limits make Opus viable again for daily use. The benchmark performance is state-of-the-art for coding. The token efficiency gains are real.

If you're building AI agents, autonomous workflows, or complex multi-tool systems, Opus 4.5 is the clear leader. If you're doing general tasks, the improvements over Sonnet 4.5 are more subtle—evaluate whether the 67% premium justifies it for your use case.

For Max subscribers: definitely test the increased limits. Opus is back on the table.

I'll update with more detailed testing results once I've put it through proper paces. For now, cautiously optimistic—and that's more than I usually say about yet another model release.

Thomas Wiegold

AI Solutions Developer & Full-Stack Engineer with 14+ years of experience building custom AI systems, chatbots, and modern web applications. Based in Sydney, Australia.

Ready to Transform Your Business?

Let's discuss how AI solutions and modern web development can help your business grow.

Get in Touch