Claude Opus 4.5 Review: Anthropic's New Coding Model Breaks Records
Another week, another model release. GPT-5.1 dropped, Gemini 3 Pro followed, and now Anthropic throws Claude Opus 4.5 into the ring. Normally, I'd roll my eyes at yet another "groundbreaking" announcement—but this time I'm genuinely paying attention.
Here's why: I use Claude constantly. Before Anthropic tightened the limits a few months back, Opus was my daily driver for serious coding work. Then the caps got restrictive enough that I downgraded to Sonnet, and honestly? Haiku 4.5 handled most of my quick tasks just fine. But Opus 4.5 changes the equation—80.9% on SWE-bench (first model to crack 80%), a 67% price cut, and apparently increased limits for Max subscribers.
I haven't run it through the wringer yet, but first impressions are strong. Let me break down what matters.
What's New in Claude Opus 4.5
Hybrid Reasoning and the Effort Parameter
The headline feature is hybrid reasoning—a single model trained for both quick responses and extended chain-of-thought processing. You control how hard Claude thinks via a new effort parameter: low, medium, or high.
This is more useful than it sounds. Need a quick function signature? Low effort. Complex refactoring across multiple files? Crank it to high. The practical benefit is obvious: you're not burning tokens (and money) on trivial tasks, but you get the full reasoning depth when it matters.
response = client.messages.create(
model="claude-opus-4-5-20251101",
max_tokens=4096,
messages=[{"role": "user", "content": "Your prompt"}],
effort="high" # "low", "medium", or "high"
)
Infinite-Length Conversations
Anyone who's hit the context wall mid-session knows the pain. Opus 4.5 introduces automatic context compaction—when you approach the 200K token limit, Claude summarizes earlier messages and keeps going. No more "sorry, we need to start a new conversation."
For extended coding sessions where you're iterating on architecture, this is a genuine workflow improvement.
Token Efficiency That Actually Matters
At medium effort, Opus 4.5 matches Sonnet 4.5's best SWE-bench score while using 76% fewer output tokens. That's not a typo. GitHub's Copilot team reported it "surpasses internal coding benchmarks while cutting token usage in half."
In practice, this means faster responses and lower API costs. Combined with the price cut, the economics of using Opus just got significantly better.
Other Notable Updates
- 200K context window with 64K max output—larger output than competitors
- Knowledge cutoff of March 2025—reasonably current
- Thinking block preservation—previous reasoning persists across turns
- Zoom tool for computer use—inspects fine UI details and small text
Benchmark Performance vs. Reality
Breaking Records on SWE-Bench
The numbers are impressive:
| Benchmark | Opus 4.5 | Sonnet 4.5 | GPT-5.1 | Gemini 3 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 80.9% | 77.2% | 76.3% | 76.2% |
| Terminal-Bench | 59.3% | 44.3% | 47.6% | 54.2% |
| ARC-AGI-2 | 37.6% | 13.6% | 17.6% | 31.1% |
Opus 4.5 is the first model to break 80% on SWE-bench Verified. It more than doubles GPT-5.1's ARC-AGI-2 score. These aren't marginal improvements.
The Simon Willison Reality Check
But benchmarks aren't everything. Simon Willison—someone whose testing I trust—ran Opus 4.5 through extensive refactoring: 20 commits, 39 files changed, over 3,000 lines modified. His honest take? He switched back to Sonnet 4.5 and "kept on working at the same pace."
That's worth acknowledging. For many tasks, the difference between Opus and Sonnet might not be dramatic. The gains show up on the hard problems—complex multi-file refactoring, architectural decisions, edge cases that trip up lesser models.
My Early Testing
I haven't pushed it hard enough yet to make definitive claims. What I can say: responses feel faster, the output quality on the tasks I've thrown at it has been solid, and I haven't hit the frustrating limits that made me abandon Opus months ago.
The reduced token consumption is real and noticeable. Whether that translates to meaningfully better code? I'll update after more testing.
Where Competitors Still Lead
Gemini 3 Pro beats Opus on knowledge-intensive benchmarks—91.9% on GPQA Diamond versus 87.0%. If you need multimodal reasoning or PhD-level domain knowledge, Gemini has an edge.
GPT-5.1 is significantly cheaper at $1.25 input versus Opus's $5, and OpenAI's ecosystem integrations remain strong.
Pricing: Finally Reasonable
The 67% Price Cut
| Model | Input (per M tokens) | Output (per M tokens) |
|---|---|---|
| Claude Opus 4.5 | $5 | $25 |
| Claude Opus 4.1 (old) | $15 | $75 |
| Claude Sonnet 4.5 | $3 | $15 |
| GPT-5.1 | $1.25 | $10 |
| Gemini 3 Pro | $2-4 | $12-18 |
Opus went from $15/$75 to $5/$25. That's massive. Still the most expensive frontier model, but no longer absurdly so.
Add prompt caching (90% savings at $0.50/M for cache reads) and batch processing (50% discount), and the real-world costs become much more manageable.
Max Plan Changes That Actually Matter
Here's what got my attention: Max users now get "significantly more Opus usage than before—as much as they previously received for Sonnet." That's Anthropic effectively removing the tight Opus caps that pushed me to Sonnet in the first place.
If you're on Max, Opus is now a viable daily driver again. That alone changes my workflow.
Why Claude Remains Best for Professional Use
Here's my opinion, take it or leave it: Claude is the best model for professional use in a business context. I've tried them all extensively. Claude consistently delivers better first-attempt results, which means fewer retries, less time debugging AI-generated garbage, and lower total cost despite the higher per-token pricing.
Consistency and Reliability
Anthropic describes Opus 4.5 as producing work with "consistency, professional polish, and genuine domain awareness" for finance, legal, and precision-critical fields. That matches my experience. When I need code that works the first time without weird hallucinations or obvious bugs, Claude delivers more reliably.
The safety numbers tell a similar story: 4.7% attack success rate on prompt injection tests versus 21.9% for GPT-5.1. If you're building anything customer-facing, that matters.
API Features Worth Knowing
- Tool Search (beta): 85% context reduction when working with 100+ tools
- Prompt caching: I've been using this heavily—it's excellent
- Computer use: Increasingly useful for automation tasks
- Thinking block preservation: Multi-turn reasoning actually works
The Cost Reality
Yes, it's expensive. But "expensive per token" isn't the same as "expensive to use." Fewer retries and better first-attempt success often mean lower total cost. The token efficiency improvements close the gap further.
For professional work where quality matters, the premium is worth it. For experimentation and quick questions, Haiku 4.5 at a fraction of the cost handles it fine.
Developer Experience: CLI and Desktop Updates
Claude Code Improvements
The CLI got some genuinely useful updates:
- Plan mode creates more precise plans with editable
plan.mdfiles - Checkpoints save state before each change—double-tap Esc to rewind
- Subagents for specialized tasks with independent permissions
- VS Code extension (beta) with real-time inline diffs
The checkpoint feature deserves attention. Safe experimentation on ambitious refactoring tasks without fear of breaking everything? That's a meaningful workflow improvement.
Desktop Integration
Claude Desktop now runs multiple parallel coding sessions via integrated Claude Code. Each session operates as its own git worktree—fix bugs in one, research issues in another, update docs in a third, all simultaneously.
File creation expanded to include Word documents, Excel with pivot tables, PowerPoint with speaker notes, and PDFs. Whether you need these features depends on your workflow, but they're there.
First Impressions and What Needs More Testing
What I've tried so far:
- General coding tasks: solid, responsive, no complaints
- Token consumption: noticeably improved
- Response speed at different effort levels: the granularity is useful
What needs more testing:
- Complex multi-file refactoring—this is where Opus should shine
- The effort parameter's optimal settings for different task types
- Infinite conversations over extended sessions
- Cost-benefit versus Sonnet 4.5 for production API usage
Worth Upgrading?
Another model release, but this one has substance. The pricing changes and increased limits make Opus viable again for daily use. The benchmark performance is state-of-the-art for coding. The token efficiency gains are real.
If you're building AI agents, autonomous workflows, or complex multi-tool systems, Opus 4.5 is the clear leader. If you're doing general tasks, the improvements over Sonnet 4.5 are more subtle—evaluate whether the 67% premium justifies it for your use case.
For Max subscribers: definitely test the increased limits. Opus is back on the table.
I'll update with more detailed testing results once I've put it through proper paces. For now, cautiously optimistic—and that's more than I usually say about yet another model release.
Related Articles
Claude Skills: Complete Guide for Developers
Learn what Claude Skills are, how to create custom AI agents in 15 minutes, and why developers call this bigger than MCP. Includes examples and best practices.
9 min read
AI SolutionsBest LLM for Office Work 2025: ChatGPT vs Claude vs Gemini
ChatGPT, Claude, or Gemini for office work? Compare real costs ($30-100/user), task performance, and integration. Data-driven guide with benchmarks.
11 min read
AI SolutionsGoogle Gemini 3 Hits #1 on LMArena: A Developer's Honest First Impressions
Gemini 3 Pro hits #1 on LMArena with 1501 Elo. A developer's honest first impressions and testing plan vs Claude Sonnet 4.5 for real coding work.
7 min read