MiniMax M2.5 Review: Why I'm Seriously Considering Ditching Claude
MiniMax's M2.5 model landed on February 12, 2026, and it's the first time I've genuinely questioned whether my Claude Max subscription is worth it. I've been paying Anthropic $200/month for Claude Code Max 20x — happily, mostly — because Opus 4.6 is phenomenal at reasoning through complex codebases. But when a model comes along that scores within 0.6% of Opus on SWE-Bench Verified at roughly one-twentieth the cost, you have to at least run the numbers. So I did. Here's my MiniMax M2.5 review after digging into the benchmarks, pairing it with the open-source OpenCode CLI, and stress-testing it against my usual workflow.
What MiniMax M2.5 Actually Is (and Why It Matters)
M2.5 is a Mixture-of-Experts model: 230 billion total parameters, but only 10 billion active during inference. That architecture is the entire reason the pricing works. You get frontier-tier capability without frontier-tier compute costs because most of the model is sitting idle on any given pass.
It ships in two variants — Standard at 50 tokens/second and Lightning at 100 tokens/second. For context, that Lightning speed is roughly double what you get from competing frontier models. Both variants are released as open weights on Hugging Face under a modified MIT License, which means you can self-host them, fine-tune them, or just use them through MiniMax's API.
The context window is 204,800 tokens (with the underlying architecture supporting up to 1 million), and it can generate up to 128K output tokens. MiniMax trained it using a proprietary RL framework called Forge that deployed the model across 200,000+ real-world environments — actual code repos, browsers, office apps — rather than just learning from human preference data. The result is what they call an "Architect Mindset": the model plans before it codes. I've seen this behaviour firsthand and it's not marketing fluff. It genuinely outlines structure and feature design before touching implementation.
How It Performs — Benchmarks and My Real-World Test
The Benchmark Picture
Let's get the numbers on the table. These are the scores that made me sit up:
| Benchmark | M2.5 | Claude Opus 4.6 | GPT-5.2 | Gemini 3 Pro |
|---|---|---|---|---|
| SWE-Bench Verified | 80.2% | 80.8% | 80.0% | 78.0% |
| Multi-SWE-Bench | 51.3% | 50.3% | — | 42.7% |
| BFCL Multi-Turn (tool calling) | 76.8% | 63.3% | — | 61.0% |
| Terminal-Bench 2 | 52.0% | 65.4% | — | — |
That SWE-Bench number is wild for an open-weight model. Six months ago this would have been science fiction. The BFCL tool-calling lead at 76.8% vs Opus's 63.3% is particularly interesting — it suggests M2.5's real-environment RL training translates directly into better function orchestration, which is exactly what you want in an agentic coding workflow.
But let's not pretend it's all roses. Terminal-Bench 2 at 52% versus Opus's 65.4% is a real gap. General reasoning scores (AIME 2025 at 45%, SimpleQA at 44%) tell you this model was optimised for coding and agentic tasks, not broad knowledge work. If you need a model to reason about abstract maths or answer obscure trivia, Opus still wins convincingly.
OpenHands ranked M2.5 4th overall and called it the first open model to surpass Claude Sonnet. Artificial Analysis scored it 42 on their Intelligence Index against a median of 25 for comparable models. Community reception has been cautiously enthusiastic — Hacker News loved the price-performance ratio but several developers flagged MiniMax's history of benchmark reward-hacking with M2 and M2.1. Fair concern. Worth watching.
My Link Shortener Test Project
Benchmarks are benchmarks. I trust my own tests more. I have a standardised Go project — a link shortener service — that I run against every new model that claims to compete at the frontier. Same spec, same constraints, same evaluation criteria every time. It's not a perfect methodology, but it's consistent, and consistency is what lets you compare.
M2.5 gave me the best result I've gotten so far. Better than Claude Code with Opus 4.6. Better than ChatGPT Codex. The architecture choices were sensible, the code was clean, and it finished fast. That "Architect Mindset" MiniMax talks about? I could actually see it working — the model laid out the structure before diving into implementation, which is exactly how I'd approach the project myself.
Now, a massive caveat: results vary between runs. I've seen this with every model. You can run the same prompt three times and get meaningfully different output quality. That's actually why I think raw inference speed is going to matter more and more for AI coding — if results are non-deterministic, the winning strategy is to run multiple attempts quickly and pick the best one. M2.5 Lightning at 100 tokens/second makes that approach economically viable in a way that Opus at $5/$25 per million tokens simply doesn't.
I'm not ready to crown it after one test. But I'm keeping it as my daily driver for the next few weeks to see how it holds up across real projects with real complexity. First impressions are genuinely strong.
OpenCode CLI — The Other Half of the Equation
Here's the thing most coverage misses: M2.5 alone is just a model. The reason it's a genuine threat to Claude Code is OpenCode.
OpenCode is an open-source coding agent built by Anomaly Innovations (the Y Combinator-backed team behind SST). It's hit 104,000+ GitHub stars and 2.5 million monthly active developers since launching in June 2025. It runs in the terminal, as a desktop app, or as a VS Code extension — and it supports 75+ LLM providers. Anthropic, OpenAI, Google, MiniMax, local models via Ollama, whatever you want.
The architecture is TypeScript on Bun with a client/server split. The TUI is just one frontend; the HTTP backend can be driven from mobile, web, or CI/CD pipelines. Compare that to Claude Code's monolithic terminal-only approach and you start to see the philosophical difference.
Features that matter to me: a Plan/Build mode toggle (Tab to switch between read-only planning and active modification), LSP integration for language-aware navigation, multi-session support so you can run parallel agents on the same project, and /undo /redo for reverting changes. There's also GitHub integration where mentioning /opencode in issue comments triggers automated actions, which is genuinely clever for team workflows.
Setup with MiniMax is trivial. Run opencode auth login, pick MiniMax as provider, paste your API key. Done. Or edit ~/.config/opencode/opencode.json with MiniMax's Anthropic-compatible API endpoint for persistent config. You can also run it free through Ollama with ollama launch opencode --model minimax-m2.5:cloud.
OpenCode itself is MIT-licensed and free. You only pay for the LLM API calls. That's the model I wish more developer tools would adopt.
The Price Gap Is Absurd
This is where the conversation gets uncomfortable for Anthropic and OpenAI.
M2.5 Standard charges $0.15 per million input tokens and $1.20 per million output tokens. Claude Opus 4.6 charges $5/$25. That's 33× cheaper on input and 20× cheaper on output. A typical SWE-Bench task costs about $0.15 with M2.5 versus $3.00 with Opus.
MiniMax's subscription tiers make the comparison even more pointed. Their $10/month Starter plan claims to match the capacity of Claude Code Max 5x at $100/month. Their $20 Plus and $50 Max tiers claim parity with Claude Code Max 20x at $200/month. Even if those claims are optimistic by 30-40%, the economics still overwhelmingly favour MiniMax.
Here's how I think about it: you can always resubscribe to Claude. The risk of trying M2.5 for a month at $10-20 is essentially zero. The potential upside is saving $150+/month on tooling that performs within a few percentage points of the premium option.
Running It Locally on a Mac
I haven't tried local deployment myself yet, but I've been watching others do it — and the results are promising enough to write about. Several developers have gotten M2.5 running via Ollama on a Mac Studio with an M3 Ultra and 512GB of RAM. It works. It's slower than the cloud API, noticeably so, but it runs and produces usable output.
That's a $10,000+ machine, so let's not pretend this is accessible to everyone today. But hardware gets cheaper and more powerful every cycle, and the direction is obvious. My Mac Mini M1 isn't going to cut it for a 230B parameter model, even with only 10B active. But in two or three hardware generations? Running a frontier-class coding model entirely on-device starts to look realistic for a much wider range of machines.
The reason this matters — especially for small businesses — is privacy. When your code never leaves your network, you eliminate an entire category of risk. No API dependency, no data flowing to Shanghai or San Francisco, no wondering what happens to your proprietary codebase in someone else's training pipeline. If I could run M2.5 locally with performance matching the cloud API, I'd switch to that setup without hesitation. We're not there yet, but the open weights mean the option exists the moment the hardware catches up.
What This Means for the AI Coding Market
We're looking at three distinct philosophies as of February 2026. Claude Code is the premium play — deepest reasoning, tightest integration with arguably the best single model, but $100-200/month and locked into Anthropic's ecosystem. ChatGPT Codex takes the multi-interface cloud approach, with GPT-5.3-Codex hitting 77.3% on Terminal-Bench 2.0 and offering the most generous usage at $20/month. And MiniMax M2.5 + OpenCode delivers provider-agnostic flexibility at a price point that makes sustained agentic workflows actually affordable.
The trend line is clear: AI coding is getting cheaper, more open, and more private. That benefits small and medium businesses disproportionately. A solo developer or a five-person team spending $10-50/month instead of $200/month per seat changes the economics of AI-assisted development entirely.
I want to be honest about the risks. MiniMax's M2 and M2.1 had documented problems with reward-hacking and test falsification. Whether M2.5 fully resolves those concerns is still under independent testing. The model's general reasoning lags behind both Opus and GPT-5.2 noticeably. And Hacker News users who heavily used previous MiniMax models reported brittle behaviour — context rot, error loops, hardcoded test cases instead of genuine solutions.
But here's the signal I keep coming back to: MiniMax claims 80% of newly committed code at their own headquarters is now M2.5-generated, with 30% of company tasks running autonomously on the model. When the people who built it trust it enough to run their own engineering on it at that scale, the benchmarks are probably directionally real — even if the last mile of polish still belongs to Claude.
I'm not declaring M2.5 the winner. I'm saying the value proposition is strong enough that I'm moving my daily driver for the next month to find out. At these prices, the experiment costs less than a single lunch in Sydney. I'll report back.
Related Articles
Building Reliable Invoice Extraction Prompts That Handle Edge Cases
Learn how to craft LLM prompts for invoice extraction that handle messy scans, edge cases, and human error—with confidence signals and real-world testing strategies.
10 min read
codeAI Invoice Processing with TypeScript: Reactive Agent Tutorial
Build an intelligent invoice processor using TypeScript and reactive agents. Automate document extraction, validation, and categorization without orchestration code. Production-ready tutorial.
14 min read
codeHow to Build an MCP Server with TypeScript
Build an MCP server with TypeScript using the official SDK. Covers tools, resources, error handling, and production deployment for AI integrations.
8 min read