MiniMax M3 Review: Finally Matching GPT-5.5 & Opus?

I don't really enjoy writing model reviews. There, I said it. After the tenth "this new model is faster and smarter than the last one" post, you start to feel like you're describing the same car with a fresh coat of paint. So when I tell you this MiniMax M3 review is one I actually wanted to write, take it as a signal. M3 is interesting. Not "interesting for a Chinese open-weights model." Just interesting, full stop.

I've been in the MiniMax corner for a while now. I liked M2.5 when it landed, and I liked M2.7 even more. But there was always the same asterisk in the back of my mind: genuinely good, just not GPT-or-Opus good. A gap you could feel. This time the gap might have closed. So I ran my usual battery of tests, watched the thing think for an uncomfortably long time, and came away mostly impressed. Here's the whole story.

What Is MiniMax M3?

MiniMax M3 is an open-weights, natively multimodal model (text, image, and video in, text out) that launched on June 1, 2026 with a 1 million token context window. It's the course-correction in the M-series: where the M2 generation deliberately ditched sparse attention over production worries, M3 brings it back as the headline feature.

That feature is called MiniMax Sparse Attention, or MSA. The short version for anyone who doesn't want the linear-algebra lecture: a lightweight index branch scans incoming tokens, picks which key-value blocks actually deserve attention, and only runs the expensive math on those. The clever bit is that it does this on the real, uncompressed key-values, so you don't pay the long-context precision tax that something like DeepSeek's latent attention does. MiniMax claims a roughly 9x speedup on prefill and 15x on decode at 1M tokens, with quality holding steady in their ablations.

Why should you care about that more than the benchmark numbers? Because a quadratic-attention model can technically hold a million tokens, but actually using them is miserable. Prefill alone can take minutes. If MSA's speedups hold up under real load, that's the difference between "1M context exists on the spec sheet" and "1M context is something you'd actually build an agent around." That's the part that matters.

On pricing, it's aggressive. Standard pay-as-you-go is $0.60 per million input tokens and $2.40 per million output, with a 50% launch promo for the first week. That's somewhere between a tenth and a twentieth of what closed frontier models cost. You can run it right now through the MiniMax API, OpenRouter (OpenAI-compatible, easiest path), and a handful of launch partners.

One honest flag before we go further: at launch the parameter count is undisclosed, and the "open-weights" part is still a promise. The weights weren't on Hugging Face yet (MiniMax says "within 10 days"). So keep your enthusiasm calibrated. More on that later.

Putting MiniMax M3 Through My Usual Tests

Here's my process, which never changes, because that's the only way I can compare across releases instead of just vibing off first impressions. I run the same three tasks on every serious model: two website builds, a poker simulation terminal program, and a full code audit of my own site, thomas-wiegold.com. Same prompts, same expectations, every time.

Website one: the Sydney coffee roaster

This is one of those prompts I've run so many times I could recite the output styles in my sleep. Funny thing about it: every single model picks more or less the same color palette for a Sydney coffee roaster. GPT, Opus, Gemini, now MiniMax. There must be something deep in the training data that screams "warm browns and cream" the moment you say "coffee." I've stopped fighting it.

What separates the models is everything else, and M3 nailed everything else. The layout was clean and considered, the technical execution was solid, and honestly it was one of the best results I've gotten for this prompt to date. Right up there with the closed frontier models. That alone made me sit up, because this is exactly the kind of task where MiniMax used to be "fine, but."

Website two: the pop-culture online store

I push the complexity up here. More interactivity, more visual flair, more chances to fall apart. M3 handled it well. Nice animations, good structure, the sort of result you'd be happy to hand off as a starting point rather than a throwaway demo. Probably the second-best result I've ever gotten for this particular prompt.

Second-best, because Gemini still had a slight edge on the design polish. If you've read my take on Gemini 3.5 Flash in Google Antigravity, you'll know I rate Gemini's web design specifically while preferring GPT-5.5 and Opus for most other work. M3 didn't beat Gemini at its own game, but getting within arm's reach is a real result.

The poker simulation

And now the part where I stop gushing. The poker sim was a mixed bag.

First problem: it took forever. I'm talking 30 to 40 minutes of the model thinking and working. I sat there reading the reasoning output, partly out of curiosity and partly out of disbelief, and it was a parade of "actually..." and "oh but wait, maybe..." and then contradicting the thing it had just decided. It would talk itself into a corner, talk itself back out, and burn a frightening number of tokens doing it. At times it felt less like reasoning and more like brute-forcing its way to an answer by sheer persistence.

To be fair, this isn't a MiniMax-specific disease. A lot of the newer models do this now. They over-think, second-guess, and treat token budgets like they're free. M3 is just a particularly patient offender.

The result itself? Okay. Not a full success, it didn't completely nail the task. But I'll give it this much context: no model has ever one-shot this poker challenge to 100%. Not one. So "okay but slow" puts it in the same bucket as everything else, just with a longer coffee break in the middle.

The code audit

This is where M3 won me back. Auditing thomas-wiegold.com is a hard test, and I mean that genuinely. The site is already heavily optimized. I've run these audits over and over and fixed the problems, so finding something new and real is not easy. The bar is high precisely because the obvious stuff is long gone.

GPT-5.5 has been my favorite for audits for a while. MiniMax M3 got remarkably close. No filler, no padding, no inventing problems to look busy. Every finding made sense and was worth my time. Compare that to when I tested DeepSeek V4, which buried the good observations under a pile of non-issues that I had to wade through and dismiss. M3 didn't waste a single line on a fake problem. For a model at this price, that's genuinely impressive.

How It Stacks Up on Benchmarks

The numbers are good, and I'm going to tell you why you should still squint at them.

On MiniMax's own reporting, M3 hits 59.0% on SWE-Bench Pro, which puts it behind Claude Opus 4.7 (64.3%) and GPT-5.5 (58.6%) by a hair, and ahead of Gemini 3.1 Pro (54.2%). On SWE-Bench Verified it's at 80.5%, and on Terminal-Bench 2.1 it sits at 66.0%, where the closed models pull ahead more clearly. The pattern is consistent with what I saw by hand: close to GPT and Opus on real coding, not quite past them.

Here's the squint. Every one of those numbers is vendor-run, on MiniMax's own infrastructure, with baselines they picked, often using Claude Code as the scaffolding. That's not an accusation of cheating, it's just how launch-day benchmarks work, and you should treat all of them with the same healthy suspicion you'd apply to any company grading its own homework. The independent scores from LMArena and Artificial Analysis were still pending at launch. When those land, that's the real test.

One known soft spot worth naming: abstract, fluid reasoning. The whole family of Chinese models has lagged here, and the ARC Prize ARC-AGI-2 results from earlier this year had the MiniMax line scoring low single digits. M3 is a strong coder and a strong agent. It is not, on the available evidence, a great abstract reasoner. Good to know before you point it at a problem that needs genuine novel reasoning rather than competent execution.

The Catches

Three things to keep in your head before you get too comfortable.

The "open-weights" label was a promise on launch day, not a fact. No weights on Hugging Face yet, and the license is the bigger worry. M2.7 shipped under a "Modified-MIT" license that blocked commercial use without written permission, which got roundly mocked as faux-open-source. M3 is expected to follow the same playbook. So if your plan involves self-hosting for commercial work, do not commit to anything until the actual weights ship and you've read the actual terms. Hope is not a deployment strategy.

The token-burning I hit in the poker test is a real cost factor, not just an annoyance. The headline price is cheap, but if the model wanders through 40 minutes of self-doubt on a hard problem, your effective cost-per-task climbs. Measure the whole task, not the per-token rate.

And a 1M context window is wonderful, but it is not a memory system. For long-running agents you still want real persistence. A big window helps; it doesn't replace architecture.

The MiniMax M3 Review Verdict: Should You Use It?

Yes, with both eyes open.

I think M3 is a real winner. The results across my tests were strong, the pricing is excellent, and for the first time a MiniMax model genuinely sits in the conversation with GPT and Opus rather than a tier below it. I'm going to use it for coding and other work. I'm also keeping my Claude and ChatGPT subscriptions, because the smart move here is hybrid: route the bulk, cost-sensitive, long-context work to M3, and reserve a closed frontier model for the slice where the last few quality points actually matter.

For a bit of field context, since "is it better than X" is the only question anyone really asks: I wasn't especially moved by Claude Opus 4.8. I honestly couldn't tell you with confidence that it beats its predecessor, and during testing it fumbled something as basic as setting up a project with linting and formatting, which is not a great look. Gemini 3.5 Flash remains my pick for web design specifically, while GPT-5.5 and Opus stay in rotation for most else. M3 doesn't dethrone any of them outright. It earns a seat at the table, and at this price that's the whole point.

If you want to try it without spending anything, it's free in OpenCode right now, and it'll be part of the OpenCode Go plan too. That's the cheapest way to form your own opinion, which, as always, is the only opinion that should actually drive your decision.

I came in skeptical because I usually do, and a MiniMax M3 review was not on my list of things I expected to enjoy writing. It turned out to be one of the more genuinely interesting models I've tested this year. Run it through your own tasks before you believe me, though. That's the entire job.

MiniMax M3 Review: Finally Matching GPT-5.5 & Opus?

What Is MiniMax M3?

Putting MiniMax M3 Through My Usual Tests

Website one: the Sydney coffee roaster

Website two: the pop-culture online store

The poker simulation

The code audit

How It Stacks Up on Benchmarks

The Catches

The MiniMax M3 Review Verdict: Should You Use It?

Thomas Wiegold

Related Articles

MiniMax M2.7 Review: Is It Worth the Hype?

MiniMax M2.5 Review: Why I'm Seriously Considering Ditching Claude

Grok 4.5 Review: I Tested SpaceXAI's Cheap Coder