DeepSeek V4 Review: I Tested It on Real Code

The wait is finally over. After months of silence from the lab that genuinely shook the AI world with V3 (and made half of Silicon Valley refresh their CapEx slides in early 2025), DeepSeek V4 dropped on April 24, 2026.

I'll be honest, I had started to wonder if a new DeepSeek model was even coming. They went so quiet that I assumed they were either cooking something extraordinary or had hit a wall. Turns out it was the first one. Mostly.

Short version up top, because that's what I'd want from a colleague: DeepSeek V4 is the best value AI model on the market right now, but it's not the best coder. If you've got volume, it's a no-brainer. If you've got a hard problem, you're still better off with Claude or GPT-5.5.

Here's what I found after running V4 through the three tests I now use for every new model release.

What DeepSeek V4 Actually Is

DeepSeek V4 isn't one model, it's two. Both are open weights, both are MIT-licensed, both ship with a 1 million token context window by default.

	V4-Pro	V4-Flash
Total params	1.6T	284B
Active per token	49B	13B
Context	1M	1M
Input price (per 1M tokens)	$1.74	$0.14
Output price (per 1M tokens)	$3.48	$0.28
License	MIT	MIT

V4-Pro at 1.6 trillion parameters is the largest open-weights model anyone has shipped to date, ahead of Kimi K2.6 and GLM-5.1. The architecture is genuinely new, not just V3 with more layers. DeepSeek built a hybrid attention system (CSA + HCA) that uses about 27% of the per-token compute of V3.2 at 1M context, plus they trained directly in FP4 instead of quantising afterwards. The full technical report is on the Hugging Face model card if you want the actual maths.

The reasoning model line is folded in too. Instead of picking between deepseek-chat and deepseek-reasoner like before, V4 has three reasoning modes: Non-Think, Think High, Think Max. And tool calls now work inside thinking mode, which R1 couldn't do.

What's missing

No multimodal. Text in, text out. If you need vision, look at Kimi K2.6 or Gemini.

Also, mildly annoying: the model card doesn't ship with a Jinja chat template, so plan for that in your tokenisation pipeline. Small thing, but it'll catch you out if you assume it works like every other recent release.

Hands-On Testing: My Three-Workload Rig

Benchmarks tell you about benchmarks. They don't tell you whether the thing will actually do your job. So I've settled on three tests I run on every new model, picked because they cover most of what I use AI for in a normal week.

Codebase audit. I have it audit my own blog, which is React Router 7 framework with TypeScript. Real code, real complexity, things I genuinely care about being right.
Logic-heavy terminal app. A poker simulation that runs thousands of hands and returns statistics. Tests reasoning, structure, edge cases.
Web design from cold. Two different prompts to see how the model handles aesthetics and layout.

Here's how V4 did on each.

Test 1: Codebase Audit

Okay, but not great.

V4-Pro found a handful of real things, but it also flagged a bunch of stuff that wasn't actually a problem. Things like style nitpicks, or "consider extracting this" suggestions in places where extraction would make the code worse. Meanwhile it missed a couple of things that both GPT-5.5 and Claude caught the first time around.

GPT-5.5 is still my pick for code audits. It's the most thoughtful about what's actually a bug versus what's just different from how it would have written it. V4 tends toward over-flagging, which is exhausting when you've got a real codebase to triage.

This matches the benchmarks, by the way. On SWE-Bench Pro (the harder, more realistic coding eval), V4-Pro lands around 55%, behind Claude Opus 4.7 (64.3%), Kimi K2.6 (58.6%), and GLM-5.1 (58.4%). The headline SWE-Bench Verified number is essentially tied with Claude, but the harder benchmark tells the truer story.

Test 2: Poker Simulation

This one was closer. The code worked, the statistics came out right, the structure was reasonable. V4 didn't fall over.

But Claude and GPT-5.5 both did it better. Cleaner separation between the simulation core and the reporting layer, fewer iterations to get to working code, slightly more idiomatic Go. V4's version felt like a competent junior engineer's first pass that you'd then refactor. Theirs felt like something a senior would commit.

Not bad. Just not first.

Test 3: Web Design (Two Builds)

This is where it got interesting.

I gave V4 two prompts. First, a coffee roaster website. Second, a modern pop culture online shop.

The coffee roaster came out almost spookily similar to what Claude would produce. Same warm earth-tone palette, similar serif-and-sans pairing, that whole "we take our beans seriously" vibe. But the layout was cookie cutter. Hero section, three feature cards, story block, footer. Boring. The kind of design you've seen a thousand times.

The pop culture shop, though, was genuinely good. Striking layout, confident typography, played with grid in interesting ways. I'd happily ship it as a starting point for a real project.

The takeaway I keep chewing on: V4 can clearly do great design. It just defaults to safe templates unless the prompt subject pulls it somewhere distinctive. Coffee roaster apparently lives in the boring-template region of its training data. Pop culture shop apparently doesn't. Worth knowing.

What I learned from testing

Pattern: V4 is competent on everything, outstanding on nothing. Which is exactly the right shape for a value model. You wouldn't pay Opus prices for "competent." But at $0.14 per million input tokens for Flash, "competent" is an absolute steal.

For coding specifically, Claude (Opus 4.7) and GPT-5.5 still win on quality. For everything else where the answer doesn't have to be perfect, V4 is hard to beat.

Benchmarks That Actually Matter

A quick run through the benchmarks worth paying attention to, because there's signal in here even if it doesn't override what I saw in testing.

SWE-Bench Verified: V4-Pro 80.6%, basically tied with Claude Opus 4.6 (80.8%).
SWE-Bench Pro (the hard one): V4-Pro ~55%, behind Opus 4.7 at 64.3%, Kimi K2.6 at 58.6%, GLM-5.1 at 58.4%.
Artificial Analysis Intelligence Index: 52, which makes V4-Pro the second-best open-weights model behind Kimi K2.6 at 54. Full breakdown on Artificial Analysis.
LiveCodeBench: 93.5, the highest reported number on this benchmark.

One footnote that doesn't get enough airtime. The US government's CAISI evaluation at NIST ran V4-Pro on held-out, non-public benchmarks and placed it closer to GPT-5 (about 8 months old) than to GPT-5.4 or Opus 4.6. Treat the headline benchmark equivalence as an upper bound. There's likely some public-benchmark overfitting going on, which is normal but worth knowing.

The other thing to flag: V4 hallucinates more than its peers when it doesn't know something. The AA-Omniscience eval clocks it at a 94% hallucination rate when uncertain. Translation, when V4 isn't sure, it doesn't tell you, it just answers. For RAG and research workflows, ground it explicitly.

How V4 stacks up against the open-weights pack

Model	Open weights	Context	SWE-Pro
DeepSeek V4-Pro	Yes (MIT)	1M	~55%
Kimi K2.6	Yes (mod. MIT)	256K	58.6%
GLM-5.1	Yes (MIT)	200K	58.4%
MiniMax M2.7	Mixed	200K	56.2%
Claude Opus 4.7	No	200K	64.3%

Quick read: Kimi K2.6 is the smartest open model overall, GLM-5.1 wins long-horizon agentic work, V4 wins on price plus context length. They're all genuinely useful for different things.

Pricing: Where V4 Actually Wins

This is the part that matters most.

Model	Input ($/1M)	Output ($/1M)
DeepSeek V4-Flash	$0.14	$0.28
DeepSeek V4-Pro	$1.74	$3.48
Gemini 3.1 Pro	~$2.00	~$12.00
GPT-5.5	$5.00	$30.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Opus 4.7	$5.00	$25.00

V4-Flash is the cheapest input price you'll find on a frontier-tier model anywhere, beating even GPT-5.4 Nano. V4-Pro is the cheapest of the larger frontier models. And cache hits are 99% off, which is huge for agentic workflows that resend big system prompts.

To make this concrete: a Hacker News commenter on Simon Willison's V4 review ran a full layer-by-layer audit of a TypeScript endpoint (API, DTOs, service, database models) for $0.09 on V4-Pro. The same audit on Claude Opus 4.7 would have cost roughly $9 to $13. That's a 100x ratio.

One caveat though, V4 is verbose. To complete the Artificial Analysis Intelligence Index, V4-Pro burned 4 to 5 times the median output tokens. The headline per-token price oversells the real bill. It's still cheaper than the alternatives, just not by quite as wild a margin as the sticker suggests.

Easiest ways to actually use V4

If you just want to try it, two easy paths.

The simplest is the OpenCode Go subscription, which I reviewed a while back. V4 is included in the plan, no API keys to set up, and you get a proper terminal coding agent out of the box.

Otherwise, the DeepSeek API itself is genuinely cheap and trivial to set up. Point any Anthropic-compatible client (Claude Code, OpenCode, OpenClaw) at the DeepSeek base URL with your API key and it works as a drop-in. That's it.

The Verdict: When to Use V4, When Not To

Use V4 for:

High-volume API workloads where cost matters more than tail quality
Agentic background work running 24/7
Long-context tasks above 200K tokens
Anywhere open weights or on-prem deployment is a hard requirement

Don't use V4 for:

Code audits where missing a real bug is expensive (GPT-5.5 still wins here for me)
Hard reasoning steps like research-grade math (GPT-5.4 or Gemini 3.1 Pro)
Big-codebase production edits where SWE-Bench Pro matters (Claude Opus 4.7)
Anything multimodal

V4 is the new default workhorse, not the new champion. The bar for "good enough cheap model" just got a lot higher, and that pulls the price-quality curve into a useful new shape. You can route 80% of your agentic and coding traffic to V4 and reserve Opus or GPT-5.5 for the genuinely hard sub-tasks where the 10x to 20x cost premium actually buys you something.

For my own work, the stack I'm settling into looks like: V4-Flash for bulk and background stuff, V4-Pro for medium-hard work, Claude Opus 4.7 or GPT-5.5 when the answer has to be right the first time.

DeepSeek themselves admit in their technical report that V4 trails the absolute frontier by 3 to 6 months. CAISI's data suggests it might be more like 8. Honestly, for most of what most of us do, that gap doesn't matter. The price ratio matters more.

And selfishly, I'm just happy a new strong model shipped. More competition makes everything better for everyone shipping with these tools. Bring on V5.