Do LLMs Actually Understand Code? The Evidence

I've spent the better part of two years with AI coding assistants open in a second monitor. Copilot, Claude, ChatGPT—the whole rotation. And I've developed a gut feeling about when to trust them and when to double-check everything.

Turns out, research confirms what that gut feeling was detecting: LLMs develop genuine semantic representations of code internally, but those representations don't reliably control their outputs. They're not just autocomplete. But they're not reliable reasoners either. They're something weirder—and understanding that weirdness matters if you're betting your codebase on them.

The Question That Actually Matters

The academic version of this debate—"Do LLMs truly understand or just pattern match?"—sounds philosophical. It isn't. It's deeply practical.

If these models just exploit statistical shortcuts, then their impressive benchmark scores hide unreliable production behavior. That function that looks correct? It might fail on inputs the model never saw patterns for. If they genuinely understand semantics, we might be witnessing something new in machine reasoning—and can calibrate our trust accordingly.

Here's the paradox that emerged from the research: models do form abstract, language-agnostic representations of programming concepts. Interpretability researchers can see them. But these internal representations don't fully govern what the model outputs. It's like having knowledge you can't consistently access.

So what's actually going on?

What the Benchmarks Actually Show

Let's start with the numbers that should make you uncomfortable.

The CRUXEval benchmark—presented at ICML 2024—tested models on simple Python functions. We're talking 3-13 lines. Basic arithmetic, indexing, nothing exotic. GPT-4 with chain-of-thought prompting hit only 75% on input prediction and 81% on output prediction. Code Llama 34B? 50% and 46%. That's barely above flipping a coin.

These aren't complex algorithms. These are the kind of functions you'd write in a technical interview warm-up.

The CodeMind framework found something even more telling: models that excel at generating code often fail at reasoning about their own generated code. The SemCoder paper (NeurIPS 2024) crystallized this into a phrase I keep coming back to: "specification reasoning does not imply execution reasoning."

Translation: your model might understand what you're asking for while having no reliable grasp of whether its code actually does that.

The Pattern-Breaking Problem

Adversarial testing made this gap concrete. Researchers tried semantic-preserving perturbations—fancy term for changes that shouldn't matter, like renaming variables or reformatting code.

The results were backwards from what you'd expect if models understood semantics. Rename a variable from user_count to banana? Significant performance drop. The EMPICA framework found models showed "better robustness to semantic-preserving transformations than sensitivity to semantic non-preserving transformations"—exactly the opposite of genuine understanding.

EquiBench tested something fundamental: can models determine if two syntactically different programs are semantically equivalent? Best results: 63-76% on challenging categories. Random baseline: 50%. Few-shot learning barely helped.

This is why your review process still matters. The model might generate correct-looking code that fails edge cases it would catch if it actually traced execution.

Inside the Black Box: What Interpretability Research Found

Here's where it gets interesting. Because when researchers looked inside these models, they found something that doesn't fit the "just autocomplete" narrative.

December 2024 research from Shanghai Jiao Tong University identified programming language-specific neurons—about 0.5-0.7% of the network in models like Llama-3.1-8B and Qwen2.5-Coder-32B. When they ablated (basically, turned off) Python-specific neurons, Python task performance dropped 82.9%. Go-specific neurons? 87-94% collapse for Go tasks.

These aren't random weights. These are specialized computational structures for code.

More striking: they found "concept layers" in the middle regions of these networks. Early layers encode language-specific syntax. Middle layers capture something more abstract—representations that stay stable across variable renaming, that don't let you linearly recover fine-grained AST structures, and that map the same algorithm in Python and Java to similar internal representations.

That sounds a lot like conceptual abstraction.

Anthropic's work tells a similar story. Their attribution graphs research on Claude Haiku showed the model performing multi-step reasoning "in its head" before outputting responses. When writing poetry, it identifies rhyming words before constructing the lines. That's planning. That's not autocomplete.

The Gap Between Knowing and Doing

So why the disconnect? Why do models with genuine internal representations still fail on simple reasoning tasks?

The research suggests these representations don't fully control output behavior. Models attend to test cases more than problem descriptions when generating successful code—a form of task understanding, but one that depends on having the right scaffolding present.

And here's the uncomfortable part: GPT's errors can't be explained by attention analysis. There are deeper mechanisms we can't see yet. The interpretability tools that work on smaller models hit walls on the frontier ones.

The Memorization Problem (And Why Scale Isn't the Answer)

If you're hoping bigger models will fix this, the evidence isn't encouraging.

BigCodeBench (ICLR 2025) tests compositional reasoning—tasks requiring multiple function calls from 139 different libraries. GPT-4o: 60%. Humans: 97%. This isn't about raw capability. It's about the kind of flexible, compositional reasoning that programming actually requires.

Benchmark contamination complicates things further. The TS-Guessing Protocol found ChatGPT and GPT-4 could guess MMLU answers with 52-57% exact match when the correct answers were masked—strong evidence of training set exposure. LiveCodeBench addresses this by continuously adding problems with known release dates, but it's a constant arms race.

Apple's GSM-Symbolic study demonstrated something troubling: changing only names and numbers in math problems causes significant performance drops. Adding a single irrelevant clause—one that doesn't affect the solution at all—degraded performance by up to 65%.

The models aren't learning robust reasoning. They're learning something more fragile.

What the Major Labs Are Actually Doing About It

The labs aren't ignoring this. But their approaches differ in telling ways.

OpenAI's o1 model introduced "reasoning tokens"—extended internal deliberation before responding. The results: 89th percentile on Codeforces, 83% on AIME 2024 versus GPT-4o's 13%. The training combines chain-of-thought reasoning with reinforcement learning on verifiable rewards. You can actually grade whether code works, so you can train against that signal.

Anthropic's taking a different angle. Their position is that reasoning should be integrated into frontier models rather than bolted on as a separate "reasoning model." They're optimizing less for competition problems and more for real-world coding tasks—the messy kind with ambiguous requirements and evolving codebases. Their attribution graphs work is explicitly about making the reasoning process visible so you can audit it.

DeepMind's NExT (Naturalized Execution Tuning) teaches models to reason about code execution by analyzing program traces—not just inputs and outputs, but what happens in between. Google's approach emphasizes models that "plan ahead" and reason about execution behavior.

The labs aren't doubling down on one approach. They're building systems that can switch modes—fast generation when appropriate, extended reasoning when needed. Claude 3.7 Sonnet explicitly supports this toggle. That tells you something about where the field thinks the answer lies.

Practical Implications: When to Trust AI-Generated Code

After a year of heavy usage—and now with research context for why my intuitions were right—here's my working model:

Trust for: Tasks within the training distribution. Well-established patterns. Boilerplate. Initial scaffolding. "Write me a React component that does X" where X is something thousands of developers have done before.

Be skeptical for: Novel problem structures. Complex control flow. Nested logic. Anything with subtle edge cases. Code where the failure mode isn't immediately obvious from reading it.

The key insight: the model's confidence doesn't correlate with correctness. It'll generate wrong code with the same fluency it generates right code. That's not a bug in the interface. It's a fundamental property of how these systems work.

Test coverage becomes more important when you're using AI assistance, not less. Not because the AI is terrible—it's often quite good—but because the failure modes are different from human error. Humans make typos and forget edge cases. LLMs generate plausible-looking code that embodies subtle misunderstandings of the problem.

Execution-based verification beats reading generated code and nodding. If you can run it against test cases, do that. If you can't, be more careful.

Where This Is Actually Heading

The research summary I keep coming back to: "genuine representations, brittle deployment."

LLMs develop real semantic understanding inside their networks. But that understanding doesn't reliably translate to correct behavior. The gap is the engineering problem of the next few years.

Promising directions include explicit semantic training (SemCoder's "monologue reasoning" approach, which trains models to reason about functional descriptions and execution effects), execution-aware architectures, and reinforcement learning on verifiable rewards. The philosophy question—"is this really understanding?"—matters less than the engineering question: "when can these systems be deployed reliably?"

Current evidence suggests we have significant ground to cover.

Andrej Karpathy noted a "loss of trust in benchmarks" in 2025, and he's right. We need better evaluation methods—ones that test for the kind of robust reasoning we actually need, not the kind of pattern completion that games existing benchmarks.

In the meantime, use these tools. They're genuinely useful. But use them like you'd use a brilliant but unreliable colleague—someone who often has great ideas and occasionally produces work that needs to be quietly fixed before it ships.

The models are getting better. The research is clarifying what "better" even means. And if you're building on top of AI code generation, understanding this gap between internal representation and reliable output is the difference between leveraging a powerful tool and being bitten by one.

The research cited in this article spans work from Anthropic, OpenAI, Google DeepMind, Meta AI, Shanghai Jiao Tong University, Princeton, Georgia Tech, and the Allen Institute for AI, primarily published between June 2024 and early 2025.

Do LLMs Actually Understand Code? The Evidence

The Question That Actually Matters

What the Benchmarks Actually Show

The Pattern-Breaking Problem

Inside the Black Box: What Interpretability Research Found

The Gap Between Knowing and Doing

The Memorization Problem (And Why Scale Isn't the Answer)

What the Major Labs Are Actually Doing About It

Practical Implications: When to Trust AI-Generated Code

Where This Is Actually Heading

Thomas Wiegold

Related Articles

Claude Opus 4.6: What's Actually Better?

Gemini 3 Flash: Why Google's Budget Model Is My New Default

Claude Opus 4.5 Review: Anthropic's New Coding Model Breaks Records