← Back to Blog
AI Coding·13 min read

Context Engineering for AI Agents: A Field Guide

Listen to this article

A Stanford paper went semi-viral a while back, or at least the AI-news-scan version of it did. The claim landed in my inbox like a cold shower: autonomous agents suffer catastrophic goal drift after about 100,000 tokens. You ask them to fix a billing bug, and an hour later they're trying to delete the billing database. Agents, apparently, have a 20-minute attention span.

It's a great story. It's also wrong.

The actual research (which is not from Stanford, and which I'd encourage you to read before quoting) found close to the opposite. The best agent it tested held its assigned goal almost perfectly for more than 100,000 tokens. The scary number was a measure of how robust the thing was, not where it fell off a cliff. Somewhere between the paper and my inbox, a finding got inverted and a database got dramatically murdered for effect.

So no, there's no tidy 20-minute timer. But agents do degrade over long contexts, and the way they degrade is genuinely worth understanding. The real story is less cinematic and a lot more useful: the leverage in agentic AI has quietly moved from writing clever prompts to managing what's in the context window. This is a field guide to doing that. Fair warning, it's notes on an open problem, not a victory lap.

Prompting Isn't Dead, It Just Moved Into the Loop

You've heard the line by now. "Prompting is dead, build loops." Stop hand-holding the model, wrap it in an agentic loop, let it rip.

There's a real half-truth in there. We genuinely are writing fewer babysitting instructions than we did two years ago, because the models read intent better and because over-steering them often makes things worse, not better. I've written before about why most of the old prompt-engineering tricks have aged badly, and a lot of that holds. The era of stuffing your prompt with "you are a world-class expert who thinks step by step" is over.

But "prompting is dead" is the wrong conclusion. Here's the thing nobody mentions when they're selling you on loops: a one-shot prompt's weakness costs you exactly once. In a loop, that same weakness fires on every single iteration. And worse, the errors don't just repeat, they accumulate inside the context. Each pass leaves residue. By turn forty, the agent isn't reasoning about your task anymore, it's reasoning about a context window full of its own earlier confusion.

I went deep on this when I wrote up how the Ralph loop and recursive agents actually behave, and it changed how I think about the whole pattern. The loop doesn't retire the prompt. It multiplies the consequences of a bad one, and it adds a brand new failure mode (drift) that one-shot prompting never had to worry about.

So the skill didn't disappear. It moved. It used to be about wording the request. Now it's about curating what's in the window. We gave that second thing a name, context engineering, and it turns out to be most of the job.

Your Context Window Is Lying to You

Let me save you some pain. The big number on the model's spec sheet is marketing.

Effective vs advertised context

Every frontier model now advertises an enormous context window. A million tokens, sometimes more. The implication is that you can pour your entire codebase in and the model will hold all of it in mind at once. It will not.

The benchmarks are pretty blunt about this. NVIDIA's RULER work, which has become the standard sanity check, suggests models reliably use only somewhere around 50 to 70 percent of their advertised window for real multi-step work. Chroma's "context rot" study tested a whole pile of frontier models and found that every single one degraded as the input grew. Not at the limit. Well before it. Even on tasks that should have been trivial.

And here's the part that matters for agents specifically: multi-fact, multi-hop retrieval falls apart much faster than simple single-fact lookup. Finding one needle in a haystack is easy. Connecting four needles scattered through that haystack, which is what your agent does on every non-trivial step, is where the wheels come off.

So the spec-sheet number is a capacity claim, not a quality claim. Treat it accordingly.

Why a bigger window won't save you

The intuitive fix is "just use a model with a bigger window." It doesn't work, for two compounding reasons.

The first is context rot: the signal gets blurrier the more you load, full stop. The second is sneakier and it's called proactive interference. Stale information sitting in your context actively corrupts the model's ability to retrieve the fresh stuff. The unsettling finding from recent work is that this persists regardless of how big the window is, and it resists prompt-engineering fixes. You can't just politely ask the model to ignore the old data.

This is where I reach for an analogy, and I'll upgrade the usual one. People say the context window is like RAM, limited working memory you have to manage. True, but incomplete. It's RAM where the contents get blurrier the more you cram in, and where the old entries actively scramble your reads of the new ones. So the job was never just fitting things into memory. The job is eviction and forgetting. (Standard disclaimer: this is a useful model, not a literal description of the architecture. Please don't email me about attention heads.)

One more thing that breaks the "bigger window" instinct. In long autonomous runs, agents derail even when the window never fills up. There's a now-famous benchmark where an agent running a simulated vending-machine business spiraled into a meltdown and started trying to email the FBI about imaginary crimes. The failure had nothing to do with running out of space. It was a coherence failure. More room wouldn't have helped. The agent didn't need a bigger desk, it needed to stop losing the plot.

8 Rules for Engineering Agent Context

Right, the practical part. This is the stuff I actually do, roughly in order of how much grief each one has saved me. Most of it lines up with Anthropic's own writeup on context engineering, with a layer of my own scar tissue on top.

1. Treat context as a budget, not a backpack. Your instinct is to throw everything in just in case. Resist it. Aim for the smallest set of high-signal tokens that gets the job done. A rule of thumb I've seen practitioners settle on is to keep your working context well under the advertised window, often in the ballpark of a quarter to a third of it. That's a starting heuristic, not a law of physics, but it's a much better default than "fill it up."

2. Put your persistent rules in a file, not the chat. Anything the agent must always know belongs in a project file like CLAUDE.md or AGENTS.md at the repo root, not buried in conversation history that's going to get compacted away. This is now a genuine cross-tool standard. That said, these files are not free and they're not magic, and I've got opinions on when CLAUDE.md is helpful versus when it's just expensive noise that are worth a read before you write a 500-line one.

3. Retrieve just-in-time, don't front-load. Instead of dumping the whole codebase into context, hand the agent lightweight references (file paths, queries, links) and let it pull what it needs with tools like grep and glob. This mirrors how you actually work. You don't memorize the repo, you search it. Your agent should too.

4. Compact deliberately. When you're near the limit, summarize and continue rather than just truncating. The safest, lowest-risk trim is clearing out stale tool results, because once a tool call is buried deep in history, the agent rarely needs the raw output again. One caution from experience: over-aggressive compaction will quietly drop something whose importance only becomes obvious three steps later. Compaction is lossy. Tune it on real traces, not vibes.

5. Give the agent a scratchpad. A simple NOTES.md or a maintained to-do list that lives outside the context window is shockingly effective. The agent writes progress and decisions to it, pulls them back when needed, and suddenly it can track a long task without holding the entire history in working memory. Cheap, simple, works.

6. Re-ground the goal on a cadence. Periodically re-inject the actual objective into the context. It sounds almost too dumb to bother with, but re-stating the goal measurably reduces drift and costs you next to nothing. If your agent is on a long run, remind it what it's doing. Same as you'd do with a junior who's been heads-down for three hours.

7. Decompose with sub-agents, but only for parallel work. Splitting a task across multiple agents gives each one a small, clean context, which is great. The catch is it costs dramatically more tokens (think on the order of fifteen times more in some setups) and it introduces a fresh failure mode where the sub-agents collectively wander off the original problem. Sub-agents shine for breadth-first, independent subtasks. For tightly-coupled work like most coding, where everything depends on everything else, they're often a bad fit. Don't reach for a swarm when one focused agent would do.

8. Gate the irreversible stuff. This is the real answer to the "agent deleted the database" horror story. Put a human approval step in front of destructive actions: database writes and deletes, deploys, anything involving money. Add checkpointing so a derailed run resumes from a known-good state instead of starting over. I'll be honest, the temptation to just let it run unsupervised is strong, and I've written about what actually happens when you skip permissions and let Claude Code off the leash. Short version: gate the things you can't undo. Your future self will thank you.

Does the Model Matter? Yes, But Not How You Think

It does matter, just not in the way the leaderboard-watchers assume.

The newer reasoning models resist goal drift noticeably better than last year's batch. That's real progress. But "better" is not "immune," and under sustained pressure or a long pile-up of accumulated context, every model I've worked with will eventually wobble. There's no model you can buy your way to fire-and-forget reliability with. Not yet.

Two findings from recent work are worth tattooing somewhere visible. First, drift can be inherited. If one agent picks up a context that an earlier, weaker agent already nudged off course, the strong model tends to continue the drift rather than correct it. So watch your handoffs. Second, coding agents drift more when an instruction conflicts with something the model was strongly trained to value, like security or privacy. Tell an agent to do something that rubs against its training, and the constraint gets slippery under pressure.

The practical takeaway for picking a model: ignore the headline context number. Choose based on the actual shape of your workload (how much multi-hop retrieval, how long the horizon, how much tool use) and then, this is the important bit, measure your own degradation curve. The spec sheet won't tell you where your agent starts getting dumb. Only your own evals will. This is doubly true if you're putting agents to work in a small business, where a confidently-wrong agent costs you real money and not just a benchmark point.

The Frontier: Agents That Manage Their Own Memory

Here's where I get to be a bit speculative, because this is the part I find genuinely exciting.

Everything above is us managing the agent's context by hand. The obvious next move is to make that someone else's job. Specifically, another agent's. Picture a second process running in the background between tasks, pruning the memory, reorganizing it, rewriting the messy bits into something cleaner. An agent that "dreams" during its downtime and wakes up with a tidier head.

I want to be clear that this isn't me inventing something in a coffee shop. It's an active research frontier with a real name. The Letta folks (out of the MemGPT lineage) call it sleep-time compute: agents using idle time to turn raw context into learned context they can reuse later. There's a small wave of 2026 preprints borrowing directly from neuroscience, with names like SCM and SleepGate, that model consolidation and forgetting as explicit phases, the way actual sleep does. And the early shipped versions are already here. ChatGPT and Claude Code both do background memory consolidation between sessions now, distilling your preferences and project notes without being asked. (If you see a specific product name attached to this, check it against the vendor's own docs before you repeat it, because the branding in blog posts runs ahead of the official announcements.)

The honest trade-off, because there's always one: an unsupervised process that rewrites memory can also hallucinate structure. Feed it a noisy context and it'll confidently write down patterns that were never there, then treat that fiction as fact next session. Dreaming curates, but it can also quietly corrupt. We do not have this solved.

Which is sort of the whole point. We're all at the very beginning of figuring this out, together, in public. I'm tinkering with some of this myself. I don't have anything worth showing yet, just a growing pile of experiments and half-formed opinions, and I'd rather tell you that plainly than pretend I've cracked it.

Putting It Together

If you want the whole thing as a decision ladder, here it is.

Default to tightly-scoped workflows over open-ended autonomy. When you do go agentic, engineer the context first (rules one through six, the budgeting and re-grounding and just-in-time stuff). Reach for sub-agents only when the work genuinely splits into independent parts. And no matter what, gate the actions you can't take back.

A few caveats so you don't quote me too confidently. This field moves monthly, and a lot of the research I've leaned on here is preprints and previews, not settled science. Some of it comes from vendors who have a horse in the race, though the independent benchmarks broadly back up the direction. And there is no universal token cliff. Degradation is gradual, it's model-specific, and it's task-specific. Anyone selling you a magic number is selling you something.

But the core lesson is solid, and it's not going anywhere. In agentic AI, the win doesn't go to whoever has the biggest context window or the cleverest prompt. It goes to whoever manages context the best. Prompting moved into the loop, and the loop runs on whatever you put in front of it.

So put good things in front of it. That's the job now.

Last updated: June 2026. This space changes fast, so if you're reading this much later, assume some of the specifics have moved.

Thomas Wiegold

AI Solutions Developer & Full-Stack Engineer with 14+ years of experience building custom AI systems, chatbots, and modern web applications. Based in Sydney, Australia.

Ready to Transform Your Business?

Let's discuss how AI solutions and modern web development can help your business grow.

Get in Touch