The Ralph Loop: How Recursive AI Agents Actually Work
Here's the entire technique, in one line of bash:
while :; do cat PROMPT.md | claude -p --dangerously-skip-permissions; doneThat's it. That's a Ralph loop. The first time I saw it I assumed I was missing something, because surely a recursive AI agent had to be more complicated than four words and a pipe. It isn't.
What's actually happening here is genuinely strange. You've got a coding agent reading the same prompt over and over, modifying a codebase on disk, and using the file system as its memory instead of conversation history. Run it overnight on a well-specified task and you wake up to a working program, fifty git commits, and a journal of everything the model tried, broke, and fixed. It feels closer to science fiction than anything else I currently use as a developer.
In this article I'll walk you through what a Ralph loop is, how it works, how to actually run one in Claude Code or Codex, and (the part I find most fascinating) what you can read in its journal in the morning. I'll also be honest about where it falls over and where it's the wrong tool entirely.
Geoffrey Huntley, who coined the term, calls Ralph "deterministically bad in an undeterministic world." Once you understand why that's a feature rather than a bug, the rest of this clicks into place.
What is a Ralph loop?
A Ralph loop is a recursive AI agent pattern where a coding agent runs in an infinite shell loop, reading the same prompt file each iteration, modifying the codebase on disk, and using the file system instead of conversation history as its memory. Each iteration starts with a fresh context window. State survives between iterations through the codebase, a TODO file, and git history.
The technique was named by Huntley in his July 2025 blog post, which is still the canonical reference. The name comes from Ralph Wiggum, the cheerful Simpsons character who rams his head into doorframes and announces "I'm helping!" Huntley's framing is that this kind of dumb, persistent loop is surprisingly effective. As Dex Horthy puts it in his history of the technique, "dumb things can work surprisingly well."
(There's a second origin story for the name. "Ralph" is also slang for vomiting, and Huntley has said the realisation of how cheap autonomous code generation had become made him want to. So: Ralph the character, Ralph the verb, both apply.)
Why it's not just a while true loop
Here's the part that took me a few reads to get. The reason this works is fresh context every iteration. That's not a side effect, it's the point.
LLMs degrade as their context fills up. Past somewhere between 100k and 150k tokens, depending on the model, quality measurably drops. Practitioners call this the Dumb Zone. Long agent sessions inevitably drift there, and Claude's auto-compaction, when it kicks in, is lossy. Your specs can quietly get summarised into vagueness without you noticing.
A bash loop sidesteps all of that. Each iteration starts with the exact same allocated context: the same PROMPT.md, the same AGENTS.md, the same specs/*.md files. What changes between iterations is the codebase on disk and a small TODO file. The loop's input is stable while the world (the repo) converges toward the spec. That's why "deterministically bad" matters. The model is reliably mediocre at every iteration, and over time it grinds the codebase into shape.
How it actually works
A single iteration looks like this:
- Bash reads
PROMPT.mdand pipes it into the agent. - The agent reads
fix_plan.mdand picks the single most important pending task. - It searches the codebase, implements the change, runs the relevant tests.
- If tests pass, it commits with a structured message and updates
fix_plan.md. - If something useful was learned about the build or the project, it updates
AGENTS.mdbriefly. - The agent exits.
- Bash restarts the whole thing. Fresh context. Modified codebase.
The single most important rule, repeated across every Ralph implementation I've read, is one thing per iteration. Not "one thing plus a quick refactor while you're in there." One thing. Ask the agent to do exactly one task per loop, and trust it to decide what's most important from the plan file. Try to cram more in and you'll watch it pick the easiest item every time and ignore everything hard.
State survives between iterations through five places:
- The codebase. A green build is the strongest signal that previous work landed.
fix_plan.mdorprogress.txt, the working TODO list.AGENTS.mdorCLAUDE.md, operational learnings.specs/*.md, frozen requirements that don't change between loops.- Git history. Structured commit messages double as a journal.
There are variations. You can run a bounded loop with --max-iterations 50. You can have the agent emit a sigil like <promise>COMPLETE</promise> and grep for it to break out. You can do a two-phase pattern where one prompt does gap analysis with no code changes, and a separate loop implements. You can cron it to "one small refactor every morning" instead of an overnight blitz. They're all Ralph, just different cadences.
Running Ralph in Claude Code, Codex, and other tools
Claude Code
The bash one-liner I opened with is the canonical Claude Code Ralph. The two flags that matter are -p for headless mode (read prompt from stdin, write to stdout, exit) and --dangerously-skip-permissions to bypass approval prompts. Without the second flag, the loop just stops on the first file write.
The trade-off should be obvious. You're handing Claude Code unrestricted shell access to whatever directory you ran it in. Don't do this on your main machine without isolation. Use a Docker sandbox, a devcontainer, a fresh git worktree, or a remote VM. Huntley's framing is that it's not a question of if your loop gets compromised, it's when, and what the blast radius is. He's right.
The supporting cast of files matters more than the bash. CLAUDE.md at the repo root gets auto-loaded into every session and is where you put operational rules. specs/*.md is for frozen requirements. fix_plan.md is the mutable TODO. Keep PROMPT.md short and reference the rest with @filename syntax instead of inlining everything.
Anthropic shipped an official ralph-wiggum plugin for Claude Code in December 2025. It's worth knowing about, but there's a real debate over whether it's the same thing. The plugin re-feeds the prompt inside one growing session via a Stop hook. That's not fresh context per iteration. Horthy's review is blunt: it misses the point. I'd pick the bash version. The plugin does lower the barrier, though, if you want to dip a toe in.
Cost reality: running Sonnet 4.5 in a bash loop with autonomous tool use lands around ten dollars an hour on metered API. If you're going to do this regularly, the Claude Max 20x plan at $200 a month is usually cheaper than running it metered.
OpenAI Codex CLI and the new /goal command
This is the newest piece of news in this space. On April 30, 2026 (three days ago, as of writing) OpenAI shipped Codex CLI 0.128.0 with a /goal command that is essentially Ralph as a first-class primitive.
The behaviour: you set a goal, Codex keeps looping until it self-evaluates the goal as complete, or until the configured token budget runs out. You don't write the bash. The internal templates that drive this are visible in the Codex repo if you want to see how OpenAI prompts the self-evaluation step.
Two ways to ralph in Codex now:
# The new way, since 0.128.0
codex /goal "Make all tests pass and commit each green checkpoint"
# The classic bash loop, still works
while :; do cat PROMPT.md | codex exec --yolo -; doneThe /goal version is friendlier and probably what most people will reach for. The bash version is more robust for the same reason the Anthropic plugin is debated. /goal is in-session, one growing context window, while the bash version gets fresh context every iteration. If you're doing a long overnight run, I'd still pick bash. For shorter, well-bounded tasks, /goal is great.
Other tools, briefly
The pattern travels to almost every coding agent.
Cursor has a cursor-agent headless CLI that ralphs cleanly. There's also a Cursor plugin that does the in-session Stop-hook variant. Aider doesn't read stdin natively, but you can wrap it: bash -c 'aider --yes-always --message "$(cat PROMPT.md)"'. Aider's --architect mode pairs a planner model with a coder model, which is a natural plan-build split. Goose (Block's open-source agent) ships a first-class Ralph tutorial with cross-model review built in. One model implements, a different model reviews. This is the closest thing to a Reflexion pattern out of the box, and it's the easiest way to ralph with a local LLM via Ollama. GitHub Copilot CLI works for short Ralph runs in programmatic mode. And if you're building a product around this, the Vercel AI SDK has a ralph-loop-agent example that's a clean TypeScript starting point.
Why the journal is where it gets interesting
Most Ralph coverage focuses on what gets built. I think that's the wrong half of the story. The interesting half is what you can read in the morning.
When every iteration commits with a structured message and writes failures to a log, you end up with something I haven't seen from any other AI tooling: a readable, narrative record of what the model tried, what broke, and why it thought it broke. A successful commit is, frankly, boring. The interesting reads are the failures, where the agent spends three iterations chasing a wrong hypothesis, finally figures out the test was misnamed, fixes the test, and moves on.
Here's the logging stack I'd set up:
- Git commit per success, with a structured message: what changed, which test passed, iteration number. This becomes the primary journal.
.ralph/errors.logper failure, with the agent's own reflection on the cause. Tell it explicitly inPROMPT.mdto append to this file when it gives up on an approach, and to include why.- Append-only
progress.md, where the agent is allowed to be opinionated about what's hard or surprising. This is where the personality leaks through.
One concrete tip lifted from Huntley's CURSED prompt: instruct the agent to write the why into test docstrings. Future iterations won't have its reasoning in their context, so the only way to teach the next Ralph anything is to make it readable from disk. After a few hundred iterations, your test suite ends up commenting itself with the agent's archaeology of past mistakes. That's pretty cool, and it's also the most AGI-adjacent thing on my laptop right now.
The morning experience is what makes me keep coming back to this. Coffee, scroll the commits, find three places the agent surprised me, two it embarrassed itself, and one bug it caught that I'd have missed. I won't oversell it as artificial general intelligence. But it's a process running on its own that produces a journal worth reading, and that's a category of experience I didn't have a year ago.
What Ralph is actually good for
The clearest case: measurable, mechanical work
The shared property of every Ralph success story I trust is that the success criterion is a number. Tests passing. Lint count down. Types clean. Coverage up. Build green.
In that mode it works, sometimes spectacularly. The repomirror team shipped six framework ports overnight at a YC hackathon, running six concurrent loops in git worktrees, racking up a thousand-plus commits and about $600 in API spend. Huntley shipped a $50,000 client MVP for $297 in API costs. The CURSED programming language, including a self-hosting compiler, was built over roughly three months of continuous Ralph runs.
The boring middle of the road still works fine: TypeScript strict-mode adoption across a monorepo, ESLint flag flips, dependency upgrades like Jest to Vitest or React 17 to 19, alt-text generation for product images, internal-link passes across a content site. Anywhere the work is repetitive and the verifier is mechanical, Ralph eats it.
The interesting case: fuzzy success, with a judge
Here's the part the standard Ralph coverage skips. The success signal doesn't have to be a function exit code. It can be:
- An LLM as judge. Slower and weaker, but works for prose, copy, and vague aesthetic targets.
- A data source. Lighthouse score, conversion rate, Web Vitals, search ranking.
- A human. You, with coffee, tapping thumbs-up or thumbs-down on yesterday's variant.
Imagine a Ralph loop iterating on your website's design. One small variant per night. In the morning you give it a thumbs up or thumbs down, and the result feeds back into progress.md. Over a month, you've made thirty small, judged design changes, each one a tiny convergent step.
The constraint that makes this work is the same as the coding case: small, focused changes per iteration. Telling the agent to redo the entire site every night is a coin flip. Telling it to nudge one component, one heading, one CTA, one section per night gives you compounding improvement. The same logic applies to copywriting, ad creative, and probably a dozen other places I haven't tried yet.
I'll be honest, though: this is harder to set up well than the measurable-result version. The judge loop has more moving parts. A way to deliver the variant. A way to capture your vote. A way to feed it back to the agent without poisoning context. Today, the function-as-judge case is much easier. The judge case is where I'd bet the interesting product opportunities live.
When Ralph is the wrong tool
A short list, because being honest about this matters more than the tips.
One-shot tasks (just use Claude or Cursor interactively). True exploration where you don't know what you want (Ralph optimises for a green build, not for taste or insight). Brownfield code with strict review processes (the bottleneck is human review of forty-thousand-line PRs, not API tokens). Anything irreversible: production database migrations, deletes you can't undo, financial transactions. UX copy with no judge wired in (the agent will mark itself complete and move on, confidently wrong).
The right mental shortcut, borrowed from Meag Tessmann's writing, is to ask: is the output machine-verifiable? If yes, loop. If no, get a human, or wire up a judge.
Tips so it doesn't waste your token budget
A few hard-won pieces of advice from people who've run thousands of iterations.
Sandbox always. Docker, devcontainer, git worktree, or remote VM. Don't argue yourself out of this. You're handing a non-deterministic process unrestricted shell access. The blast radius is whatever it can reach.
Keep the primary context under 100k tokens. Use subagents for anything that returns a lot of tokens you don't need to keep. A subagent grepping the codebase and returning "here's the function" is much better than the primary agent reading a 4,000-line file. Once the main context drifts past 100k, quality drops measurably.
Two-phase plan/build. Have a separate, one-shot prompt that refreshes fix_plan.md based on a gap analysis against specs/*.md. The loop only ever implements. This is the single biggest reliability upgrade you can make.
Layer your circuit breakers. Max iterations. Per-iteration timeout (15 minutes is a reasonable default). Hourly token cap. Stuck detection (if the same test fails three iterations in a row, stop and notify). Cost cap if you're on the API.
Wire backpressure aggressively. Type-check, lint, tests, build. Anything that returns non-zero rejects the iteration. For dynamically typed languages, adding a static analyser (mypy or ruff for Python, the relevant equivalent for whatever you're using) is non-negotiable.
Don't use --continue or --resume. They defeat the fresh-context guarantee. The whole point of a bash Ralph is that you start clean every iteration.
Set a notifier and walk away. ntfy.sh, a Slack webhook, a macOS notification, whatever you've already got wired up. Watching iteration 47 of 200 is somehow worse than watching paint dry. Tune iterations one through three carefully, then leave. Read the journal in the morning.
Should you try it?
Honest take: Ralph is a young technique. The original blog post is from July 2025. The first vendor support shipped in December 2025. Most of the splashy cost numbers I've quoted are self-reported by the technique's loudest advocates. Thoughtworks' Technology Radar has it as Trial, not Adopt, and they're right to be cautious.
That said: the cheapest possible experiment is a five-iteration human-in-the-loop Ralph on a well-specified, mechanical task. Pick something tiny. A TypeScript file you've been meaning to clean up. A small algorithm with a clear pass/fail. Run a single iteration manually in a sandboxed worktree, read the diff and the commit, then run another. You'll know in thirty minutes whether this fits how you work.
The thing that keeps pulling me back isn't that it codes. It's that it keeps going, and the trail it leaves is genuinely interesting to read. That part feels like the future, and it's already here, in four words and a pipe.
Related Articles
CLAUDE.md: Helpful or Just Expensive Noise?
Research shows CLAUDE.md files can hurt more than help. Here's what actually works—and when to skip it entirely.
9 min read
AI CodingClaude Code dangerously-skip-permissions: Why It's Tempting, Why It's Dangerous
dangerously-skip-permissions makes Claude Code autonomous—no more prompt fatigue. But real devs have lost home directories. Here's what you actually need to know.
11 min read
AI CodingI Switched From Claude Code to OpenCode — Here's Why
I tested OpenCode against Claude Code for real development work. Here's what surprised me about the open-source alternative — and what I don't miss.
8 min read