Making Claude Code continuously improve itself

A good harness fits like a glove, but that fit doesn’t maintain itself. Every week brings new code, stronger models, and subtle misalignments. What if we wired an asynchronous reflection loop into the harness, one that quietly watches sessions and turns recurring friction into concrete improvements?

Table of Contents ¹

Jump between sections with ⌘⇧J .

TLDR: github.com/jwa91/claude-continuous-improvement

The layers of influence

Working with coding agents, there are several layers of influence. The simplest and most obvious way to use an agent is to give it an instruction. The agent will start working on that instruction right away, and if the difficulty of the task matched the intelligence of the model, in most cases the agent will deliver. It could be that an agent failed your task. If an agent makes a mistake a couple of times in a row, instead of modifying the instruction, you might want to modify the agent’s “memory”. When you change for example an AGENTS.md file, you don’t only change the next instruction, you change every future instruction the agent will receive. If your project grows, you might arrive at a point where the list of instructions becomes too long, adding more instructions doesn’t improve the quality of the agent’s output, often the effect is even worse.

But if you can’t put in more instructions and you don’t want to craft a prompt for hours each time you interact with an agent, how do you guide the agent towards the output you want? One solution used a lot these days is Skills. Agent skills, just like AGENTS.md, are instructions and context that should help the agent, but they are organized as a folder, offering us the chance for “progressive disclosure”.¹Progressive disclosure is a design pattern borrowed from UX: show the minimum upfront and reveal complexity only when relevant. At startup, agents will only load a name and a short description, and only when an agent decides to use a skill, it will load the full markdown instruction, and potentially this is layered even deeper within a folder with references.

That being said, in the end, when you use either one of these two, AGENTS.md or agent skills, you’re still relying on the model to follow your instructions as intended. There is no deterministic enforcement of whatever it is you’ve put in there. A while ago I wrote a blog about RSVG, a little framework I use that gives me a bit more certainty when I configure an agent. The short version: don’t preload all those instructions, but validate them through the loop and fail quick and loud. The loop basically gives us few-shot learning for free. If the agent doesn’t make mistakes, no tokens are spent, if it does, you’re sure a validator catches it. Each time you decline input from an agent, use this rejection as a chance to optimize the validation framework around your agent.

Of course, that’s not the whole story. A validation framework too strict limits the potential and impact from your agent as well, and when a model gets stronger, or when your codebase changes, your validation framework has to change as well. If the validation framework is not aligned with the model’s capabilities, or the most recent version of your codebase, its impact on the output of the agent will be negative, instead of helping you.

Scopes

Before we continue, let me quickly define some scopes:

Scope 1: a single LLM call

Scope 2: The Agent

The core primitive: An agent runs tools in a loop to achieve a goal.

The agent loop - based on this Langchain diagram

Scope 3: The Harness

The harness is the infrastructure that sustains, governs, and connects the agent’s loop to the world. Claude Code offers us hooks to modify how the harness does this. Hooks are deterministic control points that fire regardless of what the model decides. PostToolUse and Stop are two hooks we will be using later on.

The harness layer with Claude Code hooks (source)

Starting at the LLM call, the more outwards we go, the more our influence increases, and the more the LLM’s influence decreases. In layer 1 we only control the input. In layer 2, we control the input and the agent’s toolbox, but when these tools get called is still up to the agent. In Layer 3 we decide both the when and the what, largely independent of what the agent decides. Working in an outer layer provides control at the cost of model autonomy and flexibility.

A self-improving harness

I think, everybody who is working with code or coding agents currently tries to optimize the balance described above. How to build in enough certainty into the harness, without limiting the model’s potential? Knowing that models will only get stronger over time.

What if the harness could learn on its own? Not by changing itself automatically, that would defeat the purpose. But by watching what happens during coding sessions, extracting insights from them, and presenting those insights as potential improvements for me to approve. A retrospective that runs in the background, after every session, without me having to think about it.

I have attempted to build something like that. A continuous improvement system for Claude Code that operates entirely in Scope 3: the harness layer. It uses hooks to trigger background analysis after sessions, collects findings over time, and then offers a manual refinement step where those findings get turned into concrete harness improvements.

What to extract

Before building anything, I had to decide what’s actually worth extracting from a coding session. Not everything that happens is a signal, most of it is just the agent doing its job. Besides that, just reprocessing entire threads again and again would be very expensive. I settled on three types of findings:

Corrections are the highest-signal finding. These are moments where the agent got something wrong and I had to correct it. The agent proposed Flask but I wanted FastAPI, it used a deprecated Prisma API, it tried to use bun in a pnpm workspace. Every correction represents a gap in the harness configuration. Importantly, I track whether the correction came after a tool call or after a message, as this influences how we design the improvement.²Corrections after tool calls often point to missing validators or hooks. Corrections after messages suggest gaps in AGENTS.md or skill instructions.

Decisions are conventions and technology choices that got established during a session. Sometimes explicitly (“let’s use Zod for validation”), sometimes implicitly by the code that was written. These are valuable because they should be consistent across sessions. If you chose Zod over io-ts on Tuesday, the agent shouldn’t suggest io-ts on Wednesday.

Learnings are architectural insights that go beyond the current task. Things like “the order service uses event sourcing” or “rate limiter middleware must be registered before auth”. Knowledge that is relevant across features, but that the agent has no way of knowing unless you tell it.

Each finding is a single JSON object - type, one-sentence summary, a few sentences of detail, and a timestamp - appended to a JSONL file. Append-only, never delete, so there always is a full history.

json

{
  "type": "correction",
  "summary": "Used deprecated findOne API instead of findUnique in Prisma",
  "detail": "Agent used findOne which was removed in Prisma 3. User corrected to findUnique. This affected three files in the repository layer.",
  "timestamp": "2026-02-24T09:15:00Z"
}

Two ways to extract

Next question. Claude Code gives you --resume with --fork-session, which lets you fork a session and continue from the same prompt cache. It also gives you agents, which you can invoke from a shell script as a completely separate process. These two mechanisms have very different tradeoffs, and I built an extraction method around each.

The fork approach resumes the session you just finished, forks it, and runs a reflective prompt against the last ~8 turns. Because it forks from the existing session, it gets a prompt cache hit - meaning the full conversation context is already cached, and the only new tokens are the reflection prompt itself.³Prompt caching means the API stores the conversation prefix. When you resume from the same prefix, you only pay for the new tokens, not the full context. This makes it cheap and fast for short conversations. The downside: it’s operating from within the same context window as the builder. It sees what the agent saw, which means it shares the same blind spots.

bash

claude --resume "$SESSION_ID" --fork-session \
  -p "$PROMPT" \
  --output-format json \
  --allowedTools "Read" "Write" "Edit" "Grep" "Glob" \
  --max-turns 10

The window approach invokes a separate agent (retro-analyzer) that reads the session transcript as an external file. It maintains a byte offset index, so it only reads content that hasn’t been analyzed yet. This is a fresh context (the analyzer has no prompt cache hit from the original session), so in short threads, it will cost more. But it has a genuinely external perspective, it’s not trapped inside the original conversation’s framing. And because it reads from a file with an offset, at some point this method will be cheaper. Another advantage is that you can switch models. Opus might be too overqualified for the extraction job at hand.

bash

claude --agent retro-analyzer \
  -p '{"transcript":"...","offset":48230,"findings":"...","index":"..."}' \
  --output-format json \
  --no-session-persistence

I’m still not exactly sure when to use what method, so for now I can configure which method to use; fork, window, or both. When running both, they execute sequentially in a single background process to avoid concurrent writes to the findings file.

When to trigger

Extraction runs in the background, triggered by hooks. Two hooks do the work:

A PostToolUse hook watches for git commits. After every Bash tool call, it checks if the command was a git commit and whether it succeeded. If so, it sets a committed flag in a small state file. Commits are natural breakpoints, they represent a coherent unit of work worth reflecting on. I also almost always start new threads after committing, so the Stop hook after a commit felt like a good trigger to run the retrospection.

Besides that, a Stop hook fires on every session end. It increments a turn counter and checks two trigger conditions: either the committed flag is set, or enough session stops have accumulated (default: 8). If either condition is met, it spawns the background reporter. If neither, it just saves the updated counters and exits. This way threads don’t become too long before analyzing, but I still am somewhat reasonable on when to call the retrospective process.

There is in this system no reason at all to let the Stop hook block the builder. It does a few jq operations, maybe spawns a nohup background process, and exits. I didn’t want the detached Claude sessions to silently run for a long time, so I kept them really lightweight.

Validating findings

The write-finding skill has a PreToolUse hook that validates every write to the findings file before it happens. It checks that each line is valid JSON, that the type is one of the three allowed values, that summary, detail, and timestamp are present, and that the timestamp is ISO 8601. If validation fails, the hook exits with code 2, which feeds the error message back to Claude for self-correction. In line with RSVG: don’t preload instructions, validate through the loop and let the agent self-correct.

Closing the loop

Extraction on its own gives you a growing log of findings. Useful as a record, but the real value comes when you turn findings into harness changes. For this there’s a second agent (the retro-optimizer) that you invoke manually when you’re ready to refine.

The optimizer reads the findings log, reads the improvement log (which tracks past changes), scans the current harness, and does something I think is underused in agent systems: it evaluates whether past improvements actually worked. For each logged improvement, it checks if findings of the same pattern still appear in recent entries. If they don’t, the improvement worked. If they do, it flags the improvement as ineffective. This algorithm is not my final one, but a good starting point.

Then it groups unaddressed findings by similarity, ranks them by frequency with corrections prioritized, and presents numbered proposals. Each proposal maps the finding pattern to a specific intervention at the right harness layer, for now including AGENTS.md files, though the blog post mentioned earlier might change the ruling on this a bit…

As I said earlier, I thought it would make sense in this system if there is a human gate on harness improvements for now, so harness improvements are proposed and have to be approved before they are implemented. When you approve a proposal, it modifies the target file and logs the improvement with a reference back to the findings it addresses. This temporal correlation is what makes the evaluation loop possible in the next cycle.

json

{
  "timestamp": "2026-02-24T16:05:00Z",
  "type": "hook",
  "target": ".harness/hooks/warn-bun-install.sh",
  "action": "create",
  "summary": "PreToolUse hook that warns when agent tries to use bun in a pnpm workspace",
  "rationale": "The bun-to-pnpm correction happened twice across sessions. A deterministic guard prevents the mistake entirely.",
  "findings_addressed": ["Switched from bun to pnpm after workspace resolution errors"]
}

The full picture

Put together, the system is a loop between two phases. Extraction is automatic and runs in the background during sessions. Refinement is manual and runs when I decide. The extraction phase produces findings, the refinement phase turns findings into harness changes, and the evaluation step inside refinement checks whether those changes actually helped. Over time, the harness gets tighter around the specific mistakes my agent makes in the codebase, at least, that’s the theory.