pickuma.
AI & Dev Tools

Why AI Agents Forget: Memory Decay and Context Contamination Explained

How context-window limits, the lost-in-the-middle effect, and stale data cause long-running AI coding agents to lose track — and what you can do about it.

7 min read

You give your coding agent a clear objective — refactor an authentication module, add pagination to three endpoints, fix a flaky test suite. Forty tool calls and twenty minutes later it produces something that technically compiles, except it has silently forgotten a constraint you specified in turn two, contradicts a design decision from turn eight, and calls a helper function that it deleted three steps ago. The agent did not hallucinate in the usual sense. It ran out of usable memory.

This is not a rare edge case. It is a structural property of how transformer-based agents manage state, and it gets worse the longer an agent runs. Understanding the mechanics helps you design around the failure rather than being surprised by it.

The context window is RAM, not a database

Every token an LLM processes — your instructions, tool outputs, intermediate reasoning, code snippets — lives inside a fixed-size context window. Current frontier models vary widely: GPT-4o, Claude Sonnet, and Gemini each support context windows measured in hundreds of thousands of tokens, and some configurations extend into millions. That sounds enormous until you account for how fast a coding agent burns through it.

A realistic 50-step workflow, where each step involves a tool call with a moderately verbose output, can consume well over a million tokens in aggregate. Those tokens do not accumulate neatly; at each step the model must fit everything — prior turns, current state, and the new output — inside a single window. When the window fills, something has to give: either the agent truncates early context, or it hits a hard limit and fails.

The deeper problem is that the context window behaves more like RAM than persistent storage. Information is volatile. It degrades under load. And — unlike a database — you cannot efficiently index or retrieve specific entries from it. You put things in and hope the model attends to the right ones.

Three failure modes worth naming

Memory decay from positional bias

When a fact is stated early in a long context and many tokens accumulate after it, the model’s attention to that fact decreases. This is not a hypothesis; it is a measurable phenomenon tied to transformer architecture. Researchers first characterized what they called the “lost in the middle” effect in a 2023 paper from Stanford and UC Berkeley, which showed that retrieval accuracy follows a U-shaped curve: information at the very beginning and the very end of a context window gets attended to most reliably, while content in the middle is substantially deprioritized.

A 2026 study by Yeran Gamage quantified the downstream effect on agent behavior across 4,416 trials at six conversation depths: constraint compliance dropped from 73% at turn 5 to 33% at turn 16 without memory mitigation. That is not a subtle degradation — it means an agent halfway through a complex task is violating its own instructions roughly two times as often as it was at the start. Critically, the agent does not know this is happening. It keeps running, confidently producing output that ignores earlier constraints.

Context contamination

Even when information stays within the window, it can poison future reasoning. Two mechanisms drive this.

The first is stale data. If a tool returns a code listing that the agent then modifies, the old listing is still sitting in context. Subsequent reasoning now has two versions of the same function — the original and the revised one — competing for the model’s attention. The agent may reference the old version inadvertently, producing edits that target code that no longer exists.

The second is noise accumulation. Long agent traces include failed attempts, corrective feedback loops, half-completed thoughts, and tool errors. Anthropic’s engineering team describes this as a “context rot” dynamic: as traces grow, the ratio of useful signal to accumulated noise falls, and model performance degrades even when the window limit has not been reached. The architecture’s O(n²) attention scaling means every new token must interact with every prior token, and low-quality prior tokens drag on the computation.

Compaction hallucinations

When agents summarize prior context to make room for new work — a technique called compaction — they introduce a new failure mode. Summarization is lossy by design. If the model mis-summarizes even a small detail — a variable name, a constraint, an API signature — that error propagates forward as if it were ground truth. Minor inaccuracies in compaction output can contaminate the entire remainder of the session.

What mitigation actually looks like

There is no single fix, and any solution involves tradeoffs between context cost, latency, and the risk of information loss. The patterns below are not mutually exclusive — production agent systems typically combine several.

Just-in-time retrieval instead of pre-loading

Rather than dumping all relevant code, documentation, and prior state into the initial prompt, agents can use lightweight identifiers — file paths, function names, schema names — and retrieve the actual content only when they need it. This keeps the context window lean and pushes retrieval cost to tool calls, which are cheap relative to context-window real estate. Anthropic calls this pattern “just-in-time context retrieval” in their context engineering guidance.

The tradeoff: retrieval requires good tooling. The agent must know what to retrieve and when. If retrieval granularity is too coarse, you just re-introduce the noise problem via tool output instead of pre-loaded context.

Scoped sub-agents with narrow objectives

Rather than running one agent through an entire multi-hour task, you decompose the work into focused sub-tasks and spawn separate agents for each. Each sub-agent gets a clean context window, a narrow objective, and access only to the tools it needs. The parent agent receives only the final output — typically a condensed summary of 1,000–2,000 tokens — rather than the full trace.

This architecture prevents context explosion and makes individual steps easier to reason about. Claude Code’s sub-agent pattern works exactly this way: each spawned agent’s intermediate tool calls and reasoning stay isolated in its own window and never pollute the parent’s context. The isolation can go further — agents can be forked into separate git worktrees so their file edits do not interfere with the main checkout.

The tradeoff: decomposing a task correctly requires upfront design. Sub-tasks that are not truly independent will need to share state somehow, which reintroduces the coordination problem at a different layer.

External memory with selective promotion

Instead of relying on context window contents as the sole source of agent memory, you can maintain a persistent external store — a vector database, a structured key-value store, or even a flat file — and give the agent tools to read and write it. Important decisions, constraints, and intermediate results get written to external memory explicitly. At the start of each turn, the agent retrieves only what is relevant to the current step.

The “selective promotion” part matters. If every interaction gets stored, retrieval quality degrades as the store grows stale. Effective systems let the model evaluate each interaction for salience before persisting it, then periodically prune outdated facts and merge duplicates.

Structured compaction over naive truncation

When compaction is unavoidable, the quality of the summary prompt matters enormously. Good compaction separates what must be preserved verbatim (active constraints, unresolved errors, API signatures the agent is currently working with) from what can be summarized lossy (completed sub-tasks, exploratory paths that were abandoned). Some systems use hierarchical summarization: long traces are broken into chunks, each chunk is summarized independently, and then the chunk summaries are summarized into a top-level digest.

The goal is maximizing recall of critical facts while minimizing token cost. If your compaction prompt is generic (“summarize the conversation”), you will get generic output that drops edge cases. If it is specific (“list every constraint stated, every open error, and every file modified”), you preserve what the agent actually needs.

What this means for how you use agents today

The practical upshot is that long-running agents are not just a UX problem — they are an architecture problem. Treating a coding agent as a stateful assistant that remembers everything from session start sets you up for failures that are hard to diagnose because the agent does not announce what it has forgotten.

Short, scoped tasks with explicit constraints repeated at relevant points outperform long open-ended sessions on nearly every quality metric. When you do need multi-step workflows, check whether your tool exposes compaction summaries or memory state, and treat those as artifacts worth inspecting — not plumbing to ignore. The agent’s confidence at turn 30 is not evidence that it has an accurate model of turn 3.

Context engineering — the practice of deliberately curating what goes into the window, when, and in what form — is becoming as important as prompt engineering. The window is finite. Everything you put in it costs attention, and everything you omit is a retrieval problem waiting to happen.

FAQ

Does a larger context window fix memory decay? +
Partially. A larger window delays truncation, but the lost-in-the-middle effect still applies: information buried in the middle of a very long context gets less model attention than information at the edges. Research from 2025-2026 shows the U-shaped attention curve persists even in extended-context models. Larger windows buy time but do not eliminate the need for memory architecture.
How can I tell if context contamination is causing errors in my agent? +
Look for symptoms like the agent referencing a variable or function it deleted earlier, ignoring a constraint you stated early in the session, or contradicting a decision it made several steps ago. If you can inspect the agent trace, search for the point where the incorrect assumption first appears. In many cases, stale tool output or a compaction error is the root cause, not a raw model mistake.
Is there a token budget rule of thumb for long-running agents? +
There is no universal number, but a useful heuristic is to plan for compaction or sub-agent handoff before you reach 50-60% of the context window limit. This gives the summarization step enough room to work accurately without being rushed. Running an agent to 95% of its window and then compacting tends to produce lower-quality summaries because the compaction prompt itself must compete for the remaining space.

Related reading

See all AI & Dev Tools articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.