pickuma.
AI Knowledge Work

δ-mem Explained: What Online Memory Means for LLM Agent Cost and Recall

A new arXiv preprint proposes δ-mem, an online memory mechanism for LLM agents. What it claims, what remains unverified, and how to decide whether a persistent memory layer fits your agent, RAG pipeline, or chat app.

6 min read

A new arXiv preprint (2605.12357) introduces δ-mem — “delta-mem” — an online memory mechanism that promises persistent, low-overhead context for LLM agents across long-running sessions. If you build agents, RAG pipelines, or chat apps, that one sentence touches the three things you actually get paged about: latency, recall quality, and per-request token spend.

We mapped the preprint’s framing against the memory approaches developers run in production today. Here’s the problem δ-mem is positioned to solve, what the paper does and doesn’t commit to, and how to decide whether a memory layer like this belongs in your stack — without re-architecting anything on the strength of an abstract.

Why memory is the expensive part of every agent

The default memory strategy in most agent codebases is no strategy: append every message to the conversation array and resend the whole thing on each API call. That works until sessions get long. If your agent accumulates 2,000 tokens per turn — tool results included — the history alone passes 100,000 tokens by turn 50, and you pay to reprocess all of it on every subsequent call. Cumulative input cost grows roughly quadratically with session length, and time-to-first-token degrades along with it.

The standard mitigations each trade something away:

  • Sliding windows and summarization cap cost but are lossy. The constraint your user stated in turn 3 (“never touch the prod database”) gets compressed into a summary, then compressed again, then it’s gone — usually right before the agent needs it.
  • Vector-store RAG over chat history keeps everything but relocates the problem. You now run an embedding pass and a retrieval call per turn, you make chunking decisions that silently shape recall, and multi-hop questions (“what did we decide after we ruled out option B?”) sit exactly where similarity search is weakest.
  • Offline memory pipelines — batch jobs that distill transcripts into a user profile after the session ends — can’t help the session that’s currently running.

That last gap is what “online” means in the paper’s title: memory that updates during the session and is immediately usable, rather than reconstructed from logs afterwards. For a coding agent on hour two of a refactor, or a support bot on message 40 of an escalation, that distinction is the whole game.

What the δ-mem preprint claims — and what to verify before you care

The preprint’s framing commits δ-mem to three properties:

  1. Persistence across sessions. Memory survives the end of a conversation, so the agent doesn’t restart from zero context tomorrow.
  2. Low overhead. Maintaining memory shouldn’t itself dominate your latency budget or token bill — the failure mode of several earlier memory systems, where the bookkeeping calls cost more than the history they replaced.
  3. Online operation. Updates happen incrementally as the session runs, not in a post-hoc batch job.

The δ in the name points at the mechanism: incremental updates — deltas — applied to a persistent memory state, instead of repeatedly reprocessing or re-summarizing the full history. That’s the read the paper’s framing suggests; the implementation specifics are what you should go to the PDF for.

And when you do read it, four questions separate a useful result from a benchmark-shaped one:

  • Which baselines? Beating full-context replay or a naive sliding window is table stakes. A tuned RAG-over-history setup or a hierarchical memory system in the MemGPT lineage is the comparison that matters for anyone with an existing pipeline.
  • Overhead measured in what? Tokens, wall-clock latency, or both — and at which session lengths. “Low overhead” at 20 turns says little about turn 500.
  • How is recall evaluated? Multi-session QA over long horizons stresses memory differently than single-needle retrieval. Check whether the evaluation includes questions whose answers require composing facts from separate, distant turns.
  • Is there code? A repo you can run against your own transcripts is worth more than any table in the paper.

Where a memory layer like this fits — and what to do this week

Here’s how the δ-mem class of system compares to what you’re likely running now:

ApproachToken cost as sessions growCross-session memoryExtra infrastructure
Full history replayGrows every turnNoneNone
Window + summarizationFlat after the cap, lossyNone, unless persisted separatelyNone
RAG over transcriptsRoughly flat, plus retrieval tokensYesEmbeddings + vector store
Online memory (δ-mem class)Claimed near-flatYesMemory store + update path

Whether the “claimed” row earns a place in your stack depends on your numbers, not the paper’s. Two concrete moves:

1. Measure how much of your input is replayed history. Log input token counts per call and tag the share that is prior-turn content versus new instructions and retrieved documents. If history is a minor slice of your spend, a memory layer solves a problem you don’t have. If it dominates — common for tool-heavy agents, where every tool result gets dragged through every subsequent call — the ceiling on savings is large enough to justify the spike.

2. Put memory behind a seam. If your conversation state is assembled inline wherever you call the model, you can’t experiment with anything. A narrow interface — getContext(sessionId) on the way in, recordEvent(sessionId, event) on the way out — turns every future memory backend, δ-mem included, into a swappable implementation instead of a rewrite. Build the seam now; evaluate candidates as their code lands.

If you’re scaffolding that seam this week, an agent-capable editor shortens the loop considerably — generating the interface, a replay-based test harness, and a token-accounting script is exactly the kind of well-specified grunt work it’s good at.

Cursor

AI-native code editor with an agent mode suited to scaffolding interfaces, test harnesses, and instrumentation — useful groundwork before evaluating any memory backend.

Free hobby tier; Pro from $20/month

Try Cursor

Affiliate link · We earn a commission at no cost to you.

The honest summary: δ-mem names a real and expensive problem, and “online, persistent, low-overhead” is the correct wishlist. Whether this particular mechanism delivers is unknowable from where we sit — but the instrumentation and the seam are worth building regardless of which memory system eventually wins.

FAQ

Can I use δ-mem in production today?+
Not on the strength of the preprint alone. It's an unreviewed paper, and you should check whether the authors have released code before planning anything. Until you can run it against your own transcripts, treat it as research — but you can prepare by putting your memory logic behind a swappable interface now.
How is online memory different from RAG over chat history?+
RAG retrieves chunks from stored transcripts at query time, which adds an embedding and retrieval step per turn and inherits the weaknesses of similarity search. An online memory system maintains a persistent, incrementally updated memory state during the session, so relevant context is available without a per-turn retrieval round trip. They can coexist: RAG for your document corpus, memory for the interaction itself.
Will a memory layer actually cut my token bill?+
Only if replayed conversation history is a significant share of your input tokens. Measure that first by logging per-call token composition. Tool-heavy agents with long sessions usually have the most to gain; short chat sessions with heavy document retrieval usually don't.

Related reading

See all AI Knowledge Work articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.