AI Coding Agents Must Reduce Maintenance Costs, Not Just Write Code
Why evaluating Copilot, Cursor, and Claude Code by lines generated misses the point — and how to measure whether your AI tooling is adding or removing technical debt.
A coding agent that drops 800 lines into your repo in 90 seconds feels productive. Six months later, when someone is paged at 2am to debug a layer of unused abstractions and inconsistent error handling, that same agent looks expensive. James Shore made the point bluntly in a recent essay: the value of an AI coding tool is not how fast it writes code — it is whether the resulting codebase costs less to keep alive.
We ran the same maintenance-heavy workflows through Copilot, Cursor, and Claude Code: refactors that touched 30+ files, bug fixes inside legacy modules, and incremental feature work on top of pre-existing patterns. The results are not what the lines-per-minute marketing suggests.
Maintenance is where software actually costs you
Industry estimates put maintenance at 60-80% of total software lifecycle cost, and that ratio has been remarkably stable across decades of research — from Lehman’s laws in the 1970s through Glass’s surveys in the 2000s. The reason is mechanical. Code gets read, debugged, and modified roughly an order of magnitude more often than it gets written. Anything that increases the read or modify cost — duplicated logic, inconsistent naming, unexplained branches, ill-fitting abstractions — compounds for the lifetime of the file.
An AI agent that doubles your typing speed but produces code that takes 1.5x longer to review and modify is, on net, a tax. The math is uncomfortable but unavoidable.
Where AI agents quietly add debt
The defaults of most AI coding tools are tuned for one metric: did the user accept the suggestion? That metric rewards generation that looks plausible at the diff level, not generation that fits the codebase.
Concrete failure patterns we saw repeatedly:
- Copy-paste at the function level. Rather than reuse an existing helper, the agent writes a new variant inline. The codebase ends up with three slightly different
formatCurrencyfunctions, all subtly diverging. - Defensive code for impossible inputs. Try/catch around code that cannot throw, null checks for parameters required by the type signature, fallbacks for branches that are unreachable. Each one adds reading load and never fires.
- Misaligned conventions. The repo uses early returns; the agent writes nested ifs. The repo uses one logger; the agent imports a second. Reviewers either fix it (cost) or wave it through (debt).
- Verbose comments that restate the code.
// increment counterabovecounter++. These age into noise that nobody trusts but nobody removes. - Tests that exercise the mock. Generated tests often mock the thing under test, assert what the mock returns, and report green. Coverage goes up; confidence does not.
None of these patterns are unique to AI. The issue is that AI produces them at scale, in PRs that look complete enough to merge.
How Copilot, Cursor, and Claude Code differ on maintenance
These tools sit on similar underlying models but make very different choices about context.
Copilot’s in-editor completion is optimized for the cursor’s immediate neighborhood. It rarely pulls in patterns from elsewhere in the repo, so it will happily reinvent a helper that already exists three folders away. For maintenance work in established codebases, that is exactly the wrong default.
Cursor moved early on full-repo context with its codebase indexing. When you @-reference files or let its agent mode plan a multi-file change, it tends to find and reuse existing abstractions rather than generate new ones. The cost: it is slower per suggestion, and you have to actively use the context features — accept-the-ghost-text mode falls back to Copilot-style behavior.
Claude Code runs as a CLI agent with a different operating mode. It reads files before writing them, often aggressively. On the same refactor task, we observed Claude Code reading roughly a dozen files of surrounding context before proposing a change, while Copilot effectively read one. The output is more conservative, more consistent with existing style, and easier to review.
This is not a blanket endorsement. Claude Code is slower and more expensive per task, and it can over-read on trivial changes. The point is that “AI coding agent” is not a uniform category. The maintenance footprint of the output depends heavily on how the tool gathers context before writing.
Cursor
Full-repo codebase indexing with agent mode for multi-file edits. Its maintenance behavior depends on you actually using the context features — not just the ghost-text completions.
Free tier; Pro $20/mo
Affiliate link · We earn a commission at no cost to you.
A better evaluation framework
Stop measuring AI tools by lines generated or suggestions accepted. Measure them by maintenance proxies you can actually track:
- Review iterations per AI-assisted PR. If reviewers consistently request changes, the tool is producing code that does not fit. Track this against non-AI PRs as a baseline.
- Time-to-merge. A PR that takes longer to merge than to write is a debt signal.
- Post-merge bug rate within 30 days. AI-generated code that ships clean but breaks later is the worst class of debt.
- Touch-count six months later. Code that gets rewritten or heavily modified within two quarters was never finished — it was just shipped.
These numbers are unforgiving. We have seen teams where AI-assisted PRs required noticeably more review iterations than human-written ones, and where 30-day bug rates trended higher on the AI cohort. The tools were “successful” by acceptance metrics and a net loss by every maintenance metric we cared about.
The honest version of the productivity claim is narrower than the marketing. AI coding agents are clearly net-positive for greenfield code, throwaway scripts, and well-isolated additions. The picture for established codebases is much more dependent on which tool, used how, with what review discipline. If your team’s review process treats AI output the same way it treats a junior engineer’s first PR — read carefully, push back on misfit code — the maintenance math can work. If your team treats AI output as “usually fine,” you are accumulating debt at machine speed.