pickuma.
Infrastructure

Why AI Agents Fail Silently and How to Build an Observability Monitor

AI agents return 200s and exit cleanly while hallucinating, degrading under rate limits, and overrunning budgets. Here are the four silent failure modes and a minimal monitor that catches them in production.

6 min read

A normal service fails loudly. The process crashes, the health check turns red, and your pager goes off. An LLM-powered agent fails differently. It returns a 200, exits with code 0, and hands you a confident answer that happens to be wrong. Nothing in your existing monitoring stack reacts, because by every metric it watches, nothing broke.

That gap is the problem. Uptime checks, error-rate dashboards, and latency alerts all watch the transport layer. An agent can keep that layer green while quietly producing garbage, burning your API budget, or looping for thirty steps where it used to take four. We ran a handful of agent workloads behind standard HTTP monitoring and watched the dashboard stay green through failures a human reviewer caught in seconds.

Four ways an agent fails without telling you

Hallucinated output. The agent invents an API parameter, a function name, or a citation. The response is still well-formed text or valid JSON, so a schema check passes it. The mistake only surfaces downstream — a failed deploy, a wrong number in a report, a support ticket.

Rate-limit degradation. When a provider returns a 429, a naive retry layer either retries into a backoff storm or falls back to a smaller, cheaper model. The agent keeps running. The output quality drops, and unless you logged which model actually answered, nothing records that the run was degraded.

Cost overruns. A retry loop, a runaway tool call, or a prompt injection can multiply token usage. There is no exception thrown for “this run cost $4.10 instead of $0.03.” You find out on the monthly invoice.

Truncated responses. The model hits its output token ceiling and stops mid-sentence. The API tells you this — OpenAI returns finish_reason: "length", Anthropic returns stop_reason: "max_tokens" — but only if you read that field. Most agent code reads the content and ignores the stop reason entirely.

What a monitor actually needs to watch

Because the transport layer stays green, a useful monitor has to watch one layer up: the semantics of what the model returned. Four signal categories cover most silent failures.

Cost. Track input and output tokens per call, per run, and cumulatively. A per-run token budget turns an invisible overrun into an alert.

Shape. Does the output parse? Does it match the schema the agent expects? Did the stop reason come back clean, or was it length / max_tokens? These are cheap, deterministic checks that need no model to evaluate.

Behavior. Track tool-call success rate, retry count, fallback-model usage, and step count. An agent that suddenly takes thirty steps to finish a task it used to do in four is looping, even if it eventually returns something.

Drift. Track response length, refusal rate, and latency against a rolling baseline rather than a fixed threshold. This is the category that catches failures you did not predict. You cannot define in advance what a degraded output looks like, but you can detect that it does not look like last week’s.

Drift detection is the part teams skip and the part that pays off. Fixed thresholds only catch the failure modes you already imagined. A baseline catches the ones you didn’t.

Building a minimal monitor

You don’t need a new platform. Start with a wrapper around the LLM call itself:

async function tracedCall(params) {
const start = Date.now();
const res = await client.messages.create(params);
emit({
model: params.model,
tokensIn: res.usage.input_tokens,
tokensOut: res.usage.output_tokens,
stopReason: res.stop_reason,
latencyMs: Date.now() - start,
});
return res;
}

Every call now emits a structured event. From there, the monitor is a set of small, boring rules:

  • Assert on the stop reason. If it is max_tokens, the response is truncated — flag the run instead of acting on a half-answer.
  • Validate the parsed output against a schema before the agent acts on it, not after.
  • Sum tokens per run against a budget. A reasonable starting alert is anything above three times your median run cost — tighten it once you have real data.
  • Store the events somewhere queryable: a Postgres table, your existing log pipeline, whatever you already operate.
  • Compute a rolling median of output length and alert when a run drops well below it. Forty percent is a sane place to begin, not a measured constant.

None of those rules need a model to evaluate them, so the monitor itself costs nothing per run and cannot hallucinate. The wrappers, schema validators, and alerting glue are mostly boilerplate — the kind of code an AI editor writes quickly while you focus on which signals matter for your agent.

Cursor

An AI-native code editor that speeds up writing the repetitive instrumentation layer — call wrappers, schema validators, and alert rules — so you spend your time deciding what to monitor, not typing boilerplate.

Free tier; Pro at $20/month

Try Cursor

Affiliate link · We earn a commission at no cost to you.

A monitor like this won’t make your agent smarter. It will make its failures visible on the same day they happen instead of the day a user complains — which, for anything running unattended, is the difference between a quick fix and a quiet outage.

FAQ

Can't my existing APM tool (Datadog, Sentry) handle this? +
Those tools watch the transport layer — status codes, latency, exceptions — which stays green during silent agent failures. Several APM vendors now sell LLM observability add-ons that capture token usage and traces, and those help. The schema and drift checks, though, are specific to your agent's expected output, so you still write those yourself.
How is a monitor different from running evals? +
Evals run before deploy against a fixed test set and answer whether a version is good enough to ship. A monitor runs in production against live traffic and answers whether a run is failing right now. They catch different problems, and you want both.
What's the cheapest signal to start tracking? +
The stop reason — finish_reason on OpenAI, stop_reason on Anthropic — and token counts. Both already come back in every API response, so capturing them adds no extra cost and immediately catches truncation and runaway spend.

Related tools

Some links above are affiliate links. We may earn a commission if you sign up. See our disclosure for details.

Related reading

See all Infrastructure articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.