Why AI Agents Fail Silently and How to Build an Observability Monitor
AI agents return 200s and exit cleanly while hallucinating, degrading under rate limits, and overrunning budgets. Here are the four silent failure modes and a minimal monitor that catches them in production.
A normal service fails loudly. The process crashes, the health check turns red, and your pager goes off. An LLM-powered agent fails differently. It returns a 200, exits with code 0, and hands you a confident answer that happens to be wrong. Nothing in your existing monitoring stack reacts, because by every metric it watches, nothing broke.
That gap is the problem. Uptime checks, error-rate dashboards, and latency alerts all watch the transport layer. An agent can keep that layer green while quietly producing garbage, burning your API budget, or looping for thirty steps where it used to take four. We ran a handful of agent workloads behind standard HTTP monitoring and watched the dashboard stay green through failures a human reviewer caught in seconds.
Four ways an agent fails without telling you
Hallucinated output. The agent invents an API parameter, a function name, or a citation. The response is still well-formed text or valid JSON, so a schema check passes it. The mistake only surfaces downstream — a failed deploy, a wrong number in a report, a support ticket.
Rate-limit degradation. When a provider returns a 429, a naive retry layer either retries into a backoff storm or falls back to a smaller, cheaper model. The agent keeps running. The output quality drops, and unless you logged which model actually answered, nothing records that the run was degraded.
Cost overruns. A retry loop, a runaway tool call, or a prompt injection can multiply token usage. There is no exception thrown for “this run cost $4.10 instead of $0.03.” You find out on the monthly invoice.
Truncated responses. The model hits its output token ceiling and stops mid-sentence. The API tells you this — OpenAI returns finish_reason: "length", Anthropic returns stop_reason: "max_tokens" — but only if you read that field. Most agent code reads the content and ignores the stop reason entirely.
What a monitor actually needs to watch
Because the transport layer stays green, a useful monitor has to watch one layer up: the semantics of what the model returned. Four signal categories cover most silent failures.
Cost. Track input and output tokens per call, per run, and cumulatively. A per-run token budget turns an invisible overrun into an alert.
Shape. Does the output parse? Does it match the schema the agent expects? Did the stop reason come back clean, or was it length / max_tokens? These are cheap, deterministic checks that need no model to evaluate.
Behavior. Track tool-call success rate, retry count, fallback-model usage, and step count. An agent that suddenly takes thirty steps to finish a task it used to do in four is looping, even if it eventually returns something.
Drift. Track response length, refusal rate, and latency against a rolling baseline rather than a fixed threshold. This is the category that catches failures you did not predict. You cannot define in advance what a degraded output looks like, but you can detect that it does not look like last week’s.
Drift detection is the part teams skip and the part that pays off. Fixed thresholds only catch the failure modes you already imagined. A baseline catches the ones you didn’t.
Building a minimal monitor
You don’t need a new platform. Start with a wrapper around the LLM call itself:
async function tracedCall(params) { const start = Date.now(); const res = await client.messages.create(params); emit({ model: params.model, tokensIn: res.usage.input_tokens, tokensOut: res.usage.output_tokens, stopReason: res.stop_reason, latencyMs: Date.now() - start, }); return res;}Every call now emits a structured event. From there, the monitor is a set of small, boring rules:
- Assert on the stop reason. If it is
max_tokens, the response is truncated — flag the run instead of acting on a half-answer. - Validate the parsed output against a schema before the agent acts on it, not after.
- Sum tokens per run against a budget. A reasonable starting alert is anything above three times your median run cost — tighten it once you have real data.
- Store the events somewhere queryable: a Postgres table, your existing log pipeline, whatever you already operate.
- Compute a rolling median of output length and alert when a run drops well below it. Forty percent is a sane place to begin, not a measured constant.
None of those rules need a model to evaluate them, so the monitor itself costs nothing per run and cannot hallucinate. The wrappers, schema validators, and alerting glue are mostly boilerplate — the kind of code an AI editor writes quickly while you focus on which signals matter for your agent.
Cursor
An AI-native code editor that speeds up writing the repetitive instrumentation layer — call wrappers, schema validators, and alert rules — so you spend your time deciding what to monitor, not typing boilerplate.
Free tier; Pro at $20/month
Affiliate link · We earn a commission at no cost to you.
A monitor like this won’t make your agent smarter. It will make its failures visible on the same day they happen instead of the day a user complains — which, for anything running unattended, is the difference between a quick fix and a quiet outage.
FAQ
Can't my existing APM tool (Datadog, Sentry) handle this? +
How is a monitor different from running evals? +
What's the cheapest signal to start tracking? +
Related tools
Beehiiv
Newsletter platform with built-in ad network and Boost referrals.
Try Beehiiv →
Webflow
Visual site builder with real CSS export and a CMS that scales.
Try Webflow →
Some links above are affiliate links. We may earn a commission if you sign up. See our disclosure for details.
Related reading
2026-05-26
ROCm in 2026: Why PyTorch on the RX 7900 XTX Still Falls Short for Research
A measured look at where AMD ROCm with PyTorch and PyTorch Lightning still has rough edges on the RX 7900 XTX in 2026, and what that means if you are porting CUDA training workloads.
2026-05-26
GPT-5.5 Instant vs GPT-5.3: Which of OpenAI's Three Claims Hold Up
OpenAI swapped ChatGPT's default to GPT-5.5 Instant overnight, claiming faster responses, sharper reasoning, and fewer hallucinations. We grade each claim against independent testing and show developers what to change in their API stack.
2026-05-26
OpenAI Daybreak vs Anthropic Glasswing: Identical Benchmarks, Shared Partners
OpenAI's Daybreak and Anthropic's Glasswing shipped the same week with matching cybersecurity benchmarks and overlapping enterprise partners. Here's what the convergence signals and how to evaluate either for your AppSec pipeline.
2026-05-26
Macchiato Day 2 Review: Live Token Metrics and Parallel AI Terminals
Macchiato's Day 2 release ships a live token sidebar, per-agent cost dashboard, and shortcuts for Claude Code and OpenCode. Here is what changes for developers running multiple AI agents.
2026-05-21
Concurrency, Retries, and Timeouts: Building Reliable AI Agents in TypeScript
Why Promise.race leaks model calls and billing in AI agents, and how a single-owner pattern with AbortSignal, deadline budgets, and jittered retries fixes it.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.