AI-Powered Observability: Querying Telemetry in Plain English
Observability platforms now let you ask questions of logs, metrics, and traces in natural language. Here's how the translation layer works, what it genuinely helps with, and where it breaks.
For most of the last decade, asking a meaningful question of your observability data required knowing the query language of the platform it lived in. LogQL for Loki. PromQL for Prometheus. ES|QL or KQL for Elastic. Kusto for Azure Monitor. Each has its own syntax, its own operator model, its own edge cases. The platforms are often running on OpenTelemetry-collected data — structured, semantically tagged, ready to answer detailed questions — but you still had to write the incantation yourself.
That is changing. Grafana Assistant, introduced at GrafanaCON 2025, can take a natural language description and generate PromQL or LogQL behind the scenes. Elastic’s AI Assistant for Observability can build ES|QL queries from a plain English prompt and execute them directly in the chat interface. Datadog’s Bits AI and its DDSQL editor accept natural language and produce SQL-like queries over your telemetry. The mechanics differ, but the surface is the same: type what you want to know, get a query (or an answer) back.
This article looks at how the translation layer works, what it genuinely speeds up during an incident, and where it fails — because it does fail, and the failure mode matters more than the success rate.
How the Translation Works
The core mechanism is straightforward. An LLM is given context about the target query language, the schema of the data available (field names, types, semantic conventions), and your natural language question. It produces a query string. The platform executes the query. The result comes back and is either shown to you raw or summarized by the same LLM.
The quality of this translation depends heavily on two things: how well the LLM knows the target language, and how much schema context it receives.
OpenTelemetry plays a specific role here. Because OTLP-instrumented data follows published semantic conventions — http.response.status_code, service.name, db.system, span.kind, and hundreds of other standardized attribute names — the LLM has a predictable vocabulary to reason over. It knows that http.response.status_code is an integer attribute on HTTP spans. It knows that a span.kind of server identifies entry-point spans. When Elastic, Grafana, and others tell the LLM what field names exist in your index or data source, they are leaning on the fact that OTel-instrumented data is structured and labeled consistently.
Without that consistency, the translation breaks down faster. If your logs are semi-structured with inconsistent field names across services, the LLM has to guess. It usually guesses something plausible that is wrong.
The schema injection approach varies by platform. Elastic’s AI Assistant uses Retrieval Augmented Generation (RAG) against your index mappings — it fetches the relevant field definitions before passing them to the model. Grafana Assistant uses “careful context selection” to reduce ambiguity. Datadog’s natural language layer reasons over your infrastructure’s tag taxonomy. None of these approaches fully escape the underlying problem: the LLM has to work with what it is given, and what it is given is a compressed, sometimes incomplete description of your data.
Where It Genuinely Helps During Incidents
The most honest answer is: it helps people who would otherwise write nothing at all.
If you are an on-call engineer who is competent at your product domain but not fluent in PromQL, the difference between “I have to ask someone who knows PromQL” and “I can type a question and get a starting point” is real. Elastic’s own research on this framing notes that queries that “took minutes of expert work” can become accessible to engineers without deep DSL knowledge. That is not a fabricated benefit — reducing the friction between a question and a first query is genuinely valuable during an incident where every minute of confusion costs.
The second area where AI assistance helps is correlation. Manual cross-signal investigation — starting with a spike in a latency metric, then pivoting to traces, then pivoting to logs — involves multiple query rewrites across different syntaxes. An AI assistant that can hold that context across a conversation and rewrite each query as you narrow the scope reduces cognitive overhead even for engineers who know the query languages.
Grafana’s implementation is explicit about this: the Assistant can correlate across data sources in a single conversation, generating PromQL for Prometheus, LogQL for Loki, and TraceQL for Tempo without requiring you to manually switch contexts. That multi-language fluency is harder for a human to maintain under pressure.
The Failure Modes You Should Understand
The canonical failure mode here is a query that is syntactically valid, executes without errors, and returns results — but answers a subtly different question than the one you asked. This is more dangerous than a query that throws a parse error.
Consider a natural language question like “show me all errors from the payment service in the last hour.” An LLM-generated LogQL query might filter on level="error" but miss application-layer errors logged at level="info" with an error field set to true — a pattern common in services where the log framework and the error taxonomy were built separately. The query runs. You see fewer errors than actually occurred. You close the incident. The actual errors were there; the query just did not find them.
Elastic’s 2026 observability trends report is candid about this: 53% of organizations cite hallucinations as a concern with GenAI for observability, specifically the risk of AI generating “confident nonsense” that worsens incidents if acted upon without human verification. That number is notable because it comes from practitioners who are already using these tools.
A second failure mode is context contamination. If your OTel instrumentation is inconsistent — some services emit http.status_code, others use http.response.status_code, because semantic convention versions changed between when different teams instrumented their services — the LLM may generate a query using one field name and miss data from services using the other. The LLM cannot know about your team’s instrumentation inconsistencies unless you tell it, and no current platform surfaces that automatically.
A third failure is over-reliance on the generated summary rather than the raw result. When the platform returns 10,000 matching log lines and the AI assistant summarizes them as “the payment service had intermittent errors related to database timeouts,” you are reading a compression of the data, not the data. Summaries lose outliers. They can emphasize the most common pattern and hide a rare but critical event. If a summary does not mention something, that is not evidence the something did not happen.
Understanding Your Telemetry Still Matters
There is a tempting narrative that natural language query removes the need to understand your data. It does not. You still need to know what fields exist on your spans, what values are plausible, how your services instrument errors, and what the typical distributions look like. Without that knowledge, you cannot evaluate whether a generated query is asking the right question. The LLM is fluent in syntax; you have to be fluent in semantics.
OpenTelemetry helps here, but only up to a point. The semantic conventions standardize attribute names, not values, and not the decision about what your application logs. A service that emits span.kind=server and a 200 status code on every request — because the team added try/catch blocks that swallow exceptions — looks healthy in any query language. The data is what it is; the query layer sits on top of it.
The teams that get the most from AI-assisted querying are the ones who have invested in clean instrumentation: consistent field names, meaningful service names, structured log events with explicit error fields, and span attributes that map to actual business operations. In that environment, the natural language translation layer adds real speed. In a poorly instrumented environment, it adds speed to the process of generating wrong answers.
What This Looks Like in Practice
The platforms converging on this capability — Elastic, Grafana, Datadog — have each made somewhat different implementation choices. Elastic’s AI Assistant is tightly integrated with ES|QL and Kibana, and uses your actual index mappings as context. Grafana Assistant remains in limited preview and is explicit that accuracy is still a primary development focus. Datadog’s Bits AI extends beyond query generation into agentic workflows — it can investigate an alert, correlate signals, and propose a fix — but those capabilities carry proportionally higher risk of acting on a wrong conclusion.
What they share: they all expose the generated query. They all expect human validation before action. And they all work better on OTel-instrumented data than on ad hoc log formats, because the standardized schema gives the LLM a reliable map.
OTel’s own production adoption is accelerating — one industry survey put production usage nearly doubling year-over-year in 2025. The broader the OTel footprint in your stack, the more schema context these tools can work with, and the better the translation quality.
If you are evaluating whether to adopt AI-assisted querying for your team, the useful question is not “does it work” — it works often enough to be worth using. The useful question is “what process do I have for verifying what it produces?” Treating generated queries as hypotheses you confirm, rather than answers you act on, is the practice that separates teams that benefit from these tools from teams that get burned by them.
FAQ
Does natural language querying work without OpenTelemetry instrumentation? +
What is the safest way to use AI-generated queries during an active incident? +
Will AI assistants replace the need to learn PromQL, LogQL, or ES|QL? +
Related tools
Beehiiv
Newsletter platform with built-in ad network and Boost referrals.
Try Beehiiv →
Webflow
Visual site builder with real CSS export and a CMS that scales.
Try Webflow →
Some links above are affiliate links. We may earn a commission if you sign up. See our disclosure for details.
Related reading
2026-05-21
Caddy Web Server Review: Automatic HTTPS Without the Ceremony
A detailed look at Caddy's automatic TLS, Caddyfile syntax, and reverse proxy setup — and where it falls short compared to Nginx.
2026-05-21
Mac Mini as AI Agent Infrastructure: Why Apple Silicon Powers Local LLM Inference
How Apple Silicon's unified memory architecture makes the Mac Mini a practical local inference node — benchmarks, real costs, setup with Ollama and MLX, and honest tradeoffs versus cloud GPUs.
2026-05-21
NixOS & nixpkgs in 2026: Reproducible Dev Environments Without Docker
How Nix flakes and devShells replace Docker for local dev: what works, where it hurts, and whether the learning curve is worth it for your team.
2026-05-21
The Rust Sidecar Pattern: Fixing Python AI's Deployment Weakness
Python dominates ML development but struggles in production serving. The Rust sidecar pattern splits responsibilities: Python handles models, Rust owns the hot path. Here's the mechanics.
2026-05-21
SendGrid vs Mailgun vs Resend: Honest 2026 Email API Comparison
A grounded comparison of SendGrid, Mailgun, and Resend across pricing, developer experience, deliverability, and fit — for developers picking a transactional email API in 2026.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.