AI-Powered Observability: Querying Telemetry in Plain English

For most of the last decade, asking a meaningful question of your observability data required knowing the query language of the platform it lived in. LogQL for Loki. PromQL for Prometheus. ES|QL or KQL for Elastic. Kusto for Azure Monitor. Each has its own syntax, its own operator model, its own edge cases. The platforms are often running on OpenTelemetry-collected data — structured, semantically tagged, ready to answer detailed questions — but you still had to write the incantation yourself.

That is changing. Grafana Assistant, introduced at GrafanaCON 2025, can take a natural language description and generate PromQL or LogQL behind the scenes. Elastic’s AI Assistant for Observability can build ES|QL queries from a plain English prompt and execute them directly in the chat interface. Datadog’s Bits AI and its DDSQL editor accept natural language and produce SQL-like queries over your telemetry. The mechanics differ, but the surface is the same: type what you want to know, get a query (or an answer) back.

This article looks at how the translation layer works, what it genuinely speeds up during an incident, and where it fails — because it does fail, and the failure mode matters more than the success rate.

How the Translation Works

The core mechanism is straightforward. An LLM is given context about the target query language, the schema of the data available (field names, types, semantic conventions), and your natural language question. It produces a query string. The platform executes the query. The result comes back and is either shown to you raw or summarized by the same LLM.

The quality of this translation depends heavily on two things: how well the LLM knows the target language, and how much schema context it receives.

OpenTelemetry plays a specific role here. Because OTLP-instrumented data follows published semantic conventions — http.response.status_code, service.name, db.system, span.kind, and hundreds of other standardized attribute names — the LLM has a predictable vocabulary to reason over. It knows that http.response.status_code is an integer attribute on HTTP spans. It knows that a span.kind of server identifies entry-point spans. When Elastic, Grafana, and others tell the LLM what field names exist in your index or data source, they are leaning on the fact that OTel-instrumented data is structured and labeled consistently.

Without that consistency, the translation breaks down faster. If your logs are semi-structured with inconsistent field names across services, the LLM has to guess. It usually guesses something plausible that is wrong.

The schema injection approach varies by platform. Elastic’s AI Assistant uses Retrieval Augmented Generation (RAG) against your index mappings — it fetches the relevant field definitions before passing them to the model. Grafana Assistant uses “careful context selection” to reduce ambiguity. Datadog’s natural language layer reasons over your infrastructure’s tag taxonomy. None of these approaches fully escape the underlying problem: the LLM has to work with what it is given, and what it is given is a compressed, sometimes incomplete description of your data.

Where It Genuinely Helps During Incidents

The most honest answer is: it helps people who would otherwise write nothing at all.

If you are an on-call engineer who is competent at your product domain but not fluent in PromQL, the difference between “I have to ask someone who knows PromQL” and “I can type a question and get a starting point” is real. Elastic’s own research on this framing notes that queries that “took minutes of expert work” can become accessible to engineers without deep DSL knowledge. That is not a fabricated benefit — reducing the friction between a question and a first query is genuinely valuable during an incident where every minute of confusion costs.

The second area where AI assistance helps is correlation. Manual cross-signal investigation — starting with a spike in a latency metric, then pivoting to traces, then pivoting to logs — involves multiple query rewrites across different syntaxes. An AI assistant that can hold that context across a conversation and rewrite each query as you narrow the scope reduces cognitive overhead even for engineers who know the query languages.

Grafana’s implementation is explicit about this: the Assistant can correlate across data sources in a single conversation, generating PromQL for Prometheus, LogQL for Loki, and TraceQL for Tempo without requiring you to manually switch contexts. That multi-language fluency is harder for a human to maintain under pressure.

The Failure Modes You Should Understand

The canonical failure mode here is a query that is syntactically valid, executes without errors, and returns results — but answers a subtly different question than the one you asked. This is more dangerous than a query that throws a parse error.

Consider a natural language question like “show me all errors from the payment service in the last hour.” An LLM-generated LogQL query might filter on level="error" but miss application-layer errors logged at level="info" with an error field set to true — a pattern common in services where the log framework and the error taxonomy were built separately. The query runs. You see fewer errors than actually occurred. You close the incident. The actual errors were there; the query just did not find them.

Elastic’s 2026 observability trends report is candid about this: 53% of organizations cite hallucinations as a concern with GenAI for observability, specifically the risk of AI generating “confident nonsense” that worsens incidents if acted upon without human verification. That number is notable because it comes from practitioners who are already using these tools.

A second failure mode is context contamination. If your OTel instrumentation is inconsistent — some services emit http.status_code, others use http.response.status_code, because semantic convention versions changed between when different teams instrumented their services — the LLM may generate a query using one field name and miss data from services using the other. The LLM cannot know about your team’s instrumentation inconsistencies unless you tell it, and no current platform surfaces that automatically.

A third failure is over-reliance on the generated summary rather than the raw result. When the platform returns 10,000 matching log lines and the AI assistant summarizes them as “the payment service had intermittent errors related to database timeouts,” you are reading a compression of the data, not the data. Summaries lose outliers. They can emphasize the most common pattern and hide a rare but critical event. If a summary does not mention something, that is not evidence the something did not happen.

Understanding Your Telemetry Still Matters

There is a tempting narrative that natural language query removes the need to understand your data. It does not. You still need to know what fields exist on your spans, what values are plausible, how your services instrument errors, and what the typical distributions look like. Without that knowledge, you cannot evaluate whether a generated query is asking the right question. The LLM is fluent in syntax; you have to be fluent in semantics.

OpenTelemetry helps here, but only up to a point. The semantic conventions standardize attribute names, not values, and not the decision about what your application logs. A service that emits span.kind=server and a 200 status code on every request — because the team added try/catch blocks that swallow exceptions — looks healthy in any query language. The data is what it is; the query layer sits on top of it.

The teams that get the most from AI-assisted querying are the ones who have invested in clean instrumentation: consistent field names, meaningful service names, structured log events with explicit error fields, and span attributes that map to actual business operations. In that environment, the natural language translation layer adds real speed. In a poorly instrumented environment, it adds speed to the process of generating wrong answers.

What This Looks Like in Practice

The platforms converging on this capability — Elastic, Grafana, Datadog — have each made somewhat different implementation choices. Elastic’s AI Assistant is tightly integrated with ES|QL and Kibana, and uses your actual index mappings as context. Grafana Assistant remains in limited preview and is explicit that accuracy is still a primary development focus. Datadog’s Bits AI extends beyond query generation into agentic workflows — it can investigate an alert, correlate signals, and propose a fix — but those capabilities carry proportionally higher risk of acting on a wrong conclusion.

What they share: they all expose the generated query. They all expect human validation before action. And they all work better on OTel-instrumented data than on ad hoc log formats, because the standardized schema gives the LLM a reliable map.

OTel’s own production adoption is accelerating — one industry survey put production usage nearly doubling year-over-year in 2025. The broader the OTel footprint in your stack, the more schema context these tools can work with, and the better the translation quality.

If you are evaluating whether to adopt AI-assisted querying for your team, the useful question is not “does it work” — it works often enough to be worth using. The useful question is “what process do I have for verifying what it produces?” Treating generated queries as hypotheses you confirm, rather than answers you act on, is the practice that separates teams that benefit from these tools from teams that get burned by them.

FAQ

Does natural language querying work without OpenTelemetry instrumentation?

It works, but less reliably. OTel semantic conventions give the LLM a standardized vocabulary of field names and types. With ad hoc log formats or inconsistent field naming, the LLM has to guess at schema details and is more likely to produce queries that look correct but filter the wrong data.

What is the safest way to use AI-generated queries during an active incident?

Always read the generated query before treating its output as authoritative. Confirm that the time range, service filter, and field names match your intent. Use the raw result alongside any AI-generated summary, especially when looking for rare or anomalous events that a summary might compress away.

Will AI assistants replace the need to learn PromQL, LogQL, or ES|QL?

Not in the near term, and probably not in principle. You need to understand the target query language well enough to audit what the LLM produces. Engineers who know the syntax can catch subtle errors — wrong label selectors, off-by-one time windows, missing cardinality constraints — that engineers relying purely on the natural language interface will miss.

AI-Powered Observability: Querying Telemetry in Plain English

How the Translation Works

Where It Genuinely Helps During Incidents

The Failure Modes You Should Understand

Understanding Your Telemetry Still Matters

What This Looks Like in Practice

FAQ

Caddy vs Nginx in 2026: When Automatic HTTPS Is Worth the Switch

Hetzner vs OVH for Side Projects: Bare-Metal Value in 2026

Bun vs Node.js in Production: What Actually Changes in 2026

Coolify vs Dokploy: Self-Hosted PaaS for Solo Developers in 2026

Turso vs Neon: Serverless SQLite and Postgres Compared in 2026

Get the best tools, weekly