Judea Pearl's Ladder of Causation and the Limits of LLM Reasoning

You ask an LLM-powered agent to fix a flaky test. It reads the stack trace, notices the failure happens right after a database call, and patches it with a retry. The test still fails. The model saw a correlation — failure near a database call — and never checked whether that call caused the failure. That gap has a precise name. Judea Pearl, who won the 2011 Turing Award for formalizing probabilistic and causal reasoning, would say the agent never left the bottom rung of the Ladder of Causation.

This isn’t a prompt-engineering problem you can patch away. It’s a statement about what data-driven systems can and cannot compute — and it explains a lot of what you see go wrong with LLM tools.

The three rungs of the ladder

Pearl’s causal hierarchy — laid out for a general audience in his 2018 book The Book of Why — sorts every question you can ask into three rungs, and each rung needs information the one below it cannot supply.

Rung 1 is association. “What does seeing X tell me about Y?” Written formally, it is the conditional probability P(Y | X). Correlation, pattern recognition, curve fitting, ordinary supervised learning, and next-token prediction all live here. Example: users who open the billing page churn at a higher rate.

Rung 2 is intervention. “What happens to Y if I do X?” Pearl gives this its own notation, P(Y | do(X)) — the do-operator — because acting is not the same as observing. Example: if we redesign the billing page, does churn drop? The Rung 1 correlation cannot tell you. Maybe confused users both visit billing and churn, and the page itself changes nothing.

Rung 3 is counterfactual. “Would this specific user have churned if they had not hit the broken page — given that they did hit it, and did churn?” This is reasoning about alternatives to events that already happened. It is what you do every time you say “that bug would not have shipped if we’d had a test for it.”

The rungs are ordered for a reason. The Causal Hierarchy Theorem — formalized by Elias Bareinboim and colleagues building on Pearl’s work — makes the separation rigorous: in general, data from a lower rung cannot answer a question on a higher rung. No amount of Rung 1 observation settles a Rung 2 question on its own.

Why more data does not climb the ladder

The part developers miss is that this is a structural limit, not a sample-size limit. More rows do not help.

Here is the intuition. Two completely different causal worlds can produce the exact same observational distribution. Pearl’s stock example: a rooster crows every morning before sunrise. The data — crow, then sun, every single day, for years — is equally consistent with “the rooster causes the sunrise” and “the sunrise causes the crow.” To pick the right one you need an assumption that does not come from the data: knowledge about how the world is actually wired. Strip that away and the dataset is mute, no matter how large it gets.

This is the part that matters for your stack. Retrieval-augmented generation adds more Rung 1 evidence. It can genuinely cut hallucinations that come from missing facts — if the model never saw that your API returns a 429 under load, putting that in context fixes it. What retrieval does not do is hand the model a do-operator. You can index every incident postmortem your company has ever written, and the model still cannot compute what would happen if you changed the retry policy — unless something in that text already spells out the causal structure for it.

What this means for your LLM tools and agents

A language model trained to predict the next token is modeling P(text) — Rung 1, scaled to a size no statistician of Pearl’s generation imagined. It does Rung 1 work genuinely well. The trouble starts when a prompt looks like a causal or counterfactual question. The model does not run a causal computation; it retrieves the text patterns most associated with questions of that shape.

Sometimes that works. If the training corpus contains enough worked causal reasoning about a topic — and for well-trodden topics it does — the pattern-match lands on a correct answer, and it looks like reasoning. It breaks down when the situation has no close textual precedent: a novel system, your particular codebase, a chain of two or three interventions stacked on each other. That is the profile of a large share of production hallucinations. The model is not lying. It is doing Rung 1 work on a Rung 2 question, and presenting the result with the same fluency either way.

Agents make the gap sharper. An agent acting in the world asks a Rung 2 question at every step — “if I run this command, what state results?” An agent whose only signal is logs of past runs has Rung 1 data about those runs. It performs well when the new situation matches the distribution it has seen, and degrades, often silently, when it does not. “Works in the demo, fails in production” is frequently this exact mismatch.

The practical move is to stop asking these tools to climb a rung they cannot, and to use them hard where Rung 1 is the job: autocomplete, boilerplate, format translation, summarizing a diff, surfacing a pattern across files. Then supply the causal model yourself — explicit constraints in the prompt, tests that encode your cause-and-effect expectations, and review that checks the reasoning rather than only the output.

Cursor

An LLM-native code editor that is strong at exactly the Rung 1 work — completion, refactors, codebase-aware edits — where pattern prediction shines. Treat its output as fast drafts to verify, not causal conclusions to trust.

Free tier available; Pro at $20/month

Try Cursor

Affiliate link · We earn a commission at no cost to you.

None of this is an argument against AI tooling. It is an argument for matching the tool to the rung. Pearl’s hierarchy gives you a fast check before you delegate a task: am I asking this model to recognize a pattern, or to reason about what a change would cause? The first is what it was built for. The second is still on you.

FAQ

Does this mean LLMs can never reason about cause?

Not quite. They can produce correct causal answers when the reasoning is well represented in their training text, and giving them explicit causal assumptions — a stated structure, a causal diagram, a tool call to a real model — helps. Pearl's claim is narrower and sharper: the causal assumptions have to come from somewhere outside the data, because observation alone underdetermines them.

Is a causal model just a knowledge graph?

No. A knowledge graph stores relationships between entities. A causal model additionally specifies which variables cause which, and supports the do-operator and counterfactual queries. They overlap, but a knowledge graph on its own does not let you compute the effect of an intervention.

Should I add a causal inference library to my stack?

Only if you are making real decisions under intervention — A/B test analysis, treatment effects, policy choices. For most application code the takeaway is lighter: identify which rung your problem sits on, and do not expect a correlational tool to answer an interventional question.

Judea Pearl's Ladder of Causation and the Limits of LLM Reasoning

The three rungs of the ladder

Why more data does not climb the ladder

What this means for your LLM tools and agents

Cursor

FAQ

Aider vs Continue.dev: Terminal-First vs Editor-First AI Coding in 2026

AI Code Review Tools Compared: CodeRabbit, Greptile, and Diamond in 2026

Using Claude Code Subagents for Parallel Refactoring: A Hands-On Workflow

Cline vs Roo Code: Comparing Open-Source Agentic Coding Extensions in 2026

How to Build a Skills Library for Your AI Engineering Team

Get the best tools, weekly