Judea Pearl's Ladder of Causation and the Limits of LLM Reasoning
Judea Pearl's three-rung causal hierarchy — association, intervention, counterfactual — explains why data-driven ML and LLMs hit a structural wall at causal reasoning, and what that means for agents and RAG.
You ask an LLM-powered agent to fix a flaky test. It reads the stack trace, notices the failure happens right after a database call, and patches it with a retry. The test still fails. The model saw a correlation — failure near a database call — and never checked whether that call caused the failure. That gap has a precise name. Judea Pearl, who won the 2011 Turing Award for formalizing probabilistic and causal reasoning, would say the agent never left the bottom rung of the Ladder of Causation.
This isn’t a prompt-engineering problem you can patch away. It’s a statement about what data-driven systems can and cannot compute — and it explains a lot of what you see go wrong with LLM tools.
The three rungs of the ladder
Pearl’s causal hierarchy — laid out for a general audience in his 2018 book The Book of Why — sorts every question you can ask into three rungs, and each rung needs information the one below it cannot supply.
Rung 1 is association. “What does seeing X tell me about Y?” Written formally, it is the conditional probability P(Y | X). Correlation, pattern recognition, curve fitting, ordinary supervised learning, and next-token prediction all live here. Example: users who open the billing page churn at a higher rate.
Rung 2 is intervention. “What happens to Y if I do X?” Pearl gives this its own notation, P(Y | do(X)) — the do-operator — because acting is not the same as observing. Example: if we redesign the billing page, does churn drop? The Rung 1 correlation cannot tell you. Maybe confused users both visit billing and churn, and the page itself changes nothing.
Rung 3 is counterfactual. “Would this specific user have churned if they had not hit the broken page — given that they did hit it, and did churn?” This is reasoning about alternatives to events that already happened. It is what you do every time you say “that bug would not have shipped if we’d had a test for it.”
The rungs are ordered for a reason. The Causal Hierarchy Theorem — formalized by Elias Bareinboim and colleagues building on Pearl’s work — makes the separation rigorous: in general, data from a lower rung cannot answer a question on a higher rung. No amount of Rung 1 observation settles a Rung 2 question on its own.
Why more data does not climb the ladder
The part developers miss is that this is a structural limit, not a sample-size limit. More rows do not help.
Here is the intuition. Two completely different causal worlds can produce the exact same observational distribution. Pearl’s stock example: a rooster crows every morning before sunrise. The data — crow, then sun, every single day, for years — is equally consistent with “the rooster causes the sunrise” and “the sunrise causes the crow.” To pick the right one you need an assumption that does not come from the data: knowledge about how the world is actually wired. Strip that away and the dataset is mute, no matter how large it gets.
This is the part that matters for your stack. Retrieval-augmented generation adds more Rung 1 evidence. It can genuinely cut hallucinations that come from missing facts — if the model never saw that your API returns a 429 under load, putting that in context fixes it. What retrieval does not do is hand the model a do-operator. You can index every incident postmortem your company has ever written, and the model still cannot compute what would happen if you changed the retry policy — unless something in that text already spells out the causal structure for it.
What this means for your LLM tools and agents
A language model trained to predict the next token is modeling P(text) — Rung 1, scaled to a size no statistician of Pearl’s generation imagined. It does Rung 1 work genuinely well. The trouble starts when a prompt looks like a causal or counterfactual question. The model does not run a causal computation; it retrieves the text patterns most associated with questions of that shape.
Sometimes that works. If the training corpus contains enough worked causal reasoning about a topic — and for well-trodden topics it does — the pattern-match lands on a correct answer, and it looks like reasoning. It breaks down when the situation has no close textual precedent: a novel system, your particular codebase, a chain of two or three interventions stacked on each other. That is the profile of a large share of production hallucinations. The model is not lying. It is doing Rung 1 work on a Rung 2 question, and presenting the result with the same fluency either way.
Agents make the gap sharper. An agent acting in the world asks a Rung 2 question at every step — “if I run this command, what state results?” An agent whose only signal is logs of past runs has Rung 1 data about those runs. It performs well when the new situation matches the distribution it has seen, and degrades, often silently, when it does not. “Works in the demo, fails in production” is frequently this exact mismatch.
The practical move is to stop asking these tools to climb a rung they cannot, and to use them hard where Rung 1 is the job: autocomplete, boilerplate, format translation, summarizing a diff, surfacing a pattern across files. Then supply the causal model yourself — explicit constraints in the prompt, tests that encode your cause-and-effect expectations, and review that checks the reasoning rather than only the output.
Cursor
An LLM-native code editor that is strong at exactly the Rung 1 work — completion, refactors, codebase-aware edits — where pattern prediction shines. Treat its output as fast drafts to verify, not causal conclusions to trust.
Free tier available; Pro at $20/month
Affiliate link · We earn a commission at no cost to you.
None of this is an argument against AI tooling. It is an argument for matching the tool to the rung. Pearl’s hierarchy gives you a fast check before you delegate a task: am I asking this model to recognize a pattern, or to reason about what a change would cause? The first is what it was built for. The second is still on you.
FAQ
Does this mean LLMs can never reason about cause?
Is a causal model just a knowledge graph?
Should I add a causal inference library to my stack?
Related reading
2026-06-22
Aider vs Continue.dev: Terminal-First vs Editor-First AI Coding in 2026
A hands-on comparison of Aider and Continue.dev — two open-source AI coding tools that put you in opposite seats: the terminal and the editor. How each handles models, context, and your git history.
2026-06-22
AI Code Review Tools Compared: CodeRabbit, Greptile, and Diamond in 2026
How CodeRabbit, Greptile, and Diamond differ on codebase context, review depth, and noise — and which one fits the way your team actually merges pull requests.
2026-06-22
Using Claude Code Subagents for Parallel Refactoring: A Hands-On Workflow
A practical workflow for splitting a large refactor across Claude Code subagents, with rules for scoping tasks, isolating file conflicts, and reviewing the merged result.
2026-06-22
Cline vs Roo Code: Comparing Open-Source Agentic Coding Extensions in 2026
Roo Code began as a Cline fork. Here is how the two open-source, bring-your-own-key agentic coding extensions for VS Code actually differ in 2026.
2026-06-12
How to Build a Skills Library for Your AI Engineering Team
A practical guide to designing, versioning, and distributing shared AI skills for Claude Code and Cursor so every engineer on your team works from the same baseline.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.