Judea Pearl's Ladder of Causation and the Limits of LLM Reasoning
Judea Pearl's three-rung causal hierarchy — association, intervention, counterfactual — explains why data-driven ML and LLMs hit a structural wall at causal reasoning, and what that means for agents and RAG.
You ask an LLM-powered agent to fix a flaky test. It reads the stack trace, notices the failure happens right after a database call, and patches it with a retry. The test still fails. The model saw a correlation — failure near a database call — and never checked whether that call caused the failure. That gap has a precise name. Judea Pearl, who won the 2011 Turing Award for formalizing probabilistic and causal reasoning, would say the agent never left the bottom rung of the Ladder of Causation.
This isn’t a prompt-engineering problem you can patch away. It’s a statement about what data-driven systems can and cannot compute — and it explains a lot of what you see go wrong with LLM tools.
The three rungs of the ladder
Pearl’s causal hierarchy — laid out for a general audience in his 2018 book The Book of Why — sorts every question you can ask into three rungs, and each rung needs information the one below it cannot supply.
Rung 1 is association. “What does seeing X tell me about Y?” Written formally, it is the conditional probability P(Y | X). Correlation, pattern recognition, curve fitting, ordinary supervised learning, and next-token prediction all live here. Example: users who open the billing page churn at a higher rate.
Rung 2 is intervention. “What happens to Y if I do X?” Pearl gives this its own notation, P(Y | do(X)) — the do-operator — because acting is not the same as observing. Example: if we redesign the billing page, does churn drop? The Rung 1 correlation cannot tell you. Maybe confused users both visit billing and churn, and the page itself changes nothing.
Rung 3 is counterfactual. “Would this specific user have churned if they had not hit the broken page — given that they did hit it, and did churn?” This is reasoning about alternatives to events that already happened. It is what you do every time you say “that bug would not have shipped if we’d had a test for it.”
The rungs are ordered for a reason. The Causal Hierarchy Theorem — formalized by Elias Bareinboim and colleagues building on Pearl’s work — makes the separation rigorous: in general, data from a lower rung cannot answer a question on a higher rung. No amount of Rung 1 observation settles a Rung 2 question on its own.
Why more data does not climb the ladder
The part developers miss is that this is a structural limit, not a sample-size limit. More rows do not help.
Here is the intuition. Two completely different causal worlds can produce the exact same observational distribution. Pearl’s stock example: a rooster crows every morning before sunrise. The data — crow, then sun, every single day, for years — is equally consistent with “the rooster causes the sunrise” and “the sunrise causes the crow.” To pick the right one you need an assumption that does not come from the data: knowledge about how the world is actually wired. Strip that away and the dataset is mute, no matter how large it gets.
This is the part that matters for your stack. Retrieval-augmented generation adds more Rung 1 evidence. It can genuinely cut hallucinations that come from missing facts — if the model never saw that your API returns a 429 under load, putting that in context fixes it. What retrieval does not do is hand the model a do-operator. You can index every incident postmortem your company has ever written, and the model still cannot compute what would happen if you changed the retry policy — unless something in that text already spells out the causal structure for it.
What this means for your LLM tools and agents
A language model trained to predict the next token is modeling P(text) — Rung 1, scaled to a size no statistician of Pearl’s generation imagined. It does Rung 1 work genuinely well. The trouble starts when a prompt looks like a causal or counterfactual question. The model does not run a causal computation; it retrieves the text patterns most associated with questions of that shape.
Sometimes that works. If the training corpus contains enough worked causal reasoning about a topic — and for well-trodden topics it does — the pattern-match lands on a correct answer, and it looks like reasoning. It breaks down when the situation has no close textual precedent: a novel system, your particular codebase, a chain of two or three interventions stacked on each other. That is the profile of a large share of production hallucinations. The model is not lying. It is doing Rung 1 work on a Rung 2 question, and presenting the result with the same fluency either way.
Agents make the gap sharper. An agent acting in the world asks a Rung 2 question at every step — “if I run this command, what state results?” An agent whose only signal is logs of past runs has Rung 1 data about those runs. It performs well when the new situation matches the distribution it has seen, and degrades, often silently, when it does not. “Works in the demo, fails in production” is frequently this exact mismatch.
The practical move is to stop asking these tools to climb a rung they cannot, and to use them hard where Rung 1 is the job: autocomplete, boilerplate, format translation, summarizing a diff, surfacing a pattern across files. Then supply the causal model yourself — explicit constraints in the prompt, tests that encode your cause-and-effect expectations, and review that checks the reasoning rather than only the output.
Cursor
An LLM-native code editor that is strong at exactly the Rung 1 work — completion, refactors, codebase-aware edits — where pattern prediction shines. Treat its output as fast drafts to verify, not causal conclusions to trust.
Free tier available; Pro at $20/month
Affiliate link · We earn a commission at no cost to you.
None of this is an argument against AI tooling. It is an argument for matching the tool to the rung. Pearl’s hierarchy gives you a fast check before you delegate a task: am I asking this model to recognize a pattern, or to reason about what a change would cause? The first is what it was built for. The second is still on you.
FAQ
Does this mean LLMs can never reason about cause? +
Is a causal model just a knowledge graph? +
Should I add a causal inference library to my stack? +
Related reading
2026-05-20
How to Build an Autonomous AI Coding Agent That Opens GitHub PRs Overnight
A practical breakdown of the plan-execute-verify loop behind an autonomous AI coding agent, and how to wire it to GitHub so an issue becomes a reviewable pull request overnight.
2026-05-20
Continual Harness: The Gemini Pokémon Agent That Rewrites Its Own Loop
How the Continual Harness pattern, from the Gemini Plays Pokémon and PokeAgent teams, lets an agent rewrite its own harness mid-run — plus how to apply that online-adaptation idea to autonomous agents you build.
2026-05-20
Apify Fingerprint Suite: Open-Source Browser Fingerprinting for Stealth Scrapers
Apify's fingerprint-suite generates statistically consistent browser fingerprints and injects them into Playwright or Puppeteer. How it works, how to wire it in, and when a scraper actually needs it.
2026-05-20
Optuna Tutorial: Automate Hyperparameter Tuning for ML Models in Python
How Optuna's define-by-run API, TPE sampler, and pruners automate hyperparameter tuning for scikit-learn, PyTorch, and TensorFlow models, with runnable Python code.
2026-05-20
OpenAI GPT-Realtime-2: What GPT-5-Class Reasoning Actually Changes for Voice Agents
OpenAI's GPT-Realtime-2 is the first speech model with GPT-5-class reasoning. Here's what genuinely changes for voice agents — and what to test before you migrate.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.