pickuma.
AI & Dev Tools

Building a Linter for the Bugs AI Coding Agents Actually Make

AI coding agents produce a recognizable class of mistakes — hallucinated imports, dropped error handling, duplicate logic. Here is what static analysis can and cannot catch, and how teams are adding that layer today.

7 min read

When developers started using AI coding agents at scale, the assumption was that the failure mode would look like human error — logic bugs, off-by-one mistakes, bad variable names. The reality is different. AI agents produce a recognizable and largely predictable class of bugs that human developers almost never write. They reference APIs that do not exist. They catch exceptions too broadly, or not at all. They duplicate logic because they have no memory of what they generated ten minutes ago. They write syntactically valid Python that reads like JavaScript.

This is not a random distribution. It is a systematic distribution, which means static analysis tools can be tuned to catch a meaningful portion of it. But the tooling has not fully caught up, and the bugs that most urgently need catching are often the ones that slip through.

What AI agents actually get wrong

Research helps here. A survey catalogued eight primary bug categories in AI-generated code, with functional bugs — semantic errors, wrong logic, API misuse, type errors — appearing most frequently across the literature. A separate empirical analysis of LLM-generated code placed misinterpretations at 20.77% of bugs (code that deviates from what the prompt described), missing corner cases at 15.27%, and hallucinated objects at 9.57%. That last category — referencing functions and methods that simply do not exist — is the one most specific to AI generation. Human developers rarely call a function they have not first written or imported.

These bugs cluster into a few concrete patterns worth examining:

Hallucinated imports and non-existent APIs. An agent will confidently call df.to_markdown(index=False, bold_headers=True) when the bold_headers keyword does not exist in the installed version of pandas. Or it will import from utils.validators import validate_phone_e164 when that function does not exist anywhere in the repository. Research on open-source models found package hallucination rates around 21.7%; commercial models were better but still reached 5.2%. These are not typos — the names are plausible, which is exactly what makes them dangerous.

Dropped or swallowed error handling. When an agent generates code without explicit instructions about failure modes, it either produces no error handling or wraps everything in an overconfident bare except. Three failure modes appear repeatedly: incomplete exception handling (some code paths left unguarded), wrong exception type (catching Exception when only ValueError is relevant), and bare except clauses that swallow every error including KeyboardInterrupt and SystemExit.

# What the agent wrote
try:
result = api_client.fetch(endpoint)
return parse(result)
except:
return None

The bare except here means a timeout, a network error, and a malformed response all look identical downstream. You get None and no signal about what went wrong.

Context blindness and duplicate logic. AI agents have no reliable memory of what they generated earlier in a session, and no awareness of what exists elsewhere in the codebase. The result is near-duplicate implementations: two format_currency functions that diverge subtly over time, validation logic that already exists in a shared utility being re-implemented inline, constants defined twice. The code compiles. Tests may pass. The divergence is discovered months later when one copy gets updated and the other does not.

Cross-language pattern leakage. Agents trained on code across many languages occasionally leak idioms from one language into another. Python code ends up with .push() on lists, .length instead of len(), .equals() for string comparison, or nil? checks. These fail immediately at runtime, which makes them relatively easy to catch — but a linter that knows what to look for catches them at commit time, not after deployment.

Outdated APIs and version mismatches. An agent’s training data has a cutoff. It may generate code against a library version that shipped two years ago, using deprecated methods that were removed in the current major version. The code looks plausible. It may even pass import checks if the old method still exists but has been deprecated without removal. The failure arrives when someone upgrades the dependency.

What existing tools catch today

The good news is that standard tooling already catches more of this than many teams realize — if they are actually running it.

Type checkers are the most effective first pass. Pyright and mypy catch hallucinated method calls and non-existent attributes because they resolve names against the actual installed package stubs. If the agent calls df.to_markdown(bold_headers=True) and bold_headers is not in the pandas type stubs, pyright flags it as an unknown keyword argument. Research on static analysis for library hallucinations found Pyright outperformed other methods, with detection rates between 14% and 85% depending on the library and hallucination type. That wide range reflects real variation — dynamically typed code, lambda functions inside dataframe operations, and complex type inference chains all reduce detection reliability.

Semgrep is useful for pattern-based detection of deprecated APIs and known-bad patterns. You can write rules that flag calls to deprecated library functions, check that except clauses are not bare, or verify that specific dangerous patterns (like subprocess.shell=True with user input) are absent. In 2026 Semgrep introduced a multimodal approach that pairs rule-based detection with LLM reasoning over control and data flow slices, which it claims finds substantially more true positives while cutting noise compared to standalone LLM analysis.

For Python specifically, tools like sloppylint were built explicitly for AI-generated code. It runs over 100 pattern checks across four categories: noise (debug print statements left in, redundant comments), hallucinations (packages that do not exist in the environment, bare pass placeholder functions), style issues (excessive nesting, single-method classes that accomplish nothing), and structural problems (bare except, star imports, mutable default arguments). It produces a “slop score” broken down by category and a verdict ranging from clean to severe. The cross-language leakage checks — detecting .push(), .forEach(), .equals() in Python files — catch a class of bugs that no standard Python linter targets.

Standard linters — Ruff, flake8, pylint — catch the tail of the distribution: unused imports the agent added, variables assigned and never read, unreachable code after an early return. These are real issues worth catching, but they are the easy end of the problem. They do not require any new understanding of AI-specific failure modes.

What static analysis cannot catch

The harder class of bugs sits outside what any static analysis tool can currently detect reliably.

Misinterpretations — code that implements the wrong algorithm for the stated requirement — do not have a static signature. The code is syntactically correct, type-checks, and does something. It just does the wrong something. A function that sorts descending when the spec required ascending, a filter that uses >= when it should use >, a date calculation that handles UTC inconsistently: none of these are visible to a linter. They require test coverage that exercises the intended behavior, or a human reviewer who understands the domain.

Context blindness bugs — the duplicate logic case — are detectable in principle with tools that do semantic similarity analysis across a codebase, but no widely deployed linter does this well today. The practical catch rate for near-duplicate business logic is low.

Insecure defaults in scaffolding are a documented weak spot for static analysis. When an agent generates authentication boilerplate with verify=False on an HTTPS call, or hardcodes credentials, the pattern may not match any rule because the surrounding code is legitimate. The rule would need to understand that this particular variable flows into a security-relevant context.

Logic bugs in complex control flow are also hard. AI agents generate unusual branching patterns that cause traditional control flow analysis to either over-flag or under-flag. Static analysis tools were built around assumptions that code follows recognizable structural patterns. AI-generated code sometimes violates those assumptions in ways that defeat the analyzer’s heuristics.

Putting together a checking layer

A practical static analysis layer for AI-generated code today looks like a sequence of tools in CI rather than a single product:

  1. Dependency validation before install. Check every package name the agent suggests against the actual registry (PyPI, npm, crates.io) before it goes into a manifest. This is automatable and stops slopsquatting cold.

  2. Type checking on every commit. Pyright in strict mode catches hallucinated method calls, wrong argument types, and non-existent attributes. It does not catch everything, but it catches a meaningful slice quickly.

  3. Semgrep rules for known-bad patterns. Maintain a rule set for deprecated APIs in your specific dependencies, banned patterns (bare except, shell=True, hardcoded credentials), and any domain-specific invariants your codebase must maintain.

  4. AI-specific linters where the language has them. For Python, sloppylint in CI adds checks that no standard linter performs — particularly the cross-language leakage patterns and the placeholder-function detection.

  5. Test coverage that exercises behavior, not just lines. Static analysis cannot catch misinterpretations. Tests that assert specific outputs for specific inputs can. The investment in behavioral tests pays off more against AI-generated code than against hand-written code, because the bug distribution leans toward plausible-but-wrong logic rather than syntax errors.

The limit of this stack is real. You will still ship bugs that no tool caught, because the hardest class of AI-generated mistakes requires understanding intent. That is the part that remains a human problem — or a test problem. What static analysis adds is a filter on the systematic, predictable failures: the hallucinated API that would have blown up at runtime, the bare except that would have swallowed errors silently for weeks, the JavaScript idiom that would have failed the first time someone called the function. Those are worth catching automatically.

FAQ

Do standard linters like pylint or ESLint cover AI-specific bugs? +
Partially. Standard linters catch style violations, unused imports, and some type errors — all of which appear in AI-generated code. But they were not designed for patterns specific to LLM output: hallucinated method calls, cross-language idiom leakage, placeholder functions, or swallowed exceptions. Tools like sloppylint layer AI-specific checks on top of, not instead of, standard linters.
Can type checkers reliably catch hallucinated API calls? +
They catch a substantial fraction, especially for well-typed libraries with complete stubs. Research on library hallucination detection found Pyright detection rates ranging from 14% to 85% depending on the library. Dynamically typed code, lambda expressions, and complex type inference chains reduce coverage. Type checking is necessary but not sufficient.
Is there a way to catch hallucinated package names before they reach the registry? +
Yes. Validating every AI-suggested package name against the actual registry (PyPI, npm, etc.) before running install is automatable and catches hallucinated names before any code is installed. Some dependency audit tools and supply-chain scanners are adding this check. The main gap is that it requires an explicit validation step rather than relying on the install failing gracefully.

Related reading

See all AI & Dev Tools articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.