arXiv Bans Papers With Hallucinated LLM References for One Year
arXiv now imposes a one-year submission ban for papers with unchecked LLM errors like hallucinated citations. Here's the policy, why it exists, and the verification workflow that catches hallucinations before you submit.
arXiv changed the rules for how you can use a language model in a research paper. The preprint server now imposes a one-year submission ban when a paper contains incontrovertible evidence of unchecked LLM output — most commonly hallucinated citations or fabricated results that no one verified before the paper went live.
The policy doesn’t ban LLM-assisted writing. It punishes laziness. If you ran a draft through a model, accepted its made-up reference list, and submitted without checking, you’re now blocked from posting any preprint for twelve months. That’s a real cost, especially for grad students and early-career researchers who use arXiv as a timestamp for priority claims.
What the policy actually targets
The trigger isn’t AI use. It’s verifiable error left in the manuscript. Three patterns get papers flagged:
- Hallucinated citations — references that don’t exist, or that exist but say something different from what the paper claims they say. The most common failure mode for ChatGPT, Claude, and Gemini when asked for sources.
- Fabricated experimental results — numbers in tables that don’t appear in any code or dataset the authors can produce, figures generated to illustrate a story rather than describe data.
- Phantom prior work — claims about what a competing paper does or doesn’t show, where the cited paper does no such thing.
You don’t get banned for clean LLM-assisted prose. You get banned when a moderator can open a citation, see it doesn’t exist, and conclude the author never opened it either.
Why hallucinated citations slipped past so many drafts
The reason this is a policy and not just a guideline is volume. Reviewers and moderators have been reporting flagged submissions where the bibliography has the right shape — plausible journal names, real-looking DOIs, author lists that include genuine researchers — but the specific paper doesn’t exist. The format is correct because the model learned what citations look like. The content is wrong because the model has no retrieval guarantee for the specific work cited.
When you paste a related-work section into a chat model and ask it to “add citations,” you get strings that look like references. They are not references. The model is producing a sequence of tokens that match the statistical pattern of a bibliography. Some of those will be real. Some will be combinations of real authors with real-sounding titles attached to real journals — and they won’t exist anywhere.
Three habits make this worse:
- Copy-paste from the model’s bibliography into your reference manager without DOI resolution. If your reference manager can’t find the DOI, the paper probably doesn’t exist.
- Trusting “I’ll check it later” for citation accuracy. Later is submission day. Submission day is when you ship the hallucination.
- Skipping the “open the PDF” step for every cited claim. If you can’t point to the paragraph in the cited paper that supports your claim, you can’t defend the citation in review.
A verification workflow that actually works
The fix isn’t a single tool. It’s a workflow that closes the loop between every claim in your draft and a verifiable source. Here’s what catches hallucinations before submission:
Step 1 — resolve every citation by DOI. Run your reference list through Crossref or a reference manager that resolves DOIs. Any citation that doesn’t resolve is suspect. If you can’t find it on Google Scholar, Semantic Scholar, and Crossref, treat it as hallucinated until proven otherwise.
Step 2 — for every cited claim, link to the supporting passage. Use a research workspace that lets you attach annotated PDFs to each claim. Notion, Obsidian, and Zotero with annotation plugins all work — the point is the discipline, not the tool. If a cited passage doesn’t exist in the source, that’s the citation to delete.
Step 3 — run a separate model pass that questions citations rather than generates them. Feed your bibliography and your claims into a second model and ask: “for each citation, what evidence in the cited paper supports the claim?” If the model can’t answer, the citation is probably wrong, or your claim is overstated.
Step 4 — diff your bibliography against your own pre-LLM search. If you searched for related work yourself before the model helped you write, compare what you found to what’s in the final bibliography. Citations that appeared only after the LLM touched the section get extra scrutiny.
Notion
A workspace where you can attach annotated PDFs to each cited claim, track citation status (verified / pending / hallucinated), and run a final check before submission.
Free for individual researchers; team plans from $10/user/month.
Affiliate link · We earn a commission at no cost to you.
What this means for AI-assisted research writing
The policy shifts the accountability bar in a productive direction. You can still use models to draft, summarize, restructure, and edit. You cannot use them to produce references you haven’t read or numbers you haven’t computed. That distinction is straightforward to honor, and most researchers were already on the right side of it.
The hard cases are subtler: claims about what a cited paper “shows” that drift from what the paper actually argues, paraphrased findings that flip a sign, or summaries of method that omit the constraint that makes the comparison meaningful. Those errors don’t always trigger the ban — they’re invisible to automated checks. But they’re the failures the policy is gesturing at. Treat citation hygiene as a first-class part of your writing workflow, not an end-stage chore.
FAQ
Does using ChatGPT or Claude to draft a paper count as a violation? +
What happens if I'm a co-author on a flagged paper? +
How do I check whether a citation my LLM produced is real? +
Related reading
2026-05-17
Unsloth + NVIDIA: 1.6x Faster LLM Fine-Tuning With 70% Less VRAM
Unsloth's NVIDIA collaboration claims 1.6x faster LLM fine-tuning and 70% lower VRAM usage for Llama, Mistral, and Qwen. We break down what the numbers actually unlock for developers training on consumer GPUs.
2026-05-12
AI Note-Takers and Legal Risk: What Developers Should Know in 2026
Otter, Fireflies, and Granola are facing class actions over consent and data retention. Here's what developers integrating AI transcription need to audit before shipping.
2026-05-12
yt-dlp: The CLI Video Downloader Developers Actually Use in 2026
yt-dlp replaced youtube-dl as the default for programmatic video and audio extraction. Installation, format selectors, the Python API, and the production gotchas we hit running it across three real workflows.
2026-05-11
Why Developers Are Quietly Turning Off Copilot and Cursor
A measured look at the backlash against AI coding assistants — what the METR study and cognitive offloading research show about when hand-coding actually produces better engineers and better code.
2026-05-11
Cursor vs VS Code: We Ran Both for 30 Days
A practical 30-day comparison of Cursor and VS Code across multi-file edits, agent workflows, and pricing — based on actual usage.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.