Ground the model, or it invents the evidence
Trusted, curated, subject-specific models are needed for academia
Earlier this year a thread from Tom Dietterich, who chairs arXiv’s computer science section, made the rounds. He was clarifying the consequences for unchecked LLM output on the platform. If a submission contains “incontrovertible evidence” that the authors did not check what the model produced (hallucinated references, or worse, model meta-comments still sitting in the manuscript like “here is a 200 word summary, would you like me to make any changes?”) the authors are banned from arXiv for a year.
In my opinion, the policy is needed and appropriate. The underlying problem is bigger than arXiv.

Figure 1. Citation-hallucination rates and monthly counts by repository, as of August 2025. Source: Zhao et al. (2026), arXiv:2605.07723.
A letter in The Lancet by Maxim Topaz and colleagues found that one in every 277 PubMed-indexed papers in early 2026 cited something that does not exist. In 2023 the figure was one in 2,828. A much larger audit by Zhao, Yin and colleagues at Cornell, Berkeley, Tsinghua and UCLA looked at 111 million references across arXiv, bioRxiv, SSRN and PubMed Central. Their conservative estimate is that at least 146,932 hallucinated citations entered the scientific record in 2025 alone. All four repositories climbed sharply from mid-2024. By August 2025 the hallucination rate ranged from 0.21% on bioRxiv to 1.91% on SSRN, and because PubMed Central is so much larger, it carried the highest monthly count even at a rate of 0.27%.
These numbers are largely about general-purpose language models being asked to do something they were not designed for. Generating prose is what they do well. Verifying that a citation refers to a real paper is not in their job description, and the architecture does not support it. They are statistical prediction engines that produce the most plausible-looking string given everything they have absorbed. This is a problem when it claims to build on the permanent record of research.
The hallucinations have a direction
When fabricated citations name real authors, they do not name them at random. This is likely true of existing research practices, but if we’re entering a new era of research, it would be cool if we got rid of some of the old problems. The Cornell team checked whether this bias was confined to the obviously fake citations, or whether it showed up in the perfectly real ones sitting beside them in the same bibliographies. It does. The real citations in papers that also contain hallucinations skew toward the same prominent authors.

Figure 2. How hallucinated citations differ from matched genuine ones. Source: Zhao et al. (2026), arXiv:2605.07723.
Compared with matched genuine citations, hallucinated citations credit scholars with 68.8% more prior publications and 58.3% more citations. The credited names also skew male, by 6.4 percentage points. Prominence and plausibility are entangled in the training data, since the most-cited authors appear in the most reference lists. The model has learned that their names belong in reference lists. When it fabricates, it fabricates those who would most likely turn up in a reference list.

Figure 3. The Matthew effect, automated.
The Matthew effect, Robert Merton’s name for the rich-get-richer dynamic in citation, has always existed. What is new is that we may be about to automate it at a scale and speed that no funding panel or promotion committee can keep up with.
Why a general model is the wrong tool for research
The pattern in the data points to a specific kind of mistake. A general-purpose language model has no representation of the world’s actual citation graph. Asking it whether a paper is real, or whether it supports a claim, is asking a question its architecture does not know how to answer correctly. The result is a confident-sounding answer that is roughly correlated with the truth and fails in predictable ways. This is something I saw when building https://datacitations.com/, A knowledge graph linking 9.3 million data citations across nearly 2,000 data repositories, that demonstrates how FAIR principles power trustworthy, traceable AI answers grounded in structured evidence.

The interesting observation, and the reason I think this problem is more tractable than the discourse suggests, is that science already has the infrastructure to do this properly. The infrastructure simply has not been wired up to the place where the generation is happening.

Figure 4. Two architectures, two failure modes.
Every published paper has a DOI. Every dataset can have one too. Every author can have an ORCID. Every institution can have a ROR identifier. Crossref will resolve any DOI in milliseconds. OpenAlex, DataCite and Semantic Scholar expose this as queryable open data. At Digital Science the Dimensions Knowledge Graph holds these relationships in typed, persistent form, connecting publications, grants, patents, clinical trials, datasets and people. Figshare hosts versioned datasets with their own DOIs. The Cornell team’s own validation pipeline matched 95% of references on first pass with off-the-shelf tools.
The reason 78% of fabricated citations still pass arXiv moderation comes down to one thing. The validation is not being run. The substrate exists. The plumbing connecting it to the systems that produce citations does not. A subject-specific model that consults this substrate before it speaks has a different failure mode than a general model that does not. It will sometimes say “this paper is not in my graph” rather than inventing one. It will sometimes return a smaller set of citations than the user expected. It will sometimes flag conflicts in the underlying evidence. None of these are exciting demo moments. All of them are what scientific reasoning is supposed to look like.
Will we move to a GitHub-like way of building?
The longer I sit with this, the more the analogy to software seems instructive. Software engineering went from a craft where you copy-pasted from forums to one where every dependency has a version number, a hash, an authorship trail, a license, and a public history of who changed what when. It is the reason agents can now reason about software at all, because the substrate is verifiable on every loop.
The equivalent in scholarly publishing would be a citation graph where every reference resolves to a persistent identifier, every claim has a recorded provenance trail back to source data, every researcher’s contribution is attached to their ORCID, and every dataset has a version on a repository like Figshare with its own DOI. None of these are technologically hard. Most of them already exist, it feels like we’re close, but research is not (yet) happening this way. arXiv’s new policy is one small step in that direction. So are journal requirements to validate references against Crossref at submission. So is the slow expansion of registered reports and data-availability requirements. None of these is a complete solution. They are pieces of a substrate.
The deadline
The thing that makes this urgent is the second-order effect, more than the absolute number of hallucinated citations today. Hallucinated citations are getting indexed in Google Scholar as standalone bibliographic entries. They are being ingested into the training corpora of the next generation of models. The contamination, as Maxim Topaz put it in the coverage of his paper, does not go away when the AI gets better. The longer the substrate is left ungoverned, the more of the next-generation models will be trained on something that includes things that were never true.
We have a deadline, because something is already busy filling the substrate with things that were never true.
I’m an optimist. This is a problem science is well-equipped to solve, because every piece of the infrastructure already exists. The problem is that research has a lot of problems that could be solved if cultural changes amongst the research community happened faster. In this new age of acceleration, maybe we’ll finally start optimising the research process.