Why Ghost References Still Haunt Us in 2025—And Why It's Not Just About LLMs
As early as late 2022, I understood that Retrieval Augmented Generation (RAG) would be the future. By grounding LLM responses in retrieved content, RAG should reduce or even eliminate certain types of hallucinations—including the dreaded ghost reference: citations to papers that simply do not exist.
To be clear about terms:
Ghost reference: a citation to a work that does not exist.
Citation unfaithfulness: a citation to a real work that doesn’t actually support the claim being made.
“Existing” or “real” reference: A work you can point to, access, or verify exists. Quality is not guaranteed—such a work might be a legitimate but retracted paper (a “zombie citation”) or a product of paper mills, which these days often but not always means a fully AI-generated paper. Nevertheless, it still counts as “existing” in this context. An “existing” paper or reference might also make ghost references.
In the early days of ChatGPT (GPT-3.5 era), ghost references were common because these tools (or at the free versions) didn’t perform web searches to ground their outputs - which was akin to asking a human to cite papers without searching to verify to confirm if they “remembered” the reference correctly. But that’s no longer true for most modern systems which search the web and perform Retrieval Augmented Generation (see image below generated by NanoBanana Pro).
While, this problem of ghost references seems to be no longer a serious issue at my institution, a quick scan of librarian blogs and social media suggest this problem is still a huge issue. Why is this?
I think, the answer is more uncomfortable than “LLMs hallucinate”
The answer lies in a troubling interaction between two vulnerabilities.
The first is architectural: Google Scholar creates [CITATION] records for references it cannot match to actual documents. These stubs—inferred from bibliographies rather than verified sources—allow fabricated references to accumulate citations and apparent legitimacy from humans and possibly LLM alike.
The second is that RAG systems using general web search can be fooled by this and similar pollution. When an LLM searches the web to verify a citation, it may find pages that themselves cite the ghost reference, concluding the paper must be real.
In short, ghost references are not primarily an LLM/GenAI phenomenon. They have always existed, propagated by entirely human mechanisms—typos, careless copying, citing without reading. What LLM/GenAI has done is accelerate and amplify a pre-existing structural vulnerability in scholarly communication infrastructure. The web is already contaminated; RAG with web search inherits that contamination.
How Academic RAG Actually Prevents Ghost References
I noticed early on that academic RAG and Deep Research tools largely solved the ghost reference problem. A 2023 thesis demonstrated that even basic early academic RAG systems like Scopus AI, Elicit, and SciSpace do not fabricate references.
Minor citation errors still occur, but these typically originate from upstream metadata issues—think of how Google Scholar sometimes displays incorrect publication years due to merging preprints with the version of record.
Going deeper into how RAG generate citations
LLMs do not naturally cite the text chunks they retrieve, and even if they did, how do you guard against cases where they don’t follow instructions? However, there are architectural approaches that can virtually guarantee zero fabricated citations. Here is one such method (see image below generated by Nano Banana Pro):
Step 1: Unique ID Assignment. The database assigns a unique identifier to each document chunk. When chunks are retrieved in response to a query, they carry these identifiers along with their text.
For example, given the query
“Which library is the oldest public branch library in Singapore?”
the retrieval step might return:
<text chunk 1> [UniqueID-XYAD]<text chunk 2> [UniqueID-QASD]First public branch library in Singapore was in Queenstown [UniqueID-XXXX]
Step 2: ID-Only Citation Generation. The LLM is prompted and fine-tuned to generate text with only the unique ID—not the full citation. For instance, it might produce the following:
Singapore’s oldest public branch library is in Queenstown [UniqueID-XXXX].
Step 3: Post-Hoc Verification. A non-LLM method verifies that each generated unique ID matches a retrieved document and exists in the database. This catches any hallucinated identifiers.
Step 4: Programmatic Citation Replacement. Finally, a deterministic process (such as regular expressions) replaces each unique ID with the full, verified citation or metadata. We do not trust the LLM to generate complete citations because it might hallucinate details.
This architecture virtually eliminates ghost references, as it never gets a chance to invent citation metadata. That said it cannot prevent unfaithful statements—claims that misrepresent what the cited work actually says. Various methods exist to detect unfaithful claims (e.g. entailment-based verification), but none are foolproof.
Critically, this approach works because academic RAG systems perform retrieval over a known, curated, and bounded corpus. General-purpose LLMs searching the open web cannot easily adopt this strategy—and therein lies part of the problem.
Ghost References Have Always Existed before the use of LLMs
Someone shared with me the following Bluesky post.
This Bluesky post highlighted a ghost reference that had accumulated 43 citations in Google Scholar. The “paper” appeared as a [CITATION] entry —a type of Google Scholar record we’ll examine in detail shortly.
While works in roughly the same topic by the purported authors exist (for instance, Williamson & Piattoeva published on related topics in 2019 and a 2020 book chapter), this specific reference does not correspond to any real publication.
I suspect most people initially assumed this from a LLM hallucinated reference that had propagated through the system. But my response hinted that this might not necessarily be the case.
Regardless of the initial source of the error, the Google Scholar practice of generating [CITATION] records is quite dangerous, as we shall see, this practice can “pollute” the web.
Examining the citing articles revealed something important: citations to this ghost reference came from papers supposedly published in 2019 and 2021!
This was years before ChatGPT’s widespread adoption in late 2022 and while GPT 2 (2019) and GPT 3 (2020) was a thing, it was unlikely to be the cause of a 2019 or even 2021 citing paper.
This complicates the easy narrative that GenAI created the ghost reference problem. These early citations were likely produced by humans through entirely traditional mechanisms:
Typographical errors that propagated
Careless copying of reference lists without verification
Conflating two similar papers into one that doesn’t exist
Simply misremembering a paper
That said, one can easily see the offending citing papers are from very poor quality sources, so we can’t even be sure if the publication dates are correct!
Still, I won’t be surprised if at least some of them are really just errors from humans.
This shouldn’t surprise us. The empirical literature on citation practices has long documented these patterns. Simkin and Roychowdhury’s influential studies estimated that only about 20% of citers actually read the original papers they cite—the rest probably copy references from other papers’ bibliographies (Simkin & Roychowdhury, 2003). Related are studies about academic paper urban legends - where claims like most papers are not cited are asserted by misrepresenting papers - all of this was done pre-GPT!
In the particular topic under discussion, it seems we also have evidence that ghost references are common (pre-GPT).
But as you will see later, more recent ghost references might perhaps be LLM/GenAI related.
Why “RAG + web search” doesn’t reliably prevent ghost references
When I asked the free version of ChatGPT (currently GPT-5.2 Instant) to locate the ghost reference from the Bluesky post, it confidently claimed the paper existed and pointed to what appeared to be a source. That source? A webpage from another journal that itself had cited the ghost reference.
This is perhaps part of the reason why GPT still generates ghost references even with web search?
To be fair, the paid version of ChatGPT (GPT-5.2 with thinking) occasionally identified the paper as fake by locating the original Bluesky post from the authors. But results were inconsistent—different prompts yielded different conclusions.
It is unclear to me if LLMs could be taught to try to verify papers the way we librarians and researchers do it (imagine connecting to MCP sources like Crossref, PubMed) and can say things like this reference looks fishy, as the DOI doesn’t match or there seems to be no trace of a journal article (beyond appearance in references) that should be easily found.
The broader lesson: LLMs with general web search can fail to reliably verify references because the web itself contains fake citations.
This creates a feedback loop. A ghost reference (either human or LLM generated) gets cited by a real paper. That paper appears online. An LLM finds the citation and concludes the reference must be real. The ghost becomes increasingly entrenched.
This is the “citogenesis” phenomenon famously illustrated by XKCD.
This occurs when Wikipedia errors propagate into published sources and back again. Reference list errors in influential papers get copied by subsequent authors who don’t verify their sources. A typo becomes a “paper” that dozens of researchers claim to have read.
Similarly, once a ghost reference appears (human or LLM generated), this citogenesis process can be accelerated by both LLMs and careless humans.
The scary bit is this, even if LLMs from here on out do not hallucinate ghost references “on their own” and always take backing from web sources, they may still run find existing ghost references and maybe cite them!
What’s changed is not the existence of ghost references but their potential scale and discoverability. GenAI can produce more content, faster, with confident-sounding citations. And when verification is attempted using LLMs with web search, those tools can be fooled by the pre-existing pollution that humans created over decades.
I want to be careful here: we don’t yet have good empirical data on how much GenAI has increased the rate of ghost reference creation. This is an important research question. What we can say is that the architectural conditions exist for acceleration, even if quantifying the actual increase remains difficult.
Why Google Scholar practice of creating [citation] records is dangerous in this day and age
To understand the structural vulnerability at play, we need to examine how citation indexes are constructed. I always recommend librarians and researchers to understand how citation indexes are created because it explains a lot of issues with citation counts.
Citation indexes follow a four-step pipeline:
1. Collect. Citation indexes covers a defined set of works. Even Google Scholar, as broad as it is, focuses on “scholarly work.” So step one is to gather a defined set of works—specific journals, conference proceedings, book chapters, etc. Most well-known citation indexes focus on journal articles and, to a lesser extent, conference proceedings, book chapters, and occasionally preprints (for broader, more open ones).
2. Extract. The system analyzes each collected work to identify its bibliography, pulling out reference strings in indexed works e.g.
Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4), 265–269. https://doi.org/10.1002/asi.4630240406
3. Match. This is the most technically complex step. The system attempts to link extracted references to works already in its database. This is not straight forward. Cited references exhibit enormous variation – some issues
Author name variants (initials, transliterations, name changes)
Journal title abbreviations (inconsistent or non-standard)
Typographical errors in the original references
Incomplete citation
Matching algorithms must balance precision and recall—too strict and you create false negatives (cited works lose citations); too loose and you create false positives (incorrect links).
4. Total. The system aggregates links to quantify influence, producing the citation counts researchers use to gauge impact.
In short, citation indexes transform static reference lists into dynamic, countable links, converting individual documents into a networked web of scholarship.
Step 3—matching—is the least understood yet most critical part of this process. What happens when an extracted reference cannot be matched to any existing work in the database?
The Google Scholar [Citation] Vulnerability
When an indexed work contains reference strings that don’t match anything in your index, one option is to do nothing.
But Google Scholar’s answer to unmatched references is instead to create [Citation] records—entries that appear when GS detects a reference in an indexed paper’s bibliography but cannot locate the actual source document.
These are essentially citation stubs: metadata records inferred from parsed reference lists rather than from direct indexing of primary documents.
Key characteristics include:
The title appears with a “[CITATION]” prefix and typically isn’t hyperlinked to full text.
Metadata is often incomplete or inaccurate (missing authors, incorrect dates, truncated titles) because it’s reconstructed from how other papers cited the work.
They lack abstracts since Google Scholar never accessed the original document.
They still accumulate citation counts—which is actually their primary function.
This mechanism serves legitimate purposes. An unmatched reference might represent a real work that simply isn’t indexed by Google Scholar: perhaps a print-only source, a work behind paywalls GS cannot penetrate, or simply a typo that prevented matching. For instance, referencing “Tay, C.H. (2020)” when the actual author is “Tay, C.L. (2020)” would cause matching to fail.
My own Medium posts have legitimate [CITATION] entries because they’ve been cited by indexed scholarly works despite not being “scholarly” content that GS would directly index.
But the [CITATION] mechanism has always been a structural vulnerability. It allows fabricated references—whether created by human error or GenAI hallucination—to enter the scholarly communication system with an appearance of legitimacy. Once a [CITATION] record exists and accumulates citations, it becomes increasingly difficult to distinguish from a real paper that simply isn’t available online.
I don’t believe Google Scholar creates a [CITATION] record for every unmatched reference—there’s likely some threshold, such as multiple indexed works citing the same reference.
But once a ghost reference clears that threshold and a Google Scholar [citation] record is created, the citogenesis cycle accelerates. Careless researchers (and LLMs with web search) see the [CITATION] record, assume the paper exists, and cite it themselves.
Google Scholar has extremely tight anti-bot features, so LLMs usually cannot search and “see” the [CITATION] record directly. But once that record exists on the web, there are numerous ways for it to “leak out”—through library discovery systems, reference management tools, and academic social networks that surface Google Scholar metadata.
A good example, is a journal or ebook with ghost references on a platform that offers link resolver links. The system will create a library link resolver link that when clicked brings the user to a library catalog record that looks like a real record you can ask for ILL or Document Delivery!
This vulnerability existed long before GenAI. The 2019 citations to our example ghost reference prove that. What GenAI changes is the rate at which new ghost references can be generated and how quickly they can propagate across the web, making citation verification increasingly difficult.
Are Academic RAG Tools Safe from This Problem?
Given the problems with general web search, this is precisely why we turn to academic RAG and Deep Research tools like Elicit, Consensus, and Undermind. They don’t retrieve over the open web but instead query curated sources like Semantic Scholar and OpenAlex.
Similarly, LLMs with MCP connectors to trusted content sources like PubMed or Wiley avoid web pollution by querying authoritative databases directly.
But are these tools immune to the ghost reference problem?
Most AI search startups without their own data sources rely on Semantic Scholar, OpenAlex, or some combination of web scraping and partnerships. While more inclusive than Scopus, we can be reasonably assured that works in these indexes are “real” in some sense. OpenAlex (prior to the Walden update) only indexed works it could match with Crossref records.
Important caveats apply: existence says nothing about quality. A work could exist as a PDF, on a preprint server, even have a DOI, and yet still be a paper that is entirely AI-generated—potentially including references to papers that don’t exist. But at least the indexed work in OpenAlex itself does “exist”.
The critical architectural difference is how these indexes handle unmatched references. Unlike Google Scholar, OpenAlex and Semantic Scholar appear to only display references that can be matched against indexed works (with an OpenAlex ID or equivalent).
Two of the most popular commercial citation indexes—Scopus and Web of Science—actually allow searching of unmatched reference strings via the somewhat obscure “Cited Reference Search” and “Secondary documents” features respectively. Does OpenAlex or Semantic Scholar have something similar?
They do not create [Citation] records for unmatched references.
This is an important safeguard.
Leaving aside LLM web search finding such records, as more tools such as Undermind.ai, Consensus Deep Search, implement citation-chaining in Deep Search functionality, this design choice prevents them from surfacing and amplifying fabricated references that exist only as the equalvant of [citation] stubs.
The Loose Matching Problem
A warning remains: while these indexes won’t create new records for ghost references, their matching algorithms may incorrectly link ghost references to existing real works if the algorithms are too permissive.
Consider a ghost reference generated by LLM or human with slightly garbled metadata. An overly loose matching algorithm might incorrectly link it to a real paper with a similar title or author. The ghost reference doesn’t create a new stub—instead, it incorrectly inflates the citation count of an existing work and creates a false trail that future researchers might follow.
The ghost reference that started this investigation had 43 citations, likely from a combination of
humans finding the Google Scholar [Citation] record and citing it
humans copying references from papers containing the ghost reference
LLMs finding “evidence” of such records on webpages and citing it
LLMs hallucinating (in the traditional sense) variant forms of the ghost reference and being linked through loose matching algorithms.
It has been suggested that these human errors might be in the pretraining of LLMs leading them to generate such ghost references, I don’t think this is likely - such errors appears in a tiny fraction of training data—far too infrequent to reliably influence generation.
I haven’t seen empirical work directly testing this hypothesis. It would be interesting to check whether known historical ghost references appear at elevated rates in LLM outputs compared to base rates in the literature.
The Missing Layer: Editorial and Peer Review Responsibility
The infrastructural focus of this analysis is important, but we shouldn’t neglect the human gatekeeping failures that allow ghost references to propagate.
Reference verification should be part of editorial and peer review processes. In practice, it rarely is. Reviewers focus on methodology, argumentation, and contribution—not on whether every cited work actually exists. Copy editors check formatting consistency, not ontological validity.
This is understandable given time constraints, but it means that the scholarly communication system lacks a verification layer at the point where it might be most effective. By the time a ghost reference appears in a published paper, it has already gained legitimacy.
Some journals have begun using automated reference checking tools, but these typically verify formatting and DOI resolution rather than comprehensive existence checks. A reference to a non-existent paper without a DOI would sail through.
As ghost references potentially become more common, publishers and editors may need to invest in more robust verification infrastructure—or accept that the scholarly record will become increasingly polluted.
Conclusion: A Pre-Existing Condition, Now Acute
The persistence of ghost references in 2025 may not just be a story about LLM randomly hallucinating ghost references for no reason.
Academic RAG systems that retrieve from curated databases and use proper citation verification have largely solved this problem at the technical level.
The real story is older and more uncomfortable. Ghost references have always existed, created and propagated by human sloppiness: typos, careless copying, citing without reading, conflating similar papers. The scholarly communication infrastructure—particularly Google Scholar’s [Citation] mechanism—has long had structural vulnerabilities that allow these fabrications to persist and accumulate apparent legitimacy.
What GenAI changes is the scale and the difficulty of detection. LLMs can generate ghost references faster than humans ever could. And when we try to verify references using LLMs with web search, those tools are fooled by the pre-existing pollution that humans created over decades. The contamination feeds on itself.
This has several practical implications:
For researchers: Treat Google Scholar [Citation] records with heightened suspicion—but recognise this was always good practice, not just a GenAI-era precaution. If you cannot access the full text of a work, verify its existence through multiple authoritative sources before citing it. The existence of a Google Scholar entry—even one with many citations, even one that predates ChatGPT—does not guarantee that a work is real.
For librarians: When teaching information literacy, the traditional lesson was to prefer Google Scholar over general web search because it indexes “scholarly” content. This guidance needs to be updated with education about whar [citation] records are and how to handle them.
For tool and citation index developers: The choice not to create placeholder records for unmatched references (or at least not make them easily discoverable)—as OpenAlex and Semantic Scholar have made—is a design decision with significant implications for research integrity. As citation-chaining becomes more common in AI search tools, this architectural choice becomes increasingly important.
Even though you can filter such records out in Google Scholar, they might consider whether [Citation] records need additional friction, warnings, or provenance indicators.
For researchers studying scholarly communication: We need better empirical data on this phenomenon. How has the rate of ghost reference creation changed post-2023? How do different citation indexes handle unmatched references, and what are the downstream effects? How do ghost references propagate through citation networks, and can we detect them algorithmically? The pre-GenAI literature on citation errors and “academic urban legends” provides a foundation, but the landscape may have shifted significantly.
The ghost reference problem is a chronic condition that has become acute. The infection predates GenAI; the technology has simply lowered our immune response while accelerating transmission. The cure lies not in blaming LLMs but in understanding—and shoring up—the structural vulnerabilities in scholarly infrastructure that have allowed ghost references to propagate for far longer than we’d like to admit.













Great article. I’ve seen a related phenomenon with Google AI Overview and Perplexity hallucinations seemingly confirming the existence of known-fake cases. https://open.substack.com/pub/midwestfrontierai/p/doppelganger-hallucinations-test?r=36ivo&utm_medium=ios
This is a very useful write-up, thank you!
I want to push back a little on your assessment of the impact of generative AI in the cited bluesky post - I dug into the 44 works Google Scholar references. There is a related 2019 citations of one of the authors other works, but of the 41 I could follow up on, they were all actually from 2024 (technically one is data 2023-12-31) or later, almost all in 2025. The article GS dates in 2021 is almost certainly bad metadata; the underlying OJS Dublin Core metadata indicates it was submitted in 2025 (relatedly, GS reproduces in-text metadata uncritically and also doesn't reindex - one paper removed the fake citation during revisions at arxiv).
They are also really all different citations: There are 9 references to nonexistent Routledge monographs with different subtitles and years of publication; there are 4 purported Oxford Research Encyclopedia of Education entries complete with different (and fabricated) DOIs. All in all there's at least 30 different citations to over a dozen different journals or academic presses. It's very difficult to come to the conclusion that these are being copied in lazy-but-good faith, with the possible exception of 3 identical references to the same non-existent article in Education and Information Technologies (a real journal! just a non-existent article).
I also want to note that among the citing papers are an IEEE-Explore piece and an edited volume from a Routledge imprint (CRC Press) - the latter especially is a problem for using publication quality as a proxy for authenticity.