10 Comments
User's avatar
Kukuh Noertjojo's avatar

Aaron, this is very deep to me. Thank you.

Aaron, you mentioned that top-class academic deep research tools almost never fabricate references but do they make unfaithful statements? How can one check on this issue in general? is there any empirical data on unfaithful statements for each of these tools? Would you mind to share?

Expand full comment
Aaron Tay's avatar

Yes they do.

Solving faithfulness issue is much harder. It's a very difficult problem, because even humans miscite papers, though I think currently AIs are still worse compared to citing in top tier journals.

There are quite a lot of studies on faithfulness, or citation precision for RAG and more recently DR solutions for OpenAI/Gemini Deep Research.

In best case Deep research scenarios, they are at most 80% faithful based on the most current literature of around June-July 2025. Roughly speaking for every cited generated statement/claim, 20% of the citations do not support it*.

I am not sure if academic DR are more faithful than General DR, don't think there has been any studies.

There are post generation methods to try to check unfaithfulness and mitigate but they are slow and also not 100% reliable.

*Studies measure this a bit differently

Expand full comment
Kukuh Noertjojo's avatar

Aaron, thanks a lot! I am not an information specialist and I've been learning a lot from your "musing".

Expand full comment
MF's avatar
Aug 28Edited

Great article, Aaron.

Would it be possible for you to post details of the papers you cite in your response above in regard to the statistics of generation quality ('faithfulness') above?

It would be really helpful to know the precise details of the DS system detailed in the paper and as you highlight it is dependant on the measurement and RAG / reranking system assemblage - particularly if the paper is a technical one from Arxiv . Thanks !

Expand full comment
Aaron Tay's avatar

What do you mean? My post does link to three sources. "Various academic studies on Deep Research suggest faithfulness or citation accuracy/precision of best Deep Research tools is around 80% (of cited claims are actually supported by the citation) at best. [1][2][3]". They use different terminology but roughly measuring that (eg sometimes called citation precision).

Expand full comment
MF's avatar
Aug 28Edited

Thanks, Aaron.

A very gentle suggestion from a humble reader: It might be best to follow through on a numbered style to include the full bibliographic references somewhere as a courtesy to the authors (it will help their altmetrics as well !) and to make the evidence base more apparent to readers to say the least of making the most of your valuable time in putting these sources together for an interested public.

The source you cite, Du et al. [2] has limited results reported on the associated Github (https://shorturl.at/IewEt), further, only reports few hundred academic journal URLs and thousands of general website URLs, with academic journal URLs constituting a smaller portion of the total for the limited Sci-Tech question topic domain. And so this preprint doesn't exclusively focus upon academic citations and within that doesn't appraise journal quality / peer review status of returned citations. Citation accuracy Vs. abundance is a more nuanced story to report as it better fits the utility of result to the researcher but without a clean set of academic citation results broad statements on citation accuracy can unfortunately magnify the parallax by presenting stylized facts even in a newsletter context.

It's a great article in terms of your reasserting the utility of certain product's outputs (Undermind.ai etc.) to the research context. This contextualisation of the technology, tool 'recipes' and general advice around the technology is where librarians can excel !

Expand full comment
Alfred Wallace's avatar

I've found OpenAI Deep Research very useful for a lot of orientation-type tasks, especially in the humanities; a common prompt is "I'm beginning graduate-level research in _________. Can you provide an overview of the seminal and current work in the field, and propose an initial reading list to orient myself in the literature, and summarize any special skills I would need?"

Some tests I've done on that in humanities subjects have done especially well compared to the academic deep research tools, I think because the humanities are thin enough in Semantic Scholar that the tools don't reach enough literature, especially monographs.

For articles, Undermind is of course tremendous and is probably . I agree that if this can be paired up with tools that can reach monographs and gray literature we'll really be onto something.

Expand full comment
Aaron Tay's avatar

Fully agree. I wrote from a business/social science/stem with journal focus pov.

Anything using Semantic Scholar corpus which is pretty much all Academic DR is pretty much automatically weak for humanties.

I try to remember to lower expectations when I see Law people in my sessions for example due to the poor indexing of journal titles and other legal content needed.

Expand full comment
Frida Rosengren's avatar

Thank you for the excellent content, as always. I am also a fan of deep (re)search but my biggest concern right now is related to your last point, lack of access to papers behind paywall. It doesn't matter if the AI search technique is excellent if the corpus has large gaps in it (both when it comes to fulltext and metadata). I have found it difficult to understand and compare the corpus of for example Undermind and Scispace with Scopus and WoS. How big of a problem do you think this is? There must be a large open access bias in many AI search tools, and I suspect that the coverage when it comes to older papers is also generally lacking.

Expand full comment
Aaron Tay's avatar

It is a problem but smaller than you think (except maybe in humantities). I've written about the rise of open scholarly metadata for years (2015-2020), so I forget not all librarians are familar.

Maybe read this account by me - which covers up to 2020 and explains why by 2020 we suddenly have so many new large scale academic search engines within striking size range of Google Scholar (hint the same source that fuels those academic search engines are now used by the 2022+ generation of AI search)

https://medium.com/a-academic-librarians-thoughts-on-open-access/the-next-generation-discovery-citation-indexes-a-review-of-the-landscape-a-2020-i-afc7b23ceb32

Suffice to say due to factors like PID infrastructure like Crossref, ORCID, organizations like CORE, NIH, Opencitations, Microsoft Academic Graph (which was almost Google Scholar size & they made their metadata open until 2020) now taken over by OpenAlex, Semantic Scholar, there is a ton more open scholarly metadata , then you might expect if you are unaware.

Other factors like the increase in open access and preprint culture in econs, social science even law adds to this.

Most of these tools whether Undermind, SciSpace use either Semantic Scholar Corpus, OpenAlex or harvest their own from multiple sources, but roughly speaking they all converage on around 200M (note this includes preprints & other non-journal content).

Scopus/Web of Science is estimated to be around 80M. There are many different studies but for now look at https://www.searchsmart.org/results?~() , which uses a calibrated way of estimation that fits many other studies I have seen.

There are various studies out there such as https://link.springer.com/article/10.1007/s11192-020-03690-4 , that shows Microsoft Academic Graph holds it own at least in a coverage versus Scopus.

Of course, in sheer size these open indexes win out (Scopus and Web of Science are selective), but studies generally show that even within the set of papers in say Web of Science Core collection, MAG/OpenAlex/Semantic Scholar/Lens.org generally will have most of them, though the metadata will be less accurate or complete.

Up to say 2024 (where some publishers began to close abstracts), I would say coverage at least when you mean title/abstract probably isn't an issue in STEM and social science areas and you can generally expect the abstract to be available because one source scraped it.

Full text might be increasingly available as a preprint.

PS: In fact, most *academic* deep research tools up to recently only used title abstract.

>There must be a large open access bias in many AI search tools, and I suspect that the coverage when it comes to older papers is also generally lacking.

Yes, I speculated a while back these AI search tools might lead to a new open access citation advantage as they tend to favour open access papers. But in fact as I mentioned at the beginning many of tools didnt even search through full-text just abstract.

I am not sure older papers are generally lacking espically when it comes down to STEM and social sciences. Google any classic paper and you likely see a free copy floating somewhere.... This is because most publishers dont really enforce tightly copyright on older articles anyway.

Again humanitites is a odd exception maybe.

Expand full comment