[Hot Take] AI Academic Search and the Missing Middle of Literature Discovery
Is overly focusing on Undergraduate information literacy and evidence synthesis making us underestimate AI academic search tools?
This post is part of a “hot takes” series in which I make sharper claims than I usually do. I do not intend to offend, and I am not trying to tar every librarian with the same brush — the patterns I describe and perceive may be a function of my own local context. But I am sure some of what I write will resonate. Besides, I think librarianship advances by naming what could be better, not by pretending everything is fine.
TLDR: AI search tools may look unimpressive when judged mainly through the two most visible library search frameworks: undergraduate information literacy and evidence synthesis. But their strongest current use case may be a third one: ordinary narrative literature reviews also often called Scholarly Discovery, where librarianship’s voice is less visible than it should be.
I keep running into librarians, in conversation and online, who say they have tried AI search tools and come away unimpressed. They do not really see what the fuss is about.
Some of this reaction is fair. The category “AI search” is broad, and a fair amount of what gets sold under that label is underwhelming.
At its weakest it runs thin LLM wrappers that translate a natural-language query into a Boolean string and then run it over the same lexical search engine underneath. I have previously called this the horseless carriage of AI search: using new technology to reproduce the shape of the old system rather than rethinking the system itself.
Without radically rethinking retrieval and ranking (e.g. Semantic search via dense embeddings, iterative agentic search etc), the ceiling on improvement will be low.
Several library vendors now ship tools of this kind — Scopus AI, Web of Science Research Assistant, Primo Research Assistant, EBSCOhost Natural Language Search, and others. They use a powerful new technology to reproduce an old artefact, without addressing the ranking-stage problems that make academic search frustrating in the first place1. If a librarian’s exposure to “AI search” is mostly tools of this sort, writing off the category is understandable.
But the reaction I want to focus on is different. It is the dismissal that extends even to the better AI search tools: Undermind, Elicit, Asta Paper Finder, Consensus, and others that are not simply trying to make Boolean search friendlier but are attempting more radical changes to retrieval and ranking.
I will keep talking about “better AI search tools” throughout this essay and you might be wondering what I mean, since there are many types of Academic AI search with different “AI” techniques and architectures used to accomplish different functions.
Clearly, systems ONLY based on horseless carriage AI search techniques are unlikely to do much better but can we say anything more positive here?
While there are no guarantees, I would suggest that academic AI search tools that fall into the “Deep Search” and “Deep Research” type of search tools, should more often than not, give clearly superior retrieval and ranking for “difficult queries” compared to traditional Boolean retrieval but basic TF-IDF/BM25 ranking. But if you need one tool to point at that I consider a “better AI seach tool” - I would consider Undermind.ai as an example2.
In my own testing particularly for difficult queries, the difference is often visible in the first screen of results: the better AI search tools place relevant work high in the ranking, while conventional databases often require more query reformulation and screening before comparable papers appear.
For example, I recently ran a study on reproducibility, running the same query five times across multiple AI search engines. It was not even meant to compare AI search engines relevancy but even with a broad query that had 200+ possible relevant results, only Undermind maintained high precision down to rank 50.
Other respected AI search tools could barely manage this for the top 10 to 20, showing the variance in performance. For “hard queries” the difference was even more stark between AI search tools and conventional databases.
The diagram above is just illustrative
I would not be as confident in this if it were only my own assessment, but feedback from PhD students and faculty has been overwhelmingly positive with continued heavy sustained use of Undermind.ai3. Comments like “if you do not renew the subscription, I will pay for it myself” are common.
To be clear, we are not talking about Cochrane-grade recall here, and these tools do not exhaust the literature. Their strength is not formal exhaustiveness, but top-N relevance: for focused topics, they often put more useful papers into the first 10, 20, or 50 results than conventional database searches.
But this post isn’t just about Undermind.ai, there are quite a few in my view (pretty much any type of Deep Search or Deep Research) that are almost as good.
So the question is: given that the better AI search tools often deliver much stronger early precision and better practical discovery for focused scholarly questions, especially when judged by the relevance of the top 10 to 50 results., and many researchers plainly value this, why do some librarians still not see it?
Leaving aside the usual anti-AI reasons and focusing on the effectiveness angle, my best guess is that librarians who teach or support search often evaluate AI search tools through one of two existing professional lenses: undergraduate information literacy and evidence synthesis. Neither lens is well-suited to what the better AI search tools currently do best. The use case where these tools genuinely shine, the ordinary narrative review, sits in a third space that the profession has generally paid less attention to.
Lens 1: Information literacy for undergraduates
The first lens is the information literacy(IL) lens, often focused on undergraduate learning.
Information literacy specialists will rightly remind me that IL for undergraduates is not the reductive “teach students to find five peer-reviewed papers for an assignment.” At its best, IL teaching helps students understand how information systems work, how authority is constructed, how disciplinary knowledge is produced, how databases and search engines shape what becomes visible, and how inquiry develops over time. Searching is not just a mechanical act of retrieval (Searching as Strategic Exploration!). It is part of learning how knowledge is organised, contested, and evaluated.
That said, the stereotype exists for a reason. A large amount of undergraduate IL instruction is still tied to assignment support: helping students find a handful of credible scholarly sources for essays. The librarian may want to teach richer concepts, but the immediate student need is often narrow.
This is exactly the use case where higher precision and recall do not matter much. A student writing a 2,000-word essay does not need exhaustive recall. They need enough credible sources of a common topic (of which there are many) to support a basic argument, and almost any decent academic database can supply that. JSTOR, ProQuest, EBSCOhost, Scopus, Google Scholar, or the library discovery layer will usually produce something usable. Better ranking helps, but rarely transforms the outcome.
In short, if the dominant teaching use case is helping undergraduates find “some good enough sources,” then the value of a tool that produces a much stronger top-10 or top-20 ranked list may not be obvious. The existing tools already clear the bar.
The pedagogical argument
There’s a deeper issue.
If the goal is simply to help students obtain relevant sources quickly, then better ranking is an obvious win. But if, as a information literacy librarian your goal is to help students understand the process of searching, e.g how keywords fail, how databases differ, how controlled vocabularies work4, then an AI tool that quietly produces a good ranked list with no effort from the searcher is arguably even worse for this purpose!
From that perspective, poor relevance ranking is not really a bug to be designed away. In a sense it is closer to a feature, because it creates the conditions in which these skills become necessary and therefore teachable. A student whose first search returns exactly what they want does not learn anything transferable about searching. A tool that does the work for the user implicitly devalues the skills the librarian is trying to develop, and so its strengths might register as threats rather than wins.
After all friction is pedagogically useful and one could even argue better search results removes friction!
At this point, we can consider if all friction is useful for learning and whether the reduced friction from better search ranking is “useful friction”. My next post will address this interesting topic.
This orientation may also help explain a pattern in some librarian-led AI search evaluation frameworks often designed by IL librarians. Retrieval effectiveness tends to be folded into a long list of weighted criteria alongside ease of use, transparency, accessibility, currency, and ethics. As I have argued in a recent post on AI search rubrics, retrieval should instead be treated as a gating criterion: if a tool cannot find the relevant literature, the other criteria do not rescue it, because surfacing relevant literature is the thing the tool exists to do.
Lens 2 : Evidence Synthesis
The second lens is evidence synthesis: systematic reviews, scoping reviews, and meta-analyses. From this vantage point, the gold standard is near-exhaustive recall with documented and reproducible search strategies. Most AI search tools, even the best ones, do not deliver this by design5.
Why not just ask an LLM for a Boolean search strategy? Plenty of literature since 2023 shows that prompting LLMs to generate Boolean strings produces poor results. Fine-tuning helps moderately. My current view is that a better way to approach acceptable quality is to give an agentic LLM access to PubMed and the MeSH browser and let it pilot-test the search the way a human searcher does. You can see my early attempt at building a Claude skill along these lines.
Beyond that, LLM-driven search is not deterministic, and you cannot publish a systematic review whose search step is “I asked Undermind.” From this lens, AI search is at best a supplement and at worst a distraction.
There are of course specialised AI search tools designed specifically for evidence synthesis. On one hand are those built on pre-LLM machine learning techniques, such as ASReview and Covidence, which rely mostly on active learning and supervised classifiers for screening prioritisation.
On the other hand, a growing number of LLM-native entrants have appeared, including those highlighted in Cochrane's recent announcement of selected AI tools for its innovative platform study. The boundary is blurring, however. Established players like Covidence, Rayyan, and DistillerSR have been retrofitting LLM features onto what were originally classical ML workflows.
These newer LLM-native tools are very new and largely untested. Within the systematic review community, established tools like Rayyan and Covidence are well known, but the recent LLM-first entrants remain unfamiliar even to most specialists. Elicit is perhaps the most mainstream LLM-native tool to enter this space, and recently claimed support for "PRISMA 2020 guidelines, making it reproducible, traceable, and auditable at every step"6.
But notice what Elicit and similar tools actually do. Even they apply LLMs mostly to screening, deduplication, data extraction, and maybe risk-of-bias assessment, while leaving the actual search step to standard Boolean queries on PubMed. That tells us something important: high-recall search is still a hard problem for modern retrieval techniques7.
This is not irrational. For evidence synthesis, high recall is non-negotiable. If your professional frame is systematic review support, then a tool with excellent precision but uncertain recall will naturally look insufficient.
So if these two lenses are all you have, AI search tools look unimpressive twice over. They are unnecessary, or even harmful, for the basic undergraduate case. They are not good enough for the systematic review case.
A view from outside the profession
A recent video by John Frechette, the CEO of Moara is worth watching here, partly because it reaches conclusions similar to mine but from an outsider’s vantage point. He makes two observations, both of which I think are largely correct, and the second of which I expect more librarians to push back on.
You might suspect bias as he runs a commercial AI research platform. But Moara is not mainly a search engine. It is closer to a next generation reference manager or an AI-infused evidence synthesis workspace, somewhat like Covidence with more aggressive LLM use, helping researchers collect, screen, and synthesise sources from anywhere: Google Scholar, Undermind, Claude, conventional databases, PDFs, reference managers. In that sense, Moara is source-agnostic, and John has little incentive to favour one search tool over any (he tends to praise Undermind and using Claude search outright in his demos).
First, he criticises the standard framing from some libguides of literature search as “pick a database, run your search, you are done.” He is right that this is misleading. Top-10 and top-20 overlap across major academic search engines is much smaller than this framing implies, and evidence synthesis methodology has long acknowledged that no single source has comprehensive coverage of any non-trivial broad topic. Though in fairness, many libguides are aimed at undergraduates who are not doing the level of formal review John has in mind.
Second, and more controversially, he notes that many university-recommended discipline-specific databases produced worse results than Google Scholar or Undermind, and that even IDEAS, despite his topic being in economics, returned irrelevant results past the first ten hits. Some librarians will take offence at this. I will say it plainly: I think he is right.
For many difficult, exploratory, or poorly lexicalised scholarly queries, the top-ranked results in conventional academic databases can be noticeably weaker than those from Google Scholar or newer deep-search tools like Undermind, Asta Paper Finder, or Consensus. This is not a knock on the metadata quality of those discipline-specific databases, which is often excellent. But ranking is a separate problem from coverage, and ranking is increasingly where the bottleneck sits. Discipline-specific databases optimised for Boolean retrieval over rich metadata were not built around powerful semantic or agentic/iterative pipelines modern AI search uses, and it shows.
To his credit, John does not argue you should ditch every database for one single AI search, no matter how good. He explicitly says one source is never enough. The point is not “AI search wins, traditional databases lose.” The point is that ranking quality varies, and a tool that reliably surfaces relevant work in its top 10/20/50 is doing something traditional databases often fail at.
The missing middle: narrative reviews
What both the information literacy and evidence synthesis lenses can miss is probably the most common literature search scenario in academia: the ordinary scholarly literature review, often known as scholarly discovery.
Most academics writing the literature review section of a paper, a thesis, a grant application, or a book chapter are not doing systematic reviews.
Don’t get me wrong, they still need to demonstrate command of the relevant literature, identify the key debates and findings, and situate their own contribution. They want practical coverage and recall: enough confidence that they have not missed the main debates, methods, authors, and findings, combined with high precision because their time is finite and reading irrelevant papers is expensive.
This is exactly the niche where the better AI search tools deliver the most value. Undermind, Asta Paper Finder, and Consensus (either alone or more likely in combination with other tools) in my testing tend to produce a tightly relevant result set with strong precision in the top ranks, at the cost of not exhausting the long tail. For a narrative review, that tradeoff is excellent. For a systematic review, it is insufficient on its own. For a freshman essay, it is overkill.
When I look at which researchers most enthusiastically adopt these tools, it is overwhelmingly people doing narrative reviews. That is not a coincidence.
The missing or less visible third camp
I am not saying no librarian cares about this use case (I exist!). The line between supporting “undergraduate find 5 peer review articles” and supporting researchers on capstone or thesis work is thin, and plenty of academic librarians do both. There is a long history of domain-informed, researcher-facing discovery support: the subject specialist (or Reference librarian or Digital Scholarship librarian) helping a researcher navigate a field, identify important work, trace debates, follow citations, recognise key authors, and avoid obvious blind spots will definitely benefit from a strong AI search tool along these lines.
But my thesis is that in the librarian profession today, search support has become most publicly legible and focused into two areas8. On one side is undergraduate IL. On the other is evidence synthesis, especially in health sciences. Between them sits the ordinary working researcher: the faculty member, PhD student, postdoc, policy researcher, or grant writer who is not doing a systematic review, but still needs better literature discovery than a basic database search provides.
If librarians do not make themselves visible in that space, researchers will not wait (they already aren’t). They will use Undermind, Elicit, Consensus, Asta Paper Finder9, Google Scholar, Claude, ChatGPT, Semantic Scholar, Research Rabbit, Connected Papers, citation alerts, PDFs from colleagues, and whatever else works. They will build their own messy discovery workflows, with or without us.
A renewed third camp would not treat AI search as magic. People in this camp would also not dismiss it because it fails systematic review standards or complicates undergraduate pedagogy. They would evaluate these tools the way subject specialists have always evaluated search tools: by asking whether they can surface important, relevant, and useful literature for real research questions. That camp would be willing to say:
An AI search tool can be imperfect and still useful.
A tool can be unsuitable for systematic reviews and still valuable for researchers.
A tool can lack full transparency and still deserve attention if its retrieval performance is strong.
A tool should not be recommended just because it is easy to use, ethical, accessible, or institutionally licensed if it has sub-par retrieval ranking
A tool should first be judged by whether it finds relevant literature for the task at hand. Only after that should we ask the other questions.
This is not an argument for ignoring privacy, ethics, accessibility, transparency, or sustainability. Those matter. But they should not be used to avoid the retrieval question. If a search tool does not search well, it has failed at its primary task.
The deeper point is not that librarianship has never occupied this middle territory. It has. The point is that this role needs to become more visible again in the age of AI search. The third camp does not need to be invented from scratch. It needs to be recovered, updated, and have a larger voice at the table.
Conclusion
To summarise the use cases:
Freshman undergraduate work: mid-to-high precision, recall does not need to be high. Almost any database with the right coverage will do.
Evidence synthesis: high recall is non-negotiable, and you accept whatever precision you can get. Multiple sources, documented strategies, traditional databases plus maybe AI tools as supplements.
Narrative reviews: moderate recall for the time spent, high early precision, and strong top-ranked relevance. This is where the better AI search tools sit, and it is the use case where librarian voices are thinnest.
The smarter evidence synthesis librarians have already figured out that there is nothing preventing them from using Undermind as one source among many in a multi-source strategy, and I have seen several publicly endorse this approach. That is exactly right. But the deeper move the profession still needs to make is to recognise that the narrative review may be a use case underserved by the loudest professional voices….
To be fair, some of these tools don’t just use LLM to generate nested Boolean, but also do hybrid search and powerful reranking systems (e.g. see my reviews of Scopus AI, Primo Research Assistant) which can lead to bigger improvements but in general quite a few tools rely just on or mainly on LLM to generate Boolean search strings which will have limited room for improvement.
I know it sounds like I am gushing too much about Undermind.ai. But it is good. Some other data points. Farhad Shokraneh, a highly respected evidence synthesis expert has praised Undermind many times in his webinars for exploratory search. So has John Frechette, CEO of Moara. Most independent testing of AI search tools for pure relevancy which I have seen (which unfortunately tends to be rare and often somewhat poor quality) shows Undermind on top (here, here, here) and if not second (if the tester has a dog in the fight e.g. here, here). It has weaknesses of course, e.g. it’s index covers only articles, so even Primo Research Assistant can be competitive if the test set query requires non-scholarly articles and/or Claude can beat it by finding relevant grey literature.
In case you are wondering, they (faculty and Phd students) do not react this way to all AI search tools (and we trialed a lot from 2024-2025).
When I demo search tools, I don’t prep my queries in advance much. But I am not above showing certain queries, knowing that they typically gives me opportunities to show off other features. In theory, a very good search result will reduce such opportunities.
It seems to me that the traditional systematic review method — search broadly with Boolean strategies, within reason, then screen the retrieved set — remains hard to beat for high-recall evidence identification. It is essentially a controlled brute-force approach: instead of trusting the retrieval system to decide what matters, it shifts the burden to transparent over-retrieval and human screening.
In theory, advanced agentic search might match or beat it if you were willing to set aside reproducibility and interpretability.
The deeper issue is that the main information retrieval field is not really working on this problem as much. Mainstream IR evaluation today is more focused toward multi-hop QA and hard needle-in-haystack benchmarks like BEIR, HotpotQA, and BrowseComp plus, all of which reward finding the one document that answers the query rather than finding everything on a topic.
Of course, PRISMA 2020 is a reporting guideline, not a search methodology…..
See footnote 5
I have the impression my blog is read mainly by information literacy librarians, librarians/researchers into evidence synthesis and the odd Library Systems person. Workshops I have conducted are about 50-50 in the first two camps.
Two weeks after AI PaperFinder launched, a faculty was telling me he already was using it because he heard from his peers it was good. He’s an exception, a early adopter but we should not underestimate how fast good research tools are adopted via word of mouth without librarians even hearing about it.













Hi Aaron - great post as always. I think that outside of academia there are many librarians working in this middle ground space, who are very visible within their own organisations. This is the bread and butter of the medical library world. We both teach and undertake searches within this space, supporting junior doctors, research offices, guideline creation, and clinical improvement projects. However, sadly we are also in a sector where publishing our work is not encouraged or actively supported. We likely to need to source support for undertaking research into the quality/validity of AI tools within our safety-first area of the library world, much of the research in this space is currently being undertaken by clinical staff without our input. Thanks for the musings :)
I can't help feeling that there are many valuable insights here but that - as you point out - researchers are going to use some of these tools with little knowledge (and without reading your article). If the library's role is discovery in the broad sense, not just limited to carrying out systematic reviews, where its expertise is acknowledged, how can the library demonstrate its relevance? Maybe your workshops, or for those who cannot attend them, some practical guidance on using Undermind, Consensus, and so on.