“We’re Good at Search”… Just Not the Kind That the AI era Demands - a Provocation
Recently, a librarian from a prestigious institution I met at a conference surprised me when he confessed that he and his colleagues were struggling to grasp the issues surrounding the impact of AI. But My talk helped clarify much of the fog around how to think about impact of AI on search.
His confession wasn’t an isolated one. Many librarians I speak with admit they struggle to keep up with the blizzard of new AI-powered search engines. More importantly, I sense that many of us lack the right mental models to properly discuss, analyze, and evaluate them.
The Pressure We Feel
Here’s the uncomfortable truth: we librarians have long held a self-image as masters of search—or at least, competent practitioners. This identity creates immense pressure to stay on top of “AI search” and project understanding. Yet many of us quietly feel inadequate, struggling to reconcile our traditional expertise with these emerging tools.
The reality is that we are good at searching—just in ways that differ from what may be needed now.
What We’ve Always Been Good At
Our traditional strengths in search are considerable. The best reference librarians among us possess remarkable resourcefulness, drawing from numerous sources and techniques to unearth answers that would elude even the most persistent Google user. In our domains of expertise, we can locate information—both online and offline—that others simply cannot find.
Many of us have mastered database searching within a specific paradigm: Boolean retrieval, ranking by TF-IDF or BM25. We are experts at the use of proximity operators and filters, while the finest medical librarians know MeSH like the back of their hand. Others have equivalent expertise in LCSH or specialized thesauri.
Evidence synthesis librarians come closest to theoretical information retrieval expertise, with their knowledge of piloting searches, validating hedges, and understanding retrieval metrics like sensitivity, precision, specificity, and negative predictive value. Some even research stopping points for active learning tools like ASReview.
[Added Nov 2025] I would say the knowledge to conduct the relatively new SWAR or Study within a review (SWAR) is approximately the type of skillset that we need that is rare among librarians
Another area that is search related is how we teach users to use SIFT (which involves searching to do lateral reading); this is likely to shift also now that we have extremely powerful search tools like Google’s AI mode to do the heavy lifting of searching to help the human with validation—see the work of Mike Caulfield of SIFT fame on how to do this in a emerging and exciting new area.
The problem? These hard-won skills offer little preparation for the world we live in now.
We now inhabit a world defined by unfamiliar terminology: natural language search, semantic search, dense embeddings, vector embeddings, retrieval augmented generation, deep research, agentic search. These aren’t just buzzwords—they represent fundamentally different families of approaches to information retrieval with different trade-offs and implications as we move beyond just Boolean search.
I’m skating over a crucial question here: are these AI advancements genuine improvements over traditional search methods, or are we caught up early in a hype cycle with little real benefits? e.g. Does semantic search really retrieve more relevant results for scholarly research, or does it just feel more modern? Even if these methods do objectively get you higher recall, precision results, is it worth the costs you pay in reduced interpretability, reproducibility, unknown possibility of bias and the lack of exact search hits? Also, there are legitimate reasons why medical librarians still depend on Boolean search with MeSH—precision, control, and reproducibility matter, especially for systematic reviews and evidence synthesis.
I’ll address this tension directly in my next post. But for now, let me argue that regardless of where you stand on the “improvement vs. hype” question, librarians need to understand these technologies well enough to critically evaluate them.
Our Evaluation Blind Spot - a Provocation
This brings us to an odd thing I have noticed.
When librarians assess a new search engine or database (AI-powered or not), I notice we often focus on familiar factors: the UI, the availability of filters, the sources covered, privacy terms, vendor support, and even niche library requirements like COUNTER support or link resolver integration.
Strangely, the effectiveness of the core relevancy system is, if not an afterthought, rarely the central focus.
I might be exaggerating slightly, but if you look at the few new evaluation matrices for AI-powered search circulating, “relevancy” is often just one of several categories, evaluated in a highly subjective and “I-know-it-when-I-see-it” manner.
This is baffling, given that a search engine (AI-powered or not) lives and dies on its ability to retrieve relevant results. Even generative tools that author reports are building on a house of cards if their underlying retrieval system fails to find the most relevant items.
I am not saying features like link resolver support aren’t important. But notice the asymmetry here, it is far easier for a product with proven good relevancy models to build link resolver support than the reverse. Also, I’ve noticed, most feature “innovations” like specific visualizations in Deep Research tools are easily and quickly copied. Unfortunately, relevancy models are much harder to copy. All this makes it even important to double down on choosing products by how good they are at retrieval.
Why this blind spot? Let me advance a hypothesis.
Because for the last 15 or 20 years, academic databases have all functioned in fundamentally the same way.
Index content in an inverted index
Retrieve with Boolean
Rank with some variant of TF-IDF or BM25
This is admittedly oversimplified (we glide past BM25F, proximity, citation/authority, and learning-to-rank in some academic systems.) and things like Google and Google Scholar break the mold, (which is also why we initially struggled with them) but in such a world, the main differentiator wasn’t usually the retrieval algorithm—which was a commodity. The main things that mattered were coverage (which journals do you have?) and, secondarily, the user-friendliness of the interface.
The Wild West of AI-Powered Search
Today’s landscape couldn’t be more different.
Many of the new AI search engines (like Elicit, Consensus, Undermind, etc.) are drawing from the same pool of open content (like Semantic Scholar or OpenAlex). They are competing almost entirely on the difference in their retrieval and ranking capabilities.
Unlike a decade or two ago, there is little standardization and we’re witnessing an explosion of search innovation fueled by transformer models—affecting embeddings, LLM-based reranking, agentic search, and more.
This is the wild west, and implementations vary wildly; the gap between the best and the merely flashy is large.
Proper independent, task-grounded evaluations are still rare for off-the-shelf AI powered academic search tools, but the ones I have seen tend to list the usual suspects - Undermind.ai, Elicit.com (the paid version - the free research report flow is too limited), constantly at or near the top - showing excellence of retrieval algo is not just subjective or random.
On the other end of the spectrum, I have unfortunately seen many lazy and horrible implementations - e.g. Just naively prompting a LLM to generate Boolean search strategy and thinking it will give good relevancy results
A Glimpse of the Future
This trend will accelerate. In a potential future where content owners provide access via MCP (Model Context Protocol) - like a Wiley AI Gateway, the content pool becomes even more equalized assuming the agent would have the same entitlement as the user. The only thing distinguishing one search agent from another will be its retrieval capability—its “secret sauce” for finding the best results.
In an MCP-dominated landscape, search agents wouldn’t directly control what’s returned—they’d specify queries while configured tools handle the actual retrieval. This would make agents more similar than in a world where each maintains centralized indexes, though they’d still distinguish themselves by strategic decisions about what to search and which sources to prioritize.
The Path Forward
Theoretical knowledge without practical experience is sterile, but practical knowledge without theoretical understanding is blind.
My proposal is straightforward: librarians today should develop competency in information retrieval as a discipline.
These include understanding how vector embeddings work, what makes semantic search different, how retrieval augmented generation functions, and how to test retrieval performance formally. Incidentally, all this and more are discussed on this blog!
But theory alone isn’t enough—this knowledge needs to be paired with hands-on experimentation and testing. Run actual searches. Compare results. Break things in controlled ways to understand their failure modes. Only then can you bridge the gap between knowing why a system should work and seeing how it actually behaves with real queries and real collections.
Without theory, you’re reduced to blind trial-and-error with no ability to make connections. Without practice, you’re working from assumptions that may not survive contact with messy reality. Both matters.
While I cannot give you the practical experience from trying these tools out, I can try to give you the theoretical foundation. This is why we’re developed a workshop on these topics.
The landscape has shifted. It’s time our understanding shifted with it.


Of all these, I do believe that learning and developing tools and methods to evaluate things like relevant retrieval and appropriate/useful summary are crucial. One of the reasons that I focus so much on contextualization and verification is it is very amenable to the development of model responses/response rubrics that can be used to compare with output and better understand the weaknesses and strengths of various systems, and even more importantly, the specific conditions under which they fail.
I am not even sure this has to be done at the highest level of formality (unless looking to publish). But I find it odd when people in this space do *not* have at least a half dozen ready to go challenges to test the behavior of a given system along multiple dimensions.
That’s a strong message and call to action, or call for learning. It sounds good.
I wonder, are there different evaluation methods for tools as used by librarians, vs tools as used by end users? It seems to me that these are two quite different situations.