The Horseless Carriage of AI Search: Why Using LLMs to Generate Boolean Alone Is Likely of Little Benefit
Not exactly saying Boolean must die.....but... maybe less focus?
TL;DR: Turning natural language into Boolean is not the future of library AI search. It can help with some failed queries, but the bigger problem in most discovery systems is weak ranking, not the need for better Boolean. More radically, I conclude by arguing Boolean itself may no longer be the right foundation for retrieval in an era of BM25 (loose/fuzzy boolean), hybrid search, and agentic search.
There is a growing trend among library vendors and AI search startups to use LLMs to convert natural language queries into Boolean search strings (dubbed LLM-to-Boolean from now on). Examples that use this method as part or whole of their retrieval pipeline include Scopus AI, Web of Science Research Assistant, Web of Science Smart Search, Primo Research Assistant, Summon Research Assistant, EBSCOhost Natural Language Search, Scite.ai Assistant, and many more. Universities are also building their own versions, such as Stony Brook University’s SEARCH AI and San Diego State University’s OneSearch AI Assistant.
With this many vendors and institutions converging on the same approach, you might reasonably assume this is what “AI-powered search” means — that this is the important innovation, perhaps even the whole story of how AI will transform library search.
It is not. Not even close. Using an LLM to convert natural language to Boolean may be the horseless carriage of AI-powered search1. When the automobile was invented, the first instinct was to build something that looked like a horse-drawn carriage but with a motor bolted on. It took years before designers realised that the new technology demanded a fundamentally different form. LLM-to-Boolean is the same kind of thinking: it takes the most powerful text-understanding technology ever built and uses it to produce the exact same artefact — a Boolean query string — that librarians have been crafting by hand for decades. In the simplest implementations, the LLM is mainly a more elaborate query-construction layer placed in front of an otherwise conventional lexical search engine.
I am not saying this approach has zero value. For a novice user who types a full sentence into a strict Boolean search engine and gets zero results, having an LLM translate that into a workable query is a real improvement. But used alone — without modern reranking (e.g. neural ranking, embedding retrieval methods, or agentic search) — I am skeptical LLM-to-Boolean does much to improve result quality for most users2. It is important for vendors and librarians to realise this is not the main event, rather it is a minor supporting act.
Retrieval versus ranking: the distinction that matters
If we are serious about improving search in library discovery systems, we need to stop tinkering with query construction alone and start addressing the real bottleneck: what happens after retrieval.
Information retrieval teaches that search engines do two things: they retrieve documents (deciding which ones make it into the result set) and they rank those documents (deciding the order you see them in). These are distinct stages, and improvements to one do not automatically improve the other.
Boolean only retrieves a candidate set of results. Strictly speaking, every document that matches the Boolean logic is treated as equally relevant. Ranking — typically based on lexical relevance functions like TF-IDF (term frequency - inverse document frequency) or its more sophisticated successor BM25 — is what determines the order you actually see results in. Most library search systems already have both stages, even if the ranking layer is not always well documented.
Most library or academic search systems rely heavily on lexical retrieval and lexical/proprietary relevance ranking that are TF-IDF/BM25-like based on a combination of term frequency, document frequency, document normalization and field weighting with some additional weighting for citation counts.
Scopus for example mentions - “The more often a term occurs in a document, the more likely that it is relevant to the topic of the article” and “Not every word is equally important. A term that occurs in nearly all documents will score less than something unusual. We use calculations based on Term Frequency/Inverse. Scopus uses Document Frequency (TF/IDF) (a concept originally introduced by Karen Spärck Jones, 1972) to assign a weight to any particular word in any collection of documents”
LLM-to-Boolean is entirely focused on the retrieval stage. It changes which documents get retrieved, which matters. But it does nothing to upgrade the ranking stage — the part that determines whether the best results appear on your first page or are buried on page five. In an era when users routinely face thousands of results, ranking is what determines what they actually see.
The problem is that BM25, which underpins ranking in most library databases, is decades-old technology3. It scores relevance by counting how often your search terms appear in a document and how rare those terms are across the corpus. It has no understanding of meaning, context, or user intent.
Who does LLM-generated Boolean actually help?
There are broadly three types of users to consider.
Novice users have the following failure modes.
Too aggressive with keywords — ANDing too many specific terms, getting zero results. LLMs to boolean might help here with adding synonyms though in my experience this doesn’t help because the user often has just too strict a search.
Too broad query— single or too general term, overwhelmed by thousands of results. LLM-to-Boolean may actually make this worse by adding synonyms that broaden the set further. What this user needs is better ranking not more synonyms.
Natural language as query — users thinking they are working with a natural language search systems and typing a full question, getting zero or bizarre results because the system matches “what,” “is,” “the” literally. LLM-to-Boolean helps here by parsing the question into workable Boolean.
For the increasing number of users who expect natural language search and get zero results because the system cannot parse their sentence, LLM-to-Boolean is a genuine improvement over nothing. But this is a relatively narrow use case: even without this many users quickly adjust and switch to keywords when they notice zero results.
Expert searchers who construct systematic review search strategies. These users gain nothing. They build lengthy, carefully piloted Boolean strategies using controlled vocabulary and proximity operators. As of 2026, LLM-based Boolean generation is still not reliable enough to replace expert searchers, although recent benchmark work — notably the AutoBool project using reinforcement learning — has narrowed the gap on some datasets. LLMs can be useful for brainstorming, but they cannot serve as an automated replacement. Expert searchers are not going to benefit from an LLM doing what they already do, and doing it less reliably4.
Users who type in reasonably simple keywords. This is the largest group and the one that matters most — researchers looking for relevant literature, students doing coursework, information literacy librarians conducting narrative reviews.
This group either types in 3-5 reasonable keywords or construct simpler nested Boolean versions of (A OR B) AND (C OR D) used by systematic review searchers.
Many of these users do not have a retrieval problem: their keywords already retrieve relevant documents. They have a ranking problem. The relevant documents are in the result set but buried beneath less relevant ones. What they need is not a fancier query; they need the search engine to put the best results at the top. LLM-to-Boolean does nothing about this — and, as I will argue, can actually make things worse.
The synonym expansion problem
To understand how LLM-to-Boolean can degrade results for that third group, we need to look at what it actually does to a query.
When an LLM converts a natural language query to Boolean, it almost always adds a large number of synonyms. Strip away the marketing language and what the LLM is doing — beyond dropping stop words and extracting key concepts — is essentially query expansion: transforming a user’s simple keywords into nested Boolean strings with multiple OR’d synonyms per concept5.
Query expansion is not new6. It can improve recall by retrieving more potentially relevant documents. But it also injects noise by retrieving many more marginal or incidental matches. Whether that trade-off is worthwhile depends entirely on whether the synonyms added fit the searcher intent exactly and whether the search system has a strong enough ranking layer to separate signal from noise.
In the next section, I provide three separate arguments on why when LLMs generate Boolean from query inputs, they do not necessarily improve things.
Argument 1 : Nested Boolean is a search paradigm that is no longer useful in today’s search environment
I argued as early as 2014, well before the emergence of LLMs, that nested Boolean searching had become far less effective in many modern search environments and use cases (outside of evidence synthesis). The traditional nested Boolean strategy, decomposing a topic into concepts, enumerating synonyms for each, and combining them in the form (A1 OR A2 OR A3) AND (B1 OR B2 OR B3), was developed for a very different retrieval context: one characterised by limited full-text coverage, relatively small databases, and a significant risk of zero results. Under those conditions, extensive synonym expansion was often both rational and necessary.
Contemporary search environments differ markedly. Full-text indexing is now widespread, collections routinely contain hundreds of millions of records, and stemming or related linguistic normalisation techniques handle many lexical variants automatically. In such contexts, straightforward keyword searches already tend to achieve acceptable recall and the focus is on precision. Adding long lists of OR-connected terms therefore often produces an explosion of incidental matches. A monograph that mentions “cardiac event” once in passing on page 247 may enter the result set alongside a systematic review centrally concerned with the topic7. Even in systems without full-text search, the average query now tends to retrieve far more material than in the past, which strengthens the case for improved ranking rather than more aggressive expansion.
Argument 2 : LLM’s today add poor synonyms that create more noise and over-stress outdated relevance ranking systems
LLM-to-Boolean systems amplify this longstanding problem. Worse still, inspection of Boolean strings generated by current systems, especially those relying on lower-cost bundled models such as GPT-4o mini in products such as Primo Research Assistant, suggests that the generated synonyms are often of uneven quality. They may include loosely related terms, overly broad variants, or, in some cases, plainly inappropriate substitutions. Each additional weak term expands the candidate set while simultaneously increasing the proportion of noise within it.
The central mechanism is straightforward. Query expansion may assist strict Boolean retrieval by increasing the number of documents admitted into the result set, but it can also impair relevance ranking under models such as TF-IDF and BM25 by diluting the discriminative value of the user’s original terms. Once a query is expanded from a small number of strong keywords into multiple concepts populated by weaker synonyms, the ranking function may assign undue weight to documents matching several marginal variants. As a result, documents that align only loosely with the user’s actual intent may be ranked above those that match it more precisely. Because many library discovery systems combine Boolean retrieval with TF-IDF or BM25-style ranking, LLM-generated synonym expansion can improve one stage of the retrieval process, at least in the narrow sense of increasing potential recall, while simultaneously degrading another. In the absence of a sufficiently strong reranking layer, the result is not simply a larger result set, but a larger and less coherent one.
Argument 3 : There are search resistant concepts where it is better NOT to represent in the Boolean Search
There is also a further complication. Sometimes, somewhat counter-intuitively, trying to represent hard-to-search concepts, what Farhad Shokraneh calls search-resistant concepts, is a mistake. The reason is simple: if you do not know the right terms, forcing them into the query can exclude many relevant papers. In some cases, trying to be more precise actually makes the search worse. A good example is PICO-based searching. Outcome terms are often expressed in highly variable, non-standardised ways, so it is often better not to filter for outcomes at all, and instead accept a larger set for later screening. Do LLMs that expand searches recognise this distinction?
While evidence synthesis librarians are more willing to look through 100% of results in a much bigger set of results, most other users will not look through every result and here again the potential of a much stronger ranker is where most of the value “AI search” brings.
PubMed: what good query expansion looks like
Compare LLM-generated synonyms to a system that does query expansion well: PubMed. PubMed’s Automatic Term Mapping (ATM) also expands user queries with additional terms, but the quality difference is vast8.
Among other things, ATM maps user terms to MeSH (Medical Subject Headings) — a carefully curated, hierarchically structured controlled vocabulary maintained by domain experts over decades. When ATM expands "heart attack" to include "myocardial infarction," it draws on a tested thesaurus. When an LLM expands "heart attack" to include "cardiac event," "chest pain," and "cardiovascular incident," it generates plausible-sounding terms with no guarantee they map to the same concept.
PubMed also has architectural advantages. Its users often prioritise recall over precision, making aggressive expansion a more defensible trade-off. It searches citation and abstract records plus MeSH metadata but not full text, making every keyword match less likely to be incidental. It covers a well-defined domain (life sciences), reducing the risk that added terms have multiple unrelated meanings.
Most critically, PubMed has a superior ranking system. It uses a two-stage architecture with LambdaMART reranking on the top 500 results — far more sophisticated than generic BM25. So even when ATM broadens the result set, the reranking step ensures the most relevant documents rise to the top. The expansion and the reranking work together as a system. This is precisely the architecture most library databases search lack.
What does the empirical evidence say?
Among the systems which use LLM to convert input to Boolean, Primo Research Assistant is probably the most well studied.
One of the more rigorous studies is Galbreath et al. (2025), which evaluated Ex Libris’ Primo Research Assistant (PRA), which at the time of study was using GPT-3.5 to convert natural language into Boolean — against Boolean searches crafted by instruction librarians at Washington State University.
The headline finding: no appreciable difference in topical relevance. PRA returned relevant sources 46.3 per cent of the time; librarian-built searches returned 45.6 per cent. I was mildly surprised at first at the results. But this result needs careful reading.
For each query, Primo Research Assistant generates ten Boolean variant strings runs the search and the top five are then sent to the LLM for summary generation.
Because Primo Research Assistant only generates answers using retrieval augmented generation based on the top 5 ranked results, it is critical the relevant results make it to the top 5 which requires a very strong relevance ranking system.
Perhaps this is why PRA does more than pure LLM-to-Boolean, but adds an additional reranking step beyond the usual Primo relevance ranking. It retrieves the top thirty matching records using the usual method and then reranks them using vector embeddings (semantic ranking).
Even though reranking the top 30 still is a relatively modest effort (many state of art systems will rerank the top 100!), and the semantic reranking will still fail if the conventional ranking fails to get relevant results in the top 30, it still helps a lot.
You can actually roughly compare how results in the top 5 would look like without the reranking steps9. I have done some comparisons and while I am not the most impressed with Primo Research Assistant relevancy ranking even with the reranking step , the results would be far worse without it!
The other issue is the sample of queries used. The study used zero-result patron queries — queries where users had typed natural language into Primo and received nothing back. These are precisely the cases where LLM-to-Boolean provides the clearest benefit. The results might look very different for the much larger population of queries that already return results.
More importantly, I have a strong suspicion, the queries tested were mostly “easy” queries with many potential relevant matches. This is most telling with the lack of overlap in results between the items retrieved from librarian searches and PRA searches.
Despite both sets achieving roughly 50% precision10, only 7.21 per cent of PRA's citations matched the expert searches' results. Clearly, there were many relevant results and PRA was not finding better results; it was finding different results of comparable quality.
In short, the queries were not “skill-testing”, harder queries with a much smaller and limited set of relevant results would no doubt stress Primo Research Assistant even more and this is where I see systems with more powerful relevancy ranking capabilities that goes beyond simple lexical ranking like Elicit.com, Undermind.ai, Consensus shine.
Not all LLM-to-Boolean tools are equal
The tools I listed at the outset sit on a spectrum, and the differences matter.
At one end, Web of Science Research Assistant (as tested in early 2025) and EBSCOhost NLS are pure LLM-to-Boolean: they generate a Boolean query and run it against the standard index with no changes to ranking. The homebrew implementations from Stony Brook and San Diego State work similarly11.
In the middle, as we have seen - Primo Research Assistant generates Boolean, retrieves the top 30 results, then reranks them using embeddings. Web of Science Smart Search runs Boolean and semantic search together and lets users toggle between combined, Boolean-only, and semantic-only results. (Do not confuse Web of Science Smart Search with Web of Science Research Assistant — the former is bundled with no RAG; the latter is a paid add-on with RAG.)
At the other end, Scopus AI describes a hybrid architecture in which it may use vector search, keyword search, or both depending on the query, combining and reranking the results.
Once you lay out this spectrum, you can make a prediction: tools that go beyond LLM-to-Boolean alone — those adding hybrid search, dense embedding reranking — should typically outperform those that do not and be favoured by serious researchers. And roughly speaking, this is what I found in my informal comparative testing and in various Katina reviews.
So far, we have being discussing “quick search”, where results are returned with very little delay (e.g. < 1 min). Agentic/deep search, which uses the LLMs to do iterative searching and evaluation gives you much superior results but you will have to wait much longer (e.g. >10 minutes) for the results. I dub this the difference between Deep Search/Research tools vs “Quick Search”.
The LLM-to-Boolean approach does have one genuine strength: interpretability. When the system displays the generated Boolean, you can inspect it, critique it, modify it, and rerun it. That transparency in the retrieval stage is worth acknowledging. But it comes at the cost of reproducibility: because an LLM generates the query, the search strategy can change each time you run it. In my testing, Web of Science Research Assistant and Primo Research Assistant generated different Boolean strings roughly one in five times; Scopus AI changed nearly every other time.
Why our profession's focus on Boolean retrieval is understandable but limiting
Library training and practice have always centred on the retrieval side of search — constructing the right query to get the right documents into the result set. That is what we teach, what we are trained in, and where our expertise lies. Ranking — what happens after retrieval — has historically been the vendor’s domain, largely invisible to us and outside our direct control.
This is not a failing of individual librarians. It is a structural feature of how the profession developed. When library catalogues used strict Boolean with no relevance ranking at all — results sorted by accession number or date — retrieval was genuinely the whole game. Getting documents into the result set was the only thing you could influence. Evidence Synthesis Librarians obviously have the same mindset as they typically intend to screen all the retrieved results. The habits, pedagogy, and professional identity that developed around this reality are deeply embedded and entirely rational given that history.
But the search environment has changed. Users now routinely face thousands of results and scan only the first few pages. In that context, ranking determines what users actually encounter. The professional focus on retrieval, while understandable, has become a blind spot — and it shapes which AI innovations we notice and which we overlook.
LLM-to-Boolean is visible, inspectable, and maps directly onto existing expertise. It fits neatly into teaching practices and workflows. It is unsurprising that it has generated more enthusiasm than semantic reranking, which operates invisibly beneath the interface. But we need to be honest that familiarity is not the same as effectiveness. Our experience with Google and Google Scholar teaches us that users overwhelmingly judge search engines by the quality of what appears on the first page, not by whether they can inspect the query logic. A search you can fully explain but that returns mediocre results will be rejected in favour of one that is harder to explain but returns excellent results.
To be fair, since I started in the profession in 2007, I have seen a lesser focus on teaching Boolean by librarians, but the rise of “AI search” may accidently push this back to the agenda….
Vendors have their own motivation to create the narrative LLM to Boolean = AI search
Existing database vendors, for their part, have their own reasons for staying with lexical methods and they have to do with cost. Boolean retrieval over inverted indexes is fast, well understood, and already built. The infrastructure exists and is paid for. Bolting an LLM onto the front end to generate Boolean queries requires no changes to the underlying index, no new data pipelines, and no reindexing of billions of records. It is, from an engineering and business perspective, the cheapest possible way so say you are doing "AI", especially if you bolt on the cheapest LLM you can find12.
Moving to dense embeddings is a different proposition entirely. It requires generating vector representations for every document in the index13. At the scale of a system like Ex Libris CDI, with over five billion records, that is a significant computational investment — not just a one-off cost but an ongoing one, since embeddings need to be regenerated whenever the model is updated or new documents are ingested. Dense retrieval is also slower than inverted index lookup at query time, particularly without heavy optimisation of approximate nearest neighbour search infrastructure.
Cross-encoder reranking or using LLMs directly for ranking, the most powerful option for second-stage ranking, is more expensive still. A cross-encoder scores each query-document pair individually, which means running the model once for every candidate document. Even reranking just the top 500 results per query, at the scale of millions of daily searches, adds up. Vendors looking at these numbers can easily conclude that the return on investment is uncertain — especially when librarians are not asking for it and seem satisfied with LLM-to-Boolean.
The next level beyond these methods would be employing agentic/deep search, which uses the LLMs to do iterative searching and evaluation - you get much superior results but at the cost of latency14.
This creates a reinforcing cycle. Librarians ask for LLM to Boolean retrieval because that is where their expertise lies. Legacy Vendors invest in LLM-to-Boolean because it is cheap to implement and aligns with what librarians are asking for. Neither side pushes for the harder, more expensive, but ultimately more impactful investment in ranking. The result is that library search systems remain stuck with decades-old ranking technology while vendors market a superficial AI layer on top as though it were transformative.
What should AI search do?
Let me suggest two approaches
1.The more conservative approach to maintain interpretabilty
Keep Boolean retrieval for the first stage. Inverted indexes and Boolean logic remain the right tool for scalable initial retrieval. There is no need to abandon this.
Add modern reranking as a second stage. After Boolean retrieval and BM25 produce a candidate set, apply a “neural” reranker — whether a bi-encoder, cross-encoder, or LLM-based ranker — to re-sort those candidates before presenting results. This is not radical. Semantic Scholar uses LightGBM reranking that emphasises title matches and highly-cited recent papers. PubMed uses LambdaMART on the top 500 results. Both are academic search engines at massive scale, and both concluded that BM25 alone was insufficient. Even these comparatively modest reranking techniques, now almost a generation behind the state of the art, produce noticeably better results than BM25 for ranking.
Some librarians may object that these new usually neural reranking methods are not as interpretable as lexical relevance ranking based on matching query terms. This concern is overblown, for a simple reason: the current system is not that interpretable either15.
The Ex Libris “Search and Ranking in CDI” documentation describes a relevance algorithm built on a “continuously tuned, proprietary” mix of dynamic rank factors, static rank factors, field boosting, personalised ranking by discipline, and various other components. I would wager that very few librarians who use Primo could explain in detail how CDI’s current ranking works. We have already accepted opaque ranking (think Google since 2000) — we just have not acknowledged it16.
In return, with Boolean still as a first stage retriever, you get exact hit counts and don’t run into the problem of hit counts becoming approximate if you use semantic search methods as the first stage retriever.
2. Giving up interpretability for effectiveness
There is a stronger version of the argument I have been making that I want to put on the table: perhaps we should stop treating Boolean as the obvious choice for first-stage retrieval at all.
Throughout this post I have advocated a two-stage architecture that keeps Boolean retrieval as the first stage and adds modern reranking as the second. That is a pragmatic position — it preserves what vendors and librarians are comfortable with while addressing the ranking bottleneck. But pragmatism can obscure a deeper problem. Boolean retrieval is not just outdated at the ranking stage. It is limiting at the retrieval stage too.
The fundamental issue is that Boolean is binary. A document either matches the query logic or it is excluded entirely. There is no middle ground. If a highly relevant paper uses none of your specified terms or synonyms — because the authors used different terminology, because the concept is expressed implicitly, or because the relevant discussion appears in a section your metadata fields do not cover — Boolean will never surface it. It does not matter how good your ranking is downstream. A document that never enters the candidate set cannot be ranked.
Having worked with researchers, I can tell you even in areas they know well, they tend to struggle with overly including terms, leading them to put in keywords that end up excluding even known relevant gold standard papers they give me!
The reason for this is that most researchers and even librarians are unfamiliar with what Farhad Shokraneh calls “search resistant concepts” which he defines as “concepts that when added to a the search, are more likely to miss the relevant records.” In such situations, if you want to ensure high recall, you should not even try to search for the concept!
He gives three reasons for why concepts are hard to search but the lack of standardised terminology used by the field is the clearest reason.
In evidence synthesis scenarios, search resistant concepts typically are outcome concepts (from PICO) which are described in so many different ways, such that it is often better not to even try to search for them. Do LLMs know this?
This is precisely why LLM-to-Boolean tools resort to aggressive synonym expansion in the first place. The rigidity of Boolean matching creates a constant risk of missing relevant documents, so the system compensates by throwing in every plausible variant. But as I have argued, that expansion then degrades BM25 ranking by flooding the result set with noise. The root cause of both problems — missed documents and noisy results — is the same: Boolean’s binary matching model.
BM25, used as a first-stage retriever rather than just a ranker, avoids this entirely. BM25 does not hard-exclude documents. It scores every document against the query and returns the top-k by relevance score. A document that matches some but not all of your terms still appears — it is just ranked lower.
This means you do not need exhaustive synonym lists to compensate for rigid matching. A simple, well-chosen set of keywords will retrieve a broad and relevant candidate set because partial matches are included rather than discarded.
BM25 as first-stage retriever versus Boolean: what does this actually mean?
Using BM25 as a first-stage retriever rather than Boolean can be hard to grasp if you are used to Boolean as the default, so here is a concrete example.
When you type
open access citation advantageinto a strict Boolean search engine, most systems today applies an implied AND. Every result must contain all four terms. If a document contains “open access” and “citation” but not “advantage,” it is excluded entirely — regardless of how relevant it might be.A BM25-based retriever works differently. It scores every document based on how well it matches the query terms, weighing term frequency and rarity, but it does not hard-exclude documents that are missing a term. A system using BM25 for first-stage retrieval — as Google does, or Scite.ai — might return a document matching only three of your four terms if its overall relevance score is high enough17. The missing term potentially lowers the score vs documents with all four terms but does not eliminate the document from consideration18.
In the information retrieval word, this is still lexical/keyword based as the retrieve and ranking is still just based on term matching19.
This matters because strict Boolean can accidentally drop relevant documents through no fault of the searcher. In my review of EBSCOhost Natural Language Search (NLS), I showed an example where I asked the system to find papers that used randomised controlled trials to test for an open access citation advantage. NLS generated a Boolean query and ran it against the index. It failed to surface one of the few obviously relevant papers, despite that paper being indexed in the database. The reason was straightforward: this early paper did not use the phrase “open access citation advantage,” and the Boolean that NLS constructed did not expand broadly enough to capture the terminology the paper did use. Under strict Boolean, that single vocabulary mismatch was fatal — the paper was excluded from the result set entirely.
A BM25 retriever would not have automatically excluded that paper. It would have scored it lower for missing some query terms, but the paper would still have entered the candidate set. With a strong reranker as a second stage — one capable of recognising that the paper is conceptually about the same topic even without exact term matches — that paper could be pushed to the top of the results. Semantic search approaches would handle this even more naturally, since dense embeddings capture conceptual similarity rather than relying on shared vocabulary at all.
This is the core limitation of Boolean as a retrieval method. It treats term matching as a gate: you are in or you are out. Every term that is missing from the query or absent from the document is a potential point of failure. BM25 and semantic methods treat term matching as a signal — one input among many into a relevance score — which is far more forgiving and far less dependent on the searcher anticipating every possible way an author might express a concept.
The aggressive query expansion that causes so many problems under Boolean becomes largely unnecessary. Add a strong reranker on top of first stage BM25 is what all modern post-2019 information retrieval papers use as a strong baseline comparison.
But if we are willing to give up strict Boolean as first stage, why not go further with either using Semantic search methods alone or as another component at the first stage. Dense embedding retrieval does not depend on term matching at all. It represents both queries and documents as vectors in a shared semantic space and retrieves documents by conceptual similarity. A paper about “myocardial infarction” can be retrieved by a query about “heart attack” without anyone — human or LLM — needing to specify that synonym. The vocabulary mismatch problem that has driven decades of query expansion work in library science is addressed at the architectural level rather than patched at the query level.
The next level beyond semantic search would be agentic/deep search, which uses the LLM to do iterative searching and evaluation - often using both lexical and semantic search methods
The practical objections are real but not insurmountable. Librarians value Boolean as a first stage retriever because it is predictable, inspectable, and gives them direct control over what enters the result set. Abandoning it at the first stage means accepting that retrieval becomes probabilistic — you can no longer guarantee that a specific document will or will not appear for a given query. Hit counts become approximate or meaningless. The ability to teach students a logical, reproducible search process is diminished.
These are genuine losses. But we should weigh them against what is gained. The vocabulary mismatch problem — the single largest source of recall failure in academic search — is substantially mitigated. The need for aggressive synonym expansion, and all the ranking noise it creates, is eliminated. And the overall architecture becomes simpler: instead of Boolean retrieval followed by BM25 ranking followed by (hopefully) neural reranking, you can move to a hybrid first-stage retrieval using both BM25 and dense embeddings, followed by a single reranking step.
Most of the search industry reached this conclusion two decades ago. Google, Bing, and virtually every major web search engine abandoned strict Boolean for first-stage retrieval by the early 2000s going for BM25. Among academic search systems, Semantic Scholar uses BM25 rather than Boolean as its first-stage retriever. The library world remains one of the last holdouts — not because Boolean is technically superior, but because our professional training, our teaching practices, and our vendor relationships are all built around it.
I am not suggesting this transition will be easy or that Boolean should vanish overnight. A hybrid approach — using both lexical methods like BM25 and semantic methods alongside or instead of strict Boolean — is the most realistic path. Scopus AI and top class tools are already moving in this direction. But we should be clear-eyed about what we are defending when we insist on Boolean as the foundation. We are defending a retrieval method whose core limitation — binary matching — is the root cause of many of the problems this post has described.
The bottom line
LLM-to-Boolean is not worthless. But it is a minor component being marketed as the whole solution. It solves a narrow problem — helping novices who cannot construct any Boolean at all — while doing nothing for the majority who need better-ranked results. As a first-stage query construction aid paired with modern reranking, it could play a useful supporting role. On its own, with the same BM25 ranking underneath, it changes almost nothing that matters.
The horseless carriage eventually gave way to the automobile — a machine designed from the ground up around the capabilities of the engine, not around the form of what came before. Library search needs the same transition. The real bottleneck is not retrieval — it is ranking. And for complex queries, it is not even ranking — it is the absence of iterative, evaluative search processes that can reason about what they find.
The infrastructure to do better already exists. Semantic Scholar and PubMed have proven that two-stage architectures work at scale. Agentic deep search tools have demonstrated what is possible when LLMs are used as reasoning engines rather than Boolean generators. Some libraries are already taking this path — Harvard Library, for instance, bypassed LLM-to-Boolean entirely for their special collections, building a discovery platform called Collections Explorer that relies on embedding models.
The question is whether library vendors will invest in these approaches, and whether we as a profession will demand they do — or whether we will keep admiring the horseless carriage because we can see how the reins work.
Librarians always worry about environment impacts of use of LLM, and in this case - one might be skeptical of the cost-benefit analysis of using LLMs to convery queries into Boolean
I do think LLMs will eventually be able to generate expert searcher level search strategies. The path forward is agentic search combined with tool use (e.g. a system to check MeSH): an LLM that can construct an initial query, run it, evaluate the results against inclusion criteria, identify gaps in coverage, reformulate and expand the search, test the revised strategy against known relevant studies, and iterate until recall targets are met. This is exactly the way a professional search does it! Note: this is qualitatively different from generating a single Boolean string and hoping it works.
Some systems instruct LLMs to generate multiple phrase searches to be OR’d together (e.g. Primo Research Assistant) but the result is similar.
Query expansion, and more broadly query reformulation (which encompasses expansion, term substitution, and structural transformation), remains an area of active research in information retrieval. That said, complex nested Boolean query construction receives relatively little attention in mainstream IR research, with evidence synthesis and, to some extent, patent and legal search being notable exceptions. More common approaches in current research include pseudo-relevance feedback, learned sparse expansion methods such as SPLADE, and dense retrieval techniques like HyDE (Hypothetical Document Embeddings), which generates a hypothetical answer document and uses its embedding as the query representation rather than expanding the query terms directly.
Technically, variants like BM25F weight where the hit comes from, so a hit from full-text would be worth less than a match in the title or abstract but this is difficult to get right.
I know everything is relative, there have been complaints about PubMed ATM of course.
You have to account for the fact that currently Primo Research Assistant doesn’t search the local collections and some content owners like Elsevier, JSTOR have opted out.
The study also tested PRA during its beta period with several major providers excluded, and PRA was judged on five returned items versus ten for conventional searches, so the comparison is suggestive rather than strictly like-for-like.
Stony Brook uses multiple concurrent agents for query construction, which is technically agentic but fundamentally different from agentic search that does iterative deep search, and/or evaluate results e.g. Undermind, Consensus Deep Search. The agents here assemble a single query in parallel, then hand it off to Primo.
How easy is it? Consider how even relatively smaller institutions are doing similar experiments. Essentially all you need to do is to hook up the input to a LLM API and prompt it to give a boolean output!
There is a middle path that avoids the need to pre-index embeddings for the entire collection: generate embeddings at query time and rerank only a small number of retrieved results, as Primo Research Assistant does with the top 30. This sidesteps the cost of indexing billions of records into a vector store. The trade-off is speed — computing embeddings on the fly for every query adds latency that users will notice, and the small reranking window means you are only re-sorting a tiny fraction of the candidate set. Reranking 30 results is better than reranking none, but it is a long way from reranking the top 500 or 1,000, where the real quality gains appear.
There are studies that suggest Primo Research Assistant isn’t too far behind in capability to advanced modern systems like Elicit or Undermind but these studies often test with “Easy queries” and/or they test with queries where Primo Research Assistant has an advantage in terms of their index coverage (PRA includes monographs and other non-article type content).
I grant you that with BM25 ranking systems, most librarians could roughly convince themselves why a result was in the top 5 just by looking for stemmed term matches.
I am old school enough to remember back in 2010s where the Summon mailing list would be abuzz by librarians angry that when comparing two queries, it made no logic sense because one gave more results when it should have less. This could have multiple explainations, for example, some expansion rule was silently triggered when number of results fell below some threshold.
Some systems would need a hard match of at least X out of Y terms for the document to be considered for ranking, others might only drop out of strict Boolean mode if the query is beyond a certain length etc.
One way you could test if a system was using BM25 as a retriever would be to enter 3 normal terms and one made up word. A Boolean system would definitely give you zero results. One that used BM25 for retriever would still given you some results. Of course, these days, the system could also be using a non-lexical semantic search system.
There is a persistant myth among librarians that Google of the 2000s to 2010s wasn’t a lexical/keyword search system just because it didn’t do strict boolean. In fact, it was a mostly keyword search, just that it did not implement strict Boolean - at least in the 2010s, and if you knew a couple of tricks (check the cached web for what the crawler actually indexed, look at link text etc), the search results was more interpretable than you might think. Still, it was definitely far less interpretable and predictable than typical library databases.
























I read your post with great interest and agree that these are important questions and considerations. It may be that at some point we will need to rethink whether Boolean continues to be the best approach to searching, although, to be fair, the purpose of Boolean was never to rank results.
Pubmed’s query expansion and ranking does seem to work better than other examples you have stated, but it is still problematic, and in the health sciences we routinely teach students how to bypass automatic mapping and advise caution when using Best Match ranking. This article discusses the ways in which Pubmed’s Best Match can introduce bias into search results: https://pmc.ncbi.nlm.nih.gov/articles/PMC8830327/. We are also seeing an erosion of the quality of indexing since the adoption of automatic indexing in Pubmed and other databases. This has a huge impact on the ability of human searchers, and “AI” driven search to retrieve relevant results, regardless of relevance ranking.
I can’t speak to other disciplines, but in the health sciences, librarians do routinely engage with what happens after retrieval. In the clinical context, limiting search results to the best type of evidence for the question being asked, for example limiting to randomized controlled trials to answer therapy questions or limiting to systematic reviews, is an effective way to quickly identify relevant results despite large retrievals. Biomedical databases have built in “clinical queries” limits that are useful in this regard. On the back end these limits are search queries that have been validated for sensitivity and specificity that are applied the search query. In the context of knowledge synthesis, we routinely teach researchers how to develop and refine eligibility criteria to facilitate screening, and we use sample sets of articles known to meet the inclusion criteria to adjust and improve the search, among other techniques. When conducting scoping reviews, iterative searching to improve relevance of search results, is built into the methodology. In the context of knowledge synthesis we routinely encounter “search resistant concepts” (though I was not familiar with this term, so thank you). It does seem like BM25 retrieval over Boolean in this context is something worth exploring, however it would need to allow for transparency and reproducibility which does not seem to be possible at this time.
Perhaps most importantly, in both contexts, we focus much attention on helping searchers ask clearly formulated questions. In my 20+ years of experience lack of clarity in the question is a more important factor in determining whether search results are relevant than issues in search structure or database relevancy ranking.
The ability to ask answerable questions and search iteratively is what makes human searchers more effective than AI. In my limited experience with LLMs and AI driven search this is where they fail.
The more I engage with the topic the more I come back to the idea that LLMs are like any other technological advance: we first think they will topple all that came before, and then we slowly realize they are a tool like any other, that they have their uses, but that the old technology also still has its uses. I suspect that BM25 versus Boolean is one of these things: we will still want to be able to design transparent, predictable and reproducible searches (I would not want my medical care to be based on a small and random set of relevant results!), but in some contexts, it will be useful to be able to quickly identify some relevant results. Google Scholar already serves that purpose, so why not other tools that maybe do it better?
PS there is an increasing body of evidence that shows that PICO based searching is problematic and you are correct to question whether LLMs are capable of recognising that outcomes should be screened for and not included in the search in most instances.
Hello Aaron,
My fellow librarians and I are having a hard time figuring out whether Web of Science Research Assistant still works exclusively with keyword searches using Boolean operators, or if they've introduced some form of semantic search. In a section of their website (which is already 8 months old!), there is a brief mention of semantic search: “Retrieving Articles: We start by retrieving articles that exhibit the highest degree of semantic similarity to the user’s query and complement them by adding articles with the utmost relevance through a keyword search.” without further explanation (https://webofscience.zendesk.com/hc/en-us/articles/31437630410129-Web-of-Science-Research-Assistant#h_01JGKZG2P1XN9FWQGW98CKWB18).
As for the Smart Search feature, it indicates that you can limit results to those from semantic search or keyword search. Have you tested the WoS Search Assistant again recently? Do you think it’s still based solely on keyword search and Boolean operators, as the tool seems to indicate when it provides a response (for example, in the “How are these results generated?” section)? If there is indeed a semantic search component, do you have more information about it?