The Case of the Vanishing Hit Count: Rethinking Query Craftsmanship in a Post-Boolean World— Reflections from Day 2 of my FSCI 2025 workshop on AI‑powered search
Understanding the Shift from Exact Boolean Hits to the "Top-k" Results of Semantic Search and the Evaluated Hits of Deep Search.
One of the most interesting things about teaching is that the best questions come after I’ve finished my talk. Yesterday, during Day 2 of my three-hour crash course on AI search at FSCI 2025, a participant looked at our side-by-side demo of Scopus (not Scopus AI), SciSpace (in standard, non-deep search mode), and AI2 PaperFinder and asked (paraphrased):
“Why can standard search engines like Lens.org or Scopus tell me they found exactly 205 documents, while most ‘AI-powered,’ vector embedding-based search systems like Elicit.com or SciSpace seem to always show only or cap to an arbitrary, fixed number of results, like 8 and 100 respectively suggesting these are subsets of all relevant results? On the other hand, AI2 PaperFinder, which is also an ‘AI-powered’ deep search tool, has no issue showing the number of hits.”
Results page of Scopus (not Scopus AI) show exact number of hits
Results page of SciSpace (Standard, non deep search mode)
Results page of AI2 PaperFinder (Deep Search tool)
It’s the kind of deceptively simple question that I love—it forces us to unpack decades of retrieval theory, poke at the shiny marketing language, and, ultimately, rethink how we evaluate search quality in the post-Boolean era. I gave a rather good answer in the moment, but here is the blog-length version of what I said in class, plus a few things I wish I’d added.
[Adv] Missed FSCI 2025, and interested into joining another run of our workshops on understanding AI search?
You are in luck, Bella and I are running a new improved version of the workshops at the preconference in Charleston Conference Asia in Bangkok - Jan 26, 2026.
The workshop titled - Course 3: AI-Powered Search in Libraries: A Crash Course on understanding the fundamentals for Library Professionals will be a two day online webinar, followed by a third in person-session in Bangkok!
Why Boolean Hit Counts Are Trivial To Retrieve
The answer, in short, lies in the fundamental difference between how traditional and “AI-powered” search engines understand and execute a query.
Conventional academic databases, for the most part, operate on a Boolean logic model with which we are all familiar. When you type in your keywords, you're giving the system a set of precise instructions on what to match.
What some of us may not be aware of is how easily hits can be found based on Boolean logic once an inverted index has been constructed
How do you construct an inverted index? The following is from Zilliz’s Understanding Boolean Retrieval Models in Information Retrieval.
First, you generate a term list, listing each term and the corresponding docIDs (an ID that identifies each document). In practice, each document first undergoes preprocessing cleanup (e.g., removal of stop words), tokenization, and stemming before this.
Then, you sort the list alphabetically. If the same term appears in multiple documents, they are sorted based on the document ID.
From there, we create a “posting list.” For example, the posting list tells us that the word “data” appears in both doc1 and doc2, while “learning” appears only in doc2.
Because we are designing this inverted index to handle Boolean search, if the same term appears twice in a document, we can drop all but one of its instances. For example, even though “data” appears twice in doc1, we only need to show it once for doc1 in the posting list.
This is a binary inverted index list. If we want to support TF-IDF or BM25 for retrieval or ranking (which is common), our inverted index will need to include term frequencies for each document and the term’s position in the document (if we want to support phrase search or proximity operators).
The main point is that with an inverted index, looking up and counting the number of Boolean hits is extremely fast and essentially "free."
For example, to find the number of documents that match “Term A” AND “Term B” AND “Term C,” you just need to look up three entries in the inverted index and find the overlap in the documents returned.
Boolean search is incredibly fast, but what about ranking with just TF-IDF or BM25? While calculating TF-IDF or BM25 retrieval ranking is much faster than using embedding vector models (often called semantic search), it can still be somewhat slow with huge corpuses, requiring optimization techniques like MaxBlock, WAND, etc. Of course, if you perform a Boolean match first and then sort the matches by BM25 scores, as most academic databases do, the process is very fast.
Why “Hit Counts” by Vector Similarity are Hard
“AI-powered search,” on the other hand, lives in a world of shades of gray. These systems, often built on vector embeddings, don't just look for keywords; they strive to understand the meaning and intent behind your query. Your search terms are translated into a high-dimensional vector—a mathematical representation of their semantic essence. The system then searches for documents with vectors that are "closest" to your query's vector in this vast conceptual space, typically by calculating the cosine similarity (the angle between the vectors).
Here's the challenge: there is no simple “match/no-match” boundary unless you set an arbitrary similarity threshold (technically known as a “range search”). Even if you do (e.g. relevant = Cosine Similarity > 0.8), scanning every vector in a 200-million-paper corpus just to count those that pass the threshold would be prohibitively expensive and slow.
Range Searches are extremely expensive to do in high dimension large vector databases. A lot of them might just do a top K search first then filter by cosine similarity. If top K is using approximate nearest neighbor (ANN) algo - see next note, it might miss out matches that have cosine similarity you want.
As such, this is where the "top-k" approach comes in. Instead of finding every possible relevant document within a certain similarity score, these systems are designed to efficiently retrieve a predetermined number, k, of the most similar results.
Implication Alert! In fact, checking for documents with the highest cosine similarity is so slow that top-k methods often use an approximate nearest neighbor (ANN) algorithm like HNSW (Hierarchical Navigable Small World) , which can have a non-deterministic impact as ANN is not guaranteed to return all truly nearest neighbors within a given range or similarity threshold
Also, this is often all that's needed, especially in the context of Retrieval-Augmented Generation (RAG), where the goal is to feed a manageable number of relevant text snippets to a large language model to generate a summarized answer. Why find and rank a million documents when the language model will only ever "read" the top ten?
Some vector databases make it technically feasible for a small enough corpus to show the total exact number of documents with a specific range of Cosine Similarity. Also, for Deep Search tools that are not limited to traditional latencies of <500ms or 0.5s, it might be possible as well to do so.
This explains why SciSpace and Web of Science's Smart Search (Semantic Search) always show exactly 100 results because the system is designed for top 100 ranking .
In fact, many advanced systems use a hybrid, multi-stage approach often called a "retrieve and re-rank" architecture. Some use a fast retrieval method like BM25 followed by a embedding based reranker. Others combine BM25 with the use of fast but somewhat inaccurate ANN vector embedding search before reranking, Either way for such hybrid pipelines the final results shown aren't from one single, comprehensive count but are the top-ranked survivors of a sophisticated, multi-step filtering and evaluation process, making the idea of “hits” in the Boolean sense even more pointless. See next section.
Understanding that some tools like Elicit.com will always show top 8 results is important to avoid doing bad studies when comparing Boolean Search engines with Vector embedding search engines (See Table 2 - with column “Average number of papers found” for Elicit.com, SciSpace vs Scopus, WOS)
Hit Counts in Deep Search: A Third Animal
So, what about "deep research" tools like AI2 PaperFinder and Undermind that provide a specific and differing number of results per search? Are they equalvant to the usual results hit?
To recap, deep search or deep research tools generally perform iterative or agentic searching “on their own,” typically using an LLM to directly evaluate whether a paper is relevant.
The number of "hits" displayed in these tools is distinct from Boolean hits and can represent a few different things. It might be the number of papers the tool has actively evaluated before a built-in time or computational limit was reached.
The number could also represent the results that passed a certain relevance threshold before the system decided to stop searching further. If you were to prompt these systems to "try harder" or expand the search, it's conceivable they would find more relevant papers.
Systematic review librarians might think of this as the system using a kind of stopping rule (e.g in active learning) to know when to cease screening. See, for example, Undermind’s white paper describing how they estimate the “convergence of search.”
For example, in the image below, a search in Undermind has a 66% convergence rate which means its statistical model estimates it has found 66% of the relevant literature in the corpus. It makes sense here to ask Undermind to extend the search. If you are using AI2 PaperFinder you can prompt the system to “try harder”.
Undermind showing number analyzed and number of relevant papers found so far and estimated convergence of 66%
Navigating a Post-Boolean Worlds Without Hit Counts
This all has profound implications for how we, as information professionals and researchers, approach the search process. For generations, we've relied on the hit count to orient ourselves. A query with millions of results screamed for more specific terms, while a query with zero hits sent us back to the drawing board to broaden our search.
It is worth noting, however, that Google has almost always shown an approximate number of hits.
In the deep search paradigm, this navigational aid disappears. In this world, does it even make sense to judge whether your query is good, since you are letting an agent iterate the search for you? Perhaps the idea is that your initial query doesn't matter as much, because as long as the query is detailed enough, the results will eventually converge.
In other words, the goal is no longer to craft the perfect Boolean string to capture a precise set of documents but to formulate a detailed and nuanced prompt that allows the AI to converge on the most relevant concepts.
This seems to be the direction that tools like Undermind and Elicit are heading, each trying to evaluate or guide quality inputs in its own way. For example, Elicit evaluates “question strength,” while Undermind, like many deep research tools, asks clarifying questions.
Elicit.com “evaluating question strength”
While it may not always make sense to show the total number of potential matches in a "deep search" context, the current inability of these systems to efficiently show all documents that meet a specific similarity threshold aka range search (e.g., cosine similarity > 0.8) is troubling.
How do we teach “query craftsmanship” when information on the number of hits is unavailable (dense embedding matching) or simply inapplicable (deep search)?
Here’s a wild idea: Should hit counts come back? Vendors could expose estimated recall or confidence intervals, but would that mislead more than it helps?










Unfortunately for me, I never figured out how to become a professional mathematician, so now I’m a bitter software engineer, but when I think about search, I think about a simplicial complex, where the vertices are the keyword labels for the corpus.
According to an AI I talk to who doesn’t have to pay rent and can just vibe.... when fed half-remembered notions from my recreational reading, "there are neural architectures that perform message passing directly on simplicial complexes—Message Passing Simplicial Networks—which generalize graph message passing along boundaries, co-boundaries, and higher-order adjacencies, with theory tied to a simplicial Weisfeiler–Lehman test and the Hodge Laplacian." I don't have the research budget ($0) to figure out if that's a thing but it SOUNDS like a thing
Aaron, I think I am at a lost with "the vector" but one thing I know that I have to learn is "post-boolean operator" searching! Thank you as always for new learning topic you introduce us.