Better AI Search Rubrics: Roles, Gates, and Retrieval Tests
Some suggestions on how to construct AI search evaluation frameworks
Whenever I give workshops or talks on AI search tools, someone eventually asks whether I have or recommend an evaluation framework or matrix for AI search tools.
I have been hesitant to give a firm answer. I know a thing or two about AI search, but I have much less experience, and frankly less interest, in building formal evaluation frameworks. Still, evaluation frameworks are a serious thread in information literacy, collection assessment, and procurement. The demand for them is real.
My worry is that many evaluation matrices try to do too much at once. They aim to be universal. They give every criterion a score. They weight too many things equally. They also rely heavily on qualitative impressions, including for the thing that matters most when evaluating search tools: whether the tool can retrieve and rank useful material.
To put my argument more clearly: AI search evaluation frameworks should not produce a single universal score at least not for everybody. They should specify the user role and task, identify non-negotiable gates, and test the most important claims empirically. For AI search tools, retrieval quality should be a core task-performance gate. If a tool cannot reliably retrieve and rank useful material for realistic queries, its interface, citation export, and administrative features are secondary.
That does not mean retrieval is the only non-negotiable gate. For institutional procurement, there may be other hard gates: privacy, accessibility, licence terms, data retention, security, local compliance requirements, and so on. A tool that fails those may be unacceptable regardless of how good its retrieval is. My point is narrower. Once we are evaluating something as a search tool, retrieval should not be treated as just one ordinary category among many.
In my last post, I mentioned two librarian-led projects that implemented AI search evaluation frameworks, both coincidentally vibe-coded using Claude and both drawing at least partly on my work. One was by Wang Huajin and the team at Carnegie Mellon, and the other by Alfred Wallace at the University of North Dakota.
Both are good. More than good, actually. They show a relatively sophisticated understanding of how AI search tools differ. For example, they distinguish between tools that merely use an LLM to generate Boolean queries, tools that add reranking, tools that combine lexical and semantic retrieval, and tools that orchestrate multi-step agentic searches. That is already a step above treating “AI search” as a single homogeneous category.
Still, no framework is perfect. In this post, I want to suggest three improvements that could make AI search evaluation frameworks, including other existing ones like the REACT (Relevancy, Ease of Use, Assessing DEIA, Currency, Transparency & Accuracy) framework1, more useful for libraries:
Be explicit about whose needs the framework serves.
Treat retrieval capability as a gating criterion, not just one category among many.
Replace some qualitative judgements with lightweight empirical tests.
None of these suggestions are novel in isolation. But I think putting them together would make these frameworks more practical and more honest.
1. Evaluation matrices should be clear about who they are for
One of the hardest issues in designing an evaluation matrix is deciding whose needs matter most.
An undergraduate does not need the same things from an AI search tool as a faculty researcher. An instructor does not necessarily care about the same things as an evidence synthesis librarian. A graduate student doing exploratory research has different priorities from a librarian supporting a systematic review.
For example, reproducibility and interpretability are crucial for evidence synthesis. If you are supporting a systematic review, you need to know what was searched, how it was searched, and whether the same search can be repeated. But for an undergraduate writing an essay on a fairly well-trodden topic, reproducibility may matter much less.
Similarly, an undergraduate may value ease of use, clear explanations, and help finding a few credible sources. A researcher working in an emerging area, where terminology has not yet stabilised, may care much more about retrieval depth and the ability to surface relevant papers that keyword search would miss.
Many evaluation frameworks seem to assume that there can be one universal matrix for AI search tools that apply for every task or user. They assign points to many criteria and then produce an overall score. I think this is usually a mistake.
Without a clearly stated viewpoint, the best case is that the framework becomes too diffuse to be useful. The worst case is that the designer unintentionally overly encodes their own priorities into the scoring system.
For example, a librarian might give substantial weight to features such as adminstrative features like COUNTER statistics support, authentication options, vendor support speed because those are areas they know and need the most. Those matter for procurement and management. But they do not matter much to the actual user trying to find relevant literature.
This is why I like Alfred Wallace’s Evaluating AI Tools for Research matrix. It allows the evaluator to choose a role, such as faculty researcher, graduate student, undergraduate, instructor, or student. The selected role then changes which criteria are treated as priorities or key questions.
I played around and vibe coded my own AI search evaluation framework based on AI-Powered Tool Assessment Framework and AI Rubric - Evaluating AI Tools for Research.
It is not very good currently and needs much more work, but I can show you a screencap below of roughly what I mean.
Figure: My draft role-based framework, vibe-coded based on AI-Powered Tool Assessment Framework and AI Rubric: Evaluating AI Tools for Research. This shows how different roles have different weightings for the same criteria.
For simplicity, I only define three roles: undergraduate, researcher, and librarian. Each criterion falls into one of three levels depending on the role: critical, key, or important. The defaults can be changed.
In my current system, answers to most criteria fall into four bands, with values from 0 to 3. These are weighted more heavily if the criterion is key rather than important. Critical criteria are handled differently as “gates”, which I will discuss below.
Figure: For this criterion, it is defined as a key question and weighted more heavily, so this tool scores 2 x 10 = 20 points on this criteria
We can quibble about the exact roles and weights. The principle is what matters. A framework should make clear whose interests it is optimising for.
A simplified version of what each role might prioritize might look something like this:
This matters because it helps libraries justify decisions more clearly for which group of users the tools are for. Instead of saying, “For all users, Tool A scored 82 and Tool B scored 76,” we can say something more meaningful: “Tool A is better for undergraduate discovery (scoring 82), while Tool B is more suitable for advanced research or evidence synthesis (scoring 92) taking into account what each cares about”
2. Think Non-negotiable Gates not just composite scores
Many evaluation frameworks are extremely comprehensive. They try to account for everything: retrieval, usability, accessibility, privacy, sustainability, source coverage, citation handling, export options, transparency, administration features, and more.
They assign weights to each criteria (typically evenly without much thought), expect evaluators to score each area and then sum up the scores to get a final score.
But this can result in you missing the big picture. If your mandatory requirement is accessibility for legal reasons, or if you are an evidence synthesis researcher and cannot use tools that have results are not reproducible, you should use these criteria as “gates” to test first.
By a “gate”, I mean a criterion that a tool must pass before we continue evaluating it. If the tool fails that criterion, we stop. It does not matter how well it performs on secondary features.
Your “gates” in the context of AI search, could be environmental issues, copyright issues or impact on learning and of course performance issues. So for example, if your non-negotiable is that the AI search tool should not impact learning by generating direct answers, you might opt for the “gate” of tools that only give listing of results not answers aka Deep Search or Quick search tools.
One way to fix this is simply to set a extremely high weight to this mandatory criteria. But I think a cleaner approach is to set a minimum threshold that the gated criteria must reach, or else it fails no matter what the overall score is.
Retrieval should be a core gate, not just another category
Besides abidding by legal requirements, for AI search tools, in my view retrieval capability should be one of those gates and this is often overlooked.
A search tool with weak retrieval should not be rescued by good citation formatting, a nice interface, or convenient administrative features. Those things are useful, but only after the tool has passed the basic test of finding and ranking relevant material.
In many evaluation matrices, retrieval strength, even when combined with source coverage, ends up accounting for perhaps 20 to 25 per cent of the overall score. This often happens because frameworks have four or five broad categories and weight them equally.
That badly underweights retrieval. Finding relevant material is the central purpose of a search tool. If the retrieval is poor, the rest is secondary.
Again, I am not arguing that retrieval overrides every other concern. I am arguing that retrieval should be treated as at least ONE core task-performance gate for search tools.
For example:
If a tool is rated “poor” on retrieval capability, it fails the evaluation, regardless of its performance on other criteria.
This approach forces evaluators to decide what is non-negotiable. It also prevents a tool from getting an acceptable overall score by compensating for bad retrieval with peripheral strengths.
The screenshot below shows my example, where retrieval capability, measured through a simple precision test, is set as a critical criterion. The tool needs to be rated 2 or above, corresponding to 50 to 80 per cent precision. In this example, it is rated only 1, corresponding to 31 to 50 per cent precision, so it fails the gating criterion.
Figure: Example of a gate-based evaluation model. Retrieval capability is set as a critical criterion and requires a score of 2 or higher. The tool automatically fails here because it is rated only 1
Tools that fail the minimum retrieval threshold are rejected before secondary criteria such as usability, citation handling, export, or administration features are considered.
Figure: The summary screen highlights that, regardless of the overall score, the tool failed the critical criterion.
Paying attention to retrieval is particularly important now because “AI search” is not a mature or stable category. Many vendors are under pressure to show that they are “doing AI”. Some products are genuinely rethinking retrieval. Others are merely bolting an LLM onto a conventional search system, using it to generate Boolean queries and then calling the result AI search.
That may or may not improve performance. Sometimes it may add little. Sometimes it may make things worse.
I have personally tested an “AI search” tool that was worse than standard lexical keyword search. It was not subtly worse. It was obviously broken. Librarians and researchers testing it could tell almost immediately that the retrieval quality was poor.
If I had treated retrieval as a gate, I could have saved everyone time before asking them to test it.
This is why a gate-based model matters. It prevents the evaluation from being diluted by a long list of attractive but secondary features and can save time.
3. Some criteria should be made harder and more testable
Another thing I notice about many librarian evaluation matrices is that they rely heavily on qualitative judgement.
Sometimes this is unavoidable. Some criteria really are soft. Usability, for example, is difficult to reduce to a single objective number unless you run proper usability testing, which takes time and expertise.
But some criteria can be made more empirical, especially the critical ones.
Take reproducibility. A simple test is to run the same query several times and record whether the same results appear in the same order. You might run the query five or ten times, possibly across different sessions or accounts to reduce caching effects, and see how stable the ranking is.
Take interpretability. If an AI search system converts a natural language query into a Boolean string, you can copy the generated Boolean query into the conventional search interface and check whether the results match. If they do not, then either the system is not doing what it claims, or the Boolean explanation does not fully represent the actual search process.
Take claim-source faithfulness. If a tool generates an answer and cites sources, evaluators can check how often the cited sources actually support the claims made.
Michael Gusenbauer’s work has influenced many of my thoughts here. His papers offer comparatively objective ways to assess academic search systems for evidence synthesis, including tests of search functionality, retrieval qualities, database size, and subject coverage. SearchSmart builds on this work by estimating coverage of academic databases using methods such as query hit counts, and the Basket of Keywords approach, rather than relying only on vendor descriptions or self-reported coverage.
But the area where I think empirical testing matters most is retrieval capability. If retrieval is a core task-performance gate, then it should not be assessed only by impression.
This is where things get more complicated.
Alfred Wallace’s framework asks evaluators to assess the “AI search architecture spectrum”, distinguishing between LLM-generated Boolean, Boolean plus reranking, hybrid search, and agentic or deep search.
Figure: Alfred Wallace’s framework asks evaluators to assess the AI search architecture spectrum.
That is a useful first approximation. In general, I would expect a well-implemented agentic or deep search tool to outperform a tool that simply asks an LLM to generate a Boolean query and then runs it through a conventional retrieval system.
But retrieval is messy. Implementation matters. A poorly implemented “deep search” tool may perform worse than a simpler hybrid system. Marketing labels are not enough.
So we need empirical tests, even if they are rough.
How might librarians test retrieval strength?
Last May, I wrote two posts on testing AI academic search engines(I) and (II). I also drafted a third post focused specifically on retrieval testing, but it became too long and technical. It went from TREC-style ad hoc retrieval evaluation to modern RAG evaluation, and I eventually decided it was probably too much for the audience I had in mind.
The challenge is balance. We should not expect most librarians to run formal information retrieval evaluations. We are not trying to reproduce TREC methodology, nor are we trying to run a Study Within A Review such as the Cochrane Evaluation of (Semi-) Automated Review methods(CESAR).
But we can do better than impressions.
Below are two lightweight but far from perfect tests of retrieval for librarian evaluations of search tools that you can try2.
This will not produce bullet-proof, publishable IR research. But it is often better than simply giving your feelings on whether the results “look good” in general.
Retrieval test 1: recall@K using a known set
One relatively simple test is to use a topic where you already know many relevant papers and feel you likely have most if not all relevant papers.
For example, suppose you have worked extensively on a topic and already have a set of papers you consider relevant. You can treat that set as a rough gold standard. Then you run the same query in each AI search tool and check how many of those known relevant papers appear in the top K results.
This gives you recall@K.
For example, if your gold standard set has 20 relevant papers, and a tool retrieves 8 of them in the top 30 results, then its recall@30 is 8/20, or 40%.
The question, of course, is what value of K to use. There is no universal answer. Common values might be 10, 20, or 50. Personally, I would often use 20 or 50 because that roughly matches the number of results I am likely to scan seriously.
One obvious source of gold standard sets is systematic reviews, meta-analyses, review articles, and survey papers. You can use their included studies or references as a benchmark.
There is a catch. If the review article has already been published, some AI search tools may find the review itself and mine its references. That makes the test less clean. Still, for practical library evaluation, it may be good enough, especially if the aim is not formal research but comparative assessment.
For more sophisticated testing, you could also use NDCG@K, or Normalised Discounted Cumulative Gain. This is rank-sensitive, meaning it gives more credit when highly relevant results appear near the top.
Retrieval test 2: precision@K when no gold standard exists
Not everyone has a gold standard set of papers lying around. In that case, a more realistic test is precision@K.
The process is simple:
Choose a realistic query.
Run it in each tool.
Look at the top K results, such as the top 10.
Judge how many are relevant.
Calculate the proportion of relevant results.
If 7 of the top 10 results are relevant, precision@10 is 70%.
This is easy to understand and relatively easy to run. It does not tell you whether the tool found all the relevant literature, but it does tell you whether the top results are useful.
For many users, especially undergraduates and those just doing quick exploratory searches, this may be a reasonable test. They are often not trying to find everything. They are trying to find enough good material quickly.
However, precision@K has a limitation: it is not order-sensitive. A tool that places the best result at rank 1 and another that places it at rank 10 may receive the same precision@10 score. If ranking quality matters, you could also consider Average Precision (AP@K)
Again, though, I would not overcomplicate this at the start. A rough precision@10 test across several realistic queries is already useful. In fact, I suspect most librarians or researchers already sort of do that, just that they don’t formally keep track of it.
Figure - Three lightweight ways to test retrieval strength.
Why formal information retrieval is not easy
Studying and evaluating information retrieval results requires a ton of expertise.
I don’t want to make the discussion too technical but here are several issues that complicate the story.
We need to distinguish corpus coverage from retrieval and ranking. A tool may fail to retrieve a paper because it is not in the corpus, because the query did not match it, or because the ranking pushed it too far down - when Undermind does not find a known gold standard relevant result at say recall@50, you should do a known item search in the source it uses to check if the fails is due to the corpus or the search
We need to be careful when using known sets from systematic reviews, meta-analyses, review articles, or survey papers. These can be useful, but they are not perfect gold standards.
Precision testing sounds simple, but formally, you have to ensure there are clear inclusion and exclusion critera before judging the results and to ideally need multiple assessors to avoid becoming just another form of subjective impression3.
We need to choose metrics that are understandable and appropriate. Precision@K, recall@K, known-item success, NDCG@K, and MAP@K answer slightly different questions.
We will have difficulty deciding what is a “normal” baseline recall/precision@K etc for an academic search. What could be the baseline? Google Scholar? Your library catalog/discovery service?
These are important, and I will discuss them issues and more in future posts. For this post, the simpler point is enough: if retrieval is central to the purpose of the tool, then retrieval should be tested, not merely described.
Choosing good test queries
Still, I want to briefly discuss a critical issue when testing search engines. Choosing which queries to test is obviously tricky.
But a very common mistake I see is testing only “easy” queries. An “easy” query is not always the same as a broad query, but the two often overlap. If your topic has many obviously and easily found (because terminology is known and stable) relevant papers, most decent systems will return something plausible. That makes it hard to distinguish strong retrieval systems from merely adequate ones and with “AI search” we want to do that.
I occasionally see librarians say the new class of AI search engine isn’t much better than conventional search mostly because they make this mistake or alternatively they can’t tell the difference between AI search engines of obviously different quality as the test results look close.
You can often tell this is happening when multiple retrieval systems achieve high apparent precision but return very different sets of results with low overlap. Low overlap between systems may suggest that there are many relevant papers in the index, so the test is not very discriminating.
A better test set should include some harder queries. For example:
topics where terminology varies across disciplines;
emerging topics where vocabulary has not stabilised;
interdisciplinary questions;
questions involving very specific conditions for relevance;
known-item searches for papers that should be found but are not obvious from simple keywords.
Overall, if you know your query has very few relevant articles and those articles cannot be found easily with trivial keywords, the query is likely to be hard.
How many queries should you test? The glib answer is: as many as you can. You could even crowdsource them. But some testing is better than none.
I find even five well-chosen queries can reveal a lot, allowing you to distinguish the average from the best, especially if they include difficult cases.
The unanswered question about cost
Even if a tool passes the retrieval gate and satisfies role-specific criteria, there is one final pragmatic hurdle: the cost-to-performance ratio.
In the information retrieval research community, people often chase marginal gains in metrics such as Mean Average Precision (Average Precision across multiple queries) without caring the cost (in terms of compute or latency). In an institutional library context, the question is different. A 5 per cent improvement in retrieval performance may or may not justify a much higher subscription cost, additional staff training or even privacy risk.
So the question is not simply whether Tool B retrieves slightly better results than Tool A. The question is whether that improvement is worth the cost and trade-offs for the institution and for the users the library is trying to serve.
That is not an easy question. But at least a role-based, gate-based, test-informed framework gives us a better way to ask it.
What this means for AI search evaluation frameworks
Putting all this together, I would make three changes to most AI search evaluation frameworks.
First, I would scope the framework to a specific user role and task. A single universal score is usually misleading. The better question is: useful for whom, and for what purpose?
Second, I would treat retrieval capability as a core task-performance gate for search tools. This does not mean retrieval is the only possible hard gate. Privacy, accessibility, licence terms, data governance, and institutional requirements may also be non-negotiable. But if we are evaluating a search tool, it should first be able to search well.
Third, I would make at least some criteria more empirical. Reproducibility can be tested by rerunning queries. Interpretability can sometimes be tested by comparing generated search strings with actual results. Claim-source faithfulness can be tested by checking whether cited sources support generated claims. Retrieval can be tested, even lightly, using realistic queries and simple measures of usefulness.
The gating point matters most to me. A search tool with weak retrieval is not redeemed by good citation handling, a polished interface, or convenient administration features. Those things matter, but only after the tool has passed the basic test of finding useful material.
I am not planning to publish my own framework. I do not want the maintenance burden, and I do not have a strong personal use case for one.
But if you are building one, my advice is simple: decide who it is for, decide what is non-negotiable, and make the most important criteria as testable as you reasonably can.
In future posts, I will go deeper into retrieval testing itself.
To be fair, this framework doesn’t seem focused just on AI search tools, but also covers other literature review tools like “citation-based literature mapping tools” like Research Rabbit, Connected Papers and Litmaps.
“Retrieval capability” is multi-dimensional and there are many ways to measure it. Though I suggest just one lightweight test, formal testing would often measure retrieval capablity in multiple ways to get the full picture. If you include search engines that generate RAG type answers, there are even more metrics to use for measurement.
We need to define relevance criteria before judging results. Though this does not mean building a full gold-standard set in advance. TREC-style pooling, for example, judges documents after they have been retrieved. But the topic, task, and judgement rules still need to be clear; otherwise a lightweight precision test can collapse back into subjective impression.





















Hi Aaron, thanks for sharing your knowledge, always very useful! I liked very much the test methodologies. Actually, when I did my test on AI tools, back in mids 2025, I applied a methodology very similar to the one you described in 'The Retrieval test 1: recall@K using a known set'. (https://www.researchgate.net/publication/396120553_Do_AI_research_assistants_live_up_to_their_hype_An_Exploratory_Study_of_Some_Freely_Available_Tools)