Testing AI Academic Search Engines - What to find out and how to test (2)
Following my recent talk for the Boston Library Consortium, many of you expressed a strong interest in learning how to test the new generation of AI-powered academic search tools. Specifically, evaluating systems using Retrieval-Augmented Generation (RAG) was the top request, surpassing interest in learning more about semantic search or LLMs alone.
This is a crucial topic, as these tools are rapidly entering our landscape. This post outlines my current thinking on practical ways librarians can evaluate and understand them through a series of questions with answers you might be able to obtain via reading the documentation, asking the vendor or light-weight testing of the system directly.
What Are We Evaluating? AI Academic Search with RAG
First, let's clarify the type of tool we're discussing (different types of tools may requre different approaches). I'm focusing on systems that:
1. Function as Search Engines: They use your query specifically to search an academic corpus (unlike general chatbots that might optionally search - as they have "search as a tool" functionality).
2. Generate Summaries with Citations: They use RAG to produce a synthesized text answer (a paragraph or more) based on retrieved documents, including citations linking back to those sources.
Examples include Elicit, Scite Assistant, Consensus, Scopus AI, and the research assistants for Primo and Web of Science.
These tools often present results with a generated summary alongside the list of source documents, though the exact layout varies

This is of course a simplified layout, there are still many UI decisions to be made. Given how latest studies still show RAG systems generate sentences that are not fully supported by valid citations, it is important for the system to carefully consider how the citations should be displayed in the generated answers and to make it easy to verify generated statements with citations. e.g. Are they inline numbers [1], hover-over highlights?
Below shows another possible layout from Scite Assistant with the references on the right panel.

Examples of what I consider AI academic search includes Elicit.com, Scite assistant*, Consensus, Scopus AI, Primo Research Assistant, Web of Science Research Assistant and many more (see list here)
But what should we test and how?
The RAG Challenge
RAG systems work in two main stages:
1. Retrieval: Finding relevant documents based on your query.
2. Generation: Using a Large Language Model (LLM) to synthesize an answer based only on the retrieved documents.

The special prompt that is built-in to the RAG system could uses a prompt like this

The quality of the final answer heavily depends on Stage 1: Retrieval. If the system fails to find the right information, the LLM has nothing good to work with, leading to weak, incomplete, or even misleading answers. Therefore, evaluating the retrieval component is critical.
A Practical Approach for Librarians
While academic researchers have rigorous (and time-consuming) methods for evaluating information retrieval (like TREC), librarians need practical approaches. We need to understand these tools well enough to guide users and make informed decisions, without needing weeks of formal testing.
The preliminary "framework" of questions presented here focuses specifically on questions designed to help librarians understand the functional performance and underlying mechanics of these RAG systems.
You can investigate the answers to these questions through checking the documentation and targeted testing. I've marked suggestions for hands-on testing with [TEST].
I tested this framework during a recent comparitive review of Primo Research Assistant, Web of Science Research Assistant and Scopus AI (to be published in Katina resource reviews).
Key Questions for Evaluating RAG Academic Search Engines
To structure our evaluation, we can draw inspiration from comprehensive frameworks designed for assessing academic search systems. For instance, Gusenbauer & Haddaway (2020) provide extensive criteria for evaluating the suitability of search tools, particularly for demanding tasks like systematic reviews.
The questions below adapt and simplify elements to ones relevant mostly to RAG systems only, focusing on practical insights librarians can gain through documentation review and targeted testing.
Also it's important to note that this practical testing framework deliberately excludes broader, though equally critical, evaluation areas that can be asked such as:
Ethical Considerations: Questions around copyright and ip implications of the training and use of LLMs
Environmental Impact: Assessing the computational resources and energy consumption associated with running these often complex AI models.
Let's break down the evaluation into understanding the retrieval process, the generation process, and the user interface for verification. I've marked suggestions for hands-on testing with [TEST].
Part 1: Understanding the Retrieval Component
What content is being searched (The Index)?
Why it matters: The RAG answer depends entirely on what's retrieved. Knowing the source index reveals potential coverage gaps.
Does it include only openly scholarly metadata from Semantic Scholar or OpenAlex (like Undermind.ai, Elicit.com) or does it use a proprietary index (like Scopus or Web of Science core collections)? Does it retrieve over just metadata only or over full-text (open access only, or includes some paywalled?)
Does it use the whole source or filters away any subsets? This could either be by choice (e.g. Semantic Scholar fuels both Undermind.ai and Elicit.com, yet they show different number of indexed items due to different ingestion criteria and updating strategies, while Scopus AI searches only over Scopus data from 2003 only) or optout by content owner ((like Elsevier in Primo Research Assistant)
Does it search only holding's licensed by the institution (e.g. Web of Science Core Collection owned) or the entire index regardless of holding (like Primo Research Assistant)?
How to check: Review documentation.
[TEST]: If unclear whether full text is used, design queries where the answer likely resides only in the full text and see if the system can answer accurately. Check and verify for known content opt-outs by purposely searching for such content.
How does the search actually work (Retrieval Mechanism)?
Why it matters: Most RAG systems encourage typing your input in natural query. While keyword search will probably be a part of the retrieval mechanism, it is likely to use other methods. This affects relevance performance, interpretability and reproducibility of search.
Does it use an LLM to translate your natural language query into a Boolean string (like Scopus AI, WoS RA, Primo RA)?
Does it use "semantic" search (like dense vector embeddings or learned sparse retrieval like Elicit)?
Does it do any form of two stage re-ranking? (e.g. Primo Research Assistant reranks the top 30 results with embedding search)
Or Is it a hybrid approach combining multiple methods?
How to check: Check documentation though not all details are always mentioned. Still Vendors, particularly traditional providers often highlight if they use natural language processing or semantic search.
[TEST]: You might check the interface for clues (e.g., some tools show you the generated Boolean query if it uses LLM to generate boolean search strategies and you can check for example to see if Primo Research Assistant really creates boolean search strategies the way the documents describes it.). You can also check if there is any additional reranking by checking the default relevancy sort of the keyword search against the actual top results. If they are different, there is likely some type of additional reranking.
Are the search results consistent and explainable (Reproducibility & Interpretability)?
Why it matters:Interpretability of search results is important for certain use cases (e.g. Systematic reviews). This is mostly a matter of the retrieval mechanism. With regards to reproducbility, if the same query returns different source documents each time, the generated RAG answer will of course also vary!
LLMs might also be used in other parts of the retrieval mechanism such as for query expansion which can also lead to more randomness.
How to check: Understand the retrieval mechanism (see above).
[TEST]: Assuming the RAG system uses LLM to generate boolean. Run the exact same query multiple times (e.g., 5 times in quick succession). Look at the generated Boolean query in the interface (like Scopus AI, WoS Research Assistant), how much does the generated Boolean change? (I found WoS Research Assistant fairly consistent - perhaps having a different query 1 in 5 times, while Scopus AI varied more).
[TEST]:Compare the list of top N source documents retrieved each time (where N is the number used for generation, see below). Does the order or composition change significantly? (when tested 5 times in quick succession, you may want to repeat each batch of 5 a few times)
Does the natural language search understand search constraints (Metadata Parsing)?
Why it matters: Can you refine your search using natural language for common fields? For example, does "peer-reviewed articles on climate change from 2020-2024" correctly apply filters for publication type and publication date?
How to check: Documentation might list supported natural language commands. While both RAG search tools have limited support e.g. Primo Research Assistant supports year of publication and limited article types, Web of Science Research Assistant supports a dazzling array of metadata queries using natural language.
[TEST]: Verify claimed support of metadata parasing by try queries incorporating dates, author names, affiliations, publication types, citation counts, etc. (e.g. for Web of Science Research Assistant try "Papers by Patel from MIT on genomics since 2022," "Review articles on solar panels"). See if the results reflect these constraints accurately.
Does it handle non-English queries?
Why it matters: Can users query in languages other than English, even if the underlying corpus is primarily English? This is often possible with LLM-based query interpretation or multilingual embedding models.
How to check: Documentation might state language support.
[TEST]: Input a few queries in another language you understand (e.g., French, Spanish, Chinese) and see if the system retrieves relevant English-language documents.
Part 2: Understanding the Generation Component
What is the exact RAG method used
Why it matters: RAG is a general term these days and there are many different variants and techniques such as GraphRAG, RAGFusion etc that can lead to quite different results
How to check : In general, the main way is to check the documentation or ask the vendor.
How many retrieved results feed the summary (Top N)?
Why it matters: This indicates the breadth of information the LLM uses to generate the answer. Is it fixed (e.g., top 5 for Primo RA, top 8 for Scopus AI) or dynamic? Note that the final answer might not cite all N documents if some weren't deemed relevant by the LLM.
How to check: Documentation usually states this.[TEST]: Run a search and look at the interface which will typically list all the retrieved references even those that are not used in the answer.
What LLM is used for generation?
Why it matters: Different LLMs have varying capabilities (though this is often a black box not stated explictly). Some tools might offer choice of models to use (like Scite Assistant).
How to check: Usually only knowable if stated in the documentation. Often, vendors don't disclose the specific model.
Can the generated answer be in non-English?
Why it matters: If a user queries in French, will the generated summary also be in French, or will it default to English? Systems vary (e.g., Scopus AI answers in English; Primo RA/WoS RA answer in the query language).
How to check: Documentation might specify.
[TEST]: Use a non-English query (as tested in Part 1) and observe the language of the generated summary.
Part 3: Evaluating the User Interface and Verification
How are citations displayed and linked?
Why it matters: Verification is key. How easily can a user connect and check a specific statement in the generated text back to the supporting evidence in the source document? Are citations inline numbers [1], hover-overs, linked phrases, or listed separately? Is it clear which part of the source supports the claim?
How to check: Examine the user interface.
[TEST]: Try to trace a few claims in the generated text back to the cited sources. How easy and accurate is this process? Can you quickly access the source abstract or full text?
Important Note: What We Haven't Covered Yet – Evaluating Output Quality
It's crucial to recognize that the questions and tests outlined in this post primarily help us understand how these AI academic search engines are built and function. We've focused on dissecting their components: the underlying index, the retrieval methods used, the generation process parameters, and the UI mechanisms for verification. This provides a foundational understanding of the system's mechanics and potential capabilities or limitations.
However, we have deliberately not yet addressed the direct evaluation of the quality of the final output – the generated RAG answer itself. Assessing critical aspects such as:
Accuracy and Faithfulness: Does the generated text correctly represent the information found in the cited sources? Are there factual errors or "hallucinations"?
Relevance: How well does the generated answer actually address the user's specific query?
Completeness and Conciseness: Does the summary capture the key information from the sources without being overly verbose or missing crucial points?
Usefulness: Is the generated summary genuinely helpful for the user's information need?
Evaluating these performance aspects requires different approaches, drawing from both long-established Information Retrieval (IR) evaluation techniques (like assessing the relevance of retrieved documents which heavily affects the RAG result) and newer methods specifically developed for evaluating RAG systems (such as measuring faithfulness, answer relevance, and guarding against hallucination).
Measuring this crucial aspect of RAG system performance will be the focus of the next part of this blog series.
Conclusion
Evaluating these AI academic search tools requires a blend of understanding their mechanics and practical testing. By asking the targeted questions detailed above and performing simple tests, librarians can gain a much deeper feel for a system's potential strengths, weaknesses, coverage gaps, and operational characteristics. This foundational knowledge is essential for guiding our users effectively and making informed decisions as we integrate these powerful, but still evolving, tools into our library services. Stay tuned for the next post where we will tackle the challenge of evaluating the quality of the answers these systems produce.
This blog post has been edited with the help of Gemini 2.5 Pro.
Technical note 1 : Interpretability and reproducibility of search
Based on your understanding of how retrieval works you should already have a sense how non-deterministic or misinterpretable the search is. Roughly you would expect keyword search to be most interpretable and reproducible, with Dense Embedding search being the least interpretable and reproducbilty.

See this blog post for more information, since retrieval systems may use a blend of such systems, things can be even less clear.
Technical note 2 : multi-lingual support
Even though your sources might be mostly in English, many RAG systems are capable of "understanding" your query even if it is inputted in any of the dozen non english languages.
Why do these new AI search systems generally work with non-english language? This is actually one of the benefits from moving away from just a keyword based search.
Firstly if it uses LLMs to generate Boolean search strategies, this clearly works even if you input your query into say Chinese as most modern LLMs are multi-lingual and are capable of "understanding" your query and creating an approproiate Boolean in English.
How about the more common dense embedding models? While not guaranteed (e.g. there are monolingual embeddings like MiniLM) a lot of the dense embeddings used are are built upon transformer models that are pretrained multilingually (e.g. BERT). They learn to map text from multiple languages into a shared vector space where similar concepts are represented close together, regardless of the language used.
You essentially just need one index to handle multiple languages. Compare this to lexical keyword based methods like BM25 where you need a seperate inverted index for each language!