Primo Research Assistant launches- a first look and some things you should know

New! Listen to a 10min autogenerated podcast discussing this blogpost from Google LLM Notebook
Ex Libris surprised us by suddenly releasing Primo Research Assistant to production on September 9, 2024 (when the earlier timeline was 4Q 2024 with some believing it might even be delayed). Despite the fact that there are so many RAG (retrieval augmented generation) academic search systems today that generate answers from search, this is still quite a significant event to be worth covering in my blog.
Why? Simply put, while Primo Research Assistant may not break any new ground in terms of features or functionality, in all likelihood this might be the first academic search system that use RAG to generate answers that many librarians and users will encounter given the fact that Primo together with it's cousin Summon (which will have it's own Research Assistant coming soon Q1 2025) are two of the big 4 default academic search engines used by Academic libraries (the other two are Ebsco Discovery Service and WorldCat Discovery Service) and this feature is bundled in!
I estimate this is 20% or so of academic libraries in US/UK/Canada etc (likely higher for elite universities).
As such, here are some things you should know about Primo Research Assistant (Primo RA) and my first preliminary thoughts (long). For more information on getting started with Primo Research Assistant. and meet your Research Assistant at IgELU 2024
#1 Primo Research Assistant is bundled in with your current subscription
Unlike AI offerings from Elsvier's (Scopus AI) or even Clarivate's (parent of Ex libris) Web of Science's Research Assistant, there is no additional cost to libraries who already subscribe to Primo.
In fact, it is live right now in production, though you can (and probably should) turn it off.
Do note as of time of this article, 15th September, librarians are reporting that you can only hide the link to Primo Research assistant from the Primo interface but if you know the link you can still access it.
That said even if you do so it will require user authentication to access.
It's somewhat hard to understand why Ex Libris chose to release this without additional cost, given that unlike most software features the marginal cost to run RAG enabled searches are not zero due to LLM inference costs and the potential users of Primo and Summon are likely to be much larger than most academic search systems that currently implement this technology, but as you will see Ex Libris has taken care to mitigate this issue by a) UI design and b) technology solutions
Despite this, there has been suggestions (I forgot where I saw it, but was probably in a webinar) that Primo Research Assistant does have a threshold/throttle limit if your institution is sending too many queries via the Research Assistant and there might even be a premium service add-on in the future.
#2 By default your searches in Primo do not trigger the Primo Research Assistant
Even if you do opt-in to turn on Primo Research assistant, the default search does not trigger the Research Assistant to generate an answer. This is unlike other popular tools like Elicit.com, SciSpace etc but similar to legacy databases like Scopus AI, Statista ResearchAI.
Instead you have to explictly access the Research Assistant from the main menu or from the widget in the brief results.


This makes the Primo Research Assistant a bit more hidden but it does ensure you won't get a flood of searches all using the Research Assistant by default.
Thinking a bit more, do we necessarily want the Research Assistant to run for all Primo queries? Past studies have indicated that quite a large percentage of queries in a typical library discovery search are actually known item searches, possibly because students are looking for course material or other content they found mentoned elsewhere (discovery happens everywhere!).
A Research Assistant generated answer while interesting might not be what is need. Below shows what the assistant generates when I search for Elements of Style, 4th edition.

I can imagine you can work around this issues with clever workarounds e.g. since Primo RA is already using ChatGPT to look at the input (see later), perhaps it can be prompted to try to detect if the input is likely a known item search and if so ask if a generated answer is required.
Still for now, given the all the worries about cost, it might be a good idea just not to do this by default.
#3 Primo Research Assistant uses a LLM (Large language model) currently GPT3.5 to convert your inputs into keyword searches which is used to run in CDI to retrieve documents.
Primo Research assistant like most of its peers encourages you to try typing in natural language e.g. "How does Vitamin D deficiency impact overall health".
Under "how to formulate a good question" they state
To make the most of the Primo Research Assistant, it's essential to ask clear and detailed questions about academic or scientific topics. Be as specific as possible and phrase your query in the form of a question. Example queries can be found on the starting screen.
They also do not
requests for materials of a particular type (e.g. “give me peer reviewed articles about bird migration”) or from a certain time period (e.g. “give me the newest research on climate change”).
as the system does not yet "understand" such requests with regards to metadata. This btw is not a unsolvable issue for example scite.ai assistant (which uses LLMs to construct keyword searches similar to Primo Research assistant) is able to interprete searches for year ranges and understand search queries like "Papers on X from 2015-2023).

Web of Science Research Assistant searches that can be understood
Instead for Primo Research assistant, try to keep your input simple.
Here I highly recommend you follow the examples given and refrain from doing complicated "prompt engineering" tricks you may have picked up.

The main reason why this is a bad idea is because most prompt engineering tricks are tested on pure LLM (large language models) and not Retrieval augmented generation systems but see this blog post for why this is almost always a bad idea.
But how does Primo Research Assistant know what to do with your natural language query?
The conventional way to do this used by most systems is to convert or encode your query into an embedding (a long series of numbers) by running the query through an embedding model, and do a vector similarity search against documents which have been encoded the same way in advance during indexing into embeddings and documents which score the highest similarity will be surfaced.

Conceptual model for information retrieval
This is a very popular way (in particular when used in hybrid manner together with traditional keyword search) used by most "AI search" systems, like Scopus AI, Elicit, SciSpace and more however Primo Research Assistant goes a different way which uses a less common method.
Instead, the large language model (GPT3.5 at time of writing) is prompted to take the user input and create 10 variants of keywords for the input. These variants are then connected with an OR together with the original input.

The Primo RA architecture and flow: Retrieval Augmented Generation
So for the input
Is there an open access citation advantage
It may come up with 10 keyword queries and combine them together like this
(open access citation advantage research) OR (impact of open access on citations) OR (open access publication citation rates) OR (benefits of open access for citations) OR (open access citation impact study) OR (open access citation advantage research) OR (impact of open access on citations) OR (advantages of open access publishing on citations) OR (open access citation benefits) OR (does open access lead to more citations) OR (is there an open access citation advantage?)
with the first 10 keyword searches being generated by the LLM and the bolded part is the original input.
If you think such a method may not be very good, you might be right.
Studies trying to see if LLMs like ChatGPT are good at formulating Boolean queries for systematic reviews. have generally shown that they underperform humans particularly in terms of recall. However, these are done in the context of systematic reviews, in this case we just need to get the top 5 relevant articles properly ranked as only the top 5 are used to generate the answer.
In fact, Primo Research Assistant does an additional reranking of the top 30, using embeddings to optimize the result based on the query match.
The interesting question is why Primo Research Assistant uses this method instead of the more conventional method of doing an embedding search.
I have no doubt that the size and complexity of CDI or Central Discovery Index is a big part of the issue. While most "AI academic search" systems like Elicit, Undermind, use Semantic Scholar or similar corpus have to deal with converting over 200 million documents to embeddings, the CDI index utterly dwarves that at 5 billion items! (even though as you will see later a large part of the index is excluded by default)
And the difference is not just in terms of size, but in terms of complexity, as CDI consists of not just journal articles but also all other resource types. Add the fact that they apply a match & merge process to combine different variants into one super composite record , you have a fascinatically large and complicated index that might even exceed Google Scholar.
I imagine trying to convert and store all of CDI into embedding and storing in a vector store can be a nightmare (espically as scaling can become difficult for vector stores).
Note that the reranking step of the top 30 can be done on the fly with acceptable latency without any preindexing due to the small number involved. I can imagine this step uses a powerful and more computational expensive cross-encoder model or multiple vector embedding model like COLBERT
I am also somewhat disappointed that it does not transparently show the boolean search query generated though you can find it via the "View more results from your library search" button. They should definitely learn from the way Scite assistant displays it. Some of the innovations Scite Assistant provide such as the ability to edit searches of your own to run, do various advanced filters should ideally be included.

Scite assistant transparently shows the searches used and allows you to edit the search
#4 Primo Research Assistant uses the abstracts of the top 5 ranked results to generate the answer
The whole idea of RAG is that you search and find the top ranked documents or text chunks and then feed them as "context" to a LLM to try to generate an answer to the question with intext citations. The hope is that by grounding the answer in documents that are found, it will reduce hallucinations and allow you to check the source as opposed to just asking a LLM directly.
A simple RAG prompt might look something like this.
Please answer the question with the following context if relevant.
<context 1>
<context 2>
...
<context 5>
Because RAG systems only cite what is found in a search, they do not make up fake papers unlike using a LLM alone. That said, they can "misinterprete" the item that is cited. One technical term floating around is citation faithfulness may be an issue. This is known to be particularly problematic when asking questions with no good answers (see later) and the system will still try to generate an answer by twisting the citation. This is the greatest challenge of RAG generated answers!
That said, what I described above is indeed similar to what Primo Research Assistant does with the abstracts of the top 5 ranked results.

The Primo RA architecture and flow: Retrieval Augmented Generation

The Primo RA architecture and flow: Retrieval Augmented Generation
Retrieval Augmented Generation is currently a area of intense research with an exploding amount of techniques introduced in the literature since 2020, it has already spawned at least 3 survey/review papers! We do not know what particular technique is used by Primo RA, but it is probably a light weight one.
Note that while CDI can store up to the first 65K character of the full-text if available, they are only using the abstract in the RAG search.

Just to head off a common misconception, even though the top 5 results shown will always be fed to the LLM to generate a article, the LLM can "choose" not to use or cite all 5 results if it "judges" some are irrelevant. In fact, this is a good idea because you do not want the LLM to "force" citations to try to fit the answer!
#5 Primo Research Assistant uses the entirety of CDI metadata & abstracts (with exceptions
While the mechanics of RAG is relatively easy, it is important to understand what exactly Primo Research Assistant is searching over?
With Scopus AI, Elicit the answer is very simple, Scopus and Semantic Scholar respectively. With Primo RA it is a bit more complicated.
Firstly it searches through the "entirely of CDI metadata & abstracts" (with exceptions)."
This is quite hard to unpack.
At the first obvious level, because it is CDI only it excludes your alma individual local records (e.g. books or special collection only in your local Alma).
More controversially, it includes the "entirely of CDI". This is roughly equalvant to "expand my search" so you will get answers that cite items that your institution may not have access to.
"Roughly" because besides the exceptions for Primo Research Assistant (see later) institutions can choose between EasyActive Setting and Fully Flexible Settings which the later scopes what is shown when you expand your search. My guess is "expand your search" for institutions that have EasyActive Settings (the majority) are likely to be closer to what is searched in Primo Research Assistant (less exceptions)
Lastly, after all that they also have a list of exceptions as of 15th September, it looks like this

It's hard to say if the exceptions here are opt-out by the content owners (either because they can't provide the rights or choose not to) or a decision made by Exlibris for technical reasons
For example exclusions like
Collections that are not marked as “free for search” in the CDI collection list (subscription A&I databases)
Sources with insufficient metadata and abstracts to effectively run the tool.
Documents marked as withdrawn or retracted; retraction notes.
it seems likely these are technical decisions. While some could go either way, such as the exclusion of news content.
while others seem like opt-outs particularly
Any collections from the following content providers: APA, DataCite, Elsevier, JSTOR, Kogan Page, Conde Nast.
I've written at length about the pros and cons of content owners making the decision to be included or excluded from Primo's Research Assistant.
Here's a short summary.
The Landscape: How AI is Shaping Discovery
My immediate reaction? Publishers who choose to opt out are shooting themselves in the foot, much like the initial wave of publishers who refused to be indexed in Google Scholar or early discovery indexes like Summon, Primo, and EDS back in the early 2010s. Those who resisted eventually caved because being absent from these indexes meant losing discoverability—and by extension, usage—while their competitors became more visible.
Academic libraries soon put these discovery services front and center as the default search boxes, driving all the traffic there. Publishers quickly realized that being indexed meant a significant boost in COUNTER stats, while those who weren’t saw usage plummet. Google Scholar indexing turned out to be even more critical, as everyone knows from their traffic analysis.
But It’s Not Exactly the Same…
Ex Libris is being cautious with their CDI Research Assistant. It's not the default, and a generated answer isn’t produced for every search; users have to make a conscious effort to visit another site. So, the impact of being excluded here is less severe, at least for now.
This approach mirrors how major products like Scopus AI and Statista are handling it. But if this feature becomes popular, it wouldn’t be surprising if these systems switch to a default model where AI-generated answers, along with a list of relevant papers, are always shown—much like Elicit.com or SciSpace.
If we reach a point where AI-generated answers are the norm, choosing to be excluded could significantly hurt discoverability. The generated answer is highly prominent, and not being included there isn’t great.
The Case Against Inclusion
One could argue that if their content is cited in a RAG-generated answer, many users might not click through to the full text and instead rely solely on the AI summary. This could lead to a drop in usage stats, prompting librarians to consider canceling the subscription. This concern echoes the frustrations some web owners have with Google’s Knowledge Graph and featured snippets, which they believe steal content because users no longer click through to the source.
But how strong is this argument? It depends on your view of user behavior.
To make this argument work, users would need to be a specific type of lazy—lazy enough to rely on an AI-generated summary if it’s there, but diligent enough to click through to the full text if it’s not.
Still despite my analysis, we can see more content owners choosing to be excluded. Since my last blog post on this, Elsevier, the first of the "big 5" publishers has joined in! In a sense this isn't surprising, since Elsevier has always been protective of their content (e.g. they were the last of the big 5 publishers and one of the last of the major publishers that resisted making their deposited references in Crossref Open).
I am also not surprised by APA (which was in the list the earliest), though I am puzzling about JSTOR (perhaps they don't have the rights) and espically DataCite (I guess they are just a DOI registration agency and don't own content?)
As a user and librarian I would of course prefer content owners not to opt-out, particularly since a) only abstracts are used and not full text and b) espically if we already paid for access!
On a parallel note, the rise of RAG enabled academic search may make academic content owners more protective of their titles and abstracts!
For example, it might be unrelated but when using AI tools like Elicit or Undermind.ai that use Semantic Scholar's Corpus, I've noticed more and more abstracts are missing with a message "Abstract of this article is not being displayed due to request from publisher".

Similarly in Undermind, cofounder Josh in this video with regards to Undermind results not showign abstracts says
"okay the issue we do have a problem with like some of these actually because it's Springer. I noticed as well right there's some weird thing with Springer where there's like we are missing the abstracts of some of the Springer journals because of some weird publishing thing."
Of course, all this might be unrelated or just a technical bug, but it does gives make me appreciate even more the need for the push for open scholarly metadata and movements like I4OA, Initative for Open Abstracts, which will make such worries moot.
All in all, I think RAG systems are something that will be here to stay UNLESS content owners decide to block them from search systems. I personally hope this will not be the case.
#6 Primo Research Assistant was tested with mix of human testing and auto-evaluation scoring/evaluations on a variety of metrics

While human testing of RAG generated answer is important, most testing of RAG also involves automated testing because human evaluations are too slow and expensive. The tricky bit about RAG generated long form answer is that even with a gold standard "correct" answer, it is difficult to automatically assess if the generated answer is close or similar to the gold standard answer as there are many ways to express the "right" answer.
During RAG testing besides assessing just the final answer (answer correctness, factuality etc), we would also typically want to assess how well the different components of the RAG system so we can figure out what went wrong. In other words, we often want to assess the retrieval component seperately from the generator i.e. If the retrieval managed to get most of the relevant context answers.
This problem of auto-evaluating answers in RAG is a generic problem in most NLP tasks and traditional NLP studies tend to measure lexical and (later embeddings) based metrics like ROUGE-L, BERTScore, BLUE and METEOR etc. But they do not correlate with human judgement generally.
This is an area of lot of research on how to do this automatically and reliably (such that it correlates with human judgement) with frameworks like RAGAS, TruLens, ARES etc and more emerging that help automatically assess the quality of the answer on metrics such as
Answer relevance - Does the generated answer really answer the question?
Citation faithfulness - Are the citation sentences in the generated answer faithful to the actual cited papers?
Context precision - Do the retrieved context help answer the question?
Context recall - Did all the context useful for answering the question get retrieved?
Other benchmarks or metrics test noise robustness (ability of the LLM not to be distracted by irrelevant details in context) and negation rejection (the ability of LLM to respond - "I do not know" if the retrieved context cannot answer the question).

Retrieval-Augmented Generation Benchmark (RGB)
The techiques to automate this are extremely clever. For example to determine if a generated answer is relevant to the question , these steps are done
Take the generated answer X and prompt a LLM to generate a question that can be answered by X - call this generated question Y
Convert both the original question call it Q and the generated question Y into embeddings and compare how similar they are
The more semantically similar Q and Y are the more like the generated answer X is relevant to answering the question!
Similar methods work for other metrics like citation faithfulness etc.
In fact, the latest popular methods involve a method known as "LLM-as-a-judge". It works exactly as it sounds, you directly ask LLMs to rate (or rank in some cases) the generated RAG answers based on whatever criteria you want!
You can ask the LLM to judge any of the above metrics or even less defined ones you can ask humans like comprehensiveness, bias etc. They are popular to use now, I suspect not just because this method yields judgements that may closely align with human judgement but also costs of doing LLMs calls are dropping.

Beta Testing Program Feedback – on screen
Of course, nothing beats human testing and based on the above image, out of 4,997 queries, they had 339 (6.7%) thumbs up and 276 (5.5%) thumbs down and 40% was because "didn't answer my question/request" and 30% "sources didn't meet my expectations"
#7 Primo Research Assistant generated answers and features can be improved
Of course, Primo Research Assistant is currently in testing and is very raw and we expect to see rapid improvements as more feedback comes in.
I already mentioned above some of the feature limitations of Primo Research Assistant. It can't parse metadata so you can search by year, journal, you can't do prefilters the way Scite.ai Assistant can etc.
But what about the actual quality of the generated answer? At this point, my vibes check of the system tells me that it is clearly showing poorer results than other similar products in its class.
There are multiple ways to evaluate RAG systems (with autoevaluation frameworks like RAGAS, TruLens, use of LLM as a judge etc itself a hot area of study) but one way is to study the two components that make up the system, the retrieval part and the generator part seperately vs studying it end to end.
Let's consider the generation component. Here we are told it is currently using GPT3.5 which is very outdated model, but this is due for change.
While, I have no doubt a more advanced model like GPT4o or GPT4o mini will help generate better answers hopefully with sentences that are more faithful to the cited papers (perhaps together with more advanced prompting techniques like self-vertifications etc though this is costly), one of the lessons of research on RAG is if the retrieval part cannot surface the most relevant documents, there is nothing much the generator can do.
Unfortunately my vibe tests seem to indicate to me the retrieval is the weakest part of the Primo Research Assistant currently.
While for generic searches where many documents can be considered relevant the results aren't too bad, I find the assistant starting to fail when there are fewer qualifying relevant documents and the retrieval system struggles to find them and rank them in the top 5.
Take the following query, asking for papers on how well ChatGPT does in doing title or abstract screening for systematic reviews. This has roughly 10-15 papers that are considered relevant. and are relatively easy to find by most state of art academic search systems.

In any case, since Primo Research Assistant uses only the top 5 results, the retrieval system just has to ensure the top 5 are relevant. Unfortunately as you can see above it fails.
Result #1 and #2 is fine as it goes.
#3 is a "systematic review" on how ChatGPT is used for medical research and looks like it may be partly relevant on first glance but in fact based on the abstract (and even full-text), it does not even include a study on the topic!
#4 is a review paper on use of LLM for systematic reviews, this one at least has included studies on the topic requested but still isn't ideal has the abstract says nothing about performance
#5 is about using ChatGPT to pass medical exams which is totally irelevant!
So for this use case, the retrieval found only 3 (arguably 2.5) relevant documents out of 5!
Added note : On hindsight for this search Primo might be confused by the use of "title/abstract", so changng it to title or abstract gives better results
This is why the generated answer only cites result #2 (not sure why it didn't cite #1) which is far from ideal! Compare against the same query in SciSpace or Elicit.

Result from SciSpace

Result from Elicit
While these systems might introduce hallucinations or rather not be 100% faithful to the citations (citations are to real papers but the way they are cited do not reflect their content), at the very least we see that both Elicit and SciSpace are able to easily find papers that are on target.
To see the exact keyword search used, you can click on "view more results from the library". In this particular case, it used
(ChatGPT performance in title screening for systematic reviews) OR (Effectiveness of ChatGPT in abstract screening for systematic reviews) OR (ChatGPT accuracy in systematic review screening) OR (ChatGPT for title and abstract screening in research articles) OR (Evaluation of ChatGPT for systematic review screening) OR (performance of ChatGPT in title/abstract screening for systematic reviews) OR (efficacy of ChatGPT in title/abstract screening for systematic reviews) OR (effectiveness of ChatGPT in title/abstract screening for systematic reviews) OR (ChatGPT for title/abstract screening in systematic reviews) OR (ChatGPT screening in systematic reviews) OR (how good is the performance of chatgpt in title/abstract screening for systematic reviews?)
at a glance this search string doesn't look too bad (please Evidence Synthesis people reading don't kill me!), but my guess is the Primo's relevancy ranking isn't tuned for such searches as nobody ever does such complicated nested booleans in them! Most of it's users are undergraduates doing simple keyword searches! (this isn't PubMed or Scopus that is commonly used by systematic review librarians!).
In fact, looking at the results from clicking "view more results from the library", I would say don't borther to use this function, as the results are even worse, because at least with the Primo Research Assistant top 5 results, there is an additional reranking step that isn't done here and without that I have seen searches where even the #1 result is irrelevant!
Note that when you click on that button you get the whole CDI without exceptions that the Primo research assistant as, so for example if your query is on current events, you might get newspapers (which are excluded in the Primo Research Assistant) in the top spots. What you won't get is your local alma records (e.g. special collection) though so calling it "from library" is misleading...
I am not a information retrieval researcher so I am not sure how to improve the retrieval of Primo Research Assistant but maybe increase the reranking step to work on the top 50 or 100 instead of 30? But that might lead to increased latency and cost?
That said, I don't think it is necessarily true that the approach Primo Research Assistant takes which uses LLMs to generate search strategies definitely can't produce decent results compared to the more conventional embedding approach.
For example Scite.ai assistant does a similar approach using a LLM to generate search strategies and the results look okay.

Looking at the scite.ai assistant search strategies used, we can see they tend to use only 3 strategies rather than 10 though the variety seems larger. More importantly the conventional keyword search in scite.ai is not based on strict boolean, which you can tell by clickon on the search....Also Scite.ai assistant matches on not just title abstract but also citation context.
That said, the fact that Primo Research Assistant uses the LLM to come up with keywords to search has some benefits.
One common test, I do of RAG systems is to ask it "impossible questions", where the system should conclude it is not able to find any documents to answer the question. (In literature this is sometimes called testing for negative rejection)
Most systems will fail and try to answer something because the nature of embedding based search is that some document must be "closest" and hence for almost any query you do there will be some ranked results. Add the fact that LLMs have a biased to try to answer, they often end up generating a plausible answer with citations but the citations are not faithful to the answer.
For Primo Research Assistant, I notice at times it passes this test when others fail. A large part of it is because the search is strict Boolean, the keyword search constructed by the LLM can get no results at all when testing!
In the example below I ask about the newly chosen Prime Minister of Singapore Lawrence wong and his impact on housing price (I am pretty sure there is little to no literature on this) and Primo Research Assistant passes the test because it's constructed keyword search just can't find anything!

Most others such as Elicit fail and tries to answer and will either
a) give a answer that seems to answer the question but the citations do not support it
b) give a answer that does not answer the question but has faithful citations
Elicit's answer below is more of b) which I guess is lesser of two evils, though it can do a) too.

However this isn't a perfect defense

The documents are matched because Primo matches the full-metadata of the record, so this includes authors, keywords, description etc....
This record #2 for example is picked up due to a mix of matches in the title, description and even I think the full-text characters?!? (though it doesn't seem to use that for the RAG answer).

Of course, the other advantage Primo Research Assistant has over others in the academic space is that most of them, whether from legacy players like Scopus Ai or from startups like Elicit, they focus on journal type content only, while CDI has the potential to cover everything from books,ebooks, magazines, reports, videos and more. This makes it closer to Perplexity.ai and Bing Chat to some extent.
In practice though, I think many of the other content types in CDI are excluded, particularly newspapers.

Primo research assistant citing magazines
Conclusion
I've been studying AI search/Retrieval Augmented Generation academic search systems since early 2022 and it's interesting to see it start go mainstream.
I am thinking how best I can help share what I know and educate users on this. I am currently mulling over writing on
1. Common misconceptions of RAG academic search systems
2. How to test informally a RAG based academic search system

