What Do We Actually Mean by "AI-Powered Search"?
When we say "AI-powered search engine," we're conflating at least four different things—and your concerns about one may not apply to another.
I've been watching the reactions to Google Scholar Labs with considerable interest. The responses range from enthusiastic embrace to outright rejection. One response particularly intrigued me—someone mentioned they were initially reluctant to try because they'd heard it was "AI-powered" but became more interested when they read my review and realized what that actually meant (they expected it to generate answers to questions when all it did was do better ranking).
Another interesting puzzle was when I noticed some library guides listing Semantic Scholar as “Semantic/Neural Search” when a technical look at their main retrieval method reveals the main search is still largely lexical search.
While one can understand and agree with the listing of Semantic Scholar as “AI powered” due to clear AI features like TLDR, doing the same for Lens.org and OpenAlex is a much harder sell, because they not only just do keyword search but they lack the obvious AI features of Semantic Scholar.
It’s made me realise we might all be talking past each other because we haven’t actually defined what we mean.
“AI-powered search engine” is a handy catch-all term used by vendors, but it actually hides a diverse set of systems and functionality. In this post, I’ll dissect the different ways academic search can be “AI-powered” so you can decide which types actually cross the line for you.
I am going to argue that we often mean at least 4 different things when we call something “AI-powered search”
Level 1: Post-Retrieval AI Features
Level 2: Going beyond Lexical Search with Semantic Search
Level 3: LLMs for Retrieval and/or Relevance Ranking
Level 4: Synthesis and Generation Across Papers
Level 0??: Use of AI to extract, cluster, or organise metadata used for retrieval
Levels here may not be the right framing, as the four different categories are largely orthogonal (except maybe Level 3 is a subset of level 2), but they map to most common academic search products (e.g. Level 4 is usually Deep Research) and higher levels generally reflect higher risk and greater amount of pushback from librarians.
The Spectrum of AI in Search
Level 1: Post-Retrieval AI Features
Does “AI-powered search engine” mean “AI” that impacts the search process only? Not necessarily.
There’s a whole category of AI features that don’t affect the search results you get at all. Things like optional summarisation of individual items (e.g., AI Insights on Ebscohost databases), translation tools, or text-to-speech features.
These are post-retrieval conveniences. You search, you get your results list (however that list was generated), and then you have the option to use AI to help you process what you found. The search itself? Unchanged.
In theory, if you don’t like these features, you can ignore them.
Level 2: Going beyond Lexical Search with Semantic Search
Now let’s get into actual search mechanisms. Suppose a search engine doesn’t do traditional lexical/keyword matching but instead uses “semantic search”—where queries are run through an encoder model to convert into embeddings and matched against the embeddings of indexed documents to get a relevance score.
“Semantic search” can refer to (1) meaning-aware retrieval (dense embeddings / neural retrievers), (2) sparse neural retrieval (not dense embeddings, but still “neural” e.g. SPLADE) (3) query expansion / thesauri / ontologies (not embeddings). (1) is by far the most popular today and discussed here.
Semantic Search using dense embeddings (also known as vector/embedding/neural search/retrieval)
Is that “AI”?
A quick terminology note: when I say “lexical” or “keyword” search, I’m referring to methods that match based on the actual words in your query and the documents. This includes both Boolean search (exact matching with AND/OR/NOT operators) and probabilistic methods like BM25 (which scores documents based on term frequency, document length, and how rare each term is across the corpus). These are related but distinct—Boolean gives you a set of matching documents, while BM25 gives you a ranked list. Most traditional academic databases like Scopus use Boolean first to retrieve a set of documents and then do relevance ranking with BM25 (or the related TF-IDF). Neither attempts to understand semantic meaning as they just match strings. That said, it is common to stack a reranker on top of that which is trained using supervised learning (learning to rank model) on labelled click data. We will discuss this in the section on Semantic Scholar.
Let’s take for granted that traditional information retrieval methods like Boolean and BM25 aren’t “AI” otherwise by default every search engine is AI!
So what we’re saying is that anything that goes beyond traditional keyword matching—such as semantic search via embedding search/match—is considered “AI-powered search.” An example is JSTOR beta’s semantic results function.
Note that all that changes is the relevancy algorithm. You still get a list of results (though embedding-based search usually only retrieves top K results rather than all hits like a Boolean search). There is no generated answer.
Embedding search is fantastic at understanding broad conceptual similarity, but it can be ‘fuzzy’—sometimes retrieving papers that feel related but miss the specific keywords or strict criteria you need. That is why it is common to do hybrid search combining both methods
A decade or two ago, more librarians would have heavily objected to non-keyword search methods because the results would be less explainable. (I suspect two decades of using Google’s soft or fuzzy Boolean search and getting relevant results without all matching terms has weakened the resistance! ). These days I see more librarians (except systematic review specialists) accepting such retrieval methods, particularly if they improve recall or precision—even if it means losing both interpretability and often reproducibility.
They’re often even more accepting when told the embeddings aren’t strictly “Large Language Models,” which now seems to be the bogeyman many people have in mind. But how different are they really?
Language models compute the conditional probability of a token given a context window, using these probabilities to sample and generate sequences. Embedding models, however, focus on representation learning, mapping input sequences to fixed-size vectors without an autoregressive generation component.
That said these days the embeddings we use (e.g., BERT, SciBERT, SPECTER) are transformer based encoder models which are very close cousins to the GPT style transformer based decoder models. (I will ignore encoder-decoder models)
How decoder models (LLMs) are trained: GPT-style models are trained on massive text corpora using “next token prediction” (also known as causal language modelling)—given a sequence of words, predict what comes next. This autoregressive training is what enables them to generate fluent text. The model learns statistical patterns across billions of documents, building an implicit representation of language, facts, and reasoning patterns.
How encoder models are trained: Models like BERT, SciBERT, and SPECTER use a different pretraining objective—”masked language modelling” or essentially a cloze test. Random words in a sentence are masked out, and the model learns to predict the missing words based on surrounding context. This bidirectional training (looking at context on both sides) produces rich representations of meaning, but the model isn’t designed to generate text—it outputs embeddings that capture semantic relationships.
Historically, only encoder models were used as embedding models and while many modern embedding models (e.g., SBERT, E5, Gecko) are still based on transformer encoder architectures, recent top-performing embeddings models—like OpenAI's ada and Google's Gemini embeddings, e5 mistral etc—are derived from large decoder-based transformers. While decoder models are not naturally designed to output embeddings - they are meant to predict next token, with proper fine-tuning, decoder models can effectively generate embeddings, often excelling due to being derived from their larger base model size (decoder derived embedding tends to have billions of parameters instead of hundreds of millions for encoder models ), though with higher energy costs
The similarities are substantial. Both use the transformer architecture. Both are trained on large text corpora through self-supervised learning. Both learn distributed representations of language. The encoder/decoder distinction is primarily about the training objective and what the model outputs, not fundamental differences in how they learn from data.
This means that if your concern is intellectual property—where the training data comes from, whether it was used with permission, whether the model has “ingested” copyrighted works—the issues are largely the same for both model types. An encoder model trained on scraped academic papers raises similar provenance questions to a decoder model trained on the same corpus.
However, there is one meaningful difference: encoder models do not reproduce text. They convert text into numerical vectors and compare those vectors. A decoder model, by contrast, can regenerate text that closely resembles its training data—this is the basis of concerns about LLMs reproducing copyrighted passages or memorising private information. Encoder models used for semantic search can only say "these two texts are semantically similar," - they don’t generate passages, so the plagiarism-style risk is lower.
So if your concern is specifically about text reproduction and the potential for plagiarism or copyright-infringing outputs, encoder-based semantic search is genuinely lower risk. If your concern is about the ethics of training on data without consent, the distinction offers less comfort.
There are also practical differences: embedding models tend to be much smaller and less energy-hungry than modern LLMs. And because we’re not generating answers, just doing relevance ranking, we avoid issues like citation faithfulness and cognitive offloading.
Level 3: LLMs for Retrieval and/or Relevance Ranking
Okay, so maybe you’re comfortable with embeddings from encoder models and you are happy it doesn’t generate answers.
What if the search engine uses an LLM to generate a Boolean search strategy and then runs keyword search as normal (e.g., Web of Science Research Assistant)? Or does hybrid search where it first runs LLM-generated Boolean before reranking using embeddings (e.g., Primo Research Assistant)? Or does both LLM-generated Boolean plus embedding search together and reranks (e.g., Scopus AI)?
Or what if the system retrieves using keywords (or hybrid methods) then uses GPT-style LLMs to judge relevancy and generate ranking of the top K results?
When using LLMs for relevancy ranking, you can ask the LLM to:
Give a score or categorise documents into relevancy tiers (point-wise comparison)
Compare relevancy of two documents (pair-wise comparison)
Provide a rank sort of multiple documents (list-wise comparison)
You can also prompt the LLM to give a reason for the score or ranking.
Regardless of method, we’re now using those big decoder models—the ones everyone associates with “generative AI”—but we’re still just using them to rank search results (typically after a earlier retrieval phase to rank on promising candidates). No synthesis, no summarisation (except to explain reasoning for relevance), just ordering.
Technical note: You can also use powerful but slow Cross-Encoder or late-interaction models like ColBERT instead of outright GPT-style decoder models to get results almost as good as directly using a LLM. Functionally, with respect to ranking quality there is little difference but you do avoid the use of decoder LLMs and hence the ip risks.
Google Scholar Labs, AI2 Paper Finder (now Asta) uses this type of approach - it uses an LLM to do query intent parsing, rank results and even generate a rationale explaining why each paper was ranked where it was. You’re still ultimately getting a ranked list, but now there’s AI-generated explanatory text attached to each result.
Does this change things, because there is now actual generation of text to explain relevancy or will your objection be that it uses actual LLMs (decoder models) rather than just encoder models?
Level 4: Synthesis and Generation Across Papers
Now we’re getting into what many find most objectionable. This level encompasses two related approaches:
Quick RAG tools like Elicit, scite assistant, and Primo Research Assistant synthesise information across multiple papers to generate answers with citations, typically using Retrieval Augmented Generation. You ask a question, the system reads through papers, extracts relevant information, and writes you a summary with references. It’s not just ranking anymore—it’s creating new text based on the literature.
Deep research tools like Undermind and Consensus’s deep search mode take this further—basically RAG on steroids. These systems don’t just synthesise a quick answer; they conduct extensive multi-step research processes, following leads, refining searches, and building comprehensive analyses that can run for minutes or even hours.
Both approaches share the same fundamental characteristic: the AI is generating novel text that synthesises across sources, not merely ranking or retrieving existing documents.
Leaving aside the use of LLMs, there are reasons to object to Level 4 tools:
LLMs can produce ghost references (though this can be mitigated)
Studies have reported LLMs citing retracted papers (here and here)
There are doubts whether LLMs can faithfully represent what each article says and properly weight findings when they disagree
Even if these performance issues were mitigated, people also worry about the adverse effects of using and relying on such tools - e.g. cognitive offloading from overuse of these functions.
Different Concerns, Different Red Lines
Here’s the thing—there’s no objectively “right” answer about what counts as unacceptable AI use. But it’s crucial to be clear with yourself about what you’re actually concerned about, because different concerns lead to very different red lines.
If you’re worried about generative AI making students lazy, you might be perfectly fine with AI assistance as long as it’s just ranking results. Let the algorithms find relevant papers, but make students read and synthesise themselves.
If you’re worried about environmental impact or IP issues of huge GPT-style LLMs, you might accept encoder embeddings but draw a hard line at any use of GPT-type models.
If you’re worried about reproducibility or interpretability—and regular readers know this is my particular obsession—you might object to any non-Boolean/lexical methods, and maybe even AI-generated metadata.
If you’re worried about accuracy and hallucination, the earlier levels might be fine but synthesis crosses the line. Ranking can be imperfect without being catastrophically wrong, but generation creates new opportunities for confident-sounding nonsense.
Where Do Common Tools Actually Sit?
Given this framework, it’s worth examining where familiar platforms actually fall—because the answer isn’t always obvious.
The “Semantic Scholar” Confusion
Here’s an irony worth noting: Semantic Scholar—the platform whose name suggests semantic search—doesn’t actually use vector embeddings for its main search retrieval - at least not on the search on its home page as of April 2025.
According to a preprint updated April 2025, the platform’s search operates in two stages: first, Elasticsearch retrieves up to 1,000 candidates using keyword matching (probably BM25); then these are reranked using a LightGBM model that emphasises direct title matches and highly-cited recent papers.
LightGBM is a gradient-boosted decision tree ranker—machine learning, but not a neural embedding model and not generative. It’s a feature-driven reranker sitting between BM25-style retrieval and modern transformer-heavy “AI search.”
More specifically, it’s supervised or semi-supervised learning model trained on labelled data (e.g., relevance judgments or interaction signals) to learn a scoring function that combines feature-based signals—query/field match strength, citations, publication year, etc.—into a final relevance score.
It’s typically used in a learning-to-rank–style setup (often with ranking objectives like LambdaRank), meaning it learns how to order candidate documents for a query rather than generate text.
So where do the embeddings come in? Semantic Scholar developed SPECTER and SPECTER2, sophisticated document embedding models trained on citation graphs. But these power auxiliary features: research feed recommendations, author disambiguation, paper clustering, and finding related papers. The embeddings exist; they’re just not driving the core search experience.
There is one exception: their newer “snippet search” API endpoint, designed for retrieving text passages from the S2ORC corpus, does use a genuine hybrid approach. Passages are embedded using mxbai-embed-large-v1, and results are retrieved using the union of embedding-based and keyword-based matches, ranked by a weighted sum of embedding similarity and BM25 scores (Kinney et al., 2023). But this is a specialised API for full-text passage retrieval—not the main search you get from the Semantic Scholar Search bar that most users encounter.
You might be wondering why I am harping on this. After all, Semantic Scholar has many features that are clearly AI-powered, such as “TLDR”, “Ask this paper”, Citation sentiment (e.g., “Highly Influential Citations”), Research Feeds.
This matters because if you’re advising researchers, the name might reasonably lead you to assume you’re getting embedding-based semantic retrieval. In fact, I have seen quite a few libraries list Semantic Scholar in their libguides as “semantic/neural/vector search”.
You’re not—at least not in the primary interface.
This misunderstanding might then lead you to search using natural language rather than keyword, which typically works better in semantic retrieval systems. Unfortunately, this fails here. The query:
Is there an open access citation advantage
returns only 13 results (big underestimate), while keyword search even with quotes gets a more reasonable 54.
Are OpenAlex or Lens.org AI-Powered?
Listing Semantic Scholar as “AI powered search” at least makes some sense because of many AI features.
On the other hand, I also notice quite a few LibGuides classify OpenAlex and Lens.org as “AI powered/LLM search” or worse semantic/neural search. This is a much less defensible move. Both are pure lexical search engines with no LLM used in retrieval or ranking, and no synthesis features.
New! In the latest Walden update, OpenAlex is planning a Vector search endpoint: find relevant works and other entities based on semantic similarity of free-form text. As of Dec 2025, I don’t believe this is accessible directly from the OpenAlex GUI.
This is probably just a misunderstanding by some librarians as they became aware of Lens.org and OpenAlex only recently (Lens.org actually started going beyond Lens search with Scholarly search in Dec 2017) and mistakenly assumed it was in the same class as newer search, but is there a way to argue they are “AI-powered” search engines?
Perhaps, you might argue that using AI to extract, cluster, or organise metadata makes a search engine “AI-powered.” and because we know OpenAlex uses machine learning for topic assignment and author disambiguation - it is AI-powered.
But the problem with this is that by this standard, even traditional databases like Scopus would qualify—they use clustering algorithms for author disambiguation and citation parsing.
You can try to draw a line to decide which methods are “AI” which are not (the same way we can with retrieval algos) and decide you’re okay with good old-fashioned machine learning or even embeddings if they are used to extract or clean up metadata but are not okay if LLMs are used.
When you browse Dimensions to identify research funded by NIH, when publications from an author who moved affiliation are under the same profile, when a cluster gets labeled with a coherent name, this is Artificial Intelligence at work, although not all is Generative AI. So, what are the different flavours of AI and what do they do?
But this whole thing sounds very contrived and I think most would agree use of “AI” for metadata handling should not count as AI-powered search engine.
What I’m Getting At
I’m not trying to tell you where to draw your line. What I am saying is: watch out for knee-jerk reactions (my own included!) that treat “AI” as a monolithic thing we either accept or reject wholesale.
The field is rapidly developing systems that incorporate different types of AI at different layers, in different ways, for different purposes. Some of those uses might align with your values and needs. Others might not. But you can’t make that determination until you actually understand what’s under the hood.
And unfortunately, that’s often the hardest part—because many of these systems are black boxes that don’t clearly explain which AI techniques they’re using, where, or why. Which brings me back to my recurring theme: methodological transparency matters.
We can’t make informed decisions about what AI search tools to use (or refuse to use) if we don’t actually know what they’re doing.
So the next time you see an “AI-powered search engine,” maybe the first question shouldn’t be “Is this AI?” but rather “What kind of AI, “where and how it is used in the pipeline”.
Appendix
Questions to ask or think about for a AI-powered search
A) Where is AI used in the pipeline?
Query formulation (query rewriting, Boolean drafting)
Retrieval (candidate generation)
Ranking (reranking / learning-to-rank / LLM judging)
Result enrichment (snippets, TLDRs, explanations)
Synthesis (multi-document answer)
B) What kind of model is it?
classic IR (Boolean/BM25)
“traditional” ML (e.g., boosted trees / LTR or Learning to Rank)
embeddings (encoder models; dense/sparse)
generative LLMs (decoder; possibly multimodal)
Images generated with help of Gemini 3 Pro


















Aaron, thank you for this clear summary. I will share with my colleagues here.