Scholar Labs Early Review: Google Scholar Finally Enters the AI Era
Generated by Nano-Banana Pro from text of this blog post
As “AI-powered academic search engines” began their rise in 2023, the biggest question on everyone’s mind was: Where is Google Scholar?
While Gemini Deep Research came and went, it relied primarily on the general web—similar to its rival, OpenAI Deep Research. Crucially, it failed to leverage Google’s greatest competitive advantages: the Google Scholar index and Google Books.
In a November 2024 piece for Nature (”Can Google Scholar survive the AI revolution?”), Anurag Acharya, a founder of Google Scholar, noted that they already utilized “AI” in ranking. He expressed skepticism regarding the accuracy of LLMs when summarizing multiple papers.
I believe that perspective missed the point. While traditional information retrieval algorithms are technically “AI,” the academic community was asking for the specific power of LLM-based methods. We wanted tools that could deliver significantly higher precision and recall.
Thought Experiment: Imagine Google Scholar with Deep Search: AI2 PaperFinder/Undermind-style iteration and LLM ranking across the full corpus. No long-form prose, no fabricated references—just outrageously good rankings. Academics don’t need sub-second latency; they need better ranking of top-k.
TL;DR: Google might have done exactly that (with K = 300) via their new product, Scholar Labs.
Google Scholar enters the “AI” fray
On November 18, 2025—while most of the internet was abuzz about the launch of Gemini 3—Google Scholar quietly dropped this blog post: Scholar Labs: An AI Powered Scholar Search.
(Note: The official name appears to be “Scholar Labs,” not “Google Scholar Labs”, which is a odd choice in my view).
At 4 PM Singapore time, Monica Westin (a former member of the Google Scholar team) posted about the launch on LinkedIn, tagging me. Naturally, I couldn’t resist testing it.
There appears to be a waitlist; the criteria for access remain unclear as I immediately had access (you do need to be signed in). If you happen to be on the waitlist, you can try with your education Google account - I hear that works for some when their personal Google account doesn't.
How does Scholar Labs work? Here is the little they have said about it so far:
It analyzes your question to identify its key topics, aspects and relationships. It then searches for all of them on Scholar, and evaluates the results to identify papers that answer the overall research question. For each paper, it explains how the paper answers your question. And includes all the familiar Scholar features that you depend upon.
This unfortunately is so generic as to be almost useless, so read on to know more.
It has been 24 hours since I started testing this, and these are my early impressions. Everything I blog here may turn out to be inaccurate, or the features may change swiftly by the time you read this. Still, it is worth documenting these first thoughts.
Scholar Labs is “Deep Search” not “Deep Research”
The Verdict: Scholar Labs is an interesting first attempt at adding Generative AI into the search engine. However, they are doing it in a notably conservative way.
In short, Scholar Labs is what I classify as “Deep Search” tool, not “Deep Research”.
The main differentiator is that a Deep Search tool is designed to find relevant papers and not generate direct answers. It differs from normal search by going “deeper,” running beyond the typical 1000ms latency you expect from Google or conventional databases.
What is that extra time/compute spent on? Typically, one or both of the following:
Iterative Strategy: The LLM may “decide” how and what to search in an iterative fashion.
Relevance Reasoning: The LLM is used directly to assess papers and generate a “rationale” on why a paper is relevant to the query
Deep Search tools function like conventional search engines—they give you a list of results—but with the added bonus of generated text explaining why the AI thinks the paper is relevant. This usually results in much higher precision than traditional ranking methods.
AI2 Paper Finder (now under ASTA) was the clearest flag-bearer of this class of products; it is now joined by Scholar Labs.
This differs from simple RAG (Retrieval Augmented Generation) or Deep Research tools (e.g., Gemini Deep Research, Scopus Deep Research), which attempt to generate an answer by synthesizing across multiple papers. That is a much trickier task than the Deep Search paradigm of “screening” individual papers, as it requires balancing findings from multiple papers that may contradict one another or operate in different contexts.
Given the Google Scholar team’s comments in Nature (Nov 2024), it is perhaps unsurprising they chose the safer “Deep Search” paradigm for Scholar Labs which evaluates paper by paper for the relatively clear task of relevancy to the query.
Scholar Labs: The First Run
While I classify Scholar Labs as “Deep Search,” being Google, they bring their own twist to the formula.
Here’s a walkthrough if you don’t want to watch the video:
First, you enter your query. Like most modern AI tools, the input field encourages natural language. Once you enter your input, a sidebar appears showing the process, while the main column populates with papers it deems relevant.
Step One : Query Analysis: It begins by “analyzing your question” to decide how to search. Behind the scenes, it is likely performing query understanding, possibly using an LLM like Gemini or other Information Retrieval techniques.
Step Two : Query Expansion and Execution: The interface indicates it runs multiple queries (e.g., in one test, it ran 11 different queries). The results are likely combined, de-duped, and re-ranked.
Step Three : Evaluation: Scholar Labs then displays that it has “evaluated X top results,” providing a running count of evaluated top results as time goes by.
Step Four : Display: Relevant papers (or at least what Scholar Labs considers relevant) start to appear in the main column as time goes by with standard Scholar features below or to the right of each entry (PDF links, citation counts, Library Links etc).
One big difference. Instead of standard snippets about the paper for each relevant found entry, you get generated text explaining the paper’s relevance and summary. Notably, it does this paper-by-paper and does not attempt to synthesize across multiple papers
In a typical Deep Search tool (like AI2 PaperFinder), the process runs through sucessive papers until it hits a stopping rule. Scholar Labs appears to have a few specific thresholds:
With Scholar Labs, it seems to me in most cases, it will stop once it thinks it has found 10 relevant papers. As seen below , it displays it has “found 10 relevant results” (though unfortunately it does not show how many top results it evaluated to get there at that point, though clicking the “more results” button continues the count from where it stopped earlier again).
It’s not 100% clear what happens when you click more results (hopefully the Google Scholar team can add text to explain), but I think Scholar Labs just goes down the initial list of results to look for more relevant papers.
There’s also the possibility, that on clicking “more results”, the tool issues additional iterative searches (very unlikely?) or there’s some sort of active learning model that reranks the results (unlikely I think it’s cleaner just to use the LLM - Gemini 3? to screen). I noticed that the rate of finding relevant papers slows down over time, which is expected.
From most tests, it seems to stop at 10,20,30,40 found relevant papers found milestones, but it seems the hard limit is 50 found relevant papers. The systems seems to cut off immediately (regardless of how many of the top papers it is at) once it found 50 relevant papers and you can no longer ask for more results.
In the above example, the query was designed to have many relevant results, and Scholar Labs quickly hit the 50 relevant paper mark without going through too many papers (video shows it evaluates around the top 96th result).
But what happens if you try a tougher or more specific query, will it keep going on until it finds 50 relevant ones (which may not exist)? Of course not, so far my tests show there is a hard limit at 300 - aka the tool will stop evaluating once it as evaluated the 300th top result and the “more results” button vanishes.
Below shows an example of a query which Scholar Labs terminates the search even though it claims to have found 26 relevant results once it goes past evaluating 300 top results.
Screenshot below shows another example, it stops after finding 30 relevant results (this round number seems to be concidence) but if you were looking at the interface before it displayed this text, you can see it cut off once it evaluated the top 300 result.
As you will see later there are some outliers where the search will hard stop below reachng 300, even if you have not found 50 relevant articles.
Queries accepted by Scholar Labs
With “AI-powered search,” the natural question is: What can I ask (in natural language)?
At the most basic level, does it’s natural language search “understand” commands to filter by metadata or fields? Even though Scholar Labs has a sample query that suggests you can ask “Find papers from past X years on topic Y”, I am finding it somewhat inconsistent when you state year range.
Moreover, with the rise of agentic search, we are starting to get used to asking tools to do complex tasks like, “Find paper X and look for papers it should have cited but didn’t.” Scholar Labs is clearly not that agentic. In fact, it is quite strict about allowed queries.
Many of my “power user” queries returned a message stating: “Scholar Labs is currently not designed for queries like this.”
Queries that failed:
Sumarise main points of <paper x>
What is figure 1 of <paper x>
papers by <author x> on <topic y>
Papers referenced by <paper x> and papers cited by <paper x> - are inconsistent
Some queries work with errors or inconsistently, such as the query below to find “books” but it seems to work showing entries with [books] at least at first, yet others found later do not.
Incidently, this implies this product has access to the Google Books parts like normal Google Scholar.
I have been asked if this product includes case law, I think it doesn’t this is about articles. I also am fairly certain it never shows patents? and even [citation] entries don’t appear because there isn’t enough information to evaluate them on?
Speculation: This strict filtering might be designed to protect content owner relationships. By blocking “interrogation” style queries, Google ensures users still need to access the full text, rather than getting all the details from the AI.
Implications for researchers
If my assessment of how Scholar Labs works is correct, this is a game-changer solely due to the scale of Google Scholar’s index.
Scholar’s index dwarfs OpenAlex, Semantic Scholar (used by many new “AI academic search”), and others. It includes full text from almost every major publisher (allows crawls from Google Scholar). If Scholar Labs inherits this exact index (and isn’t walled off from paywalled full-text), it should dominate in non-STEM disciplines where other databases are weaker.
Already when testing searches like
find me papers that mention Elicit.com
it outdoes my favourite Undermind.ai tool simply due to the size of the index.
It’s size is so big it often surprises me.
e.g. I tested it with “impossible queries”—topics I thought had no results—and it surprised me by surfacing a forgotten PowerPoint slide or obscure paper that mentions that query.
As such it is probably excellent for finding papers when you have forgotten the title, thought it is not perfect.
Weakness of the tool
That said, it’s not the “one ring” to rule all of Academic AI search tools.
Firstly, it is just Deep Search, or “paper finder”. I am not underselling how powerful and useful this is, in fact I did a whole blog post to say I am more bullish on Deep Search than Deep Research.
With most AI powered search tools are pivoting towards Deep Research, which provide long reports and visualizations that synthesize results across multiple papers, this is something that can be valuable if you this for quickly getting an overview of the land yet this something which Scholar Labs does not touch and likely may not for a while.
Secondly, the current interface is a bit awkward. I’ve already seen comments from people complaining it is tiresome to click pressing “more results”. Why not let it run longer (until it has evaluated 300 or so results). I guess as a free product you probably want to save compute, Google is of course rich and powerful but Google Scholar is far more popular than AI2 paper finder…
Inherent Limitations for High-Recall Systematic Reviews
While Scholar Labs offers high precision—and will frankly impress anyone who hasn’t used this class of tools before—it is not sufficient on its own for evidence synthesis or systematic reviews, which demand super-high recall. This is because Scholar Labs inherits the inherent limitations of the underlying Google Scholar infrastructure.
To explain this, we must look at standard practice. A common guideline for evidence synthesis when using Google Scholar is to scan the first 200–300 results. Scholar Labs appears to automate this approach (ignoring the stop at 50 relevant rule). However, in traditional systematic reviews, this method is used only as a supplement, never as a replacement for the main search process that involves searching across multiple databases.
Why? Early papers from the 2010s by evidence synthesis experts demonstrated that while Google Scholar’s index is vast—often having near 100% coverage of relevant papers found in a review—findability is a different issue. You can prove this high coverage yourself by taking a completed systematic review and searching Google Scholar by title for the included papers; they are almost always there.
This begs the question: If the coverage is so high, why not cut out the middleman and use Google Scholar alone? The answer is that while the papers are indexed (high coverage), finding all of them using only Google Scholar is impractical due to the poor precision caused by its limited search functionality.
Compared to structured databases like PubMed, Scopus, or Embase, Google Scholar suffers from significant limitations:
No complex search logic: Lack of nested Boolean support and no proximity operators. No official support for truncation for word endings.
Short queries: Stricter limits on search strategy length (no more than 256 characters). EDIT - Michael Gusenbauer’s great work at SearchSmart also detected a change in Google Scholar support of query length from 256 to 2048 character length! Likely related to Scholar Labs launch.
No controlled vocabulary: Limited filters and no MeSH/Emtree equivalents.
Limited field searching: In particular. you cannot limit searches to just “Title/Abstract.” Because you are searching full-text, you get massive amounts of noise for some query terms
Export limits: A lack of bulk export options and a hard cap of 1,000 results.
These limitations make it difficult to craft a single search strategy that captures all relevant results, even if all relevant papers are indexed. A very broad search strategy might work in theory but in pratice you end up having too many false drops (super low precision).
One example given was how the Google Scholar strategy when translated from a traditional search strategy from Cochrane Review had to be simplified from a 1,391 characters search strategy that included a large number of drug specific names combined using OR. This simplified Google Scholar, probably led to missing relevant papers.
Worse, even if a search captured them, the lack of abstract-only matching and other features to create precise searches means you would be drowning in noise.
Finally, even if you will willing to spend more time screening them, the 1,000-result hard cap acts as a final barrier.
This is not just theortical but has been studied multiple times.
One study illustrated this perfectly: Google Scholar had 97.2% coverage (verified via known-item search), but recall dropped to 72.6% when using a real search query (Original Query AND Title AND Author), and plummeted to 46.4% if restricted to the top 1,000 results.
Another study did better and had a higher recall of 92.9% instead of 72.6% (when using a real search query but ignoring the 1,000 results limits) but noted even if you could screen all the returns, you would have a precision of 0.13% which is >20x worse than normal databases! This means you would need to screen 20x more papers to get 1 relevant paper!
Note that this horrible precision is assuming you screen everything, in reality, precision when screening up to 1,000 is around 1% which isn’t so bad, but you get a big hit in recall.
While most 2010s studies in the medical sciences concluded Google Scholar is too blunt a tool to use alone, some recent studies in the 2020s (focusing on Social Science, Software Engineering, and even one in Otolaryngology) suggest otherwise. However, this success is often due to the specific, unique terminology used in those reviews, which prevents the result explosion common in broader searches.
Ultimately, even with near 100% coverage with the almighty Google Scholar index, failure occurs because:
The paper was in the database, but the search strategy failed to retrieve it
The paper was retrieved by the search strategy but fell beyond the 1,000-result hard cap.
How does Scholar Labs change this? Unfortunately, Scholar Labs’ method of running a search and evaluating the top 300 results remains vulnerable to both issues. Even if the system is expert-level at screening, it can only evaluate what it “sees” in that top tier. A major improvement would be if Scholar Labs allowed users to evaluate the top 500 or even 1,000 results.
If AI is used for screening, in theory you could compensate for the much poor precision Google Scholar has by “letting the AI do it”, but you need to go way past 300 or even 1,000.
Scholar Labs does run multiple queries, which might mitigate the risk of relying on a single, length-limited search string. However, we need testing to see if these generated queries are diverse and effective enough.
If these multiple queries are simply generated by a prompted LLM, there is reason for pessimism; most studies attempting to use LLMs to generate Boolean search strategies for systematic reviews have shown unpromising results (though this focus mostly on PubMed searches).
Anecdotally, I have already seen Scholar Labs fail to find papers I know exist. Aside from the classic Gehanno et al. (2013) and Bramer et al. (2013) papers regarding coverage, I tried to find a more recent paper I knew existed - claiming that reviewing the top 500 Google Scholar results could yield 99% recall but I could not remember the title offhand. Scholar Labs failed to find it initially. I eventually surfaced it by manually altering the query to be more specific and restricting the date range to the last 5 years. This highlights that the tool remains highly sensitive to the specific query entered.
Finally, there is the concern of “outsourcing relevance.” While recent studies on using LLMs for screening are promising, results are heterogeneous and dependent on the model, prompting technique and the domain . Even with high correlation with human judgement I worry about subtle biases—specifically, that the AI might consistently drop certain types of relevant papers.
In evidence synthesis, we generally tolerate false positives (noise) but fear false negatives (missing data). My initial skimming suggests Scholar Labs produces few false positives, which, paradoxically for a systematic reviewer, raises anxiety about potential false negatives.
As such , I expect evidence synthesis practitioners to be lukewarm about this tool. At best, it automates the supplementary “Google Scholar check” they already perform—perhaps offering slightly better recall via multiple queries—but it is unlikely to be groundbreaking. They will still need traditional databases to ensure the necessary recall.
Implications for discovery vendors
In a sense, Google Scholar deciding to play in the AI sandbox is bad news for discovery vendor.s Google Scholar is the 500 pound gorilla in academic search and anything they do will draw attention and be used by researchers.
That said, currently, they still don’t occupy the Deep Research space so tools like Undermind, Consensus still have a place.
Implications for content owners
It’s interesting that content owners are content with letting Google Scholar use their content (for now) including full text in a “AI search”. Granted, this is only for AI Deep Search and with query filtering of certain types of queries, users will still have to download the relevant found paper to read, mitigating content owner’s greatest fear that users using gen AI can benefit from full text without downloading the content.
Still with the winds blowing towards distributing content using MCP servers e.g. Wiley AI Gateway, perhaps Scholar Labs might be a anomaly because only they have the clout and visibility to be given access to full text?
Conclusion
This is a quick 24 hour review of a new tool that is likely to cause waves by simply being part of Google Scholar. No doubt much of what I have written is either foolish or totally wrong but that is the price of trying to be early.
I will end by stating some of my wish lists
Option to extend evaluation of the top 500/1,000 results (premium??)
Option to set to stop after every 10/20/30/50 relevant result (premium??)
Allowing saving of Scholar Labs search sessions, and even one that can be shared
More transparency
show the keyword search used in multiple query
show not just found x results but also how many top results have been evaluated (maybe even providing some visualization to show relevant found vs evaluated) when the search stops.
Could Scholar Labs do citing chasing of found relevant articles by evaluting cites of relevant articles found?
Will search alerts be based on this new paradigm of evaluation?













