Google's AI Mode: It Can Out-Search You, But Can It Out-Evaluate You?
Introduction
This piece makes a simple claim: Google’s AI Mode may out-search you, but it won’t necessarily be able to out-evaluate you.
Launched in March 2025 (US in May, global in August), AI mode pairs Gemini 2.5 Pro with aggressive fan-out—splitting a question into many sub-queries, sweeping hundreds of pages, and answering in seconds.
That more complete search is often underestimated and may even allow the AI search mode to out-perform humans in "needle in a haystack" tasks.
However, when a more comprehensive search returns a variety of conflicting sources, AI Mode in Google (or in most LLMs) may not be better than a skilled human at evaluating and weighing the evidence. This is not to say a human is always better; in scenarios where the AI fails, the average person might make the same mistake or not perform any better, as the example below illustrates.
Another catch is the opacity of Google's AI Mode: you can’t see the visible sub-queries it issues. This matters because LLM search (and normal search) can have confirmation bias, so not seeing the sub-queries might make you blind to the biases in your search.
A lot of what I mention are ideas based on what Mike Caulfield has been working on, and one trick, I picked up from reading him is that you can use “AI search” to surface candidates fast and then probe with prompts like, “give me evidence for and against <claim>.”
What is AI mode?
Launched as an experiment in March 2025, opened to all users in the US in May, and rolled out globally in August 2025, AI Mode is pitched as the grown-up sibling of the much-mocked AI Overviews.
The promise is that it can tackle messier, reasoning-heavy queries with a multimodal Gemini 2.5 Pro under the hood and can handle follow-ups smoothly.
What we know about Google AI mode
Unfortunately, we know relatively little about Google AI mode except that it uses a version of Gemini 2.5 Pro with multi-modal capabilities and is capable of handling follow-up questions. It is designed to handle more complex questions that require reasoning as compared to AI Overviews that was supposed to handle more straight forward factual questions. We are also told it uses a “fan out technique” for queries
issuing multiple related queries concurrently across subtopics and multiple data sources. While responses are being generated, our advanced models are now able to identify and access even more supporting web pages than was previously possible.
Conceptually, that fan-out is familiar to anyone who’s played with RAG-style fusion: break the question apart, run several searches in parallel, then synthesize.
Let’s use Google AI-mode itself to explain the concept.
Of course, besides Google AI Mode, Perplexity and other similar tools do a similar technique.
But the main thing I noticed while “vibe testing” is how fast Google's AI Mode is, despite searching quite deeply.
I’ve personally seen Google AI Mode run up to six queries covering over 100 sites in a matter of seconds, which is far faster and more thorough than I’ve typically seen Perplexity do. (Perplexity Pro is capable of searching deeply, but it takes minutes, not seconds).
The main issue with Google AI Mode is that it does not actually show you what sub-queries were issued, unlike other tools like Perplexity.
Empirical tests of Google AI Mode vs others
Google AI Mode is still very new, so there aren’t many formal tests. One of them is a piece from *The Washington Post* comparing free tools only. (As a bonus, librarians were involved in the judging and testing.)
You can check the full methodology here.
In this test, at least, Google AI Mode came out comfortably on top.
But it is important to note that this test favors tools that are very search-heavy, with categories on “Trivia,” “Sources,” and “Current Events” that benefit heavily from a deeper search. Arguably, the categories on “Images” and possibly “Built-in Bias” are also helped by a deeper search.
The free version of GPT-5 (2nd) that was tested probably would have done better in its paid version or with the “user selecting search option” because the free version often decides not to search and instead answers directly from its pre-training data (to save costs), leading to poor results.
Consider turning on web search in GPT5 free for search heavy tasks
Strength of Google AI mode
Is Google AI Mode considered “Deep Search”? In some sense, how much or how long something searches before it becomes a “Deep Search” is arbitrary. In fact, if you are a Google AI Pro or Ultra subscriber, you get an extra option for “Deep Search” for AI Mode.
This does the exact same fan-out technique but can go even deeper, to hundreds of results. I think about this as the non-academic version of Gemini Deep Research.
As compute gets cheaper, I foresee “Deep Search” of all forms will become common and expected. The ability to search and interpret from hundreds and thousands of web pages in seconds is going to be a huge advantage that AI search will have over humans. One can imagine that in "needle in a haystack" situations where there is just one source with the required answer, AI search will trump human efforts.
But does that mean the days of human searching are numbered? Not quite.
Weakness of Google AI mode
While Google AI Mode (and future AI searches) will be able to cover a ton more ground than humans at much faster speeds, they still need to be capable of properly interpreting, evaluating, and judging between sources if they contradict. In short, we need these AI search modes to be good at information literacy, fact-checking, and evaluating sources.
This isn’t an original observation. As far back as the now-forgotten OpenAI’s WebGPT project, where could browse the web for sources, the authors of the paper noted:
We made a number of difficult judgment calls, such as how to rate the trustworthiness of sources (see Appendix C), which we do not expect universal agreement with. While WebGPT did not seem to take on much of this nuance, we expect these decisions to become increasingly important as AI systems improve, and think that cross-disciplinary research is needed to develop criteria that are both practical and epistemically sound
As mentioned earlier, I’ve also being mostly thinking about this aspect through the work of Mike Caulfield (co-creator of SIFT).
At the risk of misinterpreting Mike, the main point of "Using AI Is a Process Too" is that while “AI search” can find a needle in a haystack and give you a direct answer, you must still engage with the process by looking at the sources the AI used and “Read the room/results page,” as he puts it in his book “Verified”.
Another interesting trick I see in his video is that he likes to ask follow-up questions of the nature, “give me evidence for and against <claim>.”
For example, Mike himself has used the question of what the snow in the *Wizard of Oz* movie was made of. This is a very tricky question because a typical Google or web search will result in pretty much all of the top results saying the snow was made of asbestos which is wrong. Even Snopes got it wrong. Also, Mike notes that LLM search might have confirmation bias.
The trick here is to always follow up with, “give me evidence for and against <claim>.”
Mike correctly points out that, given there is a primary source where a person in a position to know states to a credible source—a film historian—that the snow was made of gypsum, it is more likely the asbestos story is wrong.
At this stage, we can say that AI-powered deep search might be good at "needle in a haystack" tasks, but even if it does find the right source, it may not prioritize it over other more common, contradicting sources.
A properly prompted system like DeepBackground GPT might help….
Of course, this does not mean Google AI Mode is useless. Chances are, most humans using Google manually would have made the same error, either because they didn’t search deep enough to find the primary source or, even if they saw it, they may have erroneously reasoned that this source is an outlier and should be ignored compared to the majority of sources, which includes the reputable Snopes site.
At this stage, these tools might out-search you, but their judgment and synthesis of sources are not at a stage where they are equivalent to a top human expert, so you should not trust the answer 100%.
Conclusion
Google’s AI Mode is a genuine step-change for web search: fast, aggressively “fanned-out,” and increasingly capable of stitching together answers from a very wide crawl. It will routinely out-search most of us with least efforts, and with “Deep Search,” it can push even further when recall matters. But its opacity (no visible sub-queries) and still-developing judgment mean it can still make errors a skilled human might not.










Aaron, Thank you! as always, I learnt a new knowledge (and hopefully skill) from your post