Learning from Google Scholar and why a tool does not need to be flawless to be useful
What 2004 can teach us about 2024 — and the librarians who keep getting the lesson wrong
This post is part of a “hot takes” series in which I make sharper claims than I usually do. I do not intend to offend, and I am not trying to tar every librarian with the same brush — the patterns I describe and perceive may be a function of my own local context.
In my last hot takes post, I argued that while designing a search system to maximise learning gains may not always align with designing a search system that scores the best for relevancy, unplanned friction in learning aka poor relevancy is never a good idea.
In this post, I consider the idea of tools that are considered flawed because they can’t match the performance of an older tool in some way1.
Imagine you are a librarian in 2004, when Google Scholar launches in beta.
You have read the studies. The coverage gaps are real. The metadata is wretched. Nobody, including Google, can tell you exactly what is indexed2.
You decide it is the dumbest thing on the planet and declare that it could never be useful to anyone, despite a vocal minority of users insisting otherwise.
Fast forward to 2024. Google Scholar has become the default academic search starting point for many researchers and is widely regarded as the most comprehensive free academic search engine3. Your first assumption might be that Google fixed all the problems.
But that assumption would be mostly wrong.
Sure, Google Scholar improved, especially in coverage. Its metadata may have arguably also have improved. But you still cannot consult a stable, auditable list of indexed sources or records. Even Google’s own guidance effectively tells users to test coverage by sampling titles rather than by checking a definitive inventory. One of the fundamental weaknesses librarians diagnosed in 2004 is still there.
And yet the librarians and researchers of 2024 are not idiots for using Google Scholar heavily today.
What happened is simple. Users learnt to compensate. They used Scholar for what it was good at and routed around its flaws or used another tool for other use cases (e.g. Evidence synthesis). They learnt when its metadata could not be trusted, when its coverage was opaque, and when to complement with another database was needed. The tool did not become perfect. It became useful enough, along side other tools.
That trajectory is suprisingly common with new technology and one I think about when I read arguments thay say AI or any technology tools can never be useful because of some “fundamental flaw”.
The concession has already been made on usefulness
Some of the most prominent sceptics have already said as much.
Gary Marcus, cognitive scientist, author of Rebooting AI, and one of the most consistent public critics of AI hype, has repeatedly acknowledged that LLMs can be useful, especially for coding, brainstorming and writing, while arguing that they are unreliable and not a route to AGI alone.
Margaret Mitchell, a co-author of the influential “Stochastic Parrots” paper, has been even more explicit: LLMs can be “extremely useful”.
Emily M. Bender has likewise clarified that “stochastic parrot” is a description or metaphor, not an argument that these systems have no possible utility4.
Mike Caulfield, of SIFT fame, is actively studying and using these tools. These are not AI boosters. If even they concede that the tools can be useful, the basic question has already been settled.
You can oppose LLM use on environmental grounds, labour grounds, epistemic grounds, or any number of other defensible grounds.
But the claim that these tools can never be useful to anyone has moved past argument into something closer to an unfalsifiable position. No demonstration of utility, no improvement in the tools, and no evidence of successful use by researchers seems able to update it.
A note on scope
For the rest of this post, when I say “AI tools”, I mean AI-powered academic search tools. I work in this space, so I will stay in my lane. The argument may extend elsewhere, but I am not pretending to make that case here.
It is also worth being precise about what “AI search” means. The term covers several different things: changing what gets retrieved, reranking results, summarising content, and generating direct answers to questions. These are not the same capability and do not carry the same risks.
A librarian who objects to generative answer synthesis is making a different argument from one who objects to AI-assisted reranking. Conflating the two muddles the debate. Before objecting to “AI search”, it is worth saying which part concerns you, and why.
The Google Scholar analogy maps most cleanly onto AI-assisted retrieval and reranking: helping surface relevant results that users might otherwise miss. It also maps reasonably well onto “tip-of-the-tongue” search, one of the limited uses Bender has acknowledged as potentially useful.
It maps less directly onto generative answer synthesis, where hallucination risks are sharper. I am not arguing that all uses of AI in search carry equal risk. I am arguing that even the riskiest versions clear the “never useful” bar. The appropriate response to different risk profiles is differentiated teaching, not blanket rejection.
Back to the analogy
The people who insisted in 2004 that Google Scholar could never be useful until Google published full holdings lists were sure they were right. But they were eventually proven wrong to conclude that a tool could never be useful without that.
A tool need not be perfect to be useful. This sounds obvious, but it is the point that keeps getting lost.
One objection is that the Scholar analogy fails because LLM errors are different. Google Scholar had messy metadata and opaque coverage. LLMs produce overconfident hallucinations.
That objection has force, but notice what it actually supports. It supports teaching verification skills. It supports scaffolding. It supports appropriate scepticism. It does not support the early conclusion that the tool is useless.
A tool that requires careful handling is not the same as a tool that cannot be useful. The verification argument also cuts both ways. Uncritical acceptance of search results, including Google Scholar results, has always been the failure mode librarians teach against.
Perhaps the scaffolding still will not be enough. But I am sure many librarians who rejected Google Scholar in 2004 were pretty sure too.
Nor does it help that some of the most vocal sceptics seem not to have engaged seriously with these tools since 2023. They appear to underestimate how much the systems, and the harnesses around them, have changed even in just three years. But even that point is secondary. The Scholar parallel does not depend on the exact pace of improvement5.
I currently believe from some experience setting up and testing agents, that between the use of code to constrain the model and aggressive multi-validation checks, you can reduce the probability of error/hallucatins down to very low levels comparable to the human.
The lesson is not that tools improve. The lesson is that what counts as good enough is not always obvious at first.
Three objections
Some librarians argue that the Google Scholar comparison to AI breaks down on three grounds: librarians do not actively promote Scholar (but librarians actively promote AI); users genuinely want Scholar rather than having it pushed on them; and Scholar is free, whereas many AI tools are commercial products.
None of these objections does the work required.
On promotion, many librarians do promote Google Scholar. It appears in LibGuides, instruction sessions and one-on-one consultations. The claim that “we do not promote it” is a polite fiction. Beyond what is said publicly, plenty of academic librarians reach for Google, Scholar or Wikipedia first in their own work when the situation calls for it.
On user demand, users clearly want AI tools. Whatever one thinks of that demand, pretending it does not exist is not a serious position.
On cost, being free or paid is a separate question from whether a tool can be useful. There are genuine concerns about commercial AI: vendor lock-in, inequity of access, environmental cost, labour implications, surveillance, and the commercial capture of scholarly infrastructure. Those concerns deserve serious engagement.
But they are arguments about adoption, governance and institutional support. They are not arguments that the tools can never be useful.
And let us be honest about the comparison. Library databases have plenty of flaws: idiosyncratic interfaces, uneven indexing, opaque relevance ranking, and sometimes weak metadata. We still pay substantial sums for them and promote them as a matter of course. The objection to AI tools cannot simply be that they are commercial and imperfect, because by that standard half the collection budget becomes difficult to defend.
The “abusing trust” argument
There is a related claim that deserves a direct response: librarians who teach users how to use AI search tools are abusing professional trust because the tools are imperfect and can lead to errors.
This is a bad argument dressed up as an ethical one. It rests entirely on the premise that AI search tools are imperfect, as though that distinguished them from anything else we teach.
e.g. Name the flawless tool, we libraries promote?
Our job, properly understood, is to teach people to use imperfect tools well, with appropriate scepticism and a clear understanding of what each tool can and cannot do. Refusing to teach AI search tools because they are flawed is not an act of professional integrity. It is an abdication of the actual job.
It also leaves users to figure these tools out on their own. That is the worse outcome by every measure.
The badly understood “Stochastic Parrots” argument
A common argument among librarians goes something like this: Emily Bender says LLMs are stochastic parrots; therefore, LLMs can never be useful.
There is an immediate problem with that argument. Even if you accept the “stochastic parrot” description, it does not tell you whether LLMs combined with other technologies can be useful. It says nothing directly about retrieval-augmented generation, tool use, calculators, citation validators, structured workflows, human review, or other harnesses wrapped around the model.
The more damaging problem is that Bender herself has clarified that “stochastic parrot” is not an argument that LLMs are useless. In her account, it is not even an empirical hypothesis. It is a description or metaphor for systems that generate fluent linguistic form without grounding in communicative intent, a model of the world, or a model of the reader’s state of mind6.
This does not mean Bender thinks LLMs are broadly useful. Her position is far more sceptical than that. She has warned that synthetic text is not an information source, and that using chatbots as reliable sources of knowledge is a serious category mistake.
But she has acknowledged limited possible uses, including “tip-of-the-tongue” search, language-learning dialogue partners, non-player characters in games, and non-generative uses of language models in classification, speech recognition and machine translation. She treats summarisation more cautiously, because it can introduce material not present in the source.
Nor does this mean Bender has retreated from the stronger “form versus meaning” argument. In Bender and Koller’s 2020 paper, understanding is defined as mapping language to something outside language. Their claim is that a system trained only on linguistic form has no basis for learning that mapping, because it has access only to patterns in text, not to the extra-linguistic world those texts are about.
That is a serious argument. But it should not be flattened into the much weaker claim that LLMs can never be useful.
So the better conclusion is not “stochastic parrots can never be useful” (though she is currently very skeptical). It is: do not mistake fluent synthetic text for grounded understanding or reliable information. That is a much narrower, stronger, and useful warning but does not address the question on usefulness.
But it leaves room to ask the question that actually matters for librarians: under what conditions, with what scaffolding, for which tasks, and with what verification, can LLM-based systems be made useful rather than misleading?
The lesson
The lesson from Google Scholar is not that librarians should embrace every flawed tool users like. It is that “flawed” and “useless” are not synonyms.
It is hard to compare like for like, but I think it is fair to say that the practical gains in LLM-powered tools from 2023 to 2026 have been faster and larger than Google Scholar’s improvements across its first decade. But the more important point is not the scale of improvement. It is that Google Scholar’s improvement did not fundamentally fix its transparency problem. Instead, users learnt that the flaw was either less fatal than it first appeared, or manageable with the right habits.
That is the lesson librarians need to take seriously now.
If the objection is environmental cost, make the environmental argument. If it is labour exploitation, make the labour argument. If it is vendor lock-in, inequity, surveillance, weak governance, or commercial capture, make those arguments. They are serious enough to stand on their own.
They do not need a backdoor return to the claim that AI tools cannot really be useful.
That move is increasingly unconvincing. It usually begins with a reluctant concession: of course AI can sometimes be useful. Then, when the discussion turns to teaching, adoption or institutional support, the old premise quietly reappears. The tools are flawed, so using them must be irresponsible.
But that is not how librarians treat tools.
We teach flawed systems all the time. We teach Google Scholar while warning about coverage and metadata. We teach Scopus and Web of Science while explaining their selectivity. We teach discovery layers while knowing their indexing and ranking are imperfect.
The professional act is not pretending tools are flawless. It is teaching people where they help, where they fail, and how to verify what matters.
So reject an AI tool because the cost is too high, the governance is too weak, the evidence is too thin, or the institutional incentives are wrong.
Just say that.
Do not dress those objections up as proof that the tool can never be useful. That argument has already lost.
The question now is not whether AI search tools can be useful. It is which uses are worth the cost, which are not, and what role librarians should play in helping users tell the difference.
There are many similarities to the Innovator’s Dilemma argument. Early users may value dimensions that experts discount, and a tool that performs poorly against established professional criteria may still become useful enough to reshape practice. But unlike the Innovator’s Dilemma, I refer to cases where the alternatives co-exist.
Google Scholar offered no stable, auditable list of indexed sources or records, and even Google’s own guidance effectively tells users to test coverage by sampling titles rather than consult a definitive inventory.
As confirmed by many studies. The amount of full-text indexed by Google Scholar is also believed by many to be unmatched.
To be clear, Bender is not being cited here as making the same claim as Marcus or Mitchell that LLMs are broadly useful. Her position is much narrower and more sceptical. In interviews, She has said that safe and beneficial uses of synthetic text are hard to identify, but has offered tentative examples especially “tip of the tongue” search, where a user describes something in order to recover the name of it and can then verify it through ordinary search. She also distinguishes text generation from other uses of language models, saying that language models can have positive uses in classification, automatic speech recognition, and machine translation, while treating summarisation as more borderline because it can introduce material not present in the source. The point here is therefore not that Bender endorses LLMs as broadly useful, but that “stochastic parrot”, in her own account, is not an empirical hypothesis or an argument that LLMs have no possible utility. It is a description or metaphor for language-mimicking systems, and the 2021 paper was about the risks and harms of pursuing ever-larger language models, not a general paper about “AI”.
The improvement of LLMs between 2022 to 2026 is far larger than from 2004 to 2024! This improvement comes from both improvements in the models as well as the use of harnesses like Claude Code to combine deterministic code with LLM flexibility.
To be clear, this does not mean Bender has retreated from the stronger “form versus meaning” argument. In her account, the closest thing to an argument in this area is Bender and Koller’s 2020 paper, which defines understanding as mapping language to something outside language. Their claim is that a system trained only on linguistic form has no basis for learning that mapping, because it only has access to patterns in text, not to the extra-linguistic world those texts are about. This is separate from the “stochastic parrots” phrase itself, which Bender describes as a metaphor rather than an empirical hypothesis. She also notes that multimodal systems complicate the picture: image-text models may meet the Bender and Koller definition of understanding in a very thin sense, because they can map between linguistic strings and images. But she argues that the stochastic-parrot framing remains relevant to such systems and to systems built around them.





So many good points. Thanks for writing this.