Why EXACTLY is Google Scholar bad for evidence synthesis, systematic reviews?
Let's be clear here, Google Scholar is ill designed for use for systematic reviews. I am not trying to argue otherwise. (Obligatory warning, I am not a real systematic review librarian)
But why exactly? When I ask librarians and systematic review librarians they tend to give a variety of reasons, from lacking a particular features (e.g. it's doesn't have features to support precision searching, you can't easily bulk export, it typically gives you maximum of 1000 ranked results etc), or it has some specific aspect of operation that is disliked (e.g it's index is unstable because it is based on crawling) to more higher level complaints (e.g it's not transparent, it's not reproducible, it's not explainable or the catch-all, it's a "black box").
While all these reasons are mostly true, the question I am curious about is which features are the most fundemental blockers? (I will be mostly using Michael Gusenbauer's excellent SearchSmart.org to survey features of academic search). This is because if some of what are supposedly issues, are in fact also found in databases, librarians DO use or might use, this would makes us look hypocritical (e.g. we are against Google Scholar because it isn't a tool we subscribe too). Moreover, if we fixate on the wrong issue , then another tool might be wrongly dismissed for a relatively minor point. I will give an example later.
While it might be the case, the appropriateness of databases to use for systematic review is likely be on a spectrum than a straight binary decision (e.g. search features that allow precision searching, total records per batch export), it is still good to distinguish such features.
In classic "Musings about librarianship" overthinking fashion, I go through quite a bit of overthinking to conclude that leaving aside the obvious practical concerns that make the process efficent and effective (e.g powerful search features for precision searching, bulk export etc) what we really want is reproducible boolean (as opposed to other lexical or even semantic type) searches (even though we may not be able to explain or interpret the order of results)....
Let's start from the most specific features and work our way up to more general, high level complaints.
1. Google Scholar ranks at most 1,000 results.
This is well known, but in case you are unaware...
Below shows a search for "information retrieval", Google Scholar claims "about 8,230,000 results"

However, as I jump forward, the furthest I can go is page 98 which is the 980th result.

Clicking on page 99, or 100 gets me an server error page

This (which btw is the same limitation you find in Google) is no doubt a fundmental issue, and I guess you could argue iin theory you can workaround it somewhat by breaking your queries into parts but this is not an argument I care to make.
Why do search engines or databases do this? From what I understand, in information retrieval, a common task is to do what is known as a top-k retrieval task. This means the search will try to retrieve and rank only the top-K , where K can be set to say 1,000. This is of course far easier to do than to try to rank every document in your corpus. This makes sense because in the real world, you rarely want or need to rank everything.
In fact, scoring or ranking algos (e.g. WAND) can do "approximate" ranking. "Unsafe" systems may trade off accuracy where you give up the guarantee you definitely will retrieve all the top K scored documents, in return for speed. Again in the real world, this trades on the idea that small differences in relevancy scores do not map to real human preferences. e.g. a scoring algo in BM25 that gives 0.025 vs 0.024 to two documents may not mean anything.
While this type of ranking has been more popular in the real world due to the rise of web scale size collections, on the other hand, this objection is a somewhat boring one, since as far as I know there are very very few academic search engines or databases that do this type of "Rank top-k" type searches. Do you know of any? Off hand, I remember years ago, someone complaining to me this was happening to ProQuest Summon but this might be fixed now.
So let's move on.
2. Google Scholar has no official bulk export function
Everyone knows while there are unofficial ways to bulk export using Zotero browser extension, Harzing's Publish or Perish tool, custom scrapers etc but it is like pulling teeth, and eventually you will get the dread reCAPTCHA appearing.

You can work around this somewhat by changing ips constantly etc but it is too painful.

Again like #1, I feel this is a somewhat boring objection, because I can think of no academic database that lacks a official bulk export function.
Though I guess if you want to argue, up to fairly recently, Web of Science had pretty low limits on bulk exports if you wanted full metadata (I think it was 500), but certainly no academic database is similar to Google Scholar with NO official bulk export function.
3. Google Scholar lacks features needed for high recall, high precision searching.
Back in the 2010s, some people found that the index of Google Scholar was so big that for most systematic reviews, if you checked to see if the included papers were in Google Scholar by searching directly by the title, you would find pretty much all of them were indexed in Google Scholar, leading to hopes that one could just use Google Scholar alone for systematic reviews.
This swiftly lead to responses showing that in practice you couldn't actually find those papers working from the scratch because Google Scholar had many limitations in search functionality which included
a) limitation in the length of the search string you could use (256 characters)
b) No support of parenthesis for nested boolean
c) Limited field searching
d) No left/right truncation
e) Lack of controlled vocab etc
See SearchSmart's testing of Google Scholar functionality

and this coupled with limitation #1, meant it was extremely difficult to get a high enough recall search due to lack of search functionality (many if not most relevant results would not be in the top 1000).
Note: A more recent paper seems to have found otherwise, that they could craft search strings in Google Scholar that found 98% of gold standard papers from the first 200 GS results, and 99% if extended to 500 results, but this may be a special case, where the original SRs were not using very comprehensive searches and/or the subject "Otolaryngology Head and Neck Surgery" might be specialized enough that it was easy to find all the papers using just that keyword. (See discussion)
I suspect using multiple iterative searches of Google Scholar and combining what was found could in theory find a large number of the included papers, but the systematic review protocol requires just ONE search strategy per database
Again, this clearly is a feature that isn't binary, but Google Scholar it is fair to say is on the wrong side of this one by far.
4. Google Scholar is not a database?!?! or Google Scholar crawls the web so it's index is unstable
Here, we move to less clear objections. The problems with saying X isn't a database is that it leads to total confusion as librarians don't even agree on the definition. Throughout my career, I have seen many librarians define "library database" in totally different ways with different dimensions used for what counts as a database, such as
Full text vs non-full-text (but what about databases with some full-text?)
Publisher journal portal vs aggregator database/platform (but what are repositories?)
Non citation index vs citation index
(later supposedly counts as database if you were wondering)
With Google Scholar, the "its not a database" charge is hard to pin down, but I can assure you the idea Google Scholar does not have a index is false (since you need one for search to be possible!)
What librarians get at with Google Scholar not being a database, I think stems from the correct idea that the Google Scholar index is constructed mostly by using crawlers to crawl the web (including journal sites, repositories and even individual pages) to index papers.
The argument seems to be unlike "real databases", it is constantly crawling the web and updating which is very unstable, or has someone put it, it is "not holding it's content anywhere, it is searching". I've seen some also suggest these "academic search engines" are inherently unstable with the implication made that because of this, it's providing fundamentally not reproducible results.
The problem with this argument is plenty of new databases or if you prefer academic search engines do a lot of crawling. The most famous relatively recent example is OpenAlex (which incidentally can be used with a popular systematic review tool - EPPI-reviewer). But I'm pretty sure everything from Semantic Scholar (which is fairly popular in systematic reviews and it's index is the source of many "AI tools") to Lens.org, to CORE etc has a not insignificant portion of it's index based on crawling the web, repositories.
It is really true if your index is constructed by crawlers your search is not reproducible? OpenAlex naturally disagrees.

I think OpenAlex is technically correct, that instability isn't a mark against reproduciblity but practically speaking without code (which is very common), you do want stability of results as some assurance (for example Google and maybe Google Scholar may do random AB testing or change algos every few weeks)? Still thinking about it.
5. Google Scholar is a black box, the algorithm is not transparent or known
As the above reply by OpenAlex says, the idea here is Google Scholar does not publishes its code or even just the algorithm on how it decides what to show in its results in response to a query, so it cannot be trusted.
The immediate counter to this is to go, but besides a few rare examples (e.g PubMed we are told does BM25 for a first retrieve step and then does reordering with Learning to rank (LambdaMART) on the top 500 results), do we really know how the algorithm used in databases work? Does Elsevier publish it's algorithm for Scopus? Can we reimplement Scopus to test?
To be fair, Scopus does have help pages telling you how they match results (the usual boolean search) but it's totally not transparent on how it ranks when you rank by relevancy.
Pretty much every academic database trusted by librarians are like this with the rare exception of Pubmed and OpenAlex that does have technical papers on the ranking and reranking algorithm or makes it code open for inspection (Pubmed's code is here).
Is this a concern? Or is this a spectrum/degree thing again.
6. We do not understand how Google Scholar works, results are not interpretable or explainable
"Transparency is achieved through accessibility and comprehensibility. Sharing an algorithm's code makes it accessible, but if it is not comprehensible to most users, do the system owners achieve transparency?... PubMed also works toward comprehensibility by publishing explanations of their algorithms in a language that is understandable to a broader audience."
This is from the paper - "Artificial intelligence behind the scenes: PubMed's Best Match algorithm", so far so good.
Then it goes...
"However, considering the complexity of the algorithms, users might not understand how this algorithm actually affects their search results, bringing into question if transparency has been achieved. Furthermore, determining whether AI is used by a system can be nearly impossible for a user. Unless someone has the curiosity to dig into how the Best Match algorithm functions, they could have no idea that an AI algorithm ranks the results to their query and thus be oblivious to the possible implications, both positive and negative."
It also talks about how complicated algorithms are blackboxes, such that even if a
user understands the mathematical components of a particular algorithm, they might not be able to understand the specific ways in which algorithms interact with each other and manipulate inputs and outputs
This gets worse if you include machine learning which PubMed does via reranking of results using user clicks as a criteria for success.
This doesn't paint a very optimistic picture for people who suggest that librarians need to "interpret" or "explain" search results that appear. At the extreme level, they want to know why this result appears as #1 vs #10.
If this is the standard required, no database or search engine will pass.
Take PubMed where the algorithm is public, and even the code is available. This has 100% transparency, but how many people can even understand the code? Worse yet, even with full understanding of the code, the reranking step is pure machine learning, it's basically a black box from the explaination point of view even with the code available,
Even a expert when asked why did a result rank #1 vs #10 could do no better than to say, if you run the math , that is what happens.
And this relatively straight forward lexical BM25 search, once you throw in semantic type searching based on semantic similarity match, all interpertability and explainability is out of the window.
One way around this is to relax our requirements and say even though PubMed relevancy ranking is not fully interpretable (despite being 100% transparent) it doesn't matter, because what is included/matched is strictly boolean (things like stemming, limited query expansion are close enough) before ranking/reranking which IS explainable and that is all we need.
I mean we could just ignore the relevancy scores and export everything to screen. I am not sure I am convinced by this, since systematically reviews sometimes do sort by relevancy and take the top K records.
Note that this argument requires Boolean matching, for example, SearchSmart has this critera - "Longer strings, fewer hits", and some academic databases that fail this including Naver Academic, Semantic Scholar, Scinapse, Scite, World Cat, Sci finder. While this might not be a guarantee they are using semantic or embedding based matching, the results are probably not boolean and hence explainability plunges.
The other way is to bite the bullet and say we don't need explainabilty at all, and reproducibility of searches is all we need.
7. Google Scholar is just not reproducible
Okay let's them fall back to the fundmental argument , Google Scholar searches is not reproducible. The idea here is that for a reproducible search, if you rerun the search multiple times (either at different times, or different locations) you will get the same result (all things being equal aka no new item indexed).
There's also another concept of reproducible in the sense that I could recreate the system because the code is avaliable and rerun to confirm but that isn't a realistic option here
What are some sources of lack of reproducible here? Personalization? (though Google Scholar is on record for claiming not to do that unlike Google), some sort of AB testing? Some type of trade-off for efficiency? Who knows...
The literature on Google Scholar seems to suggest it fails this. Some examples
Variation in number of hits for complex searches in Google Scholar. (observed huge flucations in results over time)
Irreproducibility in searches of scientific literature: A comparative analysis (did time-synchronized searches at different institutional locations in the world, found Google Scholar results varied, but Scopus, PubMed did not. Web of Science Core Collection varied as well but this was due to different institutional holdings (yes, even Core Collection for Web of Science varies across institution by year coverage in case you didn't know!)
Searchsmart which tests "Reproducible queries (time)" flags Google Scholar, WorldCat (article/chapter/thesis search) and CNKI as failing this
It seems to me testing if a search is reproducible across time can be tricky, since once a new item is added to the index results will be different and not many search indexes allow you to search by indexed time.
Given that Google Scholar has a more unstable index (see #4), it might be seen (unfairly?) to be less reproducible than others.
I also wonder if this idea that the databases we use are reproducible searches is often just based on faith. I believe most of them are, but given most are not transparent or have all their code available, how would we know? I guess if we try to update systematic reviews we would have noticed if there was any major divergance (plus most databases we use are fairly stable).
Note also that a reproducible search does not require explainability.
You could have a search that always gave the same results to a query but you had no idea how it works or can't predict in advance or interpret post-hoc the results.
Arguably PubMed is in this boat, because while you could explain why a certain result appeared (because it's mostly strict boolean) but you can't predict or explain the order easily even with the formula.
At the extreme example, let's imagine something like Elicit.com, that uses cosine similarity of embeddings of query and documents to find and rank the results like one of the bi-encoders discussed here. It is constructed to have no non-deterministic factor and always gives the same results and ranking with the same query.

Say it acts like Pubmed and write a technical paper on how it works and even offers its code on github.
Such a search would be perfectly reproducible (gives same results for the same query) but not interpretable at all as the answer to why a document matches a query with a high score is simply, after running cosine similarity on the query and the document, the score is high!
Would that be an issue for systematic review librarians? I bet yes!
Conclusion
It seems at the end , what we fundementally want is a search that is reproducible, if you run it multiple times you get the same exact order and items returned (assuming nothing new added).
This in fact is a fairly safe assumption (according to Searchsmart) met by most search, even if we do not know the exact details of the search, or don't have the code (lack of transparency).
I am particularly cynical about people yelling that X is bad because it is not transparent, when most databases we use are not fully transparent on the mechanisms under the hood isn't either.
The issue of whether search results is explainable or interpretable is tricky.
On other hand, it seems to make sense to say that reproducbility (either code is available and/or it gives the same results for the same query) is what we want, that full explainability in terms of not just what is surfaced but how it is being ranked doesnt matter.
On other hand, saying that opens you up to search engines that are non-boolean, e.g. based on embedding vector search, so you don't even know why a certain result appears because it does not necessarily match the query terms! Surely that is not what you want to use for searches in a systematic review (or do you?).
So it seems it is safer to say we want not just a search that is reproducible AND explainable if not in the order they appear at least in WHY they might appear in the result set!
And since once you move out of the boolean realm, explainability and interpretabilty drops like a rock, so ultimately what you want is boolean searches that are reproducible plus the usual hygene factors of large bulk exports, features that support precise, high recall searches!

