9 Comments
User's avatar
Aaron Tay's avatar

Hi Kukuh

Just to be clear I am not a health or medical librarian so take my answers with a pinch of salt. I can't say I have heard about this 90% published literature figure. But I guess if you mean "reputable" journals in biomedical, it is probably right.

Ai search tools unless they are based on established databases tend to be vague on sources. But typically if they do name the source it typically will be Semantic Scholar corpus, OpenAlex or their own propertiary source (which tends to be them indexing free sources).

Then you can look up papers that analyse these sources such as OpenAlex, Semantic scholar and how they compare to Scopus, Web of Science, PubMed etc.

Though note that even if 2 AI tools eg Elicit and Undermind both claim to use Semantic Scholar, they may choose to index different subsets (e.g. exclude certain types) and the indexes they derive will not be identicial.

In general, OpenAlex, Semantic Scholar would be a much broader source than typical databases, but the quality would be a lot more mixed.

Expand full comment
Alan's avatar

Might this have the potential of making open access articles more discoverable than closed? https://www.science.org/content/article/open-access-papers-draw-more-citations-broader-readership

Expand full comment
Kukuh Noertjojo's avatar

Aaron, thank you for this analysis. In the old days, I was taught that searching Medline would cover over 90% published literature. With regard to AI search tools, is there any articles that explain the data sources that underpin these search tools? Thanks again Aaron for always educate us, non-search specialist, in these issues that are relevant to our works in Evidence-Based,especially medicine, Practice.

Expand full comment
Aaron Tay's avatar

Hi Kukuh

Just to be clear I am not a health or medical librarian so take my answers with a pinch of salt. I can't say I have heard about this 90% published literature figure. But I guess if you mean "reputable" journals in biomedical, it is probably right.

Ai search tools unless they are based on established databases tend to be vague on sources. But typically if they do name the source it typically will be Semantic Scholar corpus, OpenAlex or their own propertiary source (which tends to be them indexing free sources).

Then you can look up papers that analyse these sources such as OpenAlex, Semantic scholar and how they compare to Scopus, Web of Science, PubMed etc.

Though note that even if 2 AI tools eg Elicit and Undermind both claim to use Semantic Scholar, they may choose to index different subsets (e.g. exclude certain types) and the indexes they derive will not be identicial.

In general, OpenAlex, Semantic Scholar would be a much broader source than typical databases, but the quality would be a lot more mixed.

Expand full comment
Kukuh Noertjojo's avatar

Aaron, thank you! I will do that.

Expand full comment
Aster's avatar

Thanks for sharing! I'm curious about how many Elsevier articles are also indexed in PubMed. I did a quick search in OpenAlex and found about 7.5M Elsevier articles are indexed in PubMed (0.9M between 2022-2024). So roughly 33% (for both all years and between 2022-2024). My guess is, to maintain the “indexing” in PubMed and their articles discoverability, Elsevier needs to at least share abstracts with PubMed? If this is the case, then this portion of abstracts can stay open, and since Semantic Scholar covers PubMed records, in theory they will have this data. Just my ideal guess…

Expand full comment
Aaron Tay's avatar

Is a good question. It might be possible even likely that PubMed would be allowed to still display the abstracts (haven't checked) of non-OA Elsevier articles.

My understanding is that what PubMed displays is a superset of the NIH Open Citation Collection (NIH-OCC) which probably won't have the abstracts for non-open access Elsevier papers,

I guess some aggregator like OpenAlex could still scrape abstracts from PubMed or even Elsevier directly but they would be asked to take it down if they did that.

I guess my point is publishers will no longer allow aggregators of such content to use or share this content to be used by others such as AI powered search engine.

Individual AI powered search engines could try to build their own indexes by scraping sources allowed to show abstracts such as Pubmed or even the publisher sites directly, but they run the risk of getting take down notices from Elsevier etc once they get prominent enough.

Expand full comment
Aster's avatar

Totally agree. I checked one journal Lancet, just limit to all RCTs and Systematic Reviews, and found 5.4k records. 4.3k have abstracts and only 700+ are “free articles”. So many non-OA articles still have abstracts in PubMed. I think at this point, Elsevier may not be able to identify whether these AI-powered search engines are getting abstracts from PubMed or directly scraping from their site. Also not sure if PubMed would agree to remove certain abstracts if later Elsevier requests to do so.

Expand full comment
Aaron Tay's avatar

I think Elsevier doesn't need to identify how OpenAlex etc is getting the abstracts. They just have to see if it happens and ask them to take it down, because Elsevier is the copyright owners of the works.

I've heard ideas that such systems evade this issue by extracting abstracts from *preprint* versions, not sure about the legality of that and whether this would be desirable because abstracts do sometimes (rarely?) change from preprint to published version.

Expand full comment