Comparing disciplinary coverage of 56 databases, measuring ORCID adoption, measuring FAIRness in repositories and more
Note: This is based off my contribution to the Upstream blog (plus some major additions)- Aaron Tay is Keeping Tabs on Open Research
I tend to find interesting articles via Twitter or from following references of such articles. For articles that look potentially interesting I will usually put the link in my Google Keep and tag them with “Professional Development” , which I try to clear every week.
I’ve been skimming some of these tagged links recently and these are some that I have found really intriguing and have found earned a place in my browser tabs because I can’t bear to close them or I am still thinking about the implications of what I read.
The articles are
1. Search where you will find most: Comparing the disciplinary coverage of 56 bibliographic databases
2. Metadata Life Cycle: Mountain or Superhighway?
3. Measuring research information citizenship across orcid practice.
4. On NYT Magazine on AI: Resist the Urge to be Impressed
1) Gusenbauer, M. (2022). Search where you will find most: Comparing the disciplinary coverage of 56 bibliographic databases. Scientometrics, 1-63.
As a practising academic librarian who fancies himself an observer of academic discovery trends, I try to keep abreast of the academic literature on the topic. Whether it is articles comparing coverage of indexes, (e.g Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations' COCI: a multidisciplinary comparison of coverage via citations), articles that provide technical detail of interest discovery tools (e.g scite: A smart citation index that displays the context of citations and classifies their intent using deep learning), or just articles trying to understand academic user behavior, I will try to read them in detail and try to understand them in the context of other papers.
Hot off the press this month is May is Michael Gusenbauer's Search where you will find most: Comparing the disciplinary coverage of 56 bibliographic databases which is a must-read if you have similar interests to me.
It's a long sprawling 62 pages paper, but don't let that scare you off.
For most of the 62 pages, Michael goes into great detail on his new proposed methodology which he calls the BOK (Basket of keywords) method to estimate the absolute and relative coverage of 56 different bibliographic databases.

Why read this?
We know from multiple studies that Google Scholar has by far the largest absolute coverage, and in this study we find out that in the 26 subject categories, Google Scholar is indeed ranked 1st in 19 of the categories, and is in the top 3 of the remaining categories.
But this is for absolute coverage. In terms of relative coverage by the subject Computer Science (COMP), the #1 spot is claimed by ACM Digital Library (ACM) (See Table 5), despite the fact that it is only 16th by absolute coverage (See Table 4), since most of the records in ACM Digital Library are in Computer Science.
Up to now, there has been no easy way to determine absolute or relative coverage across databases, particularly when we talk about huge multi-disciplinary databases (e.g. Lens.org, Google Scholar, Dimensions, Crossref). This paper claims to provide a method and applies the method for 56 databases.
I can imagine people including librarians who are not too interested in the technical detail and want just the results. For such people, the meat of the paper would be in
Table 1 - Which lists the 26 subjects of the ASJC category (pg. 2690)
Table 2 - Which lists the 56 databases covered (pg. 2692-2699)
which can be read together with
Table 4 - Which ranks the absolute coverage of the databases in the 26 subjects (pg 2708-2710)
Table 5 - Which ranks the absolute coverage of the databases in the 26 subjects (pg 2715-2717)
You might also want to not miss Figure 1 (pg 2711) and Figure 2 (pg 2718).
In general you would need the abbreviations for the subjects and databases from Table 1 and Table 2 respectively to interpret Table 4 and Table 5. e.g. DECI= Decision Science, PSY1 = APA PsycInfo (via Ovid).
Also the section - Search advice for each major academic search type (pg 2725), provides very practical advice to researchers on when and how to use relative and absolute coverage figures as well as other aspects to decide which databases to use.
In particular they distinguish between the following major uses
Exploratory searching
High relative coverage in discipline ( Table 5) if just one discipline
High absolute coverage in multidiscipline (Table 4) if multidisciplinary
Also consider search & browse features (see Which academic search systems are suitable for systematic reviews or meta-analyses? evaluating retrieval qualities of Google Scholar, PubMed and 26 other resources)
Systematic searching
Specialized databases with high relative coverage ( Table 5) and/or
multidisciplinary databases with high absolute coverage (Table 4)
Also recommended good subject/controlled vocab features+citation searching (see Which academic search systems are suitable for systematic reviews or meta-analyses? evaluating retrieval qualities of Google Scholar, PubMed and 26 other resources)
The method
I personally spent several hours on Friday and Saturday night slowly reading the paper, particularly the methods, since I was really curious.
The paper proposes
"query results as a common denominator to compare a wide variety of search engines, repositories, digital libraries, and other bibliographic databases."
Given that this paper covered 56 databases including
Preprint servers and aggregators of repositories, OA journals - e.g. arXiv, Bielefeld Academic Search Engine, CORE, OpenAIRE, DOAJ,
Traditional subject indexes - e.g. APA PsycInfo (both via OVID and EBSCOhost), CINAHL Plus, EconLit
Traditional citation indexes - e.g. Scopus, Arts & Humanities Citation Index (via Web of Science)
Publisher platforms - e.g Wiley Online Library, SpringerLink
Huge aggregators - Google Scholar, Lens.org, Semantic Scholar, Crossref, Microsoft Academic
all of which varied a substantially in terms of searching functionality, records covered and sizesI was skeptical any single sample method based on searching keywords could get a fair estimation of coverage sizes. For example, the now defunct Microsoft Academic doesn't even technically support Boolean!
That said after several readings, I think he may have something here. Below is my attempt to explain how the method works and the validation methods.
Firstly, the author selects a basket of keywords (BOK), 14 each for each of 26 subjects from Scopus's All Science Journal Classification (ASJC) System.
The paper goes into detail on how the keywords are chosen (basically keywords that have high recall for the subject AND are less shared between subjects - aka high Term frequency and high inverse document frequency) but I think it makes more sense if I explain first the rough idea behind the method.
The keywords used are unigram which is designed to hopefully work similarly across all 56 databases as opposed to bigrams etc which may be interpreted differently by different searches
Firstly Scopus is used as the reference database, given that all articles are already classified into the 26 subjects from ASJC.
Say a keyword for the Subject - "Physics and Astronomy" is "nanotechnology". Imagine if of all the papers classified as Physics and Astronomy, 1% have nanotechnology in the title.
In fact the paper does adjust for the fact that in Scopus an article can have multiple subjects attached but let's ignore that for now.
The simplistic naive way to use this is, do the same title keyword in the 56 databases and look at the number of results say 500. Then assume the recall is the same for all databases and yield estimate 500/1% = 50,000 papers in that area.
Of course, ff this was all is to it, the paper would not be 60+ pages!
What the paper does is far more sophisticated.
Firstly, we are using not 1 keyword per subject but 14, which helps smooth outliers out. As such the study collects, 14 keywords * 26 subjects * 56 query counts.
Secondly it recognises that not all keywords are equally good estimators. This is reflected in their precision scores. For example the "nanotechnology" keyword above, might find records of which 75% are indeed classified in the subject "Physics and Astronomy", but a another keyword might only have precisions of only 30%. This should be accounted for.
The prefered method was to search the keyword in title, but because this wasn't always possible in all databases so they also collected in Scopus query counts for "Abstract only" and "All metadata" as well as "Title" for those cases. As such this results in 14 keywords * 26 subjects * 3 (title only, abstract only, All metadata) , recall and precision scores.
Thirdly, it exploits the fact that you can map Scopus ASJC categories to 254 subject categories in Web of Science so this can be used as a control to see how well the proposed method does.
Given that they have Web of Science data to compare against (remember they map ASJC Categories to WOS categories so they have the gold standard/ground truth to compare), they tried different statistics models to combine the data to get a reasonable estimate.
While it seems simple to just drop off keywords with poor precision (since on paper they would be poor estimators), there's a trade-off between that and having fewer sample keyword query results for estimation.
They finally settled on a "Median of medians" method, where they look at the estimated coverage (based on median) and calculated each time when you included keywords with precision rates between above 17%, above 18%, above 19% to above 30%. The median of this range of estimates was used as the overall estimate.
Lastly, they normalize the results above with the absolute database size (which itself is not always easy to get). While this procedure would not affect relative subject coverage, it may address systematic over and under-estimations.
The results and validation
The papers also goes into great detail how it tries to measure the validity of the results.
In terms of internal validity, they compared the same database across different platforms e.g. Medline via PubMed, Ovid, Web of Science, and EBSCOhost and APA PsycInfo (both via OVID and EBSCOhost) with high correlations.
This estimation method is believed to have issues when trying to estimate coverage of specialized databases with subjects which are narrower than the ASJC subject categories. Their calculation of the relative interquartile range (RIQR) which indicates the "level of homogeneity of the underlying estimates" finds this to be true for ERIC or SPORTDiscus and for Arts and Humanities Index which is obviously very diverse and difficult for keywords to capture.
As mentioned above, they do not only title searches but also abstract searches and all metadata searches. As expected title searches were most accurate. Still they were pretty correlated, with r = 0.980 between title and abstract and r=0.985 between title and all fields.
They also tested the impact of stemmed queries in Web of Science and Scopus (r=0.999), though the accuracy of the estimate decreased using stemmed queries.
In terms of external validity, as mentioned earlier they could map ASJC subjects to WOS subjects and use this as a control.
The method used in this study (i.e., most restrictive feld code and restrictive, verbatim queries) produced a mean accuracy of the WOS estimate of±19.6% with a maximum deviation of 46.6%
In general, the results produced by BOK method are also plausible. Take CINAL plus, even though it estimates only 16% nursing content, this is still the #1 by relative coverage. Why isn't this figure higher? This is simply due to the keyword used that is considered in Scopus has medicine rather than nursing, but because this is consistent across all sampled databases, this still results in CINAL being #1 by relative coverage as we expect.
Limitations
Finally note the limitations of this method.
The biggest limitation is that while BOK provides a fairly reliable way of classifying content into the 26 ASJ subjects, it makes no judgement on the quality of the content. Neither does it put a restriction on the type of content. If you believe the largest databases namely (mostly easily over the 120 million mark)
Google Scholar
Bielefeld Academic Search Engine
Microsoft Academic
Lens.org
CORE
Semantic Scholar
OpenAIRE
Crossref
Dimensions
are showing mostly poor quality publications (preprints, predatory journals) or even non-relevant item types (e.g. Google Scholar and Microsoft Academic would often show Library guides), this doesn't provide a solution to this issue.
Also because the current study uses only English keywords, it can only detect english records.
Lastly, the method may be sensitive to duplicate records. Web of Science and Scopus is known to have lower duplication error rates than say Google Scholar, which may overestimate absolute coverages. Still as the paper notes, the only databases close to Google Scholar in coverage such as Bielefeld Academic Search Engine, Microsoft Academic, Lens.org, Semantic Scholar, CORE) etc are likely to have the same issue if not worse.
In fact, from the results, I was mildly surprised at the performance of Bielefeld Academic Search Engine, where it was runner up in 11 categories and 1st in 2 categories (NEUR, DECI).
This is almost as good as MAG (considered by many to be the 2nd biggest) and it was runner up in 7 categories and 1st in three categories (ENER,HEAL,DENT).
My understanding is that CORE and Bielefeld Academic Search Engine roughly do the same thing in terms of indexing open access repositories, one wonders if the difference is due to deduplication with the later perhaps being less aggressive in identifying duplicates leading to higher absolute coverages.
Of course, it may also reflect Bielefeld Academic Search Engine really having more content!
2) Habermann, T. (2022, March 7). Metadata Life Cycle: Mountain or Superhighway? [Blog]. Metadata Game Changers.
I have been slowly working my way through Ted Habermann’s blog posts which often cover parts of his published research that I am excited to read, but this one blog is the one that really made me sit up and take notice.
Habermann shows that most data repositories are minting dois with a minimal set of metadata deposited in Datacite needed to get Datacite DOIs. The fact that there isn’t much metadata here is perhaps only somewhat surprising, as I know many repository managers are of the view that making too many fields mandatory will turn off users from depositing.
More astonishing to me is his finding that institutional repositories do indeed have substantial additional metadata entered by the researcher when depositing or enhanced by other means but these metadata simply do not make the climb up what he calls the “Metadata Mountain” into DataCite as they fall by the wayside.
Using DRUM Repository at the University of Minnesota as an example, he identified many metadata fields that had data collected by the repository but were not included or translated into the DataCite Schema. These included fields such as abstract, keywords, Funder etc.
Of course, all these fields affect FAIR (Findability, accessibility, interoperability and reusability) of the datasets. By working to extract such data from the records and updating the Datacite metadata he managed to increase the completeness of these fields from 15% to 32%.
There’s a lot more to his work , eg his approach to quantifying and visualizing FAIRness by mapping Datacite fields into 4 categories into a Radar Chart , and I intend to follow up on Ted’s recent papers but I’m struck by the practical implications here for repository managers for institutions. Why are we collecting metadata but not exposing it properly?
This is particularly so when the blog post suggests this issue isn’t present as much in subject repositories such as Zenodo, Dryad, Dataverse.
On reflection, this issue of repositories capturing data and making it appear only in human readable format or at least formats that are not harvested by relevant machines is not a new problem.
I have encountered similar issues in the past with our own repository. In one past incident the repository manager had painstakingly added version information to each deposited paper to the repository but despite that it was not picked up by the Unpaywall crawler. Another related incident involved trying to figure out why data citations from datasets in our data repository were not appearing via Scholix despite us supposedly providing the information when creating the metadata at the time of the deposit.
Perhaps we have a tendency to assume if we as humans can see the metadata in the repository in human displays we assume our job is done.
3) Porter, S. J. (2022). Measuring research information citizenship across orcid practice. Frontiers in Research Metrics and Analytics, 7, 779097.
More metadata fun! This time a study on ORCID. I was recently asked about Singapore effort’s on PID strategy. While I know very little about that, it did set me thinking. What was the ORCID penetration rate in Singapore and in my institution? How did we compare to others?
This is where this paper is timely. To determine ORCID use by authors, one needs a benchmark or gold standard to compare against and this is where Dimensions comes in.
I intend to look closer at the methodology, but the essential point here is that this study measures ORCID use in two dimensions.
Firstly there is what is called ORCID adoption defined as
“the percentage of researchers in a given year who have at least one publication with a DOI linked to their ORCID iD either in ORCID directly or identified within the Crossref file”
However, a researcher might create a ORCID because he is required to do so when submitting to journals or doing fund proposal by funder and otherwise ignore the ORCID profile. As such the study measures also completeness of the ORCID profile which they argue measures adoption.
The paper is particularly interesting since they provide evidence on adoption and engagement using these metrics across
Countries
Research Categories
Funders
Publishers
The author of the paper - Simon even indulged me when I asked for a breakdown of Singapore institutions! Since then he created a publicly accessible dashboard based on the data in the paper for all to try out.
Feel free to filter by country, institution, institution type or field of result.

Dashboard of institutions by ORCID adoption and completeness
All in all the paper has a lot of interesting findings and implications that are worth looking at. Some are even on the surface surprising. For example they find Humanities research tends to have lower adoption but higher engagement than say Medical and Health Sciences.
Part of this can be explained by the finding that while most publishers do accept ORCIDs, many are still accepting only ORCID for just one author. As Humanities research tends to have fewer authors per paper than medical research, this partly explains why on average humanities researchers with ORCID profiles tend to be more complete.
4) Bender, E. (2022, April 18). On NYT Magazine on AI: Resist the Urge to be Impressed [Medium]. Emily M. Bender.
I’ve been reading about and playing with language models like GPT-3 for the last 2 years. More recently tools like Elicit.org are starting to employ language models for academic research and have set out their roadmaps here in Elicit: Language Models as Research Assistants.
The results for Elict.org are quite impressive given it is a very early stage tool but what pitfalls are there? Are we too eager to employ such technologies? Can they really be used to support systematic reviews etc? I know one of the problems of language models is that it can make up answers or “hallucinate” but is that the only issue? See also - How to use Elicit responsibly.
I know there is a fairly recent technical paper - On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? that heavily critiques language models and provides a overview of risks of using language models, that is making the rounds, but I haven’t had the time to look at it, and I fear it may be overly technical for me.
This is where I came across the piece entitled “On NYT Magazine on AI: Resist the Urge to be Impressed”. Given this post is by Emily Bender who is one of the co-authors of the Stochastic Parrot piece , I suspect this will provide the viewpoint from an expert who is very skeptical and critical of language models.
This medium post itself is a response to the NYT piece “A.I. Is Mastering Language. Should We Trust What It Says?” which is worth a read.
5) Domain Repositories Enriching the Global Research Infrastructure. (2022, April 20). [Youtube].
I’m going to cheat by ending with yet another reference to Ted Habermann, this is a talk he gave recently that covers a good portion of his recent work.
It’s a fascinating talk with many ideas including“Identifier spreading” - the idea of enhancing repository records with identifiers by using reasonable assumptions. Eg, a Dataset record might not have ORCIDs in data repository, but the Dataset record might be referenced by a journal article which itself has ORCIDs, so one can infer that the Datasets should have the same ORCIDs. You can do the same with affiliations/ROR.
How difficult is it to map affiliations for repositories to ROR? Many institutional repositories will have an easier time because they only need a small number of affiliations. A few general and subject repositories like Dryad/Zenodo will have much difficult tasks
How do you measure completeness of identifiers/connectivity? Ted proposes a metric and a visualization. He also talks about measuring FAIRness and visualization of Datacite metadata completeness as mentioned earlier.

