What can you actually find in a open citation index like OpenAlex? Implications for citation based tools based on it
I recently came across "Automated citation recommendation tools encourage questionable citations" (which was first brought to my attention by this blog post) an exceedingly thought-provoking article about bias in discovery tools, particularly new ones that can suggest what to reference based on text in a paper.
Leaving aside the issues such recommenders might bring, you don't have to think extremely hard to realize such tools will be limited by what their underlying sources includes. A recommender can't recommend what it doesn't know about after all.
How do open citation indexes affect discovery?
In fact, this is a more general issue, even if you were searching traditionally by keywords with academic search engines like Google Scholar, Lens.org, Semantic Scholar or using tools like ConnectedPapers, ResearchRabbit, Litmaps etc to browse by citations manually or rely on algorithms that leverage citation networks to surface items, you run into the limits of what sources these tools are using as a base.
In the case of most tools that I dub "citation based literature mapping services", this source is an open citation index like OpenAlex, Semantic Scholar Academic Graph, Crossref/OpenCitations.org's COCI , Lens.org etc
Take the following question (which I often get)
"I know these citation indexes the tools rely on are pretty good with journal articles. But what about other publication types like Books or even law cases? Are they well covered? Can I find/be recommended them using such tools?"
My usual answer without thinking too hard is to agree, that these tools won't recommend books very well but at least you can be sure journal articles are covered mostly.
But is this true?
Do they truly cover
most journal articles and
exclude books?
In fact, I think to answer this question, we need to consider two factors.
Firstly, does the source include or index the item you are looking for?
Secondly, how well does the source's citation index/network cover the items
The first is more or less a given for most journal articles (particularly those with a DOI), the tricky problem is the second. For books, both may be an issue.
I'm still exploring the issue, but these are my first attempts at the issue.
Fact or Myth - Open citation indexes like OpenAlex or those based on Crossref are mostly complete when it comes to journal articles
I would say most of the open citation indexes currently existing, are drawn from either Crossref and/or the now defunct Microsoft Academic Graph (MAG) dataset. Some like Semantic Scholar and OpenCitations add their own sources (typically with partnerships with Publishers or by extracting from other OA sources like PMC) but traditionally even those two rely substantially on Crossref and MAG datasets as well.
So, let's look at the Crossref open dataset.
In a sense, if all you care about is items with Crossref DOI, any system using Crossref metadata as a base would include these items by definition.
But just because a record exists tells you nothing about how complete the metadata record is , which affects discovery.
In particular, if you are using ConnectedPapers, ResearchRabbit, Litmaps or similar tools, the bigger question is how well the citation network between these items are included in the sources used, since a record could exist for the item but it may be missed due to missing citations.
I know the general narrative on our part of Twitter is to praise the overwhelming success of I4OC (Initiative for Open Citations), particularly after Elsevier caved in 2020.
With articles like a "A tipping point for open citation data" and my own pieces talking about the era of open citations, you might think at least for journal articles, citation links to and from journal articles (particularly those in the Crossref DOI set) are mostly available
The truth is sort of. The answer is in fact a little unclear.
The first issue with completeness of references/citations is the obvious observation that not all academic and Scholarly content have Crossref DOIs, as noted by this Nature News Story
However, the opening up of Crossref articles’ citations doesn’t mean that all the world’s scholarly content now has open references. Although most major international academic publishers, including Elsevier, Springer Nature (which publishes Nature) and Taylor & Francis, index their papers on Crossref, some do not. These often include regional and non-English-language publications.
But in fact, even if we care only about articles with Crossref DOIs , we still have problems.
For example, some have been surprised by the following graph from a recent paper (Van Eck & Waltman, 2022) on completeness of metadata fields in Crossref record

You can see that across all publication years, roughly only 50% to 60% journal articles have references.
How can this be, if I4OC successfully made 100% of references deposited in Crossref open? As the story goes when they first started in 2017 it was merely 1%.
Part of the confusion, I suspect, is some might have a bit of misunderstanding of what I4OC really achieved with Crossref.
When content owners submit items to mint dois (content registration is the technical term) in Crossref they usually submit metadata of the item which includes title, author, publication type and other metadata including references.
In the past, publishers such as Elsevier could and did in fact deposit reference lists for their items and then kept them closed.
What I4OC achieved is to lobby Crossref to make such deposits open and publicly available to all with CC0 License and this was so successful with almost all publishers agreeing that by June 2022 this became mandatory.
However, note that while Crossref members cannot keep their deposited references closed, there is no requirement for them to register and deposit references at all and in fact some do not at all. Even among members who do so, it's unclear how comprehensive the data they are depositing into Crossref.
Below shows for example, a Crossref participation report for Wiley.

You can see for Wiley's journal articles registered with Crossref DOIs, 79% have references.
Why isn't it 100%? Many reasons, perhaps some of this content don't have any references (e.g. some editorials, letters) or perhaps some of the journals in Wiley stable just didn't submit for various reasons..
How many items have no references in Crossref / OpenAlex?
But how bad is this issue?
There has been some analysis of this in the past on Crossref data, but let's try on the new OpenAlex which has inherited both Microsoft Academic Graph (MAG) data as well as Crossref data. While MAG is mostly based on web scraping can compensate for the lack of references deposited in Crossref, it also includes a lot more content publication types which may not have references.
You can do a rough estimate of the number of works in OpenAlex that
a) Have a DOI
b) Are journal articles
c) Has at least one reference using the amazingly easy to use OpenAlex API.
Run the following
https://api.openalex.org/works?filter=has_doi:true,type:journal-article&group-by=has_references
You should be able to easily work out what the API does with the first part filtering to works with DOIs, are journal articles and then grouping by whether these works have any references.
This is the JSON output - nicely formatted (hint, use Browser extensions like this one to help)

Of all works in OpenAlex with a DOI (presumably Crossref's), 48,692,635 (48.6M) works have at least one reference and 49,170,452 (49.1M) works have no references.
That is 49.7% of all journal articles with DOI in OpenAlex have at least one reference.
You can try to improve this estimate by trying to exclude paratext which OpenAlex defines as "front cover, back cover, table of contents, editorial board listing, issue information, masthead."
and run
This improves the ratio of articles with references to 51.1%.
You may be wondering why this figure is a bit lower than what was found in Van Eck & Waltman (2022) where the figure is closer to 60% (though this is true only for the later years). James Thomas suggests correctly I think that OpenAlex only records references to indexed works in OpenAlex (typically but not always with Crossref DOIs?). Since Crossref lists all strings in references even if they don't match with any indexed Crossref work, this might explain the higher figure obtained by using Crossref directly rather than comparing OpenAlex which is a mix of various sources including Crossref and formerly MAG.
Also an earlier analysis from Crossref direct in 2021 suggests only 45% of items have references. See also similar results from an analysis by COKI and Bianca Kramer.
Overall, estimates of works with no references seem to range from 40% to 55% which sounds surprisingly high. Is this affecting our tools that rely on the citation network heavily?
The tricky thing here is we suspect this 40-50% is an overestimate.
This is because even non-paratext in OpenAlex includes "research paper, dataset, letters to the editor, figures" some of which most obviously won't have references at all.
Still, it's exceedingly difficult to estimate what the real % of articles that should have references but are not reflected in Crossref, Lens.org and similar indexes.
We know for sure some publishers (usually the smaller ones) don't submit any references at all, so presumably most of them are just missing, but even for publishers who do submit references, we won't know if they are doing it as much as they could.
Again (Van Eck & Waltman, 2022) provides a nice overview of publishers vs % of journal articles with references as of Feb 2020.

You can see some of the small and midsized publishers like APA, Project Muse clearly do not deposit any references. (IEEE is shown as 0% but was made open later and as of Oct 2022, this stands at 75%) and there are some like Springer-Nature, ACS that have around 90% of journal articles with references and we can guess most of the remaining items without references might truly be items without them.
However, there are many like OUP, Bentham etc that are somewhere in between. It's harder to tell what is happening here.
Perhaps one feasible way to get a sense of how much is really missing is to do a sample test. Yes another is to compare with a commercial citation index like Web of Science or Scopus, who presumably are extracting references directly from full text or from publishers and do not rely on publishers depositing to Crossref.
But even if we could estimate the "true % of missing references", it's hard to tell how much better tools based on citation based networks would perform with such references included.
Fact or Myth - Open citation indexes like OpenAlex or those based on Crossref do not index much book content and systems based on it that recommend items based on citation links are unlikely to recommend or surface books.
For tools like ConnectedPapers, ResearchRabbit, Litmaps or similar tools the obvious thinking is that books particularly many old classical social science or humanities books (say pre-2000) like say Benedict Anderson's Imagined Communities would not even be indexed as they would not have DOIs. That's without even wondering how well the references in books were included in the indexes!
While citation indexes do not index most classic famous books (aka non-indexed works), the references of items they do index would have citations to such books. However, I think it is unlikely systems ConnectedPapers, ResearchRabbit, Litmaps will or can point to such non-indexed works.
Still, how true is the idea OpenAlex doesn't index books well?
Using the OpenAlex API again we run the following call
https://api.openalex.org/works?filter=type:book|monograph
As of writing in Oct 2022, OpenAlex has over 5 million books and 24 million including book chapters
Of these 5 million books, 3.6 million have no DOIs. Sampling this, I found most have MAGIDs and was inherited by OpenAlex when they got it from MAG.
Here's a fairly recent book with no assigned DOIs but is indexed in OpenAlex

There's isn't a lot about the record but the work is indexed and sufficient for it to be recommended/surfaced in systems like Research Rabbit (which I verified).
That said, I sample tested <10 classic social science text and a lot were missing, more testing needed here.
The metadata in systems like this is not the cleanest, you can often find book review articles with the exact same name of the monograph title in OpenAlex , often but not always they are correctly labelled as journal articles, in any case you should check carefully to see what is being referred to. That said, surfacing review articles of book be almost as good as the book itself?
So, it's not certainly true tools like ResearchRabbit will never surface books.
Grouping by year of publications, you can see OpenAlex has books/monographs with records that spans from the 1800s! (500+ a year). Sampling suggests a lot of oldest ones are non-English?
All in all, I'm surprised by how many book records OpenAlex has. Leaving aside if they are due to Microsoft Academic Graph and this advantage will diminish with time since OpenAlex doesn't have the crawlers Microsoft has, 5 million records makes them much bigger than Scopus (which now indexes books) and Web of Science's book citation index. Both are probably in the hundreds of thousand ranges.
That said, defenders of Clarivate and Elsevier's Scopus and Web of Science will no doubt point out while their products probably index less books, they capture better the citation relationships from these books they do index.
That is almost certainly correct.
Of the 5 million or so books in OpenAlex only about 109k records have references, this is like only 2%! (It's slightly lower if you remove works with DOIs)
Of course, not all books have references and book reference styles are harder to extract I would guess but 2% feels too low. Definitely Web of Science and Scopus indexed book records have a much higher percentage of references included.
As such if you care a lot about citations FROM books, OpenAlex probably does worse than the other two despite having more records indexed for books.
Of course, if all you care about books, Google Scholar has all of these beat, which isn't surprising since it is a 100% book index.
The closest competition for books beyond those mentioned is the very new Fatcat by Internet Archive, more on this later.
Conclusion
There's still a lot of questions about how much indexes like OpenAlex cover.
But by playing around with the API we can see that in terms of journal articles, roughly 40-55% of the indexed works do not have references at all. But it's unclear how much more of the 40-55% truly has missing references.
In fact, even if an article has references deposited in Crossref, it doesn't mean all the references are extracted or even correctly extracted.
One of the things I like to do is to go to various discovery services and citation indexes and look up papers authored by me. How well do these citation indexes include references I made? Interestingly, there can be big differences. One major difference is whether these tools list references to unindexed works... Something for another post.
It is even harder to try to figure out what this means in terms of discoverability.
In terms of books, the problem is that the books aren't even indexed in the first place!
While such books may be cited by indexed works in OpenAlex, they won't be surfaced by systems relying on them.
That said, we do find a surprising substantial number of books indexed in OpenAlex around 5 million and these works might even be well cited. But the quality of their indexing is bad and only 2% of them have references.
How do we improve on indexing of books or other items in open citations?
The people at OpenCitations have a comprehensive piece of how to improve on academia's missing references.
For improving books I have seen ideas like crawling references and creating entries by linking to book sites like Open Library. Wikidata IDs could be another source for books.
In particular, Refcat from Internet Archive has over 20 million book records linked to Open Library


