Open Access Indicators and the importance of optimising your Institutional repository for discovery in Unpaywall

https://www.leidenranking.com/ranking/2020/list
Recently the CWTS Leiden Rankings for 2020 was released and like many institution rankings , it used common research indicators such as
PP(top 1%). The number and the proportion of a university’s publications that, compared with other publications in the same field and in the same year, belong to the top 1% most frequently cited.
More importantly it included less commonly seen indicators like Open Access Indicators and Gender Indicators. In particular
PP(OA)- Proportion of open acess publications of an institution
PP(Green OA)- Proportion of Green open acess publications of an institution
PP(Gold OA)- Proportion of Gold open acess publications of an institution
While such OA indicators look simple on the surface, they can be sensitive to a multiude of factors including
Definition of OA used
Index used to select institutional output (e.g. Web of Science, Scopus, Microsoft Academic)
Method used to dedupe and combine affiliations (affected also by cleanliness of index)
Years of coverage used (e.g. all years or 2014-2017)
Date the study was done (a study run later on the exact same pool of articles will have higher OA rates due to the increased amount of time for papers to be made open)
Recently, we have started to see excellent papers like
Comparison of bibliographic data sources: Implications for the robustness of university rankings
Evaluating institutional open access performance: Sensitivity analysis
that try to entangle the effects of using different indexes at the institutional and regional level when ranking institutions using citations and OA indicators
Yet these analysis like many other papers that measure OA levels, rely on the same OA detection method - the free and excellent Unpaywall to determine the OA status of an article.
For those of us who are interested in ensuring accurate measured OA levels for our institutions, this makes it important for us to ensure that Unpaywall is accurate and reliable when indexing our Institutional repository contents which are presumably a large source of OA papers for our institutions.
In this blog post, I will talk about how Unpaywall rose to become one of the most important sources of truth for determining the OA status of outputs and I will share my personal experience verifying the accuracy of Unpaywall for my institution's output.
As it turned out Unpaywall did not properly index my institutional repository's content and by working with them on this issue, I increased Proportion of open acess publications of an institution - PP(OA) as measured by Unpaywall from 32% to 80%!
I also incidentally managed to answer a question that was on my mind on the value add of a institutional repository in terms of the uniqueness content.
If you are already familar with Unpaywall,jump directly to the section on my experience testing and optimising our institutional repository for Unpaywall indexing .
A brief history of OA measurement research and the important role Unpaywall plays today
Up still recently, research that covered global OA levels were rare, basically because it was not easy to obtain OA status data in bulk.
Unfortunately, Google Scholar which was and probably still has the best coverage of Open Access is hampered by the lack of an API which meant that it could not be used for academic research to determine OA levels, except by limited manual sampling. While the study Evidence of Open Access of scientific publications in Google Scholar: a large-scale analysis showed that it was possible to do scrape Google Scholar results in bulk, it required pretty sophisticated and tedious attempts to bypass CAPTCHAs.
As such, for a comprehensive study, you would need to sink quite a bit of resources into building yoru own crawler to crawl the web to look for Open Access copies. A good example of this was the report Proportion of Open Access - Peer-Reviewed Papers at the European and World Levels—2004-2011 commissioned by the EC, and the crawler technology used there was eventually used by Eric Archambault to found the company 1Science which provided search indexes like 1Findr (formerly Oafindr) before being acquired by Elsevier.

1Science - web crawling technology that was acquired by Elsevier
Unpaywall - a game changer
A new era in measuring OA rates began with the launch of Unpaywall in 2016. While the extension got a lot of attention , it also offered very generous free access to the Unpaywall API and data dumps which allowed researchers to do their own measurements of Open access rates.
This totally changed the game!
It must be noted alternatives like Open access button, Dissem.in existed a bit before or alongside Unpaywall but it was Unpaywall extension and API that took off (the service was called oadoi at the time, that is why if you look at the various python wrappers and R Libraries, it is still called that).
They grew quickly and one year later Jason Priem founder of the now renamed Our Research noted that they had nearly 0.5 billion Unpaywall api calls. As I noted back then, a lot were probably due to unpaywall being called in various places where they were deployed in the discovery infrastructure (e.g. link resolvers, discovery systems), but also usage by researchers for research must have been part of it too.
Wow, it looks like the @unpaywall database has overtaken Sci-Hub in popularity! We had nearly half a billion calls to our API in 2017, vs ~150M downloads from Sci-Hub. #oa #openaccess https://t.co/cI0wBM1TaG
— Jason Priem (@jasonpriem) February 1, 2018
My various blog posts in the last 3-5 years have continued to chronicle the rising importance of Unpaywall and the accompanying rise of Library access broker extensions that supporting redirection to both subscribed and open access items.
Last year, I even wrote
"I think it is to fair say that in the past 3 years, the influence of Impactstory's (now renamed as Our Research) Unpaywall looms large. Short of Google Scholar, I would say that Unpaywall has been one of the most significant developments in making the focus on discovery of open access content."
They have since then started to leverage the same data to launch a new product - Unsub (formerly Unpaywall Journals) that allows Institutions to help decide which journals to cancel subscriptions for.
As noted in this tweet and this blog post, Unpaywall dominance on measuring/determining OA status is so complete that if you see something claiming to tell if an article is OA, you can guess and have a good chance of getting it right that it is probably using Unpaywall to determine so.
The same is true for a lot of research on the topic of OA measurement, where the Unpaywall source whether via APIs or downloading and extracting the data dumps is seen as the default way of determining OA status.
Due credit as to be given to the developers of Unpaywall for generously making their service free with very high API limits so you almost never need to pay anything for it (except for a very limited use case where you need constant quick updates via a data feed)
Prior to Unpaywall, doing a 100% bulk check on a large paper set was something most researchers much less librarians could not easily do. Unpaywall by providing a generous API checking service that was totally free with a very generous rate limit (even large institutions that called it whenever their discovery service loaded a search result page could barely tax it!), they made it pretty much trivial for researchers and librarians like me to run such studies.
As such since 2016, Lens.org has tracked over 72 studies that mention Unpaywall (or Oadoi), in the title, abstract or keywords, while not all of them are OA measurement studies, a large number are.

At the time of writing in Mid 2020, on top of the CWTS Ledien rankings, the following 3 papers I am aware of that are being circulated that purport to measure the percentage of OA of institutions on a large scale.
Evaluating institutional open access performance: Methodology, challenges and assessment (companion piece on sensitivity analysis)
Open Access 2007 - 2017: Country and University Level Perspective
On a smaller scale OA finding services are now increasingly used by libraries and consortiums to estimate APC paid by their researchers - of which first step involves locating papers that potentially carry APCs eg this NZ study
In almost all these cases, Unpaywall is used. In only one exception a completely different OA finding system- CORE Discovery is used. More on that later.
Evaluations of the reliability and accuracy of Unpaywall
It is clear from the above that Unpaywall currently occupies a very central position in both profesional work as well as in OA measurement studies.
However, as far as I know there has not been many studies to evaluate the reliability of Unpaywall in detecting OA besides the paper- The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles which was coauthored by the developers of Unpaywall themselves and released as a preprint in 2017.
Back then they estimated a recall of 77% and a precision of 96.6% based on comparing against a Gold Standard of 500 manually checked articles.

In other words, if Unpaywall reports that an article is open, it is 96.6% of the time correct. On the other hand, out of all the institution's OA output, it will identify only 77% of them. This means it is still missing quite a lot of OA papers (leaving aside illegal copies).
Unpaywall states that they prefer to maximise precision at the cost of recall which is understandable, but based on the relatively low recall, we know that on average the overall OA rate is probably underestimated which allows you to adjust for it.
Unfortunately, we do not know if this error is biased. For example, it seems likely to me that Unpaywall's low recall is unlikely to be due to difficulty in indexing big publisher sites and big subject repositories like Arxiv and more likely to be problems with indexing smaller sites and repositories.
More importantly, does this low recall rate affect some institutions more than others?
Even if we accept these results as valid, this paper was first released as a preprint in 2017, how have things changed since then?
Unfortunately. since then I am not aware of any public large scale study on the accuracy of Unpaywall,
Evaluating institutional open access performance: Sensitivity analysis does not directly measure the accuracy of Unpaywall, they carry out an interesting sensitivity analysis where they
proceed to analyse the levels of sensitivity asscociated with the use of different Unpaywall data dumps. For these analyses, we maintain the use of the combined dataset, but use Unpaywall data dumps from different dates to determine open access status.
Given a set of papers published in 2017 (combined dataset of WOS+Scopus+Microsoft academic), they study the impact of using 5 different Unpaywall snapshots available throughout 2018 and 2019.
They find that as they compare the results using the latest Unpaywall snapshot vs the earlier ones, they observe some jumps and relatively larger Green OA differences (which leads to Total OA differences) compared to differences in other OA categories.
They interprete this as Unpaywall slowly
capturing a backfilling of historical data in particularly through the green and bronze routes and ... observe the effects of embargos on self-archiving through the sudden jumps in 2018 for green open access, as we move towards earlier data dumps.
Another way to interpete this I think could be to say that the high distribution of errors point to Unpaywall's relatively weaker ability to detect Green OA.
My own personal experience with Unpaywall and our institutional repository
I've written in the past about the importance of making sure your institutional repository contents is well indexed in sources besides Google Scholar. At the same time University of Liège Library launched a informative talk "Open Access discovery: ULiège experience with aggregators and discovery tools providers. Be proactive and apply best practices"

Sources indexing University of Liege's Institutional repository contents
Looking at the list above of tools/indexes where University of Liege optimizes discovery of their institutional repository, you might be wondering where Unpaywall or oaDOI (it's former name) was listed. Even at the time, it was obvious this was going to be an important source to optimize for.
The reason why it wasn't listed is that at the time in 2017, it did not have it's own crawler and relied on BASE, which made it important to get your institutional repository contents properly indexed BASE and correctly identified as Open Access copies has opposed to mere metadata records (more on that later)
Since then Unpaywall started using their own crawler but I would argue that in the list of sources worth optimising there , Unpaywall has leaped to the front of the pack in importance, second only to Google Scholar!

Based on personal experience using the Unpaywall browser extension and later other tools that leveraged Unpaywall OA capability, I always had a hunch that they were not indexing our Institutional repository as well as they should. This was despite our institutional repository being extremely well indexed by Google Scholar (based on downloads and reports given to us by Digital Commons).
This was a hunch that grew partly out of the understanding about the flaws of the OAI-PMH protocol, but at the time, I comforted myself with the idea that in most cases, it would not affect the overall OA% level of our institution because for every copy missed by Unpaywall in our repository, it would spot the other Green or Gold OA copies in large Subject Repositories like Arixv, SSRN, or PMC not to mention publisher hosted copies.
In other words, I did not believe that many of our OA copies in our institutional repository was unique in which case the poor indexing of our institutional repository while reducing the downloads our copies would receive via Unpaywall would not affect OA rates.
The paper Open Access uptake by universities worldwide (2020) even provided some evidence of this, listing Singapore has the country with the 7th highest share of distinct green OA publications coming from PMC. While Singapore's output is dominated mostly by the two oldest Universities (we are the third oldest), I believed at the time our output would display a similar pattern albeit in other subject repositories like Arxiv.

Open Access uptake by universities worldwide (2020)
But was this a true assumption?
A rude awakening
Earlier this year, I was asked how my institution , the Singapore Management University fared when using a methodology as per listed in the preprint at the time - Open Access uptake by universities worldwide.
Sidenote: Our institution isn't listed in the CWTS Leiden Rankings , because fractional counting is used and due to our relatively small size coupled with high international collobration rates this knocked us below the 800 Publication thresehold for WOS from 2014-2017.
This isn't exactly the methodology but is close enough.
Go to Web of Science and use the core collection (Citation Index Expanded, Social Sciences Citation Index and Arts & Humanities Citation Index)
Restricted search to output from my institution authors.
Filter to 2014-2017
Export the file in csv with dois
Run the dois through the Unpaywall API or use the simple query tool
Profit!
Because it was so easy to do, I did the same for Web of Science and Scopus just for kicks.
For those of you who are wondering why you can't just use the OA filters in either Web of Science or Scopus directly, the reason is those databases only use a subset of Unpaywall results to count as Open Access. For example, the former only includes versions marked as Accepted Version or Published Version, while the later only includes publisher hosted versions (no Green OA!).

The trick is not to stop here after using the Unpaywall API.
Because I had a hunch Unpaywall was missing some items in our institutional repository , I then decided to do additional work
1. I filtered my DOIs to the ones where Unpaywall was unable to find a free copy
2. I asked my institutional repository manager for all the PDFs in the repository with dois
3. I matched the #1 with the #2 , to see how many matches I could find.
I expected some to be found of course leading to a higher OA rate, maybe as high as 10% more but what I found out shocked me.

As seen above using Unpaywall alone (1st row in table above), it detected 32.6% of WOS DOIs (for 2014-17) and 30.6% (for all years).
Once I included copies in Institutional repository (2nd row in table above) missed by Unpaywall , it jumped to 81.5% (for 2014-17) and 74.1% (for all years!). I won't show you my results with Scopus, but the results are similar.
When I first saw this results I was convinced an error was made and checked over and over.
I then recalled the existence of a free alternative to Unpaywall that was launched last year - JISC CORE Discovery and I ran that API over our dois and got pretty much the same results (3rd row of table) as my manual matching confirming that for us at least Unpaywall was indeed missing more than half the available full text in our repository!

Working with Unpaywall to fix the problem
While I have seen analysis that imply CORE Discovery does have a higher recall than Unpaywall, I don't believe the differences were usually that big (usually in the range of 5%?) So what was going on in our repository (hosted on Digital Commons)?
The tricky part about working with Unpaywall is , while they can provide you with a dashboard showing the number of full text matches with Crossref dois they can match in their index with your repository content, it is hard for you to use this alone to tell how many full text are actually missing from the match unless your Institutional repository manager has their own record of number of full text which matches a Crossref DOI.
Fortunately for us, our institutional repository manager has done a couple of cleanups to ensure the dois are quite accurate.

Institutional Repository Dashboard from Unpaywall
That said, I'm not an institutional repository manager, but my impression is many institutional repositories do not provide dois. So how does Unpaywall do the match at all? Do they miss everything?
Fortunately not, in the absence of a doi, Unpaywall tries to matches based on title+ author matches. You can see this in the JSON output under "evidence" when this occurs.

This is as opposed to matches in OA repositories using DOIs.

Of course for greater accuracy in matching, displaying dois in your OAI-PMH feed sounds safer but which field should you put it?
In our case, where we enquired with Digital Commons support a while ago, on where to put our dois they recommended we put our DOI field in <dc:identifier.doi>; which is in the qualified dublin core feed.
Here's an example record.

Unfortunately even if you do this, by default the Unpaywall OAI-PMH harvester will not see the dois because by default they look only at oai_dc prefix (default simple dublin core) feed and not the qualified dublin core.
However, it was easy to get the help of Unpaywall Support Staff - Richard Orr to change it.
In fact according to Richard they do look for DOIs in identifier and relation fields, with identifier being preferred and identifier.doi, so these are options too.
Sidenote : My understanding is that using Crossref standards, you should only assign the same doi as the published version if the copy in your repository is the accepted or published version. While for other versions you should be using relationships , so perhaps related fields might be better?
A even more serious problem - indicating full text
While making Unpaywall harvest our dois will help with the occasion missed matches, most institutional repositories do not list dois and yet they are presumably doing fine. So the big difference for us is unlikely due to it's failure to see our dois.
To understand the problem, we have to go back to again to the problem of the OAI-PMH.
For historical reasons, I believe the OAI-PMH protocol did not have a standard field to indicate an article metadata listed in a open repository is acompanied with a open full text copy.
While this seems absurd to us today, the context at the time was it was believed open repositories would by definition list only open full text items (like arxiv) and the idea of a repository with mixed records of metadata only and those with open full text was not quite a thing.
Today of course a lot of institutional repositories have a good mix of both type of records, while some repositories and aggregators have adapted fields like with appropriate values to indicate open access, it is not a universal standard and usage varies from institutional repository to institutional repositories.
Because Unpaywall cannot be sure how a repository indicates (if at all) which of it's metadata records are full text, it crawls all the links and PDFs in your repository , much like Google Scholar.
While this works for many repositories, some with restrictive security settings like mine will block such heavy crawling. This is the major reason why Unpaywall is unable to detect more than half of our full text!
So the solution for us is to either whitelist the Unpaywall bot, instead we have chosen to inform Unpaywall the exact metadata field we are using to indicate OA version <dc:rights.license>; and for version of full text <dc:type.version>; in qualified dublin core feed.

Once all this has been sorted out , Unpaywall correctly indexes all our full text!

The new results are a bit higher because of new full text deposited in the month or so after the first measurement.
Now to wait for the data to be updated and flow downstream to various sources using Unpaywall such as Scopus, Lens.org etc..
Inclusion guidelines for Unpaywall?
It's a shame that Unpaywall does not list any inclusion guidelines, the way Google Scholar does.
Granted the Unpaywall team is very small, but given the central importance of Unpaywall, I am sure many IR managers would be happy to work to optimise their repositories to work better with Unpaywall which would be a win win for both sides.
Researchers should also consider using alternative OA finding sources to help verify results. CORE Discovery API in my testing is pretty much identical in use to Unpaywall though the main weakness is that it can only find the best copy of a OA copy and does not list as well OA attributes, so at best it can be used to verify total OA % and not the components.
Another possibility is Open Access Button API , which I have not had that much experience with, though I suspect it relies heavily on other sources like Unpaywall rather then relying on it's own crawler.
The other only other OA finding source I can think of by 1Science is unfortunately not open.
Unique content- a measure of the value add of our institutional repository?
I've often mused about ways to measure the value add of an institutional repository. One of the candidates for this is to measure the amount of unique content you have in your repository.
This was inspired by realising years ago that many Institutional repository managers were duplicating OA content found elsewhere (preprint servers, Gold Journals etc), while such work has value for some objectives (and yes institutional repositories have many goals beyond just promoting open access), it did not really advance the cause of Open Access much since once an article has been made open access, creating additional copies only adds a bit more value from the additional redundancy.
In my mind, the amount of unique journal articles you have in your repository is a lower bound on how much value your repository is adding. Why lower bound? Because in theory, other repositories could clone your copies (ResearchGate is rumoured to do so) and your unique content starts to decline.
Not everyone agrees with this logic, but the main issue with this line of thinking is that it is extremely difficult to determine the unique content each repository holds, though it is of course quite possible to do it for individual repositories with Unpaywall with some work. (i.e. Run the unpaywall API as usual on your dois and then check the Unpaywall JSON output to see how many have only one location AND is in your repository domain).
That said the side effect of this study, is it indirectly answers one of the questions on my mind about the amount of unique content in our Institutional repository.
Assuming that Unpaywall has a decent coverage of other OA sources like Gold, Subject repositories and our repository is the only one impacted, the fact that the impact of adding our Institutional repositories pushes our OA rates from 32% to 81% implies that of all the OA content, 81-32/81 = 60% , or my institution repository contributes to 60% of OA output! Sounds too good to be true?
But it is important to note that just because something is unique in Unpaywall doesn't mean it is the only copy out there in the web! I suspect part of what is driving this is that our librarians have spent the last 10 years systematically searching Google and Google Scholar for paper versions to "rescue", picking up legal copies from ResearchGate , author homepages that Unpaywall cannot reach (author home pages) or does not want to index and display (ResearchGate, Academia.edu).
Since these are sites that do not appear in Unpaywall results tend to have dodgy status and uncertain futures, this is quite a bit of value add by librarians to curate them thos way.
Sidenote : I asked this question on Twitter about the average Uniqueness of content in Institutional repositories and some researchers in the area of OA measurement tried to answer the question with their existing data. Below Alberto Martín @albertomartin tries to answer with a dataset he scraped from Google Scholar.
Quick calculation: nº of OA articles in each repository, limited to those for which GS did not find any other free versions in other sources (first positions are not IRs but the rest are) pic.twitter.com/y2sySZpZ51
— Alberto Martín (@albertomartin) June 18, 2020
Below @CameronNeylon attempts the same using Unpaywall Data dumps
Ok quick and dirty approach gives me this. Of roughly 15M objects found in repositories, roughly half are found in only one repo that is not (Eu)PMC or arxiv. Semantic scholar accounts for a lot of the remainder tho, not sure what to make of that. pic.twitter.com/r3eQWvxLKx
— CⓐmeronNeylon (@CameronNeylon) June 19, 2020
Are Open Access indicators mature yet?
I have been recently re-reading the excellent blog posts of librarian and research evaluation expert -Elizabeth Gadd and came upon the following post entitled "Measuring openness: should we be careful what we wish for?" where she wonders if measuring openness is a good idea.
One of the main reasons I feel concerned about an additional layer of openness metrics is that I’m not convinced that openness is yet at a mature enough state to be measured.
She lists our a host of pitfalls around measuring open, from incentiving traps to equity issues. While I do not engage with such issues here, my experience looking into the relatively simple PP(OA) , proportation of institutional output that is Open Access does indeed remind me that measuring things can be a lot more tricky than it looks.
Conclusion
How typical is our experience? It's hard to tell, since I don't have access to the full text listings of our institutional repositories, though the fact that CORE Discovery and Unpaywall tend to differ only slightly most of the time gives some indication that our experience might be unusual.
Still if you are a Institutional repository manager this seems a fairly simple, yet important test you can do. Particularly if OA indicators are starting to emerge.

