It's the journal usage stupid! Loophole in COUNTER? Ezproxy with EzPAASE & Springernature syndicates with ResearchGate
In this week's blog entry, I cover three new stories of interest that have one common thread - issues affecting the measurement of journal usage.
The three stories are
1. A loophole in COUNTER? Are some publisher platforms double counting downloads?
2. More granular analysis of ezproxy logs? OCLC partners with EzPAARSE
3. Syndication on ResearchGate - Springer Nature works with Researchgate to present content directly and (probably) track usage for institutional users.
I conclude by paradoxically noting that the rise of OA, has lead publishers and libraries to pay more not less attention to institutional journal downloads.
1. A loophole in COUNTER? Are some publisher platforms down counting downloads?
If you are like many academic librarians who aren't deeply involved in the ins and outs of electronic resources evaluation , you would possibly assume that content providers who abide by the COUNTER (Counting online usage of networked electronic resources) standard allows libraries to reliably compare usage consistently across journal platforms.
While this is true to a point, a recent C&RL article "Do Download Reports Reliably Measure Journal Usage? Trusting the Fox to Count Your Hens?" pointed out a loophole that I wasn't aware of.
In a nutshell if a publisher automatically serves up a HTML article (instead of just a landing page with an abstract and a choice of PDF or HTML version of article), it can lead to double counting as users may click a second time to get the PDF article. This is something COUNTER did not handle until recently where they launched the 2019 COUNTER Release 5.
But let's follow the argument in the paper.
The authors (who incidentally include Economists from University of California Santa Barbara) use data obtained from the California Digital Library (CDL) (which includes data from all 10 Campuses of the University of California system) to try to predict downloads in a journal title from journal characteristics including
Journal Citations
Impact Factor (this scales for number of articles in the journal)
Journal’s discipline
year of download for the observed downloads
Publisher
The dataset consists of 4.25million downloads from 7k journals from 7 publishers from 2010-2016 .
Essentially, they fit an equation that can explain/predict downloads of each Journal title (by download year).
"We estimate the parameters α + β, β and the coefficients Yy, Fj, and Pj, corresponding to indicator variables for year of download, journal discipline, and journal publisher."

The papers goes into depth the effects of impact factor, number of article, download year and journal discipline on expected downloads but the most important point they make is that once you control "for each journal’s number of articles, impact factor, major field, and the year in which downloads occurred", there is still quite a lot of "publisher effect" left over.
Normalizing Elsevier effect to 1, they produce the following table which has 3 models that break down discipline effects into 5, 27 and 334 categories respectively and they show the publisher effects are robust regardless of granularity of discipline modelling.

They basically explain that it makes sense for Nature-branded journals to have higher downloads even after controlling for discipline, year, citations etc because these include "commissioned summaries of recent research called News and Views, which are written by prominent scholars and intended for nonspecialists" which are downloaded and read but not cited a lot.
That said, their study does seem to show that Elsevier, Nature Publishing Group (non Nature branded), ACS seems to have a stronger publisher effect than Wiley, Taylor & Francis , Springer. For example they find Elsevier journals draw 40% more downloads than Wiley even after controlling for discipline, citations, size of journal etc.
So what is their explanation for this? They then refer to literature that points to a loophole in COUNTER Release 4 and prior versions (this has just been superceded by COUNTER Release 5 in 2019, more on this later)
The loophole is this. Imagine a publisher A that serves up a landing page to each user with title and abstract , giving them a choice to download either HTML or PDF version of the page. So this counts as one download (either HTML or PDF).
Now assume instead Publisher B, will automatically serve up a HTML article, and then allow the user a choice to get the PDF version. Obviously, this will result in double counting and will result in more downloads total.
Essentially, while journal COUNTER downloads are fairly consistent , the way publisher platforms serve up articles can make downloads hard to compare across platforms due to this reason. While you can see total PDF and HTML downloads separately, you can't tell which are downloaded in the same session.

Sample Journal Download COUNTER report you get
The authors of this paper, then use the COUNTER statistics to calculate a Total downloads (PDF + HTML) to Total PDF ratio and show that journals published by Nature and by Elsevier which have the highest publisher effect also have the highest Total downloads to PDF downloads ratio.
This seems to provide some circumstantial evidence that the publisher effect could be due to this, but it doesn't line up perfectly as American Chemical Society (3rd highest publisher effect) has relatively low Total downloads/Total PDF ratio.
I'm not fully convinced that the publisher effect can be totally explained by this possible double counting as there could be other factors at play not controlled by the model.
In any case, this loophole seems to be known and addressed in the new 2019's COUNTER R5 standard released this year and libraries can use Unique_item_requests which will solve this issue.
Currently not every publisher is supporting R5 yet, so we will face the double counting problem for some time but eventually, I suspect every library would be using Unique item request metric to close this loophole.
2. OCLC ezproxy tie up with Ezpaarse
There's quite a bit literature over the years on this double counting loophole but it seems to me using total downloads to PDF downloads at best provides circumstantial evidence, since you can't tell directly if this double counting is really happening just by looking at totals from COUNTER reports.
Ideally what you should get is individual detailed logs to see exactly what is happening, and while there are some studies that managed to get logs from the publishers this is rare.
This is where the next development comes in. What if you could parse Ezproxy logs in a consistent and easy way to see what exactly is happening at the transaction level?
OCLC's latest tieup with EzPAARSE indeed provides an opportunity to do exactly that (though you might miss in-campus use?).
I've blogged about the EzPAARSE open source tool back in 2016 , it is basically a tool loaded up with rules for dozens of databases/platforms to parse logs to give you detailed statistics on user actions in each database/platform.
For instance with the right rules coded in, for some platforms it can tell you if a user downloaded a PDF or Html copy, if it's a view of TOC (table of content) etc.
It also queries sources like Crossref to further supplement the data it can show e.g. dois if identified in the http request can be used to determine which article/journal is being downloaded.
Below shows a sample output I got while running ezpaarse through ezproxy logs in 2016.

CSV generated by ezPAARSE
The tricky bit of course is that one needs to maintain a huge list of rules for each platform (and content providers might change it without warning) and some platforms might just not provide enough information to paarse fine details.
Until recently, Ezpaarse was used mostly by the Couperin consortium (which contributes most of the rule base), but in Feb 2019, OCLC which owns Ezproxy announced a tie up with Couperin Ezpaarse.
As I write this, details on it are still scant but OCLC will apparently roll it out first for customers of the ezproxy hosted solutions this year for an additional fee (pricing not available yet).
It will combine output from running logs through ezpaarse and provide visualization dashboards using Kibana.
Recent webinar on Ezproxy analytics at ALA annual 2019
See also this EZproxy analytics: experiences from pilot libraries (May 2019).
My reaction on this development...
Firstly, if you config your ezproxy logs to capture user identity, that coupled with running logs via EzPAARSE and visualization makes library impact studies on student success , at least certain types of them (e.g those that track electronic resource use of students as a predictor) almost trivial. Though deciding whether to do this is another story......
Secondly, my impression is OCLC has not done much with Ezproxy in the last couple of years, and the rise of new generation proxy competitors like Remotexs with better build-in analytics (It has even made in roads here in Singapore with the Nanyang Technological University rreplacing Ezproxy with it). means this is a very timely development.
Thirdly, it seems to me that the fight for information about user downloads is hotting up. Given the stated aim of RA21 to eventually end IP authentication and proxy server systems , it's curious to see this development of providing better data via Ezproxy.
3. ResearchGate and Springer Nature tie-up
If you are a regular read of Scholarly kitchen, it is hard to miss posts that talk about "leakage" and the proposed solution "syndication" , ideas popularized and reported upon by Lisa Janicke Hinchliffe and Roger Schonfeld.
What do those terms mean and why syndication?
Essentially publishers are worried that users are now downloading less from publisher websites (which record COUNTER download statistics) and more from sources or sources that do not leave any download traces.
Given they are aware that Libraries judge the value of subscriptions and negotiate based on download figures (hence the gaming of the COUNTER loophole by some platforms as seen above), this "undercounting" or "leakage" is dangerous to them.
As I understand it, "leaks" include
illegal downloads from Sci-hub
downloads from copies uploaded to platforms that some call "Scholarly collaboration networks" such as ResearchGate, Academia.edu and Mendeley groups
Even possibly downloads of Green OA copies in repositories
Not everyone likes this terminology!
Sidenote : Interestingly enough aggregator copies are not counted as leakage, probably because they are captured indirectly at the aggregator level and the business contracts are not with the publishers directly anyway. Still I wonder, if a library only has access to a journal via an aggregator and not publisher directly, this means leakages will still occur even with Publisher syndication.
So how do publishers propose to include usage from these use cases? Obviously they would love to have downloads captured on as many of these platforms as possible to be aggregated with the downloads they normally capture on publisher sites.
Firstly, they will need to modify the COUNTER standard to allow other platforms to contribute COUNTER stats to be aggregated with the normal ones and indeed they have done so.
In my recent post on the release COUNTER R5, I talked about a new feature called Distributed Usage logging

What is DUL?
As noted in the blog, Elsevier's Mendeley now supports this and this means if your institution's users download a journal article via a closed Mendeley group it will be captured in your COUNTER statistics.
DUL usage statistics is built into the COUNTER standard itself, while not a standard view it can be obtained from the Master Report. As you can see below you see a line for "DUL Mendeley".

Sample COUNTER R5 showing Mendeley downloads
Given Mendeley is an Elsevier owned company it is no surprise they adopted DUL. What other platforms might also do this?
Interestingly enough, while I see Kopernio's logo on some presentations on DUL, the representatives from Kopernio have confirmed on Twitter that this was an example and Kopernio currently does not support DUL.
Not to the best of my knowledge. The original PDF retrieval should be captured in standard COUNTER stats - we don't report 'downstream' events like subsequent reopens etc, which I believe is what DUL gets from eg Mendeley.
— Paul Tavner (@PTavner) July 17, 2019
Another Publisher that seems to be mostly on-board with DUL is the publisher Springer Nature which controls the platform Readcube.
However most interestingly, Springer Nature parts company with fellow publisher Elsevier in one important respect - their position towards ResearchGate.
ResearchGate is well known for having a lot of illegal content , while Elsevier as part of Coalition for Responsible Sharing (CRS) has chosen to sue ResearchGate, Springer Nature and a few other publishers has come to terms with them. In particular, Springer Nature has chosen instead to partner with them to syndicate content with a first pilot in March 2019
If you are like me this reads like jargon.
So to clear it up, let's talk about the recently announced Springer Nature syndication pilot which is now in it's second phase.
As far I understand it, Springer Nature will be pushing content to ResearchGate, and ResearchGate will be selectively showing full text content to its users based on their institutional entitlements.
Currently this pilot provides the PDF to download if it detects you are from an institution that has institutional access to the article, otherwise it provides a limited view only version.
Take for example the following Nature paper on ResearchGate which I am trying to access using my ResearchGate profile.
This is what I see
It picks up on my institutional affiliation and shows "access to this full-text is provided by Singapore Management University".

ResearchGate grants inline viewer access to Nature article by recognising access via my institution
If I try to download it, I get the following popup.

ResearchGate popup when I download
Interestingly enough, ResearchGate is able to identify that I have entitlements via Singapore Management University (SMU) even though I'm not in SMU IP range, and I have not registered a SMU email with the account. Could it be simply looking at my self declared institution in my profile?

Is ResearchGate using my declared institution to check for access?
Lisa and Roger try various configurations to try to understand how they determine affiliation and hence entitlements of each ResearchGate user and conclude entitlements can be gamed but no doubt adjustments will be made.
I would add that RA21 seems ready fit to handle the challenge of identifying institutional users.
On the other hand, this is what I see when I am looking at a paper on ResearchGate that my institution does not have access to. I only get the inline version (but it is the published version) but can't download a PDF version

What ResearchGate shows when I have no institutional access - just a view only version
Though we don't have confirmation of this yet, it is logical that ResearchGate will be providing usage statistics to Springer Nature via DUL closing the leakage.
Still it seems to me that current Springer Nature is surprisingly generous by providing read-only articles via ResearchGate even if your institution doesn't normally have access. it's unclear if this will continue and/or if libraries think this feature gives them the freedom to provide less priority on subscribing to such journals.
Implications if syndication catches on
It is still early days, but one wonders if the syndication model catches on what would the impact be?
Imagine if it is adopted by most major publishers not just in ResearchGate, Mendeley, but also Academia.edu, even Institutional repository networks like Elsevier owned Digital commons.
For the later, there has been pilots in the past , with models floated where visitors to digital commons Institutional Repositories might be given full-text versions from Sciencedirect, if the user had institutional access and if not given a inline view only version (accepted manuscript version instead of published version though) reminiscent of the Springer Nature-ResearchGate tieup.
The most obvious effect I think this would further reduce the usefulness of tools such as Web Scale Discovery services like Summon, Primo, EDS etc. Arguably they have never quite succeeded in drawing users away from the mighty Google Scholar, and access browser browser extensions like Kopernio also have potential to chip at the main strength of such tools - delivery...
Technically speaking leaving aside aggressors, a syndication model I suspect also provides more stable linking to content than getting at it via discovery services.
Publishers choosing to syndicate content at popular research sites might possibly be the beginning of the end for such tools - reducing their roles even further.
After all, in a future where, you can instantly tell on most popular sites you visit if a piece of content is accessible by you (via institutional access rights or OA), there is a much lesser need to go back to a library discovery tool to check, or install clunky browser extensions that may or may not always link properly.
See also the metaphor of a "supercontinent of content" that Roger Schonfeld has been promoting for a while.
Conclusion
As open access rates increase, paradoxically we see an increased focus by publishers on measuring of institutional usage downloads.
Part of it is the need for better analytics to show the value of journal packages in a world where more and more downloading is occurring off publisher platforms.
Lisa also points out that certain flipping to open arrangements like subscribe to open require on accurate usage statistics.
Lastly many companies like Elsevier have an ambition to shift from a content based business to a analytical company so being able to track users more reliably fits into their value proposition even if OA plateaus at a relatively low level . e.g. think more accurate altmetrics that can aggregate downloads across more platforms.

