Of open infrastructure & flood of openly licensed data - are they good enough?
As the battle for open access , or more accurately the route taken to open access rages on, I have become aware of the rise of a new type of "open" - the quest for "open infrastructure".
My understanding of it is very limited, but I first seriously took note of it when I saw people tweeting about the Joint Roadmap for Open Science Tools (JROST).
One of the reasons why open access is in such a mess is because we have given up the publishing infrastructure to commerical interests and it is now almost impossible to regain control of the system.
As we now start talking about Open Science, Research Commons and open tools are we heading down the same path? Commerical interests have already began to start acquiring and owning different parts of the research workflow and infrastructure.
As I understand it , part of the reason for this movement is to counter the migration of commerical interests in particular Elsevier who are making big moves into the research workflow. By owning multiple pieces of the workflow , they are able to create synergies that are hard to match.
Librarians are of course very familar with the Elsevier juggernaut - by acquiring and owning PURE/Bepress/SSRN/Mendeley/Scopus and much more, they have their hands in every part of the scientific production line and are posed to unify their offerings into a very compelling system for researchers, due to analytics and frictionless hand-offs. For a small taste of this, they are starting to force single sign-ons across different offerings.
Worse yet they might build barriers to make it difficult to inter-operate with tools not in the Elsevier stable of products or to export the data to other tools. For example, recently Mendeley 1.19 started encrypting the Mendeley library database, making it difficult to do a full export of data to Zotero.
In comparison, while there has been a rise in open science / open source tools like Dspace, Eprints, OSF, Zotero, OpenCitations etc often run by non-profits, such systems tend to be isolated and development is fragmented. Can they compete against the slick, intergrated yet proprietary systems that Elsevier, Digital Science, Clarivate are bringing to the table?

Jeroen Bosman & Bianca Kramer's handy tool to filter down tools by properties - This picture shows tools which are open source and free to use across - Discovery, Analysis, Writing and
As such JROST can be understood as an attempt to counter this by bringing such organizations together "to come together, compare notes, and identify areas of cooperation and integration."
The hope is non-profit organizations have a "shared sense of mission and a willingness to collaborate openly between these teams that are often lacking in their for-profit counterparts. "
Open data and open infrastructures are becoming a serious competitor
When it comes to open source solutions leaving aside sustainability, the question on my mind is, can such open infrastructure and tools really compete with commerical offerings in terms of quality, particularly when some of the companies have the resources of Elsevier?
Take the example of discovery and citation indexes, can a free open index run by a non-profit really compete with Scopus, Web of Science or even the newer Dimensions?
Even two years ago, I would say no way. But now I'm not so sure.
If you have been reading my blog closely for the last 2 years, you will see me mention the treasure trove of open data released by Crossref, Unpaywall, ORCID etc.
OpenAIRE's DOIBoost project does the next logical move. Why not combine all this open data together?

First use the Crossref metadata as a base, and enrich it further with data from ORCID, Microsoft Academic (for additional abstracts, affiliations) , ORICD and even links to open version with unpaywall data and you have a pretty nice database of scholarly material. You can find the python scripts available to do this.
I'm not sure the intention of this project, but could a non-profit organization like SHARE, OpenAIRE etc, use all this data to offer a open citation index?
Combining more sources?
Looking closely at the metadata model and schema of DOIBoost, it looks to me that citations of the paper are not included? One can imagine harvesting the open references of Crossref dataset and merging it with the Microsoft Academic graph (MAG) data to plug this hole.
OpenCitation Corpus which currently extracts citations from PMC open access papers (it also does matching against dois and pmids), could that be used as well?
Since we talking about citation metrics what about throwing in some altmetrics via the Crossref event data?

Crossref event data
What about processing the full-text corpus harvest by CORE?
Others wondered about adding more sources like Wikidata.
The key thing is that most of the sources I mentioned are I believe "open" to the degree that they can possibly be used like this.
As the DOIBoost paper states, "This work could be delivered thanks to the Open Science policies enacted by Microsoft, Unpaywall, ORCID, and CrossRef, which are allowing researchers to openly collect their metadata records for the purpose of research under CC0 andCC-BY licenses."
Surprising to me was this note about MAG "The MAG dataset is available with ODC-BY license thanks to the Azure4research sponsorship signed between Microsoft Research and KMi"
Of course being open, all this data can be used by commerical companies as well. Digital Science's Dimensions is probably one of the earliest and most known to use the Crossref dataset , something which they proudly mention as the base they built upon. They have also intergrated Unpaywall data, as well as their own propertary sources. As generous as they are in promising free access to researchers who need the data for bibliometrics research, and a freemium search system, they unfortunately would not be considered "Open".
Another interesting tool that uses Crossref data is Scilit, but it seems to me given the gaps and limitations of Crossref data, in particular with citations holes due to holds out like Elsevier, IEEE, ACS, Kulwer etc refusing to open their references, the availability of Microsoft Academic Graph data to plug the gap is going to be particularly impactful.

CWTS study comparing open references from Crossref against Scopus and Web of Science
Lens - Microsoft Academic Graph, Patents, Crossref and more
Because DOIboost brings together data from different sources, it can have multple identities and values. For example, in the example below, some authors have multiple Perm IDs, and you can see from the provenance , it comes from ORCID and Microsoft Academic Graph.

DOIboost Schema
Another organization that not just uses MAG with Crossref data but also works closely with Microsoft is the non-profit Lens.org. See for example announcement of tie up betweens Lens and Microsoft Research on their blog.
I will probably do a more in-depth review of Lens.org eventually but for the purposes of this post, it is only needed to note that like doiboost, Lens.org combines data from Microsoft Academic Graph , Crossref and other sources to form a combined "meta" record.
As somone who used to study Proquest's (now Exlibris's) Summon, this concept reminds me a lot of Summon's "super record" where they merge different versions of records they get from publishers, aggregators and A&I databases to get a more complete record than any one source.

As you can see from above , this record in Lens, includes information including
Scholarly citations (from Crossref)
Patent citations (from Lens itself)
Scholarly references (from Crossref, MAG, Pubmed?)
At the top right, you can see identifers from Crossref, Microsoft academic, PMID and PMC. as well as Len's own identifer's
My understanding, it will soon include or at least use Unpaywall API to show links to free to read articles. Throw in a couple more free data sources I mentioned above it could be a powerful force - possibly even comparable to Dimensions.
There are obviously some differences in details from the DOIBoost project, but I trust you see the similarity in approach - leveraging free data to create a service or tool.
Can free open infrastructure compete with commerical tools?
On paper it seems plausible that there is sufficient volume of free data out there for new entrants to compete with incumbents like Scopus and Web of Science in the citation index business. But anyone who has even the slighest familarity with citations knows the data is often very dirty.
Take for example, a very common task, benchmarking the total output of researchers in one institution against another. This is a surprisingly hard thing to do (author disambiguation is a related problem) and is an area that most article and citation indexes are competing on currently. In fact, as I noted in the past with the exception of Microsoft Academic, most freemium services like Dimension do not offer the affiliation filter/facet.
Of course there is always ORCID, but this is nowhere near wide spread currently to make a dent in the problem. So the affiliation data in Microsoft Academic seems to be the main "open" source out there that will be carrying the load. But how good it is? Particularly compared to incumbents like Web of Science
Web of Science vs Microsoft Academic - affiliation accuracy
Unfortunately the news isn't good at least for the first studies that are appearing.
Take a recent study where researcher studied the accuracy of author affiliations from Leiden University
They extracted publications that according to i) Web of Science (WOS) and ii) Microsoft Academic were affiliated with the University..
Of the articles obtained they did a matching and checked for amount of overlap.

As you can see below the amount of overlap isn't huge. While it might not be surprising that out of all the papers attributed to Ledien University by Microsoft academic, slightly less than half were also found by Web of Science (Microsoft Academic covers far more than Web of Science which has a more restrictive scope), it looks really odd that more than 40% of the papers that Web of Science found are not found by Microsoft Academic.
To investigate this , the researcher, sampled 100 papers each from both subsets, the papers found by Microsoft Academic but not Web of Science and viceversa. The samples were then checked for accuracy to see if the affiliation was mentioned in the PDF.

As you can see from above, Of the 100 additional papers found by Microsoft Academic but not WOS, 29 were wrong. Or at the very least, there was no mention of affiliation in the PDF. It strikes me that this doesn't necessarily mean that Microsoft Academic was wrong, though it probably was.
Of the remaining 64 papers where the affiliation was correctly stated, we see that 41 was missed by WOS, because WOS does not index/include those papers.
23 however had their affiliation stated correctly in the paper, were indexed in WOS but still missed by WOS. So we can see WOS misses things too.

On the flip side when we look at the 100 found by WOS but missed by Microsoft academic, MA misses 40 of them that are correctly identified by WOS (affiliations stated in the paper) even though those papers are indexed in Microsoft Academic!
This probably shows that Microsoft academic still needs a lot of work.
As a sidenote, 44 papers found by WOS with the affiliation was missed by Microsoft Academic simply because Microsoft Academic doesn't have the paper indexed while WOS has it. While this isn't particularly a problem if you are talking about accuracy, it does show the assumption that the much bigger Microsoft academic will contain almost everything in Web of Science isn't quite true (due to non-articles? Older publications?).
The remaining 16 is apprently an error in the matching process (which created the data that was use for creation of the overlap diagram) and Microsoft academic did in fact find and correctly attribute affiliations.
The interesting thing is that of the 100 sampled from the WOS set (but initially not overlapping with Microsoft academic), it seems 100% was correctly attributed showing WOS has perfect precision for that set sampled.
On the other hand, as we saw earlier, Of the 100 that Microsoft academic claimed was affiliated (and WOS did not find), 29 were in fact wrong!
In totality, we see that of the 200 manually checked papers, Microsoft had it wrong 69 times, which is a fairly high error rate! In comparison WOS was only wrong 23 times.
So it seems that fancy algorithms that Microsoft Research likes to boast about on their blog can only go so far. I would guess that Web of Science and Scopus as established players have had the benefit of years if not decades of librarians and institutions helping to correct errors. In particular, as librarians use Scival and Incites for benchmarking and Research Information Management Systems like PURE and Converis, it is likely the data from such systems can help validate the data sitting in Scopus and Web of Science (particularly for affiliation data).
Scopus vs Dimensions vs Crossref - completeness of references
In the next example, we compare the completeness of references in Scopus vs Dimensions (based on Crossref data + other proprietary data) and Crossref .
While we know Dimensions definitely has more indexed than Scopus, what we are studying below is the completeness of citations when both citing and cited items are in the index, hence controlling for the differences in index size.

CWTS study comparing citation overlaps where citing and cited sources are indexed
At a glance we can see that the Crossref citation quality is much poorer than Scopus. Crossref missed out 305 million citation links out of 445 million found by Scopus!
I also find it interesting that while there is a big overlap in citations when comparing Scopus to Dimensions (around 87% of total), Scopus has far more unique citations than Dimensions (43.5 million vs 17.9 million), this implies Scopus has more complete references (assuming they are correct).
Based on this, I suspect Scopus still has a lead for now in citation completeness, even against Dimensions which is another commerical offering but is much newer.
Conclusion
In the past 2 years or so, they have been a suddent rush of new indexes from Microsoft, Digital Science, Crossref data, 1Science and more with data that are amendable to analysis. This has led to a gold mine of data for researchers in the bibliometrics field to study & compare ,beyond the big 3 - Google Scholar, Web of Science, Scopus.
Of course it is still early days and the newer sources like Lens, Microsoft Academic, Dimensions etc might yet catch up or even surpass the more established players.
Even if this was not so, in terms of open tools and open infrastructure , it is likely a non-profit creating and maintaining a Open Citation index is probably on the resource high end of the spectrum and not all open tools or infrastructure requires that much effort to be competitive.
For instance in the realms of reference managers the open source Zotero is a match for Endnote, Mendeley and other commerical offering at least for now. How many of the open tools falls closer to the Zotero side of things in terms of resources needed and how many fall closer towards to the open citation index of things is something worth thinking about.
Lastly, I have talked about "open" tools as if it is a settled matter. It is not. Based on some definitions, it could be open source, run by non-profit or data it is based on is openly licensed.

Jeroen Bosman & Bianca Kramer's handy tool to filter down tools by properties

