Is open article data too big to ignore (for Text data mining)?
4 years ago in 2014, I wrote about the coming disruption to academic libraries due to Open Access.
In How academic libraries may change when Open Access becomes the norm , I wrote
"The trend I am increasingly convinced that is going to have a great impact on how academic libraries will function is the rise of Open Access. As Open Access takes hold and eventually becomes the norm in the next 10-15 years, it will disrupt many aspects of academic library operations and libraries will need to rethink the value-add they need to provide to universities.
The events of the past year have convinced me that the momentum for open access is nearly unstoppable and the tipping point for open access has or will occur soon. "
How do we know the tipping point has been reached? Simple, discovery tools which had for years ignored or gave cursory attention to discovery of free to read articles started to get serious about open access discovery, as the pool of open access articles has become too large to ignore and being able to reliably detect and point to them is a critical issue.
From 2017 onwards, whether it is moves from existing players in the discovery space like Summon/Primo (Proquest/Exlibris), Scopus (Elsevier), Web of Science (Clairvate) or from new disrupters like Kopernio, Unpaywall, Lean Library browser, Anywhere Access (Digital Science), 1Findr(1Science), Open Access button etc we saw this happen. This is a fast moving space, but here's a fairly recent summary by me at CNI Spring 2018 on some of the issues.
So what's next? While the first phrase of open access benefited human researchers who could get access and read articles, the next phase I think might belong to the machines.
When the level of open access was still low, individual reseachers could still benefit from reading individual articles but doing text data mining (TDM) and Machine learning (ML) techniques was perhaps still less effective as what could be mined was spotty, mostly limited to the life sciences (via Pubmed Central and Arxiv).
But I believe we are reaching yet another tipping point where the amount of free-text that is available had reached a level, where text data mining and application of the latest Machine Learning and AI techniques will increasingly yield dividends.
Articles are meant for more than humans
Due to my past background, I've always had a blind spot here. For a long while, I saw the main benefit of open access as for human readers.
But as more and more articles become open access, this also means the availability of a bigger pool of articles that can be be crunched using text data mining (TDM) and Machine learning techniques and it is a truism in machine learning circles that beyond a certain point more data trumps , better algothrims.
If you look at the posts I have done in the last two years or so, such as my coverage of latest trends and innovations in the discovery space, my coverage on tools like Dimensions, Scholarcy, Citation Gecko and other tools, a common thread is that the techniques in most of them are at least partly reliant on crunching of full-text of articles (or use outputs from such work, or could become more effective if the pool of full text became bigger) to achieve their magic.
Take projects like Open Knowledge Maps which currently uses BASE or PubMed metadata to cluster articles. They could start using full text instead to much greater effect.

Open Knowledge graph currently uses only metadata
Interesting projects like OpenCitations, EXCITE Project (Extraction of Citations from PDF Documents), would be able to freely and easily bulk obtain PDFs to do their work to produce open citations and in linked data format.
Powerful tools like Digital Science's Dimensions not only relies on open citations available through Crossref and other open sources like PMC as a base, but as the citations are incomplete (not all publishers participate) they have to rely on full-text processing of pdfs from partners to supplement their base. Both Dimensions and Microsoft Academic also use the latest NLP (Natural Language Processing) methods on article full-text to assign Fields of Research and Fields of Studies respectively instead of relying on assignment of subjects based on the journals the articles are in, which fails for an increasingly multi-disciplinary world.
Imagine a world where, most article full-text is available for TDM, creating an undertaking similar to what Dimensions is doing without partner relationships would be slightly less daunting, though the technical expertise barrier would still exist of course.
Projects like the Open Citation Index COCI , the OpenCitations Index of Crossref open DOI-to-DOI references could potentially start to compete with paid citation indexes without needing any prexisting partnerships if the pool of open access article explodes. We might even seeing the beginning of a open discovery index of articles become a reality, competiting with commerical indexes like Summon/EDS/Primo Central.

Scholarcy processes full text to extract findings, summaries etc
What else can one do with a huge corpus of articles? Crunch citation contexts to improve discovery? Using Machine learning to train a system to type citations using CiTO, the Citation Typing Ontology? Extract scientific facts? Tools like Scholarcy that crucch full-text to summarise results and extract references currently work best on open access articles and will benefit as more and more articles become open access.
The rising tide of full-text available TDM is also why I suspect linked data will finally become mainstream as being able to crunch free full-text can aid projects like Wikidata and Wikicite.
I'm still new to the potential of TDM and more knowledgable readers can probably list more examples of what might be possible in a world where more and more full text is available but this is not the purpose of this post.
Up to recently most large scale projects that relied on large scale TDM and ML of full-text were all focused on the life science areas for a very obvious reason that most of the free full text was available in that area via repositories like PMC or were from big companies that had the clout to negotiate access (e.g Google, Microsoft, Proquest, Ebscohost). This is starting to change however....
A caveat
Of course, I've been writing as if free to read or open access automatically gives you the right to data mine. This isn't true of course. There could be articles made free to read for humans but with more restrictive licenses that prevent data mining.
It's also important to note that ideally TDM would require more than just publishers putting out html pages or worse PDFs without paywalls. Well structured xml corpus in data dumps or via well documented APIs would be nice for example, but even without that, being able to download freely full text without the need to navigate past the perils of publisher paywalls permissions seems to be a win for me.
Even nicer would be for some aggregator to do the hard work of drawing full text from multiple sources, cleaning and enriching the corpus for others to build their work on top. As we shall see this is starting to happen....
What is the current size of the article corpus that can be mined? Who has collected them?
While there has been many aggregators of open repository content such as BASE, most of them only collect metadata and point to the full text.
A major exception is JISC's CORE (COnnecting REpositories). While the CORE search looks like any aggregator which you can use to search, the main thing you notice is that CORE doesn't just point to a target such as your institutional repository but to in most cases to and archived pdf on their own servers.

Link in CORE goes to PDF harvested and stored on CORE not the repository source
In fact, CORE crawls and harvestes the PDF found from repositories and "provides machine-to-machine access to the content, enabling the development of new applications, including those based on text mining."
How much is in CORE? In June 2018, CORE released a blog post claiming to be the world's biggest aggregator. including a comparison table of some alternatives. They write
"As of May 2018, CORE has aggregated over 131 million article metadata records, 93 million abstracts, 11 million hosted and validated full texts and over 78 million direct links to research papers hosted on other websites. Our dataset of full text papers has reached 49TB"
This led to some debate in the comments section about whether if this was really true (deduplication could change this 131 million article metadata number etc), but I think the main point is not just that CORE's index has X article metadata records but "CORE is unique in its endeavour to aggregate and expose not only metadata, but also full texts of open access research papers. No other service in our list provides this capability."
On top of this, CORE also offers the CORE Publisher Connector which is software that provides access to roughly 1.8 million Gold and Hybrid articles from various major publishers.
All in all this provides a strong base for building third party services over preproccessed, enriched and validated CORE content using APIs or data dumps. Obviously this is way easier than trying to pull data from thousands of repositories around the world or using diverse publisher APIs from the scratch.
Is CORE the only source of such aggregated full text data?
Possibly not.
Another interesting development has been the http://gettheresearch.org/ project announced by the team behind Impactstory and Unpaywall with partners like Open Knowledge Maps, British Library and Internet Archive.

See press release here , but it seems the aim is to build a "AI-powered scholarly search engine which aims to help public find and understand research". The site also talks about a "AI-powered Explanation Engine." and references things like concept maps, automated plain-language translations (think automatic Simple Wikipedia), structured abstracts, topic guides, and more.
All this is very hazy to me and as I write there is some debate on mailing lists if this is a realistic goal, but the thrust of it is to take all the open access papers and do TDM and the latest machine learning techniques to see what can be done.
And where does the data come from? Lest you have not been paying attention, this is the team behind Unpaywall and
"We’ve already finished the database of OA papers. So that's good. With the free Unpaywall database, we’ve now got 20 million OA articles from 50k sources, built on open source, available as open data, and with a working nonprofit sustainability model."
From what I understand this corpus will eventually be made available in some form as well for others to use.
While some doubters wonder if this project is too ambitious with just a 850k grant from Arcadia Fund, my view is that's not entirely the point. If the people behind this project can provide a large easily accessed corpus of article for analysis and research at low cost, the pay off could be huge in terms of what others could do with it.
Questions I have
When search systems began to surface multiple versions of the same people, librarians like myself started to debate the problem of which version to show users. As Lisa Hinchliffe puts it when there are multiple copies and versions which version link should we privileged? Open Access, Version of Record, or Let the User Decide?
I wonder if there is an equalvant question for TDM?
When doing TDM and running Machine learning techniques over the text, could we face the same problem?
Does it matter that my algorithm to extract facts are run over copies that are not the version of records (VORs)? For sure if I'm using the full text to extract citations and references I should only use the VOR? What if I'm using the text to determine facts and findings? I suppose one way would be just to exclude non-VORs but depending on the type of Open Access World we live in that might mean excluding a lot of the full text.
Also as more and more open data emerges , commerical organizations will start building products on top of it. Currently, open citations from Crossref is a major example of data that is being used in services by Digital Science (Dimensions), Exlibris (Primo) and more.
As more and more data becomes open, many projects will start becoming larger and comprehensive rather than just focused on the life Science area.
What type of business models will these new services based on open data be using (depends on licensing details of the open data?).
Implications for libraries
Even though in 2014, I wrote that "Libraries will have greater focus on value add expertise services such as information literacy, data management services, GIS etc to replace the diminishing "buyer" role" - I had only a vague idea on what all these expertise services meant at the time.
I don't claim to be an expert now, but I am slowly learning and getting to grips with areas such as Digital Scholarship, Research data management. Like any new area, it opens up new strategy and management questions like, what is the right amount of investment in manpower ? What signals do we watch out for to decide when we should shift our manpower to these new areas? Traditionally we libraries have been slow to shift, for instance in the 90s to 2000s, it was generally observed that while we spent most of our budgets on electronic resources, the bulk of our manpower was still focused on managing print. Are we heading to a similar situation today with data services and Scholarly Communication areas?
On the technical front, it's an opportunity for individuals who are seeking new horizons to wade into new areas to learn - Text data mining, Machine learning, Open Scholarship/Science/Research. If you intend to remain an academic librarian for the next 10-20 years, it might be a good idea to start learning a bit in those areas. It also helps as a Liason as many of our researchers are starting to do TDM and apply the latest machine techniques in their areas as well, so having basic understanding of APIs, TDM will be extremely helpful. For example, I was just sent this manual on TDM of earning conference transcripts.
Here are some sites worth looking at that offer self learning courses or tutorials
In particularly, if you are brand new to text data mining and want to learn both the theory and get your feet wet, I recommend the following 6 hour course. I have personally gone through the course, while it won't make you an expert, and I suspect you need a bit of knowledge of Python and understanding of the point of Jupyter notebooks (you will be using the Google cloud based Colaboratory) to fully appreciate the practical portions, it's still worth a pretty nice starter course.

