COVID-19 and Text data mining - a grand experiment around CORD-19 dataset?

One thing the COVID-19 crisis has put a spotlight on is how well our scientific publishing system works. Preprints that allow more speedy dissemination of scientific results but at the cost of quality have been in the spotlight . The impact of perverse incentives that leads researchers to delay release of data to publish in top journals is yet another problem that has become more salient.
The tweet below links to a new story which describes how some of the first experts in Wuhan China with data on the COVID-19 refused to share data until their paper was published - despite calls from other researchers who needed the data urgently.
Well, here we have it. People are literally dying because of researchers' obsession with getting their work into "good" journals. Chasing prestige is costing lives: maybe hundreds, possibly thousands. Not inconcievably millions. https://t.co/HEfg68HV1M
— Mike Tⓐylor 🏴 🇬🇧 🇪🇺 (@MikeTaylor) March 13, 2020
But of course the biggest target is the fact that a lot of peer reviewed research is locked behind paywalls and someone who needs it might not be able to access it. After all we are told "paywalls kill people".
Thanks,
It should be fun.
Except for publisher paywalls.
PAYWALLS MEAN PEOPLE DIE.
That's a bit long winded, so lets write:
PAYWALLS KILL PEOPLE— Peter Murray-Rust (@petermurrayrust) April 23, 2020
For example the Ebola Outbreak in Libya might have been averted if people had access to papers locked up behind paywalls.
In fact paywalls might potentially kill people, not just in the sense of preventing a human reader who might need access to the paper, but that by locking up papers behind paywall we make it very hard if not impossible to access all the papers to do text data mining (TDM) and this prevents TDM applications from emerging that might speed up research.
Still as someone with very limited knowledge of TDM , particularly of journal articles, I wonder what exactly are the benefits that we will reap when/if we live in a fully OA world where most scientific journal text is readily available for TDM.
How much would we have benefited if we lived in an alternative universe, where every scientific research paper of any topic we wished was easily available for TDM?
Might we be better informed because our information retrieval tools might become better? Whether it is because search results from a search engine enhanced with a better language model (e.g. Sci-bert or COVID-19 paper generated language models) give us more relevant results or because we might be able to build tools that handle Q&A tasks by extract text abstracts (not papers) to answer questions . Might the tool also suggest better directions to explore e.g by automatically extracting relevant entities as facets?
The promise of TDM often lies in automatically extracting of entities (e.g. genes, proteins, diseases, chemicals, symptoms, treatments etc) and the relationships between them, so might we see Knowledge-network graphs like COVID Graph and SciSight that allow medical researchers to see the forest for the trees and see relationships they would have otherwise taken longer to see?
Might we be able to extract citation relationships between fast moving preprints, to quickly judge which papers are been contradicted or supported? Or perhaps even check if certain claims are backed up by papers?
Might all these and more applications of TDM, combine to lead to overall high quality decisions at less effort? Or perhaps even directly or indirectly help with the the holy grail - allowing someone to figure out a promising treatment so much faster than we would otherwise have?
But how can we tell when we live in a world where it is not easy to access, collect and do bulk TDM on journal articles?
What follows is my attempt to understand the potential and issues around the difficulty of doing TDM of journal articles. I am a novice in this topic, and besides self learning from reading and watching the webinars, I have also benefited tremendously from conversations with Phil Gooch (Scholarcy) on this topic. However all of the many errors and misunderstandings in this piece remain fully my own.
The "grand experiment" going on around TDM of COVID-19 papers
As I write this, a grand experiment is going on that I reckon will help with the question of the value of TDM of journal articles.
Yes, I speak of the move by major publishers like Elsevier, Springer-Nature to open access to journal papers relating to COVID-19 since the start of the current emergency. But that alone might not be enough.
In an effort to give a boost to the TDM efforts, the COVID-19 Open Research Dataset (CORD) was released by Allen Institute for AI (AI2) (Semantic Scholar) and indeed this is the dataset that most experts in machine learning, natural language processing (NLP) experts are focusing their efforts on.
This large dataset combines metadata and full text aggregated from diverse sources including
PMC
medRxiv
bioRxiv
WHO paper list
Direct contributions from publishers such as Elsevier (interestingly unclear how much this adds, given that 30 leading publishers have committed to making coronavirus-related publications, and the available data supporting them, immediately accessible in PubMed Central)
Full text identified using Unpaywall
At the time of the preprint written on the dataset (22 April 2020), it had 73k metadata records (28.6k from PMC, a thousand each from medRxiv, bioRxiv and WHO list, with the remaining from publishers) and about 80% of deduped records had full text available.
A very impressive undertaking possible because it is a collaboration of organizations such as The White House Office of Science and Technology Policy(OSTP), the National Library of Medicine (NLM),the Chan Zuckerburg Initiative (CZI, you may have heard of Meta), Microsoft Research (Microsoft academic), and Kaggle.
Besides the CORD-19 dataset, they also provide other supplementary files including
Mapping to Microsoft Academic Graph
CAS COVID-19 Anti-Viral Candidate Compounds -(drawn from CAS REGISTRY of chemical substances)
SPECTER Embeddings of the dataset (Scientific Paper Embeddings using Citation-informed TransformERs, aka a language model like BERT but taking into account citations between documents)
The importance of the work done to generate the CORD-19 dataset
My understanding of the importance of their work was greatly enhanced by watching the webinar below where Lucy Lu Wang and Kyle Lo of AI2 talk about the work they do.
It also made me curious enough to read the preprint on CORD-19 by Lucy Lu Wang and Kyle Lo et al. for more details and my blog post will quote liberally from it.
Essentially they had to take all the data from these diverse sources
Dedupe based on metadata
they essentially take a conservative approach by creating clusters based on PIDs - doi, pmc id, pubmed id and as new papers are added try to merge them in. If there are conflicts in PIDs, there will not merge
Once clusters have been created, they try to harmonize the metadata of papers in each cluster, by selecting a canonical/main metadata
Selection of the main metadata record is based on the one that has an associated PDF and has the most permissive license, if there are missing fields in this record, they will supplement with values from other records in the cluster
Process the full text
If it is in PDF - they parse TEI XML files using GROBID and then into JSON and do clean up
Some sources (PMC) provide JATS XML as well (also Biorxiv now!), so they parse it as well, providing an addition JSON file for TDM.
They also do some amount of filtering of content that isn't really papers (e.g. tables of contents, indices) but beyond that - the help documents warn users of this dataset to not assume the data quality is totally clean, and to be responsible for the output they generate using the dataset.
A lot of this reminds me of the work (particularly relating to metadata dedupe and harmonization) that the people at Proquest/Exlibris, EDS do to create and maintain the central index that library discovery systems use,
Why parsing of PDF into JSON is important
A lot of this might sound boring and technical to some but it is essential work for data scientists and researchers to have a baseline to start from. In particular, the parsing of PDF to structured JSON is critical.
As Lucy Lu Wang and Kyle Lo note in the preprint
"The full text is presented in a JSON schema designed to preserve most relevant paper structures such as paragraph breaks, section headers, and inline references and citations. The JSON schema is simple to use for many NLP tasks, where character level indices are often employed for annotation of
relevant entities or spans"
In other words, by providing the content of pages in structured format with context (simple example marking up the part of a document that is the abstract by a tag), machine learning tasks are much easier and in the webinar they talk about adding more features and annotations e.g parsing of tables and figure (which was eventually done in the latest release)
If you are curious you can look up the JSON schema here , where you can see they tag sections like abstract, sections and paragraphs.

Here's a simple example. This is a paper available for free via Pubmed Central.
This is how it looks to a human.

This is how part of the parsed JSON looks like.

I have expanded only the part for the body text and the part relating to the first paragraph in the body text.
You can make out how the tags provide context by tell you this part of the body tag is the first paragraph, and the parts in the body text that refer to a inline citation or to a table.
"character level indices" refers to when a certain annotation say inline citation starts in terms of characters count.
The key point to undestand is PDF which is optimised for reading and printing by humans is not a suitable format for doing TDM and conversion of it into structured formats like XML or JSON isn't straightforword. (In the example above it is a PMC article which is one of the few sources that already provide a structured XML parse that can be more easily converted to JSON but most won't)
But once you have in structured format such as the JSON format able, it is far easier to do analysis.
In the preprint, under "Call for action", they call out the practice of providing PDFs only as a challenge to doing TDM .
"First, the primary distribution format of scientific papers – PDF – is not amenable to text processing. The PDF file format is designed to share electronic documents rendered faithfully for reading and printing, not for automated analysis of document content. Paper content (text, images, bibliography) and metadata extracted from PDF are imperfect and require significant cleaning before they can be used for analysis"
I suppose there will be some TDM experts who prefer to parse the PDF themselves, but I suspect they will not be in the vast majority of people who prefer the data to be already properly parsed so they can do the exciting bits of TDM application.
The response of the community to CORD-19
I guess by most measures CORD-19 has been a hit with the ML/NLP/TDM community. Indeed researchers have responded and the preprint notes the CORD dataset has been viewed 1.5 million times and download 75k times in one month! It is so well known I saw a faculty member from my institution (in the discipline for Accounting analytics) comment on it in Linkedin.
Part of it is because the team has been active on engaging the community on discord, Kaggle and there is even a TREC-COVID challenge (google group) which is currently in it's second round (the first round received 143 runs from 56 participating teams.)

What have people been doing with CORD-19 dataset?
Despite all the interest in the dataset and the amount of brainpower devoted on it. It is still early days of course as the CORD-19 dataset is less than 2 months old, so we can't really expect a lot.
Still, we can see the wide variety of tools and approaches taken already out there (see also this part of the webinar), which are the fruits of the input CORD-19 combined with other sources.
As a layperson with little medical or biological background it is hard to tell how useful the tools are. Most of the various information retrieval search tools don't seem much better than a Google or Google Scholar search to me.

But then again tools like Google or Google Scholar in particular probably do benefit from access to full text and does apply state of art ML/NLP (e.g. Google uses BERT at least for search queries) for information retrieval, so could a layman like me tell if we have been benefiting from this all along?
But to be fair, someone like me does not really have an information need and these tools may not be designed for me anyway.
Rather they are designed for medical researchers in the field, who have nuanced understanding and domain knowledge and who can take advantage of the various tools on offer. In particular, tools that focus on information extraction which can potentially help researchers to see patterns will mean nothing to a uninformed person like me.
To put it as an rough analogy, if you are lost in the woods and you have a tool that can help you see the forest for the trees, but you have no domain knowledge at all about trees in general, it isn't going to help you.
Take Scisight which extracts proteins/genes/cell entities and shows relationships between them in papers. For someone not trained in the STEM, the visualization tells me precisely nothing.

Scisight
As Kyle Lo notes, it is important to keep medical experts in the loop, to provide feedback on what is useful. So someone like me isn't really the audience really.
They highlight the following project to create semi-automated living literature review. They extract findings as per normal Kaggle questions but the finding is extracted in a schema that is useful for researchers, who will provide feedback on what other items needs to be extracted.

https://www.kaggle.com/covid-19-contributions#Hypertension
Overall though, the potential of tools both opensource like CERMINE, GROBID, ContentMine and non-open source tools like Scholarcy to extract entities, concepts and parse PDF and text to extract structure such as references opens a whole world of possibilities to build useful TDM applications.
Apply them over a big enough set of metadata and full-text and you get pretty much CORD-19, the dataset where you can build on.
Challenges faced in TDM of journal articles
Despite the existence of CORD-19, we are still a bit away from the ideal state of affairs where we can unleash the full potential of TDM.
The fact that most PDFs are supplied in PDF rather than supplied directly in structured XML by publishers is a challenge that has already been mentioned.
The other problem is that while we have a good set of historical and current papers on coronavirus available, there is a little lack of transparency from publisher sources on what criteria was used to supply the full text. Might a paper that was cited in or by the CORD-19 dataset be actually related research but not made available?
And even if we are convinced that we have a fairly good set of papers available on Coronavirus, CORD-19 does not cover directly topics of interest like ventilators, PPEs etc.
"there is a clear need for more scientific content to be made easily accessible to researchers. Though many publishers have generously made COVID-19 papers available during this time, there are still bottlenecks to information access. For example, papers describing research in related areas (e.g., on other infectious diseases, or relevant biological pathways) are not necessarily open access, and are therefore not available in CORD-19"
Another issue with the current dataset relates to the nature of the licenses for the full-text released, while a lot of the full text is licensed under CC-BY licenses, quite a few (from publishers) are licensed under other less open licenses. For example Elsevier has a special publisher license.
One question is would such licenses have a time expiry period? For example would use be still allowed after the end of the COVID-19 crisis?
See Analysis by Dutch librarian Biana Kramer on how much of the full text in the CORD-19 dataset is under various license

https://www.semanticscholar.org/cord19/download (Datasets as of 10 May 2020 , latest version has combined them and no longer seperates by license type)
The last issue pointed out by the CORD-19 preprint relates to the metadata. More accurately they state that there is lack of an appropriate metadata schema with wide spread adoption, so they need to spend a lot of effort to harmonize just the metadata.
"Lastly, there is no standard format for representing paper metadata. Existing schemas like the NLM’s JATS XML NISO standard, Crossref’s bibliographic field definitions, or library science standards like BIBFRAME or Dublin Core have been adopted as representations for paper metadata. However, there are issues with these standards; they can be too coarse-grained to capture all necessary paper metadata elements, or lack a strict schema, causing representations to vary greatly across publishers who use them. Additionally, metadata for papers relevant to CORD-19 are not all readily available in one of these standards. There is therefore neither an appropriate, well-defined schema for representing paper metadata, nor consensus usage of any particular schema by different publishers and archives. To improve metadata coherence across sources, the community must come together to define and agree upon an appropriate standard of representation"
Conclusion
It is perhaps fair to say that despite other interesting datasets like this Lens.org one on patents cited by works . CORD-19 has been the go-to dataset used by Machine learning and NLP researchers to use to showcase the potential of TDM. It provides a readily available source of parsed data for anyone to try their hand at TDM without doing most of the "boring bits" (e.g. aggregation, harmonization of data, parsing of pdfs) etc.
I would speculate the amount of effort, brainpower focused now to do TDM on this dataset of journal articles is probably nearly unpredecented?
While drafting this blog post, I was wondering if the value of being able to do TDM of journal articles might be a bit overstated. But now I think about it, even Elsevier boasts about the power of their TDM services in the field of STEM, so there is no doubt there are gains to be made doing so, though perhaps it might be hard to quantify them sometimes (eg would it be able to predict promising treatments the way it could predict promisng material applications to try in material science in advance of the actual work?), as I suspect the gain such applications bring might be somewhat indirect and difficult to pin down?
Still one wonders if CORD-19 is indeed the scientific world's best chance to realise and show the value of the bulk TDM of journal articles, will this be recognized? Perhaps one should start recording and showing evidence of this?
While I can imagine many TDM and Open advocates shaking their heads at how dense I am for wondering about this, I would imagine many if not most librarians or even researchers who are not as enlightened would benefit from a clearer specification of the benefits of TDM and having something concrete to point to could be very much appreciated.
This will perhaps give a final shot in the arm to the Open Access movement in showing the actual gain of bulk TDM.

