Some data discovery updates - Figshare update, Querying PID graphs & DataCite Commons launches
Two years ago in 2018 I started to realise the importance and value of Data discovery and slowly began studying various challenges to making Datasets discoverable.
Google eventually launched Google dataset search, while Crossref, DataCite began to work together to enable linking of DOIs between Journal articles with Crossref dois and datasets with DataCite dois.
Since then, many more institutions have started to launch data repositories and some such institutions including my own institution have chosen to go with Figshare for our data repository.
Note: As I write this, there is some hand wringing due to a blog post by 4TU.ResearchData where they set out there reasons for moving from open source software to a commercial hosted solution - Figshare, rather than choosing a open source solution like Zenodo or Dryad. Zenodo and Dryad have since responded.

Here are some updates on this area of data discovery that caught my eye this year so far.
1. Figshare updates linking to resource DOI field (now no longer mandatory).
2. Querying PID graph with Jupyter Notebooks
3. DataCite Commons a single search interface for Works, People, and Organizations launches
1. Figshare updates linking to resource DOI field (now no longer mandatory).
This takes a bit of explanation but the "too long didn't read summary" is that Figshare has now made the field "Resource title" and "Resource DOI" non-mandatory in Figshare institutions installs.
These two fields allows you to create links or relationships between dois of data deposits to other dois (notably Crossref Dois of articles) to be captured. While this field has been available for a while, in the past, it had to be used as a mandatory field, which is a problem for many University institutions using Figshare since not all deposits into data repositories have a related doi to link to. If you need a bit more read on.

In my past blog posts, I talked about the importance of linking datasets that support the paper with the paper itself. This will allow users to move from either a dataset to papers that use it or viceversa from a paper to a dataset, which will support use cases like showing the value of a dataset or helping researchers find suitable datasets to use.
Part of the solution to achieving this, involves giving each dataset or paper a PID (Permanent ID), so one can be precise on what one is referring to.
Sidenote: Though most people will be most familar with DOIs (Digital object identifers) for articles and datasets other non-doi PIDs do exist and are also applied to datasets.
On paper, co-ordinating linkages among all these different PIDs sounds hard to do. Enter the Scholix framework , where one can indeed create such a data-literature link.
Without going into details, instead of expecting every article and data repository out there to implement the same standard, Scholix envisions that standardisation be done by existing major hubs and global aggregators. Examples of such hubs include DataCite, Crossref, Figshare, OpenAIRE, SHARE, ICPSR, Australian National Data Service (ANDS) etc.
The advantage of this is that individual repositories just have to ensure there contents are aggregated by one of these major hubs and have an expectation that their content be discoverable.

http://www.dlib.org/dlib/january17/burton/01burton.html
The importance of such data-literature links
As noted above, being able to track datasets to article or viceversa is extremely valuable. In particular, Scopus is an early (2017!) adopter of Scholix and this allows it to surface related datasets in Scopus when displaying articles in Scopus detail page.

A recent paper - Identifying Data Sharing and Reusewith Scholix: Potentials and Limitations also studied how useful the Scholix data was (via Scholexplorer API) for libraries.
They noted
"institutions like the University of Manchester, Durham University, and the University of Illinois at Urbana-Champaign have explored the Scholexplorer data and developed individual processes to incorporate it into their systems"
Some of these use cases include pulling datasets to display in Bento searches, or for identifying, linking and publicising research output in Social Media when publicising newly published papers.

UIUC Library's bento box shows datasets found by querying Scholix's API
The paper itself found that using this system they were able to identify previously unknown dataset-article links for University of Bath-affiliated datasets.
At the time only 48 such datasets were known (found or reported by researchers or data librarians). However using Scholexplorer API to query Scholix links they found a lot more!
"The Schoexplorer API identified 1,501 unique research outputs with atleast one University of Bath author linked to at least one data-set, a 31-fold increase. In total 5,002 datasets were associatedwith these 1,501 research outputs, where one output is linkedto one or more datasets."
They also attempted to find cases of data reuse (as opposed to primary use by the author of the paper and dataset) and found 10 cases. Unfortunately
"(7 out of 10) were not included in the reference section of the articles, six studies reused data from ICPSR and mentioned it in the methods section, and two cited the associated survey websites but not the datasets."
They concluded that basically such links in particular ICPSR were probably made manually by staff at ICPSR and not by the authors of the papers who reused the datasets themselves.
At this time, this is the first and only paper I'm aware of that studies in depth this topic of data discovery via Scholix , I recommend you give it a read.
The importance of Crossref DOIs and DataCite DOIS
So clearly, creating such links using the Scholix framework can be very useful.
As a institutional repository manager, you can in theory, make your institutional repository into a Scholix hub by registering with the service and expose your data-literature links using the Scholix Schema but in practice this can be complicated.

In reality, not only do most Scholarly publishers register their papers with Crossref Dois, datasets are often (but not always) deposited in repositories like Zenodo, Figshare, Dryad and other institutional repositories that mint dois from DataCite and this makes things much easier because they are Scholix hubs.
Because agreements are already in place between Crossref and DataCite, one can just register with right metadata in the Crossref or DataCite metadata schema to create such a link.
I went into quite a bit detail on the exact details of the changes needed and various relationship types available to be made at the Crossref and DataCite records to make the link in another blog post but for this post it doesn't really matter.
All that you need to know is that linkages between journal articles and datasets can be done from either end - either from the journal article doi to dataset doi or vice versa.
Unfortunately, as we have seen in the paper mentioned earlier, data citation by authors isn't quite common and currently not all journal publishers borther to create a link between the journal article and dataset dois from the journal end (from Crossref doi to DataCite doi), so a lot of such linkages have to be done on the data repository side when the dataset is deposited in the data repository.
Figshare fixes a "tiny" bug
Of course , if you are a researcher, you don't care about all this, all you want to do when depositing your dataset to a repository is for some simple field for you to indicate the relationship of that dataset to a journal article.
In Zenodo , this is indeed possible.

Zenodo interface for researcher to add related identifers and their relationships with submitted datatset
The same applies for Mendeley Data.

Mendeley data interface for researcher to add related identifers and their relationships with submitted datatset
I noted way back in 2018, that the other popular data repository - Figshare (at least the free version) did not have such an option.
The closest they have is a "References" field, which unfortunately is for display purposes only and did not have any other impact.

What I did not mention at the time, but I was aware that institution version of Figshare did in fact have a metadata field for "Resource Title" and "Resource DOI" for quite a while already.

Filling up those fields gives you a nice call-out box with link from the dataset deposited in Figshare to the published paper.
In the example below, you can see that the dataset deposited in Figshare with DataCite DOI (10.25440/smu.12062943.v1) is related to/supplements the publication with Crossref DOI (10.3982/ecta12560).

https://researchdata.smu.edu.sg/articles/dataset/Data_from_Identifying_latent_structures_in_panel_data/12062943
But this is not a purely cosmetic display, it also affects the metadata in DataCite, which as you have seen above enables a link between the dataset and the article.

JSON-LD of dataset DOI - 10.25440/smu.12062943.v1
So you might be wondering if this feature is already there, why am I blogging now.
The reason is while this feature is available to institutions using Figshare, pretty much none of them have turned on this field for researcher use.
Why not, this is because the field if turned on has to be made mandatory . This is obviously not doable for many data reposits made to institutional repositories, which may not need this field.
All this has changed with the latest Figshare Release - 03082020 , where this is now usable as an optional field.
"For some time we have had the option to have a prominent link out to the published version of an article but to enable this feature it must be present for every object within the repository. Whilst this worked fine for our publisher clients, this does not work for institutions. We have now enabled this as an optional metadata field available to all."
Update 3 months later in Jan 2021
The linking between the publication and our dataset on Figshare has been picked up finally by Scholix and Scopus. As such if you go to the Scopus article page of the article -10.3982/ecta12560, you will find the link to the datasets in our repository!

Scopus picks up Scholix link with a link to the Figshare dataset under "Related Research Data"
Next up that I am excited to hear about is Figshare's support of Make Data Count.
If I understand this correctly, this will enable usage data from Figshare to be standardised and made available via DataCite.

2. Querying PID graph with Jupyter Notebooks
Being able to track relationships between Datasets and articles using PIDs is great. But here's a problem.
Say you wanted to find all items linked to output provided by authors affiliated with a institution. How would you do it, since Scholix did not have affiliation search?
As noted by the already mentioned paper, the way around it was to collect all the DOIs of papers by authors affiliated with the institution first somehow and then check.
"This was required because the Scholexplorer API does not yet support affiliation search, and to be able to use the University of Bath’s Research Data Archive(UBRDA) as a benchmark to compare against the ScholexplorerAPI output"
But is there an easier way? What if there is a way to query by affiliation directly?
Enter the PID graph.
Ideally authors have a author PID - ORCID.
Organizations they are from would be identified by an organization PID - ROR.
A PID graph links up all these different PIDs which allows useful queries like show me all article PIDs that are from authors (ORCID) from a certain organization (ROR).

https://blog.datacite.org/introducing-the-pid-graph/
But how does one really do this? GraphQL is the query language used but can be challenging if you are not a coder to work from the scratch.
A recent FREYA Webinar, shared Jupyter notebooks that provided code for 10 common use cases obtained from user stories that you can use as a base to modify from.
https://www.youtube.com/watch?v=I7MUTFvjPzo&t=73s
Some use cases covered in the 10 notebooks includes
The cool thing is these Jupyter notebooks in Github are set up to support Binder, so you can run these notebooks in the cloud with a single click without installing local software.

Click on "Launch Binder" button to launch Notebook for this user story
I just ran the default code for User Story 3 just to test it out. This gives you "Counts of citations, views and downloads metrics, aggregated across all of the organization's outputs" and by all outputs this means both publications and datasets.
In theory I could have just changed the part of the code referring to Oxford University with another institution's ROR ID (search for it in the ROR registry) but for now I just ran it directly.

Search for your institution's ROR ID

Change the ROR to your institution.
Which produces various interactive graphs.


In reality, if you try to run a lot of these Notebooks for your institution, you might get very little output. (Also it can be quite slow).
This is because quite a few of these PIDs are quite new , in particular the Organization ID ROR and even ORCID which is a decade old isn't used consistently with publications or datasets to identify authors.
Of course, if your query relying on relationships between datasets and journal articles being recorded, you will again miss quite a bit if the correct relationship between them are not made which is why the Figshare update is so important.
Hopefully, as time goes by it will get better.
DataCite Commons a single search interface for Works, People, and Organizations launches

If querying the PID graph is still a bit hard to wrap your head around because jupter notebooks are new to you, you might be interested in the related DataCite Commons search function.
"DataCite Commons describes works, people, and organizations, and their connections and allows users to search for them. They are identified by persistent identifiers (PIDs): works (DOI), people (ORCID ID), and organizations (ROR ID), and have standard metadata that describe them and the connections to each other. Together they form the PID Graph, which is powered by the DataCite GraphQL API. DataCite Commons provides a public web search interface to the PID Graph."
In other words, one can seach by DOIs , ORCIDs or RORs and crosslink to other PIDs much like a PID graph by just searching.
Below I run a search using https://ror.org/052gg0110 which is the ROR for Oxford University. , clicking on the result brings me to a "This page' that shows data about works linked with Oxford University.

https://commons.datacite.org/ror.org/052gg0110
In another example, I search for myself in the "people" tab, select my ORCID profile and on the "this page" tab , I get the following

https://commons.datacite.org/orcid.org/0000-0003-0159-013X
More savy readers might notice patterns in the url string used to do the querying, but for more details refer to the document here - https://support.datacite.org/docs/datacite-commons
Conclusion
Data Discovery is a very tough problem and we are only in the very early stages of trying to figure this beast out, but it's nice to see some progress being made.

