Datasets as a first class entity - preliminary musings
When I first came to my current institution, I remember meeting a faculty who kindly explained to me how he worked and what struck me was how unimportant journal articles and how important relatively speaking datasets were to him .
He said and I paraphrase "I don't really need that many articles (I usually only read and cite a few top journals), and those I can easily find using Google Scholar, but if I don't have the data , I can't do my research!".
Granted this is just one faculty in one specific area (Accounting/finance), but I think he has a point. As an academic librarian who has been in the field for 10 years with some interest in discovery, I've seen great improvement in the area of discovery of articles.
For example, when I was a newbie librarian, we didn't have discovery services and to find a specific article, you had to search by journal title to figure out which databases had what you needed. Nowdays few, even librarians do this (exceptions are in law and certain rare areas in humanties where older journals are not digitzed).
Some will take exception to a take that discovery of articles is mostly a solved problem and they are not wrong. Over the years, your author has blogged and mentioned quite a few issues and there are exciting new innovations that are on the horizon thanks to the combination of full-text, processin/text data mining combined with semantic search/linked data techniques.
Still, I think it would be fair to say that discovery of articles if not a solved problem has advanced to such a state that it's no longer the biggest pain point for the majority of our users. Instead the focus now should be on discovery of datasets which is in a dismal state and I believe has barely advanced in the last 10 years.
My thoughts in this area are in a very early stage and are still developing, but here is a think out loud piece of my understanding of the current state of the infrastructure currently existing that supports discovery of datasets and my dream of what could be possible, when or if data sets are treated like a first class entity (like articles currently are).
The importance of datasets
For many fields, datasets are the lifeblood of research. As I like to tell many students starting off on academic research, the key to research is being able to spot a interesting research question. While anyone can come up with interesting ideas, for empirical research (which is most common) whether the question is answerable depending hugely on your ability to get data.
You can collect the data yourself (survey, scrape the data, use APIs etc) or get it from some other source - typically by buying it from commerical sources or reusing existing data (from closed sources or open repositories).
The easier it is to get the dataset e.g. open datasets ,followed by open APIs the more likely people have extracted most of the value out of it, and you need to think hard how to use the dataset in non-obvious ways or to combine with other datasets.
Given this , it seems criminal how hard it is to figure out what datasets are already out there. This is often true for both open datasets and paid datasets that the library subscribes to.
In the area I support - finance and accounting, I get the impression, researchers search Google Scholar and look at what datasets are generally being used by papers that interest them and then they email me to inquire if we have those datasets or similar data series.
They enquire because firstly knowing if we have a certain data series like say gender of board members of all Russell 1000 companies from 2000-2010 is not simply a search away in a typical librarie's discoverys service.
The finance and accounting area is dominated by commerical datasets owned by companies like Bloomberg, Reuters and other specialised companies. Unfortunately this means that they don't usually work with typical library discovery infrastructuree (discovery services, authenication services) and this is true to a lesser degree with business related resources, so this means discovery is non-standard either and it takes effort to set it up even at the dataset level, much less at the data series level.
How do you know if a dataset is worth buying?
One of the most difficult problems I face is justifying the value of expensive datasets that researchers want me to buy. It's very difficult to really quantify how useful the dataset is , will it be used just by this researcher? How do we even show the value or ROI (Return on investment) of this dataset.
We tend to fall back to the policy of saying individual specialised datasets of use to one researcher should come from the researcher funds, but how do we know if a dataset is only going to be used by one researcher? Assuming we buy the dataset how can we then show it was used?
These were the questions that led me down this rabbit-hole of thinking about the problem of data.
Firstly, whenever I get asked to purchase a dataset I would obviously search for it in Google Scholar to see which articles mention it. These days I also try it with other big full text mega datasets like Dimensions and Microsoft Academic because they provide better facetting, e.g. I can filter down to an area like Accounting.
The idea here is to see how people are using the dataset, in what areas , and most importantly to get a sense if this dataset only recently began to be used. For instance in the area of CEO and director compensation are people moving away from using ExecuComp, Boardex towards Equiliar datasets?
If we do buy the dataset, I make it a point to share to faculty not just that the dataset is available but also a quick link to Google Scholar, Dimensions or Microsoft Academic to show them how other researchers are using the dataset.

In the example above, we just subscribed to Dealscan a dataset on loan packages availabe on the WRDS (Wharton Research Data services) platform. Doing a search in Dimensions and restricting to Accounting and Finance fields we can get a set of papers that mention Dealscan.
We can see which journals have articles mentioning Dealscan, the years of publication (comparing with similar datasets might give you an idea which ones are hot), the researchers using it and more.
Another interesting one to search in is Mendeley Data . It's Elsevier's new research data repository and has you will see below it has quite a few interesting features.
A somewhat similar move is a recent tieup between SSRN and WRDS, where they encourage researchers who use datasets from WRDs to publish in a special SSRN research paper series for that purpose
Showing the value of datasets
In the first decade of 2000, there was a brief interest in libraries showing return on investment. A lot of studies were in the area of public libraries rather than academic libraries, using econometic techniques like Contingent valuation, value of next best alternative etc before the tide shifted towards the current trend of showing how library services resulted in improved student and faculty outcomes.
The main issue with showing ROI was that you need to quantify the benefit of library services in monetary terms. Various techniques borrowed from econometrics was used such as running surveys asking users hypothethal questions such as "Will you give up the use of libraries for $X per year?" (contingent valuation approach), or calculating how much it would cost if the library didn't exist (value of next best alternative). An example of that approach would be to have a library value calculator and some libraries even printed out price of books borrowed in the loan receipts.

http://www.swissarmylibrarian.net/2012/05/08/highlighting-the-value-of-library-use/
One study I saw tried to solve the problem of giving a dollar value to library services by looking at grant proposals. I believe it counted the number of citations in the grant proposal, calculated % of them were from the library , plus a couple of other assumptions (e.g. % of researchers who claim citations were imporant to their proposal) and then apportioned some of the grant as the monetary benefit of the library.
To be honest, I never felt comfortable with these calculations. My experience is most citations aren't really critical (see also Semantic Scholar's identifying meaningful citations feature that I covered in a last post) and the numbers in the survey seems way too high for me.
But it led me to think, if a data set is used in a paper, the default assumption must be it was critical!
"A scholar is just a library's way of making another library" - Daniel Denett
So we can consider showing value by collecting datasets that the library purchased by seeing how many used it.
I'm not suggesting we start doing things like calculating dataset cost per data cite, but perhaps one can do simple things like tag institutional repository records with articles that use the datasets.
Create a "articles that exist because library bought the dataset" collection?
This pairs very well with the old idea I tried at my prior instituion on Mining acknowledgements acknowledgements to librarians in Google Books and other sources. Using instructions shared genourously by Jacque Hettel then at Stanford, I was able to scrape a few hundred acknowledgements to the library.

Combining both ideas you could have a collection entitled "Books or articles that exist because of libraries".
The reality of things
In reality, things are not as simple. Though in recent years there has been a push towards encouraging researchers to properly manage their datasets properly, making their datasets - findable, accessible, interoperationable and reusable (FAIR), we are still in the beginning stages of that.
Besides the adminstrative burden involved, we of course know that researchers are reluctant to make datasets open for a myraid of reasons so this seems to be a tough nut. Still it seems to me even without sharing of datasets, a proper citation of datasets used will open up a lot of use cases.
Is free text search enough?
As it stands , currently we are restricted to doing free text searches of dataset names. This usually works fine if the name is unique enough like Equiliar, Boardex etc but will run into issues when the name is generic. Of course even with rare uncommon names you are going to get false drops when datasets are mentioned but not used, though that can be still useful.
Given this problem, the Microsofts and Googles of the world will automatically think - let's not make humans tag it but use computers to solve the problem with entity extraction methods. That's nice if doable but currently not on the cards for libraries and related industries.
In a ideal world, we can reliably identify datasets used or even mentioned in an article and link directly to the dataset record (if open to the actual dataset, if not just metadata and how to acquire it).
Aggregating, ranking and display of datasets - a tougher challenge.
I can imagine two search approaches when looking for ideas a) Search from articles to datasets or b) Search directly for datasets.
We considered the first above, now we consider the latter case.
Even if datasets were properly tagged and identified , we run into the aggregation issue. How easy is it if we want to search for datasets? Is there a one-stop shop for them?
With articles, anything from Google Scholar, Microsoft Academic, Dimensions, various library discovery and indexes essentially solve this problem for 99% of cases.

With datasets, this seems harder, as datasets isn't as standardised as papers and currently things seem to be fragmented across disciplinary repositories , commerical ones (Figshare, Mendeley data, Dyrad etc) and various institutional ones. There are thousands of data repositories listed in the REgistry of REsearch Data REpositories or re3data.org, how does one search effectively across all datasets? How does one do relevancy ranking of datasets even if they were all aggregated? Surely the same methods used by currently discovery systems to display text based articles isn't appropriate. Lots of scope for creativity here I think.
For example, I am currently impressed in the way Mendeley data serves up results.

Search query for dealscan in Mendeley data
While it searches through datasets uploaded by researchers, in reality currently what you will get is tables in Elsevier Journals that are identified that have data (and it allows you to jump to the parts of the table that match your search-string.
Perhaps a one-stop shop approach isn't suitable for datasets and a bento style result recommending results from identified data repositories that are most promosing might be the way?
Scholix - a first step

In a sense this is the same problem with articles (and to some extent books and related material types) but with difficulty dialed up because articles are relatively standardised while primary sources like dataset are extremely varied and heterogenous, so the discovery issue is tougher.
The current partial solution seems to be Scholix. "The Scholix Framework (SCHOlarly LInk eXchange) which is a high level interoperability framework for exchanging information about the links between scholarly literature and data, as well as between datasets. "
Without going into details, instead of expecting every data repository out there to implement the same standard, Scholix envisions that standardisation be done by existing major hubs and global aggregators. Examples of such hubs include datacite, Crossref, Figshare, OpenAIRE, SHARE, ICPSR, Ands etc.
The advantage of this is that individual repositories just have to ensure there contents are aggregated by one of these major hubs and have an expectation that their content be discoverable.
Scholix currently concerns itself with just literature-data links and for a real life example you can refer to the Data Literature Interlinking service (DLI Service).
I haven't really studied the details but you can read more.

http://www.dlib.org/dlib/january17/burton/01burton.html
Currently Scholix seems to be taking off, one major supporter of it is Elsevier's Scopus which uses the Scholix API to query dois to check if an associated dataset exists and if so displays it in the article page allowing users to quickly locate dataset. This is very promising indeed.
Hopefully other search indexes will follow suit rather than rely on closed properitary methods.
For the other case of searching for a specific dataset, you will have to turn towards using Scholix DLI directly, or crossref/datacite event date.
Below are some examples of commerical datasets that appear when I search Scholix DLI


A ideal world
Imagine a world where all this takes off. Researchers cite datasets as a matter of routine, similar to how they cite articles. Not everything is open data, but at the very least a record exists and the dataset could be behind logins. Datasets are not only cited but also properly tagged, for example a dataset X could be set up with relationships "derived from" dataset Y and Z.
In such a world what could be done?
Imagine a reseacher running a search and being able to quickly see what datasets are being cited in his area. He can see which datasets are rising in popularity, which datasets tend to be used together and recommenders could start working on datasets. e.g. People who use, look at dataset X tend to use,look at dataset Y.
With paid datasets, publishers and librarians could get a better sense of what is popular. (A slight drawback is this can be used to track if researchers are abiding by contracts, way too many data providers have restrictive contracts). Tools designed to help map the structure of scientific research and identify hot or "prominent" areas might start clustering research based on datasets.
In the open dataset arena, researchers will start to realise the benefits of sharing data, if the datasets they share bcome highly cited and used via recommendation systems.
When we talking about citations, it is natural to think of a data citation index and indeed Clarivate has one for a while now, but I'm not sure if the take up rate is high as compared to the "core" Web of Science.
While a closed citation index was acceptable for articles for many decades, the tide is slowly turning. Open Citations is the way to go, and I believe and hope that efforts for tracking data citations should follow a more open model, similar to how Crossref and Datacite event APIs are.
Of course, being able to easily locate and find datasets will help in the efficiency of Science and with reproducibility efforts.
Another thing to think about is the potential role of blockchain technology in this, with solutions like ARTiFACTS platform.
Conclusion
A world where datasets are treated as a first class entity similar to articles is decades away I suspect, but the gains from that will be great. But how do we get from here to there?
*Title inspired by Open Citations Corpus - Citations as a first class entity

