3 "new" library discovery challenges - open access detection and versioning, article to dataset linking & general data search discovery
In the heyday of Web Scale Discovery (2009 to the early to mid 2010s), library discovery was a big issue that was front and center in our profession's sights.
It wasn't of course that our profession didn't write about OPACs and discovery layers before that of course, but the rapid adoption of Web Scale Discovery in Academic libraries in the age of Google and Google Scholar intensified the whole process and in those years, you saw presentations and papers on every aspect of library discovery you could think of.
Of course by the mid-2010s, most academic libraries had implemented their first such system and as the hype cycle goes, the product class began to mature and interest began to die down. That said Twitter and mailing lists occasionally still erupts into debates on whether this was a overall positive move by users who felt they were losers in this shift (e.g. - advanced power users, users that lament the rise of full text indexing and decline in focus on quality indexing, humanities scholars who generally suffer from poorer relevancy due to the flood of articles) .
I'm not going to rehash this argument, and I suspect I'm seen to be too pro discovery anyway.
Instead I'm going to talk about 3 "new" challenges in Discovery that we now need to face.
1. How do we ensure our open access papers (and in particular the versions they are in) are properly discovered and represented?
2. How do we ensure the datasets we archive in our institutional data repositories can be properly linked from the articles they are supplements to
3. How do we ensure the datasets we archive in our institutional data repositories can be properly discovered by data aggregators? (hidden assumption who or what is going to aggregate the datasets?)
The astute reader will note none of these challenges are new, but in my view, recent developments have made these challenges even more salient. I will outline each challenge and the solutions I'm aware that have emerged to try to meet them. Some of these are in a more nascent state then others.
As usual, I will be thinking aloud and I welcome responses and corrections of any misunderstandings I might have.
1. How do we ensure our open access papers (and in particular the versions they are in) are properly discovered and represented?
This is a drum I have beaten in recent years with "Getting serious about open access discovery — Is open access getting too big to ignore?" in 2017 getting the most attention because by then it was getting obvious the pool of open access was getting significant.
So what can Institution repository (IR) managers do? I wrote a exploratory piece on the importance of ensuring discovery of IR contents not just in Google Scholar but also in BASE, CORE, Unpaywall etc but it seems I shouldn't have bothered.
Days after I blogged that, the innovative and trend setting University of Liège Library released "Open Access discovery: ULiège experience with aggregators and discovery tools providers. Be proactive and apply best practices" , which covers the same ground and more, which I highly recommend.

University of Liège Library Institutional Repository - ORBi
Another less obvious point of Open Access discovery that is less appreciated I suspect is the versioning problem. As open access discovery tools like Google Scholar button, Lazy Scholar, Unpaywall, Open Access button, Kopernio , Lean Library browser , Anywhere access and other tools start to surface more open access papers, users are starting to run into different versions of papers.
Lisa Hinchliffe yet another person I file under "always interesting" notes that there are information literacy issues to consider now. This is timely as Scopus is now joining Web of Science, Scopus and Dimensions (the big 3 citation indexes outside of tech giants Google and Microsoft) to intergrate unpaywall results natively in their search
Here's an announcement that helps demonstrate why students (and others) are going to be encountering preprints *in* library subscription databases and thus in our #infolit sessions. https://t.co/05pAIzEKTj https://t.co/w2VEmHXAd6
— Lisa Hinchliffe (@lisalibrarian) July 26, 2018
I suspect not many academic librarians discuss discovery of versions that are not version of record (VoR)/published version particularly to first years.
While I'm not naive and I know users Google, I think many librarians focus on teaching subscribed databases that until recently pointed to mostly published version so the issue could be avoided.
Until recently, a librarian was able to maintain the polite fiction that our users would only see VoR because for most part if they were using our databases they would rarely encounter other versions.
Users and Libraries are also increasingly beginning to support use of tools like Kopernio, Lean Library Browser (e.g Stanford is encouraging use of this) which not only brings users to subscribed resources but further increases the chances of users running into OA versions that are not version of records.

Stanford's support of the Lean Library browser extension
How should we display variants? Can we even detect variants reliability?
Related to the issue is how and what to display when there are multiple versions of the same paper. Should we let the user decide? If so, which is the default? Competing agendas by stakeholders conspire to make this a tricky question.
But beyond IL implications there are also questions about how well OA discovery tools can detect the variants in our repositories and implications when they can't do it properly.
Again this is another issue I've been worrying about in "Open access and the versioning issue".
The key implication is this, while search discovery services like Scopus, Web of Science, Dimensions, Europe PMC all use the Unpaywall data feed they can use it in different ways. There is no requirement that they display all links found and they can also differ in the way they offer facets.
What Unpaywall tries to tell about the OA paper
The Unpaywall API allows you tell where the free to read version of paper was hosted. Either Publisher hosted or Repository hosted. This maps roughly to Gold OA (including hybrids) and Green OA of course. The newly trendy term "Bronze" (free to read but no license) is almost always hosted on publisher.

https://unpaywall.org/data-format
Unpaywall also tries to tell the version of the paper whether it is submitted version or accepted version or published version.

https://unpaywall.org/data-format
It can also tell you the license (Creative commons or Publisher License or Implied OA).
How Dimension and Web of Science uses the Unpaywall data

Facets and filters in Dimensions
First off based on some tests, I think Dimensions use the full unpaywall dataset so it provides links to all versions known whether publisher hosted or repository hosted and regardless of version , which is good news for IR managers. So if you are a IR manager it becomes important to ensure your OA content is appearing in Unpaywall.
So far so good, but let's looks at Web of Science

Facets and filters in Web of Science
The filters imply and the document confirms that Web of Science is only showing links to Published and Accepted versions. (Gold or Bronze does not show versions because they are all assumed to be published versions?).
Here lies the problem, is Unpaywall able to correctly detect the variant of your paper? If it thinks it is not a published or accepted version it will not appear!
We have reason to believe without any metadata, reliable version detection is tricky. While detecting the published version is relatively easy, distinguishing between submitted and accepted is tricky or almost impossible to even humans without metadata.
To consider the magnitude of the problem , even the mighty Google Scholar only combines variants together and has some simple rules to choose one version has the main version (typically the publisher copy) but does not try to distinguish between other versions. To be fair though this could be simply lack of interest to distinguish.
In theory Web of Science claims to find "article located in a subject-based repository such as PubMed Central or in an institutional repository." , in practice you will find you tend to find links to the former rather than the later.
This could be partly because Unpaywall algothrim might prioritize (best_OAlocation) besides IRs or it could be simply, the difficulty of properly detecting the version of the paper and giving it a default version of submitted rather than accepted version.
Could the tradition of poor and consistency in metadata of many IRs be causing this?
In any case, I've been trying to see if there are any patterns in how the version is detected, but Unpaywall has promised to set out explicit guidelines for this, which should improve matters if you act on it.
hi guys. these are great questions. rather than you guessing, let us come up with a standard set of things you could put on a cover page to help us determine version most accurately. stay tuned.
— Heather Piwowar (@researchremix) July 30, 2018
Definitely stay tuned.
2. How do we ensure the datasets we archive in our institutional data repositories can be properly linked from the articles they are supplements to
As we race towards setting up research data repositories, we need to consider the discovery of them. This is obviously different from the discovery of text articles and a much bigger challenge.
So let's focus on a smaller subset of the issue, how do we link datasets to the articles that use them?
A standard already exists - Scholix .

"The goal of the Scholix initiative is to establish a high level interoperability framework for exchanging information about the links between scholarly literature and data. It aims to enable an open information ecosystem to understand systematically what data underpins literature and what literature references data"

http://www.dlib.org/dlib/january17/burton/01burton.html
This standard has being adopted by Europe PMC and Scopus of all people. So when you search in Scopus for an article, Scopus will use Scholix to find and link to the dataset supporting it.

Link from Scopus to dataset via Scholix
As I've recently started looking at data repositories, discovery of datasets has been in my mind a lot and is probably the next most important thing to user experience (in workflow).
I must admit the more I explore the more I realise I don't understand how this works. For example, I was trying to determine which Data repository solutions with Institution versions (e.g Figshare, Dataverse, Mendeley Data, TIND, DSpace etc) support Scholix such that if I deposit a dataset it will be findable via Scholix.
I tried playing around with the Scholix Explorer but the more I think about it I don't get it.
I get the impression if a datacite/crossref relationship of say isSupplementedBy or isSupplementedof is made between the dois of the dataset and article, Scholix will work.
But I'm still learning. You can either read the paper on Scholix or better yet watch the video below for a high level understanding of Why Scholix and how it works on a high level.
A idea I mean to explore in my mind is how these article-dataset links/citations can be leveraged in projects like Wikicite/wikidata and OpenCitations.
Edit: Just 2 days after I blogged this Dimensions announced another article to dataset linkage. It works only for Figshare but displays the data live in a viewer.
Open datasets in Figshare are now in @DSDimensions Wish I knew about this earlier when I blogged about Article-dataset links in search. There is now a "Associated data" section which not only links to it like in Scopus but directly displays https://t.co/4a1b4REckN pic.twitter.com/Yd1YxSd025
— Aaron Tay (@aarontay) August 3, 2018
3. How do we ensure the datasets we archive in our institutional data repositories can be properly discovered by data aggregators? (hidden assumption what/who is going to aggregate the datasets?)
This is a big question and I perceive this to be mostly wild west and am to speculate wildly.
In the first place, are there even data aggregators out there that aggregate outside of silos? We cant optimize without knowing what is going to aggregate our content.
Currently research data repositories are mostly siloed among data repository solutions. REgistry of REsearch Data REpositories or re3data.org shows thousands of data repos. If you search Figshare you get only Figshare results. Dataverse gets dataverse results and it gets worse when you consider individual Dspace implementations that store datasets. (For what's it worth the parallel to article repositories is weaker because the first movers for data repositories tend to be deployed as software as a service so level of aggregation is higher than the eprints/dspaces of the past)
If we learnt anything from our experience with article repositories this could be a recipe for disaster. Imagine if a ResearchGate style data repository (or ResearchGate itself) makes a play for the research data space and achieved success by leveraging it mass to encourage researchers to put their data in them, while individual data repositories are hopelessly siloed with inconsistent standards making it useless to aggregate. Or is Google going to save us from our messes like in the past, creating a Google Scholar for datasets engine that will instant make our solutions look dumb and we scramble to figure out how to make the all mighty Google index our datasets.
Could these solutions lead to new ways of literature review eg studying papers that used a certain dataset, or looking for ideas by datasets.
I'm not aware of any specialised research data aggregator now focused on discovery of data except for Mendeley Data. (I'm unsure if the Scholix Explorer counts but I wonder if this only shows datasets with article relationships)

Mendeley Data aggregates data from Zenodo, arXiv, Dataverse and more
Mendeley Data could be a early innovator here, not only does it aggregate different repositories (both article and data) like Zenodo, Dryad, Dataverse, certain Dspace installs and more, it also inergrates whatever data such as tables or figures it can from Sciencedirect articles.

Mendeley data showing tables from Sciencedirect articles
This is seriously impressive.
Of course we are still in early days of this dataset discovery game .
More questions to consider
How does one do relevancy ranking of datasets if they were all aggregated? Surely the same methods used by currently discovery systems to display text based articles isn't appropriate.
Can we leverage on linked data for recommendations? What about studying how Google recommends we tag datasets? Already they work with data journalists to surface data in tables.
Or Should one even borther to aggregate since datasets unlike articles vary so much by disciplines?
Interesting to note web scale discovery services like Summon already have 2 million items marked as datasets....

Summon restricted to dataset type has over 2 million entries
Conclusion
So you there have it, 3 things to think about in discovery. Who says discovery is played out?

