My look at some interesting discovery ideas, trends and features for academic search
2009 to 2013 was a exciting time for library discovery as University Libraries starting rolling out so called Web Scale Discovery services. Much of my blog was focused on the area in those years, but eventually this technology began to mature and things started to settle down.
Still if you think there is nothing new or interesting in the field of academic search and discovery these days you may be in for a surprise.
What follows are a mix of technologies and actual search systems that I've been looking at that have potential to go beyond what current web scale or index based discovery offers. Some merely offer a new feature or two that might be incorporated fairly easily, others are based on fundamentally different frameworks and paradigms that supplement or might eventually replace current systems.
Looking at the commonalities between them, I see four main themes
1. Enhancing metadata (open and closed) with full-text processing
2. Doing more with citation relationships
3. Semantic relationships + visualization
4. Doing more with open data
1. Enhancing metadata with full-text processing
Traditional academic databases focused on creating and indexing metadata. As time went by full-text databases started to appear and this cumulated in the current web scale discovery or index based discovery services - EDS, Summon, Primo etc with "mega indexes" that allow users to search through both metadata and full text when available. Still the focus was on mostly human created metadata , supplemented by keyword searching in the full text.
In contrast, commerical systems like Google Scholar, turned the paradigm on it's head. Google Scholar famously, scrapes full-text and uses algorithms to determine metadata such as the title of the article, the author, affiliations (sometimes leading to humorous results). This is not to say Google doesn't use metadata when it can find it (e.g. it recommends and support Highwire Press tags and BEpress tags), but unlike conventional databases it is also game to index material based on full-text alone.
Microsoft Academic - following the footsteps of Google Scholar?
Microsoft Academic which was relaunched in late 2017 follows a similar playbook as Google by focusing on creating metadata from full-text. In fact, at a Crossref Live17 event, a speaker from Microsoft started by saying he had a "philosophical difference" with the way metadata is done by Crossref. Essentially he felt that it makes more sense for machines to generate metadata than using mostly human manual labour, the way Crossref members do.
Of course the line isn't that clean and there is probably a spectrum of approaches but you get the point.
A more interesting example is the recently launched Dimensions by Digital Science. I will be doing a full review in another post and you can read coverage about it by Library Journal, but in this blog post I will focus on the unique points of Dimensions.
We are told Dimensions, starts off with a publication metadata by aggregating metadata from various open sources such as Pubmed Central, arXiv, RePEc and of course Crossref (metadata and open citations) which is probably the largest part of the backbone.

Dimensions goes beyond this by further enriching the index by full text processing of 50 million records from its 100+ partners.

We are also told "This Step includes deriving reference/citation data from the full-text and mining acknowledgements sections to identify links to patents, research funders and funded projects"
It's unclear to me how much if any of the usual suspects in the academic/library discovery space do full text processing to generate metadata, but I suspect if they do, none do it to this extent.
And yes, when you search Dimensions you search by default through full-text not just metadata. This makes it similar to Google Scholar and Microsoft Academic and Web scale discovery services but different from Scopus and Web of Science.
It's no coincidence, Microsoft Academic and Dimensions do have a very similar feel when talking about article discovery only, with similar facets (journal/researcher /field/year of pub/source title etc) and Microsoft Academic even oneups the free version of Dimensions by allowing filters by affiliation for free.

Microsoft Academic Search interface

Dimensions search interface
To be fair the free version of Dimensions does offer at least 3 facets (open access status, publication type and Journal list) that Microsoft does not.
The similarity of Dimensions to Google Scholar and Microsoft Academic is further enhanced when you realise that unlike Scopus or Web of Science , Dimensions takes an "inclusive" approach on what journals to index rather than only indexed results from a carefully selected list of journals. A approach very similar to Google Scholar and Microsoft Academic.
To bypass the issue of unreliability of the content surfaced by the search (code for so called predatory journals), Dimensions includes a clever work-around that allows you to filter to journal sets that have been whitelisted.

Filter down results to journals on white-lists such as DOAJ
Dimensions vs Web Scale Discovery
Dimensions isn't the only search system used in the academic world that searches full-text with an inclusive approach of course. Web Scale Discovery services like Summon/EDS/Primo with huge mega-indexes take a similar approach (in terms of inclusion) and can have the same reliability issues as I noted in "Are search results in library discovery really more trust-worthy?"
Might these systems benefit from a similar feature with the ability to feature to white lists?
It's also interesting to speculate whether web scale discovery services which are rich in both metadata and full text can follow Dimension's lead but thus far they have been content on using existing citation indexes of Scopus and Web of Science to enhance their service instead of trying to develop their own citation index.
There is a small hint of this possibility in the Primo citation trails feature which is based on open citations, though to become a serious competitor they will need to process full-text to extract metadata just like Dimensions has done to be competitive. Will they want to wade into this new area?
What is the potential for Dimensions to compete with web scale discovery?
Conversely, some have speculated on whether Dimensions will take away market share from the web scale discovery systems.
In theory with a Dimensions API, one could combine Dimensions results with their OPAC catalogue results in systems like Vufind and possibly Folio in the future. Already cutting edge libaries have done so with apis from Summon, Primo and EDS and presumably Dimensions could take a similar place.

Vufind instruction on intergration of Primo, Summon, EDS - Might Dimensions be next?
But that said, these central indexes are huge and go beyond just book chapters/journal articles/preprints etc. They include ebooks, newspaper articles, periodicals, image databases, video streaming databases and more.
Dimensions would have to seriously expand their content types to match this. This might not be easy given the dominant positions in the content industry, Proquest and Ebsco hold (e.g. both own ebook databases, newspaper/periodial database providers).
2. Doing more with citation relationships
Another way for innovation in discovery is in the handling of citations.
For instance how about tracking citations beyond just article to article?
With so much interest now in making data open, the obvious question is how does one expose the data? The equally obvious idea is to surface links between articles and the data that supports it.
Scholix (Scholarly link interchange) protocol supported by Scopus (among others) that allows users to find linkages between publications and datasets is exactly what is needed. Institutional repository managers who collect datasets should investigate the possibility of supporting Scholix.

Scopus search result with links to Datasets via Scholix
But what about inter-links between other types of material?
Various databases do tracking of say patents citations of articles etc but Dimensions is the first I'm aware that is all-encompassing. By using text processing to generate all sorts of relationships between not just articles and other articles but also to other entities like grants, patents, policy papers and even altmetrics.

Can we tell why something is being cited?
While this is interesting all these citation links still run into a problem. We can't tell the exact nature of the relationship between the links and in the case of a citation between publications we are told there are 13 different reasons to cite. Can we innovate in discovery by exposing the nature of the citations between items and allowing users to filter this way?
Imagine if one could look at a highly cited paper with thousands of cites and only filter down to citing papers that were positive and had a certain keyword in the citing statement.
Searching and matching within the citation statement
Sciride is a little known bioscience discovery that includes such an intriguing idea. When you search Sciride , you are not searching the usual metadata or the full-text or both. Instead you are searching a limited subset of the full text that Sciride calls the "Citation statement".

What it is a citation statement? It is "sentences from scientific publications, supported by citing other peer-reviewed manuscripts"
In other words a citation statement would be something like this.
"Google Scholar is shown to have high recall but low precision." (Tay, 2010)
Sciride allows you to do a keyword search of the citation statement. So in this example I search for the terms
Google Scholar high recall and I get....

I'm sure you can think of many uses (e.g. to look for citation statements around a certain software or practice or even person, search for something you know exists but forgot the title), but currently Sciride is of limited use outside the domain it covers (life sciences).
It seems to me that web scale discovery with all the full text they have might be able to implement something like this with some effort. Presumably work would be needed to reliably identify citation statements and index them.
Arguably Sciride could have gone further. After all, all it is simply doing allowing keyword search over the citation statement. Can we do some kind of sentiment analysis to see if the citation statement is positive or negative?
Sentiment analysis of citations
http://rfactor.verumanalytics.io/ goes further then Sciride and tells you if a paper supports or refutes the paper it is citing.
How does it do this? Apparently this is via manual tagging which limits the ability for this feature to scale.
But automatic sentiment analysis methods do exist for telling if a citation is a positive or negative cite.
For example there is the very interesting proposal in CiTO(the Citation Typing Ontology). which proposes to " to enable characterization of the nature or type of citations, both factually and rhetorically, and to permit these descriptions to be published on the Web."
Factual typing of citations could include properties like "is cited by" or "has quotation", while rhetorical is divided into 3 subclasses, positive (e.g. "supports"), negative (e.g. "disputes") and neutral (e.g "reviews"). See more here.

The main problem with this of course is who is going to code all the citations? This paper reviews some of the author annotation tools like Chrome extensions and other writing tools but I doubt this is enough without a automated or semi-automated coding system created by machine learning, see for example CiTalO or CiTO algothrim.
Can we determine which citations are influencial in a paper?
If telling whether a citation is a positive or negative cite is hard, how about telling if a cite is important or critical to the paper? We know that a lot of cites people make are not really critical to the paper but what if we could identity the cites that are significant?
In fact, yes we can , and this is a feature in Semantic scholar - another fairly new niche search limited to the domain of Computer Science.

Semantic scholar not just shows cites but tries to identify influential citations.
How does that work? In a fasincating paper entitled Identifying Meaningful Citations, the authors described the work they do to identity which citations to a given paper are important and which are not.
Using a hand-coded set of citations , they try to use machine learning to train the system to recognise important citations. Impressively it is designed to try to catch not just direct citations as well as "indirect citations".
Some citations are direct, i.e., the citation follows an established proceedings format, others are indirect, where the work is cited by mentioning the name of an author, typically the first author, the name of the cited algorithm, of a description of the algorithm
For instance,

Some indirect citations it is trained to recognise
They tested a bunch of features but it turns out the number of times a citation appears in the paper (both in total throughout the paper and in each section, the section it appears in (e.g. appearance in methods section is usually more important than in review section) , author overlap are important features.
Their system has a high recall for recognising important citations but moderate precision 0.65.
Another interesting bit about Semantic Scholar , is thhat they can identify Surveys and reviews using heuristics.
3. Semantic similarity and relationships + visualization
While analysing citations is traditional, the main drawback is many items are never going to get citations and even those which do eventually get cited it takes a while for this to happen, so creating features that are citation based might not be enough to get the full story particularly when the item is new.
How about analysing the semantic relationship between items beyond just citations? How about showing relationships between items that are similar based on their content or at least metadata?
This is what Open Knowledge Maps does.

Using BASE or PubMed as a base, you can do a search and Open Knowledge Maps will pull out the top 100 articles will either of them and cluster them based on the similarity of metadata.

Clustering of papers in Open Knowledge Map using search "Digital Education"
"Open Knowledge Maps: Creating a Visual Interface to the World’s Scientific Knowledge Based on Natural Language Processing" explains how it works.
"We compile a bag-of-words corpus using article title, journal name, author names, subject keywords and the abstract... Documents are preprocessed by removing punctuation, filtering stopwords, transforming to lower-case and stemming. Thereby, we reduce the dimensionality of the term-document matrix generated from this corpus. We then proceed to calculate the cosine similarity between papers using the R tm package"
The above may makes a little sense if you know a bit of how R and text-mining works.
Though Open Knowledge Maps is no doubt interesting, the technique used doesn't strike me as particularly novel and worse yet it is limited by only metadata and doesn't use full-text, so the results can be hit or miss.
Concept mapping with Yewno Discover?
The current darling among Ivy league University Libraries is Yewno. It is described as such.
"Yewno Discover is built on our unique core technology, which maps 600 million semantic connections among concepts extracted from full-text academic resources. Those connections link to over 120 million scholarly articles, books, and database assets — and the number of resources at your fingertips continues to grow every day."
Pitched as a true discovery tool, the researcher can see links between concepts and drill down to items related to these concepts , creating almost a mind-map of the area in the fly.
There are some real interesting researcher and student stories, but it's a really different way of searching compared to keyword searching, so only time will tell if this technology is the future.
Another interesting one but with a different angle is Knowtro that is capable of extracting research findings in papers.

Knowtro simple view
In the advanced view, you can see more details.
What are the outcome variables? The Predictor variables? The R square, the significance level, sample size etc.

Knowtro advanced view
Linked data - Wikidata , Bibframe
Yewno works by processing full-text to extract semantic meaning. However there is an older and perhaps more established way for documents to be linked semantically (though mostly labelled manually), aka using linked data technology.
I know it seems like we have been talking about linked data for ages. But linked data pops out everywhere. Remember we were talking about typing citations using CiTO(the Citation Typing Ontology)? Guess what? That's an ontology in linked data.
If you want to have a quick overview about what linked data is from the librarian's viewpoint , you can look at this paper or Enabling your catalogue for the semantic web. For doubters and or commentaries on the viability of linked data for libaries see "Soylent Semantic web is people" or "linked data caution"
Regardless of the reason why linked data took ages to take off (this deserves a post of it's own), it does seem there are now signs of life.
ExLibris Alma has now incorporated and supports Bibframe/linked data in the Feburary release and even more importantly Primo will support exposing data via Schema.org in May!


Querying linked data - easier methods?
Through the years, I've had HUGE problems trying to understand linked data, the technicalities of RDF (various serializations), Ontologies, SPAQL and more importantly it seems so much work to understand something with no real gain.
Wikidata has almost singled handedly changed my mind on this. It's such a useful store of data that you can answer all sorts of interesting questions like
Today's delightful query: members of the current Parliament who have ancestors in @wikidata who are identified as possibly mythical. (There are two - guess who?) https://t.co/pOhcGrEeOU
— Andrew Gray (@generalising) February 3, 2018

Want more questions you can answer using Wikidata ? See Wikidata:Request a query
Making querying linked data easier
One of the problems with linked data I feel is that querying linked data in RDF is not intutive.
You can try this lesson, or follow this basic SPARQL tutorial in Youtube below but it's a tough haul for many.
Are there ways to make this easier?
The Wikidata Query Service (WDQS) see query on Wikidata above is a good attempt to simplify matters but are there better methods?
This paper reviews some of the work in this area such as the Visual SPARQL builder.

The latest entry is OSCAR: A customisable tool for free-text search over SPARQL endpoints.

4. Doing more with open data
I've been blogging for a while on the momentum open access has and how discovery systems are working to incorporate them as the level of Open access rises. Unpaywall API has been a leading force in this as databases such that Web of Science and Dimensions have also started to incorporate it to find open access versions of articles. 1Science's 1findr and Kopernio are the other players in this space as well, trying to improve on the traditonal Open access aggegators like Base (all about BASE) and CORE.
Open citations also seems to be taking hold But still more needs tto be done here.
If you look at the search engines mentioned on this list most of them such as
Semantic Scholar
Open Knowledge Maps
Sciride
only cover the life science or computer science areas. While some of it is due to scale reasons or the choice of the authors, arguably innovative features that we see can only be easily piloted on open data and life sciences and computer sciences are the areas where there is a sufficient level of open data to work on.
Currently for comprehensive , all subject search engines, the commericals like Dimensions (which builds on open data) and Yewno are the main choices.
Conclusion
This has been a mind boggling tour of some interesting features in discovery that might hold promise. Thinking about all this has been me appreciate how much open data (papers, citations) levels the playing field and allows more innovation.


