More new search discovery apps - Fatcat, OpenAIRE Research Graph, Open Research Knowledge Graph, Ebsco Discovery Concept Maps, Orion search & more
As always I am on the lookout for new interesting discovery applications and search engines, and in this blog post I will briefly mention and describe (not a full blown review) the following new ones that caught my eye.
Internet Archive's Fatcat & Internet Scholar Archive
Orion Search - Open source funding by Mozilla.
Ebsco discovery Search's Concept Map
Interestingly, a lot of the above applications such as Orion, I first spotted while doing a delayed overview of The 2nd Workshop on Open Citations and Open Scholarly Metadata 2020. , a very interesting conference indeed.
Note : As these are very new applications (many of which are still in beta), I am still figuring them out. So what follows will be more my "stream of consciousness" type review of them and not any concrete assessment. So take all this with a (bigger) grain of salt than usual.
Fatcat & Internet Scholar Archive by Internet Archive
As I have noted , we are seeing the rise of many large scale academic discovery services , due to the availability of "open" data (both metadata and full text) on articles from sources such as Crossref, Pubmed/PMC, Microsoft Academic Graph, Semantic Scholar Open Research Corpus, ORCID, Unpaywall, JISC CORE and more.
Between Lens.org, Scinapse, NAVER, Scilit etc not to mention the native interfaces for Microsoft Academic, Google Scholar, Semantic Scholar, do we need one more academic discover service that blends all this data together?
Internet Archive's newly announced Fatcat & Internet Scholar Archive (currently alpha testing at https://scholar-qa.archive.org/ ) attempts to answer this question by blending in archived data from the Internet Archive as well as datasets on digital preservation from the Keepers Registry and other sources(usual suspects like Crossref, Unpaywall, PubMed etc)
Fatcat five minute intro for workshop on open citations open scholarly metadata 2020
As I write this, the preprint, Open is not forever: a study of vanished open access journals is making some waves as it showed that
176 OA journals that, through lack of comprehensive and open archives, vanished from the web between 2000-2019, spanning all major research disciplines and geographic regions of the world.
Which was timely for the Internet Archive to release - How the Internet Archive is Ensuring Permanent Access to Open Access Journal Articles announcing the launch of Fatcat & Internet Scholar Archive (currently alpha at https://scholar-qa.archive.org/ )
Let's look at Fatcat first.

First off what's in Fatcat. The guide says
The goal is to capture the "scholarly web": the graph of written works that cite other works. Any work that is both cited more than once and cites more than one other work in the catalog is likely to be in scope. "Leaf nodes" and small islands of intra-cited works may or may not be in scope.
On another page it talks about including non-traditional digital works (web-native and "grey literature") , though if this means preprints and conference proceedings it won't be that different from existing collections e.g. Microsoft academic graph?
Note that Fatcat also
does have verified hyperlinks to fulltext content, and includes file-level metadata (hashes and fingerprints) to help identify content from any source.
But that does not mean every record in Fatcat will have links to full-text content (some records are metadata only).
What datasets does it use?


https://guide.fatcat.wiki/sources.html
How big is Fatcat now? Here are some stats

As interesting it is to know what is in Fatcat, what is it's envisioned to do?
Looking at the Project Goals and Ecosystem Niche, we can see like many similar projects, it aims to be open, non-commerical which lends itself to many possible uses for research and discovery purposes. Though they argue
We do not know of any large (more than 60 million works), open (bulk-downloadable with permissive or no license), field agnostic, user-editable corpus of scholarly publication bibliographic metadata.
I am not quite sure I agree, but what distinguishes it I think is it's focus on preservation data.
It is clear that one major use of Fatcat is with regards to tracking preservation status of Scholarly content at the work-level. It can track and answer questions like
what fraction of all published works are preserved in archives? KBART, CLOCKSS, Portico, and other preservation networks don't provide granular metadata
The other aspect of Fatcat that is somewhat unusual is that it is envisoned to be collaborative and user-editable somewhat similar to Wikipedia and Wikidata, though Fatcat has a more specific focus on capturing work and relationships (I think somewhat like Wikicite)?
Some quick tests with Fatcat
Fatcat is editable like Wikipedia or Wikidata, and there is a whole data model with entities and ontology to look at. I'm not really good at such things, but I did have a glance at the entities used.
Also even if you don't intend to edit Fatcat, you will need to look at the data model to know which field searches you can do (and what value you can use, some as free text strings, some have controlled vocab you have to use) as Fatcat interface does not provide any inline user guidance such as a advanced search on what can be done.

https://guide.fatcat.wiki/entity_container.html
I've been in correspondence with the main developer of Fatcat and the search functions can be quite powerful but complicated as there are in fact quite a lot of fields you can use.
As I was curious about the preservation status of journals, the entity "container" was of most interest to me.
For testing purposes, I tried looking at a important long running local journal in Singapore, the Singapore medical journal.(SMJ)

https://fatcat.wiki/container/oozaw2pnfvawnhapy7svkgjwye
Fatcat knows of 8,124 "releases" (which is basically articles). This compares to Microsoft Academic with roughly 7.8k which is in the ballpark. Additional releases might be alternative versions?

The most interesting data to me is on the right hand side of the page that provides interesting data on the preservation status of "releases" in the journal.
The bad news is according to Fatcat it estimates over 6,250 (76.9%) of the "releases" have no known preservation.
Fatcat also has a listing of "work types" including 1 "retraction" which is curious. Where is the data coming from? Crossref? Retraction Watch database?
How do you filter down to that? Looking at the data model for "containers" I spot a "Release_stage" field which includes the value of retracted.

For more details on retraction/withdrawn see the section in the guide , guide currently notes this section is invented and experimental
It also provides additional journal level information like ISSN-L, Wikidata QID, whether it is in the DOAJ or ISSN ROAD list. Incidentally, while this journal is currently open access it is not in DOAJ currently as they were removed after the DOAJ cleanup in 2016/2017. I suspect they did not reapply.

Further down you can look up the journal in sites like Sherpa Romeo, the Keeper's registry etc or edit the metadata.
Clicking on the coverage tab on the top of the screen gets you more visualizations.
Like preservation coverage by year or issue or by release type.

https://fatcat.wiki/container/oozaw2pnfvawnhapy7svkgjwye/coverage
If Fatcat is to believed the digital preservation of SMJ is quite bad. While for the earlier years it is covered by some Dark archives ( seems to be Hathitrust roughly 70s to mid 90s - if you press the metadata tab you can see that), and there are some "bright archives" from the more recent years (particularly from 2013 onwards), the rest is not preserved digitally.
Definitions of Bright archive vs dark
I've been trying to figure this one out , but I have practically zero knowledge of digital preservation.
Roughly, we can say "Bright" means "publicly accessible via Internet Archive".
They also try to archive links found to full-text in sources like CORE, Semantic Scholar, Microsoft Academic, Unpaywall, but not all the links can be crawled for various technical reasons (e.g. error in the links, publisher side blocking of bots, lack of dois or PIDs for matching etc).
Dark Archives are generally the sources listed in Keeper's registry (e.g.LOCKSS, LOCKSS) as well as, JSTOR, PMC or arxiv. Most of PMC (after embargo) and almost all of arxiv is publicly open and are also crawled by Internet Archive so they are technically in the bright archive as wel.

https://fatcat.wiki/container/oozaw2pnfvawnhapy7svkgjwye/coverage
How accurate is this?
Interestingly, when I check the SMJ site, it seems to have pdfs of most articles from issues from 1993 onwards by sample check
I would have assumed the Internet Archive should have gotten most of them and that the "bright" portion would extend backwards to them but they seem to be unaware of it and it is not archived.
Limited sample checks of older articles with full-text available on SMJ (i.e. before 1993) that are available on SMJ site shows that yes, they are not in the Internet Archive, nor in CORE, Microsoft academic, Semantic Scholar and Unpaywall . Example here .
So clearly, the journal page isn't too discoverable....Which should be fixed at the publisher level.
But this seems like something someone either the editors of SMJ or medical librarians in Singapore might want to work on to get it archived in the Internet Archive? While one can manually archive pages or files to Internet archive, the number of files here would probably require some script together with perhaps the bookmarklet if automation is suggested.
Checking preservation status of topics and countries
One of the things you can do in Fatcat that isn't obvious is that you can do this preservation check on anything not just at the journal level.
There is a Fatcat - Preservation Coverage Visualizer, where you can enter any keyword and field and see a similar visualization.

Fatcat - Preservation Coverage Visualizer
From here you can do any type of searches including field searches.
See for example, I'm interested in the preservation status of anything with the title singapore, you just search title:Singapore in the Preservation Coverage visualizer. You can also try fields like publisher, release_type (e.g. book, chapter, dataset), and more.
Please refer to the guide
For librarians who are looking at journals in their country, they can use Country field with ISO value.
Here's an example for containers with country code singapore, which turns out to be decently good in terms of digital preservation.

https://fatcat.wiki/coverage/search?q=country_code%3Asg
Fatcat currently has a API (though take note of the warning that the schema is not stable yet).
I repeat my earlier statement that Fatcat is pretty complicated and I am having problems wrapping my mind around it, it includes abstracts , acknowledgements and even references, so you can track citations (but curently I am told the reference text itself is not indexed in Fatcat).
If you do not have much interest in preservation issues, nor do you want metadata only records and are just in it for discovery as an end-user, the other tools Internet Archive launched, Internet Scholar Archive (currently alpha at https://scholar-qa.archive.org, now officially launched at https://scholar.archive.org/ ) is what you probably should try.

https://scholar-qa.archive.org/search?q=abstract%3A%22Singapore%22
My understanding of Internet Scholar Archive is that unlike Fatcat this shows only what is available for download. This might be interesting as a secondary check to find Scholarly material that has disappeared from the web except as perserved on Internet Archive?
OpenAIRE Research Graph

https://www.openaire.eu/blogs/the-openaire-research-graph
In my recent blog post, I focused on Project Freya's PID graph which focused on blending PIDs from Crossref, Datacite, ROR and ORCID.
However various other research graphs that are linked by PIDs exist. One major such research graph is OpenAIRE's research graph (direct link to try it)

OpenAIRE Research graph presentation at 2nd Workshop on Open Citations 2020
It aims to create a open (CC0), "Global Open Science Graph" which
includes metadata and links between scientific products (e.g. literature, datasets, software, and "other research products"), organizations, funders, funding streams, projects, communities, and (provenance) data source
from over ~10.000 trusted sources, further enhanced with
metadata and links provided by OpenAIRE end-users, Full-text mining algorithms and Research Infrastructure
And indeed it truly draws from a lot of sources , from publication sources


to research data sources

to research software sources

Funder and grant sources

And identifers like GRID, ORCID etc.
As of Nov 2019,
the Graph aggregates around 450Mi metadata records with links, which after deduplication, cleaning, and classification narrow down to ~110Mi publications, ~10Mi datasets, ~180K software research products. ~7Mi other products with 480Mi (bi-directional) semantic links between them.
It is still in beta and you can try the web interface here and the raw data dumps are available in Zenodo.

https://beta.explore.openaire.eu/
The OpenAIRE research graph is aimed to be a open resource that can contexualize research which will benefit the public (funders, researchers, publishers) in many ways, but one of the ways it is envisoned to be used is for monitoring of Science via Dashboards. Currently this isn't available yet, but you can play around with the web interface to get a feel of it's coverage.
Open Research Knowledge graph - capturing contributions?
When you read the page on OpenAIRE research graph, they mention access to data dumps to not just the OpenAIRE research graph but also to other related OpenAIRE data dumps such as Scholexplorer and DOIBoost (merging of sources from Unpaywall, Crossref, Microsoft academic, ORCID) both of which have been mentioned before on this blog.
Even more interestingly they talk about attempts by other organizations to create a Global Open Science Graph. Some like FREYA's PID graph and open citation graphs by opencitations are not new to you if you have been following my blog.
Two other research graphs are also mentioned one australian based ResearchGraph

and the one that we are going to discuss here Open Research Knowledge Graph or ORKG by German National Library of Science and Technology (TIB) .
My reading of Open Resarch Knowledge Graph is that on top of the usual metadata such as title, abstract, author in the graph , it also aims to capture the approach/method, materials/content and result/contributions of the paper in a structured way using volunteer efforts. Think of it as a more specialised Wikidata.
Assuming such data exists on papers, it opens up a whole spectrum of ideas. For example, one could do a comparison of papers in specific dimensions by problem, materials, method or results.
Below shows a demo by comparing papers which have contributions related to implementing visualization software.
Below I show one comparison table between 2 papers I found on author disambiguation.

Here's another

This is definitely very intriguing, a tool that I highlighted a few years ago Knowtro , summarised findings (variables used, p values etc) of papers in a structured way, drew quite a bit of attention when I blogged but it unfortunately is no longer available.
Open Research Knowledge Graph, if properly populated with data might be a excellent replacement.
Graphically, each paper can be visualized with relationships such as is author, is title very similar to PID graphs , OpenAIRE research graph except for the contribution data. (This looks very similar also to Open Knowledge Graph's beta concept Map except it doesn't have contribution data)

Let's expand the contribution node and you can see the paper has contributions describing datasets used, the geography area covered (Wuhan, China), observed R0 estimates of the COVID-19 virus with a literal string of 1.58.

Of course all this by default is at depth=1, increasing depth to say 3, allows you to see more connections (more work needs to make this more readable).

Adding the data into ORKG
The problem with ORKG of course is it needs data to be included on each paper beyond the usual metadata like title, author that can be mass imported.
Also currently, not too many papers seem to be available in ORKG yet

And here is the rub, for ORKG to show it potential it will need for many if not most papers to be probably labelled with the right values and properties/predicates. So for example, if you want a comparison table comparing p-values or confidence interval you would need to use that predicate to the label the paper.

So the success of this project depending on extracting sufficient human labour to enhance each paper with additional metadata , particularly on the contributions.
Unfortunately, when I tried to study the ontology and data model, I had problems wrapping my mind around how to add "contributions" for my papers. I have no doubt with some effort, I could figure it out, just as I more or less figured out Wikidata, but on first glance it looks complicated.
For example, I don't quite understand the idea and concept use of "predicate templates" and "resources". The Youtube video I found below didn't help either
However currently they are trying a ORKG Paper Annotator tool that might help users markup their paper with contributions etc. I've given it a go and while it doesn't 100% solve the problem , it does seem promising.

Ebsco Discovery Concept Map
So far, we have covered mostly open source applications of research graphs. But before these applications, one of the original uses of network maps and graph was the promise it would help with searching and discovery.
Currently I'm aware of two such commerical products, the first is Yewno, developed by a AI startup which has been in the market for a few years.
A more recent challenger in this space from a traditional library vendor is Ebsco's development of their Concept Map feature which is currently in beta.
The idea here is to map your normal keyword search terms to concepts and reated concepts that exist in Ebsco Discovery Service.
You can then build a boolean search by clicking on the concepts you want which adds them to the search builder.

Like any linked data graph, you can use relationships to see/display other concepts related to the concept you are looking at.
In the example below, the search started with "Italy' the concept, but you can choose relationships like "shares a border with" or "member of” to show concepts that have these relationships with the concept of Italy.
So for example, choosing "member of" displays international groups of which Italy is a member, which you can add to the search builder.

Searching terms directly within the concept map search box, provides you with a dropdown of possible terms with descriptions which can help with keyword disambiguation (e.g. Java as the island vs coffee).
I haven't had a chance to try a live demo of it, but watching the Youtube video and reading the tutorial page gives me flashes of using Wikidata (a simplified version).
On the other hand, while this seems impressive, it is hard to guage the richness of the concept map and it's relationships available without actually trying it out.
Is there reason to expect this might be useful? Perhaps. After all Ebsco has focused a lot in their literature on trying to differentiate themselves from the competition (Proquest's Summon for example) as having better relevancy rankings due to using top quality "thick metadata" (using various subject controlled vocabulary and therasuri).
To be more exact they have been working on mapping commonly used natural language search terms to map with these controlled vocabulary (I believe this is called the enhanced subject precision feature that is on by default since 2019).
Concept maps seem to be the next step allowing browing of the actual underlying controlled vocabulary they have which are I guess created by crosswalking different such systems.
Orion Search
Orion has one of the fanciest search interface's I have ever seen. The visualization interface reminds me of playing a 4X space strategy game where you can pan and zoom in the 3D space.

Orion also has a unusual metrics with axis “Research diversity” and “gender diversity”.

Do note the Orion demo is restricted to only a certain set of papers and cannot be used as a cross disciplinary database. “Using this Orion deployment as an example, we queried MAG with a journal name, bioRxiv, to collect all of the papers published on its platform”
Conclusion
This has been quite an unfocused look at various new discovery tools out there. Many if not most of this might not pan out, but it is still useful to keep an eye out...

