Diversity of Scholarly record + Push to Open + Deep learning
In late 2018, I gave a talk at OCLC Asia Pacific Regional Conference Meeting 2018 and I gave my thoughts on how I see the game changing for libraries in the years to come based on 3 fundamental trends.
Earlier this year in Jan 2022, I was invited to give a talk to reflect on things how I view things 3 years on....
My thesis - Diversity of Scholarly record + Push to Open + Deep learning = CHANGE
Let me first quote from what I wrote in 2018.

"This slide above is in the nutshell is the way I think about the biggest drivers of change in academic libraries for the next decade or so.
Firstly, the primary duty of academic libraries - the collection of final published material and outputs for their community to use is diminishing in importance and focus has moved toward collecting artifacts across the whole workflow including collecting inputs, processes and outputs. OCLC calls this the evolving Scholarly Record.
Secondly, not only has the scholarly record become more diverse, but there is now an almost default presumption or at least an expection that these records be made open in some way. We are no longer talking about just making published papers open (open access), but also open data (raw files, code, protocols, peer review and more), open education resources, open citations and now even open infrastructure.
Lastly, it's perhaps a cliche to say we live in a time where AI in particularly machine learning , deep learning, are changing jobs. But given the first two trends where more and more about the scholarly communication system is collected and made open it is no surprise I'm seeing more and more new applications emerge based on blending these technologies with the data collected from the Scholarly Communication system."
It is 3 years on, and the me in 2022 has learnt a lot more about what I knew in late 2018 so how did I do? In a nutshell I think so far the trends which I identified have continued to hold if not accelerate.
Let's consider these trends one by one and what has changed since 2018.
1. The evolving Scholarly record
This is based on the 2014 OCLC report of the same title and even in 2018 I was struck by how prescient it is.
Essentially it noted that traditionally "We" (which includes the library, publishers, funders) generally cared about providing access , discovery and preservation of a very limited part of the academic or Scholarly record.

At the risk of overly simplifying I would say academic libraries in say before 2000s generally mostly spent their efforts on published works like books and journals (plus special collections).
And because these were the research objects we "cared" about, we had metrics on them essentially in terms of citations, usage (downloads) and occasionally altmetrics like tweets.
The evolving Scholarly record trend points out that we "care" a lot more about just the final output or outcome of the Scholarly record (books & articles) but we also want to store and look at the whole complete workflow and the objects accompanying them.
As the OCLC report put it, we now also care about the evidence behind the output, the method that produced the output and we event want to capture the discussion around the whole process which spans beyond just peer review.

Above shows just some of what the OCLC report was describing,
Today we collect and provide access to
Protocols (e.g. protocols.io, SearchRxiv, OSF)
Code (e.g. Code Ocean, Executable Research Articles)
Data - raw, processed (e.g. Zenodo, Figshare)
and even when it comes to the actual journal article where we once mainly cared or provided access to the Publisher Version of Record, it is now common to find other versions from preprints to accepted manuscripts everywhere.

Journal Article Versions (JAV): Recommendations of the NISO/ALPSP JAV Technical Working Group
Experts such as Herbert Van de Sompel, Bianca Kramer, Jeroen Bosman, Lisa Hinchliffe have been talking about moving from a "Version of Record" to "Record of versions" though the origins of the phrase seem to go back as far as 2008
It's well and good to have so many different outputs or even versions of the same output how do we connect them together and disambiguate them?
In Persistent Identifiers Connect a Scholarly Record with Many Versions , it is argued
"Whereas the published, printed version of the research article was once the authoritative source of research, new modes of publishing and the publishing of other research outputs (postprints, protocols, data, code, etc.) have made the term “version of record” all but irrelevant. The scholarly communications landscape has already moved into what Herbert Van de Sompel, Bianca Kramer, and Jeroen Bosman call a “record of versions,” where persistent identifiers (PIDs) enhance the discoverability and linking of research outputs regardless of where those outputs are housed."
As an example they give
For example, a preprint or postprint may be available through an institutional repository (IR), a related data set may be published in a discipline’s or funder’s data repository, and related code may be available on GitLab (ideally backed up in an IR). The distributed nature of the assets is actually key for ensuring that each output is properly curated, findable, and preserved. When they are distributed in specialized repositories, research assets are more likely to have digital object identifiers (DOIs) minted, metadata created and shared, and deposits checked and preserved.
Evaluation and assessment of Scholarly work is also changing
To further add on, the way we assess and discussion Scholarly work has also changed.
Firstly, while traditional double blind peer review is still dominant, we are starting to see journals experiment with other more open peer review models.
For example QSS practices open peer review where all peer reviews are made open and are made open and available with a minted DOI which adds with discovery as it is linked to the paper. Peer reviewers can also choose to have their name attached to the report but this is voluntary.
While there are dozens of experiments with peer review models, I suspect there is a growing sentiment that making the peer review report open and available just makes sense in most cases. Particularly with concerns about predatory journals and the near impossibility of proving if a journal actually did proper peer review with traditional models, making peer review reports open goes some way to solving this issue.
We also are starting to see slowly the uptake of both prepublication reviews and post-publication reviews in the form of models like PCI (Peer community in) and Pubpeers respectively (Also Publons offers both).
PubPeers particularly seems to be getting a reputation for being a post publication site to report serious flaws in papers and the presence of comments in Pubpeer is sometimes correlated with retractions.
If you are following the story so far it means the scholarly ecosystem now needs to collect, keep track, make accessible and discoverable all this objects beyond journals and books.
And once we do that, it also mean we come to the thorny problem of how to measure and assess the impact of such new objects.
If you thinking measuring impact of journal articles with metrics are tricky, now think how tricky things become when we want to do the same for all these additional objects and their interactions with each other.

Take say the the push to ensure that research data are "first class entities" like journal articles.
This leads to a whole series of problems, organizations & standards created to address them. How should the citations be captured (particularly between different research objects)? How do we aggregate downloads between datasets when they are copies all over the world in different repositories?
While persistant IDs are definitely part of the solution as suggested above, it's not as simple as that.
Organizations and groups from FORCE11, RDA, Make data count, Crossref, Datacite etc have emerged and resulted in standards and guidelines from FAIR to COUNTER standards for RDM to Scholix (linking literature to datasets) and more.
Even with standards in place this isn't trival as it requires cooperation by different players in the whole Scholarly ecosystem.
See for example Susan Borda's attempt to track a citation made by a researcher through a journal to a dataset.
My experience following a citation to one of our datasets in the "Refs" section of an article in @theAGU And a diff data set mentioned in "Data Avail" statement from @Nature https://t.co/ee9VafEMaV @ShelleyStall @CrossrefOrg @DataCite @makedatacount #datalibs pic.twitter.com/cZ6YR70ooZ
— susan borda 🥌 (@mutanthumb) January 24, 2022
Unfortunately, she ran into many problems, including a bug....
I've also personally struggled with a similar problem, though in my case it was in the opposite direction as I was linking from the data repository to the journal....
One last note is citations discussed so far are based on traditional academic impact where citations from other researchers is what matters.
In fact, there is increasing recognization that we may need to go beyond scholarly or academic impact and to know that sometimes other types of impact might be worth tracking.
Be it impact on industry via patent cites (e.g. Lens.org), society and government (e.g. Overton), education (e.g. Open Syallbus), this is a trend I see we are headed to.

This somewhat aligns with Anne-Wil Harzing's Disambiguating Impact where she considers research impact by academic role.

For those unfamilar with tie idea of citations in textbooks/syllabi, the amazing Open Syllabus project is worth a look. Similarly there has been prosposals to extract similar data from Exlibris's reading list product - Leganto which is growing in popularity among other produces in the same class.
2. Push to open
Each of the "opens" from Open Access, Open Education, Open Data, Open Science and Open Citations and Open Infrastructure are each huge topics on their own.
Some like Open Access are further along then say Open Data or Open Science, but each if it becomes the norm has numerous impact on the research ecosystem and the way we librarians operate. For example, Open Science would imply librarians would need to become knowledgable in reproducbility issues in a host of different disciplines.
I would say that of all these "opens" , we academic libraries as a whole are probably most prepared to live in a world of Open Access , though it's all relative of course.
Skeptics and cynics might say these "opens" won't come anytime in our professional career lifetimes and that might be so.
But here's a reminder that things might change very quickly. When I first wrote about open citations in 2018, they were fighting and uphill battle with maybe 1%?of citatons deposited in Crossref made open.
Fast forward just to early 2021 and we actually do live in a world where by the movement towards open citations has mostly won out , to the point that all major publishers even Elsever make their citations deposited in Crossref open and it is no longer and option to keep it closed.
With over 1 billion open citations and where we approach parity with Scopus and Web of Science
The work is never finished of course and we can always improve but it you told me back in 2018, I4OC would be so successful just 4 years on, I wouldn't believe you.
3. Technology affecting academia
In the past 10 years, we have grown used to breakthroughs in machine learning on image recognition, game playing , NLP/speech generation and now image generation. We hear about technologies, models and techniques like Word embeddings, reinforcement learning, Attention/Transformers, CNN, RNN, Huge language models and are wowed by Alphazero, GPT-3 and DALLE-2.
Even when I gave the OCLC talk in 2018, it was clear deep learning was on the rise. Seminal papers like "Attention Is All You Need" which proposed the transformer model was already out. Human's last stand in the game of Go was just over and Deep Mind had already released alpha zero, a reinforcement learning model that learned totally from the scratch without human input beyond the rules. The first version of Open AI's GPT was just out etc. But clearly things have accelerated since then.
Still, a lot of these potential benefits from these technologies seems to have barely touched our academic ecosystem or at least their use do not seem to be apparent.
I believe we are starting to see products and services using the latest ML/DL technologies start to become widespread in all stages of the research. A lot of this I was unfamilar of , or did not exist in 2018.
Using the famous 101 innovations in Scholarly Communications- Workflow tools categories, I argue that such tools are becoming increasingly used in all phases from discovery all the way to assessment.

For a taste of such ideas imagine if one could combine the summarization powers of GPT-3 with the image generation skills of DALL·E 2 to generate visual abstracts or posters?
What about fine-tuning GPT-3 to select , rank and extract findings and other characteristics of paper (e.g. population, research method, country of study) into a research matrix of paper? This already exists = Elict.org.
Companies like UNSILO that provide machine learning solutions to publishers and vendors in the academic world already exist and among other offering provide solutions to help screen and speed up manuscript submissions that are starting to be used in many common manuscript submission systems.
The idea of using NLP to extract citation sentiment/intent/context now seems on firmer ground given large scale deployments by Semantic Scholar and scite , with Clarivate experimenting with a version of this idea.
There are more and more use cases emerging each day but the main argument I set out in my talks recently is the simple observation is for that for such machine learning and data learning techniques to succeed a prerequite for success is ready availability of data for training.
And guess what? Trend 2, the push to Open, pretty much solves this issue.
Implications for information literacy
The rise of use of advanced ML/DL technologies in everyday use also raises implications for librarians trying to teach information literacy or fancy their roles on the front lines to teach users to identify disinformation.
Most librarians by now would have heard the terms algorithmic bias/literacy, deep fakes but are we ready and able to talk intelligently on the capabilities and dangers of using tools like Elict.org that uses Language models to extract data from papers?
Back in Feb 2019, OpenAI was initially worried about the dangers of releasing GPT-2 for fear that the technology in the wrong hands could be misused for nefarious purposes and refused to release the largest 1.5 billion model. They eventually relented.
In 2020, they released GPT-3 which was over 100x bigger with 175 billion parameters that was as you might expect even better and this and larger models are now available to some extent.
What are the dangers of being able to easily access such capable text generation models?
For instance, there have been interesting attempts to see how easily one could use GPT-3 and language models like it to generate disinformation.
The answer from CSET Truth, Lies, and Automation: How Language Models Could Change Disinformation report is yes, quite easily.
The ease of using GPT-3 is such that even someone like me could easily duplicate this and create my own creditable "Fake news" using the suggested prompts but changing it my local context! (see future blog post).
Conclusion - Implications for libraries and librarians
The first obvious implication is that the ecosystem is a lot more complicated and users can use a guide who is familar with some if not most of these changes.
Take Preprints, there's clearly a big debate raging on how best to interpret it. Analysis has show that Wikipedia sparingly uses preprints as references which following the Wikpedia policy, WP:PREPRINT
But more importantly are librarians today teaching users how to deal with non-VOR copies of papers that are so easily found together due to Google Scholar and the pervasive use of Unpaywall service in link resolvers, academic search engines and browser extensions?
I have asked before and haven't really found anyone doing much on helping students identify what they are looking at if it isn't a publisher PDF VoR.
— Lisa Janicke Hinchliffe (@lisalibrarian) January 10, 2022
Can you as an academic librarian have sufficient knowledge to guide a researcher who wants to properly cite a dataset (either from the repository or journal side) and not just for mere compliance to citation styles but also to ensure such citations are captured properly such that they appear in aggregated systems like Datacite badges, Scopus (which displays Scholix links)
Are you adapt enough to discover, navigate and access all the different research objects out there or even know what platforms exist?
Beyond education, academic librarians also need to ponder their role in providing access and preservation of these new objects.
Take research data sets, should academic libraries start collecting them at the institutional level? Consortium level or even National Level?
Should libraries even be involved or should other players like Publishers or vendor do this?
Lastly, storing and proving access in a stewardship role costs money. What business model should we provide?
Of course, the current trend is towards "Open" in such a case the question about business model, ownership of source data etc becomes even tricker.
If you think this doesn't matter, see this clause about ownership of reviews....
In terms of increased use of deep learning technology, the age old question rears it head, will librarians know enough to guide users through the complicated world of deep learning and machine learning techniques embedded in the tools they use? Or has this already sailed since 2000 when Google and internet search engines became the dominant info-finding tool of choice?

