Learning about information retrieval and "AI search" in 2025

Jan 12, 2025

I recently gave a 1 hour talk to the Chinese American Librarians Association(CALA) Asia Pacific Chapter on my favourite topic - How "AI" is changing academic search

I have being giving many versions of this talk both internally in my institution (to librarians) and externally for almost two years and it has always been very tricky to do well.

This is because I believe for librarians to really appreciate and engage with all the changes that what many call "AI search" brings, we need to have some understanding of two major topics

a) a basic grasp of information retrieval techniques (and to see what is different now) and

b) some amount of understanding of deep learning (machine learning + how neural net works)

As such, I have been struggling to both understand it myself (a lot of areas are evolving, though I think I have the basics down) and explain the right amount of it to others.

My past attempts to blog to explain, e.g. here, here, here, here are actually quite bad, due to a mix of lacking full understanding and not knowing what is worth explaining.

This talk

I am relatively happy with this talk because it has been refined quite a bit and it's probably much better than other talks I've given in the past but it's still far from perfect.

I probably should have done better with 30-60 minutes more (and I would probably spend it more explaining the basics of what a Large Language model is and going more into TF-IDF, BM25) and even now I am still trying to get the right mix of what is merely "good or interesting to know" and what is critical and has implications for an academic librarian to know about information retrieval.

Possible course for librarians?

Due to a lot of interest from librarians, I am currently toying with the idea of doing a longer online workshop or course covering something like this

1. Definition of "AI search"

1.1 What is a language model

1.2 What is not included in this talk (Citation based mapping tools like Connected Papers, ResearchRabit and use in evidence synthesis)

2. AI search impact 1 - Improved relevancy in search results from contextual embedding aka semantic search

2.1 Live examples of what semantic search can accomplish that standard lexical search can't

2.2 Understanding how contextual embeddings work aka nearest neighbor match and its implications for search

2.3 Explaining why semantic search alone sometimes fails and needs to be combined with lexical search

2.4 A discussion of three main input styles for these systems - Keyword based vs natural language vs prompt engineering style

2.5 A discussion of common issues in new generation semantic search and ways to mitigate (e.g. use post filters, exact matches)

3. AI search impact 2 - Generate direct answer with citations (Retrieval augmented generation type implementation)

3.1 Explain what is RAG and the benefits of RAG (e.g. grounded answer)

3.2 Review common failure cases of RAG systems, indirect citations and unfaithful citations

3.3 Tips on trying to reduce RAG failure

3.4 A discussion of three main input styles for these systems - Keyword based vs natural language vs prompt engineering style in RAG systems

4. AI search impact 3 - Extract data from abstract & full-text* (e.g. create a research/synthesis matrix table of studies)

4.1 General accuracy rates and tips on using this feature

4.2 Copyright implications

5. Misc

5.1 How to evaluate and test AI search systems (informal vibe test vs gold standard tests vs formal information retrieval TREC tests)

5.2 Content provider's attempt to restrict usage of even title/abstracts in RAG systems e.g Elsiever content in Primo Research Assistant

5.3 Rise of AI + search writing tools e.g. Keenius, Jenna.ai, SciSpace (AI writer function), GPT4o with Canvas

5.4 Where and how to keep up

5.5 Future of AI and search and how I see this further developing

The above assumes you do have some basic understanding of machine learning and deep learning which can be acquired by going through say Google's crash course on machine learning.

More learning via Youtube

For those who want to learn more about information retrieval and impact of AI, I have created a short playlist of videos by experts that I felt were the clearest at explaining what is going on with less prior knowledge.

I focus more on the search part but if you really want to go in depth into Large Language Models (LLM), I highly recommend the following book. It teachs you both the high level concepts of what goes into training a modern transformer based LLM and is very practical with chapters on different NLP tasks like classification, sentiment analysis etc and the visualizations are amazingly clear.

Do I as a librarian really need to know so much?

There are many who feel librarians don't need to be a AI scientist to be able to teach AI literacy and I can't say they are wrong as there are different levels of being AI literate and different areas of AI literacies to specialize in.

If you are aiming to be a generalist to be able to talk broadly, you can't be expected to go deep in any particular areas. Or you could be like me and focus specifically on learning the ins and outs of just AI search tools....

It's also really hard to draw the line at how much you need to know.

For example, I think most people would say knowing the difference between supervised learning, unsupervised learning and semi-supervised learning is far from being a "AI scientist", but what about knowing the difference between a encoder model, a decoder model and a encoder-decoder model? Or even at a basic level how gradient descent via backpropagation is essentially an application of the chain rule from calculus.

Or knowing at a high level how ColBERT or SPLADE (used by Elicit.com) differs from a typical Bi-encoder? Do you need to know how Contrastive Loss & Triplet Loss works to finetune dense embeddings?

I guess the line I would draw is, does knowing this fact, allow you to understand how something works and hence change the way you use it?

Aaron Tay's Musings about Librarianship

Discussion about this post