AI academic search needs better frameworks for understanding and evaluation. These three librarian projects are a start
What it looks like when the AI search conversation and understanding gets serious
Most librarian commentary on AI search still operates at the level of warnings and impressions. Tools hallucinate. Sources are opaque. Students should be cautious. None of this is wrong, but none of it helps a colleague decide whether Undermind is appropriate for a scoping review or whether SciSpace’s summaries can be trusted in a literature class.
What would help is work that treats these systems as understandable and evaluable: layered architectures with retrieval components, generation components, interfaces, and trade-offs that can be named and weighed.
… stop thinking of the result as an answer, and start thinking of it as a search result with some synthesis on top. When all the sources are in alignment that can of course present as an answer — but it still is downstream from available sources, it still has made decisions about weighting, and so forth. If you think of it as a dressed up search result instead of an intelligence (albeit one that can synthesize in impressive ways) you’ll be better able to process the weirdness that sometimes results.
Over the last week, three librarians — none of them coordinating with me or with each other — have published work on AI search that lean in this more productive direction. Aster Zhao at HKUST, Wang Huajin at Carnegie Mellon, and Alfred Wallace at the University of North Dakota each built something concrete: an interface, a framework, a rubric (vibe coded with Claude). All three happen to cite my work, which is flattering.
Taken together, these three projects make a point I keep returning to. The conversation around AI search needs to happen at a better level. We need fewer sweeping claims and more attempts to clarify how systems differ, what assumptions they encode, and how they should be assessed for particular tasks.
1. Aster Zhao’s (Hong Kong University of Science and Technology) snapshot of GenAI tools for research
The first example is by Aster Zhao from HKUST Library: Snapshot of GenAI Tools for Research.
What I like about this one is that it does more than list tools (and a very comprehensive list it is with 40+ tools). Besides a comprehensive table view, it includes a strong timeline showing the evolution of AI search and the different “roles” that AI plays in search tools. That visual layer makes a messy landscape much easier to understand.
The work also cites some of my own frameworks and distinctions, including:
The Horseless Carriage of AI Search, where I critique vendors trying to sell and equate “AI search” to just simplistic use of LLMs for Boolean query construction and similar bolt-on approaches, while ignoring other more productive approaches like semantic search, better reranking and agentic search.
Overall, this is a strong and current map of the space, helping people to see that AI search is not one thing. I have been planning to update my own list of such tools here but Aster’s work has reduced my incentive to do so.
2. Wang Huajin’s (Carnegie Mellon University) AI-Powered Tool Assessment Framework
The second example is AI-Powered Tool Assessment Framework by Wang Huajin of CMU.
This project is especially interesting because I often get asked during my talks or workshops for an evaluation framework for AI search tools, and in truth I have not been fully happy with many of the existing ones (more on that later). This framework gets closer to something I could actually endorse.
It asks evaluators to consider questions across four areas: retrieval, generation, output, and usability. That already puts it ahead of a great deal of discussion, because it treats these systems as layered and evaluable rather than as black boxes that either “work” or “hallucinate”.
It appears to draw substantially on my older Testing AI Academic Search Engines series (part I and II) - adapting many of the questions I put in them, though I do not want to overstate my influence here. There is at least one section, on sustainability, that does not come from me, and that is a good thing.
Huajin will also be presenting at FORCE2026 in June in Singapore on “Developing an assessment framework to support critical evaluation of AI-powered academic search engines”, which suggests this is not just a quick web experiment but part of a serious and wider conversation within her institution. I am looking forward to her presentation.
3. Alfred Wallace’s (University of North Dakota) adjustable rubric for AI search tools
One reason I have resisted producing a neat evaluation matrix with fixed weights, a catchy name, and a polished PDF is that evaluation is rarely one-size-fits-all. Different users, tasks, and contexts demand different priorities. A tool useful for helping undergraduates find a few relevant papers will be unsuitable for a search tool meant for higher recall results. A tool that is fast and convenient may be unacceptable if transparency or source traceability matters.
The third project by Alfred Wallace at the University of North Dakota - Evaluating AI Tools for Research handles this better than most. Its structure is fairly traditional, focusing on sources, models, and wrappers, and the categories themselves are not ground-breaking.
The framework incorporates the provenance of the tool (established vendor, startup, or user-built), which is essential for maintaining a macro-level, long-term perspective beyond isolated feature sets.
Finally, it has one important feature: users can adjust weights of criteria when calculating a final score. That is a much more sensible approach than pretending there is a universal rubric that works equally well for everyone.
As it now stands, you select a Scenario based on whether you are a faculty researcher, Graduate Student, Undergraduate, instructor or student which will automatically flag different criteria as either “priority” or “key question” categories that affect the weighting. While you cannot change which criteria1 are considered “priority” or “key question”, you can change the size of the multiplier for criteria in either categories.
The criteria listed also reflect an informed understanding of how AI search systems differ architecturally. Alfred cites my writing (citing specifically the horseless carriage post) to support his “AI search architecture spectrum”, including the distinctions between LLM-only approaches, Boolean generation, Boolean plus reranking, hybrid search, and agentic or deep search.
Again, what matters here is not whether every category or weighting is perfect. It is that the tool reflects a more informed understanding of how AI search systems differ under the hood.
This project ends with a fascinating section on “Visions of the future” including the struggle between a future where users work on explicit platforms (both existing and new ones) vs ones where “literature research lives in the same tool that does data analysis, coding…”.
Conclusion
To be clear, I do not want to overstate my influence on any of these projects. Nor do I necessarily endorse every detail in all three.
In fact, I have many things to say about assessment frameworks and metrics for AI search tools in my next post.
But I do think they are all amazing attempts to engage with AI search by librarians.
That said, three projects, all citing me have an obvious selection effect: the librarians already engaging at this level are precisely the ones likely to read what I write2. So this is not the strongest evidence that the wider conversation has shifted. Still, anecdotally, looking at recorded talks and presentations by academic librarians, there seems to be much more understanding of how AI search works under the hood compared to say in 2023/2024.
If AI search tools are going to become part of academic workflows, we need more work of this kind: work that helps us distinguish between kinds of systems, evaluate them in context, and make their assumptions more visible. Blog posts can start those conversations. Interfaces, frameworks, and rubrics make the ideas usable for others.
Bonus
As a bonus example, I would also point to my recent SMU Libraries piece on MCP for researchers. It is doing something slightly different from the three projects above, but in a complementary way.
Rather than offering a framework or rubric, it tries to explain in concrete terms what happens when tools like Scite and Consensus are connected to Claude or ChatGPT via MCP, so that the model can search across sources, compare results, and carry out multi-step academic workflows on the user’s behalf.
In that sense, it is less about evaluating AI search and more about making the underlying infrastructure and possibilities legible to researchers. It is also, I think, one of my clearer pieces on the topic, partly because it was written for an institutional audience rather than as one of my longer blog posts where I am often thinking aloud.
This is something easily changed with simple vibe-coding of course.
Of the three librarians, I have probably had the most contact with Aster, who attended the first run of my intensive 3 session “AI-powered search for librarians” course in July 2025.









Another great piece, Aaron! Thanks for the call out to our work -- it's actually a team effort :) You are correct that we are doing more than a web app. Among other things (see our program for more: https://www.library.cmu.edu/service/ai-research), we've piloted using the framework to vet our library subscriptions with AI features embedded, and making decisions on whether to subscribe or turn on/off certain features, which led to some interesting interactions with vendors. Happy to share some of those insights when we meet in June. And look forward to your next post!
Hello nice piece. I’m a long time follower from twitter era.