4 Comments
User's avatar
Mike Caulfield's avatar

Of all these, I do believe that learning and developing tools and methods to evaluate things like relevant retrieval and appropriate/useful summary are crucial. One of the reasons that I focus so much on contextualization and verification is it is very amenable to the development of model responses/response rubrics that can be used to compare with output and better understand the weaknesses and strengths of various systems, and even more importantly, the specific conditions under which they fail.

I am not even sure this has to be done at the highest level of formality (unless looking to publish). But I find it odd when people in this space do *not* have at least a half dozen ready to go challenges to test the behavior of a given system along multiple dimensions.

Expand full comment
Frank Norman's avatar

That’s a strong message and call to action, or call for learning. It sounds good.

I wonder, are there different evaluation methods for tools as used by librarians, vs tools as used by end users? It seems to me that these are two quite different situations.

Expand full comment
Kristin Whitman's avatar

Thank you for this post, Aaron. I have been wrestling with these same questions and your course at FSCI was extremely helpful to orient me to these tools. I have a very self-serving question - you mention having seen a number of task-based evaluations for Elicit, Undermind, etc. Do you have a bibliography you could share of these publications? (I assume at least some of them are blog posts rather than peer reviewed publications)?

Expand full comment
Aaron Tay's avatar

Unfortunately a lot of them are privately done tests that I don't have permission to share publicly (I have asked). But let's just say, most properly done tests (and there are a lot of horrible ones), I am rarely surprised by the results. Of course, things are always changing and some tools might improve a lot since I last looked at them and I don't know every tool, so there might be better tools than what I know.

I can say I have been generally disappointed by the "AI powered" features of shall we say the "big boys" and I am even burnt out enough to consider a policy of not testing such tools until at least a year after they are publicly available. Sadly, from my understanding many of such implementations are lazy and cheap (using extremely old and cheap models) leading to clearly inferior performance.

The sad thing is many librarians are already predisposed to be against AI, and when they see poor implementations it just adds fuel to the fire that "stochastic parrots" = cannot be useful for retrieval.

Expand full comment