I might be exaggerating slightly, but if you look at the few new evaluation matrices for AI-powered search circulating, “relevancy” is often just one of several categories, evaluated in a highly subjective and “I-know-it-when-I-see-it” manner. This is baffling, given that a search engine (AI-powered or not) lives and dies on its ability to retrieve relevant results. Even generative tools that author reports are building on a house of cards if their underlying retrieval system fails to find the most relevant items.
Of all these, I do believe that learning and developing tools and methods to evaluate things like relevant retrieval and appropriate/useful summary are crucial. One of the reasons that I focus so much on contextualization and verification is it is very amenable to the development of model responses/response rubrics that can be used to compare with output and better understand the weaknesses and strengths of various systems, and even more importantly, the specific conditions under which they fail.
I am not even sure this has to be done at the highest level of formality (unless looking to publish). But I find it odd when people in this space do *not* have at least a half dozen ready to go challenges to test the behavior of a given system along multiple dimensions.
That’s a strong message and call to action, or call for learning. It sounds good.
I wonder, are there different evaluation methods for tools as used by librarians, vs tools as used by end users? It seems to me that these are two quite different situations.
Thank you for this post, Aaron. I have been wrestling with these same questions and your course at FSCI was extremely helpful to orient me to these tools. I have a very self-serving question - you mention having seen a number of task-based evaluations for Elicit, Undermind, etc. Do you have a bibliography you could share of these publications? (I assume at least some of them are blog posts rather than peer reviewed publications)?
Unfortunately a lot of them are privately done tests that I don't have permission to share publicly (I have asked). But let's just say, most properly done tests (and there are a lot of horrible ones), I am rarely surprised by the results. Of course, things are always changing and some tools might improve a lot since I last looked at them and I don't know every tool, so there might be better tools than what I know.
I can say I have been generally disappointed by the "AI powered" features of shall we say the "big boys" and I am even burnt out enough to consider a policy of not testing such tools until at least a year after they are publicly available. Sadly, from my understanding many of such implementations are lazy and cheap (using extremely old and cheap models) leading to clearly inferior performance.
The sad thing is many librarians are already predisposed to be against AI, and when they see poor implementations it just adds fuel to the fire that "stochastic parrots" = cannot be useful for retrieval.
Of all these, I do believe that learning and developing tools and methods to evaluate things like relevant retrieval and appropriate/useful summary are crucial. One of the reasons that I focus so much on contextualization and verification is it is very amenable to the development of model responses/response rubrics that can be used to compare with output and better understand the weaknesses and strengths of various systems, and even more importantly, the specific conditions under which they fail.
I am not even sure this has to be done at the highest level of formality (unless looking to publish). But I find it odd when people in this space do *not* have at least a half dozen ready to go challenges to test the behavior of a given system along multiple dimensions.
That’s a strong message and call to action, or call for learning. It sounds good.
I wonder, are there different evaluation methods for tools as used by librarians, vs tools as used by end users? It seems to me that these are two quite different situations.
Thank you for this post, Aaron. I have been wrestling with these same questions and your course at FSCI was extremely helpful to orient me to these tools. I have a very self-serving question - you mention having seen a number of task-based evaluations for Elicit, Undermind, etc. Do you have a bibliography you could share of these publications? (I assume at least some of them are blog posts rather than peer reviewed publications)?
Unfortunately a lot of them are privately done tests that I don't have permission to share publicly (I have asked). But let's just say, most properly done tests (and there are a lot of horrible ones), I am rarely surprised by the results. Of course, things are always changing and some tools might improve a lot since I last looked at them and I don't know every tool, so there might be better tools than what I know.
I can say I have been generally disappointed by the "AI powered" features of shall we say the "big boys" and I am even burnt out enough to consider a policy of not testing such tools until at least a year after they are publicly available. Sadly, from my understanding many of such implementations are lazy and cheap (using extremely old and cheap models) leading to clearly inferior performance.
The sad thing is many librarians are already predisposed to be against AI, and when they see poor implementations it just adds fuel to the fire that "stochastic parrots" = cannot be useful for retrieval.