I test Elicit, Scite assistant, SciSpace, Primo Research Assistant, Undermind, Ai2 ScholarQA and many more to see how they handle retracted papers.
Evaluation is the right question. Anthropic recently published how they evaluated RAG against plain grep for Claude Code's codebase understanding. Grep won. The model understood context it found itself better than pre-retrieved chunks. Makes you wonder how many RAG evaluations are measuring the wrong things entirely. Covered it here: https://reading.sh/anthropic-revealed-how-they-build-claude-codes-brain-11e48e75fd01?sk=6662727c70ed637cd1692a81f33139e2
Evaluation is the right question. Anthropic recently published how they evaluated RAG against plain grep for Claude Code's codebase understanding. Grep won. The model understood context it found itself better than pre-retrieved chunks. Makes you wonder how many RAG evaluations are measuring the wrong things entirely. Covered it here: https://reading.sh/anthropic-revealed-how-they-build-claude-codes-brain-11e48e75fd01?sk=6662727c70ed637cd1692a81f33139e2