Are AI Tools Killing Review Articles? Two Failure Modes Suggest Otherwise
arXiv recently restricted review article submissions in computer science, requiring journal or conference acceptance before deposit. They noted specifically that the change was driven by an “unmanageable influx” and that LLMs made review/position papers “fast and easy to write,” and that many were “little more than annotated bibliographies.”
Separately, arXiv has also tightened its author endorsement policy
It’s hard not to see the shared pressure behind both: scholarly infrastructure is facing a surge of low-quality, high-volume submissions, and AI has made certain genres—especially “survey-like” writing—cheap to produce.
So: are we witnessing the death of the review article?
Galli et al. (2024), writing in Learned Publishing, argue “no” with a useful distinction. They separate descriptive reviews (primarily summarising and aggregating) from reflexive reviews (interpretation, theorisation, agenda-setting). Their claim is that AI will commoditise descriptive reviews, while reflexive reviews, those requiring “human imagination, creativity and abstraction” will retain value.
That framing is plausible. But I think it underestimates something important: even “mere” descriptive reviews are hard to do well, because they depend on two upstream bottlenecks that today’s AI literature tools are still not 100% perfect in yet.
Failure Mode 1: The corpus construction gap (retrieval and exhaustiveness)
Can these tools actually find ALL the relevant papers with clear criteria?
At SMU Libraries, we’ve been early institutional subscribers to Undermind (since 2024). It’s among the earliest and strongest off-the-shelf “deep research” tools we’ve seen, and it’s heavily used and genuinely appreciated by faculty and PhD students.
But when we test it against published systematic review corpora, using the SR’s included-studies list as a “gold standard” target set. Undermind retrieves only ~30–80% of those target papers after running for around 8 minutes.
Systematic reviews we benchmark against are often narrow, with clear inclusion/exclusion criteria and established terminology. If these are “easy mode,” why does retrieval still fall short?
A big part of the answer I believe is architectural and product-driven. Tools like Undermind, Elicit1, and Consensus tend to be tuned for high precision and “reasonable recall,” not exhaustive coverage. They’re designed to help a researcher get to a useful working set quickly—not to support Cochrane/JBI/Campbell-style comprehensiveness.
This aligns with what others (mostly evidence synthesis practitioners) said in response to my LinkedIn post (which this post is built on). John Frechette articulated this “When it comes to retrieving hundreds or thousands of papers on a topic, current tools aren’t really set up for it.” Howard White agreed: “Finding papers is a weak point.”
And this matters because the descriptive/reflexive distinction quietly assumes access to a representative corpus. If your corpus is incomplete in systematic ways, your synthesis (descriptive or reflexive) inherits those blind spots.
To be clear: for many purposes, “good enough” retrieval is good enough. Tools that can cite (almost all) seminal works, surface major strands, and produce coherent overviews in a short amount of time can absolutely save a ton of time when producing reviews.
But for methods where exhaustive recall is the point, current tools are still unreliable. The obvious fix is “throw more compute at it.” But while more time and more queries help, so do less glamorous issues: corpus coverage (including paywalled content), indexing scope, metadata quality, transparent search strategy control, deduplication, and screening workflows.
Even so, if I had to bet, I’d say this first gap is the more solvable one. For example, as of now METR shows a 50% chance of completing a task of over 6 hours, a tool designed just for systematic reviews might be able to close the gap.
The harder gap is the next issue.
Failure Mode 2: The conceptualisation gap (expert judgement starts before retrieval)
The reflexive review problem isn’t only about clever synthesis after you’ve gathered the papers. It often starts earlier.
You need expert judgement to find the papers in the first place.
Over the 1.5 years of our institutional subscription to Undermind, I make it a point to ask the few disappointed users why they don’t like the tool and to collect failed Undermind queries from them.
The failed queries cluster in a very recognizable pattern: broad, often cross-disciplinary questions that require interpretive leaps—connecting subdomains that don’t share vocabulary, or asking for relationships that haven’t been named explicitly in the literature.
Users report that the tools interpret queries narrowly and conservatively. They stay close to established terminology and mainstream frameworks. They struggle to “bridge” conceptual gaps—especially when the question is exploratory (“what research exists?”) rather than confirmatory (“what does this well-defined literature say?”)2.
This shouldn’t be unexpected. These systems are trained to reflect patterns in existing literature, not to make bold interpretative moves. They’re designed to reproduce consensus: great at aggregating what a field already recognises, weaker at proposing new framings or deliberately hunting for minority subdomains and boundary cases.
Neal Haddaway, another expert in evidence synthesis makes a closely related observation from the evidence-synthesis world: in broad evidence maps—where the question is essentially “what research exists?”, AI tends to miss important minority subdomains, not because it can’t describe them, but because it can’t systematically search across diverse topics without collapsing toward the dominant cluster.
“… my experience has been that AI fails for broad evidence maps where the question is “what research exists”, not because AI can’t describe those concepts, but because it can’t systematically search for diverse topics, and tends towards aggregation, missing important minority subdomains.”
The OpenScholar paper (Asai et al., 2025) is interesting too. Nature headlined that experts preferred OpenScholar’s outputs 51–70% of the time over PhD-written answers. But what did it excel at? “Comprehensive coverage, organisation, breadth of sources”—classic descriptive synthesis.. That’s impressive, and genuinely useful. But it doesn’t magically solve the deeper issue: will the system make the right conceptual jumps, notice what’s missing, and generate non-obvious observations that advance the field?
Even if you patiently iterate—rephrase, expand, steer—today’s tools still tend to reproduce the existing shape of the literature rather than challenge it. And for reflexive, field-shaping work, that’s the whole game.
Important to note, most AI literature review tools are using the equalvant of 20 USD Tier models, there are hints that the 200 USD Tier models such as ChatGPT PRO/Gemini DeepThink models can surpass these limits and come up with truly original scientific insights, but we will see in the future when their costs drop enough to be used for off-the-shelf AI literature review tools.
Conclusion
Review articles—descriptive included—won’t die as long as incentives persist and as librarian, Alex Carroll points out - reviews are cited at higher rates than primary research. More importantly, for graduate students, writing a review chapter has long been a pedagogical exercise: you write to build the conceptual scaffolding that lets you do original research.
The question is not merely:
“Can AI write competent reviews?”
It’s:
“What happens to researcher development if AI does the cognitive work that reviews were historically for?”
A well-written review becomes Chapter 1 of a dissertation because writing it forces the author to wrestle with definitions, boundaries, tensions, and absences—exactly the things that later shape novel contributions. If we offload that struggle too early, we may get more polished text and fewer well-formed thinkers.
So yes: increasingly. AI makes descriptive review prose cheap. But judgement—what to include, how to frame, what’s missing, what matters—remains scarce. And in many disciplines, the only reliable way people learn to produce that judgement is still the slow, slightly painful process of doing the review themselves.
An earlier version of this argument appeared as LinkedIn post, and this post benefited a lot from reader comments and pushback.
Of these, only Elicit has pivoted to focusing on systematic review. Elicit just reported a 85% recall rate using cutting edge Opus 4.6 as of Feb 2026
This is an example of a (lightly altered) query that Undermind struggles for many reasons (e.g. Humantities, lack of monograph coverage, extremely broad, with a additional difficulty asking how works are cited)- I want to find a few articles, books, or chapters in the subfield of science and technology studies—published since 2000—that cite Bruno Latour’s Reassembling the Social specifically for his concepts of actor-network theory, translation, or non-human agency (across any substantive domain), and provide a basic description of how Latour is cited in each.






Aaron, thank you! I appreciate your infographic very much, very effective.