From Fixed Search Workflows to Agentic Academic Search: Promise, Progress, and the Invisible Menu Problem
Undermind's Projects features gives a taste of the promise and challenges of managing agents by users
Introduction
Academic search tools might be moving towards a different architectural model. The first generation of AI-powered literature review tools — Undermind, Consensus Deep Search, AI2’s Asta — relied on predetermined, hand-crafted workflows. You entered a query, the system executed a fixed pipeline, and you received results. The LLM’s role was largely confined to query understanding and output formatting.
That might be changing. As LLMs have become more capable of flexible, multi-step reasoning, vendors are beginning to build systems where the AI can dynamically select and chain tools to accomplish tasks that were never explicitly programmed on their platforms. This may mark a move from academic pre-defined “deep research” workflows to more flexible agentic search flows1 being available on vendor platforms and in this post, I will discuss some of the early examples such as Undermind Projects.
But this shift introduces a problem that is not getting enough attention. When a system can theoretically do anything, how does the user discover what it can actually do?
Unlike a advanced user who sets up his Claude Code as a “research agent”, a user of a vendor platform has nearly no insight to how the agents have been setup by the vendor.
I call this the ‘invisible menu’ problem (a rename of what I previously called the “blank box” problem) and I think it may become one of the new usability challenges for the next generation of AI academic search tools that include agents on their platform.
The problem is compounded by the fact that, unlike traditional library search tools where decades of use have established shared conventions and mental models, agentic search is so new that users have no baseline expectations for what these systems should even be capable of.
In my previous post, I argued that bolting LLMs onto legacy search paradigms, what I called the "horseless carriage" approach, delivers limited benefit, and that most of the potential improvements in search will come from better reranking and agentic search.
In this post, I examine how three tools (Elicit, Undermind, and SciSpace) are attempting to move towards embedding agentic capabilities or flows in their platform. I use Elicit and especially Undermind's new Projects feature as the main worked examples and using what I find using them to illustrate the invisible menu problem in practice. SciSpace is discussed more briefly as a contrast point, since I have not tested it as deeply.
Why agentic now?
Ian Mulvany, CTO of BMJ Group, responded to my earlier post on the lack of agency in academic deep research tools with an observation that captures the moment well: “The fixed workflows that the current crop of tools are showing were developed when models were weaker, yet from a product perspective, trying to build something that is simultaneously specific and general, is really hard.”
Incidentally, Ian Mulvany will be the second keynote speaker for FORCE 2026 to be held in Singapore, 3-5 June - speaking on “Surviving the Disruption: Scholarly Communication in the Age of AI”. FORCE2026, will have many tracks on this topic, including but not restricted to impact of AI on literature review (including a presentation by yours truly on “Does Agentic Deep Search Converge? Reproducibility Questions for LLM-Driven Literature Discovery”
He is right. The tools I tested in that earlier post often struggled when asked to do things outside their intended workflows, even when they seemed to have some of the underlying capabilities needed to succeed. The models either were not yet good enough to reason flexibly across available tools, or were more likely constrained from doing so by the system design.
That constraint is loosening as vendors realise the capabilities of these new models. And the response is coming from multiple directions simultaneously.
Vendors of specialised academic search tools face a strategic choice. They can make their products available as components in user-built agentic workflows — typically by providing MCP servers, as Scite, Consensus, and SciSpace have done, or APIs, as Elicit has (currently in beta).
“What If Claude or ChatGPT Could Search Academic Databases For You — and Then Do Something Useful With the Results?” is a simple guide I wrote for our users.
But they also want to maintain the value of their own platforms rather than becoming a commodity connector inside Claude or ChatGPT. Being just one MCP server among many risks ceding control of the user experience and, ultimately, the business relationship. Their response, increasingly, is to try to have it both ways: offer the connector while also making their own platform more agentic. Even publishers are moving in this direction — Wiley AI Gateway, is one of the earliest academic MCP server adopters, though as a publisher rather than a search aggregator, their interests are fundamentally different from tools like Scite, Consensus, or SciSpace and Wiley’s MCP server returns full text chunks that may be relevant to the query on top of metadata of the paper (see my earlier review).
Libraries are not standing still either. UT Libraries has built a working demonstration connecting Claude to their library catalogue (Primo/Alma), Texas Archival Resources Online, their digital collections, and the Harry Ransom Center Digital Collections — it can even identify the right subject librarian and give instructions for accessing archives.
Yale University Libraries is piloting a similar MCP connector to their catalogue, alongside a year-long trial of Consensus.app. If libraries can wire LLMs directly to their collections, the role of intermediary search platforms becomes less certain.
The benchmark test: finding uncited papers
Before examining each tool, let me explain the test I have used in the past and use throughout this post. The task is: given an open access paper X, find papers that could have been cited in X but were not
This is a useful yet simple test of agentic capability for several reasons. First, I don’t believe it is a [predefined workflow that any vendor has pre-built, so the system must reason about how to accomplish it rather than following a script. Second, it requires genuine multi-step reasoning: the system must extract the references of paper X, search for papers on similar topics published before X, and then ensure that the papers it presents are not already in the reference list. Third, the target paper I use — "The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles" — is open access and easily accessible, so even if the model lacks a specialised search tool to retrieve the references directly, it should be able to find them via a web search.
I should note its many limitations. This is a single task type that tests one kind of multi-step reasoning. A tool could fail this test and still perform well on other agentic tasks, or pass it through a fortunate sequence of actions rather than robust reasoning2. It is a useful probe, not a comprehensive evaluation.
Elicit Research Agents
Elicit has long been known for providing tailored workflows with human-in-the-loop features. They have expanded the sources they can search, going beyond their own academic index to include clinical trials, PubMed, and the web. Most recently, they added “Research Agents,” which appear to be agentic in the way generic LLMs are, combining available tools or workflows flexibly to solve tasks rather than following a fixed pipeline.
I tested Elicit with my usual test, and found the following prompt usually works:
Search the web find the references of the paper “The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles.” Find papers that could have been cited but were not.
The critical detail here is “search the web.” Without that instruction, Elicit defaults to its academic search tool, which can find the paper but cannot give the agent access to its references. Screenshot below shows what happens if it uses the Elicit academic search.
If you instead direct it to use its web tool, it behaves like any LLM with web access: it finds the full text of the paper, extracts the references, and proceeds with the task.
This is a first early example of what I will be calling “the invisible menu" problem. The user has no obvious way of knowing that Elicit’s academic search tool cannot extract references from a paper, or that switching to the web tool solves the problem. The system has the capability, but the user must guess the right incantation to unlock it.
Undermind Projects
Undermind recently added a beta "Projects" feature that represents a more substantial move towards agentic architecture. Unlike the current Undermind, which offers essentially one workflow (enter a query, answer clarifying questions, receive a report), the Projects feature provides three agents: a Search Architect, a Report Writer, and a Generalist.
The Search Architect is designed to ask clarifying questions and eventually propose a "deep search," which is Undermind's full search capability rather than a standard LLM web search.
The Report Writer works with papers found from past searches to generate reports in the format and structure you want. Like the Search Architect, it typically asks clarifying questions before proceeding.
The Generalist is less clearly defined, but it has access to every paper across all your searches. In practice, the LLMs are capable of calling each other when needed — you might be interacting with the Search Architect, but once a search completes, it may recommend switching to the Report Writer.
Why this matters
One of the most significant advance here is not any single agent but the ability to run multiple searches and combine the results. One longstanding piece of advice I give to users of the current Undermind is to ensure your query covers a fairly focused topic, expecting perhaps 10-50 relevant papers. A broad query like "how are LLMs used in evidence synthesis" is too unfocused; LLMs are used differently at different stages.
With the Search Architect, you can attack a broad research question (e.g. how are LLMs used in evidence synthesis) by running multiple focused searches to address different sub-questions:
a) How well do LLMs perform as title-abstract screeners in evidence synthesis?
b) How well do LLMs perform when generating Boolean search strategies for evidence synthesis?
c) How well do LLMs perform at data extraction in evidence synthesis?
d) How well do LLMs perform in critical appraisal for evidence synthesis?
All results feed into a unified “All papers” library.
Lastly the Report Writer can then synthesise across them. This part is very similar to the “Chat with experts” feature in the original Undermind, you can select predefined instructions to focus on different aspects for the report.
It will also ask clarifying questions.
Test result
I tested Undermind Projects with my standard query: find papers that could have been cited in “The state of OA” paper but were not.
find me papers that could have been cited in the paper “The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles.” but was not
The system asked clarifying questions, then read the full text of the target paper (identified as “piw18” in its system).
After further clarification, it offered to launch a deep search, which I did.
At this point, I was uncertain whether the request would succeed. Undermind’s deep search is capable of finding papers on a similar topic and filtering to papers published before the target, but it cannot by itself filter out papers already cited.
Once the search finished, the Generalist agent suggested running a report.
I proceeded by clicking on the “Generate Report” button, and started prompting in the Report writer agent.
The report writer did its usual clarifying questions
The critical step came when the agent began “reading full texts of Piw18” — it was checking candidate papers against both the in-text citations and bibliography of the target paper.
For example, Archambault et al. 2016 (Arc16), found by the deep search, was checked against piw18 and excluded because it was already cited.
The system passed the test, but my testing of the new Undermind interface revealed several problems along the way.
Invisible menu problems in Undermind Projects and probably other agentic interfaces.
While testing, I found myself uncertain about basic questions of system capability. Can you give prompts to the Report Writer that refer to specific papers using their keys (e.g. piw18)? Can you reference reports or searches that have already been completed? Can you ask the Report Writer to modify an existing report rather than creating a new one?
The answers to all of these turned out to be yes, which I confirmed by trying and then checking with the developers. For instance, you can ask the Report Writer to use results from a particular search and create a report with "a table of the most relevant papers, each with a column describing possible limitations and why it might not fully address my research goal." You can then ask it to add more sections to the same report. Some of these capabilities are surfaced by the system — it sometimes asks whether you want a new report or to extend an existing one, or even shown as a GUI feature — but not all of them are.
I also raised with the Undermind developers that the new system requires more effort to get the same output as the current one. Under the old system, you enter a query, answer clarifying questions, and receive a comprehensive report. Under the Projects system, you must additionally run the Report Writer, potentially multiple times, each time being asked for clarification.
I requested something like a Claude.md or Claude Skill — a way to specify in advance the sections and structure I wanted, so I could get the full report in one shot. Undermind is still considering this, but the developers revealed a useful workaround: the Generalist agent has access to all the same resources as the Report Writer, but is much less likely to ask clarifying questions. If you want to bypass the Report Writer’s clarification loop, give the same instructions to the Generalist instead.
This is a genuinely useful tip. It is also not the sort of capability most users, even experienced ones, would be likely to discover on their own.
The invisible menu problem
Imagine walking into a restaurant that has no menu. You know the kitchen can cook, but you have no idea what dishes are available, what ingredients they have, or what you should ask for. You might get a brilliant meal if you happen to request the right thing, or you might miss the house speciality entirely because it never occurred to you to ask.
Now make it worse: you do not even know what type of cuisine this restaurant serves. Is it Thai? Italian? Fusion? The sign outside just says “food.” That is the situation users face with agentic search tools today.
Compare this to a user who setups Claude Code by hand to do agentic flows, in our analogy, he would be the owner of the restaurant, and he knows how the resturant was setup, who the cooks are, what their specialities are etc. If he just copies the Claude Code workflow from someone else without understanding - we will be back to the same issue of course.
As I noted in the introduction, I previously called this the “blank box” problem, but “invisible menu” better captures what is going on. The issue is not that the system is blank or empty — it is full of capabilities. The problem is that there is no menu to tell you what they are.
The invisible menu problem arises because an agentic system’s capability space is so broad that users struggle to find out what is possible through normal interaction and is worsened if the user were not the one who setup the agentic system and hence does not understand even in principle what is possible.
The system can do things the user will never discover if there are no menus, no buttons, no tooltips — just a prompt field. The user must guess what to order. This is illustrated by all three examples from this post: knowing to say "search the web" in Elicit, knowing that paper keys work as references in Undermind, knowing that the Generalist agent bypasses the Report Writer's clarification loop. In each case, the capability exists but is invisible to normal use.
Compounding this, users cannot even fall back on conventions from elsewhere. With traditional library search tools, there is a shared mental model. Users know roughly what to expect from a discovery layer or a database: keyword search, date filters, subject headings, citation export. The specifics differ between Primo, EDS, and Summon, but decades of use have established conventions, and a user moving from one system to another can transfer most of their expectations.
With agentic search tools, no such shared mental model currently exists. Every tool has a different set of capabilities, different tools connected underneath, and different invisible constraints. Users cannot form reasonable expectations before they start, which makes the per-tool discovery problem even harder.
Consider how the user does not know what system instructions each agent is already given, what tools or sources it has access to and more etc and is either going to not use the system’s agent capabilities to the fullest or on the flipside give prompts or inputs expecting to be doable but failing. This can be particularly dangerous because such agentic systems often “fail silently” or even if they state plainly they failed, the user may not notice this failure.
Advanced users setting up their own agentic flows with Claude Code or Sciclaw will not face this issue as much, as they know exactly what and how it has been setup.
Undermind features an 'agentic sources' button that surfaces example prompts, mitigating the invisible menu problem by explicitly showing users that they can reference paper keys
Below shows what happens when you click on that.
The screenshot gives you example prompts that give you cues on what you can do. For example, this popup reveals you can give instructions to the Report Writer by referring to the paper key in the Undermind Library.
There is not yet a well-established baseline for what an agentic research tool does that users can carry from one product to another. Users cannot even form reasonable expectations before they start.
This is distinct from the well-known “black box” problem of AI opacity. A black box hides how the system reaches its output. An invisible menu hides what the system can do in the first place. Both are problems, but the invisible menu problem is specific to agentic systems and, I think, currently underappreciated.
It also has direct implications for reproducibility and trust. If the user does not know what the system did, what tools it invoked, or what alternative paths it could have taken, how can they evaluate whether the output is reliable? With a fixed workflow, the steps are predictable and auditable. With an agentic system, the tool may take different paths each time, and the user may not even realise what choices were made on their behalf.
Part of this may prove temporary. As these tools mature, shared expectations across the category will form, and users will at least arrive with a baseline of what to expect. What will not go away on its own is the per-tool discovery problem: even with category conventions in place, each system will still have its own hidden affordances unless vendors deliberately surface them. In the interim, the cognitive burden on users is substantial.
SciSpace Agents
I should be upfront that I have not tested SciSpace Agents as deeply as Elicit or Undermind, so what follows is a positioning analysis rather than a hands-on evaluation.
SciSpace has gone all-in on the agentic model. Their platform is packed with over 150 “apps” and connectors to a wide range of tools and agents that go well beyond literature review3 — searching, writing, data analysis, and more.
They offer a rich agents gallery.
In concept, this amounts to a simplified, less technical version of building your own research agent with Claude Code/ sciclaw : all the tools are connected automatically, requiring minimal setup.
The trade-off is cost — SciSpace charges by credit usage — and the fact that for all its versatility, it is unproven that this breadth actually delivers better results than more focused alternatives. SciSpace occupies a middle ground between the full flexibility of a DIY setup like Claude Code and the guided experience of platforms like Undermind and Elicit, which raises the question of whether that middle ground has a natural audience.
In terms of the “invisible menu” issue, again this is mid-way between a system like Undermind Projects and complete DIY setup like Claude Code, since you get your hands dirty more than the former but less than the latter, giving you immediate understanding of the agentic flows.
Conclusion
The academic search landscape may be shifting from rigid, predetermined workflows towards more flexible agentic architectures. Elicit’s Research Agents, Undermind’s Projects feature, and SciSpace Agents each represent different points on this spectrum, but they all reflect the same underlying recognition: LLMs are now capable enough to reason across tools rather than merely execute a fixed pipeline.
Yet as my testing reveals, “more agentic” does not automatically mean “more usable.” When a system can theoretically do anything, the user bears the cognitive burden of figuring out what it can actually do, which tools it will invoke, and how to phrase requests to get the desired outcome. The Undermind example is telling: even a power user (me!) needed developer tips to use the system efficiently. That does not look like a sustainable model for broader adoption.
As I discussed earlier, vendors are hedging their bets — offering MCP connectors while simultaneously layering agentic capabilities onto their own platforms, and libraries are beginning to wire up LLMs directly to their collections. Where this leaves the middle ground is uncertain.
My suspicion is that a general-purpose agentic platform for research, such as SciSpace Agents, will struggle to find a big audience. Technical early adopters will build their own workflows with Claude Code or similar tools like Sciclaw. Less technical users will gravitate towards familiar interfaces like Claude or ChatGPT with a few MCP connectors attached, or towards focused platforms like Undermind and Elicit that guide them through the process. One key question for vendors is not just whether to become more agentic, but how to do so without making their tools harder to use
What seems clear, at least from these examples, is that some tools are moving past the ‘horseless carriage’ stage I described in my previous post. The conversation has moved beyond simply bolting LLMs onto legacy search paradigms.
If the above is correct, a major challenge now is designing agentic systems that are genuinely discoverable — where users can understand what the system is capable of without needing insider knowledge or developer access. Until that challenge is addressed, much of the promise of agentic academic search is likely to remain concentrated among users willing to tinker.
I have been thinking about what it would take to overcome the invisible menu problem — what kinds of affordances, documentation, and design patterns could help users navigate these systems more effectively. I hope to explore that in a future post.
As I noted in an earlier post, the term agent is extremely disputed. But typically it means LLMs autonomously using tools in a loop.
Generally, if the system can pass the test at least once, I consider it a pass. It is also possible the system does not actually do the proper steps and just gets papers that were not referenced by pure luck.
The original SciSpace Deep Review and search is still available as a tool to the agent. But if you want to ignore the agentic search you can click on “literature review” on the left pane to go back to the original SciSpace search

















































Aaron, thank you! I appreciate very much every post you sent us since to me is very educational (although I need to read and re-read and re-read and re-read it again sometimes).
It occurred to me that your standard "What could have been cited but wasn't" question wasn't one that I'd tuned my tools on, so I was curious how it would do. It struggled in ways that were instructive for the next pass of improvements I make. Anyway, here's the eventual result:
https://acrobat.adobe.com/id/urn:aaid:sc:VA6C2:77bdd13d-996e-48e6-89a0-423dbe4c46b4