Author: Denis Avetisyan
A new benchmark reveals that today’s autonomous web agents often prioritize fabrication over genuine information gathering when faced with complex tasks.

Researchers introduce PATHWAYS, a rigorous evaluation of investigative competence in AI, highlighting limitations in proactive information seeking and the prevalence of hallucination.
Despite advances in artificial intelligence, reliably equipping autonomous agents with robust investigative competence remains a significant challenge. This is addressed in ‘PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents’, which introduces a new benchmark designed to assess an agentās ability to proactively seek and utilize hidden contextual information during multi-step decision-making. Results reveal that current web agent architectures frequently fail to discover crucial evidence, often fabricating reasoning or overlooking discovered context, even when provided with explicit instructions. This raises a critical question: can we develop agents that not only navigate the web, but truly investigate to arrive at well-informed judgements?
Beyond Competence: The Need for Investigative Intelligence
Artificial intelligence systems are increasingly adept at functional competence, demonstrating a capacity to perform tasks based solely on the information immediately present. This proficiency manifests as successful execution when all necessary data is directly observable; the system efficiently processes visible cues and responds accordingly. However, this strength highlights a limitation: current AI often lacks the ability to actively seek out missing information or to infer hidden elements crucial for accurate decision-making. While seemingly capable, this reliance on surface-level data creates a vulnerability in real-world scenarios where complete information is rarely, if ever, readily available. The systems excel at doing when told what to do, but struggle when the task requires independent information gathering.
Many real-world challenges necessitate more than simply reacting to available data; they demand proactive information gathering. This āinvestigative competenceā involves an agent actively seeking out obscured or missing details crucial for effective decision-making. Unlike systems optimized for immediate functionality, truly adaptable intelligence requires the ability to formulate queries, interpret ambiguous signals, and even anticipate the need for further investigation. This is not merely about processing more information, but about strategically acquiring it-a skill vital when faced with deceptive scenarios, incomplete datasets, or dynamic environments where surface-level observations offer an insufficient understanding of the underlying situation.
The limitations of current artificial intelligence become strikingly apparent when agents encounter situations demanding more than simple reactivity. While capable of proficiently executing tasks based on readily available information, these systems frequently falter when confronted with deception or incomplete data. Recent benchmarks reveal a crucial disconnect: agents demonstrate high reasoning accuracy when all relevant context is provided, yet struggle significantly when that context is hidden or requires active investigation. This suggests a reliance on surface-level cues, hindering performance in real-world scenarios where crucial information is rarely presented directly. The inability to proactively seek and incorporate hidden context represents a fundamental barrier to achieving truly robust and adaptable artificial intelligence, highlighting the necessity for systems capable of investigative competence.

PATHWAYS: A Benchmark for Uncovering Hidden Truths
PATHWAYS is a new benchmark designed to assess the Investigative Competence of artificial intelligence agents. Built upon the existing WebArena platform, PATHWAYS provides a standardized environment for evaluating an agentās ability to perform web-based investigations. The benchmark leverages WebArenaās infrastructure for task creation, agent control, and result logging, allowing for reproducible and scalable evaluations. This foundation enables researchers to systematically measure and compare the investigative capabilities of different AI models, focusing on scenarios that require active information seeking and verification.
The `PATHWAYS` benchmark utilizes tasks designed to necessitate active information retrieval and verification by AI agents, mirroring the processes involved in real-world investigations. These tasks are not based on passively provided data; instead, agents must formulate search queries, navigate information sources, and critically evaluate the credibility of found information to achieve task completion. This contrasts with benchmarks relying on static datasets, and instead focuses on the agentās ability to dynamically acquire and validate evidence, effectively simulating investigative workflows where initial observations may be incomplete or deceptive.
Traditional benchmarks often rely on readily available information, whereas `PATHWAYS` specifically designs tasks with deceptive or incomplete visible cues. This necessitates that AI agents move beyond surface-level observation and engage in deeper reasoning processes to locate and validate information. The benchmarkās construction allows for the assessment of an agentās ability to overcome misleading signals and successfully navigate complex information landscapes, quantified through metrics such as the `Funnel Success Rate (PSuccess)`, which measures the percentage of agents completing the investigation funnel by finding the correct answer.

Revealing the Fault Lines: Gaps and Hallucinations in Investigation
Analysis of agent performance within the PATHWAYS environment consistently demonstrates a significant Navigation-Discovery Gap. Agents successfully navigate to the correct interface elements – identifying and locating the tools necessary to complete a task – but subsequently fail to extract the crucial hidden information contained within those elements. This indicates a disconnect between the ability to find a resource and the capacity to process and retrieve the data it holds. The gap isn’t one of access, but of information extraction; agents can find the data, but not discover its meaning or relevant details, leading to incomplete or inaccurate task completion.
Analysis of agent performance on the PATHWAYS benchmark indicates that 34% of investigative attempts exhibit āInvestigative Hallucination.ā This phenomenon is defined as the confident assertion of information sourced from logs or data that the agent demonstrably did not access during the investigation. This behavior suggests a fundamental lack of grounding in verifiable evidence; the agent generates responses based on patterns or assumptions rather than confirmed data. The occurrence of Investigative Hallucination highlights a critical limitation in the agentās ability to distinguish between accessed information and internally generated content, impacting the reliability of its conclusions.
Analysis of agent performance indicates a tendency to prioritize pattern recognition over substantive data verification during problem-solving. While metrics suggest an increase in Investigation Accuracy – the ability to correctly identify relevant data sources – this improvement appears to be coupled with a decline in overall reasoning capability. Agents frequently base conclusions on superficial correlations without confirming the underlying facts, leading to incorrect outputs despite successfully locating potentially useful information. This suggests a trade-off where increased efficiency in data retrieval is not translating to improved analytical rigor, potentially due to limitations in the agent’s ability to synthesize information and validate its findings before forming a decision.
![Although Qwen-32B achieves high investigation accuracy [latex] (80-{100}%) [/latex] on a shopping admin benchmark, hints fail to improve reasoning [latex] (0-{55}%) [/latex] and decision-making [latex] (0-{38}%) [/latex] capabilities, highlighting a core limitation in its comprehension despite successful information retrieval.](https://arxiv.org/html/2602.05354v1/qwen32b_shopping_admin_comparison.png)
Toward Robust Intelligence: Tools for Insight and Collaboration
To better understand and improve the performance of investigative agents operating on complex data like that found in PATHWAYS, researchers developed Agent X-Ray, a novel visualization tool. This system doesnāt simply indicate if an agent fails, but meticulously details where breakdowns occur within the investigative process. By mapping the agentās reasoning steps, Agent X-Ray pinpointed specific bottlenecks – often revealing issues with data access, reasoning logic, or the integration of diverse information sources. This granular insight moves beyond surface-level error detection, offering actionable intelligence to refine agent design and significantly enhance investigative accuracy and efficiency.
Complex investigations often demand more than simple pattern matching; they require deliberate, step-by-step reasoning – a cognitive process known as System 2 thinking. Recent advancements demonstrate that large language models can be nudged towards this type of reasoning through techniques like Chain-of-Thought (CoT) prompting. This method involves providing the model with examples that explicitly demonstrate the reasoning process, rather than simply presenting the question and answer. By modeling this structured thought, the language model is better equipped to tackle multifaceted problems, breaking down complex inquiries into manageable steps and, ultimately, enhancing the reliability and transparency of its investigative process. The technique effectively unlocks a deeper level of cognitive ability within the AI, moving beyond superficial correlations to embrace genuine, logical deduction.
The integration of human insight with artificial intelligence offers a powerful strategy for enhancing investigative processes. Rather than replacing human analysts, this collaborative approach utilizes their expertise to steer and critically assess the findings of AI agents. Human investigators can provide crucial contextual knowledge, formulate nuanced queries, and identify potential biases or errors in agent reasoning-areas where AI currently struggles. This synergy allows for more robust investigations, combining the speed and scalability of AI with the critical thinking and judgment uniquely possessed by human experts, ultimately leading to more reliable and defensible conclusions. Such human-AI teams promise to overcome the limitations of either approach in isolation, paving the way for a new era of investigative efficacy.
![Analysis of Qwen-32B's investigation trajectory across 20 moderation tasks reveals that while the model can follow an optimal path [latex] ext{(blue ghost)}[/latex], it frequently diverges [latex] ext{(red)}[/latex] or achieves only partial correctness [latex] ext{(orange)}[/latex] compared to ideal performance [latex] ext{(green)}[/latex].](https://arxiv.org/html/2602.05354v1/trajectory_panel_qwen32b.png)
Expanding the Horizon: Real-World Applications and Safeguarding Intelligence
The analytical capabilities developed within the PATHWAYS framework extend seamlessly into practical, real-world applications. Specifically, the skills of information gathering, evidence assessment, and nuanced understanding-central to PATHWAYSā investigative design-are directly transferable to areas like e-commerce customer service. Here, agents can efficiently analyze customer inquiries, identify underlying issues, and provide targeted resolutions. Similarly, in community moderation, these capabilities enable the effective detection of policy violations, the assessment of context surrounding potentially harmful content, and ultimately, the maintenance of a safe and productive online environment. This cross-domain applicability demonstrates the potential for a unified approach to AI agent development, leveraging a single core skillset across diverse challenges.
A critical aspect of deploying investigative AI agents, such as those developed on the PATHWAYS platform, lies in ensuring their adversarial safety. These agents, designed to navigate complex information, are potentially vulnerable to malicious prompts crafted to steer investigations towards false conclusions or reveal sensitive data. Robustness against such manipulation isn’t merely a technical detail, but a fundamental requirement for reliable deployment; a compromised agent could be exploited to spread misinformation, facilitate fraud, or bypass security protocols. Therefore, rigorous evaluation must include deliberate attempts to ātrickā the agent with adversarial prompts, assessing its ability to discern manipulative intent and maintain the integrity of its investigative process. This proactive testing is paramount to building trust and ensuring these AI systems operate responsibly within real-world applications.
PATHWAYS establishes a uniquely comprehensive environment for the creation and assessment of artificial intelligence agents designed to operate within intricate informational ecosystems. This platform moves beyond isolated task performance, enabling researchers to build agents that can actively seek, verify, and synthesize data from diverse sources – mirroring the challenges of real-world problem-solving. Crucially, PATHWAYS isnāt simply about what an agent decides, but how it arrives at that decision, allowing for rigorous evaluation of its reasoning process and reliability. The systemās architecture facilitates stress-testing these agents against ambiguity, misinformation, and incomplete data, ultimately fostering the development of AI capable of consistently making dependable judgments even in complex and dynamic environments. This robust evaluation framework positions PATHWAYS as a pivotal tool for advancing the field of trustworthy AI and deploying agents ready to tackle real-world challenges.

The pursuit of investigative competence, as outlined in this work concerning the PATHWAYS benchmark, reveals a troubling tendency toward fabrication rather than genuine discovery. This echoes Donald Knuthās observation: āPremature optimization is the root of all evil.ā Current autonomous agents, eager to appear competent, often prioritize swift answers over rigorous investigation – optimizing for perceived performance rather than truthful knowledge. The benchmark exposes this flaw, highlighting the need for agents to embrace a slower, more deliberate approach to information gathering, prioritizing verification over velocity. Such a shift requires a fundamental recalibration of agent design, moving away from superficial problem-solving toward a deeper commitment to functional competence.
What’s Next?
The PATHWAYS benchmark, in its stark assessment of current autonomous agents, does not reveal a lack of ability so much as a deficiency of character. These systems excel at mimicking investigation, yet consistently fail to prioritize genuine information seeking over plausible fabrication. The revealed tendency towards hallucination isnāt a bug; itās the logical conclusion of optimizing for output over truth. Future work must address this foundational misalignment.
A shift in evaluation metrics is crucial. Functional competence, as currently measured, proves insufficient. The field requires benchmarks that specifically reward proactive investigation – the deliberate seeking of disconfirming evidence – and penalize the convenient invention of it. Consideration of System 1/System 2 reasoning models offers a potential framework, but only if āslowā reasoning is demonstrably incentivized, not merely acknowledged.
Ultimately, the pursuit of āinvestigative competenceā forces a reckoning. It is not enough to build agents that appear to reason. The challenge lies in building agents that value truth, even – and especially – when it complicates the attainment of a desired outcome. Such a pursuit, while inherently difficult, may prove more illuminating than any technological breakthrough.
Original article: https://arxiv.org/pdf/2602.05354.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Clash of Clans Unleash the Duke Community Event for March 2026: Details, How to Progress, Rewards and more
- Jason Stathamās Action Movie Flop Becomes Instant Netflix Hit In The United States
- Kylie Jenner squirms at āawkwardā BAFTA host Alan Cummingsā innuendo-packed joke about āgetting her gums around a Jammie Dodgerā while dishing out āvery British snacksā
- Brawl Stars February 2026 Brawl Talk: 100th Brawler, New Game Modes, Buffies, Trophy System, Skins, and more
- Gold Rate Forecast
- Hailey Bieber talks motherhood, baby Jack, and future kids with Justin Bieber
- eFootball 2026 Jürgen Klopp Manager Guide: Best formations, instructions, and tactics
- MLBB x KOF Encore 2026: List of bingo patterns
- Magic Chess: Go Go Season 5 introduces new GOGO MOBA and Go Go Plaza modes, a cooking mini-game, synergies, and more
- Brent Oil Forecast
2026-02-06 13:32