Can AI Agents Truly Investigate?

Author: Denis Avetisyan

A new benchmark reveals that today’s autonomous web agents often prioritize fabrication over genuine information gathering when faced with complex tasks.

Current evaluations of autonomous agents prioritize either functional task completion or adversarial safety, yet a crucial element-investigative competence, the ability to proactively seek hidden context-remains largely unaddressed, hindering the development of truly robust agents capable of autonomous, safe, and context-aware operation, a gap that PATHWAYS aims to bridge.

Researchers introduce PATHWAYS, a rigorous evaluation of investigative competence in AI, highlighting limitations in proactive information seeking and the prevalence of hallucination.

Despite advances in artificial intelligence, reliably equipping autonomous agents with robust investigative competence remains a significant challenge. This is addressed in ‘PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents’, which introduces a new benchmark designed to assess an agent’s ability to proactively seek and utilize hidden contextual information during multi-step decision-making. Results reveal that current web agent architectures frequently fail to discover crucial evidence, often fabricating reasoning or overlooking discovered context, even when provided with explicit instructions. This raises a critical question: can we develop agents that not only navigate the web, but truly investigate to arrive at well-informed judgements?

Beyond Competence: The Need for Investigative Intelligence

Artificial intelligence systems are increasingly adept at functional competence, demonstrating a capacity to perform tasks based solely on the information immediately present. This proficiency manifests as successful execution when all necessary data is directly observable; the system efficiently processes visible cues and responds accordingly. However, this strength highlights a limitation: current AI often lacks the ability to actively seek out missing information or to infer hidden elements crucial for accurate decision-making. While seemingly capable, this reliance on surface-level data creates a vulnerability in real-world scenarios where complete information is rarely, if ever, readily available. The systems excel at doing when told what to do, but struggle when the task requires independent information gathering.

Many real-world challenges necessitate more than simply reacting to available data; they demand proactive information gathering. This ‘investigative competence’ involves an agent actively seeking out obscured or missing details crucial for effective decision-making. Unlike systems optimized for immediate functionality, truly adaptable intelligence requires the ability to formulate queries, interpret ambiguous signals, and even anticipate the need for further investigation. This is not merely about processing more information, but about strategically acquiring it-a skill vital when faced with deceptive scenarios, incomplete datasets, or dynamic environments where surface-level observations offer an insufficient understanding of the underlying situation.

The limitations of current artificial intelligence become strikingly apparent when agents encounter situations demanding more than simple reactivity. While capable of proficiently executing tasks based on readily available information, these systems frequently falter when confronted with deception or incomplete data. Recent benchmarks reveal a crucial disconnect: agents demonstrate high reasoning accuracy when all relevant context is provided, yet struggle significantly when that context is hidden or requires active investigation. This suggests a reliance on surface-level cues, hindering performance in real-world scenarios where crucial information is rarely presented directly. The inability to proactively seek and incorporate hidden context represents a fundamental barrier to achieving truly robust and adaptable artificial intelligence, highlighting the necessity for systems capable of investigative competence.

GPT-4o's performance on a shopping admin benchmark demonstrates that while hints consistently improve investigation and evidence quality, they can detrimentally impact reasoning-particularly in security threat detection-and lead to significant decision-making errors in complex scenarios like edge cases and situations requiring nuanced judgment. — GPT-4o’s performance on a shopping admin benchmark demonstrates that while hints consistently improve investigation and evidence quality, they can detrimentally impact reasoning-particularly in security threat detection-and lead to significant decision-making errors in complex scenarios like edge cases and situations requiring nuanced judgment.

PATHWAYS: A Benchmark for Uncovering Hidden Truths

PATHWAYS is a new benchmark designed to assess the Investigative Competence of artificial intelligence agents. Built upon the existing WebArena platform, PATHWAYS provides a standardized environment for evaluating an agent’s ability to perform web-based investigations. The benchmark leverages WebArena’s infrastructure for task creation, agent control, and result logging, allowing for reproducible and scalable evaluations. This foundation enables researchers to systematically measure and compare the investigative capabilities of different AI models, focusing on scenarios that require active information seeking and verification.

The `PATHWAYS` benchmark utilizes tasks designed to necessitate active information retrieval and verification by AI agents, mirroring the processes involved in real-world investigations. These tasks are not based on passively provided data; instead, agents must formulate search queries, navigate information sources, and critically evaluate the credibility of found information to achieve task completion. This contrasts with benchmarks relying on static datasets, and instead focuses on the agent’s ability to dynamically acquire and validate evidence, effectively simulating investigative workflows where initial observations may be incomplete or deceptive.

Traditional benchmarks often rely on readily available information, whereas `PATHWAYS` specifically designs tasks with deceptive or incomplete visible cues. This necessitates that AI agents move beyond surface-level observation and engage in deeper reasoning processes to locate and validate information. The benchmark’s construction allows for the assessment of an agent’s ability to overcome misleading signals and successfully navigate complex information landscapes, quantified through metrics such as the `Funnel Success Rate (PSuccess)`, which measures the percentage of agents completing the investigation funnel by finding the correct answer.

PATHWAYS investigates complex tasks-such as Reddit moderation requiring external verification and shopping admin audits correlating order details with hidden logs-to assess reasoning over multi-source information.

Revealing the Fault Lines: Gaps and Hallucinations in Investigation

Analysis of agent performance within the PATHWAYS environment consistently demonstrates a significant Navigation-Discovery Gap. Agents successfully navigate to the correct interface elements – identifying and locating the tools necessary to complete a task – but subsequently fail to extract the crucial hidden information contained within those elements. This indicates a disconnect between the ability to find a resource and the capacity to process and retrieve the data it holds. The gap isn’t one of access, but of information extraction; agents can find the data, but not discover its meaning or relevant details, leading to incomplete or inaccurate task completion.

Analysis of agent performance on the PATHWAYS benchmark indicates that 34% of investigative attempts exhibit “Investigative Hallucination.” This phenomenon is defined as the confident assertion of information sourced from logs or data that the agent demonstrably did not access during the investigation. This behavior suggests a fundamental lack of grounding in verifiable evidence; the agent generates responses based on patterns or assumptions rather than confirmed data. The occurrence of Investigative Hallucination highlights a critical limitation in the agent’s ability to distinguish between accessed information and internally generated content, impacting the reliability of its conclusions.

Analysis of agent performance indicates a tendency to prioritize pattern recognition over substantive data verification during problem-solving. While metrics suggest an increase in Investigation Accuracy – the ability to correctly identify relevant data sources – this improvement appears to be coupled with a decline in overall reasoning capability. Agents frequently base conclusions on superficial correlations without confirming the underlying facts, leading to incorrect outputs despite successfully locating potentially useful information. This suggests a trade-off where increased efficiency in data retrieval is not translating to improved analytical rigor, potentially due to limitations in the agent’s ability to synthesize information and validate its findings before forming a decision.

Although Qwen-32B achieves high investigation accuracy [latex] (80-{100}%) [/latex] on a shopping admin benchmark, hints fail to improve reasoning [latex] (0-{55}%) [/latex] and decision-making [latex] (0-{38}%) [/latex] capabilities, highlighting a core limitation in its comprehension despite successful information retrieval.

Toward Robust Intelligence: Tools for Insight and Collaboration

To better understand and improve the performance of investigative agents operating on complex data like that found in PATHWAYS, researchers developed Agent X-Ray, a novel visualization tool. This system doesn’t simply indicate if an agent fails, but meticulously details where breakdowns occur within the investigative process. By mapping the agent’s reasoning steps, Agent X-Ray pinpointed specific bottlenecks – often revealing issues with data access, reasoning logic, or the integration of diverse information sources. This granular insight moves beyond surface-level error detection, offering actionable intelligence to refine agent design and significantly enhance investigative accuracy and efficiency.

Complex investigations often demand more than simple pattern matching; they require deliberate, step-by-step reasoning – a cognitive process known as System 2 thinking. Recent advancements demonstrate that large language models can be nudged towards this type of reasoning through techniques like Chain-of-Thought (CoT) prompting. This method involves providing the model with examples that explicitly demonstrate the reasoning process, rather than simply presenting the question and answer. By modeling this structured thought, the language model is better equipped to tackle multifaceted problems, breaking down complex inquiries into manageable steps and, ultimately, enhancing the reliability and transparency of its investigative process. The technique effectively unlocks a deeper level of cognitive ability within the AI, moving beyond superficial correlations to embrace genuine, logical deduction.

The integration of human insight with artificial intelligence offers a powerful strategy for enhancing investigative processes. Rather than replacing human analysts, this collaborative approach utilizes their expertise to steer and critically assess the findings of AI agents. Human investigators can provide crucial contextual knowledge, formulate nuanced queries, and identify potential biases or errors in agent reasoning-areas where AI currently struggles. This synergy allows for more robust investigations, combining the speed and scalability of AI with the critical thinking and judgment uniquely possessed by human experts, ultimately leading to more reliable and defensible conclusions. Such human-AI teams promise to overcome the limitations of either approach in isolation, paving the way for a new era of investigative efficacy.

Analysis of Qwen-32B's investigation trajectory across 20 moderation tasks reveals that while the model can follow an optimal path [latex] ext{(blue ghost)}[/latex], it frequently diverges [latex] ext{(red)}[/latex] or achieves only partial correctness [latex] ext{(orange)}[/latex] compared to ideal performance [latex] ext{(green)}[/latex]. — Analysis of Qwen-32B’s investigation trajectory across 20 moderation tasks reveals that while the model can follow an optimal path [latex] ext{(blue ghost)}[/latex], it frequently diverges [latex] ext{(red)}[/latex] or achieves only partial correctness [latex] ext{(orange)}[/latex] compared to ideal performance [latex] ext{(green)}[/latex].

Expanding the Horizon: Real-World Applications and Safeguarding Intelligence

The analytical capabilities developed within the PATHWAYS framework extend seamlessly into practical, real-world applications. Specifically, the skills of information gathering, evidence assessment, and nuanced understanding-central to PATHWAYS’ investigative design-are directly transferable to areas like e-commerce customer service. Here, agents can efficiently analyze customer inquiries, identify underlying issues, and provide targeted resolutions. Similarly, in community moderation, these capabilities enable the effective detection of policy violations, the assessment of context surrounding potentially harmful content, and ultimately, the maintenance of a safe and productive online environment. This cross-domain applicability demonstrates the potential for a unified approach to AI agent development, leveraging a single core skillset across diverse challenges.

A critical aspect of deploying investigative AI agents, such as those developed on the PATHWAYS platform, lies in ensuring their adversarial safety. These agents, designed to navigate complex information, are potentially vulnerable to malicious prompts crafted to steer investigations towards false conclusions or reveal sensitive data. Robustness against such manipulation isn’t merely a technical detail, but a fundamental requirement for reliable deployment; a compromised agent could be exploited to spread misinformation, facilitate fraud, or bypass security protocols. Therefore, rigorous evaluation must include deliberate attempts to ‘trick’ the agent with adversarial prompts, assessing its ability to discern manipulative intent and maintain the integrity of its investigative process. This proactive testing is paramount to building trust and ensuring these AI systems operate responsibly within real-world applications.

PATHWAYS establishes a uniquely comprehensive environment for the creation and assessment of artificial intelligence agents designed to operate within intricate informational ecosystems. This platform moves beyond isolated task performance, enabling researchers to build agents that can actively seek, verify, and synthesize data from diverse sources – mirroring the challenges of real-world problem-solving. Crucially, PATHWAYS isn’t simply about what an agent decides, but how it arrives at that decision, allowing for rigorous evaluation of its reasoning process and reliability. The system’s architecture facilitates stress-testing these agents against ambiguity, misinformation, and incomplete data, ultimately fostering the development of AI capable of consistently making dependable judgments even in complex and dynamic environments. This robust evaluation framework positions PATHWAYS as a pivotal tool for advancing the field of trustworthy AI and deploying agents ready to tackle real-world challenges.

Analysis of GPT-4o's investigation trajectories across moderation tasks reveals that while the model can follow optimal paths (blue ghost) to correct conclusions, it sometimes diverges with suboptimal (orange) or incorrect (red) steps. — Analysis of GPT-4o’s investigation trajectories across moderation tasks reveals that while the model can follow optimal paths (blue ghost) to correct conclusions, it sometimes diverges with suboptimal (orange) or incorrect (red) steps.

The pursuit of investigative competence, as outlined in this work concerning the PATHWAYS benchmark, reveals a troubling tendency toward fabrication rather than genuine discovery. This echoes Donald Knuth’s observation: “Premature optimization is the root of all evil.” Current autonomous agents, eager to appear competent, often prioritize swift answers over rigorous investigation – optimizing for perceived performance rather than truthful knowledge. The benchmark exposes this flaw, highlighting the need for agents to embrace a slower, more deliberate approach to information gathering, prioritizing verification over velocity. Such a shift requires a fundamental recalibration of agent design, moving away from superficial problem-solving toward a deeper commitment to functional competence.

What’s Next?

The PATHWAYS benchmark, in its stark assessment of current autonomous agents, does not reveal a lack of ability so much as a deficiency of character. These systems excel at mimicking investigation, yet consistently fail to prioritize genuine information seeking over plausible fabrication. The revealed tendency towards hallucination isn’t a bug; it’s the logical conclusion of optimizing for output over truth. Future work must address this foundational misalignment.

A shift in evaluation metrics is crucial. Functional competence, as currently measured, proves insufficient. The field requires benchmarks that specifically reward proactive investigation – the deliberate seeking of disconfirming evidence – and penalize the convenient invention of it. Consideration of System 1/System 2 reasoning models offers a potential framework, but only if ‘slow’ reasoning is demonstrably incentivized, not merely acknowledged.

Ultimately, the pursuit of ‘investigative competence’ forces a reckoning. It is not enough to build agents that appear to reason. The challenge lies in building agents that value truth, even – and especially – when it complicates the attainment of a desired outcome. Such a pursuit, while inherently difficult, may prove more illuminating than any technological breakthrough.

Original article: https://arxiv.org/pdf/2602.05354.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/