Can AI Truly Use the Web?

Author: Denis Avetisyan


A new benchmark reveals that today’s AI agents struggle with surprisingly simple online tasks that humans perform daily.

The evaluation pipeline assesses agent performance through interaction with a real browser, capturing detailed behavioral data across five layers to compare the agent’s trajectory against human-authored ground truth, ultimately delivering a pass/fail verdict substantiated by step-level justifications.
The evaluation pipeline assesses agent performance through interaction with a real browser, capturing detailed behavioral data across five layers to compare the agent’s trajectory against human-authored ground truth, ultimately delivering a pass/fail verdict substantiated by step-level justifications.

ClawBench, a comprehensive evaluation suite, exposes a significant gap between AI performance on existing benchmarks and real-world web interaction.

Despite advances in artificial intelligence, reliably automating everyday online tasks remains a significant challenge. To address this gap, we introduce ClawBench-a new evaluation framework designed to assess AI agents on 153 realistic web-based tasks, spanning diverse platforms and requiring complex interactions beyond those captured by existing benchmarks. Our evaluations reveal that even state-of-the-art models struggle, achieving limited success on ClawBench’s demanding workflow and information-extraction requirements. Will progress on such benchmarks be sufficient to unlock truly general-purpose AI assistants capable of navigating the complexities of the modern web?


The Illusion of Control: Why Benchmarks Fail

Many current artificial intelligence assessments occur within carefully constructed, digital simulations – environments that, while controlled, drastically underestimate the unpredictable nature of real-world web interactions. These benchmarks often prioritize easily quantifiable metrics within simplified scenarios, neglecting the messy realities of dynamic websites, variable loading times, and the constant evolution of online content. Consequently, an agent that performs flawlessly in a synthetic setting may struggle significantly when confronted with the complexities of a live website, where subtle changes in page layout, ambiguous phrasing, or unexpected server responses can derail even the most sophisticated algorithms. This disconnect between benchmark performance and real-world capability highlights a critical need for evaluation methods that more accurately reflect the inherent challenges of navigating and interacting with the open web.

Assessing artificial intelligence agents designed for real-world tasks – such as completing online forms or making reservations – necessitates interaction with live web platforms, a practice that introduces considerable safety and reliability hurdles. Unlike evaluations conducted within controlled, simulated environments, operating on actual websites exposes systems to unpredictable states, potential security vulnerabilities, and the risk of unintended consequences like accidental purchases or data breaches. These ‘write-heavy’ tasks demand careful consideration of agent behavior, as even minor errors in data entry or navigation can lead to significant issues for both the agent and the platform it interacts with. Ensuring the robustness of these agents, therefore, requires sophisticated monitoring, robust error handling, and mechanisms to prevent malicious or disruptive actions while maintaining realistic evaluation conditions.

Evaluating artificial intelligence on tasks demanding extensive text input – often termed ‘write-heavy tasks’ – presents a unique challenge beyond simply assessing output correctness. These evaluations require a framework capable of tracking a cascade of state changes within the target platform; a single, slightly incorrect input can alter subsequent steps, rendering a seemingly correct final answer meaningless. Unlike closed-form problems with definitive solutions, these tasks necessitate monitoring the process of interaction – did the agent correctly populate each field, handle dynamic content, and adapt to platform-specific requirements? A truly robust evaluation, therefore, moves beyond simple pass/fail metrics and focuses on a granular understanding of the agent’s behavioral trajectory, acknowledging that success isn’t solely determined by the ultimate outcome but by the validity of each intermediate action within a complex, evolving digital environment.

The agentic evaluator determines successful task completion by cross-referencing the agent's trajectory-analyzed through session replay, screenshots, HTTP traffic, browser actions, and messages-against a reference trajectory, resulting in a [latex]PASS[/latex] or [latex]FAIL[/latex] decision based on both task execution and adherence to behavioral rules.
The agentic evaluator determines successful task completion by cross-referencing the agent’s trajectory-analyzed through session replay, screenshots, HTTP traffic, browser actions, and messages-against a reference trajectory, resulting in a [latex]PASS[/latex] or [latex]FAIL[/latex] decision based on both task execution and adherence to behavioral rules.

ClawBench: A Necessary, Though Imperfect, Stress Test

ClawBench is designed as a comprehensive evaluation benchmark for web-based AI agents, consisting of 153 distinct tasks modeled after common user activities. These tasks are executed across 144 currently active and publicly accessible websites, reflecting real-world online platforms. The selection of tasks and platforms aims to provide a realistic and challenging environment for assessing an agent’s ability to navigate, interact with, and complete objectives on the contemporary web, going beyond synthetic or simplified environments. This approach enables evaluation of agent performance in scenarios involving complex website layouts, dynamic content, and diverse interaction requirements.

ClawBench utilizes OpenClaw, a software framework designed for automating interactions with Chromium-based web browsers. This allows the benchmark to programmatically control a browser instance, simulating user actions such as navigating to websites, filling out forms, and clicking buttons. By directly controlling the browser, ClawBench can execute the 153 tasks comprising the benchmark in a fully automated fashion, enabling repeatable and scalable evaluation of web agent performance. OpenClaw handles the complexities of browser automation, including managing browser windows, handling JavaScript execution, and interacting with web page elements, providing a stable and reliable foundation for the benchmark’s execution.

ClawBench utilizes a Chrome Extension and Chrome DevTools Protocol (CDP) Server to facilitate safe and controlled operation of AI agents. The Chrome Extension intercepts all HTTP requests made by the agent within the browser environment, providing a crucial layer of monitoring and allowing for request modification or blocking. The CDP Server enables programmatic control of the Chromium browser, but coupled with the extension, prevents agents from performing unintended actions such as accessing sensitive data or initiating unauthorized transactions. This system ensures that all agent interactions are observable and can be halted if necessary, mitigating potential risks associated with autonomous web interaction.

The evaluation protocol assesses agent performance by comparing its actions and payloads to human references using a Claude Code sub-agent and a fixed rubric, resulting in a binary completion verdict supported by schema-level justification.
The evaluation protocol assesses agent performance by comparing its actions and payloads to human references using a Claude Code sub-agent and a fixed rubric, resulting in a binary completion verdict supported by schema-level justification.

Five Layers of Data: Because Something Will Go Wrong

ClawBench utilizes a ‘Five-Layer Recording’ system to comprehensively document agent interactions. This system captures five distinct data streams: full session replay for visual review of agent activity, action screenshots providing static records of specific interface elements, complete HTTP traffic logs detailing all data transmitted, agent-generated messages for contextual understanding, and records of all browser actions undertaken by the agent. The concurrent capture of these layers allows for cross-validation and a detailed reconstruction of each agent session, facilitating thorough analysis of performance and identification of potential issues.

The comprehensive data captured by ClawBench’s five-layer recording system – encompassing session replay, action screenshots, HTTP traffic, agent messages, and browser actions – facilitates detailed analysis of agent performance. This multi-layered approach allows for reconstruction of complete agent sessions, identifying both successful task completion pathways and points of failure. By correlating data across these layers, analysts can pinpoint the specific actions or system interactions leading to errors, inefficiencies, or unexpected outcomes. This granular level of insight extends beyond simple pass/fail metrics, enabling investigation into how tasks are performed and why certain issues arise, ultimately supporting targeted improvements to agent training and system design.

HTTP request interception within ClawBench operates as a security measure to prevent agents from executing actions that could have external, real-world impacts. This functionality monitors and controls all outgoing HTTP requests made during a session, allowing for modification or blocking of sensitive operations. Specifically, it safeguards against unintended consequences like unauthorized financial transactions or data modifications by intercepting requests before they reach external servers. Interception enables validation of request parameters, ensuring adherence to pre-defined constraints and preventing actions outside of the intended testing scope, thereby mitigating potential risks associated with agent-driven interactions.

The Illusion of Intelligence: A Gap Remains

An innovative evaluation framework leverages an ‘Agentic Evaluator’ powered by Claude Code to rigorously assess the performance of AI agents. This system moves beyond simple pass/fail metrics by comparing the trajectory of an agent’s actions-the specific steps taken to complete a task-against a ‘Human Reference Trajectory’. These reference trajectories, created by human experts, serve as a benchmark for optimal task completion, allowing for a nuanced understanding of where an agent excels or falters. By meticulously analyzing deviations from human behavior, the evaluator provides detailed insights into an agent’s reasoning and execution, facilitating targeted improvements and a more accurate measure of its capabilities. This approach ensures a comprehensive ‘Task Completion Assessment’ that reflects real-world performance expectations.

Trajectory comparison serves as a crucial analytical technique, enabling a detailed examination of an agent’s path against a demonstrably successful human approach to the same task. This method doesn’t simply assess whether a task is completed, but how it’s completed, revealing subtle deviations in strategy and execution. By quantifying these differences – perhaps through metrics like path length, smoothness, or the number of corrective actions – researchers can pinpoint specific areas where the agent’s behavior diverges from human intuition and efficiency. Such granular analysis facilitates targeted improvements to the agent’s algorithms, allowing developers to refine its decision-making processes and ultimately bridge the gap between artificial and human performance. This approach moves beyond simple success/failure metrics to offer a nuanced understanding of an agent’s capabilities and limitations.

Evaluation hinged on a clear measure of achievement: success rate. Results from the ClawBench benchmark reveal a significant disparity in performance between language models. Claude Sonnet 4.6 demonstrated a 33.3% success rate in completing the tasks, indicating a substantial capacity for reliable action execution. In contrast, GPT-5.4 achieved a markedly lower 6.5% success rate, suggesting limitations in its ability to navigate and complete the challenges presented by the benchmark. This difference highlights Claude Sonnet 4.6’s improved capabilities in agent-based task completion and provides a quantifiable basis for further development and refinement of these models.

The pursuit of seamless web automation, as demonstrated by ClawBench, feels less like progress and more like polishing the gears of an inevitable breakdown. This benchmark meticulously exposes the chasm between lab-perfected agents and the messy reality of everyday online tasks. It’s a predictable outcome; the system will always encounter edge cases the developers hadn’t foreseen. As Henri Poincaré observed, “Mathematics is the art of giving reasons.” But even the most rigorous mathematical framework can’t account for CAPTCHAs designed by someone having a bad day. The agents might navigate the tests initially, but production will always find a way to break elegant theories. One suspects future iterations of ClawBench will simply measure how spectacularly these agents fail, rather than if they succeed. The benchmark isn’t about building perfect agents; it’s about leaving detailed notes for the digital archaeologists who will dissect the wreckage.

What’s Next?

The proliferation of benchmarks rarely solves the underlying problem-it simply relocates the illusion of progress. ClawBench, by demonstrating the chasm between synthetic success and genuine web interaction, joins a long line of tests that reveal more about the testing process than the intelligence being measured. The observed failures aren’t necessarily flaws in large language models themselves, but rather the predictable consequences of forcing brittle automation onto a platform designed for human ambiguity. Every successful trajectory analysis will inevitably encounter a CAPTCHA update, a redesigned button, or a website actively hostile to scraping.

Future work will undoubtedly focus on increasingly elaborate simulation and reward functions, attempting to anticipate every edge case. This feels… familiar. A more honest approach might involve accepting that perfect generalization is unattainable, and focusing instead on building agents that fail gracefully-ones that flag uncertainty, request human assistance, or, crucially, don’t accidentally order three tons of industrial lubricant.

The true test won’t be whether these agents can complete tasks, but whether they can avoid catastrophic errors with sufficient frequency to justify their deployment. Tests are a form of faith, not certainty. The real metric isn’t accuracy, but resilience-the ability to remain operational when production inevitably breaks things.


Original article: https://arxiv.org/pdf/2604.08523.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-11 03:37