Author: Denis Avetisyan
A new study reveals that despite the hype, deployed AI agents often rely on constrained designs and human intervention to achieve acceptable reliability in production systems.

This paper presents the first large-scale systematic evaluation of agentic AI in production, highlighting a tension between desired autonomous capabilities and the practical need for robust, dependable performance.
Despite the increasing prevalence of AI agents in real-world applications, a systematic understanding of how they are successfully deployed remains surprisingly limited. This paper, ‘Measuring Agents in Production’, presents the first large-scale study of agent deployment, surveying practitioners and conducting in-depth case studies across diverse industries. Our findings reveal that current production agents prioritize reliability through constrained designs and substantial human oversight, rather than relying on complex autonomous capabilities. As agentic systems continue to evolve, can we bridge the gap between research innovation and practical deployment to unlock their full potential across increasingly complex domains?
The Inevitable Ascent: Beyond Static Response Systems
While large language models demonstrate remarkable proficiency in generating human-quality text, their capabilities are fundamentally limited by a lack of sustained reasoning and independent action. These models operate primarily as sophisticated pattern matchers, excelling at predicting the next word in a sequence but unable to formulate goals, devise plans, or execute them over extended periods. Essentially, they respond to prompts rather than proactively addressing problems; a single query elicits a single response, after which the system returns to a passive state. This contrasts sharply with human cognition, which involves continuous planning, memory recall, and iterative refinement of strategies to achieve complex objectives. Consequently, traditional models require constant human intervention to manage multi-step tasks, hindering their potential in scenarios demanding true autonomy and persistent problem-solving.
Agentic AI represents a fundamental departure from conventional artificial intelligence, moving beyond the limitations of static responses to embrace sustained, autonomous action. This new paradigm integrates powerful foundation models – the engines of text generation – with external tools and robust memory systems. Rather than simply reacting to prompts, an agentic system can decompose complex goals into a sequence of manageable steps, utilizing tools like search engines or APIs to gather information and execute actions. The inclusion of memory allows the agent to learn from past experiences, refine its strategies, and maintain context across multiple interactions, enabling it to tackle tasks requiring long-term planning and adaptation – a capability previously unattainable with traditional language models. This synergistic combination doesn’t just generate text; it performs tasks, opening doors to applications ranging from automated research to personalized assistance and beyond.
The evolution from static language models to agentic AI signifies a leap toward genuinely intelligent systems. Rather than merely responding to individual queries, these agents are engineered to pursue defined objectives through iterative reasoning and action. This capability is achieved by integrating large language models with external tools – such as search engines, calculators, or specialized APIs – and equipping them with a form of memory to retain context and learn from past experiences. Consequently, agentic AI doesn’t just answer questions; it formulates plans, breaks down complex tasks into manageable steps, and proactively executes those steps to achieve a desired outcome, effectively moving beyond reactive responses toward autonomous problem-solving and intelligent action.

Architectural Foundations: Designing for Autonomous Action
Agent architectures define the structural organization of an agentic system, specifying the relationships and communication pathways between its constituent components. These components typically include a planning module, a memory system, a tool selection mechanism, and an execution engine. The architecture dictates how the agent perceives its environment, formulates plans to achieve objectives, retrieves relevant information, utilizes tools to overcome limitations, and ultimately enacts those plans. A well-defined architecture ensures modularity, allowing for independent development and maintenance of individual components, and facilitates scalability to handle increasingly complex tasks. Common architectural patterns include hierarchical, reactive, and deliberative designs, each offering trade-offs between responsiveness, flexibility, and computational cost. The choice of architecture fundamentally impacts the agent’s ability to effectively reason, learn, and adapt within a given environment.
Agent architectures routinely integrate external tools and data sources to overcome limitations present in foundational models. Tool integration allows agents to perform actions – such as API calls, database queries, or web searches – that are outside the scope of their pre-trained capabilities, effectively expanding the range of solvable problems. Simultaneously, data integration enriches the agent’s knowledge base, providing access to current, specific, or proprietary information not included in the model’s original training data; this is typically achieved through retrieval-augmented generation (RAG) or similar techniques, ensuring responses are informed by the most relevant available data and reducing reliance on potentially outdated or inaccurate internal knowledge.
Effective system integration for agentic systems requires a phased approach encompassing API compatibility checks, data format standardization, and security protocol alignment with existing infrastructure. This includes verifying that agent inputs and outputs conform to expected schemas, ensuring data consistency across systems, and implementing appropriate authentication and authorization mechanisms. Furthermore, monitoring tools must be integrated to track agent performance within the broader workflow, enabling proactive identification and resolution of potential conflicts or bottlenecks. Successful integration minimizes disruption to current operations and maximizes the return on investment from the agentic system’s deployment.

Prompt Engineering and Evaluation: Measuring Autonomous Performance
Autonomous Task Execution is fundamentally dependent on the quality of prompts provided to Foundation Models. These prompts serve as the primary method of directing the model’s behavior and defining the desired outcome of a given task. Effective Prompt Engineering involves carefully constructing inputs that clearly articulate the task requirements, constraints, and expected format of the output. The precision and detail within the prompt directly influence the model’s ability to accurately interpret the request and generate a relevant, useful response. Insufficiently defined prompts can lead to ambiguous outputs, task failures, or unintended behaviors, necessitating iterative refinement of the prompt structure and content to achieve reliable autonomous operation.
Effective assessment of autonomous agent performance necessitates a combination of both automated and human-in-the-loop evaluation methods. While automated metrics provide scalable quantitative data, human review remains crucial for assessing nuanced qualities and identifying areas for improvement not captured by algorithms. Current deployment trends reflect this need; data indicates that 74.2% of deployed agents leverage human-in-the-loop evaluation as part of their performance monitoring and refinement processes, suggesting a continued reliance on human oversight in real-world applications of autonomous systems.
The implementation of LLM-as-a-Judge represents a growing trend in autonomous agent evaluation, offering a scalable alternative to exclusively human-based assessment. This technique leverages the capabilities of another large language model to assess the outputs of an agent, providing automated feedback on performance. Current data indicates a preference for controlled autonomy, with 68% of deployed agents requiring human intervention after executing fewer than 10 steps. This suggests a focus on limiting potential errors and maintaining oversight, even as agents are designed for increasingly complex tasks, and highlights the continued need for human-in-the-loop evaluation alongside automated methods like LLM-as-a-Judge.

Scaling Autonomous Systems: Addressing Practical Limitations
The expansion of agentic systems beyond controlled research environments introduces substantial scalability challenges. Simply replicating a functional prototype isn’t sufficient; a system capable of handling numerous concurrent users and complex requests demands careful resource allocation. Efficient architectures, potentially leveraging distributed computing and specialized hardware acceleration, become essential to manage computational load. Furthermore, maintaining consistent performance as the system scales requires optimized data handling strategies and intelligent task scheduling. The core issue isn’t merely processing power, but orchestrating resources to prevent bottlenecks and ensure each agent operates effectively within a shared infrastructure, a task that necessitates innovative approaches to system design and deployment.
Reliability stands as a cornerstone for the successful deployment of agentic systems, as unpredictable or inconsistent behavior quickly undermines user confidence and restricts practical application. Current data reveals that achieving robust core technical performance represents the foremost challenge for 16% of those working in the field, highlighting the difficulty of consistently delivering dependable results. This isn’t merely a matter of avoiding errors; it requires agents to exhibit predictable reasoning, maintain state consistency across interactions, and gracefully handle unexpected inputs or environmental changes. Without a foundation of reliability, even the most innovative agentic capabilities remain largely theoretical, hindering widespread adoption and limiting their potential impact on real-world problems.
While swift responses are generally considered vital for positive user experience, current research indicates that latency isn’t always the primary obstacle in deploying agentic systems. Though immediate reaction times are desirable, a surprising tolerance for delay exists; the majority of agents can function effectively even with response times measured in minutes. This suggests that focusing solely on minimizing latency might be a misallocation of resources for many applications. Instead, developers can prioritize robust reasoning and accurate results, knowing that a degree of delay is often acceptable, particularly in tasks not requiring real-time interaction. This finding challenges the conventional wisdom surrounding agent performance and opens possibilities for exploring architectures that prioritize quality of output over sheer speed.

The Future of Intelligence: RAG and Beyond
Agentic Retrieval-Augmented Generation (RAG) signifies a substantial leap in the evolution of intelligent agents, moving beyond pre-programmed responses to dynamically incorporate external knowledge. Traditionally, agents were limited by the data they were initially trained on, hindering their ability to address novel or nuanced queries. Agentic RAG overcomes this limitation by enabling the agent to actively retrieve relevant information from external sources – be it a vast database, the internet, or specialized knowledge repositories – and integrate this information into its reasoning process. This capability dramatically enhances both the accuracy and the scope of the agent’s responses, allowing it to tackle complex problems requiring up-to-date or highly specific information. The agent doesn’t simply recall information; it researches and synthesizes it, resulting in more informed and contextually appropriate outputs and establishing a pathway toward genuinely adaptable and insightful artificial intelligence.
Intelligent agents are increasingly capable of extending their knowledge base through the integration of retrieval-augmented generation. Rather than being limited to the information embedded during their initial training, these agents can now dynamically access and process external data sources – be it a vast digital library, a specialized database, or even the live internet. This process allows the agent to retrieve relevant information in response to a query, then synthesize that retrieved knowledge with its pre-existing understanding to formulate a more informed and accurate response. The ability to incorporate external knowledge not only enhances the agent’s factual grounding but also equips it to tackle novel situations and adapt to evolving information landscapes, moving beyond rote memorization towards genuine understanding and reasoning.
The synergistic combination of retrieval-augmented generation (RAG) and agentic AI systems is poised to redefine the capabilities of artificial intelligence. This convergence isn’t simply about improving existing models; it’s about creating systems that dynamically learn and adapt, drawing upon vast external knowledge sources to inform their reasoning. Such agents move beyond pre-programmed responses, exhibiting a capacity for nuanced understanding and problem-solving previously unattainable. The result is a shift towards genuinely autonomous assistants – entities capable of handling complex tasks, providing informed guidance, and proactively offering solutions without constant human intervention. This evolution promises to unlock applications ranging from personalized education and sophisticated healthcare to streamlined scientific research and efficient business operations, ultimately fostering AI companions that are not just intelligent, but truly helpful and integrated into daily life.

The study meticulously details a landscape where practical deployment of agentic AI prioritizes demonstrable reliability over theoretical autonomy. This echoes Carl Friedrich Gauss’s sentiment: “Few things are more deceptive than the appearance of certainty.” Practitioners, as observed in the research, aren’t striving for generalized intelligence but for predictable outcomes. The paper highlights a reliance on constrained designs and human oversight-a pragmatic approach rooted in the need for verifiable results, rather than speculative potential. This focus on rigorous validation mirrors Gauss’s dedication to mathematical proof; a system’s ‘correctness’ isn’t assumed, it’s demonstrated through careful analysis and observation, even if it means sacrificing the elegance of a fully autonomous design.
The Road Ahead
The findings presented here compel a reassessment of current trajectories in agentic AI. The observed reliance on constrained designs and human-in-the-loop systems suggests that the pursuit of ‘autonomous’ agents has, for the moment, largely yielded systems that are merely assisted – and at a cost. Optimization without analysis reveals itself as self-deception; the gains in raw task completion are frequently offset by the complexity of maintaining reliability in these brittle constructions. A purely empirical approach, focused solely on scaling existing methods, risks entrenching these limitations.
Future work must prioritize formal verification and provable guarantees of agent behavior. The field needs less emphasis on clever prompting and more on foundational mathematical principles. The current metrics, focused on task success, are insufficient. A robust evaluation framework must incorporate measures of systematic error, quantifying the conditions under which an agent is likely to fail – and, crucially, why.
The aspiration to build truly general agents demands a shift from heuristic engineering to principled design. Until agentic systems can demonstrably reason about their own limitations, and operate predictably within defined boundaries, the promise of widespread, reliable deployment remains a mirage. The challenge, then, is not simply to build agents that appear intelligent, but to build systems whose intelligence can be understood.
Original article: https://arxiv.org/pdf/2512.04123.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Clash Royale Best Boss Bandit Champion decks
- Clash Royale December 2025: Events, Challenges, Tournaments, and Rewards
- Ireland, Spain and more countries withdraw from Eurovision Song Contest 2026
- Clash Royale Witch Evolution best decks guide
- Clash Royale Furnace Evolution best decks guide
- Mobile Legends December 2025 Leaks: Upcoming new skins, heroes, events and more
- ‘The Abandons’ tries to mine new ground, but treads old western territory instead
- Mobile Legends X SpongeBob Collab Skins: All MLBB skins, prices and availability
- Best Hero Card Decks in Clash Royale
- JoJo’s Bizarre Adventure: Ora Ora Overdrive unites iconic characters in a sim RPG, launching on mobile this fall
2025-12-05 12:11