Building with Brains: How Developers Really Work with AI Agents

Author: Denis Avetisyan


A new study reveals the practical challenges and emerging patterns in how software developers are integrating large language model agents into their projects.

An analysis of AI agent frameworks reveals potential failure points throughout the software development life cycle, highlighting vulnerabilities at each stage of implementation.
An analysis of AI agent frameworks reveals potential failure points throughout the software development life cycle, highlighting vulnerabilities at each stage of implementation.

This research presents the first large-scale empirical investigation of developer practices within AI agent frameworks, identifying key challenges throughout the software development lifecycle and providing insights to improve framework design and adoption.

Despite the rapid proliferation of large language model (LLM) agents, a systematic understanding of developer experiences building with agent frameworks remains surprisingly limited. This research, ‘An Empirical Study of Agent Developer Practices in AI Agent Frameworks’, presents the first large-scale empirical investigation into how developers actually utilize these frameworks throughout the software development lifecycle. Our analysis of nearly 12,000 developer discussions across ten prominent frameworks reveals significant disparities in development efficiency, maintainability, and overall usability. How can these findings inform the design of future agent frameworks and better support developers harnessing the power of LLM-driven agents?


The Evolving Landscape of Autonomous Agents

Large Language Model (LLM)-based agent systems signify a notable departure in artificial intelligence, moving beyond simple task execution toward systems capable of independent problem-solving. However, this newfound autonomy is not without caveats; these agents are constructed upon intricate architectures and a reliance on the probabilistic nature of LLMs, making them inherently fragile. While appearing to reason and act purposefully, their internal processes are often opaque, and seemingly minor variations in input can lead to unpredictable outcomes or outright failures. This brittleness stems from the LLM’s tendency to “hallucinate” information or drift off-topic, coupled with the challenges of reliably translating high-level goals into concrete actions within complex environments. Consequently, despite their potential, robust deployment necessitates careful monitoring, rigorous testing, and the development of techniques to enhance their reliability and prevent unexpected behavior.

Agent Frameworks are foundational to the current wave of large language model-based agents, serving as the crucial infrastructure for translating abstract goals into concrete actions. These toolkits aren’t simply libraries of functions; they provide a structured environment for defining an agent’s memory, planning capabilities, and crucially, its interaction with the external world. By abstracting away the complexities of prompt engineering, tool selection, and output parsing, these frameworks empower developers to focus on defining the logic of the agent rather than the minutiae of its implementation. Popular options like LangChain and AutoGen offer pre-built components for common tasks, but also allow for extensive customization, enabling the creation of agents tailored to specific domains and challenges. The effectiveness of an agent, therefore, is often inextricably linked to the robustness and flexibility of the underlying framework used to orchestrate its behavior.

Effective implementation of large language model-based agents hinges on their ability to interact with the world beyond their training data, demanding robust integration with external systems and APIs. This isn’t simply a matter of technical connectivity; it requires a standardized approach to data exchange and contextual understanding. The emergence of protocols like the Model Context Protocol addresses this need by defining a common language for agents to access tools, share information, and maintain state across interactions. By establishing clear guidelines for how agents formulate requests and interpret responses, these standards minimize ambiguity and reduce the likelihood of errors, ultimately fostering more reliable and scalable agent deployments. This standardization isn’t just about enabling communication; it’s about building trust and predictability in a new era of autonomous systems, allowing developers to focus on building sophisticated agent behaviors rather than wrestling with integration complexities.

The ten agent frameworks are categorized by their functional roles and areas of application.
The ten agent frameworks are categorized by their functional roles and areas of application.

Points of Failure: Understanding Agent Fragility

Version failures and dependency conflicts represent a substantial source of instability in autonomous agents, accounting for 23.53% of all reported failures. These issues typically arise when agents rely on external libraries or tools that undergo updates incompatible with the agent’s existing codebase. Specifically, changes in API structures, function signatures, or required data formats can introduce runtime errors or unexpected behavior. Furthermore, conflicting dependencies-where multiple components require different, incompatible versions of the same library-can lead to unpredictable outcomes and system crashes. Careful version control, dependency management practices, and robust testing procedures are therefore critical to mitigating these risks and ensuring agent reliability.

Tool failures represent 14% of all agent failures and are commonly caused by limitations inherent in the APIs used by those tools. These limitations can include rate limits, incomplete functionality, unexpected data formats, or changes in API specifications without corresponding agent adaptation. Such failures disrupt agent workflows by preventing access to necessary information or functionality, forcing the agent to halt execution or return incomplete results. Addressing these issues requires careful API selection, robust error handling, and potentially the implementation of fallback mechanisms or alternative tools.

Logic failures represent the most substantial category of agent failures, accounting for 25.6% of all observed instances. These failures stem from inadequacies within the agent’s internal control mechanisms, which are responsible for governing decision-making processes and ensuring consistent, valid outputs. Specifically, deficiencies can manifest as errors in reasoning, incorrect state management, or the inability to effectively handle unexpected inputs or edge cases. This category does not encompass external tool failures or API limitations, but rather issues originating from the agent’s core logical architecture and its implementation of control flow.

Performance failures in agents are commonly observed as memory management errors, constituting 16.03% of all reported failures. These errors arise when agents exceed allocated memory limits or fail to efficiently store and retrieve information necessary for task completion. Insufficient memory capacity or inefficient data handling directly hinders the agent’s ability to effectively process information, reason through complex scenarios, and ultimately achieve desired outcomes. This can manifest as slow response times, incomplete task execution, or outright system crashes, significantly impacting operational reliability and requiring robust monitoring and optimization strategies.

Agent frameworks can fail due to unintended dependencies, as demonstrated in these illustrative examples.
Agent frameworks can fail due to unintended dependencies, as demonstrated in these illustrative examples.

Architectural Solutions: Building Resilient Agent Systems

LangChain focuses on providing a modular framework for chaining together different language model components and tools, enabling the creation of complex, multi-step workflows. AutoGen distinguishes itself through its emphasis on facilitating multi-agent collaboration, allowing for the development of systems where multiple agents can communicate and coordinate to achieve a common goal. LlamaIndex specializes in retrieval-augmented generation (RAG), concentrating on efficiently indexing and retrieving data to enhance the knowledge base of language models. Finally, CrewAI prioritizes the creation of structured workflows by defining roles and responsibilities for individual agents within a larger system, promoting organized and predictable agent behavior. Each framework addresses a specific aspect of agent development, contributing unique capabilities to base Agent Frameworks.

Agent frameworks such as LangChain, AutoGen, LlamaIndex, and CrewAI each contribute specific functionalities to agent development. LangChain prioritizes modularity, enabling developers to chain together various components like language models, prompts, and tools. AutoGen focuses on supporting multi-agent collaboration, allowing the creation of systems where multiple agents can interact to solve complex tasks. LlamaIndex specializes in retrieval-augmented generation (RAG), integrating external data sources to enhance the knowledge and accuracy of agent responses. Finally, CrewAI emphasizes structured workflows, providing tools to define and manage the sequence of actions performed by an agent or team of agents.

Analysis of current agent-based project implementations indicates a strong trend towards framework integration, with 96% of projects employing more than one agent framework. This suggests that single frameworks often lack the complete functionality required for complex applications, necessitating the combination of modular components. Specifically, developers are leveraging the strengths of different frameworks – such as LangChain’s chaining capabilities, AutoGen’s multi-agent support, LlamaIndex’s retrieval features, and CrewAI’s workflow management – to construct more robust and effective agent systems capable of addressing a wider range of tasks and exhibiting increased resilience to limitations inherent in individual frameworks.

Growth trajectories of GitHub projects reveal performance differences across the ten agent frameworks.
Growth trajectories of GitHub projects reveal performance differences across the ten agent frameworks.

The study meticulously dissects the developer experience within LLM agent frameworks, revealing a landscape often burdened by unnecessary complexity. It’s a testament to the core finding – developers grapple not solely with novel AI concepts, but with the intricacies of integrating these into existing software development lifecycles. This echoes David Hilbert’s sentiment: “One must be able to say at all times what one knows and what one does not.” The research diligently maps what developers do know – traditional software engineering – and what remains unclear regarding the efficient orchestration of LLM agents, effectively illuminating the gaps in current practices and framework design. The pursuit of streamlined development, free from superfluous layers, aligns with a fundamental principle: simplicity isn’t a limitation, but evidence of profound understanding.

What Remains?

This investigation into developer practices surrounding large language model agents has, predictably, revealed more questions than answers. The identified challenges – primarily concerning context management and lifecycle integration – are not novel failings of this particular technology, but rather symptoms of a recurring pattern: the rush to deploy before understanding the underlying engineering demands. Simplicity is intelligence, not limitation; the current proliferation of agent frameworks demonstrates a preference for additive complexity over reductive clarity.

Future work must abandon the pursuit of ‘more features’ and concentrate on foundational elements. A singular, well-defined model context protocol, universally adopted, would offer immediate, measurable improvement. Further research should prioritize the distillation of agent development into established software engineering principles, rather than treating it as a unique, exceptional endeavor. If it can’t be explained in one sentence, it isn’t understood.

Ultimately, the true test will not be the sophistication of the agents themselves, but the degree to which their development can be rendered unremarkable. The goal is not to build intelligence, but to manage complexity; to create tools that fade into the background, allowing practitioners to focus on the problem, not the machinery.


Original article: https://arxiv.org/pdf/2512.01939.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-02 23:29