Agents That Build Agents: Introducing OpenSage

Author: Denis Avetisyan

A new framework empowers artificial intelligence to autonomously design and construct complex agent systems, surpassing the capabilities of current agent development kits.

The OpenSage framework establishes a system wherein artificial intelligence dynamically constructs and manages diverse network topologies via a unified agent pool, facilitated by a hierarchical tool structure incorporating tool-specific sandboxing, state management, and asynchronous execution, alongside graph-based short- and long-term memory systems interacted with by a dedicated memory agent to enable complex reasoning and adaptation.

OpenSage is an Agent Development Kit enabling AI-driven construction of agents, tools, hierarchical memory, and topologies using Large Language Models.

Current agent development kits often require significant human effort in designing agent topologies, toolsets, and memory systems, hindering generalizability and performance. To address this limitation, we introduce OpenSage: Self-programming Agent Generation Engine, a novel framework that empowers large language models to autonomously construct agents with self-generated components and a hierarchical memory system. Extensive experimentation demonstrates that OpenSage surpasses existing ADKs across multiple benchmarks, achieving superior results through automated design and optimized memory management. Could this represent a paradigm shift towards fully AI-centered agent development, unlocking a new era of autonomous problem-solving?

The Inherent Limitations of Manual Agent Design

The creation of intelligent agents, through conventional methods, demands a considerable investment of both time and specialized skill. Each component – from perception and decision-making to action execution – is typically handcrafted by engineers and domain experts, a process akin to meticulously assembling a complex machine. This manual approach necessitates deep understanding of both the target environment and the intricacies of artificial intelligence algorithms, requiring extensive coding, testing, and iterative refinement. Consequently, developing even moderately sophisticated agents can be exceedingly slow and resource-intensive, often proving a substantial barrier to wider adoption and hindering progress towards truly autonomous systems. The reliance on expert knowledge also limits scalability; replicating or adapting an agent to a slightly different scenario often necessitates a complete redesign, rather than a simple adjustment.

The creation of intelligent agents through purely manual design faces inherent limitations when confronted with dynamic real-world scenarios. Each alteration to an agent’s behavior, whether to navigate a novel obstacle or respond to an unforeseen event, demands painstaking code revisions and thorough re-testing by skilled developers. This iterative process becomes a significant bottleneck, hindering the agent’s ability to operate effectively in unpredictable environments. As task complexity increases-requiring agents to manage numerous variables and interactions-the manual effort escalates exponentially, making it increasingly difficult to maintain responsiveness and adaptability. Consequently, purely handcrafted agents often struggle to generalize beyond their initial training parameters, limiting their true potential for autonomous operation and intelligent problem-solving.

The pursuit of genuinely autonomous and intelligent systems is hampered by the inherent limitations of current design methodologies, which struggle with both flexibility and scalability. Existing approaches often rely on painstakingly crafted, rule-based systems or narrowly trained algorithms that prove brittle when confronted with novel situations or unpredictable environments. Expanding the capabilities of these agents demands more than incremental improvements; it requires a paradigm shift towards methods capable of adapting to unforeseen challenges and seamlessly integrating new knowledge. The inability to easily scale these systems – to increase their complexity and operational scope without a corresponding exponential rise in development effort – represents a critical bottleneck, hindering the realization of truly versatile artificial intelligence capable of tackling real-world problems with human-level adaptability and efficiency.

Agents built with the OpenSage framework and powered by Gemini 3 Pro, GPT-5 Mini, or a collaborative Gemini 3 Pro + GPT-5 Mini setup, demonstrate comparable performance to GPT-5 on the Terminal-Bench 2.0, as measured by resolved rate and cost.

An AI-Centered Paradigm for Agent Construction

The OpenSage platform implements an AI-centered paradigm by utilizing artificial intelligence to autonomously generate agents, define their operational topologies, and select necessary tools. This process bypasses traditional, manual agent development, where each component is individually designed and integrated by human developers. Specifically, the AI algorithms within OpenSage analyze specified objectives and constraints, then automatically constructs agent architectures, determines inter-agent communication pathways forming the topology, and provisions relevant software tools – including APIs, models, and data sources – required for task execution. This automated construction extends to all agent components, from perception and planning modules to action execution mechanisms, resulting in a fully functional intelligent system without significant human intervention in the design phase.

Traditional agent development necessitates manual definition of agent behaviors, topologies, and tool integrations – a process requiring substantial engineering resources and time. OpenSage’s AI-driven approach automates these traditionally manual steps by utilizing machine learning algorithms to generate agent configurations. This automation encompasses the selection of appropriate agent architectures, the definition of interaction protocols, and the integration of necessary tools. Consequently, development cycles are significantly compressed, allowing for the creation and deployment of intelligent agents in a fraction of the time compared to conventional methods, and reducing the associated engineering overhead.

OpenSage’s automated agent creation process facilitates the rapid deployment of intelligent systems by eliminating manual design and configuration. This automation allows for the instantiation of agents optimized for specific, user-defined requirements without extensive development cycles. The system dynamically constructs agent topologies and integrates necessary tools, reducing the time from initial concept to functional deployment from weeks or months to potentially hours or days. This accelerated deployment capability is particularly advantageous in dynamic environments where rapid adaptation and responsiveness are critical, allowing organizations to quickly address evolving needs and leverage new opportunities.

An ablation study of the SageAgent within the OpenSage framework on a CyberGym subset demonstrates that both agent topology and the tooling system significantly impact performance.

Memory and Tooling: The Foundations of Autonomous Operation

The OpenSage Memory System is a core component enabling persistent learning and knowledge retention in autonomous agents. This system facilitates the storage of interaction history, observed states, and derived insights, allowing agents to build upon past experiences. Data is stored in a structured format, indexed for efficient retrieval, and utilized to inform future decision-making processes. The Memory System is designed to handle both episodic memory – specific events and their contexts – and semantic memory – generalized knowledge and concepts. This capability allows agents to adapt to changing environments and improve performance over time without requiring explicit reprogramming for each new scenario.

OpenSage’s memory architecture utilizes a hierarchical structure to manage information retention and access. Short-term memory, optimized for recent interactions, is coupled with long-term memory for persistent knowledge storage. This is further enhanced by a Graph-based Memory system, which represents information as nodes and relationships, enabling efficient semantic retrieval beyond simple keyword searches. This graph structure facilitates the linking of concepts and experiences, allowing the agent to draw inferences and apply learned knowledge to novel situations. The combined approach minimizes retrieval latency for frequently accessed data while providing scalable storage for extensive knowledge bases.

Dynamic tool creation within OpenSage allows agents to overcome limitations imposed by pre-defined toolsets. This functionality enables agents to generate new tools programmatically, based on the demands of a given task or unforeseen circumstances. Tool creation is not limited to simple scripting; agents can define tool specifications, including input parameters, expected outputs, and execution logic. This on-demand tool generation significantly extends an agent’s capabilities, allowing it to adapt to novel situations and perform tasks outside the scope of its initial configuration without requiring external intervention or pre-programmed responses. The system supports a variety of tool types and programming languages, ensuring flexibility and compatibility with diverse operational requirements.

OpenSage employs a Tooling System designed for operational stability and reproducibility through containerized execution. Each tool, regardless of its complexity, is packaged within a self-contained container, isolating it from system-level dependencies and potential conflicts. This containerization utilizes technologies like Docker to guarantee consistent behavior across different environments – development, testing, and production. Furthermore, the system facilitates version control of tools, enabling rollback to previous states and ensuring auditability. By encapsulating tools and their dependencies, OpenSage minimizes runtime errors, simplifies deployment, and improves the overall reliability of autonomous agent operations.

An ablation study of the SageAgent, leveraging the OpenSage framework on the SWE-Bench Pro dataset, demonstrates the importance of agentic memory for performance compared to a zero-shot baseline and a non-agentic memory approach.

Validation and Demonstrable Broad Applicability

Rigorous testing demonstrates OpenSage’s versatility through validation on a suite of demanding benchmarks designed to assess diverse capabilities. The system’s performance was evaluated using SWE-Bench Pro, a standard for measuring software engineering prowess, alongside LOCOMO, which specifically challenges an agent’s ability to maintain coherent conversational memory over extended interactions. Further assessment occurred with Terminal-Bench 2.0, a complex task suite requiring sophisticated reasoning and execution within a terminal-based environment; this comprehensive evaluation strategy confirms OpenSage isn’t limited to a single domain but exhibits a broad capacity for problem-solving and adaptation.

Evaluations on Terminal-Bench 2.0 confirm OpenSage’s exceptional capabilities in navigating complex, real-world digital environments. The agent demonstrably outperforms all previously established benchmarks on this challenging platform, securing a leading position on the associated leaderboard. This achievement isn’t simply incremental; OpenSage exhibits a marked improvement in successfully completing intricate tasks requiring sustained interaction and problem-solving within the simulated terminal. The results underscore the agent’s robust reasoning and planning abilities, representing a significant step forward in the development of autonomous agents capable of operating effectively in dynamic and unpredictable digital landscapes.

Evaluations within realistic, dynamic environments reveal OpenSage’s robust capabilities; specifically, the agent achieved a greater than 20% improvement in resolution rate on the CyberGym platform when contrasted with OpenHands, despite utilizing the identical foundational model. This advancement indicates enhanced problem-solving efficiency within cybersecurity challenges. Furthermore, OpenSage’s performance on the SWE-Bench Pro benchmark surpasses that of the established SWE-agent baseline, demonstrating a clear capability for complex software engineering tasks and suggesting a potential for automation and assistance in coding-related workflows. These results collectively highlight OpenSage’s capacity to not only meet but exceed existing performance standards across diverse and demanding scenarios.

OpenSage’s capabilities extend beyond the realm of software engineering, as evidenced by its strong performance on the LOCOMO benchmark. This assessment, designed to evaluate long-context conversational memory, reveals that OpenSage achieves results comparable to those of specialized memory-augmented models like Mem0 and Mem0g. This parity signifies a crucial advancement in agent design; it demonstrates OpenSage’s ability to effectively manage and utilize information across extended dialogues, independent of a coding-specific focus. The achievement highlights a broader generalization capability, suggesting that the underlying architecture possesses the flexibility to excel in diverse, context-rich conversational scenarios and tasks beyond code generation.

SageAgent, powered by OpenSage, demonstrates competitive performance against state-of-the-art agents and autonomous development kits (ADKs) across three widely used agentic benchmarks, rivaling Gemini 3 (“G.3”).

Toward Truly Autonomous Agents: A Vision for the Future

OpenSage introduces a novel approach to multi-agent system design by enabling the construction of flexible agent topologies, specifically Vertical and Horizontal configurations. Vertical topologies arrange agents in a hierarchical structure, excelling in tasks requiring specialized expertise and efficient information flow – imagine a team where each member possesses a unique skill and reports to a central coordinator. Conversely, Horizontal topologies distribute processing across a network of equivalent agents, proving ideal for parallel computation and robust problem-solving, much like a swarm of robots working collectively. This adaptability isn’t merely architectural; OpenSage allows dynamic switching between these configurations, and even hybrid approaches, optimizing performance based on the specific demands of the task at hand. The result is a system capable of tackling a wider range of challenges with greater efficiency than traditional, fixed-topology multi-agent systems.

The current framework represents a crucial stepping stone, yet ongoing research prioritizes significantly expanding its capabilities to address increasingly intricate problems. Future development centers on scaling the system’s architecture, allowing it to process larger datasets and manage more complex interactions without sacrificing efficiency. Simultaneously, integration of advanced reasoning modules – incorporating techniques such as [latex]symbolic AI[/latex] and probabilistic inference – will empower the agents to not simply react to stimuli, but to proactively plan, hypothesize, and learn from experience. This synergistic approach promises to move beyond task-specific performance, fostering genuine adaptability and enabling these agents to navigate truly undefined challenges with greater robustness and ingenuity.

The pursuit of artificial intelligence increasingly centers on the development of genuinely autonomous agents – systems not simply programmed to react, but capable of independent problem-solving in unpredictable scenarios. This ambition extends beyond task-specific AI; the objective is to engineer agents possessing a generalized capacity for adaptation, allowing them to function effectively across a limitless range of environments and challenges. Such agents wouldn’t require explicit reprogramming for each new situation; instead, they would leverage advanced reasoning and learning algorithms to analyze circumstances, formulate strategies, and execute solutions with minimal human intervention. Ultimately, the creation of these highly adaptable agents represents a pivotal step toward realizing the full potential of artificial intelligence – a future where intelligent systems can proactively address complex problems and contribute to innovation in all facets of life.

The development of OpenSage, as detailed in the paper, embodies a pursuit of provable systems. The engine’s capacity for autonomous agent construction, tool creation, and hierarchical memory design isn’t merely about achieving functional results; it’s about establishing a framework where the agent’s behavior is logically derived from its underlying architecture. This resonates deeply with Tim Berners-Lee’s observation: “The Web is more a social creation than a technical one.” While OpenSage is technical, its true power lies in its potential to democratize agent development-to provide a transparent, verifiable foundation upon which others can build, fostering a collaborative ‘social creation’ of increasingly complex AI systems. If the emergent behavior feels magical, one must reveal the invariant-the underlying logic-and OpenSage strives to do just that.

What’s Next?

The advent of OpenSage, and systems like it, begs a fundamental question: as the machinery of agent construction becomes increasingly automated, what truly remains the province of the architect? Current benchmarks demonstrate performance gains, yet these are, by their nature, finite. Let N approach infinity – what remains invariant? The core challenge isn’t merely building agents that perform well on predefined tasks, but establishing formal guarantees about their behavior in genuinely novel situations. The focus must shift from empirical validation to provable correctness.

Existing Agent Development Kits largely treat memory as an external construct, a data store to be accessed. OpenSage’s hierarchical memory is a step forward, but true intelligence may lie in the agent’s ability to self-model its own memory – to understand not just what it remembers, but how and why. This requires a move beyond purely associative recall towards a system capable of metacognition, of reasoning about its own knowledge representation.

Ultimately, the pursuit of self-programming agents risks becoming an exercise in sophisticated pattern matching. The elegance of a solution isn’t measured by its ability to pass a test suite, but by its adherence to fundamental principles. The next generation of ADKs must prioritize formal verification, robust error handling, and a commitment to building agents that are not merely clever, but demonstrably correct.

Original article: https://arxiv.org/pdf/2602.16891.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/