The Self-Improving AI: From Zero Data to Reasoning Power

Author: Denis Avetisyan

Researchers have developed a novel framework that allows artificial intelligence to autonomously enhance its reasoning capabilities without relying on pre-labeled training data.

A system of co-evolutionary agents spirals toward increasing complexity as one agent generates progressively challenging tasks-guided by the other’s uncertainty and reliance on tools-while the second agent, in turn, learns to solve those tasks, establishing a virtuous cycle where capability emerges entirely from initial conditions and iterative refinement, rather than pre-programmed instruction.

Agent0 utilizes a curriculum learning approach with tool-integrated reinforcement learning to enable self-evolution in large language models.

Despite advances in large language models, scaling AI reasoning capabilities remains constrained by reliance on human-labeled data and limited capacity for complex problem-solving. This paper introduces ‘Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning’, a novel framework that autonomously evolves high-performing agents through a symbiotic co-evolution of curriculum and executor agents-leveraging tool use to drive continuous improvement. Empirically, Agent0 demonstrates substantial gains in reasoning, boosting performance on benchmarks by up to 24% without external data. Could this approach unlock a pathway towards truly self-improving AI systems capable of surpassing human-level reasoning?

The Erosion of Static Systems: Introducing Agent0

The conventional trajectory of artificial intelligence development is significantly hampered by its reliance on extensive, human-provided datasets for training and validation. This dependence creates substantial bottlenecks, as the process of data annotation is both time-consuming and expensive, limiting the speed at which AI systems can be refined and deployed. Moreover, the scalability of these systems is inherently restricted; as models grow in complexity and demand more nuanced understanding, the need for ever-larger, meticulously labeled datasets escalates exponentially. This presents a critical challenge, particularly in domains where data is scarce, expensive to obtain, or requires specialized expertise for accurate annotation, ultimately hindering the broader advancement and practical application of artificial intelligence.

The Agent0 Framework represents a significant departure from conventional artificial intelligence development, introducing a system where Large Language Model (LLM) agents evolve independently through a process of co-evolution. This innovative approach bypasses the traditional reliance on extensive human-annotated datasets, effectively removing a major bottleneck in AI scalability and deployment. Instead of human guidance, agents are pitted against each other in a dynamic environment, fostering adaptation and improvement through iterative competition and collaboration. This self-directed evolution allows the agents to refine their reasoning capabilities without external intervention, leading to increasingly sophisticated problem-solving skills and a demonstrated enhancement in both mathematical and general reasoning performance. The framework’s success highlights the potential for creating AI systems that can learn and improve autonomously, paving the way for more robust and adaptable artificial intelligence.

The Agent0 framework fundamentally relies on the inherent capabilities of Large Language Models to drive autonomous improvement. By building upon these models, the system isn’t simply programmed with solutions, but rather equipped with the capacity for complex reasoning and adaptive learning. This allows agents within the framework to iteratively refine their problem-solving strategies without human guidance. Recent evaluations demonstrate the effectiveness of this approach, revealing an 18% performance increase in mathematical reasoning tasks and a significant 24% improvement in broader general reasoning abilities, showcasing the potential of LLM-driven co-evolution to surpass traditional, manually-tuned AI systems.

This co-evolutionary loop trains a Curriculum Agent to generate challenging tasks based on executor uncertainty, tool use, and repetition, while an Executor Agent learns to solve these tasks using an ambiguity-aware reinforcement learning method.

Dynamic Curricula: Forging Intelligence Through Competition

Agent0 utilizes a Curriculum Agent to iteratively create tasks tailored to the capabilities of its Executor Agent, a process designed to exceed existing performance limits. This dynamic task generation differs from static datasets or pre-defined curricula by adapting to the Executor Agent’s current skill level; the Curriculum Agent assesses performance and subsequently generates tasks of increasing complexity. This approach facilitates continuous learning and avoids premature saturation by consistently presenting challenges just beyond the Executor Agent’s current competence, thereby actively ‘pushing its boundaries’ and driving performance improvements.

A Repetition Penalty was implemented within the task generation process to mitigate stagnation and promote diversity in the training curriculum. This penalty functions by decreasing the probability of the Curriculum Agent selecting tasks similar to those recently presented to the Executor Agent. Specifically, the penalty is applied during task selection, reducing the likelihood of repeating previously generated prompts and encouraging the exploration of novel problem spaces. This mechanism prevents the Executor Agent from overspecializing on a narrow range of tasks and fosters more robust and generalized reasoning capabilities.

The Curriculum Agent’s task generation is optimized through the implementation of two reward signals: Uncertainty Reward and Tool Use Reward. Uncertainty Reward incentivizes the generation of tasks where the Executor Agent exhibits low confidence in its predictions, thus focusing learning on areas of weakness. Tool Use Reward specifically encourages tasks that require the Executor Agent to utilize available tools to arrive at a solution. The combined effect of these reward signals resulted in a 24% improvement in general reasoning performance, demonstrating a quantifiable benefit to incentivizing both challenge and effective problem-solving strategies within the generated curriculum.

The agent demonstrates improved question complexity and diversity across iterations and effectively solves mathematical problems by combining reasoning with Python code execution.

Reinforcement Learning: A System Adapting to its Own Imperfections

Both the Curriculum and Executor Agents within the system are trained using Reinforcement Learning (RL), a machine learning paradigm where agents learn to make sequential decisions by maximizing a cumulative reward. This approach allows the agents to improve their performance over time through repeated interactions with an environment, receiving feedback in the form of rewards or penalties for each action taken. The agents are not explicitly programmed with a fixed set of rules, but rather discover optimal strategies through trial and error, adapting their behavior based on the consequences of their actions. This iterative process of exploration and exploitation is central to the RL methodology and enables the agents to learn complex tasks without requiring labeled training data, although reward signals are necessary to guide learning.

The Executor Agent mitigates the impact of inaccurate labels generated during autonomous operation by employing a pseudo-labeling technique. This involves the agent generating multiple responses to a given input and then assigning a pseudo-label based on the majority vote of these responses. Specifically, if the Executor Agent produces $n$ outputs, the label receiving more than 50% of the votes is adopted as the pseudo-label. This self-generated label is then used in subsequent training iterations, effectively filtering out potentially erroneous labels and improving the robustness of the learning process in the presence of noisy data.

Ambiguity-Dynamic Policy Optimization (ADPO) builds upon the Generalized Reinforcement Learning from Preferences (GRPO) method to improve training stability and performance in scenarios with noisy labels. ADPO dynamically adjusts the policy optimization process based on the ambiguity present in the pseudo-labels generated by the Executor Agent. Specifically, it introduces a mechanism to down-weight actions with high ambiguity, effectively focusing training on more confident predictions. This approach mitigates the negative impact of inaccurate pseudo-labels and accelerates learning. Experimental results demonstrate that the implementation of ADPO contributes to an 18% improvement in mathematical reasoning capabilities, as measured by performance on relevant benchmark datasets.

Beyond Calculation: Augmenting Reasoning with External Tools

The Agent0 framework significantly extends the capabilities of artificial intelligence through the incorporation of tool integration, most notably a dedicated code interpreter. This allows the agent to move beyond simple textual responses and actively execute code, effectively transforming it into a computational problem-solver. By running code and receiving direct feedback on its outputs, the agent can iteratively refine its approach and achieve solutions previously unattainable. This dynamic process bypasses the limitations of purely language-based reasoning, enabling the agent to handle tasks requiring numerical calculation, data analysis, or algorithmic logic – fostering a level of performance that mimics a more versatile and adaptive intelligence.

The integration of computational tools significantly expands the Executor Agent’s capabilities, moving beyond purely linguistic problem-solving to encompass tasks demanding numerical calculation, data analysis, and algorithmic execution. This agent can now address challenges previously inaccessible, such as complex mathematical problems, data-driven inquiries, and tasks requiring external information processing. By leveraging a code interpreter, the Executor Agent doesn’t simply process information; it actively computes solutions, generating results with a level of precision and complexity unattainable through language models alone. This enhanced functionality represents a substantial advancement, allowing the agent to tackle a far wider spectrum of real-world problems and paving the way for more sophisticated autonomous reasoning systems.

The Agent0 system’s Executor Agent incorporates a technique called Self-Consistency to rigorously evaluate the reliability of its own outputs. Rather than simply providing a single answer, the agent generates multiple independent responses to the same prompt, then assesses the degree to which those responses align. This internal scrutiny allows the agent to quantify its own uncertainty – flagging potentially unreliable conclusions when inconsistencies arise. By prioritizing consistent responses and learning from discrepancies, the system achieves notable performance gains; evaluations demonstrate a 24% improvement in general reasoning capabilities and an 18% increase in accuracy when tackling mathematical problems, effectively building a more robust and self-correcting artificial intelligence.

The pursuit of autonomous improvement, as demonstrated by Agent0, echoes a fundamental truth about all complex systems. Any gains in reasoning ability, achieved through self-evolution and tool integration, are inherently temporary; they age faster than expected, demanding continuous refinement. This framework, leveraging a curriculum agent and executor agent, embodies a journey not towards perfection, but towards graceful decay-a constant recalibration against the inevitable entropy. As Blaise Pascal observed, “All of humanity’s problems stem from man’s inability to sit quietly in a room alone.” Agent0, in its tireless self-improvement, attempts to fill that quiet room, not with stillness, but with the dynamic process of ongoing evolution-a testament to the fact that the arrow of time necessitates constant adaptation.

What Lies Ahead?

The architecture presented here, while demonstrating a capacity for autonomous refinement, merely postpones the inevitable entropy inherent in all complex systems. Agent0 constructs a scaffolding of improvement, but the very act of scaling this scaffolding introduces new vectors for decay. The elegance of self-evolution rests on the assumption that the environment remains, if not static, at least predictably chaotic. A truly robust system must account for unforeseen shifts in the landscape of available tools and the evolving logic of those tools themselves.

The reliance on a curriculum, however dynamically generated, hints at a fundamental limitation. It’s a structured approach to an unstructured problem. The system learns how to learn within predefined boundaries. The next iteration might explore methods to dismantle those boundaries-to allow for genuinely emergent behavior, even if that behavior occasionally manifests as spectacular failure. Stability, after all, is often just a temporary deferral of systemic collapse.

Ultimately, the pursuit of self-improving agents isn’t about achieving perfect intelligence, but about understanding the limits of adaptation. The framework offers a compelling, if provisional, answer to the question of how to bootstrap reasoning ability. The true test lies not in the initial ascent, but in observing how gracefully – or not – the system ages.

Original article: https://arxiv.org/pdf/2511.16043.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Erosion of Static Systems: Introducing Agent0

Dynamic Curricula: Forging Intelligence Through Competition

Reinforcement Learning: A System Adapting to its Own Imperfections

Beyond Calculation: Augmenting Reasoning with External Tools

What Lies Ahead?

See also: