Author: Denis Avetisyan
Current AI systems struggle with true autonomy, relying on pre-defined pathways – this review proposes a new framework inspired by how humans and animals learn.

A novel System A/B/M architecture and evolutionary-developmental approach addresses limitations in autonomous learning through meta-control, curriculum learning, and intrinsic motivation.
Despite decades of progress, artificial intelligence struggles with the flexible, adaptive learning characteristic of even simple organisms. This limitation is addressed in ‘Why AI systems don’t learn and what to do about it: Lessons on autonomous learning from cognitive science’, which proposes a novel learning architecture-integrating observational learning (System A) and active exploration (System B) under meta-cognitive control (System M)-inspired by evolutionary and developmental principles. This framework moves beyond reliance on human-designed curricula by enabling agents to autonomously construct and refine their own learning pathways. Could embracing these biologically-plausible mechanisms finally unlock truly autonomous and robust artificial intelligence?
The Loom of Prediction: Building Internal Worlds
For an autonomous agent to navigate and thrive, a comprehensive understanding of its surroundings is paramount – this is achieved through the construction of an internal āWorld Modelā. This model isnāt merely a static map, but a dynamic, predictive representation of the environment, allowing the agent to anticipate the consequences of its actions and formulate effective plans. Essentially, the agent learns to simulate reality within its own system, forecasting future states based on current observations and past experiences. This predictive capability is crucial for tasks ranging from simple obstacle avoidance to complex strategic decision-making, enabling the agent to act proactively rather than reactively and to generalize its knowledge to novel situations. Without such an internal representation, an agent would be limited to immediate sensory input, severely hindering its ability to operate independently and achieve long-term goals.
The development of truly autonomous agents faces a significant hurdle: the need for extensive, labeled datasets to understand and interact with the world. Traditional machine learning methods demand painstakingly annotated examples for every conceivable scenario, a process that becomes exponentially more difficult – and ultimately impractical – in complex, dynamic environments. Consequently, a crucial shift is underway towards self-supervised learning, where agents learn by predicting aspects of their own sensory input. Rather than requiring external labels, the agent generates its own training signals by attempting to understand the inherent structure and patterns within the raw data stream. This allows for the acquisition of knowledge through exploration and interaction, enabling the agent to build a robust internal model of its surroundings without the limitations of human-provided annotations, and paving the way for adaptability in previously unseen situations.
The efficacy of an autonomous agentās learning process is deeply intertwined with the sequencing of tasks it undertakes; a principle known as Curriculum Learning suggests that introducing challenges in a carefully orchestrated progression markedly improves performance and speed of knowledge acquisition. Rather than confronting an agent with the full complexity of an environment from the outset, this approach begins with simpler tasks that build foundational skills. As the agent demonstrates mastery, the difficulty incrementally increases, fostering robust generalization and preventing the agent from becoming overwhelmed or stuck in suboptimal strategies. This mirrors the way humans learn – building competence through manageable steps – and offers a pathway toward creating agents capable of navigating increasingly complex and dynamic scenarios with greater efficiency and adaptability.
![This evolutionary-developmental framework builds autonomous agents by simultaneously optimizing agent architecture (Ļ) through environmental interaction and evolving meta-parameters to maximize a lifecycle fitness function [latex]\mathcal{L}[/latex].](https://arxiv.org/html/2603.15381v1/figures/evolution4.png)
The Meta-Controller: Orchestrating a Symphony of Learning
System M functions as a meta-controller designed to address the complexities inherent in learning within multifaceted systems. Its primary role is the coordination of individual learning processes occurring in subordinate systems, such as System A, by providing a centralized point of control and oversight. This architecture allows for the decomposition of a complex learning task into smaller, more manageable sub-problems, each handled by a dedicated system. System M doesnāt directly perform the learning itself, but instead modulates the learning parameters and exploration strategies of these subordinate systems to achieve a global learning objective. This approach is critical when dealing with systems where independent learning agents might exhibit conflicting behaviors or inefficient resource allocation.
Bilevel optimization, as utilized within System M, involves solving two nested optimization problems concurrently. The outer problem focuses on optimizing the control policies of System M itself, treating the learning process of subordinate systems – such as System A – as a constraint. Simultaneously, the inner problem optimizes the parameters of System Aās learning process, aiming to maximize its performance given the control signals received from System M. This approach differs from traditional single-level optimization by explicitly accounting for the influence of System Mās actions on System Aās learning dynamics, enabling a coordinated optimization of both control and learning processes. The objective function for System M incorporates not only its immediate rewards but also the anticipated future performance of System A, effectively creating a feedback loop that drives improved overall system behavior.
A hierarchical learning approach, as implemented within System M, enhances exploration and adaptation by decoupling the optimization of control policies from the learning processes of subordinate systems. This separation allows the meta-controller to strategically direct learning efforts towards areas of high informational gain, rather than relying on random exploration. Consequently, System A and similar systems benefit from a more focused learning trajectory, reducing the need for extensive trial-and-error and conserving computational resources. This targeted approach improves sample efficiency and accelerates convergence towards optimal performance, particularly in complex environments where exhaustive exploration would be impractical.

Evo/Devo: A Blueprint for Emergent Intelligence
The Evo/Devo Framework integrates principles of evolutionary computation and developmental robotics to create autonomous agents capable of complex learning. This approach moves beyond traditional reinforcement learning by combining global search via evolutionary algorithms – which optimize agent architectures and high-level behaviors – with local, plasticity-based developmental processes. These developmental strategies allow agents to refine their skills and adapt to changing environments through self-organization and experience, mirroring biological development. By evolving both the agent’s āgenomeā – defining its potential – and the ādevelopmental programā that maps genotype to phenotype, the framework facilitates the emergence of robust and adaptable behaviors without requiring explicit programming of every detail.
System A within the Evo/Devo Framework utilizes intrinsic motivation as a primary driver for agent behavior and learning. This motivation isn’t derived from external rewards, but rather from internal signals generated by the agent itself, specifically novelty and progress. The agent is programmed to seek out and explore novel states or situations, and to pursue actions that demonstrate measurable progress towards internal goals, even in the absence of predefined tasks. This self-directed exploration allows the agent to autonomously discover potentially valuable solutions and behaviors without requiring extensive external training or pre-programmed knowledge, effectively broadening the search space for optimal strategies.
The incorporation of critical periods into autonomous agent learning frameworks capitalizes on the biological phenomenon of limited-time windows of heightened neural plasticity. During these periods, agents exhibit increased sensitivity to environmental stimuli and a greater capacity for acquiring and solidifying specific skills or behaviors. This approach contrasts with continuous, uniform learning rates, and allows for more efficient adaptation by prioritizing learning during defined developmental stages. By strategically timing the introduction of learning tasks or environmental challenges to coincide with these critical periods, the framework can significantly reduce training time and improve the robustness of learned behaviors, as the agentās capacity for change is maximized during those specific windows.

The Horizon of Adaptability: Lifelong Learning and Robustness
Facing unpredictable real-world conditions requires more than pre-programmed responses; therefore, āSystem Aā incorporates a mechanism for āTest-Time Adaptationā. This allows the agent to continuously refine its actions based on immediate feedback received during operation. Unlike traditional systems that remain fixed after training, āSystem Aā actively adjusts its internal parameters as it encounters new situations, effectively learning on the fly. This adaptation isn’t random; it’s guided by a carefully designed process that prioritizes successful strategies and discards ineffective ones, leading to improved performance and resilience in dynamic environments. The ability to learn and adjust during deployment is crucial for navigating unforeseen challenges and maintaining robust functionality across a variety of conditions.
An agentās ability to navigate complex and ever-changing environments is significantly enhanced through the implementation of episodic memory. This system functions much like long-term recollection, enabling the agent to store specific experiences – including both successful strategies and detrimental errors – as discrete episodes. Crucially, these stored experiences arenāt merely archived; the agent can actively replay them, effectively revisiting past scenarios to reinforce positive behaviors and avoid repeating mistakes. This replay mechanism facilitates a form of offline learning, allowing the agent to refine its decision-making processes and improve performance without requiring continuous interaction with the external world. By leveraging past successes and failures, the agent exhibits a marked improvement in adaptability and resilience when confronted with novel challenges, ultimately bolstering its overall robustness.
To cultivate genuinely adaptable intelligence, the research leverages procedural generation to create a virtually limitless stream of training environments. This technique doesnāt simply increase the quantity of practice scenarios, but crucially, their diversity. By algorithmically constructing unique challenges – varying terrain, obstacle arrangements, and resource distributions – the agent is exposed to a far broader range of situations than any hand-designed curriculum could offer. This constant novelty forces the agent to develop generalized problem-solving skills, rather than memorizing solutions to specific layouts. Consequently, when confronted with genuinely new environments during deployment, the agent exhibits heightened robustness and a superior capacity to learn and thrive, demonstrating a key step towards artificial general intelligence.

The pursuit of autonomous learning, as detailed in this exploration of System A-B-M architecture, reveals a fundamental truth: stability is often a prelude to unforeseen complications. This mirrors Andrey Kolmogorovās observation, āThe most important thing in science is not knowing, but knowing what you donāt know.ā The article posits that current AI systems falter because they lack the capacity for self-directed exploration and adaptation – a rigid adherence to pre-defined curricula. Such systems, though initially stable, are vulnerable to shifts in environment or task. The System A-B-M framework, with its emphasis on intrinsic motivation and meta-control, doesnāt aim to prevent failure, but to embrace it as a catalyst for evolutionary growth, acknowledging the inherent unpredictability of complex systems.
The Long Growth
The proposition of System A-B-M, and the broader framing of autonomous learning as an evolutionary-developmental process, feels less like a solution and more like a carefully considered relocation of the problem. It acknowledges what seasoned observers have always known: these systems do not learn; they accrete, they differentiate, they sometimes collapse under the weight of their own becoming. The architecture proposed isnāt a blueprint for intelligence, but a scaffolding for growth, and every line of code is a provisional restraint on what might otherwise emerge.
The core challenge remains, of course. Building for adaptation necessitates surrendering control – and the temptation to impose pre-defined ācurriculaā or āintrinsic motivationsā will be strong. Each such imposition is a prophecy of eventual brittleness, a pre-determined point of failure. True autonomy will not be designed; it will be observed, coaxed, and occasionally mourned. The field will likely spend the next decade entangled in the paradox of how to build a system that explicitly rejects being built.
The true measure of success won’t be benchmark scores, but the character of the failures. A system that fails creatively, unexpectedly, revealing novel pathways of breakdown – that is a system truly exploring the space of possibility. The goal is not to prevent collapse, but to understand the shape of the ruins.
Original article: https://arxiv.org/pdf/2603.15381.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- CookieRun: Kingdom 5th Anniversary Finale update brings Episode 15, Sugar Swan Cookie, mini-game, Legendary costumes, and more
- Gold Rate Forecast
- How to get the new MLBB hero Marcel for free in Mobile Legends
- PUBG Mobile collaborates with Apollo Automobil to bring its Hypercars this March 2026
- Heeseung is leaving Enhypen to go solo. K-pop group will continue with six members
- eFootball 2026 Jürgen Klopp Manager Guide: Best formations, instructions, and tactics
- 3 Best Netflix Shows To Watch This Weekend (Mar 6ā8, 2026)
- Brent Oil Forecast
- Jessie Buckley unveils new blonde bombshell look for latest shoot with W Magazine as she reveals Hamnet role has made her ābraverā
- eFootball 2026 is bringing the v5.3.1 update: What to expect and whatās coming
2026-03-17 13:23