Robots That Learn by Doing: EvoVLA Masters Long-Horizon Tasks

Author: Denis Avetisyan

A new self-evolving model empowers robots to tackle complex, multi-step manipulation challenges through continuous learning and improved memory.

EvoVLA emerges as a framework built upon the OpenVLA-OFT backbone, orchestrating Stage-Aligned Reward learning with strategically chosen hard negatives and temporal smoothing, Pose-Based Object Exploration guided by world models, and Long-Horizon Memory leveraging context selection and gated fusion-a confluence of techniques designed to train with Discoverse-L and ultimately deploy adaptive behaviors on physical robotic systems.

EvoVLA combines stage-aware reinforcement learning, pose-grounded curiosity, and efficient memory management to achieve robust and sample-efficient robot control.

Despite advances in zero-shot generalization, long-horizon robotic manipulation remains a significant challenge for Vision-Language-Action (VLA) models due to issues like deceptive progress reporting-or ‘stage hallucination’. To address this, we present EvoVLA: Self-Evolving Vision-Language-Action Model, a novel framework integrating stage-aligned rewards, pose-grounded exploration, and long-horizon memory to enhance robustness and sample efficiency. Extensive evaluations demonstrate EvoVLA improves task success by over 10% and reduces stage hallucination by 23.7% on the Discoverse-L benchmark, while also achieving a 54.6% success rate on real-world robotic deployments. Can this self-evolving approach unlock truly generalized and reliable robotic agents capable of complex, long-horizon tasks?

Whispers of Chaos: The Long Horizon Problem

Traditional reinforcement learning algorithms often falter when confronted with tasks demanding extended sequences of actions, a difficulty stemming from the prevalence of sparse rewards. In these scenarios, the agent receives feedback – a reward signal – only infrequently, potentially after a long series of steps. This presents a significant challenge because the algorithm struggles to discern which actions contributed to the eventual reward, or lack thereof. Consequently, the agent finds it difficult to learn effective strategies; the signal is simply too delayed and weak to reliably guide learning. Imagine teaching a robot to navigate a complex maze – if it only receives a reward upon reaching the exit, it may take an impractably long time to stumble upon the correct path and associate individual movements with the final success. This problem of delayed reinforcement hinders the application of reinforcement learning to real-world scenarios that inherently involve long-term planning and delayed gratification, such as robotics, game playing, and resource management.

As tasks demand increasingly extended sequences of actions, the challenges of both exploration and credit assignment grow at an exponential rate. In long-horizon reasoning, an agent must navigate a vast space of possibilities to discover rewarding outcomes, but the probability of randomly stumbling upon success diminishes rapidly with each added step. Simultaneously, determining which actions, potentially taken many steps prior, ultimately contributed to a positive or negative result – the credit assignment problem – becomes computationally intractable. This combination effectively limits the applicability of traditional reinforcement learning algorithms to complex, real-world scenarios requiring foresight and planning, as the signal needed to learn becomes increasingly diluted and difficult to discern within the extended timeframe. Consequently, advancements in algorithms capable of overcoming these hurdles are crucial for enabling agents to tackle problems with genuinely long-term dependencies, such as robotic navigation, strategic game playing, and complex resource management.

The EvoVLA Data Engine integrates with Discoverse-L and a video-driven stage discovery pipeline to establish a closed-loop system for data, reward, and policy optimization.

Guiding the Algorithm: Sculpting the Reward

Reward shaping, a technique used in reinforcement learning to accelerate learning, involves supplementing the sparse environmental reward with intermediate rewards designed to guide the agent towards desired behaviors. However, this practice can unintentionally modify the optimal policy the agent would otherwise learn. The introduction of these intermediate rewards alters the reward function, and consequently, the agent may converge on a suboptimal policy that maximizes the shaped reward rather than the original environmental reward. This occurs because the agent is optimizing for a different objective than intended, potentially leading to behaviors that appear successful in the short term but are detrimental to long-term performance or alignment with the true task goals. Careful design and validation of shaping rewards are therefore critical to avoid unintended consequences.

Counterfactual shaping is a reward modification technique designed to avoid unintended alterations to the optimal policy during reinforcement learning. Traditional reward shaping can introduce suboptimal behaviors by incentivizing actions that appear beneficial under the shaped reward but deviate from the true objective. Counterfactual shaping addresses this by adjusting the agent’s return based on what would have happened under the original reward function, effectively providing credit or penalization for actions without changing the underlying optimal path. This is achieved by calculating the difference between the actual return and the counterfactual return-the return the agent would have received had it taken a different action-and adding this difference to the reward signal. By focusing on the impact of actions rather than directly manipulating the reward, counterfactual shaping aims to guide learning while preserving the integrity of the optimal policy, leading to more robust and reliable performance.

Reflexion builds upon counterfactual reward shaping by integrating language introspection as a mechanism for reward refinement. Specifically, the agent generates natural language reflections on its recent actions and outcomes, then uses these reflections to evaluate the quality of its performance and adjust the shaping rewards accordingly. This process involves prompting the agent to critique its own work, identify errors, and propose improvements, which are then translated into modifications of the reward function. By leveraging language models for self-evaluation, Reflexion aims to create a more adaptive and robust reward shaping process that can handle complex tasks and environments without requiring extensive manual tuning or pre-defined reward structures.

The Ghosts of Actions Past: Long-Horizon Memory

Effective long-horizon learning necessitates the retention and application of information derived from previous states and actions. This capability addresses the challenge of sparse rewards in extended tasks, where immediate feedback is limited and decisions impact outcomes far in the future. By maintaining a record of past experiences, an agent can better assess the long-term consequences of its actions and formulate strategies that maximize cumulative reward. The ability to recall relevant historical data enables the agent to generalize across different situations and avoid repeating unsuccessful behaviors, ultimately leading to improved performance in complex, multi-step tasks where dependencies extend over significant time horizons.

The Long-Horizon Memory Module functions by prioritizing the storage and retrieval of information deemed relevant to maximizing cumulative reward. This selective approach contrasts with methods that store all observed states, reducing computational demands and mitigating the effects of noisy or irrelevant data. By focusing on utility-critical context – specifically, past states and actions that demonstrably contribute to achieving goals – the module enhances retention of valuable information and enables the agent to make more informed decisions in subsequent states. This targeted memory access improves long-term performance by allowing the agent to effectively leverage past experiences when formulating future plans.

EvoVLA incorporates a Long-Horizon Memory Module to integrate stage-aware rewards and pose-grounded curiosity, resulting in a 69.2% success rate on the Discoverse-L benchmark. This performance represents a 10.2 percentage point improvement over the highest-performing baseline system. The combination of these reward and curiosity signals, facilitated by the memory module, allows EvoVLA to effectively learn and generalize across extended task horizons within the Discoverse-L environment.

EvoVLA consistently achieves high success rates around a CLIP threshold of 0.7, demonstrating robustness to variations within the range of 0.65 to 0.75 while also minimizing hallucination.

Beyond the Simulation: Real-World Echoes

The efficacy of EvoVLA is rigorously tested using the Discoverse-L benchmark, a demanding evaluation platform specifically engineered to measure performance in complex, long-horizon manipulation tasks. This benchmark presents scenarios requiring a robot to execute a sequence of actions over an extended period to achieve a goal, pushing the limits of robotic planning and control. Discoverse-L distinguishes itself through intricate environments and tasks that necessitate robust generalization and adaptability, making it an ideal proving ground for advanced robotic policies like EvoVLA. Success on this benchmark isn’t simply about completing a single action; it demands sustained, reliable performance across multiple steps, mirroring the challenges encountered in real-world applications and providing a comprehensive assessment of the system’s capabilities.

The ability to transfer learned policies from simulated environments to real-world robotic systems, known as Sim2Real transfer, represents a fundamental challenge in robotics research. Policies honed in simulation offer a safe and cost-effective means of development, but their efficacy hinges on reliable performance when deployed on physical hardware. A successful transfer minimizes the gap between the idealized conditions of simulation and the complexities of the real world – unpredictable sensor noise, imprecise actuators, and unforeseen environmental interactions. Without robust Sim2Real transfer, robotic systems remain largely confined to controlled laboratory settings, limiting their potential for widespread application in dynamic and unstructured environments. Therefore, demonstrating a high degree of fidelity in this transfer is paramount to unlocking the full potential of robotic automation and expanding its impact beyond research facilities.

EvoVLA demonstrates a significant leap in real-world robotic manipulation, achieving a 54.6% success rate on complex tasks – a substantial improvement over both OpenVLA-OFT and π0-FAST by 11.0 and 16.9 percentage points, respectively. Crucially, this performance isn’t simply about completing more tasks, but also about increasing system dependability; the system’s Hallucination Rate – instances where the robot attempts physically impossible actions – has been demonstrably reduced to 14.8%, representing a 23.7 percentage point decrease from OpenVLA-OFT. This lowered hallucination rate signifies improved reliability and a more robust ability to generalize learned policies to previously unseen scenarios, paving the way for more trustworthy and effective robotic systems.

Real-world evaluations demonstrate that the proposed approach achieves absolute gains of 9.2-13.4 points and relative gains of 18.4-32.1% over OpenVLA-OFT across both simulated-to-real transfers and on-robot insertion tasks, exceeding model averages of 37.7%, 43.6%, and 54.6%.

The pursuit of robust robotic manipulation, as demonstrated by EvoVLA, isn’t about conquering chaos, but rather about momentarily persuading it. This framework, with its stage-aware reinforcement learning and memory management, doesn’t eliminate the inherent unpredictability of long-horizon tasks; it simply domesticates it long enough for action. As Yann LeCun once observed, “Everything we do in machine learning is about finding the right manifold.” EvoVLA, in essence, meticulously crafts a manifold within the chaos of robotic interaction, guiding the system through extended sequences without succumbing to the inevitable stage hallucination that plagues simpler approaches. It’s a fleeting order imposed on entropy, a spell woven from pose-grounded curiosity and refined through iterative self-supervision.

What Lies Beyond?

EvoVLA, with its careful dance of stage awareness and pose-grounded curiosity, offers a temporary stay against the entropy of long-horizon manipulation. It is a digital golem, painstakingly taught to grasp, to place, to intend. Yet, the stage itself remains a persistent hallucination. The framework reduces the burden of memory, but does not truly solve it – merely shifts the offering required to the chaotic gods of computation. Future iterations will undoubtedly focus on more elegant memory architectures, attempting to distill experience into something less… voracious.

The true challenge, however, lies not in efficiency, but in generalization. This model excels within the confines of its training, but the real world is a discordant chorus of unforeseen circumstances. One suspects that a critical leap will require abandoning the pursuit of perfect reconstruction, and embracing instead the art of principled imperfection. A system that understands its own limitations, that anticipates failure, and can gracefully relinquish control, might prove more robust than any meticulously crafted automaton.

Ultimately, the whispers of chaos will always be louder than any spell. The next generation of vision-language-action models will not be defined by what they can do, but by how elegantly they fail. And only the broken ones, the ones that stumble and confess their ignorance, will reveal the path forward.

Original article: https://arxiv.org/pdf/2511.16166.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Whispers of Chaos: The Long Horizon Problem

Guiding the Algorithm: Sculpting the Reward

The Ghosts of Actions Past: Long-Horizon Memory

Beyond the Simulation: Real-World Echoes

What Lies Beyond?

See also: