Robots Need Memories: A New Benchmark for Skillful Manipulation

Author: Denis Avetisyan

Researchers have created a challenging new testbed to evaluate how well robots can learn and retain information for complex, long-duration tasks.

The RoboMME system leverages a core set of integrated assets to facilitate comprehensive robotic manipulation and task execution.

RoboMME, a large-scale benchmark, assesses vision-language-action models and highlights the crucial role of tailored memory representations in robotic generalist policies.

Effective robotic manipulation in complex, long-horizon tasks requires robust memory capabilities, yet evaluation of memory-augmented vision-language-action (VLA) models remains fragmented and lacks standardized benchmarks. To address this, we introduce ‘RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies’, a large-scale benchmark comprising 16 manipulation tasks designed to systematically evaluate temporal, spatial, object, and procedural memory. Our experiments with 14 memory-augmented VLA variants reveal that the effectiveness of different memory representations is highly task-dependent, highlighting the need for tailored approaches. Can we develop cognitive architectures that dynamically adapt memory mechanisms to maximize performance across a broader spectrum of robotic challenges?

Beyond Reactive Control: Architecting Robotic Systems with Memory

Conventional robotic control architectures, such as the widely utilized SAM2Act framework, fundamentally operate on a principle of immediate reactivity. These systems process sensory information and generate motor commands in a continuous, but ultimately short-sighted, loop. While effective in static or highly predictable environments, this approach struggles with complexity because it lacks the capacity to retain or leverage past experiences. Each action is dictated solely by the current sensory input, preventing the robot from anticipating future needs, adapting to changing conditions based on prior outcomes, or developing a nuanced understanding of its surroundings – essentially, the robot continually relearns tasks instead of building upon previous knowledge. This reliance on present stimuli limits performance in dynamic scenarios demanding sustained interaction and procedural understanding, hindering the development of truly intelligent and autonomous robotic behavior.

Robotic systems constrained by immediate sensory input struggle significantly when faced with environments demanding ongoing interaction and a grasp of sequential procedures. Unlike humans, who leverage past experiences to anticipate changes and refine actions, these robots repeatedly recalculate responses to each new stimulus, hindering efficiency and robustness. This proves particularly problematic in dynamic scenarios – such as navigating cluttered spaces or collaborating with people – where conditions are constantly evolving and require a robot to not just react to the present, but to understand the context built through prior interactions. Consequently, performance degrades rapidly as task complexity increases, limiting their ability to operate autonomously in real-world settings that aren’t meticulously pre-programmed or static.

The pursuit of genuinely intelligent robotics necessitates a shift from purely reactive control systems to those incorporating robust memory capabilities. Current robots often operate on a stimulus-response basis, executing pre-programmed actions triggered by immediate sensory input; however, this approach falters when faced with the nuances of real-world complexity. By integrating memory, a robot transcends this limitation, gaining the ability to store, recall, and reason about past experiences. This allows for the development of procedural understanding – the capacity to anticipate consequences, adapt to changing circumstances, and refine actions over time. Such a system doesn’t merely react to its environment, but instead interprets it through the lens of accumulated knowledge, paving the way for truly adaptable and autonomous behavior.

The MME-VLA suite integrates three memory representations-symbolic (language-based subgoals), perceptual (raw visual tokens with redundancy control), and recurrent (compressed latent states)-with three integration strategies-memory-as-context, memory-as-modulator (using adaptive LayerNorm and multi-head attention), and memory-as-expert (with block-wise causal attention)-to enhance policy performance.

Vision-Language-Action Models: A Necessary, Yet Insufficient, Step Forward

Vision-Language-Action (VLA) models establish a framework for robotic control by integrating visual perception, natural language understanding, and action execution. These models accept natural language instructions as input, process corresponding visual data from onboard sensors, and then generate appropriate control signals to manipulate the robot’s environment. This differs from traditional robotic control methods which often rely on pre-programmed behaviors or complex manual control schemes. By bridging the gap between human language and robotic action, VLA models offer the potential for more intuitive and flexible human-robot interaction and increased autonomy in dynamic environments. Current implementations typically employ deep learning architectures, including transformers, to encode language and visual inputs, and then decode these representations into executable actions.

Current Vision-Language-Action (VLA) models, including MemoryVLA, demonstrate limitations in tasks demanding the integration of information across extended sequences. These models frequently exhibit decreased performance when required to recall events or states from distant past timesteps to inform present actions. Specifically, the capacity to maintain and effectively utilize long-term dependencies-relationships between events separated by numerous intervening steps-is often insufficient. This is attributable to challenges in preserving relevant historical information within the model’s memory and accurately retrieving it when needed for decision-making, resulting in errors or inefficient task completion when faced with complex, temporally-extended scenarios.

Developing robotic control models capable of sustained performance necessitates more than immediate sensory processing; a robust memory component is crucial. Current Vision-Language-Action (VLA) models frequently exhibit limitations when confronted with tasks demanding sequential reasoning or the application of previously learned information. The core difficulty resides in constructing an internal state representation that effectively encodes past observations and actions, allowing the system to retrieve and utilize this historical data to inform present behavior. This requires models to move beyond simply processing current inputs and instead build and maintain a coherent, accessible record of past experiences, enabling them to address tasks with temporal dependencies and complex, multi-step requirements.

The robot learns to reproduce observed patterns, as demonstrated in this PatternLock task where it uses a stick to retrace a sequence shown in a preceding video, indicated by the red-bordered frames.

RoboMME: A Rigorous Benchmark for Evaluating Memory-Augmented Robotic Manipulation

RoboMME is a large-scale benchmark designed for the rigorous evaluation of robotic policies that utilize external memory. It consists of 16 distinct manipulation tasks, encompassing a range of object interactions and environmental complexities. This scale allows for a more comprehensive assessment of a policy’s ability to learn and generalize compared to benchmarks with fewer tasks. The diversity of these tasks is intended to challenge policies across different manipulation skill requirements and memory demands, providing a robust platform for comparing and analyzing the performance of various memory-augmented architectures and learning algorithms. The benchmark aims to move beyond isolated skill evaluations towards a more holistic understanding of robotic manipulation with memory.

RoboMME’s evaluation is structured around four task suites designed to isolate and assess specific memory capabilities. ImitationSuite requires policies to learn from demonstrated actions, testing procedural memory. ReferenceSuite tasks involve manipulating objects based on visual references, focusing on object memory. The CountingSuite presents scenarios demanding the tracking of object counts over time, evaluating temporal memory. Finally, the PermanenceSuite assesses spatial memory through tasks requiring agents to remember object locations and maintain arrangements even after distractions or partial observations.

RoboMME’s training regimen consists of 770,000 timesteps, a volume of data designed to facilitate statistically significant evaluation of robotic manipulation policies with memory components. This quantity of data enables differentiation between policies based on nuanced performance characteristics, moving beyond simple success/failure metrics. The large dataset allows researchers to assess how effectively policies utilize and maintain information across extended interaction sequences, identify failure modes specific to memory limitations, and pinpoint areas where algorithmic improvements would yield the greatest gains in task completion and robustness. Furthermore, the scale of the dataset supports the training of more complex models and the investigation of data efficiency in memory-augmented reinforcement learning.

The robot learns to replicate a demonstrated manipulation sequence, as shown in this MoveCube task example where it observes a video (highlighted with red borders) before executing the same cube movement.

The Future of Intelligent Robotics: Systems Empowered by Persistent Memory

The advent of MemoryAugmentedPolicies signifies a crucial advancement in the pursuit of truly intelligent robotics. Evaluated through the rigorous RoboMME benchmark, these policies move beyond the limitations of conventional approaches by integrating a robust long-term memory component. This allows robotic systems to not merely react to immediate stimuli, but to actively recall and utilize past experiences when navigating complex tasks and dynamic environments. Unlike traditional methods which struggle with temporal dependencies, MemoryAugmentedPolicies demonstrate an enhanced capacity for reasoning and adaptation, effectively bridging the gap between rote execution and genuine understanding – a capability essential for robots operating with greater autonomy and reliability in real-world scenarios.

Recent advancements in robotic intelligence have yielded policies capable of tackling complex tasks with a level of sophistication previously unattainable. These MemoryAugmentedPolicies don’t simply react to immediate stimuli; they actively integrate past experiences to inform present actions, exhibiting a form of reasoning crucial for navigating long-term dependencies. Evaluations using the RoboMME benchmark reveal performance levels approaching human success rates – approximately 70 to 80 percent – in scenarios demanding sustained interaction and adaptation. This represents a substantial leap beyond conventional robotic control methods, suggesting a future where robots can reliably operate in unstructured environments and effectively manage tasks requiring foresight and contextual awareness.

Traditional robotic policies, such as DiffusionPolicy, often struggle with tasks demanding recollection of past states or planning over extended horizons due to inherent limitations in their ability to retain and utilize long-term information. This creates difficulties in dynamic environments where conditions change and robots must adapt strategies based on previous experiences. However, recent advancements are circumventing these challenges, paving the way for genuinely autonomous operation. By integrating mechanisms for persistent memory, robotic systems can now effectively store, retrieve, and reason about past events, enabling them to navigate complex scenarios, learn from mistakes, and proactively adjust behavior-capabilities previously unattainable. This breakthrough promises to unlock applications ranging from prolonged search-and-rescue missions to flexible manufacturing processes and truly independent exploration of unstructured spaces.

Policy evaluation is performed using four real-world tasks that incorporate selected frames from prior video history to inform current observation and execution.

The pursuit of robust robotic systems, as detailed in this work concerning RoboMME, necessitates a holistic understanding of interconnected components. If a system survives on duct tape, it’s likely overengineered – a sentiment echoed by Bertrand Russell, who observed, “The point of the system is to make things easier, not to prove that you can build a complicated system.” This benchmark highlights that modularity, while conceptually appealing, is an illusion of control without careful consideration of how memory representations influence performance across long-horizon tasks. The emphasis on tailored memory isn’t merely about storing information; it’s about structuring the very foundation upon which the robotic agent operates, ensuring coherence and adaptability.

Where Do We Go From Here?

The RoboMME benchmark, by forcing a confrontation with long-horizon, history-dependent tasks, predictably reveals that current vision-language-action architectures are, shall we say, optimistic in their assumptions about the past. A system that cannot meaningfully compress and retrieve relevant experience is not so much ‘intelligent’ as it is briefly operational. The emphasis on how memory is represented-and the demonstrated task dependence of optimal representation-suggests a coming era of specialized, rather than general, memory modules. One suspects the pursuit of a universal memory is a charming, if ultimately futile, exercise.

A crucial, and largely unaddressed, challenge lies in the inherent trade-off between memory capacity, retrieval speed, and the ability to discern signal from noise. If the system looks clever, it’s probably fragile, built on brittle correlations that will fail spectacularly when faced with even minor perturbations. True robustness will require an architecture that embraces uncertainty, actively seeking out and incorporating evidence that disconfirms its current understanding.

Architecture, after all, is the art of choosing what to sacrifice. Perfect recall is a fantasy; the real problem is deciding what to forget, and, more importantly, when. Future work must move beyond simply augmenting models with memory and begin to explore the meta-cognitive mechanisms that govern its use. The long road to robotic generalisation is paved not with bigger models, but with more principled compromises.

Original article: https://arxiv.org/pdf/2603.04639.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Reactive Control: Architecting Robotic Systems with Memory

Vision-Language-Action Models: A Necessary, Yet Insufficient, Step Forward

RoboMME: A Rigorous Benchmark for Evaluating Memory-Augmented Robotic Manipulation

The Future of Intelligent Robotics: Systems Empowered by Persistent Memory

Where Do We Go From Here?

See also: