Robots See the World in Objects: A New Approach to Grasping and Manipulation

Author: Denis Avetisyan

Researchers have developed a new framework that equips robots with a more human-like understanding of visual scenes, allowing for more adaptable and reliable manipulation skills.

This work introduces STORM, a multi-phase learning system that leverages frozen visual foundation models to create task-aware, object-centric representations for improved robotic manipulation and generalization.

While visual foundation models offer promising perceptual capabilities for robotics, their dense representations lack the explicit object-level structure necessary for robust and adaptable manipulation. To address this, we introduce ‘STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation’, a lightweight module that adapts frozen foundation models by augmenting them with semantic-aware slots. This approach utilizes a multi-phase training strategy to create task-aware object-centric representations, improving generalization and control performance in simulated manipulation tasks. Does this multi-phase adaptation represent a broadly applicable pathway for efficiently tailoring generic foundation models to specific robotic control challenges?

The Illusion of Pixels: Why Robots See Only Chaos

Conventional robotic vision systems typically process images as complete, undivided arrays of pixel data, a method that presents significant challenges when operating in cluttered environments. This ‘monolithic’ approach struggles to differentiate individual objects, meaning a robot perceives a complex scene as a single visual element rather than a collection of manipulable parts. Consequently, even minor changes – such as an object being partially hidden, or viewed from a different angle – can drastically alter the robot’s perception, leading to failed grasps or collisions. The system’s inability to isolate and track specific objects severely limits its adaptability and reliability, hindering performance in dynamic, real-world scenarios where objects frequently move, overlap, and change appearance.

Current robotic vision systems, reliant on processing images as complete pictures, often falter when faced with the complexities of the real world. A significant challenge arises from occlusion – when parts of an object are hidden from view – and changes in viewpoint, where the same object appears drastically different from another angle. These limitations impede a robot’s ability to reliably identify and interact with objects in dynamic environments. Furthermore, these systems struggle to generalize; a robot trained to grasp an object in one configuration may fail entirely when presented with a slightly altered arrangement. This inability to adapt to novel object configurations represents a core bottleneck in deploying robots beyond highly structured, controlled settings and into the unpredictable nature of everyday life.

The development of truly intelligent robotic systems hinges on a fundamental shift in how machines perceive the world. Current methods often treat visual input as a continuous stream of pixels, a process that proves brittle when faced with the complexities of real-world environments. Instead, a more effective approach involves recognizing scenes as composed of distinct, individual objects. By segmenting a visual field into these discrete entities, a robot can move beyond simply seeing an image to understanding what it contains. This object-centric perception allows for more robust manipulation, as the system can focus on interacting with specific items rather than attempting to process the entirety of the visual input. It also facilitates generalization to novel situations; recognizing a chair as a chair, regardless of its orientation, lighting, or surrounding clutter, is crucial for adaptability and reliable performance in dynamic environments.

The development of truly adaptable robotic systems hinges on moving beyond pixel-based perception to representations that explicitly model individual objects and their connections within a scene. Rather than processing a visual field as a unified whole, these systems require the ability to deconstruct the image into discrete elements – identifying each object’s boundaries, properties, and spatial relationships with others. This disentangled representation allows for targeted manipulation; a robot can then interact with specific objects, even amidst clutter or partial occlusion, without being misled by background noise or irrelevant visual features. Such an approach fosters greater robustness to changes in viewpoint and lighting, and crucially, enables generalization to novel object arrangements – a prerequisite for deploying robots in the dynamic and unpredictable environments of the real world.

STORM: Constructing Order from the Visual Flood

STORM utilizes a slot-based scene representation wherein the environment is decomposed into a discrete set of object ‘slots’. These slots function as containers for identified objects, providing a structured format for perception. This decomposition allows the system to focus manipulation efforts on specific, relevant objects within the scene, rather than processing the entire visual input. Each slot encapsulates information about an object’s presence, pose, and semantic category, enabling targeted interaction and reducing computational complexity associated with grasping and manipulation tasks. The slot representation facilitates efficient reasoning about object relationships and affordances within the robot’s workspace.

STORM utilizes DINOv2, a self-supervised visual transformer, as its primary visual backbone for feature extraction from image data. This choice provides robust and generalizable visual representations without requiring labeled training data. Complementing DINOv2, the Contrastive Language-Image Pre-training (CLIP) model is integrated to provide semantic understanding of the visual features. CLIP maps images and text into a shared embedding space, allowing STORM to associate visual observations with semantic concepts and improve the accuracy of object recognition and scene understanding. The combination of DINOv2 and CLIP enables the system to extract meaningful features relevant to task execution, providing a strong foundation for subsequent perception and manipulation stages.

The Task-Aware Representation within STORM functions by conditioning the perceptual output on the specific task the robot is intended to perform. This is achieved through the incorporation of task embeddings – vector representations of the desired goal – into the feature extraction and slot assignment processes. By explicitly considering the task context, the system prioritizes the perception of objects and features relevant to successful task completion, effectively filtering out irrelevant information. This targeted perception reduces computational load and improves the efficiency of downstream robotic control and manipulation, as the system focuses resources on the most pertinent aspects of the environment for the given objective.

STORM is designed with a modular architecture to facilitate the integration of diverse perception and manipulation components. This modularity is achieved through clearly defined interfaces between modules responsible for visual feature extraction, semantic understanding, task representation, and object manipulation. The framework supports the substitution of individual modules – for example, swapping DINOv2 for an alternative visual backbone or integrating a different task planner – without requiring substantial code modification. This flexibility enables adaptation to a wide range of robotic applications, including bin picking, assembly, and in-home manipulation, and allows for easy experimentation with different perception and control strategies. Furthermore, the modular design simplifies the process of porting STORM to new robotic platforms and sensors.

Learning to See What Matters: A Multi-Phase Approach

Multi-Phase Learning, as implemented in STORM, is a training strategy designed to incrementally develop an object-centric representation of the environment. This process avoids simultaneous learning of object representation and task-specific policies by decoupling the training into distinct phases. Initially, the framework focuses on establishing stable object “slots” – identifying and isolating individual objects within the scene – without regard for their functional role. Subsequent phases then refine these slots, associating them with semantic meanings and aligning them with the requirements of downstream manipulation tasks. This progressive refinement allows for a more robust and generalizable object representation compared to end-to-end training approaches, as the system first learns what objects are present before learning how to interact with them.

Semantic Alignment within the STORM framework functions by establishing a correspondence between raw visual features extracted from images and higher-level semantic concepts representing object identity and attributes. This process utilizes learned embeddings to project visual data into a semantic space, enabling the model to recognize objects irrespective of variations in viewpoint, lighting, or occlusion. By explicitly linking visual perception with semantic understanding, the model enhances its ability to generalize to novel scenes and unseen object configurations, improving performance on downstream manipulation tasks by facilitating more robust object recognition and reasoning.

STORM’s performance gains are achieved through a two-stage training process focusing on object representation. Initially, the framework prioritizes stabilizing the formation of ‘slots’, which represent individual objects within a scene, ensuring consistent identification and segmentation. Following slot stabilization, the training aligns these established object slots with the specific objectives of downstream manipulation tasks. This sequential approach – first defining objects, then relating them to actions – enables the model to learn a more robust and transferable object-centric representation, leading to improved generalization and task completion rates compared to methods that attempt simultaneous learning of both representation and control.

Imitation Learning forms a core component of the training process, leveraging expert demonstrations to rapidly establish a functional policy. This approach bypasses the need for extensive random exploration by directly learning from provided examples of successful task completion. The policy is trained to mimic the actions taken by the expert in given states, effectively transferring knowledge and accelerating convergence. This is achieved through supervised learning techniques, minimizing the difference between the agent’s actions and the expert’s actions, and subsequently reducing the time required to achieve proficient performance on the defined manipulation tasks.

A Glimpse Beyond: Benchmarking and the Path Forward

Rigorous evaluation across diverse benchmarks confirms STORM’s leading capabilities in robotic manipulation. Performance was systematically assessed using established datasets including MetaWorld, a challenging suite of simulated robotic tasks; LIBERO, designed to test generalization with visual distractors; and standard computer vision benchmarks like PASCAL VOC and COCO. These evaluations demonstrate that STORM not only achieves state-of-the-art results on existing tasks but also exhibits a significant capacity to maintain performance even when confronted with novel and disruptive visual elements, positioning it as a robust solution for real-world robotic applications. The consistent success across these varied benchmarks underscores the effectiveness of STORM’s underlying principles and its potential for broad applicability.

The architecture of STORM incorporates a Transformer Decoder within its policy network to facilitate flexible action prediction, allowing the system to effectively map observations to a distribution over possible actions. This decoder processes the learned task and object representations, enabling the robot to anticipate and execute complex manipulation sequences. Crucially, a Gaussian Mixture Model (GMM) is employed to model the probabilistic nature of robotic actions; instead of predicting a single, deterministic action, the GMM outputs a probability distribution over potential movements. This approach is vital for handling uncertainty in the environment and allows the robot to explore diverse action possibilities, improving robustness and adaptability during task execution. The combination of the Transformer Decoder’s sequence modeling capabilities and the GMM’s probabilistic outputs contributes significantly to STORM’s improved performance in challenging robotic manipulation scenarios.

Evaluations on the LIBERO benchmark, specifically when incorporating novel visual distractors, demonstrate the robust performance of STORM. The system achieved an impressive 89.6% success rate in completing tasks under these challenging conditions, signifying a substantial advancement in robotic manipulation. This figure represents a notable 19.3% improvement over existing baseline methods, highlighting STORM’s ability to maintain task performance even amidst increased visual complexity. The results suggest that STORM’s object-centric approach effectively filters out irrelevant visual information, allowing the robot to focus on the essential elements for successful task completion and showcasing a significant leap towards more reliable robotic systems in real-world environments.

Evaluations within the in-distribution MetaWorld environments demonstrate STORM’s robust performance, achieving a 74.8% success rate in completing assigned manipulation tasks. Critically, the system’s adaptability is highlighted by a significant 12.7% improvement in success rate when faced with novel visual distractors, all measured against a strong frozen DINOv2 baseline. This substantial gain indicates STORM’s ability to effectively filter irrelevant visual information and maintain task focus, even as the environment becomes more complex and challenging – a crucial step toward real-world robotic applications where unpredictable conditions are the norm.

The demonstrated performance of STORM underscores the benefits of focusing on individual objects and their relation to the task at hand within robotic manipulation. By constructing a task-aware representation, the system moves beyond treating the environment as a monolithic visual input, instead discerning and acting upon specific objects crucial to achieving a desired outcome. This object-centric approach allows STORM to generalize more effectively to new situations, including those with visual distractions, as it can isolate relevant objects and maintain task focus. The resulting improvements in success rates, evidenced by gains on benchmarks like LIBERO and MetaWorld, suggest that prioritizing object understanding and task-specific reasoning is a promising pathway toward more robust and adaptable robotic systems capable of operating in complex, real-world environments.

Continued development of STORM aims to transcend current limitations by addressing more intricate, real-world scenarios demanding higher-level cognitive abilities. Researchers intend to integrate robust reasoning and planning modules, allowing the system to not simply react to visual input but to proactively strategize and anticipate future states. Crucially, future iterations will prioritize lifelong learning capabilities, enabling STORM to continuously refine its skills and adapt to novel environments without catastrophic forgetting – effectively building upon past experiences to improve performance over time. This ongoing research seeks to move beyond task-specific performance and establish a foundation for truly adaptable and intelligent robotic systems capable of operating autonomously in dynamic and unpredictable settings.

The pursuit of robust robotic manipulation, as demonstrated by STORM, echoes a fundamental principle of complex systems. The framework doesn’t build a representation so much as cultivate one, adapting existing visual foundation models through multi-phase learning. This mirrors the growth of a garden – initial conditions are shaped, but the emergent behavior-the ability to generalize to new visual scenarios-arises from the interaction of components. Ada Lovelace observed that “The Analytical Engine has no pretensions whatever to originate anything.” Similarly, STORM doesn’t seek to create intelligence from scratch, but rather to coax forth latent capabilities within pre-trained models, fostering a system where resilience lies not in isolation, but in forgiveness between components. The architecture anticipates adaptation, recognizing that a truly robust system isn’t one that avoids failure, but one that gracefully accommodates it.

What’s Next?

STORM, in its attempt to graft task awareness onto the frozen architectures of visual foundation models, highlights a fundamental truth: representation is not discovery, but postponement. The system doesn’t understand objects, it merely delays the inevitable encounter with novelty. Each learned ‘slot’ is a carefully constructed boundary against the chaos of an unmodeled world, a temporary reprieve. The performance gains demonstrated are not a triumph over complexity, but an exercise in managing its arrival.

The claim of generalization, while promising, warrants scrutiny. New visual scenarios are not simply variations on a theme; they are breaches in the carefully constructed walls of the representation. The system will inevitably encounter configurations that expose the limitations of its object-centric priors. There are no best practices – only survivors; those architectures that degrade most gracefully under unanticipated stress. The multi-phase learning approach, while effective, merely refines the method of postponement, not the underlying problem.

Future work will not be about achieving perfect representation, but about designing systems that are exquisitely sensitive to their own failures. The true challenge lies not in building models that appear to understand, but in creating architectures that can diagnose, and adapt to, the inevitable moment when order collapses – when the cache between outages is finally exhausted.

Original article: https://arxiv.org/pdf/2601.20381.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Pixels: Why Robots See Only Chaos

STORM: Constructing Order from the Visual Flood

Learning to See What Matters: A Multi-Phase Approach

A Glimpse Beyond: Benchmarking and the Path Forward

What’s Next?

See also: