Robots See Relationships: A New Approach to Object Understanding

Author: Denis Avetisyan

This research introduces a method for robots to efficiently learn and represent how objects relate to each other, improving their ability to manipulate the world.

A visuomotor tokenization strategy prioritizing object and relational understanding—yielding $N_{S}$ object tokens and $N_{R}$ relation tokens—demonstrates superior efficiency and performance, achieving higher success rates on LIBERO-Goal with fewer tokens compared to dense tokenization methods that generate computationally expensive representations across the entire scene.

Researchers present SlotVLA, a framework utilizing compact object and relation-centric slots to achieve efficient visuomotor control for robotic manipulation with reduced token usage, demonstrated on the LIBERO+ dataset.

Existing robotic manipulation models often rely on dense visual embeddings that conflate object identity with background clutter, hindering both efficiency and interpretability. This work, ‘SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation’, addresses this limitation by introducing a framework leveraging compact, object- and relation-centric representations. Specifically, the authors present SlotVLA, which significantly reduces the number of visual tokens required for visuomotor control while maintaining competitive generalization performance—alongside the LIBERO+ dataset to facilitate this research. Could this approach pave the way for more robust and explainable robotic systems capable of complex manipulation tasks?

Beyond Pixels: An Object-Centric View

Traditional Vision-Language-Action (VLA) models struggle with complex scenes due to their reliance on pixel-level processing, hindering object interaction and relationship understanding. This leads to poor generalization and action prediction. These models fail to explicitly represent individual objects, lacking the compositional understanding needed for accurate VLA tasks. Shifting towards object-centric learning—representing scenes as collections of discrete objects—offers a promising solution, improving performance, interpretability, and robustness. Like understanding a system requires understanding each component’s role, object-centric learning provides a clearer, more functional representation.

The proposed model employs a two-stage approach, first training a task-aware object-centric encoder with slot attention and task-aware filtering, and then freezing those parameters to introduce a relation-centric encoder for relational reasoning and action decoding.

SlotVLA: Disentangling Scenes for Robust Reasoning

SlotVLA introduces a framework for extracting object-centric representations using slot attention, decomposing scenes into discrete slots representing distinct objects. This moves beyond pixel-level processing, creating a structured, interpretable representation. A key component is a relation encoder, capturing object interactions and contextual understanding. Task-aware filtering prioritizes relevant objects, focusing computational resources. By combining these elements, SlotVLA achieves robust understanding, reducing token count by an order of magnitude while maintaining or improving performance on visual reasoning tasks.

With the task query “Put the bowl on the stove”, the slot decomposition correctly associates relevant slots with the objects involved, while irrelevant slots are dispersed, demonstrating focused attention.

LIBERO+: Validating Object-Relation Reasoning

The LIBERO+ benchmark presents a fine-grained challenge for object-relation reasoning in robotic manipulation, requiring advanced understanding of object interactions. SlotVLA demonstrates superior performance on LIBERO+, validating its ability to model these interactions. Utilizing Object-Relation-Centric Slots (ORC), the framework achieves success rates of 0.86 on L-Goal and 0.91 on L-Object, further enhanced by an action decoder and LoRA parameter tuning. With Object-Centric Slots (OC), SlotVLA achieves 0.77 and 0.90 on L-Goal and L-Object respectively, confirming the efficacy of its object-centric approach.

The LIBERO+ dataset provides a comprehensive resource for robotic manipulation research.

Efficient VLA: Compression and Consistency are Key

To address the computational demands of large video language models, token compression techniques like PruMerge and TokenPacker reduce redundancy, lowering memory footprint and communication costs. Maintaining temporal consistency is paramount; the Action Decoder ensures accurate tracking of objects and relationships across frames. Integrating state-of-the-art vision encoders like DINOv2 and SigLIP further improves feature extraction, enabling more accurate understanding of video content. Good architecture is unseen until it fails, and only then is the true cost of decisions apparent.

Scaling and Generalization: The Path Forward

Future research will focus on scaling SlotVLA to more complex environments and tasks, handling increased object numbers, intricate relationships, and dynamic scenes. Investigating Q-Former for enhanced visual representation learning is a key area, alongside advancements in architectures like π0, HPTs, and ECoT. The goal is to create VLA systems capable of seamlessly integrating visual cues, language instructions, and real-world actions to solve robotic challenges, requiring not only component improvements but also a holistic understanding of system interaction.

Simulation demonstrates successful trajectory execution from exocentric views when tasked with “Put the bowl on the stove”.

SlotVLA’s architecture mirrors a fundamental principle of systemic design: understanding the whole through its constituent parts. The framework efficiently models object-relation representations, reducing token usage while preserving performance—a testament to elegant design arising from simplicity. As Alan Turing observed, “Sometimes it is the people who no one imagines anything of who do the things that no one can imagine.” This sentiment resonates with SlotVLA’s innovative approach to robotic manipulation; by focusing on compact, interpretable slots, the system achieves complex control with surprising efficiency, challenging conventional wisdom in the field and demonstrating that impactful solutions often emerge from unexpected angles. The system’s ability to distill complex scenes into meaningful object-centric representations highlights the power of structural understanding.

What’s Next?

The pursuit of object-centric representations in robotic manipulation often feels like assembling a clock from its constituent parts, then being surprised when it doesn’t immediately tell time. SlotVLA offers a compelling reduction in complexity – a welcome sign. However, the system’s efficacy remains tethered to the LIBERO+ dataset; a broader test against more ambiguous or actively deceptive environments is crucial. If the system survives on dataset-specific tuning, it’s probably overengineered. The elegance of slot attention lies in its potential for generalization, but generalization demands a rigorous accounting for the inevitable noise of the real world.

Future work must move beyond simply identifying objects and relations. A truly intelligent system will understand the affordances embedded within those relationships – what can be done, and with what consequence. The current framework treats relations as static features; a dynamic model of relational change, incorporating concepts of force, momentum, and stability, would represent a significant advancement. Modularity without such contextual understanding is an illusion of control.

Ultimately, the field needs to confront the fundamental question of representation itself. Are these slots merely convenient computational tools, or do they approximate something akin to an internal ‘physics engine’ – a predictive model of how the world should behave? The answer, predictably, likely resides somewhere in the messy intersection of the two.

Original article: https://arxiv.org/pdf/2511.06754.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/