Robots Learn by Watching—and Feeling

Author: Denis Avetisyan

A new framework empowers robots to replicate human manipulation skills by fusing visual perception with tactile feedback and 3D understanding.

OCRA facilitates the transfer of human manipulation expertise to robotic systems through object-centric learning derived from multi-view video and tactile sensing, enabling robust performance across a variety of tasks as systems inevitably succumb to the constraints of time.

This research introduces OCRA, an object-centric learning approach combining multi-view 3D reconstruction, tactile sensing, and diffusion policies for improved human-to-robot action transfer.

Despite advances in robotic manipulation, transferring complex skills from human demonstration remains challenging due to the difficulty of discerning relevant task information. This paper introduces OCRA-Object-Centric learning with 3D and Tactile Priors for Human-to-Robot Action Transfer-a novel framework that learns robust manipulation policies by reconstructing object-centric 3D scenes from multi-view video and integrating tactile feedback. By leveraging a diffusion policy and fusing visual and tactile priors through a multimodal module, OCRA significantly outperforms existing methods in learning from human demonstrations. Could this approach pave the way for more intuitive and adaptable robot assistants capable of seamlessly replicating human dexterity?

The Fragility of Perception: Beyond Conventional Robotics

Conventional robotic systems often falter when confronted with the unpredictable nature of real-world environments. These machines are typically engineered with meticulously defined parameters and pre-programmed sequences, excelling in highly structured settings like factory assembly lines. However, this reliance on precise instructions proves problematic when faced with variations in object placement, unexpected obstacles, or deformable materials. The inability to generalize beyond their programmed parameters significantly limits their adaptability, demanding constant human intervention or re-programming for even minor deviations from the expected. Consequently, tasks requiring nuanced manipulation, such as sorting recycling or assisting in a home setting, remain substantial challenges, highlighting the need for more robust and flexible robotic solutions that can navigate uncertainty and learn from experience.

Despite remarkable advancements in computer vision, robotic systems relying solely on visual input often falter when confronted with the ambiguities of the real world. While algorithms can now accurately identify objects under controlled conditions, incomplete scene understanding remains a critical limitation. This stems from the inherent challenges of interpreting three-dimensional spaces from two-dimensional images – issues like occlusion, varying lighting, and subtle textural differences can mislead even sophisticated systems. Consequently, robots struggle with tasks requiring nuanced manipulation or adaptation to unexpected changes, highlighting the need for supplementary sensory data to ensure reliable performance beyond carefully curated environments. The inability to fully ‘understand’ what is being observed, rather than simply ‘seeing’ it, ultimately restricts the robustness and versatility of vision-only robotic applications.

Despite advancements in robotic vision, current systems frequently struggle with nuanced object manipulation because they underutilize tactile sensing. While cameras excel at identifying what an object is and where it resides, they often fail to provide critical information about its texture, firmness, or slippage-details crucial for a secure and adaptable grip. This limitation necessitates a more holistic approach, integrating tactile sensors into robotic “skin” to provide feedback on contact forces and surface properties. Such integration allows for real-time adjustments during manipulation, preventing dropped objects or damaged goods, and greatly improving the robot’s ability to discern between objects with similar visual characteristics. Effectively incorporating tactile data enables finer motor control and allows robots to not just ‘see’ an object, but truly ‘feel’ it, ultimately unlocking more robust and reliable performance in complex, real-world scenarios.

Experiments demonstrate that combining vision with tactile perception enables a robot to reliably perform manipulation tasks like weight and texture sorting, revealing the limitations of vision-only approaches for nuanced object discrimination.

Beyond Pixels: An Object-Centric View of Intelligence

Object-Centric Learning (OCL) represents a departure from traditional robot perception systems that process data at the pixel level. Instead of analyzing raw pixel data, OCL focuses on identifying and representing individual objects within a scene as discrete entities, each with associated properties such as shape, size, color, and pose. This approach allows robots to build a more structured and interpretable understanding of their environment. By reasoning about objects rather than pixels, OCL facilitates improved generalization, transfer learning, and robustness to changes in viewpoint, lighting, and occlusion. The core principle is to decompose a visual scene into its constituent objects, enabling the robot to manipulate and interact with the world in a more intuitive and efficient manner.

Effective object-centric learning relies on the accurate identification and segmentation of individual objects within a visual scene. Recent advancements in this area include Segment Anything Model 2 (SAM2) and GroundingDINO. SAM2 is a promptable segmentation model capable of generating high-quality masks for diverse objects given any input prompt, while GroundingDINO focuses on grounding language descriptions to objects in images, enabling the identification of specific objects mentioned in natural language. Both techniques contribute to robust object perception by providing the foundational step of isolating and delineating objects, which is critical for subsequent representation and manipulation within an object-centric framework.

Object representation in robotic systems is improved through the concurrent identification of both the Manipulated Object Mask and the Context Object Mask. The Manipulated Object Mask isolates the specific object directly undergoing interaction – such as being grasped or moved – providing precise boundaries for action planning. Complementing this, the Context Object Mask defines the surrounding objects and surfaces relevant to the manipulation. This contextual awareness is critical for understanding spatial relationships, predicting potential collisions, and enabling more robust and adaptable robotic behaviors. By explicitly defining both the actor and the surrounding environment, the system gains a more complete and nuanced understanding of the interaction space than is achievable through single-object segmentation alone.

This framework leverages multi-view RGB and tactile data to learn object-centric 3D priors with VGGT and grounding models, extracts tactile priors via a masked autoencoder, and deploys a diffusion policy-guided by ResFiLM-fused geometric and tactile features-to predict actions for robotic manipulation.

The Symbiosis of Sensation: Fusing Vision and Touch

Tactile sensing complements visual perception by providing data regarding object characteristics not readily discernible through vision alone. While vision excels at identifying objects and their broad spatial relationships, it struggles with detailed surface properties and contact forces. Tactile sensors, when applied to an object’s surface, directly measure parameters such as local curvature, surface roughness – defining texture – and the magnitude and distribution of applied forces. This information is critical for robust manipulation, particularly in scenarios involving deformable objects, fine motor control, or occluded features where visual data is incomplete or ambiguous. The integration of tactile data therefore enhances an agent’s ability to understand and interact with its environment effectively, going beyond what is possible with vision alone.

The Tactile Encoder component processes raw tactile imagery to generate a feature vector representing contact information. This encoder is initially pre-trained on a large-scale dataset of tactile observations, enabling it to learn generalizable representations of surface characteristics. Subsequent refinement is achieved through Masked Autoencoder (MAE) techniques, where portions of the tactile input are masked and the encoder is trained to reconstruct the missing data. This process forces the encoder to learn robust and informative features, capturing essential details from incomplete tactile scans. The resulting feature vector provides a condensed, meaningful representation of the tactile input, suitable for integration with other sensory modalities.

The Object-Centric Representation Alignment (OCRA) framework employs ResFiLM (Residual Feature-wise Linear Modulation) to fuse tactile and visual features, creating a combined representation used for downstream tasks. Specifically, tactile features extracted from tactile images are modulated by visual features, allowing the system to leverage complementary information from both modalities. This unified representation facilitates both 3D reconstruction of manipulated objects and enables precise control of robotic manipulation by providing a more complete understanding of object properties and pose than either modality could provide independently. The ResFiLM approach allows for adaptive weighting of visual features based on tactile input, and vice-versa, optimizing the representation for each specific manipulation task.

Object pose estimation within the system utilizes the Iterative Closest Point (ICP) algorithm to refine the alignment between sensed point clouds and a known model or prior. ICP iteratively minimizes the distance between corresponding points to determine the optimal [latex]SE(3)[/latex] transformation – a combination of rotation and translation – representing the object’s position and orientation in 3D space. The resulting [latex]SE(3)[/latex] transformation, expressed as a 4×4 homogeneous transformation matrix, provides a complete description of the object’s pose, enabling accurate 3D reconstruction and manipulation planning.

Our experimental system integrates a robot and portable gripper equipped with a tactile device for manipulation of diverse objects, as shown on the right, and utilizes a large-scale tactile image dataset collected from a subset of these objects (left).

From Mimicry to Mastery: Transferring Skills with OCRA

The Object-Centric Representation and Action (OCRA) framework centers on efficiently transferring human skill to robotic systems through learned behavior. It accomplishes this by utilizing demonstrations – examples of a human performing a desired task – to train a Diffusion Policy. This policy doesn’t simply mimic actions; instead, it learns to predict the optimal robotic movements based on an integrated understanding of the environment, specifically focusing on the objects involved. By representing the scene in an object-centric manner, the system can better generalize to new situations and variations, allowing the robot to perform tasks with greater flexibility and robustness than traditional methods. The Diffusion Policy effectively learns a probability distribution over possible actions, enabling it to navigate complex scenarios and achieve successful outcomes based on the learned object relationships and desired task goals.

The OCRA framework demonstrates a significant advancement in robotic manipulation through the integration of diverse sensory inputs. By fusing data from multi-view RGB cameras, detailed 3D reconstructions, and tactile sensors, the system cultivates a robust understanding of the environment and the objects within it. This multi-modal approach allows OCRA to achieve a remarkable 90% success rate on visuo-tactile tasks – a substantial improvement over existing methodologies. The combined sensory information provides redundancy and accuracy, enabling the robot to confidently grasp and manipulate objects even in challenging conditions or with limited visibility. This performance highlights the efficacy of OCRA’s object-centric representation and its ability to translate sensory data into effective action.

The refinement of object geometry understanding is central to the OCRA framework’s manipulation capabilities, achieved through a dedicated Point Cloud Encoder. This encoder processes 3D point cloud data – generated from multi-view RGB input and 3D reconstruction – to create a detailed representation of object shapes and surfaces. By directly analyzing this geometric information, the system moves beyond simple visual cues, enabling it to accurately predict how a robot’s actions will affect an object. This detailed understanding is crucial for tasks requiring precise manipulation, significantly improving the reliability and success rate of robotic interactions with the environment, particularly when dealing with complex or deformable objects. The encoder’s contribution goes beyond mere object recognition; it allows for a nuanced grasp of an object’s physical properties, informing the robot’s planning and execution of manipulation strategies.

A key indicator of the Object-Centric Representation and Action (OCRA) framework’s efficacy lies in its performance on vision-only manipulation tasks, where it achieves an impressive 85% success rate. This result highlights the substantial benefits derived from representing the environment in terms of discrete objects rather than raw pixel data. By focusing on object-level understanding, the system demonstrates enhanced robustness and generalization capabilities, allowing it to successfully manipulate objects even with limited sensory input. This approach bypasses the need for precise, frame-by-frame visual servoing, enabling more reliable and adaptable robotic manipulation in complex, real-world scenarios where visual clutter or occlusion may be present. The high success rate confirms that an object-centric paradigm is a valuable asset for developing more intelligent and versatile robotic systems.

The efficiency of skill acquisition in robotic systems has been significantly improved through a novel demonstration method. Traditionally, teaching a robot a new task via teleoperation required approximately 20 seconds per demonstration, a time-consuming process limiting the complexity of learned behaviors. However, recent advancements have enabled the collection of human demonstrations in just 8 seconds, representing a two-fold increase in speed. This accelerated data acquisition is crucial for rapidly training robots to perform intricate manipulation tasks, as it allows for a greater volume of training data to be gathered in a comparable timeframe. The resulting increase in data density directly contributes to improved performance and adaptability of the robotic system, paving the way for more agile and versatile automation.

OCRA's view generalization ability is evaluated by keeping one camera fixed (green box) while systematically varying the position of another camera (green and yellow boxes, indicated by arrows). — OCRA’s view generalization ability is evaluated by keeping one camera fixed (green box) while systematically varying the position of another camera (green and yellow boxes, indicated by arrows).

The pursuit of robust action transfer, as exemplified by OCRA, acknowledges an inherent truth: systems are not static entities, but rather evolve within the currents of time and interaction. OCRA’s object-centric learning, fusing multi-view 3D reconstruction with tactile sensing, represents a considered attempt to build a system that anticipates and gracefully accommodates change. As Donald Davies observed, “The real skill in system design is not to build something that works, but to build something that can be adapted.” OCRA’s architecture, focusing on object representation rather than pixel-level imitation, embodies this principle – a structure designed not for present perfection, but for future resilience, acknowledging that the arrow of time inevitably points toward refinement and adaptation.

What Lies Ahead?

The architecture presented in this work, like all architectures, will inevitably succumb to the pressures of refinement and, ultimately, obsolescence. OCRA establishes a foothold in object-centric learning, fusing modalities with an elegance that feels, momentarily, complete. Yet, the very success of such frameworks highlights a recurring challenge: improvements age faster than they can be understood. The reliance on 3D reconstruction, while currently effective, introduces a fragility tied to sensor limitations and computational cost-a debt that will accrue as tasks become more complex and environments less structured.

The true test will not be in replicating demonstrated actions, but in generalizing beyond them. Current systems excel at imitation, but struggle with the unpredictable nuances of the physical world. Future iterations must address the inherent uncertainty of tactile feedback and the difficulty of representing object affordances in a manner that transcends specific instances. A deeper exploration of disentangled representations-separating what an object is from how it can be manipulated-feels particularly critical.

This line of inquiry inevitably leads to questions of embodied intelligence and the very nature of skill. The goal is not merely to build robots that perform actions, but that understand them. Such understanding, however, is not a destination but a continual process of adaptation-a cycle of refinement and decay, played out across generations of robotic systems. The framework established here is a single point along that trajectory, a fleeting moment of order in an otherwise entropic universe.

Original article: https://arxiv.org/pdf/2603.14401.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Perception: Beyond Conventional Robotics

Beyond Pixels: An Object-Centric View of Intelligence

The Symbiosis of Sensation: Fusing Vision and Touch

From Mimicry to Mastery: Transferring Skills with OCRA

What Lies Ahead?

See also: