Teaching Robots to Learn by Watching: A New Approach to Mobile Manipulation

Author: Denis Avetisyan

Researchers have developed a system that allows robots to acquire complex manipulation skills by learning directly from human demonstrations, bridging the gap between human and robotic embodiment.

The system learns robust, full-body mobile manipulation from human demonstration through a design prioritizing embodiment-agnostic visual representation, relaxed head action modeling, and a whole-body controller that integrates hand-eye coordination with physically-constrained motion-effectively achieving complex manipulation tasks by mirroring demonstrated behaviors across diverse robotic platforms.

This work introduces HoMMI, a system leveraging egocentric vision and 3D representations to enable cross-embodiment transfer for whole-body mobile manipulation tasks.

Learning complex robotic skills remains challenging due to the difficulty of transferring human expertise to machines. This paper introduces HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations, a framework enabling robots to learn whole-body mobile manipulation directly from human demonstrations without requiring robot-specific data collection. HoMMI bridges the embodiment gap through egocentric sensing, a relaxed head action representation, and a 3D visual representation, allowing for scalable and portable learning. Could this approach unlock more intuitive and adaptable robots capable of seamlessly integrating into human environments?

Bridging the Embodiment Disparity in Robotic Perception

The successful integration of robots into human environments faces a fundamental obstacle: the ‘visual embodiment gap’. This disparity arises from the significant differences in how robots and humans perceive and interact with the world through vision. Humans naturally interpret visual cues based on shared bodily experiences and expectations, effortlessly understanding actions performed by others with similar physiques. Robots, however, often struggle with this interpretation due to their distinct morphology and sensor configurations. This mismatch hinders the robot’s ability to accurately recognize, predict, and respond to human actions, creating challenges for intuitive human-robot collaboration and limiting the deployment of robotic solutions in real-world settings where seamless interaction is paramount. Bridging this gap requires innovative approaches to visual processing and robotic control that account for the fundamental differences in embodiment between humans and machines.

The challenge of imparting human skills to robots is fundamentally complicated by disparities in physical form and perceptual perspective. Humans and robots rarely share equivalent bodies or viewpoints, creating a significant hurdle for intuitive control transfer. A robot attempting to mimic a human action must effectively translate instructions designed for a different morphology – different limb lengths, joint arrangements, and ranges of motion. Furthermore, a human demonstrator’s line of sight and spatial understanding, naturally aligned with their own body, must be reinterpreted for a robotic system with potentially vastly different sensor placements and perspectives. This mismatch necessitates sophisticated algorithms capable of bridging the gap between human intention and robotic execution, ensuring actions are performed not just accurately, but also in a manner that aligns with human expectations for a seamless, natural interaction.

Closing the kinematic gap – the fundamental difference in how a robot and a human body are structured and move – represents a critical step towards seamless skill transfer. This disparity extends beyond simple anatomical differences; it encompasses the number of independent movements, or degrees of freedom, each possesses. A human hand, with its intricate arrangement of bones and muscles, boasts a far greater range of motion than most robotic grippers. Consequently, directly replicating human actions on a robot often fails because the robot lacks the necessary physical flexibility. Researchers are actively exploring methods to bridge this gap, including developing adaptable robotic designs, employing sophisticated motion planning algorithms that account for kinematic differences, and utilizing machine learning to map human movements onto the robot’s capabilities – ultimately striving for a system where a robot can interpret and execute human-intended actions regardless of structural variations.

Representing robot gaze as a 3D look-at point relaxes kinematic constraints-such as height and neck degrees of freedom-enabling effective active perception without rigidly mimicking human head movements.

Establishing a Framework for Multi-Modal Data Acquisition

The UMI Framework is designed for capturing data in real-world environments, prioritizing portability and ease of deployment. It employs wrist-mounted cameras as the primary visual sensors, enabling first-person observation of task execution. Crucially, the system utilizes relative end-effector control, meaning actions are defined in relation to the user’s hand or tool, rather than absolute world coordinates. This approach simplifies data annotation and facilitates the transfer of learned behaviors to new contexts, as the system focuses on the relationship between the user’s actions and the observed outcomes, independent of specific environmental setups. The wrist-mounted configuration minimizes obstruction of the user’s workspace and provides a natural point of view for recording demonstrations.

The HoMMI Data Collection System builds upon the existing UMI Framework by integrating a head-mounted camera to supplement wrist-mounted capture. This addition enables the acquisition of multi-view video data, providing a more complete visual record of observed actions. Critically, the system also records precise six-degrees-of-freedom (6-DoF) pose data, detailing the head’s position and orientation in space. This combined video and pose information facilitates detailed analysis of operator technique and allows for the reconstruction of the operator’s viewpoint during task execution.

The Apple ARKit framework provides the necessary tools for precise temporal alignment of video streams and 6-DoF pose data captured from multiple devices – specifically, wrist-mounted and head-mounted cameras within the HoMMI Data Collection System. ARKit’s capabilities include robust timestamping and inter-device synchronization protocols, enabling the creation of a consistently aligned multi-view dataset. This synchronization is essential for accurately reconstructing human demonstrations and facilitating the training of machine learning models that require correlated visual and kinematic information; discrepancies in timing would introduce errors in pose estimation and action recognition. The framework’s reliance on visual inertial odometry further contributes to data consistency by providing a shared coordinate frame for all captured data streams.

The HoMMI whole-body controller achieves precise end-effector tracking and effective active perception by employing a relaxed head look-at point action representation and stability-enhancing constraints, avoiding the issues of simultaneous 6-DoF head-hand tracking.

Implementing Visuomotor Policies with Geometric Representation

A Diffusion Policy learns a visuomotor policy directly from demonstrations, bypassing the need for explicitly defined reward functions or complex reinforcement learning procedures. This approach treats robot control as a diffusion process, where the policy learns to reverse the diffusion of action sequences given visual observations. By training on observed human actions, the robot can then generate similar behaviors, effectively mimicking the demonstrated skills. The policy is conditioned on visual inputs, allowing the robot to react to and interact with its environment based on what it observes, and outputs a distribution over possible actions, enabling stochastic and more natural movement.

To address variations in visual input due to changes in viewpoint, lighting, or object appearance, the system employs a 3D visual representation that encodes egocentric observations as geometry-aware tokens. This method transforms raw visual data into a format explicitly representing 3D geometric information, allowing the policy to focus on the spatial relationships between objects and the robot, rather than pixel-level details. These tokens capture information about the 3D structure of the scene as perceived from the robot’s perspective, providing a more robust and generalizable input to the visuomotor policy compared to direct image inputs. The resulting tokenized representation facilitates consistent policy performance across diverse visual conditions and environments.

The system utilizes DINO-v3, a self-supervised vision transformer, to extract features from image patches within the robot’s egocentric view. DINO-v3 is pre-trained on a large dataset and excels at identifying salient visual features, providing robust representations even with changes in lighting, texture, or viewpoint. These extracted features are then used as input to the policy network, enabling it to learn a more generalized visuomotor mapping. By leveraging DINO-v3’s learned representations, the policy demonstrates improved performance and adaptability when presented with novel environments and objects not encountered during training, as it relies on semantic understanding of visual inputs rather than memorizing specific pixel configurations.

The implementation utilizes a gripper-centric frame of reference to enhance the robot’s spatial understanding of its environment and target objects. This frame is coupled with a relaxed action representation employing a look-at point action representation; instead of directly predicting joint velocities, the policy predicts a 3D look-at point and a desired gripper orientation. This approach effectively bridges the kinematic gap between visual observation and robot actuation, simplifying the learning process and improving generalization by decoupling the action space from specific joint configurations and enabling the robot to focus on where to look and how to grasp, rather than how to move its joints to achieve the desired pose.

A 3D representation enables embodiment-agnostic manipulation by utilizing a consistent gripper coordinate frame and masking embodiment-specific features like arms and body.

Towards Robust and Coordinated Robotic Manipulation

Robotic systems are increasingly challenged with tasks demanding not just single actions, but sustained, multi-step execution – a capability termed ‘Long-Horizon Manipulation’. This necessitates a departure from traditional robotic control, which often focuses on immediate movements, towards a framework that anticipates and adapts to extended sequences. Recent advancements demonstrate the ability to successfully navigate these complex scenarios, enabling robots to perform tasks like folding laundry or setting a table without pre-programmed routes. This is achieved by integrating predictive models and real-time feedback loops, allowing the robot to plan several steps ahead and recover from unexpected disturbances. The success of these systems relies on robust perception, accurate state estimation, and control algorithms capable of maintaining stability and precision throughout the entire manipulation sequence, ultimately bringing robots closer to autonomously performing real-world tasks requiring sustained, coordinated action.

The ability to deftly manipulate objects with two hands, termed bimanual coordination, is significantly enhanced through the implementation of whole-body control. This approach moves beyond simply directing the robot’s arms, instead orchestrating the movements of the entire body – including the torso, legs, and head – to maintain balance and achieve precise positioning. By considering the robot’s full kinematic and dynamic capabilities, whole-body control enables the generation of coordinated motions that are both stable and efficient. This holistic strategy allows for subtle adjustments throughout the manipulation process, compensating for disturbances and ensuring that both end-effectors remain accurately on their intended trajectories. Consequently, complex tasks demanding synchronized two-handed actions become more reliable and robust, paving the way for robots capable of intricate and delicate manipulations in unstructured environments.

Robotic manipulation relies heavily on the ability to accurately guide tools – the end-effectors – along desired paths, and this is achieved through a sophisticated ‘Whole-Body Controller’. This controller doesn’t just focus on the arms; it coordinates the entire robot body, anticipating and compensating for movements needed to maintain balance and stability. Crucially, the system incorporates ‘Constraint-Aware Control’, a technique that actively enforces physical limitations – preventing collisions, respecting joint limits, and ensuring the robot operates within safe boundaries. By continuously monitoring and adjusting to these constraints, the controller guarantees not only precise trajectory tracking, but also a robust and secure performance, even during complex and dynamic manipulations.

The system incorporates an active perception component, deliberately moving the robot’s head to gain crucial visual information that enhances manipulation success. This isn’t simply reactive vision; instead, the robot proactively seeks out relevant details – like the precise location of an object’s handle or the presence of obstructions – before attempting a manipulation. By strategically positioning its “viewpoint,” the system reduces uncertainty and improves its ability to plan and execute complex movements, effectively allowing the robot to ‘look’ before it ‘acts’. This intentional gathering of task-relevant information significantly boosts performance on tasks requiring fine motor skills and adaptability to dynamic environments.

The HoMMI system represents a notable step forward in robotic dexterity, consistently achieving success rates of up to 90% when performing complex, real-world tasks. This performance extends to challenges like laundry handling and intricate tablescape manipulation – activities demanding a high degree of coordination, planning, and adaptability. Such a high success rate indicates a substantial improvement over prior robotic manipulation systems, suggesting HoMMI’s architecture effectively bridges the gap between simulated environments and the unpredictable nature of physical interactions. The ability to reliably perform these tasks highlights the system’s potential for integration into human environments, paving the way for robots capable of assisting with daily living and collaborative work.

The HoMMI system demonstrates a remarkable capacity for real-world application, achieving an 85% success rate in delivery tasks and an 80% success rate in complex tablescape manipulation. These results highlight not only the system’s precision in executing individual actions, but also its robustness in adapting to the inherent uncertainties of dynamic environments. Successfully navigating delivery scenarios-which demand obstacle avoidance and careful object handling-alongside the nuanced coordination required for setting a table, indicates a versatile manipulation skillset. This performance suggests a significant step towards robots capable of reliably assisting with everyday household chores, moving beyond controlled laboratory settings and into practical, unpredictable human environments.

Our cross-embodiment hand-eye policy successfully coordinates whole-body movements and active perception to complete a laundry task across diverse object arrangements and bin locations, outperforming baseline approaches which frequently fail in similar scenarios.

The pursuit of robust mobile manipulation, as demonstrated by HoMMI, necessitates a focus on invariant properties within complex systems. Dijkstra aptly stated, “Program testing can be effective as a means of finding errors, but it is hopelessly inadequate in confirming the absence of errors.” This echoes the core challenge of transferring skills learned from human demonstration – merely replicating observed actions does not guarantee generalization. HoMMI’s approach, by leveraging 3D representations and relaxed action spaces, attempts to distill the invariant principles governing whole-body coordination, moving beyond superficial imitation and towards a provably more robust system. The system seeks to establish what remains constant-the underlying geometric and kinematic relationships-even as the embodied agent navigates varying conditions.

Beyond Mimicry: Charting a Course for Embodied Intelligence

The pursuit of mobile manipulation via learned demonstration, as exemplified by HoMMI, reveals a fundamental tension. Successfully replicating observed actions is not, in itself, intelligence. The system addresses the ‘embodiment gap’ with commendable ingenuity, yet sidesteps the deeper question of understanding. A demonstrator doesn’t merely execute a sequence of motor commands; they possess an internal model of the world, a predictive capacity allowing for adaptation to unforeseen circumstances. A proof of system performance on a pre-defined set of demonstrations, while necessary, is demonstrably insufficient to establish generalizable competence.

Future work must prioritize the development of robust, verifiable internal representations. The reliance on egocentric vision, while pragmatic, introduces inherent limitations regarding scale and generalization. The true measure of progress lies not in flawlessly recreating a task, but in a system’s capacity to reason about it-to formulate and test hypotheses regarding the consequences of its actions. A formal, mathematically rigorous framework for action planning, grounded in principles of control theory and information geometry, remains a critical, and largely unmet, challenge.

Ultimately, the field should strive to move beyond the creation of sophisticated ‘mimics’ and towards the construction of truly embodied agents-systems capable of independent learning, adaptation, and, crucially, provable correctness. The elegance of an algorithm is not judged by its empirical success, but by the logical necessity of its conclusions.

Original article: https://arxiv.org/pdf/2603.03243.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Bridging the Embodiment Disparity in Robotic Perception

Establishing a Framework for Multi-Modal Data Acquisition

Implementing Visuomotor Policies with Geometric Representation

Towards Robust and Coordinated Robotic Manipulation

Beyond Mimicry: Charting a Course for Embodied Intelligence

See also: