Humanoid Robots Learn by Watching: A New Approach to Complex Tasks

Author: Denis Avetisyan

Researchers have developed a system that allows humanoid robots to learn coordinated manipulation skills through demonstration and a novel ‘Choice Policy’ framework.

The system decomposes teleoperation control into independent modules-arm, hand, head, and locomotion-sharing a single joystick for both thumb manipulation and omnidirectional movement, a design choice intended to streamline data acquisition while maintaining coordinated whole-body control across stationary and fully mobile humanoid platforms.

This work presents a modular teleoperation system and learning approach for efficient imitation of multi-modal behaviors in complex loco-manipulation tasks.

Achieving robust, whole-body coordination remains a key challenge in deploying humanoid robots in real-world environments. This paper, ‘Coordinated Humanoid Manipulation with Choice Policies’, introduces a system leveraging modular teleoperation and a novel imitation learning approach-Choice Policy-to address this limitation. Choice Policy efficiently learns multi-modal behaviors from demonstrations, enabling faster inference and improved performance in complex manipulation tasks like dishwasher loading and whiteboard wiping. Could this framework represent a scalable path toward truly versatile humanoid robots operating seamlessly in unstructured human-centric spaces?

The Illusion of Adaptability: Why Robots Still Struggle

Conventional robot programming often falters when confronted with the ambiguities of real-world scenarios. Unlike the precisely defined environments of factory assembly lines, tasks such as cleaning or organizing require adaptability and nuanced decision-making-qualities difficult to instill through explicit code. Each step, however intuitive for a human, demands meticulous, manual specification for a robot, a process that is both time-consuming and prone to failure when faced with unexpected variations. This reliance on detailed instructions creates “brittle” systems, easily disrupted by even minor changes in the environment or object placement, ultimately limiting a robot’s usefulness beyond highly structured applications and necessitating constant human intervention.

Conventional robotic systems often falter when confronted with everyday tasks that humans perform with ease. Attempting to explicitly program a robot to wipe a whiteboard, for instance, requires detailing every motion, pressure application, and edge coverage – a process prone to failure with even slight variations in the environment. Similarly, loading a dishwasher demands precise object recognition, grip adjustments, and placement strategies that are difficult to anticipate and codify. This approach proves remarkably brittle; a misplaced plate or a slightly different whiteboard surface can disrupt the entire sequence, leading to errors and necessitating constant reprogramming. The inherent inefficiency stems from the robot’s inability to generalize from a limited set of pre-defined instructions, highlighting the need for more adaptable and intuitive control methods.

The difficulty in creating truly adaptable robots stems not from a lack of mechanical sophistication, but from the chasm between human task understanding and robotic instruction. People effortlessly grasp concepts like “wipe the whiteboard until clean” – a goal defined by subjective assessment and flexible execution. Robots, however, require precise, step-by-step directives, a process that necessitates explicitly defining what “clean” means in terms of sensor data, motor commands, and acceptable error margins. This translation – converting intuitive human goals into quantifiable robotic actions – presents a significant hurdle, demanding new approaches to robot programming that prioritize learning from demonstration, incorporating contextual understanding, and enabling robots to generalize beyond narrowly defined scenarios. Bridging this gap is crucial for deploying robots in dynamic, real-world environments where predictability is limited and adaptability is paramount.

Mimicking Intelligence: The Limits of Learned Behavior

Learning from Demonstration (LfD) represents a significant approach to robot skill acquisition by leveraging human expertise directly. This method bypasses the need for explicitly programmed behaviors and instead relies on capturing and replicating actions performed by a human operator controlling the robot. Typically, a human demonstrates the desired task – such as assembly, manipulation, or navigation – while the robot records the operator’s states and actions. These recorded demonstrations then serve as training data for a learning algorithm, enabling the robot to learn a policy that maps states to actions, effectively mimicking the human’s behavior and facilitating autonomous task execution. This contrasts with traditional robotic programming which requires detailed, manual specification of every step.

Behavior Cloning (BC) and Diffusion Policy represent direct imitation learning approaches where a robot learns to map observations directly to actions by observing an expert. BC typically utilizes supervised learning to train a policy that predicts the expert’s actions given the same input observations, effectively creating a behavioral replica. Diffusion Policy extends this by modeling the policy as a diffusion process, allowing it to generate diverse and potentially more robust actions. Both methods bypass the need for explicitly defining reward functions, instead leveraging demonstrations as the primary learning signal, and serve as foundational techniques for initializing more complex reinforcement learning algorithms or providing a baseline level of autonomous behavior.

Behavior Cloning and Diffusion Policy, while effective initial learning strategies, demonstrate limited generalization capabilities when applied to real-world scenarios. Evaluations of these methods on complex tasks, such as dishwasher loading, consistently reveal a 50% success rate, indicating a significant failure mode when faced with even minor deviations from the training environment or task parameters. This lack of robustness stems from the algorithms’ reliance on directly replicating observed actions without developing an underlying understanding of the task’s dynamics or the ability to adapt to unforeseen circumstances. Consequently, these methods require substantial amounts of demonstration data covering a wide range of possible conditions to achieve acceptable performance, and even then, performance degrades rapidly outside of the training distribution.

The Choice Policy efficiently addresses the limitations of diffusion policies and behavior cloning by generating multiple candidate actions in a single pass and selecting the optimal one via a learned score, achieving both fast inference and robust multi-modal behavior.

Expanding the Possibilities: A More Flexible Action Space

The Choice Policy utilizes an Action Proposal Network to generate a diverse set of potential action trajectories. Rather than predicting a single action, the network outputs multiple candidate trajectories, effectively expanding the robot’s action space. This is achieved through a probabilistic process where the network samples from a distribution of possible actions, creating a set of alternatives for subsequent evaluation. The number of proposed trajectories is a configurable parameter, allowing for a trade-off between computational cost and exploration of potential solutions. This contrasts with traditional approaches that rely on deterministic action prediction and limits adaptability to unforeseen circumstances.

The Score Prediction Network assesses the viability of proposed action trajectories quantitatively. Evaluation utilizes $Mean Squared Error$ (MSE) as a primary metric, comparing predicted outcomes to desired states. Beyond MSE, a Winner-Takes-All paradigm is also implemented, where the trajectory receiving the highest score is selected for execution, effectively suppressing consideration of lower-scoring alternatives. This scoring system enables the agent to differentiate between potential actions and prioritize those most likely to achieve successful task completion, contributing to improved performance and robustness.

The implementation of a Choice Policy demonstrably improves robotic task completion rates by enabling action selection based on predicted quality. In dishwasher loading trials, this approach achieved a 70% success rate, representing a significant performance gain over both Behavior Cloning and Diffusion Policy methods, which each attained a 50% success rate. This improvement is attributed to the system’s ability to evaluate multiple potential action trajectories and choose the most promising one, thereby increasing robustness to variations in object position and orientation and enhancing adaptability to unforeseen circumstances during task execution.

During dishwasher loading, the policy strategically selects from <span class="katex-eq" data-katex-display="false">K=5</span> specialized action proposals to maintain high precision across different task phases, demonstrating effective skill specialization and switching. — During dishwasher loading, the policy strategically selects from $K=5$ specialized action proposals to maintain high precision across different task phases, demonstrating effective skill specialization and switching.

Seeing is Not Always Understanding: The Importance of Perception

Effective action proposal generation is fundamentally dependent on accurate environmental perception, achieved through the integration of RGB and depth cameras. RGB cameras capture color and texture information, providing data for object recognition and scene understanding. Complementary depth cameras measure the distance to objects, generating a 3D representation of the environment. This combined RGB-D data provides a comprehensive understanding of the surrounding space, enabling the system to identify potential interaction points, assess object affordances, and plan appropriate actions. The robustness of this perception pipeline directly impacts the quality and feasibility of generated action proposals, as inaccuracies in environmental understanding can lead to failed or inefficient movements.

The system utilizes a DINOv3 Feature Encoder to process incoming visual data from RGB and depth cameras, extracting relevant features for environmental understanding. Complementing this external perception is proprioception, which provides internal state awareness by tracking the robot’s own joint angles, velocities, and body pose. This internally-derived data, representing the robot’s current configuration and motion, is crucial for accurate state estimation and effective control, allowing the system to differentiate between perceived external changes and its own movements.

The Action Proposal Network (APN) receives processed visual data from the DINOv3 Feature Encoder – encompassing RGB and depth information – and proprioceptive data representing the robot’s internal state. This integrated information stream allows the APN to generate a set of feasible action proposals, effectively predicting potential movements and interactions with the environment. The network then evaluates these proposals based on learned criteria, selecting the optimal action to achieve a desired goal. Crucially, this process facilitates precise whole-body coordination by simultaneously considering the robot’s kinematic constraints and the external environment, ensuring stable and accurate execution of the chosen action.

Hand-eye coordination, as demonstrated by the clear view of the dishrack from the head camera, significantly improves visual feedback during plate insertion compared to uncoordinated views which suffer from occlusion.

The Illusion of Collaboration: A Glimpse of Future Potential

Modular teleoperation represents a significant shift in how humans control robots, moving away from monolithic control schemes to a system built on specialized, independent submodules. This decomposition breaks down intricate tasks – such as grasping, moving, and manipulating objects – into manageable functional units. Each submodule handles a specific aspect of the overall control, allowing for greater flexibility and adaptability. By isolating functions, the system simplifies the teleoperation process, reducing cognitive load on the operator and enabling more intuitive control. This approach not only streamlines the operation but also facilitates easier troubleshooting, upgrades, and the integration of new capabilities, paving the way for more robust and versatile human-robot interactions.

Effective hand-eye coordination is central to advanced teleoperation systems, enabling robots to perform delicate manipulation tasks with precision. This capability isn’t simply about visual tracking; it requires a tightly integrated feedback loop where visual input directly informs and adjusts the robot’s movements in real-time. Demonstrations of this principle involve seemingly simple objects, such as a whiteboard eraser, which demands nuanced control over both position and orientation. Successful manipulation of such tools highlights the system’s ability to bridge the gap between human intention and robotic action, allowing for tasks that require a high degree of dexterity and spatial awareness. This precise coordination is foundational for applications extending beyond basic object handling, paving the way for complex collaborative tasks in environments like kitchens or assembly lines.

The culmination of recent advances in teleoperation lies in a demonstrably more intuitive and effective control system, fostering genuine human-robot collaboration. Recent trials showcase this potential, with the integrated system achieving a 90% success rate in transferring objects between a human and a robot – a ‘handover’ – and a 70% success rate in completing the more complex task of loading a dishwasher. These figures highlight a significant leap towards seamless interaction, suggesting the technology is moving beyond specialized applications and toward broader use in environments requiring adaptable, collaborative robotic assistance. This level of reliability and efficiency promises to unlock new possibilities in manufacturing, healthcare, and even domestic settings, redefining the relationship between humans and robots.

The pursuit of elegant control policies, as demonstrated in this work on coordinated humanoid manipulation, invariably feels like building a sandcastle against the tide. This paper’s ‘Choice Policy’ attempts to capture multi-modal behaviors, a commendable effort, but one can’t help but anticipate the inevitable edge cases production will unearth. As Carl Friedrich Gauss observed, “If other objects are equally possible, then no one should be preferred over another.” This seems applicable here; while the Choice Policy aims for optimality, the sheer complexity of loco-manipulation ensures countless other ‘equally possible’ outcomes, many of which will likely involve dropped objects or strained actuators. It’s not a failure of the approach, simply a recognition that even the most sophisticated systems eventually succumb to the chaos of the real world. One suspects future roboticists will view these demonstrations as charmingly naive, much like digital archaeologists examining ancient code.

Sooner or Later, It Breaks

This work, predictably, addresses the ‘hard’ problem of making robots not fall over while simultaneously attempting tasks humans find trivial. The ‘Choice Policy’ is, at its core, a more organized way to store failure modes. It elegantly encapsulates the multi-modal behaviors… until it encounters a situation not present in the demonstrations. Then, the robot will default to something resembling a controlled collapse. The illusion of intelligence is, after all, remarkably fragile.

The inevitable next step, naturally, will be scaling this to more complex scenarios. More demonstrations. More edge cases. A larger dataset of near-disasters carefully curated into a ‘robust’ policy. The field will chase ever-diminishing returns on demonstration data, all while ignoring the fundamental issue: simulation is a lie, and production is the ultimate QA.

One anticipates a future where the primary metric of success won’t be task completion, but time to first failure. Perhaps then, the focus will shift from imitating human behavior – messy, unpredictable, and prone to error – to designing robot behaviors that are deliberately, boringly, and reliably stable. Everything new is old again, just renamed and still broken.

Original article: https://arxiv.org/pdf/2512.25072.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/