Author: Denis Avetisyan
Researchers have developed a system that allows humanoid robots to learn coordinated manipulation skills through demonstration and a novel ‘Choice Policy’ framework.

This work presents a modular teleoperation system and learning approach for efficient imitation of multi-modal behaviors in complex loco-manipulation tasks.
Achieving robust, whole-body coordination remains a key challenge in deploying humanoid robots in real-world environments. This paper, ‘Coordinated Humanoid Manipulation with Choice Policies’, introduces a system leveraging modular teleoperation and a novel imitation learning approach-Choice Policy-to address this limitation. Choice Policy efficiently learns multi-modal behaviors from demonstrations, enabling faster inference and improved performance in complex manipulation tasks like dishwasher loading and whiteboard wiping. Could this framework represent a scalable path toward truly versatile humanoid robots operating seamlessly in unstructured human-centric spaces?
The Illusion of Adaptability: Why Robots Still Struggle
Conventional robot programming often falters when confronted with the ambiguities of real-world scenarios. Unlike the precisely defined environments of factory assembly lines, tasks such as cleaning or organizing require adaptability and nuanced decision-making-qualities difficult to instill through explicit code. Each step, however intuitive for a human, demands meticulous, manual specification for a robot, a process that is both time-consuming and prone to failure when faced with unexpected variations. This reliance on detailed instructions creates “brittle” systems, easily disrupted by even minor changes in the environment or object placement, ultimately limiting a robot’s usefulness beyond highly structured applications and necessitating constant human intervention.
Conventional robotic systems often falter when confronted with everyday tasks that humans perform with ease. Attempting to explicitly program a robot to wipe a whiteboard, for instance, requires detailing every motion, pressure application, and edge coverage – a process prone to failure with even slight variations in the environment. Similarly, loading a dishwasher demands precise object recognition, grip adjustments, and placement strategies that are difficult to anticipate and codify. This approach proves remarkably brittle; a misplaced plate or a slightly different whiteboard surface can disrupt the entire sequence, leading to errors and necessitating constant reprogramming. The inherent inefficiency stems from the robot’s inability to generalize from a limited set of pre-defined instructions, highlighting the need for more adaptable and intuitive control methods.
The difficulty in creating truly adaptable robots stems not from a lack of mechanical sophistication, but from the chasm between human task understanding and robotic instruction. People effortlessly grasp concepts like “wipe the whiteboard until clean” – a goal defined by subjective assessment and flexible execution. Robots, however, require precise, step-by-step directives, a process that necessitates explicitly defining what “clean” means in terms of sensor data, motor commands, and acceptable error margins. This translation – converting intuitive human goals into quantifiable robotic actions – presents a significant hurdle, demanding new approaches to robot programming that prioritize learning from demonstration, incorporating contextual understanding, and enabling robots to generalize beyond narrowly defined scenarios. Bridging this gap is crucial for deploying robots in dynamic, real-world environments where predictability is limited and adaptability is paramount.
Mimicking Intelligence: The Limits of Learned Behavior
Learning from Demonstration (LfD) represents a significant approach to robot skill acquisition by leveraging human expertise directly. This method bypasses the need for explicitly programmed behaviors and instead relies on capturing and replicating actions performed by a human operator controlling the robot. Typically, a human demonstrates the desired task – such as assembly, manipulation, or navigation – while the robot records the operator’s states and actions. These recorded demonstrations then serve as training data for a learning algorithm, enabling the robot to learn a policy that maps states to actions, effectively mimicking the human’s behavior and facilitating autonomous task execution. This contrasts with traditional robotic programming which requires detailed, manual specification of every step.
Behavior Cloning (BC) and Diffusion Policy represent direct imitation learning approaches where a robot learns to map observations directly to actions by observing an expert. BC typically utilizes supervised learning to train a policy that predicts the expert’s actions given the same input observations, effectively creating a behavioral replica. Diffusion Policy extends this by modeling the policy as a diffusion process, allowing it to generate diverse and potentially more robust actions. Both methods bypass the need for explicitly defining reward functions, instead leveraging demonstrations as the primary learning signal, and serve as foundational techniques for initializing more complex reinforcement learning algorithms or providing a baseline level of autonomous behavior.
Behavior Cloning and Diffusion Policy, while effective initial learning strategies, demonstrate limited generalization capabilities when applied to real-world scenarios. Evaluations of these methods on complex tasks, such as dishwasher loading, consistently reveal a 50% success rate, indicating a significant failure mode when faced with even minor deviations from the training environment or task parameters. This lack of robustness stems from the algorithms’ reliance on directly replicating observed actions without developing an underlying understanding of the task’s dynamics or the ability to adapt to unforeseen circumstances. Consequently, these methods require substantial amounts of demonstration data covering a wide range of possible conditions to achieve acceptable performance, and even then, performance degrades rapidly outside of the training distribution.

Expanding the Possibilities: A More Flexible Action Space
The Choice Policy utilizes an Action Proposal Network to generate a diverse set of potential action trajectories. Rather than predicting a single action, the network outputs multiple candidate trajectories, effectively expanding the robot’s action space. This is achieved through a probabilistic process where the network samples from a distribution of possible actions, creating a set of alternatives for subsequent evaluation. The number of proposed trajectories is a configurable parameter, allowing for a trade-off between computational cost and exploration of potential solutions. This contrasts with traditional approaches that rely on deterministic action prediction and limits adaptability to unforeseen circumstances.
The Score Prediction Network assesses the viability of proposed action trajectories quantitatively. Evaluation utilizes Mean Squared Error (MSE) as a primary metric, comparing predicted outcomes to desired states. Beyond MSE, a Winner-Takes-All paradigm is also implemented, where the trajectory receiving the highest score is selected for execution, effectively suppressing consideration of lower-scoring alternatives. This scoring system enables the agent to differentiate between potential actions and prioritize those most likely to achieve successful task completion, contributing to improved performance and robustness.
The implementation of a Choice Policy demonstrably improves robotic task completion rates by enabling action selection based on predicted quality. In dishwasher loading trials, this approach achieved a 70% success rate, representing a significant performance gain over both Behavior Cloning and Diffusion Policy methods, which each attained a 50% success rate. This improvement is attributed to the system’s ability to evaluate multiple potential action trajectories and choose the most promising one, thereby increasing robustness to variations in object position and orientation and enhancing adaptability to unforeseen circumstances during task execution.

Seeing is Not Always Understanding: The Importance of Perception
Effective action proposal generation is fundamentally dependent on accurate environmental perception, achieved through the integration of RGB and depth cameras. RGB cameras capture color and texture information, providing data for object recognition and scene understanding. Complementary depth cameras measure the distance to objects, generating a 3D representation of the environment. This combined RGB-D data provides a comprehensive understanding of the surrounding space, enabling the system to identify potential interaction points, assess object affordances, and plan appropriate actions. The robustness of this perception pipeline directly impacts the quality and feasibility of generated action proposals, as inaccuracies in environmental understanding can lead to failed or inefficient movements.
The system utilizes a DINOv3 Feature Encoder to process incoming visual data from RGB and depth cameras, extracting relevant features for environmental understanding. Complementing this external perception is proprioception, which provides internal state awareness by tracking the robot’s own joint angles, velocities, and body pose. This internally-derived data, representing the robot’s current configuration and motion, is crucial for accurate state estimation and effective control, allowing the system to differentiate between perceived external changes and its own movements.
The Action Proposal Network (APN) receives processed visual data from the DINOv3 Feature Encoder – encompassing RGB and depth information – and proprioceptive data representing the robot’s internal state. This integrated information stream allows the APN to generate a set of feasible action proposals, effectively predicting potential movements and interactions with the environment. The network then evaluates these proposals based on learned criteria, selecting the optimal action to achieve a desired goal. Crucially, this process facilitates precise whole-body coordination by simultaneously considering the robot’s kinematic constraints and the external environment, ensuring stable and accurate execution of the chosen action.

The Illusion of Collaboration: A Glimpse of Future Potential
Modular teleoperation represents a significant shift in how humans control robots, moving away from monolithic control schemes to a system built on specialized, independent submodules. This decomposition breaks down intricate tasks – such as grasping, moving, and manipulating objects – into manageable functional units. Each submodule handles a specific aspect of the overall control, allowing for greater flexibility and adaptability. By isolating functions, the system simplifies the teleoperation process, reducing cognitive load on the operator and enabling more intuitive control. This approach not only streamlines the operation but also facilitates easier troubleshooting, upgrades, and the integration of new capabilities, paving the way for more robust and versatile human-robot interactions.
Effective hand-eye coordination is central to advanced teleoperation systems, enabling robots to perform delicate manipulation tasks with precision. This capability isn’t simply about visual tracking; it requires a tightly integrated feedback loop where visual input directly informs and adjusts the robot’s movements in real-time. Demonstrations of this principle involve seemingly simple objects, such as a whiteboard eraser, which demands nuanced control over both position and orientation. Successful manipulation of such tools highlights the system’s ability to bridge the gap between human intention and robotic action, allowing for tasks that require a high degree of dexterity and spatial awareness. This precise coordination is foundational for applications extending beyond basic object handling, paving the way for complex collaborative tasks in environments like kitchens or assembly lines.
The culmination of recent advances in teleoperation lies in a demonstrably more intuitive and effective control system, fostering genuine human-robot collaboration. Recent trials showcase this potential, with the integrated system achieving a 90% success rate in transferring objects between a human and a robot – a ‘handover’ – and a 70% success rate in completing the more complex task of loading a dishwasher. These figures highlight a significant leap towards seamless interaction, suggesting the technology is moving beyond specialized applications and toward broader use in environments requiring adaptable, collaborative robotic assistance. This level of reliability and efficiency promises to unlock new possibilities in manufacturing, healthcare, and even domestic settings, redefining the relationship between humans and robots.
The pursuit of elegant control policies, as demonstrated in this work on coordinated humanoid manipulation, invariably feels like building a sandcastle against the tide. This paper’s ‘Choice Policy’ attempts to capture multi-modal behaviors, a commendable effort, but one can’t help but anticipate the inevitable edge cases production will unearth. As Carl Friedrich Gauss observed, “If other objects are equally possible, then no one should be preferred over another.” This seems applicable here; while the Choice Policy aims for optimality, the sheer complexity of loco-manipulation ensures countless other ‘equally possible’ outcomes, many of which will likely involve dropped objects or strained actuators. It’s not a failure of the approach, simply a recognition that even the most sophisticated systems eventually succumb to the chaos of the real world. One suspects future roboticists will view these demonstrations as charmingly naive, much like digital archaeologists examining ancient code.
Sooner or Later, It Breaks
This work, predictably, addresses the ‘hard’ problem of making robots not fall over while simultaneously attempting tasks humans find trivial. The ‘Choice Policy’ is, at its core, a more organized way to store failure modes. It elegantly encapsulates the multi-modal behaviors… until it encounters a situation not present in the demonstrations. Then, the robot will default to something resembling a controlled collapse. The illusion of intelligence is, after all, remarkably fragile.
The inevitable next step, naturally, will be scaling this to more complex scenarios. More demonstrations. More edge cases. A larger dataset of near-disasters carefully curated into a ‘robust’ policy. The field will chase ever-diminishing returns on demonstration data, all while ignoring the fundamental issue: simulation is a lie, and production is the ultimate QA.
One anticipates a future where the primary metric of success won’t be task completion, but time to first failure. Perhaps then, the focus will shift from imitating human behavior – messy, unpredictable, and prone to error – to designing robot behaviors that are deliberately, boringly, and reliably stable. Everything new is old again, just renamed and still broken.
Original article: https://arxiv.org/pdf/2512.25072.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Clash Royale Best Boss Bandit Champion decks
- Mobile Legends January 2026 Leaks: Upcoming new skins, heroes, events and more
- Clash Royale Furnace Evolution best decks guide
- Vampire’s Fall 2 redeem codes and how to use them (June 2025)
- Best Hero Card Decks in Clash Royale
- Mobile Legends: Bang Bang (MLBB) Sora Guide: Best Build, Emblem and Gameplay Tips
- Best Arena 9 Decks in Clast Royale
- Clash Royale Witch Evolution best decks guide
- Dawn Watch: Survival gift codes and how to use them (October 2025)
- Brawl Stars December 2025 Brawl Talk: Two New Brawlers, Buffie, Vault, New Skins, Game Modes, and more
2026-01-01 10:35