Teaching Robots to See and Act: A Foundation for Humanoid Dexterity

Author: Denis Avetisyan

Researchers have unveiled a new model that bridges the gap between visual understanding and physical action in humanoid robots, enabling more natural and versatile loco-manipulation capabilities.

Dexterous manipulation, whole-body motion, and locomotion are integrated across eight diverse, long-horizon tasks to evaluate [latex]\Psi_{0}[/latex], with task instructions and sub-task markers overlaid for clarity and policy rollout videos available in supplementary materials.

This work introduces Ψ0, an open foundation model trained on extensive video and robot data to achieve real-time control for complex humanoid tasks.

Despite advancements in robotics, achieving generalizable loco-manipulation skills in humanoids remains challenging due to disparities between human and robot kinematics and the inefficiencies of simply scaling data collection. This work introduces [latex]Ψ_0[/latex] (Psi-Zero), an open foundation model designed to address this fundamental problem by decoupling learning through a staged training paradigm. Specifically, we demonstrate that pre-training a visual-language model on high-quality egocentric human videos, followed by post-training on real-world humanoid robot trajectories, yields superior performance compared to approaches relying on larger, more heterogeneous datasets. Can this data-centric approach unlock truly versatile and adaptable humanoid robots capable of complex tasks in dynamic environments?

The Fragility of Embodied Systems

Conventional robotic systems frequently encounter difficulties when transitioning from controlled laboratory settings to the unpredictable nature of real-world environments. This struggle stems from inherent limitations in their ability to generalize learned behaviors to novel situations and adapt to unforeseen circumstances. Unlike humans, who intuitively adjust to changing conditions, these robots often rely on precisely programmed responses, making them brittle in the face of even minor variations. A robot trained to grasp a specific object under ideal lighting, for example, may fail completely when presented with the same object in a different pose or under altered illumination. This lack of robust generalization and adaptation significantly hinders their performance on complex tasks requiring dexterity, problem-solving, and nuanced interaction with the physical world, ultimately restricting their practical application beyond highly structured environments.

While recent advancements in foundation models have yielded impressive results in areas like natural language processing and image recognition, these models frequently struggle when applied to robotics, specifically in the realm of loco-manipulation – the coordinated movement and object interaction required for robots to navigate and operate in the real world. The core limitation lies in a lack of embodied understanding; these models are trained on vast datasets of static images or text, lacking the crucial sensorimotor experience of physical interaction. Consequently, they often fail to generalize to novel situations, exhibiting brittle performance when faced with unpredictable environments or slight variations in object properties. A robot guided by such a model might successfully grasp an object in a controlled laboratory setting, but falter when encountering a cluttered table or an object with an unexpected texture. Bridging this gap necessitates imbuing these models with a deeper comprehension of physical forces, spatial relationships, and the dynamic interplay between a robot’s actions and its environment.

A fundamental hurdle in creating truly intelligent robots lies in unifying how they perceive the world, formulate plans, and execute actions – a process currently fragmented in most systems. Researchers are exploring novel learning paradigms that move beyond sequential training of these components, instead advocating for integrated approaches where perception directly informs planning, and action provides feedback to refine both. This necessitates methods capable of learning representations that capture the relationships between sensory input, desired outcomes, and the motor commands needed to achieve them, effectively creating a closed-loop system. Such advancements promise robots that don’t just react to their environment, but proactively understand it and adapt their behavior in real-time, mirroring the seamless coordination observed in biological systems and unlocking the potential for truly versatile and autonomous machines.

The practical implementation of advanced robotic systems is significantly hampered by the sheer volume of data currently required for effective training. Current machine learning paradigms often demand extensive, meticulously labeled datasets – a process that is both time-consuming and expensive, particularly when adapting robots to novel environments or tasks. This reliance on ‘big data’ creates a substantial bottleneck, limiting the scalability of robotic solutions and hindering their deployment in dynamic, real-world scenarios where acquiring and annotating such vast datasets is impractical or impossible. Consequently, researchers are actively exploring methods to improve data efficiency, such as leveraging simulation, transfer learning, and few-shot learning, to enable robots to learn and adapt with significantly less training data and thereby broaden their applicability beyond controlled laboratory settings.

The system is trained via a three-stage process: pre-training a vision-language model on the EgoDex dataset to predict actions, post-training a flow-based action expert with robotic data for joint-space control, and implementing real-time chunking with a lower-body controller to enable smooth, whole-body movements.

PsiZero: A Foundation for Robust Loco-Manipulation

PsiZero employs a multi-stage training paradigm to facilitate robust loco-manipulation capabilities. This process begins with pre-training the model on a large dataset of human egocentric videos, establishing a foundational understanding of visuomotor relationships. Subsequently, the pre-trained model is fine-tuned using real-robot interaction data, bridging the sim-to-real gap and allowing for adaptation to the specifics of the robotic platform and environment. This staged approach optimizes data efficiency by leveraging the broad knowledge gained from video pre-training and refining it with targeted robotic experience, ultimately enabling the robot to perform complex loco-manipulation tasks with improved generalization and robustness.

PsiZero demonstrates improved data efficiency and generalization capabilities through the combined use of human egocentric videos and real-robot interaction data during training. This approach allows the model to learn from both demonstrated human behaviors and direct experience with its environment, requiring less real-world robot data to achieve comparable or superior performance to existing methods. Quantitative results indicate state-of-the-art performance on loco-manipulation tasks, as evidenced by benchmarks against prior models requiring significantly more training data. The incorporation of human demonstrations provides a strong prior for task understanding, while real-robot interactions refine the model’s policies for successful execution in dynamic environments.

PsiZero employs a vision-language backbone (VLMBackbone) to process both visual input and natural language task instructions. This VLMBackbone is responsible for encoding the combined information into a unified representation, enabling the model to interpret desired goals expressed in language and correlate them with observed environmental states. Specifically, the VLMBackbone maps the input – consisting of visual observations and text prompts – into a latent space where task-relevant features are extracted and contextualized. These features are then utilized to predict appropriate robot actions, effectively bridging the semantic gap between high-level instructions and low-level motor control.

The Action Expert within PsiZero is a neural network module dedicated to translating high-level task objectives into a temporally coherent sequence of robot joint configurations. This component receives inputs from the vision-language backbone and generates predicted joint angles for each degree of freedom of the robot arm and base. The Action Expert is trained to prioritize physically plausible trajectories, avoiding jerky movements or configurations that would exceed joint limits or lead to collisions. Crucially, it outputs a sequence of actions, rather than a single pose, enabling the robot to execute complex manipulations over time and maintain stability during operation. The training process incorporates reinforcement learning to refine the action sequences based on successful task completion and reward signals.

This teleoperation framework combines human upper-body motion retargeting with inverse kinematics for robot arm control and utilizes a reinforcement learning policy to generate lower-body robot poses.

Action Chunking: Smoothing the Trajectory of Robotic Motion

PsiZero utilizes a FlowBasedActionExpert to produce ActionChunks, which are discrete segments of coordinated robot movement. This system generates these chunks by modeling the dynamic transitions between states, effectively smoothing the overall motion trajectory. By predicting subsequent actions based on the current and preceding states, the FlowBasedActionExpert minimizes abrupt changes in velocity or direction – the root cause of motion jitter. The resulting ActionChunks are then executed sequentially, providing a more fluid and natural appearance to the robot’s movements during complex tasks.

The Action Expert utilizes the MMDiT (Multi-Modal Diverse Imitation Transformer) architecture, a sequence modeling framework designed for predicting future states based on observed inputs. MMDiT leverages the transformer architecture to process multi-modal data, including proprioceptive feedback, visual observations, and task goals, to generate a distribution over potential action sequences. This allows the system to predict not just the next immediate action, but a coherent series of actions extending into the future, enabling long-horizon planning and mitigating the need for reactive control loops. The model is trained through imitation learning, learning to replicate demonstrated behaviors and generalize to novel situations by predicting plausible action sequences given a specific context.

Real-time chunking is essential for generating fluid robot motion during inference because it allows the system to predict and execute sequences of actions as a cohesive unit, rather than as a series of discrete, individually-calculated movements. This process mitigates the accumulation of small errors that can lead to jitter or unnatural transitions. By pre-planning several steps of the action sequence, the system anticipates the required motions and can execute them with greater stability and responsiveness, which is particularly critical for complex, long-horizon tasks where even minor inconsistencies can compound over time and destabilize the overall movement.

The implemented approach demonstrably achieved state-of-the-art results in complex, long-horizon humanoid loco-manipulation tasks. Quantitative analysis indicates a significant performance improvement over existing methodologies; specifically, the system attained superior results despite being trained on datasets more than ten times smaller than those used for competing systems. This efficiency is attributed to the combined benefits of action chunking and the flow-based action expert, enabling effective learning and generalization from limited data in challenging robotic control scenarios.

This real-time action chunking system utilizes a [latex]30Hz[/latex] control loop coordinating observation and action with an asynchronous inference loop-triggered when time exceeds a minimum threshold [latex]t\geq s\_{\text{min}}[/latex]-to ensure continuous action execution without inference-related delays.

The Teleoperation Pipeline: Harvesting Data for Embodied Intelligence

The TeleoperationPipeline is a system designed to gather real-world robotic data through direct human control and precise motion tracking. This pipeline enables a human operator to control the robot’s movements while simultaneously capturing data related to those actions, including joint angles, end-effector positions, and sensor readings. The captured data is then stored and utilized for training and validating robotic algorithms, particularly in the areas of imitation learning and reinforcement learning. The system’s architecture prioritizes efficient data acquisition, allowing for the collection of substantial datasets with limited operational time.

The teleoperation pipeline utilizes WholeBodyControl to manage the robot’s degrees of freedom during human-guided demonstrations, enabling complex, coordinated motions. This is achieved through integration with MultiTargetIK, a system that solves for multiple end-effector positions simultaneously, rather than treating each joint individually. This combined approach allows for precise control of the robot’s posture and movement, facilitating the capture of data reflecting a wider range of physically plausible and coordinated actions during the teleoperation process. The system dynamically adjusts joint angles to achieve desired positions for hands, feet, and other relevant body parts, resulting in smoother and more natural-looking movements for data collection.

The LocomotionRLPolicy utilizes reinforcement learning to enhance the lower-body locomotion capabilities of the robotic system. This policy operates subsequent to initial control provided by the TeleoperationPipeline and WholeBodyControl, serving to optimize and refine movement patterns. Specifically, the RL policy is trained to improve metrics such as stability, speed, and efficiency of gait, allowing the robot to adapt to varied terrains and dynamic situations. Training leverages data collected through the teleoperation process, enabling the policy to learn from human demonstrations and real-world interactions, ultimately resulting in more robust and naturalistic locomotion behaviors.

The system demonstrated state-of-the-art performance utilizing a relatively limited dataset comprised of 800 hours of human egocentric video and 30 hours of data collected directly from the robot. This data efficiency represents a significant advancement, as comparable systems typically require substantially larger datasets for training and validation. The achievement highlights the effectiveness of the teleoperation pipeline and associated algorithms in extracting maximal learning from minimal real-world robotic interaction, reducing the cost and time associated with data acquisition.

A full-body teleoperation system utilizes Manus gloves for hand control, VR tracking for upper-body pose capture and inverse kinematics, and waist/foot trackers to enable high-level locomotion commands for real-robot control.

Towards Truly Adaptive and Generalizable Robotic Systems

PsiZero marks a notable advancement in robotics by demonstrating the potential for creating robots that aren’t limited to pre-programmed tasks or highly structured settings. This model achieves performance across a diverse range of activities – from manipulating objects to navigating complex terrains – without requiring task-specific engineering. Its architecture enables the robot to learn from raw sensory input, allowing it to adapt to unforeseen circumstances and generalize its skills to entirely new environments. This represents a shift away from specialized robots designed for singular purposes, and towards more versatile machines capable of operating autonomously in the unpredictable conditions of the real world, ultimately paving the way for robots that can truly assist humans in a wider array of tasks and settings.

A longstanding hurdle in robotics involves creating systems that can learn new skills quickly and adapt to unforeseen circumstances-PsiZero directly confronts these challenges through remarkably data-efficient learning and robust generalization. Traditional robotic systems often require massive datasets and painstaking fine-tuning for each new task or environment; however, this model demonstrates the ability to acquire proficiency with significantly less data, effectively reducing the time and resources needed for deployment. This capability stems from its capacity to learn underlying principles rather than simply memorizing specific instances, allowing it to generalize learned behaviors to novel situations and environments it has never encountered during training. Consequently, PsiZero represents a move away from brittle, task-specific robots towards more versatile and adaptable machines capable of operating effectively in the real world’s inherent unpredictability.

PsiZero demonstrates a marked advancement in robotics by effectively integrating data from both simulated environments and real-world interactions, a crucial step towards deploying robots beyond controlled laboratory settings. Traditionally, robots trained solely in simulation often struggle with the inherent discrepancies between the virtual and physical worlds – a phenomenon known as the ‘sim-to-real’ gap. This model overcomes this challenge by learning from a blended dataset, allowing it to generalize its skills more effectively to novel, unpredictable situations. The combined approach not only accelerates the learning process – as simulation provides a safe and cost-effective means of experimentation – but also enhances the robot’s robustness and adaptability when confronted with the complexities of real-world tasks. Consequently, PsiZero represents a tangible pathway from theoretical research to practical robotic applications, promising more versatile and reliable automation solutions.

Continued development of the PsiZero framework prioritizes expanding its capabilities to tackle increasingly intricate challenges, moving beyond current limitations to address real-world scenarios demanding greater adaptability. Researchers are actively investigating methods to facilitate lifelong learning, enabling the model to continuously refine its skills and acquire new competencies without catastrophic forgetting. This includes exploring techniques such as meta-learning and continual adaptation, allowing the robot to build upon prior experience and generalize to previously unseen tasks with minimal retraining. The ultimate goal is to create a robotic system capable of autonomous skill acquisition and sustained performance in dynamic, unpredictable environments, paving the way for truly versatile and intelligent machines.

The development of Ψ0 represents a fascinating stage in the life cycle of robotic architecture. It’s not merely about achieving loco-manipulation; it’s about building a system capable of adaptation and continuous learning from diverse data streams. As systems evolve, their initial elegance can be obscured by layers of complexity, yet the core principles remain. This echoes Marvin Minsky’s observation: “You can’t really understand something unless you’ve tried to build it.” The creation of Ψ0, a foundation model integrating real-world robot data with pre-trained models, demonstrates this perfectly. The model isn’t a final solution, but rather a stepping stone – an architecture living its life, showcasing how improvements age and reveal previously unseen challenges within the broader field of humanoid robotics.

What Lies Ahead?

The emergence of models like Ψ0\Psi_{0} suggests a transient equilibrium. The ambition to unify locomotion and manipulation within a single framework is laudable, yet sidesteps the fundamental entropy inherent in embodied systems. Each interaction, each successful grasp, is merely a localized reduction in disorder, postponed but not prevented. The current reliance on vast datasets of human demonstrations highlights a dependence on past solutions – a scaffolding destined to decay as environments inevitably diverge from those previously observed.

Future work will likely focus on extending the model’s lifespan through continual learning, yet this addresses symptoms rather than the core issue. True robustness will demand a shift from imitation to intrinsic motivation – systems that don’t simply replicate observed behaviors, but actively explore and adapt to unforeseen circumstances. The challenge isn’t merely to achieve impressive benchmarks today, but to design architectures that degrade gracefully over time, accepting the inevitability of operational drift.

The pursuit of ‘universal’ humanoid control remains a fascinating, if ultimately quixotic, endeavor. Uptime, in this context, is a rare phase of temporal harmony, a temporary suspension of the second law. The lasting contribution of this work may not be the achievement of perfect loco-manipulation, but the illumination of the inherent limits – the unavoidable technical debt – within all complex, embodied intelligence.

Original article: https://arxiv.org/pdf/2603.12263.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/