Running with Robots: A Step Towards Truly Autonomous Humanoids

Author: Denis Avetisyan

Researchers have developed a new approach to controlling humanoid robots that combines optimized human motion data with reinforcement learning, resulting in more natural and robust running gaits.

The controller achieves human-like locomotion on both constrained and open terrains by synthesizing optimized human motion data with a reward-shaping control strategy, effectively translating commanded speeds into naturalistic gaits.

This work presents a dynamic retargeting and Control Lyapunov Function-Reinforcement Learning pipeline for performant and controllable humanoid running, integrated with an autonomy stack for real-world obstacle avoidance.

While humanoid robots promise human-like locomotion, achieving both speed and robust autonomy remains a significant challenge. This is addressed in ‘Chasing Autonomy: Dynamic Retargeting and Control Guided RL for Performant and Controllable Humanoid Running’, which presents a pipeline leveraging dynamically optimized human motion data and Control Lyapunov Function-Reinforcement Learning to enable fast, stable running. The approach demonstrates successful hardware deployment, achieving speeds up to 3.3 m/s and hundreds of meters of traversal, and integrates seamlessly into a full autonomy stack for real-world obstacle avoidance. Could this combination of dynamic retargeting and control-guided RL unlock truly versatile and adaptable locomotion for future generations of humanoid robots?

The Inevitable Fallibility of Prediction

The pursuit of truly lifelike movement in humanoid robots consistently pushes the boundaries of conventional robotics. Traditional control methods, designed for predictable, static environments, frequently falter when confronted with the inherent dynamism of bipedal locomotion and real-world unpredictability. These systems struggle to account for the complex interplay of forces, the robot’s momentum, and external disturbances – issues that humans resolve intuitively. Consequently, achieving robust and agile movement – the ability to quickly adapt to uneven terrain, recover from stumbles, and execute complex maneuvers – demands a paradigm shift beyond pre-programmed sequences and reactive adjustments, necessitating entirely new control architectures capable of anticipating and responding to dynamic changes in real-time.

Traditional control methods for humanoid robots frequently falter when confronted with the inherent complexities of dynamic systems. These systems, characterized by constant change and interconnected variables, demand precise calculations and rapid adjustments that often exceed the capacity of conventional algorithms. Consequently, engineers often resort to extensive manual tuning – painstakingly adjusting parameters through trial and error – or rely on simplified models that sacrifice accuracy for computational efficiency. While these approaches can yield limited success in controlled environments, they struggle to generalize to unpredictable real-world scenarios, hindering the development of truly agile and robust robotic locomotion. The reliance on simplification, while pragmatic, ultimately limits a robot’s ability to react effectively to disturbances or navigate uneven terrain, creating a significant barrier to achieving natural and fluid movement.

Addressing the inherent instability of bipedal movement requires control systems capable of real-time adaptation, a feat recently demonstrated through the development of a dynamic control strategy for humanoid robots. This approach allows for stable and efficient locomotion even when faced with environmental uncertainties or during complex maneuvers like running. Researchers successfully implemented this control system on a physical robot, achieving a peak running speed of 3.3 meters per second – a significant advancement indicating progress towards more agile and responsive humanoid robots capable of navigating unpredictable terrains and performing intricate tasks.

Utilizing lidar odometry and a proportional controller, the robot successfully executed long-range outdoor navigation-exceeding 150 meters-by tracking global yaw targets, a target position for lateral motion, and a feedforward velocity.

The Illusion of Control Through Observation

Reinforcement Learning (RL) distinguishes itself from traditional control methodologies by learning optimal policies through interaction with an environment and subsequent reward maximization, eliminating the requirement for pre-defined, explicit models of system dynamics or kinematics. This data-driven approach allows RL algorithms to discover complex control strategies directly from observed data, such as sensor readings and action sequences. The learned policies represent mappings from states to actions, optimized to achieve a specified goal without relying on human-engineered rules or assumptions about the underlying system. This is particularly advantageous in scenarios where modeling the system is difficult, inaccurate, or computationally expensive, enabling robots to adapt to complex and uncertain environments.

Reference tracking Reinforcement Learning (RL) establishes a learning framework where a robotic system acquires control policies by imitating demonstrated motions. This is achieved by capturing human movements – serving as the desired behavioral guidance – and using them as training data for the RL agent. The robot learns to map its own actions to those observed in the demonstrations, effectively replicating the demonstrated behavior. This approach contrasts with traditional methods requiring explicit trajectory planning or hand-engineered controllers, instead leveraging the richness and naturalness of human motion data to shape the robot’s learned policy.

Human Data serves as the foundational input for defining desired robotic behaviors and achieving natural, efficient locomotion. Comparative experiments have demonstrated the superior performance of dynamically optimized human data over kinematic retargeting combined with goal conditioning. Specifically, dynamic optimization allows the robot to learn control policies that more accurately replicate the velocity profiles observed in the human demonstrations, resulting in improved velocity tracking capabilities and more realistic movement patterns. This suggests that leveraging the full dynamics captured in human motion, rather than solely relying on positional data, is critical for training high-performance robotic controllers.

Several methodologies demonstrate the efficacy of human demonstration data in training robust robotic controllers. DeepMimic employs a deep neural network trained with a policy gradient approach, directly mapping raw, high-dimensional sensory inputs to control torques, and achieving complex acrobatic motions. Zest utilizes a data-driven approach to learn a stochastic policy, allowing the robot to adapt to variations in the environment and recover from disturbances. KungfuBot, similarly, leverages deep learning to learn complex, dynamic movements from human demonstrations, focusing on imitation and generalization to new scenarios. These methods consistently demonstrate improved performance and robustness compared to traditional control techniques by capitalizing on the richness and efficiency inherent in human movement data.

This pipeline leverages state-constrained dynamic optimization and a control-guided cost function to retarget human motion data, creating a reusable reference library for zero-shot transfer to robotic hardware and autonomous control.

The Futile Pursuit of Absolute Certainty

Control Lyapunov Functions (CLFs) are integrated into the reinforcement learning (RL) reward structure to create a method termed CLF-RL, specifically designed to enhance operational safety. CLFs are mathematical functions that provide a rigorous framework for guaranteeing the stability of a dynamical system; by incorporating the CLF as a component of the RL reward, the agent is incentivized to learn policies that not only achieve the desired task but also maintain stable system behavior. This approach contrasts with traditional RL methods by explicitly addressing stability concerns during the learning process, preventing the robot from converging to or exploring unstable states. Experimental results demonstrate that CLF-RL consistently outperforms methods relying solely on mimic rewards across various reference tracking scenarios, resulting in reduced tracking error and improved overall performance.

Control Lyapunov Functions (CLFs) offer a mathematically defined approach to stability guarantees in robotic control systems by ensuring that the system’s energy-like function consistently decreases during operation, thus preventing divergence into unsafe states. Implementation of CLFs within a reinforcement learning framework, termed CLF-RL, has demonstrated superior performance compared to methods utilizing mimic rewards across various reference tracking tasks. Empirical results indicate that CLF-RL consistently achieves lower reference tracking error, signifying improved control accuracy and stability during dynamic movement. This improvement is attributed to the CLF’s ability to directly enforce stability constraints within the learning process, guiding the robot towards safe and accurate trajectory execution.

Model Predictive Control (MPC) is utilized in conjunction with Control Barrier Functions (CBFs) to guarantee safe robot operation during trajectory generation. CBFs define admissible sets of states, ensuring the system remains within safe boundaries; these functions are incorporated as constraints within the MPC optimization problem. MPC then predicts future system behavior over a finite horizon, selecting control inputs that minimize a cost function while simultaneously satisfying the CBF constraints. This formulation allows for proactive safety enforcement, generating trajectories that avoid potentially hazardous states and maintain stability even in dynamic environments. The combination of MPC and CBFs effectively addresses safety-critical applications by explicitly considering and enforcing safety constraints during the control design process.

FastLIO, a computationally efficient localization and mapping algorithm, provides the Unitree G1 humanoid robot with robust state estimation and odometry critical for accurate control and navigation. This system utilizes a lightweight Extended Kalman Filter (EKF) with iterative closest point (ICP) localization, enabling real-time performance on the robot’s onboard hardware. FastLIO fuses data from the robot’s IMU, wheel encoders, and LiDAR sensor to estimate the robot’s 3D pose – position and orientation – and to build a map of the surrounding environment. The resulting state estimates are essential for trajectory planning, obstacle avoidance, and maintaining stability during locomotion, and demonstrate improved accuracy and robustness compared to traditional SLAM approaches in dynamic and challenging environments.

Utilizing lidar-based perception and CBF-constrained Model Predictive Control, the robot autonomously avoids collisions while maintaining locomotion in real-world environments.

The Illusion of Novelty, and the Inevitability of Emergence

The synthesis of controllers for complex dynamic maneuvers relies increasingly on the convergence of data-driven methodologies and safety-critical control principles. This approach moves beyond traditional, hand-tuned control systems by allowing robots to learn from data, adapting to intricate movements and unpredictable environments. Through techniques like reinforcement learning and imitation learning, controllers are developed that can execute demanding tasks-such as running, jumping, and navigating challenging terrain-while maintaining stability and avoiding potential hazards. Rigorous verification and validation processes, incorporating formal methods and runtime monitoring, are essential to guarantee the safety and reliability of these learned controllers, ensuring predictable performance even in unforeseen circumstances. The result is a new generation of robotic systems capable of executing complex motions with a level of agility and robustness previously unattainable.

Generating the precise movements required for complex robotic tasks relies heavily on a process called Dynamic Optimization, often implemented using a technique known as Multiple Shooting. This computational method effectively breaks down the desired motion into a series of shorter segments, optimizing each individually while ensuring seamless transitions between them. By formulating the robotic control problem as a mathematical optimization, researchers can define constraints – such as joint limits or ground contact – and an objective function that prioritizes smoothness, speed, or energy efficiency. Multiple Shooting enhances this process by simultaneously solving for the robot’s state and control inputs across these segments, improving the accuracy and feasibility of the resulting trajectory. The generated, optimized motions then serve as references for the robot to learn from, allowing it to replicate and adapt these complex maneuvers in real-world scenarios.

The creation of robust robotic control systems often necessitates a careful trade-off between the fidelity of motion planning and the computational resources required for implementation. Researchers address this challenge by employing both Full Order Dynamic Trajectories and Reduced Order Models during the training process. Full Order models, which account for all degrees of freedom and system dynamics, provide highly accurate reference motions, but demand significant processing power. Conversely, Reduced Order Models simplify the system representation, enabling faster computation and real-time control, though potentially sacrificing some precision. By strategically utilizing both approaches – leveraging the accuracy of Full Order models for initial learning and transitioning to the efficiency of Reduced Order models for deployment – robotic systems can achieve complex maneuvers while maintaining responsiveness and practicality in diverse environments.

The system demonstrates a significant advancement in robotic locomotion through the implementation of a Diffusion Policy, enabling the generation of entirely new and varied movements, rather than simply replicating learned examples. This approach moves beyond the limitations of pre-programmed routines, allowing the robot to adapt to unforeseen circumstances and navigate complex terrains with greater flexibility. Validated through extensive outdoor trials – including runs exceeding 250 meters in length – the Diffusion Policy proves capable of synthesizing robust and dynamic motions, effectively extending the robot’s operational range and establishing a pathway toward truly autonomous navigation and manipulation in unstructured environments.

Multiple shooting trajectory optimization solves for motion by dividing the problem into hybrid domains connected at reset nodes, optimizing to track a human reference trajectory.

The pursuit of autonomy, as demonstrated by this work on dynamic retargeting and CLF-RL for humanoid running, echoes a fundamental truth: systems reveal themselves through operation, not design. The paper’s success isn’t merely in achieving performant locomotion, but in integrating this capability into a functional autonomy stack capable of real-world navigation. This echoes Liskov’s sentiment: “Programs must be right first before they are fast.” The robustness gained through Control Lyapunov Functions isn’t about preventing failure, but about building a system that gracefully reveals its limitations through operation, allowing for iterative refinement and a deeper understanding of the underlying dynamics. Monitoring, in this context, becomes the art of fearing consciously – anticipating the inevitable revelation of a system’s boundaries.

The Horizon Recedes

This work, predictably, does not solve locomotion. It merely shifts the boundaries of what is predictable. The integration of optimized trajectories with CLF-RL offers a temporary reprieve from the inevitable chaos of real-world interaction, but stability is merely an illusion that caches well. Each successful stride is, implicitly, a catalog of failures avoided – a record of the infinite configurations the system did not encounter. The autonomy stack, lauded as a triumph of integration, is simply a more elaborate mechanism for delaying the inevitable encounter with a configuration it cannot handle.

Future efforts will undoubtedly focus on expanding the scope of ‘handled’ configurations. Yet, a guarantee of comprehensive coverage is a contract with probability. The true challenge isn’t building more robust controllers, but accepting that the system will always be fundamentally incomplete. The emphasis should move from minimizing deviation from a planned trajectory to maximizing the capacity to recover from unexpected deviations.

Ultimately, this is not an engineering problem, but an ecological one. These robots aren’t tools to be built, but ecosystems to be grown. The pursuit of ‘autonomy’ isn’t about achieving perfect control, but fostering resilience. Chaos isn’t failure – it’s nature’s syntax. The next generation of research will recognize this, and focus on cultivating systems that can not only run, but also fall gracefully.

Original article: https://arxiv.org/pdf/2603.25902.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Fallibility of Prediction

The Illusion of Control Through Observation

The Futile Pursuit of Absolute Certainty

The Illusion of Novelty, and the Inevitability of Emergence

The Horizon Recedes

See also: