Speak to Move: Teaching Robots to Understand Your Commands

Author: Denis Avetisyan


Researchers have developed a new framework that allows humanoid robots to be controlled through natural language, bridging the gap between human intention and robotic action.

Humanoid-LLA decomposes high-level commands, such as instructions for a figure-eight walk, into a sequence of unified motion tokens-a vocabulary bridging natural language and action-level control-to produce physically realistic, full-body movements, demonstrating a pathway from abstract goals to embodied robotic behaviors.
Humanoid-LLA decomposes high-level commands, such as instructions for a figure-eight walk, into a sequence of unified motion tokens-a vocabulary bridging natural language and action-level control-to produce physically realistic, full-body movements, demonstrating a pathway from abstract goals to embodied robotic behaviors.

This work introduces Humanoid-LLA, a system leveraging large language models and a unified motion vocabulary for open-vocabulary, language-conditioned whole-body control of humanoid robots.

Despite advances in robotics, enabling humanoids to reliably follow open-ended language commands remains a significant challenge. This paper introduces ‘Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary’, a novel framework-Humanoid-LLA-that bridges the gap between natural language and physically plausible robot actions. By integrating large language models with a unified motion vocabulary and reinforcement learning, we demonstrate improved language generalization, motion fidelity, and successful real-world deployment on a humanoid robot. Could this approach unlock truly intuitive and versatile human-robot collaboration in complex environments?


The Illusion of Fluidity: Why Robots Still Can’t Move Like Us

Despite advancements in robotic control, replicating the fluidity and complexity of human movement in humanoid robots remains a significant hurdle. Current techniques, such as motion diffusion models (MDM) and imitation policies like OmniH2O and LangWBC, frequently fall short when attempting to translate observed human actions into stable and realistic robot behaviors. These methods often struggle with the inherent differences in morphology – the varying joint configurations, limb lengths, and degrees of freedom – between humans and their robotic counterparts. Consequently, robots may exhibit jerky, unnatural motions or fail to adapt to unforeseen circumstances, highlighting a critical gap in transferring learned behaviors from the biological realm to artificial systems. The difficulty lies not simply in mimicking what humans do, but in understanding and recreating how they do it, factoring in subtle adjustments and dynamic balance that are easily achieved through biological control but prove elusive for even the most sophisticated algorithms.

Current humanoid control systems, while demonstrating impressive feats of locomotion, frequently stumble when tasked with replicating the subtlety of human movement. The issue isn’t necessarily a failure to move, but rather a deficiency in mirroring the intricate adjustments and seamless transitions that characterize natural human behavior. This manifests as jerky motions, awkward postures, and a general lack of fluidity – qualities vital for tasks demanding dexterity or social interaction. The resulting instability isn’t always catastrophic, but it frequently necessitates conservative, simplified movements, limiting the robot’s capabilities and hindering its ability to navigate complex real-world scenarios with the same grace and efficiency as a human. Consequently, these systems often prioritize completing an action over performing it naturally, highlighting a significant gap between functional capability and genuine embodiment.

Successfully transferring human motion to humanoid robots requires more than simply replicating joint angles; a fundamental difficulty lies in representing movement itself in a way that isn’t tied to the unique skeletal structure of humans. Current approaches often struggle because they attempt to directly map human kinematics – the specific relationships between bones and joints – onto robots with drastically different anatomy. This leads to brittle systems, easily disrupted by even minor variations in robot morphology or environment. Researchers are actively exploring motion representations that focus on the intent of the movement – the desired outcome, such as reaching for an object or maintaining balance – rather than the precise path taken by human joints. By abstracting away from human-specific details and focusing on these higher-level goals, it becomes possible to develop control policies that can generalize across different robotic platforms and adapt to unforeseen circumstances, ultimately bridging the gap between natural human movement and robust humanoid control.

Humanoid-LLA learns diverse, physically realistic behaviors by first building a motion vocabulary from paired human and humanoid data, then distilling a controller that directly leverages this vocabulary within a physics simulation.
Humanoid-LLA learns diverse, physically realistic behaviors by first building a motion vocabulary from paired human and humanoid data, then distilling a controller that directly leverages this vocabulary within a physics simulation.

Stripping Away the Illusion: A Universal Motion Language

The proposed framework utilizes a ‘Unified Motion Vocabulary’ as its central component, implemented as a discrete codebook. This codebook serves to represent fundamental motion primitives – the basic, reusable building blocks of complex movements. Each entry within the codebook corresponds to a specific motion primitive, allowing for the decomposition of arbitrary motions into combinations of these discrete units. This discrete representation facilitates both motion analysis and synthesis, and provides a standardized means of comparing and transferring motions across different embodiments. The vocabulary is not pre-defined, but rather learned directly from motion data, enabling adaptation to the specific characteristics of the dataset.

The Unified Motion Vocabulary is learned through a Variational Autoencoder (VQ-VAE) architecture incorporating Implicit Partitioning. Standard VQ-VAE implementations utilize a fixed codebook size, limiting representational capacity. Implicit Partitioning dynamically adjusts the codebook’s granularity during training, effectively increasing its capacity without a proportional increase in parameters. This is achieved by learning a distribution over codebook embeddings, allowing for finer-grained representation of motion primitives and improved expressiveness in reconstructing and generating complex movements. The resulting vocabulary benefits from a larger effective capacity, enabling it to capture a wider range of motion data and generalize more effectively to novel scenarios.

During training, cross-embodiment reconstruction techniques enforce consistency between human and humanoid motions by requiring the model to accurately reconstruct one modality given the other, both represented within the unified motion vocabulary. Specifically, a human motion sequence is encoded into the discrete vocabulary space, then decoded back into a humanoid motion, and vice versa. The reconstruction loss, calculated as the mean squared error between the original and reconstructed motions, penalizes discrepancies and encourages the learning of a shared, consistent representation. This process effectively bridges the kinematic differences between human and humanoid forms, enabling transfer of motion primitives across embodiments and improving the generalization capability of the model.

Demonstrating robust language understanding, this humanoid robot successfully executes complex, unseen instructions-including those referencing concepts like 'soldier' or 'martial arts'-through free-form language-conditioned whole-body control.
Demonstrating robust language understanding, this humanoid robot successfully executes complex, unseen instructions-including those referencing concepts like ‘soldier’ or ‘martial arts’-through free-form language-conditioned whole-body control.

From Words to Walkways: Translating Intent into Action

The Humanoid-LLA framework establishes a direct correspondence between natural language commands and robot actions through the use of discrete motion tokens. These tokens represent fundamental movement primitives and are defined within a unified vocabulary encompassing both language and action space. Incoming language instructions are processed and translated into a sequence of these tokens, effectively converting a high-level command into a series of executable steps for the humanoid robot. This token-based approach allows for a structured representation of desired movements, enabling the system to interpret and execute complex instructions by composing these discrete actions.

The Humanoid-LLA framework employs Large Language Models (LLMs) to convert natural language instructions into discrete sequences of action tokens. To enhance the logical flow and consistency of these action sequences, a technique called Motion Chain-of-Thought reasoning is integrated. This involves prompting the LLM to explicitly articulate intermediate steps or reasoning before generating the final action sequence, effectively simulating a planning process. By explicitly modeling the relationship between instruction, reasoning, and action, Motion Chain-of-Thought improves the coherence and interpretability of the generated actions, leading to more reliable execution by the humanoid robot.

The Vocabulary-Directed Controller functions as the final stage in translating linguistic commands into physical robot actions. This controller utilizes a pre-defined, discrete vocabulary of motor commands, mapping each token generated by the LLM to a specific trajectory for the humanoid robot’s joints. This token-to-trajectory mapping enables precise control over the robot’s degrees of freedom, facilitating complex movements. By constraining the action space to this vocabulary, the controller minimizes ambiguity and promotes stable, natural-looking motion, while also simplifying the control problem and reducing computational demands compared to continuous control methods. The discrete nature of the vocabulary also allows for efficient planning and execution of action sequences.

Quantitative evaluation of the Humanoid-LLA framework in physical robot execution environments demonstrates statistically significant performance gains compared to existing methods. Specifically, the framework achieves a higher success rate in completing instructed tasks, coupled with reductions in both positional and kinetic errors. Mean per-joint position error is demonstrably lower, indicating improved accuracy in reaching target configurations. Furthermore, measured velocity error and acceleration error are reduced, signifying smoother and more natural robot movements. These improvements are consistently observed across a range of tasks and environmental conditions, validating the efficacy of the language-to-action control system.

The Illusion of Intelligence: Towards Robust and Adaptable Humanoids

The Humanoid-LLA framework benefits from a refinement process leveraging reinforcement learning, enabling more sophisticated and adaptable movement capabilities. This approach allows the humanoid to not only execute pre-programmed motions, but also to dynamically adjust to unforeseen circumstances and complex environmental interactions. By employing reinforcement learning, the framework learns through trial and error, optimizing its control policies to achieve desired behaviors in varied and challenging scenarios. This results in a system capable of generating more fluid, natural, and robust movements, moving beyond static animations to achieve truly dynamic and responsive humanoid locomotion and manipulation.

The training of the Humanoid-LLA framework benefits from the implementation of Group Relative Policy Optimization, a technique designed to address common challenges in reinforcement learning. This method improves both the stability of the learning process and its sample efficiency – meaning the framework can learn effectively from fewer experiences. By grouping similar actions and optimizing policies relative to these groups, the algorithm reduces variance during training and accelerates convergence. This approach allows for more robust learning, particularly crucial when dealing with the complex, high-dimensional state spaces inherent in humanoid locomotion and manipulation.

The successful execution of complex humanoid motions hinges on carefully designed reward functions that guide the learning process. This framework utilizes a tiered reward system, beginning with Tracking Reward, which incentivizes accurate adherence to desired trajectories and poses. Complementing this is Physical Fidelity Reward, a critical component that penalizes unrealistic or physically implausible movements, promoting natural and balanced gaits. Finally, Distributional Reward broadens the scope of acceptable behaviors, encouraging the humanoid to explore a range of plausible actions rather than converging on a single, potentially brittle solution, ultimately leading to more robust and adaptable performance in dynamic scenarios.

The culmination of integrating reinforcement learning with the Humanoid-LLA framework yields demonstrably superior performance across a range of challenging motion tasks. Rigorous evaluation, detailed in Tables 1 and 2, reveals substantial gains in key metrics when compared to established baseline methods. These improvements aren’t merely incremental; the framework exhibits a heightened capacity for both accurately tracking desired movements and generating physically plausible humanoid actions. This is achieved through a carefully constructed reward system that incentivizes not just successful completion of a task, but also the naturalness and stability of the resulting motion, ultimately paving the way for more realistic and adaptable virtual humanoids.

The pursuit of seamless language-conditioned control, as demonstrated by Humanoid-LLA, inevitably invites future complications. It’s a familiar pattern: elegant architectures built upon tokenized motion vocabularies will, with sufficient production load, reveal unforeseen edge cases. As Grace Hopper observed, “It’s easier to ask forgiveness than it is to get permission.” This sentiment resonates deeply; the framework’s initial success doesn’t guarantee sustained robustness. Every optimization, every carefully curated vocabulary, will eventually require patching, refactoring, or outright replacement. The system isn’t a finished product, but a temporary truce negotiated with the chaos of real-world deployment. It’s a compromise that survived-for now-the relentless pressure of physics and unpredictable user input.

The Road Ahead

Humanoid-LLA neatly packages a familiar ambition – robots understanding natural language – into a more scalable architecture. The claim of an ‘open vocabulary’ will, predictably, encounter the limitations of embodiment. The model doesn’t understand “carefully” any more than the bug tracker understands “critical.” It simply tokenizes, and physical reality has a knack for generating tokens not present in the training data. Expect a proliferation of edge-case failures, each more absurd than the last, and a corresponding escalation in the complexity of failure modes.

The distillation process, while elegant, feels like a sophisticated form of planned obsolescence. Reinforcement learning is still, at its core, expensive trial and error. The real cost won’t be computational, but the accumulated wear and tear on actuators and the inevitable need to replace hardware prematurely stressed by the pursuit of linguistic fidelity. One suspects the next iteration will focus less on vocabulary and more on damage control.

The promise isn’t control, not really. It’s the illusion of control, a convincing enough performance to justify further investment. The system doesn’t deploy; it’s released. And the resulting data – the crumpled limbs, the near misses, the unexpected interactions – will, as always, be the most valuable contribution.


Original article: https://arxiv.org/pdf/2511.22963.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-01 10:37