Robots That Groove: Bridging Language and Movement

Author: Denis Avetisyan


Researchers have developed a new framework enabling humanoid robots to seamlessly translate diverse inputs-from spoken commands to musical rhythms-into fluid, real-time motion.

UniAct demonstrates a versatile capacity for translating diverse instructional cues - including sequential text, curved trajectories, musical rhythms, and even internet-sourced human movements - into natural and coordinated humanoid motion, achieving zero-shot transfer of complex actions without requiring task-specific training.
UniAct demonstrates a versatile capacity for translating diverse instructional cues – including sequential text, curved trajectories, musical rhythms, and even internet-sourced human movements – into natural and coordinated humanoid motion, achieving zero-shot transfer of complex actions without requiring task-specific training.

UniAct unifies motion generation and streaming using large language models and discrete motion tokens for robust humanoid control.

Achieving truly versatile humanoid robots requires bridging the gap between high-level, multimodal perception and real-time, whole-body control-a challenge complicated by the heterogeneity of input signals. This paper introduces UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots, a novel framework that unifies language, music, and trajectories through discrete motion tokenization and a fine-tuned large language model. By streaming actions through a physically grounded manifold, UniAct achieves sub-500ms latency and a 19% improvement in zero-shot motion tracking success. Could this approach unlock a new era of responsive, general-purpose humanoid assistants capable of seamless interaction across diverse real-world scenarios?


Bridging the Gap: From Intention to Fluid Motion

Achieving truly natural and responsive movement in humanoid robots presents a formidable engineering challenge, largely due to the intricate nature of the ‘Gait Cycle’ – the sequence of steps and shifts in balance required for locomotion. This cycle isn’t simply a mechanical repetition; it’s a dynamically adjusted process influenced by terrain, speed, and unforeseen disturbances. Furthermore, translating abstract, high-level commands – such as “walk quickly to the door” – into the precise orchestration of dozens of actuators proves incredibly difficult. Current robotic systems often struggle to bridge this gap between instruction and action, frequently relying on pre-defined motion libraries or computationally intensive real-time calculations to maintain stability and responsiveness. The complexity arises not just from coordinating joint movements, but also from managing the robot’s center of gravity, predicting future states, and reacting to external forces – all while striving for the fluid, adaptable motion that characterizes human movement.

Current methodologies for humanoid robot locomotion frequently encounter limitations in achieving truly fluid and reactive movement. Many systems depend on meticulously crafted, pre-programmed sequences of actions, proving inflexible when faced with unexpected disturbances or changing environments. Alternatively, some approaches utilize computationally intensive optimization algorithms to calculate appropriate motor commands, but these often struggle to operate in real-time, hindering the robot’s ability to respond swiftly to dynamic situations. This reliance on either rigid pre-planning or extensive on-the-fly calculations presents a significant obstacle to creating robots capable of navigating complex, real-world scenarios with the grace and adaptability of a human.

Despite significant advancements in robotics, translating intentions expressed through language or other modalities – such as gestures or demonstrations – into fluid, natural human-like movement remains a considerable hurdle. Current ‘Text-to-Motion’ systems and related multimodal control schemes are fundamentally limited by their capacity to efficiently capture and decode the intricate dynamics inherent in complex actions. Representing the nuanced interplay between joint angles, velocities, and accelerations across the entire body, and then swiftly interpreting those representations in real-time, demands substantial computational resources and sophisticated algorithms. The challenge isn’t simply recognizing the intent of a command, but accurately predicting the precise sequence of muscle activations and body configurations required to execute it smoothly and responsively, a task that requires a more compact and expressive method of representing movement than is currently available.

Human motion guidance effectively controls a humanoid robot, demonstrating successful imitation of complex movements.
Human motion guidance effectively controls a humanoid robot, demonstrating successful imitation of complex movements.

UniAct: A Unified Framework for Real-Time Responsiveness

UniAct introduces a new methodology for humanoid robot control by combining a multimodal large language model (MLLM) with two core technical components: precise whole-body tracking and a discrete motion representation utilizing ‘Motion Tokens’. This integration allows the system to interpret high-level instructions and translate them into robot actions. The MLLM serves as the central processing unit, receiving commands and generating a sequence of Motion Tokens. These tokens, representing pre-defined, quantized motion primitives, are then used to construct and execute complex movements. Robust whole-body tracking provides the necessary state estimation for accurate execution and feedback, ensuring the robot maintains stability and correctly performs the intended actions as dictated by the MLLM and token sequence.

Finite Scalar Quantization (FSQ) is utilized to convert continuous motion data – representing joint angles, velocities, and accelerations – into a discrete set of ‘Motion Tokens’. This process involves mapping continuous values onto a finite number of scalar levels, effectively reducing the dimensionality and complexity of the motion data. By representing motion as a sequence of these discrete tokens, computational efficiency is gained in both processing and generation; the quantized representation requires less memory and allows for faster calculations compared to directly manipulating continuous data streams. The number of scalar levels used in FSQ directly impacts the trade-off between compression efficiency and the fidelity of the reconstructed motion.

The Causal Decoder within UniAct functions by translating the discrete Motion Tokens – representing quantized motion data – into continuous Degrees of Freedom (DoF) positions for robot control. This conversion is achieved through a decoder architecture designed for temporal coherence, ensuring smooth and natural movements. The system is engineered to maintain a latency of less than 500ms, measured from input to final DoF position output, which is critical for real-time responsiveness and stable control of the humanoid robot. This performance level is achieved through optimizations in the decoder architecture and efficient processing of the discrete token stream.

The UniAct framework leverages the ‘BeyondMimic’ low-level tracking controller as a foundational element for maintaining stability and ensuring precise motion execution. BeyondMimic utilizes a model predictive control (MPC) approach, incorporating real-time state estimation and feedback to counteract disturbances and maintain desired joint positions and velocities. This controller is pre-trained on a diverse dataset of robot motions and is capable of adapting to varying dynamic conditions and payload changes. By providing a robust and reactive base, BeyondMimic allows the higher-level UniAct components – the MLLM, Motion Tokens, and Causal Decoder – to focus on planning and generating complex behaviors without being constrained by low-level control challenges, ultimately contributing to the system’s real-time performance and accuracy.

UniAct synthesizes real-time humanoid motion by leveraging a server-side multi-modal large language model (MLLM) that autoregressively generates <span class="katex-eq" data-katex-display="false">DoF</span> position tokens from text, music, and trajectory inputs, which are then decoded into continuous <span class="katex-eq" data-katex-display="false">DoF</span> positions and executed via a tracking controller after being tokenized using Frequency-Space Quantization (FSQ).
UniAct synthesizes real-time humanoid motion by leveraging a server-side multi-modal large language model (MLLM) that autoregressively generates DoF position tokens from text, music, and trajectory inputs, which are then decoded into continuous DoF positions and executed via a tracking controller after being tokenized using Frequency-Space Quantization (FSQ).

Expanding Control Modalities and Demonstrating Robustness

UniAct facilitates robot control through multiple input modalities beyond standard text prompts. Specifically, the framework accepts spatial trajectories as direct control signals, converting defined paths into robot movements via ‘Trajectory-to-Motion’. Additionally, UniAct incorporates ‘Music-to-Motion’ functionality, allowing the robot to synchronize its actions with musical beats and rhythms. This cross-modal capability expands the range of possible robot interactions and enables control schemes that are not reliant on linguistic commands.

UniAct’s performance was validated through extensive testing utilizing the UA-Net multimodal dataset, a resource specifically designed for evaluating robotic control systems across diverse input types. This dataset comprises a broad spectrum of motion capture data paired with corresponding text, spatial trajectory, and musical beat inputs, enabling a comprehensive assessment of the framework’s capabilities. Evaluations on UA-Net demonstrate UniAct’s ability to generate a wide range of natural and responsive robot motions, effectively translating various input modalities into coherent and physically plausible actions. The dataset’s diversity allows for testing across different motion styles, speeds, and complexities, confirming the framework’s generalization capabilities.

UniAct exhibits significant robustness when processing imperfect input data, specifically demonstrating a 19% performance increase in zero-shot motion tracking when compared to existing state-of-the-art methods. This improvement was measured using low-quality reference motions, indicating the framework’s ability to generate acceptable outputs even with noisy or incomplete initial data. The metric focuses on the system’s capacity to accurately follow a target motion without prior training on similar examples, highlighting its adaptability and resilience to imperfect inputs.

UniAct demonstrates strong performance in zero-shot transfer learning, indicating its capability to generalize to novel instructions without requiring task-specific training data. This adaptability is achieved through the framework’s design, allowing it to effectively interpret and execute commands it has not previously encountered. Evaluations confirm that UniAct can successfully apply learned motion generation principles to entirely new prompts, surpassing the performance of existing methods in scenarios requiring generalization to unseen instruction types. This zero-shot capability is a key feature for real-world robotic applications where pre-training on every possible command is impractical.

This demonstration showcases successful cross-modal control of a humanoid robot, enabling manipulation based on diverse input modalities.
This demonstration showcases successful cross-modal control of a humanoid robot, enabling manipulation based on diverse input modalities.

Towards Versatile and Intelligent Robots: A Vision for the Future

The UniAct control framework presents a significant advancement in humanoid robotics, poised to broaden the scope of robotic application across multiple sectors. This efficient and adaptable system moves beyond pre-programmed movements, enabling robots to respond dynamically to unforeseen circumstances and perform complex tasks in real-world settings. Consequently, opportunities arise in assistive robotics, where robots can provide personalized support to individuals with varying needs, and in the entertainment industry, where lifelike robotic performers can captivate audiences. Furthermore, the framework’s versatility extends to the realm of exploration, facilitating the deployment of robust and agile robots in challenging environments – from disaster zones to distant planets – where human presence is either impractical or dangerous. By streamlining robotic control and promoting adaptability, UniAct promises to accelerate the integration of humanoid robots into everyday life and unlock previously unattainable capabilities.

The UniAct control framework distinguishes itself by liberating robotic movement from the constraints of pre-recorded motion capture. Traditionally, robots relied on mimicking specific human movements, limiting their ability to adapt or improvise. This framework, however, allows for the generation of entirely new and original motions, as the control system isn’t tethered to a database of existing actions. Consequently, robots can respond to unforeseen circumstances and exhibit behaviors that appear more fluid, spontaneous, and – crucially – natural. This capacity is expected to significantly enhance human-robot interactions, creating experiences that are not simply functional, but also engaging and intuitive, paving the way for robots that can truly collaborate and connect with people in meaningful ways.

The convergence of UniAct with ‘Motion Matching’ techniques promises a significant leap in robotic movement quality. Motion Matching leverages extensive databases of human motion capture data, allowing robots to seamlessly blend pre-recorded movements to create remarkably natural and fluid actions. When integrated with UniAct’s adaptable control framework, this approach transcends the limitations of pre-programmed motions, enabling robots to respond to unforeseen circumstances with believable and dynamic behaviors. The system doesn’t simply play motions; it intelligently selects and combines them, resulting in a wider spectrum of possible actions and a heightened sense of realism in robotic performance. This synergy opens doors to more engaging human-robot interactions and expands the application of humanoid robots into scenarios demanding nuanced and adaptable movements, such as collaborative work environments or complex navigation tasks.

Ongoing development prioritizes enhancing the framework’s capacity for robust learning, even with sparse datasets – a crucial step towards real-world applicability. Current research investigates techniques allowing the system to extrapolate learned behaviors effectively, enabling adaptation to previously unseen and increasingly intricate environments. This includes exploring methods for efficient data augmentation and transfer learning, aiming to minimize the need for extensive, painstakingly-collected motion capture data. Ultimately, the goal is to create a system capable of autonomously refining its movements and responding intelligently to unpredictable external factors, paving the way for truly versatile robotic agents capable of operating reliably in complex, dynamic settings.

Motion matching expands the UA-Net dataset's walking motion capture from 20 minutes to over 10 hours by segmenting sequences into clips and seamlessly blending transitions based on pose, velocity, and predicted trajectory similarity.
Motion matching expands the UA-Net dataset’s walking motion capture from 20 minutes to over 10 hours by segmenting sequences into clips and seamlessly blending transitions based on pose, velocity, and predicted trajectory similarity.

The pursuit of seamless humanoid control, as demonstrated by UniAct, echoes a fundamental design principle: elegance through unification. The framework’s ability to interpret diverse inputs – text, music, trajectories – and translate them into fluid motion speaks to a harmonious system where each element occupies its rightful place. As Yann LeCun aptly stated, “Everything we are trying to do in AI is to build systems that can reason, learn, and adapt.” UniAct’s discrete motion tokenization and large language model integration showcase this adaptability, representing a step toward robots that don’t just perform actions, but understand and respond to the nuances of complex instructions, embodying a refined and cohesive control mechanism.

Beyond the Immediate Steps

The elegance of UniAct lies not merely in its unification of input modalities, but in the implicit argument it makes for a more compositional approach to robot control. Current systems often treat motion planning and execution as a monolithic block – a tangled web of parameters adjusted through iterative refinement. This framework, however, hints at a future where complex behaviors emerge from the seamless arrangement of discrete, reusable motion primitives – a library of actions, not a cascade of commands. The challenge, naturally, isn’t just tokenizing motion, but ensuring that this vocabulary is sufficiently rich, and the grammar flexible enough to handle the inherent messiness of the real world.

Real-time performance, while demonstrated, remains a brittle achievement. Scaling this system beyond carefully curated demonstrations – expanding the repertoire of actions, increasing environmental complexity, and accommodating the inevitable sensor noise – will demand a ruthless pruning of redundancy. The true test won’t be generating a dance to a pop song, but reliably navigating a cluttered room while simultaneously responding to unexpected requests.

Ultimately, the question isn’t whether large language models can control robots, but whether they should. The pursuit of ever-more-humanlike robots risks conflating imitation with intelligence. A truly robust system won’t strive to mimic human movement, but to achieve goals efficiently, safely, and with a distinct mechanical grace. Beauty, after all, scales; clutter does not.


Original article: https://arxiv.org/pdf/2512.24321.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-01 17:09