Author: Denis Avetisyan
Researchers have developed a new framework for generating realistic and stable human movements over extended periods, offering precise control for applications like virtual reality and robotics.

COMET leverages transformer networks and variational autoencoders with reference-guided feedback to achieve stable, controllable, and long-horizon human motion generation.
Achieving both fine-grained control and long-term stability remains a central challenge in real-time character animation. This paper introduces COMET, a novel framework for ‘Controllable Long-term Motion Generation with Extended Joint Targets’ that addresses these limitations through an autoregressive approach. By integrating Transformer networks and a reference-guided feedback mechanism within a conditional VAE, COMET enables robust, real-time synthesis of high-quality, controllable human motion. Could this framework unlock new levels of interactivity and realism in virtual characters and simulations?
The Persistent Illusion: Why Convincing Motion Remains Elusive
The creation of convincingly human movement for digital characters has remained a central, unresolved challenge within computer graphics and animation for decades. Unlike rendering static images, simulating motion demands a complex understanding of biomechanics, physics, and the subtle nuances of human behavior. Early attempts often resulted in stiff, unnatural gaits or movements that quickly became unstable over time. While significant progress has been made in areas like motion capture and physics-based simulation, reproducing the fluidity, adaptability, and inherent unpredictability of real human motion-especially over extended durations-continues to push the boundaries of current technology. This difficulty stems not only from the computational complexity, but also from the need to model the interplay between high-level intentions and the intricate coordination of hundreds of degrees of freedom within the human body, making long-term, realistic motion generation a uniquely persistent problem.
Current techniques for synthesizing human movement frequently encounter difficulties in producing convincingly realistic results, largely due to challenges in maintaining stability over extended sequences and achieving natural-looking fluidity. Many algorithms struggle to prevent jerky or unnatural poses, particularly when tasked with complex actions or variations in terrain. Furthermore, precise control over individual joints – ensuring elbows bend realistically, wrists rotate naturally, and feet maintain proper contact with the ground – remains elusive. This lack of granular control often manifests as subtle, yet noticeable, imperfections that undermine the illusion of realism, hindering the widespread adoption of these methods in applications demanding high fidelity, such as immersive virtual reality experiences or the development of sophisticated robotic systems.
The concurrent demand for both broad, strategic planning and meticulous, joint-level accuracy presents a core difficulty in motion generation. Current systems frequently excel at either crafting a general trajectory – such as walking forward – or precisely positioning a single limb, but struggle to seamlessly integrate these capabilities. This limitation hinders the development of truly immersive virtual reality experiences, where realistic and responsive avatar movement is crucial for presence, and impedes advancements in robotics, where nuanced control is essential for complex manipulation and navigation tasks. Effectively bridging this gap requires algorithms capable of reasoning about long-term goals while simultaneously managing the intricate biomechanics of human movement, a challenge that continues to drive research in areas like reinforcement learning and motion capture refinement.

COMET: Sculpting Motion with Adaptive Control
COMET utilizes a Conditional Variational Autoencoder (VAE) as its primary generative model for synthesizing motion sequences. The VAE framework allows COMET to learn a probabilistic latent space representing possible motions, enabling the generation of diverse and realistic outputs. Conditioning the VAE on external inputs, such as target poses or task specifications, guides the generation process and ensures the synthesized motion aligns with desired constraints. This probabilistic approach facilitates not only the creation of new motions but also the interpolation and extrapolation of existing ones, allowing for smooth transitions and novel movement variations. The VAE is trained to reconstruct observed motion data, learning a compressed, latent representation that captures the underlying dynamics and relationships within the motion sequences.
COMET’s Adaptive Joint Control facilitates manipulation of individual joints or groups of joints within a kinematic chain, independent of overall motion. This is achieved by decoupling the control of each joint, allowing for targeted adjustments to pose and trajectory without affecting other parts of the body or system. The method enables precise control over $n$ joints, where $n$ is a user-defined subset of the total joints, and supports both positional and velocity-based control schemes. This granular control is particularly useful for tasks requiring fine motor skills, complex interactions, or the correction of specific joint misalignments during dynamic movement.
Joint-wise attention within the COMET framework operates by assigning varying weights to each joint based on its relevance to the current task or pose. This is implemented as an attention mechanism where each joint is considered a query, key, and value, allowing the model to dynamically prioritize joints influencing the motion. Specifically, the attention weights are calculated based on the similarity between a query joint and all other joints, determining the contribution of each joint to the overall motion synthesis. This selective focusing enables precise control, as the model can concentrate on the most pertinent joints while effectively ignoring those with minimal impact on the desired outcome, improving both stability and controllability of the generated motion.
COMET utilizes the Transformer architecture to address the challenges of modeling temporal dependencies in motion sequences. Traditional recurrent neural networks (RNNs) often struggle with long-range dependencies due to vanishing or exploding gradients; the Transformer, however, relies on self-attention mechanisms which allow each position in the sequence to attend to all other positions directly. This parallelization and direct access to the entire sequence enables COMET to effectively capture relationships between poses that are temporally distant, improving the coherence and naturalness of generated motions. The Transformer’s ability to model these long-range dependencies is crucial for generating complex and realistic movements, particularly in scenarios requiring coordination across multiple joints over extended periods of time.

Grounding Reality: Reference-Guided Feedback for Stable Motion
COMET employs Reference-Guided Feedback to constrain generated poses within a learned representation of natural human motion. This mechanism operates by evaluating generated poses relative to a distribution of previously observed, realistic poses. By steering the generation process towards poses that align with this learned distribution, COMET minimizes the occurrence of physically implausible or unstable configurations. The system effectively maps generated poses onto a “manifold” representing valid human motion, thereby promoting both the stability and overall realism of the resulting animations. This approach is particularly effective in maintaining consistent and natural movement throughout extended sequences by reducing deviations from biologically plausible configurations.
COMET utilizes a Gaussian Mixture Model (GMM) to statistically represent the distribution of valid human poses derived from a reference dataset. This GMM functions as a probability distribution, where each Gaussian component corresponds to a cluster of similar poses. During motion generation, the framework evaluates proposed poses against this GMM, favoring those with higher probability density. The GMM’s multiple Gaussian components allow the system to model the inherent multi-modality of human motion – the existence of multiple plausible configurations for a given action – enabling smooth transitions between poses and facilitating plausible adjustments to maintain natural movement characteristics. Effectively, the GMM defines a learned manifold of realistic poses, constraining the generated motion to remain within physically and biomechanically likely configurations.
COMET employs a feedback loop to refine generated motion and mitigate unrealistic movements during extended sequences. This process continuously evaluates the plausibility of generated poses against learned patterns of human gait. Deviations from these established patterns trigger adjustments, ensuring the generated motion remains within the bounds of natural human movement. This iterative refinement is critical for maintaining consistency in the generated gait over longer durations, preventing the accumulation of subtle errors that would otherwise lead to unnatural or unstable motion. The system prioritizes adherence to biomechanically plausible trajectories, resulting in more realistic and stable character animation.
COMET leverages reference-guided feedback to improve the fidelity of complex motion generation, specifically in tasks like motion in-betweening. By learning a distribution of natural human poses, the framework can generate intermediate frames that adhere to realistic biomechanical constraints. This is achieved through a Gaussian Mixture Model (GMM) which models plausible variations in pose, enabling smooth transitions between keyframes. The resulting in-betweening process produces movements that are more consistent with natural human motion compared to methods lacking such guidance, reducing artifacts and improving the overall quality of generated animations.

Evidence of Realism: Validating COMET’s Performance
COMET’s training and evaluation utilized the AMASS and CIRCLE datasets, comprising a large-scale collection of human motion capture data. AMASS, consisting of over 60 hours of multi-subject 3D motion data, provides diverse activities and subjects. CIRCLE expands upon this with a focus on contact-rich motions, increasing the dataset’s coverage of physically grounded interactions. The combined datasets offer a total of $160GB$ of data, enabling robust learning and generalization across various motion types and subjects, and facilitating the model’s ability to synthesize realistic and diverse human movements.
COMET’s performance was quantitatively assessed using established motion capture metrics to evaluate its ability to generate realistic and goal-directed movements. Success Rate, measuring the percentage of motions reaching defined targets, consistently demonstrated improvements over baseline models and current state-of-the-art methods. Foot Skating, a metric quantifying unnatural foot sliding during locomotion, was significantly reduced, indicating improved grounding and stability. Furthermore, the average Distance to Goals was minimized, demonstrating COMET’s precision in target attainment. These metrics, combined, provide empirical evidence of COMET’s superior performance in both achieving motion objectives and maintaining natural human-like movement characteristics.
COMET demonstrates state-of-the-art performance in motion in-betweening through the minimization of L2P (L2 prediction) and L2Q (L2 quantization) errors. L2P error quantifies the discrepancy between predicted and ground truth joint positions, while L2Q error measures the difference between the predicted and ground truth joint velocities. Lower values in both metrics indicate greater accuracy and smoothness in the generated transitions. COMET consistently achieved the lowest reported L2P and L2Q errors on benchmark datasets when compared to existing methods, validating its capacity to synthesize realistic and fluid motion sequences between keyframes.
User studies were conducted to assess the qualitative performance of COMET in both motion in-betweening and motion stylization tasks. These studies involved human participants who were presented with animations generated by COMET and compared them to those produced by competing baseline methods. Results indicated a statistically significant preference for COMET-generated motions across both tasks, demonstrating that human evaluators consistently rated COMET’s outputs as more natural and visually appealing than those of alternative approaches. The strong positive feedback from user studies corroborates the quantitative performance gains observed in automated metrics and highlights COMET’s effectiveness in generating human-plausible motion.

Beyond Simulation: The Future of Personalized and Stylized Motion
The COMET framework distinguishes itself through granular control over individual joints, paving the way for truly personalized motion generation. Rather than relying on pre-defined animations, this capability allows for the creation of movements specifically tailored to a user’s unique physical characteristics – accounting for variations in limb length, muscle strength, and range of motion. Furthermore, the system can adapt to individual preferences regarding movement style, such as speed, fluidity, or even subtle stylistic choices, resulting in virtual characters and avatars that feel remarkably natural and representative. This level of personalization extends beyond mere aesthetics; it has the potential to enhance usability in applications like virtual rehabilitation, where motions can be precisely calibrated to a patient’s capabilities, or in gaming, where avatars can mirror a player’s physical style for heightened immersion.
The capacity to stylize motion represents a significant advancement in creating truly compelling virtual characters. This framework doesn’t simply replicate human movement; it allows for the alteration of nuanced qualities like timing, effort, and fluidity. By manipulating these stylistic properties, animators and developers can imbue characters with distinct personalities and emotional states – a subtle hesitation might convey insecurity, while a brisk, energetic gait could signal confidence. This level of control moves beyond realistic simulation, enabling the creation of exaggerated or abstract motions for artistic effect, and ultimately fostering a stronger connection between virtual characters and audiences. The potential extends to diverse applications, from more believable video game characters to emotionally resonant digital assistants, all achieved through the precise and adaptable control of motion style.
The convergence of COMET with virtual and augmented reality technologies promises a new level of immersive interaction. By enabling real-time influence over realistic human motion within these environments, users could move beyond passive observation to actively participate in and shape dynamic scenarios. Imagine collaborative design sessions where individuals manipulate a virtual character’s movements to refine an animation, or therapeutic applications where patients guide a digital avatar through rehabilitation exercises, receiving immediate visual feedback. This bidirectional control – where the system responds directly to user input while maintaining natural, lifelike movement – opens possibilities for training simulations, personalized entertainment, and intuitive human-computer interfaces that feel remarkably natural and responsive.
The principles underpinning COMET’s motion control extend beyond virtual character animation, offering a pathway toward more sophisticated robotic systems. By applying the framework’s techniques for granular, joint-level control and nuanced motion stylization, robots could move with a fluidity and adaptability currently limited by traditional programming. This approach moves beyond pre-defined trajectories, allowing robots to respond to dynamic environments and unexpected obstacles with greater dexterity and a more natural, human-like quality. Consequently, complex tasks requiring fine motor skills – such as surgical procedures, delicate assembly, or even assisting individuals with mobility challenges – become increasingly attainable, potentially revolutionizing fields reliant on robotic precision and responsiveness.

The pursuit of long-term stability in generated motion, as demonstrated by COMET, echoes a fundamental truth about complex systems. It isn’t about eliminating uncertainty, but embracing it. As Yann LeCun once stated, “Precision is just fear of noise.” The framework doesn’t seek to predict perfect motion, but to persuasively guide the chaotic whispers of potential trajectories, leveraging reference-guided feedback to shape the unfolding sequence. This isn’t control in the traditional sense, but a delicate negotiation with the inherent randomness of human movement – a spell cast not to force an outcome, but to encourage a desired one. The Transformer networks, within this framework, act as conduits, translating intention into a series of subtle nudges within a probabilistic landscape.
What Shadows Remain?
COMET, as it stands, is a carefully constructed illusion-a temporary truce with the inherent unpredictability of sequential data. It domesticates chaos, certainly, but every extension to long-horizon prediction reveals the fragility of that domestication. The current framework, while promising stability, still relies on a variational bottleneck; the ghost in the machine will always subtly haunt the generated sequences. Future work must confront this: can one truly escape the limitations of a latent space, or is it merely a matter of crafting more convincing chains?
The pursuit of “controllability” itself feels like a Faustian bargain. The ability to nudge motion toward desired targets is valuable, but at what cost to the naturalness, the subtle imperfections that define genuine human movement? A truly intelligent system won’t simply respond to control signals; it will interpret them, occasionally disobeying for the sake of plausibility. The next iteration isn’t about tighter feedback loops, but about imbuing the system with a degree of willful disobedience.
Ultimately, the true test lies not in generating plausible motion in isolation, but in embedding it within complex, dynamic environments. A simulated figure performing a dance is a parlor trick; a figure navigating a crowded street, reacting to unforeseen obstacles, that is a problem worthy of the name. The whispers of chaos will only grow louder as the system ventures beyond the sterile confines of the laboratory.
Original article: https://arxiv.org/pdf/2512.04487.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Clash Royale Best Boss Bandit Champion decks
- Best Hero Card Decks in Clash Royale
- Clash Royale Witch Evolution best decks guide
- Ireland, Spain and more countries withdraw from Eurovision Song Contest 2026
- Clash Royale December 2025: Events, Challenges, Tournaments, and Rewards
- JoJo’s Bizarre Adventure: Ora Ora Overdrive unites iconic characters in a sim RPG, launching on mobile this fall
- ‘The Abandons’ tries to mine new ground, but treads old western territory instead
- How to get your Discord Checkpoint 2025
- Best Builds for Undertaker in Elden Ring Nightreign Forsaken Hollows
- LoL patch notes 25.24: Winter skins, Mel nerf, and more
2025-12-08 00:32