Bringing Human Motion to Robots: A New Vision-Based Approach

Author: Denis Avetisyan


Researchers have developed a pipeline that translates monocular video of human movement into robot-ready motion data with improved stability and physical realism.

A pipeline reconstructs human motion from monocular video by first extracting per-frame MHR parameters using a frozen SAM 3D Body model, then associating identities across frames with Kalman filtering, and finally estimating physically plausible world-coordinate trajectories through trajectory-level identity and scale locking, sliding-window smoothing, and contact-aware ground optimization before retargeting the motion to a Unitree G1 humanoid robot via a kinematics-aware pipeline.
A pipeline reconstructs human motion from monocular video by first extracting per-frame MHR parameters using a frozen SAM 3D Body model, then associating identities across frames with Kalman filtering, and finally estimating physically plausible world-coordinate trajectories through trajectory-level identity and scale locking, sliding-window smoothing, and contact-aware ground optimization before retargeting the motion to a Unitree G1 humanoid robot via a kinematics-aware pipeline.

This work presents a method for world-coordinate human motion retargeting leveraging 3D body reconstruction and contact-aware optimization for robust robot control.

Accurate and temporally consistent human motion capture remains challenging for real-world robotics applications, particularly from readily available monocular video. This paper, ‘World-Coordinate Human Motion Retargeting via SAM 3D Body’, introduces a lightweight pipeline that reconstructs world-coordinate human motion and directly retargets it to a quadrupedal robot. By leveraging a frozen 3D body reconstruction model, a kinematics-aware human representation, and contact-aware optimization, we achieve stable trajectories and reliable robot control. Could this approach unlock more natural and intuitive human-robot interaction in complex, unstructured environments?


Decoding Movement: The Challenges of Monocular 3D Human Motion Capture

Reconstructing three-dimensional human movement from a single camera view – monocular video – presents considerable difficulties for computer vision systems. The primary obstacle lies in self-occlusion, where parts of the body temporarily disappear from view behind other body parts, creating data gaps. Simultaneously, video data is inherently noisy, affected by factors like lighting changes, sensor limitations, and background clutter. These combined challenges mean algorithms must often infer the position and orientation of obscured limbs, relying on probabilistic models and learned patterns of human biomechanics. Consequently, achieving both accuracy and robustness in 3D motion capture from monocular video remains a central problem in fields like animation, virtual reality, and human-computer interaction, demanding sophisticated techniques to overcome these inherent limitations.

Current approaches to translating video into three-dimensional human movement frequently falter when maintaining realistic fluidity over time and ensuring physically believable actions. The difficulty arises because algorithms often treat each video frame in isolation, leading to jittery or unnatural transitions between poses. Complex scenarios – those involving rapid movements, interactions with objects, or significant self-occlusion where limbs temporarily hide each other – exacerbate these problems, as the system struggles to infer the underlying skeletal structure and maintain consistency. This results in captured motions that, while visually similar to human movement, often lack the subtle biomechanical constraints and smooth transitions inherent in real-world actions, hindering their usefulness in applications like animation, virtual reality, and clinical gait analysis.

The proposed method achieves temporally consistent single-person motion estimation and robust performance in multi-person scenarios, with refinements-illustrated by transitioning from light to dark blue over time-demonstrating improved predictions two frames into the future.
The proposed method achieves temporally consistent single-person motion estimation and robust performance in multi-person scenarios, with refinements-illustrated by transitioning from light to dark blue over time-demonstrating improved predictions two frames into the future.

Establishing a Foundation: The Momentum Human Rig and Initial Reconstruction

The Momentum Human Rig (MHR) is a parametric model of the human body constructed to provide a stable and controllable base for 3D human representation. Its parametric nature allows for independent control over pose, shape, and expression; these parameters are disentangled during the rigging process to minimize correlation and facilitate targeted manipulation. This disentanglement is achieved through a specific model architecture and training regime designed to isolate the influence of each parameter set. The resulting rig exhibits improved stability during animation and reconstruction, and enables precise control over individual aspects of the human form without unintended consequences to others.

SAM 3D Body is utilized to establish an initial three-dimensional pose estimate directly from each frame of the input video. This system functions by leveraging the parametric nature of the Momentum Human Rig (MHR) to infer a complete 3D human model from a single 2D image. The reconstruction process analyzes the visual data to determine the MHR parameters that best represent the observed pose and shape, outputting a 3D representation which then serves as the starting point for further processing and refinement within the pipeline. This single-image approach enables rapid initial pose estimation, facilitating real-time or near real-time performance.

The initial 3D reconstruction generated via SAM 3D Body provides a foundational mesh and pose estimate used as input for subsequent processing stages. This reconstruction, while not a final product, establishes the basic human form and pose parameters, enabling iterative refinement through optimization techniques. Specifically, this initial state minimizes the computational cost of later steps by providing a reasonably accurate starting point, rather than requiring algorithms to estimate the entire 3D human model from scratch. Subsequent steps leverage this pre-existing structure to improve fidelity, correct inaccuracies, and ultimately achieve the desired level of detail and realism in the final 3D representation.

Refining the Signal: Optimization for Temporal Consistency

Latent-Space Smoothing is an optimization technique implemented within the MHR latent space to address high-frequency jitter in reconstructed motion. This is achieved through Sliding Window Optimization, which analyzes a defined window of frames and adjusts the latent parameters to minimize temporal discontinuities. The technique operates directly on the MHR latent representation, enabling efficient smoothing without requiring re-processing of the original input data. By considering multiple frames simultaneously, the sliding window approach effectively averages out noisy or erratic movements, resulting in a smoother and more visually consistent animation.

Sliding Window Optimization enhances temporal coherence by minimizing the cumulative difference between successive frames within a defined window of frames. This technique addresses high-frequency jitter by iteratively adjusting the MHR latent space representation, effectively smoothing motion over time. The optimization process considers a limited history of poses – the “window” – and adjusts current poses to be consistent with, and averaged across, those preceding poses. This localized, iterative refinement reduces abrupt changes in pose and maintains a smoother, more realistic motion profile throughout the sequence, demonstrably improving the perceived temporal stability of the reconstructed movements.

Trajectory-level consistency ensures the stability of shape and scale parameters throughout a motion sequence by enforcing temporal averaging. This process mitigates unnatural distortions by maintaining consistent bone lengths and overall subject proportions across consecutive frames. Specifically, the system calculates an average value for shape and scale parameters, and applies this averaged value to each frame within a defined window, thereby smoothing out high-frequency fluctuations and preventing abrupt changes in the reconstructed pose. This is particularly crucial in multi-person scenarios where maintaining individual identity and realistic movement is paramount, as inconsistencies can lead to visually jarring artifacts and tracking errors.

Temporal averaging of shape and scale parameters is implemented to enhance the stability of reconstructed poses over time. This process calculates the average values of these parameters across consecutive frames, effectively reducing fluctuations and maintaining consistent bone lengths throughout the animation. The benefit of this approach is particularly pronounced in multi-person scenarios, where maintaining individual identity and realistic movement requires minimizing distortions and ensuring each subject’s reconstruction remains coherent. By smoothing these parameters, the system minimizes unnatural changes in body proportions and prevents the appearance of jitter or erratic motion, resulting in smoother and more plausible per-subject reconstructions.

Contact-Aware Global Optimization estimates root trajectories by minimizing an energy function incorporating both contact constraints and Soft Contact Probability. The optimization is guided by a Z-up frame to maintain vertical alignment and physical plausibility. Contact constraints enforce adherence to ground plane contact, while Soft Contact Probability, a value between 0 and 1 representing the likelihood of foot-ground contact, allows for robustness to imperfect contact detection. An Auxiliary Camera Prior is integrated to further constrain the optimization process, improving trajectory accuracy and reducing drift, particularly in scenarios with noisy or ambiguous data. This approach results in physically plausible human motion by penalizing unlikely poses and encouraging stable foot placement.

Optimization of root trajectories incorporates soft contact probabilities, representing the likelihood of foot-ground contact, and employs energy minimization to penalize physically implausible poses and movements. This approach effectively reduces accumulated drift during long sequences and mitigates foot-sliding artifacts, a common issue in motion reconstruction. By minimizing an energy function that considers both pose and contact constraints, the system generates trajectories that adhere to physical limitations, resulting in more realistic and stable human motion estimates. The soft contact probabilities allow for robustness to imperfect contact detection, preventing abrupt corrections and smoothing transitions between contact and non-contact phases.

The Detect-Track Module employs Kalman Filtering to establish and maintain consistent identity assignments for each individual throughout a motion sequence. Kalman Filtering, a recursive estimator, predicts the state of each tracked subject – encompassing position and potentially other parameters – and updates this prediction based on subsequent detections. This process minimizes the impact of noisy detections and occlusions, allowing the system to reliably associate detections with the correct individual across frames. By continuously refining state estimates and maintaining a probabilistic model of each subject’s trajectory, the module mitigates identity switches and ensures accurate, long-term tracking of multiple people within the scene.

From Simulation to Embodiment: Real-World Robotic Application

The culmination of motion reconstruction and optimization lies in its practical application; the refined World-Coordinate Motion data is seamlessly transferred to the Unitree G1 humanoid robot through a process called Kinematics-Aware Retargeting. This isn’t simply copying movement data, but a sophisticated translation that accounts for the robot’s unique skeletal structure and range of motion. By intelligently adapting the human motion to the robot’s physical capabilities, the system ensures movements are not only accurately reproduced, but also stable and natural-looking. This retargeting process effectively bridges the gap between human performance and robotic execution, allowing the Unitree G1 to faithfully mimic complex actions with a level of fidelity previously unattainable, and paving the way for more fluid and intuitive interactions between humans and robots.

The successful transfer of reconstructed human motion data allows the Unitree G1 humanoid robot to replicate intricate movements with a notable degree of fidelity and balance. This isn’t simply mirroring; the robot dynamically adjusts its own kinematics to execute actions – from nuanced gestures to complete locomotion sequences – mirroring human performance. Stability is maintained through continuous adaptation, as the robot leverages the precise motion capture data to anticipate and counteract potential imbalances. Consequently, the Unitree G1 demonstrates an ability to perform complex tasks with a fluidity and robustness previously unattainable, representing a considerable advancement in robotic mimicry and opening avenues for more effective collaboration between humans and machines.

The developed pipeline represents a considerable advancement in the pursuit of more fluid and responsive interactions between humans and robots. By accurately translating complex human motion onto a quadrupedal platform like the Unitree G1, the system moves beyond pre-programmed routines and towards genuine mimicry. This capability isn’t merely about replicating what humans do, but how they do it, incorporating subtle nuances of balance and movement that traditionally challenge robotic systems. The result is a robot capable of behaving in a more predictable and understandable manner, fostering a sense of collaboration rather than mechanical response – a critical step toward robots seamlessly integrating into everyday human environments and tasks.

The presented work emphasizes a holistic approach to human motion retargeting, recognizing that stable and plausible movement isn’t merely a matter of kinematic accuracy. It’s a system where each component-from the monocular vision input to the contact-aware optimization-must function in concert. This echoes Donald Knuth’s observation that, “Premature optimization is the root of all evil.” The pipeline avoids focusing on isolated improvements; instead, it prioritizes a robust and integrated system capable of generating trajectory-level consistency. Like a well-designed organism, the system’s strength lies not in any single, brilliant fix, but in the harmonious interplay of its parts. If the system survives on duct tape, it’s probably overengineered.

Where Do We Go From Here?

The pursuit of convincingly transferring human motion to robotic systems invariably reveals the fragility of assumptions. This work, while demonstrating a notable advance in stability and plausibility, merely clarifies the depth of the challenge. The reliance on a ‘frozen’ 3D reconstruction, however cleverly implemented, implicitly concedes the problem of dynamic, real-time adaptation. A truly robust system must address the inherent uncertainty in visual perception and the inevitable discrepancies between reconstructed and actual human kinematics.

Future efforts will likely focus on relaxing this constraint, integrating learning-based approaches to predict, rather than simply react to, human movement. Contact-aware optimization, while essential, is computationally demanding. Simplification, not cleverness, will be key; the most elegant solutions are often those that impose the fewest constraints. A system that attempts to model every nuance of human biomechanics is doomed to become brittle and unreliable.

Ultimately, the goal isn’t to perfectly replicate human motion-an impossible task-but to achieve a functional equivalence. The robot doesn’t need to look human, it needs to behave predictably. A focus on trajectory-level consistency, coupled with a principled understanding of physical limitations, offers a more pragmatic, and ultimately more robust, path forward.


Original article: https://arxiv.org/pdf/2512.21573.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-29 16:48