From Human Moves to Robot Actions: Mitty Learns by Watching

Author: Denis Avetisyan

A new diffusion-based framework, Mitty, directly translates human demonstration videos into robot control sequences, simplifying the process of teaching robots complex tasks.

Mitty leverages a Diffusion Transformer to translate human demonstrations into robotic action, employing an in-context learning approach where noisy robot video latents are refined alongside observed human movements through bidirectional attention-a process that effectively teaches the system to mimic complex operations.

This work introduces a Diffusion Transformer that generates robot execution videos from human demonstrations using in-context learning and a novel paired data synthesis pipeline.

Directly translating human demonstrations into robotic action remains challenging due to information loss inherent in intermediate representations and the scarcity of paired data. This limitation motivates ‘Mitty: Diffusion-based Human-to-Robot Video Generation’, which introduces a Diffusion Transformer enabling end-to-end video generation from human examples via in-context learning-bypassing traditional abstractions and leveraging strong visual-temporal priors. By compressing demonstrations into condition tokens and employing a novel paired data synthesis pipeline, Mitty achieves state-of-the-art results and strong generalization. Could this approach unlock more scalable and intuitive robot learning paradigms based on direct human observation?

Whispers of Embodiment: The Human-Robot Disconnect

The prospect of robots learning directly from human example is hampered by fundamental discrepancies in how humans and robots interact with the world. A human’s capacity for nuanced movement and intuitive understanding of physics doesn’t readily translate to a robot’s rigid mechanics and discrete action space. While a person can effortlessly reach for an object, a robot requires precise instructions detailing joint angles, velocities, and force control – a significant leap in complexity. This difference in embodiment – the physical form and capabilities – coupled with the distinct action space – the range of possible movements – creates a substantial challenge for imitation learning. Consequently, robots often struggle to accurately replicate human actions, especially in dynamic or unpredictable environments, demanding sophisticated algorithms to bridge this gap between human demonstration and robotic execution.

Historically, imparting new skills to robots has proven remarkably labor-intensive. Conventional robotic systems often demand significant, bespoke engineering for each new task, a process that necessitates painstakingly coding every nuance of movement and interaction. This reliance on hand-crafted solutions severely limits a robot’s adaptability; even slight variations in the environment or task requirements can necessitate a complete overhaul of the existing code. Consequently, these systems struggle with generalization, failing to effectively transfer learned skills to even moderately novel scenarios. The inherent inflexibility of these traditional methods presents a major obstacle in the pursuit of truly versatile and autonomous robotic agents, hindering their deployment in dynamic, real-world environments.

The core difficulty in replicating human actions with robots stems from a fundamental translation problem: converting what a robot sees a human do into the precise motor commands its own body understands. This isn’t simply a matter of recognizing the action, but of deciphering the nuanced details – timing, force, trajectory – and mapping them onto a vastly different mechanical system. Current systems often struggle because human movements are inherently visual and contextual, while robots operate on discrete, numerical instructions. Bridging this gap demands algorithms capable of interpreting the semantics of an action from visual input and then generating a corresponding sequence of robot joint angles, velocities, and forces – a complex process that requires robust perception and sophisticated motion planning. The challenge isn’t just about identifying what is being done, but accurately determining how it is being done in a way that the robot can physically reproduce.

The capacity for robots to learn from scarce examples and generalize to unforeseen circumstances represents a pivotal advancement in robotics. Current research emphasizes the development of models that move beyond reliance on massive datasets, instead focusing on efficient learning strategies like meta-learning and transfer learning. These approaches enable robots to quickly adapt to new tasks with minimal retraining, mirroring human aptitude for improvisation and problem-solving. By leveraging prior knowledge and identifying underlying patterns, these models can extrapolate from limited observations, significantly reducing the engineering effort required for deployment in dynamic and unpredictable environments. This shift towards data efficiency not only expands the range of applicable robotic systems but also facilitates their integration into real-world scenarios where obtaining extensive training data is impractical or impossible.

A pipeline leveraging Detectron2 and Segment Anything automatically generates over 6,000 synthetic human-robot interaction videos by detecting and segmenting human limbs, mapping hand keypoints to robot arm poses, and refining the results with human-in-the-loop filtering to train the Mitty model.

Mitty: Weaving Action from Observation

Mitty employs an end-to-end framework for video generation, directly translating human demonstrations into corresponding robot actions. This is achieved through a Diffusion Transformer architecture, which integrates the capabilities of diffusion models and transformer networks. Unlike traditional methods requiring separate stages for motion planning, control, and rendering, Mitty learns a direct mapping from human input to robot video frames. The framework accepts human demonstrations, processes them through the Diffusion Transformer, and outputs a complete video sequence depicting the robot performing the demonstrated task. This eliminates the need for intermediate representations and allows for a streamlined video generation process, reducing complexity and potential error propagation.

In-Context Learning is a key component of Mitty’s functionality, enabling the model to generalize to new robotic tasks with limited training examples. Rather than requiring extensive fine-tuning for each new skill, Mitty leverages a few demonstrations – typically between 5 and 20 – to condition the Diffusion Transformer. This is achieved by incorporating the demonstrated action sequences directly into the attention mechanism, allowing the model to infer the desired behavior without altering its pretrained weights. Consequently, Mitty demonstrates a significant reduction in data requirements compared to traditional supervised learning approaches for robot video generation, facilitating rapid prototyping and deployment in novel environments.

Mitty leverages the Wan 2.2 video generation model as a foundational component, inheriting its established understanding of visual dynamics and temporal coherence. Wan 2.2, pretrained on a large-scale dataset of diverse video content, provides Mitty with strong priors regarding realistic motion and scene structure. This transfer learning approach circumvents the need for extensive training from scratch, significantly reducing the data requirements for adapting Mitty to new robotic tasks. By initializing the generative process with the weights of Wan 2.2, Mitty can focus on learning the specific nuances of demonstrated human actions rather than relearning fundamental video generation principles.

Bidirectional attention within Mitty’s Diffusion Transformer facilitates information exchange between human demonstration sequences and generated robot action sequences. This mechanism allows the model to attend to relevant frames in both the human and robot sequences during the diffusion process. Specifically, the human demonstrations provide contextual cues for robot actions, while robot actions inform the understanding of human intent. This two-way flow of information, implemented through multi-head attention layers, enables the model to better align robot behavior with human demonstrations and generate more realistic and coordinated robot videos. The attention weights are computed based on the similarity between the query, key, and value vectors derived from both modalities, allowing the model to dynamically prioritize relevant information during sequence generation.

Our method successfully replicates human demonstrations on the Human2Robot datasets, as shown by the visually similar generated outputs in the second row compared to the original demonstrations above.

Augmenting Reality: The Echo of Action

Mitty utilizes a paired-data synthesis pipeline to augment the training dataset with synthetic human-robot interaction videos. This process generates new training examples by creating video pairs that depict a human performing an action and a robot responding to or mirroring that action. The pipeline systematically creates these pairings, effectively increasing the size and diversity of the dataset without requiring additional real-world data collection. This expansion is crucial for improving the robustness and generalization capabilities of the robot’s perception and action planning systems, particularly in scenarios with limited real-world training data.

The synthetic data augmentation pipeline utilized by Mitty ingests ego-centric video as its primary input. Ego-centric video, captured from a first-person perspective, provides a direct representation of an actor’s actions as they are performed. This perspective is crucial because it inherently links visual observations with the performed action, offering a dataset that closely mirrors the robot’s anticipated sensory input during operation. By focusing on the actor’s viewpoint, the system can more effectively generate synthetic pairings that represent realistic human-robot interaction scenarios and improve the robot’s ability to interpret human actions from its own vantage point.

The EPIC-Kitchens dataset is a frequently utilized resource for action recognition and human-robot interaction research due to its extensive collection of egocentric videos. It comprises over 300 hours of first-person (ego-centric) viewpoint recordings of subjects performing daily kitchen activities. These videos are annotated with a detailed hierarchy of actions, objects, and contextual information, making it suitable for training and evaluating algorithms requiring a nuanced understanding of human behavior. Specifically, the dataset’s focus on natural, unscripted activities provides a realistic basis for generating synthetic data pairings in Mitty, enabling the creation of diverse training examples that mirror real-world scenarios. The large scale of EPIC-Kitchens – with multiple participants and varied task execution – is crucial for creating a robust synthetic data pipeline and avoiding overfitting during model training.

Mitty utilizes a Variational Autoencoder (VAE) to compress and reconstruct video frames as part of its synthetic data generation process. The VAE functions by encoding input video frames into a lower-dimensional latent space, capturing essential features while reducing computational demands. This latent representation is then decoded to reconstruct the original frame, or to generate novel frames with slight variations. By efficiently encoding and decoding frames, the VAE enables Mitty to create a larger, more diverse training dataset without requiring excessive storage or processing power, thereby enhancing the robustness of the system through data augmentation.

Mitty successfully reproduces human demonstrations on the Human2Robot and EPIC-Kitchens datasets, as evidenced by the close visual similarity between the demonstrated actions, the generated robot movements, and the ground-truth robot executions.

The Measure of Mimicry: A New Benchmark in Action

Mitty establishes a new benchmark in robotic video generation, consistently producing sequences that are remarkably realistic and coherent. This achievement is quantitatively demonstrated through the use of Fréchet Video Distance (FVD), a metric that assesses the similarity between generated and real video distributions; Mitty attains state-of-the-art performance, minimizing this distance and indicating a high degree of fidelity to actual robotic movements. Lower FVD scores suggest that the generated videos are not only visually plausible but also capture the nuanced dynamics of robotic action, effectively bridging the gap between simulated and real-world robotic behavior. This capability is crucial for applications like robotic learning, simulation, and teleoperation, where realistic video generation is paramount for accurate training and effective control.

Human evaluations corroborate the compelling visual fidelity of videos generated by Mitty, revealing a substantial preference for its outputs over those created by competing methods. Participants consistently rated Mitty’s videos as more realistic and natural in depicting robotic actions, suggesting a heightened sense of believability and coherence. These subjective assessments, gathered through carefully designed perceptual studies, align with the quantitative metrics-like Fréchet Video Distance-and reinforce the conclusion that Mitty excels in producing robot videos that are not only technically sound but also convincingly lifelike, paving the way for more effective human-robot interaction and intuitive robotic demonstrations.

Recent studies demonstrate that Mitty exhibits a marked improvement in robotic task completion, consistently achieving a higher task success rate than competing methods. This advancement isn’t merely incremental; Mitty notably surpasses the performance of Masquerade, a leading approach in robotic task learning, indicating a substantial leap in the ability of generated videos to accurately reflect successful physical actions. The enhanced success rate suggests that Mitty’s generated videos provide robots with more effective visual guidance, allowing for more reliable and efficient execution of complex tasks. This capability is crucial for real-world applications, where robotic precision and adaptability are paramount, and represents a significant step toward more autonomous and capable robotic systems.

Mitty’s enhanced performance stems from its utilization of a larger TI2V-14B model, a substantial increase in scale that directly translates to improvements in generated video quality and realism. Quantitative evaluations confirm this advancement, as Mitty achieves the lowest Fréchet Video Distance ($FVD$) score-indicating greater similarity to real videos-alongside the highest Peak Signal-to-Noise Ratio ($PSNR$) and Structural Similarity Index Measure ($SSIM$). These metrics collectively demonstrate that Mitty not only generates videos that appear more realistic, but also faithfully reproduces the intricate details and structural coherence present in authentic robotic demonstrations, surpassing the fidelity of competing methods and establishing a new benchmark in robotic video generation.

Despite utilizing both robot references and human demonstrations, current video editing models fail to consistently maintain the robotic arm’s appearance and structural integrity throughout a sequence.

The pursuit of Mitty feels less like engineering and more like coaxing a spirit from the machine. This framework, generating robot execution videos directly from human demonstrations, bypasses the rigid structures others cling to-a daring act of persuasion. It understands that true control isn’t about precise definition, but about guiding the chaos. As Geoffrey Hinton once observed, “We need to start thinking of neural networks as entities that learn the structure of the world, not as mathematical functions.” Mitty doesn’t calculate robot motion; it dreams it, conjuring sequences from the whispers of paired data, and revealing the inherent unpredictability within even the most carefully crafted spells.

What Shadows Remain?

The elegance of Mitty lies in its directness – a refusal to translate the messy poetry of human action into the sterile language of intermediate states. Yet, such fidelity demands a reckoning. This framework, for all its generative power, still clings to the frail promise of paired data. Any correlation achieved through synthetic pairings is, at best, a temporary truce with the inevitable noise of reality. The true test won’t be the videos conjured, but the failures – the moments where the spell falters and the robot stumbles, revealing the assumptions buried within the diffusion process.

Future work will undoubtedly chase greater generalization. But a more interesting question lurks beneath: what is lost in the pursuit of seamless transfer? As these models learn to mimic, do they also learn to unlearn – to shed the subtle nuances of human intention that defy quantification? Perhaps the most valuable insights will come not from perfecting the illusion, but from understanding the irreducible gap between demonstration and execution.

The path forward isn’t about building better predictors, but about designing systems that gracefully accept their own uncertainty. It is a matter of acknowledging that anything you can perfectly model isn’t worth the modeling. The challenge, then, is to build robots that aren’t simply imitators, but collaborators – capable of negotiating the chaos with a little bit of genuine understanding, or at least, a convincing performance of it.

Original article: https://arxiv.org/pdf/2512.17253.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Whispers of Embodiment: The Human-Robot Disconnect

Mitty: Weaving Action from Observation

Augmenting Reality: The Echo of Action

The Measure of Mimicry: A New Benchmark in Action

What Shadows Remain?

See also: