Seeing is Learning: Robots Gain Skills From Pretrained Video

Author: Denis Avetisyan

A new approach leverages the power of large-scale video understanding to dramatically improve how robots learn complex tasks from limited demonstrations.

The framework instantiates a pretrained video generation model-informed by large-scale video data and rich physical dynamics-and adapts it for control through partial denoising, extracting latent visual plans via flow tracking to an intermediate noise level $τ_v$, which then conditions a separate action decoder processing proprioceptive states to predict action trajectories, all while operating the video and action components on independent flow schedules ($τ_v$ and $τ_a$) to optimize learning for each modality.

Mimic-video decouples action planning from control using generative video pretraining and flow matching to achieve generalizable robot control beyond vision-language-action models.

Despite advances in robotic manipulation, current vision-language-action models struggle with sample efficiency due to a reliance on inferring physical dynamics from limited robot trajectories. This work introduces mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs, a novel approach leveraging large-scale pretrained video models to ground policies in rich visual priors and decouple action planning from low-level control. By pairing video understanding with a flow matching-based action decoder, we demonstrate significant improvements in both sample efficiency-achieving a 10x gain-and convergence speed. Could this paradigm shift unlock truly generalizable robotic control, moving beyond the limitations of purely language-guided approaches?

The Illusion of Creation: Generative Models and Their Discontents

Generative modeling represents a paradigm shift in machine learning, moving beyond simple recognition to the creation of new data instances that convincingly mimic a given dataset. At its core, this field aims to capture the underlying probability distribution – the complex statistical blueprint – governing a collection of observations. Successfully learning this distribution allows algorithms to sample new data points that adhere to the same patterns and characteristics as the original data, whether it be realistic images, compelling text, or intricate musical compositions. This isn’t merely about copying existing examples; it’s about understanding the rules that govern the data and then using those rules to generate novel, yet plausible, variations. The challenge lies in the inherent complexity of real-world data distributions, which are often high-dimensional and contain intricate dependencies, demanding increasingly sophisticated modeling techniques to accurately represent and reproduce them.

Generative Adversarial Networks (GANs), while initially promising for creating realistic data, frequently suffered from significant training challenges. A core issue was instability; the delicate balance between the generator – tasked with creating data – and the discriminator – responsible for distinguishing real from fake – often failed, leading to oscillations and non-convergence. Equally problematic was ‘mode collapse’, where the generator learned to produce only a limited variety of samples, effectively ignoring large portions of the desired data distribution. This resulted in a lack of diversity in the generated output, as the model fixated on a few easily reproducible examples rather than capturing the full complexity of the training data. These limitations spurred research into alternative generative modeling techniques, paving the way for approaches like diffusion models that offered greater stability and diversity in data generation.

Denoising Diffusion Models represent a significant advancement in generative modeling by shifting the paradigm from directly creating data to learning to reverse a gradual noising process. These models begin with pure noise – random data – and iteratively refine it, step-by-step, into structured outputs resembling training data. This process, inspired by non-equilibrium thermodynamics, involves learning to predict and remove a small amount of noise at each iteration, gradually revealing underlying patterns. Unlike Generative Adversarial Networks (GANs), which can suffer from training instability, diffusion models offer a more stable training process and often achieve superior sample quality. The iterative refinement allows for a nuanced control over the generation process and has demonstrated remarkable success in generating high-resolution images, audio, and even molecular structures, establishing diffusion models as a leading technique in the field of generative artificial intelligence.

Near-perfect control achieved with ground truth video demonstrates that policy performance is directly limited by video generation quality, as evidenced by improvements using a robot-dataset-finetuned video model (red) over a standard pretrained model (gray).

Flow Matching: A Direct Route (If It Exists)

Flow Matching utilizes a Continuous Normalizing Flow (CNF) as the foundation for generative model training. CNFs define a continuous transformation between a simple probability distribution, typically Gaussian noise, and the data distribution through an ordinary differential equation. This allows for the definition of a velocity field that describes how a point moves under the flow. By learning this velocity field, the model learns a direct mapping from noise to data, effectively constructing a generative process. The framework relies on defining a time-dependent transformation $x_t$ where $x_0$ is the noise and $x_1$ is the generated sample, and training involves estimating the velocity field that governs this transformation.

Flow Matching establishes a direct mapping from a noise distribution to the data distribution by learning a time-dependent vector field, $f(\mathbf{x}_t, t)$, where $\mathbf{x}_t$ represents a noisy data point at time step $t$. This contrasts with iterative refinement approaches; instead of gradually denoising, the model directly predicts the displacement needed to move from a noisy sample towards the data manifold. This direct learning paradigm significantly improves training efficiency because the model optimizes the vector field to directly transform noise into data, reducing the number of steps and computational resources required compared to methods that rely on iterative sampling or score-based approaches. The learned vector field allows for single-step generation, potentially offering substantial speedups during inference.

Flow Matching achieves accelerated generation and improved sample efficiency by directly learning a vector field that maps noise to data, unlike diffusion models which rely on iterative refinement. This direct learning approach eliminates the need for multiple denoising steps during sample generation, resulting in a significant reduction in computational cost. Empirical results demonstrate that Flow Matching can achieve an order-of-magnitude improvement in sample efficiency compared to traditional diffusion models, requiring substantially fewer training samples to achieve comparable or superior generation quality. This efficiency gain is particularly notable in scenarios with limited data or high computational constraints.

Reconstruction accuracy peaks at intermediate flow times, suggesting optimal performance when decoding slightly perturbed video latents rather than perfectly clean or purely noisy ones on the BridgeDataV2 dataset.

Partial Denoising: Tweaking the Knobs (Because Perfection is Overrated)

Flow Matching’s efficacy is directly linked to the precise manipulation of the denoising trajectory. Unlike traditional generative models that often employ a fixed denoising schedule, this approach allows for dynamic control over the noise removal process. By strategically adjusting the extent of denoising at each step, the model can navigate the generative space more efficiently and converge on desired outputs with greater stability. This control is achieved through parameters that govern the denoising process, enabling the model to prioritize relevant features and suppress noise effectively, ultimately leading to improved sample quality and faster convergence rates as demonstrated by a 2x improvement over standard Video-Action Learning (VLA) baselines.

Partial denoising extracts intermediate representations from the video model by utilizing a noisy visual plan. This method operates by intentionally introducing noise to the initial video data and then progressively denoising it through the model. The resulting intermediate states, representing partially denoised versions of the input, are then leveraged as feature maps for downstream tasks or further refinement. This approach differs from complete denoising, which aims to fully reconstruct the original clean data, and allows for more granular control over the generative process by accessing representations at various stages of noise reduction.

The Partial Denoising strategy utilizes a parameter, denoted as Flow Time ($\tau$), to regulate the degree of noise reduction applied during the generative process. Specifically, $\tau$ controls the extent to which the model denoises a noisy input, thereby directly influencing the characteristics of the generated output. Empirical results demonstrate that this approach yields a 2x improvement in convergence speed when compared to standard Video-Language-Action (VLA) baseline models. This accelerated convergence indicates a more efficient training process and faster achievement of optimal model performance.

The Video-Action Model (VAM) demonstrates significant data efficiency in action decoder training. Empirical results indicate that VAM achieves competitive performance utilizing only 2% of the action data typically required by standard video-action models. This represents a 98% reduction in the volume of action decoder training data needed to reach comparable levels of performance, suggesting substantial gains in training resource optimization and scalability.

Robot policies perform optimally with intermediate levels of video noise, demonstrating that high-fidelity video reconstruction is unnecessary for successful performance in SIMPLER-Bridge environments.

The pursuit of generalizable robot control, as demonstrated by mimic-video’s decoupling of action planning from control, feels predictably optimistic. It’s a valiant attempt to build a framework that resists the entropy of production environments. Andrey Kolmogorov observed, “The most important thing in science is not to be afraid of making mistakes.” This resonates; mimic-video, with its reliance on generative video pretraining and latent representations, will undoubtedly encounter edge cases and unforeseen failures. The model’s initial elegance will, inevitably, be compromised by the realities of deployment. Every optimization, even those striving for robustness, will eventually require a counter-optimization. It’s not a flaw, simply a characteristic of complex systems – architecture isn’t a diagram, it’s a compromise that survived deployment.

What’s Next?

The decoupling of action planning and control, as demonstrated by this work, feels less like a breakthrough and more like a return to first principles. For years, the field chased end-to-end learning, believing the network would ‘figure it out’. It rarely did. The reliance on large-scale, pretrained video models is, predictably, the new bottleneck. While these models provide richer priors, the transfer function to robotic control remains fragile. One anticipates a future consumed by adversarial examples crafted not from image noise, but from subtly altered kinematic sequences-the robot’s equivalent of a misspelled word.

The promise of improved sample efficiency is, as ever, contingent on the definition of ‘efficient’. The pretraining phase requires datasets orders of magnitude larger than any currently available for complex robotic tasks. The notion of ‘generalizable’ control also requires scrutiny. A robot successfully navigating a simulated kitchen is still remarkably poor at opening a jar of pickles in the real world. It is safe to predict a renewed focus on domain adaptation, and likely, a proliferation of increasingly elaborate simulation environments-each one, inevitably, failing to capture some critical aspect of reality.

Ultimately, this work, like many others, pushes the problem one step further down the line. The elegant diagrams of decoupled action and control will, in time, become tangled monoliths of error handling and special cases. If all tests pass, it simply means the tests are testing for the wrong thing.

Original article: https://arxiv.org/pdf/2512.15692.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Creation: Generative Models and Their Discontents

Flow Matching: A Direct Route (If It Exists)

Partial Denoising: Tweaking the Knobs (Because Perfection is Overrated)

What’s Next?

See also: