Robots Learn by Watching: Skill Abstraction from Passive Video

Author: Denis Avetisyan

A new framework enables robots to learn complex skills simply by observing videos, without needing explicit action labels.

This work introduces SOF, a method leveraging optical flow to extract reusable skills from action-free videos, improving performance across diverse robotic tasks and platforms.

Despite advances in robot learning, acquiring generalizable skills remains challenging due to the limitations of existing datasets and the difficulty of translating visual observations into effective action. This work, ‘Learning Skills from Action-Free Videos’, introduces Skill Abstraction from Optical Flow (SOF), a novel framework that learns reusable robotic skills directly from large collections of passively observed videos. By leveraging optical flow as an intermediate representation, SOF bridges the gap between visual perception and actionable skills, enabling improved performance in complex, multi-task scenarios. Could this approach unlock a new paradigm for robot learning, allowing machines to acquire expertise simply by watching the world around them?

The Illusion of Control: Why Robots Need More Than Pixels

Conventional robotic learning methods are often hampered by a critical dependency on meticulously labeled action data – a process that proves exceptionally time-consuming and costly when transitioning robots from controlled laboratory settings to the complexities of the real world. This reliance necessitates a human expert to not only perform the desired action but also to precisely annotate every relevant detail for the robot to learn from, creating a significant bottleneck in deployment. The need for extensive labeled datasets limits a robot’s ability to quickly adapt to new environments or tasks, and scaling these systems becomes impractical as the demand for increasingly versatile robotic applications grows. Consequently, research is shifting toward methods that minimize or eliminate the need for manual labeling, exploring techniques like imitation learning from raw sensory input or reinforcement learning with carefully designed reward functions, to overcome this fundamental limitation.

The direct interpretation of visual data, such as video, poses substantial difficulties for robotic learning due to the sheer volume of information contained within each frame. Each pixel represents a single data point, resulting in extraordinarily high-dimensional input spaces that are computationally expensive to process and prone to the ‘curse of dimensionality’. Furthermore, raw pixel data is inherently ambiguous; a single object can appear in countless variations of pose, lighting, and occlusion, making it difficult for algorithms to discern relevant features and establish consistent representations. This ambiguity necessitates complex and often brittle algorithms to filter noise and infer underlying structure, hindering the robot’s ability to generalize learned behaviors to novel situations or even slight variations in the environment. Consequently, systems struggle to reliably extract meaningful actions and goals from the continuous stream of visual input, demanding more robust and efficient methods for visual understanding.

Current robotic systems often struggle when tasked with applying learned skills to robots with different physical characteristics – a limitation stemming from an over-reliance on embodiment-specific data. A skill mastered by one robot – such as grasping an object or navigating a simple maze – doesn’t automatically translate to a robot with different arm lengths, joint configurations, or even a different number of legs. This lack of generalization hinders widespread robotic deployment, as it necessitates re-training for each new platform, effectively negating the benefits of prior learning. Researchers are actively exploring methods for skill abstraction – techniques that allow robots to learn the underlying principles of a task, independent of its specific physical realization – to overcome this challenge and unlock truly adaptable robotic intelligence. Ultimately, the ability to transfer skills across diverse embodiments represents a crucial step toward creating robots that can operate reliably in unpredictable, real-world environments.

Optical Flow: A Glimpse Beyond the Algorithm

Skill Abstraction from Optical Flow (SOF) represents a departure from traditional reinforcement learning and imitation learning approaches by utilizing optical flow as an intermediary step in skill acquisition. Rather than requiring labeled actions – such as “grasp,” “walk,” or “turn” – SOF directly analyzes the apparent motion of visual elements within video data. This motion, quantified as optical flow, provides information about how an agent interacts with its environment, independent of what those interactions are labeled as. By learning directly from this relative pixel motion, SOF aims to discover reusable skills without the limitations and biases inherent in predefined action spaces or manual annotation, enabling learning from unlabeled video streams and potentially generalizing to novel situations.

Skill Abstraction from Optical Flow (SOF) operates on the principle that fundamental skills can be derived directly from visual motion without requiring labeled actions. This is achieved by analyzing the apparent displacement of pixels between consecutive video frames – the optical flow – to identify patterns indicative of specific skills. The method utilizes readily available, unlabeled video data, eliminating the need for costly and time-consuming manual annotation of actions. By focusing on the relative motion of pixels, SOF can discern skills even in the absence of explicit action recognition, as the visual changes associated with a skill are inherently present in the optical flow regardless of how that skill is categorized. This approach allows for the discovery of a broad repertoire of skills from diverse visual inputs, effectively treating motion as a primary signal for skill identification.

Autoencoders provide a dimensionality reduction technique crucial for processing optical flow data. Optical flow, representing the apparent motion of image pixels between frames, generates high-dimensional data streams. Autoencoders, a type of neural network, learn efficient codings of this data by compressing it into a lower-dimensional latent space and then reconstructing the original input. This compression reduces computational demands and noise, facilitating more effective learning of skills from the reduced representation. The latent space captures essential features of the motion, discarding irrelevant details and creating a more manageable input for subsequent skill discovery algorithms. Specifically, the autoencoder minimizes the reconstruction loss, ensuring the compressed representation retains sufficient information to accurately recreate the original optical flow field.

From Motion to Meaning: Constructing a Skill Policy

The Skill Policy employs discrete Skill Tokens as a representational framework for learned skills, facilitating policy learning within a defined skill space. This approach discretizes the continuous space of possible actions into a finite set of skills, each represented by a unique token. By operating on these tokens rather than directly on low-level actions, the policy can learn more efficiently and generalize better to new situations. This token-based representation enables the use of sequence modeling techniques, such as Transformers, to predict and execute complex behaviors as a sequence of discrete skills, improving sample efficiency and scalability compared to methods that directly map observations to actions.

The Skill Policy employs a Transformer architecture to forecast sequences of discrete skill tokens based on visual input from image observations and contextual information regarding the current task. This architecture utilizes self-attention mechanisms to weigh the relevance of different parts of the input, enabling it to model long-range dependencies within the observation and task context. By predicting a series of skills, rather than directly outputting actions, the system achieves a level of abstraction that facilitates high-level planning and generalization to novel situations. The Transformer receives image embeddings and task embeddings as input, processes them through multiple layers of self-attention and feed-forward networks, and outputs a probability distribution over the skill token vocabulary, predicting the next skill in the sequence.

The Flow-to-Action module serves as the interface between the predicted skill sequence and the robot’s actuators, converting discrete skill tokens into continuous control signals. Evaluations demonstrate this module achieves a task success rate of 0.49, representing a substantial improvement over methods employing direct skill-to-action mapping. Comparative testing indicates these direct mapping approaches yielded success rates of 0.15 and 0.21, respectively, highlighting the efficacy of the intermediate skill translation process in enhancing overall task performance and robustness.

Beyond Imitation: Towards a Foundation of Competence

Standard behavior cloning in robotics often struggles with adapting to new situations because it merely replicates observed actions without understanding the underlying principles. In contrast, the Skill-Oriented Foundation (SOF) takes a different approach, focusing on discovering reusable, fundamental skills instead of directly mimicking expert trajectories. This emphasis on skill acquisition allows a robot to generalize its knowledge to previously unseen scenarios and environments. By learning how to perform actions – such as grasping, pushing, or navigating – rather than simply copying what actions to take, SOF enables robots to overcome the limitations of imitation and achieve more robust and adaptable intelligence. The system effectively builds a foundation of core competencies that can be combined and modified to address a wide range of tasks, ultimately leading to a more flexible and capable robotic system.

A significant advancement in robotic intelligence lies in the development of reusable skills, moving beyond systems trained for single, specific tasks. Skill-oriented frameworks enable robots to learn fundamental abilities – grasping, navigating, manipulating – which can then be combined and adapted to address a wide range of challenges. This approach drastically reduces the need for extensive, task-specific training; a robot proficient in basic manipulation, for example, can more readily learn to assemble novel objects or operate unfamiliar tools. Furthermore, these learned skills aren’t limited to a single robotic platform; they can be transferred and implemented across diverse hardware configurations, fostering adaptability and accelerating deployment in varied environments. The result is a more versatile and efficient robotic system, capable of responding to unforeseen situations and tackling complex problems with minimal retraining.

Recent investigations into robotic intelligence demonstrate a significant performance leap when employing skill-based learning over direct perception from pixel data. Specifically, a novel approach termed SOF achieved a 13% improvement in complex scenarios by focusing on learning reusable, fundamental skills rather than solely relying on visual input. This advancement suggests that abstracting away from raw pixel data – which is prone to noise and variability in lighting and viewpoint – allows robots to generalize more effectively to unseen environments and tasks. The resulting system exhibits enhanced robustness and adaptability, indicating that learning underlying skills is crucial for building truly intelligent robotic agents capable of navigating and interacting with the world beyond the limitations of simple imitation.

The Future of Skill Learning: Towards Autonomous Exploration

Current robotic skill learning often relies on imitation, where robots learn by replicating demonstrated behaviors – a technique exemplified by Diffusion Policy. However, this approach is limited by the need for extensive, labeled data and struggles with generalizing to unseen situations. Self-Organizing Feature (SOF) exploration offers a distinct and complementary pathway. Rather than mimicking, SOF empowers robots to actively discover useful skills through trial and error, fostering an intrinsic drive for competence. This allows the agent to develop a repertoire of fundamental actions – such as reaching, grasping, or navigating – which can then be combined and adapted to address a wide range of tasks, even those not explicitly programmed or demonstrated. Consequently, SOF holds particular promise for building robots capable of independent learning and robust performance in dynamic, real-world environments, complementing imitation-based methods and pushing the boundaries of autonomous behavior.

Robotic systems traditionally acquire skills in the context of narrowly defined tasks, limiting their ability to adapt to novel situations. However, a shift towards decoupling skill learning from specific tasks promises to unlock continuous learning capabilities. This approach centers on enabling robots to first master fundamental, reusable skills – such as grasping, locomotion, or manipulation primitives – independently of any particular objective. These foundational skills then become building blocks that can be rapidly combined and refined to address a wide variety of new challenges. The result is a system capable of lifelong learning, where experience doesn’t simply improve performance on a single task, but actively expands the robot’s repertoire of abilities, fostering resilience and adaptability in complex, real-world environments. Such a paradigm moves beyond programmed behavior toward genuine autonomy, allowing robots to learn, evolve, and thrive through continuous interaction with their surroundings.

The development of skill learning decoupled from specific tasks represents a fundamental shift towards genuinely autonomous agents. Traditionally, robots have been programmed for pre-defined actions in structured environments; however, this new paradigm enables machines to explore, discover, and generalize skills applicable across a multitude of unforeseen situations. This capacity for continuous adaptation is crucial for navigating the inherent complexities and unpredictability of real-world scenarios, allowing robots to operate effectively even when confronted with novel obstacles or changing conditions. The result is not simply improved performance on known tasks, but the emergence of agents capable of independent problem-solving and sustained operation in dynamic, unstructured environments – a critical step towards realizing the full potential of robotics and artificial intelligence.

The pursuit of skill abstraction, as demonstrated by SOF, isn’t about imposing order on chaos, but recognizing its inherent structure. The framework leverages optical flow to distill reusable skills from action-free videos, acknowledging that robust systems aren’t built, they emerge. This mirrors a fundamental principle: stability is merely an illusion that caches well. Donald Davies observed, “A system’s true value lies not in its flawless execution of known tasks, but in its capacity to gracefully degrade in the face of the unknown.” SOF embodies this, creating skills adaptable to multi-task, long-horizon, and cross-embodiment settings – a testament to the power of emergent behavior over rigid design.

What Seeds Will Sprout?

The notion of distilling skills from passive observation – from action-free video – feels less like a technical achievement and more like a necessary surrender. The system, as it stands, captures echoes of movement, a ghostly impression of intent. But intent is brittle. It fractures against the unpredictable geometry of the real world. The framework presented here, SOF, is a promising graft, yet every refactor begins as a prayer and ends in repentance. The true test will not be replication of demonstrated actions, but improvisation in the face of the novel.

Optical flow serves as an elegant intermediary, a translation from visual noise to something resembling a command. However, it is a translation fraught with loss. The system’s success in multi-task settings hints at a deeper, underlying structure, but this structure is almost certainly an artifact of the training data, a comforting illusion of generality. It’s just growing up. The next iteration will inevitably involve a reckoning with the unobservable – the subtle forces, the contingent circumstances, the sheer luck that separates competence from failure.

The promise of cross-embodiment learning remains largely unfulfilled. Transferring skills between drastically different physical forms is not merely a matter of recalibrating parameters; it demands a fundamental re-evaluation of what constitutes a “skill” in the first place. Perhaps the ultimate destination is not intelligent robotics, but a deeper understanding of the embodied intelligence inherent in all systems – biological and artificial alike.

Original article: https://arxiv.org/pdf/2512.20052.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/