Robots Learn From Watching: A Few Examples Unlock Real-World Skills

Author: Denis Avetisyan


Researchers have developed a new framework that allows robots to learn complex tasks from just a handful of demonstrations, paving the way for more adaptable and intuitive robotic systems.

A learned reward function, distilled from five demonstrations and focused on behavioral invariants rather than visual details, successfully guides a robotic arm across manipulation tasks-including peg insertion, box opening, and bulb unscrewing-and generalizes to unseen variations in position, viewpoint, and object appearance without requiring further training.
A learned reward function, distilled from five demonstrations and focused on behavioral invariants rather than visual details, successfully guides a robotic arm across manipulation tasks-including peg insertion, box opening, and bulb unscrewing-and generalizes to unseen variations in position, viewpoint, and object appearance without requiring further training.

The FLORA framework combines motion flow representations, symbolic regression, and reinforcement learning to discover robust reward functions from limited data, enabling zero-shot generalization.

Designing reward functions that generalize beyond controlled settings remains a central challenge in robotics reinforcement learning. The work ‘Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations’ introduces FLORA, a framework that learns robust, symbolic reward functions from as few as five demonstrations by distilling behavioral invariants from motion flow representations. This approach enables zero-shot generalization to unseen positions, viewpoints, and object variations, significantly accelerating downstream policy learning and improving process alignment. Could this decoupling of reward design from specific visual instantiations unlock truly versatile and reusable robotic skills?


The Inevitable Friction of Sparse Signals

Traditional reinforcement learning algorithms often falter when faced with the realities of complex tasks, largely due to the issue of sparse rewards. Unlike simplified games where frequent feedback guides learning, many real-world scenarios – such as robotic navigation or intricate manipulation – provide only infrequent and delayed signals of success. This scarcity presents a significant challenge; the agent must explore extensively, attempting numerous actions without confirmation if it’s progressing toward the goal. Consequently, the algorithm struggles to discern effective strategies from random chance, leading to slow learning or complete failure. The problem isn’t a lack of potential reward, but rather the difficulty in associating early actions with eventual success, hindering the agent’s ability to build a useful understanding of the environment and the task at hand.

Robotic manipulation frequently presents a significant challenge for reinforcement learning algorithms due to the nature of the tasks themselves; often, a robot receives little to no immediate feedback upon attempting a complex action. Unlike game-playing scenarios with constant scoring, successfully manipulating objects – such as assembling parts or navigating cluttered environments – typically yields a reward only upon completion of the entire sequence. This scarcity of intermediate signals drastically slows the learning process, as the robot must effectively stumble upon successful strategies through random exploration. Consequently, algorithms struggle to discern which actions contribute positively to the ultimate goal, making efficient learning incredibly difficult and necessitating sophisticated techniques to bridge this ‘sparse reward’ gap.

A significant limitation of current reinforcement learning approaches lies in their susceptibility to brittle reward specifications, hindering effective generalization. Often, rewards are meticulously engineered for a specific environment or task, inadvertently encoding assumptions that do not hold in novel situations. This leads to agents that perform well under limited conditions but fail dramatically when faced with even slight variations – a change in lighting, object pose, or the introduction of a new obstacle can completely derail learned behavior. The agent, having optimized for a narrow definition of success, lacks the robustness to adapt and continues pursuing the initially rewarded behavior even when it’s no longer appropriate or effective. Consequently, achieving true adaptability requires moving beyond precisely defined rewards towards methods that can infer underlying goals and learn more flexible, transferable representations of success.

This framework synthesizes robust reward signals from raw visual input through a pipeline comprising flow generation, progress estimation via a symbolic potential function, and a reward shaping module, enabling efficient policy learning via a bi-level optimization loop and accelerating real-world reinforcement learning in complex manipulation tasks.
This framework synthesizes robust reward signals from raw visual input through a pipeline comprising flow generation, progress estimation via a symbolic potential function, and a reward shaping module, enabling efficient policy learning via a bi-level optimization loop and accelerating real-world reinforcement learning in complex manipulation tasks.

Emergent Structure: Learning the Language of Reward

Traditional reinforcement learning often requires meticulously designed reward functions, a process that is both time-consuming and prone to bias. FLORA diverges from this approach by learning a symbolic reward function directly from task demonstrations. This learned function encapsulates the inherent structure of the task, allowing the agent to assess progress without explicit, pre-defined rewards. The system identifies key task elements and their relationships, translating observed states into scalar values representing task completion. By abstracting the underlying principles, FLORA achieves greater generalization and reduces the need for extensive reward engineering, enabling effective learning from limited demonstration data.

The Flow-Generator component within FLORA processes raw visual input, specifically RGB images, to create object-centric motion flows. This transformation involves identifying and tracking salient objects within the scene and representing their movement as vector fields. These flows differ from pixel-wise optical flow by focusing on object-level changes, resulting in a more compact and informative representation of the environment’s dynamics. By decoupling motion from background clutter and representing it relative to objects, the framework reduces dimensionality and improves the signal-to-noise ratio, facilitating the subsequent learning of reward functions from limited demonstration data. The resulting motion flows are then used as input to the Symbolic Potential Function.

FLORA’s reward function is defined by a Symbolic Potential Function which leverages a Large Language Model (LLM) to interpret object-centric motion flows and assign scalar progress values. This function effectively creates a reward landscape without requiring manual reward engineering. Crucially, the LLM is trained to map observed states, represented by these motion flows, to a single numeric value indicating progress towards the task goal, and it achieves functional reward landscapes with only five demonstration examples. This few-shot learning capability significantly reduces the data requirements for task specification compared to traditional reinforcement learning methods.

FLORA consistently distinguishes between successful, partially successful, and random trajectories on the Lever-Pull task, even under out-of-distribution viewpoints, while baseline methods exhibit performance degradation, as evidenced by the clear separation of reward curves.
FLORA consistently distinguishes between successful, partially successful, and random trajectories on the Lever-Pull task, even under out-of-distribution viewpoints, while baseline methods exhibit performance degradation, as evidenced by the clear separation of reward curves.

Stabilizing the System: Potential-Based Reward Shaping with Milestones

FLORA addresses the difficulties inherent in reward shaping through the implementation of PBRS-MS, or Potential-Based Reward Shaping with Milestone augmentation. Traditional reward shaping methods can introduce suboptimal policies or instability during the learning process; PBRS-MS mitigates these issues by leveraging a carefully constructed potential function. This function guides the agent towards desired behaviors without altering the optimal policy. The augmentation aspect involves incorporating milestones – intermediate goals – into the reward structure, providing more frequent and informative signals to accelerate learning and improve robustness, particularly in complex environments where sparse rewards are common.

Potential-Based Reward Shaping (PBRS) traditionally guarantees policy invariance, meaning the optimal policy remains unchanged, but can still suffer from instability during learning. PBRS-MS extends this approach by incorporating milestone augmentation, which introduces a series of intermediate rewards based on achieving pre-defined states or milestones within the task. This augmentation addresses the instability issue by providing more frequent reward signals, guiding the agent through the learning process and improving convergence. The resulting framework maintains the optimality guarantees of standard PBRS while demonstrably improving the stability of the learning process, allowing for more reliable policy optimization in complex environments.

To optimize the performance of the symbolic potential function within the PBRS-MS framework, Bayesian Optimization is utilized to automatically tune its numerical parameters. This approach systematically explores the parameter space, balancing exploration and exploitation to identify configurations that maximize task success. Empirical evaluation demonstrates that this optimization process yields a 42% improvement in success rates on challenging tasks when compared to existing reward shaping techniques and baseline algorithms. The method’s effectiveness stems from its ability to efficiently discover parameter settings that effectively guide the learning agent towards optimal policies, even in complex environments.

Our proposed PBRS-MS effectively resolves potential collapse issues observed under standard PBRS conditions.
Our proposed PBRS-MS effectively resolves potential collapse issues observed under standard PBRS conditions.

Beyond Benchmarks: A Glimpse of Adaptable Intelligence

The FLORA framework showcases remarkable capabilities in robotic manipulation through its performance on the Meta-World benchmark, consistently achieving a 96% success rate across a suite of simple tasks. This level of proficiency is particularly noteworthy given the complexity of these challenges, which require precise motor control and adaptive behavior in varied environments. By effectively learning from limited examples, FLORA demonstrates a practical approach to robotic skill acquisition, circumventing the need for extensive and costly data collection. The framework’s success on Meta-World underscores its potential to accelerate the development of robust and versatile robotic systems capable of tackling real-world manipulation tasks with a high degree of reliability.

A significant advantage of the FLORA framework lies in its capacity to effectively learn from a limited number of demonstrations, addressing a core challenge in robotics and reinforcement learning where acquiring extensive datasets can be prohibitively expensive or time-consuming. This efficiency stems from the system’s ability to extrapolate generalized reward functions from relatively few examples of desired behavior, unlike methods requiring numerous trials and errors. Such capability is particularly valuable in real-world applications – like complex manipulation tasks or personalized robotic assistance – where gathering large datasets is impractical, and the cost of physical experimentation is high. By minimizing the need for extensive data collection, FLORA facilitates the rapid deployment of robotic systems in diverse and resource-constrained environments, paving the way for more adaptable and user-friendly automation.

The FLORA framework distinguishes itself through the enhanced generalization capabilities of its learned reward functions, a significant advancement over traditional, manually-engineered reward systems. This improvement is particularly pronounced when integrated with techniques like LLM Reflection, allowing the system to adapt to unforeseen circumstances with greater robustness. Demonstrating this capacity, FLORA currently stands as the sole method capable of successfully completing the intricate Bulb-Unscrew task – a benchmark of complex robotic manipulation – and consistently maintains high success rates even when confronted with out-of-distribution (OOD) scenarios, showcasing a marked ability to perform reliably beyond its initial training parameters.

The trained reward models successfully generalize to three out-of-distribution tasks, achieving consistent success rates across multiple random seeds.
The trained reward models successfully generalize to three out-of-distribution tasks, achieving consistent success rates across multiple random seeds.

Towards a Future of Autonomous Growth

The development of FLORA signifies a crucial advancement in the pursuit of lifelong learning for artificial intelligence. Unlike traditional reinforcement learning approaches that require task-specific reward engineering, FLORA focuses on learning reward functions themselves. This enables the agent to generalize its skills across a variety of environments and tasks, avoiding the need for constant retraining with each new challenge. By identifying core principles of reward – what constitutes ‘good’ behavior – FLORA doesn’t just solve individual problems; it builds a foundational understanding that facilitates rapid adaptation. This learned reward function acts as a compass, guiding the agent towards successful outcomes even in unfamiliar situations, ultimately fostering a level of autonomy previously unattainable and opening doors to truly versatile and resilient AI systems.

Ongoing research aims to significantly broaden the applicability of FLORA by merging it with techniques in preference-based reward learning and vision-language models. This integration promises to move beyond explicitly defined rewards, allowing the system to learn from high-level human preferences – such as “solve the puzzle quickly” – rather than precise, low-level actions. Simultaneously, connecting FLORA to vision-language models will enable it to interpret instructions and understand environments through natural language and visual input, fostering a more intuitive and flexible interaction with the world. This synergistic approach is expected to unlock the potential for truly adaptable agents capable of tackling a wider range of complex tasks with minimal human guidance, ultimately accelerating progress towards generalized artificial intelligence.

The development of adaptable autonomous agents receives a significant boost from this research, which showcases a system capable of mastering new challenges with minimal human guidance. Through innovative reward function learning, the system achieves a demonstrable 2x increase in policy learning speed when compared to the current state-of-the-art baseline, ReWiND-CT, specifically on the complex robotic tasks of Peg-Insert and Box-Open. This enhanced efficiency isn’t merely a numerical improvement; it represents a crucial stride toward creating agents that can independently acquire skills and operate effectively in dynamic, real-world scenarios, reducing the need for constant reprogramming and supervision.

Robustness to out-of-distribution (OOD) variations-including viewpoint and position shifts, as well as novel object instances-was evaluated using manipulation tasks performed with a Franka arm.
Robustness to out-of-distribution (OOD) variations-including viewpoint and position shifts, as well as novel object instances-was evaluated using manipulation tasks performed with a Franka arm.

The pursuit of a flawless robotic system, as demonstrated by FLORA’s approach to reward learning, is a fascinating, yet ultimately limited, endeavor. A system that never breaks is, in effect, a dead system, incapable of adapting to the inherent unpredictability of the real world. FLORA’s ability to generalize from limited demonstrations – learning invariant rewards despite novelty – isn’t about achieving perfection, but about building resilience. As Vinton Cerf observed, “Any sufficiently advanced technology is indistinguishable from magic.” This ‘magic’ isn’t about flawless execution, but about the system’s capacity to become something new through interaction and, inevitably, through failure-a purification process inherent in the growth of any complex ecosystem. The framework doesn’t simply solve a robotic task; it establishes the conditions for ongoing evolution.

What Lies Ahead?

The pursuit of invariant rewards, as exemplified by FLORA, is less about solving a technical problem and more about acknowledging an inherent instability. The system does not find a reward; it tentatively inhabits a potential energy surface, hoping the local minima prove durable. The framework’s success with limited demonstrations merely postpones the inevitable confrontation with novelty-each new scenario is a stress test, revealing the brittleness of any learned abstraction. Monitoring, therefore, is not verification, but the art of fearing consciously.

The reliance on motion flow, while elegant, introduces a particular class of failure. What happens when the environment subtly alters the meaning of motion itself? The system, optimized for predictable trajectories, will stumble, not due to a miscalculation, but a conceptual mismatch. These are not bugs – they are revelations, glimpses into the limitations of representing agency as a purely kinematic exercise.

True resilience begins where certainty ends. The next stage is not about increasing the number of demonstrations, or refining the symbolic regression. It demands a shift in perspective: from learning what to do, to learning how to learn what to do, and, crucially, when to abandon the attempt altogether. The future lies in systems that gracefully degrade, that recognize their own incompleteness, and that prioritize adaptability over absolute optimization.


Original article: https://arxiv.org/pdf/2605.22123.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-05-24 00:58