Seeing, Understanding, and Acting: A Leap Forward in Robotic Manipulation

Author: Denis Avetisyan

Researchers have significantly improved a robot’s ability to perform complex, long-horizon tasks by combining advanced vision-language understanding with a novel reinforcement learning technique.

The BEHAVIOR Challenge dataset, designed for the NeurIPS 2025 competition, exhibits a distribution of skill occupancy across video frames and reveals that trajectories vary in length and complexity, averaging a specific number of frames and unique skills per instance-characteristics inherent to any system navigating a finite state space before inevitable decay.

This work details Openpi Comet, a system achieving strong performance on the BEHAVIOR Challenge through extensive pre-training, skill composition, and Rejection Sampling Fine-tuning.

Despite advances in robotic manipulation, reliably executing long-horizon tasks in realistic environments remains a significant challenge. This paper details Openpi Comet: Competition Solution For 2025 BEHAVIOR Challenge, our approach to the BEHAVIOR benchmark-a competition focused on everyday household tasks for embodied agents. By leveraging a vision-language-action foundation model, extensive pre-training, and a novel post-training refinement technique called Rejection Sampling Fine-tuning, we achieved near state-of-the-art performance. How can these findings inform the development of more robust and adaptable embodied AI systems capable of seamlessly integrating into human environments?

The Inevitable Drift: Confronting the Challenges of Long-Horizon Manipulation

Robotic systems attempting to navigate and interact within the complexities of everyday environments, such as a typical home, currently face substantial limitations when employing reinforcement learning techniques. These challenges stem from the inherent unpredictability and high dimensionality of real-world spaces, demanding robots to process vast amounts of sensory information and adapt to constantly changing conditions. Unlike controlled laboratory settings, homes present cluttered scenes, variable lighting, and a diverse range of objects with differing physical properties – all of which significantly increase the difficulty of learning effective manipulation policies. Consequently, even seemingly simple tasks, like grasping a specific object from a cluttered table, require exceptionally robust and adaptable algorithms that can overcome these perceptual and motor control hurdles. The gap between performance in simulated environments and real-world success highlights the need for advancements in areas like sim-to-real transfer and robust perception to enable truly autonomous robotic manipulation.

Current robotic manipulation techniques often falter when confronted with tasks demanding foresight and sustained action. The difficulty arises from the “long-horizon” problem – the need for a robotic policy to accurately predict and account for consequences unfolding over many sequential steps. Unlike scenarios with immediate rewards, tasks like preparing a meal or tidying a room require the robot to maintain a coherent plan across extended timeframes. Existing reinforcement learning algorithms frequently struggle with this temporal credit assignment – determining which actions, taken far in the past, ultimately contributed to a distant outcome. Consequently, developing robust policies capable of effectively planning and executing over these longer sequences remains a central challenge in advancing robotic autonomy, necessitating innovative approaches to address the complexities of long-term dependencies and delayed gratification.

The policy robustly executes complex, long-horizon household tasks-including navigation, manipulation, and tool use-across a diverse set of activities on the BEHAVIOR-1K benchmark.

Unifying Perception and Action: The Promise of Vision-Language-Action Models

Vision-Language-Action (VLA) models represent a shift in robotic manipulation by consolidating traditionally separate components – visual perception, natural language processing, and motor control – into a unified neural network architecture. This integration allows robots to interpret high-level linguistic commands, such as “pick up the red block,” and directly translate them into actionable movements. Prior approaches often required explicit intermediate steps, like manually defining object affordances or pre-programming trajectories; VLA models aim to learn these mappings directly from data, increasing adaptability and reducing the need for task-specific engineering. The core benefit lies in enabling robots to perform complex manipulation tasks described in natural language without requiring detailed, low-level instructions, potentially streamlining human-robot interaction and expanding the range of tasks robots can autonomously address.

The base policy, denoted as π_0.5, integrates three core components to facilitate robotic manipulation. Visual encoders process image data to extract relevant features from the robot’s environment. Simultaneously, language encoders interpret natural language instructions, converting them into a contextual understanding of the desired task. These encoded visual and linguistic inputs are then fed into a transformer-style action expert, which generates a continuous control signal. This architecture allows the policy to not only perceive the environment and understand commands, but also to dynamically refine manipulation strategies based on both inputs, resulting in a unified perception-to-action pipeline.

The Transformer network integrated within the π0.5 policy plays a crucial role in refining continuous control signals for robotic manipulation. Unlike discrete action spaces, continuous control requires precise motor commands; however, these commands are often subject to noise from both perception and the policy itself. The Transformer architecture, through its self-attention mechanism, effectively filters this noise by identifying and mitigating inconsistencies within the generated action sequence. This denoising process is achieved by weighting the influence of each action component based on its relevance to the overall manipulation goal, resulting in smoother trajectories and improved positional accuracy during task execution. The network’s capacity to model temporal dependencies within the action space is key to generating coherent and stable control signals, especially in complex, multi-step manipulation scenarios.

The robotic policy was initially trained on a diverse dataset of over 1,500 hours of demonstrated and algorithmically generated trajectories, then refined through a post-training process of iterative data augmentation based on successful rollouts from perturbed initial states.

Sculpting Competence: Pre-training and Refinement as Pathways to Robustness

Pre-training on the BEHAVIOR dataset was investigated using both single-task and multi-task approaches to establish a strong foundation for subsequent policy learning. Single-task pre-training involved training a policy specifically on individual tasks within the dataset, while multi-task pre-training utilized data from all available tasks concurrently. This initial pre-training phase aimed to improve the policy’s initial performance and accelerate learning during downstream tasks, effectively providing a learned prior. The dataset’s diversity allowed for the creation of models capable of generalizing to new situations more effectively than training from random initialization.

To address the data scarcity inherent in real-world robotic learning, offline rollouts were generated using a pre-existing motion planner. This technique allowed for the creation of a synthetic dataset without requiring any online interaction with the environment, thereby circumventing the limitations and potential risks associated with direct physical experimentation. The motion planner produced trajectories which were then used to augment the existing training data, effectively increasing the dataset size and providing a broader range of scenarios for policy training. This approach is particularly valuable when online data collection is costly, time-consuming, or potentially damaging to the robotic system.

Rejection Sampling Fine-Tuning (RFT), implemented using a defined RFT Procedure, was instrumental in policy refinement. This process selectively retained successful trajectories from the training data and subsequently retrained the policy using only these samples. The application of RFT resulted in a post-training validation Q-score of 0.22, indicating a measurable improvement in policy performance as evaluated by the Q-function. This score represents the expected cumulative reward for following the learned policy, and a value of 0.22 demonstrates a substantial gain in the policy’s ability to achieve desired outcomes.

Selecting the optimal post-trained policy model for each task yields an aggregated validation Q-score of 0.31, representing the upper limit of the model’s performance.

Discerning Signal from Noise: Insights from Ablation and Analysis

Investigations into the model’s reliance on different input modalities revealed a surprising robustness to the absence of proprioceptive data – information detailing the robot’s own joint angles and positions. Ablation studies, where this data stream was systematically removed, demonstrated no significant drop in performance across a suite of manipulation tasks. This finding suggests the model effectively compensates for the lack of internal state awareness by heavily prioritizing visual input and language commands. Rather than relying on ‘feeling’ its way through a task, the system primarily utilizes what it ‘sees’ and ‘hears’, indicating a strong capacity for learning policies driven by external observations and instructions. This reliance highlights the potential for developing robotic systems that can operate effectively even with limited or noisy internal sensing capabilities.

Investigations into action parameterization strategies revealed a significant drawback to employing a relative approach. While intuitively appealing, representing actions as changes from a current state impeded the development of robust manipulation skills. The policy struggled to accurately accumulate actions over time, leading to imprecise movements and an inability to consistently achieve desired outcomes. This suggests that absolute action parameters – defining movements directly without reference to the current state – are crucial for establishing a stable and effective control system, allowing the model to learn and execute complex household tasks with greater fidelity and reliability.

A detailed analysis of the BEHAVIOR-1K dataset’s skill distribution revealed a surprising spectrum of complexity embedded within everyday household tasks. The research team discovered that while some actions, such as grasping a simple object, were relatively straightforward, others – like meticulously folding laundry or skillfully arranging items on a shelf – demanded a nuanced interplay of fine motor control, spatial reasoning, and adaptive planning. This understanding directly informed the training process, allowing for a curriculum that progressed from simpler manipulations to more intricate sequences, ultimately enhancing the robot’s ability to generalize to a wider range of real-world scenarios and achieve more human-like dexterity. The dataset’s granularity proved essential for identifying critical skill gaps and tailoring the learning algorithms accordingly, fostering a more robust and adaptable robotic system.

The Long Arc of Automation: Towards Truly Adaptable Robotic Systems

Recent advances in robotic manipulation are increasingly focused on Vision-Language Action (VLA) models, which offer a pathway towards systems capable of complex, long-horizon tasks. This work highlights the potential of these models when paired with carefully designed pre-training and refinement strategies. The approach begins with broad pre-training on extensive datasets, equipping the model with a foundational understanding of object affordances and task sequences. Subsequent refinement then focuses this knowledge, tailoring the model to specific manipulation skills and improving its ability to generalize to novel situations. By bridging the gap between visual perception, natural language instructions, and robotic action, this methodology offers a significant step toward creating robots that can reliably perform a diverse range of tasks in unstructured environments, moving beyond the limitations of traditional, narrowly-defined automation.

Recent advancements in robotic manipulation have yielded a system capable of successfully completing 22 out of 50 benchmark household tasks, marking a significant stride towards broadly capable robots. This achievement wasn’t simply about accomplishing individual actions, but rather demonstrating a degree of generalization across varied scenarios – from preparing simple meals to cleaning up common messes. The successful completion rate indicates the system’s ability to adapt to new, unseen tasks without requiring extensive retraining, a critical step beyond task-specific programming. While challenges remain in reaching a 100% success rate and tackling more complex domestic activities, this result provides compelling evidence that robots are steadily progressing towards seamlessly integrating into and assisting within human living spaces, offering a glimpse into a future where robotic assistance in daily life is commonplace.

Advancing robotic manipulation capabilities hinges on overcoming current limitations in training data acquisition and algorithm scalability. Researchers are actively investigating methods to move beyond reliance on massive, manually annotated datasets by exploring techniques like self-supervised learning and simulation-to-reality transfer. Simultaneously, the integration of hierarchical reinforcement learning offers a promising pathway toward tackling complex, long-horizon tasks. This approach breaks down overarching goals into smaller, more manageable sub-tasks, enabling robots to learn reusable skills and generalize more effectively across diverse environments. By combining these strategies – data efficiency and hierarchical learning – the field anticipates significant progress in creating robotic systems capable of robust and adaptable performance in real-world scenarios.

The ambition to create robots that genuinely understand and interact with the world as humans do hinges on integrating language and vision. These modalities aren’t simply data streams for a robotic system; they are crucial for contextual understanding and adaptable action. By processing visual information – recognizing objects, spatial relationships, and dynamic changes – in conjunction with natural language instructions or environmental descriptions, robots move beyond pre-programmed routines. This fusion enables them to interpret ambiguity, generalize learned skills to novel situations, and ultimately operate with the flexibility required in unpredictable, real-world settings. The potential extends beyond simple task completion, paving the way for robots capable of collaborative problem-solving and intuitive interaction within the complex tapestry of daily life.

The pursuit of robust foundation policies, as demonstrated in this work, echoes a sentiment akin to Paul Erdős’s observation: “A mathematician knows a lot of things, but a physicist knows a lot more.” While the OpenPI Comet system doesn’t operate within the realm of physics, it similarly builds upon a broad base – the extensive pre-training and VLA backbone – to achieve complex long-horizon manipulation. Each iteration of Rejection Sampling Fine-tuning refines this foundation, acknowledging that even the strongest system requires constant adaptation to gracefully handle the inevitable decay of performance over extended tasks. The challenge isn’t simply reaching a solution, but building a system that ages well, maintaining competence across a vast solution space.

What Lies Ahead?

The pursuit of long-horizon manipulation, as exemplified by this work, inevitably encounters the limitations inherent in any system attempting to predict and control a complex world. Openpi Comet demonstrates competence, but competence is merely a temporary stay against entropy. The challenge isn’t simply about achieving successful task completion; it’s about graceful degradation when, inevitably, the unexpected occurs. Systems learn to age gracefully, adapting to imperfect observations and unforeseen circumstances-a quality not easily quantified by benchmark scores.

Future work will likely focus on more robust pre-training methodologies, and expanding the scope of foundation policies. However, a potentially more fruitful avenue lies in accepting the inherent uncertainty. Rejection Sampling Fine-tuning is a step towards acknowledging failure modes, but further exploration of methods that explicitly model and incorporate error is crucial. Sometimes observing the process – the subtle shifts in policy, the types of failures encountered – is better than trying to speed it up.

The BEHAVIOR Challenge, and benchmarks like it, serve as useful pressure tests. Yet, the true measure of progress won’t be found in surpassing these metrics, but in the development of systems that can continue to function, and even learn, in the face of inevitable decay. The goal isn’t to build a perfect agent, but one that knows how to endure.

Original article: https://arxiv.org/pdf/2512.10071.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/