From Simulation to Reality: Teaching Robots to See and Act

Author: Denis Avetisyan

Researchers have developed a new method for translating AI-generated video into real-world robotic actions, bypassing the need for extensive task-specific training.

Dream2Flow reconstructs 3D object flow from generated videos to enable open-world robotic manipulation without requiring labeled data for each new task.

While generative models excel at predicting plausible physical interactions, translating these simulations into actionable robotic control remains a significant challenge. This work introduces Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow, a framework that leverages reconstructed 3D object motions as an intermediate representation to connect video generation with robotic manipulation. By decoupling desired state changes from low-level actuator commands, Dream2Flow enables zero-shot transfer of skills to diverse object categories-rigid, articulated, deformable, and granular-without task-specific demonstrations. Could this 3D object flow representation serve as a universal interface for adapting pre-trained vision models to real-world robotic control?

The Illusion of Control: Why Robots Still Struggle

Robotic systems designed for controlled laboratory settings often encounter significant difficulties when deployed in the unpredictable nature of real-world environments. Unlike the precisely calibrated conditions of a research lab, everyday spaces present a multitude of challenges including uneven terrain, unpredictable lighting, and the presence of dynamic obstacles. These unstructured settings demand a level of adaptability and robustness that traditional robotic control architectures frequently lack. Consequently, robots may struggle with basic navigation, object manipulation, and task completion, highlighting a critical gap between robotic potential and practical application. The inherent complexity stems from the need for robots to not only perceive their surroundings accurately, but also to interpret ambiguous data and react appropriately in situations not explicitly programmed – a feat easily accomplished by humans, yet remarkably difficult for machines.

The translation of abstract, human-level task instructions – such as “fetch the blue block” or “clear the table” – into the series of minute motor commands a robot understands presents a formidable challenge. This isn’t simply a matter of programming each step; it requires a complex system capable of interpreting the intention behind the command and dynamically adjusting to unforeseen circumstances. Researchers are exploring methods like hierarchical reinforcement learning and behavior trees to decompose these high-level goals into manageable sub-actions, while simultaneously incorporating feedback loops that allow the robot to refine its movements based on sensory input. The difficulty lies in creating a system robust enough to handle the inherent ambiguity of natural language and the unpredictable nature of physical interaction, demanding sophisticated algorithms for planning, perception, and control that bridge the gap between cognitive command and physical execution.

The persistent discrepancy between robotic simulation and real-world performance stems from what researchers term the ‘Embodiment Gap’. This gap isn’t merely a matter of imperfect sensors or algorithms; it fundamentally arises from the differences in a robot’s physical form – its morphology – and how it moves – its kinematics – between the digital and physical realms. A robot that functions flawlessly in a simulated environment can falter dramatically when deployed in reality due to unmodeled friction, subtle variations in joint mechanics, or even minor discrepancies in the robot’s physical dimensions. These seemingly small differences accumulate, leading to inaccuracies in motion planning and control, and highlighting the need for methods that explicitly address the challenges of transferring learned behaviors from simulation to the complexities of physical embodiment. Consequently, bridging this gap is crucial for unlocking the full potential of robotic autonomy and enabling robots to operate reliably in unstructured, real-world settings.

Dream2Flow: Another Layer of Abstraction

Dream2Flow employs contemporary video generation models – including, but not limited to, Kling 2.1, Wan2.1, and Veo 3 – to synthesize visual representations of the robotic actions desired by the user. These models are utilized to create a dataset of synthetic videos depicting the robot performing the instructed task. The generated video data serves as the primary input for subsequent stages of the Dream2Flow framework, circumventing the need for real-world data collection and providing a scalable and controllable source of training and operational data for robotic control.

Dream2Flow introduces a ‘3D Object Flow’ representation as an intermediate step in robotic control, bridging the gap between high-level instructions and physical actions. This representation isn’t direct control data; instead, it’s a learned abstraction derived from synthetic videos generated by models like Veo 3. The 3D Object Flow encodes the movement and transformations of key objects within the scene, providing a more robust and interpretable signal than directly mapping instructions to robot joint velocities. This intermediate layer allows the system to reason about the desired outcome in terms of object manipulation, improving generalization and success rates, particularly in complex tasks like the Open Oven benchmark.

The Dream2Flow framework is compatible with current state-of-the-art video generation models, including Kling 2.1, Wan2.1, and Veo 3, allowing for adaptability across diverse video synthesis techniques. Empirical evaluation using the Open Oven task demonstrates an 80% success rate when paired with the Veo 3 video generation model; this indicates the framework’s capacity to translate generated visual data into effective robotic control signals and highlights performance gains achievable through integration with high-fidelity video generation systems.

Decoding the Visual Noise

The 3D Object Flow pipeline utilizes vision foundation models, specifically SpatialTrackerV2, to establish accurate per-frame depth estimation as a critical initial step. SpatialTrackerV2 provides dense depth maps, converting 2D image data into 3D spatial information for each frame of the generated video. This depth information is then used to reconstruct the 3D positions of objects and surfaces within the scene, enabling consistent object tracking and realistic motion reconstruction. The accuracy of SpatialTrackerV2 directly impacts the fidelity of the final 3D scene representation, making it a foundational component of the entire process.

Object identification and tracking within generated videos are achieved through the integration of Grounding DINO and Segment Anything Model 2 (SAM 2). Grounding DINO serves as a foundational model for detecting and localizing objects based on textual prompts, providing initial bounding box detections. SAM 2 then refines these detections by generating high-quality segmentation masks, precisely outlining the boundaries of each identified object. This two-stage process enables robust object tracking across frames, even with complex scenes and occlusions, by associating segmented regions with individual object instances.

CoTracker3 facilitates consistent point tracking throughout video sequences by establishing correspondences between points in consecutive frames. This is achieved through a learned association mechanism that minimizes drift and maintains identity over time, even with significant viewpoint changes or occlusions. The system utilizes a cost volume approach to identify the most likely corresponding point in the subsequent frame, based on feature similarity and spatial proximity. By accurately linking points across frames, CoTracker3 enables the reconstruction of smooth and realistic motion trajectories, which is critical for generating coherent and visually plausible video output.

From Simulated Flow to Tentative Action

The 3D Object Flow, representing the desired movement of objects in the environment, directly informs the ‘Robotic Control’ module. This flow provides a high-level directive for the robot’s actions, specifying the intended manipulation or interaction with the objects. The module interprets this flow as a sequence of desired states, translating it into specific motor commands. These commands then control the robot’s actuators, enabling it to execute the planned manipulation as defined by the 3D Object Flow. The accuracy and efficiency of the robotic action are therefore directly dependent on the fidelity of the extracted 3D Object Flow and the module’s ability to accurately translate it into executable commands.

Trajectory Optimization within the robotic control system functions by generating a series of robot configurations – positions, velocities, and accelerations – that satisfy the constraints imposed by the environment and the desired 3D object flow. This process involves defining a cost function that quantifies the efficiency and smoothness of potential trajectories, typically incorporating factors such as path length, execution time, and energy consumption. Algorithms then iteratively refine the trajectory by minimizing this cost function, subject to constraints on joint limits, obstacle avoidance, and the maintenance of desired object manipulation characteristics. The resulting optimized trajectory provides a precise set of control commands for the robot to execute, enabling smooth and efficient movement aligned with the goals indicated by the 3D object flow data.

The robotic control system utilizes Reinforcement Learning (RL) to optimize its manipulation policy. In this framework, the 3D Object Flow data is directly integrated as a reward signal within the RL algorithm, guiding the robot’s learning process. Specifically, experiments using a particle dynamics model to simulate the ‘Push-T’ task have demonstrated a 60% success rate, indicating the effectiveness of this approach in translating perceived object flow into actionable robotic behavior. This suggests the RL agent successfully learns to predict and influence object movement based on the 3D Object Flow input.

The Illusion of Adaptability

The development of Dream2Flow represents a notable advancement in robotic manipulation, enabling machines to interact with a surprisingly diverse range of objects in unstructured settings. Unlike systems constrained by specific object types or controlled environments, this framework facilitates successful grasping and manipulation of both rigid items – like tools or containers – and more challenging materials such as articulated fabrics, deformable sponges, or even granular substances like rice. This broad applicability stems from its innovative approach, which sidesteps the need for precise 3D models or pre-programmed routines, allowing the robot to adapt its movements based on visual input and simulated physics. Consequently, Dream2Flow moves beyond the limitations of traditional robotics, paving the way for more versatile and adaptable machines capable of functioning effectively in the complexities of everyday life and unpredictable real-world scenarios.

Traditional robotic manipulation often struggles with the inherent unpredictability of real-world objects and scenes. The Dream2Flow framework overcomes these limitations by ingeniously combining video generation with 3D reconstruction techniques. Rather than relying on precise, pre-programmed movements or static environmental models, this approach learns from simulated interactions, effectively ‘dreaming’ up potential manipulation sequences. These sequences are then translated into realistic 3D representations, allowing the robot to anticipate and adapt to variations in object shape, pose, and material properties. This synthesis of generative AI and 3D vision empowers robots to handle a broader range of objects – from rigid tools to deformable fabrics and even granular materials – with greater robustness and dexterity than conventional methods.

Evaluations demonstrate that this novel approach to robotic manipulation achieves remarkably high success rates when tested on a diverse array of objects – encompassing rigid, articulated, deformable, and granular materials. Critically, performance consistently surpasses that of established trajectory baselines. These alternatives rely on either dense optical flow analysis – which struggles with complex scenes – or rigid pose transformations, inherently limited in handling deformable objects. The observed improvements suggest a significant advancement in robotic dexterity, enabling more robust and adaptable performance in unstructured, real-world environments where traditional methods often falter. This capability promises to broaden the scope of tasks robots can reliably undertake, moving beyond carefully controlled settings and towards genuine autonomy.

The pursuit of seamless video generation and robotic control, as demonstrated by Dream2Flow, feels predictably optimistic. Reconstructing 3D object flow from generated videos-a neat trick-will inevitably reveal the limitations of even the most sophisticated generative models when confronted with the unpredictable physics of the real world. It’s a familiar pattern: elegant theory meeting messy production. As John McCarthy observed, “The best way to predict the future is to invent it,” but inventing doesn’t guarantee it will work when someone actually tries to use it. The bug tracker, one suspects, is already compiling a comprehensive list of edge cases. They don’t deploy – they let go.

So, What Breaks First?

Dream2Flow, with its elegant reconstruction of 3D object flow, neatly sidesteps the need for task-specific training. A commendable feat, certainly. But the history of robotics is paved with algorithms that ‘just worked’ in simulations. Production, as always, will have the final say. The question isn’t if an edge case will emerge – some unforeseen interaction, a strangely shaped object, a lighting condition not accounted for – but when. The promise of open-world manipulation hinges on anticipating the infinite ways things can go wrong, and that’s a losing battle.

The current reliance on generated videos is… optimistic. Generative models are, at their core, sophisticated pattern-matching exercises. They excel at plausible falsehoods. Expect a divergence between the reconstructed 3D flow and actual physics when confronted with novel scenarios. Future work will inevitably involve bridging this ‘reality gap’, likely with increasingly complex – and brittle – correction mechanisms. Everything new is old again, just renamed and still broken.

Perhaps the most intriguing path forward isn’t refinement of the 3D flow reconstruction itself, but a graceful degradation strategy. A system that knows its limitations, that can signal uncertainty, and that reverts to safer, pre-programmed behaviors when faced with ambiguity. Because when a robot inevitably misinterprets reality, it’s not the elegance of the algorithm that matters, it’s the severity of the ensuing chaos.

Original article: https://arxiv.org/pdf/2512.24766.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/