Simulating Reality: Robots Learn to Manipulate Without Real-World Training

Author: Denis Avetisyan

New research demonstrates that large-scale simulation, powered by procedurally generated data, can unlock zero-shot transfer for robotic manipulation tasks.

MolmoBot leverages varied simulation data to facilitate zero-shot transfer to real-world robotic tasks-including manipulation and navigation-effectively scaling training datasets for broadly capable foundation models, despite the inevitable gap between simulated and physical environments.

Large-scale simulation data enables robots to perform complex manipulation tasks in the real world without any prior real-world training, achieving competitive performance with models trained on real data.

A prevailing assumption in robotics is that bridging the reality gap necessitates some degree of real-world data collection or fine-tuning; however, the work presented in ‘MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation’ challenges this notion by demonstrating effective zero-shot transfer from simulation to the real world for both static and mobile manipulation tasks. Through the creation of MolmoBot-Engine and the release of MolmoBot-Data-a dataset of 1.8 million trajectories-the authors showcase that sufficiently large-scale and diverse procedural simulation can yield robust policies without any real-world adaptation. Specifically, their MolmoBot policies achieve a 79.2% success rate on tabletop pick-and-place tasks, outperforming existing methods, and generalizing to unseen objects and environments. Could this approach unlock a new era of robot learning, reducing reliance on costly and time-consuming real-world data acquisition?

The Illusion of Robotic Dexterity

Conventional robotic manipulation strategies face significant hurdles when confronted with the sheer variety and unpredictability of real-world settings. These approaches, often reliant on precisely defined parameters and controlled environments, struggle to accommodate the infinite combinations of object shapes, sizes, textures, and lighting conditions encountered outside of laboratory settings. This limitation stems from the difficulty in creating algorithms capable of generalizing across this vast “state space” – the complete set of possible scenarios a robot might encounter. Consequently, robots frequently exhibit brittle behavior, failing to adapt to even slight deviations from their training data and hindering their ability to perform reliably in dynamic, unstructured environments. The challenge isn’t simply recognizing objects, but understanding their potential interactions and responding effectively to unforeseen circumstances, a task demanding far greater adaptability than most current systems possess.

Effective manipulation of complex objects-items possessing internal degrees of freedom like hinges, drawers, or articulated limbs-demands a departure from traditional robotic control strategies. These systems require not simply the ability to grasp and move an object, but to understand and influence its internal configuration. Robust generalization is paramount; a robot trained to manipulate one instance of a complex object must reliably adapt to variations in its geometry, weight distribution, and even the presence of friction or wear. This necessitates advanced control algorithms capable of inferring the object’s state, predicting the effects of actions on its internal mechanisms, and executing nuanced movements that achieve desired configurations-a capability far exceeding the precision of simple pick-and-place routines. Ultimately, achieving truly versatile manipulation hinges on developing robots that can ‘reason’ about the mechanics of complex objects and proactively adjust their actions to ensure successful interaction.

A persistent obstacle in robotics is the notorious “sim-to-real” gap, where algorithms trained in meticulously crafted simulations struggle to perform reliably in unpredictable real-world conditions. This discrepancy arises from inherent differences in dynamics, sensor noise, and unmodeled physical interactions – factors easily controlled in a virtual environment but pervasive in reality. Consequently, a robot capable of flawlessly assembling a structure in simulation may falter with even minor disturbances when operating with physical objects. This lack of transferability significantly hinders the practical deployment of robotic manipulation systems, demanding substantial effort and resources for real-world adaptation and fine-tuning, effectively slowing the progress from laboratory demonstration to widespread application.

Mobile manipulation evaluations of our [latex]RBY1[/latex] robot were conducted in diverse, real-world environments to assess both articulated and rigid object handling capabilities.

Generating Worlds to Avoid Reality

MolmoBot utilizes procedural generation to construct a wide variety of simulated environments, termed MolmoSpaces, for robotic training. This approach involves algorithmically defining environment parameters – including object properties, lighting conditions, and scene layouts – to automatically generate numerous unique scenarios. By varying these parameters, MolmoBot creates a dataset encompassing diverse conditions that would be impractical to manually design or collect in the real world. The system’s ability to rapidly generate these environments significantly accelerates the data collection process, providing a scalable means to obtain the large datasets necessary for training robust manipulation policies.

The MolmoBot-Engine facilitates the automated creation of MolmoBot-Data, a dataset comprising 1.8 million expert trajectories used for training manipulation policies. These trajectories are generated through simulated robotic interactions within the MolmoSpaces environments. Each trajectory represents a successful completion of a manipulation task, providing examples of optimal control sequences. The scale of MolmoBot-Data is critical, as it provides sufficient data diversity for robust policy learning and generalization. Data points include robot joint angles, end-effector positions, and object states, providing a comprehensive record of successful manipulations. This dataset is formatted to be directly consumable by various reinforcement learning algorithms, streamlining the policy training process.

A significant challenge in robotics is the development of policies capable of robust performance in novel environments. The MolmoBot system addresses this through the generation of a large and varied dataset of robotic interactions within procedurally generated environments. By training policies on this diverse data – comprising 1.8 million trajectories – the resulting models demonstrate improved generalization capabilities to previously unseen scenarios. This contrasts with traditional robotics training methods often limited by the constraints of real-world data collection or the biases of manually designed simulations, and enables more adaptable and reliable robotic systems.

MolmoBot iteratively replans trajectories within a randomized, pre-built environment ([latex]MolmoSpaces^{2026}[/latex]) populated with task-relevant objects to achieve successful completion.

Teaching Robots to Pretend They Understand

MolmoBot facilitates the training of vision-language models (VLMs) for robotic manipulation by directly grounding language instructions in visual observations and translating them into robot actions. This approach contrasts with traditional methods relying on predefined skills or complex state estimation. By leveraging VLMs, MolmoBot enables robots to interpret natural language commands – such as “pick up the red block” – and execute the corresponding manipulation tasks without requiring explicit programming for each scenario. The system utilizes a learned mapping between visual input, language instructions, and robot control signals, providing a flexible and adaptable control framework suitable for a range of manipulation tasks and environments.

The MolmoBot policy incorporates a DiT (Diffusion Transformer)-based flow matching action head to improve robotic control. Flow matching operates by training a model to predict the velocity field that transforms a distribution of states into a desired target state, enabling more accurate action generation. Utilizing a DiT architecture within this framework allows for effective modeling of complex, multi-modal distributions encountered in robotic manipulation tasks. This approach contrasts with traditional methods that often rely on discrete action spaces or direct regression, and results in smoother, more natural robot movements and increased success rates in achieving desired goals by better handling uncertainty in state estimation and action execution.

Evaluation of the MolmoBot policy on real-world tabletop pick-and-place tasks yielded a 79.2% success rate. This performance metric represents the proportion of attempts where the robot successfully grasped and relocated target objects within a defined workspace. Comparative analysis demonstrates that MolmoBot significantly outperforms existing robotic manipulation methods on the same benchmark tasks, indicating an improvement in both task completion and overall efficiency. The success rate was determined through a standardized testing procedure involving a diverse set of object configurations and initial robot poses.

Three policies-MolmoBot, MolmoBot-SPOC, and MolmoBot-π₀-were trained on MolmoBot-Data, leveraging vision-language backbones and differing action prediction heads-DiTX-based flow matching or a bidirectional transformer-to process RGB images, proprioceptive state, language instructions, and optional 2D point conditioning for action chunk prediction.

The Illusion of Intelligence: Real-World Performance

MolmoBot exhibits a compelling ability to generalize learned skills to entirely new situations without the need for additional training. This zero-shot transfer capability stems from the agent’s foundational training on a diverse range of simulated environments and tasks, allowing it to effectively adapt to unseen scenarios. Unlike conventional robotic systems requiring task-specific retraining for each new environment, MolmoBot leverages its pre-existing knowledge to quickly understand and execute tasks in novel settings. This adaptability is particularly evident in benchmark tests, where the agent consistently outperforms prior models – such as π0.5-DROID – by substantial margins, demonstrating a significant leap toward more versatile and robust robotic intelligence capable of operating effectively in the real world.

Rigorous validation of MolmoBot’s capabilities was conducted using the DRoid benchmark, a challenging platform for robotic manipulation. Results demonstrate a substantial performance advantage in the ‘Pick MSProc’ task, where MolmoBot achieved a 92.0% success rate. This figure represents a marked improvement over the 47.0% success rate attained by π0.5-DROID under identical conditions. This significant difference highlights MolmoBot’s enhanced ability to reliably grasp and retrieve objects in complex, real-world scenarios, suggesting a robust and adaptable approach to robotic control and a considerable advancement over existing methodologies.

Evaluations on the challenging Pick Classic and PnP-Next-To benchmarks reveal MolmoBot’s robust manipulation capabilities. The system consistently achieves success rates of 60-66% on Pick Classic and 60-67% on PnP-Next-To, demonstrating a marked improvement over the π0.5-DROID baseline. Specifically, π0.5-DROID attained success rates ranging from 13-29% on Pick Classic and 31.3% on PnP-Next-To, highlighting MolmoBot’s significantly enhanced ability to generalize and execute complex pick-and-place tasks in dynamic environments. This performance underscores the effectiveness of the underlying approach in enabling reliable robotic manipulation beyond the limitations of prior methods.

MolmoBot policies ([latex]F=2[/latex]) demonstrate robust zero-shot transfer to real-world DROID evaluations, achieving a higher mean success rate (with 95% confidence intervals shown as error bars) than policies trained on large-scale real-world demonstrations.

The pursuit of zero-shot transfer, as demonstrated by MolmoB0T, feels… predictable. Another layer of abstraction built on procedural generation, hoping to bridge the sim-to-real gap. It’s a clever workaround, certainly, leveraging scale to compensate for inherent inaccuracies. But one suspects this ‘revolution’ will soon reveal its own costly compromises. As Claude Shannon observed, “The most important thing in communication is to minimize errors.” MolmoB0T minimizes the appearance of errors, shifting the burden onto the dataset itself. The system achieves performance parity with real-world data, but that parity is maintained through an immense investment in synthetic data. It’s an expensive way to complicate everything, and someone, somewhere, will be debugging those procedurally generated edge cases for months.

What’s Next?

The claim of zero-shot transfer is, predictably, not a destination. It is merely a shifting of the problem. The elegance of procedural generation buys a temporary reprieve from the data bottleneck, but introduces a new one: the simulation itself. Each added fidelity, each attempt to mirror the chaotic mess of reality, is another layer of abstraction that will inevitably fail in unpredictable ways. The production environment will, as always, discover edge cases the simulations never conceived. Consider it a feature, not a bug.

The true challenge isn’t creating more data, but building systems robust enough to ignore most of it. A future direction lies not in scaling simulation, but in minimizing its reliance. Models that can learn from fewer, more carefully curated examples, or that can actively request information when uncertain, will prove more valuable than those drowning in procedurally generated noise. The current emphasis on vision-language models feels particularly fragile; semantic understanding is easily disrupted by unexpected lighting, occlusions, or the simple indignity of a slightly warped object.

One can anticipate a proliferation of specialized simulation tools, each tailored to a narrow domain. This will create interoperability nightmares and a new form of technical debt. CI is the temple – and everyone prays nothing breaks when those components finally meet. Documentation, of course, remains a myth invented by managers.

Original article: https://arxiv.org/pdf/2603.16861.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Robotic Dexterity

Generating Worlds to Avoid Reality

Teaching Robots to Pretend They Understand

The Illusion of Intelligence: Real-World Performance

What’s Next?

See also: