Robots Learn by Touching: Closing the Reality Gap with AI-Driven Simulation

Author: Denis Avetisyan

A new framework empowers robots to autonomously build accurate simulations of the physical world through interactive perception and intelligent behavior planning.

The system constructs a physics-based simulation within MuJoCo by bridging a gap in parameter knowledge-defined by a fixed system prompt and user-specified objectives-through a behavior tree generated by a visual language model, effectively demonstrating an adaptive framework where simulated environments are built not from complete data, but from the intelligent acquisition of missing physical properties.

This work introduces an autonomous Real2Sim system leveraging vision-language models and behavior trees to estimate physical parameters and generate digital twins without manual intervention.

Constructing accurate physics-based simulations remains challenging due to the difficulty of reliably estimating real-world physical properties. This paper introduces ‘Real2Sim based on Active Perception with automatically VLM-generated Behavior Trees’, a novel framework that autonomously learns these parameters through robotic interaction. By leveraging vision-language models to generate task-specific behavior trees, the system actively explores the environment and estimates object mass, geometry, and friction without manual intervention or pre-defined routines. Could this approach unlock fully autonomous pipelines for creating digital twins and bridging the gap between perception and physically-grounded robotic control?

The Fragile Mirror: Bridging Reality and Simulation

The escalating integration of robotics and artificial intelligence hinges on the capacity to train and plan within simulated environments; however, a fundamental challenge lies in faithfully replicating the intricacies of the physical world. While simulation offers a cost-effective and safe alternative to real-world experimentation, achieving sufficient realism proves remarkably difficult. Factors such as nuanced material properties, unpredictable interactions between objects, and the inherent ‘messiness’ of sensor data are often drastically simplified, leading to a discrepancy between simulated performance and real-world outcomes. This fidelity gap necessitates continuous advancements in simulation techniques to accurately model the complexities of reality, enabling robotic systems to generalize learned behaviors and operate reliably in unstructured and dynamic environments.

Robotic systems trained exclusively in simulation frequently encounter difficulties when transitioned to real-world applications due to discrepancies between the virtual and physical realms. Conventional simulations often simplify complex phenomena – approximating physics, neglecting subtle sensor imperfections, and failing to account for the inherent randomness of environments. These simplifications, while computationally efficient, introduce biases; a robot might learn to navigate a perfectly smooth virtual floor, but struggle with the unevenness and unpredictable debris of a real warehouse. Consequently, performance gaps arise, necessitating costly and time-consuming re-training or manual intervention to bridge the fidelity gap and ensure robust, reliable operation in unstructured settings. Addressing these limitations is paramount for the successful deployment of autonomous systems in increasingly complex and dynamic environments.

The development of truly capable robotic systems hinges on the creation of simulations that mirror the intricacies of the physical world. High-fidelity simulations aren’t merely about visual realism; they demand accurate modeling of physics, material properties, and the unpredictable nature of sensor data. Without this level of detail, robots trained exclusively in simulation often falter when deployed in real-world scenarios, exhibiting unexpected behaviors or failing to complete tasks. Consequently, a robust simulation environment becomes paramount not only for efficient training and testing, but also for ensuring the safety and reliability of robotic operations, especially in dynamic and unstructured environments where adaptability is essential. This pursuit of realism is therefore fundamental to unlocking the full potential of robotics and artificial intelligence.

Robotic systems frequently encounter scenarios where visual perception is incomplete – objects are obscured, lighting is poor, or sensors have limited range. Current simulation techniques struggle to accurately replicate these conditions of partial observability, creating a significant disconnect between training in the virtual world and performance in reality. A robot trained solely on complete data may fail to recognize or react appropriately to partially hidden objects, leading to errors in navigation, manipulation, or interaction. This challenge stems from the difficulty of modeling the complexities of light transport, sensor limitations, and the robot’s inherent uncertainty about the unseen portions of its environment. Consequently, advancements in simulation must prioritize realistic modeling of these perceptual limitations to ensure robust and reliable robotic behavior when faced with the ambiguities of the real world.

The simulation accurately reproduces the visual characteristics of the real environment, utilizing only bottle meshes as prompts for graphical representation.

Automated Genesis: Constructing Simulations from Observation

The Real2Sim pipeline automates simulation creation by directly utilizing data acquired from real-world environments. This process involves robotic interaction to actively perceive and map the environment, coupled with multi-modal scene understanding – processing data from multiple sensors like cameras, depth sensors, and potentially others – to build a comprehensive digital representation. The system doesn’t rely on manually designed simulations; instead, it dynamically constructs a simulation model based on real-world observations, enabling the creation of environments tailored to specific robotic tasks and facilitating data-driven simulation design.

The Real2Sim pipeline employs a Vision-Language Model (VLM) to translate high-level user directives and observed environmental characteristics into parameters for simulation creation. This VLM analyzes both visual input from the real-world environment and natural language instructions to determine the desired simulation scenario and relevant object properties. Specifically, the VLM identifies objects, their relationships, and intended robotic interactions, then uses this understanding to instantiate a corresponding Simulation Model with appropriate physical properties, object placements, and task goals. The model’s output directly influences the generation of the simulation environment, ensuring alignment between user intent, the perceived scene, and the constructed virtual world.

Behavior Trees (BTs) provide a hierarchical framework for organizing complex robotic behaviors within the Real2Sim pipeline. These trees decompose high-level goals into a sequence of tasks and conditions, enabling structured and reactive control. Crucially, BT nodes do not directly implement actions; instead, they rely on Atomic Actions – fundamental, indivisible control primitives – to interact with the simulated environment. These Atomic Actions represent low-level commands such as moving a joint, applying a force, or activating a sensor, forming the foundation upon which more complex behaviors are built and executed within the simulation. The modularity of this BT/Atomic Action system allows for flexible behavior design and easy adaptation to different robotic platforms and environments.

The Real2Sim pipeline addresses the challenge of transferring policies learned in simulation to real-world robotic systems by explicitly grounding simulation construction in real-world data. This data-driven approach involves using sensor input – including visual and potentially other modalities – from a target environment to inform the creation of the simulated environment. By replicating real-world characteristics in simulation, such as object properties, lighting conditions, and sensor noise, the discrepancy between the simulated and real domains is reduced. Consequently, policies trained in this more realistic simulation exhibit improved transferability, requiring less adaptation or fine-tuning when deployed on a physical robot operating in the corresponding real-world environment.

The behavior tree (BT) successfully estimates table height and blue bottle mass by sequentially executing sub-trees, with the second sub-tree demonstrating a visual language model (VLM) approach where the robot first approaches objects before moving to pickable poses using atomic actions from set <span class="katex-eq" data-katex-display="false">\mathcal{A}</span>. — The behavior tree (BT) successfully estimates table height and blue bottle mass by sequentially executing sub-trees, with the second sub-tree demonstrating a visual language model (VLM) approach where the robot first approaches objects before moving to pickable poses using atomic actions from set $\mathcal{A}$ .

Physical Echoes: Estimating Properties Through Interaction

Physical Parameter Estimation is a core element of the Real2Sim Pipeline, enabling the determination of object properties essential for accurate simulation. This process focuses on quantifying characteristics such as mass and the static and kinetic coefficients of friction. Estimation is achieved through direct robotic interaction with real-world objects, leveraging the robot’s ability to apply controlled forces and measure the resulting object response. The determined parameters are then used to create more realistic and accurate simulations, improving the transfer of learned behaviors from simulation to the physical world. Accurate estimation of these physical properties is critical for tasks requiring precise manipulation and interaction with diverse objects.

Physical Parameter Estimation within the Real2Sim Pipeline relies on data acquired during physical interaction with objects. Specifically, force measurements, typically obtained from force/torque sensors integrated into the robot’s end-effector, provide information about the interaction forces between the robot and the object. Simultaneously, object pose data, detailing the object’s position and orientation in space, is captured using vision systems or other tracking technologies. The combination of these force and pose measurements allows the pipeline to infer physical properties; changes in force readings correlated with known positional changes enable the estimation of parameters like mass and the coefficient of friction. Data is gathered during controlled interactions, such as pushing or tapping, to ensure accurate and repeatable measurements for the estimation process.

The Real2Sim pipeline demonstrates successful mass estimation of objects even when partially occluded from view. This capability was validated through experimentation where the system accurately determined the mass of an object despite limited visual information. This success indicates the pipeline’s robustness and ability to infer physical properties using interaction data, even in complex environments presenting challenges to perception and planning. The system achieves this by integrating force measurement data with object pose estimation, allowing it to effectively model the object’s dynamics and infer its mass during interaction, enabling the planning of complex behaviors in uncertain conditions.

Current implementations of the Real2Sim pipeline estimate static and kinetic friction coefficients during physical parameter estimation; however, these estimations exhibit significant variance across repeated trials. Analysis indicates this variance is not solely attributable to sensor noise or limitations in force measurement. Further investigation is required to determine the root cause, potentially involving refinement of the estimation algorithms, improved modeling of contact mechanics, or enhancements to the robotic control strategies employed during interaction. Reducing this variance is critical for improving the accuracy and reliability of the simulated environments generated by the pipeline, and is a primary focus of ongoing development efforts.

The Real2Sim pipeline employs both Torque Control and Cartesian Impedance Control to ensure stable and accurate robotic interaction during physical parameter estimation. Torque Control directly manages joint torques, enabling precise force application necessary for probing object properties. Complementing this, Cartesian Impedance Control regulates the robot’s end-effector position and orientation in Cartesian space, allowing it to comply with external forces and maintain safe contact. This combination is critical for minimizing unintended forces during interaction, preventing damage to both the robot and the target object, and facilitating reliable data acquisition for parameter estimation.

The robot successfully estimates table height via initial contact and bottle mass through torque sensing, then returns the object to its starting position in this first scenario.

Toward Resilient Systems: Digital Twins and Enhanced Capabilities

The creation of accurate Digital Twins – virtual replicas of physical assets – is now significantly advanced through the Real2Sim Pipeline. This innovative framework doesn’t simply generate a visual model; it establishes a high-fidelity representation capable of mirroring the behavior and dynamics of its physical counterpart with remarkable precision. This is achieved through a sophisticated process of data acquisition and algorithmic refinement, allowing for nuanced simulations that capture complex interactions and subtle variations. Consequently, engineers and researchers can now virtually prototype, test, and optimize designs with a level of realism previously unattainable, reducing reliance on costly physical prototypes and accelerating innovation across diverse fields like manufacturing and logistics.

The creation of highly accurate virtual replicas through the Real2Sim pipeline dramatically alters traditional engineering and operational practices. Previously, rigorous testing and optimization demanded physical prototypes and extensive real-world trials – a process that was both time-consuming and expensive. Now, engineers can conduct countless simulations within the digital realm, identifying potential flaws and refining designs before a single physical component is manufactured. This capability extends beyond simple design validation; virtual environments allow for the optimization of entire systems, such as logistics networks or manufacturing processes, by experimenting with different configurations and parameters. Furthermore, the pipeline facilitates predictive maintenance strategies; by continuously monitoring the virtual twin and comparing it to real-world data, anomalies can be detected early, allowing for proactive interventions and minimizing downtime – a particularly valuable asset in complex industrial settings.

The creation of truly versatile robotic systems demands more than just advanced algorithms; it requires a deep understanding of how a robot will interact with the unpredictable realities of the physical world. This pipeline addresses this challenge by establishing a seamless connection between virtual simulations and physical execution, allowing robots to be trained and refined in a risk-free digital environment before deployment. Through iterative cycles of virtual testing and real-world validation, robotic controllers can be optimized for robustness against sensor noise, unexpected obstacles, and dynamic conditions. Consequently, robots developed using this approach demonstrate enhanced adaptability, improved performance in complex scenarios, and a greater capacity for autonomous operation – effectively moving beyond pre-programmed tasks to genuine intelligent behavior.

Continued development centers on broadening the applicability of this framework to increasingly intricate and dynamic environments – moving beyond controlled laboratory settings to real-world scenarios like unstructured warehouses or outdoor construction sites. Crucially, this expansion is paired with the integration of sophisticated artificial intelligence algorithms, specifically those focused on reinforcement learning and predictive modeling. This synergy aims to move beyond mere simulation and enable fully autonomous robotic operation, where robots can learn, adapt, and execute tasks without constant human intervention or pre-programmed instructions, ultimately fostering a new generation of intelligent and resilient robotic systems capable of navigating and interacting with complex surroundings.

The pursuit of accurate simulations, as detailed in this work concerning Real2Sim, inevitably confronts the realities of system decay. The framework’s reliance on automatically generated behavior trees, while innovative, operates within the constraints of imperfect perception and estimation-a transient state before inevitable adaptation or obsolescence. As Claude Shannon observed, “The most important thing in communication is to convey the meaning, not the symbols.” This holds true for simulation as well; the fidelity of the digital twin is less critical than its ability to meaningfully represent the physical world, even as that representation ages and requires refinement. The system’s capacity for autonomous parameter estimation acknowledges that perfect mirroring is unattainable, but continuous learning offers a path toward graceful aging, extending the lifespan of the simulation’s utility.

The Long Calibration

This work, while demonstrating an autonomous pathway to simulation fidelity, merely postpones the inevitable divergence. Every failure is a signal from time; the gap between the physical and the modeled will invariably widen as entropy asserts itself. The presented framework effectively automates the initial calibration, but the true challenge lies not in achieving a snapshot of accuracy, but in maintaining it-in designing systems that gracefully accommodate the decay of correspondence.

Future efforts should not focus solely on refining the estimation of physical parameters, but on developing mechanisms for continuous recalibration. The reliance on robotic interaction, while elegant, introduces its own vulnerabilities. Consider the implications of actuator drift, sensor degradation, or even subtle environmental shifts. Refactoring is a dialogue with the past; the system must be capable of learning from its own obsolescence.

Ultimately, the pursuit of perfect digital twins is a paradox. The real is, by definition, transient. The value may not lie in mirroring reality, but in creating simulations that are usefully inaccurate-models that capture essential dynamics even as they diverge from the specifics of a fleeting moment. The long calibration, then, is not about freezing time, but about anticipating its passage.

Original article: https://arxiv.org/pdf/2601.08454.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragile Mirror: Bridging Reality and Simulation

Automated Genesis: Constructing Simulations from Observation

Physical Echoes: Estimating Properties Through Interaction

Toward Resilient Systems: Digital Twins and Enhanced Capabilities

The Long Calibration

See also: