Learning to Open Doors: A Mobile Manipulator Masters the Task

Author: Denis Avetisyan


Researchers have developed a diffusion-based policy enabling a dual-arm mobile robot to reliably open and navigate through doors, even with unexpected disturbances.

A diffusion-based policy enables a mobile manipulator to autonomously open and navigate through damped pull doors, coordinating perception, dual-arm manipulation, and base navigation to execute complex sequences of actions-reaching, twisting, pulling, and passing-while maintaining robustness against external disturbances crucial for real-world applications.
A diffusion-based policy enables a mobile manipulator to autonomously open and navigate through damped pull doors, coordinating perception, dual-arm manipulation, and base navigation to execute complex sequences of actions-reaching, twisting, pulling, and passing-while maintaining robustness against external disturbances crucial for real-world applications.

This work presents an end-to-end imitation learning approach for coordinated control of a non-holonomic mobile base and dual arms in complex door-opening scenarios.

Despite advances in robotic manipulation, reliably coordinating complex, multi-stage behaviors like opening a standard door remains a significant challenge. This paper, ‘Diffusion Policy for Coordinated Control of a Nonholonomic Mobile Base and Dual Arms in Door Opening and Passing’, introduces a diffusion-based visuomotor policy that enables a mobile manipulator to autonomously open and traverse damped pull doors. Through end-to-end learning from demonstrations, the resulting policy achieves a high success rate and demonstrates robustness to external disturbances-capabilities often lacking in traditional, state-machine based approaches. Could this framework unlock more adaptable and resilient robotic solutions for complex, real-world manipulation tasks?


The Challenge of Embodied Intelligence

Successfully grasping and manipulating objects in real-world settings, even seemingly simple tasks like opening a damped pull door, presents a considerable hurdle for robotics. The difficulty doesn’t stem from a lack of mechanical capability, but from the intricate interplay between perception and control required in cluttered environments. Robots must accurately perceive the position and state of the door handle, accounting for visual obstructions and varying lighting conditions, then precisely modulate force and movement to overcome the damping mechanism – a resistance designed for human interaction. This demands more than just pre-programmed sequences; it necessitates robust algorithms that can interpret noisy sensor data, predict object behavior, and adapt to unexpected disturbances, a challenge that continues to drive innovation in robotic dexterity and intelligence.

Conventional robotic control strategies frequently falter when confronted with the unpredictable nature of real-world tasks. These methods typically rely on precise models of the environment and robot dynamics, assumptions that rarely hold true amidst the constant variability of everyday objects and situations. Minor deviations – a slightly askew door handle, an unexpected obstruction, or even subtle changes in lighting – can disrupt these pre-programmed sequences, leading to failed grasps or collisions. The inherent uncertainty stems not only from imperfect sensors and actuators, but also from the difficulty in anticipating the full range of possible interactions, making robust, reliable manipulation in dynamic, cluttered spaces an ongoing challenge for robotic systems.

Robust robotic manipulation hinges on the development of visuomotor policies that transcend pre-programmed responses. These policies must not only correlate visual input with motor commands, but also generalize effectively to previously unseen environments and object configurations. A key characteristic of these advanced systems is their ability to adapt in real-time to dynamic conditions – such as unexpected disturbances or changes in object position – without requiring explicit re-programming. This necessitates incorporating learning algorithms that allow the robot to refine its control strategies through experience, enabling it to navigate the inherent uncertainties of the physical world and perform tasks reliably even in complex, cluttered scenes. Ultimately, the success of real-world robotic manipulation rests on creating systems capable of ‘seeing’ a situation, ‘understanding’ its implications, and ‘adapting’ its actions accordingly.

Despite manual interference during door opening, the policy effectively halts, re-adjusts, and resumes the task, demonstrating robust adaptation to external disturbances.
Despite manual interference during door opening, the policy effectively halts, re-adjusts, and resumes the task, demonstrating robust adaptation to external disturbances.

Diffusion Policies: A Pathway to Embodied Control

A Diffusion Policy directly maps raw visual inputs, such as images from a camera, to robot actions without requiring intermediate state estimation or the design of specific, pre-defined features. Traditional robotic control often relies on perceiving the environment to create an internal representation of its state – including object positions, robot pose, and velocities – before determining appropriate actions. This process is computationally expensive and susceptible to errors introduced by imperfect perception. In contrast, the Diffusion Policy learns a direct mapping from pixels to control signals, effectively bypassing the need for explicit state representation. This approach simplifies the control pipeline and allows the robot to react directly to visual information, potentially improving robustness and adaptability in dynamic or uncertain environments.

The Diffusion Policy leverages a U-Net architecture, a convolutional neural network known for its efficacy in image segmentation and reconstruction tasks. This architecture consists of an encoder that downsamples the input visual data into a lower-dimensional latent space, followed by a decoder that upsamples this representation back to the original input resolution. Skip connections between corresponding layers in the encoder and decoder facilitate the preservation of fine-grained details during the encoding and decoding processes. This structure allows the network to capture both global context and local features, crucial for mapping raw visual inputs to complex robotic actions and effectively modeling the high-dimensional relationship between perception and control.

FiLM (Feature-wise Linear Modulation) conditioning is implemented to modulate the U-Net’s convolutional layers, enabling the diffusion policy to incorporate visual information directly into its action selection process. This technique involves scaling and shifting the feature maps within each layer using parameters derived from the observed scene; specifically, a learned affine transformation [latex] y = \gamma(x) \odot x + \beta(x) [/latex] is applied, where [latex] x [/latex] is the input feature map, [latex] \gamma(x) [/latex] and [latex] \beta(x) [/latex] are scale and bias parameters generated from the visual input, and [latex] \odot [/latex] denotes element-wise multiplication. By conditioning the network in this manner, the policy dynamically adjusts its internal representations and consequently its actions, based on the specific details of the observed environment without requiring retraining or explicit state estimation.

The diffusion policy utilizes three ResNet-18 visual encoders and a 1D U-Net with FiLM conditioning to transform Gaussian noise into an action sequence [latex]A_t[/latex] through [latex]K[/latex] denoising steps.
The diffusion policy utilizes three ResNet-18 visual encoders and a 1D U-Net with FiLM conditioning to transform Gaussian noise into an action sequence [latex]A_t[/latex] through [latex]K[/latex] denoising steps.

Bridging the Reality Gap: Sim2Real Transfer with Domain Randomization

Domain Randomization, implemented within the Mujoco physics engine, is utilized to generate a broad range of simulated environments for training the diffusion policy. This technique involves randomizing various physical parameters and visual characteristics during simulation, including parameters governing mass, friction, and actuator strength, as well as textures, lighting conditions, and camera viewpoints. By exposing the policy to this variability, the training process encourages the development of features that are invariant to these simulated perturbations, ultimately improving generalization performance when deployed in real-world scenarios where these parameters are unknown or differ from the training environment.

Domain randomization achieves sim-to-real transfer by systematically varying simulation parameters during training. Specifically, the training process introduces changes to visual characteristics like lighting conditions and surface textures, as well as alterations to physical dynamics such as friction coefficients, object masses, and joint damping. This deliberate introduction of noise and uncertainty compels the learning policy to develop features that are not specific to a single simulated environment, but rather generalize across a distribution of possible conditions. Consequently, the resulting policy demonstrates increased robustness and improved performance when deployed in the real world, where unforeseen variations are common.

Model Predictive Control (MPC) is implemented to refine the diffusion policy’s actions by optimizing a sequence of control inputs over a finite time horizon. This optimization process is guided by a Task-Space Scheduler, which defines desired end-effector trajectories in Cartesian space. To translate these task-space goals into joint-space commands for the robot, Inverse Kinematics (IK) is employed. The IK solver determines the necessary joint angles to achieve the desired end-effector pose, providing the MPC controller with feasible and accurate control signals. This hierarchical approach – Task-Space Scheduler directing MPC via IK – enables precise and adaptable robot behavior beyond the capabilities of direct policy outputs.

Simulated data collection utilizes inverse kinematics to control door manipulation alongside model predictive control for base locomotion, with randomized door and handle appearances introducing visual variability across episodes.
Simulated data collection utilizes inverse kinematics to control door manipulation alongside model predictive control for base locomotion, with randomized door and handle appearances introducing visual variability across episodes.

Real-World Validation and the Promise of Efficient Intelligence

The culmination of this research involved validating the diffusion policy through physical deployment on the ‘RealMan Platform’, a sophisticated dual-arm mobile manipulator. This robotic system was tasked with a complex manipulation challenge – successfully opening a ‘Damped Pull Door’. This particular door presents difficulties due to its resistance and the precise coordination required between the robot’s arms. Successful completion of this task on a physical platform demonstrates the policy’s robustness and ability to generalize beyond simulated environments, highlighting its potential for real-world robotic applications and providing a critical step toward adaptable and intelligent manipulation systems.

The foundation of a robust robotic manipulation policy lies in the quality of its training data, and a dedicated ‘Teleoperation Kit’ was instrumental in acquiring this crucial resource. This kit allowed researchers to remotely guide a robotic arm through the desired manipulation tasks, effectively creating a dataset of expert demonstrations. By leveraging human intuition and precision through teleoperation, the system gathered high-quality examples of successful task completion. This approach bypassed the challenges of randomly exploring the robot’s action space, instead providing the diffusion policy with a strong starting point and accelerating the learning process. The resulting dataset ensured the policy was trained on realistic and effective strategies, directly contributing to its subsequent performance and reliability.

To accelerate the practical application of the diffusion policy, a technique called Low-Rank Adaptation (LoRA) was implemented. LoRA enables parameter-efficient fine-tuning by freezing the pre-trained model weights and introducing trainable low-rank matrices, dramatically reducing the computational demands and memory footprint. This approach allows for swift adaptation to new robotic platforms and tasks without the extensive resources typically required for full fine-tuning. Critically, the integration of LoRA didn’t compromise performance; simulations demonstrated a consistent 100% task success rate, showcasing its effectiveness in maintaining robust manipulation capabilities while significantly improving inference speed and accessibility.

The developed diffusion policy demonstrates a substantial gain in computational efficiency without sacrificing performance. Traditional diffusion models often require a large number of denoising steps – typically 100 or more – to generate accurate results. However, this policy achieves comparable levels of success with a drastically reduced requirement of only 10 denoising steps. This simplification represents a significant advancement, enabling faster inference times and opening possibilities for real-time robotic applications where timely decision-making is crucial. By maintaining performance with fewer steps, the policy offers a pathway to deploying complex robotic behaviors on platforms with limited computational resources.

Data collection was performed both on a two-arm RealMan platform using a teleoperation kit and in simulation with a state-based controller integrating inverse kinematics and model predictive control.
Data collection was performed both on a two-arm RealMan platform using a teleoperation kit and in simulation with a state-based controller integrating inverse kinematics and model predictive control.

The presented work emphasizes a holistic approach to robotic manipulation, prioritizing robustness and adaptability over overly complex solutions. This aligns with the principle that simplicity scales, as the diffusion policy demonstrably handles variations in door dynamics and external disturbances. The system’s ability to generalize from a limited set of demonstrations suggests an underlying elegance in its design – good architecture is invisible until it breaks, and in this case, the policy consistently performs despite unforeseen circumstances. The focus on end-to-end learning, allowing the system to implicitly model complex relationships, reflects an understanding that structure dictates behavior, and a well-designed learning framework can yield surprisingly capable results. As G.H. Hardy stated, ‘The essence of mathematics is its economy.’ This diffusion policy embodies that economy, achieving coordinated control with a minimal set of assumptions and a focus on core principles.

What’s Next?

The demonstrated success of a diffusion policy in coordinating a mobile manipulator through a constrained task-opening a door, no less-should not be mistaken for a general solution. The elegance of end-to-end learning often obscures the brittleness inherent in systems trained on narrow distributions. This work, while a clear step forward, highlights the continuing challenge of sim-to-real transfer, and the unspoken assumption that ‘robustness to disturbances’ simply means ‘doesn’t fall over immediately.’ The real world, predictably, offers more creative failures.

Future efforts will likely focus on disentangling the learned policy. A single diffusion model, however effective, offers limited insight into why a particular action sequence succeeds. Explicitly representing affordances-what the environment permits-and integrating those into the planning process remains a largely open problem. One suspects that true generality will require a shift away from purely reactive policies, towards systems capable of anticipating-or at least, gracefully recovering from-the unexpected.

Ultimately, this line of research forces a familiar reckoning: architecture is the art of choosing what to sacrifice. The current approach sacrifices interpretability for performance. The next iteration will necessitate a more thoughtful trade-off, acknowledging that a system which appears clever is, more often than not, fragile. The pursuit of truly adaptable manipulation demands nothing less.


Original article: https://arxiv.org/pdf/2605.15352.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-05-19 00:05