Seeing and Feeling: Robots Learn to Blend Vision and Touch

Author: Denis Avetisyan


New research demonstrates how robots can intelligently combine visual perception with force/torque sensing to perform more nuanced and reliable manipulation tasks.

A robotic manipulation policy reliant solely on visual input falters in contact-rich scenarios, while a naive fusion of force-torque signals can hinder performance in free space; however, a method capable of interpretably weighting force guidance-increasing during contact and decreasing during free motion-achieves an 82% success rate across three challenging tasks, significantly surpassing existing approaches and demonstrating a nuanced understanding of dynamic interaction.
A robotic manipulation policy reliant solely on visual input falters in contact-rich scenarios, while a naive fusion of force-torque signals can hinder performance in free space; however, a method capable of interpretably weighting force guidance-increasing during contact and decreasing during free motion-achieves an 82% success rate across three challenging tasks, significantly surpassing existing approaches and demonstrating a nuanced understanding of dynamic interaction.

An adaptive vision-torque fusion approach, leveraging diffusion policies, enhances contact awareness in robotic manipulation by selectively integrating tactile feedback.

While vision-based robotic manipulation has advanced significantly, relying solely on visual data proves insufficient in contact-rich scenarios demanding nuanced force awareness. This work, ‘Learning When to See and When to Feel: Adaptive Vision-Torque Fusion for Contact-Aware Manipulation’, investigates and proposes a novel approach to multimodal sensor fusion, adaptively integrating force/torque signals with visual observations within diffusion-based policies. Experimental results demonstrate a 14% improvement in success rate by selectively utilizing torque data during contact phases, highlighting the importance of contact-aware integration. Could this adaptive strategy pave the way for more robust and versatile robotic manipulation capabilities in complex, real-world environments?


The Fragility of Sight: Navigating the Limits of Visual Dependence

Conventional robotic manipulation strategies frequently prioritize visual feedback, creating systems susceptible to failure when confronted with real-world unpredictability. This dependence on vision proves problematic as even slight variations in lighting, object appearance, or unexpected contact forces can drastically degrade performance. A robot relying solely on what it sees struggles with tasks requiring fine motor control or adaptability – imagine assembling delicate parts or grasping an object obscured from direct view. This ‘brittleness’ arises because visual data alone often fails to capture the nuanced physical interactions essential for successful manipulation, limiting a robot’s ability to respond effectively to disturbances or maintain a stable grip in dynamic environments.

Contemporary robotic manipulation frequently exhibits a curious fragility stemming from an over-dependence on visual information. This reliance creates a phenomenon termed ‘Modality Collapse’, wherein crucial feedback from force and torque sensors – even when present – is effectively disregarded by the control system. While a camera might confirm an object’s presence, it provides limited insight into the forces actually exerted during interaction. Consequently, even minor disturbances, such as an unexpected slip or a slight obstruction, can derail a manipulation task because the robot isn’t adequately responding to the tactile world. This dismissal of vital haptic data undermines the robot’s ability to adapt and maintain a secure grasp, ultimately limiting its robustness and real-world applicability.

Successfully equipping robots with truly robust manipulation skills hinges on overcoming a significant challenge: the effective integration of tactile sensing into their control systems. Current approaches often prioritize visual data, creating a performance bottleneck when dealing with real-world uncertainties like slippery objects or unexpected contact forces. The difficulty isn’t simply acquiring tactile information, but rather, translating those crucial force and texture readings into a dynamic, adaptive control policy. This requires developing algorithms that can seamlessly blend visual and tactile inputs, allowing the robot to ‘feel’ its way through a task and adjust its actions in real-time. Such a policy would move beyond pre-programmed movements, enabling the robot to respond intelligently to unforeseen circumstances and maintain a secure grasp, even when visual information is incomplete or misleading – ultimately paving the way for more reliable and versatile robotic manipulation.

The proposed method utilizes ResNet-encoded RGB images and MLP-encoded torque signals, modulated by a contact-gated mechanism and a learned scale predictor, to blend modality-specific noise predictions and improve denoising performance within a diffusion process.
The proposed method utilizes ResNet-encoded RGB images and MLP-encoded torque signals, modulated by a contact-gated mechanism and a learned scale predictor, to blend modality-specific noise predictions and improve denoising performance within a diffusion process.

A Framework for Sensory Fusion: Beyond the Limitations of Single Modalities

The Diffusion Policy Framework utilized in this work provides a mechanism for learning policies directly from offline, multimodal datasets without requiring explicit trajectory optimization or reward functions. This framework operates by modeling the policy as a diffusion process, effectively learning to reverse a noise process applied to actions conditioned on states. By framing policy learning as a diffusion process, the system can generate diverse and plausible actions, and crucially, accommodate data originating from heterogeneous sources such as vision and tactile sensing. This approach contrasts with traditional reinforcement learning methods, enabling effective learning from complex, high-dimensional, and potentially incomplete datasets commonly found in robotic manipulation tasks.

The system employs a dual-encoding architecture to process multimodal sensory input. Visual data is processed via a ResNet-18 convolutional neural network, a pre-trained model commonly used for image feature extraction. Simultaneously, force and torque signals are processed by a Multi-Layer Perceptron (MLP), a fully connected neural network capable of learning complex relationships from the tactile data. The outputs of both the ResNet-18 and the MLP are then fused to provide a comprehensive representation of the robot’s sensory environment, allowing for informed decision-making during task execution.

The system utilizes a Scale Predictor network to modulate the influence of tactile data during multimodal fusion. This predictor dynamically assigns a weighting factor to the force/torque signal, effectively scaling its contribution relative to the visual input from the ResNet-18 encoder. The output of the Scale Predictor is a scalar value that determines the magnitude of the tactile data’s impact, allowing the system to adaptively prioritize force feedback based on the current state and input characteristics. This dynamic weighting facilitates robust performance in scenarios with varying levels of visual clarity or tactile signal strength, enabling the system to intelligently rely on the most reliable sensory information.

An ablation study reveals that our approach predicts higher weights for contact scenarios (represented by orange dots) compared to the Mixture of Experts (MoE) model, indicating improved sensitivity to contact forces.
An ablation study reveals that our approach predicts higher weights for contact scenarios (represented by orange dots) compared to the Mixture of Experts (MoE) model, indicating improved sensitivity to contact forces.

Counteracting Sensory Collapse: Strategies for Adaptive Control

The FACTR methodology utilizes a learning curriculum based on progressively applied visual corruption to promote increased dependence on force inputs during robotic learning. This approach introduces noise and distortions to the visual data received by the robot, effectively reducing its reliance on vision for task completion. By forcing the robot to compensate for the degraded visual information, FACTR encourages the development of a policy that prioritizes and more effectively utilizes force feedback for accurate and robust control, ultimately improving performance in scenarios with limited or unreliable visual data.

The FoAR (Future Contact Anticipation and Regulation) method addresses modality collapse by integrating a predictive model of future contact events into the control policy. This future contact predictor estimates the likelihood and magnitude of upcoming contact forces, and then modulates the contribution of force-derived features accordingly. By selectively weighting force features based on anticipated contact, FoAR enables the robot to dynamically prioritize haptic information when it is most relevant – during periods of expected interaction – and to rely more heavily on visual inputs when contact is unlikely. This adaptive modulation of force feature contribution improves the robustness of the system to sensory noise and enables more efficient utilization of haptic sensing during complex manipulation tasks.

Techniques such as TA-VLA and ForceVLA enhance robotic policy learning by directly incorporating torque data into the control policy. These methods utilize force-aware mixture-of-experts (MoE) architectures, allowing the policy to selectively weight contributions from different expert networks based on sensed forces. By integrating torque information and employing MoE layers, the resulting representations become more physically grounded, enabling improved performance and robustness in contact-rich manipulation tasks. This approach facilitates learning policies that are sensitive to external forces and can adapt to varying interaction scenarios.

Implicit Representation Disentanglement and Pairing (ImplicitRDP) addresses modality collapse by explicitly aligning the visual and force encoders within a robotic learning system. This alignment is achieved through a contrastive learning objective that encourages the encoders to produce similar embeddings for corresponding visual and force data points, and dissimilar embeddings for mismatched pairs. By learning a shared representation space, ImplicitRDP facilitates adaptive modality weighting; the system can dynamically prioritize information from either the visual or force sensors based on data reliability and task relevance, mitigating over-reliance on a single modality and improving robustness in noisy or uncertain environments. This approach enables the robot to effectively integrate visual and haptic feedback for improved performance in physical interaction tasks.

Distracting torque features lead to failure cases, indicating the importance of robust feature selection in the control policy.
Distracting torque features lead to failure cases, indicating the importance of robust feature selection in the control policy.

Demonstrating Resilience: Complex Tasks and Robust Performance

Evaluations conducted with a Franka Research 3 Robot demonstrate the framework’s capacity to reliably execute complex manipulation tasks. Performance was rigorously tested across three distinct challenges: the precise extraction of a ‘Twisty Connector’, the delicate opening of an ‘Egg Boiler Lid’, and the nuanced placement of bottles based on perceived weight. Successful completion of these tasks-requiring both visual perception and tactile feedback-highlights the system’s robustness and adaptability in real-world scenarios. The framework consistently outperformed existing methods in navigating the intricacies of each task, indicating a significant advancement in robotic manipulation capabilities and a potential for broader application in automated systems.

Reliable robotic manipulation in real-world scenarios demands robust data processing, particularly regarding force and torque sensing. This system incorporates ‘Contact Gating’, a filtering mechanism specifically designed to mitigate noise inherent in force/torque data acquisition. By discerning and attenuating spurious signals that arise during contact, this technique significantly enhances the stability and precision of robotic control. The result is a system less susceptible to external disturbances and sensor inaccuracies, leading to improved task completion rates and more consistent performance across complex manipulation challenges. This proactive noise reduction is crucial for achieving dependable automation in unstructured environments where unpredictable contact forces are common.

The system achieves robust control through a ‘CFG-Style Fusion’ technique, intelligently integrating visual and tactile information. Rather than relying on fixed weighting, this approach utilizes a ‘Scale Predictor’ to dynamically assess the relative importance of each sensory input during task execution. This means the robot doesn’t simply blend vision and touch equally; instead, it learns to prioritize whichever sense is most reliable for the current situation. For example, during initial object localization, visual data might dominate, while during manipulation requiring precise force control, tactile feedback becomes paramount. This dynamic balancing, unlike static methods that assign fixed weights such as 0.7634 to vision and 0.2366 to torque, allows for more adaptable and resilient performance across complex tasks, contributing to the observed 14% improvement in overall task success rate.

The robotic system demonstrably excels in complex manipulation tasks, achieving an overall success rate of 82%. This represents a significant 14% performance increase compared to the most effective existing methods. A key feature enabling this improvement is the system’s ability to dynamically adjust its reliance on visual and tactile feedback during contact. Unlike baseline models – which maintained static weights of 0.7634 for vision and 0.2366 for torque throughout contact – this approach intelligently prioritizes information sources based on task demands, offering greater adaptability and a more interpretable control strategy.

The agent successfully learns to manipulate objects across three distinct tasks: opening an egg boiler lid, placing bottles by weight, and extracting a twisty connector.
The agent successfully learns to manipulate objects across three distinct tasks: opening an egg boiler lid, placing bottles by weight, and extracting a twisty connector.

The pursuit of robust robotic manipulation, as detailed in this work, necessitates acknowledging the inevitable entropy of any system. Just as a perfectly polished lens will eventually gather dust, even the most sophisticated algorithms encounter limitations when confronted with the messy realities of physical interaction. This research addresses this decay through adaptive integration-selectively prioritizing force/torque feedback during contact-effectively gating noise and preserving performance. This echoes Carl Friedrich Gauss’s sentiment: “Few things are more deceptive than a simple appearance.” The apparent simplicity of vision-based manipulation belies the complexities introduced by contact; a truly graceful system, like that proposed, acknowledges and adapts to these inherent imperfections, rather than striving for an impossible ideal of pristine, unwavering accuracy.

What’s Next?

The presented work offers a temporary reprieve from the inevitable-the difficulty of robust robotic interaction. Selective integration of force/torque data, while demonstrably effective, merely postpones the fundamental challenge: that all sensory input degrades over time, and contact, by its nature, accelerates that decay. Systems do not fail due to accumulated errors, but because time operates without preference. The current approach refines the signal, but does not address the entropy inherent in the interaction itself.

Future iterations will likely focus on anticipating the loss of fidelity in both visual and tactile sensing. The question is not how to better see and feel, but how to build systems that gracefully degrade, accepting that complete awareness is a fleeting illusion. An interesting direction lies in exploring meta-policies-algorithms that learn when to trust-or distrust-any given sensor, rather than attempting to perfectly fuse their outputs.

Ultimately, the field may need to shift its focus from precise control to resilient adaptation. Sometimes stability is just a delay of disaster. A truly robust system won’t strive for perfect perception, but for the ability to continue functioning, even-and especially-when perception fails.


Original article: https://arxiv.org/pdf/2604.01414.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-05 21:57