Seeing Soft Robots: Real-Time 3D Reconstruction Without Markers or Training

Author: Denis Avetisyan

A new vision-based approach enables accurate, real-time 3D shape reconstruction of soft robots in unstructured environments, eliminating the need for external markers or extensive training data.

The system demonstrates robust 3D shape reconstruction across diverse real-world conditions-from controlled indoor settings to challenging outdoor environments with partial occlusions-by integrating both color and depth data captured by an RGB-D camera and leveraging a soft continuum robot for adaptable data acquisition.

This work introduces Appearance-Based Feature Tracking (AFT) for markerless, training-free, and robust shape reconstruction using RGB-D sensing and hierarchical reconstruction.

Accurate and reliable shape reconstruction remains a significant challenge for the widespread deployment of soft robots in unstructured environments. This paper introduces AFT: Appearance-Based Feature Tracking, a novel vision-based framework for markerless and training-free soft robot shape reconstruction that leverages inherent surface appearance as implicit visual markers. By decoupling local feature alignment from global kinematic optimization, AFT achieves real-time tracking with robustness to occlusions and viewpoint changes-demonstrating sub-3% tip error in experimental validation. Could this approach unlock more adaptable and cost-effective control strategies for soft robots operating in complex, real-world scenarios?

The Inevitable Drift: Addressing the Challenge of Embodied Intelligence

Conventional robotics often necessitates painstakingly detailed models of the robot and its environment, demanding meticulous calibration before deployment. This reliance on pre-programmed precision hinders a robot’s ability to function effectively when confronted with the unpredictable nature of real-world scenarios. While successful in structured settings like factory assembly lines, these systems frequently falter when exposed to the ambiguities of unstructured environments – think navigating cluttered homes, conducting search and rescue operations, or exploring disaster zones. The rigid design and control schemes struggle to accommodate unexpected obstacles, varying terrains, or dynamic changes, ultimately limiting the robot’s adaptability and overall robustness. This constraint highlights a crucial need for robotic systems capable of learning and adjusting to unforeseen circumstances, rather than strictly adhering to pre-defined parameters.

The pursuit of truly adaptable robots has led researchers toward designs incorporating significant flexibility, often manifesting as “soft robots.” However, controlling these highly deformable systems introduces a fundamental difficulty: their virtually infinite degrees of freedom. Unlike rigid robots with a defined set of joints, a soft robot’s shape is determined by an infinite number of points along its body, making traditional control algorithms – which rely on precise joint angles and positions – inadequate. This presents a significant computational hurdle; accurately predicting the robot’s behavior and exerting precise control requires modeling and accounting for the continuous interplay of forces and deformations across its entire structure. Consequently, novel control strategies are needed, often drawing from areas like optimization, machine learning, and even concepts borrowed from fluid dynamics, to navigate this immense and continuous control space and achieve reliable, predictable movement.

Effective control of soft robots hinges on a precise understanding of their configuration – their pose and the shape they assume in any given moment. However, traditional state estimation techniques, designed for rigid-bodied robots with well-defined joints, falter when applied to these highly deformable systems. The infinite degrees of freedom inherent in soft robot kinematics create a computational challenge; determining the robot’s state requires modeling an enormous, continuous configuration space. Researchers are actively exploring novel approaches, including sensor fusion with flexible sensors, learning-based methods that directly map sensor data to control actions, and computationally efficient models that approximate the robot’s complex deformation, all in pursuit of reliable state estimation and, ultimately, precise and adaptable control of these promising machines.

This closed-loop control system maintains desired configurations by continuously estimating the current shape and adjusting accordingly via feedback.

The Gaze of the Machine: Vision-Based Shape Sensing as a New Paradigm

Vision-based shape sensing presents a departure from conventional robotic state estimation which typically utilizes encoders, IMUs, or force/torque sensors. This approach instead infers a robot’s configuration – its shape and pose in three-dimensional space – directly from visual data. By processing images captured by cameras, the system reconstructs the robot’s geometry and determines its position and orientation without relying on direct physical measurements of joint angles or external forces. This offers potential advantages in scenarios where traditional sensors are impractical, unreliable, or too costly to implement, and allows for sensing of the robot’s complete shape, including deformations, which may not be directly measurable with internal sensors.

RGB-D cameras function by combining standard color (RGB) imagery with depth information, typically acquired via structured light, time-of-flight, or stereo vision. The RGB component provides texture and visual details, while the depth data represents the distance from the camera to each point in the scene. This combined data allows for the creation of a 3D point cloud representing the robot’s surface. The depth information, measured in units such as millimeters or meters, is crucial for reconstructing the robot’s geometry and pose, enabling accurate shape sensing. Specifically, the depth map, a grayscale image where pixel intensity corresponds to distance, is often used in conjunction with the color image to generate a complete 3D model of the robot’s structure.

Structure from Motion (SfM) and COLMAP are key algorithms utilized in vision-based shape sensing to generate 3D models of robots from 2D image sequences. SfM algorithms identify and track features across multiple camera views, estimating camera poses and sparse 3D point clouds simultaneously. COLMAP builds upon SfM by providing a robust and scalable implementation, incorporating techniques for feature extraction, matching, bundle adjustment, and dense reconstruction. Bundle adjustment is a non-linear optimization process that refines both the 3D point cloud and camera poses to minimize reprojection error. The resulting dense 3D reconstruction provides a detailed representation of the robot’s shape, enabling accurate pose estimation and shape sensing without requiring a pre-built model or precise kinematic parameters.

Robustness to occlusion and viewpoint variation was evaluated using an experimental setup incorporating an RGB-D camera, a continuum robot, and motion capture, with digitally introduced occlusions and images captured from front-right, front-left, and side-left perspectives.

Implicit Markers: Appearance-Based Feature Tracking for Robust Localization

Appearance-based Feature Tracking (AFT) represents a departure from traditional visual odometry and simultaneous localization and mapping (SLAM) techniques which frequently rely on explicitly defined markers or the addition of external sensors such as LiDAR or motion capture systems. Instead, AFT utilizes the inherent texture and visual characteristics of a robot’s operational environment as intrinsic, implicit markers for localization and tracking. This is achieved by processing visual data – typically from onboard cameras – to identify and monitor stable features directly from the observed surfaces. By removing the dependency on external infrastructure or specialized hardware, AFT aims to create a more adaptable and cost-effective solution for robot navigation and state estimation in dynamic and unstructured environments.

Appearance-based Feature Tracking (AFT) relies on deep learning architectures to identify and monitor salient points within visual data. Specifically, the ResNet-50 convolutional neural network is employed for its proven feature extraction capabilities, providing a robust initial representation of the observed surface. Complementing this, the Segment Anything Model (SAM) is integrated to perform instance segmentation, enabling the isolation and tracking of individual features even under conditions of partial occlusion or significant viewpoint change. These models are trained on diverse datasets to ensure generalization and adaptability to varying lighting conditions and surface textures, ultimately providing a reliable basis for pose estimation and localization without external markers.

Multi-scale feature extraction improves the robustness of Appearance-Based Feature Tracking (AFT) by processing input images at multiple resolutions. This approach involves extracting features from the image at varying scales, allowing the system to capture both fine-grained details and broader contextual information. Utilizing features from multiple scales enables the tracking algorithm to maintain lock even when faced with partial occlusions, changes in lighting, or significant viewpoint shifts. The system effectively combines information from these different scales, providing a more complete and reliable representation of the tracked surface and improving overall tracking accuracy and stability in dynamic environments.

Differentiable rendering optimizes the reconstruction process in appearance-based feature tracking by formulating the rendering pipeline as a fully differentiable function. This allows gradients to be backpropagated from the observed image space directly to the 3D scene parameters – including pose and geometry – enabling gradient-based optimization. Traditional rendering pipelines are often non-differentiable due to discrete operations like ray intersection and shading, necessitating approximations or heuristics for optimization. By utilizing differentiable approximations of these operations, the system can directly minimize the difference between the rendered image and the observed image, leading to more accurate and robust reconstruction and pose estimation. This approach eliminates the need for separate optimization steps and allows for end-to-end learning of the reconstruction process.

Hierarchical feature representations with increasing receptive fields and dimensions are extracted from multiple stages (Res2, Res3, and Res4) of a ResNet-50.

Adaptive Control and Robust Performance: The Promise of Soft Robotics Realized

Accurate and robust shape estimation is fundamental to enabling truly adaptive control in soft robotics, and the AFT system directly addresses this need. By providing a reliable understanding of a soft robot’s configuration, AFT facilitates closed-loop control, allowing these robots to dynamically adjust to unforeseen circumstances and varying task demands. Unlike traditional open-loop systems where pre-programmed movements are executed regardless of external factors, AFT-driven closed-loop control continuously senses the robot’s state and modifies its actions accordingly. This capability is particularly crucial in unstructured environments where obstacles, surface irregularities, or unexpected interactions are common, allowing soft robots to maintain performance and navigate complex scenarios with greater resilience and precision.

A novel hierarchical reconstruction strategy significantly enhances the accuracy of soft robot pose estimation by addressing the challenges of complex deformations. This approach separates the reconstruction process into distinct local and global stages; initially, individual segments of the robot are reconstructed through localized partition matching, focusing on immediate visual data. Subsequently, a global kinematic optimization refines these local reconstructions, ensuring overall consistency and adherence to the robot’s mechanical constraints. By decoupling these stages, the system avoids the computational burden and potential inaccuracies of simultaneously optimizing both local details and the overall robot pose, leading to a more efficient and robust solution for tracking soft robots in dynamic environments.

A significant advancement in soft robotics control is demonstrated through a system capable of achieving a relative tip error of just 2.6±1.3% – a level of precision typically requiring the complex and expensive infrastructure of optical motion capture. This performance is attained utilizing a single, readily available RGB-D camera, offering a substantial reduction in both cost and setup complexity. The accuracy suggests the system can reliably guide soft robots in intricate tasks and dynamic environments, opening possibilities for applications where precise positioning is crucial, such as minimally invasive surgery or delicate manipulation of fragile objects. This achievement underscores the potential for vision-based systems to provide high-fidelity control without the limitations of traditional tracking methods.

The accuracy and reliability of the adaptive control system are significantly bolstered through integration with optical motion capture technology. This established technique serves as a crucial source of ground truth data, allowing for rigorous validation of the algorithms developed for shape estimation and control. By comparing the system’s output with the highly precise measurements from the motion capture system, researchers can identify and address potential inaccuracies, refine the reconstruction process, and optimize the overall performance of the soft robot control scheme. This comparative analysis not only ensures the robustness of the system under various conditions but also facilitates ongoing improvements to its adaptive capabilities, pushing the boundaries of precision in soft robotics.

Real-time performance is critical for effective soft robot control, and this system achieves an update rate of 2.5 Hz utilizing a high-performance computing setup. This means the shape estimation and reconstruction process, essential for adaptive control, can be completed and updated approximately four times every second. The system’s responsiveness is enabled by an Intel Core i9-12900 processor and an NVIDIA RTX A4000 GPU, demonstrating the computational resources necessary to process visual data and maintain closed-loop control with minimal latency. This processing speed allows the soft robot to react quickly to changes in its environment or task, contributing to its robustness and adaptability in dynamic scenarios.

The system demonstrates a noteworthy capacity to maintain accurate shape estimation even when portions of the soft robot are temporarily obscured. Performance remains stable with up to 55% of an object’s position and 25% of its width occluded, a critical feature for real-world applications where self-occlusion and external interference are common. This resilience stems from the algorithm’s ability to intelligently interpolate missing data based on observed segments and prior kinematic models, ensuring continuous and reliable control despite incomplete visual information. Such robustness minimizes the need for extensive sensor redundancy and facilitates deployment in cluttered or dynamic environments where consistent tracking is paramount.

This framework reconstructs continuous 3D shapes by first building a reference model from multi-view images and then iteratively updating it with segmented and matched features from incoming frames.

The pursuit of robust shape reconstruction, as demonstrated in this work, echoes a fundamental truth about all systems. They are, by their very nature, transient. This paper addresses the challenge of tracking soft robots in unstructured environments without relying on pre-defined markers or extensive training data – a pragmatic acknowledgment of real-world entropy. As Claude Shannon observed, “Communication is the process of conveying meaning, but it’s always subject to noise.” Similarly, visual tracking isn’t about perfect fidelity, but about extracting meaningful information despite inherent ambiguity and change. The hierarchical reconstruction strategy presented here isn’t about stopping decay, but about building a system resilient enough to adapt and maintain functionality within it. Versioning, in this context, is a form of memory, allowing the system to recall prior states and navigate the arrow of time toward continued operation.

What Lies Ahead?

The presented work achieves a notable decoupling of reconstruction from the typical demands of pre-trained models or rigid markers. This is not, however, a victory over entropy, but a momentary stay of execution. All systems, even those built on clever algorithms, are subject to the inevitable degradation of signal in complex environments. The robustness demonstrated here simply postpones the moment when accumulated noise overwhelms the appearance-based tracking – a delay, not a denial, of eventual failure.

Future efforts will undoubtedly focus on expanding the scope of ‘unstructured environments’ – a phrase that tacitly acknowledges the inherent structurelessness of reality. Yet, a more fundamental challenge remains: the limitations of appearance itself. Visual features, by their nature, are transient indicators of deeper, often obscured, kinematic states. Relying solely on what is seen invites a constant struggle against occlusion, lighting variations, and the subtle shifts in material properties that betray the robot’s internal configuration.

It is conceivable that future iterations will integrate this visual tracking with alternative sensing modalities – tactile feedback, perhaps, or even subtle electromagnetic field analysis. But such additions are merely attempts to gather more data before the inevitable decay of information. The question is not whether this system will eventually fail, but how gracefully it will age, and what insights its eventual disintegration might offer about the limits of perception and control.

Original article: https://arxiv.org/pdf/2511.18215.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Drift: Addressing the Challenge of Embodied Intelligence

The Gaze of the Machine: Vision-Based Shape Sensing as a New Paradigm

Implicit Markers: Appearance-Based Feature Tracking for Robust Localization

Adaptive Control and Robust Performance: The Promise of Soft Robotics Realized

What Lies Ahead?

See also: