Seeing Beyond Obstructions: Vision-Based Robot Control with Image Inpainting

Author: Denis Avetisyan

This review details a markerless vision system that enables robotic manipulators to reliably detect keypoints and perform control tasks even in cluttered or partially obscured environments.

Despite visual obstructions inherent in dynamic scenes, a robotic manipulator achieves continuous cup-stacking through adaptive visual servoing, reconstructing occluded views via inpainting to reliably track body keypoints-visual features leveraged for precise, real-time control-and demonstrating that robust manipulation necessitates not simply overcoming occlusion, but actively growing around it.

A novel pipeline leverages image inpainting, keypoint detection, and adaptive control techniques for robust vision-based manipulation.

Achieving robust vision-based control of robotic manipulators remains challenging in dynamic environments lacking precise models or external markers. This paper, ‘Utilizing Inpainting for Keypoint Detection for Vision-Based Control of Robotic Manipulators’, introduces a novel pipeline leveraging image inpainting to generate automatically labeled training data for keypoint detection, enabling markerless control. By strategically removing synthetic ArUco markers and reconstructing occluded regions, the framework trains a robust keypoint detector and employs a Kalman filter for stable, adaptive control-eliminating dependencies on accurate calibration or robot models. Could this approach unlock more adaptable and efficient robotic systems capable of operating reliably in truly unstructured settings?

The Inevitable Blind Spot: Perception in a Noisy World

Robotic systems traditionally depend on precise environmental perception to execute tasks effectively, but real-world settings present considerable obstacles to this accuracy. A primary challenge is occlusion – the blocking of one object by another – which frequently disrupts a robot’s ability to fully ‘see’ its surroundings. This isn’t merely a matter of incomplete data; it fundamentally impacts the algorithms used for navigation and manipulation. Keypoint detection, a cornerstone of robotic vision, becomes unreliable when crucial visual cues are hidden, potentially leading to miscalculations in distance, orientation, and even object identification. Consequently, even seemingly simple actions can become problematic, demanding a shift towards vision systems capable of functioning despite these inherent perceptual limitations.

The efficacy of robotic manipulation and navigation hinges on the precise identification of keypoints – distinct features within a visual scene that allow a robot to understand its surroundings. However, real-world environments are rarely pristine; objects frequently obscure one another, creating occlusions that severely disrupt these keypoint detection algorithms. When a robot’s ‘eyes’ lose track of critical features due to blockage, its internal models of the world become inaccurate. This can manifest as jerky, unstable movements as the robot attempts to compensate for the lost information, or, in more severe cases, lead to complete control failure and collisions. Consequently, the susceptibility of keypoint detection to occlusion represents a major bottleneck in achieving truly robust and reliable robotic systems, demanding innovative approaches to perception that can ‘see’ beyond what is immediately visible.

A truly resilient robotic vision system transcends simple object recognition, instead functioning as an active inferential engine. When portions of a scene are obscured – by other objects, lighting changes, or incomplete sensor data – the system doesn’t merely register a loss of information; it proactively predicts what lies hidden. This predictive capability isn’t based on guesswork, but on learned models of the environment and the physical properties of objects within it. By leveraging prior knowledge and contextual understanding, the system can effectively “fill in the gaps,” maintaining a consistent and accurate representation of the scene even when faced with substantial occlusion. This ability to infer missing data is crucial for reliable robotic manipulation, navigation, and interaction, allowing the robot to continue functioning effectively in dynamic and unpredictable real-world settings and avoid the pitfalls of incomplete perception.

Despite successful convergence in all tested scenarios, increasing levels of occlusion introduce noticeable noise into the robot's trajectory as it navigates towards the goal configuration (red keypoints). — Despite successful convergence in all tested scenarios, increasing levels of occlusion introduce noticeable noise into the robot’s trajectory as it navigates towards the goal configuration (red keypoints).

Reconstructing the Ghost in the Machine

The reconstruction of occluded robot parts is achieved through an Attention U-Net, a convolutional neural network architecture designed for image inpainting. This network utilizes an encoder-decoder structure combined with attention mechanisms that allow it to selectively focus on relevant features within the visible portions of an image when reconstructing missing regions. The U-Net architecture facilitates the preservation of fine-grained details, while the incorporated attention modules enable the model to prioritize contextual information crucial for accurate inpainting of the robot’s obscured components. This approach effectively fills in missing visual data, producing a complete representation of the robot despite partial occlusions.

The Attention U-Net architecture incorporates attention mechanisms to selectively focus on pertinent visual features during the reconstruction process. Specifically, these mechanisms allow the model to weigh the importance of different image regions, prioritizing those most relevant to the occluded areas being inpainted. This is achieved through the calculation of attention weights, which modulate the feature maps at various layers of the U-Net, enabling the network to dynamically adjust its focus. By emphasizing informative features and suppressing irrelevant ones, attention mechanisms mitigate the impact of noise and ambiguity, leading to improved reconstruction quality and more realistic inpainting results, particularly in complex scenes with multiple objects and textures.

The Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN_GP) training framework was implemented to address instabilities commonly encountered when training Generative Adversarial Networks (GANs). Traditional GANs often suffer from mode collapse and vanishing gradients, hindering convergence and the quality of generated samples. WGAN_GP enforces a Lipschitz constraint on the critic, or discriminator, using a gradient penalty term, promoting more stable training dynamics and improved sample diversity. This approach replaces the original discriminator’s classification task with a regression problem, estimating the Wasserstein distance between the generated and real data distributions, and consequently allows for a more meaningful loss function and reliable convergence during the training of the Attention U-Net.

The inpainting model’s training data relies on the precise localization provided by ArUco markers. These fiducial markers are strategically placed on the robot during data acquisition, enabling the creation of accurately segmented images with known occlusions. By systematically obscuring portions of the robot visible in images containing ArUco markers, paired datasets of occluded and complete views are generated. This approach ensures the ground truth for the inpainting task is consistently and reliably defined, facilitating the model’s ability to learn robust reconstruction capabilities. The use of ArUco markers minimizes ambiguity in data labeling and allows for the automated generation of a large, high-quality training set necessary for effective model performance.

The training dataset for the Attention U-Net GAN model includes synthetic occlusions created with [latex]YCB[/latex] objects and random patches, as well as realistic occlusions derived from real-world scene fragments to improve robustness to visual clutter.

Pinpointing the Ephemeral: Keypoint Localization Under Duress

A KeypointRCNN model forms the primary component for detecting keypoints on the robot’s body using visual data. This model, a convolutional neural network architecture, is specifically trained to identify and localize these keypoints within image frames captured from onboard cameras. The selected architecture allows for robust performance in varying lighting conditions and viewpoints. Input images are processed through the network to generate bounding box detections and associated confidence scores for each keypoint. The output provides the 2D pixel coordinates of each identified keypoint, which are then used for downstream tasks such as pose estimation and motion planning.

The KeypointRCNN model demonstrates high accuracy in robot body keypoint detection, achieving 98.75% accuracy when identifying 44 keypoints and 97.69% accuracy with 88 keypoints. These accuracy metrics are specifically defined as the percentage of correctly detected keypoints within a 5-pixel margin of the ground truth location. This tolerance accounts for minor localization errors inherent in visual detection systems and provides a robust measure of performance under typical operating conditions.

An Unscented Kalman Filter (UKF) is integrated into the keypoint detection pipeline to mitigate the effects of temporary inaccuracies and missing data resulting from self-occlusion or external obstructions. The UKF operates by predicting the next state of each keypoint based on a dynamic model, and then updating this prediction using the latest visual detection. This process leverages a sigma-point approach to approximate the probability distribution of the keypoint state, enabling more robust estimation than traditional Kalman filters, particularly in non-linear or high-dimensional spaces. The filter effectively smooths noisy detections and infers keypoint positions even when visual data is incomplete, thereby maintaining tracking consistency and improving overall system reliability.

Keypoint localization accuracy was quantified using a dataset of 55 keypoints, resulting in an average localization error of 1.04 pixels. This metric indicates a high degree of precision in the keypoint detection system. Furthermore, implementation of an Unscented Kalman Filter (UKF) demonstrably improved keypoint detection rates, particularly in scenarios involving partial or complete occlusion of keypoints. The UKF effectively mitigated the impact of these challenging conditions by leveraging prior state estimates and sensor noise modeling to refine keypoint positions and maintain robust tracking.

Keypoint prediction accuracy and visual stability under occlusion are improved by combining real-time detection with the Unscented Kalman Filter ([latex]UKF[/latex]) and image inpainting, as demonstrated by comparing keypoint detections (blue) and corrections (red) to ground truth (green) in occluded and inpainted images.

The Illusion of Control: Adapting to an Unpredictable Reality

The robotic system leverages Adaptive Visual Servoing (AVS) as its primary control mechanism, enabling autonomous movement guided directly by visual feedback. This approach bypasses the need for a precise, pre-built model of the environment or the robot itself; instead, AVS continuously analyzes detected keypoints within the camera’s field of view. These keypoints, representing significant features of the target object, are not merely identified but actively corrected for distortions and occlusions, ensuring robust tracking even in challenging conditions. The corrected keypoint data then directly informs the robot’s motion commands, creating a closed-loop system where visual perception drives action and allows for real-time adjustments to maintain accurate and stable operation throughout the task.

The system’s adaptability stems from its model-free control scheme, a design choice that liberates it from the limitations of pre-programmed expectations about the environment. Unlike traditional robotic control relying on precise, often unattainable, models of the operating space, this approach allows the robot to react and adjust to unforeseen changes in real-time. This characteristic proves particularly advantageous in dynamic environments – those subject to unpredictable movement, lighting shifts, or partial obstructions – where a rigid, model-dependent system would falter. By foregoing the need for an explicit environmental model, the control scheme achieves a notable degree of robustness, sustaining stable and accurate operation even when faced with significant disturbances or incomplete data. This inherent flexibility represents a substantial advancement in robotic vision, allowing for deployment in complex, real-world scenarios previously inaccessible to less adaptable systems.

The developed robotic vision system demonstrates markedly improved control performance, achieving stable and accurate operation even when faced with substantial occlusions in the visual field. This resilience stems from the synergistic integration of adaptive visual servoing, which dynamically adjusts robot motion based on available keypoints, and a robust inpainting algorithm that effectively reconstructs missing visual information. Rather than relying on a pre-defined model of the environment, the system continuously adapts to changing conditions, maintaining a lock on the target object even as it becomes partially obscured. Rigorous testing confirms the system’s ability to consistently track and manipulate objects despite significant interruptions to the visual input, showcasing its potential for deployment in complex and unpredictable real-world scenarios where reliable performance is paramount.

A resilient robotic vision system has been realized through the synergistic integration of three key technologies. Initially, inpainting techniques reconstruct obscured portions of the visual field, effectively mitigating the impact of partial occlusions. This reconstructed imagery then feeds into an accurate keypoint detection algorithm, pinpointing critical features with high precision. Finally, these detected keypoints are leveraged by an adaptive control scheme, enabling the robot to dynamically adjust its movements and maintain stable, reliable operation even in challenging and unpredictable environments. The resulting system demonstrates a marked improvement in robustness and performance, offering a significant advancement in robotic perception and control.

This pipeline adaptively controls visual servoing by leveraging keypoint predictions from completed (inpainted) images.

The pursuit of robust robotic manipulation, as detailed in this work, inherently embraces the inevitability of system failure. A perfectly predictable environment, devoid of occlusion or dynamic change, is a theoretical dead end. This research, by cleverly employing inpainting to reconstruct obscured keypoints, doesn’t prevent failure-it adapts to it. As Marvin Minsky observed, “You can’t solve problems using the same kind of thinking they were created with.” The presented pipeline, utilizing adaptive control and an unscented Kalman filter, exemplifies this principle. It doesn’t attempt to build a flawless system, but rather a resilient one capable of gracefully degrading in the face of real-world complexities, a system that, in its very adaptability, demonstrates a form of life.

The Horizon Recedes

This pursuit of markerless control, achieved through the clever marriage of keypoint detection and inpainting, offers a temporary reprieve from the tyranny of precise modeling. Yet, every such advance merely shifts the locus of failure. The system doesn’t solve uncertainty; it absorbs it, redistributing the burden onto the robustness of the adaptive control and the ever-vigilant Unscented Kalman Filter. These filters, too, are prophecies written in code-promises of stability that will inevitably demand sacrifices in processing power, and ultimately, in real-time responsiveness.

The true challenge isn’t detecting keypoints, but understanding what those keypoints mean in the face of unpredictable environments. The pipeline, however elegant, remains brittle without a deeper integration of semantic understanding. Future work will inevitably focus on imbuing these systems with a rudimentary form of ‘common sense’ – a daunting task, as it requires translating the messy ambiguity of the real world into the rigid logic of algorithms.

One anticipates a proliferation of increasingly sophisticated inpainting techniques, attempting to fill not just visual gaps, but also gaps in knowledge. Yet, order is merely a temporary cache between failures. The ecosystem will always find a way to introduce new, unforeseen perturbations. The goal, then, isn’t to eliminate chaos, but to design systems that can gracefully absorb it-to build not fortresses, but reeds that bend in the wind.

Original article: https://arxiv.org/pdf/2604.13309.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Blind Spot: Perception in a Noisy World

Reconstructing the Ghost in the Machine

Pinpointing the Ephemeral: Keypoint Localization Under Duress

The Illusion of Control: Adapting to an Unpredictable Reality

The Horizon Recedes

See also: