Seeing is Manipulating: AI-Powered Vision for Smarter Robotics

Author: Denis Avetisyan

Researchers have developed a new system that uses artificial intelligence to intelligently choose the best viewpoints for robots, dramatically improving their ability to grasp and manipulate objects in 3D space.

The system demonstrates predictive capabilities regarding real-world actions, suggesting an ability to anticipate and potentially navigate dynamic environments as inherent to its operational lifespan.

VERM leverages the GPT-4o foundation model for task-adaptive viewpoint selection, enabling efficient and accurate 3D robotic manipulation through coarse-to-fine refinement and depth-aware perception.

Executing 3D manipulation tasks with robotic systems is often hampered by computational cost and occlusion arising from redundant multi-camera data. To address this, we introduce VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation, a novel framework that leverages the visual reasoning capabilities of foundation models-specifically GPT-4o-to synthesize task-adaptive viewpoints from existing 3D point clouds. This approach effectively filters irrelevant information, enabling faster training and inference speeds while maintaining high accuracy in robotic manipulation. Could this virtual eye paradigm unlock new levels of efficiency and adaptability in complex robotic systems operating in real-world environments?

The Inevitable Decay of Static Systems

Conventional robotic systems often falter when navigating intricate three-dimensional spaces because of inherent difficulties in processing spatial information and anticipating the consequences of their actions. These machines typically rely on pre-programmed maps or painstakingly detailed sensor data to build a representation of their surroundings, a process that becomes computationally expensive and prone to error in dynamic or unstructured environments. Furthermore, predicting how a physical action will unfold – such as grasping an object or maneuvering around an obstacle – requires sophisticated modeling of physics and material properties, exceeding the capabilities of many existing robotic platforms. This limitation hinders their ability to adapt to unforeseen circumstances or perform tasks requiring nuanced manipulation and spatial reasoning, ultimately restricting their deployment in real-world applications beyond highly controlled settings.

Contemporary robotic systems, despite advances in processing power, frequently demonstrate a brittle quality when confronted with real-world scenarios. This inflexibility stems from a reliance on pre-programmed responses and limited capacity for generalization. While a robot might flawlessly execute a task within a highly structured laboratory setting, even minor deviations – an unexpected obstacle, altered lighting, or a slightly different object – can disrupt performance. The core issue isn’t necessarily a lack of computational ability, but rather a deficiency in the capacity to learn and adapt in real-time, continuously refining strategies based on sensory input and experience. Consequently, these systems struggle with the inherent ambiguity and dynamism of unstructured environments, highlighting the need for more robust and versatile approaches to robotic intelligence that prioritize adaptability over rigid adherence to pre-defined parameters.

VERM accurately predicts future actions within the RLBench environment, demonstrating effective anticipatory control.

The Emergence of Spatial Understanding

Large foundation models, such as GPT-4o, integrate spatial reasoning through extensive pre-training on diverse datasets encompassing visual and linguistic information. This allows the models to learn relationships between objects, their positions, and the actions required to manipulate them within an environment. Specifically, these models utilize transformer architectures to process and encode spatial data, enabling them to predict the consequences of actions and plan complex trajectories. The capacity to represent and reason about 3D spaces, object affordances, and physical constraints is achieved by processing data representing environments as sequences or graphs, effectively translating perceptual input into actionable insights for robotic systems. This eliminates the need for task-specific training data, allowing robots to generalize to novel scenarios and adapt to previously unseen environments.

Foundation models leverage extensive datasets to establish correlations between environmental inputs and effective actions, enabling predictive capabilities in complex scenarios. By processing multi-modal data – including visual, tactile, and proprioceptive information – these models construct internal representations of the environment and anticipate the consequences of different actions. This allows for the generation of action sequences designed to maximize task success, even in situations with incomplete or noisy data. The predictive accuracy is directly related to the scale and diversity of the training data, as well as the model’s capacity to generalize learned patterns to novel environments and objects. Consequently, these models don’t simply react to stimuli, but proactively estimate the optimal course of action based on their learned understanding of the world.

Effective robotic action predicated on natural language instruction necessitates a comprehensive understanding of spatial relationships and environmental context. Robots must accurately interpret linguistic cues referencing object locations, orientations, and distances to successfully execute commands. This requires processing not only the semantic meaning of the instruction but also its spatial implications; for example, distinguishing “place the red block on the blue block” from “place the red block next to the blue block”. Failure to robustly model spatial information leads to inaccurate action planning and execution, even with semantically correct language processing. Consequently, integrating advanced spatial reasoning capabilities is crucial for bridging the gap between human instruction and robotic behavior.

GPT-4o enables querying virtual camera poses through prompt-based instructions.

VERM: A Framework for Anticipatory Perception

The VERM framework utilizes GPT-4o to dynamically select optimal camera viewpoints during 3D robotic manipulation tasks. This task-adaptive view selection process moves beyond fixed camera perspectives, allowing the system to prioritize visual information most relevant to the action being planned. GPT-4o analyzes the current state of the environment and the desired manipulation goal to determine the most informative camera pose, effectively focusing the robot’s visual perception. This intelligent viewpoint control directly improves the accuracy and efficiency of action prediction, enabling more robust and adaptable robotic behavior compared to systems relying on static or pre-defined camera angles.

The VERM framework utilizes intelligent selection of virtual camera poses to improve robotic perception and environmental understanding. By dynamically adjusting the viewpoint from which the robot observes its surroundings, VERM ensures that relevant visual information is prioritized during action planning. This is achieved through GPT-4o, which analyzes the scene and determines optimal camera angles to maximize the visibility of key objects and features. This focused visual input facilitates more accurate object recognition, pose estimation, and scene interpretation, ultimately leading to improved performance in 3D robotic manipulation tasks. The system doesn’t rely on a fixed camera position but actively selects views that provide the most informative data for successful task completion.

The Depth-Aware Module within VERM integrates depth information, obtained from depth sensors, to enhance the accuracy and robustness of action planning for robotic manipulation. This module processes depth data to create a 3D understanding of the scene, enabling the system to more effectively estimate object poses, distances, and potential collisions. By incorporating depth, the framework moves beyond relying solely on RGB images, mitigating issues caused by lighting variations or occlusions and improving the robot’s ability to generalize to novel environments and object configurations. The resulting depth maps are utilized in subsequent stages of the framework, specifically within the coarse-to-fine adjustment process, to refine action predictions and ensure precise execution of manipulation tasks.

The Coarse-to-Fine Adjustment with Multi-Resolution Voxels (C2F-ARM) strategy employed by VERM refines action predictions through a two-stage process. Initially, a low-resolution voxel representation is utilized for rapid, global action planning. This is followed by a refinement stage leveraging higher-resolution voxel grids to precisely adjust the predicted actions, ensuring accurate execution. The multi-resolution approach balances computational efficiency with precision; the coarse representation accelerates initial planning, while the fine-grained adjustments enable accurate manipulation in complex environments. This method allows VERM to predict robotic actions with improved accuracy and granularity compared to single-resolution approaches.

Performance evaluations demonstrate that the VERM framework significantly improves computational efficiency. Specifically, VERM achieved a 1.89x speedup in training time and a 1.54x speedup in inference speed when benchmarked against the previous state-of-the-art method, RVT-2. These gains are attributed to the framework’s intelligent selection of task-adaptive views and optimized data processing pipeline, allowing for faster model convergence and reduced latency in action prediction. The speedup was measured using standardized datasets and hardware configurations to ensure a fair comparison.

The proposed VERM utilizes a policy network to govern its behavior.

Beyond Parallel Processing: A Comparative View

Recent advancements in robotic vision leverage multi-view transformers, exemplified by the RVT and RVT-2 architectures, to enhance environmental perception. These systems ingest visual data from multiple synchronized cameras, enabling a more comprehensive understanding of the robot’s surroundings than single-camera approaches. By processing these multiple viewpoints concurrently within the transformer network, RVT and RVT-2 can construct a richer feature representation of the scene. This simultaneous processing allows the system to infer spatial relationships and object characteristics with greater accuracy, contributing to improved performance in downstream tasks such as action prediction and robotic manipulation.

Multi-view transformers demonstrate improvements in action prediction by leveraging information from multiple camera perspectives concurrently. Traditional methods typically process visual data from a single viewpoint or require sequential analysis of multiple views, introducing computational bottlenecks and potential inaccuracies. These transformers, however, enable parallel processing of multi-view data, reducing latency and improving the model’s ability to infer actions. Benchmarks indicate a quantifiable increase in prediction accuracy – for example, VERM achieves 83.6% success, exceeding RVT-2’s performance by 1.4% – alongside gains in computational efficiency due to the parallelization capabilities of the transformer architecture.

PerAct diverges from multi-view transformer approaches like RVT by utilizing voxel maps to represent the environment. These voxel maps, which discretize 3D space into volumetric pixels, provide a complete scene representation to the Perceiver transformer architecture. The Perceiver, known for its efficient attention mechanism, processes this voxel-based scene understanding to predict future actions. This combination of voxel maps and Perceiver transformers allows PerAct to effectively encode spatial relationships and temporal dynamics for improved action prediction performance without relying on traditional frame-by-frame video processing.

Evaluations demonstrate that the VERM model achieves an average task success rate of 83.6%. This performance represents a 1.4% improvement over the RVT-2 model, indicating a measurable increase in efficacy. The reported success rate is an aggregate metric derived from testing VERM across a defined set of tasks, establishing a quantitative benchmark for its capabilities in action prediction and robotic manipulation.

Towards Adaptive Intelligence and Broader Impact

The convergence of foundation models and vision-based action prediction frameworks, such as VERM, marks a pivotal advancement in the pursuit of genuinely intelligent robotics. Traditionally, robots relied on meticulously programmed sequences for specific tasks; however, integrating large-scale, pre-trained models allows for a leap in generalization and adaptability. These models, trained on vast datasets of text and images, provide robots with a form of “common sense” reasoning, enabling them to anticipate consequences and plan actions in dynamic, real-world scenarios. By leveraging vision to interpret surroundings and foundation models to predict likely outcomes, systems like VERM move beyond simple reaction and toward proactive, goal-oriented behavior – a crucial step toward robots capable of independent operation and complex problem-solving in unstructured environments.

The convergence of foundation models and predictive robotics promises transformative advancements across diverse sectors. In manufacturing, these technologies could enable robots to perform intricate assembly tasks with greater precision and adaptability, responding dynamically to unforeseen variations. Within healthcare, robots equipped with these capabilities might assist in complex surgeries, provide personalized patient care, and automate repetitive tasks, freeing up medical professionals to focus on critical decision-making. Furthermore, the application of these systems extends to the realm of exploration, allowing robots to navigate challenging environments – from deep sea trenches to distant planets – with increased autonomy and efficiency, collecting data and performing tasks previously inaccessible to humans. This potential for increased capability suggests a future where robots are not merely automated tools, but intelligent partners capable of tackling some of the world’s most pressing challenges.

Continued development centers on enhancing the resilience, speed, and flexibility of these robotic systems, paving the way for solutions to increasingly intricate problems. Current research prioritizes strategies to improve performance in unpredictable environments and with imperfect data, focusing on techniques like transfer learning and reinforcement learning to allow robots to generalize skills across diverse scenarios. Increasing computational efficiency is also crucial, with efforts directed toward model compression and optimized algorithms to enable deployment on resource-constrained platforms. Ultimately, these advancements aim to move beyond controlled laboratory settings and equip robots with the capacity to reliably operate in real-world complexities, from dynamic manufacturing floors to challenging search-and-rescue operations.

The convergence of foundation models and vision-based action prediction frameworks promises a future where robots move beyond pre-programmed routines to engage with the world in a truly fluid manner. By leveraging the broad knowledge embedded within these large models, robots can begin to anticipate human intentions, understand complex environmental cues, and adapt their actions in real-time, fostering interactions that feel natural and intuitive. This isn’t simply about improved efficiency; it’s about creating machines capable of genuine collaboration, whether assisting in delicate surgical procedures, navigating dynamic disaster zones, or working alongside humans in complex manufacturing processes. Continued refinement of these combined techniques will ultimately unlock a new era of robotics, characterized by seamless integration into everyday life and a capacity for adaptable, intelligent behavior previously confined to the realm of science fiction.

Recent evaluations of the Vision-Enabled Reasoning Model, or VERM, showcase a remarkable capacity for robotic task completion. When integrated with the Qwen2.5 foundation model, VERM achieves an 80.3% success rate in performing designated actions, while pairing it with Claude 3.5 Sonnet elevates performance to 81.2%. These figures represent a substantial advancement in the field of embodied AI, demonstrating the potential for large language models to effectively ground reasoning in visual perception and translate that understanding into successful physical actions. This level of performance suggests a viable pathway toward robots capable of navigating complex environments and executing tasks with a high degree of reliability, paving the way for broader applications across various industries.

The pursuit of robust robotic manipulation, as demonstrated by VERM, inherently acknowledges the inevitability of imperfect perception. Systems, even those leveraging powerful foundation models like GPT-4o, will encounter ambiguous data and unforeseen circumstances. This aligns with the notion that incidents are not failures, but integral steps toward maturity. As Alan Turing observed, “Sometimes people who are unkind are unkind because they are unkind to themselves.” This sentiment mirrors the system’s need to ‘learn’ from each imperfect viewpoint selection, refining its coarse-to-fine approach. VERM’s ability to adapt and improve through successive refinements isn’t about eliminating errors, but gracefully accommodating them within the medium of time and experience, ensuring the system ages with increasing resilience.

What Lies Ahead?

The VERM framework, while demonstrating a compelling initial step toward viewpoint-adaptive robotic manipulation, merely postpones the inevitable reckoning with systemic complexity. The reliance on foundation models – GPT-4o in this instance – introduces a dependency that is not a solution, but a transfer of computational burden. Each abstraction carries the weight of the past; the model’s reasoning, however efficient, remains opaque, a black box inheriting the biases and limitations of its training data. Future iterations must confront the question of interpretability – not to understand the reasoning, perhaps, but to anticipate its failures.

The coarse-to-fine refinement strategy is a pragmatic concession to computational reality, yet it highlights a fundamental truth: complete, instantaneous perception is an illusion. The system trades accuracy for speed, an exchange that will become increasingly untenable as tasks demand finer granularity and more complex interactions. The true challenge lies not in selecting the optimal viewpoint, but in designing systems capable of gracefully degrading performance when faced with incomplete or ambiguous data.

Ultimately, the longevity of this approach depends on its ability to embrace slow change. Robustness will not be found in ever-larger models or faster algorithms, but in architectures that prioritize resilience over efficiency. Only systems that anticipate their own decay-and plan for it-can hope to endure beyond the current wave of technological enthusiasm.

Original article: https://arxiv.org/pdf/2512.16724.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/