Robots That Think on Their Feet: Combining Grasping and Non-Grasping Skills

Author: Denis Avetisyan

Researchers have developed a new framework that allows robots to seamlessly integrate traditional grasping with more complex, non-grasping manipulation techniques for improved adaptability in real-world scenarios.

AdaptPNP establishes a system where a vision-language model plans task sequences blending grasping and manipulation, a digital twin simulates each step to predict outcomes, and a feedback loop refines those plans based on real-world execution—effectively allowing the system to learn from and correct its own actions through iterative rehearsal and reflection.

AdaptPNP leverages vision-language models and a digital twin to enable robots to plan and execute complex manipulation tasks requiring both prehensile and non-prehensile actions.

While robust robotic manipulation often relies on stable grasping, many real-world scenarios demand adaptability beyond prehensile actions. This need motivates AdaptPNP: Integrating Prehensile and Non-Prehensile Skills for Adaptive Robotic Manipulation, a novel framework that synergistically combines grasping with non-prehensile manipulation—like pushing or sliding—through a vision-language model and digital twin. This approach enables robots to proactively plan and refine complex manipulation sequences, seamlessly transitioning between grasping and non-grasping actions based on environmental context. Could this hybrid approach represent a crucial step towards truly general-purpose, human-level robotic dexterity?

Beyond the Grip: Rethinking Robotic Manipulation

For decades, robotic manipulation has been largely defined by prehensility – the ability to grasp and hold onto objects. This approach, while effective for simple tasks in structured environments, fundamentally limits a robot’s adaptability. By prioritizing stable grasps, systems struggle with scenarios demanding nuanced interaction, such as re-orienting an object without releasing it, inserting a part into a tight space, or delicately adjusting a fragile item. This reliance on grasping creates a bottleneck, requiring robots to constantly cycle between grasping, re-positioning, and re-grasping – a process that is both inefficient and incapable of handling the variability present in real-world settings. Consequently, advancements in robotic dexterity require moving beyond this prehensile paradigm and embracing a broader range of manipulation strategies.

Successfully navigating real-world scenarios often requires more than simply picking up and moving objects; it demands a nuanced ability to manipulate them while they are in contact with surfaces or other objects. Consider tasks like assembling furniture, turning a key in a lock, or even peeling fruit – these actions aren’t defined by stable grasps, but by controlled slips, rotations, and forces applied without fully securing an object. This challenges traditional robotics, which frequently prioritizes firm prehension, and highlights the necessity for systems capable of exploiting contact dynamics – leveraging friction, momentum, and compliant motions – to perform complex manipulations. Such an approach allows for greater adaptability, reduced reliance on precise positioning, and the potential to handle a wider range of objects and tasks in unstructured environments.

The future of robotic dexterity hinges on moving beyond simple grasping actions and embracing a more versatile approach to object manipulation. Current robotic systems often struggle in unstructured environments precisely because they prioritize stable prehension – the firm holding of an object – over the dynamic interplay of forces needed for truly complex tasks. Researchers are now focused on developing unified frameworks that integrate both prehensile and non-prehensile manipulation primitives – actions like pushing, rolling, or scooping – allowing robots to strategically influence objects without necessarily maintaining a secure grip. This integration isn’t merely about adding new tools to a robot’s repertoire; it demands a fundamental shift in control algorithms, enabling seamless transitions between grasping and non-grasping strategies and fostering a more fluid, adaptable, and human-like interaction with the physical world. Ultimately, such a framework promises robots capable of tackling intricate assembly tasks, navigating cluttered spaces, and responding to unpredictable changes in their surroundings with greater efficiency and resilience.

AdaptPNP successfully places a box at a target pose by first pushing it to the edge of the table and then grasping it from the side, overcoming the limitations of a direct grasp due to the box’s width.

Unifying Intelligence: AdaptPNP’s Framework for Action

AdaptPNP establishes a unified framework for task and motion planning by integrating high-level planning of pick-and-place (P&P) skills with low-level robotic execution. This integration eliminates the traditional separation between task planning, which defines what needs to be done, and motion planning, which determines how to achieve it. The framework enables the system to reason about both the sequence of skills required to fulfill a task and the specific robot motions needed to execute each skill, all within a single planning process. This unified approach facilitates more efficient and robust task completion by allowing for real-time adaptation and refinement of both the task plan and the robot’s movements based on environmental feedback and constraints.

The AdaptPNP framework employs a Vision-Language Model (VLM)-based Task Planner to decompose high-level goals into a sequence of executable robotic primitives. This planner accepts natural language instructions and processes scene understanding data to identify relevant objects and their affordances. It then generates plans consisting of both prehensile actions – those requiring grasping an object – and non-prehensile actions, such as pushing, sliding, or simply navigating to a location. The VLM facilitates the selection and ordering of these primitives, creating a task plan that specifies the desired sequence of robot actions to achieve the given objective. This allows the system to handle complex tasks requiring a combination of manipulation and motion without explicit, hand-coded task specifications.

The AdaptPNP framework incorporates a Digital Twin as a core component for planning robustness. This Digital Twin functions as a simulated environment where proposed robotic primitives are virtually rehearsed before physical execution. Through simulation, the system generates predicted 6D Object Poses – representing an object’s position and orientation – for each primitive. These predicted poses are then used to evaluate the feasibility and potential outcomes of each action, allowing AdaptPNP to identify and mitigate potential failures before they occur in the real world. This rehearsal process facilitates robust planning by enabling proactive collision avoidance and ensuring successful task completion, even in complex or uncertain environments.

AdaptPNP employs the Planning Domain Definition Language (PDDL) to formally specify task requirements and environmental constraints. PDDL allows for the unambiguous definition of initial states, goals, preconditions, and effects of actions, enabling a robot to reason about task feasibility and generate valid plans. This symbolic representation facilitates automated planning by providing a standardized format for describing problems to planning algorithms and ensures that generated plans adhere to the specified constraints. The use of PDDL supports complex task decomposition and allows for the integration of various robotic skills within a unified planning framework, contributing to robust and reliable task execution.

AdaptPNP iteratively refines task plans through a digital twin-based rehearsal and reflection loop, enabling robust robotic manipulation even after initial failures, as demonstrated by its successful completion of a complex push-to-edge-then-grasp sequence.

Closing the Loop: Real-World Validation and Resilience

AdaptPNP’s Closed-Loop Reflection Mechanism functions by continuously evaluating the execution of the planned task and using the resulting feedback to adjust the plan in real-time. This iterative refinement process is critical for resolving multi-modal ambiguities, where multiple interpretations of sensory data are possible. By comparing predicted outcomes with observed results, the system identifies discrepancies and modifies the task plan to improve performance and ensure successful completion. This feedback loop allows AdaptPNP to adapt to unforeseen circumstances and correct errors during execution, significantly enhancing its robustness and reliability in dynamic environments.

AdaptPNP utilizes a Digital Twin, a virtual replica of the physical environment, enabled by the FoundationPose system. FoundationPose provides real-time six-dimensional (6D) pose estimates – representing both 3D position and orientation – for objects within the scene. These 6D pose estimations are critical for aligning the virtual and real environments, allowing the system to accurately map objects observed in the physical world to their corresponding representations in the simulation. This alignment is not merely for visualization; it’s fundamental to the closed-loop reflection mechanism, enabling the system to predict the outcome of actions in simulation before executing them in the real world and to correct for discrepancies between the simulated and real environments.

The AdaptPNP framework achieves robustness in manipulation tasks through continuous adaptation to unforeseen circumstances. By integrating real-time 6D pose estimation via FoundationPose and a closed-loop reflection mechanism, the system can dynamically revise its task plan based on execution feedback. This allows the framework to correct for discrepancies between the simulated and real environments, as well as respond to unexpected object positions or disturbances during manipulation. Experimental results demonstrate that the absence of either the 6D pose representations or the closed-loop reflection component leads to significant performance degradation, and near-total task failure, confirming the critical role of these features in maintaining operational reliability.

AdaptPNP achieves state-of-the-art performance in robotic manipulation, consistently surpassing the success rates of established methods including Model Predictive Control (MPC), Proximal Policy Optimization (PPO), MoKA, and OpenVLA. This outperformance has been demonstrated across a benchmark of eight simulated tasks and four real-world tasks. Quantitative analysis reveals a significant reduction in performance when 6D pose representations are removed from the framework, and near-complete task failure when the closed-loop reflection mechanism is disabled, indicating that both components are critical to the system’s functionality and robustness.

AdaptPNP is evaluated across a diverse set of pick-and-place manipulation tasks, encompassing both simulated scenarios with varying target poses and real-world applications highlighted by yellow target regions.

Expanding the Horizon: Implications for a More Agile Robotics

AdaptPNP represents a significant advancement in robotic manipulation by unifying grasping – traditionally a core focus – with a broader range of interaction methods such as pushing, rotating, and precise movements without grasping. This integration isn’t simply about adding more tools to a robot’s skillset; it’s about enabling a fundamentally more adaptable approach to handling objects. By treating these actions as interchangeable ‘primitives’, the system allows robots to dynamically select the most effective method for a given situation, even if that means foregoing a direct grasp. Consequently, robots can now address tasks requiring delicate adjustments, navigating cluttered scenes, or manipulating objects beyond their immediate reach, opening doors to more complex and nuanced interactions in real-world environments.

The capacity for robotic adaptability hinges on a fundamental building block: a diverse repertoire of basic actions, or primitives. Recent advancements demonstrate that equipping robots with primitives like Push, Rotate, Moveto, and Release transcends simple task execution; it enables a flexible response to unpredictable environments. These actions aren’t merely isolated movements, but foundational components that can be chained together to solve complex manipulation challenges. By mastering these core skills, a robot gains the ability to adjust its approach based on object characteristics, unforeseen obstacles, or changes in its surroundings, moving beyond pre-programmed sequences and towards genuine, dynamic problem-solving. This primitive-based approach allows for the creation of robust and versatile robotic systems capable of handling a wider array of tasks with greater efficiency and reliability.

The development of AdaptPNP signifies a crucial step toward deploying robots effectively in real-world, unstructured environments. Historically, robotic manipulation has been largely confined to highly controlled settings with predictable arrangements; however, this framework enables robots to navigate and interact with dynamic, cluttered spaces more effectively. By combining prehensile actions—like grasping—with non-prehensile maneuvers—such as pushing or rotating objects—robots can now solve tasks requiring greater adaptability and problem-solving skills. This capability unlocks a broad spectrum of assistive applications, ranging from collaborative manufacturing and logistics to in-home assistance for the elderly or individuals with disabilities, and even disaster relief operations where navigating unpredictable terrain and manipulating unfamiliar objects are paramount.

The AdaptPNP framework is poised for continued development, with ongoing research dedicated to bolstering its learning algorithms and broadening the scope of tasks it can effectively address. Future iterations will explore more sophisticated methods for robots to autonomously acquire and refine their manipulation skills, potentially leveraging techniques like reinforcement learning and imitation learning to achieve greater robustness and efficiency. This includes tackling tasks requiring intricate coordination, delicate force control, and real-time adaptation to unpredictable environmental changes. Ultimately, the goal is to move beyond pre-defined scenarios and enable robots to generalize their abilities to a truly diverse range of real-world applications, from complex assembly operations to collaborative human-robot workflows in dynamic and unstructured settings.

AdaptPNP, as detailed in the study, doesn’t simply program robotic action—it allows the robot to interpret a dynamic environment and react accordingly. This resonates with Barbara Liskov’s insight: “Programs must be correct with respect to their specification.” The framework achieves this ‘correctness’ not through rigid pre-programming, but through a digital twin that simulates potential actions, validating them against the desired outcome. The digital twin essentially ‘reads the code’ of the physical world, anticipating consequences before they occur. By integrating vision-language models, AdaptPNP creates a system where the robot doesn’t just do as instructed, but understands why and can adjust its approach, embodying a proactive, rather than reactive, methodology to manipulation.

Beyond the Grasp: Future Directions

The framework detailed within deliberately blurs the lines between traditional prehensile and non-prehensile manipulation, a useful disruption. However, the reliance on a digital twin, while currently practical, invites a critical question: how much of the ‘adaptation’ is merely sophisticated pre-programming within the simulation? True autonomy demands a system that gracefully degrades when the simulated world diverges from reality – and reality, predictably, always wins. The current approach feels like an elegant extension of known constraints, not a genuine escape from them.

Future work must address the brittleness inherent in any vision-language pipeline. Current models excel at recognizing what is, but struggle with predicting what could be, especially concerning unforeseen contact dynamics. Exploring methods to imbue the system with a sense of ‘physical intuition’ – perhaps through differentiable physics or reinforcement learning operating directly on raw sensor data – is crucial. The goal isn’t just to plan a sequence of actions, but to anticipate their consequences with a degree of robustness that approaches, however distantly, genuine understanding.

Ultimately, the true test of AdaptPNP, and systems like it, won’t be their ability to replicate human manipulation, but their capacity to discover entirely new forms of it. The challenge isn’t to build a robot that can grasp a cup, but one that can invent a better way to move it – a task requiring not just intelligence, but a willingness to dismantle the very assumptions upon which the system is built.

Original article: https://arxiv.org/pdf/2511.11052.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond the Grip: Rethinking Robotic Manipulation

Unifying Intelligence: AdaptPNP’s Framework for Action

Closing the Loop: Real-World Validation and Resilience

Expanding the Horizon: Implications for a More Agile Robotics

Beyond the Grasp: Future Directions

See also: