Four Legs and a Helping Hand: Robots Learn to Grasp and Go

Author: Denis Avetisyan

Researchers have successfully combined the mobility of a quadruped robot with the dexterity of a robotic arm, enabling autonomous navigation and object manipulation in complex environments.

The system decomposes high-level operator requests-specifying a location and target object-into a structured sequence of autonomous actions-locomotion, navigation, and manipulation-executed by a quadruped robot, demonstrating a layered approach to task execution where complex instructions are translated into fundamental robotic behaviors.

This work details a task-level control system achieving a 75% grasp success rate for a quadruped robot performing object manipulation tasks.

While quadruped robots excel at locomotion across challenging terrains, their utility remains limited without robust object manipulation capabilities. This paper, ‘Autonomous Grasping On Quadruped Robot With Task Level Interaction’, details the development of an integrated system enabling a quadruped robot to autonomously navigate, detect, and grasp objects through a task-level control framework. Demonstrating a 75% grasp success rate in real-world trials, the research showcases a functional mobile manipulation platform. Could this approach pave the way for more versatile and adaptable quadruped robots in complex service applications?

The Inevitable Complexity of Embodied Intelligence

The pursuit of robotic manipulation within real-world, unstructured environments presents a significant hurdle for the field. Unlike the controlled conditions of a factory floor, everyday spaces are filled with unpredictable layouts, varying lighting, and a diverse range of objects with differing shapes, sizes, and material properties. This complexity demands that robots not only possess the mechanical dexterity to grasp and move objects, but also the perceptual and cognitive abilities to identify, localize, and plan around obstacles – all while maintaining balance and stability. Successfully navigating these challenges is crucial for realizing the full potential of robots in assisting humans with tasks in homes, workplaces, and beyond, requiring advancements in areas such as computer vision, motion planning, and robust control systems.

Effective mobile manipulation demands more than simply attaching a robotic arm to a moving platform; it necessitates a seamless integration of locomotion and dexterity. A robust mobile base provides the ability to navigate complex, unstructured environments and approach tasks from optimal angles, while a dexterous arm enables intricate object interactions. The true challenge lies in coordinating these two systems – ensuring the base moves in a way that supports the arm’s manipulations, and vice versa. This tight integration allows for complex tasks like opening doors, assembling objects, or retrieving items from cluttered spaces, all while maintaining balance and stability. Without this coordination, even the most capable arm or base is limited in its ability to perform useful work in real-world scenarios.

To tackle the complexities of object manipulation within dynamic, real-world settings, this research introduces a tightly integrated robotic system comprised of the quadrupedal robot, Lite3, and the OpenManipulator-X robotic arm. This pairing offers a unique advantage by combining stable, adaptable locomotion with precise, dexterous manipulation capabilities. Lite3 provides a robust mobile base capable of navigating uneven terrain and maintaining balance during manipulation tasks, while the OpenManipulator-X extends the system’s reach and allows for intricate interactions with objects. The synergistic combination of these platforms enables the robot to not only reach a target object but also to grasp, lift, and reposition it effectively, opening possibilities for complex tasks in unstructured environments and establishing a foundational platform for advanced mobile manipulation research.

The robot successfully identifies and moves towards the selected object.

Mapping the Present: Spatial Awareness Through SLAM

Effective robotic manipulation is predicated on comprehensive environmental understanding, which is achieved through Simultaneous Localization and Mapping (SLAM) techniques. SLAM enables the creation of a map of the surroundings while concurrently determining the robot’s location within that map. This process involves the integration of sensor data, typically from lidar and cameras, with algorithms such as Kalman filters or particle filters to estimate both the robot’s pose ($x$, $y$, $\theta$) and the positions of landmarks or features in the environment. Accurate SLAM is crucial for path planning, obstacle avoidance, and ultimately, successful object manipulation, as it provides the necessary spatial information for the robot to interact with its surroundings.

The robot utilizes the `hdl_graph_slam` package for creating a consistent map of its surroundings while simultaneously determining its location within that map. `hdl_graph_slam` employs a graph-based approach to SLAM, representing the environment as a network of nodes connected by edges that reflect spatial relationships. This allows for robust mapping, even in environments with loops or limited visibility. The `hdl_localization` package then leverages this generated map to accurately estimate the robot’s pose-its position and orientation-within the known environment, providing the necessary spatial awareness for navigation and manipulation tasks. Both packages are designed for real-time operation and integration with sensor data, primarily from LiDAR devices.

The robot utilizes real-time object detection capabilities to perceive and interact with its surroundings. This is achieved through the integration of the `YOLOv8n` neural network and the `Object Detection` software package. `YOLOv8n` performs the core task of identifying objects within sensor data, such as camera images or LiDAR point clouds, and outputting bounding box coordinates and associated confidence scores. The `Object Detection` package then processes these outputs, filtering results based on confidence thresholds and providing the robot with information regarding the location and classification of detected objects in its operational environment.

Users can remotely monitor and control a robot's search via a web interface displaying live camera feeds, target room selection, location tracking, and robot status. — Users can remotely monitor and control a robot’s search via a web interface displaying live camera feeds, target room selection, location tracking, and robot status.

From Perception to Action: Orchestrating the Grasp

The system employs GraspNet to generate a set of potential grasp poses for each detected object. This process prioritizes both grasp stability-ensuring the object can be securely held-and accessibility, which considers the robot’s ability to physically reach the proposed grasp location. GraspNet utilizes a probabilistic model trained on a large dataset of 3D objects and grasp data to predict feasible and high-quality grasp poses. The output of GraspNet is a set of 3D poses, each defining the position and orientation of the robotic gripper relative to the object, and are subsequently evaluated and refined by the grasp planning module.

Grasp planning refines initial grasp candidates generated by $GraspNet$ by evaluating object geometry and potential collisions. This process involves analyzing the 3D model of the object to ensure the proposed grasp pose does not intersect with the object itself, and subsequently checking for collisions with the environment. Collision detection utilizes bounding volume hierarchies to improve computational efficiency, and grasp poses are iteratively adjusted to maximize stability and accessibility while minimizing the risk of interference. The refined set of grasp candidates then serves as input for the motion planning stage.

The robotic system employs motion planning to generate collision-free trajectories for the robotic arm, enabling it to reach and grasp target objects following grasp pose selection. Across twelve trials, this motion planning component achieved a 75% success rate in completing grasps, indicating effective semi-autonomous task-level control. Quantitative analysis of approach accuracy revealed final position errors ranging from 24 to 27 centimeters along the X-axis and 5 to 16 centimeters along the Y-axis. Post-approach distance measurements indicated the robotic arm terminated between 24.52 and 33.42 centimeters from the target object.

The robot successfully executes a grasping action.

Towards Intuitive Agency: Bridging the Gap Between Intent and Action

A Robot Operating System (ROS) WebSocket Bridge forms the core of the system’s control interface, allowing users to interact with the robot through high-level task commands rather than complex low-level motor controls. This bridge establishes a real-time communication pathway, translating user instructions – such as “navigate to the kitchen” or “retrieve the blue object” – into actionable parameters for the robot’s motion planning and manipulation algorithms. By abstracting away the intricacies of robotic control, the system prioritizes user experience and enables intuitive operation, even for individuals without specialized robotics expertise. The architecture facilitates a streamlined workflow, allowing users to focus on what the robot should achieve, rather than how it should achieve it, paving the way for more sophisticated and adaptable robotic applications.

The system’s interface prioritizes operational simplicity, moving beyond complex robotic controls to enable users to assign tasks with unprecedented flexibility. This design streamlines interaction, allowing for quick adaptation to changing environments and demands without requiring specialized expertise. Consequently, the robot isn’t limited to pre-programmed sequences; it dynamically responds to high-level instructions, facilitating a broader range of applications and enhancing its utility in diverse scenarios. The resulting ease of use lowers the barrier to entry for robotic automation, potentially extending its benefits to new users and industries.

Future development anticipates a significant leap in robotic control through the integration of Large Language Models (LLMs). This approach moves beyond traditional programming interfaces, envisioning a system where the robot responds directly to natural language instructions. By leveraging the power of LLMs, the robot will be capable of interpreting complex commands, understanding context, and adapting to unforeseen circumstances-effectively translating human intent into physical action. Such integration promises not only a more intuitive user experience, but also a substantial expansion of the robot’s operational scope, allowing it to tackle a wider range of tasks with minimal direct supervision and opening possibilities for truly collaborative human-robot interaction.

The pursuit of autonomous systems, as demonstrated by this integration of quadrupedal locomotion and robotic manipulation, inevitably confronts the realities of entropy. While a 75% grasp success rate represents a significant achievement, it inherently acknowledges the existence of failure – a temporary state in the system’s ongoing decay. This echoes Linus Torvalds’ observation: “Talk is cheap. Show me the code.” The demonstrable functionality, the working system, is what matters, not theoretical perfection. The code, and by extension, the robot’s actions, reveal the system’s present state, knowing full well that even successful grasps contribute to the eventual wear and tear, the accumulating latency inherent in all physical processes. The system doesn’t strive for immortal stability, but for graceful degradation, adapting and continuing operation within the bounds of its inevitable decline.

What Lies Ahead?

The successful integration of locomotion and manipulation, as demonstrated in this work, isn’t an arrival, but a version update. A 75% grasp success rate suggests a system still negotiating with entropy-a high watermark, certainly, but one acutely aware of its own eventual decay. The true challenge isn’t simply doing a task, but the graceful handling of inevitable failure states. Each grasp, each navigation correction, is a temporary stay against the arrow of time, pointing always toward refactoring, adaptation, and ultimately, redesign.

Future iterations will likely focus less on isolated success and more on the metadata of failure. Understanding why a grasp fails – the subtle interplay of perception error, dynamic instability, and environmental uncertainty – will prove more valuable than incrementally improving success percentages. The system’s memory – its ability to learn from these negative examples – will define its longevity. A truly robust system won’t merely react to the present; it will anticipate the future based on the ghosts of attempts past.

The pursuit of task-level control is, at its core, a quest to abstract away complexity. But abstraction is never perfect. The inherent messiness of the real world will always seep through the carefully constructed layers of software and hardware. The next generation of quadrupedal manipulators will need to embrace this imperfection, not fight against it, acknowledging that elegance often resides in the artful management of controlled instability.

Original article: https://arxiv.org/pdf/2512.01052.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Complexity of Embodied Intelligence

Mapping the Present: Spatial Awareness Through SLAM

From Perception to Action: Orchestrating the Grasp

Towards Intuitive Agency: Bridging the Gap Between Intent and Action

What Lies Ahead?

See also: