Robots Get Their Bearings: AI-Powered Exploration and Object Retrieval

Author: Denis Avetisyan

A new system combines large language models with robotic platforms, allowing robots to understand instructions, map environments, and autonomously locate and grasp objects.

The agentic exploration stack integrates perception, localization, and mapping to construct a semantic representation of the environment, which then informs an LLM-driven reasoning core and a finite-state orchestrator to strategically trigger low-level skills based on decision history.

This work presents an integrated robotic system leveraging large language models for agentic exploration, semantic mapping, and skill orchestration within ROS.

Despite advances in robotics, enabling robots to autonomously navigate complex environments and fulfill high-level requests remains a significant challenge. This paper, ‘LLM-Based Agentic Exploration for Robot Navigation & Manipulation with Skill Orchestration’, presents an integrated system where large language models drive agentic exploration, semantic mapping, and skill orchestration for indoor shopping tasks. By combining LLM-generated plans with a ROS-based robotic framework and AprilTag localization, the robot successfully executes multi-store navigation and object retrieval from natural language instructions. Could this approach pave the way for more intuitive and adaptable robotic assistants in real-world scenarios?

The Algorithmic Imperative: Navigating Unstructured Environments

For robotic agents venturing into real-world tasks like shopping, simple movement isn’t enough; they require sophisticated exploration strategies to navigate unpredictable environments and achieve complex goals. Unlike pre-programmed routines, effective exploration demands adaptability, allowing the agent to actively seek out relevant information and overcome unforeseen obstacles. This necessitates moving beyond basic path planning and incorporating methods that prioritize information gain – essentially, the agent must intelligently decide where to look next to maximize its understanding of the surroundings and efficiently locate desired items. A robust exploration strategy isn’t merely about covering ground, but about building a semantic map of the environment, recognizing objects, and anticipating potential challenges – a crucial step towards true autonomous operation in human-centered spaces.

Conventional robotic exploration strategies often falter when confronted with the unpredictable nature of real-world settings. These methods, frequently reliant on pre-programmed maps or meticulously defined routes, struggle to adapt to moving obstacles, unexpected changes in layout, or the introduction of novel objects. Furthermore, humans rarely issue precise, step-by-step instructions; instead, commands tend to be ambiguous, relying on shared context and the ability to infer intent. A robot equipped with only traditional algorithms, therefore, faces a significant challenge in deciphering vague requests like “find something for dinner” or “go to the living room,” requiring a leap beyond simple spatial awareness towards a more nuanced understanding of language and context to successfully operate in a dynamic and often uncertain world.

Effective navigation for robotic agents transcends simple path planning, demanding instead a nuanced semantic understanding of the surrounding environment. A robot doesn’t merely register obstacles; it must interpret what those obstacles are – a table indicating a dining area, a shelf suggesting a storage space, or a doorway leading to another room. This requires integrating visual data with knowledge about object affordances – what actions can be performed with or around them. Consequently, successful agents leverage this understanding to infer goals, predict outcomes, and adapt to unforeseen circumstances, moving beyond pre-programmed routes to dynamically construct navigable maps based on meaning rather than mere geometry. This semantic awareness allows for flexible problem-solving, enabling robots to accomplish complex tasks in dynamic, real-world settings where rigid adherence to a pre-defined path would inevitably fail.

The system reliably navigates corridors and reaches target locations by combining AprilTag-based approach, base repositioning, local obstacle avoidance, and store-entry alignment primitives under a central controller.

Constructing a Semantic Framework: Visual Landmarks and Spatial Reasoning

The system constructs a navigable map by identifying and utilizing junction signboards as primary visual landmarks. These signboards, containing directional information such as road names and arrow indicators, are detected through image processing. The identified signboards are then incorporated into the map as nodes, with directional data defining the edges connecting these nodes. This approach allows the system to represent the environment’s connectivity and facilitate path planning based on observable signage, enabling autonomous navigation and location awareness without relying on pre-existing maps or GPS signals.

AprilTag markers are fiducial markers utilized for robust and accurate localization within the robotic system. These markers, characterized by a unique binary encoding, provide a known reference point detectable under varying lighting and viewing angles. The system employs these markers as stable anchor points, allowing for precise pose estimation – determining the robot’s position and orientation – through computer vision algorithms. Unlike natural features, AprilTags remain consistent across sessions and are computationally efficient to detect, contributing to the reliability of the mapping and navigation processes. Their detection facilitates accurate data association between sensor readings and the global coordinate frame.

YOLO Object Detection serves as the primary mechanism for identifying junction signboards within the captured visual data. This real-time object detection system analyzes images to locate and classify instances of relevant signage. Upon detection, the system extracts and stores graphical icons present on each signboard, associating them with the signboard’s spatial location as determined by AprilTag localization. These stored icons, representing directional information, collectively form the foundational elements of the robot’s environmental representation, enabling it to build and navigate a semantic map based on visually identified landmarks and their associated directional cues.

The corridor-based shopping environment utilizes junction signboards-including arrows, store icons, and AprilTags-to facilitate semantic mapping and navigation, with a designated pickup point located in the shaded region.

Orchestrating Perception and Action: The Robotic Infrastructure

The Robot Operating System (ROS) Stack functions as the primary integration framework for the robotic system, facilitating communication and data exchange between distinct functional modules. Specifically, the ORB-SLAM3 visual odometry system provides perceptual input, which is then used by the localization component to estimate the robot’s pose within its environment. This localized pose data is subsequently fed into the control components, enabling autonomous navigation and manipulation. The ROS Stack utilizes a publish-subscribe messaging system, allowing these modules to operate asynchronously and independently while maintaining a coherent and responsive overall system behavior. This modular architecture promotes code reusability, simplifies debugging, and enables future expansion with additional functionalities.

The robot’s navigation and task execution are managed by a Finite-State MainController, which decomposes complex operations into discrete, executable behaviors. This controller utilizes a state machine architecture to sequentially activate modules responsible for specific actions, including wall avoidance, obstacle negotiation, and store entry procedures. Transitions between states are triggered by sensor data and task completion signals, enabling dynamic adaptation to changing environmental conditions. The modular design allows for the easy addition or modification of behaviors without impacting the overall system functionality, promoting robust and efficient movement through the target environment. This hierarchical control system prioritizes safe operation by continuously evaluating sensor input and adjusting behavior accordingly.

The Local Costmap is a grid-based representation of the robot’s immediate surroundings, updated continuously with data from onboard sensors. Each cell within the costmap contains a value representing the estimated obstacle density; higher values indicate a greater probability of collision. This dynamic map facilitates real-time path planning by allowing the robot’s navigation algorithms to identify traversable space and avoid obstacles. The costmap is not a static model of the environment, but rather a probabilistic assessment that accounts for sensor uncertainty and allows the robot to react to changes in its surroundings, enabling robust collision avoidance even in dynamic environments. The resolution and size of the costmap are configurable parameters, balancing computational cost with the level of detail required for safe navigation.

The Grasping Controller facilitates robotic interaction with the environment through precise object manipulation. This is achieved via a layered control architecture incorporating force/torque sensing and visual servoing. The controller computes a sequence of motions-approach, grasp, lift, and place-based on object pose estimation from perception systems. It employs admittance control to regulate contact forces during interaction, mitigating potential damage to both the robot and manipulated objects. The controller supports multiple grasp types, selectable based on object geometry and task requirements, and utilizes a Jacobian-based inverse kinematics solver to translate desired end-effector poses into joint commands.

The robot successfully retrieves an object within a store environment by first navigating to its location and then executing a pre-programmed grasp sequence.

From Intent to Action: The Logic of Autonomous Decision-Making

The system’s core functionality resides within the LLM Decision Layer, a sophisticated component designed to bridge the gap between human intention and robotic execution. This layer receives goals articulated in natural language – such as “fetch the blue block” or “explore the living room” – and decomposes them into a sequence of actionable commands for the robot. Rather than relying on pre-programmed routines, the LLM utilizes its understanding of language and context to formulate a plan, effectively converting abstract objectives into concrete motor instructions. This process allows for a level of flexibility and adaptability previously unattainable, enabling the robot to respond to novel requests and navigate dynamic environments with greater autonomy. The result is a system that doesn’t merely react to instructions, but actively interprets them, paving the way for more intuitive and effective human-robot interaction.

The system’s decision-making process hinges on a sophisticated integration of spatial awareness and experiential learning. A Semantic Map Representation provides the robot with a detailed understanding of its surroundings, categorizing elements not just by their location, but also by their function – identifying, for instance, a ‘kitchen’ versus a ‘hallway’. Crucially, this isn’t a static assessment; an Action History Log records every movement and interaction, preventing the robot from repeating unsuccessful maneuvers or revisiting already explored areas. By cross-referencing the map with this log of past actions, the system anticipates potential obstacles, optimizes routes, and ultimately makes more informed choices, leading to efficient and reliable navigation even in dynamic environments. This combination ensures that each action builds upon previous experiences, fostering a form of robotic ‘common sense’ that minimizes wasted effort and maximizes task completion.

The system’s ability to correlate environmental mapping with a record of prior actions fundamentally enhances navigational performance. By referencing the ‘Action History Log’ in conjunction with the ‘Semantic Map Representation’, the robot doesn’t simply react to the present environment, but anticipates potential challenges and leverages previously successful strategies. This allows for more efficient path planning, avoiding redundant movements and enabling the robot to adapt quickly to unforeseen obstacles or changes within complex spaces. Essentially, past experiences are woven into the decision-making process, fostering a level of reliability and adaptability that transcends simple reactive navigation and moves toward proactive, informed exploration.

Rigorous testing across both the simulated environment of Gazebo and a physical, real-world setup has confirmed the robustness and adaptability of this robotic decision-making system. These trials weren’t merely isolated checks; they demonstrated a complete, functional pipeline – from receiving natural language goals to executing corresponding actions in a dynamic space. Performance parity between simulation and reality highlights the system’s capacity to generalize beyond controlled conditions, suggesting it can reliably navigate unforeseen obstacles and complex scenarios. This validation is crucial, signifying a move beyond theoretical frameworks towards a practical, deployable solution for autonomous navigation and task completion.

The proposed system integrates a ROS-based sensor and perception pipeline with a large language model agent that translates natural language commands into constrained robot actions, such as directional movement or storage operations.

The presented work meticulously crafts a system where robotic action stems from provable, high-level directives. This approach resonates with a core tenet of computer science, as articulated by Edsger W. Dijkstra: “Program testing can be a useful effort, but it can never prove the absence of errors.” The system’s reliance on semantic mapping and LLM-driven task planning isn’t merely about achieving functionality; it’s about constructing a demonstrably correct framework for navigation and manipulation. The robot doesn’t simply react to stimuli; it operates based on a logically derived plan, mirroring the pursuit of mathematical purity in algorithmic design. The focus on provability, rather than empirical testing alone, establishes a robust foundation for reliable robotic autonomy.

Future Trajectories

The presented integration of large language models into robotic control, while demonstrating a functional system, ultimately highlights the enduring chasm between statistical correlation and genuine understanding. The robot navigates and manipulates, but does it know what it is doing? The semantic map, a collection of labels attached to sensor data, remains precisely that – a collection. A truly robust system demands formal verification of the LLM’s reasoning, a process currently absent and arguably beyond the scope of purely empirical evaluation. The successful retrieval of an object, observed across multiple trials, does not constitute proof of generalizable intelligence.

Future work must address the inherent ambiguity of natural language. The system currently relies on human-provided instruction; a truly autonomous agent requires the ability to resolve ambiguity internally, or to actively seek clarification. More critically, the reliance on AprilTag localization, while pragmatic, introduces a single point of failure. A truly elegant solution would necessitate a system capable of building and maintaining a consistent world model independent of external markers – a feat demanding breakthroughs in simultaneous localization and mapping, grounded in provable algorithms, not merely probabilistic estimations.

The field now faces a choice: continue refining the existing paradigm of statistical learning, accepting its inherent limitations, or pursue a more rigorous, mathematically grounded approach to artificial intelligence. The former offers incremental improvements; the latter, though far more challenging, holds the promise of genuine autonomy – a distinction that, in the final analysis, matters profoundly.

Original article: https://arxiv.org/pdf/2601.00555.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Algorithmic Imperative: Navigating Unstructured Environments

Constructing a Semantic Framework: Visual Landmarks and Spatial Reasoning

Orchestrating Perception and Action: The Robotic Infrastructure

From Intent to Action: The Logic of Autonomous Decision-Making

Future Trajectories

See also: