Robots That Remember: Smarter Object Search in Cluttered Spaces

Author: Denis Avetisyan


A new framework combines probabilistic planning with neural networks to help robots efficiently locate objects even as environments become more complex.

The systematic expansion of a state space proceeds through iterative refinement, building complexity from fundamental components to enable comprehensive system analysis and control.
The systematic expansion of a state space proceeds through iterative refinement, building complexity from fundamental components to enable comprehensive system analysis and control.

This work introduces GNPF-kkCT, a POMDP-based approach utilizing belief tree reuse, neural processes, and adaptive action refinement for robust robot object search with growing state spaces and hybrid action domains.

Efficiently locating objects in complex, real-world environments remains a challenge for mobile robots due to perceptual limitations and expansive state spaces. This paper introduces a novel framework, ‘POMDP-based Object Search with Growing State Space and Hybrid Action Domain’, which tackles this problem by formulating object search as a Partially Observable Markov Decision Process (POMDP) and employing a growing neural process filtered k-center clustering tree (GNPF-kCT) to manage complexity. Through innovations in belief tree reuse, adaptive action refinement, and a modified Monte Carlo Tree Search (MCTS), the approach demonstrates improved performance over existing POMDP and non-POMDP baselines, including large language model (LLM)-based methods. Could this combination of techniques pave the way for more robust and efficient robotic search capabilities in dynamic, unstructured environments?


The Challenge of Intelligent Robotic Search

Robotic object search in realistic settings presents a significant challenge because of inherent uncertainties and substantial computational needs. Unlike controlled laboratory environments, the real world is filled with unpredictable factors – imperfect sensor data, cluttered scenes, and objects that can appear in various poses and lighting conditions. These ambiguities demand that robots process vast amounts of information to estimate object locations and orientations, a process that quickly becomes computationally expensive. Traditional algorithms, often reliant on precise models and calculations, struggle to cope with this complexity, leading to slow search times and frequent failures. The combination of perceptual uncertainty and the need for real-time processing creates a demanding task for robotic systems attempting to reliably locate and retrieve objects in dynamic, unstructured spaces.

Successful robotic object search isn’t simply about finding an item; it demands integrated planning of both where to look and how to acquire it. A truly robust system must simultaneously address the uncertainty inherent in object localization – estimating the object’s position amidst sensor noise and visual clutter – and devise a feasible manipulation strategy to actually grasp or retrieve it. This necessitates algorithms that can anticipate potential obstacles, adjust to changes in the environment, and select appropriate grasping points or movement trajectories. Rather than treating localization and manipulation as separate problems, advanced approaches emphasize a synergistic planning process where the search for an object is intrinsically linked to the robot’s ability to physically interact with and secure it, ultimately leading to more efficient and reliable performance in complex, real-world scenarios.

Robotic object search systems frequently falter when confronted with the unpredictable nature of real-world environments. Existing methodologies, while effective in controlled settings, demonstrate limited resilience to dynamic changes – such as moving obstacles, altered lighting, or the unexpected appearance of new objects. This inflexibility stems from a reliance on pre-programmed plans and static environmental maps, hindering a robot’s ability to replan and adjust its search strategy on the fly. Consequently, robots struggle to maintain efficient search performance in cluttered or constantly evolving workspaces, necessitating the development of more robust and adaptable algorithms capable of real-time reasoning and reactive behavior to overcome unforeseen challenges.

Using feature matching and without prior knowledge of object properties, a Fetch robot successfully locates and removes a pink snack box from a table by actively adjusting its base, lift, and head.
Using feature matching and without prior knowledge of object properties, a Fetch robot successfully locates and removes a pink snack box from a table by actively adjusting its base, lift, and head.

Modeling Uncertainty: The Power of POMDPs

Partially Observable Markov Decision Processes (POMDPs) are a mathematical framework used to model decision-making where the agent’s state of the environment is not fully known. Unlike standard Markov Decision Processes (MDPs), POMDPs account for uncertainty through a belief state, which represents a probability distribution over possible states given the history of observations and actions. A POMDP is defined by a tuple [latex](S, A, O, T, R, [latex]\Omega[/latex])[/latex], where S is the set of states, A the set of actions, O the set of observations, [latex]T[/latex] the transition function representing state changes given actions, R the reward function, and Ω the observation function which defines the probability of observing a particular observation given a state and action. This allows POMDPs to represent scenarios where sensors are noisy or incomplete, making them suitable for applications like robotics, dialogue systems, and healthcare.

Object-oriented POMDPs (OOP-POMDPs) represent a refinement of the standard POMDP framework by structuring the state space around distinct objects, each defined as a class with specific attributes and associated actions. This approach contrasts with traditional POMDPs which often define the state as a flat vector of variables. By encapsulating state information within object classes, OOP-POMDPs facilitate modularity and reusability, particularly in complex scenarios involving numerous interacting entities. This modularity reduces the overall state-space size and simplifies model construction, as common object characteristics and behaviors can be defined once and instantiated multiple times. Furthermore, the object-oriented structure enables efficient belief update and action selection by focusing computations on relevant object attributes and interactions, thereby improving computational tractability compared to monolithic state representations.

The computational complexity of solving Partially Observable Markov Decision Processes (POMDPs) stems from the exponential growth of the belief state space with increasing numbers of observations, actions, and states. Exact solution methods, such as value iteration, require storing and updating the value function for each possible belief point, leading to intractable memory and processing demands for even moderately sized problems. Approximation algorithms, while mitigating this issue, introduce error and often require significant parameter tuning. The time required to find an approximate solution also scales poorly with problem size, frequently necessitating the use of heuristics and parallelization techniques to achieve results within a reasonable timeframe. Specifically, the state space grows as [latex]O(b^s)[/latex], where ‘b’ represents the number of possible observations and ‘s’ is the number of states.

POMDP-based methods demonstrate varying performance in terms of discounted cumulative reward, steps to completion, and success rate, all measured within a 50-step horizon.
POMDP-based methods demonstrate varying performance in terms of discounted cumulative reward, steps to completion, and success rate, all measured within a 50-step horizon.

GNPF-kkCT: An Efficient Solver for Robotic Search

GNPF-kkCT accelerates Partially Observable Markov Decision Process (POMDP) solving for object search by integrating three core techniques. Neural processes are employed to predict the feasibility of actions, effectively reducing the dimensionality of the continuous action space and focusing search on viable options. K-means clustering further refines the action space by partitioning it into discrete clusters, enabling more efficient exploration and planning. Finally, belief tree reuse leverages previously computed belief trees from past experiences, avoiding redundant computations and significantly improving the speed of the planning process. This combination of techniques allows GNPF-kkCT to achieve substantial performance gains in autonomous object search scenarios.

GNPF-kkCT employs neural processes to predict the feasibility of actions within a continuous action space, effectively reducing its dimensionality for planning. This prediction mechanism assesses the likelihood of successful action execution based on the current state of the environment. Complementing this, k-means clustering is utilized to partition the remaining action space into a discrete set of representative actions. By grouping similar actions, the system reduces the computational burden of exploring the entire continuous space, enabling more efficient exploration and accelerating the search for optimal policies. This combination of neural prediction and discrete action partitioning significantly improves the scalability and speed of the POMDP solver.

Belief tree reuse in GNPF-kkCT functions by storing and re-applying previously computed belief trees to similar states encountered during the search process. This avoids redundant computations inherent in traditional POMDP planning algorithms, which would recalculate optimal policies from scratch for each new state. Specifically, the system maintains a repository of belief trees, and when a current state is determined to be sufficiently similar to a previously encountered state – based on a defined similarity metric – the corresponding stored belief tree is retrieved and adapted, rather than rebuilt. This adaptation typically involves updating the root node of the retrieved tree to reflect the current observation, significantly reducing the computational cost associated with policy evaluation and improvement, and ultimately accelerating the planning speed for object search tasks.

GNPF-kkCT incorporates a World Model to simulate the environment and predict state transitions, enabling more informed decision-making during the planning process. This World Model is coupled with a Large Language Model (LLM) which is utilized for two key functions: generating prompts that guide the agent’s search behavior and predicting the feasibility of potential actions. The LLM assesses the likelihood of successful action execution based on the current state and the agent’s capabilities, effectively narrowing the action space and improving the efficiency of the planning algorithm. This integration allows the system to leverage the LLM’s reasoning abilities to enhance both the quality of prompts and the accuracy of action selection, contributing to improved performance in object search tasks.

Evaluations demonstrate that GNPF-kkCT achieves demonstrable performance gains in autonomous object search when contrasted with established classical POMDP solving methods. Specifically, the system exhibits improvements in metrics including search success rate, time to solution, and cumulative reward. Quantitative results indicate a statistically significant reduction in planning time, enabling faster response and adaptation in dynamic environments. These advancements are attributed to the system’s ability to efficiently navigate the state space and prioritize promising actions, as facilitated by the integration of neural processes, clustering, and belief tree reuse. Comparative analyses against baseline algorithms consistently demonstrate GNPF-kkCT’s superior performance across a range of simulated and real-world object search scenarios.

Our proposed approach, GNPF-kkCT, iteratively refines a grasp pose through alternating steps of grasp planning and contact-consistent trajectory optimization.
Our proposed approach, GNPF-kkCT, iteratively refines a grasp pose through alternating steps of grasp planning and contact-consistent trajectory optimization.

System Validation and Real-World Integration

Rigorous validation of the proposed robotic system occurred within the Gazebo simulator, a widely-used platform for testing robotic algorithms. These simulations assessed the system’s ability to autonomously locate and manipulate objects within a variety of complex, dynamically-changing environments. Results consistently demonstrated effective object search strategies, highlighting the system’s capacity to navigate obstacles and accurately identify target objects. Furthermore, the manipulation capabilities were thoroughly tested, confirming reliable grasping and precise object placement, thereby establishing the system’s potential for real-world application in tasks requiring both perception and dexterous action.

The robotic system’s ability to perceive its surroundings relies on a suite of critical perception modules, beginning with object detection accomplished through the implementation of YOLO – a highly efficient algorithm enabling real-time identification of target objects within the environment. Complementing this visual recognition is point cloud segmentation, a technique that processes three-dimensional data captured by sensors to differentiate between individual objects and the background, creating a detailed understanding of the scene’s geometry. This combined approach allows the robot to not only see what is present, but also to accurately delineate and categorize objects, forming the foundation for successful navigation and manipulation tasks in complex and dynamic settings.

Accurate spatial understanding is paramount for robotic navigation and manipulation, and RTAB-SLAM provides a robust solution for creating and continually updating maps of unstructured environments. This algorithm uniquely combines range and visual data, allowing the system to simultaneously build a map and localize itself within it – a process known as Simultaneous Localization and Mapping (SLAM). Unlike methods reliant on pre-existing maps or highly structured spaces, RTAB-SLAM excels in dynamic and visually complex scenarios by efficiently managing large datasets and correcting for accumulated errors. The resulting map serves as a critical foundation for the robot’s ability to plan paths, identify object locations, and successfully execute tasks within its workspace, even as the environment changes over time.

Robust robotic object search in challenging environments becomes possible through the synergistic integration of several key components, all orchestrated by the GNPF-kkCT framework. This system unifies perception modules – including real-time object detection via YOLO and point cloud segmentation – with the mapping and localization capabilities of RTAB-SLAM. By effectively combining visual understanding of the environment with precise spatial awareness, GNPF-kkCT facilitates a seamless transition from language command to physical action. This improved Vision-Language-Action integration allows the robot to not only ‘see’ and ‘understand’ an object request, but also to reliably locate and retrieve the target object, even within complex and dynamic scenes, representing a significant step towards more intuitive and effective human-robot collaboration.

This research introduces GNPF-kkCT, a new framework and solver built upon Partially Observable Markov Decision Processes (POMDPs) designed to significantly enhance robotic object search capabilities. Through rigorous testing, GNPF-kkCT demonstrably outperforms traditional methods, achieving higher cumulative rewards during task execution. This improvement stems from a more efficient task planning process, evidenced by a reduction in the number of steps required to locate and manipulate objects. Crucially, the framework also exhibits a notably increased success rate in complex environments, suggesting a more robust and reliable approach to vision-language-action integration for autonomous robotic systems. The framework’s performance indicates a substantial advancement in the field of robotic task planning and execution.

Our method consistently yields reliable observations even with challenging robot configurations.
Our method consistently yields reliable observations even with challenging robot configurations.

The pursuit of efficient robotic object search, as detailed in this framework, echoes a fundamental principle of systemic design. The GNPF-kkCT method’s adaptive refinement of the action space-its capacity to intelligently manage a growing state space-demonstrates that optimizing individual components isn’t enough. As John von Neumann observed, “It is impossible to be certain of anything.” This resonates with the probabilistic nature of POMDPs and the inherent uncertainty in cluttered environments. The system must account for incomplete information, continuously updating its beliefs and adapting its search strategy, much like a living organism responding to its surroundings. A holistic approach, considering the interplay between belief tree reuse, neural processes, and action selection, is crucial for robust performance.

Beyond the Search

The elegance of the presented framework, GNPF-kkCT, lies not in its complexity, but in its attempt to bridge the gap between theoretical optimality-a hallmark of POMDPs-and the messy reality of robotic perception. The current instantiation, however, reveals the inherent limitations of any system attempting to distill infinite state spaces. Scalability, as always, is the true test. The adaptive refinement of the action space, while promising, merely delays the inevitable expansion; a truly robust system demands a principled understanding of which actions, and therefore which states, are worth considering.

Future work must address this core issue. The reliance on KK-Center clustering, while effective for initial partitioning, ultimately represents a static view of a dynamic environment. A more organic approach – one that mirrors the way an organism adapts to its surroundings – is required. The potential of neural processes to encode uncertainty is clear, but their integration with belief tree reuse remains largely unexplored. The ecosystem of components must evolve in concert, not as isolated improvements.

Ultimately, the pursuit of efficient object search is a proxy for a much larger question: how do we build systems that can reason effectively under uncertainty, and adapt gracefully to the unexpected? The answer, it seems, will not be found in faster algorithms or more powerful hardware, but in a deeper appreciation for the fundamental principles of simplicity, clarity, and holistic design.


Original article: https://arxiv.org/pdf/2604.14965.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-19 14:30