Robots That Understand What You Want: A New Leap in Object Manipulation

Author: Denis Avetisyan

Researchers have developed a system that allows robots to better interpret human language and interact with the world around them through a novel approach to understanding object affordances.

The system constructs an embodied memory from environmental observations, then prioritizes retrieval candidates based on assessed affordances to enable successful manipulation of objects following free-form instruction-a process reflecting an approach to graceful degradation as complexity increases.

This work introduces Affordance RAG, a hierarchical multimodal retrieval framework leveraging affordance-aware embodied memory to enhance open-vocabulary mobile manipulation capabilities.

While robots increasingly navigate complex environments, reliably executing open-vocabulary manipulation tasks remains a significant challenge. This paper introduces Affordance RAG: Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation, a novel framework designed to enhance robotic understanding of natural language instructions for object manipulation. By integrating affordance reasoning into a hierarchical multimodal retrieval system, the approach enables robots to identify executable actions with improved accuracy and efficiency. Could this represent a key step towards more adaptable and intuitive human-robot collaboration in real-world settings?

The Inevitable Drift: Bridging Perception and Action

Conventional robotic systems often falter when confronted with tasks not explicitly programmed into their systems, a limitation stemming from their reliance on pre-defined actions and a lack of robust generalization. These robots excel within tightly controlled environments performing repetitive motions, but struggle with the variability inherent in real-world scenarios; a slight change in object position, lighting, or even the phrasing of an instruction can disrupt performance. This rigidity arises because traditional approaches focus on mapping specific inputs to specific outputs, effectively memorizing solutions rather than understanding the underlying principles of manipulation. Consequently, robots are unable to adapt to novel situations or apply learned skills to new, yet related, tasks – a significant barrier to their deployment in dynamic and unstructured environments where open-ended problem-solving is essential. The inability to extrapolate beyond pre-programmed parameters restricts their utility, highlighting the need for more flexible and adaptable robotic architectures.

For robots to truly assist humans with complex tasks, they must move beyond pre-programmed routines and interpret instructions given in natural language. This necessitates a system capable of deciphering ambiguous phrasing, understanding contextual cues, and dynamically adjusting actions based on the environment. Successfully linking language to action requires robots to not only hear a command like “bring me the red block,” but to visually identify the correct object amidst clutter, plan a collision-free path, and execute the manipulation with appropriate force-all while adapting to unforeseen obstacles or changes in the scene. The challenge lies in creating a cohesive system where linguistic understanding directly informs perceptual analysis and motor control, enabling robots to operate effectively in the unpredictable reality of human environments.

Robotic manipulation frequently falters not due to a lack of individual component capabilities, but rather the difficulty in seamlessly coordinating vision, language, and action. Existing systems often treat these as separate pipelines – a robot might see an object, understand a command referencing it, and then attempt a pre-programmed action – but struggle when faced with even minor deviations from training data. This fragmented approach leads to “brittle” performance, meaning the robot fails catastrophically when encountering novel situations, unexpected object poses, or ambiguous instructions. The inability to dynamically integrate perceptual input with semantic understanding prevents robots from adapting their plans and executing robust, generalizable manipulation skills, hindering their deployment in real-world, unstructured environments.

The ability for a robot to reliably interact with the world hinges on its capacity to not merely see objects, but to understand what those objects allow it to do. This necessitates a shift towards systems that can ground natural language – instructions like “pick up the red block” – directly in visual perception and, crucially, reason about object affordances. Affordances define the potential actions possible with an object – a handle affords grasping, a button affords pressing, and a flat surface affords placing. Without this understanding, robots struggle to generalize beyond pre-programmed scenarios, failing when faced with novel objects or slight variations in their environment. Developing robotic systems capable of inferring these affordances from visual data and linking them to language represents a fundamental step toward truly versatile and adaptable manipulation, allowing robots to move beyond rigid automation and engage in more flexible, human-like interaction with the world.

The Affordance RAG framework enables robots to leverage pre-exploration to build an embodied memory and then efficiently retrieve relevant affordances through hierarchical multimodal search to fulfill given instructions.

Constructing an Embodied Memory: Affordance RAG

Affordance Mem is a hierarchical memory structure designed to integrate visual perception with robotic action capabilities. Constructed from visual observations of the environment, it organizes information across multiple levels of abstraction. This hierarchy facilitates the representation of robotic affordances – the possible actions an agent can perform with objects – at varying granularities. Lower levels focus on instance-specific affordances derived from individual object detections, while higher levels aggregate these into regional or scene-level affordances. This multi-level representation allows the system to not only identify objects but also understand their potential uses within a given context, effectively bridging the gap between perception and action planning.

The system constructs its environmental representation through a multi-level approach, utilizing Vision-Language Models (VLMs) to capture both regional and visual semantics. This involves processing visual input to create representations at different scales: instance-level details and broader, regional understandings of areas within the environment. VLMs are employed to encode these visual features and associate them with corresponding language-based descriptions, enabling the system to understand not just what objects are present, but also their relationships and the characteristics of the surrounding space. This hierarchical structure allows for efficient retrieval of relevant information during reasoning, as queries can be focused on specific instances or broader regional contexts as needed.

The Affordance Mem architecture relies on two core modules for environmental representation: the Area Summarizer and the Affordance Proposer. The Area Summarizer processes multi-view visual features, aggregating them to construct regional nodes within the memory; this allows for a spatially-organized understanding of the environment beyond individual viewpoints. Complementing this, the Affordance Proposer operates at the instance level, predicting potential affordances – the possible actions – associated with each identified object. These predicted affordances are then linked to the corresponding object instances and integrated into the hierarchical memory structure, enabling the system to not only perceive objects but also understand their functional potential.

Traditional object recognition systems identify what an object is, while Affordance RAG focuses on what can be done with that object. This is achieved through explicit representation of affordances – the action possibilities linked to an object based on its properties and the agent’s capabilities. Instead of solely relying on visual features for classification, the system predicts potential interactions, such as “graspable,” “pushable,” or “openable.” This allows for reasoning about the functional role of objects within a scene and informs action planning; the system doesn’t just see a doorknob, it understands it can be turned to open a door, enabling more effective task completion and goal-oriented behavior.

Sensitivity analysis reveals that the hyperparameter α controls the balance between regional and visual semantic contributions in Multi-Level Fusion.

Navigating Complexity: Efficient Retrieval and Reasoning

Recursive Top-Down Traversal, as implemented in Affordance RAG, operates by initiating a search at the root of the hierarchical memory structure. The system then progressively descends through the hierarchy, evaluating candidate regions based on their relevance to the given instruction. At each level, the algorithm recursively applies this process to relevant child nodes, effectively pruning irrelevant branches and focusing the search on potentially informative regions. This approach contrasts with a breadth-first or flat search, offering improved efficiency by minimizing the number of regions that require evaluation and accelerating the identification of instruction-relevant data within the hierarchical memory.

Multi-Level Fusion within the Affordance RAG framework integrates regional and visual semantics to improve instruction-grounded retrieval. Specifically, the system processes both the textual description associated with each memory region and the visual features extracted from the corresponding image. These features are then combined at multiple levels of abstraction, allowing the model to capture complex relationships between the instruction, the regional context, and the visual content. This fusion process enables the system to better identify relevant memory regions, even when the instruction does not explicitly mention specific visual details or regional attributes, resulting in more robust and accurate retrieval performance.

Affordance-aware reranking utilizes Large Language Models (LLMs) to refine initial retrieval results by assessing the alignment between potential actions and predicted affordances. Following the retrieval of candidate regions, the LLM evaluates each region’s suitability for performing actions relevant to the given instruction. This evaluation considers the predicted affordances – the possible actions an agent can take within a given environment – and assigns a higher rank to regions where the predicted affordances directly support the instruction’s requirements. This process effectively prioritizes regions that are not only semantically relevant but also enable the execution of the desired actions, improving the precision and utility of the retrieved information.

Standard Retrieval-Augmented Generation (RAG) systems typically focus on semantic similarity between a query and stored documents. This framework extends RAG by integrating affordance information – the potential actions an object or environment supports – into the retrieval process. Instead of solely relying on semantic matching, the system identifies regions within the knowledge base that relate to possible actions given the query. This is achieved by explicitly representing and querying for affordances associated with visual and semantic content, allowing the system to retrieve information relevant not just to what is being asked, but also to how a user might interact with the environment or objects described within the knowledge base. This affordance-based retrieval complements semantic search and improves the relevance of retrieved content for action-oriented tasks.

Real-world experiments demonstrate that incorporating affordance score reranking successfully identifies the correct target object and receptacle (highlighted in green) for instruction following, such as delivering a cup to a desk with coffee powder, outperforming a variant lacking this feature (less suitable options highlighted in orange).

Toward Robust Intelligence: Empirical Validation and Future Trajectories

Recent evaluations utilizing the challenging WholeHouse-MM Benchmark reveal that Affordance RAG establishes a new state-of-the-art in open-vocabulary mobile manipulation. This framework demonstrates an impressive overall task success rate of 85%, signifying a substantial advancement in robotic task completion. The system’s performance indicates a robust ability to understand and execute commands involving diverse objects and environments without requiring pre-defined categories. This achievement is particularly noteworthy given the benchmark’s complexity, which aims to mimic the unpredictable nature of real-world domestic settings, thereby validating the system’s potential for practical application and highlighting its capacity to navigate and interact with previously unseen scenarios effectively.

Evaluations demonstrate substantial performance gains with the developed system, notably in its ability to retrieve relevant information and successfully complete tasks. Specifically, the system achieves a Recall@5 score of 94%, a significant 15 percentage point improvement over previous methods, indicating a markedly enhanced capacity to identify the most pertinent data within a given context. This heightened recall directly translates to a Task Success Rate of 85%, a substantial 40 percentage point leap over baseline performance, suggesting the system is not only better at finding the right information, but also at effectively utilizing it to achieve desired outcomes in mobile manipulation tasks.

Detailed evaluation reveals Affordance RAG’s precision in identifying relevant environmental elements crucial for successful manipulation. The system achieves a Recall@10 score of 49.9% when locating the target object, indicating nearly half of the relevant objects are retrieved within the top ten results. Furthermore, it demonstrates a 24.3% Recall@10 for identifying suitable receptacles, and an overall Recall@10 of 37.1%. These scores represent significant improvements over previous state-of-the-art methods, exceeding baseline performance by 7.8, 4.5, and 8.4 percentage points respectively, and highlighting the framework’s ability to effectively pinpoint the objects and containers necessary for task completion.

The demonstrated capacity of this affordance-based retrieval-augmented generation (RAG) framework to generalize beyond its training data signifies a crucial step towards practical application in dynamic, real-world settings. Unlike systems reliant on pre-defined scenarios, this approach exhibits robust performance even when confronted with previously unseen objects, environments, and task variations. This adaptability stems from the system’s ability to reason about object affordances – the potential actions an object enables – and retrieve relevant information accordingly, rather than simply memorizing specific solutions. Consequently, the framework isn’t limited by the constraints of static datasets and holds promise for deployment in homes, workplaces, and other complex environments where consistent, reliable mobile manipulation is essential, paving the way for more versatile and intelligent robotic assistants.

Task success rate on the WholeHouse-MM benchmark, measured as the percentage of times both the target object and receptacle are retrieved within the top-K results, demonstrates performance across different approaches.

The pursuit of robust robotic systems, as demonstrated by Affordance RAG, echoes a fundamental truth about complex creations-they are not static achievements but evolving processes. This framework, with its hierarchical multimodal retrieval and emphasis on affordance reasoning, doesn’t simply solve the problem of language-guided manipulation; it establishes a system capable of learning and adapting within dynamic environments. As Barbara Liskov observed, “It’s one thing to program a computer to do what you want it to do; it’s another thing to design a system that can do what you want it to do, without you telling it.” The elegance of Affordance RAG lies in its capacity to move beyond direct instruction, fostering a more graceful and resilient form of robotic intelligence. Observing this development suggests that prioritizing adaptability is often more valuable than seeking immediate, rigid solutions.

The Long Cycle

The presented framework, while a step toward more robust open-vocabulary mobile manipulation, highlights an enduring truth: systems do not achieve competence, they accumulate corrections. Affordance RAG addresses the immediate challenge of grounding language in action, but the inherent ambiguity of both language and the physical world guarantees a continuous influx of edge cases. The elegance of hierarchical retrieval merely distributes the burden of error-it doesn’t eliminate it. Future work will inevitably focus on refining the retrieval mechanisms, expanding the embodied memory, and improving the affordance representation, but these are all local optimizations within a larger, inescapable cycle.

A critical, and often overlooked, limitation lies in the implicit assumption of a static ‘world’ against which affordances are judged. Real environments are not fixed; they degrade, evolve, and surprise. A truly mature system will not simply react to these changes, but anticipate them, modeling not just object affordances, but also the affordances of decay and transformation. This requires shifting from a focus on ‘correct’ actions to a probabilistic understanding of likely failures, and graceful degradation-an engineering philosophy that prioritizes longevity over pristine performance.

The field progresses not by solving problems, but by revealing more nuanced ones. Affordance RAG offers a valuable tool for navigating the present, but the true test lies in its ability to adapt, learn from its inevitable failures, and extend its operational lifespan-not in achieving a mythical state of perfect execution. Time, after all, is not a metric of progress, but the medium in which all systems are ultimately tested.

Original article: https://arxiv.org/pdf/2512.18987.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Drift: Bridging Perception and Action

Constructing an Embodied Memory: Affordance RAG

Navigating Complexity: Efficient Retrieval and Reasoning

Toward Robust Intelligence: Empirical Validation and Future Trajectories

The Long Cycle

See also: