Robots Gain Long-Term Memory with Graph-Based Retrieval

Author: Denis Avetisyan

Researchers have developed a new system that allows robots to efficiently store, retrieve, and utilize past experiences to better understand and interact with their surroundings.

An embodied agent constructs an environmental memory through exploration, subsequently enabling it to address user queries-whether spatial, temporal, or descriptive-by retrieving relevant information and providing guidance to specific locations.

This work introduces EmbodiedLGR, an architecture integrating lightweight graph representations with vector databases for real-time semantic-spatial memory in robotic agents.

Efficiently building and retrieving memories remains a core challenge for embodied artificial intelligence navigating complex environments. This limitation motivates the work presented in ‘EmbodiedLGR: Integrating Lightweight Graph Representation and Retrieval for Semantic-Spatial Memory in Robotic Agents’, which introduces a novel architecture, EmbodiedLGR-Agent, that combines graph-based semantic memory with vector database retrieval for real-time performance. By leveraging this hybrid approach, the agent achieves state-of-the-art inference speeds and competitive accuracy on embodied question answering tasks, while also demonstrating practical utility through successful deployment on a physical robot. Could this efficient memory management system unlock more natural and responsive human-robot interactions in real-world scenarios?

Beyond Simple Response: The Necessity of Embodied Understanding

Conventional robotics often functions on a foundation of meticulously pre-programmed responses, a methodology that proves increasingly inadequate when confronted with the unpredictable nature of real-world environments. These systems, while capable of performing repetitive tasks with precision, exhibit a marked fragility when faced with novelty or deviation from their expected parameters. The inherent limitation stems from a reliance on explicitly defined instructions, leaving little room for adaptation or improvisation. Consequently, robots designed in this manner struggle with even minor disturbances – an unexpected obstacle, a shift in lighting, or a previously unseen configuration – frequently leading to operational failure or the need for human intervention. This dependence on static programming highlights a critical gap between the capabilities of current robotic technology and the fluid, dynamic demands of authentic interaction and problem-solving in unstructured settings.

Truly effective Human-Robot Interaction necessitates a shift from robots as mere command-followers to agents capable of genuine learning and contextual reasoning. Current robotic systems often struggle when confronted with situations deviating from their pre-programmed parameters; a robot that can learn from experience, however, adapts and problem-solves in dynamic environments. This isn’t simply about accumulating data, but building an internal model of the world, allowing the robot to anticipate consequences, understand nuanced requests, and collaborate more seamlessly with humans. The capacity to reason – to infer, deduce, and apply past experiences to novel situations – is therefore paramount, moving robotics beyond automated task completion towards genuine, intelligent interaction.

Existing robotic systems often falter when confronted with the unpredictable nature of real-world environments because they struggle to create and retain detailed, contextual memories. Unlike humans, who seamlessly integrate past experiences with present perceptions to inform decision-making, robots typically rely on pre-programmed instructions or limited short-term memory. This deficiency hinders their ability to navigate complex spaces, adapt to changing conditions, or effectively solve problems requiring nuanced understanding. A truly intelligent agent requires more than just data storage; it demands a system capable of associating sensory input with spatial and temporal context, allowing it to not only recall past events but also reason about their relevance to the current situation – a capability crucial for robust and adaptable performance in dynamic environments.

Running on a Jetson Orin and integrated with a ROS 2 environment, the EmbodiedLGR-Agent successfully learned an environment representation using ZED X cameras and Velodyne LiDAR, enabling autonomous navigation to previously memorized objects.

A Semantic Map: The Architecture of Embodied Memory

The EmbodiedLGR-Agent employs a Semantic Memory Graph as its central knowledge representation. This graph stores information not as discrete data points, but as interconnected nodes representing objects, locations, and their relationships within an environment. Nodes are linked by edges defining spatial and semantic associations – for example, the relation “on top of” or “adjacent to.” This graph structure allows the agent to represent complex environmental layouts and retrieve information based on semantic similarity, rather than exact matches. The architecture is designed to facilitate reasoning about space, object permanence, and the agent’s own experiences within the environment, enabling more robust and adaptable behavior.

The Semantic Memory Graph within the EmbodiedLGR-Agent relies on a Vector Database to provide efficient storage and retrieval of experiential data. This database utilizes vector embeddings – numerical representations of semantic information – allowing for similarity searches based on meaning rather than exact keyword matches. By indexing experiences as vectors, the system can rapidly identify relevant past interactions based on the similarity of their embeddings to the current perceptual input. This approach significantly improves search speed and scalability compared to traditional database methods, enabling the agent to access and utilize a large corpus of experiences for planning and decision-making. The Vector Database facilitates both nearest neighbor and approximate nearest neighbor searches, balancing retrieval accuracy with computational efficiency.

The EmbodiedLGR-Agent employs Florence-2, a Visual Language Model, to process visual input and generate embeddings for semantic representation. Florence-2 functions as the primary perceptual component, analyzing incoming images to create vector representations of observed objects and scenes. These embeddings are then directly integrated into the Semantic Memory Graph, establishing a link between the agent’s visual experiences and its stored knowledge. This process allows the agent to semantically understand its surroundings and retrieve relevant past experiences based on visual similarity, facilitating informed decision-making and action planning. The selection of Florence-2 prioritizes computational efficiency, enabling real-time perception on resource-constrained robotic platforms.

The EmbodiedLGR-Agent operates in two phases: a memory-building phase that processes visual frames with [latex]Florence-2VLM[/latex] to create a vector database and memory graph informed by robot pose and time, and a querying phase where an LLM retrieves relevant information from these sources to answer user requests.

Reasoning with Context: Retrieval-Augmented Generation in Action

The ReMEmbR system enhances the capabilities of the EmbodiedLGR-Agent through the integration of a Retrieval-Augmented Generation (RAG) pipeline. This pipeline allows the agent to access and utilize external knowledge stored in its semantic memory during the response generation process. Specifically, incoming queries trigger a retrieval step where relevant information is extracted from the semantic memory. This retrieved context is then combined with the original query and fed into a large language model to produce a more informed and contextually relevant response, effectively augmenting the model’s inherent knowledge with external data.

The agent’s ability to query its semantic memory is central to its reasoning process; relevant information is retrieved from the knowledge graph and utilized to contextualize both action selection and natural language responses. This retrieval process involves formulating queries based on the current state and task, searching the semantic memory for matching nodes and relationships, and then incorporating the retrieved information into the agent’s decision-making pipeline. The system utilizes the retrieved data to refine its understanding of the environment, identify appropriate actions, and formulate coherent and informed responses, effectively grounding its behavior in stored knowledge.

The ReMEmbR system employs NetworkX, a Python package, for the construction, manipulation, and traversal of its Semantic Memory Graph, enabling efficient representation of knowledge and relationships. To facilitate semantic similarity searches within this graph, all-MiniLM-L6-v2, a sentence-transformer model, is utilized to generate vector embeddings from textual data. These embeddings serve as numerical representations of concepts, allowing the system to identify and retrieve relevant information based on semantic proximity within the vector database and graph structures. The choice of all-MiniLM-L6-v2 balances embedding quality with computational efficiency, critical for real-time operation.

Milvus serves as the Vector Database within the ReMEmbR system, responsible for the persistent storage and high-performance retrieval of semantic vectors generated from knowledge graph entities and user queries. These vectors, produced using all-MiniLM-L6-v2, are indexed within Milvus, enabling efficient similarity searches to identify the most relevant information for Retrieval-Augmented Generation (RAG). The system utilizes Milvus’s approximate nearest neighbor (ANN) search capabilities to quickly locate semantically similar vectors, facilitating rapid access to the knowledge required for informed decision-making and response generation. This implementation prioritizes scalability and speed, allowing the system to handle a large volume of semantic data and maintain low latency during retrieval operations.

Performance evaluations demonstrate a response latency of 23.73 seconds when utilizing the Florence-2-large language model within the ReMEmbR system. This represents a modest increase compared to the 19.79-second latency achieved when the system relies exclusively on vector database queries for information retrieval. The observed increase in latency is likely attributable to the additional processing required for graph traversal and information integration within the RAG pipeline, despite optimizations implemented within the system architecture.

Analysis of query routing within the ReMEmbR system demonstrates a strong preference for vector database retrieval, though this reliance shifts with the language model utilized. When employing Florence-2-base, the system defaults to querying the Milvus vector database for 93.34% of incoming requests. However, with the larger Florence-2-large model, this percentage decreases to 80.83%, indicating a greater capacity for effective graph-based retrieval. Consequently, 19.17% of queries are resolved exclusively through information accessed within the Semantic Memory Graph when using Florence-2-large, suggesting enhanced reasoning capabilities and reduced dependence on the vector database for complete responses.

Towards Robust Intelligence: Validation and Future Directions

The EmbodiedLGR-Agent has achieved notable success in evaluating environmental understanding through performance on the OpenEQA and NaVQA datasets. These benchmarks present complex, visually-grounded questions requiring the agent to not only ‘see’ its surroundings but also to reason about relationships between objects and actions within those surroundings. The agent’s ability to accurately answer questions concerning spatial arrangements, object properties, and potential interactions demonstrates a significant step towards robust environmental intelligence. This performance suggests the system possesses a capacity for contextual awareness, moving beyond simple object recognition to encompass a deeper, more nuanced comprehension of the world around it – a crucial element for effective interaction and navigation in real-world scenarios.

The agent’s capacity to function effectively in unfamiliar situations stems from its robust semantic memory, a system allowing it to store and retrieve information not as raw sensory data, but as abstract concepts and relationships. This allows the agent to move beyond rote memorization of specific environments and instead understand the underlying principles governing them. Consequently, when presented with a novel scenario, the agent doesn’t require retraining; it can leverage its existing knowledge to interpret the new situation, identify relevant information, and formulate appropriate responses. Furthermore, the semantic memory facilitates adaptation to changing environments, as the agent can continuously update its understanding based on new experiences, refining its internal model of the world and ensuring continued effective interaction.

The development of agents like EmbodiedLGR-Agent signifies a considerable leap toward truly interactive robotics. By effectively bridging the gap between perception, reasoning, and action, this approach paves the way for robots that don’t simply execute programmed tasks, but genuinely understand and respond to dynamic, real-world situations. Such advancements are crucial for enabling robots to collaborate with humans in complex environments – assisting in homes, workplaces, or even disaster relief scenarios – requiring not just physical dexterity but also the capacity to interpret instructions, learn from experience, and adapt to unforeseen circumstances. Ultimately, this line of research promises a future where robots are not merely tools, but intelligent partners capable of seamless and intuitive interaction.

Continued development of the EmbodiedLGR-Agent centers on significantly expanding its capacity for both memory and complex reasoning. Researchers aim to move beyond current limitations by implementing continuous learning strategies, allowing the agent to refine its understanding of the environment through ongoing interaction and data acquisition. This involves not simply storing more information, but also developing more sophisticated algorithms for knowledge organization and retrieval, enabling the agent to draw nuanced inferences and adapt to unforeseen circumstances. Exploration will be a key component, pushing the agent to actively seek out new information and refine its internal models of the world, ultimately fostering a more robust and intelligent system capable of navigating and interacting with dynamic, real-world scenarios.

The architecture detailed within prioritizes succinctness; unnecessary complexity obscures effective function. This resonates with John McCarthy’s assertion, “It is often easier to explain what something is not than what it is.” EmbodiedLGR-Agent, through its integration of lightweight graph representation and vector database, actively defines capability by eliminating extraneous layers. The system’s efficiency in real-time inference stems not from adding more components, but from a deliberate subtraction of the superfluous – a principle aligning with McCarthy’s emphasis on clarity and the pursuit of essential truths within any system, be it computational or conceptual. The focus on semantic-spatial memory demands precision, and precision thrives in the absence of noise.

The Road Ahead

The elegance of EmbodiedLGR-Agent lies in its attempt to bridge the gap between the symbolic and the subsymbolic – a familiar ambition. The architecture’s performance, while promising, invites contemplation. It is easy to construct elaborate frameworks to manage complexity; the true test arrives when those frameworks begin to dissolve, revealing not further layers of intricacy, but a core of fundamental simplicity. Future work will undoubtedly focus on scaling these systems – larger graphs, more agents, broader environments. But a more pressing question concerns the nature of the knowledge represented. Currently, it seems a focus on ‘what’ is known eclipses the ‘how’ and ‘why’ – the inferential processes that distinguish intelligence from mere recall.

The current reliance on vector databases, while pragmatic, feels… provisional. They serve as excellent repositories, but offer little in the way of active reasoning. A genuine leap forward will likely involve integrating these retrieval mechanisms with more robust symbolic reasoning engines, or perhaps, finding ways to imbue the vector space itself with inferential capabilities. One suspects the path isn’t about adding more layers of abstraction, but about stripping away the unnecessary, revealing the underlying geometric principles that govern both perception and cognition.

Ultimately, the challenge remains not to build robots that possess memory, but to understand how memory emerges as a consequence of interaction. This architecture is a step in that direction – a carefully constructed, and arguably over-engineered, attempt to capture a fleeting glimpse of that elusive process. The true measure of its success will be not in benchmarks achieved, but in the questions it inspires.

Original article: https://arxiv.org/pdf/2604.18271.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Simple Response: The Necessity of Embodied Understanding

A Semantic Map: The Architecture of Embodied Memory

Reasoning with Context: Retrieval-Augmented Generation in Action

Towards Robust Intelligence: Validation and Future Directions

The Road Ahead

See also: