Mapping the World for AI: How Scene Graphs Are Powering Smarter Robots

Author: Denis Avetisyan

Researchers are developing richer ways for artificial intelligence to understand indoor environments, enabling more effective task planning and interaction.

MomaGraph facilitates robotic task execution by constructing task-specific scene graphs that delineate relevant objects, their constituent parts, and the spatial-functional relationships between them, thereby enabling both spatial understanding and effective task planning.

This review details MomaGraph, a novel state-aware unified scene graph framework utilizing vision-language models and reinforcement learning for improved spatial-functional reasoning in embodied AI.

Effective robotic manipulation in complex environments demands a robust understanding of scene context, yet current approaches often fragment spatial and functional reasoning or treat scenes as static entities. To address these limitations, we present MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning, introducing a novel, unified scene representation integrating spatial-functional relationships, object states, and interactive elements. This work is accompanied by MomaGraph-Scenes, a large-scale dataset, and MomaGraph-Bench, a comprehensive evaluation suite, alongside MomaGraph-R1, a 7B vision-language model trained via reinforcement learning to predict task-oriented scene graphs and serve as a zero-shot planner. Will this framework unlock more adaptable and intelligent embodied agents capable of seamlessly navigating and interacting with real-world households?

The Limits of Pixel-Based Perception

Conventional approaches to representing environments for intelligent agents often fall short due to their inability to fully capture the intricate relationships between objects within a scene. These methods typically prioritize individual object recognition, neglecting the crucial spatial and semantic connections that define how objects interact and influence each other. Consequently, an agent operating on such representations struggles to understand affordances – what actions are possible with which objects – and to effectively plan complex tasks. For instance, a system might identify a ‘cup’ and a ‘table’ but fail to understand that the table provides a stable surface for the cup, hindering its ability to grasp and move the cup without spillage. This limitation ultimately restricts the agent’s capacity to navigate and manipulate the world with the same intuitive understanding as a human, demanding more complex and often brittle planning algorithms to compensate for the impoverished environmental understanding.

Many contemporary approaches to scene understanding for intelligent agents fundamentally represent environments as single, static images, a simplification that limits their capacity for effective planning and interaction. This reliance on purely visual data overlooks the inherent interactive nature of most real-world scenarios, neglecting crucial information about how objects respond to forces, how they can be manipulated, and the potential consequences of an agent’s actions. Consequently, an agent perceiving a scene in this manner struggles to anticipate outcomes, adapt to dynamic changes, or reliably execute complex tasks requiring physical reasoning; it essentially lacks the ability to “test” scenarios before committing to an action. A robust system necessitates encoding not just what is present, but also how elements within the scene relate and respond to each other, allowing for predictive modeling and resilient navigation of interactive environments.

Unlike direct planning methods that struggle with complex tasks even with powerful models like GPT-5, our Graph-then-Plan approach leverages structured scene graphs to reliably generate accurate and complete task sequences.

A Formal Representation: MomaGraph

MomaGraph utilizes a graph structure to represent scenes, where nodes represent objects and edges define relationships between them. These relationships are explicitly categorized as either spatial or functional. Spatial relationships denote the physical arrangement of objects – their positions and orientations relative to each other – and are critical for navigation and collision avoidance. Functional relationships, conversely, describe how objects interact and affect each other’s states; for example, a “supports” relationship between a table and a book, or a “contains” relationship between a refrigerator and food items. This dual modeling of spatial and functional relationships allows for a more comprehensive understanding of the scene compared to representations focused solely on geometry or object recognition.

MomaGraph incorporates part-level interactive elements to facilitate agent reasoning about object effects and subsequent planning. This is achieved by representing objects not as monolithic entities, but as collections of interacting parts, each with defined functionalities and spatial relationships. Consequently, agents can simulate how manipulating a specific part of an object – for example, opening a drawer or rotating a doorknob – will affect other objects in the scene, and use this information to generate feasible action sequences. This part-level granularity allows for more precise predictions of physical interactions and enables agents to solve tasks requiring complex manipulation and reasoning about affordances, going beyond simple whole-object interactions.

MomaGraph dynamically updates a task-specific scene graph by capturing environmental state changes and evolving spatial-functional relationships during interaction.

MomaGraph-R1: A Learned Predictive Model

MomaGraph-R1 is a 7 billion parameter vision-language model employing reinforcement learning to forecast task-oriented scene graphs. This model processes visual input and associated task instructions to predict the relationships between objects and their attributes, representing the scene as a graph data structure. The reinforcement learning training process optimizes the model’s ability to accurately predict these scene graphs, enabling subsequent task planning. The model architecture combines visual and textual encoders with a graph prediction module, allowing it to generate structured representations of the environment relevant to the given task.

MomaGraph-R1 utilizes the DAPO (Distribution Alignment Policy Optimization) algorithm to learn a reward function specifically designed to evaluate the alignment between predicted and ground-truth scene graphs. This reward function quantifies the accuracy with which the model represents task-relevant information within the generated graph. DAPO achieves this by iteratively refining the reward function based on the distribution of predicted scene graphs, encouraging the model to prioritize graph structures that accurately reflect the relationships between objects and their roles in completing the given task. The resulting graph-alignment reward directly guides reinforcement learning, enabling MomaGraph-R1 to optimize its scene graph generation capabilities for improved task planning performance.

The MomaGraph-R1 model utilizes a Graph-then-Plan framework for zero-shot task planning, initiating the process by constructing a scene graph representing the environment and relevant objects. This graph serves as an intermediate representation, enabling the model to subsequently generate a sequence of actions designed to achieve the specified task. Evaluation on the MomaGraph-Bench dataset demonstrates an overall task success rate of 70% when employing this framework, indicating the model’s ability to generalize to unseen tasks without requiring task-specific training data. The success rate is calculated based on complete and correct execution of tasks as defined within the benchmark.

MomaGraph-R1 successfully infers the action plan to turn on a TV by constructing a scene graph from multiview images that represents spatial and functional relationships.

Demonstrating Robustness and Generalization

MomaGraph-R1 underwent rigorous testing via MomaGraph-Bench, a purposefully designed evaluation suite that moves beyond simple task completion to probe the model’s capacity for complex reasoning and sequential planning. This benchmark isn’t merely about providing correct answers; it assesses how the model arrives at those answers, demanding a demonstration of logical thought processes and the ability to anticipate consequences across multiple steps. The structure of MomaGraph-Bench allows for a nuanced understanding of the model’s strengths and weaknesses in scenarios requiring foresight, adaptability, and the integration of diverse information – critical capabilities for advanced AI systems operating in dynamic, real-world environments. By systematically evaluating performance on this challenging benchmark, researchers gained valuable insight into MomaGraph-R1’s core cognitive abilities and its potential for tackling intricate problems.

To move beyond simulated environments, MomaGraph-R1 was integrated with a physical robot, enabling evaluation of its capabilities in authentic, unstructured settings. This deployment involved tasks requiring real-time perception, planning, and execution – scenarios where even minor discrepancies between simulation and reality can derail performance. Results from these robotic experiments confirm that MomaGraph-R1 doesn’t just excel in benchmarks; it reliably translates its reasoning and planning abilities into effective action in the physical world. This successful implementation highlights the model’s robustness and potential for deployment in practical applications such as autonomous navigation, object manipulation, and human-robot interaction, paving the way for more adaptable and intelligent robotic systems.

Rigorous evaluation demonstrates that MomaGraph-R1 represents a substantial advancement in reasoning and planning capabilities. On the MomaGraph-Bench benchmark, the model achieved a performance increase of 11.4% over the leading open-source alternative, and notably exceeded the performance of In-Context Learning by 4.8%. Further validating its efficacy, MomaGraph-R1 attained 92.6% accuracy on the BLINK benchmark, a 3.1% improvement over the Supervised Fine-Tuning (SFT) baseline. These results collectively indicate a significant leap forward, establishing MomaGraph-R1 as a state-of-the-art solution for complex, multi-step reasoning tasks.

MomaGraph-R1 successfully executed the 'Open the Cabinet' task by constructing a scene graph from multiview images to determine a functional action plan. — MomaGraph-R1 successfully executed the ‘Open the Cabinet’ task by constructing a scene graph from multiview images to determine a functional action plan.

The pursuit of robust embodied AI, as demonstrated by MomaGraph, necessitates a shift from merely achieving functional outcomes to establishing provable understanding of the environment. The system’s emphasis on spatial-functional reasoning within unified scene graphs echoes a fundamental principle of mathematical elegance. Fei-Fei Li aptly stated, “If it feels like magic, you haven’t revealed the invariant.” MomaGraph strives to expose these invariants – the underlying, consistent relationships between objects and their functions – allowing agents to operate not through brute-force learning, but through demonstrable comprehension of the world’s structure. This focus on revealing the ‘invariant’ is crucial for building AI systems that are not simply reactive, but truly intelligent.

What’s Next?

The pursuit of robust scene understanding, as exemplified by MomaGraph, inevitably reveals the brittleness inherent in mapping perceptual data to symbolic representations. While the integration of vision-language models represents a step towards grounding these abstractions, the system remains fundamentally reliant on the quality of the initial dataset and the inductive biases encoded within the training regime. The true test will not be achieving high scores on curated benchmarks, but rather demonstrating consistent performance in genuinely novel environments – spaces that defy the assumptions baked into the learning process.

A critical, often overlooked, aspect is the computational cost of maintaining and reasoning over these increasingly complex scene graphs. Each added node, each inferred relationship, introduces a potential source of error and exponentially increases the search space for task planning. Future work must prioritize algorithmic efficiency, exploring methods for lossless compression of scene information and the development of provably optimal planning algorithms. Redundancy, even when seemingly benign, is an invitation to failure.

Ultimately, the field requires a shift in focus – away from simply representing the world, and towards building agents that can actively learn its underlying principles. A system that can dynamically refine its internal model of a scene, based on minimal interaction and a few core axioms, would be far more elegant – and far more robust – than any hand-crafted knowledge base, no matter how comprehensive. The goal is not to build a perfect map, but to forge an agent capable of navigating imperfection.

Original article: https://arxiv.org/pdf/2512.16909.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Limits of Pixel-Based Perception

A Formal Representation: MomaGraph

MomaGraph-R1: A Learned Predictive Model

Demonstrating Robustness and Generalization

What’s Next?

See also: