Robots Learn by Watching: Building 3D Scene Understanding Through Human Guidance

Author: Denis Avetisyan

Researchers have developed a new framework that enables robots to construct detailed 3D representations of environments by learning from human demonstrations of how to manipulate objects.

ArtiSG enhances robotic manipulation of objects with complex features-including those possessing subtle details, unusual movement patterns, or visual ambiguity-by drawing upon a stored record of human demonstrations to accurately predict and execute precise 6-DoF trajectories, overcoming the limitations of vision-language models that struggle with such intricacies.

ArtiSG constructs functional 3D scene graphs from human-demonstrated articulated object manipulation, improving robot interaction and viewpoint robustness.

While 3D scene graphs offer robots semantic understanding for navigation, they often lack the functional details needed for dexterous manipulation, particularly with articulated objects. To address this, we present ArtiSG: Functional 3D Scene Graph Construction via Human-demonstrated Articulated Objects Manipulation, a framework that builds functional scene graphs by encoding human demonstrations into a structured robotic memory. This approach leverages captured articulation trajectories and interaction data to discover even inconspicuous functional elements, enabling more precise kinematic estimation and improved recall. Could this human-guided approach unlock a new level of robotic proficiency in complex, real-world manipulation tasks?

Beyond Simple Recognition: Understanding Action in Visual Scenes

While current computer vision systems demonstrate remarkable proficiency in identifying and localizing objects within an image or video, a critical gap remains in their ability to interpret how those objects are utilized and how they relate to one another. These systems often treat a scene as a collection of independent entities, failing to grasp the functional context that dictates interaction. For instance, a system might accurately detect a mug, a table, and a hand, but struggle to infer that the hand is reaching for the mug on the table, or understand the purpose of that interaction – drinking. This limitation hinders progress in fields like robotics, where machines require a nuanced understanding of scenes to perform complex tasks, and emphasizes the need for algorithms that move beyond simple object recognition towards a more holistic comprehension of visual environments.

Effective robot manipulation hinges not simply on identifying what objects are present in a scene, but critically, on understanding how those objects can be used. This necessitates a grasp of ‘affordances’ – the potential actions an object allows – and ‘kinematic possibilities’, which define the range of motion and configurations achievable with the object and the robot itself. A robot perceiving a mug, for example, doesn’t just register its shape; it must infer that the mug can be grasped, lifted, tilted for pouring, and that these actions are constrained by its handle and the robot’s arm reach. Without this functional understanding, robots remain limited to pre-programmed sequences, unable to adapt to novel situations or interact with the environment in a truly intelligent and flexible manner.

A fundamental limitation of conventional scene understanding lies in its reliance on object lists – simple enumerations of ‘what’ is present. This approach proves inadequate when robots or AI systems must reason about a scene’s purpose. Knowing a kitchen contains a knife, plate, and apple doesn’t explain how these objects relate – whether the knife is for cutting the apple, or simply resting beside it. Complex tasks, such as assisting with cooking or tidying a room, necessitate understanding object affordances – what actions an object enables – and the kinematic possibilities for interaction. Representing a scene functionally, as a network of potential actions and relationships, is crucial for enabling truly intelligent behavior, moving beyond mere object identification to genuine comprehension of a scene’s utility and purpose.

Human demonstrations of manipulation sequences, captured with a custom gripper, are used to construct functional scene graphs that represent object articulation and enable open-vocabulary queries for robot manipulation.

ArtiSG: Encoding Functionality for 3D Scene Understanding

ArtiSG extends traditional 3D Scene Graphs by introducing ‘Functional Element Nodes’ which specifically denote the components of an articulated object that are directly involved in an action or manipulation. Unlike standard scene graph nodes that focus on geometric properties, these functional elements highlight parts such as a drawer handle, a door hinge, or a robot’s gripper, allowing the system to understand an object’s interactive capabilities. This representation facilitates the decomposition of complex tasks into a series of actions performed on these functional elements, enabling robots to move beyond static scene understanding and engage with the dynamic, interactive properties of their environment. The inclusion of these nodes allows for a more targeted and efficient planning process, focusing on the actionable components rather than the entire object geometry.

The ArtiSG framework utilizes human demonstrations to create a functional representation of 3D scenes for robotic manipulation. This is achieved by encoding observed actions into a structured format that defines how articulated objects can be interacted with. Evaluation in real-world scenarios demonstrates an 88.5% functional element recall rate, indicating the system’s ability to accurately identify and reproduce the demonstrated functionalities. This high recall suggests the encoded representation effectively captures the essential elements of the human-performed actions, allowing the robot to replicate them with a significant degree of accuracy.

ArtiSG extends robotic manipulation capabilities beyond basic pick-and-place tasks by explicitly modeling articulation possibilities within 3D scenes. This is achieved through the representation of how objects can be moved and reconfigured, enabling the planning of complex actions such as assembling parts, opening containers, or rearranging objects in a specific order. By encoding these articulation parameters into the 3D scene graph, ArtiSG facilitates the generation of motion plans that account for an object’s degrees of freedom and constraints, allowing a robot to perform tasks requiring multi-step manipulations and dynamic adjustments during execution. This contrasts with traditional approaches that treat objects as rigid bodies, limiting the robot to simpler, pre-defined actions.

This system constructs a functional scene graph for indoor environments by initializing an element-aware representation from multi-view semantics, tracking human manipulation to estimate object articulation, and refining the graph with interaction data to recover hidden functional elements and kinematic attributes.

From Perception to Understanding: Building the Functional Scene Graph

The system leverages pre-trained Vision Foundation Models, specifically Grounding DINO for object detection and Segment Anything Model (SAM) for precise segmentation. Grounding DINO identifies objects within a scene based on textual queries, enabling the system to focus on relevant items. SAM then generates high-quality masks delineating the boundaries of these detected objects, and critically, their functional regions – the specific parts of an object that participate in an action or interaction. This two-stage process allows for both identification and detailed segmentation, providing a richer representation of the visual input than traditional object detection methods.

Top-k Frame Selection is implemented to optimize processing of video data by reducing redundancy and computational load. This technique analyzes consecutive frames and selects only the k most significantly different frames based on a change detection metric. By discarding highly similar, intervening frames, the system minimizes redundant calculations within the perception pipeline. The value of k is a configurable parameter, allowing for a trade-off between processing speed and the granularity of temporal information retained. This selective frame processing directly lowers the computational cost associated with object detection, segmentation, and tracking, while maintaining sufficient data to accurately represent dynamic scenes.

Accurate estimation of articulation trajectories is achieved by integrating three tracking methodologies. CoTracker provides multi-object tracking capabilities, maintaining consistent identities across frames. Mediapipe contributes real-time human pose estimation, identifying key body and hand landmarks. Simultaneous Localization and Mapping (SLAM) techniques are employed to create a 3D map of the environment, providing spatial context and enhancing tracking robustness, particularly in scenarios with occlusion or rapid movement. The combined system leverages the strengths of each component to deliver precise and consistent articulation data over time.

Articulated motion data was collected using a handheld UMI gripper with a 26-sided ArUco tracking sphere for <span class="katex-eq" data-katex-display="false">6</span>-DoF pose estimation, and validated against ground truth poses obtained via OptiTrack retro-reflective markers on the cabinet door. — Articulated motion data was collected using a handheld UMI gripper with a 26-sided ArUco tracking sphere for $6$ -DoF pose estimation, and validated against ground truth poses obtained via OptiTrack retro-reflective markers on the cabinet door.

Refining Accuracy: Robustness Through Computational Methods

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is implemented as a filtering stage to improve the accuracy of articulated object segmentation. This algorithm identifies and removes outlier data points, effectively reducing noise that can arise from cluttered environments or imperfect sensor data. By grouping together closely packed points and labeling those that lie alone as outliers, DBSCAN enables robust segmentation even when dealing with incomplete or noisy visual information. This pre-processing step is crucial for maintaining the integrity of subsequent pose estimation and tracking procedures, contributing to overall system performance in complex scenes.

Perspective-n-Point (PnP) algorithms are utilized to determine the six-degree-of-freedom pose – three for rotation and three for translation – of the camera relative to a known 3D object or scene. These algorithms take as input 2D image projections of 3D points and their corresponding 3D world coordinates. By solving for the camera’s rotation and translation that best aligns the projected 2D points with their 3D counterparts, PnP enables precise camera pose estimation. Iterative methods, such as the Levenberg-Marquardt algorithm, are commonly employed to refine the solution and minimize reprojection error, thereby improving the accuracy of both camera pose and tracked object position. The quality of the 3D point correspondences and the number of points used significantly impact the robustness and precision of the PnP solution.

Principal Component Analysis (PCA) is implemented to reduce the dimensionality of visual data, thereby improving both computational efficiency and the robustness of pose estimation. This dimensionality reduction focuses on extracting the most salient features from the input data, discarding less informative components. Quantitative evaluation demonstrates a trajectory Root Mean Squared Error (RMSE) of 1.09 cm for tracking revolute joints in dynamic scenarios. This performance represents a significant improvement over comparative methods, with CoTracker achieving an RMSE of 7.31 cm and Mediapipe resulting in an RMSE of 3.83 cm under identical testing conditions.

This decoupled tracking setup accurately recovers smooth, precise gripper trajectories-visualized in orange-for both prismatic and revolute joint manipulations, as demonstrated by the consistent tracking performance across varying viewpoints and highlighted coordinate frames.

Towards True Intelligence: The Future of Robotic Understanding

ArtiSG establishes a crucial stepping stone towards more versatile robotic systems, moving beyond the limitations of robots confined to executing pre-defined instructions. This framework doesn’t simply allow a robot to perform a task; it provides the means for the robot to understand the task’s underlying structure and, crucially, to apply that understanding to new, unforeseen circumstances. By representing environments and actions as interconnected functional elements, ArtiSG facilitates a level of abstraction that allows robots to reason about tasks in a manner analogous to human cognition. This capability isn’t merely about executing commands, but about adapting to variations, recovering from errors, and ultimately, learning from experience – laying the groundwork for robots capable of true autonomous problem-solving and genuine intelligence in dynamic real-world settings.

The development of robotic systems capable of truly understanding and responding to human instruction hinges on bridging the semantic gap between language and action. Recent advances demonstrate that grounding language commands within a functional scene graph – a structured representation of objects and their relationships – facilitates this crucial link. This approach moves beyond simple keyword recognition, allowing robots to interpret what a user wants done, not just how to do it based on pre-defined routines. Consequently, interactions become more intuitive and natural, resembling human-to-human communication where context and implied understanding play significant roles; the robot effectively ‘understands’ the intent behind a request, even if phrased in varied or ambiguous terms, leading to more successful and efficient task completion.

Current research endeavors are directed towards extending the capabilities of this robotic intelligence framework to increasingly intricate and dynamic environments. A central focus lies in enabling lifelong learning, allowing robots to autonomously acquire new skills and refine existing ones over time without explicit reprogramming. Notably, integrating human demonstrations into the learning process has proven highly effective, resulting in a substantial 32.7% improvement in the robot’s ability to correctly identify and recall functional elements within a scene – increasing performance from 55.8% to 88.5%. This suggests a promising pathway towards robots that not only execute commands but also learn from, and adapt to, the nuances of human interaction and the complexities of real-world scenarios.

ArtiSG outperforms baseline methods like Lost&Found and OpenFunGraph in both real and simulated scenes by accurately localizing functional elements-indicated by green dots-with higher recall and precision.

ArtiSG prioritizes distilling complex environments into manageable representations. The framework’s emphasis on encoding human demonstrations to construct functional 3D scene graphs mirrors a core tenet of effective design: simplification. As John McCarthy stated, “Every complexity needs an alibi.” The system doesn’t attempt to model every nuance of a scene; instead, it focuses on the functional relationships between articulated objects-the essential elements for robotic interaction. This targeted approach, prioritizing utility over exhaustive detail, demonstrates an understanding that abstractions age, principles don’t. The resulting scene graphs are robust and viewpoint invariant, a testament to the power of clarity.

What Remains?

The construction of functional scene graphs, as demonstrated by ArtiSG, is not an endpoint, but a distillation. The framework successfully encodes human intention within object manipulation, yet the true challenge lies not in representing articulation, but in accepting its inherent ambiguity. Current methods, even those leveraging demonstration, still presume a singular, correct interpretation of interaction. Future work must address the inevitable noise-the imprecise grasps, the unanticipated collisions-and build systems robust enough to function with uncertainty, not in spite of it.

A critical simplification remains the assumption of complete observability. Human understanding of an object’s state isn’t solely visual; it’s tactile, auditory, and informed by prior experience. To move beyond simulated environments, systems must integrate multi-modal sensing and, crucially, learn to actively request clarification when faced with incomplete information. The pursuit of ‘viewpoint robustness’ is merely a palliative; true intelligence requires a strategy for seeking better views.

Ultimately, the value of this work isn’t in the complexity of the graphs it generates, but in the simplicity of the question it poses: what minimal representation is sufficient for effective interaction? The continued refinement of ArtiSG-and its successors-will not be measured by the features added, but by the assumptions gracefully removed.

Original article: https://arxiv.org/pdf/2512.24845.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/