Seeing How Things Move: A New Approach to Understanding Object Manipulation

Author: Denis Avetisyan

Researchers have developed a novel framework for accurately perceiving and reconstructing articulated object movements from everyday videos captured by wearable cameras.

The PAWS framework utilizes hand-object interaction and foundation models to enable accurate 3D reconstruction from egocentric videos, paving the way for advanced robotic manipulation.

Accurate perception of articulated objects remains a challenge despite advances in 3D scene understanding, largely due to the reliance on labeled datasets and limited scalability. This paper introduces PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos, a novel framework that directly extracts object articulations from hand-object interactions within large-scale, in-the-wild egocentric videos. By leveraging foundation models, PAWS achieves significant improvements in articulation perception and facilitates downstream tasks like robotic manipulation and 3D articulation prediction. Could this approach unlock truly scalable and robust 3D understanding for robotics and beyond?

Deciphering Dynamic Scenes: The Foundation of Intelligent Systems

The ability for robots and augmented/virtual reality systems to convincingly interact with the physical world hinges on accurately interpreting dynamic scenes. However, current methodologies frequently falter when confronted with the unpredictable nature of real-world environments. Unlike controlled laboratory settings, ‘in-the-wild’ scenarios present a cascade of challenges – occlusions, varying lighting conditions, complex object interactions, and a sheer diversity of object types – that overwhelm algorithms trained on simplified datasets. These systems often struggle to differentiate between rigid and articulated object movement, or to predict how forces will propagate through interconnected objects, leading to inaccurate manipulations and unrealistic virtual experiences. Consequently, advancements in robotics and immersive technologies are significantly hampered by this persistent difficulty in robustly perceiving and interpreting complex motion.

Many conventional methods for interpreting visual scenes operate under constraints that rarely hold true in real-world settings. These approaches frequently assume objects possess simple, predictable shapes – like perfect spheres or cubes – and move with uncomplicated trajectories. This simplification, while easing computational burden, drastically reduces their effectiveness when faced with the inherent complexity of deformable objects, articulated limbs, or chaotic environments. Consequently, a robotic system or augmented reality application built upon these foundations will struggle to accurately predict how an object will behave, or to realistically integrate virtual elements into a dynamic, unpredictable scene. The reliance on idealized models creates a significant gap between laboratory performance and the nuanced demands of authentic, everyday interactions, hindering the development of truly adaptable and intelligent systems.

Determining how objects move – their articulation – presents a significant hurdle in enabling machines to truly understand visual scenes. This isn’t simply about detecting motion, but discerning how an object moves: is it a hinge rotating around a fixed axis, a sliding motion along a surface, or a more complex, multi-degree-of-freedom movement? Pinpointing the precise axis of rotation, the origin from which movement stems, and the type of articulation are all critical. Current computer vision systems often struggle with this, particularly when faced with real-world complexity – cluttered environments, varying lighting conditions, and objects exhibiting non-rigid movements. Successfully identifying these nuances in articulation is fundamental, as it allows systems to predict future object states, plan interactions, and ultimately build more intelligent and responsive robotic and augmented reality applications.

The progression towards genuinely interactive and intelligent systems is fundamentally constrained by limitations in accurately predicting object articulation. Without a reliable understanding of how objects move – their joints, hinges, and flexible components – robots struggle to manipulate the world effectively, and augmented or virtual reality experiences remain visually and physically unconvincing. Current systems often treat objects as rigid bodies, failing to account for the nuanced dynamics of deformable materials or complex mechanisms. This deficiency impacts a broad range of applications, from robotic surgery and automated assembly to realistic game physics and intuitive human-computer interfaces. Consequently, advancements in articulation prediction are not merely an incremental improvement, but a critical prerequisite for unlocking the full potential of intelligent machines and immersive digital environments, enabling them to respond to, and interact with, the physical world in a natural and meaningful way.

This pipeline processes egocentric video and language descriptions by first perceiving dynamic interactions to extract keyframes and trajectories, then recovering scene geometry using different flows based on motion type, followed by VLM-guided reasoning to select global views and identify articulation axes, and finally inferring joint articulations from integrated 3D hand trajectories and recovered geometry.

PAWS: A Framework for Unveiling Scene Articulation

The PAWS framework determines scene articulation through the coordinated analysis of hand movements, scene geometry, and linguistic understanding. Specifically, it integrates hand trajectory estimation – which provides data on object interaction – with a process of geometric recovery to reconstruct the 3D environment. This reconstructed scene, combined with the hand trajectory data, is then fed into a Vision-Language Model (VLM) for reasoning. The VLM, leveraging pre-trained foundation models, infers the types of motion occurring and identifies appropriate articulation axes, ultimately enabling the system to understand how objects are being manipulated within the scene.

Hand trajectory estimation within the PAWS framework utilizes Kalman Filtering to provide data regarding object interaction. Kalman Filtering is employed to predict and smooth the 3D position of hands over time, accounting for potential noise and occlusion. This results in a temporally coherent representation of hand movements, which are then analyzed to determine points of contact with objects in the scene. The estimated trajectories are crucial for identifying the type of interaction – such as pushing, pulling, or grasping – and for inferring the intended manipulation of the object. The accuracy of this estimation directly impacts the subsequent geometric recovery and VLM reasoning stages, providing essential spatial and temporal cues for scene-level articulation inference.

Geometric recovery within the PAWS framework utilizes multi-view stereo techniques to reconstruct a 3D representation of the scene from RGB images. This reconstruction provides critical spatial context for articulation analysis by establishing the relative positions of objects and their surfaces. The resulting point cloud or mesh allows the system to determine potential articulation axes based on object geometry and connectivity, and to disambiguate ambiguous motion inferences. Specifically, the 3D reconstruction facilitates accurate estimation of contact points and feasible ranges of motion, informing the selection of appropriate articulation parameters for subsequent analysis of object manipulation.

VLM Reasoning within the PAWS framework leverages the capabilities of pre-trained Foundation Models to determine the type of motion being performed and to identify the relevant articulation axes for an object. These models, trained on extensive multimodal datasets, analyze visual input in conjunction with language prompts to categorize actions – such as opening, closing, rotating, or sliding – and subsequently pinpoint the appropriate axis around which the motion occurs. This process is not reliant on explicit 3D pose estimation; instead, the VLM infers articulation based on contextual understanding, effectively translating observed interactions into actionable parameters for scene articulation analysis. The selected articulation axis then guides the reconstruction and representation of the object’s movement within the 3D scene.

The hand filtering pipeline refines noisy fingertip observations [latex]\mathbf{z}_{t}[/latex] using contact-based trimming, Kalman filtering with [latex]\chi^{2}[/latex] outlier rejection, and RTS smoothing to produce accurate trajectories [latex]{\\hat{\\mathbf{p}}_{t}}[/latex] for articulation parameter estimation.

Demonstrating Robustness: Validation on Benchmark Datasets

PAWS was evaluated using the HD-EPIC and Arti4D datasets to assess its generalization capabilities across varying environmental conditions and complex articulated objects. HD-EPIC presents a large-scale, in-the-wild dataset with diverse scenes and object interactions, while Arti4D focuses on detailed 4D human motion capture data. Performance on both datasets demonstrates PAWS’s ability to accurately predict object articulation even with changes in viewpoint, lighting, and background clutter, indicating a robust and adaptable framework for handling real-world scenarios beyond controlled laboratory settings.

The PAWS framework employs iTACO (interactive Topological object CO-segmentation) to establish an initial 3D object estimate from sparse inputs. This estimate is then refined using LoFTR (Local Feature Regression and Tracking), a keypoint-based method utilized for robust feature tracking across frames. LoFTR identifies and matches salient features, enabling the system to maintain object correspondence and accurately track articulated parts even under significant viewpoint changes or occlusions. This combination allows PAWS to effectively handle challenging scenarios where initial estimates may be incomplete or noisy, providing a reliable foundation for subsequent articulation prediction.

Ground truth annotation for the HD-EPIC dataset was facilitated by the Open3D platform, providing tools for interactive 3D point cloud manipulation. Specifically, Open3D enabled annotators to perform precise point selection directly on the 3D data, crucial for defining object boundaries and keypoints. This interactive process supported accurate labeling of object articulation, including identifying joint locations and defining kinematic chains. The platform’s capabilities ensured consistency and reduced error in the annotation process, contributing to a high-quality dataset for evaluating articulation prediction algorithms.

Quantitative evaluation demonstrates that PAWS achieves superior articulation prediction accuracy and robustness compared to existing state-of-the-art methods. Specifically, PAWS outperformed baseline approaches, including Articulation3D and ArtiPoint, on benchmark datasets representing challenging in-the-wild scenarios. Performance gains were measured by assessing the percentage of correctly predicted articulation poses and the consistency of predictions across varying viewpoints and lighting conditions. These results indicate that PAWS offers a substantial improvement in handling the complexities of real-world articulated object pose estimation.

Our articulation prediction method, demonstrated across four tasks, accurately identifies joint movements-indicated by yellow arrows-and effectively tracks both object and hand trajectories, outperforming existing methods like Articulate-Anything[40] and ArtiPoint[82].

Beyond Prediction: Implications for Intelligent Systems

The development of PAWS signifies a notable advancement in robotic cognition, moving beyond simple object recognition towards a deeper understanding of how objects are used. This framework doesn’t merely identify an object; it infers the underlying articulation – the joints, hinges, or flexible components – allowing the robot to predict how the object can be manipulated. This capacity for inferring articulation is crucial because it enables robots to interact with the world in a manner more akin to human intuition, anticipating how objects will respond to force or movement. Consequently, a robot equipped with PAWS isn’t just seeing a collection of shapes, but grasping the potential for dynamic interaction, representing a fundamental shift towards more natural and effective human-robot collaboration and opening doors to complex tasks previously unattainable by automated systems.

The ability of the PAWS framework to infer the articulation of objects – understanding how parts connect and move – unlocks significant potential across multiple fields. In robotic assembly, this means robots can manipulate complex components with greater dexterity, even when visual obstructions are present. Similarly, surgical assistance benefits from a system capable of predicting how tissues and instruments will move, enabling more precise and safer procedures. Perhaps most profoundly, this advancement fosters more natural and effective human-robot collaboration; a robot that understands an object’s degrees of freedom can anticipate human actions and respond accordingly, creating a truly intuitive partnership and opening doors to shared workspaces and cooperative tasks previously unattainable.

Ongoing development of the PAWS framework prioritizes scalability to increasingly intricate environments and a wider range of manipulable objects. Researchers are actively working to refine the system’s ability to process scenes with greater visual clutter and to accurately infer the articulation of objects exhibiting more degrees of freedom – from complex tools to deformable materials. Crucially, this expansion isn’t occurring in isolation; integration with established robotic perception and motion planning pipelines is a key objective. By combining PAWS’s strengths in understanding object pose and articulation with robust planning algorithms, the goal is to create a fully integrated system capable of not just seeing how an object moves, but also intelligently acting upon it in dynamic, real-world scenarios.

The potential for robots to truly become integrated into daily life hinges on their ability to not just see the world, but to understand it and react accordingly – a challenge PAWS addresses by fundamentally linking perception with action. This framework moves beyond simple object recognition, enabling robots to infer how objects move and interact, allowing for more fluid and intuitive responses to dynamic environments. Consequently, tasks currently requiring human dexterity or complex programming – from assisting in a cluttered kitchen to collaborating on an assembly line – become increasingly achievable with robotic assistance. The long-term impact suggests a future where robots are not simply tools, but collaborative partners capable of anticipating needs and seamlessly integrating into the fabric of everyday routines, fundamentally reshaping how humans interact with technology and the world around them.

PAWS enables a robot to perform complex manipulation tasks, such as closing a cupboard and opening a drawer, by reconstructing 3D hand articulations from egocentric video.

The pursuit of accurately perceiving articulation in the wild, as detailed in this work, echoes a fundamental tenet of understanding complex systems. The framework, PAWS, attempts to bridge the gap between visual input and meaningful interpretation of hand-object interaction. This aligns with Geoffrey Hinton’s observation that, “Learning is about finding patterns.” The system’s reliance on foundation models and 3D reconstruction isn’t merely about technical achievement; it’s an effort to discern underlying patterns within noisy, real-world egocentric videos. By focusing on these patterns, the framework aims to unlock more robust and reliable robotic manipulation capabilities, essentially translating visual data into actionable insights.

What Lies Ahead?

The PAWS framework, by linking visual data of hand-object interaction with the predictive power of foundation models, offers a compelling step toward truly understanding articulation in natural scenes. However, the elegance of 3D reconstruction from egocentric video should not obscure the inherent difficulties. Current reliance on large datasets, while yielding impressive results, begs the question of generalization. How readily does this perception translate to novel objects, lighting conditions, or even subtly different cultural practices of manipulation? The system, for all its sophistication, still perceives what it has been shown, not necessarily what is.

A critical path forward involves moving beyond correlation and toward causal understanding. Simply mapping visual features to articulated movement is insufficient; a robust system must reason about the why of manipulation – the goals, constraints, and affordances that drive action. Integrating principles of physics-based simulation, and incorporating feedback loops that allow the system to test its hypotheses, represents a logical, if challenging, extension.

Ultimately, the true measure of success will not be the fidelity of the reconstruction, but the ability to anticipate. If a predicted articulation fails to materialize, or cannot be explained by a coherent internal model, it doesn’t exist.

Original article: https://arxiv.org/pdf/2603.25539.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deciphering Dynamic Scenes: The Foundation of Intelligent Systems

PAWS: A Framework for Unveiling Scene Articulation

Demonstrating Robustness: Validation on Benchmark Datasets

Beyond Prediction: Implications for Intelligent Systems

What Lies Ahead?

See also: