Robots Get a Grip: Mastering Articulated Objects in Real Time

Author: Denis Avetisyan

A new system combines sensing and machine learning to allow robots to reliably estimate and manipulate complex, moving objects like tools and deformable parts.

The system demonstrates a capacity for online articulation estimation by autonomously opening cabinet doors that are visually indistinguishable in their closed state, revealing hidden mechanical differences only through interaction-a feat impossible to predict from static observation alone.

This work presents a factor graph-based approach integrating visual, force, and kinematic data with deep learning for robust online estimation and shared autonomy in articulated object manipulation.

While robots struggle with the everyday task of interacting with common articulated objects like drawers and doors, humans perform these actions effortlessly. This paper introduces a novel approach to robotic manipulation, detailed in ‘Online Estimation and Manipulation of Articulated Objects’, which combines learned visual priors with real-time force and kinematic sensing. By fusing these data streams within a factor graph framework grounded in Screw Theory, our system achieves robust online estimation of articulation, enabling successful manipulation even with unknown mechanisms. Could this integrated sensing and analytical modeling pave the way for truly adaptable robotic assistants in complex, unstructured environments?

The Challenge of Articulated Systems: A Foundation for Predictable Manipulation

The reliable manipulation of articulated objects – items comprised of interconnected, movable parts like robotic arms, doors, or even clothing – presents a significant hurdle for robotics researchers. Unlike rigid objects with predictable behavior, these systems possess an infinite number of potential configurations, making accurate prediction of their state incredibly complex. This isn’t simply a matter of tracking position and orientation; it demands an understanding of the relationships between moving parts and how forces applied to one component will affect the entire assembly. Current robotic systems often struggle with this multi-faceted prediction, leading to unstable grasps, failed manipulations, and a limited ability to interact with the dynamic, complex world humans navigate with ease. Consequently, advancements in predicting the configuration of articulated objects are fundamental to building truly versatile and adaptable robots.

Estimating the configuration of articulated objects-items with joints and moving parts like robotic arms or even simple hinges-presents a significant hurdle for robotic systems. Conventional approaches to this problem often rely on simplifying assumptions or struggle with the high dimensionality of possible configurations, leading to inaccuracies in predicting where and how an object will move. These limitations directly impact a robot’s ability to plan reliable grasps; a slightly incorrect estimate of an object’s pose can result in a failed grasp or even damage to the object or the robot itself. Consequently, robust manipulation-the ability to smoothly and effectively interact with these objects-remains a key area of ongoing research, demanding more sophisticated methods for accurately determining articulation parameters and enabling truly adaptable robotic control.

Effective interaction with articulated objects – from robotic assembly to assistive devices – fundamentally depends on a system’s ability to predict object affordance, essentially, what actions the object allows. This isn’t simply recognizing the object, but understanding its potential for movement and how that movement can be utilized. However, realizing this predictive capability requires exceptionally robust state estimation; a precise understanding of the object’s current configuration – the angles of its joints, the tension in its linkages – is paramount. Errors in state estimation directly translate to inaccurate predictions of affordance, leading to failed grasps, clumsy manipulations, and ultimately, an inability to effectively interact with the object. Consequently, advancements in state estimation techniques are inextricably linked to progress in robotic manipulation and the creation of truly versatile, adaptable robotic systems.

The system integrates RGB-D camera input and human click prompts to estimate object articulation, refine a symbolic model of the robot and object, and subsequently solve for an optimal robot trajectory using <span class="katex-eq" data-katex-display="false">QP</span>. — The system integrates RGB-D camera input and human click prompts to estimate object articulation, refine a symbolic model of the robot and object, and subsequently solve for an optimal robot trajectory using $QP$ .

A Probabilistic Framework: Fusing Sensory Data for Articulation Estimation

The system employs a Factor Graph to integrate data from heterogeneous sensor modalities – visual, kinematic, and force sensing – within a probabilistic framework. This graph represents the relationships between observed sensor measurements and the underlying articulation states of the object as a set of factors connected by variables. Visual sensing provides pose and appearance information, kinematic sensing reports joint angles and velocities, and force sensing measures contact forces and torques. The Factor Graph allows for the consistent fusion of these data streams by modeling them as probabilistic constraints, enabling the system to compute a posterior distribution over the articulation states and quantify associated uncertainties. This approach facilitates robust state estimation, even in the presence of sensor noise or occlusions, by leveraging the complementary information provided by each modality.

Screw Theory provides a robust mathematical framework for representing the rigid body motions of articulated objects, specifically focusing on the concept of ‘screws’ which combine rotational and translational movement into a single entity. This allows for the concise and accurate modeling of kinematic relationships between links in an articulated system. A screw can be fully described by its pitch and axis, representing an infinitesimal displacement along that axis combined with an infinitesimal rotation around it $\hat{v} = \omega \times r$ , where ω is the angular velocity, and $r$ is the position vector. By representing motions as screws, complex multi-body systems can be analyzed using linear algebra, enabling efficient computation of velocities and forces throughout the kinematic chain and providing a consistent foundation for state estimation.

A Deep Neural Network (DNN) is integrated to predict articulation affordance – the range of possible joint configurations – and quantify the associated uncertainty. This DNN receives data representing the current state of the articulated object and outputs a probability distribution over potential articulation states. The predicted uncertainty, expressed as a variance or standard deviation, is then utilized within the Factor Graph framework to weight the contributions of different sensor modalities. Specifically, estimations with lower uncertainty are prioritized, effectively increasing their influence on the final, fused articulation state estimate while downweighting less reliable data. This allows the system to adaptively rely more heavily on sensors providing confident measurements and to mitigate the impact of noisy or ambiguous data from other sources.

Through iterative refinement using kinematic measurements and force feedback, the robot successfully opened all four cabinet doors, initially predicting prismatic joints before converging on the correct revolute solutions, even when starting with inaccurate estimates.

Optimization and Refinement: Achieving Precise State Estimation Through Mathematical Rigor

Factor Graph Optimization (FGO) is employed to estimate the state of an articulated object by minimizing an error function that quantifies the discrepancy between predicted and observed measurements. This minimization process iteratively refines the object’s pose and configuration. A key component of this refinement is the use of Tangent Similarity, a metric that assesses the accuracy of the estimation by evaluating the alignment between predicted and observed transformations in the tangent space of the pose manifold. Specifically, Tangent Similarity calculates the geodesic distance between predicted and ground truth poses, providing a robust measure of error even in the presence of rotational uncertainty. The optimization algorithm then adjusts the object’s state to reduce this error, leading to a more accurate and consistent estimation of its configuration. $\text{Tangent Similarity} = \frac{ \text{Predicted Transformation} \cdot \text{Ground Truth Transformation} }{|| \text{Predicted Transformation} || \cdot || \text{Ground Truth Transformation} || }$

Sim-to-Real transfer techniques address the domain gap between simulated environments and real-world robotic deployments by reducing the discrepancy in data distributions. This is achieved through methods like domain randomization, where simulation parameters are varied during training to force the model to learn robust features, and domain adaptation, which involves modifying the model or data to align the simulated and real distributions. Successful Sim-to-Real transfer enables the trained state estimation framework to generalize effectively to real-world sensor data, mitigating the need for extensive real-world data collection and retraining, and facilitating deployment on physical robots without significant performance degradation.

The PartNet-Mobility Dataset serves as a comprehensive resource for training the Deep Neural Network (DNN) used in state estimation. This dataset contains 3D CAD models of over 130,000 single and multi-part objects, paired with physics-based simulation data that quantifies object mobility and potential interactions. The inclusion of both geometric and kinematic data allows the DNN to learn robust representations of object affordance – the actions an object can perform or enable – and to accurately estimate uncertainty in its predictions. The scale and diversity of PartNet-Mobility significantly improves the DNN’s ability to generalize to novel objects and environments, contributing to more reliable state estimation in complex scenarios.

Simulation results demonstrate that our neural network accurately predicts articulation flow (red lines) from input point clouds (blue) for objects within the PartNet Mobility dataset, as visualized in a PyBullet rendering with applied pulling directions.

Shared Autonomy and Real-World Deployment: Towards Robust Human-Robot Collaboration

This research introduces a framework centered on shared autonomy, a collaborative approach where a human operator and the KUKA LBR iiwa robot work in tandem to accomplish intricate manipulation tasks. Rather than fully autonomous operation or direct teleoperation, the system intelligently distributes control between human expertise and robotic precision. The human operator retains high-level command and oversight, guiding the robot through complex maneuvers, while the robot handles the detailed execution and maintains stability. This division of labor is particularly beneficial in scenarios demanding adaptability and problem-solving skills beyond the capabilities of current automation, offering a more flexible and intuitive interface for complex robotic tasks and ultimately paving the way for broader real-world deployment in manufacturing, assembly, and other dynamic environments.

Effective manipulation of complex objects demands a robot’s ability to discern individual components, even when those components are moving relative to one another. This system achieves this through robust robot segmentation, a process where the robot’s vision system isolates articulated objects – those with joints, like chains or flexible tools – from the surrounding environment. By identifying each segment and tracking its movement independently, the system significantly improves manipulation success rates. This isn’t simply object detection; it’s about understanding an object’s structure and dynamics, allowing for precise control and coordinated movement during tasks such as assembling complex parts or manipulating deformable materials. The ability to isolate and manage articulated objects opens possibilities for automating tasks previously considered too intricate for robotic systems.

The implementation of CasADi, a symbolic mathematics and optimization software, proves central to achieving real-time control and optimization capabilities when paired with the KUKA LBR iiwa robot. This software facilitates the rapid prototyping and deployment of complex control algorithms, enabling the robot to dynamically adjust its movements based on environmental feedback and task demands. By automatically generating efficient code for optimization problems, CasADi bypasses the need for manual code tuning, significantly reducing development time and improving system performance. This capability is crucial for real-world applications where unpredictable scenarios require immediate and precise robotic responses, showcasing the practicality of the developed framework beyond simulated environments and validating its potential for industrial adoption and human-robot collaboration.

Experiments using revolute and prismatic joints demonstrate that estimated articulation-represented by angular velocity <span class="katex-eq" data-katex-display="false">\boldsymbol{\\omega}</span> (yellow arrows) and linear velocity <span class="katex-eq" data-katex-display="false">\mathbf{v}</span> (red arrows) relative to the base frame <span class="katex-eq" data-katex-display="false">\mathtt{W}</span>-accurately reflects the estimated pose <span class="katex-eq" data-katex-display="false">\mathbf{T\\_{\\mathtt{A}}}</span> despite utilizing only joint encoder sensors and motion capture for ground truth. — Experiments using revolute and prismatic joints demonstrate that estimated articulation-represented by angular velocity $\boldsymbol{\\omega}$ (yellow arrows) and linear velocity $\mathbf{v}$ (red arrows) relative to the base frame $\mathtt{W}$ -accurately reflects the estimated pose $\mathbf{T\\_{\\mathtt{A}}}$ despite utilizing only joint encoder sensors and motion capture for ground truth.

The presented system navigates the inherent complexities of articulated object manipulation by prioritizing a provable, mathematically grounded approach. It’s a system built not just to work on a given set of tasks, but to reliably estimate and adapt to the unknown-a pursuit mirroring the elegance of a well-defined invariant. As Ken Thompson famously stated, “If it feels like magic, you haven’t revealed the invariant.” This sentiment encapsulates the work’s emphasis on transparent, robust estimation via factor graphs and deep learning; the system strives for demonstrable correctness, moving beyond empirical success to achieve a deeper understanding of the manipulated objects and their mechanisms. The integration of visual, force, and kinematic sensing provides the necessary data, but it’s the underlying mathematical framework that transforms data into demonstrable truth.

What’s Next?

The presented work, while a step toward accommodating the inherent messiness of the physical world, merely illuminates the depth of the challenges remaining. The reliance on deep learning, however strategically applied, introduces a familiar compromise. While effective at generalizing from observed data, such approaches offer little in the way of provable guarantees regarding manipulation stability or even accurate state estimation under unforeseen circumstances. The system functions; a crucial first step, but one predicated on empirical success, not mathematical necessity.

Future effort must address the fundamental disconnect between perception and control. The current paradigm largely treats estimation and manipulation as sequential processes. A more elegant, and ultimately robust, solution will necessitate their unification – a simultaneous optimization of state and action, guided by a formally verifiable model of both the object’s kinematics and the robot’s dynamics. To claim true autonomy, the system must move beyond reactive adaptation and demonstrate anticipatory control-predicting, rather than merely responding to, external disturbances.

The pursuit of shared autonomy, too, remains fraught with difficulty. Human intuition, while often effective, is notoriously difficult to formalize. Bridging this gap requires a careful consideration of information theory – what information must be conveyed to the human operator, and in what form, to maximize their ability to guide the robot safely and efficiently. The ultimate goal is not simply to offload tasks, but to augment human capabilities with the precision and repeatability of robotic systems – a feat demanding far more than incremental improvements in existing techniques.

Original article: https://arxiv.org/pdf/2601.01438.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Challenge of Articulated Systems: A Foundation for Predictable Manipulation

A Probabilistic Framework: Fusing Sensory Data for Articulation Estimation

Optimization and Refinement: Achieving Precise State Estimation Through Mathematical Rigor

Shared Autonomy and Real-World Deployment: Towards Robust Human-Robot Collaboration

What’s Next?

See also: