Bridging the Gap: A Novel State Estimator for Humanoid Robots

Author: Denis Avetisyan

Researchers have developed a new approach to accurately track the position and orientation of humanoid robots, enhancing their ability to navigate and interact with complex environments.

A hybrid state estimation approach integrates a model-based Iterated Extended Kalman Filter with a data-driven Transformer network to enhance perception and control for the RH5 humanoid robot, effectively merging the benefits of both traditional and learned methodologies.

InEKFormer combines the strengths of the InEKF and Transformer networks to achieve superior state estimation performance and facilitate robust sim-to-real transfer.

Despite the increasing potential of humanoid robots across diverse applications, achieving stable and dynamic bipedal locomotion remains challenging, critically relying on accurate state estimation. This paper introduces InEKFormer: A Hybrid State Estimator for Humanoid Robots, a novel approach that synergistically combines the strengths of an invariant extended Kalman filter (InEKF) with a Transformer network. Experimental results on a real humanoid robot demonstrate that this hybrid architecture outperforms both traditional Kalman filtering and purely learning-based methods. Can this fusion of established state estimation techniques and deep learning unlock more robust and adaptable locomotion for complex robotic systems?

The Foundations of Embodied Perception

For humanoid robots to navigate and interact effectively within complex environments, accurate state estimation – determining the robot’s precise position, orientation, and velocity – is fundamentally critical. Unlike factory robots operating in structured settings, humanoids contend with unpredictable terrains, dynamic obstacles, and incomplete sensory information. Errors in state estimation directly translate to instability during locomotion, inaccurate manipulation of objects, and ultimately, failure to complete intended tasks. Consequently, significant research focuses on developing robust algorithms capable of filtering sensor noise, accounting for the robot’s nonlinear movements, and maintaining a consistent and reliable understanding of the robot’s own state in real-time, thereby enabling truly autonomous and adaptable humanoid operation.

Humanoid robots navigating real-world environments face a significant hurdle in maintaining accurate localization due to the inherent complexities of movement and imperfect sensor data. Traditional localization algorithms, often relying on linear approximations, struggle with the nonlinear dynamics of bipedal locomotion – the constantly changing center of gravity, joint angles, and foot-ground interactions. Simultaneously, sensor noise from cameras, inertial measurement units, and force sensors introduces errors into the robot’s perception of its surroundings. These combined challenges lead to accumulating errors in position and orientation estimation, directly impacting the robot’s ability to maintain balance, plan effective trajectories, and successfully complete tasks. The consequence is often diminished stability and degraded performance, necessitating more robust and adaptable localization strategies that can effectively filter noise and account for these nonlinearities.

The RH5 robot utilizes an inertial measurement unit (IMU), linear joint encoders, and force-torque (FT) sensors to determine its position and interaction forces.

Invariant Extended Kalman Filters: A Geometric Approach

The Invariant Extended Kalman Filter (InEKF) utilizes the mathematical framework of Lie groups and Lie algebras to enhance state estimation. Lie groups describe continuous symmetries present in the system’s state space, such as rotations or translations represented by matrices like $SO(3)$ or $SE(3)$. By propagating the state estimate and covariance matrix using the Lie group structure, the InEKF avoids issues arising from parameterization ambiguities and maintains valid state representations. This approach ensures that the estimated state remains within the physically realizable manifold defined by the Lie group, leading to improved convergence rates and increased accuracy, particularly in high-dimensional or strongly nonlinear systems where traditional Extended Kalman Filters (EKFs) may diverge or produce inaccurate results.

The Invariant Extended Kalman Filter (InEKF) addresses challenges in state estimation arising from nonlinear system dynamics and sensor noise by utilizing the principles of Lie group symmetries. Traditional Extended Kalman Filters (EKFs) linearize nonlinear functions, introducing errors that accumulate over time; the InEKF, however, propagates the state estimate directly on the relevant Lie group manifold. This approach avoids the explicit linearization step and the associated approximations, reducing the impact of both process and measurement noise. Specifically, by preserving the underlying geometric structure of the system – represented by the Lie group and its associated algebra – the InEKF minimizes the covariance matrix’s tendency to become non-positive definite, a common source of divergence in standard EKF implementations. The result is improved accuracy and robustness, particularly in high-dimensional or strongly nonlinear systems where conventional EKF performance degrades significantly.

While InEKF provides a theoretically sound basis for robust state estimation through its exploitation of Lie group symmetries, practical implementations often benefit from data-driven enhancements. Specifically, the performance of the InEKF can be improved by adapting its noise covariance matrices, $Q$ and $R$, based on observed data rather than relying solely on modeled values. Techniques such as adaptive Kalman filtering or the use of machine learning to predict process or measurement noise can refine the filter’s performance, particularly in scenarios with poorly characterized or time-varying noise. Furthermore, data-driven approaches can help to mitigate the effects of model inaccuracies that inevitably arise in real-world applications, leading to more accurate and reliable state estimates.

The InEKFormer model utilizes a gain estimator architecture to improve performance.

Introducing InEKFormer: A Hybrid Architecture for Adaptive Estimation

InEKFormer utilizes a hybrid architecture that integrates the InEKF with a Transformer network to dynamically determine the Kalman gain. The InEKF, an iterative extended Kalman filter, provides an initial gain estimate, which is then refined by the Transformer network. This network is trained on historical system data to learn the relationship between system states, measurements, and optimal gain values. By predicting the Kalman gain based on learned patterns, InEKFormer can adapt to non-linearities and time-varying dynamics that often challenge traditional Kalman filtering methods, improving estimation performance in complex systems where a fixed gain is suboptimal. The Transformer’s attention mechanism allows it to weigh the importance of different system variables when calculating the gain, enabling a more nuanced and accurate response to changing conditions.

The InEKFormer architecture employs a Transformer network to dynamically adjust the Kalman gain, a critical parameter in state estimation. This network is trained on historical data comprising system states, measurements, and associated uncertainties. By learning the relationship between these variables, the Transformer predicts a refined Kalman gain for each time step. This data-driven adaptation allows the filter to optimally weight sensor measurements and model predictions, particularly in scenarios where the system dynamics are non-linear or time-varying, ultimately improving the accuracy and robustness of the state estimate compared to fixed or heuristically tuned gain values.

The InEKFormer architecture integrates model-based filtering, specifically the InEKF, with a data-driven Transformer network to achieve improvements in state estimation performance. Traditional Kalman filtering relies on an accurate system model and noise covariances; however, real-world systems often exhibit complexities not fully captured by these models. By learning from historical data, the Transformer network refines the Kalman gain, allowing the filter to adapt to unmodeled dynamics and sensor noise characteristics. This hybrid methodology combines the extrapolation capabilities of the InEKF with the adaptive learning of the Transformer, resulting in enhanced accuracy and increased robustness to uncertainties in both the system model and measurement data. The resulting state estimates exhibit reduced error variance compared to solely model-based or data-driven approaches.

The InEKFormer architecture utilizes k-unit delays, represented as 1/z^k, within its high-level block diagram to process sequential data.

Validation Through Embodied Experimentation

High-accuracy ground truth data for evaluation was generated using a 7-DoF RH5 robot arm and a Qualisys motion capture system. The Qualisys system, consisting of eight cameras operating at 240 Hz, provided sub-millimeter positional accuracy of reflective markers affixed to the robot’s end-effector and key link locations. This configuration enabled precise tracking of the robot’s kinematics throughout a range of motions, serving as the definitive reference for evaluating the performance of the InEKFormer state estimation network. Data was collected across multiple trajectories, encompassing varying speeds and orientations, to ensure robust and comprehensive validation.

To increase the volume and diversity of training data, the RobotDART simulator was utilized in conjunction with real-world motion capture data. RobotDART enabled the generation of synthetic data representing robot states and movements, effectively expanding the dataset beyond the limitations of physical data acquisition. This augmented dataset, combining both real and simulated data, was then used to train the InEKFormer network, improving its generalization capability and performance across a wider range of operational scenarios. The synthetic data generation process included variations in robot trajectories, environmental conditions, and sensor noise to further enhance the robustness of the trained model.

Training of the Transformer network leveraged both teacher forcing and scheduled sampling techniques to improve convergence and generalization. Teacher forcing initially provided ground truth data as input during training, guiding the network towards correct predictions. Subsequently, scheduled sampling gradually replaced ground truth inputs with the network’s own predictions, increasing robustness to accumulated errors and bridging the gap between training and deployment. Optimization was performed using the Adam optimizer, configured with default learning rates and parameters, to minimize the loss function and refine network weights.

Quantitative evaluation using the RH5 robot and motion capture system demonstrated a significant performance improvement of the InEKFormer over traditional state estimation techniques. Specifically, models trained on real-world data achieved a reduction in Root Mean Squared Error (RMSE) by a factor of 100 when compared to those trained exclusively on simulated data. This indicates a substantial improvement in the accuracy of state estimation when leveraging real-world observations for training, and highlights the InEKFormer’s ability to generalize from limited real-world data to achieve more accurate results than existing methods.

The InEKFormer network exhibits an inference time ranging from 50 to 350 milliseconds when executed on a 4-core Intel i5-3320M processor. This performance metric was determined through direct execution of the trained network and measurement of the time required to process a single input frame and generate a state estimate. The observed variation in inference time is influenced by the complexity of the input data and the resulting computational load, with more complex scenarios generally requiring longer processing times. This timing allows for real-time or near real-time operation in many robotics applications.

Models utilizing both single and multiple trajectory tests, alongside an online InEKF, accurately predict subsequent positions during a walking gait.

Towards Intelligent and Adaptive Robotic Systems

The InEKFormer framework represents a significant step towards creating robotic systems capable of genuine adaptation and intelligence. This novel approach integrates embodied knowledge – accumulated through physical interaction with the world – with the power of Transformer networks, traditionally used in natural language processing. By encoding robot state, actions, and sensory inputs into a unified representation, the InEKFormer allows the robot to predict future states and plan actions more effectively, even in uncertain or changing environments. Unlike conventional methods that often rely on pre-programmed behaviors or extensive data labeling, this framework facilitates continuous learning and allows the robot to generalize its skills to novel situations, promising more robust and versatile performance in real-world applications. The architecture’s ability to reason about its own capabilities and limitations also opens doors for more intuitive human-robot interaction and collaborative problem-solving.

Ongoing research endeavors are directed towards enriching the InEKFormer framework by integrating a broader spectrum of sensor inputs, moving beyond current limitations to encompass tactile sensing, thermal imaging, and advanced audio processing. This multi-modal approach aims to provide robots with a more comprehensive understanding of their surroundings, mirroring human perception and enabling more nuanced interactions with the environment. Simultaneously, investigations are underway to implement and evaluate more sophisticated learning algorithms, including reinforcement learning and meta-learning techniques, to accelerate adaptation and improve performance in unpredictable scenarios. These combined advancements promise to yield robotic systems capable of not only reacting to stimuli, but also proactively anticipating challenges and learning from experience, ultimately fostering true autonomy and resilience.

The development of more robust and independent humanoid robots promises to reshape operations within challenging real-world settings. Anticipated advancements facilitated by this research extend beyond simplified laboratory conditions, envisioning robots capable of navigating and responding effectively to unpredictable changes in environments like hospitals, factories, and even extraterrestrial landscapes. In healthcare, this translates to robots assisting with patient care and logistics with greater dependability; in manufacturing, it enables adaptable automation in dynamic assembly lines; and in exploration, it unlocks the potential for autonomous data collection and sample retrieval in previously inaccessible locations. Ultimately, increased reliability and autonomy represent a crucial step towards integrating robots seamlessly into complex human environments and expanding their utility across a diverse range of applications.

The presented InEKFormer architecture embodies a commitment to holistic system design. The fusion of the InEKF with a Transformer network isn’t merely additive; it acknowledges the inherent trade-offs between computational efficiency and representational power. The system’s performance, particularly its sim-to-real transfer capabilities, suggests a structure that anticipates and mitigates the inevitable leaks in abstraction. One might observe that the architecture doesn’t attempt to solve the state estimation problem in isolation, but rather to manage its complexity through a carefully balanced hybrid approach. As Donald Davies noted, “A system is only as good as its weakest link,” and InEKFormer appears designed to reinforce those critical connections within the broader estimation framework.

Beyond the Horizon

The introduction of InEKFormer, while a demonstrable refinement of state estimation, merely clarifies existing boundaries. The persistent challenge isn’t simply more accurate estimation, but the inherent fragility of monolithic systems. Current approaches, even hybrid ones, treat the robot as a closed entity, neglecting the vital interplay between the physical platform, the sensing modalities, and the environment itself. A truly scalable solution demands an estimator capable of learning and adapting to systemic uncertainties – one that acknowledges the robot isn’t solving a problem in an environment, but with it.

Future work shouldn’t focus on increasingly complex filters or network architectures, but on the decomposition of the state estimation problem. Can we distribute the computational burden, embedding estimation directly into the actuators and sensors? A decentralized, emergent estimation system, built upon principles of redundancy and self-correction, offers a more robust and ultimately more elegant path. The goal isn’t to model reality, but to create a system that gracefully responds to its inherent unpredictability.

Sim-to-real transfer remains a particularly thorny issue. The fidelity of simulation, no matter how impressive, will always be an approximation. The true test of an estimator lies not in its performance within a controlled virtual environment, but in its ability to maintain stability and accuracy when confronted with the messy, uncooperative nature of the real world. A system built on brittle precision is destined to fail; one built on resilient adaptability will endure.

Original article: https://arxiv.org/pdf/2511.16306.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/