Robot Rally: Humanoid Takes on Badminton

Author: Denis Avetisyan

Researchers have successfully trained a humanoid robot to play a full game of badminton, demonstrating advanced whole-body coordination and real-time decision-making.

A curriculum of progressively refined control, leveraging privileged critic observations alongside actor observations and proprioceptive feedback, enables a 1.28-meter, 21-DoF humanoid to learn a policy—$ \pi_{WBC} $—capable of dynamically adjusting to a shuttlecock’s trajectory—estimated via an Extended Kalman Filter and path prediction—and executing whole-body actions through a low-level PD controller, anticipating inevitable estimation errors and the inherent fragility of physical systems.

This work details a multi-stage reinforcement learning approach enabling a humanoid robot to predict trajectories and execute complex badminton returns in a real-world setting.

While humanoid robots excel at pre-programmed interactions, adapting to the dynamism of real-world scenarios remains a significant challenge. This is addressed in ‘Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning’, which introduces a novel reinforcement learning pipeline enabling a humanoid robot to play badminton with coordinated full-body movements. Through a carefully designed curriculum and trajectory prediction, the robot achieves successful rallies in simulation and demonstrates on-court hitting with speeds up to 10 m/s. Could this approach pave the way for more adaptable and dynamic humanoid robots capable of excelling in similarly complex, real-time environments?

The Illusion of Mastery: Why Agile Play Defies Simple Systems

The swift and unpredictable nature of badminton presents a uniquely demanding challenge for robotic systems. Unlike many sports with relatively predictable ball paths, a badminton shuttlecock’s aerodynamic properties create highly dynamic and erratic trajectories, requiring exceptional agility and precise timing to intercept. A robot must not only react to the shuttlecock’s initial velocity but also continuously adapt to its deceleration and changing direction due to air resistance and spin. This necessitates a level of responsiveness and adaptability that surpasses the capabilities of most current robotic control systems, which often struggle with the coordination of full-body movements at the speeds required to effectively play the game. Successfully mimicking the reflexes and dexterity of a human badminton player demands advancements in areas like real-time trajectory prediction, rapid motor control, and robust balance maintenance.

Conventional robotic control systems frequently encounter difficulties when tasked with intercepting rapidly moving objects due to the intricate coordination required across multiple joints and the need for real-time adjustments. These systems often rely on pre-programmed trajectories or simplified models of motion, proving inadequate for the unpredictable and dynamic nature of a badminton rally. Successfully tracking a shuttlecock—which decelerates rapidly and follows a non-linear path—demands not only precise positioning but also the ability to anticipate its trajectory and react with the necessary speed and accuracy. The sheer complexity of coordinating a robot’s full-body movements – including base rotation, arm extension, and racket angle – while simultaneously accounting for external forces and maintaining balance presents a substantial engineering hurdle. This challenge necessitates advanced control algorithms capable of handling high-dimensional, time-varying dynamics and incorporating sensory feedback for robust and adaptable interception strategies.

Achieving competitive badminton play for robots hinges on the creation of control systems capable of managing the sport’s inherent unpredictability. Unlike pre-programmed robotic tasks, a badminton match demands continuous adaptation to the opponent’s shots – varying speeds, angles, and deceptive movements. A robust system must therefore integrate real-time visual processing to accurately track the shuttlecock’s trajectory, predict its landing point, and instantaneously calculate the necessary full-body movements for interception. This requires moving beyond traditional robotic control methods towards algorithms that prioritize dynamic adaptation and anticipatory control, allowing the robot to not simply react to the shuttlecock’s current position, but to proactively position itself for the next shot. Successfully implementing such a system will not only elevate robotic badminton performance but also advance the broader field of robotics by demonstrating a significant step towards truly agile and intelligent machines.

Twenty robot swings toward a designated hitting position reveal a consistent racket center trajectory as it passes through the z = 1540 mm plane, as shown by the green spheres.

A Phased Ascent: Building Skill from Foundation to Complexity

The robot’s training regimen employs a multi-stage curriculum, prioritizing foundational skills before progressing to more complex maneuvers. Initial training focuses on ‘Footwork Acquisition’, which establishes the robot’s ability to achieve stable interception positioning. This phase involves repetitive drills designed to optimize the robot’s gait, balance, and responsiveness to predicted interception points. Successful footwork is critical, as it provides the necessary platform for subsequent actions, including swing generation and accurate striking; without stable positioning, the robot cannot consistently deliver effective hits. Data collected during footwork acquisition informs adjustments to the robot’s locomotion parameters, ensuring efficient and reliable movement towards interception targets.

Following the acquisition of stable interception positioning, the training curriculum advances to ‘Swing Generation’, a phase dedicated to developing both the accuracy and power of the robot’s hitting mechanism. This stage involves iterative refinement of kinematic parameters, including angular velocity and trajectory optimization, to maximize impact force while maintaining directional control. The curriculum then culminates in ‘Precision Striking’, which focuses on integrating learned swing mechanics with real-time visual data to optimize performance against moving targets; this includes incorporating feedback loops to correct for trajectory deviations and refine strike timing, ultimately enhancing the robot’s ability to consistently and effectively intercept and impact designated objects.

The robot’s learning process is structured around a modular design, enabling incremental skill acquisition. This approach decomposes complex tasks into smaller, manageable sub-problems, allowing the robot to master foundational elements before progressing to more intricate maneuvers. By sequentially building capabilities, the system avoids the computational inefficiencies and instability often associated with attempting to learn all aspects of a complex skill simultaneously. Furthermore, this modularity enhances robustness; if a specific module encounters an unforeseen circumstance, the core functionality of other, independently learned modules remains unaffected, preventing catastrophic failure and facilitating easier debugging and refinement of individual components.

Robot training utilizes filtered shuttlecock trajectories constrained to an interception region of x∈[−0.8,0.8]m, y∈[−1,0.2]m, and z∈[1.5,1.6]m.

Reinforcement Learning: Sculpting Control Through Iteration

Reinforcement Learning (RL) was implemented to develop a whole-body controller for complex locomotion. This controller coordinates the kinematic chains governing both footwork and arm movements, enabling the agent to learn optimal control policies through interaction with a simulated environment. The RL framework allows for the automated discovery of strategies for balancing, stepping, and reaching without explicit pre-programming of these behaviors. The resulting policy maps environmental observations directly to actuator commands, facilitating adaptive and dynamic movement capabilities. The controller’s performance is evaluated based on metrics such as forward velocity, stability margin, and energy efficiency, all optimized through the learning process.

The Asymmetric Actor-Critic architecture employed separates the policy learning process into two distinct components: an actor and a critic. The actor, responsible for selecting actions, is parameterized by $\theta$ and updated via a policy gradient algorithm. Critically, the critic, parameterized by $\omega$, estimates the value function $V(s)$ or the Q-function $Q(s,a)$ and is trained separately, often using a temporal difference (TD) learning approach. This asymmetry – independent parameterization and training – improves learning stability by decoupling policy improvement from value estimation, and enhances efficiency by allowing the critic to provide more accurate and frequent feedback to the actor during training. The critic’s independent learning process can converge more rapidly, leading to a more stable and efficient overall reinforcement learning process.

Domain randomization was implemented as a key component of the training process to address the sim-to-real gap. This technique involves randomly varying simulation parameters – including friction coefficients, mass distributions, and actuator delays – during training, forcing the reinforcement learning agent to learn a robust policy insensitive to these variations. The underlying dynamics of the simulated agent are informed by a simplified $6$-DoF Shuttlecock Dynamics Model, which prioritizes computational efficiency while retaining essential characteristics of the system’s behavior. This combination of randomized training and a streamlined dynamics model facilitates the transfer of the learned control policy to a physical robot operating in unmodeled and unpredictable real-world conditions.

Initially, the reinforcement learning framework leveraged an Extended Kalman Filter (EKF) to provide state predictions necessary for policy learning and action selection. The EKF facilitated more accurate estimations of the system’s state, improving the efficiency of the learning process. However, subsequent development shifted towards a prediction-free reinforcement learning approach, eliminating the reliance on state predictions. This transition was motivated by a desire to reduce computational complexity and enhance the robustness of the controller by removing a potential source of error associated with the EKF’s predictive model.

This simulation demonstrates two humanoid robots sustaining a 21-return rally, showcasing a Prediction-Free policy that infers optimal impact positions from initial shuttlecock trajectories and a Target-Known policy guided by a predetermined hitting location, both successfully executing impacts as indicated by green spheres.

The Elegance of Reaction: Prediction-Free Control in Action

The system integrates Prediction-Free Control directly into the robot’s Whole-Body Controller, enabling a reactive approach to badminton play. Rather than attempting to forecast the shuttlecock’s trajectory – a computationally intensive process – the controller extracts hitting information solely from immediate, short-term observations of the shuttlecock’s position and velocity. This circumvents the delays inherent in predictive methods, allowing the robot to respond dynamically to the incoming shuttlecock and initiate a swing based on the present state, rather than a predicted future state. By focusing on real-time perception, the system achieves agility and responsiveness crucial for successful rallies, effectively decoupling control from the uncertainties of long-term prediction and simplifying the computational demands on the platform.

Conventional robotic control for dynamic tasks often relies heavily on predicting the future state of objects – a computationally expensive process prone to inaccuracies. This system bypasses that requirement entirely, enabling a significantly lighter computational load and markedly improved reaction times. By directly inferring hitting information from immediate sensory input – short-term observations of the shuttlecock’s current position and velocity – the robot can formulate control actions without first calculating a predicted trajectory. This approach not only streamlines processing but also enhances robustness, as the system isn’t hampered by errors accumulating in long-range predictions, ultimately resulting in a more agile and responsive platform capable of real-time adjustments during play.

Rigorous testing of the prediction-free control system moved beyond simulation with deployment on a fully functional humanoid platform. Utilizing a ball machine to deliver consistent shuttlecock trajectories and a motion capture system for precise performance analysis, researchers demonstrated the system’s capacity for dynamic, real-time adjustments. This validation confirmed the approach’s effectiveness, enabling the robot to achieve swing speeds of 5.3 meters per second and propel shuttlecocks at outgoing velocities reaching 10 meters per second – performance levels indicative of a highly responsive and agile control system capable of complex athletic maneuvers.

Simulations featuring two identical humanoid robots engaging in a badminton rally have confirmed the system’s ability to maintain a sustained exchange, achieving a rally length of 21 consecutive returns. This demonstrates a robust level of dynamic stability and precise control throughout the interaction. Further analysis reveals a high degree of accuracy in the robots’ hitting mechanics, with position errors averaging just 0.10 meters and orientation errors limited to 0.2 radians. These metrics indicate the system’s capacity to consistently place the shuttlecock within a tight tolerance, paving the way for more complex and realistic robotic sports applications and showcasing the effectiveness of the prediction-free control strategy in a dynamic, interactive setting.

The robot successfully intercepted the shuttle at both ends of the 98 cm x 42 cm interception area, demonstrating consistent performance across the workspace.

The pursuit of embodied intelligence, as demonstrated by this humanoid badminton player, echoes a fundamental truth about complex systems. This work doesn’t build a badminton player; it cultivates one through iterative learning, a process inherently resistant to perfect design. As Ken Thompson observed, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” Similarly, attempting to pre-program every nuance of a whole-body athletic interaction proves futile. Instead, the robot’s ability to adapt—to learn from countless rallies through reinforcement learning and domain randomization—reveals the inherent limitations of top-down control and the emergent power of bottom-up adaptation. Every successful return isn’t a victory over chaos, but a temporary reprieve, a localized order sustained by the ongoing struggle against inevitable failures.

What Lies Beyond the Net?

The demonstration of a humanoid capable of engaging in the dynamic interplay of badminton is less a culmination, and more a carefully orchestrated postponement of inevitable chaos. This work reveals not the triumph of control, but the exquisite fragility of it. Each successful return is built upon layers of prediction and reaction, a temporary reprieve from the fundamental unpredictability inherent in physical interaction. The system functions, therefore, not by mastering badminton, but by anticipating, and momentarily delaying, its own failure.

Future work will inevitably focus on robustness – expanding the domain of acceptable variance before the cascade begins. Yet, such efforts are merely tactical. The true challenge lies not in perfecting the response, but in accepting the inevitability of breakdown. There are no best practices in robotics – only survivors. The most valuable advancements will likely emerge from embracing imperfection, designing for graceful degradation, and understanding that order is just cache between two outages.

This is not to diminish the accomplishment. Rather, it is a reminder that architecture is how systems postpone chaos. The next generation of humanoid robots will not be defined by their ability to play a game, but by their capacity to learn from their mistakes, to adapt to the unexpected, and to continue functioning—however imperfectly—in the face of constant disruption.

Original article: https://arxiv.org/pdf/2511.11218.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Mastery: Why Agile Play Defies Simple Systems

A Phased Ascent: Building Skill from Foundation to Complexity

Reinforcement Learning: Sculpting Control Through Iteration

The Elegance of Reaction: Prediction-Free Control in Action

What Lies Beyond the Net?

See also: