Cooperative Robots: A Single Policy for Teamwork of Any Size

Author: Denis Avetisyan

Researchers have developed a new framework allowing multiple robots to collaborate on complex object manipulation tasks, regardless of the number of team members.

The TeamHOI framework establishes coordinated multi-agent behavior through a transformer-based policy network utilizing alternating self- and cross-attention, enabling a unified approach to teamwork across varying team sizes, and further refines motion realism and skill diversity via a masked AMP strategy that blends full-body and object-interaction-based discriminators.

TeamHOI leverages Transformer networks and masked motion priors to achieve decentralized control in physics-based simulations of human-object interaction.

Despite advances in physics-based control, scaling cooperative human-object interaction (HOI) to variable team sizes remains a significant challenge. This paper introduces ‘TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size’, a framework enabling a single decentralized policy to govern multiple humanoids in collaborative object manipulation. By leveraging a Transformer-based network with teammate tokens and a masked Adversarial Motion Prior, TeamHOI achieves scalable coordination and realistic motion, even with limited cooperative HOI data. Could this approach unlock more robust and adaptable multi-agent systems for complex real-world tasks?

The Challenge of Scalable Cooperative Systems

The development of truly cooperative multi-agent systems faces a significant hurdle: maintaining robust coordination as team size fluctuates. While small groups can often achieve alignment through simple communication and pre-programmed behaviors, scaling these approaches to larger, more dynamic teams introduces immense complexity. Each additional agent exponentially increases the number of possible interactions and required adjustments, quickly overwhelming traditional centralized control methods. Consequently, researchers are actively exploring decentralized strategies – where agents rely on local information and peer-to-peer communication – to foster adaptable and scalable coordination. The ability for a team to seamlessly incorporate or lose members without compromising performance is crucial for real-world applications, ranging from robotic swarms performing search and rescue operations to autonomous vehicles navigating complex traffic scenarios, making this a central challenge in the field.

Conventional methods for multi-agent coordination often falter when faced with fluctuating team sizes and compositions. These approaches typically rely on centralized planning or explicitly programmed strategies, which become computationally prohibitive as the number of agents increases – the demand for processing power grows exponentially with each new participant. More critically, these systems struggle to generalize; a strategy effective for a team of three may completely fail with five or seven agents, requiring extensive re-programming or re-training for each new scenario. This lack of adaptability stems from the difficulty in representing and predicting the complex interactions that emerge within dynamic teams, hindering the development of robust and scalable cooperative systems. Consequently, researchers are actively exploring decentralized and learning-based solutions to overcome these limitations and enable more flexible, responsive multi-agent coordination.

Cooperative physical tasks, such as collaboratively lifting a heavy object, necessitate a nuanced understanding of contact dynamics and stable formations amongst agents – a challenge that intensifies dramatically as team size increases. While small groups can often achieve coordination through relatively simple strategies, scaling these approaches to larger teams introduces exponential complexity in both planning and execution. Maintaining stable contact requires precise adjustments based on individual agent positions, forces, and the object’s shifting center of gravity, all while accounting for potential disturbances. Current learning methods struggle to generalize these complex interactions efficiently, often requiring extensive training for each new team configuration or environmental variation, hindering the development of truly scalable multi-agent coordination systems.

Our method achieves synchronized and stable table manipulation with both four and eight agents, demonstrably outperforming the CooHOI* baseline which exhibits limited cooperative behavior, as evidenced by the red trajectory line indicating table movement and the final position marked by a black dot.

TeamHOI: A Decentralized Policy for Flexible Collaboration

TeamHOI employs a Transformer-based Policy Network to facilitate cooperative behavior by enabling agents to model the states of their teammates. This network architecture allows each agent to attend to the states of other agents within the team, effectively creating a representation of the overall team state. The Transformer’s attention mechanism processes teammate states as input, producing contextualized embeddings that inform the agent’s decision-making process. This internal modeling of teammates’ states enables agents to anticipate actions, coordinate strategies, and adapt to changing team dynamics, moving beyond simple reactive behaviors to proactive cooperation.

TeamHOI employs ‘Teammate Tokens’ as a core mechanism for inter-agent communication and coordination. These tokens function as embeddings representing the observed state of each teammate, allowing the policy network to condition its actions on the states of others. Specifically, each agent receives a fixed-length vector summarizing the relevant information about its teammates, regardless of the total team size. This approach avoids the need for variable-length input or complex attention mechanisms that scale with team size, ensuring the policy remains computationally efficient and generalizable to teams ranging from 2 to 8 agents. The use of these tokens enables agents to effectively reason about the intentions and capabilities of their teammates, promoting cooperative behavior without explicit communication protocols.

The TeamHOI architecture achieves scalability and generalization by designing agent policies independent of team size. Traditional multi-agent reinforcement learning often requires retraining policies for each new team configuration, limiting adaptability. By abstracting teammate information into ‘Teammate Tokens’ and processing these tokens with a Transformer network, the policy learns to reason about relationships between agents rather than specific agent identities or numbers. This decoupling allows a single trained policy to effectively coordinate teams ranging from two to eight agents without performance degradation, demonstrated through experimental results and quantitative analysis of cooperative task completion rates across varying team sizes.

Team-size normalization consistently yields higher task rewards compared to global advantage normalization throughout training.

Robust Training and Stabilization Methodologies

TeamHOI training was conducted within Isaac Gym, a physics simulator developed by NVIDIA, to facilitate accelerated data generation and parallelization. This simulation environment enables the creation of diverse training scenarios and the collection of large datasets at a significantly reduced computational cost compared to real-world robotic experimentation. Isaac Gym’s parallelization capabilities were leveraged to simultaneously simulate multiple training instances, substantially decreasing the overall training time required to achieve robust policies. The use of a physics simulator also allows for precise control over environmental variables and repeatable experiments, crucial for consistent performance evaluation and algorithm development.

Advantage Normalization was implemented to address performance instability during training with varying numbers of cooperative agents. This technique normalizes the advantage function – the difference between the observed reward and the expected reward – by dividing it by an estimate of its standard deviation. This normalization effectively scales the learning signal, preventing excessively large or small updates that can occur when team size changes, and thereby mitigates performance degradation. By maintaining a consistent learning scale regardless of the number of agents, Advantage Normalization ensures stable and reliable training across diverse team configurations.

The Principal-Axes Coverage Reward was incorporated into the training regime to promote stable multi-agent formations during object manipulation. This reward function incentivizes agents to align their movements with the object’s principal axes – the axes of maximum variance – effectively reducing instability and promoting coordinated action. Implementation of this reward resulted in a near-perfect success rate – consistently exceeding 99% – across all tested team sizes when operating under normal weight conditions, demonstrating its efficacy in stabilizing team behavior and ensuring reliable performance regardless of the number of participating agents.

IsaacGym facilitates flexible team-size training by temporarily suspending extra agents at a dummy ceiling, effectively removing them from the learning process without altering the environment's dynamics. — IsaacGym facilitates flexible team-size training by temporarily suspending extra agents at a dummy ceiling, effectively removing them from the learning process without altering the environment’s dynamics.

Enhancing Exploration and Diversity with Masked AMP

The pursuit of increasingly versatile robotic movement necessitated a technique to encourage exploration beyond initially successful strategies. To this end, an adversarial motion prior, termed Masked AMP, was integrated into the learning process. This approach leverages a discriminator network trained to distinguish between realistic and generated motions, effectively guiding the agent towards more natural and varied behaviors. By introducing an adversarial element, the system is compelled to continuously refine its movements, escaping local optima and fostering a broader range of learned skills. The result is a more robust and adaptable agent capable of handling a wider variety of tasks and environments, moving beyond the limitations of purely reward-driven learning.

The study implemented a technique of strategically masking portions of the agent’s body during the discriminator update phase, effectively prompting the system to explore a more diverse set of potential behaviors. This approach prevents the discriminator from overly focusing on specific, potentially limiting, configurations of the agent, and instead forces it to evaluate movement validity based on partial observations. By obscuring certain body parts, the system is encouraged to generalize its understanding of plausible motion, leading to the discovery of novel and robust strategies that might otherwise be overlooked. This method cultivates a broader understanding of acceptable movement, ultimately enhancing the agent’s adaptability and performance in varied and complex environments.

The integration of reference motions with Masked AMP serves as a powerful regularization technique, guiding the learning process toward more plausible and resilient movements. By providing the discriminator with examples of desired motion characteristics, the system doesn’t merely seek to distinguish between real and generated movements, but specifically assesses how closely the generated actions align with natural biomechanics. This nuanced feedback encourages the agent to not only explore a diverse range of behaviors – as facilitated by the masking process – but also to ground those explorations in physically realistic and coordinated actions. Consequently, the resulting movements are demonstrably more stable, efficient, and visually convincing, exhibiting a robustness that extends beyond the training environment and improves generalization to novel scenarios.

Masking the action-matching policy (AMP) during training demonstrably improves both task reward and the success rate of hand-object interactions.

Zero-Shot Generalization and Broad Applicability: A Path Forward

TeamHOI exhibits a remarkable capacity for zero-shot generalization, successfully navigating cooperative carrying tasks even when presented with entirely new scenarios. The system doesn’t require specific training for each table shape or team composition; instead, it applies learned principles to unfamiliar geometries and varying numbers of agents. This adaptability was demonstrated through successful performance on previously unseen table configurations and with team sizes differing from those used during training. The framework’s ability to extrapolate from limited experience suggests a robust underlying understanding of cooperative behavior, rather than simply memorizing specific solutions, opening avenues for deployment in dynamic and unpredictable real-world environments where pre-training on every possible condition is impractical.

TeamHOI’s architecture isn’t limited to a single method of interacting with the carried object; it demonstrates a remarkable adaptability to multiple affordance behaviors. This means the agents can dynamically adjust their grip and cooperative strategy based on subtle changes in the object’s state or the environment. Rather than being programmed for a specific carrying style, the framework allows for diverse interactions-agents might shift from a balanced, two-handed carry to a more precarious, one-handed maneuver to navigate obstacles, or redistribute weight based on perceived instability. This flexibility isn’t merely reactive; the system anticipates and proactively adjusts to various potential interactions, fostering a more robust and versatile approach to cooperative manipulation and setting the stage for application in more complex, real-world scenarios.

TeamHOI’s architecture establishes a compelling blueprint for the development of multi-agent systems capable of thriving in dynamic and unpredictable environments. Beyond the successful execution of a cooperative carrying task, the framework demonstrates an ability to maintain coherent collaboration even as team size expands to sixteen agents – a significant scaling achievement. Critically, this consistent performance extends to increasingly challenging physical demands, with the system sustaining success even when subjected to a fivefold increase in the weight of the carried object. This combination of robust generalization and adaptability suggests that TeamHOI’s underlying principles can be applied to a diverse range of collaborative tasks, offering a pathway toward more versatile and resilient artificial intelligence systems.

Using a single set of human reference motions, the policy learns diverse affordance behaviors, enabling agents to both walk toward various directions and adapt to different object manipulation strategies like side-holding or edge-lifting.

The pursuit of a unified policy, as demonstrated by TeamHOI, echoes a fundamental principle of algorithmic elegance. The framework’s capacity to adapt to variable team sizes in cooperative manipulation tasks isn’t merely about achieving functionality; it’s about establishing predictable boundaries within a complex system. This aligns with Fei-Fei Li’s observation: “AI is not about making machines smarter, it’s about making humans more capable.” TeamHOI exemplifies this by shifting the focus from individual agent intelligence to a cohesive, scalable system-a predictable and provable approach to human-object interaction, regardless of team composition. The innovation resides not in complex individual behaviors, but in the consistent application of a unified policy.

What’s Next?

The elegance of TeamHOI lies in its ambition – a single policy governing collective manipulation. However, the devil, as always, resides in the details. While scaling to variable team sizes is a commendable step, the underlying assumption of a homogenous team feels… optimistic. Real-world collaboration rarely features perfect agents. Future work must address the inevitable heterogeneity – varying skill levels, imperfect sensing, and the occasional agent failure. If the system falters when a robot simply doesn’t comply, the illusion of robustness quickly dissipates.

Furthermore, the reliance on a masked motion prior, while currently effective, begs the question of generalization. The system learns to predict likely motions, but what happens when confronted with truly novel interactions? If it feels like magic when the system anticipates a teammate’s action, it hasn’t revealed the invariant – the fundamental principle governing successful cooperation. A mathematically rigorous understanding of these invariants, rather than purely empirical observation, remains the ultimate goal.

Ultimately, this framework’s success hinges on bridging the gap between simulated physics and the messy reality of human-robot interaction. The true test won’t be manipulating virtual objects with perfect precision, but gracefully handling unexpected disturbances – a dropped tool, a momentary lapse in attention, or the simple, unpredictable nature of human intent. Until then, the pursuit of a provably correct cooperative policy remains a compelling, if elusive, challenge.

Original article: https://arxiv.org/pdf/2603.07988.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/