Two Hands Are Better Than One: A New Dataset for Robot Dexterity

Author: Denis Avetisyan

Researchers have released RoboCOIN, a large-scale dataset designed to accelerate the development of more capable and adaptable bimanual robotic manipulation systems.

The RoboCOIN dataset provides a multi-embodiment collection of bimanual manipulation data, coupled with a hierarchical annotation scheme and a data processing framework for improved robot learning and generalization.

Achieving human-level dexterity in robotics requires robust bimanual manipulation skills, yet the scarcity of large-scale, diverse datasets hinders progress across heterogeneous robotic platforms. To address this challenge, we introduce RoboCOIN: An Open-Sourced Bimanual Robotic Data COllection for INtegrated Manipulation, a comprehensive dataset of over 180,000 demonstrations collected from 15 robots, annotated with a novel hierarchical structure spanning trajectory concepts to frame-level kinematics. Coupled with the CoRobot processing framework and Robot Trajectory Markup Language, RoboCOIN demonstrably improves multi-embodiment bimanual learning performance-but how can this resource further accelerate the development of adaptable and intelligent robotic systems?

The Inevitable Constraints of Embodied Intelligence

Conventional robotic learning approaches often falter when confronted with manipulation tasks requiring nuanced coordination and adaptability. These methods typically rely on painstakingly curated datasets and algorithms specifically tailored to each individual skill, creating a significant bottleneck in development. The demand for extensive data is particularly problematic; robots require numerous examples to learn even simple actions, a requirement that doesn’t scale well to the complexity of real-world environments. Furthermore, transferring learned skills to new objects or situations proves challenging, as these systems struggle with generalization. This reliance on specialized expertise and massive datasets limits the deployment of robots in dynamic, unstructured settings, hindering the realization of truly versatile and autonomous machines.

Despite the recent advancements in Vision-Language Action (VLA) models for robotics, a significant hurdle remains in their ability to reliably perform tasks beyond the data they were trained on. While these models demonstrate promise in translating natural language instructions into robotic actions, their performance often degrades dramatically when confronted with scenarios differing even slightly from those present in the training dataset. This limitation stems from the substantial data requirements needed to capture the vastness of the real world; current datasets, though growing, frequently lack the diversity and scale necessary for robust generalization. Consequently, a robot guided by a VLA model may successfully grasp a red block in a controlled environment, but struggle with a blue block, or the same red block in a cluttered space, highlighting the critical need for more expansive and adaptable learning frameworks.

Current robotic learning efforts are often constrained by the limitations of available datasets, particularly when it comes to teaching robots to perform complex, two-handed tasks. While resources like Open X-Embodiment provide valuable data for single-arm manipulation, they inherently fall short of representing the full spectrum of dexterity required for bimanual interactions – those involving both hands working in concert. This restriction poses a significant hurdle, as many real-world tasks – from assembling electronics to preparing food – demand the nuanced coordination of two arms. Consequently, robots trained solely on single-arm data struggle to generalize to scenarios requiring bimanual dexterity, leading to brittle performance and hindering progress towards truly versatile robotic systems capable of seamlessly interacting with the physical world.

Advancing robotic intelligence beyond current limitations necessitates a learning framework capable of broad adaptation and efficient scaling. Existing methods often falter when confronted with the variability of real-world environments and the complexity of nuanced tasks, demanding extensive, task-specific data collection. A truly versatile framework would move beyond these constraints, enabling robots to generalize from limited experience and transfer learned skills to novel situations. Such a system would not simply memorize solutions, but rather develop an understanding of underlying principles, allowing for robust performance across a wide range of challenges. This leap in capability is crucial, as it promises to unlock advanced robotic applications in areas like manufacturing, healthcare, and even complex exploration, fundamentally reshaping how humans and robots interact with the world.

RoboCOIN: A Foundation for Adaptable Dexterity

The RoboCOIN dataset comprises over 180,000 demonstrated task executions, collected across 421 distinct manipulation tasks. Data was captured using 15 different robotic platforms, encompassing a range of kinematic structures and morphologies. This scale of data is significantly larger than previously available bimanual manipulation datasets and is intended to facilitate the training of robust and generalizable learning algorithms. The diversity in both task execution and robotic embodiment aims to mitigate overfitting and improve the transferability of learned policies to novel scenarios and hardware configurations.

The RoboCOIN dataset’s multi-embodiment design incorporates data collected from a diverse range of robotic platforms, specifically Dual-Arm Robots, Half-Humanoid Robots, and fully Humanoid Robots. This deliberate variety addresses a key limitation in robotic learning – the tendency for models to overspecialize to a single morphology. By training on data from multiple embodiments, RoboCOIN facilitates the development of manipulation policies that generalize more effectively to novel robotic systems and environments. The dataset’s coverage extends to variations in kinematic structure, degrees of freedom, and actuator configurations, enabling the creation of adaptable policies less susceptible to the “sim-to-real” gap and promoting transfer learning between different robotic platforms.

The RoboCOIN dataset utilizes a Hierarchical Capability Pyramid to facilitate both multi-resolution learning and complex task decomposition. This pyramid consists of three annotation levels: Trajectory-Level, providing high-level task goals; Segment-Level, detailing intermediate steps or skills required to achieve those goals; and Frame-Level, offering precise state information at each timestep. This structure enables algorithms to learn from coarse-to-fine representations, allowing for generalization across tasks and efficient transfer of learned skills. Furthermore, the hierarchical organization supports task decomposition by allowing robots to break down complex goals into manageable sub-problems, each addressable at a specific level of the pyramid.

The CoRobot Framework is designed to accelerate research in robotic manipulation by providing a comprehensive and standardized infrastructure. It integrates tools for data processing, including filtering, conversion, and augmentation of raw sensor data collected from diverse robotic platforms. Data validation procedures within the framework ensure data quality and consistency, crucial for reliable machine learning. Furthermore, CoRobot facilitates robotic experimentation through a unified API and simulation environment, enabling researchers to rapidly prototype, test, and refine algorithms without significant platform-specific modifications. This integrated approach reduces the overhead associated with data management and experimentation, thereby streamlining the development pipeline for robotic manipulation systems.

Empirical Validation: The Emergence of Generalization

Multiple Vision-Language Action (VLA) models, notably Robotics Diffusion Transformer, OpenVLA, and GR00T-N1.5, have exhibited performance gains when trained utilizing the RoboCOIN dataset. These models benefit from RoboCOIN’s scale and diversity compared to existing datasets, enabling improved generalization to robotic tasks. Observed improvements are quantifiable; for example, GR00T-Fine achieved a 16% performance increase over its GR00T-Raw counterpart through training on RoboCOIN, while GR00T-Mine saw a 23% improvement utilizing the same dataset and associated data augmentation techniques. This indicates RoboCOIN effectively facilitates learning and enhances the capabilities of various VLA architectures in the robotics domain.

GN00T-N1.5, a vision-language action (VLA) model utilizing a diffusion-based architecture, has achieved state-of-the-art performance through training with the RoboCOIN dataset and the CoRobot framework. Specifically, on the Realman RMC-AIDA-L task using the π0π0 evaluation setting, GN00T-N1.5 demonstrates a 70% success rate. This represents a substantial improvement over prior results, increasing performance from a previous benchmark of 20% on the same task and evaluation parameters. The model’s performance gain is directly attributable to the utilization of RoboCOIN and CoRobot, highlighting their effectiveness in training advanced VLA models.

Evaluations demonstrate RoboCOIN’s advantages over the π0 dataset in terms of training data characteristics. Specifically, RoboCOIN provides a larger and more diverse dataset, increasing the robustness of trained models in real-world robotic applications. Data quality control within RoboCOIN utilized the Robot Trajectory Markup Language (RTML) to identify and remove low-quality trajectories, resulting in the elimination of 35.3% of such data across two separate tasks. This filtering process contributes to improved model performance by reducing the influence of noisy or inaccurate training examples.

Performance gains were observed through targeted data refinement of the GR00T model. Specifically, GR00T-Fine achieved a 16% improvement in performance relative to the GR00T-Raw baseline, while GR00T-Mine demonstrated a 23% improvement. These gains were directly attributable to the implementation of Robot Trajectory Markup Language (RTML) filtering, used to remove low-quality trajectories, coupled with data augmentation techniques applied during the training process. This indicates the efficacy of curated datasets in enhancing the capabilities of Vision-Language-Action (VLA) models.

Toward Robust Intelligence: The Horizon of Adaptable Systems

The demonstrated capabilities of RoboCOIN represent a significant step toward robotic systems capable of truly dexterous manipulation. Previous approaches often struggled with the subtleties of real-world interactions, requiring painstaking task-specific programming; however, RoboCOIN’s success in diverse, unscripted scenarios suggests a pathway toward more adaptable robotic hands. This breakthrough doesn’t simply refine existing techniques – it unlocks possibilities for addressing manipulation challenges demanding both fine motor skills, such as assembling intricate components, and precise coordination, like surgical procedures or delicate object handling. Researchers are now positioned to explore increasingly complex tasks, moving beyond simple pick-and-place operations to embrace scenarios requiring nuanced force control, adaptive grasping, and real-time problem-solving – ultimately bringing robots closer to seamlessly assisting humans in a wider array of applications.

The continued advancement of robotic intelligence relies heavily on the availability of robust and diverse datasets, and resources like Galaxea Open-World and AGIbot World are proving instrumental in this regard. These platforms offer significantly more than simple training data; they present complex, open-ended environments where robots can encounter a wide range of scenarios and challenges. Galaxea, with its focus on physically simulated environments, allows for the refinement of motor skills and manipulation strategies, while AGIbot World provides a valuable testbed for integrating these skills into more generalized, goal-directed behavior. Crucially, these datasets aren’t static; they are designed to evolve, providing ongoing opportunities to validate model performance, identify limitations, and drive further innovation in robotic learning. By leveraging these complementary resources, researchers can move beyond narrow, task-specific solutions and build robots capable of adapting to the complexities of real-world environments.

The success of hierarchical learning and multi-embodiment extends beyond complex manipulation, offering a powerful framework for advancements in robotic navigation and perception. By breaking down these traditionally monolithic challenges into layered, manageable sub-problems, robots can learn to perceive the world and plan paths with increased robustness and efficiency. For example, a robot navigating a cluttered room might first learn to identify broad object categories, then refine this understanding to distinguish specific instances, and finally utilize this knowledge to plan a collision-free trajectory. Furthermore, training across multiple virtual embodiments – different robot morphologies or sensor configurations – encourages the development of generalized skills less susceptible to the limitations of any single platform. This approach promises to yield robots capable of adapting to unforeseen circumstances and operating effectively in dynamic, real-world environments, ultimately moving the field closer to truly versatile, intelligent machines.

The development of robotic systems capable of complex manipulation, as demonstrated by this research, signifies a crucial step toward truly versatile robotic assistants. These systems are no longer limited to narrowly defined tasks in controlled settings, but instead exhibit the potential to adapt and perform a diverse array of functions within the unpredictable nature of human environments. This adaptability isn’t simply about executing pre-programmed routines; it suggests a future where robots can learn new skills, solve unforeseen problems, and collaborate with people in a natural and intuitive manner. The long-term vision extends beyond industrial automation to include personalized assistance in homes, support for elderly care, and collaborative roles in various service industries, fundamentally changing the relationship between humans and machines.

The introduction of RoboCOIN and its accompanying CoRobot framework acknowledges the inevitable entropy inherent in robotic systems. While the dataset aims to facilitate learning and generalization across varied platforms, it implicitly accepts that robotic embodiments and the data they generate are not static entities. As Edsger W. Dijkstra observed, “It’s not enough to have good intentions, you also need to do things the right way.” RoboCOIN isn’t merely a collection of data; it’s a structured approach to managing that data’s lifecycle, recognizing that maintaining a usable, evolving dataset-like any complex system-requires diligent attention to detail and a framework for graceful decay. The hierarchical annotation scheme and data processing tools are, in essence, mechanisms for slowing that decay and ensuring the dataset remains valuable over time.

What Lies Ahead?

The proliferation of datasets like RoboCOIN represents a predictable surge-a temporary bulwark against the inevitable decay of robotic systems. Each collected trajectory, each annotated grasp, is merely a snapshot of a transient state. The value isn’t in the data itself, but in the methods developed to extract signal from the noise before entropy claims it. Uptime is, after all, merely temporary.

The true challenge isn’t accumulating more examples of manipulation, but crafting architectures resilient to the inherent variability of the physical world. Multi-embodiment offers a path, but the transfer of learned policies remains a brittle process. Latency is the tax every request must pay, and current approaches often incur exorbitant costs when adapting to novel platforms. A deeper exploration of sim-to-real discrepancies-and the fundamental limits of generalization-is essential.

Stability is an illusion cached by time. Future work should focus not solely on expanding datasets, but on developing methods for continual learning and adaptation. Systems must not merely perform in a controlled environment, but gracefully degrade as components age, sensors drift, and the world refuses to conform to pre-defined parameters. The goal isn’t perfection, but elegant obsolescence.

Original article: https://arxiv.org/pdf/2511.17441.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Constraints of Embodied Intelligence

RoboCOIN: A Foundation for Adaptable Dexterity

Empirical Validation: The Emergence of Generalization

Toward Robust Intelligence: The Horizon of Adaptable Systems

What Lies Ahead?

See also: