Surgical Robotics Gets a Boost from Massive New Dataset

Author: Denis Avetisyan

Researchers have unveiled a large-scale, multi-modal dataset designed to accelerate the development of foundation models for surgical robotics and simulation.

A comprehensive dataset encompassing 770 hours of synchronized multimodal demonstrations from 49 institutions across North America, Europe, the Middle East, and Asia supports the development of advanced healthcare robotics, specifically through training models like GR00T-H - a vision-language-action model for surgical autonomy - and Cosmos-H-Surgical-Simulator, a multi-embodied, action-conditioned world model leveraging data from 20 diverse robotic platforms including surgical systems [latex]\text{(da Vinci Si, dVRK)}[/latex] and adaptable manipulators [latex]\text{(Franka Panda, UR5e)}[/latex]. — A comprehensive dataset encompassing 770 hours of synchronized multimodal demonstrations from 49 institutions across North America, Europe, the Middle East, and Asia supports the development of advanced healthcare robotics, specifically through training models like GR00T-H – a vision-language-action model for surgical autonomy – and Cosmos-H-Surgical-Simulator, a multi-embodied, action-conditioned world model leveraging data from 20 diverse robotic platforms including surgical systems [latex]\text{(da Vinci Si, dVRK)}[/latex] and adaptable manipulators [latex]\text{(Franka Panda, UR5e)}[/latex].

Open-H-Embodiment provides the data needed to train advanced robotic systems capable of learning and generalizing across a range of surgical tasks.

Despite the promise of autonomous medical robots to revolutionize healthcare, progress has been hindered by a critical lack of large, openly available datasets. To address this, we introduce ‘Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics’, comprising multi-modal data from over 49 institutions and diverse robotic platforms-including the da Vinci and Versius systems-spanning surgical manipulation, ultrasound, and endoscopy. This dataset enabled the development of GR00T-H, a vision-language-action model achieving unprecedented performance on a suturing benchmark, and Cosmos-H-Surgical-Simulator, an action-conditioned world model supporting multi-embodiment surgical simulation. Could this open infrastructure unlock a new era of robust, generalizable robot learning and ultimately broaden access to precision medicine?

The Challenge of Surgical Skill Transfer

Historically, the acquisition of surgical proficiency has relied heavily on an apprenticeship model, demanding significant time and resources from experienced surgeons to oversee trainees. This traditional approach, while valuable, presents inherent limitations; the sheer volume of cases required for mastery is often unattainable, and the subjective nature of assessment introduces variability in skill transfer – meaning competence demonstrated in one setting doesn’t always reliably translate to novel situations or anatomical variations. Furthermore, access to specialized surgical training remains unevenly distributed, creating disparities in the quality of care available to patients and hindering the development of a consistently skilled surgical workforce. The financial burden associated with extended training periods and the need for dedicated mentorship further compound these challenges, prompting exploration of alternative, more efficient, and standardized learning paradigms.

Contemporary robotic surgery, while offering precision and minimally invasive techniques, often struggles with adaptability. Current systems are typically programmed for specific procedures and anatomical conditions, exhibiting limited capacity to generalize to unforeseen variations during surgery. This rigidity stems from a reliance on pre-programmed movements and a lack of sophisticated sensing capabilities to interpret nuanced anatomical differences or unexpected tissue behavior. Consequently, surgeons may find robotic assistance less effective when confronted with anatomical anomalies, challenging surgical sites, or patient-specific variations not accounted for in the system’s programming. Overcoming this limitation necessitates advancements in artificial intelligence and machine learning, enabling robotic platforms to perceive, learn, and autonomously adjust to the dynamic and unpredictable nature of the surgical environment, ultimately mirroring the adaptability of a skilled human surgeon.

Achieving true surgical skill transfer necessitates learning approaches that move beyond rote memorization of specific techniques and instead cultivate adaptable dexterity. Current training often struggles to bridge the gap between controlled laboratory settings and the unpredictable realities of the operating room, where anatomical variations and unforeseen complications demand immediate, nuanced responses. Robust paradigms emphasize not just the what of a procedure, but the how and why, fostering a deeper understanding of surgical principles and biomechanical interactions. This allows surgeons to extrapolate learned skills to novel situations, mastering a spectrum of conditions rather than being limited to pre-defined scenarios. Consequently, research is increasingly focused on developing training methods-including advanced simulation and machine learning-that prioritize adaptability and equip surgeons with the cognitive and motor skills necessary to confidently navigate the inherent complexities of surgical practice.

Post-training GR00T-H on Open-H significantly improves surgical task success rates (p<0.001) across the da Vinci Research Kit Si, CMR Versius, and Virtual Incision MIRA platforms, as demonstrated by 95% confidence intervals.

Open-H: A Foundation for Surgical Intelligence

The Open-H-Embodiment dataset is designed to facilitate the development of robust surgical foundation models through its extensive scale and diversity. It comprises 770 hours of synchronized video and kinematic data captured across 20 distinct robotic platforms. Data standardization is achieved via the LeRobot v2.1 framework, ensuring consistency and compatibility for machine learning applications. This large dataset addresses the historical limitations of data availability in surgical robotics, providing a resource for training and evaluating algorithms capable of generalizing across different surgical systems and procedures.

Multi-embodiment learning addresses the challenge of limited generalization in surgical robotics by training policies across a diverse set of robotic platforms. Traditional approaches often result in policies that perform well only on the specific robot used during training. By exposing a learning agent to multiple robotic embodiments – in the case of Open-H, data from 20 different platforms is utilized – the agent develops a more robust and adaptable skillset. This allows a learned surgical policy to transfer effectively to novel robotic systems without requiring extensive retraining, significantly reducing development time and costs associated with deploying surgical intelligence across varied hardware configurations.

The limited availability of labeled data has historically constrained progress in surgical robotics research and the development of intelligent systems. The Open-H-Embodiment dataset, comprising 770 hours of paired video and kinematic data from 20 robotic platforms, directly addresses this data scarcity. This scale enables the training of more robust and generalizable foundation models, reducing the reliance on extensive, platform-specific data collection. By providing a standardized, large-scale resource, Open-H facilitates accelerated innovation in areas such as surgical skill learning, autonomous surgical tasks, and improved robotic assistance, lowering the barrier to entry for researchers and developers.

Across diverse surgical datasets, institutions, and robotic embodiments, the model accurately predicts surgical video frames from recorded kinematic data, demonstrating robust generalization capabilities.

GR00T-H: A Next-Generation Surgical Policy

GR00T-H utilizes a vision-language-action framework as its foundation, representing an advancement over prior iterations. The policy’s enhanced capabilities are achieved through pre-training the GR00T-N1.6 model on the Open-H dataset, a large-scale collection of robotic manipulation data. This pre-training process allows GR00T-H to leverage learned representations from the Open-H dataset, improving its ability to map visual inputs and language instructions to appropriate robotic actions. The use of GR00T-N1.6 as a base model ensures compatibility with existing infrastructure and facilitates transfer learning from the Open-H dataset.

Evaluations using the SutureBot platform demonstrate a significant performance increase with the GR00T-H policy. Specifically, GR00T-H achieved a 25% end-to-end task completion rate, contrasting with 0% completion for ACT, GR00T-N1.6, and LingBot-VA. Furthermore, the average task success rate across three SutureBot tasks was 54% for GR00T-H, representing an improvement over the 30% success rate achieved by GR00T-N1.6.

GR00T-H establishes a new performance standard for surgical policy, demonstrably surpassing the capabilities of previously established models such as LingBot-VA. Comparative evaluations on the SutureBot platform reveal a 54% task success rate for GR00T-H, significantly exceeding the 30% achieved by GR00T-N1.6 and the 0% completion rate of both ACT and LingBot-VA. These results validate the efficacy of the vision-language-action framework and pre-training methodologies employed in the development of GR00T-H, positioning it as a robust benchmark for future research in robotic surgical assistance.

GR00T-H demonstrates superior data efficiency and scaling capabilities on the SutureBot task, achieving performance comparable to ACT with only 33% of the training data and outperforming all baselines with 100% data [latex] (n=10 per subtask) [/latex], as indicated by Clopper-Pearson 95% confidence intervals.

Cosmos-H: Bridging Simulation and Reality

The Cosmos-H-Surgical-Simulator represents a leap forward in surgical training, built upon the foundation of the Cosmos-Predict 2.5 platform and employing sophisticated Action-Conditioned World Models. These models don’t simply recreate visual fidelity; they predict how tissues will react to surgical instruments with remarkable accuracy. By learning from extensive datasets of surgical procedures, the simulator anticipates the consequences of each action-a cut, a suture, a cauterization-allowing trainees to practice complex maneuvers in a completely virtual, yet profoundly realistic, environment. This predictive capability extends beyond surface appearances, simulating nuanced physical interactions like tissue deformation, bleeding, and even the propagation of forces, providing a level of immersion previously unattainable and fostering the development of crucial surgical skills without risk to patients.

The Cosmos-H surgical simulator represents a significant leap forward in robotic surgery development by providing a secure and streamlined environment for training complex surgical policies. Traditionally, honing these policies required extensive animal trials or supervised practice with cadavers – costly, time-consuming, and ethically challenging approaches. This simulator, however, enables researchers to rapidly prototype, test, and refine robotic surgical techniques within a virtual setting, drastically reducing both the financial burden and inherent risks. Through iterative simulation and reinforcement learning, surgical algorithms can be optimized for precision, efficiency, and adaptability before ever entering a real operating room, ultimately accelerating the translation of innovative robotic surgery solutions into clinical practice and enhancing patient outcomes.

Traditional surgical training relies heavily on supervised practice, often utilizing animal models or cadavers – resources that are both expensive and carry inherent limitations. However, advancements in large-scale data analysis and sophisticated modeling techniques are enabling the creation of virtual surgical environments that dramatically reduce these costs and risks. By training surgical policies within these simulations, using datasets derived from actual surgical procedures and patient outcomes, a surgeon can refine techniques and build expertise without exposing patients to unnecessary risk. This approach not only accelerates the development of robotic surgery systems but also offers a scalable and accessible platform for continuous professional development, potentially democratizing access to high-quality surgical training worldwide and fostering a new era of safer, more efficient healthcare.

Across both benchtop and tissue-based datasets, the Cosmos-H-Surgical-Simulator achieves consistently low [latex] ext{L}_1[/latex] error and high SSIM scores-indicated by the narrow shaded bands representing one standard deviation across generation seeds-demonstrating robust performance over 72 autoregressively generated frames.

The creation of Open-H underscores a fundamental principle: data, meticulously curated and broadly accessible, serves as the essential substrate for progress. This aligns with Alan Turing’s observation: “The question is not whether a machine can think, but whether it can be made to perform tasks that would be regarded as intelligent if done by a human.” Open-H doesn’t aim to replicate human intelligence directly, but to provide the necessary data for machines – specifically, foundation models like GR00T-H – to perform increasingly complex surgical tasks. The scale of the dataset, and its multi-modal nature, represents a reduction to essential components – vision, language, and action – stripping away superfluous complexity to enable effective learning. The project’s focus on data augmentation and simulation further emphasizes this commitment to clarity and efficient knowledge transfer.

Where To Now?

Open-H offers scale. Scale is useful, but not inherently insightful. The data itself does not solve the problem of generalization. True robustness demands models that understand why actions succeed, not merely that they do. The current work addresses simulation-to-real transfer, a persistent challenge. But the gap between simulated perfection and clinical messiness remains substantial.

Abstractions age, principles don’t. The focus on vision-language-action is sensible, but limiting. Tactile sensing, force feedback, and the nuanced physics of tissue manipulation are underrepresented. These aren’t simply additional data streams; they alter the fundamental nature of the problem. Every complexity needs an alibi.

Future work must move beyond mimicking demonstrations. Developing models capable of planning – forming and testing hypotheses about surgical interventions – will be crucial. The dataset provides a foundation. The next step is to build systems that learn from failure, adapt to the unexpected, and ultimately, augment-not replace-human skill.

Original article: https://arxiv.org/pdf/2604.21017.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Challenge of Surgical Skill Transfer

Open-H: A Foundation for Surgical Intelligence

GR00T-H: A Next-Generation Surgical Policy

Cosmos-H: Bridging Simulation and Reality

Where To Now?

See also: