Bridging the Gap: AI and HPC Workflows at Scale

Author: Denis Avetisyan

A new runtime system, RHAPSODY, streamlines the execution of complex, hybrid workflows that combine the strengths of high-performance computing and artificial intelligence.

RHAPSODY integrates with Dragon via an external API, translating submitted tasks and policies into distributed workflows managed across nodes and GPUs, thereby enabling coordinated execution.

RHAPSODY efficiently orchestrates heterogeneous workflows integrating simulation, training, inference, and agent-driven control for large-scale applications.

Emerging scientific applications increasingly demand the convergence of high-performance computing (HPC) and artificial intelligence (AI), yet existing systems struggle to efficiently manage the heterogeneous requirements of tightly coupled, large-scale workflows. This paper introduces RHAPSODY: Execution of Hybrid AI-HPC Workflows at Scale, a multi-runtime middleware designed to orchestrate diverse workloads-spanning simulation, training, inference, and agent-driven control-by composing existing HPC and AI technologies. RHAPSODY achieves scalable heterogeneity with minimal overhead, enabling near-linear scaling for inference and efficient data/control coupling in agentic workflows. Will this approach unlock a new era of scientific discovery by seamlessly integrating the strengths of HPC and AI at scale?

Beyond Computation: Embracing Intelligence in Scientific Discovery

Scientific advancement increasingly confronts problems characterized by immense datasets, intricate physical models, and multi-scale phenomena – challenges that push the boundaries of traditional computational methods. These approaches, often reliant on exhaustive simulations and predetermined algorithms, struggle with the inherent uncertainties and complexities present in fields like climate modeling, drug discovery, and materials science. The sheer computational cost of exploring vast parameter spaces and resolving fine-grained details often becomes prohibitive, hindering progress and limiting the scope of inquiry. Consequently, researchers are actively seeking novel paradigms that can overcome these limitations, moving beyond purely deterministic methods to embrace techniques capable of learning from data, adapting to new information, and efficiently navigating complex solution landscapes. This demand for innovation has spurred exploration into hybrid approaches, seeking to integrate the strengths of both artificial intelligence and high-performance computing.

The convergence of artificial intelligence and high-performance computing (HPC) is forging a new paradigm for scientific discovery, leveraging the distinct strengths of both fields. This hybrid approach marries the precision and established physics-based modeling capabilities of HPC with the pattern recognition and adaptability inherent in AI. Recent studies demonstrate that these workflows achieve near-linear scaling for inference workloads-meaning doubling computational resources nearly doubles the speed of results-a significant advancement over traditional methods. This efficiency stems from AI’s ability to accelerate simulations, refine models, and extract meaningful insights from vast datasets generated by HPC systems, ultimately enabling researchers to tackle previously intractable scientific problems with unprecedented speed and accuracy.

Sustained overlap between agent decisions and HPC task realization, with minimal lag, demonstrates stable and effective AI-HPC coupling under RHAPSODY across both short (25s, 250 agents) and extended (3,500s, <span class="katex-eq" data-katex-display="false">\sim</span>49,000 agents) workflows. — Sustained overlap between agent decisions and HPC task realization, with minimal lag, demonstrates stable and effective AI-HPC coupling under RHAPSODY across both short (25s, 250 agents) and extended (3,500s, $\sim$ 49,000 agents) workflows.

Orchestrating Complexity: The Foundation of Efficient Workflows

Modern computational workflows frequently integrate diverse tasks – encompassing data processing, model inference, and training – necessitating runtime systems capable of managing this heterogeneity. These systems must efficiently schedule and coordinate tasks across varying hardware and software environments, handling dependencies and resource allocation dynamically. The complexity arises from the need to optimize for both throughput and latency, often requiring specialized execution strategies for each task type. Effective runtime management also includes monitoring task progress, handling failures, and ensuring data consistency across the workflow, demanding robust error handling and recovery mechanisms. Consequently, sophisticated runtime systems are critical for realizing the full potential of hybrid workflows and achieving optimal performance.

RHAPSODY functions as a multi-runtime execution substrate designed to optimize heterogeneous workloads by composing existing, specialized runtimes. Currently, it integrates Flux, Dragon, vLLM, and DeepSpeed, enabling the concurrent execution of up to 22 distinct task types. This architecture facilitates workload decomposition and distribution across the most appropriate runtime environment for each task, maximizing efficiency. Scalability has been demonstrated up to 1024 nodes, allowing for the orchestration of complex workflows across substantial computational resources.

SmartRedis facilitates efficient data exchange between tightly coupled tasks within a heterogeneous computing environment. This system employs an in-memory data store optimized for low-latency communication, enabling coordinated resource orchestration across diverse computational workloads. Benchmarks demonstrate that SmartRedis consistently introduces a runtime overhead of less than 5% of the total execution time, maintaining this performance level across varying scales of deployment and computational complexity. This minimized overhead ensures that data transfer does not become a significant bottleneck, allowing for efficient parallelization and utilization of resources in hybrid workflows.

RHAPSODY demonstrates scalable performance with the Dragon runtime system when executing single-core no-op function tasks.

Accelerating Discovery: Applications at the Forefront of Science

IMPECCABLE accelerates drug discovery by integrating artificial intelligence with high-performance computing (HPC) resources. This hybrid approach focuses on optimizing computationally intensive simulations critical to identifying and validating potential drug candidates. Specifically, AI algorithms are employed to refine simulation parameters, reduce computational load, and accelerate the overall pipeline. This methodology addresses the complexity inherent in molecular dynamics, quantum mechanics, and other simulation types used in pharmaceutical research, allowing for more efficient exploration of chemical space and faster identification of promising compounds. The system’s architecture is designed to handle the substantial data volumes and processing demands associated with these complex simulations, ultimately reducing time-to-market for new therapeutics.

The SPHERICAL and LUCID projects demonstrate the application of hybrid AI-HPC workflows to large-scale inference tasks. SPHERICAL focuses on antigen design, leveraging computational methods to predict and optimize immune responses. LUCID, conversely, applies this paradigm to the processing of scientific literature, enabling efficient extraction of knowledge and identification of relevant data. Both projects utilize distributed computing resources to manage the computational demands of these inference tasks, accelerating research in their respective fields through increased throughput and reduced processing times.

Scientific Foundation Models are central to current large-scale scientific applications, and their efficient operation is facilitated by platforms such as the American Science Cloud. The RHAPSODY framework, designed for these models, minimizes data coupling overhead during distributed processing. Benchmarks indicate a low overhead of ≤ 0.32 GB when operating on a single node, and scales to a maximum of 164 GB when utilizing 512 nodes. This minimal overhead is critical for maintaining performance and scalability in computationally intensive scientific workflows, allowing for efficient data transfer and reduced communication bottlenecks.

RHAPSODY demonstrates scalable high-throughput inference with increasing service instances and clients, sustained GPU utilization, and robust performance even with varying batching parameters and routing policies, as evidenced by strong scaling in multi-service inference under heterogeneous request costs.

Beyond Automation: Towards a Future of Autonomous Scientific Inquiry

Scientific progress is poised to shift from pre-programmed automation to dynamic, agentic workflows. These emerging systems envision artificial intelligence not simply executing tasks, but actively generating and coordinating them – essentially designing and managing its own computational experiments. Instead of scientists explicitly defining every step of an investigation, AI agents will autonomously formulate hypotheses, select appropriate simulations or analyses, and iteratively refine their approach based on results. This represents a fundamental change in how research is conducted, moving beyond scripted procedures to a more exploratory and adaptive paradigm where the AI takes a proactive role in knowledge discovery, promising to unlock insights from complex datasets and accelerate the pace of innovation across diverse scientific domains.

Agentic systems are increasingly reliant on the synergistic capabilities of hybrid AI-HPC architectures to tackle previously intractable scientific challenges. These systems don’t merely execute pre-defined instructions; instead, they intelligently combine the pattern recognition and predictive power of artificial intelligence with the raw computational force of high-performance computing. This allows for autonomous exploration of vast parameter spaces and optimization of complex models, effectively enabling the system to design and refine its own experiments. By dynamically allocating resources and adapting strategies based on real-time results, these hybrid approaches accelerate scientific discovery, particularly in fields where simulations and data analysis are computationally intensive, and where identifying optimal solutions demands navigating highly complex landscapes.

The transition from automated scientific workflows to agentic systems promises a substantial leap in the rate of discovery. Traditional automation executes pre-defined tasks, limiting exploration to programmed parameters; however, agentic workflows empower AI to dynamically formulate hypotheses, design experiments, and analyze results – effectively operating as an autonomous scientist. This capability was recently showcased by RHAPSODY, a scalable execution framework demonstrating near-linear scaling for inference workloads. This performance indicates that, as computational resources increase, the system’s ability to process data and derive insights grows proportionally, effectively diminishing the bottleneck of computational limitations and allowing for the exploration of significantly larger and more complex datasets. Such scalability is crucial for tackling grand scientific challenges and accelerating progress across diverse fields, moving beyond incremental improvements towards genuinely novel insights.

Strong-scaling studies of RHAPSODY-enabled AI-HPC workflows reveal that memory-based coupling minimizes runtime overhead compared to filesystem-based approaches, with data transfer latency significantly impacting overall performance and RHAPSODY contributing minimal overhead beyond computation and orchestration.

The design presented within this work echoes a fundamental principle of efficient systems: reducing complexity to reveal inherent functionality. RHAPSODY’s integration of disparate technologies-HPC and AI-into a unified runtime system exemplifies this approach. It’s a deliberate refinement, prioritizing streamlined workflow orchestration over superfluous features. As Bertrand Russell observed, “The point of civilization is to lessen suffering.” While seemingly distant from computational science, this sentiment aligns with the core ambition of RHAPSODY: to diminish the complexities inherent in scaling hybrid AI-HPC workflows, thereby enabling more efficient and ultimately, impactful research.

What Lies Ahead?

The presented work addresses a necessary, if predictably complex, integration. The coupling of simulation and learning, while conceptually elegant, invariably introduces a new stratum of engineering difficulty. RHAPSODY represents a pragmatic attempt to manage that difficulty, but does not, and cannot, resolve the fundamental tension between the deterministic rigor of high-performance computing and the inherent stochasticity of modern artificial intelligence. Future efforts will likely focus not on merely connecting these worlds, but on developing abstractions that conceal their fundamental incompatibility.

A crucial, and often overlooked, limitation resides in the assumption of workflow predictability. The agentic aspects, while promising, are still tethered to pre-defined tasks. True autonomy – the ability for these workflows to self-discover and refine their objectives – remains elusive. The pursuit of such systems will necessitate a re-evaluation of existing runtime architectures, perhaps towards more decentralized, emergent models. The goal is not simply efficient execution, but elegant disappearance of the orchestration layer itself.

Ultimately, the value of these systems will not be measured in FLOPS or training epochs, but in their ability to yield useful insight with minimal human intervention. A system that efficiently simulates and learns, yet fails to produce actionable knowledge, is merely a beautifully complex exercise in self-regard. The true challenge lies in translating computational power into genuine understanding.

Original article: https://arxiv.org/pdf/2512.20795.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Computation: Embracing Intelligence in Scientific Discovery

Orchestrating Complexity: The Foundation of Efficient Workflows

Accelerating Discovery: Applications at the Forefront of Science

Beyond Automation: Towards a Future of Autonomous Scientific Inquiry

What Lies Ahead?

See also: