Simulating Science: When AI Takes the Reins

Author: Denis Avetisyan

A new approach combines artificial intelligence with physics-based simulation to build and validate scientific models through active experimentation.

The study elucidates a formal description of a simulated fluid displacement problem, encompassing foundational assumptions, governing equations-including constitutive laws and the introduction of fractional calculus-and a nuanced analysis of mobility ratio effects on favorable versus unfavorable flow regimes, ultimately emphasizing the significance of the quarter-five-spot configuration as a benchmark for understanding multiphase flow dynamics in porous media, expressed mathematically as [latex] \frac{d P}{d x} = -\frac{\mu}{\kappa} v [/latex].

This paper details an agentic workflow leveraging LLM orchestration and the Jutul framework to improve reproducibility and physical validation in scientific modeling.

Despite advances in code-generating LLMs, constructing physically realistic and scientifically reproducible simulations remains challenging due to inherent ambiguities in natural language descriptions. This paper, ‘Agentic Scientific Simulation: Execution-Grounded Model Construction and Reconstruction’, introduces an agentic workflow where a large language model orchestrates scientific simulation using the [latex]Jutul[/latex] framework, grounding model construction in execution and simulator validation. Our results demonstrate that this approach enables explicit detection of underspecified modeling choices and reveals a critical limitation: tacit assumptions resolved by simulator defaults remain invisible to downstream analysis. Can systematically surfacing these hidden assumptions unlock a new level of transparency and reproducibility in scientific modeling?

The Erosion of Manual Effort in Scientific Modeling

Historically, the advancement of scientific understanding through computational modeling has been significantly constrained by the laborious process of manual code development and validation. Researchers meticulously craft algorithms, line by line, to simulate complex phenomena, a task demanding substantial time and expertise. This manual approach isn’t merely time-consuming; it’s inherently prone to errors – subtle bugs in the code can propagate through simulations, leading to inaccurate or misleading results. The validation phase, where model outputs are compared against empirical data, is equally challenging, often requiring extensive testing and refinement. This slow, iterative cycle limits the scope and speed of scientific discovery, as researchers spend considerable effort ensuring the reliability of their computational tools rather than exploring new hypotheses and insights.

The recent proliferation of Large Language Models (LLMs) presents a transformative opportunity for scientific discovery, yet their potential remains largely untapped without integration with systems capable of enacting and validating their outputs. While LLMs excel at generating hypotheses, designing experiments, and even writing code, these remain textual propositions until connected to executable environments – simulations, robotic platforms, or data analysis pipelines. This ‘grounding’ is crucial; it moves LLMs beyond sophisticated text generation and towards agentic science, where models can independently formulate, test, and refine knowledge through interaction with a defined reality. Effectively, the power of LLMs lies not just in what they can articulate, but in their capacity to drive real-world or simulated experimentation, thereby closing the loop between theory and observation and accelerating the pace of scientific progress.

JutulGPT iteratively interprets user intent, resolves ambiguities through autonomous assumption logging or targeted queries, and validates generated code via documentation retrieval, static analysis, and simulation, terminating only when the simulation completes successfully and satisfies internal consistency and convergence criteria.

An Iterative Loop for Autonomous Scientific Inquiry

Agentic Scientific Simulation utilizes a cyclical ‘Interpret-Act-Validate’ loop to facilitate autonomous scientific model development and testing. The LLM first interprets existing data or a defined problem, then acts by generating a model or experimental plan, and finally, the generated output is validated against a simulator or real-world data. This process is iterative; validation results are fed back into the interpretation phase, allowing the agent to refine its models and experimental designs without direct human intervention. The loop enables the LLM to move beyond purely linguistic tasks and actively engage in the scientific method through repeated cycles of hypothesis generation, experimentation, and analysis, ultimately building and refining models based on empirical evidence.

Execution-Grounded Workflows prioritize code execution and validation as the core components of an agent’s reasoning cycle, diverging from approaches reliant on solely linguistic processing. In this framework, the agent doesn’t simply reason about a problem using language; it formulates hypotheses, translates them into executable code – typically involving simulations or data analysis – and then assesses the results of that execution against pre-defined validation criteria. This process creates a feedback loop where the agent’s understanding is directly informed by empirical outcomes, rather than inferences drawn from textual data alone. The emphasis on executable workflows ensures that reasoning is consistently tethered to demonstrable results, increasing the reliability and accuracy of the agent’s conclusions.

The core of this agentic system relies on a differentiable simulator, such as JutulDarcy, to enable automated model validation and sensitivity analysis. Unlike traditional simulators that provide only output values for given inputs, a differentiable simulator provides gradients – information about how changes in input parameters affect the output. This allows the agent to directly optimize model parameters based on discrepancies between simulated and observed data, effectively performing gradient-based optimization of the generated models. The ability to compute these gradients programmatically bypasses the need for manual experimentation or finite difference approximations, significantly accelerating the validation process and enabling systematic exploration of model sensitivity to various input parameters and assumptions.

JutulGPT generated a 3D reservoir model with three stratified layers of varying permeability and porosity, visualized with semi-transparent coloring and [latex]log_{10}[/latex] of permeability, and configured with four injectors and three producers to simulate a peripheral waterflood influenced by layered heterogeneity and an anticlinal trap.

JutulGPT: A Concrete Implementation of Agentic Science

JutulGPT integrates a Large Language Model (LLM) with the JutulDarcy reservoir simulator, creating an agent capable of automated simulation workflow construction and execution. This coupling allows the agent to translate high-level instructions into a series of computational steps performed by JutulDarcy. Specifically, the system has demonstrated the ability to build and run complex simulations, including those modeling two-phase flow – a common requirement in petroleum reservoir engineering. The LLM doesn’t simply generate code; it orchestrates the entire simulation process within the JutulDarcy environment, handling input parameterization, simulation execution, and result retrieval. This integration facilitates automated workflows previously requiring significant manual effort from reservoir engineers.

JutulGPT utilizes Semantic Retrieval-Augmented Generation (RAG) to enhance code generation by accessing and incorporating relevant information from a knowledge base. This process involves retrieving documentation and illustrative examples based on the semantic meaning of the user’s prompt, rather than relying solely on keyword matching. The retrieved content is then provided as context to the Large Language Model (LLM), enabling it to produce more accurate, contextually appropriate, and reliable simulation code. This approach mitigates the risk of generating code based on incomplete or outdated information, and improves the LLM’s ability to correctly interpret the user’s intent and leverage the capabilities of the JutulDarcy simulator.

JutulGPT demonstrated the ability to generate fully executable simulations from varying levels of input abstraction. Starting with a direct operational prompt, the system successfully translated requests into functional models. This capability extended to more complex inputs, including detailed technical reports and even concise descriptions typically found in peer-reviewed journal articles. Across all three input types-operational prompt, technical report, and journal description-the system achieved a 100% success rate in converting high-level intent into a working simulation, validating the effectiveness of the LLM-simulator coupling and the Semantic RAG implementation.

Structured Documentation is integral to JutulGPT’s functionality, providing the LLM with a formalized, machine-readable representation of the simulator’s features, parameters, and operational logic. This documentation isn’t simply natural language descriptions; it utilizes a defined schema that details each component’s inputs, outputs, and dependencies. Consequently, the LLM can parse this structured data to accurately identify relevant simulator capabilities for a given task, generate valid code based on those capabilities, and avoid errors stemming from ambiguous or incomplete information. The schema facilitates precise mapping between user intent, expressed in natural language prompts, and the specific functions within JutulDarcy, thereby maximizing the agent’s ability to construct and execute complex simulations.

The Shadow of Assumptions and the Path Towards Collaborative Intelligence

Despite its demonstrated capabilities, JutulGPT underscores a critical challenge in AI-driven simulation: the prevalence of tacit assumptions baked into the default settings of the underlying models. These assumptions, often concerning physical processes or environmental conditions, operate as hidden constraints that significantly influence the agent’s reasoning and subsequent actions. Crucially, these foundational beliefs are rarely explicitly documented or subjected to scrutiny during the simulation process, creating a potential source of systematic error or unexpected outcomes. The system’s performance, therefore, isn’t solely a measure of its reasoning ability, but also a reflection of the often-unacknowledged biases inherent in the simulator’s initial configuration, highlighting the need for transparent assumption logging and active validation within agent-based modeling.

Future development envisions a shift towards more collaborative AI through paradigms like Agentic Coding and Vibe Coding. These approaches prioritize intuitive, human-in-the-loop interactions, moving beyond simple prompt-response cycles to establish genuine partnerships with the AI. Agentic Coding allows users to guide the AI’s reasoning process with high-level directives, while Vibe Coding focuses on communicating intent through nuanced signals, mirroring human communication. Such methods are designed not only to improve task performance but also to build trust and transparency, fostering a sense of shared understanding between humans and AI systems, ultimately enabling more effective and insightful problem-solving.

A key feature of this research lies in its comprehensive logging system, meticulously recording every step of the model’s construction. Beyond simply tracking prompts and retrieved documentation, the framework diligently catalogs the agent’s underlying assumptions and the iterative repair processes undertaken to refine its reasoning. This complete audit trail allows for detailed post-hoc analysis, enabling researchers to pinpoint the origins of errors, identify biases embedded within the system, and gain a deeper understanding of how the model arrives at its conclusions. Consequently, the framework not only builds an intelligent agent, but also provides a transparent and verifiable record of its cognitive development, fostering trust and facilitating ongoing improvement.

The development of this framework signifies a crucial step towards realizing ‘Scientific Collaborators’ – artificial intelligence systems poised to revolutionize the pace of discovery. These collaborators transcend simple data analysis; they actively integrate reasoning processes, the execution of simulations or experiments, and the subsequent interpretation of results – all within a unified system. By autonomously connecting hypothesis generation with empirical testing, and adapting strategies based on observed outcomes, these AI collaborators offer the potential to accelerate scientific progress across diverse fields. This seamless integration promises not merely assistance to researchers, but a true partnership in unraveling complex scientific questions, potentially identifying patterns and insights previously obscured by the limitations of human analysis or the scale of available data.

The pursuit of agentic simulation, as detailed in this work, necessitates a rigorous adherence to provability, not merely observed functionality. The system’s capacity to orchestrate simulations and validate them against physical realities-grounding execution in measurable outcomes-echoes a sentiment shared by Ada Lovelace, who observed that “The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.” This is not a limitation, but a feature; the system’s fidelity lies in the transparency of its operations, revealing the invariant rules governing the simulated world. Implicit modeling assumptions, highlighted as a crucial area for reproducibility, represent precisely those unstated ‘orders’ that, if left unexamined, can introduce subtle, yet critical, errors into the system’s logic.

Future Directions

The presented work, while demonstrating a functional orchestration of simulation and language models, merely scratches the surface of a deeper, and frankly more troubling, issue. The emphasis on ‘execution-grounded’ validation is, of course, laudable – a simulation is worthless without correspondence to observed reality. However, the inescapable truth remains: any simulation is predicated on a set of assumptions, often implicit, regarding the underlying physics. To claim reproducibility requires not just replicating the process, but a formal, provable, accounting for every such assumption, and a demonstration of its impact on the resultant output. The current landscape favors empirical testing; a mathematically rigorous proof of model fidelity remains elusive.

Future efforts should not focus on simply scaling the agentic workflow or integrating additional simulators. A more fruitful avenue lies in developing formal methods for identifying, representing, and validating these implicit assumptions. Consider, for instance, the challenge of differing boundary conditions – a seemingly minor alteration can yield drastically different results, yet is often treated as a mere ‘tuning parameter’. A truly robust system demands a declarative representation of these choices, allowing for a formal analysis of their influence.

The ultimate goal is not simply to build simulations that appear correct, but to construct models whose behavior can be mathematically predicted. The pursuit of elegant algorithms, founded on irrefutable logic, remains paramount. Empirical success is, at best, a provisional indicator – a temporary reprieve from the inevitable scrutiny of mathematical truth.

Original article: https://arxiv.org/pdf/2603.00214.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Erosion of Manual Effort in Scientific Modeling

An Iterative Loop for Autonomous Scientific Inquiry

JutulGPT: A Concrete Implementation of Agentic Science

The Shadow of Assumptions and the Path Towards Collaborative Intelligence

Future Directions

See also: