Beyond Code: How Physics Expertise Guided AI to Build Reliable Scientific Software

Author: Denis Avetisyan

A new case study demonstrates that effectively supervising AI coding agents for complex scientific tasks requires a focus on physical correctness and architectural oversight, not just raw code generation capability.

Physicist-supervised development, incorporating oracle testing and explanation agency, proved critical for achieving trustworthy autonomous resolution in scientific software creation.

While artificial intelligence increasingly automates complex tasks, ensuring the correctness-not merely the functionality-of AI-generated scientific software remains a challenge. This is explored in ‘Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software’, which details a 12-day collaboration between a physicist and an AI coding agent building a module for one-loop perturbation theory. The study found that trustworthy results hinged not on the agent’s inherent capabilities, but on focused human supervision emphasizing physical consistency and architectural oversight-practices that caught errors missed by automated tests. Can we design AI agents that proactively propose fundamental changes to their approach, rather than simply optimizing within predefined structures, and ultimately distinguish between predictive accuracy and true explanatory power?

Precision Cosmology: Navigating the Limits of Approximation

Cosmological models routinely employ perturbative calculations to chart the evolution of large-scale structure in the universe, essentially building a picture from small, manageable fluctuations. However, as observational precision increases – driven by increasingly sophisticated galaxy surveys – these approximations begin to falter. The universe isn’t perfectly linear; gravity’s influence becomes powerfully nonlinear at smaller scales, where matter clumps and collapses in complex ways. This breakdown isn’t merely a technical inconvenience; it introduces systematic errors into predictions about the abundance and distribution of galaxies, and ultimately limits the ability to accurately determine fundamental cosmological parameters. While perturbative methods remain valuable for initial insights, capturing the full richness of structure formation requires moving beyond these approximations, demanding either more sophisticated analytical techniques or computationally intensive numerical simulations that can directly model the nonlinear gravitational interactions.

Cosmological understanding hinges on interpreting the distribution of galaxies, but accurately deciphering these large-scale structures demands sophisticated modeling of gravity’s nonlinear effects. While gravity is well-understood in its linear regime, the interactions between galaxies and dark matter become profoundly complex at smaller scales, rendering standard perturbative techniques inadequate. Simulating these nonlinear gravitational dynamics requires immense computational resources – often necessitating supercomputers and months of processing time – to track the evolution of cosmic structures with sufficient precision. The challenge isn’t simply one of brute force computation; it also involves developing algorithms that can efficiently capture the intricate interplay of gravity and matter, while avoiding the introduction of spurious numerical artifacts that could skew cosmological inferences. Successfully navigating these computational hurdles is critical for unlocking the full potential of galaxy surveys and achieving a more accurate picture of the universe’s evolution.

Contemporary cosmological modeling frequently necessitates the implementation of empirical corrections – often termed ‘fudge factors’ – to reconcile theoretical predictions with observational data from galaxy surveys. These adjustments are not derived from established physical principles, but rather introduced as ad hoc parameters tuned to minimize discrepancies between simulations and real-world measurements. While effective in achieving a superficial agreement, this practice introduces a fundamental uncertainty; the underlying physics driving structure formation remains obscured by these phenomenological corrections. The reliance on such parameters limits the ability to confidently constrain cosmological parameters and raises questions about the true physical basis of the observed universe, potentially masking new physics beyond the standard model.

The pursuit of precise cosmological understanding is currently hampered by a dependence on empirically-derived corrections within theoretical models. While simulations accurately predict large-scale structure, discrepancies arise when matching these predictions to observations of galaxy distributions – requiring the introduction of ‘fudge factors’ to achieve alignment. These adjustments, though effective in forcing agreement, lack a foundation in established physics and obscure the underlying mechanisms driving cosmic structure formation. Consequently, the ability to confidently determine key cosmological parameters, such as the density of dark matter and the expansion rate of the universe, is compromised; estimates become less robust and potentially misleading. Until these ad-hoc corrections can be replaced with physically motivated solutions, a truly accurate and insightful picture of the cosmos remains elusive, hindering progress towards unraveling the fundamental laws governing the universe.

CLAX-PT: A Differentiable Framework for Precision

CLAX-PT is a JAX-based module designed for the computation of one-loop perturbation theory loop integrals. It provides a differentiable implementation, meaning gradients can be calculated with respect to input parameters. This allows for efficient evaluation of these integrals, which are crucial for precision cosmology and are often computationally expensive using conventional numerical methods. The framework handles the complexities of loop integration by leveraging automatic differentiation within the JAX ecosystem, facilitating both accurate results and integration with machine learning pipelines for tasks such as parameter estimation and model comparison.

CLAX-PT integrates with machine learning workflows through the utilization of automatic differentiation. This capability allows for the computation of derivatives of complex loop integral calculations with respect to input cosmological parameters without requiring manual derivation or numerical approximation of these derivatives. Consequently, CLAX-PT enables gradient-based optimization algorithms, such as those employed in Bayesian inference and maximum likelihood estimation, to directly adjust parameters and assess their impact on calculated observables. This seamless integration facilitates efficient parameter estimation, model comparison, and uncertainty quantification within modern machine learning frameworks, bypassing the need for computationally expensive derivative calculations or approximations.

CLAX-PT achieves computational efficiency in loop integral evaluation through the implementation of FFTLog Decomposition. This technique transforms the complex, multi-dimensional integration problem into a series of one-dimensional Fast Fourier Transforms (FFTs) and logarithmic operations. Traditional methods for loop integral calculation, such as Monte Carlo integration or sector decomposition, scale poorly with increasing loop order and dimensionality. FFTLog Decomposition offers a computational complexity of [latex]O(N \log N)[/latex], where N is the number of momentum points, representing a substantial improvement over the [latex]O(N^d)[/latex] scaling of direct numerical integration in d dimensions. This allows CLAX-PT to compute loop integrals at a fraction of the cost of conventional approaches, enabling faster parameter estimation and model exploration.

CLAX-PT enables the direct calculation of gradients of loop integral results with respect to variations in cosmological parameters. This functionality is achieved through automatic differentiation within the JAX framework, eliminating the need for numerical derivative approximations or manual gradient calculations. Consequently, CLAX-PT facilitates the implementation of Bayesian inference techniques, such as Markov Chain Monte Carlo (MCMC), for robust parameter estimation and uncertainty quantification. The ability to efficiently compute gradients also supports model selection criteria, allowing for quantitative comparison of different cosmological models based on their likelihoods given observational data. This differentiable programming approach streamlines the process of connecting theoretical predictions to observational constraints, enabling more efficient and accurate cosmological analyses.

Validation and Autonomous Correction: A Robust System

Rigorous validation of CLAX-PT involved a comparative analysis against the established CLASS-PT reference code, focusing on predicted power spectra. This process demonstrated a high degree of agreement between the two systems, with discrepancies consistently remaining below 1% accuracy. The validation methodology ensured that CLAX-PT’s performance was demonstrably consistent with a well-established and thoroughly tested cosmological code, establishing a baseline for its reliability and correctness in predicting large-scale structure formation.

The system incorporates an automated Oracle Test Suite to verify code correctness throughout the development lifecycle. This suite functions by executing a predefined set of tests and comparing the results against expected outputs, effectively identifying discrepancies and flagging potential errors. The tests are designed to cover a broad range of input conditions and edge cases, ensuring comprehensive evaluation of the code’s functionality. Automated error flagging allows developers to address issues promptly, reducing the risk of bugs propagating to later stages and improving overall code quality. The suite operates continuously, providing ongoing validation as new code is integrated.

During testing, the automated system successfully resolved 10 of 15 identified issues without human intervention. The remaining 2 issues necessitated manual correction, and 3 additional issues required 33 sessions to correct due to a fundamentally incorrect code architecture. This demonstrates a substantial degree of automated bug resolution capability, although human oversight remains critical for addressing complex or architecturally-rooted problems and ensuring efficient correction.

Initial autonomous bug resolution required an average of 10 iterations per issue; however, a critical architectural flaw necessitated 33 resolution sessions before correction. This discrepancy indicates that while the automated system effectively addresses localized errors, substantial issues stemming from fundamental design choices require human oversight for timely and accurate remediation. The extended resolution time associated with the architectural error underscores the necessity of integrating human expertise into the debugging workflow, even with advanced automated tools.

Precision and Control: Unlocking Cosmological Insights

Cosmological measurements rely on Baryon Acoustic Oscillations (BAO) as a standard ruler, but accurately interpreting these signals requires accounting for subtle distortions known as anisotropic damping. CLAX-PT distinguishes itself by precisely modeling this effect, which arises from the fact that BAO aren’t perfectly isotropic – they appear stretched and squashed due to the universe’s expansion history and the dynamics of matter. This anisotropic damping is particularly important for extracting precise cosmological parameters from large-scale structure surveys; neglecting it introduces systematic errors. By accurately capturing these distortions, CLAX-PT offers a more faithful representation of the underlying physics, improving the reliability of measurements aimed at understanding dark energy, dark matter, and the overall geometry of the universe.

The CLAX-PT framework addresses a significant challenge in cosmological simulations: accurately modeling the impact of large-scale bulk flows. These flows, arising from primordial density fluctuations, can induce systematic errors in measurements of Baryon Acoustic Oscillations (BAO) – a standard ruler used to map the expansion history of the universe. To mitigate this, CLAX-PT incorporates Infrared (IR) Resummation, a technique that systematically accounts for the long-wavelength fluctuations driving these bulk flows. This isn’t merely a correction, but a fundamental shift in approach, enabling a more accurate prediction of the observed BAO signal even in the presence of substantial cosmic flows. By effectively ‘summing over’ the infinite series of perturbations contributing to these flows, the framework minimizes uncertainties and improves the reliability of cosmological parameter estimation, delivering a more precise understanding of the universe’s evolution.

Conventional perturbative approaches to cosmology often struggle with short-wavelength, or ‘ultraviolet’ (UV), modes, necessitating the introduction of arbitrary cutoff scales or ‘fudge factors’ to maintain stability and accuracy. CLAX-PT distinguishes itself by directly incorporating UV counterterms into its calculations; these terms systematically absorb divergences arising from high-momentum modes, providing a physically motivated method for regulating short-scale behavior. This isn’t merely a mathematical trick; by explicitly accounting for these contributions, the framework avoids the need for ad-hoc modifications and ensures that predictions remain well-defined even at the smallest scales. The result is a more robust and reliable model, free from the ambiguities often associated with arbitrary cutoff procedures, and offering a principled pathway toward controlling potentially problematic short-distance physics within cosmological simulations.

CLAX-PT distinguishes itself in cosmological modeling through a commitment to physical accuracy and control, eschewing the need for empirically-determined, yet theoretically opaque, parameters often used to ‘fine-tune’ results. This framework achieves a demonstrated accuracy of less than 1% when benchmarked against the established CLASS-PT code, signifying a substantial improvement in reliability without sacrificing computational efficiency – all within a concise implementation of approximately 2,100 lines of code. By explicitly modeling key effects like anisotropic baryon acoustic oscillation damping and large-scale flows, and through rigorous ultraviolet control, CLAX-PT establishes a foundation for more robust cosmological inference, minimizing systematic uncertainties and offering a pathway toward more dependable interpretations of the universe’s fundamental properties.

The pursuit of autonomous scientific software, as detailed in the case study, benefits from a focused supervisory approach. It prioritizes physical correctness over sheer computational capability. This echoes Tim Bern-Lee’s sentiment: “The web is more a social creation than a technical one.” The study demonstrates that technical prowess alone is insufficient; instead, a ‘social’ element – in this case, human oversight grounded in physical principles – is vital. The architecture of trust, built upon verifiable physical constraints, becomes paramount. Clarity is the minimum viable kindness; a needlessly complex AI, absent such supervision, offers little practical benefit.

Beyond the Hype Cycle

The pursuit of autonomous code generation, predictably, has become enamored with capability. This work suggests that capability, divorced from demonstrable physical correctness, is merely a faster path to elegant errors. The true challenge isn’t building an agent that can code, but one that understands – or, failing that, can be reliably steered toward – what constitutes a meaningful solution within a physical model. They called it ‘oracle testing’; a rather generous term for repeatedly asking a human to confirm the machine hadn’t invented a perpetual motion device.

Future effort should resist the temptation to layer complexity upon complexity. The field fixates on ‘explanation agency’ as if verbose justification absolves a fundamentally flawed calculation. Simpler architectures, coupled with rigorous, physically-informed supervision protocols, likely offer a more robust path forward. The goal isn’t to replicate a physicist, but to amplify one – to offload the tedious aspects of implementation, not to outsource critical thinking.

One wonders if the current emphasis on large models is a distraction, a way to avoid confronting the genuinely difficult problem of encoding physical intuition. Perhaps the most fruitful avenue lies not in scaling up, but in scaling down – in seeking minimal, verifiable components, and accepting that some problems are best solved with a little human judgment.

Original article: https://arxiv.org/pdf/2605.30353.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

2026-05-29 13:51