Uncovering Hidden Equations: AI Finds Stochastic Dynamics

Author: Denis Avetisyan

Researchers have developed a new artificial intelligence technique that automatically discovers the underlying equations governing complex, random systems.

The research introduces a method for discovering interpretable stochastic differential equations directly from observed time series data through genetic programming, optimizing both the drift [latex] f(x) [/latex] and diffusion [latex] g(x) [/latex] terms via tree-based adaptation-including crossover, exemplified by subtree exchange, and mutation of operators-to model the dynamics of stochastic systems.

This work presents a genetic programming approach for symbolic regression of stochastic differential equations, achieving competitive performance and scalability for high-dimensional problems.

While traditional system identification often overlooks the impact of noise, hindering the modeling of realistic dynamical systems, this work-‘Symbolic Discovery of Stochastic Differential Equations with Genetic Programming’-introduces a novel genetic programming approach to simultaneously learn both the drift and diffusion terms of stochastic differential equations. By directly optimizing for maximum likelihood, our method accurately recovers governing equations, scales efficiently to higher dimensions, and generalizes to stochastic partial differential equations. This advancement extends symbolic regression toward interpretable discovery in noisy environments, but can these techniques be further adapted to uncover hidden stochasticity in even more complex, real-world phenomena?

The Inherent Uncertainty of Complex Systems

The natural world frequently presents systems governed by chance and intricate relationships, rendering traditional mathematical descriptions inadequate. Phenomena like weather systems, ecological populations, and even financial markets aren’t predictable through neat, linear equations; instead, they exhibit inherent randomness – a stochastic quality where future states aren’t solely determined by present conditions. This unpredictability arises from the non-linear interactions within these systems, where small changes can trigger disproportionately large effects – the hallmark of chaos. Consequently, attempts to model these complex processes with purely deterministic approaches often fall short, necessitating alternative frameworks that embrace the role of probability and acknowledge the limitations of finding closed-form, analytical solutions. These systems demand modeling techniques that can grapple with uncertainty and capture the emergent behaviors arising from countless interwoven factors.

Conventional modeling techniques frequently depend on simplifications to render complex systems mathematically tractable, yet these very simplifications can obscure crucial dynamics. For instance, assuming linearity-a straight-line relationship between cause and effect-allows for elegant solutions, but often fails to represent real-world phenomena where effects can be disproportionate or delayed. Similarly, averaging out random fluctuations or ignoring feedback loops can produce models that predict stable states when, in reality, the system is prone to chaotic behavior or sudden shifts. While these approximations offer analytical convenience, they frequently come at the cost of predictive power, potentially leading to inaccurate forecasts or a fundamental misunderstanding of the system’s underlying mechanisms. The pursuit of realism, therefore, often requires abandoning these convenient simplifications in favor of more nuanced, though computationally demanding, approaches.

Addressing the limitations of conventional equations requires a shift towards computational modeling techniques that embrace the unpredictable nature of many real-world systems. These methods, such as agent-based modeling and stochastic differential equations, move beyond deterministic predictions by explicitly incorporating randomness and allowing for emergent behaviors. Rather than seeking a single, closed-form solution, these approaches simulate the interactions of numerous components, revealing patterns and sensitivities that are often obscured by simplification. This allows researchers to explore a wider range of possible outcomes and understand how complex systems respond to changing conditions – crucial for fields ranging from epidemiology and climate science to financial forecasting and ecological conservation. By acknowledging and integrating inherent uncertainties, these modeling techniques offer a more realistic and nuanced understanding of the world, even if complete predictability remains elusive.

GP-SDE accurately recovers the governing equations of stochastic dynamical systems across diverse test cases-including the double well, van der Pol oscillator, Rössler attractor, Lorenz96, and Lotka-Volterra models-demonstrating superior performance, indicated by low mean squared error and consistently correct equation structure (stars), compared to Kramers-Moyal expansion with sparse regression and GP-ODE, even with multi-step integration.

Stochastic Dynamics: A Necessary Embrace of Randomness

Stochastic Differential Equations (SDEs) are mathematical models used to describe systems where evolution is impacted by random fluctuations. Unlike Ordinary Differential Equations (ODEs) which assume deterministic progression, SDEs incorporate a stochastic, or random, component – typically represented as a Wiener process or Brownian motion – to account for inherent uncertainties in the modeled system. This is crucial because many real-world processes, from molecular motion to financial markets, are subject to unpredictable noise. By integrating this randomness directly into the equation, SDEs provide a more accurate and realistic representation of system dynamics than purely deterministic approaches, allowing for probabilistic predictions rather than fixed outcomes. The general form of an SDE often includes a drift coefficient, representing the deterministic trend, and a diffusion coefficient, quantifying the strength of the random noise.

Traditional deterministic models in physics and engineering assume precise, predictable evolution of a system given initial conditions. However, many real-world processes are subject to random disturbances. Stochastic Differential Equations (SDEs) address this limitation by incorporating a diffusion term, typically represented as a Wiener process or Brownian motion [latex]dW(t)[/latex], into the governing equation. This term introduces randomness and accounts for unpredictable fluctuations originating from external noise or inherent system uncertainties. Mathematically, this addition transforms a standard differential equation, such as [latex]\frac{dx}{dt} = f(x,t)[/latex], into an SDE of the form [latex]dx = f(x,t)dt + g(x,t)dW(t)[/latex], where [latex]g(x,t)[/latex] determines the magnitude of the random fluctuations.

Stochastic Differential Equations (SDEs) find application across numerous scientific disciplines due to their ability to incorporate randomness. In the field of heat transfer, SDEs can model fluctuations in thermal conductivity or external noise affecting heat diffusion. The Fisher-KPP equation, a foundational model in population genetics and ecology, is often expressed as an SDE to account for demographic stochasticity and unpredictable environmental variations influencing population growth. Furthermore, complex systems like atmospheric dynamics are amenable to SDE modeling; the Lorenz96 model, used to study predictability in weather patterns, can be formulated as an SDE to represent uncertainties in atmospheric forcing and model parameters, thereby enabling probabilistic forecasting.

GP-SDE effectively recovers stochastic partial differential equations, accurately replicating the evolution of both the Fisher-KPP equation and two-dimensional heat transfer as demonstrated by its ability to match the ground truth systems.

Automated Equation Discovery: A Pursuit of Underlying Truth

Symbolic Regression is a data-driven technique used to identify mathematical equations that describe the relationships within a dataset. Unlike traditional modeling approaches that require a user to specify the functional form of the equation, Symbolic Regression automatically searches for the best-fit equation directly from the observed data. This is achieved by exploring a vast space of possible mathematical expressions, utilizing algorithms like Genetic Programming to evolve and refine candidate equations based on their ability to accurately predict the observed data. The resulting equation represents a compact and interpretable model of the underlying system, offering insights into the governing principles without requiring prior knowledge or assumptions about the system’s structure. The technique can be applied to various data types and system complexities, offering a versatile tool for scientific discovery and model building.

Symbolic Regression utilizes Genetic Programming (GP) to explore a vast space of potential mathematical expressions without requiring a pre-defined model structure. GP operates by creating an initial population of randomly generated equations – represented as tree-like structures composed of mathematical operators and variables – and then iteratively evolving this population through processes analogous to natural selection. Each equation’s performance is evaluated based on its ability to accurately predict observed data, typically quantified using a loss function such as Mean Squared Error. Equations with lower error are more likely to be selected for reproduction, with crossover and mutation operators introducing variation and exploring new regions of the equation space. This iterative process continues until a satisfactory equation, or a population of equations, is found that effectively captures the underlying relationships within the data, effectively discovering a mathematical model directly from observation.

Kramers-Moyal Expansion (KME) and Sparse Regression (SR) are techniques used to improve the efficiency and interpretability of symbolic regression. KME approximates stochastic differential equations (SDEs) using a series expansion, effectively representing the system’s dynamics with a finite number of terms and reducing model complexity. SR, conversely, focuses on identifying the most significant terms within a potential equation by enforcing sparsity – driving coefficients of less influential terms to zero. This results in simplified equations that highlight the key relationships governing the data while minimizing overfitting and computational cost. Both methods contribute to a more focused search space for symbolic regression algorithms, improving their ability to discover accurate and concise models.

Evaluations demonstrate that the GP-SDE approach achieves performance comparable to, and in some cases lower Mean Squared Error (MSE) than, both Kramers-Moyal Symbolic Regression (KM-SR) and Genetic Programming for Ordinary Differential Equations (GP-ODE) across a suite of established benchmark problems. Specifically, GP-SDE was tested on the double well potential, the van der Pol oscillator, and the Rössler attractor, demonstrating its ability to accurately model dynamical systems in these contexts. The resulting MSE values indicate a statistically significant level of accuracy relative to the compared methods, suggesting GP-SDE is a viable alternative for automated equation discovery.

Simulation of the Rössler attractor using different modeling techniques-ordinary differential equations evolved by genetic programming, and stochastic differential equations derived from both genetic programming and Kramers-Moyal expansion with sparse regression-demonstrates that all methods can accurately approximate the system's trajectories from a fixed initial condition, as shown by the close alignment of simulated paths (colored lines) with the true system (black lines) and their respective statistical distributions (shaded areas). — Simulation of the Rössler attractor using different modeling techniques-ordinary differential equations evolved by genetic programming, and stochastic differential equations derived from both genetic programming and Kramers-Moyal expansion with sparse regression-demonstrates that all methods can accurately approximate the system’s trajectories from a fixed initial condition, as shown by the close alignment of simulated paths (colored lines) with the true system (black lines) and their respective statistical distributions (shaded areas).

Validating the Model: A Rigorous Test of Accuracy

The fidelity of simulations hinging on stochastic differential equations (SDEs) is inextricably linked to the precision of numerical integration techniques. Because analytical solutions to SDEs are often intractable, researchers rely on algorithms to approximate their behavior over time. Accurate integration isn’t merely about generating a plausible trajectory; it’s foundational for validating equations discovered through symbolic regression. These regression techniques aim to unearth the underlying dynamics governing a system, but the resulting equations are only as trustworthy as the data used to confirm them. If the numerical integration introduces substantial error, it becomes difficult to discern whether discrepancies between the model and observed data stem from flaws in the discovered equation or simply inaccuracies in the simulation itself. Consequently, robust and reliable numerical integration is paramount, ensuring that any identified relationships accurately reflect the true underlying dynamics and aren’t artifacts of the simulation process.

Multi-step integration techniques represent a powerful approach to numerically solving stochastic differential equations, offering computational efficiency vital for both long-term simulations and comprehensive parameter studies. Unlike single-step methods that rely solely on the current state, these methods leverage information from previous time steps to predict future behavior, drastically reducing the computational burden. This efficiency becomes particularly crucial when exploring complex dynamical systems where numerous simulations are needed to map out parameter space or to assess the robustness of discovered equations. By effectively approximating solutions without requiring excessively small time steps, multi-step integration enables researchers to investigate system dynamics over extended periods, uncovering long-term trends and behaviors that would be inaccessible with more computationally demanding approaches. The ability to efficiently simulate system trajectories under varying parameter conditions is therefore foundational to validating discovered equations and building predictive models.

Maximum Likelihood Estimation (MLE) plays a vital role in refining the accuracy of discovered equations by determining the parameter values that best align the model’s predictions with observed data. This statistical method operates on the principle of finding the parameters that maximize the likelihood of obtaining the observed dataset, effectively quantifying the goodness of fit. In the context of symbolic regression and stochastic differential equations, MLE transforms a discovered equation – such as [latex] \frac{dx}{dt} = f(x, \theta) [/latex] – from a symbolic form into a predictive model by estimating the optimal values for its parameters θ. The process involves defining a likelihood function that represents the probability of observing the data given a specific set of parameters, then employing optimization algorithms to locate the parameter values that maximize this likelihood. Consequently, MLE not only validates the structural form of the discovered equation but also ensures its predictive power is maximized, allowing for reliable simulations and forecasting based on the learned model.

Investigations into high-dimensional systems, exemplified by the Lorenz96 model, reveal a notable difference in computational efficiency between GP-SDE and KM-SR. While KM-SR’s runtime escalates considerably as dimensionality increases to 10 and 20, GP-SDE demonstrates a comparatively stable performance profile. This resilience stems from the algorithmic design of GP-SDE, allowing it to manage the computational burden associated with higher dimensions more effectively. Consequently, GP-SDE offers a practical advantage when exploring and simulating complex systems where dimensionality is a significant factor, providing a pathway to tractable solutions where other methods become computationally prohibitive.

The effectiveness of GP-SDE is notably enhanced when paired with multi-step integration techniques, particularly when dealing with limited data availability. Investigations using the Lotka-Volterra model, a common system for ecological modeling, demonstrate that this combination maintains robust performance even with sparse sampling rates of 0.02, 0.2, and 0.5 – conditions that often challenge traditional methods. This resilience suggests that GP-SDE, when leveraged with multi-step integration, can accurately approximate system dynamics even when observations are infrequent, offering a significant advantage for modeling complex phenomena where continuous data collection is impractical or impossible. The method’s ability to extract meaningful insights from sparse datasets underscores its potential for applications in fields reliant on intermittent or incomplete observations.

As dimensionality increases on the Lorenz96 model, the runtime of KM-SR-evaluated with both four and sixteen bins-grows more rapidly than that of GP-ODE and GP-SDE, as demonstrated by averaging results over ten independent seeds after a compilation run.

The pursuit of symbolic regression, as demonstrated in this work concerning stochastic differential equations, aligns with a fundamental tenet of computational purity. The method’s success in discovering governing equations, particularly its improved scalability in higher dimensions, echoes the importance of provable solutions over mere empirical observation. As Edsger W. Dijkstra stated, “It’s not enough to show that something works; you must show why it works.” This paper doesn’t simply offer a ‘working’ method for system identification; it offers a pathway towards a demonstrably correct one, emphasizing the elegance of a solution derived through rigorous, symbolic manipulation-a true embodiment of mathematical beauty within the realm of machine learning.

What Lies Ahead?

The demonstrated capacity to evolve symbolic representations of stochastic differential equations, while promising, merely addresses the superficial aspects of the problem. The true challenge isn’t simply finding an equation that fits observed data – any curve can be approximated, given sufficient complexity. Instead, the pursuit must shift toward equations exhibiting inherent mathematical elegance, those that reveal underlying principles rather than merely describing phenomena. Current reliance on genetic programming, while scalable, remains fundamentally a search process – an exploration of a vast, largely meaningless, solution space.

A critical limitation resides in the definition of ‘fitness’. Current approaches prioritize minimizing error against a given dataset, a distinctly pragmatic, and ultimately unsatisfying, metric. Future work should investigate fitness functions that reward simplicity, symmetry, and consonance with established mathematical frameworks. The algorithm should, ideally, disprove hypotheses as readily as it confirms them, actively seeking contradictions and refining its models accordingly.

The extension to partial differential equations presents a substantial, though not insurmountable, hurdle. However, the more profound question concerns the nature of ‘discovery’ itself. Is the goal to automate the inductive leap, or merely to accelerate the existing process? True scientific advancement demands not simply more equations, but better ones – those possessing the capacity to generalize, predict, and illuminate the fundamental laws governing the universe. The algorithm must move beyond mimicry and embrace the austere beauty of mathematical truth.

Original article: https://arxiv.org/pdf/2603.09597.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inherent Uncertainty of Complex Systems

Stochastic Dynamics: A Necessary Embrace of Randomness

Automated Equation Discovery: A Pursuit of Underlying Truth

Validating the Model: A Rigorous Test of Accuracy

What Lies Ahead?

See also: