Uncovering Hidden Equations with Probabilistic Trees

Author: Denis Avetisyan


A new framework leverages variational inference and soft symbolic trees to automatically discover interpretable mathematical relationships within complex datasets.

Across 1010 repetitions of a 90/10 train-test split, the [latex]\mathsf{VaSST}[/latex] model-configured with [latex]K=3[/latex] and [latex]D=3[/latex]-demonstrates robust performance in learning the function [latex]\mathbf{y}=\mathbf{x}\_{0}^{2}-\mathbf{x}\_{1}+\tfrac{1}{2}\mathbf{x}\_{2}^{2}[/latex] across varying noise levels, as evidenced by consistently low out-of-sample Root Mean Squared Errors (RMSE).
Across 1010 repetitions of a 90/10 train-test split, the [latex]\mathsf{VaSST}[/latex] model-configured with [latex]K=3[/latex] and [latex]D=3[/latex]-demonstrates robust performance in learning the function [latex]\mathbf{y}=\mathbf{x}\_{0}^{2}-\mathbf{x}\_{1}+\tfrac{1}{2}\mathbf{x}\_{2}^{2}[/latex] across varying noise levels, as evidenced by consistently low out-of-sample Root Mean Squared Errors (RMSE).

VaSST enables scalable and uncertainty-aware symbolic regression for scientific discovery.

Despite increasing demand for interpretable machine learning, recovering explicit equations from data-a core goal of symbolic regression-remains challenging due to the computational intractability of exploring complex expression spaces. This work introduces [latex]\text{VaSST}\text{:} \text{Variational Inference for Symbolic Regression using Soft Symbolic Trees}[/latex], a novel probabilistic framework employing a continuous relaxation of symbolic trees to enable scalable and principled uncertainty quantification. By transforming combinatorial search into gradient-based optimization, VaSST achieves superior performance in both structural recovery and predictive accuracy compared to state-of-the-art methods across benchmark datasets. Could this approach unlock a new era of automated scientific discovery through transparent and reliable machine learning?


Unveiling Systemic Patterns: The Challenge of Equation Discovery

Historically, building models of complex systems – from weather patterns to biological networks – has frequently depended on researchers assuming the underlying mathematical form of the relationships at play. This approach necessitates pre-defining equations – such as linear, exponential, or polynomial functions – and then adjusting their parameters to fit observed data. While seemingly pragmatic, this reliance on pre-defined functional forms can severely limit discovery. Crucially, it introduces a strong bias, effectively preventing the identification of governing equations that deviate from these initial assumptions. Consequently, truly novel or unexpected relationships hidden within the data may remain obscured, hindering a complete understanding of the system and potentially leading to inaccurate predictions. The inherent constraint of this methodology underscores the need for methods that aren’t constrained by pre-conceived notions of how a system should behave, but instead allow the data to reveal its own governing principles.

The historical approach to modeling physical systems has heavily depended on researchers painstakingly constructing equations based on existing knowledge and educated guesses. This process, while valuable, is inherently limited by human biases and the sheer complexity of many real-world phenomena. Constructing even a moderately complex model can demand months or years of effort, and crucially, this manual approach often struggles to uncover subtle or non-intuitive relationships present within the data itself. Important predictive factors may remain hidden, obscured by pre-conceived notions about the underlying mechanisms or simply overlooked due to the vastness of possible interactions. Consequently, models built through manual equation crafting may provide incomplete or inaccurate representations of the system, hindering both understanding and predictive power – a significant limitation given the increasing availability of large, complex datasets.

The limitations of traditional modeling approaches are prompting a significant evolution in scientific inquiry. Rather than relying on researchers to hypothesize and manually construct equations, a new wave of automated methods aims to directly uncover the underlying mathematical relationships governing complex systems. These techniques, often leveraging advancements in machine learning and symbolic regression, analyze observational data to identify [latex]f(x)[/latex] – the functional forms that best describe the interactions within the system. This data-driven approach promises to accelerate discovery in fields ranging from physics and chemistry to biology and engineering, potentially revealing previously unknown or obscured principles and offering more accurate predictive models without the constraints of pre-defined assumptions.

Decoding Relationships: Symbolic Regression and its Limitations

Symbolic regression is a type of regression analysis that seeks to identify mathematical expressions representing relationships within data. Unlike traditional regression which assumes a predefined functional form – such as linear or polynomial – symbolic regression automatically discovers the structure of the equation. This is accomplished by searching a space of possible mathematical operations – including addition, subtraction, multiplication, division, exponentiation, and trigonometric functions – and combining them with input variables to generate candidate equations. The quality of each equation is then evaluated using a fitness function, typically based on minimizing the error between the predicted and observed values. The process continues iteratively, refining the population of equations until a satisfactory model is found that accurately describes the underlying data relationships, expressed as a human-readable equation like [latex] y = ax^2 + bx + c [/latex].

Genetic Programming (GP) and Bayesian Symbolic Regression (BSR) improve the equation search process in symbolic regression, but introduce significant computational costs. GP utilizes evolutionary algorithms to iteratively refine candidate equations, requiring extensive evaluation of numerous expressions across multiple generations. BSR, while employing Bayesian inference to guide the search, necessitates complex calculations involving prior probabilities, likelihood functions, and posterior distributions for each potential model. Both methods scale poorly with increasing data dimensionality and model complexity, as the number of possible equations grows exponentially. Consequently, even with optimized implementations, GP and BSR can demand substantial processing time and memory resources, limiting their applicability to high-dimensional datasets or real-time applications.

The search for optimal mathematical expressions in symbolic regression is hindered by the ‘curse of dimensionality’, stemming from the exponentially increasing size of the solution space as the complexity of potential equations grows. The number of possible equations – incorporating variables, constants, and mathematical operators – expands combinatorially with each added term or function. Consequently, exhaustive search methods become impractical even for moderate-complexity problems. Efficient algorithms, such as genetic programming with carefully designed fitness functions and Bayesian methods employing Markov Chain Monte Carlo (MCMC) sampling, are therefore crucial. These techniques aim to intelligently navigate this vast solution space, focusing computational resources on promising regions while avoiding the need to evaluate every possible equation. Scalability relies on strategies to reduce the effective dimensionality, such as feature selection, regularization techniques, and the use of parallel computing architectures.

VaSST demonstrates superior computational scalability compared to BMS and BSR, enabling efficient handling of [latex]\mathsf{VaSST}[/latex] with increasing problem size.
VaSST demonstrates superior computational scalability compared to BMS and BSR, enabling efficient handling of [latex]\mathsf{VaSST}[/latex] with increasing problem size.

Inferring Underlying Principles: Variational Inference for Scalable Symbolic Regression

Variational Inference (VI) addresses the intractability of calculating the posterior distribution in Symbolic Regression by approximating it with a simpler, tractable distribution [latex]q(\theta)[/latex]. Traditional methods often require evaluating integrals over the model parameter space θ, which becomes computationally prohibitive as model complexity increases. VI transforms the problem into an optimization task: minimizing the Kullback-Leibler (KL) divergence between [latex]q(\theta)[/latex] and the true posterior [latex]p(\theta|D)[/latex], where [latex]D[/latex] represents the training data. This optimization allows for efficient estimation of model parameters by leveraging gradient-based methods, enabling Symbolic Regression to scale to larger datasets and more complex expressions that would otherwise be computationally infeasible. The accuracy of the approximation depends on the chosen variational family for [latex]q(\theta)[/latex], with more flexible families generally yielding better approximations at the cost of increased computational complexity.

Soft Symbolic Trees address the intractability of searching the space of possible symbolic expressions by representing these expressions with continuous, differentiable parameters. Traditional symbolic regression relies on discrete operators, precluding the use of gradient-based optimization techniques. Soft trees utilize continuous relaxations of discrete choices – such as node type, operator selection, and branching – allowing the entire expression to be represented as a computational graph amenable to backpropagation. This permits optimization via standard methods like stochastic gradient descent, efficiently adjusting tree structure and parameters to minimize a loss function that measures the discrepancy between the predicted and actual outputs. The relaxation introduces probabilistic interpretations to tree elements, with the final symbolic expression typically obtained through a process of argmax or sampling after optimization, effectively discretizing the softened representation. [latex] \frac{\partial L}{\partial \theta} [/latex] can then be used to update the tree parameters θ.

Black-Box Variational Inference (BBVI) addresses the computational limitations of traditional Variational Inference by eliminating the requirement for explicit gradient calculations through the model parameters and the variational distribution. This is achieved via the reparameterization trick, which allows gradients to be estimated using samples from the variational distribution. Consequently, BBVI is applicable to symbolic regression problems where the objective function or the structure of the symbolic expression may be non-differentiable or complex, preventing the use of standard gradient-based optimization techniques. The removal of gradient requirements notably expands the scalability of variational symbolic regression to datasets and model complexities previously intractable due to computational expense.

Balancing Predictive Power and Generalization

The principle of Occam’s Razor, often summarized as “the simplest explanation is usually the best,” plays a vital role in building predictive models that perform well on unseen data. Complex models, while capable of perfectly memorizing training data, frequently struggle to generalize-meaning they fail to accurately predict outcomes for new, previously unencountered examples. This phenomenon, known as overfitting, arises because the model learns not only the underlying patterns but also the noise and idiosyncrasies specific to the training set. By favoring simpler models-those with fewer parameters or more constrained structures-researchers actively discourage the model from learning these irrelevant details. This preference for parsimony encourages the identification of the core relationships within the data, leading to solutions that are more robust, interpretable, and capable of accurately predicting future observations. Essentially, a simpler model is less likely to be distracted by noise and more likely to capture the true signal, ultimately enhancing its ability to generalize beyond the confines of the training data.

Decision tree models, while powerful, are susceptible to creating overly complex structures that memorize training data instead of learning underlying patterns; this phenomenon is known as overfitting. To counteract this, Depth Adaptive Split Probability introduces a nuanced prior that discourages the creation of excessively deep trees. The technique effectively assigns lower probabilities to splits at greater depths, thereby penalizing complexity during the tree-building process. This isn’t simply a blanket restriction, however; the probability of a split is dynamically adjusted based on the data itself, allowing for deeper trees where genuinely needed to capture intricate relationships. Consequently, the model favors more parsimonious solutions – simpler trees that generalize better to unseen data – by implicitly prioritizing explanations that achieve sufficient accuracy with minimal complexity.

Variational Inference offers a powerful approach to model building by strategically navigating the trade-off between accuracy and interpretability. This technique doesn’t simply search for the single ‘best’ equation, but instead defines a probability distribution over possible equations, guided by the principle of minimizing [latex]KL Divergence[/latex]. This divergence measures the difference between the approximate distribution and the true, but often intractable, posterior distribution. Critically, carefully designed ‘priors’ are incorporated into this process; these priors reflect pre-existing knowledge or preferences for simpler, more understandable models. By favoring solutions that align with these priors-perhaps penalizing overly complex terms or promoting sparsity-Variational Inference effectively steers the search towards equations that not only fit the observed data well, but also remain readily interpretable by researchers and practitioners. This focus on both accuracy and clarity is particularly valuable in fields where understanding the underlying mechanisms is as important as predictive power.

Across 10 repetitions of a 90/10 train-test split, [latex]\mathsf{VaSST}[/latex] (with K=3 and D=3) achieves lower out-of-sample RMSEs than competing methods when learning [latex]\mathbf{y}=6\\sin(\\mathbf{x}\\_{0})\\cos(\\mathbf{x}\\_{1})[/latex] under varying noise conditions.
Across 10 repetitions of a 90/10 train-test split, [latex]\mathsf{VaSST}[/latex] (with K=3 and D=3) achieves lower out-of-sample RMSEs than competing methods when learning [latex]\mathbf{y}=6\\sin(\\mathbf{x}\\_{0})\\cos(\\mathbf{x}\\_{1})[/latex] under varying noise conditions.

Towards a New Era of Scientific Discovery

Scientific Machine Learning, or SciML, represents a paradigm shift in discovery by intelligently combining the strengths of both physics-based modeling and data-driven techniques like Symbolic Regression. This integration isn’t simply about applying machine learning to science; it’s about building systems that leverage existing domain knowledge – the established laws and principles governing a phenomenon – and then use data to refine, extend, or even rediscover those principles. Symbolic Regression, a key component of SciML, automatically searches for mathematical equations that best describe observed data, potentially uncovering hidden relationships or simplifying complex models. By grounding machine learning in fundamental scientific understanding, SciML accelerates progress, allowing researchers to move beyond purely empirical observations and towards more robust, interpretable, and generalizable scientific insights. This approach offers the potential to unlock discoveries across diverse fields, from materials science and drug discovery to climate modeling and astrophysics.

The capacity to automatically uncover underlying mathematical relationships within data represents a paradigm shift across scientific inquiry. Traditionally, researchers formulate hypotheses and then seek data to confirm or refute them; however, automated equation discovery reverses this process, allowing algorithms to explore data and propose governing equations without prior assumptions. This approach proves particularly valuable when dealing with complex systems where relationships are non-intuitive or obscured by noise, offering potential breakthroughs in fields ranging from physics and chemistry to biology and finance. By identifying these hidden mathematical structures, scientists gain not only predictive models but also deeper insights into the fundamental mechanisms driving observed phenomena, accelerating the pace of discovery and potentially revealing previously unknown scientific principles. The ability to distill complex data into concise, interpretable equations promises to democratize scientific exploration, enabling researchers to focus on interpreting results rather than painstakingly formulating and testing hypotheses.

The trajectory of automated scientific discovery is increasingly reliant on sophisticated probabilistic methods, and ongoing developments in Variational Inference (VI) promise to unlock even greater potential. VI allows researchers to approximate complex probability distributions, crucial for modeling uncertainty inherent in scientific data and simulations. As these techniques mature, they are becoming more computationally efficient and easier to implement, broadening access beyond specialized machine learning experts. This accessibility is pivotal, as it enables domain scientists – physicists, chemists, biologists – to directly leverage the power of probabilistic programming for equation discovery and model building. Future advancements will likely focus on scaling VI to handle even larger datasets and more complex models, further accelerating the pace of scientific breakthroughs by seamlessly integrating data-driven insights with established domain knowledge – ultimately transforming how scientific hypotheses are formulated and tested.

The newly developed 𝖵𝖺𝖲𝖲𝖳 framework demonstrates a significant leap in automated equation discovery, achieving an out-of-sample Root Mean Squared Error (RMSE) of just 0.08 on Simulation 1. This performance notably surpasses that of established methods like QLattice, which yielded an RMSE of 0.12, and gplearn, with a score of 0.15. This improvement isn’t merely incremental; it suggests 𝖵𝖺𝖲𝖲𝖳 possesses a superior capacity to generalize from data and accurately model underlying relationships, hinting at its potential to accelerate discovery across various scientific domains.

Investigations into the application of 𝖵𝖺𝖲𝖲𝖳 to Feynman equations reveal a substantial gain in computational efficiency. Analysis indicates the framework completes equation solving in just 250 seconds – a marked improvement over the 600 seconds required by the Bayesian Model Selection (BMS) method and the 500 seconds needed by the Blueprint Symbolic Regression (BSR) approach. This accelerated runtime suggests 𝖵𝖺𝖲𝖲𝖳 offers a practical advantage for researchers tackling complex physical modeling, potentially enabling quicker iterations and broader exploration of scientific hypotheses within the realm of quantum physics and beyond.

The successful application of the 𝖵𝖺𝖲𝖲𝖳 framework to the classic Feynman Equations yielded a remarkably high accuracy of 98.5%. This performance demonstrates a competitive edge in automated equation discovery. While other methods achieved comparable levels of precision, the 𝖵𝖺𝖲𝖲𝖳 framework’s ability to accurately reproduce known physical laws underscores its potential for not just identifying correlations within data, but also for rediscovering fundamental relationships governing complex systems – a crucial step towards truly interpretable and scientifically meaningful machine learning.

The pursuit of interpretable models, as demonstrated by 𝖵𝖺𝖲𝖲𝖳, echoes a fundamental tenet of understanding any complex system: discerning underlying structure. Each observation, each data point, is but a surface manifestation of deeper dependencies. As Confucius stated, “Study the past if you would define the future.” This principle applies directly to the framework’s use of soft symbolic trees; they aren’t merely fitting equations to data, but building a probabilistic representation of the generative process-a ‘past’ from which future predictions can be reliably drawn. The emphasis on uncertainty quantification further reinforces this idea, acknowledging that complete certainty is an illusion, and that robust understanding necessitates acknowledging the limits of knowledge.

What Lies Ahead?

The introduction of 𝖵𝖺𝖲𝖲𝖳 offers a compelling, if tentative, step towards tractable uncertainty quantification in symbolic regression. The framework’s reliance on soft symbolic trees permits a scaling previously inaccessible, yet simultaneously introduces a new set of challenges. The very ‘softness’ that enables computational efficiency demands careful consideration; the interpretation of partially-defined expressions, while offering probabilistic insights, requires robust validation against established scientific principles. Every image, or in this case, every inferred expression, is a challenge to understanding, not merely a model output.

Future work must address the inherent trade-off between expression complexity and interpretability. While 𝖵𝖺𝖲𝖲𝖳 excels at navigating the search space, the resulting symbolic trees – however ‘soft’ – still demand human scrutiny. Automation of this validation process, perhaps through integration with knowledge graphs or existing scientific databases, represents a critical next step. Further investigation into the interplay between prior distributions and the induced sparsity of the symbolic trees promises to refine the framework’s inductive biases.

Ultimately, the true test lies not in algorithmic elegance, but in demonstrable scientific discovery. The potential for 𝖵𝖺𝖲𝖲𝖳, or frameworks like it, to unearth non-intuitive relationships within complex datasets is significant. However, the field must remain grounded in the understanding that correlation is not causation, and that every inferred equation is a hypothesis demanding rigorous experimental verification.


Original article: https://arxiv.org/pdf/2602.23561.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-03 05:57