Can AI Rediscover Geometry?

Author: Denis Avetisyan

A new benchmark challenges large language models to derive the equations defining 3D surfaces, revealing current limitations in their ability to perform geometric reasoning.

The evaluation pipeline assesses the fidelity of symbolic equation recovery from 3D surface data by integrating regression errors—quantified by Normalized Mean Squared Error ($NMSE$)—with strict symbolic equivalence checks and geometry-aware distance metrics such as Chamfer and Hausdorff distances, thereby providing a comprehensive measure of both numerical and structural accuracy.

SurfaceBench assesses the capacity of self-evolving models to perform symbolic regression and discover the underlying equations of complex 3D scientific surfaces.

Despite advances in machine learning for science, discovering concise symbolic equations from data—particularly for complex 3D geometric phenomena—remains a significant challenge. This work introduces SurfaceBench: Can Self-Evolving LLMs Find the Equations of 3D Scientific Surfaces?, a comprehensive benchmark comprising 183 tasks designed to rigorously evaluate symbolic regression methods on surface discovery. Our results reveal that state-of-the-art large language models struggle to generalize across different surface complexities and equation representations, highlighting limitations in their geometric reasoning abilities. Can future benchmarks and methodologies bridge the gap between symbolic manipulation and accurate geometric reconstruction, fostering truly data-driven scientific discovery?

The Mathematical Imperative: Discerning Order from Data

The pursuit of scientific understanding frequently hinges on discerning the underlying mathematical relationships governing observed data. This process, known as symbolic regression, aims to automatically discover equations – such as $y = ax^2 + bx + c$ – that best fit a given dataset. However, this task presents a significant challenge; the space of possible equations is vast, and efficiently searching for the correct one requires sophisticated algorithms. Unlike simply finding a curve that fits the data, symbolic regression seeks a concise, interpretable equation revealing the fundamental principles at play. This ability to uncover governing equations from data, rather than relying solely on pre-defined models, holds immense promise for accelerating discovery across diverse fields, but remains a computationally demanding and statistically complex endeavor.

Conventional symbolic regression techniques, such as Genetic Programming, face substantial limitations when tasked with discerning mathematical expressions representing intricate surfaces. These methods operate by evolving candidate equations, a process that becomes computationally expensive and time-consuming as the complexity of the target function increases. Recent evaluations on established benchmarks reveal a surprisingly low success rate – only 6% of attempts result in the accurate recovery of the underlying equation. This poor performance underscores a critical bottleneck in automated scientific discovery, highlighting the need for more efficient and robust algorithms capable of handling the challenges posed by high-dimensional and nonlinear data. The difficulty arises not just from the search space of possible equations, but also from the inherent noise and limited data often encountered in real-world scientific investigations, requiring substantial computational resources to even approach a viable solution for equations like $f(x, y) = x^2 + y^2$.

SurfaceBench is a benchmark suite comprising 183 surface equations from 15 scientific domains designed to evaluate symbolic regression algorithms across explicit, implicit, and parametric representations.

Leveraging Linguistic Priors for Equation Synthesis

Large Language Models (LLMs) demonstrate an advantage in equation discovery due to their pre-training on extensive text and code corpora, which imparts inherent symbolic priors. This allows LLMs to not simply search for equations that fit data, but to generate plausible equation structures based on learned relationships between symbols and mathematical concepts. Unlike traditional symbolic regression methods that rely on random exploration or gradient-based optimization, LLMs can propose equations containing terms like $y = ax^2 + bx + c$ with a higher prior probability if similar structures have been observed during training. This generative capability effectively narrows the search space, improving the efficiency of discovering equations that accurately model observed data and potentially generalizing to unseen data more effectively.

Several methodologies integrate Large Language Models (LLMs) into symbolic regression to improve search efficiency. LLM-SR utilizes an LLM to directly propose equation candidates, which are then evaluated against observed data. LaSR (Language-guided Symbolic Regression) employs an LLM to rank and prioritize potential equation terms during the search process. SGA (Symbolic Gradient Ascent) leverages LLMs to estimate gradients for symbolic expressions, guiding the optimization towards better solutions. OpenEvolve further refines this process by combining LLM-generated equation structures with evolutionary algorithms to explore a broader and more informed solution space, effectively reducing the computational cost associated with traditional, exhaustive search methods for identifying mathematical relationships, such as finding an equation $y = ax^2 + bx + c$ that best fits a given dataset.

Traditional symbolic regression relies on algorithms like genetic algorithms or random search to explore the space of possible equations, often requiring substantial computational resources. Approaches integrating Large Language Models (LLMs) differ by utilizing the LLM’s capacity to generate candidate equations directly, effectively acting as a proposal mechanism. This allows the LLM to suggest plausible equation structures – combinations of variables, constants, and mathematical operators like $+, -, *, /$ – based on its pre-training data and understanding of mathematical relationships. Subsequently, these proposed equations are evaluated against the target data, and the LLM can refine its proposals through iterative feedback, optimizing the equation’s fit and complexity. This generative and refinement process accelerates the search for accurate models compared to purely stochastic or heuristic methods.

Symbolic regression methods utilizing large language models—LLM-SR and OpenEvolve—fail primarily due to either an inability to identify the correct functional families or to optimize equations within those families.

SurfaceBench: A Rigorous Test of Algorithmic Fidelity

SurfaceBench is a benchmark dataset designed for evaluating symbolic regression algorithms, consisting of 183 surface equations categorized into 15 scientific domains. These equations are represented in three primary forms: Explicit surfaces defined as $z = f(x, y)$, Implicit surfaces defined by $F(x, y, z) = 0$, and Parametric surfaces defined by $x = f(u, v)$, $y = g(u, v)$, and $z = h(u, v)$. This diversity in equation types and scientific categories—including but not limited to mathematics, physics, and engineering—allows for a comprehensive assessment of an algorithm’s ability to generalize across different surface representations and scientific principles. The dataset is intended to rigorously test the robustness and accuracy of methods attempting to recover the underlying equation from data representing the surface.

SurfaceBench utilizes Chamfer Distance and Hausdorff Distance as primary metrics for evaluating the geometric accuracy of symbolically regressed equations. Chamfer Distance measures the average distance from a point in one surface to its nearest point on the recovered surface, providing an assessment of overall shape similarity. Specifically, it calculates the average distance between two point clouds representing the original and recovered surfaces. Hausdorff Distance, conversely, determines the maximum distance of any point in one surface to the nearest point on the recovered surface, thus identifying the largest deviation between the two geometries. This metric is sensitive to outliers and highlights the worst-case error in the recovered equation. Both metrics are calculated in a multi-dimensional space, allowing for evaluation of surfaces defined in $R^n$, and provide a quantitative assessment of the fidelity of the recovered equation to the ground truth.

Evaluation of current state-of-the-art Large Language Model (LLM)-based frameworks using the SurfaceBench benchmark demonstrates a 4% equation recovery rate across the 183 included surface equations. This performance is based on assessing the recovered equations against ground truth solutions using geometric distance metrics. Traditional symbolic regression methods achieved a slightly improved equation recovery rate of 6% under the same evaluation conditions, indicating a significant performance gap for both approaches and highlighting the challenges associated with accurate surface equation recovery from data.

SurfaceBench utilizes a dataset curation pipeline to generate diverse and challenging surface reconstruction problems by transforming seed equations and validating their novelty and solvability.

Implications for Automated Scientific Discovery

The demonstrated efficacy of these equation discovery methods on the SurfaceBench dataset hints at a transformative potential extending far beyond its specific challenges. Successfully inferring governing equations from limited data—a feat previously demanding significant human expertise—opens doors to automating scientific modeling across diverse fields. From materials science and climate modeling to biological systems and economic forecasting, the ability to rapidly identify underlying relationships within complex datasets promises to accelerate discovery. This approach doesn’t merely offer a computational shortcut; it provides a framework for systematically exploring model spaces and potentially uncovering previously unknown physical laws or relationships hidden within observational data, effectively democratizing the process of scientific inquiry and pushing the boundaries of data-driven research.

Powerful large language models, such as GPT-4, represent a significant advancement in the automation of scientific discovery, particularly in identifying the underlying equations that govern complex phenomena. Traditionally, deriving these governing equations relied heavily on human intuition and laborious symbolic regression techniques. However, LLMs demonstrate an ability to analyze data and propose candidate equations with remarkable efficiency, effectively functioning as ‘equation learners’. This capability extends beyond the specific datasets used in training, suggesting potential applications across diverse scientific domains – from fluid dynamics and materials science to biology and economics. The models can effectively sift through vast amounts of data, recognize patterns, and formulate mathematical expressions that describe the observed relationships, potentially accelerating the pace of scientific progress and enabling the modeling of previously intractable systems. While not a replacement for rigorous scientific validation, LLMs offer a powerful new tool for hypothesis generation and the initial formulation of mathematical models.

Continued development hinges on bolstering the reliability and efficiency of these equation discovery methods, particularly when confronted with noisy or incomplete datasets. Current approaches, while promising, require further refinement to handle the complexity of real-world scientific problems and scale effectively to high-dimensional data. A particularly fruitful avenue for future work lies in synergistic combinations of large language models with established symbolic regression techniques; integrating the pattern recognition capabilities of LLMs with the rigorous mathematical foundations of traditional methods could unlock a new generation of automated scientific discovery tools, potentially revealing previously hidden governing equations and accelerating progress across diverse fields like physics, chemistry, and engineering.

The pursuit of discovering underlying equations, as exemplified by SurfaceBench, demands a commitment to formal definitions and rigorous logic. Robert Tarjan aptly stated, “Data structures and algorithms are the heart of computer science.” This sentiment resonates deeply with the challenge presented by symbolic regression; the benchmark isn’t merely about finding an equation that fits the data, but identifying the correct equation – one underpinned by provable mathematical principles. SurfaceBench exposes the limitations of current large language models in geometric reasoning, emphasizing the necessity of algorithms built upon sound mathematical foundations rather than empirical performance on test cases. The benchmark, therefore, serves as a crucial step towards achieving true intelligence in machine learning – intelligence rooted in demonstrable truth.

Beyond Equations: Charting a Course for Geometric Intelligence

The introduction of SurfaceBench reveals, with characteristic clarity, that current methods for equation discovery are, at best, approximations. To claim ‘success’ based on limited test cases is a statistical fallacy; the true measure lies in asymptotic scalability and provable generalization. The benchmark’s inherent difficulty is not merely a matter of computational cost, but a fundamental limitation in the ability of these systems to reason geometrically. The discovered equations, while superficially correct, often lack the elegance—the mathematical purity—one would expect from a truly intelligent system.

Future work must move beyond the fitting of parameterized functions. The pursuit of “geometry-aware metrics” is a necessary, but insufficient, condition. The core challenge remains: how to instill a system with an intrinsic understanding of spatial relationships, rather than merely a capacity to memorize and extrapolate from examples. The goal is not simply to reproduce known surfaces, but to anticipate and define those that have yet to be observed—a capacity demanding a shift in focus from data-driven learning to deductive reasoning.

Ultimately, the value of benchmarks like SurfaceBench resides not in celebrating present achievements, but in ruthlessly exposing the gaps in current approaches. The pursuit of artificial intelligence must be guided by mathematical rigor, not empirical expediency. Until these systems can demonstrate a genuine understanding of the underlying geometric principles, they remain, fundamentally, sophisticated pattern-matching machines.

Original article: https://arxiv.org/pdf/2511.10833.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Mathematical Imperative: Discerning Order from Data

Leveraging Linguistic Priors for Equation Synthesis

SurfaceBench: A Rigorous Test of Algorithmic Fidelity

Implications for Automated Scientific Discovery

Beyond Equations: Charting a Course for Geometric Intelligence

See also: