From Data to Guaranteed Optimality: A New Path for Model Discovery

Author: Denis Avetisyan

Researchers have developed a framework that automatically builds interpretable and provably-convex mathematical models directly from data, bypassing traditional modeling limitations.

DiscoverDCP identified a surprisingly parsimonious functional form-$y=e^{3.2x}+e^{-4.9x}$-capable of approximating complex, nested nonlinearities present in the ground truth data, demonstrating that accurate modeling doesn’t necessarily demand excessive complexity.

DiscoverDCP leverages symbolic regression and Disciplined Convex Programming to ensure optimization problems remain solvable and well-behaved.

Achieving both accuracy and guaranteed convexity in system identification remains a persistent challenge in optimization and control. This paper introduces DiscoverDCP: A Data-Driven Approach for Construction of Disciplined Convex Programs via Symbolic Regression, a novel framework that leverages symbolic regression guided by the rules of Disciplined Convex Programming (DCP) to automatically construct globally convex models from data. By enforcing convexity during model discovery, rather than verifying it post-hoc, DiscoverDCP enables the identification of flexible, interpretable surrogates beyond traditional fixed-parameter convex forms. Could this approach unlock more robust and efficient solutions for safety-critical applications demanding verifiable performance?

Deconstructing the Black Box: Why Interpretability Matters

Contemporary machine learning frequently prioritizes predictive accuracy, resulting in complex models often described as “black boxes.” While these algorithms can achieve remarkable performance in tasks like image recognition or natural language processing, their internal workings remain opaque. This lack of transparency poses significant challenges; understanding why a model arrives at a particular decision is often as crucial as the decision itself. Without insight into the model’s reasoning, debugging errors becomes extraordinarily difficult – identifying and correcting biases or unexpected behaviors is akin to searching for a needle in a haystack. Moreover, the inability to explain predictions erodes trust, particularly in high-stakes domains like healthcare, finance, and criminal justice, where accountability and justification are paramount. Consequently, a growing emphasis is being placed on developing methods that not only predict accurately but also offer a clear and understandable rationale for their conclusions.

The increasing reliance on machine learning necessitates a shift towards interpretable models, particularly within domains where accountability and understanding are paramount. While complex algorithms may achieve high predictive accuracy, their “black box” nature presents critical limitations in fields like healthcare, finance, and criminal justice. Without the ability to discern why a model arrives at a specific conclusion, it becomes difficult to validate its reasoning, identify potential biases, or ensure fair and ethical outcomes. Consequently, the pursuit of interpretability isn’t merely about enhancing transparency; it’s about fostering trust, enabling effective debugging, and ultimately, facilitating responsible innovation that aligns with human values and societal expectations. This demand for insight extends beyond simply identifying influential features; it requires a comprehensive understanding of the model’s decision-making process, allowing stakeholders to confidently leverage its predictions and intervene when necessary.

A persistent challenge in machine learning lies in the inherent trade-off between a model’s predictive accuracy and its interpretability. While complex algorithms – like deep neural networks – often achieve state-of-the-art performance, their internal workings remain largely opaque, functioning as “black boxes.” Simplifying models to enhance understanding frequently leads to a reduction in their ability to accurately forecast outcomes. Researchers are actively investigating techniques – including attention mechanisms and surrogate models – to bridge this gap, striving for algorithms that are both powerful and transparent. This pursuit is not merely academic; in fields like healthcare and finance, understanding why a model makes a specific prediction is as crucial as the prediction itself, demanding a careful calibration between performance metrics and the ability to extract meaningful insights from the model’s decision-making process.

DiscoverDCP: Carving Order from Complexity

DiscoverDCP introduces a new methodology for modeling non-linear convex dynamics by integrating symbolic regression with Disciplined Convex Programming (DCP). This approach differentiates itself from traditional methods, such as those relying on quadratic approximations, by directly seeking convex mathematical expressions that fit observed data. Utilizing symbolic regression, DiscoverDCP explores a vast space of potential models, while DCP ensures that any learned expression maintains convexity – a critical property for guaranteeing globally optimal solutions and stable system behavior. Benchmarking demonstrates that DiscoverDCP achieves improved accuracy in representing these dynamics compared to standard quadratic baselines, offering a more precise and reliable means of modeling complex systems where convex constraints are known or desired.

DiscoverDCP ensures convexity of learned expressions through the application of Disciplined Convex Programming (DCP). DCP is a mathematical framework that allows for the automatic verification of convexity during symbolic regression. This verification process relies on the composition of convex functions, ensuring that any expression generated and validated by DiscoverDCP adheres to the properties of convex optimization. Consequently, any optimization problem formulated using these expressions is guaranteed to possess a globally optimal solution, eliminating the risk of local minima and enhancing the stability of the learned model. The framework achieves this by tracking the convexity of each operation and variable within the symbolic expression, allowing it to reject any non-convex formulation during the search process.

DiscoverDCP employs PySR, a search-based symbolic regression library, to navigate the landscape of potential mathematical expressions. PySR utilizes evolutionary algorithms to efficiently explore this expression space, evaluating candidates based on their ability to fit provided data. The search process is guided by a fitness function that quantifies the error between the model’s predictions and the ground truth. Parameter optimization within PySR, including population size and mutation rates, is performed to balance exploration of diverse expressions with exploitation of promising candidates, enabling the discovery of compact and accurate models from data. The library’s ability to handle a wide range of mathematical operators and functions is crucial for identifying complex relationships within the data, while its parallelization capabilities accelerate the search process.

On synthetic 1D data, a low-complexity approximation (blue) effectively captures the ground truth exponential function (red) and outperforms a quadratic fit (yellow) and a higher-complexity approximation (purple).

The Geometry of Prediction: Convexity and Operators

The reliance on convex functions within DiscoverDCP is foundational to its optimization process. Convexity guarantees that any local minimum is also the global minimum, thereby eliminating the risk of converging to suboptimal solutions. This property simplifies optimization algorithms, allowing for efficient determination of the best model parameters. Without the assurance of a single global minimum, optimization would require exhaustive searches or heuristics, significantly increasing computational cost and potentially failing to identify the optimal solution. The framework leverages this characteristic to accelerate model training and ensure reliable results across various problem domains.

DiscoverDCP utilizes a system of unary and binary operators to construct mathematical expressions, enabling the representation of a wide range of optimization problems. Unary operators, such as negation or the exponential function, act on single operands, while binary operators, including addition, multiplication, and composition, require two operands. This operator-based construction allows for the creation of complex expressions from simpler components. Specifically, the framework supports operators that preserve convexity, ensuring that composed expressions remain convex when applied to convex functions. This compositional property is critical for guaranteeing the global optimality of solutions found by optimization algorithms. The system allows for the definition of custom operators, extending the framework’s modeling capabilities beyond the built-in set.

Convex quadratic functions within DiscoverDCP are defined using positive semidefinite matrices. A model of dimension $n$ requires a positive semidefinite matrix of size $n \times n$ to represent the quadratic component, necessitating $n(n+1)/2$ unique parameters due to symmetry. Additionally, $n$ parameters define the linear terms and a single scalar term represents the constant offset. Therefore, the total parameter count for a convex quadratic model of dimension $n$ is $n(n+1)/2 + n + 1$. This parameterization ensures that the resulting function is convex, which is a core requirement for efficient optimization within the framework.

Beyond the Fit: Prioritizing Understandability

DiscoverDCP distinguishes itself through the implementation of a Complexity Score, a metric designed to quantify the intricacy of the relationships it uncovers within data. This score doesn’t merely assess the model’s performance, but actively evaluates the simplicity of the discovered expression; a lower score indicates a more parsimonious, easily interpretable model. By systematically penalizing overly complex formulations – those with numerous variables or convoluted interactions – the framework encourages solutions that achieve high accuracy without sacrificing understandability. This emphasis on finding the most straightforward explanation, rather than simply the most predictive one, is central to DiscoverDCP’s design, fostering a preference for models that reveal fundamental relationships rather than memorizing training data. The result is a system capable of generating insights that remain transparent and accessible, even when dealing with high-dimensional and noisy datasets.

DiscoverDCP actively prioritizes solutions that strike a balance between predictive power and comprehensibility by implementing a penalty for model complexity. This isn’t merely about achieving high accuracy; the framework deliberately favors simpler explanations, even if they represent a marginal trade-off in performance. By discouraging excessively intricate models-those with numerous parameters or convoluted relationships-DiscoverDCP ensures that identified patterns are not only statistically significant but also readily interpretable by researchers. This approach moves beyond simply finding correlations to understanding the underlying mechanisms, revealing insights that would remain obscured within the opaque layers of more complex, “black-box” machine learning algorithms. The result is a system designed to deliver both robust predictions and human-understandable knowledge.

DiscoverDCP distinguishes itself from many contemporary machine learning approaches by prioritizing not just predictive power, but also the transparency of the resulting models. While complex algorithms – often described as “black boxes” – can achieve high accuracy, they frequently obscure the underlying relationships driving their predictions, hindering scientific understanding. This framework deliberately favors simpler explanations, even if it means a marginal decrease in performance, because readily interpretable models facilitate the extraction of meaningful insights. By revealing how a prediction is made, rather than merely that a prediction is made, DiscoverDCP allows researchers to validate hypotheses, uncover novel mechanisms, and build more robust and trustworthy knowledge – aspects frequently lost when relying solely on opaque, high-dimensional models.

Expanding the Canvas: Beyond Quadratic Limitations

DiscoverDCP represents a significant departure from traditional machine learning approaches often limited by quadratic programming. The system deliberately expands the scope of model exploration, venturing into a broader landscape of convex functions and expressions – including, but not limited to, functions involving hyperbolic tangents, exponential terms, and power laws. This deliberate broadening isn’t simply about increased complexity; it’s about unlocking the potential to represent more nuanced relationships within data. By moving beyond the constraints of simple $x^2$ terms, DiscoverDCP can potentially uncover more accurate and interpretable models for phenomena that don’t conform to strictly quadratic behavior, effectively addressing limitations inherent in many existing techniques and offering a more flexible toolkit for data scientists.

The DiscoverDCP framework doesn’t simply propose convex models; it rigorously validates them. Leveraging established tools like CVXPY, a Python-embedded modeling language for convex optimization, the framework systematically verifies whether a discovered expression truly represents a convex program. This verification process isn’t merely a check for mathematical correctness, but a guarantee that the resulting model can be reliably solved using efficient and well-understood algorithms. Consequently, any insights derived from these models are demonstrably robust, and the solutions obtained are guaranteed to be globally optimal within the defined constraints – a critical feature for applications demanding high precision and trustworthiness, such as resource allocation, financial modeling, and control systems. This commitment to verifiable convexity distinguishes DiscoverDCP and ensures the dependability of its machine learning outputs.

The expansion beyond quadratic modeling facilitated by DiscoverDCP unlocks significant potential for interpretable machine learning applications across numerous fields. By accommodating a broader class of convex functions, the framework enables the development of models that are not only accurate but also readily understood by human experts. This is particularly crucial in domains like healthcare, finance, and engineering, where transparency and trust are paramount. The ability to express complex relationships with interpretable convex programs allows for better decision-making, improved risk assessment, and enhanced model validation. Consequently, problems previously addressed with opaque “black box” models can now benefit from solutions that offer both predictive power and clear, actionable insights, fostering greater confidence and facilitating more effective collaboration between humans and machines.

The pursuit of automated model discovery, as exemplified by DiscoverDCP, isn’t merely about finding a solution, but uncovering the underlying principles governing a system. This echoes Marvin Minsky’s sentiment: “You can’t always get what you want, but sometimes you get what you need.” DiscoverDCP doesn’t simply aim to fit data; it seeks to reconstruct the inherent structure, ensuring the resulting model adheres to the rigorous rules of Disciplined Convex Programming. This framework, by prioritizing convexity-preserving operations, embodies a commitment to understanding – and therefore controlling – the system’s behavior, aligning with the idea that true intelligence lies in reverse-engineering reality itself.

Beyond the Guarantee

DiscoverDCP, at its core, isn’t about finding convex programs-it’s about revealing the limitations of the data itself. What happens when the attempt to force convexity consistently fails? Does that indicate a fundamental non-convexity in the underlying system, or simply a deficiency in the observed data? The framework implicitly asks: what crucial information is missing, and how does that missing information manifest as an inability to satisfy the demands of disciplined convexity? The true signal may lie not in the successfully constructed programs, but in the regressions that stubbornly refuse to conform.

Future work will undoubtedly explore scaling these techniques to higher-dimensional and more complex datasets. However, a more intriguing path involves deliberately introducing ‘controlled violations’ of convexity. Could a carefully designed, slightly non-convex program offer a better approximation of reality than a strictly convex, but ultimately impoverished, model? The pursuit of guaranteed solutions may inadvertently obscure potentially more accurate, albeit less formally verifiable, representations.

The elegance of DiscoverDCP lies in its transparency. It doesn’t merely output a model; it provides a lineage-a traceable path from data to program. The next step isn’t simply to automate this process further, but to build tools that allow practitioners to interrogate why a particular program was chosen, and, more importantly, why others were rejected. Perhaps the discarded models hold the key to a deeper understanding of the system being modeled.

Original article: https://arxiv.org/pdf/2512.15721.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/