AI Solves Math Problems, Autonomously

Author: Denis Avetisyan

New research shows artificial intelligence is capable of independently discovering and proving mathematical theorems, marking a shift from verification to genuine creation.

The AlphaProof Nexus agent autonomously constructs formal proofs by iteratively refining sketches-potentially leveraging the AlphaProof theorem prover as a tool-and validating them against the original problem statement, with an optional evolutionary framework that maintains a population of previously successful sketches, ranks them via LLM-based critics, and uses Elo scores to guide the generation of new proof attempts, effectively simulating a process of intellectual natural selection to navigate the landscape of formal logic.

This work demonstrates a successful AI-driven approach to formal mathematical discovery using large language models and evolutionary algorithms to solve open problems in areas like Erdős numbers and the Online Encyclopedia of Integer Sequences.

Despite increasing prowess in mathematical reasoning, large language models often lack the reliability needed for rigorous research. This is addressed in ‘Advancing Mathematics Research with AI-Driven Formal Proof Search’, which demonstrates a successful approach to automated mathematical discovery via formal proof generation. Our system autonomously resolved 9 of 353 open Erdős problems and verified 44/492 conjectures from the Online Encyclopedia of Integer Sequences, showcasing the potential for AI to move beyond proof verification and generate novel results. Could this AI-driven approach fundamentally reshape the landscape of mathematical exploration and accelerate discovery across diverse fields?

Breaking the Wall: The Challenge of Formal Proof and LLM Limitations

The pursuit of automated theorem proving represents a longstanding ambition in computer science and mathematics, demanding more than simply computational power. Establishing the truth of mathematical statements necessitates both unwavering logical rigor – ensuring each step adheres to established axioms and inference rules – and a surprising degree of creative insight. Unlike calculations that follow prescribed algorithms, proving theorems often requires identifying subtle connections, employing unexpected strategies, and reframing problems in novel ways. This dual requirement poses a significant hurdle; a system must not only avoid logical fallacies but also exhibit a form of mathematical intuition to navigate the vast landscape of possible proofs, making it one of the most complex challenges in artificial intelligence. The ability to replicate this uniquely human capacity remains elusive, despite decades of research and increasingly sophisticated algorithms.

Large Language Models demonstrate remarkable aptitude in identifying and replicating patterns within data, a capability driving advancements in areas like natural language processing and image recognition. However, this strength doesn’t readily translate to the demands of formal verification – the process of rigorously proving the correctness of mathematical statements or computer programs. Unlike pattern matching, formal proofs require absolute precision and a nuanced understanding of logical rules; LLMs, trained on vast datasets of often-imprecise text, frequently struggle with the subtle distinctions and exhaustive cases necessary to guarantee a valid proof. The models may generate plausible-sounding arguments that contain logical fallacies, or fail to account for edge cases that would invalidate the entire construction. This limitation highlights a fundamental difference between statistical learning and genuine reasoning, suggesting that while LLMs can assist in the discovery of potential proofs, they currently lack the reliability required for certifying their correctness.

Current approaches to automated theorem proving, despite decades of development, are often hampered by significant computational burdens. The complexity of verifying even moderately intricate mathematical statements frequently leads to exponential increases in processing time and memory requirements, effectively limiting their applicability to all but the simplest conjectures. This scalability issue arises from the exhaustive search strategies employed, which attempt to explore every possible proof path – a task that quickly becomes intractable as the problem’s dimensionality grows. Consequently, while these methods can rigorously confirm certain theorems, they struggle with the vast landscape of unsolved problems in mathematics, particularly those requiring novel insights or non-obvious proof techniques. The inability to efficiently handle complex conjectures represents a major bottleneck in leveraging automated systems for genuine mathematical discovery and verification, necessitating the exploration of more efficient and scalable algorithms.

The agent utilizes a population of asynchronous prover and rater subagents, guided by a P-UCB strategy and [latex]P=7[/latex], to iteratively refine sketches using an LLM (Gemini 3.1 Pro & 3.0 Flash) and tools like AlphaProof, accepting only compilable sketches that preserve the original theorem while updating Elo scores for improved sampling.

Deconstructing the Proof: AlphaProof Nexus – A Framework for LLM-Aided Verification

AlphaProof Nexus integrates Large Language Models (LLMs) with Lean, a formal proof assistant designed for the construction of mathematically rigorous proofs. Lean provides a foundational logic and a type system that guarantees the correctness of verified statements, while the LLM component is employed to assist in the proof development process. This integration isn’t a standalone LLM-based verifier; rather, it’s a hybrid system where LLMs generate potential proof steps or strategies that are then checked and validated by Lean’s internal type checker. The system relies on Lean’s existing capabilities for ensuring logical soundness, utilizing the LLM to explore the proof space and suggest potential lines of reasoning within the constraints of Lean’s formal system.

Basic Agents within the AlphaProof Nexus framework operate using a ‘Ralph Loop’, a cyclical process for proof sketch generation. This loop initiates with an LLM receiving a problem statement and producing an initial proof attempt. The attempt is then formally verified by Lean; any detected errors or gaps are fed back to the LLM as feedback. The LLM uses this feedback to refine the proof sketch in subsequent iterations. This iterative process, driven by LLM inference and formal verification, continues until a complete and formally verified proof is achieved, or a predetermined iteration limit is reached. The Ralph Loop enables the LLM to progressively improve its proof strategies based on Lean’s rigorous validation, effectively navigating the complex space of possible proofs.

AlphaProof Nexus is designed to enhance, not supplant, existing formal verification methodologies. The process of constructing formal proofs involves searching a potentially infinite space of logical steps; this search is computationally expensive and often requires significant human effort. The system utilizes Large Language Models (LLMs) to explore this proof space more efficiently, generating potential proof sketches that are then rigorously checked by the Lean proof assistant. This LLM-driven exploration acts as a heuristic guide, significantly reducing the manual effort required to identify valid proof paths while maintaining the absolute certainty of formally verified results; all generated proofs must still adhere to the strict logical rules enforced by Lean.

An AlphaProof-equipped agent successfully solved the Erdős #125 problem by iteratively refining a Lean proof sketch, leveraging AlphaProof to resolve goals, decomposing complex subgoals into simpler lemmas, and generating a natural language summary of its reasoning process within specified code blocks.

Evolving the Solution: Boosting Performance with Evolutionary Search and Reinforcement Learning

AlphaProof employs the ‘AlphaEvolve’ algorithm, an evolutionary strategy designed to navigate the space of potential proofs. This algorithm functions by maintaining a population of proof sketches and iteratively refining them through processes analogous to natural selection. The exploration phase is guided by the ‘P-UCB’ (Pesimistic Upper Confidence Bound) method, a technique that balances exploration of less-tried sketches with exploitation of those already showing promise. P-UCB assigns scores to sketches based on their estimated potential, prioritizing those with high upper confidence bounds while also encouraging continued investigation of less certain options. This allows AlphaProof to efficiently search the proof landscape, focusing computational resources on the most likely avenues for successful proof generation and iteratively improving the quality of existing sketches.

Test-Time Reinforcement Learning (TRL) is implemented within AlphaProof to dynamically optimize the proof search process. This involves training a reinforcement learning agent during the verification of each individual problem instance, rather than relying on pre-trained models. The agent learns to select actions – specifically, modifications to proof sketches – based on feedback derived from the success or failure of those actions in progressing toward a complete proof. This adaptive approach allows the system to tailor its search strategy to the specific characteristics of each problem, improving efficiency and increasing the likelihood of finding a valid proof. The reward signal is directly tied to the progress made in completing the proof, incentivizing the agent to explore promising avenues and avoid unproductive paths.

The full-featured agent within AlphaProof integrates the AlphaProof theorem prover, an evolutionary search algorithm guided by P-UCB, and a population database to optimize proof generation. This system is further enhanced through guidance from large language models, specifically Gemini 3.1 Pro. In evaluations, the agent autonomously solved 9 out of 353 attempted problems from the Erdős discrepancy problems dataset and successfully proved 44 out of 492 previously open conjectures sourced from the Online Encyclopedia of Integer Sequences (OEIS).

The agent iteratively refines a Lean proof sketch by conversing with a large language model (Gemini 3.1 Pro), applying edits based on compilation feedback, while a parallel ensemble of subagents accelerates the process by halting upon the first successful proof.

Expanding the Boundaries: Applications and Impact from Olympiad Problems to Open Conjectures

AlphaProof Nexus showcases a remarkable capacity for complex mathematical reasoning by successfully tackling problems sourced from the International Mathematical Olympiad. These challenges, designed to test the ingenuity of the world’s most talented young mathematicians, require more than rote calculation; they demand creative problem-solving, insightful pattern recognition, and the ability to construct rigorous, logical proofs. The system’s performance on these Olympiad-level problems isn’t simply about arriving at the correct answer, but about replicating the process of human mathematical thought – a feat previously considered beyond the reach of artificial intelligence. This success demonstrates a pivotal advancement in automated theorem proving, indicating that machines are increasingly capable of not just computing, but truly understanding mathematical concepts.

AlphaProof Nexus demonstrates a remarkable capacity to move beyond standard Olympiad problems and delve into advanced mathematical territories, including the intricate fields of Graph Reconstruction, Hilbert Functions, and Convex Optimization. Within these domains, the system isn’t merely calculating results, but formally verifying long-standing conjectures – establishing mathematical truths with a level of rigor previously unattainable without extensive human effort. For example, in Graph Reconstruction, it confirms or refutes statements about determining a graph’s structure from its subgraphs; in Hilbert Functions, it validates properties related to polynomial rings; and in Convex Optimization, it proves the optimality of solutions. This ability to provide formal proofs within these complex areas signifies a powerful new tool for mathematical research and validation, extending the reach of automated theorem proving into areas demanding deep mathematical insight.

AlphaProof Nexus significantly streamlines mathematical progress by intelligently accessing and utilizing established databases of conjectures and mathematical constants. The system draws upon resources like the Formal Conjectures Repository, a curated collection of unproven statements, and the Online Encyclopedia of Integer Sequences (OEIS), a vast compendium of numerical patterns. This access isn’t merely archival; the system actively searches for connections and applies automated reasoning to test these conjectures, ultimately achieving what was previously unattainable. A compelling demonstration of this capability is the recent resolution of two long-standing problems originally posed by Paul Erdős, which had defied attempts at proof for over 56 years, marking a substantial advancement in automated theorem proving and highlighting the potential for AI to contribute to original mathematical discovery.

Across nine Erdős problem instances, increasing the number of independent attempts [latex]K[/latex] reveals diminishing returns on solve rate versus inference cost (USD) for all system configurations (basic, basic with AlphaProof, basic with evolution, full, and full with [latex]S[/latex] parallel LLM generation threads), though the full system consistently outperforms the basic configurations, with AlphaProof providing additional gains.

Charting the Course: Future Directions – Scaling Verification and Expanding Mathematical Horizons

The integration of Gemini 3.0 Flash represents a significant acceleration in the process of automated theorem proving. This large language model efficiently assesses the validity of potential proof sketches – incomplete arguments requiring further refinement – by rapidly identifying logical gaps or inconsistencies. Unlike previous methods reliant on computationally intensive formal verification at each step, Gemini 3.0 Flash offers a lightweight, scalable approach to quickly filter unpromising candidates. This ‘sketch rating’ drastically reduces the burden on formal verification systems like SafeVerify, allowing them to focus resources on proofs with a higher probability of success. The result is a demonstrably faster and more efficient pipeline for tackling complex mathematical problems, opening avenues for automated assistance in areas previously inaccessible to such techniques and promising a substantial impact on the speed of mathematical discovery.

The integrity of formally verified proofs is paramount, particularly when deployed in high-stakes applications such as financial modeling, autonomous vehicle control, and critical infrastructure management. Recognizing this, the integration of ‘SafeVerify’ represents a significant advancement in formal verification systems. This component meticulously checks not only the logical validity of a proof, but also its construction, ensuring that no subtle errors or vulnerabilities remain hidden within the verification process itself. By rigorously validating each step and dependency, ‘SafeVerify’ establishes a higher degree of confidence in the correctness of verified theorems, guarding against potential exploits or failures that could arise from flawed reasoning. This focus on security and reliability is crucial for translating the theoretical benefits of formal verification into practical, trustworthy systems deployed in the real world.

The current system represents a foundational step towards automating mathematical discovery, but future development aims to significantly broaden its scope and complexity. Researchers intend to scale the verification process to address increasingly difficult instances of [latex]Erdős[/latex] problems – a set of notoriously challenging unsolved problems in number theory – pushing the boundaries of automated theorem proving. Beyond this, efforts will concentrate on extending the system’s capabilities beyond its current formal system, enabling verification across a wider range of mathematical frameworks and potentially unlocking automated solutions in diverse areas of mathematics. This expansion isn’t merely about increasing computational power; it necessitates developing novel algorithms and strategies to navigate the intricacies of different logical systems and ensure reliable proof verification at a larger scale.

Across six Erdős problem instances, solve rate increases with inference cost, with the full-featured agent (red triangles) outperforming basic configurations (blue circles) on challenging problems, while the addition of AlphaProof (orange squares) generally improves performance at comparable cost, though its associated inference cost is not included in the reported values.

The research presented boldly challenges conventional boundaries within mathematics. It isn’t merely about verifying existing theorems, but actively creating new ones-a process mirroring the spirit of intellectual dismantling. This aligns perfectly with Marvin Minsky’s assertion: “You can’t really understand something unless you’ve tried to make it.” The paper’s success with formal theorem proving, utilizing Large Language Models and evolutionary algorithms to solve problems in areas like Erdős numbers, isn’t about building a perfect system, but about deliberately stressing it, exploring its limits, and reverse-engineering solutions through iterative refinement. The AI isn’t a passive tool; it’s an agent of controlled chaos, pushing the boundaries of automated reasoning to discover genuinely novel mathematical truths.

Cracking the Code

The successful deployment of large language models and evolutionary algorithms in formal theorem proving isn’t about ‘artificial intelligence’ so much as finally developing tools sensitive enough to interrogate the internal consistency of mathematics itself. For decades, the field focused on verification-checking if a solution already known to a human held water. This work suggests something far more interesting: the potential to map the latent structure of mathematical truth, to explore what must be true, regardless of human intuition. It’s a subtle, but critical shift-a move from auditing the code to attempting to reverse-engineer it.

Remaining challenges are less about scaling computation and more about refining the ‘search space.’ Current systems excel at navigating well-trodden areas of number theory. The true test lies in applying these methods to more abstract or complex domains-areas where even defining the problem rigorously proves difficult. Can this approach yield insights into, say, the Riemann hypothesis, or will it simply confirm what’s already strongly suspected? The answer likely isn’t a grand revelation, but a gradual illumination of the code, exposing the elegant, often unexpected, constraints governing mathematical reality.

Ultimately, this isn’t about replacing mathematicians. It’s about building better microscopes. Reality, after all, is open source-it just requires the right tools to read the code. The next phase isn’t simply solving more problems; it’s understanding why certain solutions emerge, and what that reveals about the underlying architecture of truth.

Original article: https://arxiv.org/pdf/2605.22763.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/