Proving Theorems with a Lean Agent

Author: Denis Avetisyan

A new approach to automated theorem proving leverages large language models and iterative refinement to achieve strong results with minimal complexity.

An iterative agent design facilitates theorem proving through cycles of proposal, compilation, and review, leveraging a memory system to refine proofs based on feedback, and incorporating tool access-such as library and web searches-within a constrained number of calls to enhance the process and prevent logical fallacies.

This paper introduces AxProverBase, a minimal agent for formal verification within the Lean theorem prover, demonstrating competitive performance through strategic memory management and language model integration.

Despite advances in artificial intelligence, systematically evaluating and comparing automated theorem proving architectures remains a significant challenge. This paper, ‘A Minimal Agent for Automated Theorem Proving’, introduces AxProverBase, a streamlined agentic system designed to establish a robust baseline for such comparisons. By focusing on iterative proof refinement, effective memory management, and leveraging large language models, AxProverBase achieves competitive performance against state-of-the-art approaches with a significantly simpler design. Could this minimal approach unlock more accessible and efficient formal verification tools for the broader research community?

The Challenge of Formal Systems Verification

Automated Theorem Proving (ATP) stands as a cornerstone in the pursuit of flawless software and hardware systems, yet its implementation is profoundly challenging. The difficulty arises from the inherent complexity of formal systems, which require representing intricate designs and specifications with absolute precision. ATP systems attempt to mechanically verify that a given design adheres to its specification – essentially, proving its correctness without ambiguity. However, even moderately complex systems generate an astronomical number of potential states and interactions, creating a vast search space for any proof attempt. This combinatorial explosion necessitates increasingly sophisticated algorithms and substantial computational power, as even minor errors in the formal representation or proof process can invalidate the entire verification effort. Consequently, despite significant advancements, achieving fully automated and scalable verification remains a central hurdle in ensuring the reliability of critical systems.

Traditional formal verification techniques, particularly those reliant on whole proof generation, encounter considerable limitations when applied to realistically complex systems. These methods attempt to construct a complete, end-to-end proof of a system’s correctness, a process that rapidly becomes computationally expensive as the system’s size and intricacy increase. The number of potential proof steps grows exponentially with even modest increases in complexity, quickly overwhelming available resources and rendering the approach impractical. This scalability issue isn’t simply a matter of faster hardware; it reflects a fundamental challenge in navigating the vast, often infinite, search space of possible logical inferences. Consequently, while theoretically sound, whole proof generation struggles to provide timely or even feasible verification for many modern software and hardware designs, necessitating the development of more efficient and targeted methodologies.

The core of formal verification’s challenge resides in the sheer scale of possibilities that must be explored to confirm a system’s correctness. Each verification attempt involves navigating a remarkably vast ‘search space’ – a combinatorial explosion of potential proof steps. This isn’t simply a matter of computational power; traditional algorithms can become trapped in unproductive branches, endlessly pursuing paths that don’t lead to a conclusive proof. Consequently, research focuses on developing more intelligent methodologies, algorithms capable of adapting their search strategies, prioritizing promising avenues, and effectively pruning irrelevant ones. These adaptable approaches aim to move beyond brute-force exploration, instead employing heuristics and learned patterns to efficiently traverse the proof landscape and pinpoint potential errors with greater speed and accuracy, ultimately enabling verification of increasingly complex systems.

An ablation study using Claude Opus 4.5 and a [latex]10,000[/latex] token budget demonstrates that iteratively incorporating feedback, memory (via previous attempt history and self-reflection), and search tools significantly improves theorem-proving performance, as measured by pass@kk values (mean ± 95% confidence interval across 50 samples) and the percentage of proven theorems per iteration (mean ± standard error across 2-3 samples).

Deconstructing Complexity: An Agentic Approach with AxProverBase

AxProverBase employs an agentic architecture to address complex reasoning tasks by decomposing them into a series of smaller, independently solvable sub-problems. This approach deviates from monolithic problem-solving methods by distributing the workload across multiple specialized agents. Each agent focuses on a specific aspect of the overall problem, facilitating parallel processing and modularity. This decomposition allows for targeted application of reasoning strategies and enables more efficient resource allocation, ultimately improving the system’s capacity to handle intricate and multi-faceted challenges. The agentic framework also supports greater flexibility and adaptability, as individual agents can be modified or replaced without disrupting the entire system.

The AxProverBase architecture employs a two-agent system for proof construction. The Proposer Agent autonomously generates candidate proof steps based on the current proof state and available axioms. These steps are then submitted to the Review System, which independently verifies their logical validity according to the underlying formal system’s rules. The Review System’s output-a boolean indication of correctness-is fed back to the Proposer Agent, guiding subsequent step generation. This iterative process of proposal and verification forms the core mechanism for navigating the proof search space and constructing valid proofs.

AxProverBase’s performance is significantly enhanced by its Context Management and Memory Node, which facilitate Iterative Proof Refinement. The system stores data from each attempted proof step – including proposed steps, review outcomes, and associated contextual information – within the Memory Node. This retained information is then leveraged during subsequent attempts; the Context Management system analyzes prior failures and successes to inform the Proposer Agent, guiding it towards more promising proof strategies and avoiding redundant exploration of previously invalidated paths. Empirical results indicate that this iterative refinement process, enabled by contextual memory, contributes the most substantial performance gains within the AxProverBase architecture.

Benchmarking Performance: Validation on Established Datasets

AxProverBase has undergone evaluation using established benchmark datasets designed to assess automated theorem proving capabilities across diverse mathematical areas. These datasets include PutnamBench, which focuses on problems from undergraduate mathematics competitions; FATE, a benchmark specifically constructed for evaluating formal proof systems in abstract and commutative algebra; and LeanCat, a collection of tactics and proofs from the Lean proof assistant. Utilizing these benchmarks allows for a standardized comparison of AxProverBase’s performance against other automated theorem provers and provides insights into its strengths and weaknesses in different mathematical domains. The use of multiple datasets ensures a comprehensive evaluation, moving beyond performance on a single type of problem.

Evaluations on the PutnamBench benchmark dataset indicate that AxProverBase achieves performance levels competitive with state-of-the-art automated theorem provers. This is notable because AxProverBase utilizes a comparatively simpler architectural design than many leading systems. While specific performance metrics vary depending on the problem set, the system demonstrates a consistent ability to solve a substantial number of PutnamBench problems, achieving comparable success rates despite its reduced complexity. This suggests an efficient implementation and effective proof search strategy within the constraints of its architecture.

Evaluations using the FATE benchmark demonstrate AxProverBase’s successful application to problems in abstract and commutative algebra, indicating its capability within advanced mathematical domains. Performance analysis reveals a substantial reduction in resource consumption compared to the Hilbert proof assistant; AxProverBase requires significantly fewer computational resources to achieve comparable results on these benchmarks. This efficiency is notable considering the complexity of the problems addressed within the FATE dataset, which include formalizations of mathematical competition problems and research-level theorems.

AxProverBase distinguishes itself through its iterative proof construction capabilities, enabling the system to leverage prior unsuccessful attempts to refine subsequent proof searches. This functionality is crucial for navigating complex proof landscapes, as the system doesn’t simply discard failed branches but instead analyzes them to inform future strategies. By building upon previous work, AxProverBase avoids redundant exploration and focuses computational resources on promising avenues, contributing to its overall efficiency and competitive performance, particularly within challenging mathematical domains where exhaustive search is impractical.

Claude Sonnet and Opus 4.5, allocated a [latex]10,000[/latex] token budget, and Gemini 3 Flash and Pro, set to high thinking, demonstrate varying pass rates with [latex]95%[/latex] confidence intervals, revealing a cost-performance trade-off as thinking budgets are adjusted from [latex]2k[/latex] to [latex]32k[/latex] tokens.

Envisioning the Future: Foundation Models and Reinforcement Learning

The integration of large-scale foundation models with the AxProverBase system promises a significant leap in automated theorem proving capabilities. These models, pre-trained on vast corpora of mathematical text and code, possess an inherent understanding of mathematical language, symbols, and common proof strategies. By leveraging this pre-existing knowledge, the system can move beyond purely symbolic manipulation to grasp the meaning behind mathematical statements. This allows for more intelligent hypothesis generation, a refined ability to identify relevant axioms and theorems, and ultimately, a substantial acceleration of the proof search process. Instead of exhaustively exploring all possible paths, the system can prioritize those most likely to lead to a valid proof, mirroring the intuition of a human mathematician and enabling progress on problems previously considered intractable. The potential extends to not only verifying existing proofs but also assisting in the discovery of novel mathematical relationships, essentially augmenting human mathematical reasoning.

The Proposer Agent’s effectiveness within automated theorem proving can be significantly enhanced through the application of reinforcement learning. This training methodology allows the agent to learn from a reward signal – positive reinforcement for steps that contribute to a successful proof, and negative for those that lead to dead ends – thereby optimizing its strategy for navigating the vast search space of possible proof steps. Unlike traditional approaches reliant on hand-crafted heuristics, reinforcement learning enables the agent to discover novel and potentially more efficient proof techniques. By iteratively refining its decision-making process through trial and error, the agent becomes adept at identifying promising avenues of exploration and avoiding unproductive paths, ultimately accelerating the overall proof process and improving its ability to tackle complex mathematical challenges. The agent learns not just what steps to take, but when to take them, leading to a more nuanced and effective search strategy than previously attainable.

The convergence of agentic architectures, large foundation models, and reinforcement learning strategies promises a significant leap forward in automated theorem proving. By framing the proving process as a series of actions undertaken by an intelligent agent, researchers can leverage the contextual understanding and generative capabilities of foundation models – trained on vast mathematical corpora – to suggest promising proof steps. Reinforcement learning then refines this process, allowing the agent to learn from its successes and failures, optimizing its exploration of the proof search space and ultimately improving its ability to solve increasingly complex mathematical problems. This synergistic approach moves beyond traditional, rule-based systems, paving the way for theorem provers capable of not just verifying proofs, but also discovering them with a level of ingenuity previously unattainable, potentially accelerating mathematical discovery itself.

The work detailed in this paper underscores a crucial principle: effective systems aren’t born from complexity, but from refined simplicity. AxProverBase, by stripping away unnecessary components and concentrating on iterative refinement and memory management, demonstrates a surprising level of performance. This echoes Bertrand Russell’s observation: “The point of the game is to be able to think clearly.” Just as Russell valued clarity of thought, this research prioritizes a streamlined approach to automated theorem proving, proving that a minimal agent, focused on core functionality, can achieve competitive results within the landscape of formal verification and large language models. Structure, in this instance, undeniably dictates behavior.

What Lies Ahead?

The architecture presented here, intentionally minimal, reveals a predictable truth: complexity rarely originates within a core system, but rather accrues at its interfaces. AxProverBase demonstrates that competitive performance in automated theorem proving does not necessarily demand intricate, hand-engineered heuristics. Instead, the limitations now reside in the scaffolding – the imperfect translation between natural language problem statements and formal logic, the subtle biases embedded within the large language models themselves, and the surprisingly brittle nature of even well-defined formal systems. Modifying one component of this chain – say, improving the LLM’s ability to parse mathematical intent – will inevitably trigger a cascade of adjustments required elsewhere.

Future work, therefore, must shift focus from optimizing the ‘solver’ itself to understanding, and ultimately streamlining, the entire cognitive loop. The question is not merely ‘can a machine prove this theorem?’, but ‘how does one effectively communicate a mathematical idea to a machine?’. This necessitates a deeper exploration of formal language design, perhaps moving beyond the current reliance on purely symbolic notation towards systems that can incorporate more nuanced, context-aware representations.

Ultimately, the pursuit of automated theorem proving is not about replacing mathematicians, but about augmenting their capabilities. A truly elegant system will not merely verify correctness, but actively participate in the creative process, identifying promising avenues of inquiry and illuminating the underlying structure of mathematical truth. That, however, demands a humility rarely seen in artificial intelligence – a willingness to admit the limits of its own understanding and to learn from the intricate, often paradoxical, logic of the universe itself.

Original article: https://arxiv.org/pdf/2602.24273.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Challenge of Formal Systems Verification

Deconstructing Complexity: An Agentic Approach with AxProverBase

Benchmarking Performance: Validation on Established Datasets

Envisioning the Future: Foundation Models and Reinforcement Learning

What Lies Ahead?

See also: