Can AI Write Formal Proofs?

Author: Denis Avetisyan

A new approach leverages large language model agents to significantly automate the painstaking process of verifying complex systems with mathematical rigor.

This paper demonstrates 87% success in mechanizing a formal system using agentic proof automation with off-the-shelf language models and the Lean 4 theorem prover.

Despite the increasing demand for mathematically certified software, formal proof development remains a laborious and time-consuming process. This paper, ‘Agentic Proof Automation: A Case Study’, introduces a novel workflow leveraging recent advances in large language models to automate the majority of proof engineering tasks under human direction. We demonstrate that off-the-shelf LLM agents can successfully complete proof tasks with an 87% success rate when applied to a complex formal system-System Capless in Lean 4-requiring minimal human intervention in 84% of cases. Could this agentic approach represent a paradigm shift towards more accessible and productive formal verification?

The Formalization Imperative: Bridging Intuition and Rigor

Mathematicians often construct proofs through a process of insightful leaps and intuitive reasoning, prioritizing clarity of concept over absolute formal rigor. However, translating these informal arguments into a completely formalized structure is paramount when these mathematical principles underpin critical software systems. This formalization isn’t merely an academic exercise; it’s the bedrock of software verification, ensuring that code operates predictably and without errors. By meticulously detailing every logical step – a process akin to building a bridge with individually inspected supports – formal proofs allow computers to independently verify the correctness of algorithms, guaranteeing reliability in applications ranging from aircraft control systems to financial modeling. The demand for this level of assurance is driving a need for tools and techniques that can bridge the gap between human intuition and machine-checkable proofs.

The process of converting mathematical intuitions into formally verifiable proofs, known as proof mechanization, historically demands an extraordinary investment of time and specialized skill. Each step, from translating the initial idea into symbolic logic to meticulously verifying every inference rule, is executed manually, often requiring years of dedicated effort from highly trained experts. This labor-intensive nature creates a significant bottleneck in fields like software and hardware verification, where proving the correctness of complex systems is paramount. While the benefits of formal proof are substantial – guaranteeing reliability and preventing costly errors – the practical limitations of traditional methods restrict their widespread adoption, hindering progress in critical applications demanding absolute certainty.

The chasm between how mathematicians instinctively approach problems and the absolute precision demanded by formal proof systems presents a significant hurdle in applying mathematical rigor to practical applications. Human reasoning often relies on intuition, glossing over subtle details that a computer, strictly interpreting formal logic, requires explicit declaration. This disconnect isn’t a flaw in either process, but rather a difference in methodology; however, it creates a substantial bottleneck. Successfully bridging this gap necessitates increased automation in proof construction, allowing computers to translate the essence of a mathematician’s insight into a formally verifiable argument. Without such tools, the benefits of mathematically grounded software and hardware verification remain largely inaccessible, hindering progress in critical domains that demand absolute reliability.

Agentic Automation: A Paradigm Shift in Formal Verification

Agentic Proof Automation represents a shift in formal verification workflows by implementing a system where Large Language Model (LLM) agents independently perform significant portions of the proof engineering process. This automation extends beyond simple script execution to include tasks such as hypothesis generation, tactic selection, and state exploration within a formal verification environment. While these agents operate autonomously, human oversight remains crucial for guiding the agent’s direction, validating generated proofs, and resolving ambiguous or complex scenarios. This hybrid approach aims to substantially reduce the manual effort traditionally required for developing and maintaining formal proofs, increasing both efficiency and scalability of verification processes.

Agentic behavior within proof automation systems involves the use of Large Language Models (LLMs) to independently navigate and analyze codebases. This process extends beyond simple code completion; the LLM autonomously identifies potential proof steps, including variable assignments, function calls, and logical conditions, without explicit, step-by-step human instruction. By iteratively exploring the codebase and generating these proof steps, the system significantly reduces the manual effort traditionally required for formal verification or debugging. The LLM acts as an agent, proactively searching for relevant code segments and constructing a potential proof trace, which is then subject to human review and refinement, resulting in a substantial decrease in the time and resources needed for complex code analysis.

The architecture of Agentic Proof Automation relies on large language models (LLMs), specifically instances of Claude and GPT, to function as the primary reasoning engines. These LLMs are not utilized for one-time proof generation; rather, they are implemented in an iterative refinement loop. This means the LLM proposes a proof step, evaluates its validity (potentially through automated tests or human feedback), and then utilizes the results of that evaluation to refine subsequent proof step proposals. This iterative process allows the system to progressively build more robust and accurate proofs, leveraging the LLM’s ability to learn from its own outputs and adjust its reasoning strategy. The LLMs handle symbolic execution, constraint solving, and other complex tasks required for formal verification within this iterative framework.

The Toolkit for Automated Reasoning: Foundations and Integration

The agent operates on the formal system’s codebase through dedicated tools for code exploration and modification. These tools facilitate navigation of the project’s file structure and allow the agent to programmatically read, edit, and save files within the codebase. Specifically, the agent utilizes these capabilities to inspect existing definitions, identify relevant lemmas or theorems, and propose changes to the formal system’s code, such as adding new definitions, modifying existing ones, or attempting to complete proofs. The tools abstract the underlying file system interactions, providing an interface for the agent to manipulate the formal system’s source code as part of its automated reasoning workflow.

The lean4check tool serves as a critical component in the automated reasoning workflow by compiling Lean 4 modules and presenting the results to the agent. This compilation process includes type checking and, where applicable, invoking SMT solvers to resolve logical constraints within the formal system. The output from lean4check is not merely a success/failure indicator; it provides detailed error messages and contextual information that the agent utilizes to diagnose issues in its attempted proofs or formalizations. This feedback loop allows the agent to iteratively refine its strategies and ensure the correctness of the developed formal system, effectively guiding its subsequent actions and facilitating progress toward a verified solution.

The automated reasoning agent leverages Lean 4 as its Interactive Theorem Prover (ITP), a system designed for constructing and verifying mathematical proofs. Lean 4 provides the foundational logic and proof infrastructure. However, many problems require determining the satisfiability of logical constraints, a task for which the agent utilizes SMT (Satisfiability Modulo Theories) Solvers. These solvers, integrated with Lean 4, efficiently determine whether a given set of logical formulas, potentially involving arithmetic, arrays, and other theories, is consistent or if a counterexample exists. The combination of Lean 4’s proof capabilities and SMT solvers’ constraint solving enables the agent to address a broader range of formal reasoning tasks than either system could manage independently.

Impact Assessment: System Capless and Beyond

This automated methodology was rigorously tested on System Capless, a complex formal system distinguished by its implementation of Capture Polymorphism – a feature allowing variables to be bound differently within nested contexts. This system presents a significant challenge for automated theorem proving due to its intricate logical structure and the potential for subtle errors in variable capture. Applying the automated approach to System Capless allowed for a detailed evaluation of its capabilities in a non-trivial setting, pushing the boundaries of what’s currently achievable in formal verification and demonstrating its potential to tackle highly complex mathematical structures. The selection of System Capless was strategic, representing a benchmark for advanced formal systems and serving as a proving ground for future developments in automated reasoning.

Across a comprehensive evaluation involving 189 distinct tasks, the automated agent demonstrated a remarkable ability to navigate and formalize elements within the `System Capless` environment, achieving an overall success rate of 87%. This performance signifies a substantial advancement in automating formalization processes, as the agent consistently identified and successfully implemented logical structures within the complex system. The high success rate isn’t merely a statistical measure; it underscores the agent’s capacity to handle nuanced challenges inherent in formal systems, effectively translating abstract concepts into rigorously defined, machine-verifiable proofs and states.

The automated formalization agent exhibited a remarkable degree of independence throughout its operation, necessitating human guidance for only 16% of the assigned tasks. This high level of autonomy underscores the system’s ability to navigate the complexities of formal systems with minimal external support. Beyond initial proof construction, the agent demonstrated a robust capacity for error recovery, successfully repairing 90% of broken proofs encountered during the formalization process. This proficiency in self-correction is particularly noteworthy, as it highlights the system’s potential to not only generate formalizations but also maintain their integrity and correctness with limited human oversight, suggesting a pathway toward increasingly automated theorem proving and verification.

The automated formalization process, as demonstrated with System Capless, isn’t a linear progression but rather a dynamic cycle of conjecture and verification. Analysis reveals the agent invoked the $Lean4check$ tool, on average, 8.3 times for each completed task-encompassing both proof development and state verification. This frequent interaction isn’t a sign of inefficiency, but underscores the agent’s deliberate strategy of constantly testing and refining its work based on feedback. Each invocation of $Lean4check$ represents a critical self-assessment, allowing the system to identify and correct errors iteratively, ultimately contributing to a robust and reliable formalization process and a high overall success rate.

The successful automation of formal system mechanization, as demonstrated in the study, highlights a fundamental principle of systemic integrity. One observes that structure invariably dictates behavior, and the agentic workflow presented here meticulously enforces a structure upon the proof development process. As Donald Davies observed, “The only thing worse than being right is being right too soon.” This resonates with the need for robust, verified systems; a correct solution delivered prematurely, without thorough validation-such as the formal verification achieved here-offers little practical benefit. The 87% success rate underscores the power of aligning structure with intended functionality, preventing failures along previously invisible boundaries.

Future Directions

The demonstrated success, while encouraging, illuminates the inherent fragility of current approaches. This work doesn’t solve the problem of formal verification; rather, it shifts the locus of difficulty. The bottleneck isn’t simply generating proof steps, but ensuring their structural integrity within a larger system. Like urban planning, incremental evolution is preferable to wholesale reconstruction, yet current methodologies often require significant re-evaluation of prior ‘successful’ steps as the formal system expands. The 87% success rate represents a promising start, but the remaining 13% hints at a fundamental need for agents capable of self-diagnosis and repair, not merely step generation.

A key limitation lies in the reliance on human guidance. True automation demands a system capable of navigating ambiguity and making informed decisions about proof strategy. This necessitates a deeper integration of meta-reasoning – the ability to reason about reasoning – into the agentic framework. Capture polymorphism, while powerful, represents a single tool in a much larger toolkit. Future research should explore how different formalization techniques can be seamlessly integrated and dynamically selected based on the specific characteristics of the system being verified.

Ultimately, the goal isn’t to create agents that mimic human mathematicians, but to construct systems that embody the principles of robust, verifiable computation. A focus on structural evolution, rather than isolated proof construction, offers a path toward creating systems that are not merely correct, but demonstrably resilient to change. The elegance of such a system lies not in its complexity, but in its inherent simplicity.

Original article: https://arxiv.org/pdf/2601.03768.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Formalization Imperative: Bridging Intuition and Rigor

Agentic Automation: A Paradigm Shift in Formal Verification

The Toolkit for Automated Reasoning: Foundations and Integration

Impact Assessment: System Capless and Beyond

Future Directions

See also: