Can AI Verify Code Itself?

Author: Denis Avetisyan

A new agent-based system harnesses the power of large language models to dramatically improve the automation of formal program verification.

AutoRocq successfully verified a proof obligation-$wp\_goal$ extracted from benchmark 52\_polynomial within the SV-COMP competition-by autonomously retrieving relevant lemmas from a global context and constructing a proof tree to achieve verification.

This paper introduces AutoRocq, a system leveraging agentic workflows and tree-structured proofs for enhanced program analysis and verification.

Despite advances in automated code generation, ensuring the correctness of increasingly complex software remains a significant challenge. This paper introduces ‘Agentic Program Verification’ and presents AutoRocq, a novel system that leverages large language models within an agent-based framework to autonomously verify program code. Unlike prior approaches reliant on extensive training data, AutoRocq learns on-the-fly through iterative refinement, collaborating with a formal theorem prover to construct and validate proofs. Could this autonomous, agent-driven approach pave the way for truly trusted and automated software development cycles?

The Challenge of Formal Verification: A Necessary Rigor

The pursuit of bug-free software has long driven the field of formal verification, a rigorous process aiming to mathematically prove program correctness. However, achieving this level of assurance traditionally demands substantial manual intervention from skilled engineers. These experts must meticulously translate program logic into formal specifications and painstakingly guide verification tools through each step, a process that can be both time-consuming and prone to human error. While the potential benefits – enhanced security, improved reliability, and reduced development costs – are considerable, the high initial investment in expertise and effort has historically limited the widespread adoption of formal verification techniques, particularly within industries prioritizing rapid development cycles. This reliance on manual effort represents a significant bottleneck, hindering the ability to consistently deliver demonstrably correct software at scale.

Despite advancements in computer science, automated formal verification tools often falter when confronted with the intricacies of real-world software. These tools, designed to mathematically prove the absence of bugs, encounter scalability issues as program size and complexity increase; the computational resources and time required for verification can grow exponentially. This limitation creates a significant bottleneck in the software development lifecycle, preventing widespread adoption of formal methods. While effective for small, critical components, applying these tools to larger systems frequently proves impractical, forcing developers to rely more heavily on traditional testing methods – which, while faster, offer no guarantee of complete correctness. Consequently, a gap remains between the promise of bug-free software and the realities of efficient development, highlighting the ongoing need for more scalable and robust verification techniques.

The automated verification of software relies heavily on the identification of loop invariants – logical statements that remain true throughout the execution of a loop, serving as critical stepping stones to proving a program’s correctness. However, automatically generating these invariants proves remarkably difficult, particularly for non-trivial programs. Existing tools often falter, requiring human experts to manually devise these statements, a process that is both time-consuming and prone to error. This bottleneck significantly limits the scalability of formal verification, as the effort required to specify invariants can quickly outweigh the benefits of automated checking, hindering the widespread adoption of this vital software assurance technique. Researchers continue to explore machine learning and symbolic execution approaches to alleviate this challenge, aiming to create tools capable of autonomously inferring sufficient invariants for complex codebases.

Harnessing LLMs: A Pathway to Automated Invariant Synthesis

Large Language Models (LLMs), particularly GPT-4, have shown the ability to generate candidate loop invariants with a degree of success previously unattainable through purely automated methods. Loop invariants are logical statements that remain true at the beginning and end of each iteration of a loop, and are critical for program verification. GPT-4’s performance stems from its training on a massive dataset of code and natural language, enabling it to identify patterns and relationships within loop structures. While not guaranteed to be correct, the generated invariants provide a starting point for automated verification tools, significantly reducing the manual effort required to prove program correctness. Initial evaluations demonstrate that GPT-4 can generate plausible invariants for a non-trivial subset of benchmark problems, though further refinement and formal validation are consistently necessary.

Direct application of Large Language Models (LLMs) to complex program verification presents performance and reliability challenges. LLMs, while capable of generating candidate invariants, often lack the precision and formal guarantees required for conclusive verification. Consequently, successful integration necessitates combining LLM outputs with established static analysis tools. This approach leverages the LLM’s ability to propose potential invariants, while the static analyzer – such as Frama-C – rigorously validates these proposals against the program’s semantics. Furthermore, techniques like prompt engineering and output filtering are crucial to manage the LLM’s stochastic nature and ensure the generated candidates are syntactically and semantically plausible, reducing the burden on the formal verification stage and improving overall efficiency.

The integration of Large Language Model (LLM)-generated loop invariants with static analysis platforms such as Frama-C represents a significant advancement in automated program verification. LLMs, while capable of proposing candidate invariants, lack the formal rigor to guarantee their correctness; Frama-C addresses this limitation by providing a formal verification engine. This combined approach leverages the LLM’s ability to generate plausible invariants from code, then utilizes Frama-C’s deductive verification capabilities to rigorously prove or disprove these invariants. Specifically, Frama-C can analyze the program code and the proposed invariant, determining if the invariant holds true at all loop entry and exit points, thereby ensuring its validity. This workflow allows for automation of a traditionally manual process, potentially accelerating program verification and improving software reliability.

AutoRocq utilizes large language models for its decision-making components, as illustrated in this overview.

AutoRocq: An LLM Agent for Automated Proof Generation

AutoRocq operates as an LLM agent leveraging three core components for automated theorem proving. Context-aware tactic generation enables the agent to select proof steps – or tactics – based on the current state of the proof and the specific lemma being addressed. Proof tree management facilitates exploration of the proof space by maintaining a structured representation of attempted proof paths, allowing AutoRocq to backtrack and pursue alternative strategies. Crucially, the agent incorporates a feedback handling mechanism that interprets responses from the Rocq interactive theorem prover, using this information to refine tactic selection and guide the proof search process; this cycle of tactic application, prover evaluation, and agent adaptation is central to its functionality.

AutoRocq leverages the Rocq interactive theorem prover to navigate the search space for formal proofs. This integration allows the agent to propose proof steps as tactics, submit them to Rocq for verification, and receive feedback indicating whether a tactic is valid or leads to a dead end. Crucially, AutoRocq utilizes this feedback to dynamically adjust its tactic selection strategy; invalid tactics are avoided in subsequent attempts, and successful tactics are prioritized, effectively learning from the prover’s evaluations. This iterative process of proposal, verification, and refinement enables AutoRocq to explore the proof space more efficiently than methods relying on static strategies or limited feedback mechanisms.

Evaluations demonstrate that AutoRocq achieves a success rate of 51.1% when attempting to prove mathematical lemmas, representing a substantial improvement over existing automated theorem proving systems. Furthermore, AutoRocq successfully proves 30.9% of program lemmas, indicating its applicability to both formal mathematical problems and the verification of computer code. These results were obtained through rigorous testing against benchmark datasets commonly used for evaluating automated reasoning tools, and the reported percentages represent the proportion of lemmas for which AutoRocq could generate a complete and formally verified proof.

Benchmarking and Future Directions: Expanding the Boundaries of Formal Verification

Rigorous evaluation of AutoRocq on established benchmarks demonstrates a substantial advancement in automated theorem proving capabilities. The system achieves a 51.1% success rate in proving mathematical lemmas and a 30.9% rate for program lemmas, representing a significant leap beyond existing approaches. Specifically, AutoRocq outperforms baseline methods by a margin of 20.8% to 343.0% in mathematical lemma proving, and by 42.4% to 204.6% in the more challenging domain of program verification. These results underscore AutoRocq’s ability to tackle complex verification problems and establish a new standard for performance in this critical area of computer science.

The system, AutoRocq, distinguishes itself through the successful automated proof of 142 lemmas – logical statements crucial for verifying software and mathematical systems. This accomplishment isn’t merely quantitative; of these, 98 pertain to mathematical proofs and 44 to program lemmas, indicating a broad applicability and a capacity to address challenges in both abstract reasoning and concrete code verification. Critically, these proofs weren’t achieved by existing methods, signifying that AutoRocq has demonstrably expanded the frontier of automated reasoning and offers genuinely novel contributions to the field of formal verification, paving the way for more reliable and secure software systems.

Evaluations utilizing Linux kernel modules demonstrate AutoRocq’s capacity to address real-world software verification challenges. The system successfully verified 12 lemmas within these complex modules, a substantial improvement over baseline approaches which typically managed between 2 and 10 lemmas. This performance indicates that AutoRocq isn’t merely achieving theoretical gains in lemma proving, but is providing a measurable benefit in practical scenarios. The ability to verify a significantly higher number of properties within critical system code underscores the potential of AutoRocq to enhance software reliability and security in deployed applications, moving beyond research benchmarks to offer tangible improvements in code validation.

The success of AutoRocq underscores a powerful convergence of artificial intelligence and rigorous mathematical verification. By integrating large language models with formal methods, the system doesn’t simply detect potential software flaws, but actively proves the absence of those flaws with mathematical certainty. This synergistic approach moves beyond traditional testing, which can only demonstrate the presence of errors, towards a future where software reliability is established through formal proof. The demonstrated ability to verify complex lemmas, particularly within the challenging domain of Linux kernel modules, suggests a pathway for building demonstrably trustworthy software systems and reducing the risk of critical failures, marking a significant advancement in software engineering practices.

Complexity analysis reveals that CoqGym lemmas are significantly simpler than those from SV-COMP programs, with all CoqGym lemmas containing fewer than 100 terms or 7 hypotheses.

The pursuit of automated program verification, as demonstrated by AutoRocq, mirrors a fundamental principle of elegant design: reducing complexity to reveal underlying truth. This system’s agent-based approach, utilizing large language models to navigate a tree-structured proof representation, exemplifies a commitment to parsimony. As Marvin Minsky observed, “Questions are more important than answers.” AutoRocq doesn’t merely provide verification; it frames the problem in a way that allows for focused inquiry, streamlining the process and illuminating potential flaws with greater efficiency. The system’s success rests on asking the right questions, not simply accumulating layers of code or complexity.

What Lies Ahead?

The advent of AutoRocq represents not an arrival, but a refinement of the inevitable. Automation, previously constrained by the combinatorial explosion of proof search, now benefits from the statistical leverage of large language models. However, this benefit is not without cost. The system, while demonstrably effective, remains tethered to the inherent limitations of its linguistic foundation. True verification demands not merely plausible reasoning, but guaranteed correctness. The current reliance on LLM-generated heuristics introduces a subtle, yet persistent, vulnerability.

Future work must address the fidelity of semantic representation. The tree-structured proof, while elegant, is ultimately an interpretation – a map, not the territory. Bridging the gap between linguistic inference and formal semantics remains the central challenge. A reduction in reliance on probabilistic completion, and a corresponding increase in deductive rigor, is not merely desirable; it is the logical terminus of this line of inquiry. Unnecessary embellishment is violence against attention.

The pursuit of complete automation is, perhaps, a category error. The most valuable contributions may lie not in replacing human insight, but in augmenting it. A system capable of identifying critical proof states, or suggesting promising lemmas, offers a more realistic, and arguably more impactful, path forward. Density of meaning is the new minimalism.

Original article: https://arxiv.org/pdf/2511.17330.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Challenge of Formal Verification: A Necessary Rigor

Harnessing LLMs: A Pathway to Automated Invariant Synthesis

AutoRocq: An LLM Agent for Automated Proof Generation

Benchmarking and Future Directions: Expanding the Boundaries of Formal Verification

What Lies Ahead?

See also: