Seeing is Knowing: AI Turns Images and Text into Formal Proofs

Author: Denis Avetisyan

Researchers have developed a new framework that leverages the power of artificial intelligence to translate visual and textual information into rigorously verifiable logical statements.

Across both mathematical and physical domains, five representative models demonstrate varying performance in autoformalization, as quantified by compilation accuracy and confirmed through human verification.

MMFormalizer introduces a method for multimodal autoformalization, alongside the PhyX-AF benchmark for evaluating formal reasoning in mathematics and physics.

Despite advances in machine reasoning, translating real-world mathematics-often expressed through both text and visuals-into formal, verifiable statements remains a fundamental challenge. This paper introduces MMFormalizer: Multimodal Autoformalization in the Wild, a novel framework that extends autoformalization by adaptively grounding language in perceptual data from mathematical and physical domains. MMFormalizer recursively constructs formal propositions using a new benchmark, PhyX-AF, and demonstrates strong performance with frontier large language models like GPT-5 and Gemini-3-Pro, even tackling complex areas like relativity and quantum mechanics. Could this unified approach to multimodal autoformalization ultimately bridge the gap between human intuition and rigorous, machine-checkable proofs?

Bridging Perception and Logic: The Challenge of Autoformalization

The pursuit of truly reliable artificial intelligence fundamentally depends on effectively translating the nuances of human understanding – expressed through natural language – into the precise language of formal systems. Current AI often excels at pattern recognition but struggles with genuine reasoning because it lacks a solid foundation in logical axioms derived from perceived information. This disconnect hinders an AI’s ability to not only process data, but to understand its implications, draw valid conclusions, and justify those conclusions with provable steps. Establishing a seamless bridge between how information is received – through language, vision, or other senses – and how it is represented internally as formal logic is therefore paramount; without it, AI remains susceptible to errors stemming from ambiguity, incomplete information, or simply a lack of true comprehension, limiting its potential for complex problem-solving and trustworthy decision-making.

The translation of raw perceptual data – be it visual scenes, auditory signals, or sensor readings – into the precise language of formal logic presents a formidable challenge for conventional artificial intelligence systems. These methods often falter when confronted with the inherent ambiguity and nuanced complexity of real-world input, struggling to distill coherent, logically sound axioms from messy, incomplete, or contradictory information. The difficulty arises not merely from computational limitations, but from the fundamental mismatch between the continuous, analog nature of perception and the discrete, symbolic structure of formal systems. Attempts to manually encode perceptual knowledge into logical rules are brittle and scale poorly, while automated approaches frequently generate axioms that are either trivially true, computationally intractable, or fail to accurately reflect the underlying perceptual reality – hindering reliable reasoning and robust AI performance.

The development of a resilient autoformalization framework represents a significant hurdle in achieving truly reliable artificial intelligence. Current systems often require painstakingly hand-crafted logical axioms, a process that is both time-consuming and prone to human error. A successful framework must ingest data from varied sources – including text, images, and sensor readings – and autonomously construct a formal, logically consistent representation. This isn’t merely about translation; it requires discerning underlying relationships, handling ambiguity, and establishing a robust system of definitions and inferences. Such a system would move beyond pattern recognition and allow AI to engage in genuine deductive reasoning, verifying its conclusions based on established axioms rather than probabilistic estimations. Ultimately, the ability to automatically build formal representations unlocks the potential for AI systems capable of explaining why they arrive at specific conclusions, fostering trust and accountability in critical applications.

A symbolic pipeline automatically generates and verifies synthetic geometry problems through construction, deduction, dependency extraction, and verification steps.

Formalizing Knowledge: Dependency Graphs and Recursive Grounding

The MMFormalizer utilizes a Dependency Graph as a central data structure to explicitly represent the relationships between formalized mathematical concepts and the entities they describe. Nodes within this graph represent propositions, definitions, and entities, while directed edges denote dependencies – for example, a theorem proving a proposition, or a definition introducing a new entity. This graph isn’t merely a visualization; it serves as the foundation for tracking provenance and justification. Each formalized statement is linked back to its supporting definitions and axioms, creating an auditable trail of logical inference. The Dependency Graph facilitates both forward and backward reasoning, enabling the system to trace the origins of a concept or explore its consequences. Furthermore, the graph’s structure allows for efficient identification of redundant proofs or potential inconsistencies within the formalized knowledge base.

Recursive grounding functions by initiating the construction of a dependency graph from initial perceptual data, which are treated as base propositions. These propositions are then subjected to logical inference, utilizing axioms and previously formalized theorems, to generate more complex propositions and establish relationships between concepts. This process is iterative; each newly inferred proposition is added to the dependency graph and becomes the basis for further inference. Consequently, the graph expands incrementally, moving from raw perceptual input towards more abstract and rigorously defined concepts, with each level of refinement grounded in prior assertions and logical derivations.

The MMFormalizer utilizes LeanSearch as a critical component for knowledge retrieval during the formalization process. LeanSearch functions by querying established knowledge bases, most notably MathLib and PhysLean, to identify relevant definitions, axioms, and previously proven theorems. These retrieved elements are then integrated into the developing formal proof, allowing the system to leverage existing mathematical and physical knowledge rather than requiring de novo derivations for each step. The search process is optimized for the Lean proof assistant, ensuring compatibility and efficient application of retrieved knowledge within the formalization framework. This dependency on curated libraries significantly accelerates the formalization process and increases the reliability of generated proofs.

The MMFormalizer incorporates a termination condition to constrain the growth of the dependency graph during recursive grounding. This condition establishes predefined limits on the depth or complexity of the graph’s expansion, halting the refinement process when these thresholds are met. Specifically, the termination condition monitors the number of recursive calls or the length of derived propositions; exceeding a specified maximum value triggers graph stabilization. This mechanism is critical for preventing infinite loops during logical inference and ensuring computational feasibility, particularly when dealing with complex or potentially ill-defined perceptual data. The implementation allows for configurable parameters to balance the completeness of formalization with computational resource constraints.

The pipeline recursively grounds physical primitives, checks for compiler errors, and composes axioms using formally retrieved dependencies to achieve termination.

Ensuring Logical Integrity: Semantic Checking and Axiom Composition

While formalization translates natural language into a machine-readable format, it does not inherently guarantee logical validity. `Semantic Checking` is therefore a crucial subsequent step, employing a Large Language Model (LLM) to rigorously assess the generated formal statements for internal consistency and correctness. This process verifies that axioms and inferences adhere to the rules of the formal system, identifying potential contradictions or illogical derivations that could arise even with syntactically correct formalization. Without semantic checking, a formalized system may appear valid but still produce incorrect or meaningless results, necessitating this verification stage to ensure reliable reasoning and proof construction.

Semantic checking within the framework employs a Large Language Model (LLM) to validate the logical soundness of generated axioms and inferences. The LLM is utilized to determine if a statement logically follows from established axioms and definitions, effectively acting as a consistency checker. This assessment involves evaluating the syntactic and semantic structure of the statement against the underlying formal system. The LLM’s output provides a confidence score indicating the probability that the statement is logically valid, which is then used to filter out potentially incorrect or inconsistent inferences before they are incorporated into formal proofs. This process is critical for ensuring the reliability and correctness of the automated reasoning system.

Axiom composition within the framework facilitates the construction of formal proofs through the systematic combination of established, grounded concepts and axioms. This process leverages two primary libraries: MathLib, providing a broad base of mathematical definitions and theorems, and PhysLean, which specializes in physics-related axioms and concepts. By drawing upon these resources, the system can build increasingly complex proofs from fundamental principles, ensuring a logically sound and verifiable derivation of new statements. The composition isn’t simply concatenation; it involves a search for relevant axioms and concepts applicable to the current proof state, effectively chaining together known truths to reach a desired conclusion.

Evaluations on the PhyX-AF benchmark demonstrate the framework’s superior performance, achieving state-of-the-art results in both compile accuracy and semantic correctness when compared to a range of open-source and closed-source large language models. Compile accuracy, measured by the percentage of successfully parsed formal statements, indicates the framework’s ability to generate syntactically valid expressions. Semantic correctness, assessed through rigorous verification of logical consistency, confirms the framework’s capacity to produce meaningful and valid inferences. These metrics collectively establish the framework as a leading solution for automated theorem proving and formal reasoning tasks within the PhyX-AF domain.

This example demonstrates how sampled points are expanded into a symbolic geometric configuration and then used with Horn-style deduction to derive a geometric relation.

Expanding the Boundaries: Formalizing Physics and Beyond

This innovative framework achieves a significant milestone by translating core principles of physics – such as $\textbf{F} = m\textbf{a}$ from Newton’s Laws and the intricacies of the Hamiltonian – into a strictly logical, machine-readable format. This formalization isn’t merely representational; it unlocks the potential for automated theorem proving within these established physical systems. By expressing these laws as formal axioms and definitions, the system can rigorously verify existing proofs, explore novel derivations, and potentially identify inconsistencies or gaps in current physical models. This capability extends beyond simple verification, enabling the automated generation of new theorems and insights directly from the formalized foundations of classical and quantum mechanics, promising to accelerate research and deepen understanding.

The development of the PhyX-AF benchmark represents a significant step towards objective evaluation of autoformalization techniques in physics and related fields. This standardized platform offers a diverse collection of problems, spanning classical mechanics, electromagnetism, and quantum mechanics, allowing researchers to rigorously compare the performance of different automated reasoning systems. By providing a common ground for assessment, PhyX-AF facilitates progress in translating physical theories into logically verifiable formal proofs. The benchmark isn’t merely a collection of problems; it includes a robust evaluation framework and a process for ensuring the quality of generated solutions, thereby fostering reproducibility and accelerating the development of reliable autoformalization tools capable of handling increasingly complex scientific challenges.

Rigorous verification is central to the PhyX-AF framework; automatically generated formalizations aren’t simply accepted, but undergo meticulous scrutiny by a team of expert annotators. These specialists, possessing deep knowledge in both physics and formal logic, assess each solution for logical validity and mathematical soundness. This validation process extends beyond mere syntactic correctness; annotators confirm that the formalized statements accurately reflect the underlying physical principles and that the proofs are not only formally correct but also meaningfully represent the intended reasoning. This human-in-the-loop approach ensures a high degree of confidence in the automated results, establishing PhyX-AF as a reliable tool for formalizing complex scientific knowledge and $reducing the potential for subtle errors$ in automated reasoning.

Automated formalization represents a significant leap towards enhancing the rigor and efficiency of scientific inquiry. Traditionally, translating physical principles and mathematical derivations into logically verifiable formal systems has been a laborious and error-prone process, demanding extensive expert time. By automating this translation, researchers can accelerate the pace of discovery, allowing computational systems to rigorously verify hypotheses and explore complex models with a level of precision unattainable through manual methods. This capability is particularly crucial in fields where even subtle errors in reasoning can have significant consequences, such as advanced physics and engineering. Moreover, automated formalization offers a pathway to building more robust and reliable artificial intelligence systems capable of not only performing calculations but also proving their correctness, ultimately minimizing the risk of propagating errors and fostering greater confidence in scientific results.

Newton's three laws fundamentally underpin classical mechanics, providing the basis for concepts like momentum, the Hamiltonian formulation, phase space, and ultimately, spacetime representations. — Newton’s three laws fundamentally underpin classical mechanics, providing the basis for concepts like momentum, the Hamiltonian formulation, phase space, and ultimately, spacetime representations.

The pursuit of MMFormalizer exemplifies a dedication to stripping away ambiguity, translating complex realities into the austere language of formal logic. This aligns perfectly with the sentiment expressed by Donald Knuth: “Premature optimization is the root of all evil.” The framework doesn’t immediately seek the most complex solution, but rather focuses on establishing a clear, verifiable foundation – a formalization – before any further refinement. By prioritizing clarity in representing both visual and textual information, and employing recursive grounding to ensure logical consistency, MMFormalizer embodies the principle that a system’s true strength lies not in its feature set, but in its fundamental simplicity and correctness. The PhyX-AF benchmark, as a method of evaluation, further reinforces this focus on essential qualities.

The Path Ahead

The endeavor to ground language in formal systems, as exemplified by MMFormalizer, reveals less a destination than a deepening awareness of the chasm between intuition and proof. The framework’s capabilities, while notable, merely highlight the brittleness inherent in translating the messy abundance of sensory input into the austere language of logic. The PhyX-AF benchmark, a necessary construct, simultaneously underscores the limitations of current evaluation metrics; a passed test confirms only adherence to a particular formalization, not necessarily a capture of underlying physical or mathematical truth.

Future work must resist the temptation to accumulate complexity. Instead, attention should focus on discerning the minimal set of assumptions required for effective formalization. Recursive grounding, while promising, risks infinite regress if not tethered to a core of self-evident principles. A fruitful avenue lies in exploring alternative formalisms – those perhaps less concerned with exhaustive representation and more with facilitating useful inference, even if approximate.

The ultimate challenge is not to build systems that mimic reasoning, but to illuminate the very nature of logical consequence. MMFormalizer, and its successors, should be judged not by what they can prove, but by the clarity with which they reveal what remains unprovable.

Original article: https://arxiv.org/pdf/2601.03017.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Bridging Perception and Logic: The Challenge of Autoformalization

Formalizing Knowledge: Dependency Graphs and Recursive Grounding

Ensuring Logical Integrity: Semantic Checking and Axiom Composition

Expanding the Boundaries: Formalizing Physics and Beyond

The Path Ahead

See also: