Can AI Solve Graduate-Level Math?

Author: Denis Avetisyan

A new benchmark assesses the ability of cutting-edge artificial intelligence models to tackle complex mathematical reasoning problems found in a theoretical computer science textbook.

This review presents LLMathBench, a rigorous evaluation of frontier language models on PhD-level mathematical reasoning, specifically focusing on formal proofs within randomized algorithms.

Despite recent advances showcasing the potential of large language models in mathematical discovery, a rigorous assessment of their foundational reasoning capabilities remains crucial. This paper, ‘Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms’, presents a comprehensive benchmark of GPT-5, Gemini, Claude, and Grok against a graduate-level curriculum, revealing substantial variance in their ability to generate formal proofs. Our findings indicate that while top-performing models demonstrate proficiency in probabilistic reasoning, consistency and reliability remain significant challenges. How can we further refine these models to achieve robust and trustworthy automated mathematical derivation?

The Ephemeral Nature of Mathematical Certainty

Despite their remarkable proficiency in processing and generating human language, Large Language Models (LLMs) consistently encounter difficulties when applied to the domain of rigorous mathematical reasoning. These models, trained on vast datasets of text and code, often demonstrate success in mimicking the form of mathematical arguments, but struggle with the underlying logical validity. The challenge stems from the inherent differences between natural language, which allows for ambiguity and nuance, and formal mathematical systems, which demand absolute precision and adherence to axiomatic rules. While LLMs can readily identify patterns and relationships within textual data, they frequently falter when required to construct or verify proofs – tasks that necessitate a deep understanding of logical inference and the ability to manipulate symbolic representations, such as equations like $E=mc^2$, with unwavering accuracy. This limitation suggests that simply scaling up model size or training data is unlikely to fully resolve the issue, indicating a need for novel architectures or training methodologies specifically designed to enhance mathematical reasoning capabilities.

Conventional methods in automated reasoning frequently falter when confronted with the stringent demands of formal proofs. Unlike tasks involving pattern recognition or statistical inference, constructing a logically sound mathematical argument requires absolute precision and an ability to navigate complex chains of inference. Existing systems, often reliant on heuristics or incomplete logical frameworks, can generate proofs that appear correct but contain subtle errors or rely on unstated assumptions. This leads to unreliable results, particularly when dealing with non-trivial theorems or complex mathematical structures. The challenge isn’t simply about arriving at the correct answer, but demonstrating a valid and verifiable path – a demand that exposes the limitations of approaches that prioritize superficial correctness over rigorous logical depth. Consequently, assessments of reasoning ability must move beyond simple solution verification and focus on the validity of the process used to reach that solution, ensuring each step adheres to the established rules of formal logic, such as those found in predicate logic – expressed symbolically as $∀x P(x) \implies Q(x)$ – to guarantee trustworthiness.

Assessing the mathematical prowess of Large Language Models demands more than simply checking answers; a rigorous and standardized evaluation framework is crucial to differentiate genuine reasoning from skillful pattern recognition. Current benchmarks often fall short, susceptible to exploitation by models that memorize solutions or identify superficial cues without grasping underlying principles. A truly robust framework necessitates diverse problem sets, varying in complexity and requiring multi-step reasoning, alongside methods to analyze the process of solution-finding, not just the final result. This includes evaluating the model’s ability to justify each step, identify potential errors, and generalize to unseen problems – essentially, demonstrating an understanding of the axioms and rules governing the mathematical domain, rather than merely replicating learned patterns. Such a framework will be essential for charting progress and building LLMs capable of reliable formal reasoning, moving beyond statistical mimicry towards true mathematical intelligence.

A Framework for Deciphering Logical Structures

The Benchmark Evaluation Framework is a six-stage process developed to systematically evaluate the capacity of Large Language Models (LLMs) to both generate and verify mathematical proofs. This framework moves beyond simple problem-solving by requiring LLMs to demonstrate a complete proof construction process, including formalization and verification steps. Each stage is designed to isolate and assess specific capabilities, allowing for granular analysis of LLM performance in mathematical reasoning. The framework is intended to provide a standardized and rigorous methodology for comparing different LLMs and tracking improvements in their mathematical proof capabilities. The output of each stage is a clearly defined metric, facilitating quantitative comparisons and identifying areas for model refinement.

The evaluation framework employs problems sourced directly from the “Randomized Algorithms” textbook by Motwani and Raghavan. This selection provides a standardized and consistently defined problem set, mitigating variability introduced by differing problem sources or ambiguous problem statements. Problems are chosen to cover a range of algorithmic techniques, including probability, hashing, and graph algorithms, and span varying levels of complexity. Utilizing a single, established textbook ensures reproducibility of results and allows for direct comparison of LLM performance across different configurations and prompts. The textbook’s rigorous mathematical presentation also supports the accurate assessment of proof generation and verification capabilities, as solutions are grounded in established mathematical principles and $formal logic$.

Problem Formalization is the initial stage of the evaluation process, and it involves converting mathematical exercises sourced from the Randomized Algorithms Textbook into a standardized format for Large Language Model (LLM) input. This conversion centers on representing each problem, including its premises, desired conclusion, and any necessary definitions, using $ \LaTeX $ code. The resulting “Formal $ \LaTeX $” format ensures consistent problem presentation, facilitating automated processing by the LLM and eliminating ambiguity arising from natural language variations. This standardized format allows for direct comparison of LLM outputs and enables automated verification of generated proofs against established solutions.

Observing the Capacity for Formal Deduction

The benchmark evaluated the performance of four state-of-the-art Large Language Models (LLMs): Gemini-3-Pro, GPT-5-Thinking, Grok-4, and Claude-Sonnet-4.5-Thinking. These models were selected to represent current leading capabilities in natural language processing and generation. The evaluation focused on assessing their ability to perform complex reasoning and problem-solving tasks, with the aim of quantifying their strengths and weaknesses in a standardized manner. Model versions were current as of the date of testing, and all models were accessed via their respective APIs under standard usage conditions. Performance metrics were recorded to facilitate comparative analysis and identify areas for potential optimization.

The Double Timing Strategy addresses the issue of premature termination during mathematical proof generation by Large Language Models (LLMs). This optimization dynamically adjusts timeout limits applied during the proof construction process. Initially, a conservative timeout is used to quickly identify proofs that can be generated within a short timeframe. If a proof generation process approaches the initial timeout, the limit is automatically extended, allowing the LLM additional time to complete complex derivations. This adaptive approach prevents the LLM from halting before reaching a valid solution, particularly for problems requiring extensive reasoning steps, without significantly impacting the evaluation time for simpler, rapidly solvable problems.

Mathematical Proof Generation, as assessed in this evaluation, requires the Large Language Models (LLMs) to produce formally valid proofs for given mathematical problems. The testing process involves presenting the LLMs with a diverse set of problems, ranging in complexity from basic algebra and calculus to more advanced topics in number theory and geometry. Successful proof generation is determined by verifying the logical consistency and correctness of each step within the generated proof, ensuring adherence to established mathematical axioms and inference rules. The evaluation metrics focus on both the completeness of the proof – whether all necessary steps are included – and its validity – whether each step is logically sound and contributes to a correct solution. The LLMs’ outputs are assessed against known correct proofs, with automated verification tools used where possible to confirm the accuracy of the generated mathematical statements and logical derivations, including verifying equations such as $x^2 + y^2 = r^2$.

Establishing Confidence in Logical Validity

The system leverages automated proof verification as a crucial step in assessing the validity of mathematically generated proofs. Initially, candidate proofs are formalized into $LaTeX$ code, enabling computational analysis. This formalized representation isn’t merely for display; it serves as the input for a dedicated verification engine. This engine systematically checks each step of the proof against established mathematical axioms and inference rules, identifying potential logical fallacies or inconsistencies. The automated process significantly reduces the burden of manual verification, allowing for rapid assessment of a large number of proofs, and provides a quantifiable measure of confidence in the generated solutions. While not infallible, this automated component establishes a reliability baseline, flagging proofs that require further human scrutiny and ensuring a higher standard of mathematical rigor.

A rigorous evaluation strategy incorporated both automated assessment and human oversight to guarantee the quality of generated proofs. Following automated verification, a representative sample of 20% of the (Problem, Proof, Verdict) triplets underwent manual review by human experts. This human verification step served as a crucial final check, identifying potential errors or logical inconsistencies that the automated system might have missed, and confirming the overall reliability of the framework’s output. The combination of these two validation methods ensured a comprehensive and robust assessment of the generated mathematical proofs, acknowledging that complete reliance on either automated or manual approaches alone could be insufficient for establishing trustworthiness in complex reasoning tasks.

Analysis reveals the automated verification system successfully establishes a reliability threshold, exhibiting divergence of no more than 20% from human-validated proofs. This finding underscores a crucial point: while large language models demonstrate promise in generating mathematical proofs, complete reliance on automation is insufficient. The observed divergence, though limited, necessitates the integration of human oversight to ensure the absolute correctness and logical soundness of generated proofs. Consequently, a combined approach-leveraging the efficiency of automated verification and the precision of human review-represents the most robust strategy for assessing the dependability of LLM-generated mathematical reasoning and maintaining confidence in their outputs, particularly in fields demanding rigorous proof, such as formal mathematics and computer science.

Toward the Horizon of AI-Assisted Discovery

A newly developed benchmark and evaluation framework is poised to accelerate progress in artificial intelligence for mathematical reasoning. This system moves beyond simple problem-solving to assess an AI’s capacity for genuine mathematical insight – not just arriving at the correct answer, but demonstrating the logical steps and justifications required for a rigorous proof. The framework utilizes a diverse set of mathematical problems, ranging from number theory and combinatorics to geometry and calculus, each designed to challenge different facets of mathematical intelligence. Crucially, the evaluation isn’t solely focused on success rates; it also measures the quality of the reasoning process, penalizing solutions that rely on flawed logic or incomplete justifications. By providing a standardized and objective means of assessment, this framework enables researchers to systematically compare different AI approaches, identify areas for improvement, and ultimately build more robust and reliable systems capable of tackling complex mathematical challenges and potentially even discovering new mathematical truths.

The pursuit of automated mathematical discovery benefits from a synthesis of seemingly disparate approaches: the Probabilistic Method and Formal Logic. Traditionally, the Probabilistic Method establishes the existence of objects with certain properties by demonstrating their existence with non-zero probability, offering intuitive, though not strictly definitive, evidence. Formal Logic, conversely, demands absolute certainty through rigorous proof, but can struggle with the initial exploration of complex mathematical landscapes. Integrating these concepts allows systems to leverage probabilistic arguments for hypothesis generation – identifying promising avenues of investigation – then employ formal methods to verify those hypotheses with unassailable certainty. This interplay facilitates a more efficient and powerful approach to mathematical reasoning, moving beyond purely deductive or inductive strategies and enabling AI to not only prove theorems but also to intelligently discover new ones, potentially revealing previously unknown mathematical truths like the existence of Ramsey numbers or properties of prime numbers expressed as $p$ in number theory.

Ongoing research prioritizes advancements in automated verification, seeking methods to move beyond simple proof checking towards genuine proof understanding and gap-filling – essentially, enabling AI to not just confirm a solution, but to independently identify missing logical steps. Simultaneously, exploration extends to novel computational architectures tailored for mathematical discovery, moving past general-purpose AI models. These designs incorporate inductive biases that reflect the structural properties of mathematical objects and relationships, potentially leveraging techniques like graph neural networks to represent and manipulate complex mathematical expressions. The ultimate aim is to create systems capable of autonomously formulating conjectures, designing proof strategies, and ultimately, discovering new mathematical truths, pushing the boundaries of both artificial intelligence and mathematical knowledge. This includes investigating how to represent mathematical concepts in a way that facilitates both symbolic manipulation and probabilistic reasoning, bridging the gap between rigorous proof and intuitive mathematical insight.

The evaluation of frontier LLMs against rigorous mathematical texts exposes an inherent fragility within these systems. Much like geological formations yielding to the ceaseless pressure of time, the LLMs demonstrate varying capacities for maintaining logical consistency throughout extended proofs. This benchmark, assessing formal reasoning on randomized algorithms, reveals that even the most advanced models struggle with the nuances of mathematical rigor. As John McCarthy observed, “It is better to be thought a fool than to do a foolish thing.” The pursuit of automated verification, therefore, isn’t merely about achieving correct outputs, but building systems that gracefully degrade – recognizing limitations and avoiding confidently presented fallacies. The work underscores that true intelligence isn’t about speed, but about the capacity for sustained, accurate reasoning-a quality still elusive in current architectures.

The Horizon of Proof

The evaluation detailed within reveals not a failure of intelligence, but the predictable entropy of any system attempting complete formalization. Each incorrect derivation, each flawed proof, is a signal from time – a reminder that even the most sophisticated models operate within the constraints of finite resources and imperfect training. The benchmark, while assessing current capabilities, inadvertently maps the boundaries of those limitations. The question is not whether these large language models can become perfect reasoners, but how gracefully they age as the complexity of mathematical inquiry increases.

Future work will undoubtedly focus on scaling parameters and refining training data. However, true progress lies in refactoring the underlying approach – treating the creation of formal proofs not as a task of pattern completion, but as a dialogue with the past. The model must learn to acknowledge not just what is true, but why it is true, and-crucially-where the lineage of that truth originates. Automated verification, therefore, becomes less about judging the final output and more about tracing the evolution of its internal logic.

The pursuit of mathematical reasoning in artificial systems is, at its core, a study in decay. Every system will eventually succumb to the weight of its own complexity. The measure of success will not be in avoiding this fate, but in designing systems that, as they age, reveal the beauty of their own inevitable decline.

Original article: https://arxiv.org/pdf/2512.13978.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/