Beyond Calculation: What Makes Math Problems Intriguing?

Author: Denis Avetisyan


New research explores how well artificial intelligence can discern the qualities that make a mathematical problem genuinely interesting, comparing its judgment to that of human mathematicians.

Human engagement with mathematics extends beyond problem-solving to encompass the discernment of worthwhile pursuits, a cognitive process largely absent in current large language models tasked solely with direct solutions; this work elucidates the divergence by comparing human and model evaluations of problem significance, both in final assessments and the underlying factors informing those judgments.
Human engagement with mathematics extends beyond problem-solving to encompass the discernment of worthwhile pursuits, a cognitive process largely absent in current large language models tasked solely with direct solutions; this work elucidates the divergence by comparing human and model evaluations of problem significance, both in final assessments and the underlying factors informing those judgments.

This study investigates the alignment between human and large language model assessments of mathematical problem interestingness, with implications for automated discovery and educational applications.

The pursuit of mathematical progress—and even engagement with mathematical problems—is fundamentally guided by subjective judgments of “interestingness,” a quality difficult to quantify. This paper, ‘A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models’, investigates the alignment between human and large language model (LLM) assessments of this elusive quality across a range of mathematical expertise. Our findings reveal that while LLMs demonstrate a broad agreement with human perceptions of interestingness, they often fail to replicate the nuanced distributions and underlying rationales driving those judgments. This disconnect raises critical questions about the potential—and limitations—of LLMs as collaborative partners in mathematical discovery and education.


The Subjective Calculus of Mathematical Worth

Evaluating mathematical problems extends beyond solvability; ‘interestingness’ significantly influences human engagement. While algorithms assess difficulty, capturing qualities like elegance, novelty, or surprising connections remains a challenge for artificial intelligence. Recent analysis of International Mathematical Olympiad (IMO) responses reveals that human perceptions of mathematical interest are multi-faceted, based on a complex interplay of factors. Understanding these preferences is key to building AI capable of not simply solving problems, but appreciating mathematics itself – tracing the echoes of past insights with each solution.

Analysis of human responses in an IMO study reveals correlations in the selection of multiple reasons for both identifying problems as interesting and uninteresting.
Analysis of human responses in an IMO study reveals correlations in the selection of multiple reasons for both identifying problems as interesting and uninteresting.

Benchmarking AI Against the Human Intuition

Evaluating an AI’s ability to assess problem worth requires comparison with human judgments from expert mathematicians. Datasets comprised of problems from competitive examinations—including the AMC and IMO—served as the basis for this comparative analysis. Large Language Models demonstrate an ability to approximate human perceptions of mathematical problem interestingness, with squared Pearson correlation ($R^2$) values ranging from 0.48 to 0.78. A more rigorous comparison utilizing the Wasserstein Distance metric shows that the best performing model, Mistral 7B, achieves a WD of 12.4 (95% CI: [0.3, 16.0]). A human split-half baseline yielded a Wasserstein Distance of 9.5 (CI: [7.8, 11.5]), indicating that the model’s judgments are meaningfully consistent with human experts.

Comparison of human and large language model (LLM) judgments on the Prolific dataset demonstrates a quantifiable agreement in assessing both problem interestingness and difficulty, as measured by scaled squared Pearson correlations, with darker values indicating stronger alignment.
Comparison of human and large language model (LLM) judgments on the Prolific dataset demonstrates a quantifiable agreement in assessing both problem interestingness and difficulty, as measured by scaled squared Pearson correlations, with darker values indicating stronger alignment.

Efficient Reasoning: A Dynamic Allocation of Cognitive Resources

Large Reasoning Models (LRMs) offer a novel approach to automated problem-solving, utilizing reinforcement learning to optimize both solution accuracy and computational efficiency. Unlike conventional large language models, LRMs strategically allocate ‘Reasoning Token Count’—a measure of computational effort—focusing on the most salient aspects of a given problem. LRMs assess not only whether a problem has a solution, but also its inherent quality, reflected in the model’s allocation of Reasoning Token Count. This adaptive allocation prioritizes problems with significant learning potential, leading to a more efficient use of resources.

Across a range of large reasoning models (LRMs), judgment speed, as determined by reasoning chain length, varies depending on whether a problem is perceived as low or high interest.
Across a range of large reasoning models (LRMs), judgment speed, as determined by reasoning chain length, varies depending on whether a problem is perceived as low or high interest.

Toward Automated Discovery: Guiding the Search for Enduring Truths

Large language models (LRMs) demonstrate a capacity to assess the ‘Problem Interestingness’ of mathematical statements, offering a novel approach to guiding research directions. By accurately gauging this quality, LRMs can prioritize the search for unexplored problems, potentially bypassing unproductive avenues of investigation. Exploration of ‘Problem Variants’ reveals which elements consistently elevate or diminish perceived value, refining the model’s ability to discriminate between promising and uninteresting challenges. This work lays the groundwork for ‘Automated Mathematical Discovery’, where AI autonomously proposes and investigates new theorems, driven by its assessment of problem interestingness—a process where, even in the pursuit of abstract truth, the echoes of obsolescence guide the search for what endures.

Distribution of judgment speeds across LRMs reveals that models tend to employ longer reasoning chains when evaluating problems ultimately categorized as high interest, differentiating them from those deemed low interest based on median interestingness scores.
Distribution of judgment speeds across LRMs reveals that models tend to employ longer reasoning chains when evaluating problems ultimately categorized as high interest, differentiating them from those deemed low interest based on median interestingness scores.

The study of interestingness, as explored within this paper, reveals a fascinating echo of system entropy. Just as all structures inevitably degrade, so too does the perceived novelty of mathematical problems. Claude Shannon observed, “The most important thing in communication is to convey information, and the most important thing in information is to convey meaning.” This resonates deeply with the findings; LLMs can assess interestingness – a form of meaning extraction – but their distributions diverge from human judgment, suggesting a difference in how ‘information’ is weighted over time. The very act of evaluating problem interestingness is a versioning process – a snapshot of perceived value against the backdrop of a constantly expanding mathematical landscape. It is a subtle reminder that even in the realm of abstract thought, the arrow of time points inexorably towards refinement, and ultimately, refactoring of our understanding.

What Lies Ahead?

The correlation between human and large language model judgments of mathematical problem “interestingness” – a surprisingly pliable metric – reveals less about true understanding and more about shared structural preferences. Every architecture lives a life, and this one demonstrates a capacity to mimic the surface of curiosity. However, the distributional differences noted within the study hint at a divergence in why a problem is deemed interesting, a chasm that automated discovery cannot easily bridge without risking a self-referential loop of novelty. The system will inevitably optimize for what it finds interesting, potentially diverging from genuinely fruitful lines of inquiry.

Future work must address the limitations of relying on human judgment as a gold standard. Human mathematical taste is itself a product of cultural trends, pedagogical biases, and, quite often, sheer accident. It is a moving target, and improvements age faster than one can understand them. A more robust approach might involve defining interestingness not as a subjective evaluation, but as a function of a problem’s capacity to generate novel, verifiable, and unexpected results – a definition inherently less susceptible to mimicry.

Ultimately, the pursuit of automated mathematical discovery is not about creating a machine that likes problems, but one that can systematically explore the landscape of mathematical possibility, unburdened by the peculiar constraints of human intuition. The question is not whether the system can judge interestingness, but whether it can transcend it.


Original article: https://arxiv.org/pdf/2511.08548.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-12 23:56