The Self-Evolving Math Problem: Training AI with Increasingly Complex Challenges

Author: Denis Avetisyan

Researchers are exploring a new approach to artificial intelligence training where code-driven agents automatically generate and refine mathematical problems to push the boundaries of reasoning capabilities.

The system iteratively refines initial problem definitions through computational exploration, abstracting empirical findings into increasingly complex challenges-a process mirroring organic growth rather than deliberate construction, and implicitly forecasting eventual limitations within the evolved structure.

This paper introduces Code2Math, a framework utilizing code agents to autonomously evolve mathematical problems, escalating difficulty while preserving solvability for improved model training and evaluation.

The increasing demand for challenging mathematical problems to train advanced reasoning models is hampered by a scarcity of high-quality, complex examples. Addressing this bottleneck, the work ‘Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?’ introduces a novel framework leveraging code-driven agents to autonomously generate more difficult, yet solvable, variations of existing problems. Experiments demonstrate that these agents can successfully synthesize new problems structurally distinct from the originals and exhibiting increased complexity through test-time exploration. Could this approach provide a scalable pathway towards automatically creating datasets for evaluating the next generation of mathematical reasoning systems?

The Illusion of Complexity: Why Easy Problems Are the Real Challenge

The creation of truly challenging mathematical problems isn’t simply a matter of increasing complexity; often, seemingly minor alterations to existing problems can inadvertently render them trivial. This counterintuitive phenomenon stems from the delicate balance required between problem constraints and the potential solution pathways. A problem designed to test a specific reasoning skill may, with a slight modification, become solvable through rote application of a formula or a straightforward, uninspired approach. Researchers have found that ensuring genuine difficulty necessitates a deep understanding of the underlying mathematical concepts and a careful consideration of how subtle changes can unintentionally unlock unintended shortcuts, highlighting the surprising nuance involved in crafting problems that genuinely assess cognitive effort rather than mere computational ability. The challenge lies in designing problems where the path to the solution isn’t immediately obvious, demanding a degree of insightful thinking beyond simple application of learned procedures.

Current automated problem generation techniques often fall short when aiming for truly insightful challenges. While algorithms can readily produce solvable problems, consistently crafting those demanding non-obvious approaches proves remarkably difficult. The core issue lies in the tendency for even slight modifications to established problem structures to inadvertently create trivial variations, easily solved with standard techniques. This limitation significantly hinders the utility of these methods in accurately assessing higher-order reasoning skills; a problem easily solved doesn’t differentiate between rote memorization and genuine cognitive flexibility. Consequently, evaluations relying on such generated problems may overestimate an individual’s or system’s actual reasoning capabilities, as they fail to effectively probe for the nuanced insights that characterize true problem-solving expertise.

The true test of a well-crafted mathematical problem isn’t simply its solvability, but the cognitive resources it demands from the individual attempting to solve it. Researchers increasingly recognize that generating problems requiring substantial effort – those resisting immediate application of standard algorithms or memorized facts – is paramount for accurate assessment of reasoning skills. A problem easily solved through rote learning provides little insight into genuine understanding, while one demanding careful analysis, the formulation of novel strategies, or the application of multiple concepts offers a far more revealing picture of cognitive ability. This necessitates a shift in problem generation methodologies, moving beyond simple variations on existing exercises toward designs that intentionally introduce complexity and encourage deeper engagement with the underlying mathematical principles. The goal is to create challenges that aren’t merely difficult, but meaningfully so – requiring not just calculation, but true intellectual investment.

This multi-agent system validates new problems and solutions by leveraging agents focused on evolution, solvability, and difficulty assessment, utilizing mathematical tools to refine initial inputs into verified problem-solution pairs.

Architecting Insight: The Evolution of Deceptive Challenges

LLM-Based Problem Evolution is a process by which new problem instances are automatically generated from a set of initial, or seed, problems. This is achieved through the application of large language models (LLMs) which are prompted to create variations while preserving core problem characteristics. The LLM takes an existing problem as input and, based on its training data and specified parameters, outputs a modified problem statement. This allows for the programmatic creation of a diverse problem set without manual authoring, enabling scalability in areas such as automated curriculum generation and benchmark creation. The process focuses on maintaining problem validity and complexity while introducing novel elements, ensuring the generated problems remain solvable but present unique challenges.

The Evolution Agent automates problem modification through a defined strategy, employing techniques such as Mathematical Problem Adaptation to generate novel challenges. This adaptation involves systematically altering problem parameters, constraints, or the underlying mathematical relationships while preserving the core problem-solving principles. The agent doesn’t simply randomize changes; instead, it applies rules based on the problem’s structure to ensure the resulting problems remain solvable but potentially require different solution approaches. These modifications can include scaling numerical values, introducing distractors, changing variable representations, or transforming equations, all executed according to the agent’s programmed strategy. The goal is to create a diverse set of problems that build upon existing ones, facilitating a more robust evaluation of solver capabilities.

The Evolution Agent incorporates a Theory of Mind (ToM) framework to model the anticipated reasoning processes of problem-solving agents. This involves predicting the steps a solver will likely take, identifying potential cognitive biases, and crafting problem variations designed to exploit these vulnerabilities. By simulating solver thought processes, the agent generates problems not simply based on complexity, but on their capacity to mislead or create deceptive challenges – specifically, problems that appear solvable with common approaches but require a less obvious solution path. This ToM-driven approach differs from standard problem generation which primarily focuses on increasing difficulty through parameters like numerical range or equation length, and instead aims for problems that are challenging due to their structure and how they interact with expected solver behavior.

Analysis of agentic problem evolution reveals that DeepSeek-Chat, DeepSeek-Reasoner, and Gemini-3-Pro-Preview-Thinking differ in their failure rates, as quantified by rejections from both a solvability and difficulty verification agent.

Validating the Illusion: Ensuring Genuine Challenge

The system incorporates a Solvability Verification Agent to guarantee the functional correctness of generated problems before deployment. This agent assesses whether a valid solution exists for each new problem instance, effectively preventing the introduction of unsolvable or logically flawed challenges. Evaluation using DeepSeek-Reasoner demonstrates a high degree of accuracy, with the agent confirming the solvability of approximately 96% of generated problems, indicating a robust capacity to filter out invalid problem instances.

The Difficulty Verification Agent evaluates the complexity of newly generated problems relative to the initial seed problem, employing metrics such as Burden of Discovery to quantify this increase. Empirical results indicate a consistent reduction in solve rates when evaluated against strong solvers, ranging from a 6% to 21% decrease. This demonstrates the agent’s capacity to produce problems that present a genuine increase in challenge, effectively modulating difficulty during problem generation.

Test-Time Scaling is implemented to improve the robustness of generated problems by creating multiple variations of each challenge. This process involves generating a set of problems from a single seed, then evaluating properties such as solvability and difficulty across this set. By assessing these characteristics across multiple variations, the system ensures that the final selected problem consistently presents the intended level of challenge and avoids anomalies or unintended simplifications that might occur due to random generation. This approach contributes to a more reliable and predictable problem-generation process, increasing confidence in the quality of the generated challenges.

Code-Driven Exploration utilizes symbolic computation to validate proposed solutions and quantify problem complexity during problem generation. This approach involves executing and verifying code against problem constraints, which is computationally intensive. Analysis of the evolutionary process revealed an average of 1.56 to 6.55 failures per problem attempted before a valid and appropriately complex problem was generated. This failure rate indicates a significant computational cost associated with ensuring problem reliability, necessitating a trade-off between the rigor of validation and the overall efficiency of the generation process.

The distribution of average token consumption reveals that agent-evolved problems generally require more tokens to solve than the original problems, with timeout failures conservatively assigned the maximum token limit to indicate high difficulty.

The Art of Concealment: Measuring Insight, Not Just Accuracy

The foundation of this problem-solving approach rests on a deliberate strategy: embedding a crucial, yet hidden, ‘Aha Moment’ directly within the problem’s initial presentation. This isn’t about making the problem unsolvable, but rather ensuring the path to a solution demands genuine insight, not just the application of pre-existing knowledge. The core idea is to mask the essential realization – the pivotal understanding needed to bridge the gap between the problem’s conditions and its solution – so that solvers are compelled to actively reason and discover this insight themselves. By concealing this key element, the approach moves beyond evaluating simple recall or pattern recognition, instead assessing a solver’s capacity for flexible thinking and innovative problem-solving.

The Evolution Agent operates on the principle of veiled complexity, deliberately introducing subtle obfuscation to hinder reliance on rote memorization or superficial pattern recognition. Rather than presenting a straightforward problem, the agent dynamically adjusts the information landscape, embedding the core insight within a network of seemingly relevant, yet ultimately distracting, details. This approach forces a solver to engage in genuine reasoning, demanding a deep understanding of the underlying principles and a deliberate process of hypothesis formation and testing. Consequently, assessments move beyond mere accuracy, revealing the true depth of a solver’s cognitive flexibility and ability to extract meaningful solutions from ambiguous or misleading information.

The assessment framework moves beyond traditional metrics of correctness to evaluate the process of problem-solving itself. By intentionally obscuring a critical insight within the problem’s structure, the system necessitates genuine reasoning and discourages solutions derived from rote memorization or superficial pattern recognition. This approach allows for a more detailed analysis of a solver’s cognitive strategies, revealing not just whether a problem was solved, but how – identifying strengths in analytical thinking, creative hypothesis generation, and the ability to overcome conceptual obstacles. Consequently, the resulting evaluation provides a significantly richer and more nuanced understanding of problem-solving capabilities than simple accuracy scores ever could, highlighting a solver’s true potential for innovative thought.

The pursuit of increasingly complex mathematical challenges, as explored in this work, echoes a fundamental truth about systems. They don’t simply reach a state of difficulty; they evolve towards it. This mirrors Kolmogorov’s observation: “The most important discoveries often come from looking at things from a different angle.” The framework presented isn’t about constructing a perfectly scaled difficulty ladder, but rather about allowing an agent to explore the solution space, generating problems that, while solvable, push the boundaries of current reasoning models. Long-term stability in problem generation-a consistent output of ‘medium’ difficulty-wouldn’t signify success, but rather a hidden constraint, a failure to truly probe the limits of these models. The system’s ability to escalate difficulty through exploration indicates a more natural, organic growth – a key characteristic of resilient systems.

The Horizon of Difficulty

The presented work does not solve mathematical problems-it cultivates an ecosystem of mathematical problems. This is a subtle, yet crucial distinction. The framework’s capacity for autonomous difficulty escalation hints at a future where datasets are not curated artifacts, but generative landscapes. However, the very act of defining “solvability” introduces an inherent prophecy of failure. Every metric, every reward function, establishes a boundary beyond which the system will inevitably strain, and reveal the limitations of its own definitions.

True resilience in these systems begins where certainty ends. The framework’s current focus on single-agent evolution, while a pragmatic starting point, obscures the more complex dynamics of a truly adaptive system. The future likely lies in exploring multi-agent scenarios – not as a means to achieve consensus, but to foster constructive conflict, where differing “solvers” continuously challenge and refine the problem space. Monitoring, then, is not about preventing errors-it is the art of fearing consciously.

That is not a bug-it’s a revelation. The inevitable emergence of unsolvable, or ill-defined, problems should not be viewed as a deviation from the intended function, but as essential data points. These are the edges of the system’s comprehension, the points where new architectures, new reward structures, and ultimately, a deeper understanding of mathematical reasoning itself, must be forged.

Original article: https://arxiv.org/pdf/2603.03202.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Complexity: Why Easy Problems Are the Real Challenge

Architecting Insight: The Evolution of Deceptive Challenges

Validating the Illusion: Ensuring Genuine Challenge

The Art of Concealment: Measuring Insight, Not Just Accuracy

The Horizon of Difficulty

See also: