Teamwork Makes the Dream Work: Can Multiple AI Agents Solve Geometry Problems Better Than One?

Author: Denis Avetisyan

New research explores whether dividing the task of geometric reasoning between specialized AI agents improves problem-solving performance.

An interpreter agent transforms visual information and queries into formal predicates, which are then utilized by a solver agent to address the presented problem.

This study evaluates agentic frameworks for diagram-grounded geometry, comparing multi-agent systems to single-agent approaches with open- and closed-source models.

Despite advances in multimodal large language models, it remains unclear whether decomposing complex reasoning tasks into multi-agent systems consistently improves performance. This is the central question addressed in ‘Do Multi-Agents Solve Better Than Single? Evaluating Agentic Frameworks for Diagram-Grounded Geometry Problem Solving and Reasoning’, which systematically compares single- and multi-agent pipelines on challenging visual math benchmarks. The study reveals that while multi-agent approaches consistently benefit open-source models, strong closed-source models exhibit only modest gains, suggesting agentic decomposition isn’t universally optimal. Does this indicate a need for more nuanced approaches to agent design, tailored to the capabilities of different model architectures?

The Geometry of Thought: Why Problems Persist

Geometry problem solving, despite seeming straightforward, presents a significant challenge for many conventional artificial intelligence systems due to its inherent demand for complex reasoning. These systems often falter not because of a lack of geometric knowledge, but because problems require integrating information from both visual diagrams and textual descriptions – a process demanding multiple sequential inferences. Traditional approaches frequently dissect problems into isolated steps, struggling to maintain a cohesive understanding across these steps and failing to effectively connect diagrammatic elements with stated theorems or given facts. This difficulty in visual-textual integration hinders the ability to construct complete proofs, where each inference builds logically upon the previous one, ultimately limiting performance on anything beyond simple, single-step problems. The nuanced interplay between visual perception and logical deduction proves a considerable hurdle for algorithms designed to mimic human geometric reasoning.

Despite advancements in artificial intelligence, even the most sophisticated single models encounter limitations when tackling geometric proofs and problems requiring the integration of visual and textual information. These challenges stem from the inherent complexity of geometry, which demands not only logical deduction but also the ability to accurately interpret diagrams and translate visual cues into symbolic representations. Current models often struggle with multi-step reasoning, where each step builds upon previous inferences, and can be easily derailed by subtle variations in problem presentation or diagrammatic details. This performance ceiling suggests that a new approach-perhaps one that combines the strengths of different AI architectures or incorporates human-like reasoning strategies-is needed to truly unlock the potential for automated geometric problem solving.

Deconstructing Complexity: A Multi-Agent Strategy

The MultiAgentPipeline addresses geometry problem solving through functional decomposition, assigning distinct roles to individual agents within a pipeline. This typically involves separating the process into stages such as problem interpretation – extracting relevant information from the textual problem statement and diagrams – and subsequent solving, where the interpreted data is used to derive the solution. This modular design allows for specialization; agents can be trained or configured to excel at specific sub-tasks, improving overall pipeline efficiency. For example, one agent might focus solely on identifying geometric shapes and relationships, while another concentrates on applying relevant theorems or formulas to arrive at a numerical answer. This contrasts with single-agent approaches where a single model is responsible for all stages of the problem-solving process.

The modular architecture of the multi-agent pipeline facilitates the incorporation of ZeroShotLearning techniques by allowing agents to be trained on distinct, potentially unrelated, datasets and then applied to novel geometry problems without specific retraining. This is achieved through standardized interfaces between agents, enabling the seamless integration of OpenSourceModels. Utilizing open-source components increases accessibility for researchers and developers, lowers deployment costs, and promotes customization, while zero-shot capabilities expand the range of solvable problems beyond the training data through generalization and transfer learning.

The implementation of a multi-agent pipeline decomposes geometry problem solving into discrete functional roles, allowing each agent to specialize in a specific task such as problem interpretation or solution derivation. Experimental results indicate that this modular approach generally improves performance in open-source geometry solvers; however, strong, closed-source models often maintain a performance advantage when operating in a single-agent configuration. This suggests the benefits of decomposition are contingent on the underlying model capacity and the complexity of the geometry problems presented; lower-capacity models benefit more significantly from task specialization, while highly capable models may not require it.

From Vision to Symbol: The Interpreter Agent

The InterpreterAgent functions as the central component in converting both visual geometric diagrams and accompanying textual problem statements into a formalized, symbolic language suitable for downstream problem-solving. This translation process involves identifying the key elements within the input – points, lines, angles, and their relationships – and representing them using a structured notation. The resulting symbolic representation, typically involving predicates and variables, allows the solver agent to reason about the geometric problem in a computationally tractable manner, independent of the original visual or textual format. Without this intermediate symbolic conversion, direct reasoning from raw visual or textual data would be significantly more challenging and less reliable.

PredicateGeneration is the process of identifying and formalizing geometric relationships and properties present in both visual and textual problem inputs. This involves extracting information such as angle equalities ($ \angle ABC = \angle DEF $), length relationships ($ AB = CD $), geometric constraints (points A, B, and C are collinear), and object classifications (identifying shapes as triangles, circles, or lines). The generated predicates are then represented in a structured format, allowing the system to reason about the geometric scenario and translate it into a symbolic language suitable for the solver agent. Accurate predicate generation is fundamental, as the solver’s subsequent reasoning is directly dependent on the completeness and correctness of these extracted properties.

The integration of advanced multimodal models, such as Google’s Gemini and OpenAI’s GPT-4o, into the interpretation phase of geometric problem solving demonstrably improves performance. These models excel at processing both visual diagrams and textual problem statements, enabling more accurate extraction of relevant geometric features and relationships than unimodal or less sophisticated models. Empirical results across datasets including MathVerse, Geometry3K, WeMath, and OlympiadBench indicate a performance increase of +6.8% on the Geometry3K dataset when utilizing Qwen-2.5-VL-7B within a multi-agent pipeline, achieving an accuracy of 60.07% compared to 53.24% with a single-agent baseline. This enhancement stems from the models’ capacity to effectively correlate visual and textual information, reducing ambiguity and improving the reliability of the symbolic representation generated for the solver.

Performance of the multi-agent pipeline was validated on several datasets including MathVerse, Geometry3K, WeMath, and OlympiadBench. Quantitative results on Geometry3K demonstrated a 6.8% increase in accuracy using the Qwen-2.5-VL-7B model within the multi-agent system, achieving 60.07% accuracy compared to 53.24% with a single-agent approach. These results indicate improved performance in geometric problem solving through the utilization of a multi-agent architecture and a vision-language model.

The Solver Agent: Precision in Reasoning

The SolverAgent functions as the reasoning engine within the multi-agent system, receiving a formalized, symbolic representation of the problem from the preceding InterpreterAgent. This symbolic input, typically expressed as logical statements or equations, is then processed through a defined set of reasoning rules. These rules, which may include deductive logic, algebraic manipulation, or geometric theorems, are applied iteratively to the symbolic representation. The application of these rules aims to transform the initial problem statement into a series of logical steps, ultimately leading to the derivation of a solution. The agent’s performance is therefore directly dependent on both the quality of the symbolic representation received and the completeness and correctness of its internal reasoning rules.

The integration of compact language models, such as Phi-4, into the solver agent framework highlights the feasibility of achieving significant problem-solving capabilities with relatively small parameter counts. Phi-4, despite its size, demonstrates strong performance on reasoning tasks, indicating that model scale is not the sole determinant of solving power within this system. This suggests a potential for efficient deployment and reduced computational costs without substantial performance degradation, and opens avenues for exploring other similarly-sized models for specialized reasoning applications.

The SolverAgent architecture is designed for flexibility in solving strategies. This modularity facilitates experimentation with various reasoning rules and algorithms without altering the core framework. Furthermore, it enables the combination of multiple solving approaches – for instance, deploying agents specializing in different geometric theorems or algebraic manipulations – to address complex problems. Experimental results demonstrate the benefits of this approach; on Geometry3K, a multi-agent pipeline using Qwen-2.5-VL-32B achieved 72.05% accuracy, a 3.3% improvement over a single agent. Similarly, on OlympiadBench, a multi-agent system with Qwen-2.5-VL-7B attained 61.84% accuracy, representing a 9.4% improvement over the single-agent baseline of 52.44%.

Evaluations of the multi-agent pipeline demonstrate performance gains on established benchmark datasets. Specifically, utilizing the Qwen-2.5-VL-32B model, the system achieved 72.05% accuracy on the Geometry3K dataset, representing a 3.3% improvement over a single-agent baseline of 68.72%. Furthermore, with the Qwen-2.5-VL-7B model, the multi-agent system attained 61.84% accuracy on OlympiadBench, a 9.4% increase compared to the single-agent accuracy of 52.44%. These results indicate the effectiveness of the multi-agent approach in enhancing problem-solving capabilities.

Beyond Solution: Towards Robust and Explainable Intelligence

The pursuit of artificial intelligence capable of not only solving complex problems but also articulating how those solutions are reached has led to increasing interest in multi-agent systems. This approach, demonstrated effectively in the realm of geometry problem solving, moves beyond monolithic AI models by distributing the task amongst multiple specialized agents. Each agent contributes a specific skill – such as identifying key geometric features, formulating proof strategies, or verifying solution steps – and communicates with others to achieve a unified result. This decomposition offers inherent advantages in robustness; should one agent falter, others can compensate, preventing complete failure. More crucially, the modular nature of these systems fosters explainability, as the reasoning process becomes transparent through the analysis of interactions between agents. By tracing the flow of information and the contributions of each module, researchers and users gain valuable insights into the AI’s decision-making process, paving the way for more trustworthy and reliable artificial intelligence.

Deconstructing intricate problems into smaller, self-contained modules offers a pathway toward not only enhanced artificial intelligence, but also greater transparency in its operation. This modular approach allows researchers to pinpoint the specific steps within a reasoning process where errors occur, facilitating targeted improvements and refinements. Rather than treating an AI as a “black box,” this strategy enables a detailed examination of each component’s contribution to the final solution. By isolating individual modules, it becomes possible to assess their accuracy, efficiency, and potential biases, ultimately leading to more reliable and explainable AI systems. Such granularity is crucial for debugging, optimization, and building trust in complex algorithms, as it shifts the focus from simply what an AI decides to how it arrives at that decision.

The current geometry problem-solving framework stands to gain significantly from the incorporation of both closed-source models and the LLaVA multimodal model. Closed-source models, with their often larger parameter counts and training datasets, can contribute enhanced reasoning and problem-solving abilities, particularly in areas where open-source alternatives are still developing. Simultaneously, the integration of LLaVA – a model capable of processing both images and text – introduces a vital dimension of visual understanding. This is crucial for geometry, where diagrams are fundamental to problem definition and solution. By enabling the system to ‘see’ and interpret visual information, LLaVA facilitates a more comprehensive analysis of the problem, potentially unlocking solutions that text-based models might overlook and paving the way for more versatile and effective AI systems.

Recent evaluations on the We-Math benchmark demonstrate a notable performance increase through the implementation of multi-agent systems. Utilizing Gemini-2.0-Flash, the collaborative approach achieved an accuracy of 62.90%, surpassing the 61.16% attained by a single-agent baseline – a relative improvement of +1.74%. This suggests that the advantages of distributing problem-solving across multiple agents are especially prominent when leveraged with open-source models like Gemini-2.0-Flash, potentially due to increased flexibility and adaptability in task decomposition and knowledge sharing. The observed gains indicate a pathway towards more effective AI systems capable of tackling complex mathematical reasoning challenges through collaborative computation and specialized module contributions.

The study dissects problem-solving approaches, favoring decomposition into specialized roles-an interpreter and a solver-for certain models. This echoes a principle of reductive reasoning. As John von Neumann observed, “If people do not believe that mathematics is simple, it is only because they do not realize how elegantly status has been built upon complication.” The research demonstrates that, for open-source visual language models, this simplification-distributing cognitive load-yields measurable gains in geometry problem solving. However, the benefits are not universal, suggesting that advanced closed-source models already implicitly employ such strategies. Clarity, it appears, is the minimum viable kindness, even for artificial intelligence.

Beyond the Chorus: Future Directions

The persistent question-does distributing cognition yield better solutions?-receives, as often happens, a qualified answer. This work demonstrates benefits for less capable models, suggesting decomposition acts as a scaffolding, a prosthetic for limited reasoning. Yet, the lack of consistent gains with more powerful systems implies a saturation point. Perhaps the problem isn’t how to reason, but the inherent ambiguity within the formalization of geometry itself. A system capable of flawless execution on ill-defined problems remains elusive.

Future work should concentrate on refining the interface between agent roles. The current paradigm, while functional, feels… crude. A more nuanced exchange-less predicate generation, more direct manipulation of visual primitives-might unlock greater synergy. Furthermore, the emphasis on zero-shot learning, while commendable, obscures a deeper issue: are these systems truly ‘solving,’ or simply pattern-matching with increased sophistication?

The pursuit of increasingly complex architectures feels, increasingly, like building ever-more elaborate Rube Goldberg machines. The true elegance lies not in doing more, but in distilling the essential elements. A focus on minimalist, yet demonstrably robust, agentic frameworks-systems that prioritize clarity over capacity-promises a more fruitful path forward.

Original article: https://arxiv.org/pdf/2512.16698.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/