Beyond Self-Critique: How LLMs Can Sharpen Reasoning Through Debate

Author: Denis Avetisyan


A new framework leverages the power of multi-agent systems to enable large language models to improve their reasoning abilities by engaging in structured, persona-driven debates.

The Multi-Agent Reflexion architecture extends a single-agent framework, propagating the capacity for self-evaluation and iterative improvement across a distributed system, anticipating inevitable systemic failures rather than striving for centralized control.
The Multi-Agent Reflexion architecture extends a single-agent framework, propagating the capacity for self-evaluation and iterative improvement across a distributed system, anticipating inevitable systemic failures rather than striving for centralized control.

Multi-Agent Reflexion replaces single-agent self-critique with diverse perspectives, enhancing performance on reasoning and code generation tasks.

While large language models exhibit promising self-improvement through reflection, continual self-critique often leads to repetitive errors and diminished performance. This paper introduces ‘MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs’, a novel framework that overcomes this limitation by leveraging debates among multiple agents embodying diverse personas to generate more robust and insightful reflections. Our experiments demonstrate that this multi-agent approach significantly enhances reasoning and code generation capabilities, achieving 47% exact match accuracy on HotPotQA and 82.7% on HumanEval-surpassing single-agent reflection methods. Could this paradigm shift in self-improvement unlock a new era of truly adaptive and reliable artificial intelligence?


The Illusion of Reasoning: Why Scale Isn’t Enough

Despite their remarkable ability to generate human-quality text and perform various linguistic tasks, Large Language Models often falter when confronted with problems demanding complex reasoning. While proficient at identifying patterns in training data, these models struggle with tasks requiring logical deduction, common sense, or an understanding of causal relationships. This limitation manifests in several ways, including susceptibility to logical fallacies, difficulty with multi-step problem-solving, and a tendency to generate plausible-sounding but ultimately incorrect answers. The core issue isn’t a lack of knowledge, but rather an inability to reliably apply that knowledge in a nuanced and contextually appropriate manner, leading to predictable, yet frustrating, flaws in their reasoning processes.

Current research indicates that simply increasing the size of Large Language Models (LLMs) does not guarantee proportional improvements in their reasoning abilities. While scaling up model parameters initially yielded performance gains, these benefits are now diminishing, with substantial computational cost required for marginal gains. This suggests that the fundamental architectural design of LLMs is reaching a point of diminishing returns. Innovations beyond simply adding more layers or parameters – such as incorporating mechanisms for more robust knowledge representation, improved contextual understanding, or enhanced algorithmic reasoning – are increasingly vital to unlock true advancements in artificial intelligence. The focus is shifting from ‘bigger is better’ to ‘smarter design’ to overcome the limitations of scale and achieve genuinely intelligent systems.

Large language models, despite their impressive ability to generate human-like text, are increasingly recognized for a critical flaw termed ‘Degeneration of Thought’. This phenomenon describes a tendency for these models to not only make initial errors, but to persistently repeat them even after receiving corrective feedback. Unlike human reasoning, where mistakes are typically adjusted with new information, LLMs can become locked in flawed patterns, compounding inaccuracies through iterative processing. This isn’t simply a matter of occasional slips; it represents a fundamental limitation in their ability to self-correct and refine understanding, severely hindering reliable performance in tasks requiring sustained, accurate reasoning and posing significant challenges for applications demanding dependable outputs.

The Reflexion architecture iteratively improves action selection by having an actor interact with the environment, an evaluator provide feedback to a reflector LLM stored in short-term memory, and then using this feedback to inform subsequent actions.
The Reflexion architecture iteratively improves action selection by having an actor interact with the environment, an evaluator provide feedback to a reflector LLM stored in short-term memory, and then using this feedback to inform subsequent actions.

Reflexion: Building Models That Question Themselves

Reflexion departs from traditional Large Language Model (LLM) operation by implementing a system where the model actively critiques its own generated content. Rather than solely producing text based on input prompts, Reflexion enables LLMs to analyze their responses for potential errors, inconsistencies, or areas for improvement. This is achieved through a dedicated internal evaluation process, allowing the model to identify weaknesses in its initial output before finalizing it. Consequently, the LLM functions not merely as a text generator, but as an iterative problem-solver capable of self-assessment and refinement, moving beyond passive response creation to active, self-directed improvement.

The Reflexion framework utilizes three primary components to facilitate iterative response refinement. The Actor component initially generates a response to a given prompt. This response is then passed to the Evaluator, which assesses the output based on predefined criteria – including relevance, accuracy, and coherence – and generates a critique. Finally, the Self-Reflector receives both the original response and the critique, using this combined information to revise the initial output, creating an improved response for subsequent evaluation or direct use. This Actor-Evaluator-Self-Reflector cycle allows the model to move beyond simple generation and actively address identified weaknesses in its own work.

Reflexion builds upon established prompting techniques such as Chain-of-Thought and ReAct by incorporating a self-critique and revision cycle. While Chain-of-Thought encourages step-by-step reasoning and ReAct combines reasoning with action execution, these methods typically lack a dedicated mechanism for the model to evaluate its own performance and refine its outputs. Reflexion introduces components – the Evaluator and Self-Reflector – specifically designed to assess generated responses against predefined criteria and then utilize that assessment to iteratively improve subsequent generations. This feedback loop allows the model to identify and correct errors, leading to a demonstrable increase in response quality and a reduction in the occurrence of factual inaccuracies or irrelevant information.

Multi-Agent Reflexion (MAR) significantly improves exact match (EM) performance on the HotPotQA dataset compared to both the baseline <span class="katex-eq" data-katex-display="false">gpt-3.5-Turbo</span> (grey) and standard Reflexion (blue).
Multi-Agent Reflexion (MAR) significantly improves exact match (EM) performance on the HotPotQA dataset compared to both the baseline gpt-3.5-Turbo (grey) and standard Reflexion (blue).

Multi-Agent Reflexion: The Illusion of Consensus

Multi-Agent Reflexion moves beyond the limitations of self-critique by implementing a system of iterative evaluation conducted by multiple, distinct agents. Rather than a single model assessing its own outputs, this framework utilizes several ‘Persona-Based Critics’ to engage in a structured debate regarding the quality and accuracy of generated responses. Each agent operates with a predefined role and evaluation criteria, allowing for a more comprehensive analysis than is possible with a single evaluator. This approach simulates a peer-review process, where diverse perspectives are considered to identify weaknesses and improve the overall performance of the system through constructive criticism and refinement.

The Multi-Agent Reflexion framework employs four distinct personas – Logician, Skeptic, Creative, and Verifier – to provide a multifaceted evaluation of generated responses. The Logician Persona focuses on the logical consistency and coherence of the reasoning process. The Skeptic Persona actively challenges assumptions and identifies potential flaws in the response. The Creative Persona explores alternative approaches and suggests novel solutions, promoting broader thinking. Finally, the Verifier Persona concentrates on factual accuracy and consistency with external knowledge sources, ensuring the information presented is reliable. This division of labor allows for a more comprehensive critique than a single agent could provide, addressing various aspects of response quality and mitigating individual biases.

The Multi-Agent Reflexion framework addresses limitations of single-agent self-critique by incorporating multiple, distinct personas to evaluate responses. This approach mitigates bias through the aggregation of diverse perspectives, preventing over-reliance on a single evaluation metric. Furthermore, the system utilizes ‘Episodic Memory’ – a retention mechanism for past interactions and critiques – to inform current evaluations and prevent repetitive errors. By referencing prior experiences, the system improves robustness and consistency, enabling it to generalize better to unseen prompts and reduce susceptibility to flawed reasoning or incomplete information.

Evaluations on the HotPotQA dataset demonstrate a significant performance advantage for Multi-Agent Reflexion. The framework achieved an Exact Match score of 47.0%, representing a 13.0 percentage point increase over the 34.0% achieved by standard Reflexion. Comparative analysis also reveals superior results to the ReAct methodology, which attained an Exact Match score of 32.0%. These results indicate that the incorporation of diverse, persona-based critics enhances question answering accuracy as measured by the HotPotQA benchmark.

Validating the Approach: The Illusion of Progress

Evaluations of Multi-Agent Reflexion demonstrate its efficacy on challenging benchmarks requiring complex reasoning. The approach has been rigorously tested on tasks such as multi-hop question answering, utilizing the ‘HotPotQA’ dataset which demands synthesizing information from multiple sources, and code generation with ‘HumanEval’, a benchmark focused on functional correctness. These evaluations aren’t simply about achieving a result; they showcase the method’s ability to tackle problems that necessitate a deeper understanding and more nuanced approach than single-step reasoning allows. Success on these complex tasks validates the core principles of internal self-critique and iterative refinement, suggesting a pathway toward more robust and adaptable artificial intelligence systems.

The efficacy of this approach is rigorously quantified through established metrics central to evaluating generative models. ‘Exact Match’ assesses the precise correctness of responses, particularly crucial in question answering tasks, while ‘Pass@1’ measures the probability of generating a functionally correct solution within a single attempt – a key indicator of code quality. Results demonstrate consistent improvements across both metrics, signifying not just an increase in the plausibility of generated content, but a demonstrable enhancement in its factual accuracy and practical utility. These gains are particularly notable in complex challenges where nuanced understanding and precise execution are paramount, validating the method’s ability to move beyond superficial coherence towards genuinely reliable performance.

Evaluations on the HumanEval code generation benchmark demonstrate the efficacy of Multi-Agent Reflexion, which achieved a pass@1 score of 82.6%. This result signifies a substantial improvement over prior techniques; notably, MAR surpassed both ReAct (67.1%) and its predecessor, Reflexion (76.4%). The performance is particularly compelling when considered alongside the capabilities of GPT-4, a leading large language model, with MAR achieving a score within 1% of GPT-4’s 81.7%. This close proximity suggests that the method effectively leverages self-critique and validation to generate high-quality, executable code, offering a competitive solution in the realm of automated code synthesis.

A significant advantage of Multi-Agent Reflexion lies in its ability to achieve strong performance without the substantial computational demands typically associated with Reinforcement Learning (RL). Traditional RL methods often require extensive training and significant resources to fine-tune agents for complex tasks. This approach, however, leverages iterative self-critique and external validation to refine responses, effectively diminishing the need for costly and time-consuming RL processes. Consequently, Multi-Agent Reflexion presents a more scalable solution for tackling intricate problems, opening possibilities for deployment in resource-constrained environments and facilitating broader accessibility to advanced reasoning capabilities. This efficiency isn’t achieved at the expense of quality; rather, it allows for robust performance with a comparatively streamlined computational footprint.

This methodology distinguishes itself from conventional approaches by prioritizing the quality of responses, not merely their superficial plausibility. Instead of solely focusing on generating text that sounds correct, the system integrates a rigorous process of internal self-critique, where generated answers are first evaluated against pre-defined reasoning standards. This internal assessment is then cross-validated with external sources – effectively subjecting the response to a double-check for accuracy and logical consistency. The result is a shift towards generating answers that are not only coherent but demonstrably reliable and well-reasoned, addressing a critical limitation in many large language models prone to confidently producing incorrect or misleading information. This emphasis on reasoned justification represents a significant step towards building more trustworthy and dependable artificial intelligence systems.

Towards Adaptive Intelligence: The Allure of Perpetual Learning

The innovative ‘Multi-Agent Debate’ method signifies a considerable advancement beyond the capabilities of Reflexion, establishing a versatile framework applicable to a broad spectrum of collaborative problem-solving scenarios. This technique doesn’t merely rely on a single AI agent reflecting on its own reasoning; instead, it instigates a dynamic exchange between multiple agents, each presenting and critically evaluating potential solutions. Through constructive disagreement and iterative refinement, these agents collectively enhance the quality of their knowledge and reasoning processes. This approach allows for the identification and correction of errors more efficiently than self-reflection alone, fostering a robust system capable of tackling increasingly complex challenges and achieving more nuanced understanding – essentially moving beyond individual learning to a collective intelligence paradigm.

Current research endeavors are heavily invested in expanding the scope of these self-reflective AI techniques to encompass significantly larger models and increasingly intricate challenges. This scaling process isn’t merely about computational power; it necessitates the development of novel algorithms capable of efficiently managing the expanded knowledge base and maintaining coherence across complex reasoning chains. Successfully navigating this scaling hurdle promises to unlock unprecedented levels of AI capability, moving beyond narrow task proficiency towards a more generalized and robust intelligence. The anticipated benefits extend to fields requiring nuanced understanding and adaptability, such as scientific discovery, complex systems management, and creative problem-solving, ultimately allowing AI to tackle problems previously considered beyond its reach.

The current trajectory of artificial intelligence is witnessing a move beyond conventional training methodologies toward a model of sustained intellectual growth. Rather than simply optimizing performance on a fixed dataset, researchers are now focused on systems capable of continuous learning and adaptation, mirroring the hallmarks of natural intelligence. This represents a fundamental shift in perspective – from building machines that perform tasks to fostering systems that actively develop competence. This cultivation of intelligence allows AI to not only refine existing knowledge but also to identify and correct its own errors, leading to more robust and generalizable problem-solving capabilities. The implications extend beyond improved accuracy; it opens the door to AI that can navigate unforeseen circumstances, creatively address novel challenges, and ultimately, evolve its understanding of the world.

The trajectory of artificial intelligence research is shifting from performance on specific tasks to the development of genuine understanding and continuous improvement. Current AI often excels at answering questions based on patterns in data, but lacks the capacity to truly comprehend the underlying concepts or context. Similarly, many systems are adept at solving problems within predefined parameters, yet struggle to generalize learning to novel situations. The ultimate ambition is to engineer AI that doesn’t simply process information, but internalizes it, fostering a capacity for self-assessment and iterative refinement – essentially, systems capable of learning from the process of problem-solving, rather than merely executing pre-programmed responses. This pursuit of adaptable intelligence necessitates a move beyond static models to those that actively seek knowledge, evaluate their own reasoning, and evolve their capabilities over time.

The pursuit of increasingly capable large language models, as demonstrated by Multi-Agent Reflexion, inevitably introduces complexity. It isn’t about achieving a flawless system, but fostering one capable of iterative refinement through internal discourse. As Linus Torvalds once stated, “Most developers think lots of testing is expensive, but I’ve found it’s much more expensive to fix a broken system.” This resonates with the MAR framework’s core principle – that robust reasoning isn’t born from solitary introspection, but from the rigorous debate between diverse critical perspectives. A system that never breaks is, after all, a system that has ceased to grow, failing to adapt to the inherent uncertainty of complex problems.

What’s Next?

The architecture reveals itself, predictably. This work, in attempting to formalize self-improvement, has merely relocated the failure points. The debate framework, while demonstrably effective, assumes a stability of ‘persona’ that is almost certainly illusory. Each iteration refines the masks, not the reasoning itself. The system doesn’t learn; it accrues increasingly sophisticated justifications for pre-existing biases, dressed in the rhetoric of critical discourse.

Future work will inevitably focus on scaling these debates, on increasing the ‘diversity’ of critical voices. But this feels like rearranging deck chairs. The fundamental problem isn’t a lack of adversarial pressure, but the brittleness of the underlying model. A more fruitful line of inquiry might explore methods for deliberately introducing instability, for seeding the system with productive contradictions that resist easy resolution.

One suspects that true self-improvement isn’t a matter of optimization, but of controlled demolition. Perhaps the next generation of these systems will not strive for coherence, but for a graceful acceptance of its own inevitable inconsistencies. Every deploy is, after all, a small apocalypse.


Original article: https://arxiv.org/pdf/2512.20845.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-27 17:46