Let Agents Teach Themselves: Scaling Deep Research with Automation

Author: Denis Avetisyan

New research demonstrates how automatically refining the instructions given to teams of AI agents can dramatically improve their ability to gather complex information and produce insightful reports.

The system dissects complex inquiries by orchestrating a cascade of specialized agents-task creation, document inspection, information aggregation-that ultimately converge into a cohesive, long-form report, demonstrating a decomposition strategy for knowledge synthesis where the whole is built from iteratively refined fragments.

This review explores the application of prompt optimization techniques like TextGrad and GEPA to multi-agent systems, showing significant performance gains even with minimal initial prompting.

Achieving robust performance in complex information synthesis tasks often demands substantial manual effort in prompt engineering and system design. This work, ‘Self-Optimizing Multi-Agent Systems for Deep Research’, investigates automated methods for optimizing multi-agent systems tasked with gathering and synthesizing information from large document collections. Our results demonstrate that employing techniques like TextGrad and GEPA to evolve agent prompts can yield Deep Research systems that match or exceed the performance of expertly crafted configurations, even when initialized with minimal prompting. Could this self-optimization approach unlock a new paradigm for building adaptable and high-performing information synthesis tools?

Deconstructing the Information Flood

Contemporary research faces an unprecedented challenge stemming from the sheer volume and fragmented nature of available information. Traditional methodologies, designed for curated collections and limited datasets, frequently fail to capture the full scope of relevant knowledge. This limitation often results in incomplete analyses, skewed perspectives, and ultimately, biased insights. The modern information landscape, characterized by rapidly evolving digital sources, diverse viewpoints, and the proliferation of misinformation, demands a shift beyond simple data gathering. Studies reveal that relying solely on conventional approaches can inadvertently amplify existing biases, overlook critical information, and hinder the development of truly comprehensive understandings of complex phenomena. Consequently, researchers must actively address these methodological shortcomings to ensure the integrity and validity of their findings.

The sheer volume of data in contemporary research necessitates a shift beyond traditional methods. Synthesizing insights from a multitude of sources – academic papers, news reports, social media, and specialized databases – quickly overwhelms manual approaches and the limitations of basic keyword searches. This isn’t simply a matter of finding more information, but of discerning patterns, identifying biases, and establishing connections across disparate fields. Consequently, researchers are increasingly reliant on computational tools – including machine learning and natural language processing – to automate the process of information extraction, analysis, and ultimately, knowledge creation. These technologies allow for the identification of subtle relationships and previously unseen trends, moving beyond surface-level understanding to a more holistic and nuanced comprehension of complex phenomena.

Automated Dissection: A Multi-Agent System

A Multi-Agent System (MAS) facilitates the decomposition of complex research inquiries into discrete, manageable sub-tasks. This decomposition allows for the parallel execution of these sub-tasks, significantly reducing overall research time compared to sequential processing. Each sub-task is assigned to a specialized agent within the system, enabling focused processing and reducing the computational burden on any single component. The parallel nature of the MAS also promotes exploration of multiple research avenues simultaneously, enhancing the potential for novel discoveries and comprehensive synthesis of information from diverse sources. This approach differs from monolithic systems by distributing the workload and promoting modularity, which improves scalability and maintainability.

The automated research system is structured around four specialized agents, each handling a distinct phase of the research process. The Orchestrator Agent initiates and manages the overall workflow, assigning tasks to other agents and monitoring progress. Reader Agents are responsible for retrieving and parsing information from relevant sources, including academic papers and online databases. The Aggregator Agent synthesizes the information gathered by the Reader Agents, identifying key findings and resolving conflicting data. Finally, the Writer Agent generates the final research output, composing reports, summaries, or other specified deliverables based on the aggregated information.

The core functionality of each agent within the multi-agent system is powered by the `GPT-4.1-mini` large language model. This model facilitates information processing, including text comprehension, summarization, and generation, for each designated task. Specifically, `GPT-4.1-mini` enables the Reader Agent to extract relevant data from research papers, the Aggregator Agent to synthesize findings from multiple sources, and the Writer Agent to formulate coherent and well-structured reports. Utilizing this model ensures a consistent level of output quality and allows for efficient handling of complex information required for automated research workflows.

GEPA exhibits more diversified exploration of candidate solutions compared to TextGrad, as demonstrated by its broader search tree.

The Art of the Prompt: Genetic Algorithm Optimization

Automated prompt optimization is a core component of our agent performance strategy. We utilize techniques such as TextGrad and GEPA to refine the input prompts used by our agents, moving beyond manual prompt engineering. These methods systematically analyze and modify prompts to maximize performance based on predefined evaluation metrics. The goal is to identify prompts that elicit the most accurate and relevant responses from the underlying language models, thereby improving the overall effectiveness of our agents without requiring continuous human intervention in prompt creation and tuning.

The Genetic Algorithm employed within the GEPA framework functions by iteratively generating and evaluating a population of prompt variations. Each prompt within the population is assessed against a predefined evaluation metric, quantifying its performance in eliciting desired responses. Prompts demonstrating higher performance are then selected for “reproduction,” wherein their characteristics are combined and mutated to create new prompt variations for the next generation. This process of selection, reproduction, and mutation is repeated over multiple generations, effectively exploring a large solution space of potential prompts and converging on those that consistently maximize the evaluation metric. The algorithm’s search capability allows it to identify high-performing prompts without requiring explicit human guidance or predefined templates.

Performance evaluations indicate that the Genetic Algorithm-based Prompt Exploration and Adaptation (GEPA) technique achieves a score of 0.705 when applied to minimally constructed prompts. This result surpasses the performance of both TextGrad (0.685) and OpenAI’s optimization method (0.583) under the same conditions. Furthermore, when utilizing prompts crafted by experts, GEPA maintains strong performance with a score of 0.701, continuing to outperform TextGrad (0.670) and OpenAI’s optimizer (0.667). These results demonstrate GEPA’s efficacy across varying prompt quality levels.

Performance evaluations demonstrate substantial gains achieved through prompt optimization techniques. Specifically, the Genetic Algorithm-based Prompt Engineering approach (GEPA) improved overall scores by 0.141 when applied to minimal prompts, increasing performance from a baseline of 0.564 to 0.705. TextGrad exhibited an equivalent improvement of 0.141, raising scores from 0.513 to 0.654 under the same optimization process. These results indicate both methods effectively enhance prompt quality, though GEPA achieved a higher overall score after optimization.

From Raw Data to Actionable Insight: The Research Pipeline

The initial stage of knowledge discovery relies on a dedicated component, the Reader Agent, which functions as a highly focused web harvester. This agent doesn’t simply search the internet; it strategically accesses the FineWeb Index – a curated and refined collection of web content – guided by a precise Query Planning established by the orchestrator. This planning process decomposes complex research questions into a series of targeted information requests, ensuring the agent retrieves only the most relevant data. By operating within this indexed environment and adhering to the pre-defined query structure, the Reader Agent efficiently gathers the raw materials necessary for subsequent analysis, minimizing noise and maximizing the signal-to-noise ratio in the overall research pipeline. The speed and accuracy of this initial retrieval are paramount, as they directly influence the quality and timeliness of the final insights.

The Aggregator Agent functions as a critical distillation hub within the research pipeline, transforming raw data into structured knowledge. Following information retrieval, this agent doesn’t simply compile documents; it actively synthesizes extracted passages into concise “mini-reports,” each addressing a specific facet of the initial query. These mini-reports aren’t intended as standalone conclusions, but rather as modular building blocks, providing focused evidence and analysis. By breaking down complex information into these manageable units, the agent facilitates a more nuanced and accurate Answer Generation process, ensuring that subsequent synthesis isn’t burdened by unorganized or redundant data. This modular approach also allows for greater transparency, as each claim within the final report can be traced back to its originating source via the corresponding mini-report.

The culmination of the research pipeline rests with the Writer Agent, a sophisticated component responsible for transforming fragmented data into a cohesive and insightful report. This agent doesn’t merely assemble information; it actively synthesizes the mini-reports generated by the Aggregator, identifying key themes, resolving discrepancies, and constructing a logically flowing narrative. Crucially, every assertion within the final report is meticulously source-attributed, providing complete transparency and enabling readers to verify the underlying evidence. The output isn’t simply a collection of facts, but rather a polished document designed to deliver actionable insights, empowering users to make informed decisions based on rigorously validated research.

The pursuit of optimized multi-agent systems, as detailed in this research, embodies a fundamental principle: understanding through deconstruction. It’s not enough to simply build; the system must be stressed, probed, and refined through iterative testing-essentially, broken down to reveal its underlying mechanics. This aligns perfectly with Donald Davies’ observation: “If you want to know how something works, you have to take it apart.” The article demonstrates this vividly by showcasing how techniques like TextGrad and GEPA, through a process of prompt optimization, expose the limitations of initial configurations and unlock significantly improved performance in Deep Research tasks. The ability to coax greater efficacy from minimal prompts isn’t merely efficiency; it’s an exploit of comprehension, a testament to the power of reverse-engineering complex systems.

Where Do We Go From Here?

The presented work demonstrates that even systems designed for ‘deep research’ – a term already brimming with aspiration – remain profoundly sensitive to the initial conditions of their prompting. Optimization techniques like TextGrad and GEPA aren’t merely refining performance; they’re revealing how fragile intelligence can be, how readily complex behavior emerges from-and collapses without-carefully sculpted instruction. The implication isn’t that these systems are failing, but that the very notion of ‘general’ intelligence requires an equally general scaffolding of prompts – a perpetually shifting foundation.

Future work must move beyond optimizing for a fixed research task. The real challenge lies in building systems that optimize themselves, iteratively refining not just prompt content, but the optimization algorithms themselves. Currently, these approaches treat prompts as static entities to be tuned. What happens when the system begins to rewrite the rules of prompt construction – to evolve its own methods of inquiry? This necessitates a deeper investigation into meta-optimization – optimization of the optimizers – and a willingness to tolerate, even embrace, emergent and unpredictable behaviors.

Ultimately, the limitations of current automated prompt engineering aren’t technical; they’re philosophical. The pursuit of ‘optimal’ prompts implies a knowable ‘truth’ to be extracted, a fixed target for information gathering. But reality rarely conforms to such neat expectations. Perhaps the most fruitful path forward involves designing systems that are comfortable with ambiguity, that thrive in the space between information and uncertainty-systems that don’t seek to eliminate the chaos, but to learn from it.

Original article: https://arxiv.org/pdf/2604.02988.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deconstructing the Information Flood

Automated Dissection: A Multi-Agent System

The Art of the Prompt: Genetic Algorithm Optimization

From Raw Data to Actionable Insight: The Research Pipeline

Where Do We Go From Here?

See also: