The Collaborative Scientist: Scaling Deep Research with AI Agents

Author: Denis Avetisyan

A new framework demonstrates how teams of AI agents can dramatically improve the efficiency and performance of complex research tasks.

MindDR-v1.5 demonstrably surpasses existing deep research systems across all five evaluated metrics on the DeepResearch-Benchmark leaderboard, establishing a new state-of-the-art in the field.

MindDR, a multi-agent system leveraging reinforcement learning and large language models, achieves leading results with approximately 30 billion parameters through optimized data synthesis and a staged training process.

Achieving leading performance in complex deep research tasks typically demands substantial model scale, creating a cost and efficiency challenge. This paper, ‘Mind DeepResearch Technical Report’, introduces MindDR, a multi-agent framework that overcomes this limitation by achieving competitive results with approximately [latex]30B[/latex]-parameter models. MindDR leverages a collaborative three-agent architecture and a four-stage training pipeline-including preference alignment-to maximize performance across tasks like browsing and long-form generation, demonstrated by state-of-the-art scores on benchmarks including a novel, curated dataset, MindDR Bench. Could this cost-effective approach unlock broader access to powerful deep research capabilities?

The Inherent Limitations of Superficial Information Retrieval

Contemporary information retrieval often falls short when tackling complex research questions. Traditional methods, reliant on keyword searches and shallow pattern matching, frequently deliver a deluge of results that are tangentially related, or simply miss crucial nuances hidden within vast datasets. This superficiality stems from an inability to grasp the contextual relationships between concepts, or to synthesize information from disparate sources effectively. Consequently, researchers spend considerable time sifting through irrelevant material, struggling to build a comprehensive understanding, and potentially overlooking critical insights. The sheer volume of available data, coupled with the increasing sophistication of information obfuscation, exacerbates this problem, demanding a paradigm shift beyond simple data aggregation towards systems capable of true cognitive processing.

Simply increasing the size of current large language models will not unlock genuinely deep research capabilities. While scaling enhances a model’s ability to recall and synthesize information from its training data, it doesn’t address fundamental limitations in reasoning and planning. These models excel at pattern recognition but struggle with tasks requiring multi-step inference, hypothesis generation, and the critical evaluation of conflicting evidence – all hallmarks of thorough research. A larger model may offer more comprehensive answers, but without architectural innovations focused on robust reasoning, it remains prone to generating plausible-sounding yet ultimately superficial or inaccurate conclusions. True depth necessitates a shift beyond mere scale, demanding systems capable of actively formulating research questions, strategically exploring information spaces, and resolving inconsistencies to build genuinely novel insights.

Truly insightful research transcends simple information gathering; it demands a sophisticated interplay of cognitive skills. While extensive knowledge provides the raw material, the ability to synthesize this information hinges on robust reasoning capabilities – the capacity to draw logical inferences and identify underlying principles. Equally critical is strategic planning, allowing a researcher to formulate hypotheses, design effective investigative pathways, and prioritize relevant data. However, information is rarely uniform, and conflicting evidence is commonplace; therefore, successful deep research necessitates skillful conflict resolution, the ability to critically evaluate competing claims, and reconcile discrepancies to arrive at well-supported conclusions. It is this combination of breadth, logical acumen, foresight, and critical judgment that distinguishes superficial analysis from genuinely deep understanding.

The Report-RL framework leverages a policy model and a frontier LLM-such as Gemini 3.1 Pro-to generate and evaluate reports using RACE Rubrics, demonstrating performance gains over distilled frontier LLMs on both DeepResearch Bench and MindDR Bench (as detailed in Fig. 1).

A Multi-Agent Architecture for Principled Deep Research

MindDR employs a multi-agent architecture consisting of three core agents: the Planning Agent, the DeepSearch Agent, and the Report Agent. The Planning Agent is responsible for query decomposition, breaking down complex research requests into a series of discrete, actionable subtasks. The DeepSearch Agent then executes these subtasks by performing multi-step information retrieval across various data sources. Finally, the Report Agent integrates the retrieved information, resolves any conflicting data, and generates a cohesive and comprehensive report as the output. Each agent operates independently yet collaboratively within the system to achieve the overarching research goal.

The Planning Agent within MindDR employs a query decomposition strategy to address complex research requests. This involves breaking down an initial, broad query into a series of discrete, sequentially-executed subtasks. Each subtask represents a specific information need, allowing for targeted searches and reducing the cognitive load on subsequent agents. This hierarchical approach enables the system to manage complexity and maintain focus throughout the research process, improving both efficiency and the quality of results by ensuring each component contributes to a clearly defined objective. The decomposition is not pre-defined but dynamically generated based on the initial query’s structure and semantic content.

The DeepSearch Agent utilizes a recursive retrieval process to address multi-step information needs. This agent doesn’t rely on single-query searches; instead, it formulates a series of interconnected queries based on initial results and intermediate findings. It accesses information from heterogeneous sources, including text corpora, knowledge graphs, and web-based data, and dynamically adjusts search strategies based on the content encountered. This iterative approach allows the agent to explore complex topics, refine its understanding through each retrieval step, and effectively gather relevant information that would likely be missed by traditional, single-query search methods.

The Report Agent within the MindDR architecture is responsible for the final stage of research, consolidating data retrieved by the DeepSearch Agent. This process includes information synthesis from multiple sources, conflict resolution between potentially contradictory statements, and the generation of a comprehensive report. The agent is designed to produce human-aligned reports, meaning the output is structured and presented in a manner consistent with expectations for clarity, conciseness, and logical flow, ensuring the final product is readily understandable and actionable for a human user.

The MindDR framework utilizes a planning agent to decompose user queries into subtasks, which are independently executed by deep search agents employing [latex] ext{ReAct}[/latex] and extended chain-of-thought reasoning, and then synthesized into a comprehensive, citation-supported report by a report agent.

A Four-Stage Training Pipeline: Towards Robust Reasoning

The initial stage of the MindDR training pipeline utilizes Supervised Fine-Tuning (SFT) as a cold-start procedure. This involves training the model on a dataset of instruction-following and tool-use examples to establish a baseline capability for understanding and executing user requests. The SFT process provides the foundational skills necessary for subsequent reinforcement learning stages by pre-training the model to generate appropriate responses and leverage external tools. This pre-training minimizes the exploration required during reinforcement learning, improving training stability and sample efficiency. The resulting model serves as a starting point for optimizing long-horizon reasoning, report generation, and human preference alignment.

Following Supervised Fine-Tuning (SFT), the training pipeline incorporates Search-RL to enhance the DeepSearch Agent’s performance in complex, multi-step reasoning tasks and information retrieval. This stage utilizes the Generalized Reward Problem Optimization (GRPO) algorithm, which allows for the optimization of reward functions that are not directly observable, but are inferred through the agent’s interaction with its environment. GRPO enables the agent to learn efficient search strategies by maximizing cumulative rewards derived from successful task completion, improving both the speed and accuracy of information gathering required for downstream report generation. The goal is to optimize the agent’s ability to formulate effective search queries and synthesize relevant information over extended reasoning horizons.

The Report-RL stage focuses on refining the Report Agent’s ability to generate extended, coherent text. This is achieved through a reinforcement learning process utilizing Direct Preference Optimization (DAPO) and a Large Language Model (LLM) functioning as a judge. DAPO allows for direct optimization of the agent’s policy based on human preferences, while the LLM-as-Judge provides scalable reward signals by evaluating the quality of generated reports against predefined criteria. This combination enables the agent to learn complex relationships between input instructions and desired long-form output characteristics, leading to improved content quality and adherence to specified reporting guidelines.

Preference Alignment constitutes the final stage of MindDR training, focusing on refining the model’s output to align with human expectations regarding both factual correctness and overall report quality. This is achieved through the systematic collection and utilization of human feedback on generated reports. Evaluators provide preference rankings between different report variations, indicating which responses are deemed more accurate, insightful, and helpful. These preference signals are then used to train a reward model, which learns to predict human preferences. The reward model subsequently guides further model training via Reinforcement Learning from Human Feedback (RLHF), iteratively calibrating the system to consistently produce reports that meet human standards for accuracy and usefulness.

Search-RL training, tracked over 180 steps, demonstrates progressive capability development-achieved through dynamically scheduled reward coefficients [latex]\lambda_{tool}[/latex], [latex]\lambda_{format}[/latex], [latex]\lambda_{PRM}[/latex], [latex]\lambda_{ORM}[/latex]-resulting in improvements across answer accuracy, entity coverage, tool use, and format compliance.

Empirical Validation: Demonstrating Superior Performance

Evaluations on the MindDR Bench – a challenging dataset comprising 500 authentic Chinese user queries – demonstrate the framework’s leading performance in deep research tasks. This curated benchmark rigorously tests a system’s ability to understand complex information needs and synthesize relevant answers from the web, and MindDR consistently achieved top results. The framework’s success on this demanding test highlights its proficiency in handling real-world queries, going beyond simple question-answering to encompass the nuances of open-ended information exploration and complex reasoning – a crucial step toward building truly intelligent research agents.

Evaluations beyond the MindDR Bench confirm the framework’s robust performance and adaptability to diverse research tasks. Strong results on the DeepResearch Bench, a challenging dataset designed to assess complex information seeking, and BrowseComp-ZH, which focuses on web browsing and comprehension in Chinese, demonstrate that MindDR’s capabilities extend beyond a single curated dataset. This success indicates the framework isn’t simply memorizing answers, but rather developing a generalizable ability to effectively navigate information, synthesize knowledge, and respond to complex queries – a critical step toward creating truly intelligent research agents.

The MindDR framework distinguishes itself as a cost-effective solution for deep research, achieving state-of-the-art performance utilizing models with approximately 30 billion parameters. This represents a significant advancement, as MindDR demonstrably surpasses the capabilities of currently available open-source systems in complex information retrieval and reasoning tasks. By strategically coordinating multiple agents within a unified framework, MindDR maximizes the potential of relatively smaller models, offering a compelling alternative to the resource-intensive demands of larger language models. This efficiency not only reduces computational costs but also broadens accessibility to advanced research capabilities, positioning MindDR as a practical and powerful tool for a wider range of applications.

Evaluations demonstrate MindDR’s leading performance in complex reasoning tasks; the framework achieved a score of 45.7% on the BrowseComp-ZH benchmark, surpassing all other publicly available open-source agent-style systems. Further validation on the MindDR Bench yielded a RACE score of 51.8, indicating a robust capacity for reading comprehension and answering challenging questions. These results highlight MindDR’s effectiveness as a cost-efficient deep research framework, capable of achieving state-of-the-art results even with models containing approximately 30 billion parameters, and establishing a new benchmark for performance in the field.

To overcome limitations in available training data for complex reasoning tasks, the framework utilizes a Knowledge Graph to proactively synthesize new examples. This approach doesn’t simply rely on existing datasets, but instead constructs targeted training instances by leveraging the structured relationships within the Knowledge Graph. The resulting synthetic data demonstrably improves both data efficiency – allowing the system to achieve strong performance with fewer examples – and generalization capabilities, enabling it to tackle a wider range of unseen queries and scenarios. By effectively augmenting the training process with synthesized knowledge, the system learns more robust and adaptable reasoning skills, ultimately boosting performance on benchmark evaluations.

The MindDR framework’s success in complex reasoning stems from its Extended Chain-of-Thought mechanism, a sophisticated approach to coordinating multiple agents. This mechanism moves beyond simple sequential thought processes by allowing agents to not only formulate individual steps, but also to anticipate the needs of subsequent agents, proactively providing relevant information and context. This inter-agent communication fosters a collaborative problem-solving environment where each agent builds upon the insights of others, ultimately enabling the system to tackle intricate queries that demand nuanced understanding and multi-step inference. By effectively synthesizing information across agents, the framework avoids the pitfalls of isolated reasoning, achieving a more robust and accurate approach to deep research tasks.

MindDR 1.5 outperforms both previous-generation MindDR 1.0 and other comparable deep research models on the DeepResearch Bench leaderboard ([https://huggingface.co/spaces/muset-ai/DeepResearch-Bench-Leaderboard](https://huggingface.co/spaces/muset-ai/DeepResearch-Bench-Leaderboard)).

Future Directions: Towards Autonomous Scientific Discovery

The continued development of MindDR hinges on its ability to process increasingly vast and intricate datasets, pushing the boundaries of automated scientific inquiry. Future efforts will concentrate on scaling the system’s architecture and algorithms to accommodate datasets orders of magnitude larger than those currently utilized. This scaling isn’t merely about handling volume; it requires innovations in data indexing, retrieval, and representation to maintain efficiency and accuracy. Simultaneously, the system will be challenged with increasingly complex queries – those demanding multi-step reasoning, nuanced interpretation, and the synthesis of information from disparate sources. Successfully navigating these challenges will not only enhance MindDR’s analytical capabilities but also unlock its potential to address previously intractable scientific problems and accelerate the pace of discovery across numerous disciplines.

Continued advancement of MindDR, and similar autonomous research agents, hinges on refining the algorithms that guide their exploration and discovery. Current reinforcement learning approaches often rely on carefully engineered reward functions, which can be brittle and limit the agent’s ability to generalize. Future research will prioritize the development of more robust and adaptable algorithms, potentially leveraging techniques like intrinsic motivation and curiosity-driven learning, to allow the agent to independently define and pursue valuable research directions. This includes investigating reward functions that incentivize not only the accuracy of findings but also the novelty and impact of proposed hypotheses, ultimately fostering a more creative and efficient scientific process.

Future advancements in autonomous research necessitate a shift towards systems capable of integrating and dynamically updating their knowledge base. Currently, many AI models rely on static datasets, limiting their ability to address novel inquiries or incorporate the latest scientific findings. Researchers are actively exploring methods to allow agents to access and synthesize information from diverse external sources – including scientific databases, pre-print servers, and real-time data streams – effectively creating a continuously learning system. This adaptive capacity isn’t merely about accessing more data; it requires sophisticated techniques for knowledge representation, validation, and integration, ensuring the agent can discern credible information and reconcile conflicting findings within an ever-changing information landscape. Successfully implementing such systems will unlock the potential for truly autonomous discovery, enabling agents to formulate hypotheses, design experiments, and refine understanding beyond the limitations of their initial training.

The development of MindDR showcases a marked improvement in training efficiency for large language models applied to scientific reasoning. Current methodologies often demand substantial computational resources; however, this research successfully achieved comparable performance with a significantly reduced input of 1.03 billion training tokens and 6,000 GPU card-hours. This represents a nearly 70% reduction in both token requirements and computational time when contrasted with previous generation systems, which necessitated 3.6 billion tokens and 15,000 card-hours. Such a decrease in resource intensity not only lowers the financial and environmental costs associated with model training, but also broadens accessibility, potentially enabling a wider range of researchers and institutions to participate in advanced AI-driven scientific exploration.

The long-term objective centers on the development of fully autonomous research agents, systems poised to independently formulate hypotheses, design experiments, analyze data, and ultimately, propel scientific discovery. These agents represent a paradigm shift, moving beyond tools that simply assist researchers to entities capable of driving innovation with minimal human intervention. By automating the iterative process of scientific inquiry, such agents promise to accelerate the pace of discovery across diverse fields, potentially unlocking solutions to complex challenges currently beyond reach. This vision extends beyond mere automation; it anticipates a future where artificial intelligence actively contributes to the expansion of human knowledge and fosters a new era of scientific advancement, unconstrained by the limitations of time and resource.

MindDR employs a four-stage training pipeline to iteratively refine dialogue response generation through reward modeling, policy optimization, and value function learning.

The MindDR framework, detailed in this report, embodies a commitment to provable efficacy. It isn’t merely about achieving results, but about constructing a system with demonstrable efficiency through meticulous data synthesis and a multi-stage training pipeline. This aligns perfectly with Alan Turing’s assertion: “The question of whether a machine can think is too loaded with the vagueness of language.” MindDR strives to replace ambiguity with quantifiable metrics-search efficiency and preference alignment-creating a system whose ‘thinking’ is defined by demonstrable performance, not subjective interpretation. The focus on scalable, cost-effective models-approximately 30B parameters-reflects a mathematical purity, prioritizing algorithmic elegance over brute-force computational power.

Future Directions

The presented framework, while demonstrating efficacy with relatively constrained model parameters, merely shifts the fundamental challenge rather than resolving it. The apparent success hinges on a carefully orchestrated data synthesis pipeline – a process inherently susceptible to bias and, crucially, lacking a formal guarantee of truth. One must ask: does efficiency in exploration compensate for potential flaws in the very foundations upon which knowledge is built? The observed performance, while numerically impressive, remains an empirical observation, not a logically derived necessity.

Future efforts should prioritize formalizing the data synthesis process, perhaps through the application of techniques from automated theorem proving or constructive logic. Simply scaling the model or increasing the volume of synthesized data offers diminishing returns if the underlying generative process is not rigorously defined. The question isn’t whether a system appears to reason, but whether its conclusions are demonstrably valid, independent of any specific training regime.

Ultimately, the pursuit of ‘intelligent’ agents necessitates a move beyond superficial performance metrics. The true test lies in the ability to construct systems whose behavior is not merely predictable, but provably correct. Until then, the field remains tethered to empirical validation – a frustratingly imprecise substitute for genuine understanding.

Original article: https://arxiv.org/pdf/2604.14518.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/