The AI Scientist: Automating Research Plan Creation

Author: Denis Avetisyan


A new framework uses artificial intelligence to autonomously generate detailed research plans, bypassing the need for human-created training data.

The system trains models to autonomously generate research plans, evaluating their quality through rubric-based rewards derived from a dataset curated by a separate model selecting representative examples from existing research papers-a process where plans are assessed against both goal-specific rubrics and general guidelines, with the fraction of satisfied criteria forming the basis for training and evaluation metrics.
The system trains models to autonomously generate research plans, evaluating their quality through rubric-based rewards derived from a dataset curated by a separate model selecting representative examples from existing research papers-a process where plans are assessed against both goal-specific rubrics and general guidelines, with the fraction of satisfied criteria forming the basis for training and evaluation metrics.

ResearchPlanGen leverages large language models and self-rewarding reinforcement learning to extract rubrics and solutions from existing scientific literature.

Despite advances in artificial intelligence, consistently generating rigorous research plans remains a challenge for language models. In ‘Training AI Co-Scientists Using Rubric Rewards’, we address this limitation by introducing a self-improving framework that automatically trains models to generate detailed research plans using extracted grading rubrics and solutions from existing scientific literature. This approach, leveraging reinforcement learning with self-grading, achieves significant improvements in plan quality-validated by human experts and frontier models across machine learning and medical research-without requiring external annotation. Could this scalable, automated training recipe represent a crucial step towards realizing truly general AI co-scientists capable of independent scientific discovery?


The Illusion of Automated Discovery

Scientific advancement fundamentally relies on well-defined research plans, serving as roadmaps for investigation and discovery. Despite its critical importance, the creation of these plans predominantly remains a labor-intensive, manual process, demanding significant time and expertise from researchers. This often involves iteratively refining hypotheses, designing experiments, anticipating potential challenges, and meticulously outlining procedures – tasks currently beyond the reliable scope of most automated systems. The continued dependence on manual planning creates a bottleneck in scientific workflows, limiting the scale and speed at which new knowledge can be generated and potentially hindering progress across diverse fields of study. Consequently, automating aspects of research planning represents a significant opportunity to accelerate the pace of scientific innovation.

Current large language models, despite their proficiency in natural language processing, encounter significant obstacles when tasked with devising comprehensive research strategies. The creation of a robust scientific plan demands more than simply identifying relevant information; it necessitates nuanced reasoning about experimental design, hypothesis testing, and the logical sequencing of investigations. LLMs often struggle with this structured, multi-step thinking, frequently exhibiting difficulties in prioritizing experiments, anticipating potential pitfalls, or effectively integrating prior knowledge into a coherent research roadmap. This limitation isn’t a matter of lacking data, but rather a deficit in the ability to apply abstract reasoning and causal inference – cognitive skills crucial for formulating genuinely effective scientific inquiries and translating broad questions into actionable, testable hypotheses.

The protracted nature of manual research planning presents a significant bottleneck in scientific advancement. While countless datasets and computational tools exist, their effective deployment relies on meticulously crafted experimental strategies – a process demanding deep domain expertise and often requiring substantial time investment. This reliance on human-driven planning restricts the sheer volume of hypotheses that can be rigorously tested, and consequently, slows the rate at which new knowledge is generated. The inability of current large language models to consistently generate viable research pathways exacerbates this issue, limiting the potential for automated hypothesis generation and efficient exploration of the scientific landscape, ultimately hindering the pace of discovery and innovation across all disciplines.

Human evaluation demonstrates that finetuning the model significantly improves research plan quality, consistently yielding preferred plans over the base Qwen-3-30B model across tested criteria, with a 95% confidence interval shown via bootstrap sampling.
Human evaluation demonstrates that finetuning the model significantly improves research plan quality, consistently yielding preferred plans over the base Qwen-3-30B model across tested criteria, with a 95% confidence interval shown via bootstrap sampling.

Trading Manual Effort for Algorithmic Oversight

ResearchPlanGen addresses the limitations of current LLM-based research planning methods which typically require substantial human-annotated datasets for training. This framework introduces an alternative approach focused on autonomous learning, eliminating the need for costly and time-consuming manual labeling. By enabling LLMs to generate complete research plans – encompassing problem definition, hypothesis formulation, methodology, and expected outcomes – without direct human supervision, ResearchPlanGen aims to significantly reduce the resources required to develop AI capable of scientific inquiry. The system is designed to produce plans that are structurally complete and logically coherent, facilitating downstream tasks such as experiment design and data analysis.

The Generator-Verifier loop functions as the central component of the framework, employing two Large Language Model (LLM) instances in an iterative process. The ‘Generator’ LLM produces a complete research plan based on a given prompt. This plan is then submitted to the ‘Verifier’ LLM, which assesses the plan’s completeness, logical coherence, and relevance to the initial prompt. The Verifier outputs a score and detailed feedback, which is then used to refine the Generator’s subsequent planning attempts. This automated grading process, facilitated by the second LLM instance, enables scalable evaluation without requiring human annotation, and allows the Generator to continuously improve its planning capabilities through self-supervised learning.

The Generator-Verifier loop facilitates self-supervised learning by enabling the generator LLM to iteratively improve research plan generation. The process begins with the generator LLM producing a plan, which is then evaluated by the verifier LLM based on predefined criteria – such as completeness, logical flow, and feasibility. The verifier provides a score or critique that is then fed back to the generator as training signal. This allows the generator to adjust its internal parameters and refine its planning strategies without requiring human-labeled data, creating a closed-loop system for continuous improvement and adaptation.

Domain-specific finetuning consistently improves research plan generation scores across machine learning, medical, and arXiv goals, as demonstrated by consistently better performance than the base <span class="katex-eq" data-katex-display="false">Qwen-3-{30}B-A3B-Instruct</span> model, with larger, more recent GPT models achieving the highest scores and exhibiting significant cross-domain generalization.
Domain-specific finetuning consistently improves research plan generation scores across machine learning, medical, and arXiv goals, as demonstrated by consistently better performance than the base Qwen-3-{30}B-A3B-Instruct model, with larger, more recent GPT models achieving the highest scores and exhibiting significant cross-domain generalization.

The Illusion of Rigor: Rubrics and Reinforcement

Rubric-Guided Training enhances the development of research plans by utilizing both goal-specific criteria and broader, general guidelines. This approach moves beyond simple task completion to prioritize the methodological soundness and overall quality of the generated plans. The rubric serves as a structured evaluation framework, informing the training process and enabling targeted improvements in areas such as research question formulation, methodology selection, and data analysis techniques. By explicitly defining expectations for both the content and the structure of the research plan, Rubric-Guided Training aims to produce more rigorous and well-defined outputs, increasing the likelihood of successful downstream research execution.

The Self-RewardRL component utilizes a pre-trained, fixed-parameter Large Language Model (LLM) functioning as a grader to provide automated reward signals during training. This LLM assesses generated research plans based on adherence to the established rubric, quantifying the quality of the plan according to pre-defined criteria. The resulting score is then used as a reward signal to reinforce desirable behaviors in the generator LLM, effectively creating a closed-loop reinforcement learning system where the generator learns to maximize rubric scores without requiring human intervention for evaluation.

GroupRelativePolicyOptimization (GRPO) is employed as a refinement technique for the generator Large Language Model (LLM) following rubric-based evaluation. GRPO functions by comparing the performance of the generator LLM against a group of previously generated plans, normalizing rewards based on relative rubric scores. This relative scoring encourages the generator LLM to not only achieve high rubric scores in absolute terms, but also to outperform other generated plans, leading to a maximized performance distribution centered around optimal rubric adherence. The process utilizes the rubric scores provided by the frozen LLM grader as the primary signal for policy updates, effectively driving the generator LLM towards consistently producing research plans that meet defined quality and rigor standards.

Training both instruction-tuned and reasoning-based Qwen-3-4B models with a Qwen-3-30B MoE reward model <span class="katex-eq" data-katex-display="false">	heta_r</span> demonstrates comparable validation performance, despite the reasoning model requiring over twice the computational resources.
Training both instruction-tuned and reasoning-based Qwen-3-4B models with a Qwen-3-30B MoE reward model heta_r demonstrates comparable validation performance, despite the reasoning model requiring over twice the computational resources.

The Promise and Peril of Automated Inquiry

ResearchPlanGen demonstrates a remarkable capacity for domain generalization, consistently producing effective research plans across diverse scientific disciplines. Experiments reveal the framework isn’t limited by the specific training data used during its development; instead, it leverages underlying principles of scientific inquiry to adapt to fields ranging from astrophysics to zoology. This adaptability stems from the system’s ability to identify core research objectives and translate them into actionable plans, regardless of the subject matter. The consistent generation of high-quality plans across such varied fields suggests a robust understanding of the scientific method itself, exceeding expectations for a system trained on a limited dataset and indicating its potential as a broadly applicable tool for researchers across all disciplines.

Rigorous ablation studies reveal that the generation of robust research plans hinges on the synergistic interplay between goal-specific directives and broadly applicable guidelines. Removing either component demonstrably degrades the quality and feasibility of the proposed research. Plans generated solely with goal-specific information often lack the necessary methodological rigor or fail to account for established best practices within the scientific community. Conversely, relying exclusively on general guidelines produces plans that, while technically sound, are insufficiently tailored to address the unique nuances of the stated research objective. This suggests that an effective framework requires a balanced approach, integrating contextual goals with universally applicable principles to produce research plans that are both innovative and realistically executable; the optimal framework doesn’t simply know what to do, but how to adapt general knowledge to specific problems.

To accelerate progress in automated research planning, the DatasetResearchPlanGen has been made publicly available. This novel resource comprises a carefully curated collection of extracted research goals, detailed evaluation rubrics, and corresponding solution plans, spanning a diverse range of scientific disciplines. By providing a standardized and accessible dataset, researchers are now equipped to develop, benchmark, and refine algorithms capable of autonomously generating high-quality research proposals. The dataset’s structure encourages both supervised learning approaches and reinforcement learning paradigms, ultimately fostering innovation in areas such as scientific discovery, experimental design, and knowledge synthesis. Its release represents a significant step towards building intelligent systems that can assist scientists in navigating the complexities of modern research.

The pursuit of automated research plan generation, as demonstrated by ResearchPlanGen, inevitably highlights the transient nature of elegant solutions. It seems fitting, then, to recall Ken Thompson’s observation: “Software is like entropy: It is difficult to stop it from becoming disordered.” This framework, built upon self-rewarding reinforcement learning and rubric-based evaluation, attempts to codify rigor. However, the system’s reliance on extracted rubrics and solutions – data derived from existing scientific papers – implies a cycle. Each iteration refines the process, yet simultaneously inherits the biases and limitations of its predecessors. The architecture isn’t a perfect diagram; it’s a compromise that survived deployment, and will, inevitably, require resuscitation.

What’s Next?

The automation of research plan generation, as demonstrated by ResearchPlanGen, addresses a logistical bottleneck. However, the core challenge remains untouched: discerning genuine scientific progress from sophisticated mimicry. The framework excels at synthesizing existing structures, but novelty-the uncomfortable, messy process of proposing something genuinely new-requires more than rubric optimization. Expect diminishing returns as the system saturates on established knowledge; the truly interesting failures will not be neatly graded.

Future iterations will inevitably focus on expanding the scope of ‘extracted rubrics’. The implicit assumption – that a paper’s evaluation criteria fully capture its contribution – is, of course, naive. The real innovation isn’t in automating the process of science, but in automating the appearance of it. The next stage will involve increasingly complex attempts to model reviewer biases, effectively building a system that’s good at getting papers accepted, not necessarily good science.

Ultimately, the field doesn’t need more generative AI. It needs a more honest accounting of what gets lost in translation when human judgment is reduced to a reward function. The current trajectory suggests a future where research becomes an exercise in algorithmic optimization, and the pursuit of truth becomes a secondary concern. Perhaps, eventually, the system will be able to write its own critiques – a fitting conclusion to the cycle.


Original article: https://arxiv.org/pdf/2512.23707.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-31 00:57