The Self-Improving Scientist: Automating Years of ML Research

Author: Denis Avetisyan

A new system is tackling the complex challenge of automating long-term machine learning experimentation and improvement.

Through 74 autonomous experiment cycles over 23 hours on the MLE-Bench Lite’s Detecting Insults task, the AiScientist system improved validation AUC from 0.903 to 0.982 with 18 performance-enhancing updates, demonstrating its capacity for self-directed optimization in machine learning.

AiScientist leverages hierarchical orchestration and durable state management to enable autonomous research engineering, demonstrated through successful paper replication and experimental advancement.

Despite advances in autonomous AI, sustaining coherent progress across extended machine learning research engineering tasks remains a significant challenge. This paper, ‘Toward Autonomous Long-Horizon Engineering for ML Research’, introduces AiScientist, a system designed to address this by combining hierarchical orchestration with a novel “File-as-Bus” workspace for durable state continuity. AiScientist demonstrably improves performance on both paper replication and experiment improvement benchmarks by leveraging artifact-mediated coordination rather than relying on conversational handoffs. These results suggest that automating long-horizon ML research is fundamentally a systems problem – but can we further refine these systems to truly unlock autonomous scientific discovery?

The Challenge of Sustained Scientific Inquiry

Conventional machine learning workflows often falter when applied to research endeavors that unfold over extended timelines. Unlike tasks with immediate feedback, long-horizon projects – those requiring months or even years to yield results – present unique challenges in maintaining context and managing complexity. The iterative process of scientific discovery demands nuanced adjustments based on accumulating evidence, a level of adaptability that current systems struggle to provide without substantial human oversight. This reliance on manual intervention not only slows down progress but also introduces potential for bias and limits the scalability of research efforts, as each step frequently requires a researcher to interpret data, refine hypotheses, and guide subsequent experiments. Consequently, automating the entire scientific process-from initial ideation to experimental validation-remains a significant hurdle in accelerating discovery.

Current machine learning methodologies, while powerful in narrowly defined tasks, falter when applied to the extended timelines and intricate dependencies inherent in long-horizon scientific research. Simply increasing computational power or data volume doesn’t address the fundamental issue: a lack of systemic automation. A truly transformative approach necessitates a new paradigm – one where algorithms not only analyze data and formulate hypotheses, but also autonomously design, execute, and interpret experiments over weeks, months, or even years. This requires systems capable of maintaining a coherent project state, adapting to unexpected results, and iteratively refining research directions without constant human intervention, effectively closing the loop between computational ideation and empirical validation to accelerate the pace of discovery.

Maintaining a cohesive research trajectory over extended periods presents a fundamental hurdle for conventional machine learning systems. Unlike tasks with immediate feedback, long-horizon projects demand the retention of complex project states – encompassing hypotheses, experimental setups, partial results, and evolving research goals – across weeks, months, or even years. Traditional methods, designed for discrete, short-lived experiments, struggle with this continuous, cumulative knowledge representation and often require substantial human intervention to prevent drift or loss of context. This necessitates not just data storage, but a system capable of intelligently interpreting the significance of past work, adapting to new findings, and proactively guiding future experimentation – a level of sustained reasoning that remains a significant challenge for current automated scientific discovery platforms.

AiScientist employs a 'File-as-Bus' architecture where a central orchestrator guides specialized agents coordinating through a shared workspace to iteratively refine research and engineering outputs via durable, artifact-mediated loops. — AiScientist employs a ‘File-as-Bus’ architecture where a central orchestrator guides specialized agents coordinating through a shared workspace to iteratively refine research and engineering outputs via durable, artifact-mediated loops.

AiScientist: An Architecture for Durable Experimentation

AiScientist is an autonomous machine learning research system engineered for experiments spanning extended durations. Its architecture integrates hierarchical orchestration-allowing for the decomposition of complex research goals into manageable stages-with a novel File-as-Bus protocol. This protocol functions by externalizing the complete state of a research project-including code, data, configurations, and results-and persisting it as a collection of readily accessible files. This design enables durability and reproducibility, as the entire project state is continuously saved and versioned, and facilitates modularity by allowing different components to interact through shared file access.

The File-as-Bus protocol within AiScientist establishes durable state continuity by serializing the complete project state – including code, data, configurations, and experimental results – into easily accessible files. This externalization differs from traditional in-memory state management; all information is persisted to a file system, allowing for seamless resumption after interruptions, reproducibility of experiments, and inspection of intermediate results. The protocol facilitates a decoupling between the agent’s control logic and the underlying project state, enabling agents to reliably access and modify the project’s status through standard file I/O operations, and ensuring that no data is lost during long-running autonomous research cycles.

AiScientist’s ‘Thin Control’ architecture minimizes the complexity of the control logic required for autonomous experimentation. Instead of dictating low-level implementation details, the system focuses on orchestrating high-level stages within a research pipeline. This is achieved by relying on the ‘Thick State’ – a fully externalized and persistent record of the project’s complete state – for actual execution. By decoupling control from implementation, AiScientist gains substantial flexibility; the control logic can be readily adapted to new tasks or environments without modifying the underlying execution components. Furthermore, this approach enhances resilience, as the complete project state is durable and can be easily restored or resumed following interruptions, allowing for long-horizon experimentation without loss of progress.

‘Thick State’ in AiScientist refers to the comprehensive serialization of all relevant project data – including code, data versions, experimental results, and configuration parameters – into readily accessible files. This contrasts with systems relying on internal, often transient, state representations. By externalizing this information, AiScientist ensures that the agent has complete visibility into the project’s history and current status. This accessibility enables robust introspection, debugging, and adaptation; agents can leverage the ‘Thick State’ to inform decision-making, reproduce experiments, and recover from failures without relying on potentially lost internal representations. The granularity of this state serialization is designed to allow for detailed analysis and reconstruction of any prior project state.

Analysis of AiScientist using GLM-5 reveals that its File-as-Bus mechanism significantly enhances performance, particularly during later refinement stages, surpassing both a simpler agent and a variant lacking this feature.

Orchestration and Coordination: A Multi-Agent System

Hierarchical Orchestration functions by dividing a complex research task into discrete sub-tasks, each representing a specialized role. These roles are then individually assigned to specific agents within the system. This decomposition allows for parallel processing and focused execution, improving overall efficiency. The hierarchy isn’t necessarily rigid; agents can be dynamically assigned roles based on task requirements and available resources. The process relies on clearly defined interfaces between roles, ensuring seamless integration of results and preventing functional overlap. By distributing the workload and assigning responsibility, Hierarchical Orchestration mitigates the cognitive load on any single agent and facilitates scalability for increasingly complex research endeavors.

Specialized Agents operate by leveraging a shared workspace established through the File-as-Bus protocol, enabling concurrent access and modification of data. This protocol treats files not as static containers, but as dynamic communication channels, providing a durable state that persists across agent interactions. Each agent is designed to perform a specific function within a larger research task, and coordination is achieved by reading and writing to designated files within the shared workspace. This approach eliminates the need for a central database or complex API calls, allowing agents to operate asynchronously and independently while maintaining data consistency through file-based communication. The shared workspace serves as a single source of truth, facilitating knowledge sharing and collaboration between agents without requiring direct inter-agent communication.

Multi-Agent Coordination is achieved through a defined hierarchical structure wherein agents operate with specific roles and responsibilities. This structure facilitates efficient collaboration by channeling information flow and minimizing redundant processing. Agents communicate and share knowledge by accessing and modifying files within the shared workspace, guided by the Workspace Map. The hierarchical arrangement allows for task decomposition and distribution, enabling parallel processing and reducing overall task completion time. Furthermore, this organization supports knowledge sharing as agents build upon each other’s work, contributing to a cumulative understanding and improved research outcomes.

The Workspace Map functions as a lightweight indexing system within the shared workspace, enabling specialized agents to efficiently locate and utilize relevant information. This map does not replicate data; instead, it maintains pointers to data locations, minimizing storage overhead and ensuring consistency. Agents consult the Workspace Map to discover available resources, identify the outputs of other agents, and determine the appropriate data to use for their assigned tasks. The map’s lightweight nature prioritizes speed and scalability, facilitating rapid access to information crucial for multi-agent coordination and efficient task completion.

Validation and Benchmarking: Demonstrating Autonomous Research Capability

AiScientist’s capacity for genuine scientific discovery is rigorously tested through ‘PaperBench’, a challenging evaluation framework demanding complete replication of published research. This isn’t simply about achieving similar results; the system must independently recreate the entire experimental pipeline – from data processing and model selection to training and evaluation – based solely on the information presented in top-tier machine learning conference papers. By assessing its ability to autonomously translate theoretical descriptions into functional code and reproducible experiments, PaperBench provides a strong indicator of AiScientist’s potential to not just apply existing knowledge, but to genuinely do science – a crucial step toward fully autonomous research capabilities and the acceleration of scientific progress.

MLE-Bench Lite serves as a rigorous proving ground for autonomous machine learning research systems, specifically designed to evaluate sustained progress on challenging, competitive tasks. Unlike benchmarks that assess performance on a single, static problem, MLE-Bench Lite demands consistent improvement across a series of experiments, mirroring the iterative nature of real-world research. The benchmark focuses on measuring a system’s ability to autonomously design, execute, and analyze experiments over an extended period, rewarding not just initial success, but also the capacity to learn from failures and refine its approach. This sustained performance metric provides a more holistic evaluation of an AI researcher’s capabilities, moving beyond one-off achievements to assess its potential for long-term innovation and impactful contributions to the field.

AiScientist’s performance on the MLE-Bench Lite benchmark reached 81.82% Any Medal%, a result indicative of substantial progress in automating extended, iterative machine learning research. This score doesn’t simply reflect task completion; it demonstrates the system’s capacity to autonomously refine experiments over a sustained period, effectively acting as a self-improving research engineer. Unlike systems focused on single-shot optimization, AiScientist navigates the complexities of long-horizon research, adapting to challenges and building upon prior results to consistently enhance performance on competitive machine learning tasks. The achievement signals a potential shift towards systems capable of independently driving progress in areas demanding prolonged experimentation and nuanced analysis, ultimately accelerating the pace of discovery.

AiScientist demonstrates a notable advancement in automating machine learning research through its performance on the PaperBench benchmark, exceeding the strongest existing baseline by 11.15 points. This improvement isn’t merely incremental; it signifies the system’s capacity to independently navigate the complex stages of a research project – from experimental design and code implementation to analysis and refinement. The ability to replicate results from top-tier conference papers, consistently outperforming established methods, suggests a fundamental shift in how ML research can be conducted, potentially accelerating discovery and reducing the reliance on manual effort in key areas such as hyperparameter optimization and model selection. This achievement highlights AiScientist’s potential to serve as a powerful tool for researchers, enabling them to explore a wider range of ideas and focus on higher-level conceptual challenges.

Rigorous testing of the AiScientist system involved systematically disabling core components to assess their impact on performance, a process known as ablation study. Results demonstrated the crucial role of the ‘File-as-Bus’ protocol, a mechanism facilitating data and code exchange within the autonomous research loop. Removing this protocol led to a substantial decrease in the system’s capabilities; specifically, performance on the challenging ‘MLE-Bench Lite’ benchmark dropped by 31.82 percentage points, and its score on ‘PaperBench’ – assessing replication of scientific papers – decreased by 6.41 points. This significant reduction underscores the protocol’s importance as a foundational element enabling AiScientist to effectively manage the complex workflow of autonomous machine learning research and consistently achieve high-level results.

The AiScientist system, as detailed in the paper, embodies a holistic approach to machine learning research engineering. It recognizes that automating long-horizon tasks isn’t merely about stringing together individual steps, but about managing the entire lifecycle of an experiment-from initial setup to artifact storage and subsequent iterations. This mirrors Tim Bern-Lee’s sentiment: “The web is more a social creation than a technical one.” The AiScientist’s emphasis on durable state continuity and artifact-mediated coordination acknowledges that research builds upon previous work, creating a web of interconnected experiments and findings. Just as the web relies on consistent linking and accessible information, AiScientist’s success hinges on maintaining a reliable and navigable record of each experiment’s progression.

Future Directions

The presented work, while demonstrating a path toward automating elements of machine learning research, inevitably highlights just how much remains tacit. AiScientist’s reliance on ‘artifacts’ – durable representations of state – is less a solution and more a formalization of a problem everyone already knew existed: experiments are not merely code, they are entangled histories. The system’s performance, then, isn’t a measure of intelligence, but of bookkeeping. If the system looks clever, it’s probably fragile.

A natural extension lies in refining the ‘hierarchical orchestration.’ Current approaches to automation tend toward brittle modularity; the illusion of control maintained by carefully partitioning concerns. More interesting, and considerably more difficult, is the development of systems that can renegotiate those partitions – that can dynamically restructure the research process itself. The architecture, after all, is the art of choosing what to sacrifice.

Ultimately, the true test won’t be replication, but genuine improvement. Can such a system, freed from the constraints of human attention, stumble upon genuinely novel approaches? Or will it merely optimize existing strategies, endlessly refining local maxima? The answer, predictably, will likely reveal more about the nature of ‘intelligence’ itself than about machine learning.

Original article: https://arxiv.org/pdf/2604.13018.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Challenge of Sustained Scientific Inquiry

AiScientist: An Architecture for Durable Experimentation

Orchestration and Coordination: A Multi-Agent System

Validation and Benchmarking: Demonstrating Autonomous Research Capability

Future Directions

See also: