Building Blocks for Scientific AI: Automatically Crafting Code Libraries

Author: Denis Avetisyan

A new system automatically generates reusable code examples from existing repositories, empowering AI agents to accelerate scientific discovery.

CodeDistiller transforms extensive GitHub repositories into a curated library of reusable scientific code, thereby empowering Code-RAG agents to surpass the limitations of their pre-existing knowledge and perform novel scientific discovery tasks.

CodeDistiller leverages code distillation and a judge-based LLM approach to construct functional libraries for materials science and experiment-driven research.

Automated scientific discovery systems are often constrained by the limited, manually-created code they can leverage for experimentation. This paper introduces CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents, a system designed to overcome this limitation by automatically distilling functional, domain-specific code examples from large collections of scientific GitHub repositories. We demonstrate that CodeDistiller can produce working examples for a substantial portion of materials science repositories, and that augmenting automated discovery agents with these distilled libraries leads to more accurate and scientifically sound experimental results. Could this approach unlock a new era of self-improving scientific software and accelerate the pace of discovery?

The Persistence of Error: Reproducibility as a Foundational Challenge

The increasing availability of research code, while promising for scientific advancement, paradoxically presents a major obstacle to progress: a persistent reproducibility bottleneck. Though researchers routinely share code alongside publications, independently verifying published findings remains surprisingly difficult. This isn’t simply a matter of typos or minor errors; complex software dependencies, undocumented environments, and the inherent difficulty in translating methods into executable code contribute to widespread failures in replication. Consequently, valuable time and resources are diverted from building upon existing knowledge, as scientists are often forced to re-implement analyses rather than directly validating or extending prior work. This impedes the cumulative nature of science, slowing the pace of discovery and potentially leading to the propagation of erroneous results, despite the vast quantities of shared code.

Current approaches to ensuring code reproducibility frequently falter when confronted with the intricate realities of scientific software. Research repositories are rarely isolated entities; they depend on a tangled web of external libraries, specific software versions, and often, complex computational environments. These dependencies, which can number in the hundreds or even thousands, are often poorly documented or change over time, creating a moving target for replication. Attempts to simply package code and data frequently fail to capture the complete computational provenance necessary for a successful rebuild, leading to the “dependency hell” that plagues many scientific endeavors. The sheer scale of these dependency graphs, coupled with the difficulty in managing version conflicts and ensuring consistent environments, presents a substantial obstacle to verifying published findings and fostering cumulative knowledge.

The inability to consistently reproduce published research findings poses a substantial impediment to scientific advancement. When results cannot be reliably verified, researchers expend valuable time and resources attempting to reimplement analyses or validate claims, rather than extending existing knowledge. This creates a compounding effect, where each unverifiable finding necessitates further independent effort, slowing the overall pace of discovery. The consequence is not merely duplicated work, but a fractured landscape of isolated studies, hindering the cumulative nature of science and diminishing the potential for breakthroughs that rely on a solid foundation of reproducible evidence. Ultimately, a lack of reproducibility stifles innovation and limits the capacity to address complex scientific challenges effectively.

CodeDistiller's performance varies depending on the base model used, as confirmed by both automated metrics and manual expert analysis of a representative subset of successfully completed repositories. — CodeDistiller’s performance varies depending on the base model used, as confirmed by both automated metrics and manual expert analysis of a representative subset of successfully completed repositories.

CodeDistiller: An Attempt at Automated Algorithmic Reconstruction

CodeDistiller is an automated system designed to process complex codebases hosted on GitHub and transform them into functional, executable examples. This conversion process addresses the challenge of utilizing existing research code, which often requires significant effort to set up and debug. The system focuses on extracting core functionalities from repositories and packaging them in a readily usable format, specifically for integration into automated scientific discovery pipelines. This allows researchers to leverage existing code as building blocks for new experiments and analyses without the need for manual intervention or extensive reverse-engineering of the original source material.

CodeDistiller incorporates a Vetted Code Library as a critical component for ensuring the quality and reliability of automatically generated code examples. This library consists of pre-validated code snippets and functions, sourced from established and trusted repositories, that are used as building blocks during the distillation process. When converting complex code, CodeDistiller prioritizes the use of these vetted components, effectively substituting potentially problematic or undocumented sections with functionally equivalent, pre-tested code. This approach significantly reduces the risk of introducing errors or inaccuracies in the distilled examples, and provides a baseline of known-good functionality upon which the automated conversion can reliably operate.

CodeDistiller demonstrated a variable success rate in automatically generating functional code examples, ranging from 26% to 74% across a test set of 250 GitHub repositories specifically within the materials science domain. This performance indicates that the system’s ability to successfully distill executable code is heavily dependent on the complexity and quality of the source code within each repository. The range suggests significant variation in the ease with which CodeDistiller can extract and validate functional components from different projects, highlighting both the potential and the limitations of automated code distillation techniques.

The difficulty in reproducing published research results is a significant challenge across scientific disciplines, frequently attributed to the lack of accessible, executable code accompanying publications. CodeDistiller addresses this reproducibility crisis by automatically generating functional code examples from complex repositories. This functionality lowers the barrier to verification, allowing researchers to directly execute and validate published methods, and facilitates further research by providing a readily available foundation for experimentation and extension of existing work. By providing this executable resource, CodeDistiller streamlines the research process and promotes more reliable scientific progress.

CodeDistiller streamlines software understanding by identifying a repository’s core purpose, selecting relevant files, and iteratively generating and debugging illustrative examples.

Code Retrieval and Experimentation: Towards Automated Hypothesis Testing

Code-RAG, or Code Retrieval-Augmented Generation, functions by leveraging the outputs of the CodeDistiller component to identify and retrieve pertinent code examples. This retrieval process is central to supporting scientific discovery by providing relevant code snippets that can be used as a basis for analysis or modification. The system doesn’t generate code from scratch; instead, it utilizes existing, distilled code as building blocks, enabling the rapid prototyping and testing of scientific hypotheses. Retrieved code examples are then integrated into the workflow, allowing agents to build upon established research and accelerate the experimentation cycle.

The performance of the Code-RAG system relies on Parametric Knowledge, which is acquired through the training of coding agents. This knowledge isn’t simply about code syntax, but rather learned parameters that guide the selection of relevant code examples for retrieval. Specifically, the agents are trained to evaluate and prioritize code based on its potential to contribute to scientific discovery, effectively learning which code characteristics are most valuable in specific research contexts. This allows the system to move beyond keyword-based retrieval and intelligently identify code that aligns with the goals of the experiment, leading to more accurate and efficient knowledge discovery. The system uses this knowledge to refine its search and prioritize code snippets based on learned parameters, improving the quality of the retrieved examples.

The integration of code retrieval with agents such as AI Scientist and AgentLab facilitates the generation of novel experiments by leveraging existing research as a foundation. These agents utilize retrieved code examples – distilled for relevance and accuracy – to propose and execute experiments that build upon prior work, rather than requiring de novo experimental design. This process significantly accelerates discovery by automating hypothesis generation and testing, allowing for a greater volume of experiments to be conducted in a given timeframe. The system’s capacity to synthesize existing knowledge with automated experimentation reduces the time required to explore research spaces and identify potentially impactful findings.

Experiment-Driven Discovery facilitates automated scientific investigation by iteratively generating and evaluating experiments based on retrieved code and parametric knowledge. This process leverages coding agents to formulate hypotheses, design experimental procedures as executable code, and analyze resulting data to refine subsequent iterations. The system’s capacity for automated experimentation is integral to accelerating the pace of scientific discovery, allowing for exploration of a wider experimental space than is feasible through manual methods. This approach enables continuous learning and optimization of research pathways, systematically testing and validating scientific principles through code execution and data analysis.

CodeDistiller functions as a critical component in enhancing automated scientific discovery by improving the quality of code retrieved for experimentation. Evaluations on the SUPER Benchmark demonstrate a 16% performance increase compared to previously established state-of-the-art systems. This improvement is attributed to CodeDistiller’s ability to refine code selection, resulting in more accurate, complete, and scientifically valid experiment generation. The system achieves this through optimized filtering and prioritization of code examples, directly impacting the reliability and efficiency of the automated discovery pipeline.

Rigorous Validation and the Pursuit of Generalizability

CodeDistiller’s capabilities are rigorously tested through a comprehensive suite of benchmarks designed to mirror the challenges of real-world research code. Evaluations utilize established tests like SUPER Benchmark, which focuses on code synthesis, and RexBench, geared towards reproducing experimental results. Further assessment comes from TM-Bench, evaluating task management, alongside ENVBENCH, which assesses environment setup, and the more complex scenarios presented by ResearchCodeBench. Finally, GISTIFY provides a benchmark for identifying the core functionality within a codebase, collectively providing a robust measure of CodeDistiller’s performance across a spectrum of scientific computing tasks and ensuring broad applicability to diverse research domains.

The system’s capabilities are rigorously tested through a series of benchmarks designed to mirror the challenges of real-world scientific computing. These evaluations go beyond simple code execution, specifically assessing the ability to establish the necessary computational environment, faithfully reproduce published research results, and adapt existing code for new purposes. This holistic approach ensures that the system isn’t merely generating code, but demonstrating a practical understanding of the scientific process – a crucial step toward automating discovery. Successfully navigating benchmarks like SUPER Benchmark and ResearchCodeBench signifies a capacity to handle the intricacies of research code, including complex dependencies and specialized configurations, thereby validating the system’s potential as a valuable tool for the scientific community.

Evaluation using the ResearchCodeBench benchmark reveals that CodeDistiller currently achieves a success rate of just under 40% when tasked with replicating complex research code. While seemingly modest, this result is significant because ResearchCodeBench is specifically designed to assess performance on challenging, real-world scientific projects-often involving intricate dependencies and undocumented procedures. This partial success indicates that CodeDistiller possesses a foundational ability to navigate these complexities, offering a promising starting point for further development and refinement of automated code repurposing techniques in demanding scientific domains. The current performance highlights both the potential and the ongoing challenges in automating the process of understanding and adapting existing research code.

The development of robust evaluation metrics is critical for advancing the field of automated scientific discovery, and existing benchmarks often fall short when assessing complex code replication and repurposing tasks. To address this, researchers have extended the capabilities of benchmarks like SUPER Benchmark, RexBench, and ResearchCodeBench, creating a standardized framework for rigorously evaluating systems like CodeDistiller. This expanded framework allows for consistent and comparable assessments of different automated systems’ abilities to navigate the intricacies of research code, fostering innovation and accelerating progress in areas such as data analysis, model building, and scientific simulation. By providing a common ground for evaluation, the framework encourages the development of more reliable and effective tools for scientific exploration and discovery.

Automated evaluation of code generated for scientific tasks presents a significant challenge, traditionally requiring extensive manual review by experts. To address this, the system leverages a novel approach utilizing a Large Language Model (LLM) as an autonomous judge. This ‘LLM-as-a-Judge’ methodology assesses the functionality and correctness of the generated code by executing it, comparing the outputs against expected results, and verifying adherence to the original research intent. By automating this critical evaluation step, researchers can rapidly iterate on code generation strategies and benchmark performance across diverse scientific domains, fostering more efficient and reproducible research workflows. The LLM’s capacity to understand code semantics and research context allows for a nuanced assessment, moving beyond simple pass/fail tests and providing valuable insights into the quality and reliability of the generated solutions.

The pursuit of automated scientific discovery, as detailed in this work concerning CodeDistiller, hinges on the creation of demonstrably correct and reusable code components. This mirrors a fundamental tenet of algorithmic design – the unwavering need for provable solutions. Robert Tarjan aptly stated, “Algorithms must be correct, not just work.” The system’s reliance on distilling functional code from existing repositories, and validating it through an LLM-as-a-judge approach, emphasizes this principle. It isn’t sufficient for the generated code to merely produce results; its correctness must be inherently verifiable, ensuring the reliability of the entire experiment-driven discovery process. The elegance lies in reducing complex tasks to verifiable, distilled components.

What Remains Invariant?

The pursuit of automated scientific discovery, as exemplified by CodeDistiller, hinges on a fundamental question: can the elegance of a mathematical principle be preserved through the imperfect medium of code? The system demonstrably constructs functional libraries, yet the core challenge isn’t merely generation, but guaranteeing correctness as the scope of potential scientific problems expands. Let N approach infinity-what remains invariant? Currently, the reliance on GitHub repositories, while pragmatic, introduces a dependence on the quality and biases of existing code, a distinctly un-axiomatic foundation. The LLM-as-a-judge paradigm, though novel, begs the question of verifiable truth-assessment based on probability, however sophisticated, is not proof.

Future work must address the formal verification of generated code, moving beyond empirical testing to provable correctness. This necessitates a shift from treating code as merely ‘working’ to establishing its logical equivalence to underlying scientific principles. The true test lies not in reproducing known results, but in independently deriving novel, verifiable theorems – a task demanding more than statistical likelihood.

Ultimately, the field needs to confront the limitations of current LLMs as arbiters of scientific validity. A system that seeks to automate discovery cannot simply mimic existing knowledge; it must embody the very principles of logical rigor that define the scientific method. The elegance of a solution isn’t measured by its performance on a benchmark, but by its inherent, mathematical truth.

Original article: https://arxiv.org/pdf/2512.01089.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Persistence of Error: Reproducibility as a Foundational Challenge

CodeDistiller: An Attempt at Automated Algorithmic Reconstruction

Code Retrieval and Experimentation: Towards Automated Hypothesis Testing

Rigorous Validation and the Pursuit of Generalizability

What Remains Invariant?

See also: