Beyond Language: Building AI Scientists for Biology

Author: Denis Avetisyan

A new framework empowers artificial intelligence to not just process biological data, but to actively conduct and verify research.

PRAXIS integrates long-term memory, case-based reasoning, and verifiable workflows to create robust and auditable AI agents for biological discovery.

While large language models increasingly automate scientific tasks, ensuring reliability and auditability remains a critical challenge in biological research. To address this, we present ‘PRAXIS: Case-distilled and code-verified AI agents for biological research’, a framework that integrates long-term memory and case-based reasoning to create verifiable agentic workflows. This approach transforms research experience into executable capabilities, improving method selection and error suppression across diverse biocomputational tasks. Could this represent a pathway toward AI systems that not only assist, but actively enhance the rigor and reproducibility of scientific discovery?

The Illusion of Progress: Bottlenecks in the Pursuit of Knowledge

The foundations of scientific progress have long been burdened by processes demanding significant human effort. Manual literature review, for instance, requires researchers to sift through vast and ever-growing volumes of publications – a task that increasingly strains time and resources. Similarly, experimental design, while intellectually stimulating, is often a painstakingly slow cycle of hypothesis formulation, setup, data collection, and analysis. This reliance on manual processes creates substantial bottlenecks, limiting the speed at which new knowledge can be generated and applied. Consequently, discoveries are often delayed, and potentially groundbreaking research may remain unexplored due to practical limitations imposed by these traditional workflows. The sheer volume of scientific data now being produced necessitates a shift towards more automated and efficient methodologies to overcome these inherent constraints and truly accelerate the pace of innovation.

Despite their remarkable abilities in processing and generating human language, Large Language Models (LLMs) currently function as powerful tools within a research workflow, rather than autonomous drivers of it. While adept at tasks like summarizing papers or brainstorming hypotheses, LLMs lack the crucial capacity to independently orchestrate the multifaceted stages of scientific inquiry. These models struggle with tasks requiring long-term planning, iterative experimentation, and the integration of diverse tools-from data analysis software to laboratory equipment-necessary for a complete research cycle. Essentially, LLMs can provide insightful pieces of the puzzle, but they cannot, on their own, assemble those pieces into a coherent and self-directed investigation, highlighting a critical gap between linguistic proficiency and true scientific agency.

The current pace of scientific discovery demands a fundamental shift beyond simply analyzing existing data; a new research entity – the Scientific Agent – is envisioned to proactively drive the entire investigative process. These agents wouldn’t merely process information, but autonomously formulate hypotheses, design experiments – both in silico and potentially in vitro – analyze results, and iteratively refine their approach. Such automation promises to overcome the inherent limitations of traditional research, where human constraints restrict the scope and speed of inquiry. By integrating [latex]AI[/latex] with automated laboratory equipment and knowledge databases, these agents could dramatically accelerate the cycle of discovery, potentially identifying promising avenues of research currently obscured by the sheer volume of available data and the constraints of manual effort. Ultimately, the goal is to create a self-improving system capable of tackling complex scientific challenges with unprecedented efficiency and scale.

PRAXIS: A Framework for Automated Inquiry

PRAXIS integrates two distinct learning methodologies to equip agents with research capabilities. Literature learning involves processing and extracting knowledge from a corpus of research papers, enabling the agent to understand established concepts and identify relevant information. Complementing this, case distillation focuses on learning from expert-demonstrated research workflows, effectively transferring procedural knowledge. By combining these approaches – broad knowledge acquisition from literature with specific skill replication via case distillation – PRAXIS aims to create agents capable of both independent exploration and efficient execution of complex research tasks.

PRAXIS employs a ‘Workflow Schema’ to formally define the sequential steps comprising a research task. This schema, represented as a directed acyclic graph, explicitly outlines dependencies between individual steps – such as literature review, data acquisition, analysis, and report generation – enabling automated execution and precise control over the research process. By codifying the workflow, PRAXIS ensures reproducibility, as any researcher can re-execute the defined schema with consistent results. Furthermore, the schema-based approach facilitates scalability; the framework can readily adapt to larger and more complex research tasks by adding or modifying steps within the defined structure, and allows for parallelization of independent steps to reduce overall execution time.

PRAXIS implements checkpointing by periodically saving the state of its research workflows to persistent storage. This allows workflows to resume execution from the last saved state following any system failure or interruption, preventing the loss of progress in potentially lengthy research processes. Checkpointing isn’t solely for failure recovery; the ability to resume from a checkpoint, rather than restarting entirely, also enables workflow acceleration through parallelization and optimization of individual workflow steps. The system logs checkpoint data, including workflow state and relevant variables, to facilitate reproducibility and debugging of long-running experiments.

Evaluations demonstrate that the PRAXIS framework substantially improves both the reliability and auditability of automated research workflows. Specifically, internal testing revealed a 37% reduction in workflow failures compared to baseline implementations lacking checkpointing and workflow schema enforcement. Furthermore, the system’s detailed logging of each workflow step, combined with the standardized ‘Workflow Schema’, allows for complete reconstruction and verification of research processes, facilitating independent review and error identification. These features address key limitations of prior automated research systems, which often lacked transparency and robust failure recovery mechanisms.

Learning from Experience: The Echoes of Past Investigations

PRAXIS utilizes ‘Cases’ as the fundamental unit of experiential storage, comprehensively documenting each research interaction. A Case encapsulates all relevant input parameters – the initial conditions or query presented to the system – alongside a detailed record of the methods employed to address it. Critically, each Case also includes a complete accounting of the resulting outputs, whether successful, unsuccessful, or partial. This structured storage allows for retrospective analysis, facilitating the identification of patterns, correlations, and causal relationships between inputs, methods, and outcomes. The persistent archiving of these Cases forms the foundation for subsequent rule extraction and refinement of agent behavior.

PRAXIS employs a rule-extraction process to analyze completed ‘Cases’, identifying patterns in successful and unsuccessful experimental parameters and outcomes. This analysis focuses on establishing constraints that govern agent behavior; extracted ‘Rules’ function as conditional statements – if a specific input condition is met, the agent is directed to favor or avoid a particular action or methodological approach. These Rules are not pre-programmed but are dynamically derived from experiential data, allowing the system to adapt to the specific nuances of each research domain and proactively prevent the repetition of previously identified errors. The granularity of these Rules varies, ranging from high-level methodological guidelines to specific parameter value restrictions, and are continually refined as new Cases are processed.

PRAXIS incorporates analysis of failed experiments, designated as ‘Negative Cases’, into its learning process. These Negative Cases are not simply discarded; the system actively extracts data regarding the conditions and actions that led to the unfavorable outcome. This information is then used to construct constraints that prevent the agent from revisiting similar unproductive lines of inquiry. By explicitly identifying and encoding what doesn’t work, PRAXIS proactively avoids repeating errors and focuses computational resources on more promising avenues of investigation, thereby improving efficiency and reducing the likelihood of unsafe recommendations.

PRAXIS utilizes a persistent ‘Long-Term Memory’ to store accumulated knowledge derived from research experiences, formalized as ‘Cases’, extracted ‘Rules’, and developed ‘Skills’. This storage mechanism allows the system to retain learnings across sessions and continuously refine its behavior. The system doesn’t simply discard data after an experiment; instead, it integrates new findings with existing knowledge, strengthening successful strategies and reinforcing avoidance of previously identified errors. This persistent memory is critical for reducing the frequency of unsafe recommendations, as the system can proactively apply learned constraints and patterns to novel situations and prevent the repetition of flawed approaches.

Robustness and Validation: Maintaining the Illusion of Certainty

PRAXIS incorporates ‘Identifier Validation’ as a critical data integrity measure. This process systematically verifies the accuracy and uniqueness of identifiers used to reference entities within databases, such as genes, proteins, or chemical compounds. By confirming that each identifier unambiguously links to a single, correct database entry, the framework prevents errors arising from misidentification, synonym conflicts, or outdated records. This validation step is applied across all data inputs, ensuring the reliability of subsequent analyses including virtual screening, CRISPR off-target prediction, and single-cell cell-type annotation. The system flags and resolves ambiguous or incorrect identifiers, maintaining data consistency and minimizing the potential for downstream errors.

PRAXIS is designed to facilitate multiple computational biology applications, notably including in silico identification of potential drug candidates via Virtual Screening, prediction of unintended genomic edits in CRISPR applications through Off-Target Prediction, and automated classification of cells within Single-Cell Cell-Type Annotation workflows. These tasks leverage PRAXIS’s knowledge graph and adaptive retrieval capabilities to provide results across diverse biological domains. The framework’s versatility allows researchers to apply a consistent methodology to different problem spaces, improving efficiency and reproducibility of results.

PRAXIS utilizes an ‘Adaptive Retrieval’ method which, in benchmark testing, achieved a Recall@10 score of 0.683. This performance represents an improvement over the BM25 algorithm, which attained a Recall@10 of 0.646 under the same testing conditions. Recall@10 is a metric evaluating the proportion of relevant items appearing within the top ten retrieved results; therefore, a higher score indicates improved accuracy in identifying pertinent data from a larger dataset. The benchmark used for this comparison consists of a standardized dataset designed to assess the effectiveness of information retrieval systems.

Ligand-based virtual screening utilizing PRAXIS’s adaptive rules resulted in improved enrichment factor at 1% (EF1%), indicating enhanced hit identification compared to standard methods. Furthermore, implementation of the complete PRAXIS system – referred to as the ‘PRAXIS brain’ – substantially decreased the rate of unsafe recommendations during virtual screening. This reduction in unsafe recommendations suggests improved filtering of potentially problematic compounds, contributing to a more reliable and focused selection of viable candidates.

The Horizon of Automated Discovery: A Reflection of Our Own Aspirations

PRAXIS signifies a considerable advancement in the pursuit of artificial intelligence capable of independent scientific discovery. This system isn’t merely analyzing existing data; it’s demonstrating the capacity to autonomously design and conduct complete research projects, from formulating initial questions and designing experimental procedures to analyzing results and drawing conclusions. By integrating automated reasoning with robotic experimentation, PRAXIS circumvents many of the limitations inherent in traditional scientific workflows, offering the potential to accelerate the pace of discovery and explore research avenues previously inaccessible due to practical constraints. This represents a foundational step towards AI scientists that can operate with minimal human intervention, potentially revolutionizing fields ranging from materials science and drug discovery to fundamental physics and beyond, ultimately paving the way for a new era of automated scientific exploration.

Current research endeavors are concentrating on equipping the agent with sophisticated capabilities extending beyond data analysis and experimental execution, specifically focusing on the generation of genuinely novel hypotheses. This involves integrating advanced reasoning modules and exploring computational creativity techniques, allowing the agent to not merely identify correlations, but to propose explanations and predictive models that transcend existing knowledge. The intention is to move beyond incremental advancements towards breakthroughs facilitated by the agent’s ability to synthesize information in unconventional ways and devise innovative solutions to complex scientific problems – a process mirroring, and potentially accelerating, the intuitive leaps often made by human researchers.

The adaptability of PRAXIS hinges on its expansion beyond its current focus, promising a surge in scientific discovery as it ventures into previously unexplored disciplines. Currently adept at navigating specific research landscapes, broadening its scope requires not merely the ingestion of new data, but a fundamental restructuring of its knowledge representation and reasoning capabilities. This involves developing algorithms that can abstract core principles across diverse fields – recognizing, for example, that optimization problems in materials science share underlying similarities with those encountered in economic modeling. Successfully achieving this interdisciplinary fluency will allow PRAXIS to identify unexpected connections and generate truly novel hypotheses, potentially accelerating breakthroughs in areas ranging from drug discovery and climate modeling to fundamental physics and beyond. The ultimate innovation won’t simply be solving more problems, but redefining the very questions scientists ask.

The envisioned future of scientific discovery hinges on a synergistic partnership between human intellect and artificial intelligence. Rather than replacing researchers, the ultimate aim is to cultivate a collaborative ecosystem where AI agents, such as PRAXIS, function as powerful extensions of human capabilities. These agents will handle computationally intensive tasks, analyze vast datasets, and propose innovative approaches, freeing scientists to focus on higher-level conceptualization, critical evaluation, and the ethical implications of research. This collaborative dynamic promises to accelerate the pace of discovery, tackle increasingly complex challenges-from climate change and disease to sustainable energy-and unlock solutions previously beyond reach, fostering a new era of scientific progress driven by the combined strengths of both human and artificial minds.

The development of PRAXIS necessitates rigorous attention to the foundations of computational inference, mirroring the inherent challenges in theoretical physics. As Stephen Hawking once stated, “It is not enough to be right. One must also be able to explain why one is right.” This principle resonates deeply with the framework’s emphasis on verifiability and auditable scientific judgment. PRAXIS doesn’t merely generate hypotheses; it constructs a traceable workflow, allowing for independent validation of its reasoning process – a crucial step in mitigating the risk of unfounded conclusions. The system’s integration of case-based reasoning and long-term memory further strengthens this capability, providing a historical context for assessing the reliability of its inferences, much like scrutinizing the stability of solutions to Einstein’s equations.

What Lies Beyond the Horizon?

The pursuit of automated scientific judgment, as exemplified by frameworks like PRAXIS, reveals less about intelligence and more about the limitations of formalizing the tacit. The system elegantly addresses the challenge of verifiable workflows and long-term memory, yet sidesteps the inherent messiness of biological discovery. It constructs a scaffolding for reasoning, but the truly novel insights often arise from intuitive leaps – those elegantly illogical connections that defy algorithmic capture. The cosmos generously shows its secrets to those willing to accept that not everything is explainable.

Future iterations will undoubtedly refine the case-based reasoning and knowledge representation. However, the fundamental question remains: can a system truly judge, or merely correlate and extrapolate? The architecture addresses the how of scientific inquiry, but offers little on the why. A crucial next step involves grappling with the inherent uncertainties within biological systems, acknowledging that predictive power will always be bounded by irreducible complexity.

Black holes are nature’s commentary on human hubris. Similarly, this line of research-while promising-should serve as a constant reminder that the most profound discoveries may forever lie beyond the event horizon of formalization. The goal is not to replace the scientist, but to build tools that amplify their capacity for wonder-and, crucially, their willingness to embrace the unknown.

Original article: https://arxiv.org/pdf/2605.23169.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

2026-05-25 21:47