Can AI Truly Do Biology?

Author: Denis Avetisyan

A new benchmark reveals that while large language models are improving, significant challenges remain in automating complex biological research tasks.

The comparative analysis reveals performance distinctions between LAB-Bench and LABBench2 across a spectrum of high-level task families, highlighting nuanced variations in accuracy achieved by each benchmark.

LABBench2 assesses AI systems on real-world biology problems, highlighting current limitations in data access, retrieval, and operational fidelity.

Despite growing optimism surrounding AI’s potential to accelerate scientific discovery, robustly evaluating progress beyond rote knowledge remains a significant challenge. This work introduces ‘LABBench2: An Improved Benchmark for AI Systems Performing Biology Research’, an evolution of the original LAB-Bench designed to assess real-world capabilities in performing useful biological research tasks. Our benchmark, comprising nearly 1,900 tasks, reveals that while current frontier models have improved, substantial gaps persist in areas like data access, literature retrieval, and faithful execution of complex operations-demonstrated by accuracy drops ranging from -26% to -46% across subtasks. Will these findings spur the development of more capable AI tools to truly augment core research functions and unlock the full potential of AI for scientific advancement?

The Expanding Frontier of Scientific Data

The sheer volume of data now produced by scientific endeavors presents a formidable challenge to traditional analytical methods. Advances in fields like genomics, astronomy, and materials science are driving an exponential increase in datasets, quickly overwhelming the capacity of researchers to manually process and interpret findings. This isn’t simply a matter of ‘big data’ requiring more storage; the complexity and interconnectedness of these datasets demand computational approaches capable of identifying subtle patterns and relationships previously hidden within the noise. Consequently, the bottleneck in many scientific disciplines has shifted from data collection to data analysis, necessitating the development of innovative tools and techniques to extract meaningful insights from this rapidly expanding information landscape.

The sheer volume of contemporary scientific data necessitates a shift towards automated knowledge discovery systems. These aren’t simply advanced search engines; rather, they employ complex reasoning algorithms to move beyond information retrieval and towards genuine synthesis. Such systems aim to identify patterns, formulate hypotheses, and even predict future research directions by integrating data from disparate sources – genomic databases, clinical trials, published literature, and more. This automated analysis addresses a critical bottleneck in scientific progress, allowing researchers to explore connections and insights that would be impossible to discern manually. The development of these tools promises to accelerate the pace of discovery, moving the field beyond data accumulation towards meaningful understanding and innovation.

Current approaches to analyzing scientific literature often fall short of capturing the subtle complexities inherent in research findings. Traditional methods, reliant on keyword searches and manual review, struggle with ambiguity, context-dependent meanings, and the evolving language of science. This limitation hinders effective knowledge discovery, as critical insights can be obscured by imprecise interpretations or missed entirely due to a lack of nuanced understanding. Consequently, there is a growing demand for advanced tools capable of discerning subtle relationships, identifying implicit assumptions, and accurately representing the full scope of scientific arguments – moving beyond simple information retrieval toward genuine comprehension and synthesis of complex data.

The accelerating pace of discovery across scientific disciplines has created an overwhelming deluge of data and publications, necessitating novel approaches to knowledge extraction. Researchers are increasingly hampered not by a lack of information, but by the sheer difficulty of accessing, interpreting, and integrating findings from disparate sources – journals, databases, pre-print servers, and patents. Consequently, there is a pressing demand for sophisticated tools capable of not merely searching for relevant data, but of synthesizing it – identifying patterns, resolving contradictions, and generating new hypotheses. These systems must move beyond simple keyword searches and embrace semantic understanding, allowing them to navigate the complex relationships within scientific literature and unlock the full potential of accumulated knowledge, ultimately fostering more rapid and impactful advancements across all fields of study.

LABBench2: A Rigorous Evaluation Platform for Scientific AI

Traditional benchmarks for scientific AI often rely on question answering, which assesses recall of factual knowledge. LABBench2 departs from this approach by focusing on the execution of complete scientific tasks, such as designing experiments, interpreting data, and resolving procedural issues. This shift aims to provide a more comprehensive evaluation of an AI system’s capabilities, moving beyond simple knowledge retrieval to assess its ability to apply that knowledge in a simulated research context. Consequently, performance on LABBench2 requires not only accessing relevant information but also reasoning, planning, and problem-solving skills analogous to those used by human scientists.

LABBench2 extends traditional benchmark assessments by incorporating tasks that simulate common research workflows, specifically experiment planning and protocol troubleshooting. Experiment planning tasks require the AI to design a valid experimental procedure to address a given scientific question, including selection of appropriate materials and methods. Protocol troubleshooting tasks present the AI with a flawed experimental protocol and necessitate the identification of errors and proposal of corrective actions. These task types move beyond simple factual recall and demand integrated reasoning skills, mirroring the iterative and problem-solving nature of actual scientific research activities and providing a more comprehensive evaluation of an AI’s capabilities in a laboratory context.

LABBench2 comprises a dataset of 1,893 distinct tasks intended to rigorously evaluate the capabilities of scientific AI models. These tasks are not limited to simple question answering; instead, they require complex reasoning and problem-solving skills across a range of scientific disciplines. The large scale of the benchmark is designed to move beyond performance on narrow, well-defined problems and assess generalization to novel scenarios, pushing the boundaries of what is currently achievable in automated scientific discovery and analysis. The tasks were developed with an emphasis on requiring multi-step reasoning and integration of information from diverse sources.

The LABBench2 benchmark architecture demands advanced capabilities in three core areas. Effective data retrieval is required to identify relevant information from a provided knowledge base, which includes scientific papers, databases, and experimental protocols. Following retrieval, data analysis functionalities are necessary to interpret the extracted information, identify patterns, and draw logical inferences. Crucially, the benchmark also necessitates synthesis capabilities – the ability to combine information from multiple sources, resolve conflicts, and formulate novel solutions or predictions based on the analyzed data, ultimately enabling the AI to address complex scientific tasks.

Using web search and code execution tools significantly enhances the performance of frontier language models across a broad range of tasks in the LABBench2 benchmark, as demonstrated by the improved results indicated by hashed bars compared to the base models shown with solid bars.

Decoding Scientific Data: Task Modalities and Methodologies

LABBench2 utilizes a benchmark suite comprised of multiple question-answering tasks to comprehensively evaluate scientific understanding. These tasks – TableQA2, which focuses on tabular data; FigQA2, centered on interpreting figures; and SuppQA2, requiring analysis of supplementary materials – each present distinct challenges in scientific reasoning. The suite is designed to move beyond simple fact retrieval and assess a model’s ability to synthesize information from various scientific data representations, providing a multifaceted evaluation of its comprehension capabilities. These tasks, alongside others within the LABBench2 framework, collectively gauge performance across a broad spectrum of scientific information processing skills.

LABBench2 tasks necessitate the synthesis of information presented across multiple document modalities. Successful completion of TableQA2, FigQA2, and SuppQA2, for example, requires models to not only identify relevant data within tables, figures, and supplementary materials, but also to correlate this data with information contained in the associated research papers. This process of complex data integration goes beyond simple information retrieval; it demands an understanding of relationships between different data representations and the ability to draw inferences based on combined evidence from heterogeneous sources. The tasks are specifically designed to evaluate a system’s capacity to perform this cross-modal reasoning and knowledge aggregation.

The DbQA2, PatentQA, and TrialQA tasks within LABBench2 require models to query and interpret information from structured databases, patent literature, and legal trial records, respectively. DbQA2 utilizes databases such as BioKG and DrugBank, demanding knowledge graph traversal and entity linking. PatentQA involves retrieving relevant claims and descriptions from patent documents, necessitating understanding of legal terminology and claim structure. TrialQA focuses on answering questions based on legal case files, requiring the parsing of complex legal arguments and evidence presented in court documents. Successful completion of these tasks necessitates not only information retrieval but also a capacity for understanding domain-specific language and complex data structures beyond standard scientific literature.

Evaluations utilizing the FigQA2 and TableQA2 tasks reveal a substantial decline in performance when employing a retrieval-based approach compared to direct answer prediction. Specifically, models exhibit difficulty in identifying and synthesizing information from the full text of associated research papers to accurately answer questions based on presented figures and tables. This suggests limitations in current methods for effectively accessing, parsing, and integrating information across multiple document sections, and highlights a core challenge in linking visual or tabular data with its contextual explanation within the broader scientific literature.

Performance on both FigQA2 and TableQA2 datasets demonstrates consistent results across image-provided, paper-provided, and retrieval task modes.

Frontier Models and the Future of Scientific AI

Recent advancements in artificial intelligence have seen the application of sophisticated language models – often termed ‘Frontier Models’ – to complex scientific challenges, as exemplified by their engagement with the LABBench2 benchmark. These models, characterized by their increased scale and capacity for reasoning, are not merely processing data but actively attempting to solve problems requiring scientific knowledge and logical deduction. Initial results from LABBench2 demonstrate a notable ability of these Frontier Models to tackle tasks previously beyond the reach of conventional AI, suggesting a promising trajectory for automating aspects of scientific discovery. While challenges remain in achieving human-level performance, the observed successes highlight the potential of these models to accelerate research across diverse scientific disciplines, offering a powerful new toolkit for exploration and innovation.

LABBench2 functions as a vital arena for assessing the potential of advanced language models within scientific inquiry. This benchmark isn’t merely a measure of rote knowledge; it probes a model’s ability to reason through complex scientific problems, interpret data, and generate plausible hypotheses. By presenting challenges that demand more than simple pattern recognition, LABBench2 distinguishes between models capable of genuinely assisting scientific workflows and those limited to superficial mimicry. The carefully curated tasks, spanning diverse scientific disciplines, provide a standardized and rigorous method for comparing model performance and identifying areas where further development is crucial for impactful contributions to research. Ultimately, the benchmark facilitates a focused evaluation of whether these models are truly poised to become powerful tools for scientists, capable of accelerating discovery and innovation.

The evolution of scientific benchmarks is exemplified by LABBench2, a significantly more challenging iteration of the original LAB-Bench. Recent evaluations reveal a consistent performance decrease across various models, ranging from 26% to 46%. This deliberate increase in difficulty isn’t intended to discourage progress, but rather to provide a more rigorous and realistic assessment of artificial intelligence capabilities in the face of complex scientific reasoning. The gap in performance highlights the need for continued development in areas like data interpretation, experimental design, and hypothesis generation-skills that remain a substantial hurdle for even the most advanced AI systems. Ultimately, LABBench2 serves as a crucial calibration point, revealing the limitations of current models and charting a course towards more robust and genuinely intelligent scientific AI.

Recent evaluations utilizing benchmarks like SeqQA2 and CloningQA reveal a significant performance boost when advanced language models are equipped with external tools. This isn’t simply incremental improvement; the integration of tools functions as a powerful performance equalizer, substantially narrowing the gap between leading-edge models and those with fewer parameters. While model size traditionally dictated success in complex reasoning tasks, the ability to access and utilize tools – such as calculators, search engines, or specialized databases – allows smaller, more efficient models to achieve comparable, and in some cases superior, results. This suggests that the future of scientific AI may not solely depend on scaling model size, but rather on developing robust mechanisms for tool use, effectively augmenting a model’s inherent capabilities and unlocking new levels of problem-solving potential.

Performance analysis of SeqQA2 and CloningQA reveals that sequence input modality-inline, file, or retrieval-significantly impacts results, with file-based performance for GPT 5.2 Pro limited by API constraints, and a heatmap detailing SeqQA2 performance across subcategories further illustrating model strengths and weaknesses in the default inject mode.

The development of LABBench2 highlights a crucial point regarding complex systems – structure alone does not guarantee functionality. While large language models demonstrate increasing structural sophistication in processing information, the benchmark reveals limitations in accessing and integrating external data, essential for genuine scientific discovery. This echoes Henri Poincaré’s sentiment: “It is through science that we learn to control nature; it is through art that we learn to control ourselves.” The benchmark isn’t merely testing recall; it’s assessing the ability to apply knowledge – to orchestrate data retrieval and execution, mirroring the delicate balance Poincaré describes between understanding and control. The findings suggest that improvements in language modeling must be coupled with robust mechanisms for external knowledge integration to achieve true scientific competence.

What’s Next?

The results detailed within suggest that current approaches to AI in biology resemble intricate scaffolding built on shifting sand. Models demonstrate a capacity for mimicking the form of scientific reasoning, yet struggle with the fundamental act of actually doing science – the messy, iterative process of data acquisition, verification, and synthesis. If the system survives on duct tape and clever prompting, it’s probably overengineered, masking a core fragility. The benchmark itself, while a step forward, highlights a persistent issue: evaluation metrics often prioritize elegant output over robust execution.

The field now faces a choice. It can continue refining language models to better simulate competence, or focus on architectures that genuinely integrate with – and learn from – the underlying complexity of biological data. Modularity, frequently touted as a solution, is an illusion of control without a deep understanding of the dependencies and feedback loops inherent in living systems. A truly intelligent system will not simply retrieve information; it will know what information is reliable, how it was obtained, and why it matters in context.

Future work must therefore move beyond isolated tasks and focus on building systems capable of sustained, self-correcting inquiry. The goal isn’t to automate the scientist, but to create a partner capable of navigating the vast, often ambiguous landscape of biological knowledge – a system that doesn’t just answer questions, but knows when to ask better ones.

Original article: https://arxiv.org/pdf/2604.09554.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Expanding Frontier of Scientific Data

LABBench2: A Rigorous Evaluation Platform for Scientific AI

Decoding Scientific Data: Task Modalities and Methodologies

Frontier Models and the Future of Scientific AI

What’s Next?

See also: