Can AI Truly Do Science?

Author: Denis Avetisyan


A new evaluation framework assesses whether large language models can move beyond data recall to actually contribute to the process of scientific discovery.

This paper introduces the Scenario-Driven Evaluation benchmark to assess large language models’ capabilities in hypothesis generation, experimental design, and results interpretation within complex scientific projects.

While large language models (LLMs) are rapidly being applied to scientific problems, existing benchmarks often fail to capture the iterative, hypothesis-driven nature of genuine discovery. This limitation is addressed in ‘Evaluating Large Language Models in Scientific Discovery’, which introduces a scenario-grounded benchmark assessing LLMs’ performance across biology, chemistry, materials, and physics by evaluating not only question-level accuracy but also their ability to propose hypotheses, design experiments, and interpret results within complex research projects. The study reveals a performance gap between LLMs on this discovery-focused benchmark and general science assessments, suggesting current models fall short of true scientific “superintelligence.” Can this new framework guide the development of LLMs toward more effective contributions to scientific exploration and accelerate the pace of discovery?


The Illusion of Progress: Bottlenecks in Scientific Inquiry

Historically, the scientific process has been notably constrained by the immense cognitive load placed on researchers during the initial phases of investigation. Formulating testable hypotheses and conducting comprehensive literature reviews – tasks demanding both deep expertise and considerable time – often represent significant bottlenecks in discovery. Researchers frequently dedicate months, even years, to manually sifting through existing data, identifying potential relationships, and designing experiments to validate their ideas. This reliance on human-driven exploration, while crucial for nuanced understanding, inherently limits the scale and speed at which new scientific frontiers can be approached. The sheer volume of published research, coupled with the increasing complexity of data sets, exacerbates this challenge, demanding innovative approaches to augment human capabilities and accelerate the pathway from observation to insight.

The advent of Large Language Models (LLMs) signals a transformative shift in how scientific inquiry is conducted. Traditionally, researchers dedicate substantial effort to formulating hypotheses and performing preliminary investigations – processes that can be both time-consuming and resource-intensive. LLMs now offer the capacity to automate these initial stages, effectively functioning as collaborators in the early phases of discovery. By analyzing vast datasets and identifying potential relationships, these models can generate novel hypotheses and suggest promising avenues for experimentation. This automation doesn’t replace the scientist, but rather augments their capabilities, allowing them to focus on critical analysis, experimental design, and the interpretation of results – ultimately accelerating the overall pace of scientific progress and potentially uncovering insights previously obscured by the sheer volume of available data.

The effective integration of Large Language Models into scientific discovery isn’t simply about achieving a passing grade on standard benchmarks; current evaluations, such as those measured by the Scientific Discovery Evaluation (SDE) – which currently yield performance scores between 0.60 and 0.75 across different scientific domains – offer an incomplete picture of true capability. A more nuanced assessment is crucial, moving beyond merely confirming if an LLM can complete a task to rigorously evaluating the quality of its hypotheses, the novelty of its proposed experiments, and the logical soundness of its reasoning. This necessitates developing evaluation methods that probe for genuine understanding and creative problem-solving, rather than pattern recognition and rote memorization, to accurately gauge an LLM’s potential to truly accelerate scientific progress and avoid propagating flawed or unoriginal ideas.

Beyond Task Lists: Evaluating Scientific Reasoning

Current large language model (LLM) evaluation frameworks, such as LM-Evaluation-Harness, predominantly assess performance on discrete, narrowly defined tasks. This approach typically involves evaluating LLMs on benchmarks consisting of individual questions or problems isolated from broader scientific contexts. Consequently, these frameworks often fail to capture the multifaceted challenges inherent in real-world scientific research, which requires integration of knowledge, iterative experimentation, and complex reasoning across multiple stages. The emphasis on isolated tasks limits the ability to assess an LLM’s capacity to contribute to a complete research workflow, from formulating hypotheses to analyzing and interpreting results within a broader scientific framework.

Scientific Discovery Evaluation (SDE) represents a shift from task-specific Large Language Model (LLM) assessments to a more comprehensive methodology evaluating performance across core scientific disciplines. SDE benchmarks LLMs not on isolated questions, but within the context of complete research projects spanning biology, chemistry, materials science, and physics. This approach necessitates evaluation of an LLM’s capabilities throughout the entire scientific process – including hypothesis generation, experimental design, data analysis, and interpretation of results – providing a more realistic and nuanced understanding of its potential for contributing to scientific discovery than traditional benchmarks allow. The goal is to determine how effectively an LLM can function as a collaborative research tool, rather than simply demonstrating knowledge recall.

Scientific Discovery Evaluation (SDE) employs dedicated frameworks, notably SDE-Harness, to move beyond isolated task evaluation and assess Large Language Models (LLMs) across the complete research lifecycle. This involves prompting LLMs to perform sequential operations mirroring a scientific project: formulating a hypothesis, designing a methodology – often involving simulation parameters – executing the simulated experiment, analyzing the resulting data, and finally, interpreting the findings to draw conclusions. SDE-Harness facilitates this end-to-end evaluation by providing tools for defining these stages and automatically assessing LLM performance at each step, offering a more realistic and comprehensive measure of scientific reasoning capability than traditional benchmarks.

Project-level evaluation represents a significant advancement in assessing Large Language Models (LLMs) for scientific applications, requiring capabilities beyond simple question answering. This methodology necessitates LLMs to perform a complete research cycle, encompassing hypothesis generation, experimental design – including simulation setup – and results interpretation. Current results from Scientific Discovery Evaluation (SDE) indicate that LLM performance on these complex, project-level questions is, however, lower than performance on more traditional science benchmarks; for example, accuracy on SDE questions is currently reported at 0.84 on MMMU and 0.86 on GPQA-Diamond, suggesting a performance gap when tackling full research projects as opposed to isolated tasks.

Pattern Matching in a Lab Coat: LLMs and Scientific Domains

Large language models (LLMs) are demonstrating capability across several scientific disciplines. In chemistry, LLMs are being applied to retrosynthesis – the automated process of designing chemical syntheses – and to the optimization of transition metal complex structures for catalysis. Furthermore, LLMs can be utilized for molecular property estimation, predicting characteristics such as solubility or toxicity based on molecular structure. These applications leverage the LLM’s ability to identify patterns and relationships within complex datasets, effectively modeling chemical structures and predicting their behavior without explicit programming for each specific task.

Large Language Models (LLMs) are extending their capabilities beyond empirical data analysis to encompass abstract mathematical problem-solving. Specifically, LLMs have demonstrated proficiency in symbolic regression, the automated discovery of mathematical expressions from data, and in modeling the Ising Model, a mathematical model of ferromagnetism in statistical mechanics. Successful application to the Ising Model involves predicting the critical temperature at which phase transitions occur, requiring the LLM to infer relationships between spins and energy states. These achievements suggest an emerging capacity for mathematical reasoning, going beyond pattern recognition to include the manipulation of mathematical concepts and the derivation of solutions from first principles, though performance limitations still exist.

Large Language Models are being applied to challenges in materials science and biotechnology, specifically in the identification of novel crystal structures and the optimization of protein sequences. In materials discovery, LLMs can predict stable crystal structures from compositional data, accelerating the search for new materials with desired properties. Similarly, in protein engineering, LLMs are utilized to design protein sequences with enhanced or altered functions, going beyond naturally occurring variants. These applications leverage the LLM’s ability to learn complex relationships from large datasets of materials and biomolecular information, offering a computational approach to accelerate innovation in these fields, though current limitations require further development for widespread adoption.

Despite demonstrated improvements in scientific tasks, the performance of Large Language Models (LLMs) employing reasoning approaches exhibits limitations. While substantial gains have been observed in specific instances – for example, accuracy on Lipinski’s rule of five increased from 0.65 to 1.00 – these improvements are not sustained indefinitely. Performance gains ultimately plateau, suggesting that simply increasing model scale will not yield continued progress. This saturation indicates a requirement for novel techniques and architectures beyond continued scaling to further enhance LLM capabilities in scientific domains.

The Mirage of Automation: AI and the Future of Scientific Exploration

Large language models are poised to reshape scientific workflows by taking on the traditionally time-consuming initial phases of research. This automation extends beyond simple literature reviews; LLMs can now assist in hypothesis generation, experimental design, and even the preliminary analysis of datasets. By handling these early-stage tasks, scientists are liberated to concentrate on the more nuanced aspects of their work – interpreting complex results, formulating innovative theories, and tackling unforeseen challenges. This shift promises not just increased efficiency, but a fostering of creativity, as researchers gain valuable time to explore uncharted territory and pursue truly groundbreaking solutions. The potential lies in augmenting, not replacing, human intellect, enabling a synergistic partnership between scientists and artificial intelligence.

Rigorous evaluation frameworks, such as the Scientific Discovery Evaluation (SDE), are becoming indispensable tools in harnessing the potential of large language models (LLMs) for scientific advancement. These frameworks move beyond simple benchmark scores, probing for deeper understanding of an LLM’s capabilities and, crucially, its limitations within specialized scientific domains. Identifying these weaknesses-whether in logical reasoning, data interpretation, or the application of fundamental principles-is not merely an academic exercise. It directly informs strategies for refining model architectures, improving training datasets, and developing targeted interventions to mitigate errors. Without such nuanced evaluation, the potential for LLMs to propagate inaccuracies or offer misleading insights remains a significant concern, hindering rather than accelerating the pace of genuine scientific discovery. The ongoing development and application of SDE and similar frameworks are therefore pivotal in ensuring that AI serves as a reliable and trustworthy partner in the pursuit of knowledge.

The trajectory of scientific discovery is increasingly linked to the capacity of large language models (LLMs), yet maximizing their potential requires both increased scale and improvements in reasoning. While expanding model size consistently yields performance gains, recent analyses reveal a striking correlation-greater than 0.8, as measured by both Spearman’s rank correlation coefficient ($r_s$) and Pearson’s correlation coefficient ($r$)-between the performance of leading LLMs across seemingly disparate fields like chemistry and physics. This suggests that current advancements may be relying on broadly applicable, but ultimately limited, patterns rather than genuine domain-specific understanding, and that shared weaknesses are being amplified as models grow. Addressing these underlying limitations – refining the ability to perform causal reasoning, handle uncertainty, and extrapolate beyond training data – will be crucial to unlock truly transformative breakthroughs and move beyond simply automating existing scientific processes.

The intersection of artificial intelligence and scientific investigation is poised to redefine the pace of discovery. This convergence isn’t simply about automating existing processes; it represents a fundamental shift in how knowledge is generated and validated. Researchers anticipate a future where AI systems, capable of analyzing vast datasets and formulating novel hypotheses, collaborate seamlessly with human scientists, accelerating the cycle from initial inquiry to impactful breakthrough. While challenges remain in ensuring the reliability and interpretability of AI-driven results, the potential for transformative advances across disciplines – from materials science and drug discovery to climate modeling and fundamental physics – is substantial. This synergistic relationship promises not only incremental progress but the potential for paradigm shifts, unlocking solutions to complex problems previously considered intractable and ushering in an era of unprecedented innovation.

The pursuit of automating scientific discovery, as this paper details with its Scenario-Driven Evaluation benchmark, feels less like innovation and more like accelerating the inevitable accumulation of technical debt. The SDE attempts to move beyond superficial question-answering, demanding LLMs formulate hypotheses and design experiments – a commendable effort, yet one destined to reveal the limits of current models. As Carl Friedrich Gauss observed, “It is not enough to know, one must apply.” These models may know vast amounts of scientific data, but applying that knowledge to genuinely novel research projects exposes the brittleness beneath the surface. The benchmark itself, however elegantly constructed, will become just another layer of abstraction before long, masking the core issues of reasoning and interpretability. The hope is that the evaluation process reveals where the models fail, not that it creates a perfect, self-sustaining system.

What’s Next?

The introduction of Scenario-Grounded Evaluation represents, predictably, a more complicated failure mode. Moving beyond isolated question-answering, the benchmark attempts to mirror the messiness of actual scientific projects – a laudable goal, if one is prepared for the inevitable cascade of edge cases. Every abstraction dies in production, and here, the ‘production’ is the relentless demand for novelty inherent in scientific inquiry. The current framework will undoubtedly reveal its limitations as models begin to exploit loopholes in the evaluation criteria, optimizing for benchmark performance rather than genuine reasoning.

The true challenge lies not in achieving high scores on SDE, but in developing metrics that can distinguish between a cleverly hallucinated hypothesis and one with a plausible path toward empirical validation. Current methods primarily assess the form of scientific output, not its grounding in reality – a distinction that will become increasingly blurred as models gain fluency. The field will likely shift toward increasingly adversarial testing, probing for fundamental flaws in the models’ understanding of causality and experimental design.

Ultimately, this work, like all attempts to formalize complex processes, provides a temporary respite from chaos. It buys time, refines the questions, and clarifies the points of failure. Everything deployable will eventually crash. The value, perhaps, is in designing for a beautiful, informative crash – one that illuminates the next set of unsolved problems.


Original article: https://arxiv.org/pdf/2512.15567.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-18 09:51