Can AI Truly Discover New Science?

Author: Denis Avetisyan

A new benchmark assesses whether large language models can move beyond recalling existing knowledge to genuinely project experimental outcomes and generate novel scientific insights.

The varying difficulty in communicating scientific discoveries across manuscripts suggests an inherent fragility in the projection of knowledge, where even robust findings are susceptible to distortion or misinterpretation as they traverse the landscape of scholarly exchange.

ProjectionBench evaluates large language models’ ability to perform scientific discovery under progressive information disclosure, measuring reasoning and knowledge synthesis capabilities.

While large language models excel at recalling known information, truly innovative scientific discovery demands reasoning beyond simple knowledge retrieval. To address this limitation, we introduce ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure, a novel benchmarking framework that assesses a model’s ability to project experimental outcomes by progressively revealing information-from initial research questions to full experimental details. Our evaluation reveals that models like GPT-5.4 and Gemini 3.1 pro demonstrate improved performance over prior generations, with GPT-5.4 achieving a 0.7 F1 score alignment with ground truth conclusions even with minimal context. Can this progressive evaluation of semantic divergence unlock the potential for LLMs to function as genuine co-scientists, driving forward the next generation of scientific inquiry?

The Erosion of Pattern Recognition: Distinguishing Correlation from Causation

While Large Language Models demonstrate impressive abilities in identifying correlations and predicting outcomes from vast datasets, genuine scientific discovery necessitates a far more discerning process. It isn’t sufficient to simply recognize patterns; a robust capacity for reasoning – formulating hypotheses, designing experiments to test those hypotheses, and interpreting results with critical analysis – is paramount. Furthermore, verification plays a crucial role, demanding that findings are rigorously tested, replicated, and subjected to scrutiny to ensure their validity and reliability. This emphasis on reasoning and verification distinguishes true scientific progress from mere data association, highlighting a key limitation of current AI systems and a critical area for future development in achieving artificial general science.

Existing evaluations of artificial intelligence often prioritize superficial performance on datasets, masking a critical deficit in genuine scientific reasoning. These benchmarks typically assess an AI’s ability to mimic scientific outputs – predicting results or completing patterns – rather than its capacity for independent verification, hypothesis refinement, or identifying flawed methodology. This limitation hinders progress because an AI capable of simply ‘passing’ current tests may still fail when confronted with novel data, ambiguous results, or the need to critically assess the validity of its own conclusions. Consequently, the field risks being misled by inflated scores that do not reflect a true advancement in AI’s ability to contribute to genuine scientific discovery, necessitating the development of more granular and rigorous evaluation protocols.

The pursuit of artificial intelligence capable of genuine scientific discovery necessitates a shift from broad performance metrics to detailed, granular evaluations. Current benchmarks often measure only superficial success – an AI’s ability to replicate existing patterns – rather than its capacity for robust reasoning, hypothesis generation, and critical verification. A truly insightful AI must not simply find correlations, but understand why they exist, and its evaluation must reflect this depth. This requires developing tests that dissect the AI’s process – examining its reasoning steps, its ability to identify flaws in its own logic, and its capacity to design experiments that rigorously test its hypotheses. Only through such granular assessment can the potential of AI to move beyond pattern recognition and contribute to genuinely novel scientific insight be fully unlocked, fostering advancements beyond the limitations of existing data and methodologies.

GPT-5 scoring accurately reflects the completeness of provided ground truth documents, with higher scores correlating to a larger fraction of the document being available.

Dissecting Scientific Proficiency: New Tools for Assessment

Recent advancements in artificial intelligence evaluation have led to the development of specialized benchmarks designed to assess scientific capabilities. SciBench, MatSciBench, and DiscoveryBench represent this new generation of tools, each targeting distinct facets of scientific proficiency. SciBench focuses on college-level problem-solving skills across various scientific disciplines, while MatSciBench specifically evaluates understanding and application of materials science principles. DiscoveryBench, conversely, emphasizes automated data analysis and hypothesis generation, simulating the process of scientific discovery. These benchmarks move beyond simple recall, requiring models to demonstrate reasoning and application of scientific knowledge, rather than merely identifying known facts.

Current scientific benchmarks are evolving to assess higher-order reasoning skills beyond basic question answering. These new evaluations necessitate that models demonstrate the ability to integrate information from multiple sources, a process requiring more than simple information retrieval. Specifically, models are challenged to formulate testable hypotheses based on provided data and existing knowledge, and then critically evaluate evidence – both supporting and contradictory – to validate or refine those hypotheses. This moves the focus from recognizing correct answers to simulating the core processes of scientific inquiry, demanding capabilities in data synthesis, inference, and evidence-based reasoning.

DeepScholar-Bench and ScholarEval are designed to evaluate a language model’s capacity for complex information processing within a research context. DeepScholar-Bench assesses the ability to contextualize novel research ideas by requiring models to identify relevant prior work and articulate the contribution of a proposed concept. ScholarEval focuses on verifying factual claims against a corpus of scientific literature, specifically testing whether a model can accurately identify supporting or contradictory evidence for a given statement. Both benchmarks utilize datasets constructed from real scientific publications and rely on metrics that measure the precision and recall of information retrieval and the logical consistency of reasoning, moving beyond simple fact verification to evaluate a model’s understanding of scientific argumentation.

Area under the curve (AUC) scores demonstrate that model performance improves with increased contextual information, with notable variations across domain categories-bioactive, mechanical, and nanomaterials.

Granular Evaluation: Deconstructing Claims for Precise Assessment

Automated grading assesses model performance through a claim-based approach, wherein generated outputs are deconstructed into individual claims and then compared to established ground truth data. This granular methodology moves beyond holistic scoring by evaluating the factual accuracy and logical consistency of each claim. By isolating performance at the claim level, the system facilitates the identification of specific strengths and weaknesses within a model’s reasoning process, offering a more detailed diagnostic than traditional evaluation metrics. This approach allows for targeted improvements and a nuanced understanding of a model’s capabilities, rather than a single aggregate score.

The automated grading system utilizes GPT-5 to perform both claim extraction from generated text and subsequent judgment of those claims against established ground truth. This dual application of GPT-5 streamlines the evaluation process, reducing the need for manual review and enabling assessment of a larger volume of generated content. By automating both the identification of core assertions and their verification, the system achieves efficiency and scalability, allowing for consistent and repeatable performance measurement across diverse models and contexts. The reliance on a single model, GPT-5, for both tasks minimizes potential inconsistencies arising from differing evaluation criteria.

Model performance is quantitatively assessed using the F1 Score and Area Under the Curve (AUC). Reported results, detailed in Figure 3, demonstrate that AUC values are not static and fluctuate based on the specific model and the context of the evaluation. GPT-5.4, in certain contexts, has achieved an F1 Score of approximately 0.70, indicating a measurable level of performance in tasks requiring scientific reasoning; however, this score is context-dependent and should be considered alongside AUC variations to provide a comprehensive understanding of a model’s capabilities.

Imperfect projections of ground truth data may either omit valid claims or introduce extraneous ones, enabling a granular comparison of their constituent parts.

The Foundation of Discovery: A Dynamically Maintained Dataset

The foundation of these materials science benchmarks rests upon a meticulously curated dataset drawn from openly accessible articles published by Springer Nature. This commitment to open access ensures broad usability and facilitates reproducibility for researchers globally. More than simply a large collection of papers, the dataset is actively maintained, guaranteeing its relevance and currency in a rapidly evolving field. By leveraging recent publications, the benchmarks accurately reflect the current state of materials science research, moving beyond static, potentially outdated evaluations. This dynamic approach allows for meaningful comparisons of new methodologies and models against the most up-to-date scientific understanding, fostering innovation and accelerating discovery.

The benchmark dataset encompasses a remarkably wide range of material science disciplines, moving beyond singular focuses to incorporate nanomaterials – exploring structures at the atomic and molecular scale – alongside bioactive materials designed for interaction with biological systems, and conventional mechanical materials crucial for engineering applications. This deliberate diversity isn’t merely comprehensive; it reflects the increasingly interdisciplinary nature of modern materials research, where breakthroughs often occur at the intersection of these fields. By including data from such varied categories, the benchmark allows for robust evaluation of research tools across a broad spectrum of scientific inquiry, facilitating the development of solutions applicable to diverse challenges – from targeted drug delivery and advanced sensors to high-performance structural components and sustainable energy technologies.

ResearcherBench elevates materials science evaluation by prioritizing sophisticated deep research information retrieval. This benchmark doesn’t simply assess whether a system can find relevant papers; it probes the ability to synthesize information from complex scientific literature, identifying nuanced relationships and critical details often buried within lengthy texts and intricate data sets. Such capabilities are increasingly vital as materials science advances, demanding researchers navigate an ever-expanding body of knowledge to solve multifaceted challenges – from designing novel alloys with tailored properties to engineering biocompatible implants with optimized performance. By focusing on this deeper level of information access, ResearcherBench provides a more robust and realistic assessment of a system’s potential to accelerate materials discovery and innovation, moving beyond superficial keyword searches to true scientific understanding.

Beyond Performance Metrics: Charting a Course for True Innovation

InnovatorBench represents a significant leap forward in assessing artificial intelligence for scientific advancement by moving beyond simple task completion to evaluate complete innovation cycles. This novel framework doesn’t merely check if an AI can, for instance, predict a molecular property; instead, it gauges the entire process, from hypothesis generation and experimental design to validation and ultimately, the potential for real-world application. By simulating a complete scientific workflow, including crucial steps like addressing potential confounding factors and demonstrating robustness, InnovatorBench identifies AI systems capable of truly novel discovery – systems that can not only generate insights, but also guide experiments and convincingly demonstrate their impact. This holistic approach provides a more nuanced understanding of an AI’s capabilities, fostering development that prioritizes practical scientific breakthroughs over isolated performance metrics and ultimately bridging the gap between automated discovery and tangible progress.

Despite advancements in artificial intelligence capable of generating novel scientific hypotheses, the fundamental principle of null hypothesis testing remains critical for verifying their validity. This established statistical method provides a rigorous framework for evaluating AI-driven insights, demanding that a hypothesis be actively disproven before acceptance. Researchers utilize null hypothesis testing to determine the probability that observed results are due to chance, rather than a genuine effect-a crucial step in preventing the propagation of false positives. By demanding statistically significant evidence against a null hypothesis – often the assumption of no effect – scientists can confidently assess the reliability of AI-generated findings and ensure that reported discoveries are robust and replicable, ultimately safeguarding the integrity of the scientific process even as it becomes increasingly reliant on automated systems.

The trajectory of AI-driven scientific discovery hinges not merely on developing increasingly sophisticated algorithms, but on establishing rigorous and evolving methods to assess their genuine impact. Current evaluation metrics often focus on narrow benchmarks, failing to capture the holistic value of a scientific contribution – its creativity, robustness, and potential for real-world application. Continually refining these evaluation boundaries – incorporating null hypothesis testing, assessing end-to-end innovation, and demanding evidence of practical benefit – is therefore paramount. This iterative process of assessment unlocks AI’s full potential, transforming it from a tool for data analysis into a collaborative partner capable of accelerating breakthroughs in fields ranging from medicine and materials science to climate modeling and beyond, ultimately offering solutions to complex global challenges.

The pursuit of scientific discovery, as outlined in ProjectionBench, isn’t merely about recalling established facts; it’s about navigating uncertainty and projecting potential outcomes – a process inherently tied to the passage of time. Robert Tarjan observed, “Ultimately, a program is only as good as the data structures it uses.” This resonates deeply with the framework’s emphasis on evaluating an LLM’s ability to synthesize knowledge and reason – its internal ‘data structures’ – to anticipate experimental results. The benchmark doesn’t assess what a model knows, but how gracefully it extrapolates, effectively testing its capacity to age – or evolve – beyond initial conditions. The challenge lies not in preventing decay, but in building systems capable of productive transformation.

What Lies Ahead?

The pursuit of automated scientific hypothesis generation, as illuminated by this work, inevitably encounters the limitations inherent in any system attempting to model complex phenomena. ProjectionBench offers a valuable metric – the capacity to anticipate experimental outcomes – but anticipates only the current state of decay. Scientific knowledge isn’t static; it erodes, shifts, and occasionally rebuilds upon former foundations. The benchmark’s utility will diminish as the underlying data ages, necessitating continuous refinement and, crucially, the inclusion of negative results – the evidence of paths not taken.

A persistent challenge remains: differentiating genuine insight from sophisticated pattern matching. Large language models excel at identifying correlations, but correlation isn’t causation, and projecting a known trend isn’t innovation. Future iterations of this work must focus on assessing the model’s ability to extrapolate beyond existing data, to formulate hypotheses that are not merely plausible extensions of the present, but genuinely novel departures. This isn’t about achieving perfect prediction, but measuring the graceful acceptance of inevitable error.

Ultimately, the field must acknowledge that ‘scientific reasoning’ within an artificial system is an artifact-a temporary phase of temporal harmony. The true metric isn’t accuracy, but resilience-the capacity to adapt, recalibrate, and continue formulating hypotheses even in the face of persistent uncertainty. The benchmark, like the scientific method itself, is a tool for managing entropy, not escaping it.

Original article: https://arxiv.org/pdf/2605.30284.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

2026-05-31 09:07