Can AI Predict Scientific Experiments?

Author: Denis Avetisyan

A new benchmark reveals the surprising limitations of large language models when it comes to forecasting the results of real-world scientific studies.

A research workflow integrating large language model predictions into experimental design demonstrably accelerates scientific discovery by prioritizing high-potential experiments-identified through [latex]LLM[/latex]-based outcome forecasting-and preemptively filtering resource-intensive investigations unlikely to yield significant results, thereby optimizing the allocation of empirical validation efforts.

SciPredict assesses large language models’ ability to predict experimental outcomes in natural sciences, highlighting critical deficiencies in both accuracy and prediction confidence calibration.

Despite the increasing capacity of large language models (LLMs) to demonstrate scientific knowledge, their ability to proactively predict the outcomes of empirical experiments – a task potentially exceeding human capabilities – remains largely unexplored. To address this gap, we introduce SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?, a benchmark comprising 405 tasks spanning physics, biology, and chemistry. Our evaluations reveal that while some frontier models achieve accuracies comparable to human experts (around 20%), both struggle with reliable prediction, failing to accurately assess the trustworthiness of their own forecasts. This raises a critical question: can LLMs not only make predictions about experimental results, but also reliably know when those predictions are likely to be correct-a necessity for truly accelerating scientific discovery?

The Limits of Linguistic Mimicry in Scientific Reasoning

While Large Language Models (LLMs) excel at processing and generating human language – composing text, translating languages, and even writing different kinds of creative content – their abilities plateau when confronted with tasks demanding genuine scientific understanding. These models operate by identifying patterns and relationships within vast datasets of text, enabling them to mimic human language proficiency. However, this proficiency doesn’t necessarily translate into a capacity for causal reasoning or the ability to apply fundamental scientific principles to novel situations. Consequently, LLMs often struggle with problems requiring an understanding of physical laws, chemical reactions, or biological processes, highlighting a critical gap between linguistic competence and true scientific intelligence. The core limitation lies in their reliance on correlation rather than causation, meaning they can identify patterns in data but lack the ability to accurately predict outcomes based on underlying mechanisms.

Despite advancements in natural language processing, current Large Language Models (LLMs) exhibit limited capacity when tasked with predicting the outcomes of empirical experiments. Recent evaluations reveal a surprisingly low accuracy rate, ranging from just 14 to 26 percent, indicating a significant gap between linguistic proficiency and genuine scientific reasoning. This deficiency substantially hinders the potential of LLMs as tools for accelerating scientific discovery; while adept at processing and summarizing existing knowledge, they struggle to extrapolate from data and formulate accurate predictions about novel phenomena. The inability to reliably forecast experimental results suggests that these models primarily excel at pattern recognition within text, rather than possessing a deeper understanding of the underlying scientific principles at play.

The limitations of current Large Language Models in scientific prediction necessitate the development of a comprehensive benchmark that transcends simple linguistic proficiency. Existing evaluations often prioritize the ability to mimic scientific language – summarizing papers or answering definitional questions – rather than assessing genuine predictive power regarding experimental outcomes. A truly robust benchmark would demand models not merely describe scientific concepts, but actively forecast results, requiring a deeper understanding of causal relationships and underlying principles. Such a benchmark would move beyond surface-level analysis, incorporating diverse scientific domains, varying levels of complexity, and quantifiable metrics to accurately gauge an LLM’s capacity for genuine scientific reasoning and discovery, ultimately pinpointing areas where these models fall short and guiding future development efforts.

Despite benefiting from expert-curated background knowledge, state-of-the-art frontier models-including Claude Opus 4.5, GPT-5.2, Gemini 3 Pro, Llama 3.3, and Qwen 3 235B-demonstrate accuracy and calibration robustness gaps in predicting scientific experiment outcomes, particularly when transitioning from multiple-choice to numerical questions, and exhibit poor correlation between confidence/feasibility and actual accuracy across scientific domains.

SciPredict: A Benchmark for Empirically-Grounded Reasoning

SciPredict is a newly developed benchmark intended to evaluate the capacity of Large Language Models (LLMs) to accurately forecast the results of empirical scientific experiments. The benchmark is distinguished by its focus on assessing predictive reasoning, requiring LLMs to move beyond simply recalling facts and instead apply scientific principles to novel situations. It achieves this by presenting LLMs with descriptions of experimental setups and asking them to predict the observed outcomes. The scope of SciPredict is multi-disciplinary, encompassing experiments drawn from the fields of Physics, Biology, and Chemistry, thereby providing a comprehensive evaluation of LLM performance across diverse scientific domains.

SciPredict employs a variety of question types to comprehensively assess reasoning capabilities. Multiple Choice Questions (MCQs) evaluate the selection of correct hypotheses from a predefined set, while Numerical Value Questions require precise quantitative predictions based on experimental parameters. Free Form Questions demand generative responses, assessing the model’s ability to articulate reasoning and provide justifications for its predictions in natural language. This multi-faceted approach allows for granular evaluation of LLMs, differentiating strengths and weaknesses across various reasoning skills – including hypothesis selection, quantitative analysis, and explanatory capabilities – beyond simple accuracy metrics.

The SciPredict benchmark incorporates experimental data from the core scientific disciplines of Physics, Biology, and Chemistry to provide a comprehensive assessment of large language models’ empirical reasoning capabilities. This multi-disciplinary approach ensures that models are not evaluated solely on their performance within a single scientific domain, mitigating potential biases and fostering generalization. The inclusion of experiments from these fields necessitates models to demonstrate understanding of varied experimental setups, data interpretation techniques, and scientific principles specific to each discipline, thereby establishing a robust and representative evaluation of reasoning skill.

Data Leakage Prevention is a core component of the SciPredict benchmark, implemented through rigorous dataset construction and evaluation protocols. The benchmark employs a strict separation between training and testing data, ensuring no experimental results or closely related information appear in the training set that could allow a model to simply memorize answers instead of performing empirical reasoning. Specifically, SciPredict utilizes a multi-stage filtering process to identify and remove potentially leaked information, including exact string matches, near-duplicate experimental setups, and results reported in datasets predating the test experiment. This prevents models from exploiting spurious correlations and guarantees evaluations accurately reflect genuine reasoning capabilities, rather than memorization of prior data.

A moderate positive correlation (Pearson r≈0.46) between model accuracy on the SciPredict benchmark and performance on the HLE benchmark suggests that general reasoning capability partially explains the ability to predict empirical outcomes.

Establishing Ground Truth: Human Baselines and Expert Validation

SciPredict utilizes a Human Expert Baseline as a foundational component of its validation process. This baseline is constructed by having qualified scientists independently respond to each benchmark question, creating a set of verified, correct answers. LLM predictions are then directly compared against this human-generated baseline to quantify performance. This direct comparison methodology allows for a statistically rigorous assessment of LLM capabilities, moving beyond relative scoring to establish absolute performance levels relative to human expertise. The Human Expert Baseline serves not only as a performance target but also as a crucial point of reference for interpreting LLM results and identifying areas where models demonstrate strengths or weaknesses in scientific reasoning.

Each question and corresponding answer within the SciPredict benchmark undergoes a multi-stage Expert Review process conducted by scientists with relevant subject matter expertise. This review verifies the scientific accuracy of the question’s premise and the provided answer, ensuring alignment with established scientific consensus. Furthermore, the review assesses the clarity and unambiguousness of both the question and answer wording, mitigating potential misinterpretations and ensuring a consistent evaluation metric. Any identified inaccuracies or ambiguities result in question/answer revision or removal from the benchmark dataset, maintaining a high standard of scientific rigor and reliability.

Rigorous validation of benchmark results is essential to mitigate the risk of inaccurate conclusions regarding Large Language Model (LLM) capabilities. Without these checks, observed performance gains may be attributable to flaws in the benchmark itself – such as ambiguous questions or incorrect answers – rather than genuine improvements in LLM scientific reasoning. Establishing a reliable benchmark requires actively identifying and correcting these issues through expert review, ensuring that reported performance accurately reflects the LLM’s ability to solve scientifically valid problems and avoids falsely inflating perceived competency. This process directly supports the interpretability and trustworthiness of SciPredict’s evaluations.

Analyzing LLM performance against the Human Expert Baseline and validated benchmark questions allows for the identification of specific scientific reasoning skills where LLMs demonstrate proficiency and those where they struggle. This granular analysis extends beyond overall accuracy scores to pinpoint weaknesses in areas such as causal inference, quantitative reasoning, or the application of specific scientific principles. By comparing LLM responses to expert judgments on a question-by-question basis, researchers can determine the types of scientific problems LLMs consistently solve correctly, and conversely, the specific areas requiring further development and refinement in LLM architectures and training data. This detailed understanding is crucial for targeted improvements and a more realistic assessment of LLM capabilities in scientific domains.

SciPredict constructs a benchmark of scientific prediction tasks through a four-stage pipeline encompassing data collection from recent preprints, expert annotation converting papers into structured prediction formats, rigorous task structuring with detailed contextual information, and multi-faceted quality control leveraging both human and automated verification.

Calibration and the Future of Scientific Artificial Intelligence

A reliable artificial intelligence doesn’t simply make predictions; it accurately reflects its certainty about those predictions. This alignment between a model’s stated confidence and its actual accuracy is known as calibration, and it’s a cornerstone of trustworthy AI systems. SciPredict directly assesses this critical feature, moving beyond simple accuracy metrics to evaluate whether a model is appropriately humble when uncertain and confidently correct when it is. Poorly calibrated models can mislead researchers, potentially leading to flawed conclusions or wasted resources, even if their overall accuracy appears reasonable. Therefore, the ability to rigorously evaluate calibration, as SciPredict provides, is essential for developing AI tools that can be reliably integrated into the scientific process and build trust among researchers.

Large language models demonstrate a surprising capacity for predicting empirical outcomes on the SciPredict benchmark, achieving accuracy levels – around 20% – comparable to those of human experts in the field. However, this performance is tempered by a significant lack of calibration; the models’ stated confidence in their predictions does not reliably reflect their actual correctness. While a model might confidently assert a particular outcome, its success rate doesn’t consistently align with that reported confidence, raising concerns about the trustworthiness of its predictions and highlighting a critical area for improvement in scientific AI development. This discrepancy suggests that while LLMs can often arrive at correct answers, they struggle to accurately assess the certainty of those answers, potentially leading to overreliance on inaccurate predictions.

Recent investigations demonstrate that augmenting large language models with expertly curated background knowledge yields a measurable improvement in predictive accuracy, approximately 3%. This suggests that while LLMs possess substantial pattern recognition capabilities, their performance is notably enhanced when provided with foundational scientific context. The incorporation of this knowledge isn’t simply about increasing the dataset size; rather, it’s about providing the models with the necessary conceptual framework to better interpret and extrapolate from existing data. This targeted knowledge infusion represents a practical strategy for mitigating some of the inherent limitations of LLMs and moving towards more reliable and insightful AI tools for scientific discovery.

Analysis of SciPredict results reveals a noteworthy connection between a large language model’s ability to predict scientific outcomes and its broader reasoning capabilities. Specifically, the model’s accuracy on SciPredict, when provided with necessary background knowledge (NBK), demonstrates a moderate positive correlation – a Pearson r of approximately 0.46 – with its performance on the Holistic Learning and Evaluation (HLE) benchmark. This suggests that an LLM’s capacity for accurate empirical prediction isn’t simply rote memorization, but is linked to a more general ability to reason, infer, and synthesize information – a crucial characteristic for advancing scientific discovery through artificial intelligence. The finding implies that improvements in general reasoning skills within LLMs could translate directly into more reliable and insightful predictions within specific scientific domains.

SciPredict serves as a crucial benchmark in the evolving landscape of scientific artificial intelligence, not merely by assessing predictive capabilities, but by meticulously detailing both successes and shortcomings of current large language models. This dual focus is paramount; identifying areas where AI excels allows for strategic implementation and focused development, while acknowledging weaknesses – particularly regarding calibration and reliable confidence scoring – guides researchers toward more robust methodologies. The platform’s capacity to pinpoint these strengths and vulnerabilities fosters the creation of AI tools that aren’t simply accurate, but demonstrably trustworthy, ultimately accelerating scientific discovery through dependable computational assistance. By offering a nuanced evaluation, SciPredict transcends a simple performance metric, functioning instead as a compass for building more responsible and effective AI collaborators in research.

Unlike humans, whose self-reported confidence, difficulty, and feasibility accurately correlate with prediction accuracy, models exhibit poor calibration in these metrics, suggesting human judgment provides a more reliable signal for task outcome predictability.

The SciPredict benchmark, as detailed in the article, rigorously assesses Large Language Models not merely on whether they predict experimental outcomes correctly, but on how confidently they assign probabilities to those predictions. This emphasis on calibration – a measure of trustworthiness – resonates deeply with the pursuit of mathematical purity in computation. As Tim Berners-Lee observed, “The Web is more a social creation than a technical one.” Similarly, a predictive model, no matter how complex, gains true value only when its confidence levels align with actual accuracy; a model that consistently overestimates or underestimates its certainty is, fundamentally, flawed. The pursuit of a ‘well-calibrated’ LLM, therefore, mirrors the elegant simplicity of a provable algorithm – a solution that isn’t simply ‘working on tests,’ but demonstrably correct within defined parameters.

Beyond Prediction: Charting a Course for Scientific LLMs

The introduction of SciPredict exposes a chasm between the appearance of competence in large language models and genuine predictive capability within the natural sciences. The observed deficiencies in calibration are particularly noteworthy; a model may offer a prediction, but without a reliable assessment of its own uncertainty, the output remains, at best, an educated guess-and frequently, a statistically unsupported one. The pursuit of higher raw accuracy, while seemingly logical, is ultimately superficial if not coupled with demonstrable trustworthiness in its stated confidence.

Future work must prioritize not merely the success of predictions, but the provable relationship between predicted probabilities and actual outcomes. The field should shift from treating LLMs as empirical pattern-matchers to demanding a more mathematically grounded understanding of why a prediction is made. Benchmarks should not reward merely correct answers, but penalize overconfident incorrectness, forcing a more rigorous internal representation of scientific uncertainty.

The enduring question is not whether these models can mimic scientific reasoning, but whether they can embody it. Until a demonstrable link is established between internal model state and probabilistic correctness – a linkage verifiable through formal methods, not just empirical testing – the promise of LLMs in scientific discovery will remain largely symbolic. The elegance of a solution, after all, lies not in its complexity, but in its demonstrable truth.

Original article: https://arxiv.org/pdf/2604.10718.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Limits of Linguistic Mimicry in Scientific Reasoning

SciPredict: A Benchmark for Empirically-Grounded Reasoning

Establishing Ground Truth: Human Baselines and Expert Validation

Calibration and the Future of Scientific Artificial Intelligence

Beyond Prediction: Charting a Course for Scientific LLMs

See also: