Can AI Truly Do Science? A New Benchmark Puts Problem-Solving to the Test

Author: Denis Avetisyan


Researchers have unveiled a rigorous evaluation framework designed to assess the scientific reasoning capabilities of artificial intelligence agents.

The analysis quantifies task complexity within the COMPOSITE-STEM domain, revealing the distribution of task counts at a granular level and providing a basis for assessing algorithmic scalability.
The analysis quantifies task complexity within the COMPOSITE-STEM domain, revealing the distribution of task counts at a granular level and providing a basis for assessing algorithmic scalability.

COMPOSITE-STEM is a 70-task benchmark, curated by experts, for evaluating AI performance on complex STEM problems.

Despite increasing promise for accelerating scientific discovery, robust evaluation of AI agents remains a critical bottleneck, with existing benchmarks quickly becoming saturated and unable to assess complex reasoning. To address this gap, we introduce COMPOSITE-Stem, a new benchmark comprising 70 expert-written STEM tasks-spanning physics, biology, chemistry, and mathematics-evaluated using both exact-match grading and a novel LLM-as-a-jury protocol. Our findings reveal that even frontier models achieve only 21% on this challenging benchmark, highlighting the need for continued development in AI-driven scientific problem-solving. Will benchmarks like COMPOSITE-Stem prove essential for unlocking the full potential of AI to augment-and accelerate-scientific progress?


The Crisis of Quantifiable Intelligence

The field of artificial intelligence currently faces a significant hurdle in its pursuit of increasingly capable agents: a dearth of standardized evaluation benchmarks. Without consistent metrics and universally accepted tests, comparing the performance of different AI systems becomes a problematic exercise, often relying on results derived from disparate datasets or task-specific evaluations. This lack of comparability doesn’t simply impede academic progress; it actively hinders the development of robust and reliable AI, as improvements become difficult to quantify and validate across different architectures and training methodologies. Consequently, the absence of standardized benchmarks slows the pace of innovation, creating ambiguity in determining which approaches genuinely represent advancements in artificial intelligence and which merely excel within a limited, non-generalizable context.

Current artificial intelligence evaluation techniques often fall short when tasked with gauging complex scientific reasoning abilities. While agents may excel at tasks requiring pattern recognition or data recall, reliably assessing their capacity for genuine understanding – the ability to formulate hypotheses, design experiments, and interpret results – remains a significant challenge. This limitation yields unreliable performance metrics, particularly in crucial domains like drug discovery, materials science, and climate modeling, where nuanced judgment and innovative problem-solving are paramount. Consequently, reported successes may be overstated, hindering genuine progress and potentially leading to flawed conclusions derived from systems that mimic understanding without actually possessing it. The difficulty lies in constructing benchmarks that go beyond simple question-answering and instead demand the application of scientific principles to novel, ambiguous scenarios – a task that requires AI to not just know science, but to do science.

Determining genuine scientific understanding in artificial intelligence remains a significant hurdle due to the limitations of current evaluation methods. An agent might achieve high scores on benchmarks by identifying correlations within training data – effectively memorizing patterns – without possessing the capacity for actual reasoning or generalization to novel scenarios. This poses a critical problem; superficial performance can mask a lack of deep comprehension, leading to overestimation of an AI’s capabilities and potential failures when confronted with previously unseen problems. Consequently, the development of robust evaluation frameworks is paramount, requiring assessments that move beyond pattern recognition and probe for a true grasp of underlying scientific principles, causal relationships, and the ability to apply knowledge flexibly – ensuring that progress in AI reflects genuine intelligence, not simply sophisticated memorization.

The table summarizes performance metrics for each model, allowing for a comparative assessment of their effectiveness.
The table summarizes performance metrics for each model, allowing for a comparative assessment of their effectiveness.

Introducing COMPOSITE-STEM: A Rigorous Test of Scientific Acumen

COMPOSITE-STEM is a benchmark designed to evaluate the scientific reasoning capabilities of large language models. It consists of 70 distinct tasks, each authored by experts in their respective fields, and covers the core disciplines of Physics, Biology, Chemistry, and Mathematics. These tasks are not simple factual recall questions; rather, they require applying scientific principles to solve problems and demonstrate understanding across multiple domains. The benchmark’s composition is intended to provide a comprehensive and rigorous assessment of a model’s ability to perform scientific reasoning, going beyond performance on individual, isolated topics.

The creation of COMPOSITE-STEM’s benchmark tasks prioritized scientific rigor through a process of Expert Task Curation. This involved commissioning subject matter experts – holding advanced degrees and professional experience in Physics, Biology, Chemistry, and Mathematics – to individually author each of the 70 tasks. Tasks were then subjected to a multi-stage review process, including verification of factual accuracy, assessment of task clarity, and validation of solution methodologies. This curation process was designed to ensure that each task accurately reflects established scientific principles and demands genuine reasoning capabilities, rather than exploiting superficial patterns or statistical correlations present in training data. The resulting benchmark is therefore intended to provide a reliable measure of a model’s true scientific understanding and problem-solving skills.

Evaluation using the COMPOSITE-STEM benchmark demonstrates substantial performance differences among leading large language models. Specifically, claude-opus-4.6 achieved a Pass@1 rate of 21.4% across the 70-task suite. Preliminary results indicate that models such as GPT-5.4 and Grok-4.20-beta exhibit significantly lower Pass@1 rates, suggesting a considerable capability gap in scientific reasoning and problem-solving when compared to claude-opus-4.6 on this rigorous benchmark. Pass@1 represents the percentage of tasks solved correctly on the first attempt.

A Dual-Strategy Grading System for Unbiased Assessment

The evaluation framework utilizes a dual-strategy grading system tailored to question type. For objective questions – those with defined correct answers – Exact Match Grading is employed, providing a straightforward assessment of accuracy. However, for tasks requiring evaluation of meaning and reasoning – semantic correctness – LLM-as-a-Jury Grading is implemented. This method leverages multiple Large Language Models to independently assess responses, with a consensus-based approach used to determine a final grade, mitigating the biases of any single model and improving the reliability of subjective assessments.

AsymmetryZero is a core component of the grading framework, functioning as a mechanism to translate subjective expert preferences into quantifiable, auditable contracts. This is achieved by defining grading rubrics not as vague guidelines, but as formally specified conditions that an answer must meet to receive a particular score. These conditions are then encoded and executed programmatically, ensuring that all responses are evaluated against the same, consistent criteria. The resulting audit trail details precisely why a given answer received a particular grade, enabling review and verification of the grading process and mitigating bias. This approach facilitates reliable, reproducible assessment, regardless of the evaluator, and provides transparency into the scoring logic.

The framework leverages tools such as RDKit to facilitate accurate evaluation of chemistry-based tasks. In a comparative analysis, claude-opus-4.6 successfully determined a hydrogen atom count of 350 within a given molecular structure. GPT-5.4, utilizing a custom Python script for analysis, reported a count of 399 for the same structure. This discrepancy highlights the importance of standardized evaluation tools and consistent methodologies when employing large language models for scientific assessments, and demonstrates the need for validation against established cheminformatics software like RDKit.

Performance on the dense task-model varied by domain, with green indicating successful completion, red denoting failure, and gray representing unscored trials.
Performance on the dense task-model varied by domain, with green indicating successful completion, red denoting failure, and gray representing unscored trials.

Toward Verifiable Progress: The Harbor Framework and its Implications

The Harbor Framework prioritizes reproducibility as a cornerstone of reliable AI evaluation. This is achieved through meticulous tracking of all experimental parameters, code versions, and environment configurations, enabling independent researchers to precisely recreate reported results. By facilitating this verification process, the framework moves beyond simply reporting performance to demonstrably proving it. This commitment to transparency not only builds confidence in the validity of AI benchmarks but also accelerates scientific progress by allowing researchers to build upon established findings with assurance. Ultimately, Harbor’s reproducibility features are designed to foster a more robust and trustworthy foundation for the rapidly evolving field of artificial intelligence, shifting the focus from anecdotal success to verifiable, repeatable outcomes.

The creation of a standardized evaluation platform, built by extending the capabilities of TerminalBench, represents a significant step towards more reliable and accelerated progress in artificial intelligence. Previously, comparing the performance of different agents was often hampered by variations in evaluation environments and metrics. This new platform addresses this challenge by providing a consistent and well-defined set of tasks and scoring methods, enabling researchers to directly assess and contrast the strengths and weaknesses of various AI systems. Such standardization not only streamlines the research process but also fosters greater transparency and reproducibility, allowing for independent verification of results and ultimately driving innovation in the field. By removing ambiguity in evaluation, the platform facilitates more focused development efforts and a clearer understanding of the capabilities of increasingly complex AI agents.

The Harbor framework distinguishes itself through a deliberately modular architecture, enabling seamless incorporation of diverse input types beyond traditional text. This flexibility is powerfully demonstrated by Multimodal Terminus-2, an extension of the platform capable of processing and responding to both textual and visual information. By accepting inputs like images alongside text prompts, the framework significantly broadens the range of skills that can be rigorously assessed in AI agents. This capability moves beyond evaluating purely linguistic understanding to encompass perception, visual reasoning, and the ability to integrate information from multiple sensory modalities – a critical step towards building more versatile and generally intelligent artificial systems. The ease with which multimodal inputs are integrated facilitates more comprehensive and realistic evaluations, pushing the boundaries of what can be measured in AI performance.

The pursuit of robust AI agents, as demonstrated by COMPOSITE-STEM, necessitates a commitment to verifiable truth. This benchmark, with its 70 expert-curated STEM tasks, isn’t merely assessing if an agent appears to solve problems, but whether its reasoning adheres to logical principles. This aligns perfectly with the sentiment expressed by David Hilbert: “One must be able to say everything that one wishes to say.” Just as a mathematical proof demands rigorous justification at every step, so too must an AI agent’s solution be demonstrably correct, not simply a plausible output. The Harbor framework, integral to COMPOSITE-STEM, offers a means of establishing this demonstrable correctness, turning potential solutions into verifiable truths.

Future Directions

The introduction of COMPOSITE-STEM, while a necessary step towards quantifiable assessment of AI scientific aptitude, merely highlights the vastness of what remains unknown. The benchmark’s reliance on expert-curated tasks, though currently unavoidable, introduces a subtle but critical dependency on human biases – a precarious foundation for judging true intelligence. The pursuit of a perfectly objective, universally valid scientific challenge is, perhaps, a philosophical conceit; however, striving for minimized subjective influence is paramount.

A key limitation resides in the static nature of the benchmark. Real scientific progress is iterative, demanding agents not simply solve problems, but generate them-to identify the limitations of existing knowledge. Future iterations should prioritize dynamic evaluation, where agents propose experiments, interpret ambiguous data, and refine hypotheses – capabilities that demand a level of causal reasoning presently absent in most systems. Reproducibility remains the cornerstone of scientific validity; any result that cannot be consistently re-created is, at best, a statistical anomaly.

Ultimately, the value of such benchmarks lies not in declaring a ‘winner,’ but in rigorously exposing the shortcomings of current approaches. The Harbor framework, as a platform for evaluation, offers a valuable structure, but its true test will be its capacity to accommodate increasingly complex, open-ended challenges – ones that demand not merely calculation, but genuine insight.


Original article: https://arxiv.org/pdf/2604.09836.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-14 14:32