Can AI Truly Do Science? Introducing PRiSM, a New Challenge for Multimodal Reasoning

Author: Denis Avetisyan

Researchers have created a dynamic benchmark, PRiSM, to push the boundaries of artificial intelligence in scientific problem-solving, moving beyond static datasets and simple question answering.

The PRiSM dataset presents a challenge in compositional reasoning, providing instances that couple parameterized questions with visual depictions of their solutions, alongside both detailed step-by-step reasoning and executable $Python$ code, to facilitate the development of systems capable of not merely answering questions, but demonstrating <i>how</i> they arrived at the answer. — The PRiSM dataset presents a challenge in compositional reasoning, providing instances that couple parameterized questions with visual depictions of their solutions, alongside both detailed step-by-step reasoning and executable $Python$ code, to facilitate the development of systems capable of not merely answering questions, but demonstrating *how* they arrived at the answer.

PRiSM is an agentic, multimodal benchmark designed to rigorously evaluate scientific reasoning via a scalable Python-grounded pipeline.

Existing benchmarks for evaluating vision-language models struggle to assess true scientific reasoning, often focusing on final answers rather than the underlying conceptual understanding and symbolic manipulation required in fields like physics and mathematics. To address this limitation, we introduce PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation, a dynamic and scalable platform featuring over 24,750 university-level problems generated via an agent-based pipeline. PRiSM uniquely leverages executable Python code for both ground truth generation and automated verification, enabling fine-grained auditing of model reasoning and revealing critical failure modes. Will this new benchmark catalyze the development of genuinely scientifically literate multimodal AI systems?

The Fragility of Simulated Understanding

Contemporary artificial intelligence systems, despite achieving impressive results in narrow tasks, exhibit a surprising fragility when confronted with complex scientific reasoning. These systems often demonstrate a lack of consistent logic, failing to arrive at correct conclusions even when presented with problems that are structurally similar to those they have already solved. This sensitivity to superficial changes – minor alterations in phrasing, the order of information, or the inclusion of irrelevant details – highlights a fundamental limitation in their ability to generalize knowledge and apply it flexibly. Unlike human scientists who can readily identify the core principles at play, current AI frequently fixates on surface-level features, leading to inconsistent and unreliable performance across even subtly varied scenarios. This brittleness poses a significant challenge to deploying AI as a reliable partner in scientific discovery, demanding a shift from mere accuracy to a more robust and adaptable form of reasoning.

Assessing the true intelligence of artificial intelligence in scientific contexts necessitates evaluation methods that move past traditional accuracy scores. Current benchmarks often prioritize achieving the correct answer in a specific instance, overlooking a system’s capacity to maintain consistent reasoning across subtly altered problems – a hallmark of robust intelligence. Consequently, researchers are developing new metrics focused on error recovery and consistency; a system isn’t merely judged on its successes, but on its ability to identify and correct mistakes, and to apply the same logical principles even when presented with variations in data or problem framing. This shift towards evaluating for resilience, rather than simply correctness, is crucial for building AI capable of genuine scientific discovery and reliable application in real-world scenarios, where perfect information and predictable conditions are rarely encountered.

The PRiSM dataset evaluates a model's robustness to variations in input and phrasing (Task I) and its ability to identify and correct errors in multi-step reasoning problems (Task III). — The PRiSM dataset evaluates a model’s robustness to variations in input and phrasing (Task I) and its ability to identify and correct errors in multi-step reasoning problems (Task III).

Procedural Genesis: Cultivating a Robust Benchmark

The PRiSM dataset distinguishes itself through procedural generation, creating 24,750 unique problem instances spanning university-level physics and mathematics. This approach contrasts with static datasets by offering a potentially limitless supply of novel problems, mitigating issues of overfitting and memorization common in evaluating scientific AI. The generated problems are not pre-authored but are created algorithmically, allowing for control over problem characteristics and the creation of targeted evaluation sets. This dynamic generation enables researchers to assess an AI’s ability to generalize and solve unseen problems, rather than simply recognizing previously encountered examples. The dataset’s scale, combined with the procedural generation method, provides a robust platform for benchmarking and advancing the field of scientific AI.

The PrismAgent is a crucial element of the PRiSM benchmark, functioning as an automated pipeline for problem creation and validation. It utilizes Optical Character Recognition (OCR) technology to convert visually presented problems – such as those derived from textbooks or handwritten materials – into a machine-readable format. Following OCR processing, the agent generates corresponding Python code which serves as an automated verification system. This code executes the problem and compares the model’s solution against the known, correct answer, providing an objective assessment of performance without manual grading. The system supports the creation of a diverse range of problems and ensures consistent, scalable evaluation.

The PRiSM benchmark provides complete, step-by-step solutions for each of its 24,750 problems, designed to assess symbolic reasoning capabilities. Analysis indicates an average solution length of 6.91 reasoning steps, representing the number of individual inferences required to arrive at the correct answer. The standard deviation of 1.75 steps demonstrates significant variability in problem complexity, ranging from relatively straightforward derivations to more involved, multi-step proofs. This distribution is intended to provide a robust evaluation across a spectrum of reasoning demands and differentiate between models with varying levels of symbolic manipulation proficiency.

PRiSM demonstrates structured reasoning by processing input variations to generate a coherent response, as shown in this example.

Verifiability as a Foundation: The Logic of Automated Solutions

PRiSM’s core functionality centers on the dynamic generation of Python code corresponding to each problem instance. This allows for automated execution and verification of proposed solutions against the problem’s defined parameters and expected outcomes. The generated code isn’t simply a numerical evaluation; it represents a programmatic instantiation of the solution’s logic. Any discrepancies between the executed code’s output and the ground truth are flagged as errors, facilitating precise identification of flaws in the solution process. This automated error detection is crucial for both evaluating the performance of solution synthesis algorithms and debugging individual solution attempts, ensuring a high degree of reliability and accuracy.

PRiSM’s generated Python code utilizes the SymPy library to perform symbolic mathematics, enabling manipulation of equations and expressions without numerical evaluation. Crucially, the Pint library is integrated to ensure dimensional consistency throughout calculations; Pint tracks units of measure alongside numerical values, raising exceptions when operations would result in dimensionally invalid results. This is particularly important for scientific and engineering problems where maintaining correct units – such as meters, seconds, kilograms, and amperes – is essential for the validity of the solution; for example, adding a length in meters to a time in seconds would be flagged as an error. The use of Pint enforces adherence to physical laws expressed through dimensional analysis, thereby increasing the reliability and correctness of the generated solutions.

The PRiSM framework is designed to address both programmatic solution synthesis (Task IV) and reasoning under ambiguity (Task V) within problem-solving contexts. This capability is supported by a comprehensive knowledge base encompassing 450 distinct physics concepts and 110 mathematics concepts. This broad conceptual coverage allows the framework to generate and verify solutions across a wide range of problems, and to manage situations where problem statements are not fully specified or contain inherent uncertainty. The framework’s ability to handle these tasks is central to its overall functionality and validation process.

PRiSM successfully corrects reasoning errors, as demonstrated in this example.

Beyond Accuracy: Exposing the Cracks in Simulated Intelligence

A comprehensive evaluation of artificial intelligence models requires more than simply assessing performance on a single, static dataset. The PRiSM framework addresses this need by systematically introducing variations to scientific problems, allowing researchers to gauge a model’s consistency and reliability. Utilizing metrics such as the ‘true Score’, PRiSM quantifies how well a model maintains accurate reasoning despite alterations in irrelevant details-like changing an image or rephrasing a question. This approach reveals whether a model truly understands the underlying scientific principles, or merely exploits superficial correlations within the training data. By rigorously testing across these problem variations, PRiSM provides a robust assessment of an AI’s ability to generalize and avoid brittle, easily-fooled behavior, offering a crucial step toward trustworthy scientific reasoning systems.

Analysis of a newly developed dataset reveals a consistent failure mode in Vision-Language Models termed ‘Modality Conflict’. This occurs when visual information presented alongside a problem directly contradicts the logically correct answer, leading the model to prioritize the image over established reasoning principles. For instance, a physics problem depicting an object falling upwards might be incorrectly solved due to the model’s reliance on the visual cue of upward motion, despite understanding the laws of gravity. This highlights a critical limitation in these models – a susceptibility to being misled by perceptual input, even when it clashes with learned knowledge – and underscores the need for improved mechanisms to integrate and validate information across different modalities.

The development of artificial intelligence capable of robust scientific reasoning demands more than just achieving high accuracy on standard datasets. PRiSM emerges as a critical resource, offering researchers a platform to rigorously evaluate and refine these systems. By systematically generating diverse problem variations, PRiSM moves beyond simple benchmark testing and exposes vulnerabilities in AI reasoning processes. This allows for targeted improvements and the development of models less susceptible to superficial correlations or biases. Consequently, PRiSM isn’t merely a testing tool, but an iterative development partner, facilitating the creation of AI systems demonstrably capable of reliable and consistent scientific thought – a crucial step towards trustworthy AI in complex domains.

PRiSM successfully demonstrates multimodal synthesis in this example, showcasing its ability to generate diverse outputs.

The creation of PRiSM isn’t about crafting a perfect test, but rather acknowledging the inherent unpredictability of complex systems. The benchmark operates as a dynamic ecosystem, mirroring the scientific process itself-a continual cycle of observation, experimentation, and refinement. It recognizes that stability is merely an illusion that caches well; any attempt to rigidly define ‘correct’ answers will inevitably fail as the system evolves. As Edsger W. Dijkstra observed, “In moments of decision, the best thing you can do is the right thing; the next best thing is the wrong thing; and the worst thing you can do is nothing.” PRiSM doesn’t offer guarantees of definitive success, but a framework for measuring a model’s adaptability within a chaotic, yet structured, environment. The agent-based pipeline deliberately introduces variables, understanding that chaos isn’t failure – it’s nature’s syntax.

What Lies Ahead?

The construction of PRiSM, as with any benchmark, is less a solution and more a carefully documented compromise. It addresses present failings in evaluating scientific reasoning, yet implicitly forecasts the future failures of evaluation itself. Models will adapt, not by ‘understanding’ science, but by mastering the quirks of the synthetic environment – optimizing for the benchmark, not for genuine generalization. The pursuit of ever-more-complex synthetic data will inevitably lead to models exquisitely tuned to artifice.

The true challenge, then, isn’t scaling the dataset, but acknowledging the inherent limitations of proxy tasks. PRiSM rightly emphasizes an agent-based pipeline, yet even that architecture is a provisional scaffolding. Dependencies accumulate; the ecosystem evolves. The benchmark will become a fossil, revealing more about the state of evaluation at this moment than about any enduring capacity for scientific thought.

One anticipates a shift-perhaps a reluctant one-toward evaluating models not on isolated tasks, but on their capacity to fail gracefully within real-world scientific workflows. Less focus on correct answers, more on identifying the boundaries of competence. For in the end, the most valuable skill isn’t knowledge, but the ability to recognize what one does not know.

Original article: https://arxiv.org/pdf/2512.05930.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Simulated Understanding

Procedural Genesis: Cultivating a Robust Benchmark

Verifiability as a Foundation: The Logic of Automated Solutions

Beyond Accuracy: Exposing the Cracks in Simulated Intelligence

What Lies Ahead?

See also: