Can AI Solve Chemical Mysteries?

Author: Denis Avetisyan

Researchers have created a challenging benchmark to test if artificial intelligence can deduce molecular structures from experimental data, mimicking the problem-solving process of a chemist.

Large language models and human chemists exhibit distinct approaches to chemical reasoning and decision-making within the Molquest environment, with the study revealing quantifiable differences in their strategies for navigating chemical space and achieving desired outcomes, as evidenced by comparative analysis of their respective solution pathways and efficiency.

MolQuest is a dynamic benchmark for agentic evaluation of abductive reasoning in chemical structure elucidation, designed to assess the capabilities of large language models in AI-driven scientific discovery.

While large language models demonstrate promise in scientific discovery, robust evaluation of their dynamic reasoning capabilities remains a significant challenge. To address this gap, we introduce ‘MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation’, a novel agent-based framework that assesses LLMs’ ability to perform iterative molecular structure elucidation from authentic spectral data. Our benchmark reveals that even state-of-the-art models struggle with this complex, multi-step task, achieving only approximately 50% accuracy, highlighting a critical need for improved strategic reasoning in AI for science. Can we develop LLMs capable of truly participating in the scientific process, autonomously designing experiments and refining hypotheses like a skilled chemist?

The Challenge of Scientific Reasoning: Beyond Pattern Matching

Large Language Models, despite their successes in areas like natural language processing and code generation, struggle when applied to complex scientific reasoning tasks, particularly those requiring strategic planning and iterative refinement. Molecular structure elucidation, for instance, isn’t simply a matter of recalling facts; it demands a deliberate sequence of hypotheses, experiments – whether real or simulated – and data analysis. Current LLMs often exhibit a ‘one-shot’ approach, attempting to directly deduce the solution without effectively planning a series of logical steps or adapting to unexpected results. This limitation stems from a core architectural challenge: these models excel at pattern recognition but lack the capacity for sustained, goal-directed thought processes crucial for navigating the uncertainties inherent in scientific discovery. Consequently, while capable of generating plausible-sounding answers, they frequently falter when faced with problems demanding nuanced judgment, careful observation, and the ability to learn from iterative feedback.

Current evaluations of Large Language Models in scientific contexts, utilizing benchmarks such as ChembBench and ChemiQ, frequently present challenges that don’t fully mirror the intricacies of actual scientific problem-solving. These benchmarks tend to focus on isolated tasks with definitive answers, neglecting the dynamic and iterative nature of research. A true scientific workflow involves proposing hypotheses, designing experiments, interpreting ambiguous data, and adapting strategies based on evolving evidence – elements often absent in static benchmark datasets. Consequently, while an LLM might achieve high scores on these tests, it doesn’t necessarily demonstrate the capacity for genuine scientific reasoning, which requires navigating uncertainty, handling incomplete information, and making informed decisions across a sequence of interdependent steps. The limitations of these benchmarks highlight the need for more sophisticated evaluation tools that can assess an LLM’s ability to function not just as a knowledge repository, but as a flexible and adaptable scientific investigator.

The prevailing methods for assessing Large Language Models in scientific contexts frequently rely on static datasets, presenting a limited snapshot of problem-solving ability. This approach overlooks the fundamentally iterative and adaptive nature of genuine scientific workflows, where hypotheses are refined through cycles of experimentation and analysis. Real-world scientific reasoning isn’t a single-step answer retrieval; it demands sequential decision-making, where each step informs the next, and models must dynamically adjust strategies based on emerging evidence. Consequently, performance on these fixed benchmarks may not accurately reflect an LLM’s capacity to navigate the complexities of open-ended scientific inquiry, potentially overestimating their readiness for tackling nuanced, real-world challenges that require ongoing learning and adaptation.

Molecular structure elucidation can be effectively modeled and solved as a Constraint Satisfaction Problem (CSP).

MolQuest: A Framework for Evaluating Scientific Agency

MolQuest is an agent-based framework designed to assess the capabilities of Large Language Models (LLMs) in the domain of molecular structure elucidation. This benchmark moves beyond traditional static datasets by simulating the iterative workflow of scientific investigation. The LLM functions as an autonomous agent, actively proposing hypotheses regarding molecular structure and then requesting simulated experimental data – specifically, mass spectrometry and nuclear magnetic resonance (NMR) spectroscopy – to validate or refine those hypotheses. This dynamic interaction allows for evaluation of an LLM’s ability to plan experiments, interpret results, and converge on a plausible molecular structure through successive refinement, mirroring the process used by chemists in a laboratory setting.

MolQuest distinguishes itself from traditional benchmarks by simulating an active research process. Instead of providing complete information upfront, the LLM functions as an agent within a dynamic environment where it must request specific data, mirroring experimental techniques like Mass Spectrometry and Nuclear Magnetic Resonance (NMR) Spectroscopy. The agent iteratively queries for data based on its current hypothesis, analyzes the results, and then refines that hypothesis – a process repeated until a satisfactory molecular structure is proposed. This iterative data acquisition and hypothesis refinement is central to MolQuest’s design, demanding capabilities beyond simple pattern recognition and requiring the LLM to manage uncertainty and prioritize information gathering.

MolQuest utilizes abductive reasoning as its core evaluation principle, challenging Large Language Models (LLMs) to determine the most probable molecular structure from limited and imperfect data. This process necessitates inference based on incomplete evidence, simulating the real-world complexities of scientific investigation where experimental data – such as Mass Spectrometry and NMR Spectroscopy results – inherently contains noise and ambiguities. The LLM must therefore not simply recall known structures, but actively formulate and revise hypotheses to account for the observed data, effectively performing a best-guess inference given the available, and potentially flawed, evidence. The success of an LLM is measured by its ability to consistently propose plausible structures and refine those proposals as more data becomes available, reflecting the iterative nature of scientific discovery.

A two-dimensional representation of the benchmark molecular set, generated using t-SNE dimensionality reduction of molecular fingerprints, reveals the chemical diversity within the dataset.

Beyond Accuracy: Assessing Calibration and Conservation

MolQuest incorporates Calibration Error as a key performance metric, quantifying the degree to which a model’s predicted confidence in its structural predictions aligns with actual correctness. This assessment moves beyond simple accuracy by evaluating whether high-confidence predictions are consistently accurate and low-confidence predictions are consistently inaccurate. A well-calibrated model provides reliable uncertainty estimates, which are crucial for scientific applications where understanding the limits of a prediction is as important as the prediction itself. Calibration Error is calculated based on the Expected Calibration Error (ECE), providing a quantitative measure of this alignment and enabling a more nuanced evaluation of model reliability beyond overall accuracy scores.

Formula Conservation, a key metric within the MolQuest benchmark, assesses the degree to which predicted molecular structures maintain correct elemental composition. Gemini 3 Pro achieved a score of 93.57% on this metric, indicating a high level of adherence to established chemical principles and consistency with experimentally derived data. This evaluation is performed by verifying that the number of atoms of each element in the predicted molecule matches the target molecule, thereby ensuring the model does not generate chemically invalid structures. A high Formula Conservation score is critical for practical applications, as incorrect elemental composition renders a predicted structure meaningless for downstream tasks like property prediction or synthesis planning.

The MolQuest benchmark utilizes an Agent-Based Framework to assess the strategic planning capabilities of language models during molecular reasoning. This framework measures the number of interaction rounds required to arrive at a predicted molecular structure. Evaluations of Gemini 3 Flash and Pro models demonstrate efficient information acquisition, with both models achieving an average of 4.7 to 4.8 interaction rounds. This metric indicates the models’ ability to effectively refine their hypotheses and converge on solutions with a minimal number of iterative steps within the defined agent-based system.

A heatmap visualizing the pairwise similarity of molecules in the benchmark set, calculated using molecular fingerprints, reveals the relationships within the dataset.

Towards Robust AI: Benchmarking and Future Directions

MolQuest establishes a standardized benchmarking environment for evaluating large language models (LLMs) applied to scientific challenges, offering a level playing field for comparing performance across models like GPT-5.2, Gemini 3-Pro, and Qwen-3-Max. Recent evaluations utilizing this framework demonstrate significant advancements in structure prediction accuracy; notably, Gemini 3 Flash achieved a state-of-the-art result of 51.51%, while Gemini 3 Pro attained a score of 48.30%. This rigorous comparison, conducted under consistent and challenging conditions, is crucial for identifying the strengths and weaknesses of different LLMs, ultimately driving innovation in AI-assisted scientific workflows and enabling the selection of optimal models for specific research tasks.

MolQuest is designed not as a static benchmark, but as a continually evolving environment that fosters the creation of AI agents capable of genuine scientific discovery. This dynamic framework presents agents with opportunities to iteratively refine their hypotheses and methodologies based on incoming data, mirroring the core principles of the scientific method. Unlike traditional evaluations that assess performance on fixed datasets, MolQuest challenges AI to actively learn from its mistakes and adapt to novel situations, encouraging the development of agents that don’t simply answer questions, but investigate them. The system’s capacity for ongoing assessment and refinement paves the way for AI tools that can contribute meaningfully to complex scientific workflows, pushing beyond pattern recognition toward true understanding and innovation.

MolQuest emphasizes that scientific validity extends beyond mere predictive accuracy, demanding AI systems demonstrate both calibration and adherence to fundamental chemical principles like formula conservation. This focus fosters the creation of trustworthy AI agents capable of generating not just plausible, but scientifically defensible results. Recent evaluations using the MolQuest framework reveal Claude Opus 4.5 achieves an accuracy of 9.18 per 1 million tokens when assessed on these criteria, suggesting a promising trajectory towards robust and reliable artificial intelligence for complex scientific tasks. This level of performance underscores the importance of evaluating AI not solely on ‘what’ it predicts, but ‘how’ it arrives at those predictions, ensuring alignment with established scientific knowledge and minimizing the risk of spurious correlations or physically impossible outcomes.

The data processing pipeline transforms raw sensor data into actionable insights through a series of stages including filtering, feature extraction, and model-based prediction.

Expanding the Horizon: Towards Autonomous Scientific Discovery

Future development of the MolQuest framework centers on a crucial integration with Human-in-the-Loop Data Pipelines. This approach doesn’t aim to replace scientific expertise, but rather to augment it; the AI will generate hypotheses based on complex datasets, which are then presented to domain experts for critical evaluation. This collaborative process allows researchers to validate AI-driven predictions, identify potential flaws, and refine the models with valuable human insight. By iteratively combining computational power with nuanced scientific judgment, the system promises to not only accelerate discovery but also to ensure the robustness and reliability of the generated knowledge, ultimately fostering a more effective synergy between artificial intelligence and human intellect.

The MolQuest framework, initially developed for accelerating molecular discovery, demonstrates significant potential beyond its original scope. Researchers are actively investigating its adaptation to tackle notoriously complex challenges in protein folding and materials discovery, fields where identifying stable configurations and optimal properties often requires immense computational resources and experimental validation. By leveraging MolQuest’s core principles of AI-driven hypothesis generation and iterative refinement, scientists anticipate a substantial reduction in the time and cost associated with these endeavors. This expansion isn’t merely about applying existing algorithms to new domains; it involves tailoring the framework to accommodate the unique characteristics of each problem, such as the intricacies of intermolecular forces in protein structures or the high-dimensional parameter spaces of materials composition. Successfully extending MolQuest’s reach promises not only to unlock breakthroughs in these crucial scientific areas, but also to establish a versatile platform for tackling a wide range of complex scientific problems.

The long-term vision driving advancements in artificial intelligence for scientific discovery centers on creating systems capable of autonomous hypothesis generation and testing. Current AI often serves as a powerful tool for scientists, analyzing data and suggesting potential avenues of research; however, the next generation aims for true scientific independence. This involves developing algorithms that can sift through existing knowledge, identify gaps, formulate novel hypotheses, design experiments – potentially utilizing robotic automation – and interpret the resulting data to validate or refute those hypotheses. Such a capability promises to dramatically accelerate the pace of discovery, moving beyond assistance to genuine partnership with, and even leadership of, the scientific process, ultimately unlocking solutions to complex challenges across diverse fields with unprecedented speed and efficiency.

The pursuit of automated chemical reasoning, as exemplified by MolQuest, demands a rigor beyond mere empirical success. It requires solutions demonstrably correct, not simply those appearing to work on a limited test set. This aligns perfectly with Andrey Kolmogorov’s assertion: “The most important thing in science is not to know things, but to know how to find them out.” MolQuest doesn’t simply assess if a large language model can guess a molecular structure; it evaluates its ability to deduce it through a dynamic process of experimentation and constraint satisfaction, mirroring the fundamental principles of scientific inquiry. The benchmark’s agent-based framework necessitates provable steps, ensuring each deduction is logically sound and transparent, echoing Kolmogorov’s emphasis on the process of discovery over the mere accumulation of facts.

What Remains Constant?

The advent of MolQuest, while a pragmatic step toward assessing large language models’ capabilities in chemical reasoning, merely sharpens the fundamental question. The benchmark establishes an arena for agentic problem-solving, but does not address the core issue: can a system truly elucidate structure, or simply correlate data points with statistically probable arrangements? Let N approach infinity – what remains invariant? The elegance of a solution does not reside in its success on a finite test set, but in its adherence to underlying physical principles. Current evaluations, however sophisticated, remain susceptible to exploitation via clever prompt engineering or statistical overfitting.

Future work must move beyond performance metrics and embrace provability. The challenge isn’t simply to build a system that finds a molecular structure, but one that can justify its solution with a demonstrably valid chain of reasoning-a digital analog of a chemist’s meticulous derivation. This necessitates incorporating formal methods and constraint satisfaction techniques directly into the model architecture, not merely as post-hoc validation steps.

Ultimately, the true test lies not in replicating the results of chemistry, but in embodying its logic. The benchmark is a useful tool, certainly, but it is the pursuit of mathematical purity, of an invariant core, that will determine whether these models transcend the realm of sophisticated pattern matching and achieve genuine scientific intelligence.

Original article: https://arxiv.org/pdf/2603.25253.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/