Author: Denis Avetisyan
A new benchmark dataset challenges large language models to demonstrate multimodal reasoning skills on complex chemistry Olympiad questions.

Researchers introduce USNCO-V to evaluate the ability of AI to integrate visual and textual information for advanced scientific problem-solving.
Despite advances in artificial intelligence, robust multimodal reasoning remains a significant hurdle for large language models, particularly in complex scientific domains. This is addressed in ‘Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams’ where we present USNCO-V, a challenging benchmark dataset constructed from decades of U.S. National Chemistry Olympiad exams, to systematically assess 40 leading multimodal LLMs. Our findings reveal that while some models achieve super-human performance, limitations in vision-language integration persist, and improved prompting strategies are crucial for effective visual grounding. How can we further refine these models to truly unlock their potential for scientific discovery and education?
Decoding Visual Complexity: The Challenge for Artificial Chemists
For decades, chemistry education has fundamentally depended on the ability to decipher visual information – from interpreting complex reaction diagrams and understanding the setup of laboratory apparatus, to mentally manipulating three-dimensional molecular structures. However, artificial intelligence systems consistently falter when asked to perform these same visual tasks. Unlike humans, who intuitively connect visual cues with underlying chemical principles, current AI often treats images as mere pixel arrangements, lacking the capacity to ‘see’ a reaction mechanism or predict molecular behavior based on spatial arrangement. This discrepancy isn’t simply a matter of image recognition; it’s a failure to bridge the gap between visual representation and abstract chemical concepts, hindering progress in automating tasks that require visual problem-solving, like predicting reaction outcomes or designing novel molecules.
Despite advances in image recognition, current artificial intelligence systems demonstrate a significant disconnect between seeing a chemical representation and understanding its underlying principles. While an AI might identify a beaker in an image, it often struggles to infer the ongoing reaction, the properties of the reactants, or the potential products. This limitation stems from a lack of ‘chemical intuition’ – the ability to connect visual features, such as bond angles or functional groups, to concepts like reactivity and stability. Consequently, problem-solving in chemistry, which frequently demands interpreting visual information to predict outcomes or propose mechanisms, remains a substantial challenge for AI. The inability to bridge this visual-conceptual gap hinders the development of AI tools capable of accelerating chemical discovery and automating complex scientific tasks, particularly those requiring nuanced understanding of molecular structure and behavior.
The advancement of artificial intelligence in scientific discovery faces a significant hurdle due to limitations in visual reasoning, especially within chemistry. While AI excels at processing numerical data, interpreting visual information – such as laboratory setups, reaction mechanisms depicted in diagrams, or the three-dimensional structure of molecules – remains a substantial challenge. This isn’t merely a matter of image recognition; it requires an understanding of how visual cues relate to fundamental chemical principles, a connection current AI systems frequently fail to establish. Consequently, the ability to automate tasks requiring visual analysis – predicting reaction outcomes from apparatus diagrams, inferring molecular properties from structural representations, or designing novel experiments based on visual observations – is severely restricted, effectively creating a bottleneck in the potential for AI-driven innovation within the chemical sciences and beyond.
A New Paradigm: Multimodal LLMs for Scientific Understanding
Multimodal Large Language Models (LLMs) address limitations of traditional LLMs by integrating the processing of both textual and visual inputs. These models utilize architectures capable of accepting and correlating data from multiple modalities, such as text, images, and diagrams. This simultaneous processing allows the AI to establish relationships between different data types, improving performance on tasks requiring cross-modal understanding. Unlike unimodal LLMs restricted to textual information, multimodal LLMs can directly interpret visual data like chemical structures, reaction schemes, or experimental setups, enabling a more comprehensive analysis and informed decision-making process.
Few-shot prompting and chain-of-thought prompting are techniques used to improve the reasoning and problem-solving abilities of multimodal Large Language Models. Few-shot prompting involves providing the model with a limited number of example input-output pairs to guide its responses to new, unseen inputs. Chain-of-thought prompting builds upon this by encouraging the model to explicitly articulate its reasoning process, breaking down complex problems into intermediate steps before arriving at a final answer. This approach allows the model to demonstrate a more transparent and interpretable decision-making process, and has been shown to significantly improve performance on complex tasks by enabling more accurate and logical inferences.
The integration of Large Language Models (LLMs) with visual data processing capabilities enables the development of AI systems capable of addressing complex chemistry problems traditionally solved by human experts. These systems can analyze visual inputs such as reaction schemes, molecular structures, and spectroscopic data, then combine this visual understanding with the reasoning and knowledge encoded within the LLM. This allows for tasks like predicting reaction outcomes, proposing synthetic routes, and interpreting experimental results to be automated. Specifically, the LLM component provides contextual understanding and reasoning, while the visual processing component extracts information directly from images or diagrams, bridging the gap between visual representations of chemical information and the LLM’s textual processing abilities.
Chain-of-Thought (CoT) prompting substantially improves the performance of Large Language Models (LLMs) on complex reasoning tasks. Specifically, implementation of CoT prompting techniques has resulted in an 80-82% increase in “win rates” – the frequency with which the LLM provides the correct solution – when benchmarked against baseline LLM performance without CoT. This improvement is observed across various problem types requiring multi-step reasoning, and demonstrates that guiding the model to explicitly articulate its reasoning process, rather than directly providing an answer, significantly enhances its problem-solving capabilities. The observed gains indicate CoT prompting is a robust method for unlocking the latent reasoning potential of LLMs.

Establishing a Standard: The USNCO-V Benchmark
The USNCO-V dataset is designed as a standardized and challenging benchmark for assessing the visual reasoning skills of multimodal Large Language Models (LLMs) specifically within the domain of chemistry. Constructed from questions sourced from past US National Chemistry Olympiad examinations, the dataset presents problems that necessitate the interpretation of visual information, including chemical diagrams, experimental apparatus setups, and molecular structures. This focus differentiates USNCO-V from general visual question answering datasets, requiring models to not only recognize visual elements but also to apply chemical knowledge to derive correct answers. The rigor of the benchmark stems from the complexity of the problems and the need for models to integrate visual and textual information for successful problem-solving, thereby providing a more nuanced evaluation of multimodal LLM capabilities in a scientific context.
The USNCO-V dataset is derived from questions used in the US National Chemistry Olympiad examinations, and therefore presents a high level of chemical reasoning complexity. Problems within the dataset necessitate the interpretation of various visual elements, including chemical diagrams representing molecular structures and reaction mechanisms, schematics of laboratory apparatus used in experiments, and graphical representations of data derived from chemical processes. These visuals are not simply illustrative; they contain critical information essential for solving the problems, requiring models to extract and integrate visual data with their existing chemical knowledge base. The dataset’s structure directly reflects the skills assessed in the Olympiad, focusing on a student’s ability to analyze visual information to apply chemical principles and predict outcomes.
Evaluations using the USNCO-V dataset indicate that large multimodal models, specifically GPT-5, achieve a high level of accuracy – 86.3% – when assessed on the national exam set. This performance exceeds the estimated capabilities of participants in the actual US National Chemistry Olympiad competition, suggesting a substantial advancement in the ability of AI to solve complex, visually-based chemistry problems. This benchmark highlights the model’s capacity to not only process textual information but also to interpret and reason about diagrams of chemical apparatus, molecular structures, and experimental setups presented within the exam questions.
Evaluations of the GPT-4.1 multimodal LLM on the USNCO-V dataset demonstrate performance variability based on the problem set utilized. Specifically, the model achieved an accuracy of 42.5% when tested on the local set of US National Chemistry Olympiad problems. Performance increased to 50.7% accuracy when evaluated on the national set of problems, indicating a potential difference in difficulty or problem characteristics between the two datasets. These results establish a quantitative baseline for assessing the model’s capabilities in visually-grounded chemical reasoning and provide a comparative metric against human performance on the USNCO exam.
Occasion-based saliency analysis assesses the importance of different visual elements by measuring the change in a model’s log-probability when those elements are altered or removed. This technique operates on the premise that features crucial to the model’s decision-making process will exhibit the most significant shifts in log-probability. By quantifying these changes, researchers can identify which specific components of a chemical diagram, apparatus setup, or molecular structure are driving the model’s predictions. The resulting saliency maps offer a granular view of the model’s visual attention, allowing for the investigation of whether the model is focusing on chemically relevant features, or instead relying on spurious correlations within the image. This method provides empirical evidence for understanding the internal reasoning process of multimodal LLMs when applied to visual chemistry problems.

A New Era of Scientific Collaboration
The recent achievements of multimodal large language models, as demonstrated by strong performance on challenging benchmarks like the USNCO-V chemistry exam, signify a potential paradigm shift in scientific discovery. These models, capable of processing and integrating information from diverse sources – including text and visual data – are no longer confined to simple pattern recognition. Instead, they exhibit an emerging capacity for reasoning and problem-solving within complex scientific domains. This success isn’t merely about automating existing tasks; it suggests a future where AI actively collaborates with scientists, accelerating the pace of research by identifying subtle connections, proposing novel hypotheses, and offering fresh perspectives on established data. The implications extend beyond chemistry, hinting at a broader applicability across various scientific disciplines and a fundamental change in how knowledge is generated and validated.
Multimodal large language models represent a paradigm shift in scientific inquiry, extending beyond simple automation to become active collaborators in the discovery process. These models don’t just process data; they demonstrate an ability to synthesize information from diverse sources – text, images, and potentially other modalities – to formulate novel hypotheses that a researcher might not immediately consider. By identifying patterns and correlations within complex datasets, these systems can assist in interpreting experimental results and even suggest avenues for further investigation. This capacity for insight generation stems from their ability to learn relationships and extract meaning beyond explicit programming, potentially accelerating the pace of scientific understanding and enabling breakthroughs in fields ranging from materials science to drug discovery.
A significant challenge accompanying the rise of powerful multimodal large language models lies in their frequently proprietary nature. While these models demonstrate remarkable capabilities in scientific reasoning, access is often restricted to those with the resources to utilize them through APIs or specific platforms. This limited accessibility hinders independent verification of results and impedes the broader scientific community’s ability to build upon these advancements. Reproducibility, a cornerstone of scientific rigor, becomes difficult when the underlying model and its parameters are not openly available for scrutiny and replication. Consequently, progress in the field may be concentrated within a select few institutions, potentially slowing down the overall pace of discovery and innovation.
Continued investigation into multimodal large language models demands a dual focus on capability and responsible implementation. While current benchmarks demonstrate impressive performance, a comprehensive understanding of these models’ limits – including potential biases embedded within training data and susceptibility to adversarial inputs – remains crucial. Future research must prioritize not only enhancing predictive power and expanding the range of solvable scientific problems, but also developing robust methods for ensuring transparency, accountability, and fairness in their application. This includes exploring techniques for model interpretability, mitigating potential harms, and establishing clear ethical guidelines for their deployment across diverse scientific disciplines and beyond, fostering trust and maximizing benefit while minimizing risk.

The evaluation of large language models, as demonstrated by the USNCO-V benchmark, reveals a critical interplay between component capabilities and systemic performance. Models may excel at isolated tasks – identifying elements in an image or recalling chemical principles – yet stumble when integrating these abilities for complex, multimodal reasoning. This echoes a fundamental principle of systemic design: a weakness in one area inevitably propagates through the whole. As Ada Lovelace observed, “The Analytical Engine has no pretensions whatever to originate anything.” The engine, like these models, can perform what it is programmed to do, but true intelligence-and robust performance on challenges like the USNCO-V exam-requires more than just processing power; it demands a cohesive structure capable of translating data into insightful solutions.
Beyond the Surface
The demonstration that large language models can, in certain instances, exceed human performance on a specialized exam like the USNCO-V is less a triumph of artificial intelligence than a pointed commentary on the nature of the exam itself. The system responds to patterns, of course, but true understanding – the ability to reconfigure knowledge and apply it to genuinely novel scenarios – remains elusive. The benchmark reveals not so much intelligence as a refined capacity for mimicry, a sophisticated form of pattern completion.
Future work must move beyond simply integrating visual and textual data; the challenge lies in fostering genuine reasoning across modalities. The current models treat images as another source of tokens, effectively flattening information. A more robust architecture will likely necessitate a deeper representation of underlying physical principles, a structured understanding of chemical relationships, rather than a purely correlative one.
Ultimately, the limitations exposed by USNCO-V highlight a fundamental truth: elegance in a system derives not from the accumulation of data, but from the clarity of its underlying structure. The pursuit of artificial intelligence is, at its core, a quest to understand the principles of organization – the same principles that govern all complex systems, from molecules to minds.
Original article: https://arxiv.org/pdf/2512.14989.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Brawl Stars December 2025 Brawl Talk: Two New Brawlers, Buffie, Vault, New Skins, Game Modes, and more
- Mobile Legends: Bang Bang (MLBB) Sora Guide: Best Build, Emblem and Gameplay Tips
- Clash Royale Best Boss Bandit Champion decks
- Best Hero Card Decks in Clash Royale
- Call of Duty Mobile: DMZ Recon Guide: Overview, How to Play, Progression, and more
- Clash Royale December 2025: Events, Challenges, Tournaments, and Rewards
- Best Arena 9 Decks in Clast Royale
- Clash Royale Best Arena 14 Decks
- Clash Royale Witch Evolution best decks guide
- Brawl Stars December 2025 Brawl Talk: Two New Brawlers, Buffie, Vault, New Skins, Game Modes, and more
2025-12-19 02:36