Can AI Truly Do Science?

Author: Denis Avetisyan

New research reveals that while artificial intelligence can now perform scientific tasks, it doesn’t necessarily understand the reasoning behind them.

Analysis of LLM-based agents demonstrates a lack of epistemic rigor in scientific workflows, specifically regarding evidence-based reasoning and belief revision.

Despite increasing deployment of artificial intelligence in scientific discovery, a fundamental question remains regarding whether these systems reason scientifically. Our work, detailed in ‘AI scientists produce results without reasoning scientifically’, systematically evaluates large language model (LLM)-based agents across diverse scientific tasks, revealing a disconnect between workflow execution and genuine epistemic rigor. We find that performance is driven primarily by the base model, with minimal contribution from agent scaffolding, and that critical elements of scientific reasoning-such as evidence consideration and belief revision-are strikingly absent. This raises a crucial question: can scientific knowledge generated by these agents be justified if the process itself lacks the hallmarks of sound scientific inquiry?

The Burden of Human Intervention in Scientific Discovery

Established scientific processes, despite their proven reliability, inherently depend on substantial human intervention at each stage – from formulating initial hypotheses and designing experiments to interpreting results and drawing conclusions. This reliance creates bottlenecks, particularly when confronting the sheer volume of data generated by modern instruments and simulations. The traditional workflow struggles to rapidly assimilate new information, meaning potentially groundbreaking discoveries can be delayed as researchers manually sift through findings and recalibrate their approaches. While meticulous, this iterative process is often ill-equipped to handle the accelerating pace of data generation, hindering the potential for truly dynamic and responsive scientific exploration.

The frontiers of scientific inquiry are increasingly defined by problems of immense complexity, exceeding the capacity of traditional, manually-driven research approaches. Contemporary challenges-from modeling climate change and predicting protein folding to discovering novel materials-demand systems capable of more than just data analysis; they require autonomous agents that can formulate testable hypotheses, design and execute experiments-whether physical or in silico-and rigorously interpret the resulting evidence. This necessitates a shift toward tools that don’t simply accelerate existing workflows, but actively participate in the scientific process, iteratively refining understanding and identifying previously unconsidered avenues of investigation. Such capabilities promise to unlock insights hidden within the deluge of modern data and accelerate the pace of discovery, moving beyond correlation to establish genuine causal relationships and build more robust, predictive models of the natural world.

Contemporary artificial intelligence often falters when tasked with genuine scientific inquiry, not due to a lack of processing power, but because of an inability to cohesively blend hypothesis formation, experimental design, and rigorous evidence evaluation. Studies reveal a significant ‘Evidence Non-Uptake Rate’ – currently measured at 68% across analyzed research traces – indicating that a substantial portion of potentially relevant data is either ignored or misinterpreted by these systems. This isn’t simply a matter of inaccurate conclusions; it demonstrates a fundamental difficulty in utilizing existing knowledge to refine understanding, resulting in the generation of claims unsupported by available evidence and hindering the potential for accelerated scientific discovery. The challenge, therefore, lies not merely in building smarter algorithms, but in creating systems capable of embodying the core principles of the scientific method itself.

Liberating Inquiry: LLM Agents as Catalysts for Discovery

LLM-based agents represent a paradigm shift in scientific workflows by integrating large language models (LLMs) with predefined structural frameworks. This combination moves beyond simple LLM prompting by enabling autonomous execution of research tasks. The framework provides the LLM with a defined operational space, managing the sequence of steps required for a given scientific inquiry. This allows the agent to not only generate hypotheses or interpret data, but also to independently initiate data acquisition, execute analyses using specialized tools, and iterate on its approach without constant human intervention. The resulting workflows are designed for repeatability and scalability, offering the potential to accelerate discovery across diverse scientific domains.

The Agent Scaffold serves as the foundational architecture for LLM-based agents in scientific discovery, providing the necessary infrastructure to manage the complex workflows involved in autonomous research. This scaffold encompasses several core functionalities: prompt engineering to guide the LLM’s reasoning, dynamic tool selection based on task requirements, and overall orchestration to sequence actions and manage data flow. Specifically, it handles the parsing of research goals into actionable steps, the identification of appropriate tools – such as databases, simulation software, or analytical packages – and the execution of those tools with relevant parameters. By centralizing these functions, the Agent Scaffold enables iterative experimentation, data-driven decision-making, and ultimately, the automation of scientific processes.

Structured tool-calling is a core mechanism enabling LLM-based agents to perform scientific tasks by interfacing with external resources. This process moves beyond simple text generation, allowing the agent to identify when a specific task requires an external tool – such as a database query engine, a computational chemistry package, or a data visualization library – and to formulate a precise request to that tool. The tool then executes the request and returns the results to the agent, which can interpret the output and integrate it into its ongoing workflow. Successful implementation requires a defined interface between the LLM and available tools, including clearly specified input parameters and output formats, ensuring reliable and repeatable interactions for data acquisition and analysis.

Demonstrating Capability Across Scientific Disciplines

LLM-based agents are demonstrating utility in a growing number of scientific fields. In spectroscopic structure elucidation, these agents can interpret spectral data – such as mass spectrometry and NMR – to propose molecular structures. Within inorganic qualitative analysis, they facilitate the identification of unknown compounds based on chemical tests and observations. Furthermore, LLM agents are being applied to circuit inference, where they analyze circuit behavior to deduce the underlying network topology. These applications highlight the agents’ capacity to process complex, discipline-specific data and perform reasoning tasks relevant to scientific investigation.

LLM-based agents demonstrate capability in computationally intensive scientific tasks, specifically Molecular Dynamics (MD) and Retrosynthetic Planning. MD simulations, which model the time-dependent behavior of molecular systems, benefit from agent-driven automation of parameter selection and analysis of trajectory data. Similarly, in Retrosynthetic Planning – the automated design of chemical syntheses – agents can navigate complex reaction networks and propose viable synthetic routes by leveraging large chemical databases and predictive algorithms. These applications require substantial computational resources, and agent implementation offers potential for increased efficiency and scalability compared to traditional methods.

LLM-based agents are demonstrating capability in experimental workflows through integration with advanced data acquisition and analysis techniques. Specifically, these agents can control instrumentation such as Atomic Force Microscopy (AFM) to collect high-resolution surface topography data. Following data acquisition, Machine Learning (ML) algorithms, implemented within the agent framework, are utilized for subsequent analysis, including feature identification, pattern recognition, and quantitative measurements derived from the AFM data. This integration allows for automated experimentation and data processing, increasing efficiency and enabling complex analyses beyond traditional methods.

The Fragility of Belief: A Critical Assessment of Agent Reasoning

The effectiveness of advanced LLM-Based Agents hinges on their ability to revise beliefs when confronted with conflicting evidence-a process known as Refutation-Driven Belief Revision. This isn’t simply about accumulating data, but actively updating internal models in response to disconfirming information, mirroring a core tenet of scientific reasoning. However, current research indicates this capability remains limited; observed traces reveal that agents successfully refute and revise beliefs in only 26% of cases. This suggests a significant gap between the potential for rational agency and its actual implementation in these systems, highlighting a critical area for improvement in the pursuit of truly robust and reliable AI.

The pursuit of reliable conclusions within complex systems necessitates a move beyond singular data points; instead, the framework prioritizes convergent multi-test evidence, demanding that assertions are consistently supported by multiple, independent lines of inquiry. This approach actively mitigates the risk of drawing inaccurate conclusions from anomalous results or biased data, strengthening the overall validity of the agent’s reasoning. However, current observations reveal that this rigorous standard is only met 7% of the time across analyzed traces, indicating a significant opportunity for improvement in bolstering the robustness of LLM-based agents and ensuring consistently well-supported beliefs.

The agent’s reasoning isn’t simply a matter of processing information, but is instead governed by a foundational epistemological structure – a system dictating how hypotheses are formed, experiments designed, and evidence critically evaluated. This structure leverages token-level log-probability, a metric used to gauge the model’s own confidence in its assertions. However, recent research reveals a surprising dynamic: the inherent capabilities of the base language model itself explain a substantial 41.4% of the overall performance variance. This suggests that, while carefully constructed scaffolding and reasoning frameworks are intended to enhance performance, their contribution – measured at just 1.5% of explained variance – is currently overshadowed by the pre-existing knowledge and capabilities embedded within the foundational model itself, highlighting a critical area for future development and optimization.

A Future Defined by AI-Driven Scientific Discovery

The landscape of scientific investigation is undergoing a fundamental transformation with the emergence of Large Language Model (LLM)-based agents. Traditionally, scientific discovery has been a largely human-directed process, reliant on researchers to formulate hypotheses, design experiments, and interpret results. However, these AI agents represent a shift towards autonomous inquiry, capable of independently generating hypotheses, planning experiments – even controlling laboratory equipment in some instances – and analyzing data to draw conclusions. This isn’t simply automation of existing tasks; it’s a move from scientists directing discovery to AI agents driving it, potentially uncovering patterns and relationships previously obscured by human bias or limitations in scale. The implications are profound, suggesting a future where scientific progress is accelerated not by faster human researchers, but by a continuously learning, AI-driven system of discovery.

Current research and development surrounding LLM-based scientific agents are heavily focused on bolstering their capacity for independent operation and sophisticated analytical thinking. A significant hurdle lies in the low ‘Pass@k’ rate – currently below 0.05 – observed in hypothesis-driven scientific fields, indicating a limited ability to successfully navigate complex problem-solving. Consequently, efforts are being directed toward improving the reasoning capabilities of these agents, enabling them to not only generate hypotheses but also critically evaluate evidence and refine their approaches. This includes exploring methods to enhance planning, execution, and error correction within the agent’s workflow. Furthermore, the scope of application is expanding beyond well-defined laboratory settings to encompass a broader range of scientific disciplines, with the ultimate goal of creating agents capable of driving discovery across diverse and challenging research areas.

The advent of LLM-based scientific agents signals a potential revolution in how discovery unfolds, offering the capacity to not only analyze existing data with unprecedented speed but also to proactively formulate and test hypotheses. This acceleration of the scientific method extends beyond incremental improvements; it promises entirely novel insights by identifying patterns and connections previously obscured by the limitations of human analysis or the sheer volume of complex data. By automating the traditionally laborious process of experimentation and interpretation, these agents can dramatically shorten the time between initial inquiry and meaningful results, ultimately fostering innovation across diverse fields and equipping researchers with powerful tools to address pressing global challenges, from climate change and disease to resource management and sustainable energy.

The research meticulously details a performance of process devoid of underlying principle. LLM-based agents successfully execute scientific workflows, yet operate without demonstrable scientific reasoning or belief revision – a crucial element of genuine inquiry. This echoes Vinton Cerf’s observation: “Any sufficiently advanced technology is indistinguishable from magic.” The systems appear to perform science, achieving results, but the mechanism remains opaque, lacking the epistemic foundations that define the practice. The study effectively highlights this distinction, demonstrating the difference between algorithmic proficiency and true understanding – a critical point as these agents become increasingly integrated into scientific endeavors.

Where Do We Go From Here?

The demonstration that automated agents can perform scientific workflows without understanding them is not a surprise, merely a clarification. A system that needs instructions has already failed. The core issue is not the imitation of scientific output, but the absence of epistemic principles. Current approaches prioritize synthetic performance over genuine belief revision – the capacity to discard hypotheses in the face of contradictory evidence. This is not a limitation of language models, but a fundamental flaw in the objective itself. To demand ‘reasoning’ from a system designed to predict tokens is to mistake the map for the territory.

Future work must abandon the pursuit of artificial ‘intelligence’ and focus instead on artificial clarity. The challenge lies not in building agents that can generate plausible narratives, but in creating systems that reveal their own ignorance. Trace intervention, while promising, remains a descriptive tool, not a prescriptive one. It shows how errors occur, not why they should be avoided. The ultimate goal is not to replicate human cognition, but to surpass it – to build systems that operate with a level of transparency and self-correction that is currently beyond our reach.

Perhaps the most productive avenue for investigation is the deliberate introduction of ‘friction’ – constraints and limitations designed to force the agent to confront its own uncertainty. Clarity is courtesy. A system that acknowledges what it does not know is, in a very real sense, more intelligent than one that pretends to know everything.

Original article: https://arxiv.org/pdf/2604.18805.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/