Can AI Truly Do Science?

Author: Denis Avetisyan

A new study puts an autonomous AI scientist to the test in the challenging field of radiation biology, revealing both promise and the critical need for careful validation.

Researchers evaluated the AI system KOSMOS’ ability to generate and falsify hypotheses, achieving one confirmed discovery, one promising lead, and one failed result.

While the promise of artificial intelligence driving scientific discovery is rapidly advancing, rigorous validation remains a critical challenge. In ‘When AI Does Science: Evaluating the Autonomous AI Scientist KOSMOS in Radiation Biology’, we assessed the hypothesis-generating capabilities of the autonomous AI scientist KOSMOS across three radiobiology problems, revealing a mixed outcome of one well-supported finding, one plausible lead, and one refuted hypothesis. This demonstrates that AI scientists can indeed propose testable ideas, but require careful auditing against appropriate null models to distinguish signal from noise. How can we best integrate these powerful tools into the scientific process while maintaining the standards of evidence-based research?

Unveiling Predictive Factors: An AI-Driven Approach to Radiation Response

The persistent difficulty in accurately predicting how cancers will respond to radiation therapy fuels a critical need for new strategies in biomedical research. Traditional methods of hypothesis generation, reliant on existing knowledge and manual curation, often struggle to navigate the complexity of tumor biology and identify truly novel predictive factors. This limitation hinders personalized treatment plans and contributes to variable patient outcomes. Consequently, researchers are increasingly exploring computational approaches, seeking to overcome these bottlenecks by systematically analyzing vast datasets and proposing hypotheses that might otherwise remain unexplored, ultimately aiming to refine and optimize cancer treatment efficacy.

An autonomous artificial intelligence, dubbed KOSMOS, was utilized to investigate the complex biological processes governing cellular responses to radiation. Unlike conventional research methods reliant on human-defined parameters and pre-existing knowledge, KOSMOS operates as an independent scientific entity, capable of formulating and testing hypotheses without direct human intervention. The system autonomously analyzed extensive datasets pertaining to radiation exposure and genomic information, iteratively refining its understanding of potential mechanisms. This approach allowed for the exploration of a vast hypothesis space, unconstrained by human bias or established paradigms, ultimately leading to the identification of previously unrecognized factors influencing radiation response. KOSMOS represents a significant step towards automated scientific discovery, promising to accelerate the pace of innovation in radiobiology and beyond.

The conventional process of formulating hypotheses in radiobiology is often constrained by existing knowledge and human bias, limiting the scope of potential discoveries. This research diverges from that model by leveraging an autonomous AI – KOSMOS – to systematically explore a vastly expanded solution space, unburdened by pre-conceived notions. By computationally analyzing complex datasets, KOSMOS identified CDO1 as a significant predictor of radiation response, a connection that might have remained obscured through traditional methods. This AI-driven approach doesn’t merely accelerate the hypothesis generation process; it fundamentally alters it, opening avenues for uncovering previously unknown biological mechanisms and potentially revolutionizing cancer treatment strategies by identifying novel therapeutic targets.

Baseline Capacity and Gene Expression: Correlating Biological Markers

Spearman correlation was employed to assess the relationship between baseline DNA Damage Response (DDR) capacity and the p53 transcriptional response following irradiation. The analysis yielded a correlation coefficient of -0.40, accompanied by a p-value of 0.756. This p-value exceeds the conventional significance threshold, indicating that the observed negative correlation between baseline DDR capacity and p53 response is not statistically significant. Therefore, the data do not support the hypothesis that baseline DDR capacity reliably predicts the magnitude of the p53 transcriptional response to irradiation.

Pearson correlation analysis was performed to determine the relationship between baseline OGT and CDO1 expression levels and the strength of the Radiation Response Module. The analysis revealed a correlation coefficient of 0.23 for OGT, with an empirical p-value of 0.341, indicating no significant association. In contrast, CDO1 expression demonstrated a statistically significant positive correlation with Radiation Response Module strength, yielding a correlation coefficient of 0.70 and an empirical p-value of 0.0039.

To validate the observed correlation between baseline CDO1 expression and radiation response, statistical significance was determined through the generation of Null Distributions. These distributions were created by randomly sampling gene sets and calculating their Pearson correlation coefficients. The observed correlation of 0.70 for CDO1 was then compared to these Null Distributions, yielding an empirical p-value of 0.0039. This p-value indicates that the probability of observing a correlation as strong as 0.70 by chance, given a randomly selected gene set, is less than 0.39%, thereby confirming CDO1 as a statistically significant predictor of radiation response module strength and ensuring the robustness of this finding against false positives.

A Multi-Gene Signature for Predicting Prostate Cancer Recurrence

A 12-gene signature was assessed for its ability to predict biochemical recurrence-free survival following radiotherapy treatment for prostate cancer. This evaluation utilized patient data to determine if the expression levels of these 12 genes could differentiate between patients who experienced biochemical recurrence and those who did not. The signature’s predictive capacity was determined by analyzing the correlation between gene expression patterns and observed patient outcomes, specifically the time to detectable prostate-specific antigen (PSA) levels after treatment. This analysis aimed to identify a gene expression profile indicative of increased or decreased risk of recurrence, providing a potential tool for post-treatment monitoring and risk stratification.

The discriminatory ability of the 12-gene signature for predicting biochemical recurrence-free survival following prostate radiotherapy was quantified using the Concordance Index (CI). A CI value of 0.613 was achieved, accompanied by an empirical p-value of 0.0166. The CI represents the probability that, for any pair of patients, the signature correctly predicts which patient will experience recurrence first. A value of 0.5 indicates performance no better than chance, while values exceeding 0.5 suggest discriminatory power. The statistically significant empirical p-value supports the conclusion that the observed CI is not due to random chance, thereby demonstrating clinical utility for recurrence prediction.

Analysis of the 12-Gene Signature yielded an Absolute Log Hazard Ratio of 0.899 with an empirical p-value of 0.3738, which does not meet conventional statistical significance thresholds. However, the Concordance Index, a measure of discriminatory ability, indicated a value of 0.613 (empirical p = 0.0166). This discrepancy suggests that while the signature may not definitively predict the magnitude of hazard, it possesses a statistically significant capacity to distinguish between patients who will and will not experience biochemical recurrence following radiotherapy. The Concordance Index therefore indicates potential clinical utility despite the non-significant hazard ratio.

The Pursuit of Verifiable Insights: Falsification-Based Auditing

Falsification-Based Auditing served as a core tenet in establishing the dependability of hypotheses and findings produced by the AI system. Rather than seeking confirmation, this methodology actively attempted to disprove the AI’s claims through rigorous testing and scrutiny of generated predictions. This approach, inspired by Karl Popper’s philosophy of science, prioritized identifying potential errors and limitations, thereby strengthening the validity of any remaining conclusions. By systematically challenging the AI’s output, researchers could confidently assess the robustness of its insights and minimize the risk of drawing inaccurate or misleading inferences, ultimately ensuring a higher degree of trustworthiness in the scientific process.

KOSMOS, while capable of generating novel scientific hypotheses, underwent a demanding process of falsification-based auditing to validate its claims. This methodology moved beyond simple confirmation, actively seeking evidence that could disprove the generated hypotheses. Rigorous statistical testing, including assessments of empirical p-values – such as the 0.0039 value associated with CDO1 – formed the core of this auditing process. By systematically attempting to disprove its own conclusions, the system bolstered confidence in the validity of its findings and highlighted the importance of challenging assumptions in scientific discovery. This approach ensures that identified relationships are not merely statistical flukes, but reflect genuine underlying biological mechanisms.

The pursuit of scientific discovery increasingly relies on artificial intelligence, yet the inherent ‘black box’ nature of many AI systems demands a commitment to transparency and verifiability. Rigorous auditing, such as the falsification-based approach employed in this study, is paramount to building trust in AI-driven findings and accelerating the pace of progress. This methodology doesn’t simply accept AI-generated hypotheses; it actively seeks to disprove them, strengthening the confidence in validated results. The identification of CDO1 as a potential therapeutic target, supported by an empirical p-value of 0.0039, serves as a concrete example of how this approach can yield statistically significant and, crucially, verifiable insights, paving the way for more reliable and impactful discoveries.

The evaluation of KOSMOS, as detailed in the study, reveals a crucial point about complex systems: modularity alone doesn’t guarantee understanding. The AI’s ability to generate hypotheses, while impressive, necessitates stringent verification – a process mirroring the scientific method’s emphasis on falsification. As Alan Turing observed, “Sometimes people who are unhappy tend to look at the world as hostile towards them.” This sentiment echoes the need for cautious optimism when interpreting AI-driven discoveries; even a seemingly successful hypothesis requires robust scrutiny to ensure it isn’t merely a reflection of inherent biases or flawed data. If the system survives on duct tape, it’s probably overengineered, and KOSMOS demonstrates that even sophisticated AI needs a solid foundation of validation.

Future Directions

The exercise presented here-subjecting an autonomous agent to the rigors of experimental radiobiology-reveals less a path to automated discovery and more a stark illustration of existing limitations. KOSMOS’s performance, while yielding one genuine finding, underscores that even successful hypothesis generation does not absolve the need for meticulous validation. The system operates, fundamentally, as a powerful pattern-recognizer, but true scientific progress demands an understanding of why patterns emerge, not merely that they do. The infrastructure should evolve without rebuilding the entire block; refinement of existing validation protocols will prove more fruitful than chasing entirely novel, untested frameworks.

A key challenge lies in the construction of robust null models. The current approach, while functional, relies on relatively simple baselines. The complexity of biological systems demands increasingly sophisticated counterfactuals against which to assess the significance of AI-generated hypotheses. A failure to adequately define ‘not-interesting’ risks mistaking statistical artifacts for biological signal.

The pursuit of ‘agentic AI’ in science is, at its core, a quest to externalize the inductive leap. However, structure dictates behavior, and a system’s capacity for genuine novelty remains tethered to the constraints of its design. The next phase must focus not solely on expanding the breadth of AI’s knowledge, but on deepening its ability to critically assess the validity of that knowledge – a task that, ironically, may require a more profound understanding of the human scientific process itself.

Original article: https://arxiv.org/pdf/2511.13825.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unveiling Predictive Factors: An AI-Driven Approach to Radiation Response

Baseline Capacity and Gene Expression: Correlating Biological Markers

A Multi-Gene Signature for Predicting Prostate Cancer Recurrence

The Pursuit of Verifiable Insights: Falsification-Based Auditing

Future Directions

See also: