Histopathology’s New Agent: AI Automates Complex Image Analysis

Author: Denis Avetisyan

Researchers have developed an agentic system that leverages the power of large language models to unlock deeper insights from whole-slide histopathology images.

The system autonomously translates user requests regarding histological images into executable Python code, iteratively refining the analysis through a feedback loop where generated code informs subsequent iterations-effectively enabling dynamic, multi-stage investigation of image data without explicit programming.

Nova, an agentic framework coupled with the SlideQuest benchmark, streamlines automated histopathology analysis and discovery.

Despite advances in digital pathology, complex analyses remain time-consuming and require specialized expertise, limiting broad accessibility. This paper introduces NOVA: An Agentic Framework for Automated Histopathology Analysis and Discovery, a system that leverages large language models to translate scientific queries into executable Python code, integrating 49 open-source tools and enabling on-demand tool creation. We demonstrate NOVA’s capabilities with SlideQuest, a novel 90-question benchmark designed to assess multi-step reasoning and computational problem-solving in histopathology, and show it outperforms existing coding-agent baselines. Can this agentic approach unlock scalable discovery and fundamentally reshape the landscape of computational pathology research?

Deconstructing Diagnosis: The Limits of Convention

The established practice of histopathological analysis, while crucial for accurate diagnosis and biomedical research, faces inherent limitations in efficiency and consistency. Examining tissue samples under a microscope is a remarkably time-consuming process, demanding significant attention from highly trained pathologists. Furthermore, subjective interpretation is unavoidable, leading to demonstrable variations – known as inter-observer variability – even amongst experts. This discrepancy in assessment can delay diagnoses, introduce uncertainty in research findings, and potentially impact patient care. The sheer volume of samples, coupled with the increasing complexity of diagnostic criteria, exacerbates this bottleneck, highlighting the urgent need for innovative solutions to streamline and standardize the process of pathological evaluation.

The advent of digital pathology has generated an unprecedented surge in Whole-Slide Images (WSIs), far exceeding the capacity of traditional manual analysis. This escalating volume isn’t simply a scaling problem; effective diagnosis increasingly requires complex, multi-step reasoning – identifying subtle patterns, integrating information across vast tissue areas, and differentiating between nuanced disease states. Automated systems must therefore move beyond simple object detection and embrace cognitive tasks like contextual analysis and hierarchical inference. Such intelligent agents are needed to efficiently navigate these high-resolution images, prioritize regions of interest, and ultimately, synthesize a comprehensive understanding of the underlying pathology – a feat demanding computational power and sophisticated algorithms capable of mirroring the expertise of a skilled pathologist.

Current image analysis workflows in pathology frequently struggle with generalization, exhibiting limited performance when applied to new staining protocols, tissue types, or even variations in imaging hardware. These pipelines are often meticulously tailored to specific datasets, relying on hand-engineered features and rigid parameter settings that fail to capture the inherent biological variability present in clinical samples. This inflexibility necessitates substantial retraining or manual adjustments whenever the input data deviates from the original training set, creating a significant impediment to widespread adoption and hindering the potential for truly scalable, automated diagnostic solutions. Consequently, a demand exists for more robust and adaptable systems capable of learning transferable representations and reasoning across diverse pathological contexts, moving beyond narrowly focused, dataset-specific approaches.

Novato successfully characterizes breast cancer subtypes (Luminal A, Luminal B, Basal-like, HER2-enriched) by analyzing morphological features and correlating them with tumor characteristics, as demonstrated in this case study and detailed in Figures B.1 and B.2.

Nova: Architecting an Agentic Pathology

Nova’s core functionality relies on a Large Language Model (LLM) to process user instructions expressed in natural language. This LLM serves as the central decision-making component, translating textual queries into executable Python code. The LLM is responsible for understanding the intent of the request – for example, identifying specific cellular structures or quantifying staining intensity – and formulating the appropriate code to achieve the desired histopathological analysis. The generated Python code is then passed to an interpreter for execution, effectively allowing users to interact with histopathological data through intuitive language-based commands rather than requiring direct coding expertise.

SmolAgents is a Python framework designed to facilitate the creation and orchestration of autonomous agents. It provides a modular architecture allowing for the definition of agent roles, tool usage, and memory management. Agents within the SmolAgents framework operate by receiving tasks, selecting appropriate tools, executing those tools, and observing the results to inform subsequent actions. This framework supports both sequential and parallel agent execution, enabling complex workflows to be broken down into manageable, independent steps. Furthermore, SmolAgents incorporates mechanisms for agent communication and coordination, allowing multiple agents to collaborate on a single task or share information across different tasks, thereby increasing overall system efficiency and adaptability.

Nova leverages a Python Interpreter to directly execute the Python code generated by its Large Language Model. This execution capability is fundamental to the system’s functionality, allowing for programmatic data processing of histopathology images and associated data. Specifically, the interpreter handles tasks such as image loading, manipulation, feature extraction, quantitative analysis, and the creation of visualizations – including charts and graphs – to present findings. The interpreter’s role extends beyond simple execution; it also provides a runtime environment for accessing and utilizing the custom tools integrated within Nova’s agentic workflow, facilitating a closed-loop system of query, code generation, execution, and result interpretation.

Nova incorporates Custom Tools, which are specialized Python functions designed to address specific challenges within histopathology. These tools extend the core functionality of the agentic workflow by providing pre-built capabilities for tasks such as image processing, feature extraction from microscopic slides, quantitative analysis of tissue structures, and the generation of pathology reports. The modular design allows for the easy integration of new tools as needed, and existing tools can be updated without requiring modifications to the core agent framework. This approach enables Nova to perform complex histopathological analyses by orchestrating a sequence of targeted Python functions, effectively automating aspects of the diagnostic process.

Nova generated a detailed analysis report identifying morphological features linked to the PAM50 subtypes of breast cancer.

Validating Intelligence: The SlideQuest Benchmark

SlideQuest is a 90-question benchmark intended for the evaluation of computational agents operating within the field of pathology. The benchmark differentiates itself through its requirement of multi-step reasoning; questions are not answerable with single data lookups but necessitate the integration of information across multiple sources and analytical steps. Furthermore, SlideQuest assesses performance at the dataset level, meaning agents must demonstrate consistent accuracy and reliability when applied to entire datasets rather than isolated instances. This holistic approach aims to provide a more comprehensive and realistic evaluation of an agent’s capabilities in complex pathological analysis.

The Nova agentic framework was evaluated on the SlideQuest benchmark, a 90-question assessment of computational pathology agents, and demonstrated competency across multiple task types within the framework. Specifically, Nova successfully addressed challenges categorized as DataQA, which assesses dataset-level analysis; PatchQA, evaluating performance on image patches; and SlideQA, focusing on whole-slide image understanding. This ability to handle diverse tasks highlights Nova’s versatility and its capacity to integrate and apply reasoning across different levels of granularity within histopathology data.

CellularQA tasks within the SlideQuest benchmark utilize publicly available datasets, specifically MoNuSeg and PanopTILs, to assess a computational agent’s capacity for cell-level image analysis. MoNuSeg provides images of annotated cells for tasks like cell segmentation and counting, while PanopTILs focuses on identifying and quantifying tumor-infiltrating lymphocytes within whole slide images of cancer tissues. Performance on these datasets measures the agent’s ability to accurately identify, classify, and analyze cellular structures, offering a quantitative evaluation of its proficiency in digital pathology tasks requiring fine-grained image understanding.

The integration of the Nova agentic framework with the SlideQuest benchmark resulted in an average score of 0.477 across a suite of complex histopathology tasks. This performance indicates the effectiveness of augmenting Large Language Models (LLMs) with external tools and agentic capabilities for reasoning over visual data. Specifically, SlideQuest evaluates computational agents on tasks requiring multi-step reasoning and dataset-level analysis within the field of pathology. The achieved score demonstrates Nova’s capacity to address these challenges, surpassing baseline LLM performance both with and without a Python interpreter; the Nova framework outperformed an LLM with Python interpreter and retries (0.269) and a standalone LLM (0.0).

The Nova agentic framework demonstrated a statistically significant performance improvement on the SlideQuest benchmark compared to baseline models. Nova achieved an average score of 0.477 on complex histopathology tasks, while a Large Language Model (LLM) utilizing a Python interpreter and retry mechanisms scored 0.269. A standalone LLM, without tool augmentation or retries, achieved a score of 0, indicating a complete inability to address the benchmark’s challenges. These results highlight the benefits of Nova’s agentic approach and tool integration for complex reasoning tasks in computational pathology.

Within the SlideQuest benchmark, Nova demonstrated significant performance variation across different task categories, achieving a DataQA score of 0.777, which represents the highest score attained across all categories evaluated. Conversely, Nova’s performance on CellularQA tasks resulted in a score of 0.323, representing the lowest score achieved within the benchmark. These results indicate a relative strength in data-level question answering compared to cell-level analysis capabilities, suggesting potential areas for focused improvement within the Nova agentic framework.

Using GPT-4.1 with a Python interpreter consistently yielded high average scores and low failure rates across diverse SlideQuest benchmark categories, as indicated by standard error analysis over three trials.

Beyond Automation: Towards a Dynamic Pathology

Nova signals a fundamental departure from conventional pathology image analysis, moving beyond pre-programmed pipelines to embrace an agentic system driven by artificial intelligence. Traditional methods rely on meticulously crafted algorithms for specific tasks, limiting adaptability and requiring substantial manual intervention for each new research question. In contrast, Nova employs large language models to interpret instructions and autonomously execute code, effectively creating a self-directed analytical agent. This paradigm shift fosters a dynamic workflow where the system can independently formulate hypotheses, explore data, and generate insights-a capability poised to dramatically accelerate discovery in pathology and related biomedical fields by unlocking the potential of complex, multi-faceted investigations.

Traditional pathology image analysis often relies on rigid pipelines, meticulously crafted for specific questions and proving cumbersome to repurpose. Nova, however, facilitates remarkably flexible workflows by moving beyond pre-defined steps. This system can dynamically adjust its analytical approach, enabling researchers to explore diverse research questions without extensive re-tooling. By decoupling the analytical process from fixed algorithms, Nova allows for iterative experimentation and adaptation to new data or hypotheses; a researcher might initially investigate tumor microenvironment characteristics, then seamlessly shift focus to identifying rare cell types, all within the same framework. This adaptability not only accelerates the pace of discovery but also empowers pathologists and researchers to pursue more nuanced and comprehensive investigations into complex diseases.

The integration of Large Language Models (LLMs) and automated code execution within pathology signifies a transformative leap beyond conventional image analysis. This innovative approach empowers researchers to not only ask complex questions of digital pathology data, but to dynamically generate and execute the necessary analytical pipelines to find answers. Rather than being limited by pre-defined algorithms, the system can interpret natural language queries – such as “identify all instances of tumor-infiltrating lymphocytes within these images and quantify their spatial relationship to cancer cells” – and autonomously construct bespoke image processing workflows, leveraging existing libraries or even writing new code as needed. This capability dramatically accelerates the pace of discovery, allowing for rapid hypothesis testing and the exploration of nuanced biological phenomena previously inaccessible due to computational limitations, with potential applications extending far beyond the field of pathology itself.

The increasing volume and complexity of digital pathology data necessitate solutions that move beyond static analysis pipelines. This framework addresses this challenge through inherent adaptability; it’s designed not as a fixed tool, but as a flexible system capable of integrating new datasets and tackling previously unsupported tasks with relative ease. By decoupling analytical components and leveraging automated code execution, the system can be reconfigured and extended without extensive manual intervention. This scalability is crucial for accommodating the expanding scope of digital pathology research, from rare disease identification to large-scale biomarker discovery, and ultimately promises to accelerate the translation of research findings into clinical practice.

Nova generated a final analysis report detailing the morphological features linked to molecular PAM50 breast cancer subtypes.

The pursuit of automating histopathology analysis, as demonstrated by Nova, isn’t simply about achieving accuracy; it’s about systematically challenging the boundaries of what’s computationally possible. The framework deliberately introduces an agentic approach, essentially asking, ‘what happens if we allow the AI to actively explore and formulate its own analytical path?’ This echoes Claude Shannon’s insight: “The most important thing in communication is to reduce uncertainty.” Nova seeks to reduce the uncertainty inherent in complex slide analysis not through pre-defined rules alone, but by enabling a dynamic, exploratory process – a calculated dismantling of traditional analytical constraints to reveal deeper insights within the data. The SlideQuest benchmark further embodies this spirit, rigorously testing the limits of these systems and forcing a reevaluation of existing methods.

What Lies Beyond?

The introduction of Nova, and benchmarks like SlideQuest, represent a predictable escalation. The field of computational pathology has, until recently, focused on narrowly defined tasks – pattern recognition, mostly. Now, the ambition expands to understanding-or, more accurately, simulating understanding-within the complexity of whole-slide images. This shift invites a necessary reckoning. The current reliance on large language models, while yielding promising results, is, at best, a sophisticated form of mimicry. True intelligence isn’t about generating plausible narratives; it’s about identifying the fundamental principles governing a system, and those principles are rarely articulated in natural language.

The limitations are not merely technical. The quest for “agentic” frameworks implicitly assumes a singular, optimal solution to histopathological analysis. Yet, pathology, like most biological systems, is inherently ambiguous and context-dependent. Multiple interpretations, each valid within a certain framework, are the norm. The challenge, then, isn’t building an agent that finds the answer, but one that transparently articulates the assumptions and biases embedded within its reasoning process. Security isn’t achieved through a ‘black box’ delivering confident pronouncements; it requires complete visibility into the inferential steps.

Future work should prioritize not scale, but dissection. A rigorous exploration of failure modes-identifying precisely where and why these systems err-will prove more valuable than incremental improvements in accuracy. The goal isn’t to automate the pathologist, but to build a tool that exposes the inherent uncertainties within the diagnostic process, forcing a more honest and nuanced appraisal of the underlying biology.

Original article: https://arxiv.org/pdf/2511.11324.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deconstructing Diagnosis: The Limits of Convention

Nova: Architecting an Agentic Pathology

Validating Intelligence: The SlideQuest Benchmark

Beyond Automation: Towards a Dynamic Pathology

What Lies Beyond?

See also: