Mapping AI’s Limits in Spatial Biology

Author: Denis Avetisyan

A new benchmark reveals the challenges AI agents face when analyzing complex, real-world biological data from spatial transcriptomics.

SpatialBench provides a comprehensive and standardized platform for evaluating the performance of robotic manipulation systems across a diverse set of spatial reasoning tasks, facilitating reproducible research and benchmarking of algorithms designed to address complex geometric challenges in real-world environments.

SpatialBench assesses the performance of AI models and data workflows in interpreting spatially resolved gene expression data.

Despite rapid advances in artificial intelligence, reliably extracting biological insight from complex spatial transcriptomics data remains a significant challenge. To address this, we introduce SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?, a benchmark comprising 146 verifiable problems derived from practical spatial analysis workflows. Our findings reveal substantial limitations in current AI agent performance-accuracy ranges from 20-38%-and demonstrate a strong dependence on both the chosen model and the ‘harness’ used to control its execution. Can a focus on optimizing these agent harnesses, alongside model development, unlock the full potential of AI for spatial biology?

Deciphering Spatial Complexity: A New Era in Transcriptomics

Spatial transcriptomics represents a paradigm shift in biological research, moving beyond the averaging of signals from bulk tissue samples to reveal gene expression patterns within the anatomical context of a tissue. This technology doesn’t simply identify where genes are expressed, but crucially, where they are expressed in relation to cells, structures, and even disease microenvironments. Consequently, datasets generated by spatial transcriptomics are inherently high-dimensional and complex, capturing not only gene activity but also precise positional information for tens of thousands of genes across potentially millions of cells. This wealth of data is fundamentally reshaping our understanding of tissue organization, developmental processes, and the pathogenesis of diseases like cancer, but also presents significant analytical hurdles as researchers strive to extract meaningful biological insights from this unprecedented level of detail.

The advent of spatial transcriptomics has unleashed an unprecedented volume of biological data, detailing gene expression with precise anatomical context; however, conventional analytical methods, designed for bulk RNA sequencing, are increasingly strained by this complexity. These pipelines often struggle to effectively integrate the spatial information with transcriptomic profiles, leading to bottlenecks in processing and interpretation. The sheer scale of data – encompassing millions of gene expression measurements across numerous spatial locations – overwhelms existing computational resources and statistical approaches, obscuring subtle yet critical patterns. Consequently, researchers face challenges in identifying spatially-defined cell types, understanding tissue microenvironments, and ultimately, translating these insights into a deeper understanding of development and disease.

The transformative potential of spatial transcriptomics – mapping gene expression within the precise architecture of tissues – is currently bottlenecked by analytical limitations. While the technology generates exceptionally detailed data, extracting meaningful biological insights demands automated and robust computational methods. Current state-of-the-art models, despite advancements in machine learning, struggle to accurately interpret these complex datasets, achieving only 20 to 40 percent accuracy when applied to real-world samples. This substantial margin of error highlights a critical need for improved algorithms and analytical pipelines capable of handling the scale and intricacies inherent in spatial transcriptomic data, ultimately paving the way for a more complete understanding of tissue organization and disease mechanisms.

The aggregate model demonstrates performance across the SpatialBench suite of tasks.

Establishing a Standard: Introducing SpatialBench for Rigorous Evaluation

SpatialBench is a benchmark suite constructed to rigorously evaluate workflows used in spatial transcriptomics data analysis. It comprises 146 distinct problems, each formulated to be verifiable, allowing for objective assessment of analytical performance. These problems cover a range of common tasks within spatial transcriptomics, including data processing, normalization, spatial domain identification, and cell type deconvolution. The suite is designed to provide a standardized and reproducible method for comparing different analytical approaches and identifying areas for improvement in the field.

SpatialBench utilizes AI Agents to automate the execution of spatial transcriptomics analysis pipelines, addressing the challenges of manual operation and ensuring consistent, verifiable results. These agents are designed to perform specific tasks within the workflow – such as data loading, quality control, analysis, and visualization – without direct human intervention. This automation is crucial for reproducibility, as it eliminates variability introduced by differing user implementations and parameter settings. The agents operate within a defined computational environment, further standardizing the execution process and allowing for consistent benchmarking of different analytical methods. Each agent’s actions are logged, providing a complete audit trail of the analysis and facilitating validation of the results.

The architecture of SpatialBench necessitates a carefully constructed ‘Harness Design’ to manage the execution of AI Agents and maintain a controlled analytical environment. This harness is responsible for task orchestration, data handling, and environment consistency across all benchmark problems. Evaluations within SpatialBench demonstrate that variations in harness design – including agent prompting strategies, error handling, and resource allocation – can yield performance differences comparable to, and in some cases exceeding, those observed when altering the underlying analytical model itself. This highlights the critical importance of a well-defined harness not simply as a facilitator, but as a substantial determinant of overall workflow efficacy and result reproducibility.

SpatialBench is constructed by combining procedural generation of scenes with a physics engine to create a diverse and controllable suite of 3D environments for robotic manipulation.

Defining the Queries: Constructing Verifiable Biological Problems

Problem construction within SpatialBench necessitates a balance between biological plausibility and the feasibility of computational analysis. Biological realism is achieved by grounding problem parameters – such as cell type proportions, gene expression levels, and spatial arrangements – in established experimental data or validated biological models. However, purely realistic simulations can be computationally prohibitive or analytically intractable. Therefore, problem design often involves controlled simplification and parameterization, focusing on key biological processes while abstracting away less critical details. This ensures that generated datasets are both representative of biological systems and amenable to efficient analysis using SpatialBench’s benchmarking tools, enabling quantifiable performance evaluation of spatial transcriptomics methods.

The generation of verifiable biological queries within SpatialBench relies on a tiered methodological approach to establish ground truth expectations. Specifically, each problem instantiation requires the application of ‘Cell Typing’ to categorize cells based on defined markers, followed by ‘Differential Expression’ analysis to identify statistically significant gene expression changes between these cell types or conditions. Finally, ‘Spatial Analysis’ methods are employed to correlate these molecular signatures with their anatomical locations, thereby defining the expected spatial distribution of cells and molecules. This integrated workflow ensures that the problem’s expected results are not arbitrary, but are derived from established biological principles and analytical procedures.

Prior to problem instantiation within SpatialBench, rigorous quality control (QC) and dimensionality reduction are essential preprocessing steps. QC procedures, including filtering of low-quality cells based on metrics like library size, number of detected genes, and mitochondrial content, mitigate the impact of technical noise and ensure data reliability. Dimensionality reduction techniques, such as Principal Component Analysis (PCA) or Uniform Manifold Approximation and Projection (UMAP), reduce computational burden and facilitate meaningful downstream analysis by removing redundant or irrelevant features while preserving biologically significant variance. These steps are critical for both maintaining data integrity and enabling efficient computation, particularly when analyzing large spatial transcriptomic datasets.

Model accuracy varies significantly depending on the experimental platform used for data collection.

Unveiling Platform Capabilities and Charting Future Directions

SpatialBench represents a significant advancement in the field of spatial transcriptomics by offering a unified framework for rigorously comparing diverse platforms – including Xenium, Visium, MERFISH, Seeker, and AtlasXomics – against standardized analytical tasks. This comparative capability is crucial, as each platform possesses unique strengths and weaknesses in terms of resolution, throughput, and cost. By benchmarking these technologies on a common set of challenges, SpatialBench facilitates informed decision-making for researchers selecting the optimal tool for their specific biological questions. The system doesn’t just report if a platform works, but quantifies how well it performs, providing a granular understanding of the trade-offs inherent in each approach and driving innovation through direct, measurable comparisons.

Evaluations across several leading AI models reveal a significant opportunity to enhance analytical accuracy in spatial transcriptomics data processing. Current performance, as measured by the SpatialBench platform, indicates a mean accuracy of 38.4% for Opus-4.5, decreasing to 34.0% with GPT-5.2 and 28.3% for Sonnet-4.5. These results, while representing a functional baseline, highlight the substantial potential for refinement and optimization of AI-driven workflows in this emerging field; further development promises to unlock more precise and reliable interpretations of complex spatial gene expression patterns.

Evaluating the analytical efficiency of AI Agents within the SpatialBench framework hinges on quantifiable metrics like ‘Step Count’ and ‘Latency’. Current models demonstrate varying computational demands – Claude and GPT models complete tasks in 2-3 steps, while Grok variants require approximately 9.8-9.9 steps. These measurements reveal the processing complexity inherent in each agent’s approach. Notably, implementing the Latch harness resulted in a substantial 23.3 percentage point improvement in accuracy when compared to a standard Opus-4.5 configuration, highlighting the potential for optimized infrastructure to significantly enhance performance and analytical throughput in spatial transcriptomics data analysis.

Agent harnesses demonstrate varying aggregate accuracy levels, highlighting performance differences between configurations.

The introduction of SpatialBench underscores a critical principle in system design: structure dictates behavior. The benchmark isn’t merely evaluating what models can analyze spatial transcriptomics data, but how they are controlled and executed via the ‘harness’. As Tim Bern-Lee observed, “The Web is more a social creation than a technical one.” This rings true for SpatialBench; the interplay between model and harness reveals that even sophisticated analytical tools are limited by the frameworks within which they operate. A fragile harness undermines the potential of the model, echoing the sentiment that simplicity and clarity are paramount to a robust, enduring system.

Where Do We Go From Here?

The exercise, as often happens, reveals more about the tools than the territory. SpatialBench demonstrates, with characteristic understatement, that simply possessing a powerful model does not equate to meaningful analysis of spatial biology data. A skilled artisan can coax beauty from coarse materials, but even the finest chisel is useless without a hand to guide it. The ‘harness’ – that seemingly mundane architecture governing model access and execution – proves pivotal. If the system looks clever, it’s probably fragile.

Future work will undoubtedly focus on refining both models and harnesses. However, a more fundamental challenge remains: defining ‘understanding’ in this context. Current benchmarks largely assess performance on discrete tasks, yet biological systems rarely present neatly packaged problems. A truly robust system must not only answer questions, but also know which questions to ask, and acknowledge when its answers are, at best, provisional.

The architecture, ultimately, is the art of choosing what to sacrifice. Complete generality is a phantom. Progress will likely involve embracing specialized systems, tailored to specific biological questions and data types. It is a humbling thought – that the path forward lies not in building ever-more-complex universal solvers, but in acknowledging the inherent limitations of any single approach.

Original article: https://arxiv.org/pdf/2512.21907.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deciphering Spatial Complexity: A New Era in Transcriptomics

Establishing a Standard: Introducing SpatialBench for Rigorous Evaluation

Defining the Queries: Constructing Verifiable Biological Problems

Unveiling Platform Capabilities and Charting Future Directions

Where Do We Go From Here?

See also: