The Shape of Life’s Data: What Foundation Models Really Learn

Author: Denis Avetisyan

A new study systematically maps the geometric landscape within single-cell genomic foundation models, revealing that apparent biological structure often lacks robustness.

A rigorous audit of 111 content-based hypotheses revealed that while approximately 27 initially yielded positive results, a conservative evaluation-focusing on methodological robustness-reduced this number to fewer than 15, concentrating the rate of consistently supported findings to roughly 10% and highlighting the challenges of maintaining statistical validity across diverse analytical contexts.

Researchers assessed 141 hypotheses to determine whether the embedded geometric structure of these models reflects true biological signal or spurious correlations.

Despite the increasing complexity of biological foundation models, the nature of the geometric and topological structures they learn remains largely unexplored. In ‘What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses’, we systematically investigate the internal representations of these models using an autonomous hypothesis-screening approach, revealing that while they do capture genuine geometric structure, this signal is often fragile, domain-dependent, and concentrated in specific tissues. Our analysis-spanning persistent homology, manifold distances, and cross-model alignment-demonstrates shared structure between models like scGPT and Geneformer, yet highlights challenges in recovering precise gene-level correspondences. Can these findings inform the development of more robust and interpretable foundation models for single-cell genomics, and ultimately, a deeper understanding of biological systems?

Unveiling Biological Knowledge Encoded in Foundation Models

The rapid evolution of single-cell genomics technologies has resulted in an unprecedented accumulation of biological data, far exceeding the capacity of traditional analytical methods. These experiments, which measure gene expression and other characteristics in individual cells, generate datasets containing information on tens of thousands, even millions, of cells. This data deluge necessitates the development of sophisticated computational tools capable of not only processing this immense volume of information but also extracting meaningful biological insights. Existing methods often struggle with the complexity and high dimensionality inherent in single-cell data, prompting a search for more powerful approaches-specifically, those that can effectively represent and interpret the intricate relationships between genes, cells, and biological processes. The sheer scale of these datasets, coupled with the need for robust and scalable analysis, has positioned machine learning, and particularly foundation models, as a promising avenue for tackling this analytical challenge.

Foundation models, increasingly utilized in single-cell genomics, represent a significant leap in capturing the intricate relationships within biological data, yet their operational logic often remains a ‘black box’. These models, trained on vast datasets of cellular characteristics, demonstrate an ability to predict gene expression, cell types, and even responses to stimuli – exceeding the performance of traditional methods. However, precisely how these models internally represent and utilize biological knowledge is poorly understood. While they excel at pattern recognition and prediction, the encoded information isn’t easily interpretable, hindering efforts to validate their conclusions or glean novel biological insights. This opacity poses a challenge; researchers can observe the model’s output, but discerning the underlying reasoning-the specific features and interactions driving its decisions-requires innovative approaches to probe and visualize the model’s internal states and representations.

The true value of foundation models in biology hinges not merely on their predictive power, but on deciphering how they arrive at those predictions. Without understanding the internal representation of biological knowledge – the features, relationships, and patterns the model prioritizes – validating their outputs becomes problematic. A model might accurately predict a cellular response, yet do so based on spurious correlations rather than genuine biological mechanisms. Investigating the model’s “reasoning” – through techniques like feature attribution and representational similarity analysis – is therefore essential. This level of interpretability fosters trust in the model’s findings, allows for the refinement of biological hypotheses, and ultimately unlocks the full potential of these tools to drive discovery and accelerate advancements in fields like drug development and personalized medicine.

Despite independent training, scGPT and Geneformer exhibit significantly aligned geometric organization across multiple metrics and tissue domains, exceeding null expectations and demonstrating convergent learning.

Mapping Gene Function Through Geometric Embeddings

Gene embeddings were generated utilizing scGPT, a pre-trained foundation model designed for single-cell genomic data analysis. This process involves inputting gene expression data-typically a matrix representing the expression levels of genes across individual cells-into scGPT. The model then transforms this data into a lower-dimensional vector representation for each gene, capturing its expression patterns in relation to other genes within the dataset. These resulting vectors, or gene embeddings, serve as a numerical representation of each gene’s functional role and relationships, enabling quantitative analysis of genomic data and facilitating downstream tasks such as gene network inference and cell type identification.

Analysis of the residual stream geometry within scGPT demonstrates that gene embeddings are not randomly distributed, but exhibit a structured organization. Specifically, the residual streams – the outputs of intermediate layers within the neural network – were analyzed to reveal that genes with related biological functions tend to cluster in proximity within the embedding space. This spatial relationship suggests the existence of interpretable axes of biological meaning, where directions within the high-dimensional embedding space correspond to variations in specific biological processes or pathways. Quantification of these relationships, through methods like geodesic distance, allows for the identification of genes involved in similar biological functions and potentially reveals underlying regulatory relationships.

To quantify relationships between genes within the scGPT-generated embedding space, we calculated geodesic distance and triangle defect. Geodesic distance represents the shortest path between two gene embeddings on the manifold defined by the residual stream, providing a measure of dissimilarity beyond Euclidean distance. Triangle defect, calculated as the sum of angle deficits at the vertices of triangles formed by gene embeddings, indicates the curvature of the embedding space and can highlight genes that are clustered or exhibit non-Euclidean relationships. Higher defect values suggest a greater degree of curvature and potentially indicate functional associations or regulatory interactions. These metrics were computed using established algorithms in topological data analysis to provide a rigorous, quantifiable assessment of gene relationships.

Persistent homology was employed on the scGPT-generated gene embedding space to detect topological features beyond simple connectivity. This technique identifies cycles, voids, and higher-dimensional holes within the data, quantifying their persistence-the length over which they exist at different scales. Specifically, Betti numbers, [latex] \beta_i [/latex], were calculated to enumerate the number of [latex] i [/latex]-dimensional holes, with [latex] \beta_0 [/latex] representing connected components, [latex] \beta_1 [/latex] representing loops, and higher values indicating more complex voids. These persistent topological features are hypothesized to correspond to functional modules, such as co-expressed gene sets, or regulatory circuits where genes interact in a cyclic manner, offering a novel means of inferring biological relationships directly from the learned embedding space.

Geodesic distances consistently improve regulatory edge discrimination compared to Euclidean distances, particularly within middle transformer layers, as evidenced by a consistent, albeit modest, [latex]\Delta AUROC \approx 0.01[/latex] across both source- and target-disjoint gene pool splits.

Validating Geometric Structure Against Known Regulatory Interactions

The foundational premise of this analysis is that the spatial arrangement of genes within the embedding space is non-random and directly corresponds to established regulatory relationships. This hypothesis posits that genes with strong, known interactions – either activating or repressing – will exhibit closer proximity in the embedding compared to genes with no or weak interactions. Specifically, the geometric distance between gene representations is expected to inversely correlate with the strength and direction of their regulatory link, as defined by curated databases of regulatory interactions. This expected correspondence forms the basis for evaluating the biological relevance and interpretability of the embedding generated, serving as a primary validation metric.

Analysis focused on signed regulatory motifs, defined as combinations of transcription factors and their documented activation or repression effects on target genes. These motifs were used to assess the relationship between predicted regulatory interactions and the geometric proximity of genes within the embedding space. Specifically, the sign of the regulatory effect (activating or repressing) was compared to the direction and magnitude of the distance between gene representations; a positive correlation was expected between activating motifs and shorter distances, and between repressing motifs and greater distances. This approach allowed for a quantitative assessment of whether the embedding space accurately captured not only the presence, but also the nature of regulatory relationships.

The validation of predicted regulatory interactions relied on established resources documenting known relationships between genes. Specifically, the DoRothEA database, providing curated transcription factor (TF)-target interactions with confidence scores, was utilized. Complementary data was sourced from TRRUST, a database of human and mouse transcriptional regulatory interactions, and STRING, which integrates multiple sources to provide protein-protein interaction data, including those mediated by transcriptional regulation. These databases served as the ground truth against which the geometric relationships derived from the gene embedding were compared, enabling quantitative assessment of the hypothesis that spatial proximity reflects functional connectivity.

To assess the validity of observed correlations between geometric proximity and regulatory interactions, a suite of null models was implemented to account for potential confounding factors. These included degree-preserving rewiring, which maintains network degree distribution; label permutation, which randomizes the association between genes and their regulatory roles; and feature shuffling, which disrupts the correlation between gene features and embedding coordinates. A stringent max-null audit was then performed, comparing observed correlations to the distribution of correlations generated by these null models. This revealed limited robust signal, with only 15 out of 25 tested regulatory relationships demonstrating statistically significant correlation after correction for multiple hypothesis testing, suggesting that a substantial portion of initially observed associations may be attributable to chance or confounding variables.

Varying the radius of the [latex]k[/latex]-nearest neighbor search reveals how the embedding captures regulatory information based on local geometric signal resolution.

Autonomous Hypothesis Screening for Mechanistic Insights

An automated pipeline was constructed to test hypotheses regarding gene function using OpenAI’s Codex language model. This system operates iteratively, generating potential functional relationships, programmatically executing experiments to evaluate those relationships within a single-cell transcriptomic dataset, and then assessing the results. The pipeline’s autonomy stems from Codex’s ability to translate natural language prompts – representing the proposed hypotheses – into executable code, enabling large-scale, systematic investigation of gene function without manual intervention. This approach facilitates the unbiased screening of numerous hypotheses, exceeding the scale achievable through traditional manual experimentation.

The hypothesis generation process initiates by leveraging the geometric structure within the scGPT embedding space. Single-cell gene expression data is first projected into this space, where genes are represented as points. Hypotheses are then formulated based on the proximity of these gene representations; genes located near each other in the embedding space are posited to have related functions. Specifically, the pipeline examines relationships defined by distances and angles between gene embeddings, allowing for the automated creation of testable predictions regarding gene function and interaction without prior biological knowledge. This approach assumes that functionally similar genes will cluster together in the embedding space, creating a quantifiable basis for hypothesis generation.

To mitigate the risk of spurious correlations arising from data sharing between model training and evaluation, a disjoint gene-pool split was implemented. This methodology ensured that no genes present in the training set were also included in the testing set. Specifically, genes were randomly assigned to either a training or testing pool, and this assignment was maintained throughout the entire hypothesis screening process. This strict separation prevented the model from leveraging information about genes seen during training to predict the function of genes in the test set, thereby providing a more rigorous assessment of its generalization capability and reducing the potential for inflated performance metrics.

Evaluation of the autonomous hypothesis screening pipeline utilized the Area Under the Receiver Operating Characteristic curve (AUROC) as a performance metric. Initial testing across 24 independent rows demonstrated a mean ΔAUROC of 0.079, indicating a measurable improvement in predictive accuracy. However, application of a stringent null-gap criterion, designed to minimize false positive findings, resulted in only 1 out of 6 hypotheses meeting the threshold for statistical significance. A broader systematic screen encompassing 141 hypotheses identified a ΔΔAUROC of 0.094 specifically for signed motif-community interactions when evaluated under strict null controls, suggesting a focused area of reliable performance within the pipeline.

Motif-community hardening [latex]\Delta\Delta AUROC[/latex] consistently improved performance across domain splits and uniquely achieved complete null-gap coverage, representing the strongest finding among 141 tested hypotheses.

Future Directions: Towards Interpretable and Predictive Foundation Models

Geometric analysis is proving to be a powerful lens through which to understand the complex “black box” of foundation models in single-cell genomics. This approach moves beyond simply assessing a model’s predictive accuracy and instead focuses on characterizing the high-dimensional data manifolds these models learn. By treating single-cell data as points in a geometric space, researchers can identify intrinsic structures, such as clusters and trajectories, that correspond to meaningful biological states and transitions. This dissection reveals how the model organizes and represents cellular heterogeneity, enabling a deeper understanding of the underlying biological processes driving its predictions. Ultimately, this geometric perspective allows for the validation of model-derived insights against known biology, and offers a framework for identifying potential biases or limitations within the model itself, thereby increasing trust and interpretability.

The integration of geometric analysis with autonomous hypothesis screening represents a significant advancement in biological discovery. This approach leverages the inherent structure within single-cell data, revealed through geometric techniques, to formulate and test biological hypotheses without manual intervention. By automatically exploring the landscape of potential mechanisms, researchers can bypass traditional bottlenecks and accelerate the identification of key drivers of cellular behavior. This computational process not only increases the speed of discovery, but also allows for the exploration of a broader range of possibilities, potentially uncovering previously overlooked biological relationships and mechanisms that would have remained hidden through conventional methods. The ability to autonomously screen hypotheses, guided by geometric insights, promises a future where biological understanding is driven by data-driven exploration and computational rigor.

The progression of this research necessitates an expansion beyond current data limitations, with future efforts directed towards applying these geometric and autonomous screening techniques to significantly larger single-cell datasets. A crucial advancement lies in multimodal integration, aiming to synthesize information not only from transcriptomic data, but also from proteomic and metabolomic profiles. This holistic approach promises a more comprehensive understanding of cellular states and dynamics, allowing the construction of foundation models that capture the intricate interplay between different molecular layers. By bridging these traditionally separate ‘omics’ domains, researchers anticipate uncovering previously hidden regulatory mechanisms and generating predictive models with enhanced biological relevance and accuracy.

The development of foundation models in single-cell genomics strives for a dual purpose: accurate prediction and biological insight. While current models excel at forecasting cellular states, understanding why these predictions are made remains a significant challenge. This research endeavors to overcome this limitation by constructing models that are not simply ‘black boxes,’ but rather offer transparent, interpretable reasoning behind their outputs. Demonstrating a robust consistency – confirmed by a Jaccard similarity of 0.65 across independent runs – suggests the identified patterns are not merely artifacts of the analytical process. Ultimately, these advancements aim to unlock a deeper understanding of the intricate mechanisms governing complex biological systems, moving beyond prediction to genuine biological discovery.

The comparison of null-gap signals reveals a distinct boundary between robust and fragile geometric findings for signed motif-community hardening (H123) versus sectional anisotropy (H139) across domain-split groups.

The exploration of foundation models in single-cell genomics, as detailed in the research, reveals a concerning trend: encoded biological information isn’t necessarily robust or universally applicable. This fragility echoes a broader issue-the values embedded within any system, be it algorithmic or scientific, shape its interpretation of reality. As Galileo Galilei observed, “You cannot teach a man anything; you can only help him discover it himself.” The study underscores this principle; the models don’t inherently reveal biological truths, but rather reflect the biases and limitations of the data and methods used to construct them. Rigorous null-model controls become, therefore, not just a technical necessity, but an ethical imperative – a means of ‘helping’ the model, and ourselves, discover what is genuinely meaningful within the complex landscape of biological data.

Beyond the Map

The systematic interrogation of geometric structure within single-cell genomic foundation models, as presented, reveals a disquieting truth: encoding information is not the same as understanding it. While these models demonstrably capture variance, the fragility of the learned topology-its dependence on specific datasets and parameter choices-suggests an over-reliance on correlation rather than causal mechanisms. An engineer is responsible not only for system function but its consequences; therefore, simply achieving predictive power is insufficient. The field now faces a critical juncture-a need to move beyond simply mapping cell states to discerning the underlying generative principles.

Future work must prioritize the development of robust null models-controls not simply for random chance, but for the inherent biases embedded within data acquisition and model architecture. Autonomous screening, while promising, demands careful consideration of what constitutes a ‘meaningful’ signal; a signal, after all, can be artifact. The pursuit of geometric interpretability cannot be a post-hoc exercise; topological constraints and biological priors must be integrated into model design.

Ultimately, the value of these foundation models will not be measured by their ability to reproduce existing knowledge, but by their capacity to generate novel, testable hypotheses. Ethics must scale with technology, and the automation of biological discovery demands a commensurate level of intellectual rigor and philosophical humility.

Original article: https://arxiv.org/pdf/2602.22289.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/