Mapping Life’s Code: How AI Uncovers Hidden Structure in Cells

Author: Denis Avetisyan

New research reveals that artificial intelligence models are organizing gene data into a biologically meaningful framework, offering unprecedented insight into cellular organization and regulation.

As a transformer model deepens, its internal geometry progressively concentrates gene representations along a secretory/localization axis-evidenced by a rise in explained variance from 19% to 77%-while simultaneously embedding co-regulated genes in close proximity, as demonstrated by consistently significant co-localization of TRRUST regulatory pairs across all layers.

Single-cell transformer representations establish a multi-dimensional spectral geometry that reveals underlying relationships in biological networks.

Despite the increasing complexity of single-cell genomic data, the biological knowledge encoded within high-dimensional gene representations learned by foundation models remains largely opaque. This study, ‘Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations’, reveals that these models-such as scGPT-organize genes into a structured biological coordinate system, reflecting key aspects of cellular organization. Specifically, we demonstrate that this geometry encodes subcellular localization, protein-protein interactions, and regulatory relationships-distinguishing transcription factors from target genes and revealing a nuanced representation of activation versus repression. How can leveraging this interpretable internal model advance our understanding of cellular processes and facilitate more effective drug discovery and model auditing?

Emergent Patterns in High-Dimensional Gene Expression

The advent of single-cell genomics has unleashed a flood of data, with each cell’s complete transcriptional profile represented by tens of thousands of genes – a dimensionality that quickly overwhelms traditional analytical methods. This high-dimensional space, while incredibly detailed, presents a significant challenge: extracting biologically relevant meaning from the noise requires innovative computational strategies. Simply cataloging gene expression levels proves insufficient; researchers need techniques capable of discerning underlying patterns, identifying subtle distinctions between cell types, and ultimately, reconstructing the complex relationships that govern cellular behavior. Consequently, the field is actively pursuing methods that can effectively reduce dimensionality without sacrificing crucial biological information, shifting the focus from merely describing gene activity to understanding the functional implications of these intricate molecular profiles.

Conventional methods for reducing the dimensionality of single-cell genomic data often fall short in representing the intricate relationships between genes that define cellular states. Techniques like Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) assume linear relationships or focus on preserving local neighborhood structures, thereby overlooking the complex, non-linear interactions and dependencies that orchestrate gene expression. This limitation results in simplified representations where crucial biological information is lost, hindering the accurate identification of cell types and the understanding of developmental trajectories. Consequently, interpretations based solely on these reduced dimensions can be misleading, obscuring the nuanced genetic programs governing cellular identity and function.

scGPT represents a significant advancement in analyzing single-cell RNA sequencing data by employing a foundation model built upon transformer architecture. This approach moves beyond traditional dimensionality reduction, which often presents cellular data as a complex and unintelligible feature space. Instead, scGPT generates robust gene embeddings – a numerical representation of genes capturing their relationships – revealing an inherent geometric organization within the data. By learning these embeddings, the model effectively maps cells onto a structured landscape where proximity reflects biological similarity, facilitating the identification of distinct cell types, developmental trajectories, and subtle variations in cellular states with greater accuracy and interpretability than previously possible. This structured representation not only simplifies the visualization of high-dimensional data but also allows for more effective downstream analyses, such as cell type annotation and the prediction of cellular responses to stimuli.

A six-dimensional subspace, particularly the combination of [latex]\mathrm{SV}_{2}\\\mathrm{SV}_{7}[/latex] (green), effectively distinguishes transcription factors from their targets across layers, with complementary depth profiles from [latex]\mathrm{SV}_{5}\\\mathrm{SV}_{7}[/latex] (orange) and [latex]\mathrm{SV}_{2}\\\mathrm{SV}_{4}[/latex] (blue) ensuring complete regulatory information coverage.

Revealing Intrinsic Complexity Through Dimensionality Reduction

Singular Value Decomposition (SVD) was applied to the embeddings generated by scGPT to determine the ‘effective rank’ of the data, which represents the number of independent signals contributing to cellular diversity. SVD decomposes the embedding matrix into three matrices, allowing for the identification of the principal components that capture the most variance in the data. The effective rank is then estimated by analyzing the singular values obtained from the decomposition; larger singular values correspond to stronger, more significant signals. By quantifying this effective rank, we can assess the intrinsic dimensionality of the gene expression landscape captured within the embeddings and understand how many independent factors are driving the observed heterogeneity in the single-cell data.

Two Nearest Neighbors (TwoNN) Intrinsic Dimensionality is a method used to estimate the complexity of the gene expression landscape as represented within the scGPT embeddings. This technique functions by analyzing the distances between each data point (representing a cell) and its two nearest neighbors in the embedding space. The intrinsic dimensionality is then inferred from the distribution of these distances; a lower intrinsic dimensionality indicates that the data can be effectively represented with fewer variables while still capturing the essential relationships within the gene expression data. This provides a quantitative measure of the inherent complexity of cellular heterogeneity captured by the reduced-dimensional embeddings, independent of the original high-dimensional gene count.

Analysis of scGPT embeddings using Singular Value Decomposition indicates a substantial reduction in gene representation dimensionality. Specifically, scGPT achieves a 14.4-fold reduction across its 12 transformer layers, transitioning from the original gene expression space to a lower-dimensional representation. This dimensionality reduction does not appear to sacrifice critical biological information, suggesting that the underlying structure of cellular heterogeneity is surprisingly low-dimensional. The preserved variance within this reduced space indicates that scGPT effectively captures the essential signals driving cellular diversity while mitigating noise and redundancy inherent in high-dimensional gene expression data.

Edge-level regulatory geometry, measured by [latex] SV_5 [/latex]-[latex] SV_7 [/latex] (orange), peaks in early layers and diminishes with depth, while [latex] SV_2 [/latex]-[latex] SV_4 [/latex] (blue) exhibits near-random behavior.

Mapping Interaction Networks Through Embedded Structure

Analysis of the embedding matrix generated by scGPT reveals that the initial singular vectors, specifically SV2 and SV3, contain information relating to protein interaction networks. Corroboration of this finding is achieved through comparison with the STRING Database, a curated resource of known and predicted protein-protein interactions. The observed alignment between the spectral properties captured in SV2 and SV3 and the interaction data within STRING indicates that the embedding process effectively encodes relationships derived from protein interactions, suggesting the learned representation reflects network topology.

The co-pole rate quantifies the degree to which pairs of genes are co-localized along the principal spectral axes derived from the scGPT embedding. This metric assesses the overlap in projections of gene pairs onto these axes, effectively measuring functional association based on embedding proximity. Higher co-pole rates indicate a stronger tendency for functionally related gene pairs to exhibit similar embedding profiles, providing empirical support for the embedding’s capacity to represent biological relationships. Statistical analysis of co-pole rates demonstrates significant enrichment of known functional gene pairs, further validating the embedding’s ability to capture and organize genes based on their functional roles within the cell.

Analysis indicates that scGPT’s learned gene representations are not independent, but rather reflect the underlying protein interaction network. Specifically, the model encodes information about interacting protein pairs, as evidenced by statistically significant ZZ-score values. These scores, quantifying the enrichment of known protein-protein interactions along the embedding’s spectral axes, consistently exceed established significance thresholds, demonstrating that co-localized gene pairs within the embedding space are more likely to represent physically interacting proteins according to external databases like STRING. This suggests scGPT captures functional relationships beyond individual gene characteristics, representing genes within the context of their broader biological network.

Repression edges exhibit greater geometric prominence than activation edges across both analyzed spectral subspaces.

B-Cell Differentiation: A Landscape of Transcriptional Regulation

Single-cell RNA sequencing data, when processed through the scGPT embedding model, reveals a compelling correlation between the resulting data structure and the well-defined stages of B-cell differentiation. This analysis demonstrates that the embedding space isn’t simply a random distribution of cells, but rather organizes itself in a way that mirrors the cellular transformations occurring during B-cell development, notably within the germinal center reaction. The germinal center, critical for antibody affinity maturation, is particularly well-defined within the embedding, suggesting scGPT effectively captures the transcriptional changes that define this crucial immune process. This inherent organization allows for a nuanced understanding of how B-cells progress through differentiation, potentially uncovering key regulatory factors and pathways involved in a functional immune response.

Analysis reveals that transcription factors, as cataloged in the TRRUST Database, are fundamental organizers of the single-cell gene expression landscape during B-cell differentiation. These proteins, which regulate gene expression, don’t operate in isolation; rather, they establish a hierarchical structure within the embedding space, effectively charting the developmental trajectories of B-cells. This organization suggests that the activity of specific transcription factors directly correlates with, and likely drives, key stages of differentiation, including the germinal center reaction where B-cells refine their antibody production. The observed relationship indicates that variations in transcription factor expression patterns can accurately predict the regulatory edges influencing B-cell development and contribute to distinct B-cell marker clustering, highlighting their crucial role in orchestrating immune responses.

The capacity of scGPT to model intricate relationships within single-cell data suggests a powerful new avenue for dissecting immune cell development and function. Initial evaluations demonstrate scGPT’s predictive capabilities, achieving an Area Under the Receiver Operating Characteristic curve (AUROC) of 0.602 at the initial embedding layer for identifying regulatory connections between genes. Furthermore, significant enrichment in AUROC values was observed when applying scGPT to cluster cells based on established B-cell markers, indicating the model effectively captures the transcriptional signatures defining distinct stages of B-cell differentiation. These results establish scGPT not merely as a data representation tool, but as a potentially insightful framework for uncovering the regulatory logic governing immune cell fate and behavior.

Beyond Immunology: A Foundation for Cellular Understanding

The success of scGPT isn’t limited to the study of immune cells; its core strength lies in the application of transformer architectures, a methodology proven capable of deciphering intricate relationships within diverse biological systems. These architectures, initially prominent in natural language processing, excel at identifying contextual dependencies – a crucial skill when analyzing the complex interplay of genes and proteins within any cell type. This adaptability means the principles behind scGPT can be readily extended to model other tissues and cellular compositions, offering a unified framework for understanding cellular function beyond immunology. Researchers anticipate leveraging this foundational approach to unlock insights across a spectrum of biological investigations, from neurological disorders to developmental biology, effectively establishing a versatile tool for single-cell genomics.

The power of scGPT isn’t solely in its ability to map cellular states, but in its connection to established biological knowledge. By integrating Gene Ontology (GO) data – a comprehensive framework detailing gene functions – the model transforms abstract embeddings into readily interpretable insights. This fusion allows researchers to pinpoint key functional modules driving cellular behavior, moving beyond simple cell type identification to understand how cells operate. Consequently, the system doesn’t just cluster cells; it reveals the underlying biological processes responsible for those groupings, accelerating the discovery of gene networks and pathways involved in both healthy and diseased states. The resulting embeddings, therefore, represent a powerful bridge between computational analysis and established biological understanding.

scGPT is poised to become a central resource for single-cell genomics, promising to dramatically accelerate biological and medical discoveries. This model establishes a new paradigm by learning comprehensive cellular representations, enabling researchers to investigate diverse cell types and tissues with greater efficiency. Notably, the model demonstrates a remarkably high Spearman Correlation of -0.972 in the convergence of GC regulator embeddings across its layers, indicating a robust and consistent internal representation of gene regulatory mechanisms. This internal consistency suggests scGPT doesn’t simply memorize data but learns underlying biological principles, offering a powerful foundation for predictive modeling, disease mechanism elucidation, and the development of novel therapeutic strategies. Its potential extends beyond current applications, serving as a springboard for future innovations in personalized medicine and a deeper understanding of cellular life.

The research illuminates how complex biological systems self-organize, much like a coral reef forming an ecosystem from local interactions. The study reveals an inherent geometry within single-cell data, suggesting order doesn’t require a central architect, but emerges from the interplay of regulatory networks. This resonates with Sartre’s assertion, “Existence precedes essence,” because the functional meaning of genes isn’t predetermined; it arises from their relationships within the cellular context, a discovered ‘essence’ born of ‘existence’ within the scGPT coordinate system. The model doesn’t impose order; it reveals it, demonstrating that constraints can indeed be invitations to creativity, as the model’s architecture facilitates the discovery of biological principles.

What Lies Ahead?

The emergence of a discernible biological coordinate system within the latent space of single-cell foundation models is less a triumph of engineering and more an acknowledgement of inherent order. Global regularities emerge from simple rules; the model doesn’t impose biology, it reveals what was already present in the high-dimensional data. The immediate challenge, however, isn’t further refinement of the architecture, but a rigorous accounting of the limitations of this revealed order. Is the ‘geometry’ truly representative of underlying biology, or merely a consequence of algorithmic constraints and data biases?

Future work will likely focus on perturbing the system – deliberately introducing noise or altering data representations – to assess the robustness of the observed coordinate system. Any attempt at directive management often disrupts this process, so a cautious, observational approach is warranted. A critical examination of the residual streams within these transformer networks could offer further insight into the mechanisms driving this organization, potentially unveiling the principles governing cellular differentiation and regulatory network dynamics without requiring explicit, pre-defined knowledge.

Ultimately, the power of this approach lies not in prediction, but in description. The goal shouldn’t be to control cellular behavior, but to understand the rules by which it self-organizes. The next phase of research will require abandoning the notion of a ‘complete’ model and embracing the inherent ambiguity of biological systems, recognizing that any representation is, at best, a partial and provisional map of an infinitely complex landscape.

Original article: https://arxiv.org/pdf/2602.22247.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/