Designing AI for Life’s Code

Author: Denis Avetisyan

A new framework, BioArc, automatically engineers neural networks optimized for understanding complex biological data like DNA and proteins.

BioArc-F, an architecture selected for its consistently high performance across diverse tasks, demonstrates robust foundational capabilities even when pretrained with a fraction-one-tenth-of the computational resources typically required by baseline models, suggesting an inherent efficiency in its design.

BioArc is a neural architecture search framework for discovering high-performance foundation models tailored to biological data, achieving competitive results with smaller models and advancing self-supervised learning in the life sciences.

While foundation models have driven progress in fields like language and vision, directly applying architectures from these domains to biological data often yields suboptimal results due to inherent differences in data structure and properties. To address this, we present BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models, a novel framework leveraging Neural Architecture Search to systematically explore and identify high-performing architectures specifically tailored for biological data modalities. This large-scale analysis not only reveals novel designs-often achieving competitive results with smaller models-but also distills empirical principles for future model development and proposes effective architecture prediction methods. Will this principled approach to architecture discovery unlock a new era of task-specific and generalized foundation models for biology?

The Evolving Landscape of Biological Modeling

The rapid evolution of Foundation Models, notably those built upon the Transformer architecture and Diffusion Models, represents a significant leap forward in artificial intelligence. These models, pre-trained on massive datasets, demonstrate an unprecedented ability to generalize and adapt to diverse downstream tasks with minimal task-specific fine-tuning. The Transformer, with its attention mechanisms, excels at capturing contextual relationships within data, while Diffusion Models have revolutionized generative tasks, producing remarkably realistic images, audio, and other complex outputs. This paradigm shift moves away from traditional, task-specific AI development, instead fostering the creation of broadly capable models that can be applied to a wide range of problems, fundamentally altering the landscape of machine learning and opening doors to previously unattainable levels of automation and insight.

Applying recent advances in foundation models to biological data, such as DNA and protein sequences, introduces significant hurdles stemming from the very nature of these systems. Unlike the relatively structured data often used to train these models – like text or images – biological sequences exhibit intricate, long-range dependencies and complex, hierarchical structures. Traditional machine learning approaches struggle to effectively capture these relationships, often requiring massive computational resources and still failing to fully represent the underlying biological processes. The inherent variability and noise within genomic data, coupled with the sheer scale of these sequences, further complicate the task, demanding innovative strategies to overcome limitations in data representation and model architecture.

Genomic data, unlike many datasets utilized in artificial intelligence, is characterized by intricate, long-range dependencies – a single gene’s expression can be influenced by regulatory elements located far along the DNA sequence. Traditional computational methods, often designed for more localized patterns, frequently fail to adequately capture these distant interactions, leading to incomplete or inaccurate analyses. This limitation hinders progress in areas like predicting gene function, understanding disease mechanisms, and designing effective therapies. Consequently, researchers are actively exploring novel approaches – including those inspired by the efficiency of biological systems themselves – to better model these complex relationships and unlock the full potential of genomic information. The inability of prior methods to account for these long-range effects underscores the necessity for innovative techniques capable of deciphering the full context of genomic data.

A significant evolution in artificial intelligence is occurring through the embrace of principles found in biological systems, specifically regarding efficient information processing. Rather than relying on manually designed neural network architectures, researchers are now implementing automated architecture discovery methods. This approach allows algorithms to independently determine the optimal network structure for a given biological dataset, like genomic sequences. The resulting models demonstrate a remarkable reduction in size – achieving dimensions 25x25x25 smaller than conventional designs – without sacrificing predictive power. In many cases, these biologically-inspired, automatically-discovered architectures actually improve performance, suggesting that the inherent efficiency of natural systems can be successfully translated into more streamlined and effective AI models for complex biological data analysis.

The top five DNA model architectures, built from combinations of HYENA, Transformer, and CNN modules, consistently demonstrate a shared structural pattern when trained across diverse tasks.

BioArc: Automating Architectural Evolution

BioArc implements a systematic framework for neural architecture discovery by defining a comprehensive search space and employing automated optimization techniques. This framework moves beyond manual architecture engineering by treating architecture design as a learnable component. It utilizes a predefined “Supernet” – a large, over-parameterized neural network containing numerous possible architectures – and then leverages Neural Architecture Search (NAS) algorithms to identify optimal subnetworks specifically suited for biological data analysis tasks. The systematic approach includes defined evaluation metrics and search strategies, enabling reproducible results and facilitating the discovery of architectures that maximize performance on target datasets, such as those used in genomics and proteomics.

BioArc employs Neural Architecture Search (NAS) to automate the identification of effective neural network architectures for biological data. Rather than manually designing and testing individual models, BioArc utilizes a Supernet – a large, encompassing network containing numerous potential architectures as subnetworks. NAS algorithms then efficiently search this expansive design space by evaluating different subnetworks within the Supernet, thereby reducing the computational cost associated with exploring all possible configurations. This approach enables BioArc to identify architectures optimized for specific biological tasks without requiring extensive manual effort or prior architectural assumptions.

BioArc’s architecture integrates Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and the Hyena operator to capitalize on the strengths of each for biological sequence analysis. CNNs effectively capture local patterns, LSTMs model long-range dependencies, and Hyena offers efficient processing of long sequences via implicit convolutions. Crucially, BioArc incorporates data preprocessing via tokenization, converting raw biological data-such as DNA sequences-into a numerical representation suitable for neural network input. This tokenization step is essential for consistent data formatting and enables the subsequent application of CNN, LSTM, and Hyena layers within the discovered architectures.

BioArc facilitates a transition from manually designed neural networks to architectures discovered through automated search, specifically addressing limitations in applying intuition to complex biological data. This automated approach systematically explores a broad range of network configurations, identifying models that demonstrably outperform existing, manually crafted solutions on DNA sequence analysis tasks. Benchmarking indicates BioArc-generated models achieve higher accuracy and efficiency in tasks such as sequence classification and regulatory element prediction compared to established methods, indicating an ability to better capture underlying biological patterns and relationships within genomic datasets.

The BioArc framework efficiently discovers optimal neural network architectures by systematically searching a vast space of possibilities encoded within a shared supernet, guided by a knowledge-based agent that leverages prior task data.

Unveiling Genomic Insights: Promoter Detection & Pretraining Strategies

PromoterDetection is a fundamental task in genomics, involving the identification of DNA regions that regulate gene expression. BioArc improves PromoterDetection by effectively modeling complex regulatory signals present in genomic sequences. These signals often consist of combinations of transcription factor binding sites and other sequence features that are difficult to capture with traditional methods. BioArc’s architecture is designed to integrate these features, leading to more accurate identification of true promoter regions and a reduction in false positives. The framework achieves this through a combination of specialized layers and attention mechanisms that prioritize relevant sequence motifs and their interactions, ultimately enhancing the precision of promoter identification.

BioArc incorporates prior knowledge of genomic elements to improve promoter detection accuracy. Specifically, the framework utilizes features representing CPG Islands – regions of DNA with a high concentration of cytosine-guanine base pairs often associated with gene regulation – and DPE (DNAse I Footprint Element) predictions, which identify regions protected from DNAse I digestion, suggesting protein binding and regulatory function. Integrating these genomic element annotations as input features allows the model to leverage established biological insights, thereby refining its ability to distinguish true promoters from non-promoter sequences and enhancing overall performance beyond models relying solely on DNA sequence information.

BioArc leverages pretraining strategies to build robust genomic representations without relying on labeled data. Specifically, the framework utilizes Masked Modeling, where portions of the input sequence are randomly masked and the model is trained to predict the missing nucleotides. Next Token Prediction involves training the model to predict the subsequent nucleotide in a given genomic sequence. Furthermore, BioArc employs Contrastive Learning, which aims to learn embeddings where similar sequences are close in vector space and dissimilar sequences are distant. These unsupervised pretraining objectives allow the model to capture inherent patterns and dependencies within unlabeled genomic data, forming a strong foundation for subsequent fine-tuning on specific downstream tasks.

Fine-tuning BioArc’s pretrained models demonstrates improved performance on downstream genomic tasks. Evaluation using a Spearman Rank Correlation of 0.8170 indicates a strong positive correlation between the ranking of architectures initialized with the supernet and those fully trained on specific tasks. This high correlation confirms the reliability of the architecture ranking produced by the pretrained models, suggesting that the learned representations effectively capture relevant genomic information and facilitate efficient transfer learning to various downstream applications.

Training DNA tasks from scratch reveals that different tokenization methods significantly impact performance across various architectures, with further results detailed in Appendix A.6.9.

Expanding the Horizon: BioArc’s Future Impact

The true strength of BioArc lies in its capacity as a broadly applicable analytical tool, extending well beyond the confines of genomic data analysis. While initially demonstrated with genomic datasets, the framework’s core architecture is inherently adaptable to diverse biological data types, notably proteomics and metabolomics. This versatility stems from BioArc’s focus on relationships between data points, rather than the specifics of the data itself; it can model complex interactions regardless of whether those interactions are derived from gene expression, protein abundance, or metabolite concentrations. Consequently, researchers can leverage the same automated architecture discovery process across multiple ‘omics layers, facilitating a more holistic and integrated understanding of biological systems and potentially revealing previously hidden connections between different levels of biological organization.

BioArc’s design prioritizes adaptability, enabling effortless incorporation of cutting-edge deep learning methodologies. The framework’s modular structure allows researchers to readily experiment with, and benefit from, advancements like sophisticated attention mechanisms – which refine the model’s focus on critical data points – and the power of graph neural networks, capable of representing and analyzing complex biological relationships. This seamless integration isn’t merely about adopting new tools; it’s about creating a platform that dynamically evolves with the field, ensuring BioArc remains at the forefront of biological data analysis and facilitates the translation of algorithmic progress into tangible scientific insights. The inherent flexibility promises a sustained capacity to leverage future innovations in deep learning, maximizing the potential for discovery across diverse biological datasets.

BioArc fundamentally shifts the paradigm of biological data analysis by automating the traditionally laborious process of machine learning architecture discovery. Rather than requiring researchers to manually design and optimize complex neural networks, the framework intelligently searches for the most effective model structure tailored to a specific biological question. This automation frees scientists from the constraints of model engineering, allowing them to concentrate expertise and resources on interpreting results and formulating new hypotheses. Consequently, BioArc promises to dramatically accelerate the pace of scientific discovery by lowering the barrier to entry for advanced machine learning techniques and enabling faster, more efficient exploration of complex biological datasets, ultimately fostering breakthroughs across diverse fields like genomics, proteomics, and beyond.

Ongoing development of BioArc prioritizes enhanced computational efficiency and broadened applicability through refined search algorithms and knowledge transfer techniques. Researchers are actively working to accelerate the automated architecture discovery process, allowing for faster analysis of complex biological data. A key objective is to enable the transfer of learned patterns between diverse datasets – for instance, applying insights from genomic data to improve metabolomic predictions. This cross-dataset learning is being pursued concurrently with efforts to drastically reduce model size, aiming for a $25 \times 25 \times 25$ reduction, which promises to lower computational costs and facilitate wider accessibility without compromising predictive power.

The top five protein model architectures (Arch 1-5) demonstrate superior performance.

The pursuit of optimal neural architectures, as detailed in this work with BioArc, echoes a fundamental principle of resilient systems. BioArc’s method of discovering high-performance foundation models from biological data – prioritizing efficiency even with smaller models – suggests an acceptance of inherent limitations. As Alan Turing observed, “We can only see a short distance ahead, but we can see plenty there that needs to be done.” This rings true; BioArc doesn’t attempt to brute-force complexity, but rather to refine and optimize within the constraints of available data and computational resources, allowing the system to age gracefully and continue functioning effectively. The framework isn’t about achieving ultimate power, but about sustained, adaptable performance-a hallmark of enduring systems.

What Lies Ahead?

The pursuit of optimal neural architectures, as demonstrated by BioArc, is less a quest for perfection and more a calculated deceleration of inevitable decay. Each refined layer, each strategically pruned connection, merely extends the period of functional coherence before the model, like any complex system, succumbs to the pressures of entropy. The framework’s success with biological data – protein and DNA sequencing – suggests a shift in focus: not simply building larger models, but sculpting them with an acute understanding of the data’s intrinsic limitations and inherent noise.

Future work will inevitably confront the challenge of generalizability. Can architectures discovered for one biological domain be effectively transferred to others? Or is each system-genome, proteome, metabolome-so uniquely textured that it demands a bespoke neural scaffolding? The reliance on self-supervised learning, while powerful, introduces a subtle form of bias-the model learns to predict based on existing patterns, potentially reinforcing established dogma rather than fostering genuine discovery.

Ultimately, the value of BioArc lies not in its current performance, but in its potential to accelerate the cycle of refinement. Technical debt, in this context, is akin to erosion; a constant force requiring continuous maintenance. Uptime is a rare phase of temporal harmony, a fleeting moment of stability before the system’s inherent vulnerabilities are exposed. The next phase demands a more holistic view-integrating architectural search with robust uncertainty quantification and a rigorous assessment of long-term stability.

Original article: https://arxiv.org/pdf/2512.00283.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Evolving Landscape of Biological Modeling

BioArc: Automating Architectural Evolution

Unveiling Genomic Insights: Promoter Detection & Pretraining Strategies

Expanding the Horizon: BioArc’s Future Impact

What Lies Ahead?

See also: