Evolving Proteins with AI: A Faster Path to Design

Author: Denis Avetisyan

Researchers have developed a new artificial intelligence framework that dramatically accelerates the process of designing proteins with desired functions.

AlphaDE’s performance improves predictably with larger pretrained protein language models, as demonstrated by the results achieved with a fine-tuned ESM2-35M model—a relationship consistently observed within a 95% confidence interval.

AlphaDE combines protein language models and Monte Carlo tree search to efficiently navigate the sequence-function landscape and outperform existing in-silicon directed evolution methods.

While computational protein design has advanced significantly, existing in-silicon directed evolution algorithms often underutilize the rich evolutionary information encoded within protein sequences. To address this, we present ‘Boosting In-Silicon Directed Evolution with Fine-Tuned Protein Language Model and Tree Search’, introducing AlphaDE, a novel framework that harnesses the power of large language models and Monte Carlo tree search to efficiently navigate the protein sequence landscape. Our results demonstrate that AlphaDE markedly outperforms state-of-the-art methods, even with limited training data, and can effectively condense the search space for optimal protein sequences. Could this approach unlock a new era of rational protein design and accelerate breakthroughs in biotechnology and medicine?

The Limits of Exploration: Navigating Protein Sequence Space

The conventional approach to protein engineering, known as directed evolution, often proves to be a substantial computational undertaking. This method typically involves creating numerous protein variants, assessing their functionality, and iteratively refining them based on desired traits. However, the sheer size of the possible sequence combinations – a landscape often referred to as ‘sequence space’ – necessitates immense processing power and time. Consequently, successful designs frequently emerge not from systematic exploration, but from fortunate coincidences and the serendipitous discovery of functional variants. While effective in some instances, this reliance on chance limits the predictability and efficiency of the process, hindering the rational design of proteins with entirely novel capabilities and prompting the search for more targeted and computationally efficient strategies.

The challenge in designing novel proteins lies not in a lack of building blocks—amino acids—but in the sheer immensity of possible combinations. Each protein is a chain of these acids, and even a relatively short protein offers an astronomical number of potential sequences. Current protein design methods, while sophisticated, are fundamentally limited by their inability to thoroughly explore this “sequence space.” Imagine searching for a single needle in a haystack the size of a galaxy – that’s the scale of the problem. Consequently, many promising protein designs remain undiscovered, and researchers often rely on incremental improvements to existing proteins rather than truly novel creations. This vastness necessitates the development of innovative computational tools and experimental techniques capable of efficiently navigating and harnessing the full potential of protein sequence space to unlock proteins with tailored functionalities.

Deep mutational scanning, a cornerstone of modern protein engineering, systematically assesses the functional impact of numerous amino acid substitutions, yet faces inherent scalability challenges. While capable of comprehensively mapping sequence-structure-function relationships for smaller proteins or specific regions, the technique becomes progressively time-consuming and resource-intensive as protein size increases. Creating and evaluating libraries encompassing all possible mutations – even within a single protein – rapidly becomes impractical due to the exponential growth of combinatorial space. Furthermore, current deep mutational scanning methods often focus on a limited range of conditions or functions, potentially overlooking crucial adaptive properties or novel functionalities that might emerge under alternative circumstances. This constrained scope necessitates careful experimental design and can hinder the discovery of proteins exhibiting truly exceptional or unexpected capabilities.

AlphaDE efficiently narrows the conformational search space of avGFP by iteratively predicting and refining structural trajectories from an initial sequence.

Unveiling Evolutionary Principles: The Power of Protein Language Models

Protein Language Models (PLMs) leverage the extensive data available in protein sequence databases – such as UniProt and NCBI – to statistically model the relationships between amino acids. These models do not require explicit structural information; instead, they infer evolutionary constraints by analyzing the frequency and co-occurrence of amino acids across millions of naturally occurring protein sequences. By treating protein sequences as a ‘language’, PLMs identify patterns indicative of conserved residues essential for protein function and stability, effectively capturing information accumulated through billions of years of evolution. This allows for the prediction of protein properties and the generation of novel sequences with desired characteristics based on learned evolutionary principles.

Protein Language Models (PLMs) leverage the Masked Language Modeling (MLM) technique, analogous to natural language processing, to decipher the principles governing protein sequences. In MLM, a percentage of amino acids within a protein sequence are masked, and the model is trained to predict these masked residues based on the surrounding context. Through exposure to large datasets of protein sequences, PLMs statistically learn the relationships between amino acids, their positions within the sequence, and the resulting impact on protein structure and function. This process enables the model to capture complex dependencies, including evolutionary constraints and biophysical principles, without explicit structural information. The resulting models can then represent protein sequences as high-dimensional vectors, effectively encoding information about the underlying rules governing protein biology.

Fine-tuning Protein Language Models (PLMs) with homologous sequences significantly improves their capacity for de novo protein variant generation. This process leverages the existing knowledge encoded within the PLM and adapts it to a specific protein family or function. By exposing the model to multiple sequence alignments of related proteins, fine-tuning refines the probability distributions governing amino acid selection, resulting in generated sequences that exhibit higher sequence similarity to functional proteins and a greater likelihood of adopting correct protein folds. The performance gain stems from the model’s ability to learn subtle patterns and constraints specific to the target protein family, enabling the creation of variants that are not only plausible in terms of sequence but also predicted to be structurally stable and functionally active.

Sequence fitness distributions across protein datasets reveal that AlphaDE consistently generates sequences with top fitness scores, as indicated by the dashed line representing the best-performing results.

AlphaDE: A Framework for Guided Protein Sequence Evolution

AlphaDE utilizes a combined approach leveraging the strengths of Protein Language Models (PLMs), specifically ESM2, and Monte Carlo Tree Search (MCTS) for protein sequence optimization. ESM2 provides a learned representation of protein structure and function, enabling rapid evaluation of sequence viability without requiring computationally expensive physical simulations. MCTS then efficiently explores the vast protein sequence space by selectively sampling and evaluating promising sequences guided by ESM2’s predictions. This integration allows AlphaDE to navigate the complex fitness landscape, focusing computational resources on regions with a high probability of yielding improved protein designs, thereby enabling efficient exploration of sequence variants.

AlphaDE utilizes pretrained protein language models – specifically ESM-1b and TAPE – as oracles to assess the fitness of generated protein sequences. These models, trained on vast datasets of known protein sequences, provide a predictive capability for evaluating sequence stability and function without requiring computationally expensive physical simulations. The models assign a probability score to each generated sequence, representing its likelihood of being a functional protein; this score serves as the fitness metric within the AlphaDE framework. By leveraging the predictive power of ESM-1b or TAPE, AlphaDE efficiently filters and prioritizes sequences, enabling the rapid identification of high-fitness variants during the evolutionary search process.

Benchmarking results demonstrate that the AlphaDE framework achieves a mean fitness value of 1.22 when evaluated across five distinct protein engineering tasks: avGFP, TEM, AAV, E4B, and AMIE. This performance represents a substantial improvement over the TreeNeuralTS method, with AlphaDE exhibiting a 351.85% increase in fitness. The consistent achievement of higher fitness values across multiple tasks indicates the robustness and generalizability of the AlphaDE approach for protein sequence optimization.

AlphaDE employs a two-stage process of fine-tuning followed by Monte Carlo Tree Search (MCTS) for inference.

Expanding the Horizon: Synergies with Advanced Protein Design Methods

AlphaDE isn’t intended to replace established directed evolution techniques, but rather to function as a powerful synergistic component within them. Current methods like Bayesian Optimization, Covariance Matrix Adaptation Evolution Strategy (CMA-ES), and AdaLead all possess inherent strengths in navigating the vast sequence space of protein engineering; however, they can be limited by computational cost or premature convergence. Integration with AlphaDE addresses these challenges by providing a more informed and efficient exploration strategy, particularly in high-dimensional spaces. The framework dynamically adjusts sampling based on predicted performance, allowing existing algorithms to focus on promising regions and avoid wasted computational effort. This complementary approach consistently demonstrates improved performance across a range of protein design tasks, accelerating the discovery of novel enzymes and proteins with desired characteristics.

AlphaDE demonstrates a powerful synergy when paired with TreeNeuralTS and TreeNeuralUCB, sophisticated algorithms designed to navigate the vast landscape of possible protein sequences. These tree-based methods excel at balancing exploration – venturing into uncharted sequence space – with exploitation, refining promising candidates. Integrating AlphaDE with TreeNeuralTS and TreeNeuralUCB allows for a more informed and efficient search process; the framework effectively leverages the algorithms’ predictive capabilities to prioritize mutations and accelerate the identification of functional proteins. This combination isn’t merely additive, but demonstrably enhances sequence exploration, enabling the discovery of proteins with improved or novel characteristics beyond the reach of either technique alone, and ultimately expanding the boundaries of protein engineering.

AlphaDE demonstrates a compelling capacity for zero-shot learning, a capability that extends beyond the confines of previously explored protein sequences and functionalities. This adaptability arises from the framework’s ability to generalize learned patterns and apply them to entirely new design challenges without requiring task-specific training data. Researchers found that AlphaDE can effectively predict optimal enzyme variants even for targets with significant structural or functional divergence from those used during its initial development. This suggests the framework doesn’t simply memorize successful sequences, but rather learns underlying principles of protein stability and activity, offering a substantial advantage in tackling the vast and largely uncharted territory of de novo protein design and accelerating the creation of enzymes with unprecedented capabilities.

Fine-tuning AlphaDE with ESM2-35M on varying numbers of randomly sampled sequences demonstrates performance stability, as indicated by the shaded 95% confidence intervals.

The pursuit of optimized protein sequences, as demonstrated by AlphaDE, inherently demands a ruthless simplification of the sequence-function landscape. This framework doesn’t attempt to model every nuance of protein behavior; instead, it focuses on distilling the essential elements for efficient evolution. As Donald Davies once stated, “If you can’t explain it simply, you don’t understand it.” AlphaDE embodies this principle by leveraging protein language models to create a manageable representation of protein space, enabling targeted exploration via Monte Carlo tree search. The elegance lies not in the complexity of the model, but in its ability to rapidly navigate and refine solutions, stripping away unnecessary details to achieve superior performance in in-silicon directed evolution.

The Road Ahead

The present work offers a reduction—a distillation of protein evolution into algorithmic terms. Yet, the sequence-function landscape remains stubbornly high-dimensional. AlphaDE navigates this space with improved efficiency, but efficiency is merely a means. The ultimate question—what constitutes ‘fit’ beyond pre-defined metrics—persists. Future iterations must address the inherent limitations of relying solely on labeled data, and explore the potential of truly de novo design—protein creation untethered to existing structures.

A critical bottleneck lies in accurately modeling the physical reality underpinning protein behavior. Current language models, however sophisticated, are fundamentally statistical. Bridging the gap between sequence probability and biophysical feasibility demands a more integrated approach—one that couples machine learning with rigorous simulation. The pursuit of ‘generalizability’—a model capable of evolving proteins for arbitrary functions—may prove illusory. Perhaps a more fruitful path lies in specialization—creating bespoke evolution engines tailored to specific protein families or catalytic mechanisms.

The elegance of in-silicon directed evolution resides in its promise of automation. However, true autonomy requires more than just efficient search. It demands a capacity for self-critique—a system capable of identifying its own limitations and iteratively refining its methodology. The path forward is not simply to build a better algorithm, but to design a learning system—one that evolves not just proteins, but the very principles of protein evolution.

Original article: https://arxiv.org/pdf/2511.09900.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Limits of Exploration: Navigating Protein Sequence Space

Unveiling Evolutionary Principles: The Power of Protein Language Models

AlphaDE: A Framework for Guided Protein Sequence Evolution

Expanding the Horizon: Synergies with Advanced Protein Design Methods

The Road Ahead

See also: