Unfolding Protein Prediction with Scale

Author: Denis Avetisyan

A new model, SeedFold, demonstrates that scaling both data and model size, combined with a novel attention mechanism, dramatically improves the accuracy of biomolecular structure prediction.

SeedFold achieves state-of-the-art performance on the FoldBench benchmark by scaling folding models across model capacity-through a Pairformer width of 512-architecture-employing linear triangular attention to reduce computational complexity-and data-leveraging large-scale distillation to expand training to 26.5 million samples, as demonstrated by its 384-width variant, SeedFold-Linear.

SeedFold achieves state-of-the-art performance on FoldBench through efficient linear attention and optimized data distillation techniques.

Achieving highly accurate biomolecular structure prediction requires continually increasing model capacity, a process often limited by computational complexity. This need drives the work presented in ‘SeedFold: Scaling Biomolecular Structure Prediction’, which introduces a novel approach to scaling both model and data size for protein folding. SeedFold leverages an effective width-scaling strategy, a new linear triangular attention mechanism to reduce computational load, and a large-scale distilled dataset, demonstrably outperforming AlphaFold3 on most protein-related tasks evaluated on FoldBench. Will these innovations pave the way for even more powerful and accessible biomolecular foundation models?

The Enduring Challenge of Protein Folding

The fundamental link between a protein’s three-dimensional structure and its biological function necessitates accurate structure prediction, yet achieving this has long presented a significant computational hurdle. Traditional methods, relying on experimental techniques like X-ray crystallography and nuclear magnetic resonance, are often time-consuming, costly, and struggle with proteins that are difficult to isolate or unstable. Computational approaches, while offering a faster alternative, face the ‘combinatorial explosion’ of possible conformations a protein chain can adopt as it folds – a problem so complex that exhaustively searching for the lowest energy state, which corresponds to the native structure, remains computationally intractable for all but the smallest proteins. This inherent difficulty limits progress in diverse fields, from designing targeted therapeutics to engineering novel biomaterials, as understanding a protein’s role requires knowing precisely how it is shaped and how that shape dictates its interactions.

The difficulty in predicting protein structures initially stemmed from the sheer number of possible conformations a protein chain can adopt – a problem known as combinatorial complexity. Each amino acid within the chain possesses multiple rotational and bonding possibilities, and the total number of arrangements grows exponentially with the protein’s length. This meant that even simulating relatively short protein sequences using brute-force computational methods became intractable, quickly overwhelming available processing power. Consequently, advancements in fields reliant on understanding protein shape – such as rational drug design, where molecules are engineered to bind to specific protein targets – and the creation of novel biomaterials with tailored properties were significantly delayed. The inability to accurately model these structures presented a fundamental bottleneck, requiring innovative approaches to bypass the limitations imposed by this combinatorial explosion.

Increasing model width consistently improves both global structural accuracy (measured by complex RMSD, with lower values being better) and local structural quality (measured by intra-protein lDDT, with higher values being better), with the most significant gains achieved when scaling from 128x128 to 256x256, while further increases to 512x512 yield diminishing returns and depth scaling provides only marginal improvements. — Increasing model width consistently improves both global structural accuracy (measured by complex RMSD, with lower values being better) and local structural quality (measured by intra-protein lDDT, with higher values being better), with the most significant gains achieved when scaling from 128×128 to 256×256, while further increases to 512×512 yield diminishing returns and depth scaling provides only marginal improvements.

AlphaFold2: A Leap Forward in Structural Insight

AlphaFold2 achieved a breakthrough in protein structure prediction by utilizing attention mechanisms, a technique originally developed for natural language processing. Prior to AlphaFold2, ab initio or template-based modeling methods struggled to consistently reach accuracy levels comparable to experimental techniques like X-ray crystallography or cryo-electron microscopy. Evaluations using the Critical Assessment of Structure Prediction (CASP) competitions demonstrated that AlphaFold2 consistently achieved Global Distance Test – Total Score (GDT_TS) scores exceeding 90, placing its predictions within the resolution range of experimental methods. This near-experimental accuracy was achieved through a deep neural network architecture that learns the relationships between amino acids and utilizes these relationships to predict the 3D structure of proteins, representing a significant advancement over previous computational approaches.

The Pairformer module within AlphaFold2 is central to its predictive capabilities by explicitly modeling interactions between all pairs of amino acids within a protein sequence. This is achieved through the calculation of a pairwise representation, where each residue is related to every other residue, capturing both local and long-range dependencies crucial for determining the final folded structure. This pairwise attention mechanism allows the model to infer geometric constraints, such as distances and orientations, between amino acids, enabling accurate reconstruction of the protein’s three-dimensional geometry. By directly addressing residue-residue relationships, the Pairformer bypasses the need for handcrafted features and allows the network to learn these relationships directly from the sequence data.

The Pairformer module within AlphaFold2 utilizes triangular attention to compute interactions between all pairs of amino acids in a protein sequence; however, this approach exhibits computational complexity scaling quadratically with sequence length-specifically, O(N²), where N represents the number of amino acids. This quadratic scaling arises because each amino acid residue must be compared to every other residue to determine pairwise relationships. Consequently, the triangular attention mechanism becomes a significant computational bottleneck when applying AlphaFold2 to larger proteins or protein complexes, limiting its scalability and increasing processing time and memory requirements.

SeedFold and SeedFold-Linear achieve competitive interface prediction success rates compared to Boltz-1 and Protenix-0.5, though a detailed performance breakdown for AlphaFold 3 remains unavailable due to licensing restrictions.

SeedFold: Scaling Prediction with Efficient Attention

SeedFold advances the protein structure prediction capabilities established by AlphaFold2 through two primary modifications: scalable attention modules and increased model and training data scale. While AlphaFold2 demonstrated high accuracy, its computational demands limited scalability. SeedFold addresses this by implementing attention mechanisms designed to reduce computational complexity, enabling the processing of larger protein structures. Concurrently, the model’s capacity has been expanded, and it is trained on a larger dataset comprising both the AlphaFold Database (AFDB) and the Mgnify database, resulting in improved predictive performance and generalization to novel protein sequences.

SeedFold improves computational efficiency by implementing linear attention mechanisms. Traditional attention mechanisms used in models like AlphaFold2 have a computational complexity of $O(N^3)$ , where N is the sequence length. Linear attention reduces this complexity to $O(N^2)$ by approximating the attention matrix, enabling the processing of significantly longer protein sequences. This reduction in complexity directly facilitates the prediction of larger protein structures and complexes that were previously intractable due to computational limitations, while maintaining comparable or improved accuracy.

SeedFold leverages data distillation techniques to improve both performance and generalization capabilities. This process involves training on datasets curated from large structural databases, specifically the AlphaFold Database (AFDB) and the Mgnify database. AFDB provides a high-quality, diverse set of predicted protein structures, while Mgnify offers a broader, though potentially less accurate, collection of structural templates. By training on these distilled datasets, SeedFold benefits from the collective knowledge embedded within these large-scale resources, allowing it to more effectively predict the structures of novel proteins and improve accuracy on challenging cases.

SeedFold achieves state-of-the-art performance through strategic scaling of its core modules. Specifically, both the Pairformer and Structure Module were scaled in terms of width – increasing the number of features – and depth – increasing the number of layers. This scaling process resulted in an lDDT (local distance difference test) score of 0.8889 when predicting the structures of protein monomers, representing a significant improvement over existing methods. The increase in model capacity facilitated by width and depth scaling allows SeedFold to capture more complex relationships within protein sequences and improve the accuracy of its predictions.

Sequence length distributions vary significantly across datasets, with AFDB favoring shorter sequences and MGnify containing a greater proportion of longer ones.

Expanding the Horizon: Biomolecular Modeling and Future Prospects

Recent evaluations on the FoldBench benchmark reveal SeedFold as a leading force in biomolecular structure prediction, distinguished by both its accuracy and computational efficiency. The model achieves a DockQ score of 53.21% when predicting the interfaces between antibodies and antigens – critical for understanding immune responses – and further excels with a 65.31% DockQ score for protein-RNA interactions, essential for gene regulation and cellular processes. These results highlight SeedFold’s capacity to accurately model complex biological systems, suggesting its potential to significantly advance research in areas ranging from immunology to RNA biology and beyond.

Recent advancements in biomolecular modeling have centered on refining the attention mechanisms within computational algorithms, and innovations like gated linear triangular attention represent a significant leap forward. This technique builds upon standard attention by introducing a gating mechanism, effectively controlling the flow of information and enhancing computational efficiency. Crucially, the incorporation of Layer Normalization further stabilizes the training process, preventing performance degradation and allowing for more robust predictions. By selectively focusing on relevant interactions and minimizing noise, gated linear triangular attention not only improves the accuracy of biomolecular structure prediction but also contributes to the overall stability and scalability of these complex computational models, paving the way for investigations into larger and more intricate biological systems.

The capacity of SeedFold to model complete molecular complexes, mirroring the advancements seen in AlphaFold3, represents a significant leap towards deciphering the complexities of biological systems. Traditionally, biomolecular modeling focused on predicting the structure of individual proteins; however, life’s processes rarely occur in isolation. This expanded capability allows researchers to investigate how molecules interact as functional units, offering insights into everything from cellular signaling pathways to the mechanisms of disease. By accurately predicting the structures of entire complexes, SeedFold facilitates a more holistic understanding of biological interactions, paving the way for rational drug design, the engineering of novel proteins with tailored functions, and a deeper appreciation for the intricate choreography of life at the molecular level.

SeedFold establishes a new standard in biomolecular modeling, achieving a DockQ score of 74.14% for accurately predicting protein-protein interactions and 66.48% for protein-ligand binding – results that currently surpass all competing methods. This heightened predictive power isn’t merely an academic achievement; it directly translates to advancements across multiple scientific disciplines. Researchers can now accelerate the identification of potential drug candidates by rapidly screening for molecules that effectively bind to target proteins, potentially shortening development timelines and reducing costs. Furthermore, the ability to accurately model protein interactions unlocks opportunities for de novo protein design, allowing scientists to engineer novel proteins with tailored functions. Ultimately, SeedFold’s precision promises to not only refine existing knowledge of biological systems but also to illuminate previously inaccessible aspects of life’s fundamental processes.

Gated linear tri-attention consistently outperformed additive linear tri-attention in predicting interface success rates for both antibody-antigen and protein-ligand interactions, leading to its selection as the default attention configuration.

SeedFold’s architecture, detailed in the study, embodies a principle of reduction-a refinement toward essential functionality. The model achieves state-of-the-art performance not through sheer complexity, but through a focused scaling of both data and model size, underpinned by a novel linear attention mechanism. This resonates with Simone de Beauvoir’s observation: “One is not born, but rather becomes a woman.” Just as gender isn’t a pre-defined state but a becoming, SeedFold isn’t simply built; it becomes a powerful predictor through iterative refinement and the distillation of crucial information-stripping away the unnecessary to reveal the core structure of prediction.

Future Directions

The demonstration of SeedFold’s scaling properties, while predictable given sufficient computational resources, does not resolve the fundamental question of why a sequence dictates a structure. The model excels at mimicking the patterns observed in existing data, but offers little insight into the biophysical principles governing protein self-organization. Future iterations will likely reveal diminishing returns; increased parameter counts do not equate to increased understanding. The true challenge lies not in achieving higher accuracy on benchmark datasets, but in developing models capable of generalization beyond the observed.

A critical limitation remains the reliance on distilled datasets. While efficient, this approach inherently introduces bias and obscures the underlying complexity of biological systems. Exploration of models trained directly on raw, uncurated data – accepting the associated noise – may yield more robust and insightful representations. The pursuit of ‘perfect’ data is often a distraction from addressing the inadequacies of the model itself.

Ultimately, the value of SeedFold, and its successors, will be measured not by its ability to predict structures, but by its utility in accelerating the design of novel proteins with desired functions. The structure is merely a prelude; the function is the consequence. Emotion, in this context, is a side effect of structure, and clarity is compassion for cognition.

Original article: https://arxiv.org/pdf/2512.24354.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Enduring Challenge of Protein Folding

AlphaFold2: A Leap Forward in Structural Insight

SeedFold: Scaling Prediction with Efficient Attention

Expanding the Horizon: Biomolecular Modeling and Future Prospects

Future Directions

See also: