Designing Proteins from Scratch: A New Diffusion Approach

Author: Denis Avetisyan


Researchers have developed a novel method for creating entirely new protein structures with precise binding capabilities, pushing the boundaries of protein engineering.

SeedProteo successfully generated binders-molecules designed to bind to specific protein targets-for notoriously difficult multi-chain proteins, including H1 (as a dimer), VEGF-A (as a dimer), and TNF-<span class="katex-eq" data-katex-display="false">αα</span> (as a trimer), demonstrating its capacity to address complex protein interactions and meet predefined computational success criteria.
SeedProteo successfully generated binders-molecules designed to bind to specific protein targets-for notoriously difficult multi-chain proteins, including H1 (as a dimer), VEGF-A (as a dimer), and TNF-αα (as a trimer), demonstrating its capacity to address complex protein interactions and meet predefined computational success criteria.

SeedProteo leverages diffusion modeling and all-atom accuracy to achieve state-of-the-art de novo protein binder design with improved sequence-structure consistency.

Despite advances in computational protein design, generating high-quality de novo protein structures-particularly those with desired binding properties-remains a significant challenge. Here, we present SeedProteo: Accurate De Novo All-Atom Design of Protein Binders, a diffusion-based model that repurposes a state-of-the-art folding architecture to achieve substantial improvements in both unconditional protein generation and targeted binder design. By focusing on all-atom modeling and enhancing sequence-structure consistency, SeedProteo surpasses existing open-source methods in terms of design success rates, structural diversity, and novelty. Could this approach unlock new possibilities for designing proteins with tailored functions and therapeutic applications?


The Illusion of Control: Limitations in Protein Design

Contemporary protein design strategies, prominently including inverse folding techniques, frequently encounter limitations when addressing the intricate web of all-atom interactions within a protein structure. These methods often simplify the energetic landscape, treating atoms as coarse-grained units to manage computational demands, but this simplification can obscure critical details governing protein folding and stability. The delicate balance of van der Waals forces, electrostatic interactions, and hydrogen bonds – all operating at the atomic level – profoundly influences a protein’s final conformation and function. Consequently, designs generated through these streamlined approaches may exhibit reduced stability, unexpected conformational flexibility, or fail to achieve the intended biological activity, highlighting the persistent challenge of accurately modeling the full complexity of interatomic forces.

Computational protein design, while increasingly sophisticated, faces significant hurdles stemming from the sheer complexity of accurately simulating a protein’s journey to its final, functional shape. Existing methods often rely on approximations to manage the immense computational cost of exploring all possible conformations – the myriad ways a protein chain can fold. This simplification, however, can lead to an incomplete picture of the protein’s conformational landscape, obscuring crucial intermediate states and potentially overlooking the most stable or functional forms. Consequently, true de novo design – creating entirely new proteins from scratch – remains a challenge, as these methods may struggle to reliably predict how a designed sequence will actually fold and behave in a biological context. The inability to fully capture this dynamic landscape limits the success rate of computationally designed proteins and underscores the need for more efficient and accurate modeling techniques.

The creation of functional and stable proteins hinges on a precise understanding of all-atom interactions, yet accurately modeling these forces remains a significant obstacle in protein design. These interactions, encompassing everything from van der Waals forces to electrostatic attractions and hydrogen bonds, dictate a protein’s three-dimensional structure and, consequently, its biological activity. Current computational methods often simplify these complex forces, leading to designs that may appear promising in silico but fail to fold correctly or exhibit limited stability in real-world conditions. This bottleneck isn’t merely a matter of computational power; it’s a fundamental challenge in capturing the nuanced interplay of atomic forces that govern protein behavior, demanding innovative approaches that move beyond simplified models and embrace the full complexity of the protein energy landscape to reliably engineer novel biomolecules.

SeedFold outperforms open-source methods in both successfully designing binders for ten target proteins and maximizing the diversity of those designs.
SeedFold outperforms open-source methods in both successfully designing binders for ten target proteins and maximizing the diversity of those designs.

SeedProteo: Mapping the Protein Manifold

SeedProteo employs a diffusion-based generative model, a probabilistic approach where noise is progressively added to structural data during a forward process, and then learned to reverse, generating novel protein structures. This process involves training the model on a dataset of known protein structures to learn the underlying distribution of all-atom coordinates. The generative capabilities stem from the model’s ability to sample from this learned distribution, creating new structures that statistically resemble those in the training data. Specifically, the model predicts the parameters of a noise distribution, allowing it to iteratively refine a random noise vector into a valid, all-atom protein structure. This contrasts with discriminative models that predict properties of existing structures; SeedProteo actively creates new structural possibilities.

SeedProteo builds upon the architectural framework of AlphaFold3, specifically inheriting its attention mechanisms and network design. This foundation allows SeedProteo to process and understand complex relationships within protein structures. However, SeedProteo diverges from AlphaFold3 by prioritizing generative capabilities and all-atom accuracy. Modifications to the network, including adjustments to the loss function and training data, facilitate the generation of novel protein structures with precise atomic coordinates. This focus contrasts with AlphaFold3’s primary function of structure prediction from amino acid sequences; SeedProteo is designed to create structures, not merely predict them, and maintains high fidelity at the atomic level throughout the generative process.

SeedProteo distinguishes itself from many protein design methods by directly learning from three-dimensional structural data, rather than relying primarily on amino acid sequences or computationally defined energy functions. Traditional approaches often predict structure from sequence, introducing potential inaccuracies due to the complex relationship between the two, or utilize scoring functions which may not fully capture the intricacies of protein folding and stability. By training directly on known protein structures, SeedProteo bypasses these limitations, enabling the generation of novel designs that are inherently consistent with observed structural features and avoiding biases introduced by sequence-based predictions or simplified energy landscapes. This direct structural learning approach allows for greater fidelity in all-atom modeling and expands the potential for designing proteins with desired structural characteristics.

SeedProteo adapts a folding network-by altering input channels-to perform generative design, demonstrating the framework’s versatility with a minimal architectural change.
SeedProteo adapts a folding network-by altering input channels-to perform generative design, demonstrating the framework’s versatility with a minimal architectural change.

Architectural Foundations: Geometry, Coordinates, and Sequences

SeedProteo utilizes an Atom14 representation to define protein structures, a method which encodes the 3D coordinates of 14 key atoms per amino acid residue: the backbone N, CA, C, O atoms, and the β carbons and sidechain atoms at the C\beta position. This detailed representation captures essential geometric information, including bond lengths, bond angles, and dihedral angles, necessary for accurate modeling of protein folding and conformation. By focusing on these specific atoms, SeedProteo achieves a balance between computational efficiency and the ability to represent the crucial structural features of proteins with sufficient fidelity for downstream tasks like structure prediction and design.

Pairformer blocks are utilized to efficiently process the 3-dimensional atomic coordinates that define protein structures, enabling the model to understand spatial relationships between atoms. These blocks leverage attention mechanisms to identify coordinate pairings crucial for structural stability and function. Complementing this, a Markov Random Field (MRF) is employed for sequence sampling, which allows the model to generate diverse and plausible amino acid sequences. The MRF defines probabilities based on dependencies between adjacent amino acids, ensuring that generated sequences adhere to biophysical principles and are likely to fold into stable protein structures. This combination of Pairformer blocks and the MRF facilitates both accurate structural modeling and the generation of novel protein sequences.

Training the protein structure prediction model utilizes a combination of specialized loss functions to optimize both accuracy and structural diversity. Distogram Loss minimizes the error between predicted and actual residue-residue distances, enforcing geometric constraints. Smooth LDDT Loss, based on the Local Distance Difference Test, promotes high-quality local structures and overall model confidence. Coordinate Diffusion Loss encourages the model to generate diverse structures by penalizing deviations from a diffused coordinate distribution during the sampling process; this loss function operates directly on the 3D coordinates of atoms, rather than intermediate representations, to improve structural fidelity and explore a broader conformational space.

Secondary structure prediction serves as a key conditioning input for the generative model, significantly influencing the resulting protein designs. This prediction, typically encompassing elements like alpha-helices and beta-sheets, provides a structural scaffold that constrains the conformational space explored during generation. By incorporating predicted secondary structure elements, the model is biased towards producing designs that align with known physical principles governing protein folding and stability. This pre-conditioning improves the likelihood of generating plausible and realistic protein structures, reducing the computational cost of exploring improbable conformations and enhancing the overall quality of the generated designs. The predicted secondary structure information is integrated into the generative process, effectively acting as a prior that guides the model towards structurally sound outputs.

The Illusion of Designability: Expanding the Boundaries

SeedProteo exhibits a remarkable capacity for de novo protein generation, constructing diverse and structurally sound proteins without relying on existing templates. This model achieves a success rate exceeding 60% when generating sequences of 1000 amino acids in length-a significant advancement over existing methods, which rapidly decline to near-zero performance at comparable scales. This capability stems from SeedProteo’s innovative approach to sequence design, enabling the creation of novel proteins with a high probability of folding into stable, three-dimensional structures, and representing a substantial step forward in computational protein design.

Rigorous assessment of de novo protein designs requires robust structural validation, and SeedProteo integrates SeedFold for this purpose. SeedFold, functioning analogously to the advanced AlphaFold3, meticulously evaluates the designability and biophysical plausibility of generated structures. This process isn’t simply a check for foldability; it assesses whether the designed sequence is intrinsically likely to adopt the predicted conformation, minimizing the risk of unstable or non-functional proteins. By leveraging SeedFold’s predictive power, researchers can confidently filter designs, prioritizing those with a high probability of successfully folding into the intended three-dimensional structure and ultimately, performing the desired function. This validation step is critical for translating computational design into tangible, functional proteins with real-world applications.

SeedProteo’s functionality extends beyond de novo protein creation to encompass the challenging field of binder design, effectively generating proteins with a high potential for strong, specific interactions with desired target molecules. Evaluations on ten benchmark targets reveal that this model currently achieves the highest success rate for binder design among publicly available, open-source methods. This capability is significant, as creating proteins that bind to specific targets is crucial for therapeutic development, diagnostics, and biotechnological applications, offering a powerful new tool for researchers seeking to engineer proteins with tailored functionalities and affinities.

SeedProteo distinguishes itself through consistent generation of highly realistic and diverse protein structures, a feat demonstrated across a benchmark of ten distinct targets. The system not only produces viable protein folds, but achieves a greater variety of those structures – as measured by unique structural clusters – and introduces genuinely novel conformations compared to existing methods. Critically, designs generated by SeedProteo consistently meet stringent criteria for success, exhibiting low interface PAE values (below 1.5), minimal complex RMSD (under 2.5 Å), and high Binder pTM scores exceeding 0.8, collectively validating the structural integrity and potential functionality of the de novo designed proteins.

Designability is assessed using increasingly strict thresholds for monomer acceptance, as detailed in Appendix 6.
Designability is assessed using increasingly strict thresholds for monomer acceptance, as detailed in Appendix 6.

The pursuit of de novo protein design, as demonstrated by SeedProteo, mirrors a fundamental tension between theoretical construction and the inherent limitations of any model. The model achieves improvements in sequence-structure consistency, yet any attempt to predict or generate complex biological structures-even with all-atom modeling-ultimately faces an event horizon of uncertainty. As Friedrich Nietzsche observed, “There are no facts, only interpretations.” SeedProteo offers a powerful interpretive framework for protein folding, but the ‘true’ structure remains elusive, subject to the constraints of the model and the inherent ambiguities of biological systems. The diffusion-based approach, while sophisticated, doesn’t eliminate the possibility that alternative, equally valid designs exist beyond the current computational reach.

What Lies Beyond the Fold?

SeedProteo, and its lineage of diffusion-based protein design tools, presents a technically impressive advance. Each incremental improvement in de novo protein creation, however, merely reframes the enduring question: design relative to what? The model’s success hinges on a sophisticated understanding of sequence-structure relationships, yet biological function rarely conforms to purely structural predictions. The cosmos, indifferent to algorithmic elegance, continues to demand demonstrable activity, not just plausible conformation.

The emphasis on all-atom modeling, while commendable, introduces computational demands that scale exponentially with complexity. Future iterations will inevitably confront the limitations of available resources. One suspects that each new benchmark achieved in binder design will reveal even more subtle, and potentially intractable, criteria for true biological compatibility. The pursuit of sequence-structure consistency, after all, may be a beautiful mathematical exercise, but life is rarely so obliging.

The field now faces a choice: to endlessly refine algorithms for predicting form, or to confront the deeper mystery of function. Perhaps the most profound discoveries will not arise from increasingly accurate simulations, but from acknowledging the irreducible complexity of the systems under investigation. The event horizon of biological reality is perpetually receding, mocking the ambition of complete understanding.


Original article: https://arxiv.org/pdf/2512.24192.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-04 05:43