Building Better Drugs with AI’s Building Blocks

Author: Denis Avetisyan


A new artificial intelligence framework leverages fragment-based design and automated tuning to accelerate the discovery of promising drug candidates.

The progression from traditional human-in-the-loop medicinal chemistry and AI engineering-where both disciplines collaborate-to a fully automated agent-to-agent system demonstrates a shift toward increasingly autonomous workflows, initially replacing the AI engineer with an agentic framework before ultimately automating both roles and suggesting a trajectory for complete automation in the field.
The progression from traditional human-in-the-loop medicinal chemistry and AI engineering-where both disciplines collaborate-to a fully automated agent-to-agent system demonstrates a shift toward increasingly autonomous workflows, initially replacing the AI engineer with an agentic framework before ultimately automating both roles and suggesting a trajectory for complete automation in the field.

FRAGMENTA, an end-to-end generative model combining fragment-based molecule generation with agentic tuning, demonstrates improved drug lead identification in a cancer drug discovery setting.

Despite advances in generative AI for drug discovery, limited datasets often hinder performance, and current fragment-based methods struggle with both diversity and efficient model tuning. This paper introduces FRAGMENTA: End-to-end Fragmentation-based Generative Model with Agentic Tuning for Drug Lead Optimization, a novel framework that reframes molecular generation as a vocabulary selection problem and employs an agentic AI system for automated objective refinement. Experiments in cancer drug discovery demonstrate that FRAGMENTA’s agentic tuning not only surpasses traditional human-driven approaches but also identifies significantly more promising drug candidates. Could this represent a paradigm shift towards fully autonomous AI-driven drug lead optimization?


The Molecular Maze: Why Innovation Gets Stuck

The search for novel therapeutic molecules is hampered by the sheer scale of chemical possibility. Estimates suggest that there are potentially $10^{60}$ stable, synthesizable molecules, a number far exceeding the capacity of traditional high-throughput screening or even focused combinatorial libraries to explore effectively. This “chemical space” presents a formidable challenge, as identifying compounds with desired biological activity requires sifting through an astronomically large number of potential candidates. Moreover, many theoretically possible molecules are either difficult or impossible to create in a laboratory setting, further narrowing the viable search space. Consequently, drug discovery often relies on exploring only a tiny fraction of this vast landscape, potentially overlooking promising compounds and limiting innovation.

Despite the initial excitement surrounding deep learning’s application to drug discovery, a significant hurdle remains: the limited ability of these models to reliably generalize beyond the training data. While capable of identifying patterns, current deep learning approaches frequently generate molecules that, though novel, are impractical to synthesize in a laboratory setting. This lack of ‘synthesizability’ stems from the models’ insufficient understanding of chemical reactions and the constraints of real-world chemistry; a molecule may appear valid according to the model’s criteria, yet require reaction conditions or reagents that are impossible or prohibitively expensive to achieve. Consequently, a substantial portion of computationally-designed molecules are ultimately unusable, demanding further refinement and highlighting the need for algorithms that prioritize both novelty and feasibility.

The conventional depiction of molecules as simple molecular graphs, while computationally convenient, inherently restricts a deep learning model’s comprehension of nuanced chemical relationships. These graphs typically represent atoms as nodes and bonds as edges, failing to encode crucial three-dimensional information like bond angles, chirality, and conformational flexibility – aspects vital for predicting a molecule’s properties and reactivity. This simplification forces the model to infer complex spatial arrangements from a flattened representation, leading to inaccuracies and limiting its ability to generalize to novel compounds. Consequently, models trained on these graphs often struggle with tasks requiring a precise understanding of molecular geometry, hindering the design of synthesizable and effective drug candidates. A more holistic representation, capable of capturing these intricate details, is therefore crucial for advancing the field of molecular design and discovery.

A computational metric, synthetic accessibility, predicts the practical difficulty of synthesizing molecules on a scale of 1 to 10, with lower scores indicating greater feasibility for drug development.
A computational metric, synthetic accessibility, predicts the practical difficulty of synthesizing molecules on a scale of 1 to 10, with lower scores indicating greater feasibility for drug development.

Building with LEGOs: Fragment-Based Modeling

Fragment-based models represent molecules not as a single entity, but as collections of smaller, pre-defined units termed molecular fragments. This decomposition enables a more targeted exploration of chemical space by focusing computational resources on the assembly and modification of these fragments rather than evaluating entire molecules. The size of these fragments typically ranges from a few to approximately twenty heavy atoms, facilitating efficient calculations and reducing the combinatorial complexity associated with de novo molecular generation. By operating on these substructures, algorithms can efficiently generate and evaluate a larger number of diverse, synthetically accessible compounds compared to methods that treat molecules as monolithic structures, and allows for the identification of novel compounds with desired properties.

Fragment-based models often employ core structural frameworks, such as the Bemis-Murcko scaffold, to standardize and guide the assembly of molecular fragments. The Bemis-Murcko scaffold defines a set of ring systems representing the core of a molecule, effectively ignoring side chains and allowing for comparisons based on fundamental structure. By anchoring fragment linking and growth to these predefined scaffolds, the models maintain structural integrity and avoid the generation of chemically invalid or unstable compounds. This scaffold-centric approach facilitates the efficient exploration of chemical space while ensuring that resulting molecules adhere to established principles of chemical feasibility and are more likely to be synthesizable.

Utilizing pre-existing, readily available molecular fragments as building blocks demonstrably improves the synthesizability of generated compounds. By assembling molecules from validated fragment libraries, the need for de novo synthesis of complex substructures is reduced, streamlining the chemical process and increasing the probability of successful compound creation. This fragment-based approach also inherently facilitates the generation of molecular diversity; variations in fragment selection and linking strategies allow for the exploration of a wider range of chemical space compared to methods reliant on complete molecular design. The use of commercially available fragments further accelerates the drug discovery pipeline by reducing lead times associated with sourcing necessary chemical components.

Fragment-based representations offer advantages over simple graph-based methods in molecular representation due to their ability to capture more nuanced chemical information. Graph-based approaches typically treat atoms and bonds as uniform entities, limiting the expression of structural features and chemical properties. In contrast, fragment-based methods decompose molecules into pre-defined, chemically meaningful units – such as aromatic rings, alkyl chains, or heterocycles – allowing the model to explicitly represent these substructures. This decomposition facilitates the capture of complex relationships between molecular fragments, including their connectivity, spatial arrangement, and chemical environment, leading to a more expressive and potentially discriminative feature space for downstream tasks like property prediction or similarity searching. The explicit representation of these building blocks also enables the incorporation of prior chemical knowledge, further enhancing the model’s capacity to generalize and accurately represent molecular characteristics.

This comparison demonstrates the differences in fragment selection between the implemented approaches.
This comparison demonstrates the differences in fragment selection between the implemented approaches.

Orchestrating Intelligence: Agentic AI Takes the Stage

Agentic AI systems address the challenge of molecular design by decomposing the task into manageable sub-problems handled by specialized agents. This approach moves beyond monolithic AI models by enabling focused expertise and iterative refinement. Rather than a single model attempting all aspects of design, individual agents concentrate on specific functions – such as knowledge extraction, requirement clarification, vocabulary selection, and model modification. The coordinated operation of these agents allows for a more intelligent and efficient exploration of chemical fragment space, facilitating the discovery of novel molecular structures with desired properties. This modularity enhances both the flexibility and scalability of the design process, allowing for easier adaptation to new data and objectives.

The agentic AI system incorporates both an Extract Agent and a Query Agent to refine the molecular design process through interaction with human experts. The Extract Agent processes feedback – typically in the form of evaluations or corrections to proposed molecular fragments – and converts it into structured, machine-readable data. This structured knowledge is then used to update the system’s understanding of desirable molecular characteristics. Conversely, the Query Agent addresses ambiguity in initial requirements or design goals. When the system encounters unclear specifications, the Query Agent formulates targeted questions to the expert, clarifying the intent and ensuring the generative model receives precise instructions, thereby reducing iterative refinement cycles.

LVSEF (Learned Vocabulary Selection via Exploration and Feedback) employs a Q-Learning reinforcement learning approach to dynamically refine the molecular fragment vocabulary used during de novo molecular design. This process optimizes the balance between exploration of chemical space – maximizing molecular diversity – and exploitation of known synthesizable motifs. The Q-Learning algorithm learns a policy that selects fragment vocabularies based on observed rewards, where rewards are determined by metrics assessing both the novelty of generated molecules and their predicted synthesizability, typically estimated using retrosynthetic analysis tools. By iteratively refining the vocabulary based on this feedback loop, LVSEF aims to generate molecules that are both structurally diverse and realistically synthesizable, surpassing the performance of static or randomly selected fragment sets.

The Code Agent functions as the central mechanism for iterative refinement within the agentic AI system. Receiving refined vocabulary selections and expert feedback distilled by other agents, it directly modifies the parameters and architecture of the generative model responsible for proposing molecular designs. This modification isn’t a one-time adjustment; the Code Agent continuously updates the model based on ongoing evaluation of generated designs, effectively implementing a closed-loop optimization process. These updates can include adjustments to the model’s loss function, changes to the sampling strategy, or even alterations to the underlying neural network structure, all aimed at improving the quality, diversity, and synthesizability of subsequent molecular proposals. The agent’s actions are guided by the goal of aligning the generative model’s output with the desired properties and constraints identified by the system.

This multi-agent system iteratively refines molecular designs by using feedback from medicinal chemists to update a knowledge base and adjust a generative model's objective function through a process of evaluation, querying, extraction, and modification.
This multi-agent system iteratively refines molecular designs by using feedback from medicinal chemists to update a knowledge base and adjust a generative model’s objective function through a process of evaluation, querying, extraction, and modification.

FRAGMENTA: A New Synthesis in Drug Discovery

The FRAGMENTA framework represents a novel convergence of computational drug design strategies, seamlessly integrating fragment-based modeling with the capabilities of agentic artificial intelligence. This synergistic approach moves beyond traditional methods by leveraging the precision of fragment-based techniques – which build molecules from smaller, pre-defined units – with the autonomous exploration and optimization facilitated by AI agents. Rather than operate as separate entities, these components are interwoven into a cohesive pipeline, enabling FRAGMENTA to efficiently generate and refine molecular structures with a heightened focus on target protein binding. The result is a drug design process that is not only accelerated, but also capable of identifying promising lead candidates that might be overlooked by conventional techniques, ultimately streamlining the path from initial concept to potential therapeutic intervention.

FRAGEMENTA’s core innovation lies in its synergistic combination of fragment-based modeling and agentic artificial intelligence, resulting in the de novo generation of molecules exhibiting significantly enhanced binding potential. Traditional drug discovery often struggles with the vast chemical space, but by initiating design with small, readily synthesizable molecular fragments, FRAGMENTA focuses computational power on promising substructures. The subsequent application of agentic AI refines these fragments, iteratively optimizing their structure to maximize predicted affinity for the target protein – as quantified by docking scores. Improved docking scores directly correlate with a higher probability of successful binding, signifying that FRAGMENTA doesn’t simply produce more molecules, but rather, molecules demonstrably more likely to become effective therapeutic agents. This approach circumvents common pitfalls in drug design by prioritizing molecules with inherent structural advantages, potentially accelerating the identification of viable drug candidates.

The FRAGMENTA framework strategically incorporates GENTRL, a generative model proficient in assembling molecules from pre-defined fragments, to enhance the drug discovery process. This integration allows for the efficient generation of diverse chemical structures, focusing on building molecules from known, stable building blocks rather than de novo design. By leveraging GENTRL’s ability to explore a vast chemical space with fragment-based components, FRAGMENTA can rapidly prototype and evaluate potential drug candidates, improving the likelihood of identifying compounds with favorable binding properties. This approach circumvents challenges associated with synthesizing unstable or unrealistic molecular structures, ultimately accelerating the identification of high-affinity lead compounds for further pharmaceutical development.

The FRAGMENTA framework demonstrably excels at identifying promising drug candidates, having successfully generated thirteen high-affinity leads – molecules predicted to strongly bind to their target proteins, as indicated by a docking score of -6 or less. This represents a substantial improvement over conventional methods, nearly doubling the number of viable candidates discovered through traditional approaches. A lower docking score signifies a more stable and favorable interaction between the molecule and the protein, suggesting a higher potential for therapeutic effect. The sheer volume of these high-affinity leads provides a significantly broader foundation for subsequent optimization and development, potentially accelerating the drug discovery timeline and increasing the likelihood of identifying a clinically effective compound.

Recent trials within a functioning pharmaceutical laboratory showcased the practical benefits of integrating artificial intelligence into the drug discovery process. Specifically, a collaborative “Human-Agent” configuration-where researchers work alongside an AI system-achieved an impressive 86% improvement in identifying promising initial compounds, known as “hits,” compared to conventional methods. This substantial gain wasn’t merely theoretical; it represents a significant acceleration of the early stages of drug development, potentially reducing both the time and resources required to bring new therapies to market. The results highlight the power of combining human expertise with the computational speed and predictive capabilities of advanced AI, offering a compelling vision for the future of pharmaceutical research.

The FRAGMENTA framework’s fully autonomous, Agent-Agent configuration demonstrated a remarkable capacity for de novo drug design, identifying eleven high-affinity lead candidates – molecules predicted to strongly bind to target proteins, as indicated by docking scores of -6 or less. This performance notably surpasses that of conventional human-in-the-loop lead optimization strategies, suggesting a potential paradigm shift in pharmaceutical innovation. By leveraging agentic AI to autonomously explore chemical space using molecular fragments, the system effectively navigates complex relationships between molecular structure and protein binding, accelerating the discovery process and potentially unlocking novel therapeutic interventions with greater efficiency than traditional methods.

LVSEF iteratively decomposes molecules into fragments, stores novel ones in a Q-table, and uses reward-based reinforcement learning to generate new molecules with improved properties.
LVSEF iteratively decomposes molecules into fragments, stores novel ones in a Q-table, and uses reward-based reinforcement learning to generate new molecules with improved properties.

The pursuit of automated drug lead optimization, as demonstrated by FRAGMENTA, inevitably introduces new layers of complexity. This framework, with its agentic tuning and fragment-based generation, represents a compelling attempt to navigate that complexity, yet it’s a temporary reprieve. As Marvin Minsky observed, “The more we learn about intelligence, the more we realize how much we don’t know.” FRAGMENTA’s success in a specific cancer drug discovery setting is noteworthy, but the underlying principles-vocabulary selection, reinforcement learning-will eventually become the technical debt of the next iteration. Architecture isn’t a diagram; it’s a compromise that survived deployment, and FRAGMENTA, however elegant, is simply another stepping stone in an endlessly evolving landscape.

The Road Ahead

FRAGMENTA, like all elegant constructions, simply postpones the inevitable. The framework’s success hinges on a vocabulary-a curated selection of molecular fragments. It will be fascinating to observe the point at which that vocabulary becomes the limiting factor, the point where novelty is sacrificed for the comfort of known chemical space. Anything ‘self-healing’ in these systems hasn’t broken enough yet. The true test won’t be performance on benchmark datasets, but the first production deployment where the edge cases bloom.

The agentic tuning, while promising, introduces another layer of opacity. Automated optimization is, at best, a sophisticated form of error concealment. Documentation, as always, is collective self-delusion; a comforting narrative constructed around a fundamentally unpredictable process. The real advancement won’t be generating more leads, but in developing tools to reliably diagnose failure-to understand why a particular molecule was rejected, beyond the opaque pronouncements of a reinforcement learning agent.

Ultimately, the field will likely gravitate toward systems that embrace, rather than obscure, their limitations. If a bug is reproducible, the system is stable; that’s the metric that matters. FRAGMENTA is a step toward automation, certainly, but the next generation will require a hard reckoning with the inherent messiness of chemical innovation.


Original article: https://arxiv.org/pdf/2511.20510.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-26 17:35