Small Language Models, Big Discoveries: A New Recipe for Drug Design

Author: Denis Avetisyan

Researchers demonstrate that carefully training smaller language models can yield surprisingly powerful results in drug discovery, challenging the current trend of ever-larger AI systems.

The MMAI Gym establishes a unified environment integrating data, training protocols, and benchmark assessments to facilitate comprehensive robotic manipulation research.

A specialized training environment, MMAI Gym, enables a relatively small language model to achieve competitive performance on a range of molecular reasoning and drug discovery tasks.

While large language models have shown promise across diverse fields, their application to complex scientific reasoning-particularly in drug discovery-remains limited by a reliance on scale rather than specialized training. This work, ‘MMAI Gym for Science: Training Liquid Foundation Models for Drug Discovery’, introduces a targeted training framework and dataset designed to imbue foundation models with a deeper understanding of molecular data. We demonstrate that a relatively small, purpose-trained ‘Liquid Foundation Model’-trained using the MMAI Gym-can outperform significantly larger models on key drug discovery benchmarks, achieving near specialist-level performance across tasks like molecular optimization and ADMET prediction. Does this signal a shift towards more efficient, specialized models as the future of AI-driven drug discovery?

Navigating the Complexity of Molecular Representations

Conventional machine learning algorithms, designed for sequential or grid-like data, often falter when applied to molecules due to their inherent graph structure – atoms connected by bonds forming intricate, three-dimensional networks. This poses a significant challenge, as accurately predicting a molecule’s properties-like its reactivity, solubility, or toxicity-requires understanding these complex relationships. Representing molecules as simple vectors or matrices loses crucial information about connectivity and spatial arrangement, leading to diminished predictive power. Consequently, traditional methods struggle to discern subtle differences in molecular structure that can dramatically impact chemical behavior, necessitating the development of specialized techniques capable of effectively capturing and interpreting this graph-based information.

The inherent scalability issues of standard Transformer architectures present a considerable challenge when applied to molecular reasoning, particularly in complex tasks like retrosynthesis – the process of planning a chemical synthesis. These models, while powerful in natural language processing, struggle with the combinatorial explosion that arises when representing molecules as sequences; each atom and bond necessitates a corresponding token, quickly exceeding the input length limitations of conventional Transformers. This restriction hinders their ability to capture long-range dependencies crucial for understanding molecular structure and reactivity. Consequently, representing larger, more complex molecules requires truncation or fragmentation, potentially losing vital information needed for accurate predictions. Researchers are actively exploring modifications to the Transformer architecture, such as sparse attention mechanisms and hierarchical representations, to overcome these scaling limitations and enable effective reasoning about increasingly sophisticated molecular structures.

A major impediment to accelerating drug discovery lies in the substantial need for labeled molecular data; training machine learning models to predict crucial properties or design novel compounds typically demands vast datasets painstakingly curated by experts. This reliance on extensive labeling is not merely a logistical hurdle, but a fundamental bottleneck, as acquiring such data is both time-consuming and expensive. Consequently, researchers are increasingly focused on developing data-efficient methodologies – techniques capable of learning effectively from limited examples. These approaches encompass strategies like self-supervised learning, transfer learning, and active learning, all aimed at maximizing the information gleaned from each labeled data point and reducing the overall labeling burden, ultimately streamlining the process of identifying promising drug candidates.

LFM2: A Scalable Architecture for Efficient Molecular Reasoning

The Liquid Foundation Model 2 (LFM2) utilizes a hybrid architecture designed to overcome the limitations of both traditional State Space Models (SSMs) and attention mechanisms. Standard attention exhibits quadratic scaling with sequence length, becoming computationally expensive for long sequences. SSMs offer linear scaling but can struggle with complex dependencies. LFM2 combines these approaches, integrating elements of both to achieve sub-quadratic scaling – specifically, a complexity between linear and quadratic. This is accomplished by employing SSMs to efficiently process sequential data while incorporating attention-like mechanisms to selectively focus on relevant information within the sequence, enabling effective reasoning over extended contexts without prohibitive computational cost.

LFM2-2.6B utilizes Gated Short Convolution to efficiently process sequential data by applying convolutional layers with a limited receptive field, reducing computational complexity compared to traditional attention mechanisms. This is coupled with Grouped-Query Attention (GQA), a modification of Multi-Head Attention that divides the query, key, and value projections into groups, thereby decreasing the memory bandwidth requirements and accelerating attention computations. The combination of these two techniques allows LFM2-2.6B to achieve faster sequence processing speeds while maintaining a comparable representational capacity to standard attention-based models, as demonstrated through benchmark testing on various language modeling tasks.

Pre-RMSNorm normalization is implemented within the LFM2 architecture to address training instability frequently encountered in large-scale models. Traditional RMSNorm is applied before each transformer block’s attention and feedforward layers, centering and scaling inputs to maintain a stable variance throughout the network. This pre-normalization strategy mitigates the vanishing or exploding gradient problems that can hinder convergence during training, particularly with increased model size and sequence length. By stabilizing the activations, Pre-RMSNorm enables the use of higher learning rates and larger batch sizes, resulting in significant improvements in training efficiency and overall model performance for LFM2.

The LFM2 architecture, utilizing gated short convolution sequence operators, demonstrates improved throughput and efficiency in both prefill processing and decoding speed as prefill increases (Amini et al., 2025).

Validating LFM2 Through Rigorous Drug Discovery Benchmarks

LFM2 is adapted for specific drug discovery applications through both supervised and reinforcement learning fine-tuning methodologies. This process leverages datasets such as the TDC (Target Discovery Compound) collection, which provides labeled data for training. Supervised learning utilizes known input-output pairs to guide the model’s parameter adjustments, while reinforcement learning employs a reward system to optimize performance based on defined objectives relevant to drug discovery tasks like molecule generation or property prediction. These fine-tuning strategies allow LFM2 to specialize from its pre-trained capabilities and achieve higher accuracy and efficiency in targeted applications within the pharmaceutical domain.

Successful fine-tuning of LFM2 for drug discovery applications relies heavily on the selection of appropriate optimization algorithms. The AdamW optimizer is utilized to manage weight decay during training, preventing overfitting and improving generalization performance on downstream tasks. Complementary to AdamW, Group Relative Policy Optimization (GRPO) is employed, particularly in reinforcement learning scenarios, to stabilize training and enhance policy convergence. GRPO facilitates learning by normalizing advantages within groups, reducing variance and enabling more effective exploration of the action space, ultimately leading to higher performance metrics during fine-tuning for tasks like retrosynthesis and multi-objective optimization.

LFM2’s performance was rigorously evaluated across several established benchmarks for molecular prediction. Testing on USPTO-50K, URSA-expert-2026, FGBench, and MuMO-Instruct demonstrated strong capabilities in retrosynthesis, functional group editing, and multi-objective optimization tasks. Specifically, LFM2 achieved state-of-the-art results on the USPTO-50K benchmark for single-step retrosynthesis, as measured by CC Metrics. Performance on FGBench indicated comparable accuracy to state-of-the-art models in functional group reasoning, while MuMO-Instruct evaluation showed LFM2 achieving the best overall success rate among tested models.

Utilizing BF16 (BFloat16) precision during model training offers substantial computational benefits without compromising predictive performance. BF16 is a 16-bit floating-point format that reduces memory footprint by half compared to FP32 (32-bit floating-point) training, enabling the use of larger batch sizes or more complex models within the same hardware constraints. This reduction in memory usage directly translates to accelerated training speeds, as data transfer between memory and processing units becomes more efficient. Empirical results demonstrate that training with BF16 precision for LFM2 yields comparable performance metrics to FP32 training, indicating minimal loss in accuracy or generalization capability despite the reduced precision.

The LFM2-2.6B-MMAI model, containing 2.6 billion parameters, demonstrates strong performance across a range of molecular prediction tasks, achieving results comparable to, and often exceeding, those of significantly larger models. Specifically, this model achieved the best overall success rate on the MuMO-Instruct benchmark, which evaluates multi-objective molecular optimization, and attained accuracy levels comparable to the state-of-the-art on the FGBench benchmark, designed to assess functional group reasoning capabilities in molecular contexts. This performance indicates an efficient use of parameters, providing competitive results despite being more than an order of magnitude smaller than some contemporary models.

Expanding the Horizon: A Comprehensive Toolkit for Molecular AI

To enable large language models to ‘understand’ molecules, researchers employ a technique called tokenization, effectively translating complex chemical structures into a language the model can process. Simplified Molecular Input Line Entry System (SMILES) and Self-Referencing Embedded Strings (SELFIES) – text-based representations of molecular structures – are broken down into smaller units, or tokens. This process transforms a molecule’s intricate arrangement of atoms and bonds into a sequence of discrete symbols, much like converting words into letters for natural language processing. By representing molecular information in this standardized, tokenized format, the model can learn patterns, predict properties, and even generate novel molecular structures with greater efficiency and accuracy, paving the way for accelerated drug discovery and materials science innovation.

The MMAI Gym addresses a critical need in the development of large language models for drug discovery: a reliable and standardized platform for training and assessment. This environment moves beyond general language understanding to specifically challenge models with tasks requiring domain-faithful reasoning – essentially, the ability to think like a chemist. By presenting LLMs with curated datasets and chemically relevant problems, the MMAI Gym facilitates the development of models capable of accurate molecular reasoning, reaction prediction, and retrosynthetic analysis. The structured nature of the Gym allows for rigorous evaluation of LLM performance, identifying strengths and weaknesses in their chemical understanding and guiding further model refinement towards more effective and trustworthy drug design capabilities.

LFM2 distinguishes itself as a significant advancement in molecular design and prediction through a carefully orchestrated synergy of computational elements. Its efficient architecture, built for speed and scalability, is not merely a structural choice but is intrinsically linked to a specialized training regimen focused on the nuances of chemical language and properties. This focused training is then rigorously validated through a robust evaluation framework, ensuring the model’s predictions are both accurate and reliable. The result is a powerful tool capable of not only forecasting molecular characteristics but also intelligently generating novel compounds with desired attributes, representing a considerable leap forward in areas like drug discovery and materials science.

The pursuit of increasingly large models, as highlighted in the development of Liquid Foundation Models, often obscures a more fundamental principle: elegant design. This work demonstrates that focused fine-tuning, utilizing a specialized environment like the MMAI Gym, can yield surprisingly competitive results. As Linus Torvalds once stated, “Talk is cheap. Show me the code.” This sentiment perfectly encapsulates the approach taken here – a pragmatic focus on demonstrable performance rather than sheer scale. The success of this relatively small model, achieving state-of-the-art results on drug discovery tasks, underscores the importance of optimizing for efficiency and targeted learning, proving that a well-structured system, even one of modest size, can outperform a larger, less refined one.

The Road Ahead

The demonstration that a modestly sized language model, carefully cultivated within a focused training environment, can rival the performance of significantly larger counterparts, prompts a necessary re-evaluation of current trajectories. The relentless pursuit of scale, reminiscent of adding floors to a building without reinforcing the foundations, may yield diminishing returns. It is not merely the quantity of parameters, but the elegance of their orchestration, that dictates a system’s capability. One cannot simply replace a component without understanding the circulatory system that sustains it.

Remaining challenges extend beyond simply achieving benchmark scores. The true test lies in generalizability – will this fine-tuned architecture maintain efficacy when confronted with novel chemical spaces, or unforeseen biological mechanisms? The MMAI Gym represents a crucial step, but a comprehensive ‘stress test’ across diverse, real-world datasets is paramount. Moreover, the inherent limitations of language models in representing true physical reality must be addressed. A molecule is not merely a string of characters; it is a three-dimensional entity governed by quantifiable forces.

Future work should prioritize the development of training methodologies that emphasize efficiency and interpretability. The focus must shift from brute-force scaling to cultivating ‘molecular reasoning’ – the ability to not just predict properties, but to understand why those properties exist. A system that can articulate its rationale, and adapt its strategy based on feedback, will ultimately prove far more valuable than one that merely delivers a result.

Original article: https://arxiv.org/pdf/2603.03517.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Navigating the Complexity of Molecular Representations

LFM2: A Scalable Architecture for Efficient Molecular Reasoning

Validating LFM2 Through Rigorous Drug Discovery Benchmarks

Expanding the Horizon: A Comprehensive Toolkit for Molecular AI

The Road Ahead

See also: