Teaching AI to Understand Medicine

Author: Denis Avetisyan

New research demonstrates a method for significantly improving large language models’ ability to reason about complex biomedical information.

Beneficial Forgetting Training (BFT) consistently improves the performance of the DeepSeek-R1-Distill series (14B, 32B, and 70B) across diverse benchmarks-including medical and biological reasoning, as assessed through theme and axis-wise evaluations, and metrics like ROUGE-L, ROUGE-1, and ROUGE-2-demonstrating its efficacy in mitigating catastrophic forgetting following fine-tuning on specialized datasets such as the OpenAI Health Bench Consensus subset and achieving state-of-the-art results in biological process reasoning.

Balanced Fine-Tuning stabilizes training and adaptively weights samples to enhance generalization and representation learning in large language models applied to biomedical science.

Despite the promise of large language models in life science research, effectively imbuing them with specialized biomedical knowledge remains a significant challenge due to sparse data and overfitting. This paper introduces ‘Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning’, presenting a novel post-training method that enhances LLM performance through stabilized gradients and adaptive sample weighting. Our Balanced Fine-Tuning (BFT) approach demonstrably surpasses standard fine-tuning, enabling improved knowledge acquisition and reasoning in both medical and biological domains-even exceeding the performance of specialized agents like GeneAgent. Could BFT unlock broader applications of LLMs, accelerating discovery across the biomedical landscape?

The Illusion of Intelligence: LLMs and the Limits of Scale

While Large Language Models (LLMs) excel at producing human-quality text, crafting coherent narratives, and even mimicking various writing styles, their capacity for dependable reasoning remains a substantial challenge. These models frequently demonstrate inconsistencies when tasked with complex problem-solving, logical inference, or applying knowledge to unfamiliar situations. The core issue isn’t a lack of information – LLMs are trained on massive datasets – but rather an inability to consistently apply that information in a reliable manner. This manifests as errors in multi-step reasoning, susceptibility to misleading prompts, and difficulty distinguishing between correlation and causation. Consequently, despite their impressive generative abilities, deploying LLMs in domains demanding accuracy and trustworthiness – such as medical diagnosis, legal analysis, or financial forecasting – requires careful consideration and mitigation of these inherent reasoning limitations.

Despite substantial investment in increasing the size of Large Language Models (LLMs) – a strategy known as scaling – researchers are observing a plateau in performance gains. Simply adding more parameters and data is yielding diminishing returns, suggesting that the path to robust reasoning isn’t solely through brute force. Simultaneously, the training process itself remains remarkably unstable; slight variations in initialization or data order can lead to drastically different outcomes, hindering reproducibility and reliable model development. This instability manifests as unpredictable shifts in learned representations and can derail the entire training run, demanding significant computational resources and expertise to mitigate. Consequently, the field is actively seeking alternative methods that prioritize not just scale, but also the stability and consistency of the learning process to unlock truly reliable LLM reasoning.

For large language models to move beyond impressive demonstrations and become truly useful in critical applications – such as medical diagnosis, legal reasoning, or financial forecasting – both training stability and generalization ability must be dramatically improved. Current LLMs can exhibit unpredictable behavior, offering inconsistent results even with slight variations in input, and often struggle to apply learned knowledge to novel situations. Addressing these limitations isn’t simply about increasing model size; it requires innovative techniques that ensure the learning process itself is more robust and that the resulting models can reliably extrapolate from training data to unseen data. Without these advancements, widespread deployment in high-stakes scenarios remains risky, as unpredictable outputs could lead to significant errors and potentially harmful consequences, hindering the realization of LLMs’ full potential.

This comparison of responses from supervised fine-tuning (SFT, blue) and biologically-informed fine-tuning (BFT, orange) demonstrates that BFT enhances large language models’ reasoning about biological processes, as prompted by user input (black).

A Fragile Fix: Introducing Balanced Fine-Tuning

Balanced Fine-Tuning (BFT) represents an iterative advancement in Large Language Model (LLM) adaptation techniques. Building directly on the established practice of Supervised Fine-Tuning (SFT), which utilizes labeled datasets to refine model weights, BFT incorporates principles from Dynamic Fine-Tuning (DFT). DFT introduces adaptive learning rates and regularization strategies to address training instability. BFT extends these concepts by implementing specific metrics and optimization procedures designed to further stabilize the fine-tuning process and enhance the model’s ability to generalize to unseen data, ultimately aiming to improve performance across a wider range of inputs and tasks.

Balanced Fine-Tuning (BFT) employs Token Confidence and Group Confidence as key metrics for assessing and addressing instability during the fine-tuning of Large Language Models. Token Confidence measures the probability assigned to each generated token, flagging low-confidence tokens as potential indicators of instability. Group Confidence extends this by evaluating the collective confidence of tokens within a defined group – typically a sentence or phrase – providing a broader assessment of output reliability. During training, BFT utilizes these metrics to dynamically adjust optimization strategies, such as applying weighted loss or targeted regularization, to mitigate the impact of unstable token or group predictions and promote more robust learning.

Balanced Fine-Tuning (BFT) prioritizes the stability of the optimization process during LLM adaptation, moving beyond solely maximizing accuracy on a training dataset. This approach seeks to minimize unpredictable behavior and enhance response consistency by preventing overconfidence in specific tokens or groups of tokens during training. By actively monitoring and mitigating optimization imbalances, BFT aims to create models that not only generate correct outputs but also exhibit more consistent and predictable performance across diverse inputs, ultimately improving the reliability of the LLM in real-world applications.

An ablation study demonstrates that the proposed BFT model, utilizing both sample and token-level weighting and adjustable window lengths, consistently outperforms standard supervised fine-tuning (SFT) and reinforcement learning (GRPO) on mathematical reasoning datasets, with performance tracked across a single training epoch.

Evidence, Such as It Is: Validating BFT on Key Benchmarks

Evaluation of the proposed Bio-Fine Tuning (BFT) methodology commenced with DeepSeek-R1-Distill serving as the foundational Large Language Model (LLM). Performance assessment focused on two core knowledge domains: general reasoning capabilities, measured using the Massive Multitask Language Understanding (MMLU) and Chinese Massive Multitask Language Understanding (CMMLU) benchmarks; and biomedical knowledge, assessed via the OpenAI HealthBench. This dual-domain evaluation strategy was implemented to quantify BFT’s impact on both broad linguistic understanding and specialized scientific expertise, providing a comprehensive understanding of the model’s capabilities post-training.

Benchmarking revealed that the implementation of BFT consistently maintained or improved performance on the MMLU and CMMLU benchmarks when compared to the base DeepSeek-R1-Distill model. This indicates that the BFT methodology does not negatively impact general reasoning capabilities and, crucially, mitigates the issue of catastrophic forgetting – the tendency of LLMs to lose previously learned information when exposed to new data. The preservation of scores on these general knowledge benchmarks demonstrates BFT’s ability to integrate specialized biological training data without sacrificing performance on broader knowledge domains.

The incorporation of biologically-focused training data, produced via the GenePT methodology, and the utilization of gene embeddings – created using techniques such as Youtu-Embedding – demonstrably enhanced the LLM’s biomedical reasoning capabilities. Performance was quantified using ROUGE metrics – specifically ROUGE-L, ROUGE-1, and ROUGE-2 – on benchmarks assessing biological process reasoning. Results indicated that the LLM, following this data integration, achieved superior ROUGE scores compared to the GeneAgent model, indicating improved performance in tasks requiring comprehension and recall of biological information.

BFT learns biologically meaningful gene representations, as demonstrated by its ability to accurately predict gene attributes, interactions, cell types, and responses to perturbations-performing favorably in multimodal integration and single-cell analysis compared to other models.

A Temporary Stay of Execution: Implications and Future Directions

The capacity of Bio-Fine Tuning (BFT) to bolster Large Language Model (LLM) stability and generalization carries profound implications for fields demanding unwavering reliability. In applications such as medical diagnosis and treatment planning, even minor inaccuracies can have critical consequences; BFT’s enhancements minimize such risks by fostering more consistent and dependable outputs. This improved robustness stems from the model’s refined ability to extrapolate from learned data, enabling it to confidently address novel scenarios and complex patient cases. Consequently, BFT represents a substantial step towards deploying LLMs in high-stakes environments where precise and trustworthy performance is paramount, potentially revolutionizing healthcare decision-making and patient care.

The convergence of large language models with comprehensive biomedical knowledge, and increasingly granular datasets like single-cell data, signifies a paradigm shift in artificial intelligence. This integration isn’t merely about feeding more information into an existing model; it’s about constructing LLMs specifically attuned to the intricacies of biological systems. Such specialized models demonstrate enhanced accuracy and reliability when addressing complex biomedical questions, moving beyond general knowledge to provide nuanced insights. The potential extends to personalized medicine, drug discovery, and a deeper understanding of disease mechanisms, offering a pathway towards AI systems capable of tackling domain-specific challenges with unprecedented precision and offering solutions beyond the scope of current general-purpose language models.

The benefits of Biomedical Fine-Tuning (BFT) extend beyond enhanced performance, as the technique achieves notable gains in evaluation scores while maintaining a training runtime comparable to traditional Supervised Fine-Tuning (SFT). This efficiency is crucial for practical implementation and wider adoption within resource-constrained environments. Current research is actively investigating the scalability of BFT to even more expansive language models, aiming to unlock further improvements in complex reasoning and knowledge integration. Beyond the biomedical field, explorations are underway to determine the effectiveness of BFT across diverse and challenging domains, suggesting a potentially versatile approach to enhancing the capabilities of large language models in a variety of specialized applications.

Biological embeddings are extracted from LLM-BFT by generating gene embeddings from textual descriptions using Tencent Youtu-Embedding, then weighting these embeddings by gene expression values within a single-cell dataset to produce cell embeddings.

The pursuit of aligning large language models with biomedical knowledge, as detailed in this work, feels predictably Sisyphean. Balanced Fine-Tuning aims for gradient stability and improved representation learning, but the underlying assumption – that one can truly solve the problem of generalization – is where the cracks begin to show. It’s a temporary reprieve, a local maximum before production inevitably introduces novel edge cases. As Blaise Pascal observed, “The belly is an ungovernable master.” Similarly, the demands of real-world data will always exceed the neatness of any training scheme. This isn’t a criticism of the technique; merely an observation that even the most elegant architectures eventually succumb to the entropy of scale.

What’s Next?

This work, predictably, moves the goalposts. Balanced Fine-Tuning addresses gradient instability, a problem that will undoubtedly be replaced by a more subtle, equally irritating one. The claim of improved generalization feels… optimistic. Any model that performs well on a held-out dataset hasn’t encountered all the ways production data will contort itself. If a bug is reproducible, it suggests a stable system; the absence of reported bugs merely indicates insufficient adversarial testing.

The emphasis on representation learning is particularly fraught. Elegant representations are, after all, just a more complex form of technical debt. The field will chase ever-more-abstract embeddings until a simple data drift renders them meaningless. Documentation of these learned representations, naturally, will be collective self-delusion-a snapshot of understanding that decays the moment the model leaves the researcher’s machine.

Future work will likely focus on automating the ‘balancing’ process itself, creating a feedback loop of increasing complexity. Anything self-healing just hasn’t broken yet. The real challenge, unstated here, remains the same: biomedical knowledge isn’t static. The model isn’t learning science; it’s learning a snapshot of current consensus, which, history suggests, is a moving target.

Original article: https://arxiv.org/pdf/2511.21075.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Intelligence: LLMs and the Limits of Scale

A Fragile Fix: Introducing Balanced Fine-Tuning

Evidence, Such as It Is: Validating BFT on Key Benchmarks

A Temporary Stay of Execution: Implications and Future Directions

What’s Next?

See also: