The Equity Imperative: Addressing Bias in AI-Powered Healthcare

Author: Denis Avetisyan

As artificial intelligence reshapes biomedical research, ensuring fairness and preventing the amplification of existing healthcare disparities is paramount.

This review examines the sources of bias in biomedical foundation models and proposes strategies for improved data provenance, transparency, and benchmarking to promote equitable AI in healthcare.

Despite advances in precision medicine, healthcare disparities persist, often masked by biases embedded within the very data driving innovation. In ‘Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities’, we reveal a critical vulnerability: substantial population biases in omics datasets used to train biomedical foundation models-where data from European ancestry disproportionately dominates. This skewed representation risks perpetuating, and even amplifying, existing inequities as these models become central to downstream healthcare applications. Can a commitment to data provenance, openness, and transparent evaluation foster truly equitable and robust biomedical AI that benefits all populations?

The Illusion of Hypothesis: Why We Must Grow, Not Build

Historically, biomedical research has proceeded through the formulation of specific hypotheses, followed by experiments designed to confirm or refute them. While effective in many instances, this approach inherently constrains investigation within the boundaries of pre-defined expectations. Researchers, consciously or unconsciously, often focus on parameters they believe are relevant, potentially overlooking crucial but unanticipated connections. This reliance on pre-set criteria introduces a form of bias, limiting the scope of discovery and hindering exploration of the full complexity of biological systems. Consequently, valuable insights might remain hidden simply because the research framework wasn’t designed to detect them, necessitating a shift towards methodologies capable of unbiased data exploration.

The recent surge in large-scale data generation technologies, particularly Single-Cell RNA Sequencing (scRNA-seq), is fundamentally reshaping biomedical research by enabling a transition from hypothesis-driven to data-driven discovery. scRNA-seq allows researchers to profile the gene expression of thousands of individual cells, creating a high-resolution map of cellular states and interactions within complex tissues. This unprecedented level of detail moves beyond traditional bulk analyses, revealing previously hidden heterogeneity and rare cell populations critical to disease development and progression. Consequently, researchers can now explore biological systems without pre-defined assumptions, identifying novel patterns, biomarkers, and therapeutic targets directly from the data itself – a paradigm shift poised to accelerate breakthroughs across numerous fields, from immunology to oncology.

The sheer volume and intricacy of modern biological datasets, particularly those arising from single-cell technologies, demand analytical tools that surpass the capabilities of traditional statistical methods. Existing approaches often struggle with the high dimensionality and inherent noise within these data, hindering the identification of meaningful patterns and relationships. Consequently, researchers are increasingly turning to foundation models – large, pre-trained models initially developed for natural language processing – and adapting them for biological applications. These models, capable of learning complex representations from vast amounts of unlabeled data, offer a powerful means of integrating diverse biological information and uncovering subtle, previously hidden connections. This shift promises to move beyond predefined hypotheses and facilitate a more exploratory, data-driven approach to understanding the fundamental principles of life.

Foundation Models: From Tools to Ecosystems

Biomedical foundation models signify a departure from traditional artificial intelligence approaches focused on solving individual, narrowly defined biomedical problems. Prior AI systems were typically engineered and trained for a specific task, limiting their applicability to other areas. In contrast, foundation models, such as ESM-33 for protein language modeling and AlphaFold for protein structure prediction, are trained on extensive, heterogeneous biomedical datasets to develop generalized representations of biological data. This pre-training allows these models to be adapted – through techniques like fine-tuning – to a wide array of downstream tasks, including those not explicitly included in their initial training, thereby increasing efficiency and reducing the need for task-specific model development.

Biomedical foundation models are constructed through training on extensive datasets representing multiple modalities of biological information. Genomic sequences are utilized, with models like DNABERT specifically trained on DNA data. Transcriptomic profiles, detailing RNA expression, are incorporated using models such as scBERT and BMFM-RNA, which analyze single-cell RNA sequencing data. Furthermore, these models ingest large corpora of biological text, including scientific literature and databases, as exemplified by BioBERT and BioGPT, and increasingly, general-purpose Large Language Models adapted for biomedical applications. The scale of these datasets, often encompassing billions of data points, is critical to the models’ ability to learn complex biological relationships and generalize to new tasks.

Pre-training biomedical foundation models on extensive datasets allows them to learn complex relationships within biological data without task-specific labeling. This process creates models that represent a generalized understanding of biology, enabling effective transfer learning to a variety of downstream applications. For example, in drug discovery, these models can predict protein structures, identify potential drug targets, and estimate drug efficacy, accelerating the initial phases of the Drug Discovery Process. In personalized medicine, models trained on patient genomic data and clinical records, like those used in Warfarin Dosing Algorithms, can predict individual responses to medications, enabling tailored treatment plans and improved patient outcomes. The broad pre-training phase effectively encodes biological knowledge, reducing the need for extensive task-specific training data and improving performance on tasks with limited labeled examples.

The Shadow of Bias: Ancestry and the Limits of Representation

Ancestry bias in Biomedical Foundation Models arises from the non-representative composition of training datasets, where certain ancestral groups are significantly overrepresented while others are underrepresented. This imbalance directly impacts model performance, potentially leading to decreased accuracy and reliability when applied to populations not adequately reflected in the training data. The consequence is a systemic skewing of predictions, which can perpetuate and even exacerbate existing health disparities by providing less accurate diagnoses or treatment recommendations for underrepresented groups. This bias isn’t simply a matter of statistical error; it’s a fundamental flaw in the model’s ability to generalize across the full spectrum of human biological diversity.

Analysis of publications between 2015 and 2024 reveals a significant lack of demographic reporting, contributing to potential biases in Biomedical Foundation Models. Specifically, only 2.7% of analyzed publications report on ancestry or ethnicity, while 4.4% report geographic origin, and 3.0% detail the sex/gender composition of study participants. This limited reporting hinders the assessment of model generalizability across diverse populations and raises concerns about the potential for inaccurate predictions and the exacerbation of existing health disparities, as model performance may vary significantly based on these unreported demographic factors.

Analysis of publications between 2015 and 2024 reveals substantial underreporting of demographic data in biomedical research, specifically concerning ancestry. Transcriptomic studies exhibit the lowest reporting rate at 0.7%, indicating a significant gap in understanding potential biases within this field. In contrast, genomic studies report demographic information at a rate of 6.1%, the highest among omics disciplines. This disparity highlights the critical need for standardized demographic reporting practices and rigorous data curation procedures across all omics fields to mitigate the risk of ancestry bias and ensure equitable performance of Biomedical Foundation Models.

Explainable AI (XAI) methods are essential for evaluating Biomedical Foundation Models due to their complexity and potential for biased outputs. Techniques like SHAP (SHapley Additive exPlanations) assign each feature an importance value for a particular prediction, allowing researchers to identify which factors disproportionately influence model outcomes. By analyzing these feature attributions across diverse demographic groups, hidden biases can be detected – for instance, if a model relies more heavily on ancestry-correlated features for certain predictions. This granular level of interrogation facilitates model debugging, promotes transparency in decision-making, and ultimately contributes to the development of more fair and reliable predictive tools in healthcare and biomedical research.

The Architecture of Trust: Reproducibility and the Virtual Cell

Rigorous benchmarking serves as a critical cornerstone in validating the efficacy of biomedical foundation models, a necessity given their increasing application in areas impacting human health. These models, trained on vast datasets, require thorough evaluation not simply on a single task, but across a spectrum of challenges and varied datasets to truly assess their reliability and ability to generalize beyond the training data. Such evaluations expose potential biases, identify limitations in performance across different patient populations or disease states, and ultimately ensure that these AI systems are robust and trustworthy when deployed in clinical settings. Without standardized benchmarks and transparent reporting of results, the potential of these powerful tools remains hampered by uncertainty and a lack of confidence in their predictive capabilities.

Establishing reproducible biomedical AI necessitates a shift towards rigorous documentation practices, and established frameworks like the Consort Guidelines offer a valuable blueprint. Originally designed to enhance the reporting quality of clinical trials, these guidelines emphasize transparent data provenance – detailing not just what data was used, but also how it was collected, processed, and analyzed. Applying similar standards to the development of AI models – meticulously recording data versions, preprocessing steps, model architectures, training parameters, and evaluation metrics – allows for independent verification and replication of results. This level of detail is crucial for building trust in AI-driven insights, facilitating collaborative research, and ultimately ensuring the robustness and reliability of these models in critical biomedical applications, moving the field beyond ‘black box’ predictions towards verifiable and dependable outcomes.

The pursuit of a Virtual Cell Paradigm represents a transformative ambition within biomedical artificial intelligence. This concept envisions generative models capable of faithfully simulating the intricate behaviors of cells – from molecular interactions to complex physiological responses. Such models wouldn’t simply analyze existing data, but proactively predict cellular outcomes under various conditions, effectively functioning as digital twins of biological systems. This capability promises to dramatically accelerate drug discovery by enabling in silico screening of potential therapeutics, reducing the reliance on costly and time-consuming laboratory experiments. Furthermore, a robust Virtual Cell Paradigm holds the potential to revolutionize personalized medicine, allowing clinicians to model a patient’s specific cellular profile and predict their response to different treatments, ultimately leading to more effective and targeted therapies.

The pursuit of equitable AI in biomedicine reveals a familiar pattern: systems designed with the best intentions often inherit the imperfections of their origins. This paper rightly focuses on the propagation of bias through foundation models, a consequence of datasets reflecting existing societal imbalances. It echoes a sentiment expressed by David Hilbert: “We must be able to answer the question: what are the prerequisites for the existence of a solution?” In this context, the prerequisites aren’t mathematical, but ethical – a comprehensive understanding of data provenance and a commitment to benchmarking for fairness. The architecture isn’t the solution; it’s a compromise frozen in time, a snapshot of the biases present during its construction. Technologies change, dependencies remain – and those dependencies are often built on flawed foundations.

What Lies Ahead?

The pursuit of equitable biomedical AI feels less like engineering and more like tending a garden. This work, illuminating the potential for bias within foundation models, doesn’t offer solutions so much as a clearer view of the weeds. The focus on data provenance and explainability is sensible, of course, but each layer of transparency reveals further complexity – and new surfaces for bias to cling to. Scalability is simply the word used to justify that complexity. One suspects that, as models grow larger and more integrated with clinical practice, the origins of any given prediction will become increasingly opaque, a black box built of layers upon layers of abstraction.

The call for standardized benchmarking is pragmatic, yet every benchmark is, by definition, a simplification of reality. Everything optimized will someday lose flexibility. A model that performs flawlessly on current datasets may stumble badly when confronted with genuinely novel data, or subtle shifts in population demographics. The pursuit of a ‘fair’ algorithm implies a fixed definition of fairness, a static target in a dynamic world.

Perhaps the perfect architecture is a myth, a necessary fiction to quiet the anxieties of those who believe systems can be controlled. The real work, then, isn’t building better models, but cultivating a humility about their limitations – and a willingness to continuously monitor, adapt, and acknowledge the inevitable imperfections. The goal isn’t bias elimination, but bias awareness – a constant, vigilant tending of the garden, knowing full well that weeds will always return.

Original article: https://arxiv.org/pdf/2604.14514.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Hypothesis: Why We Must Grow, Not Build

Foundation Models: From Tools to Ecosystems

The Shadow of Bias: Ancestry and the Limits of Representation

The Architecture of Trust: Reproducibility and the Virtual Cell

What Lies Ahead?

See also: