Can AI Spot Bias in Cancer Detection?

Author: Denis Avetisyan


A new study explores how intelligent agents can help ensure fairness in early-onset colorectal cancer screening.

Semantic similarity between generated outputs and ground truth statements varied considerably across agent roles-Domain Expert and Fairness Consultant-and model sizes (Llama 3.1 8B, OSS 20B, and OSS 120B), with retrieval-augmented generation (RAG) consistently improving alignment compared to both large language models alone and agents without RAG, as evidenced by the distributions of similarity scores and statistical significance testing.
Semantic similarity between generated outputs and ground truth statements varied considerably across agent roles-Domain Expert and Fairness Consultant-and model sizes (Llama 3.1 8B, OSS 20B, and OSS 120B), with retrieval-augmented generation (RAG) consistently improving alignment compared to both large language models alone and agents without RAG, as evidenced by the distributions of similarity scores and statistical significance testing.

Researchers evaluated an agentic AI system leveraging retrieval-augmented generation for fairness auditing in early-onset colorectal cancer detection, analyzing the impact of model scale and task complexity.

Despite increasing reliance on artificial intelligence in clinical decision-making, ensuring fairness and mitigating algorithmic bias remains a significant challenge. This study, ‘Ablation Study of a Fairness Auditing Agentic System for Bias Mitigation in Early-Onset Colorectal Cancer Detection’, investigates the potential of an agentic AI system-comprising a domain expert and fairness consultant-to audit machine learning models for disparities in early-onset colorectal cancer detection. Results demonstrate that this agentic approach, particularly when leveraging Retrieval-Augmented Generation, enhances the identification of relevant fairness considerations, with performance scaling with model size. Could such systems offer a scalable solution for proactive fairness auditing and responsible AI implementation in healthcare?


Decoding Bias: The Algorithmic Mirror

The accelerating integration of machine learning into biomedical practice, while promising enhanced diagnostics and treatment, presents a critical challenge: the potential to exacerbate existing health disparities. Algorithms trained on biased datasets – reflecting historical inequities in healthcare access, representation, and data collection – can systematically disadvantage vulnerable populations. This isn’t merely a matter of inaccurate predictions for certain groups; the models can actively amplify pre-existing biases, leading to misdiagnosis, inappropriate treatment recommendations, and ultimately, unequal health outcomes. Consequently, a growing body of research highlights the urgent need to proactively assess and mitigate these algorithmic biases, moving beyond simple performance metrics to evaluate fairness and equity in machine learning-driven healthcare solutions.

Conventional evaluations of machine learning models in biomedical applications frequently prioritize overall accuracy, potentially masking significant disparities in performance across different demographic groups. This oversight can lead to models that, while seemingly effective on average, systematically disadvantage vulnerable populations due to biased training data or algorithmic design. Consequently, a shift towards proactive auditing strategies is essential; these strategies must move beyond simple accuracy metrics to incorporate fairness assessments that specifically examine model behavior across various subgroups defined by race, ethnicity, socioeconomic status, and other relevant factors. Such audits require careful consideration of potential bias sources, the selection of appropriate fairness metrics – like equal opportunity or predictive parity – and the implementation of techniques to mitigate identified disparities, ultimately ensuring equitable and just healthcare outcomes.

The opacity of many machine learning models presents a significant obstacle to identifying and correcting algorithmic bias within biomedical applications. Often described as “black boxes,” these models make predictions without revealing the specific factors driving those conclusions, making it difficult to discern whether decisions are based on legitimate medical indicators or inadvertently reflect societal biases present in the training data. This lack of interpretability prevents clinicians and researchers from thoroughly auditing the model’s reasoning, hindering efforts to pinpoint the source of unfair or discriminatory outcomes. Consequently, biases can remain undetected, perpetuating health disparities and eroding trust in AI-driven healthcare solutions; the inability to trace the decision-making process effectively blocks both the correction of existing biases and the prevention of new ones.

A robust evaluation of biomedical machine learning models necessitates more than just overall accuracy; it demands a system capable of systematically interrogating the vast and complex landscape of medical literature to uncover potential biases. Such a system would not merely assess model performance on diverse datasets, but actively synthesize knowledge regarding social determinants of health, known health disparities, and the nuanced ways in which these factors might influence both data and algorithmic outcomes. Crucially, this requires a framework for quantifying fairness – moving beyond simple metrics to incorporate context-specific considerations and allowing researchers to evaluate models across multiple dimensions of equity. By automating the process of literature review and fairness assessment, developers can proactively identify and mitigate biases, ultimately ensuring that these powerful tools contribute to more just and equitable healthcare for all populations.

The Autonomic Auditor: An Agentic System

The Agentic AI System is designed to automate auditing procedures through the coordinated action of multiple specialized agents. This architecture departs from monolithic AI models by distributing responsibility across distinct components, each focused on a specific task within the auditing workflow. These agents operate autonomously, yet communicate and collaborate to achieve a comprehensive evaluation. The system’s modular design facilitates scalability and allows for the easy addition or modification of agents to address evolving auditing requirements. Each agent is designed with a defined scope and objective, contributing to the overall audit process through specialized analysis and reporting.

The auditing system utilizes Retrieval-Augmented Generation (RAG) to improve the contextual relevance of its analyses. RAG functions by first retrieving pertinent passages from a biomedical literature database based on the audit query. These retrieved passages are then incorporated as context for a large language model, enabling it to generate responses grounded in established scientific evidence. This process mitigates the risk of the model relying solely on its pre-trained knowledge, which may be incomplete or outdated, and allows for reasoning informed by the latest research findings. The retrieved documents provide verifiable support for the system’s conclusions, increasing transparency and reliability in the auditing process.

The Agentic AI system incorporates a Domain Expert Agent and a Fairness Consultant Agent to facilitate comprehensive auditing. The Domain Expert Agent is responsible for synthesizing current biomedical knowledge related to the condition under audit, enabling informed evaluation of the AI model’s outputs. Complementing this, the Fairness Consultant Agent recommends relevant fairness metrics – beyond traditional accuracy – to assess potential performance disparities across different demographic groups. This dual-agent approach ensures that audits consider both clinical accuracy and equitable performance, addressing potential biases inherent in AI models and promoting responsible AI deployment.

Traditional model evaluation often relies on overall accuracy, which can mask performance differences across demographic subgroups. This system incorporates fairness metrics alongside accuracy to specifically identify and mitigate disparities in model outputs. Our evaluation demonstrates improvements in semantic similarity – a measure of how closely the model’s reasoning aligns with human understanding – when assessed across various demographic groups, indicating a reduction in biased outcomes. This approach prioritizes equitable performance by moving beyond aggregate statistics and focusing on group-specific evaluations, thereby addressing potential harms stemming from disparate model behavior.

Knowledge Excavation: Technical Foundations

The Retrieval-Augmented Generation (RAG) component employs a Chroma Vector Database for the storage and retrieval of document embeddings. These embeddings are generated using the mxbai-embed-large model, which converts textual data into numerical vector representations. Chroma is utilized due to its efficiency in performing similarity searches on these vectors, enabling the RAG system to quickly identify and retrieve the most relevant documents in response to a given query. This process facilitates the augmentation of large language model prompts with pertinent information, enhancing the accuracy and contextual relevance of generated outputs.

Agent functionality is driven by several large language models (LLMs), specifically Llama 3.1 8B, GPT-OSS 20B, and GPT-OSS 120B, providing a range of model sizes for performance and resource trade-offs. These LLMs are deployed and managed using the Ollama Inference Framework, which streamlines the process of running and serving the models. Ollama facilitates model loading, version control, and execution, enabling efficient integration of the LLMs into the agent architecture and simplifying the deployment pipeline.

Semantic similarity scoring quantifies the relevance of retrieved documents to the agent’s input query, enabling contextually grounded reasoning. This process employs vector embeddings – numerical representations of text – to calculate the cosine similarity between the query embedding and the embeddings of candidate documents. Higher similarity scores indicate greater semantic relatedness, allowing the system to prioritize and incorporate the most relevant information into the agent’s reasoning process. The selection of documents is thus not based on keyword matching, but on conceptual alignment, improving the accuracy and coherence of generated outputs by providing contextually appropriate information to the large language model.

Analysis of agent outputs demonstrated a statistically significant improvement in semantic similarity to expert-validated reference statements when Retrieval-Augmented Generation (RAG) was implemented (p < 0.05). This effect was particularly pronounced in the Domain Expert Agent’s performance identifying disparities in early-onset colorectal cancer. Furthermore, the RAG tool was consistently utilized in all experimental runs employing the Llama 3.1 8B large language model, indicating its integral role in facilitating contextually relevant reasoning with this specific model.

Beyond Detection: Shaping Equitable AI

An agentic artificial intelligence system now offers a viable pathway for systematically auditing biomedical machine learning models, directly addressing concerns about fairness and the potential for exacerbating existing health disparities. This innovative approach moves beyond theoretical bias detection by providing a practical, automated framework for evaluating model outputs across different demographic groups. The system functions by proactively identifying and flagging instances where algorithms may produce inequitable predictions or recommendations, allowing developers to mitigate these biases before deployment. By enabling continuous monitoring and assessment, this technology ensures that AI-driven healthcare tools consistently deliver equitable outcomes, fostering greater trust and ultimately improving the quality of care for all patients, regardless of background or identity.

The dynamic nature of real-world data necessitates ongoing scrutiny of machine learning models in healthcare; a one-time assessment of fairness is insufficient to guarantee consistently equitable outcomes. Automated auditing systems address this challenge by providing continuous monitoring and evaluation capabilities, allowing for the detection of emerging biases as models encounter new data distributions. This proactive approach differs from traditional, retrospective bias detection methods, which often identify issues only after harm has occurred. By regularly assessing model performance across different demographic groups, these systems can flag instances where disparities arise, triggering interventions such as model retraining or data augmentation. The benefit is a feedback loop that actively maintains model fairness over time, fostering trust in AI-driven healthcare solutions and preventing the perpetuation of health inequities through algorithmic bias.

Addressing algorithmic bias in healthcare AI is not merely a technical refinement, but a crucial step towards fostering genuine trust and demonstrably improving patient outcomes. When machine learning models perpetuate or amplify existing health disparities, they erode confidence in medical technology and can lead to unequal access to care. Proactive mitigation of these biases – through rigorous auditing and transparent model development – ensures that AI tools serve as equitable partners in diagnosis and treatment. This, in turn, promotes wider adoption of AI-driven solutions, allowing healthcare professionals to leverage the technology’s potential to enhance accuracy, personalize treatment plans, and ultimately deliver more effective and just care for all patients, regardless of demographic background or social determinants of health.

This agentic AI auditing framework transcends specific biomedical applications, offering a versatile foundation for evaluating machine learning models across diverse healthcare domains. Beyond mitigating health disparities, the system’s architecture allows for the identification and correction of other potentially harmful biases embedded within algorithms – such as those relating to age, gender, or socioeconomic status. Performance evaluations revealed consistent efficacy across various model configurations; however, the GPT-OSS 120B model paired with the Fairness Consultant Agent exhibited a marginally reduced reliance on Retrieval-Augmented Generation (RAG) tools – registering 85% usage compared to other setups – suggesting potential refinements in prompting or knowledge retrieval strategies for optimal performance within that specific configuration.

The study meticulously dissects the agentic AI’s capabilities, essentially probing its limits in the critical domain of early-onset colorectal cancer detection. This approach mirrors a fundamental tenet of understanding any complex system: deliberate disruption to reveal underlying mechanisms. As John von Neumann observed, “If people do not believe that mathematics is simple and elegant and if they are not excited by a sense of the beautiful, there is no hope for them.” The researchers didn’t simply accept the AI’s outputs; they actively ablated components – systematically removing functionalities – to assess their impact on fairness auditing. This isn’t merely about optimizing performance; it’s about understanding why the system succeeds or fails, particularly concerning biases embedded within the data and algorithms, and that mirrors a beautiful, elegant approach to problem solving.

Beyond the Audit: Pushing the Boundaries

The demonstrated capacity of an agentic system to probe for bias in early-onset colorectal cancer detection isn’t a resolution, but a controlled demolition of assumptions. This work reveals not a ‘solved’ problem of fairness, but a methodology for systematically exposing where the system believes it is neutral-and subsequently, where it is most likely wrong. The variability observed with model scale suggests that simply increasing parameters isn’t a path to unbiased outcomes; rather, it amplifies the existing structural prejudices within the data and retrieval mechanisms. The real challenge lies not in refining the audit, but in accepting that ‘fairness’ isn’t a static property, but a dynamic vulnerability.

Future work should deliberately stress-test these agentic systems with adversarial examples designed not to ‘trick’ the AI, but to reveal its underlying logic. Focus should shift from semantic similarity – a surface-level assessment – to interrogating the reasoning behind the retrieval-augmented generation. Can the agent explain why certain data points are considered relevant, and more importantly, can it articulate the limitations of that reasoning? The goal isn’t to eliminate bias-that’s a naive fantasy-but to create systems that are transparently biased, allowing for informed mitigation strategies.

Ultimately, this approach frames AI fairness not as an engineering problem, but as an epistemological one. It demands a constant process of deconstruction, challenging the very foundations of algorithmic ‘knowledge’. The system doesn’t just detect bias; it forces a confrontation with the biases inherent in the act of categorization itself.


Original article: https://arxiv.org/pdf/2603.17179.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-19 17:54