Beyond Hallucinations: A Verifiable AI for Biomedical Questions

Author: Denis Avetisyan

Researchers have developed an open-source system that combines information retrieval with a robust verification process to deliver accurate and trustworthy answers to complex medical inquiries.

The VerifAI system addresses biomedical question answering by presenting both supporting sources-derived from lexical and semantic retrieval-and a generated answer, with each sentence color-coded to reflect a verification engine’s assessment of its reliability; detailed attribution information, including linked PubMed references and semantically similar source text, is revealed upon hovering over individual sentences.

VerifAI leverages retrieval-augmented generation and natural language inference to mitigate the risk of inaccurate responses in biomedical question answering systems.

Despite advances in large language models, ensuring factual consistency remains a critical challenge in biomedical question answering. This paper introduces ‘VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering’, an expert system that combines retrieval-augmented generation with a novel post-hoc claim verification mechanism to mitigate the risk of hallucinations. By decomposing generated answers into atomic claims and validating them against retrieved evidence using natural language inference, VerifAI achieves state-of-the-art accuracy and provides a transparent lineage for every claim-outperforming even GPT-4 on the HealthVer benchmark. Will this approach to verifiable AI pave the way for more trustworthy and reliable applications of LLMs in high-stakes medical contexts?

The Illusion of Knowledge: Why Biomedical Question Answering is So Hard

The sheer volume and intricate nature of biomedical literature present a significant hurdle to efficient information retrieval. Traditional search methods often fall short when confronted with complex questions requiring synthesis of data across multiple studies, or those demanding understanding of subtle contextual cues. Researchers frequently spend considerable time sifting through irrelevant results or struggling to pinpoint evidence supporting specific hypotheses. This inefficiency isn’t merely a matter of convenience; it directly impacts the pace of discovery, hindering progress in areas like drug development and personalized medicine. Current systems struggle with the inherent ambiguity of natural language, the constantly evolving terminology within the field, and the need to integrate diverse data types – from genomic sequences to clinical trial outcomes – to provide truly comprehensive answers.

Despite their remarkable capacity to process and generate human-like text, large language models exhibit a significant vulnerability known as ‘hallucination’ – the tendency to confidently present fabricated information as factual. This is particularly concerning within biomedical applications, where accuracy is paramount; a model might, for instance, suggest a nonexistent drug interaction or misrepresent the efficacy of a treatment. The root of this issue lies in the models’ probabilistic nature; they predict the most likely continuation of a text sequence, not necessarily the true continuation. Consequently, even models trained on vast datasets of scientific literature can generate plausible-sounding but entirely incorrect responses, demanding rigorous validation and the development of techniques to mitigate these ‘hallucinations’ before such systems can be reliably deployed in healthcare settings.

VerifAI: A Pragmatic Approach to Retrieval and Generation

VerifAI’s retrieval strategy employs a hybrid approach, integrating both lexical and semantic search methods to enhance the recall of relevant biomedical literature. Lexical search utilizes keyword matching against document text, providing precision but potentially missing conceptually similar information. Semantic search, conversely, leverages vector embeddings to identify documents with similar meaning, even if they lack shared keywords. By combining these techniques, VerifAI aims to maximize recall – the proportion of relevant documents retrieved – while maintaining a reasonable level of precision. The system’s implementation prioritizes identifying a comprehensive set of potentially relevant abstracts for subsequent processing by the generative component.

The VerifAI system’s generative component utilizes the Mistral-7B-Instruct-v0.2 large language model within a Retrieval-Augmented Generation (RAG) framework. This approach first retrieves relevant abstracts from biomedical literature, and then the Mistral-7B-Instruct-v0.2 model synthesizes these retrieved passages into concise answers. The RAG framework enables the model to ground its responses in verifiable source material, improving the factual accuracy and reliability of the generated text. The model is specifically instructed to formulate answers directly based on the content of the retrieved abstracts, avoiding reliance on pre-existing parametric knowledge.

The generative component of VerifAI undergoes fine-tuning using the PQAref dataset, a corpus specifically curated for biomedical question answering. This process adapts the base ‘Mistral-7B-Instruct-v0.2’ model to the nuances of medical terminology and reasoning. Evaluation metrics demonstrate that fine-tuning with PQAref yields measurable improvements in both answer accuracy – assessed by metrics such as exact match and F1 score – and fluency, as judged by human evaluators. The dataset’s composition, including a diverse range of question types and biomedical topics, facilitates generalization and robust performance across varied query inputs.

VerifAI employs a modular architecture integrating formal methods with machine learning to enable rigorous verification of AI systems.

Detecting the Lies: Rigorous Verification of Factual Consistency

The verification component utilizes Natural Language Inference (NLI) to determine the logical connection between a generated claim and its corresponding supporting evidence. This process is implemented with a fine-tuned DeBERTa model, a transformer-based architecture selected for its performance in NLI tasks. The model is trained to classify the relationship as either entailment, contradiction, or neutral, with a primary focus on identifying instances of entailment – where the evidence logically supports the claim. By evaluating this relationship, the system aims to quantitatively assess the factual consistency of generated statements and flag potential inaccuracies before dissemination.

The system’s evaluation of ‘Entailment’ involves a logical assessment of the relationship between a generated claim and its supporting evidence. This process determines whether the evidence provides sufficient justification for the claim’s validity; a positive entailment score indicates the evidence logically supports the claim, while a negative or neutral score signals a potential inconsistency. This evaluation is not a simple keyword match; the system analyzes semantic relationships to identify if the meaning of the evidence necessitates the truth of the claim, allowing for the detection of inaccuracies that might not be apparent through superficial textual comparisons.

The verification component’s performance in discerning factual consistency within biomedical texts is significantly improved through training on the SciFact dataset. This dataset, comprised of scientific claims and associated evidence, provides a robust training ground for the DeBERTa model used in the Natural Language Inference (NLI) process. Exposure to the SciFact dataset allows the model to better identify subtle relationships between claims and supporting text, specifically enhancing its ability to accurately assess entailment – whether the evidence logically supports the claim – in the complex terminology and nuanced reasoning characteristic of biomedical research. This focused training directly addresses the challenges of verifying factual accuracy in a domain requiring specialized knowledge and precise interpretation.

The DeBERTaSF model demonstrates its classification performance through this confusion matrix.

A Cautious Optimism: Performance and Future Directions

Rigorous evaluation of VerifAI using the BioASQ benchmark-a standardized test for biomedical question answering systems-reveals a substantial capacity for accurate information retrieval. The system achieves a Precision@10 score of 23.7%, indicating that, within the top ten returned results, over twenty-three percent directly answer the posed question. Further demonstrating its efficacy, VerifAI obtains a Mean Average Precision@10 of 42.7%, a metric that considers both the ranking of correct answers and the proportion of relevant results within the top ten. These scores highlight VerifAI’s potential as a valuable tool for researchers and clinicians seeking reliable answers to complex biomedical queries, and suggest it effectively navigates the vast landscape of scientific literature to pinpoint pertinent information.

The system’s verification component demonstrates a robust capacity for discerning accurate biomedical information, achieving an overall accuracy of 81%. This performance notably surpasses that of current GPT-4 models when tasked with validating claims against provided evidence. Further analysis reveals strong performance in both identifying supporting evidence – with F1-scores ranging from 0.81 to 0.86 – and, crucially, in correctly flagging instances where no supporting evidence exists. This nuanced ability to assess veracity is vital for reliable question answering, minimizing the risk of disseminating potentially harmful misinformation within the biomedical domain and highlighting the system’s potential as a trustworthy resource for researchers and clinicians.

VerifAI’s architecture is intentionally constructed to be highly adaptable, enabling seamless incorporation of diverse biomedical databases and knowledge resources. This modular design isn’t merely a structural choice; it’s a key factor in the system’s potential for growth and enhanced performance. By decoupling core functionalities, VerifAI can readily assimilate new data streams – from genomic repositories and clinical trial results to specialized protein interaction databases – without requiring fundamental code revisions. This facilitates continuous learning and allows the system to stay current with the rapidly expanding landscape of biomedical knowledge, ultimately broadening its capacity to answer complex questions and provide more comprehensive insights.

The pursuit of ‘verifiable AI,’ as detailed in this paper, feels…familiar. It’s a Sisyphean task, really. One builds elaborate systems to check the outputs of other systems, all to avoid the inevitable ‘hallucination’ – a politely named failure. As Bertrand Russell observed, “The problem with the world is that everyone is an expert in everything.” This rings painfully true; everyone thinks the LLM is giving a definitive answer, until it confidently states something demonstrably false. VerifAI attempts to build a safety net, a verification engine to catch these errors, but one suspects that production will quickly find new and inventive ways to circumvent even the most rigorous checks. It’s an endless cycle; today’s innovation is tomorrow’s technical debt, neatly repackaged.

What’s Next?

The pursuit of verifiable AI, as exemplified by VerifAI, inevitably encounters the limitations of its components. While the framework addresses hallucination through retrieval and verification, the underlying fragility of natural language inference remains. Production deployments will undoubtedly reveal edge cases where even carefully constructed verification chains fail – a beautifully crafted abstraction collapsing under the weight of real-world ambiguity. The current emphasis on biomedical questions, while practical, begs the question of domain transferability; will the verification engine scale elegantly to fields with less structured knowledge?

A significant challenge lies not in improving verification accuracy, but in defining ‘truth’ itself. Biomedical knowledge is constantly evolving; a ‘verified’ answer today may become obsolete tomorrow. Future work must grapple with temporal uncertainty and the inherent provisionality of scientific understanding. Simply flagging outdated information isn’t enough; the system needs mechanisms for acknowledging and incorporating evolving consensus.

Ultimately, the true test of VerifAI – and all similar systems – won’t be its performance on benchmark datasets, but its behavior when faced with genuinely novel, adversarial queries. Every abstraction dies in production, and the manner of its demise will reveal the fundamental limits of verifiable AI. The goal, then, isn’t to eliminate errors, but to design systems that fail gracefully – and predictably.

Original article: https://arxiv.org/pdf/2604.08549.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Knowledge: Why Biomedical Question Answering is So Hard

VerifAI: A Pragmatic Approach to Retrieval and Generation

Detecting the Lies: Rigorous Verification of Factual Consistency

A Cautious Optimism: Performance and Future Directions

What’s Next?

See also: