Beyond Text: Enhancing Biomedical Question Answering with Visual Insights

Author: Denis Avetisyan

A new study explores how combining text and image retrieval can improve the accuracy of complex question answering systems in the specialized field of glycobiology.

Statistical analysis reveals significant differences in accuracy across evaluated models and data augmentation techniques, as demonstrated by variations in interquartile range and associated $p$-values.

Researchers evaluate different retrieval-augmented generation strategies, including multi-modal approaches, for biomedical question answering, demonstrating trade-offs between simplicity and reasoning capacity.

Despite advances in large language models, effectively integrating visual information remains a challenge in knowledge-intensive biomedical domains. This is explored in ‘Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology’, which investigates optimal strategies for incorporating figures and tables into retrieval-augmented generation (RAG) pipelines. Our results demonstrate a trade-off between simplifying pipelines by converting visuals to text-beneficial for mid-size models-and leveraging the reasoning capacity of frontier models with direct visual retrieval. As multi-modal RAG matures, will increasingly efficient visual retrievers and stronger generators unlock even more effective knowledge synthesis in complex scientific fields?

The Illusion of Fluency: Exposing the Limits of LLMs

Large Language Models have demonstrated a remarkable capacity for generating human-quality text, mastering grammar, and even mimicking diverse writing styles. However, this fluency often masks a fundamental limitation: a struggle with tasks demanding precise recall of specific knowledge. While proficient at identifying patterns within the text they’ve been trained on, LLMs frequently falter when required to synthesize information from external sources or apply nuanced contextual understanding. This isn’t a matter of lacking information entirely, but rather an inability to reliably retrieve and integrate the right knowledge at the moment it’s needed, leading to inaccuracies or generic responses in knowledge-intensive applications. The models essentially excel at ‘knowing what words usually follow other words’ but can struggle when deeper reasoning and accurate contextual grounding are paramount.

Conventional Large Language Models, while proficient in generating human-like text, encounter significant hurdles when tasked with complex reasoning within specialized fields or those heavily reliant on visual information. These models are trained on vast datasets of general knowledge, but often lack the nuanced understanding required to accurately interpret data from domains like medical imaging, scientific visualizations, or engineering schematics. This limitation stems from their reliance on statistical correlations within text, rather than a genuine comprehension of underlying principles; consequently, they may struggle with tasks demanding precise inferences, detailed analysis of visual cues, or application of domain-specific expertise. The result is a susceptibility to errors and inconsistencies when confronted with information outside their broad, generalized training, highlighting the need for approaches that can effectively integrate external knowledge and specialized data types.

Retrieval-Augmented Generation, or RAG, represents a significant advancement in addressing the knowledge limitations of large language models. Instead of relying solely on the parameters learned during training, RAG dynamically integrates information retrieved from external sources during the text generation process. This allows the model to access and incorporate up-to-date or highly specialized knowledge, improving the accuracy and relevance of its responses. By effectively bridging the gap between a model’s pre-existing knowledge and a vast external information landscape, RAG empowers LLMs to tackle complex, knowledge-intensive tasks with greater fidelity and nuance. The technique doesn’t alter the core LLM itself, but augments its capabilities by providing a constantly updated, relevant knowledge base for each query, ultimately enhancing its performance on tasks demanding precise contextual understanding.

While Retrieval-Augmented Generation (RAG) offers a promising path to enhance Large Language Model performance, simplistic implementations frequently underutilize the potential of multi-modal data. Many RAG systems treat images, audio, or video as mere supplementary information, failing to deeply integrate these inputs with the textual context. This often results in superficial connections, where the model can identify the presence of a visual element but struggles to accurately reason about its implications or integrate it into a coherent response. Consequently, the effectiveness of RAG is significantly curtailed when dealing with complex, real-world scenarios where understanding requires a nuanced interpretation of diverse data types, rather than simply retrieving related text fragments.

Beyond Text: Architecting Multi-Modal Reasoning

Multi-Modal Retrieval-Augmented Generation (MM-RAG) builds upon the principles of Retrieval-Augmented Generation (RAG) by expanding its capabilities to encompass data beyond text. Traditional RAG systems primarily utilize textual data sources for knowledge retrieval; MM-RAG extends this to include other modalities, most notably images, but potentially also audio or video. This integration requires the system to process and understand information represented in these different formats, converting them into a common embedding space for efficient similarity search. By combining information retrieved from both textual and visual sources, MM-RAG aims to provide more comprehensive and contextually relevant responses than systems limited to text alone, enabling applications requiring understanding of multi-modal inputs.

Effective multi-modal Retrieval-Augmented Generation (MM-RAG) necessitates accurate extraction of textual content from visual documents, a process often handled by specialized document parsing tools. Systems like Docling utilize Optical Character Recognition (OCR) combined with layout analysis to identify and extract text, tables, and figures from document images or PDFs. This extracted content is then converted into a structured format suitable for embedding and indexing. Robust parsing is critical because the quality of extracted text directly impacts the relevance of retrieved information and, consequently, the performance of the downstream LLM. Failure to accurately parse visual documents can lead to incomplete or erroneous data being incorporated into the knowledge base, hindering the system’s ability to provide accurate and contextually relevant responses.

Efficient vector databases are essential for multi-modal Retrieval-Augmented Generation (MM-RAG) systems due to the high dimensionality and volume of embeddings generated from both text and image data. These databases, such as Qdrant, utilize approximate nearest neighbor (ANN) search algorithms to enable rapid retrieval of relevant information from massive datasets, which is critical for maintaining low latency in RAG applications. Unlike traditional databases, vector databases are optimized for similarity search, allowing them to identify data points with the closest vector representations, even if an exact match is not present. Scalability is also a key consideration; these databases must be capable of handling billions of vectors and supporting high query throughput to facilitate real-time responses in complex MM-RAG pipelines.

The performance of Multi-Modal Retrieval-Augmented Generation (MM-RAG) systems is fundamentally dependent on Vision-Language Models (VLMs) that can process and correlate information from both visual and textual inputs. These models utilize architectures, such as transformers, pre-trained on extensive datasets containing paired image and text data, enabling them to generate contextualized embeddings representing the combined semantic meaning of both modalities. Effective VLMs must not only identify objects and scenes within images but also understand the relationships between visual elements and corresponding textual descriptions, allowing for accurate cross-modal retrieval and generation of relevant responses. The ability to reason across modalities-inferring information from images based on textual queries, and vice-versa-is critical for the success of MM-RAG in complex tasks requiring integrated understanding of diverse data types.

Granular Relevance: Deconstructing the Search for Precision

Traditional information retrieval systems commonly employ techniques like term frequency-inverse document frequency (TF-IDF) or bag-of-words models, which represent documents and queries as aggregated statistics of their constituent terms. This coarse-grained approach treats documents as collections of keywords, disregarding the semantic relationships between terms and the contextual information crucial for understanding complex data. Consequently, these methods often fail to identify documents that are relevant conceptually but differ in wording or contain nuanced information not directly captured by simple keyword matching. This limitation becomes particularly pronounced when dealing with specialized domains or data types where subtle distinctions are significant, leading to reduced precision and recall in retrieval tasks.

Late Interaction models address limitations in traditional retrieval by moving the computation of similarity scores to a more granular level. Instead of representing an entire query or document with a single vector, these models, exemplified by ColBERT, operate on token or sub-word level representations. This allows for the comparison of individual terms within the query to those within the document, capturing nuanced semantic relationships that would be lost in coarse-grained methods. Specifically, ColBERT encodes both queries and documents into contextualized token embeddings, then computes a maximum similarity score between each query token and each document token. The final document score is derived from the aggregate of these fine-grained comparisons, enabling a more precise assessment of relevance than methods relying on document-level vector representations.

Visual document retrieval models, including ColFlor, ColPali, and ColQwen, extend the principles of late interaction to accommodate the unique characteristics of visual data. Unlike traditional methods that may operate on aggregated features, these models compute document-query relevance at a fine-grained level, considering individual visual elements or patches. This approach is particularly beneficial for visual search tasks where nuanced details and contextual understanding are crucial for accurate retrieval. By leveraging late interaction, these models can effectively capture the relationships between query terms and specific visual components within documents, leading to improved performance compared to coarse-grained matching techniques.

Late interaction models, including ColFlor and ColPali, have shown performance gains in knowledge-intensive tasks specifically within the Glycobiology domain. When integrated with GPT-5, both ColFlor and ColPali achieved an accuracy score of 0.828. This indicates a significant improvement in retrieving relevant information and supporting complex reasoning tasks within this specialized scientific field, demonstrating the effectiveness of fine-grained similarity computations for nuanced data analysis.

Quantifying Intelligence: Assessing and Refining RAG Performance

The efficacy of Retrieval-Augmented Generation (RAG) systems is fundamentally assessed through performance benchmarks established by leading Large Language Models (LLMs). Both proprietary models, such as GPT-4o and the anticipated GPT-5, and increasingly powerful open-source alternatives like Gemma-3-27B-IT, provide critical reference points for evaluating a RAG system’s ability to accurately and relevantly respond to queries. These LLMs act as a standardized measure; a RAG system’s output is often compared against the direct response of these models to the same prompt, allowing researchers to quantify improvements in context utilization and answer quality. The use of diverse LLMs as benchmarks is crucial, as performance can vary depending on model architecture, training data, and inherent biases, ensuring a robust and comprehensive evaluation of RAG system capabilities.

VisRAG establishes a novel framework for multi-modal question answering by integrating a Vision Language Model (VLM) directly into the Retrieval-Augmented Generation (RAG) pipeline. This approach moves beyond traditional text-based RAG systems by enabling the model to not only understand text but also to interpret visual information within documents – such as charts, diagrams, and images – as part of the retrieval process. By grounding responses in both textual and visual context, VisRAG demonstrably improves accuracy and relevance in answering complex questions that require the synthesis of information from multiple modalities. Its architecture serves as a powerful benchmark for evaluating RAG systems operating on visual data and paves the way for more sophisticated and versatile question-answering applications, particularly in domains like document understanding and information retrieval from visual archives.

The successful integration of visual documents into Retrieval-Augmented Generation (RAG) systems hinges on the initial step of Optical Character Recognition (OCR). This technology effectively converts images containing text-such as scanned documents, PDFs, or screenshots-into machine-readable text formats. Without accurate OCR, the valuable information embedded within these visual sources remains inaccessible to the LLM, hindering its ability to provide informed responses. Modern OCR engines employ sophisticated algorithms, including deep learning models, to overcome challenges like varying font styles, image distortions, and low-resolution scans, ensuring high-fidelity text extraction. This preprocessing stage is not merely a technical requirement, but a fundamental enabler for building truly multi-modal RAG pipelines capable of leveraging the full spectrum of available data.

The performance of Retrieval-Augmented Generation (RAG) systems benefits significantly from the fine-tuning of Large Language Models (LLMs) on specialized datasets, allowing them to more effectively utilize retrieved information for accurate response generation. Recent advancements, such as the ColFlor model, showcase substantial efficiency gains; it achieves comparable accuracy – 0.828 – to GPT-5 while being seventeen times smaller than the ColPali model. Furthermore, multi-modal augmentation applied to GPT-4o demonstrates a strong performance, attaining an accuracy of 0.808, highlighting the potential for optimizing LLMs not only for size and efficiency but also for enhanced contextual understanding and precise response delivery within RAG architectures.

The pursuit of reliable knowledge, as demonstrated in this exploration of multi-modal retrieval-augmented generation, echoes a fundamental tenet of computational integrity. Ken Thompson famously stated, “Software is only ever as good as its testing.” This sentiment directly applies to the evaluation of RAG pipelines in glycobiology; a system’s efficacy isn’t solely determined by its architectural complexity, but by its demonstrable ability to consistently and accurately retrieve and synthesize information. The study highlights a trade-off between pipeline simplicity and reasoning capacity, mirroring the need for rigorous testing to validate the reproducibility of results – a core principle in ensuring the reliability of any complex system, be it software or a knowledge retrieval mechanism.

What Lies Ahead?

The presented work, while demonstrating the potential of multi-modal retrieval for specialized knowledge domains, merely scratches the surface of a fundamental problem. The observed trade-off between pipeline complexity and reasoning capacity is not a peculiarity of glycobiology, but a symptom of a deeper issue: current large language models often excel at syntactic manipulation rather than genuine semantic understanding. Augmenting these models with retrieved information is, at best, a palliative, not a cure.

Future efforts should not focus solely on optimizing retrieval strategies or expanding multi-modal inputs. Rather, the field must grapple with the question of how to imbue these systems with verifiable reasoning capabilities. The pursuit of ever-larger datasets and model parameters, without concomitant advances in formal verification and logical consistency, risks creating increasingly sophisticated – yet ultimately unreliable – oracles. Optimization without analysis is self-deception, a trap for the unwary engineer.

A fruitful avenue lies in exploring the integration of symbolic reasoning systems with large language models, potentially leveraging the strengths of both paradigms. This demands a shift in evaluation metrics, moving beyond simple accuracy scores to assess the logical soundness and explainability of generated answers. Only then can the promise of true knowledge-based reasoning in the biomedical domain – and beyond – be realized.

Original article: https://arxiv.org/pdf/2512.16802.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Fluency: Exposing the Limits of LLMs

Beyond Text: Architecting Multi-Modal Reasoning

Granular Relevance: Deconstructing the Search for Precision

Quantifying Intelligence: Assessing and Refining RAG Performance

What Lies Ahead?

See also: