Beyond Search: Building Smarter Chatbots with Document Intelligence

Author: Denis Avetisyan


A new framework, HybridRAG, enhances chatbot responses by proactively preparing answers from unstructured documents, even scanned PDFs.

HybridRAG utilizes pre-generated question-answer pairs and hierarchical chunking to improve Retrieval-Augmented Generation (RAG) performance with raw, unstructured data.

While Retrieval-Augmented Generation (RAG) has become a leading paradigm for knowledge-grounded chatbots, its reliance on structured data and on-the-fly processing limits scalability in real-world applications. This paper introduces ‘HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated Q&A over Raw Unstructured Documents’, a novel framework that proactively generates a question-answer knowledge base from unstructured documents-including scanned PDFs processed via Optical Character Recognition-to accelerate response times and improve accuracy. By pre-computing answers, HybridRAG minimizes reliance on costly real-time generation, offering both reduced latency and enhanced performance as demonstrated on the OHRBench dataset. Could this approach unlock more practical and efficient chatbot deployments capable of handling large volumes of complex, unstructured information?


The Illusion of Understanding: LLMs and Unstructured Data

Despite the remarkable advancements in Large Language Models (LLM) and their proficiency in generating coherent text, extracting meaningful insights from unstructured documents presents a persistent challenge. These models, while adept at pattern recognition and language prediction, often struggle with the inherent complexities of real-world data – information lacking a predefined format such as reports, emails, or legal contracts. The difficulty lies not in processing text itself, but in deciphering the context, relationships, and relevant details embedded within varying layouts, inconsistent formatting, and ambiguous structures. Consequently, simply feeding these documents into an LLM frequently yields incomplete or inaccurate results, demanding sophisticated pre-processing techniques and specialized architectures to bridge the gap between raw data and actionable knowledge.

Conventional techniques for extracting data from documents frequently falter when faced with the inherent variability of real-world layouts. These methods typically demand substantial pre-processing – involving steps like optical character recognition, table detection, and zone identification – to impose order on what is often a chaotic arrangement of text and graphics. This elaborate preparation not only adds significant computational overhead but also introduces fragility; even slight deviations in document formatting can necessitate re-tuning the pre-processing pipeline. Consequently, scaling these systems to handle large volumes of diverse documents, or achieving the speed required for real-time applications like instant document search, proves exceptionally challenging. The reliance on rigid, pre-defined rules limits adaptability and creates a bottleneck in leveraging the wealth of information contained within unstructured data.

HybridRAG: A Pragmatic Approach to Document Understanding

HybridRAG is a Retrieval-Augmented Generation (RAG) framework engineered to address the challenges inherent in processing unstructured documents. Unlike traditional RAG systems often optimized for structured data, HybridRAG incorporates techniques to effectively handle the complexities of formats like PDFs, scans, and images. This practical approach focuses on extracting meaningful information from these sources, enabling more accurate and contextually relevant responses from large language models. The framework’s design prioritizes adaptability to diverse document types and layouts, a crucial factor in real-world applications where data is rarely perfectly formatted.

HybridRAG utilizes Optical Character Recognition (OCR) to convert scanned documents or images containing text into machine-readable text. This process is crucial for processing documents that are not natively digital. Following OCR, Layout Analysis is employed to determine the structural elements within the document, such as headings, paragraphs, tables, and lists. This structural understanding enables accurate text extraction and maintains document context, which is vital for downstream tasks like question answering and information retrieval. The combination of OCR and Layout Analysis ensures that text is not only extracted but also understood in relation to its original document formatting, improving the quality and relevance of retrieved information.

Hierarchical chunking improves document retrieval performance by recursively dividing documents into smaller, contextually relevant segments, moving from broad sections to increasingly granular chunks. This approach addresses the limitations of fixed-size chunking, which can disrupt semantic context. The process facilitates more accurate semantic matching when combined with embedding models such as BGE-M3, a model trained to generate dense vector embeddings that capture the semantic meaning of text. BGE-M3’s architecture is optimized for both sentence and passage retrieval, resulting in improved recall and precision in identifying relevant document segments during the retrieval stage of a RAG pipeline. These embeddings are then used to calculate the similarity between the query and each document chunk, enabling the system to retrieve the most semantically related information.

Validation on OHRBench: A Necessary, Though Imperfect, Test

The OHRBench dataset was utilized for a comprehensive evaluation of the HybridRAG framework. This benchmark presents a significant challenge for question answering systems due to its composition of unstructured documents, requiring robust information retrieval and generation capabilities. OHRBench is designed to assess performance on complex queries that necessitate reasoning over multiple document segments, and its difficulty stems from the need to accurately identify relevant information within these unstructured sources. The dataset’s structure and content allow for a granular analysis of a system’s ability to handle real-world document retrieval and question answering scenarios, making it a suitable testbed for evaluating the HybridRAG framework’s performance characteristics.

Evaluation of the HybridRAG framework utilized established metrics for question answering performance, including ROUGE-L, which measures longest common subsequence overlap between generated and reference answers; F1-score, representing the harmonic mean of precision and recall; and BERTScore, a metric leveraging pre-trained language models to assess semantic similarity. Results indicate that HybridRAG consistently outperforms baseline Retrieval-Augmented Generation (RAG) approaches across these metrics, demonstrating statistically significant improvements in answer quality and relevance as determined by quantitative assessment on the OHRBench dataset.

Evaluation of the HybridRAG framework on the Administration domain of the OHRBench dataset demonstrated a 19% improvement in F1-score when compared to a Simplified HybridRAG configuration utilizing the Qwen2.5 large language model. This performance gain indicates a substantial enhancement in the precision and recall of answer retrieval and generation within the Administration domain, specifically attributable to the features incorporated into the full HybridRAG framework over the simplified baseline. The F1-score, a harmonic mean of precision and recall, serves as a key metric for evaluating the accuracy of question answering systems on this benchmark.

Evaluation on the Finance domain of the OHRBench dataset revealed a 22% improvement in F1-score when utilizing the HybridRAG framework compared to a Simplified HybridRAG configuration employing the Qwen2.5 large language model. This performance gain indicates a substantial enhancement in the precision and recall of answer retrieval and generation specifically within financial documents, demonstrating the effectiveness of the HybridRAG approach for complex domain-specific information retrieval and response generation.

QA Pre-Generation enhances the retrieval component of HybridRAG by proactively generating potential question-answer pairs from the document corpus prior to query processing. This pre-generation step creates a more comprehensive and nuanced index of document content, enabling the system to identify and retrieve passages more effectively aligned with user queries. By anticipating potential questions, the framework improves both the quality and relevance of retrieved passages, ultimately leading to more accurate and informative generated answers compared to standard retrieval-augmented generation methods.

Evaluations using the OHRBench dataset demonstrate that the HybridRAG framework achieves a Context-Question-Answer Relevance (CQAR) score of 0.24. This metric assesses the degree to which the retrieved context, the original question, and the generated answer are logically aligned and mutually supportive. The observed CQAR score of 0.24 represents a significant improvement over typical question-answer pairs within the OHRBench benchmark, indicating that HybridRAG consistently retrieves more pertinent contextual information, leading to more relevant and focused responses.

The HybridRAG framework was tested with multiple Large Language Models (LLMs) to demonstrate its adaptability across different architectures and capabilities. Specifically, experiments incorporated Llama3, Qwen2.5, and GPT-4o, allowing for a comparative assessment of performance irrespective of the underlying LLM. Results indicated that HybridRAG consistently improved performance metrics across these diverse LLMs, confirming its ability to effectively leverage various language models for enhanced question answering over unstructured documents and showcasing a lack of dependency on a specific LLM implementation.

Evaluation using the OHRBench dataset demonstrates that HybridRAG generates responses with improved qualitative characteristics compared to standard question-answer pairs within the benchmark. Specifically, HybridRAG achieved an Answerability score of 0.31, indicating a higher propensity for providing responses that directly address the posed question. Furthermore, the framework attained Clarity and Fluency scores of 0.50 and 0.76 respectively, signifying that generated answers are more readily understandable and exhibit a higher degree of natural language quality when compared to typical OHRBench outputs.

Beyond the Benchmark: Real-World Implications and Future Directions

HybridRAG represents a notable step forward in the field of document understanding, effectively bridging the gap between information stored in unstructured formats and the need for precise, efficient retrieval. Traditional methods often struggle with the inherent ambiguity and complexity of raw text, leading to inaccurate or incomplete results; however, this framework leverages a hybrid approach to knowledge retrieval, combining the strengths of both retrieval-based and generative models. This allows for a more nuanced comprehension of document content, enabling the system to not only locate relevant information but also synthesize it into coherent and insightful responses. The resulting improvement in accuracy and speed has the potential to transform workflows across numerous industries reliant on processing large volumes of unstructured data, from legal and financial sectors to scientific research and beyond.

The enhanced document understanding facilitated by HybridRAG extends far beyond simple information retrieval, promising to reshape workflows across multiple critical sectors. In legal discovery, the framework’s ability to accurately pinpoint relevant passages within vast document collections can dramatically reduce review times and associated costs, while simultaneously improving the precision of evidence gathering. Financial analysts stand to benefit from more efficient processing of regulatory filings and market reports, enabling faster identification of key trends and risks. Perhaps most significantly, scientific research can be accelerated through the rapid synthesis of information from a growing body of publications, allowing researchers to quickly build upon existing knowledge and formulate new hypotheses. This capability isn’t merely about finding information; it’s about unlocking insights previously buried within unstructured data, fostering innovation and informed decision-making across diverse disciplines.

HybridRAG exhibits a high degree of accuracy in retrieving previously stored responses, a crucial aspect of effective information recall. Evaluations demonstrate the system correctly identifies 80% of stored responses when a similarity threshold of 0.7 is applied, indicating a strong ability to match new queries to relevant past answers. Even with a more stringent similarity requirement-a threshold of 0.9-the framework maintains a noteworthy 13% accuracy rate, suggesting resilience and precision in discerning nuanced connections between queries and stored information. This robust handling of stored responses positions HybridRAG as a reliable solution for applications demanding accurate and consistent information retrieval from large knowledge bases.

Ongoing development of HybridRAG prioritizes refinement through adaptive chunking strategies, aiming to dynamically adjust document segmentation based on content complexity and query relevance. This will move beyond fixed-size chunks to optimize information capture and reduce noise. Simultaneously, researchers are actively integrating multimodal data – including images, audio, and video – into the framework. This expansion promises a richer understanding of information sources, enabling HybridRAG to answer queries that require analysis beyond textual content. The successful incorporation of these elements is expected to significantly broaden the application scope, fostering more nuanced and comprehensive responses across various domains.

Continued refinement of HybridRAG’s individual components promises to unlock even greater performance gains in Retrieval-Augmented Generation. Current development prioritizes algorithmic efficiency and scalability, aiming to reduce computational costs without sacrificing accuracy. Crucially, the framework’s true potential will be demonstrated through rigorous testing on a broader spectrum of datasets – encompassing varying document lengths, subject matter complexities, and data formats. This comprehensive evaluation will not only benchmark HybridRAG against existing RAG solutions but also identify specific areas for targeted improvement, ultimately positioning it as a robust and versatile tool for a wide range of information retrieval tasks and solidifying its role at the forefront of the field.

The pursuit of elegant chatbot frameworks invariably encounters the brutal realities of production. HybridRAG, with its pre-generated Q&A pairs and hierarchical chunking, attempts to tame the chaos inherent in unstructured documents – scanned PDFs, no less. It’s a valiant effort, yet one can’t help but anticipate the edge cases, the oddly formatted documents, and the queries that will inevitably break the pre-computed knowledge. As Tim Bern-Lee observed, “The Web is more a social creation than a technical one.” This framework, for all its technical sophistication, is still fundamentally dependent on the quality and consistency of the underlying data – a messy, social creation indeed. The bug tracker will, undoubtedly, become a chronicle of unanticipated document quirks. They don’t deploy – they let go.

What’s Next?

The promise of effortlessly querying unstructured data with Large Language Models continues to accelerate, and HybridRAG represents a logical, if predictably complex, step. The pre-generation of question-answer pairs is a concession to reality; anyone who’s spent more than an hour with LLMs knows ‘context window’ is just a polite term for ‘limited patience.’ It’s a band-aid, certainly, but a pragmatic one. The inevitable question, of course, is scale. This framework will work beautifully on a curated dataset of ten PDFs. Then production will happen, and suddenly it’s ten thousand scanned, poorly formatted, occasionally handwritten documents. They’ll call it AI and raise funding, naturally.

The hierarchical chunking is sensible, but also feels… familiar. It’s another layer of engineering complexity built on top of an already fragile foundation. One suspects that within six months, the elegant architecture described here will resemble a series of nested `if` statements and desperate hacks, all held together by duct tape and fervent hope. The real challenge isn’t building a better RAG pipeline; it’s maintaining one.

Ultimately, this work is a reminder that the holy grail of ‘semantic search’ remains elusive. The system ‘used to be a simple bash script’ that grepped for keywords. Now it’s… this. Tech debt is just emotional debt with commits, and the accumulating weight of these ‘improvements’ will eventually bring the whole edifice crashing down. One can only hope the documentation is updated before then, although experience suggests otherwise.


Original article: https://arxiv.org/pdf/2602.11156.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-02-15 04:36