Mining Clinical Trials with AI and Human Expertise

Author: Denis Avetisyan

A new system combines the power of artificial intelligence with human oversight to accelerate the process of extracting reliable evidence for systematic reviews.

The EviSearch system establishes a foundational architecture for evidence-based reasoning, enabling the structured exploration of complex information spaces and facilitating the rigorous evaluation of claims through logically connected evidence chains [latex] E = \{e_1, e_2, ..., e_n\} [/latex]. — The EviSearch system establishes a foundational architecture for evidence-based reasoning, enabling the structured exploration of complex information spaces and facilitating the rigorous evaluation of claims through logically connected evidence chains [latex] E = \{e_1, e_2, …, e_n\} [/latex].

EviSearch utilizes a multi-agent system with large language models and complete provenance tracking to improve the accuracy and efficiency of clinical evidence extraction.

Systematic reviews, crucial for evidence-based medicine, are often bottlenecked by the laborious process of extracting data from complex clinical trial publications. To address this, we introduce ‘EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews’, a multi-agent system that combines large language models with human oversight to accurately capture structured clinical evidence directly from trial PDFs, while guaranteeing complete provenance for every extracted value. This approach substantially improves extraction accuracy and provides a verifiable audit trail, enabling safer integration of automated methods into evidence synthesis pipelines. Will this human-in-the-loop design pave the way for truly living systematic reviews and accelerate the translation of research into clinical practice?

The Impedance of Data Heterogeneity in Clinical Evidence Synthesis

The sheer volume of data produced by modern clinical trials presents a significant obstacle to efficient systematic reviews. These trials generate information across diverse formats – from meticulously structured databases and spreadsheets to free-text reports, scanned documents, and complex PDF publications – each requiring specialized parsing and interpretation. This heterogeneity isn’t merely a logistical issue; it creates a substantial bottleneck, delaying the synthesis of evidence needed for informed healthcare decisions. The disparate nature of the data demands considerable effort to standardize and integrate, effectively slowing down the process of meta-analysis and hindering the timely application of research findings to clinical practice. Consequently, accessing robust, synthesized evidence often lags behind the rapid pace of new trial results, impacting both healthcare professionals and patients.

The painstaking process of manually extracting data from clinical trials represents a significant impediment to rapid scientific advancement. Researchers traditionally pore over lengthy reports and publications, a labor-intensive undertaking that demands substantial financial resources and considerable time. This manual approach isn’t merely slow; it introduces a high risk of human error, from misinterpreting statistical values to overlooking crucial patient characteristics. Consequently, the completion of meta-analyses – vital syntheses of existing research – is frequently delayed, and the translation of trial results into informed clinical practice and evidence-based decision-making suffers, potentially impacting patient care and hindering the development of effective healthcare strategies.

Automated data extraction from clinical trial reports, while promising, frequently encounters obstacles due to the inherent complexity and inconsistency of PDF formats. These documents aren’t standardized; variations in layout, table structures, and the use of images instead of text significantly impede the performance of algorithms designed for automated parsing. Many systems struggle to accurately identify and categorize critical data points – such as patient demographics, treatment regimens, and outcome measures – leading to incomplete datasets or, more concerningly, inaccurate information. This necessitates extensive manual verification, undermining the efficiency gains sought through automation and delaying the availability of synthesized evidence for meta-analysis and informed clinical practice. The challenge isn’t simply recognizing text, but also interpreting its meaning within the specific context of a clinical trial report – a task requiring a level of nuanced understanding that current automated solutions often lack.

EviSearch: A Modular Framework for Automated Evidence Extraction

EviSearch’s architecture is predicated on the distribution of tasks across specialized agents. The document parsing agent is responsible for initial document processing and conversion into a structured format. Following parsing, the information extraction agent identifies and extracts relevant data points according to pre-defined criteria. Crucially, a reconciliation agent is incorporated to validate and harmonize extracted information, resolving discrepancies between outputs from different extraction processes and ensuring data integrity. This multi-agent system facilitates modularity, allowing for independent scaling and optimization of each component, and enhances the robustness of the automated extraction pipeline.

The EviSearch framework employs a dual-agent approach to information retrieval from documents. The PDF Query Module directly processes Portable Document Format (PDF) files, utilizing the Gemini-2.5-Flash large language model to extract both textual content and embedded data. This module operates directly on the PDF’s raw structure. In contrast, the Search Agent functions on pre-processed, parsed representations of documents, meaning the PDF has already been converted into a more structured format before analysis. This division of labor allows for optimized extraction; Gemini-2.5-Flash handles the complexities of PDF formatting, while the Search Agent benefits from the efficiency of operating on readily available, structured data.

Batching techniques within EviSearch significantly improve processing speed and efficiency by dividing large extraction tasks into smaller, independent batches. These batches are then processed concurrently using parallel execution, leveraging multi-core processors and distributed computing resources. This approach minimizes idle time and maximizes throughput compared to sequential processing. The size of each batch is dynamically adjusted to optimize performance based on available resources and the complexity of the extraction tasks, enabling efficient handling of varying document sizes and data volumes. By distributing the workload, batching reduces the overall extraction time and increases the scalability of the system.

The Reconciliation Agent functions as a critical component in ensuring the reliability of extracted data by performing cross-validation across multiple information sources. This agent receives outputs from both the PDF Query Module and the Search Agent, identifying discrepancies in extracted entities and relationships. Discrepancies are flagged and resolved through a defined protocol, prioritizing outputs based on confidence scores assigned by the individual extraction modules. This comparative analysis mitigates errors stemming from individual module limitations, such as OCR inaccuracies or parsing errors, ultimately improving the overall accuracy and consistency of the extracted knowledge base. The agent does not simply select one output, but aims to synthesize a consolidated and verified representation of the information.

The extraction interface provides a means to interact with and retrieve data.

Dissecting the Pipeline: Parsing, Reasoning, and Provenance

EviSearch utilizes the Landing AI Document Parse Model to address the challenges inherent in processing clinical trial PDFs, which often contain complex layouts and unstructured data. This model converts the PDF content into structured formats – JSON and Markdown – enabling efficient downstream processing by subsequent components of the EviSearch framework. The conversion process facilitates the extraction of text, tables, and figures, transforming the originally unstructured data into a machine-readable format suitable for semantic search, reasoning, and data analysis. This structured output is essential for accurate information retrieval and supports the framework’s ability to pinpoint relevant details within the source documents.

The EviSearch Search Agent leverages semantic search capabilities enabled by the OpenAI Text Embedding Model to identify relevant information within documents that have undergone parsing. This process converts text into vector embeddings, representing the meaning of words and phrases numerically. By comparing the embedding of a user query to the embeddings of text segments within the parsed documents, the Search Agent can identify passages with similar semantic meaning, even if they do not share exact keyword matches. This allows for more nuanced and accurate information retrieval compared to traditional keyword-based search methods, improving the precision and recall of relevant data within clinical trial PDFs and other complex document types.

EviSearch incorporates a comprehensive provenance tracking system to maintain data integrity and auditability. This system meticulously records the origin of each extracted value, including the specific PDF document, page number, and the parsing method employed – currently, the Landing AI Document Parse Model. Furthermore, the system logs all reasoning steps taken by the Search Agent, referencing the semantic search results generated by the OpenAI Text Embedding Model and the precise location within the parsed document where the value was identified. This detailed lineage allows for complete traceability of information, enabling verification of results and facilitating error analysis or model refinement.

EviSearch is designed to extract data from multiple data types commonly found within clinical trial PDFs. The system performs table reasoning, identifying and interpreting tabular data including headers, rows, and cell values. Simultaneously, it supports chart interpretation, processing visual data such as graphs and charts to identify trends, values, and associated metadata. This multi-modal approach allows EviSearch to aggregate information from diverse sources within a single document, increasing the comprehensiveness of extracted insights and supporting more complex queries.

Augmenting Automation with Human Oversight: A Pragmatic Approach

Despite the significant automation achieved by EviSearch in clinical evidence extraction, a human-in-the-loop verification process remains essential to the framework’s reliability. This deliberate integration of human expertise addresses the inherent complexities within scientific literature, allowing for the correction of nuanced errors and the validation of data points requiring contextual understanding. By combining computational efficiency with human judgment, EviSearch minimizes the risk of propagating inaccuracies and ensures a higher standard of data integrity, particularly when dealing with complex information frequently presented in figures and tables – a critical aspect of robust clinical evidence synthesis.

The integration of human review within EviSearch isn’t merely a quality control step, but a dynamic process that actively enhances the system’s performance. Human experts correct instances where automated extraction falters, particularly with nuanced or ambiguous data, and validate the accuracy of complex relationships identified within clinical trial reports. Critically, these human-verified corrections aren’t isolated fixes; they are fed back into the extraction models, allowing the system to learn from its errors and refine its algorithms over time. This continuous feedback loop fosters iterative improvement, ensuring that EviSearch becomes increasingly adept at accurately and comprehensively synthesizing clinical evidence, ultimately exceeding the capabilities of purely automated approaches.

EviSearch leverages a carefully balanced methodology, uniting automated extraction with the nuanced capabilities of human reviewers. This synergistic approach allows the system to rapidly process extensive clinical documentation, capitalizing on computational speed and scale, while simultaneously ensuring data integrity through expert validation. Complex relationships, ambiguous phrasing, and subtle contextual cues-often missed by algorithms-are effectively addressed by human oversight, leading to a more robust and reliable synthesis of evidence. The resulting framework doesn’t simply aim to replace human analysis, but rather to augment it, enabling researchers to focus on higher-level interpretation and decision-making rather than tedious manual data collection.

EviSearch establishes a new benchmark in clinical evidence synthesis, achieving an overall extraction accuracy of 91.3% when evaluated against a rigorous clinical trial dataset. This represents a substantial advancement over existing methods, exceeding the performance of the best baseline by a significant 7.2 percentage points. The enhanced accuracy isn’t solely about identifying information; it directly contributes to improved auditability, enabling researchers and clinicians to confidently trace the origins and validity of extracted evidence. This level of precision minimizes the risk of errors in meta-analyses and systematic reviews, ultimately fostering more reliable and trustworthy conclusions in healthcare research and practice.

EviSearch demonstrates a high degree of reliability in clinical evidence synthesis, achieving 90.9% correctness in identifying and extracting factual information and 91.6% completeness in capturing all relevant data points within source documents. These metrics, assessed against a clinical trial benchmark, indicate the framework’s ability to not only find the right answers but also to comprehensively represent the available evidence. This dual strength-high correctness coupled with high completeness-is crucial for building trustworthy and auditable knowledge bases, allowing researchers and clinicians to confidently utilize the extracted information for informed decision-making and rigorous analysis.

EviSearch distinguishes itself through its ability to accurately extract data presented visually within figures, achieving an 86.7% accuracy rate on figure-sourced evidence – a substantial improvement over the leading baseline, which attained only approximately 50% accuracy. This capability is particularly vital in clinical research, where critical information is often embedded within charts, graphs, and diagrams rather than explicitly stated in text. By substantially enhancing the extraction of data from these visual sources, EviSearch significantly reduces the burden of manual data collection and interpretation, fostering more efficient and reliable evidence synthesis while minimizing the potential for human error in translating visual data into structured, usable formats.

EviSearch operates on a substantial scale, processing an average of 642,798 tokens – individual units of text – per document to comprehensively extract relevant clinical evidence. This intensive analysis is facilitated by the strategic utilization of 79 API calls per document, enabling the system to access and integrate information from diverse sources and perform complex data processing tasks. The high token count underscores the framework’s ability to handle lengthy and detailed clinical reports, while the efficient API call management demonstrates a carefully optimized workflow designed for both thoroughness and speed in evidence synthesis.

The pursuit of verifiable truth, as embodied by EviSearch, resonates with a sentiment expressed by David Hilbert: “We must be able to answer, yes or no, to any question.” The system’s architecture, meticulously designed to capture provenance for every extracted clinical evidence value, directly addresses the need for absolute certainty. Unlike systems that prioritize functional output, EviSearch champions provability; each assertion isn’t simply produced by a large language model, but attested through a clear chain of evidence. This commitment to rigorous validation, mirroring Hilbert’s formalist ideals, elevates EviSearch beyond a mere tool for data extraction and positions it as a framework for establishing demonstrable, and therefore reliable, clinical knowledge.

What’s Next?

The pursuit of automated clinical evidence extraction, as exemplified by EviSearch, inevitably raises a fundamental question: Let N approach infinity – what remains invariant? The system demonstrably improves upon purely automated approaches, yet the continued necessity of human oversight suggests an inherent limit. The challenge isn’t merely to achieve higher recall or precision – those are transient metrics. The enduring problem lies in the semantic ambiguity of natural language itself. A model can learn to associate terms with concepts, but true understanding – the ability to discern subtle nuances of methodology or patient characteristics – remains elusive.

Future work must therefore shift focus. Rather than striving for complete automation – a Sisyphean task – attention should be directed towards minimizing the cognitive burden on human reviewers. Can the system not only extract data, but actively challenge assertions, highlighting potential inconsistencies or methodological weaknesses? This demands a move beyond passive extraction towards an active, adversarial system – a ‘devil’s advocate’ for clinical evidence.

Furthermore, the emphasis on provenance, while laudable, must extend beyond simply tracing the origin of extracted values. A complete provenance should encompass not just the source of the information, but also a formal representation of the reasoning applied in its interpretation. Only then can the system approach a level of transparency and verifiability commensurate with the gravity of its task – and perhaps, approach a solution that is, if not perfect, at least provably correct.

Original article: https://arxiv.org/pdf/2604.14165.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Impedance of Data Heterogeneity in Clinical Evidence Synthesis

EviSearch: A Modular Framework for Automated Evidence Extraction

Dissecting the Pipeline: Parsing, Reasoning, and Provenance

Augmenting Automation with Human Oversight: A Pragmatic Approach

What’s Next?

See also: