Unlocking Drug Discovery with AI-Powered Literature Mining

Author: Denis Avetisyan

A new system leverages artificial intelligence to automatically extract crucial protein-ligand interaction data from scientific publications, accelerating the pace of pharmaceutical research.

BioMiner systematically harvests large-scale bioactivity data-demonstrating a significant reduction in time and cost compared to manual curation of datasets like PDBbind v2016-and this expanded dataset, heavily skewed towards a limited distribution of proteins, substantially improves deep learning model performance when used for pre-training, as evidenced by consistent gains across five independent runs [latex] (n=5) [/latex].

BioMiner, a multi-modal system paired with the BioVista benchmark, enables automated extraction of bioactivity data from scientific literature.

The exponential growth of biomedical literature presents a significant bottleneck in extracting valuable protein-ligand bioactivity data essential for drug discovery. To address this challenge, we introduce ‘BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature’, a novel framework that integrates semantic reasoning with chemically-grounded visual analysis to automatically extract bioactivity information, including complex ligand structures like Markush compounds. Validated on the newly established BioVista benchmark, BioMiner demonstrates substantial improvements in both extraction speed and accuracy, enabling applications ranging from pre-training datasets to accelerating hit identification. Will this multi-modal approach redefine the landscape of automated knowledge discovery in the life sciences?

Decoding the Bioactivity Bottleneck

A substantial body of knowledge regarding protein-ligand interactions remains hidden within the text of scientific literature, presenting a significant obstacle to large-scale biological analysis. While countless studies detail the effects of various molecules on proteins – fundamental to understanding disease and developing new therapies – this data is rarely formatted in a way that computers can readily interpret. Researchers typically encounter this information embedded within complex sentences, figures, and supplementary materials, requiring manual curation which is both time-consuming and prone to error. This inaccessibility hinders efforts to build comprehensive databases, perform meta-analyses, and leverage the power of machine learning for drug discovery and systems biology; effectively, a wealth of potentially transformative insights remains locked away, awaiting effective extraction and computational processing.

The sheer volume of protein-ligand interaction data embedded within scientific literature presents a significant hurdle for automated analysis, largely due to the inconsistent ways this information is presented. Researchers historically report bioactivity data using a wide array of formats – from textual descriptions and tables to complex graphs and diagrams – creating a fragmented landscape for data mining. Furthermore, ambiguous phrasing and variations in chemical nomenclature introduce considerable noise, challenging algorithms attempting to accurately identify and interpret key interactions. This inconsistency necessitates sophisticated natural language processing techniques capable of discerning meaningful data amidst the complexity of scientific writing, a task that traditional methods often struggle to accomplish with the required precision and scale.

The pursuit of novel therapeutics and a deeper comprehension of biological processes hinges on access to reliable bioactivity data – information detailing how chemical compounds interact with biological targets. However, this crucial data remains largely trapped within the text of scientific publications, presenting a significant bottleneck for researchers. Recognizing this challenge, the development of BioMiner represents a substantial step forward. This automated system systematically extracts bioactivity information from a vast corpus of over 11,683 scientific papers, effectively unlocking a wealth of knowledge previously inaccessible to computational analysis. By converting unstructured text into a structured, machine-readable format, BioMiner empowers scientists to accelerate drug discovery, refine predictive models of biological systems, and ultimately, advance the pace of scientific innovation.

The BioMiner framework extracts protein-ligand bioactivity data from publications, utilizing a chemical structure extraction agent that processes both explicit and Markush structures, and is benchmarked against BioVista, a dataset of 16,457 bioactivity data points and 8,735 structures designed for comprehensive evaluation across six tasks.

BioMiner: An Agentic System for Automated Extraction

BioMiner employs an agentic system wherein multiple specialized agents collaborate to identify and extract protein-ligand interactions from scientific literature. This approach integrates both visual and semantic reasoning capabilities; agents analyze figures and tables to locate potential interactions, while simultaneously parsing accompanying text for supporting evidence and contextual information. The multi-modal design allows BioMiner to leverage complementary data sources, improving accuracy and robustness compared to systems relying on a single modality. Agents communicate and coordinate to validate findings and construct a comprehensive understanding of the observed interactions, ultimately enabling automated extraction of bioactivity triplets.

BioMiner’s foundational large language model is Qwen3-VL-32B, a vision and language model chosen for its capacity to process both visual and textual data relevant to protein-ligand interactions. To optimize performance for the specific task of bioactivity triplet extraction, the model undergoes parameter-efficient LoRA (Low-Rank Adaptation) fine-tuning. This technique modifies a small number of the model’s parameters, reducing computational costs and preventing catastrophic forgetting while adapting Qwen3-VL-32B to the nuances of biological data and the BioVista benchmark. LoRA allows BioMiner to achieve strong results without requiring full model retraining.

BioMiner’s Chemical Structure-Grounded Visual Semantic Reasoning (CSG-VSR) method integrates data from multiple sources to identify bioactivity triplets – consisting of a protein, ligand, and associated activity. This integration occurs across figure images, tabular data, and accompanying text, enabling a holistic understanding of protein-ligand interactions. Evaluation on the BioVista benchmark demonstrates an F1 score of 0.32 for bioactivity triplet extraction, indicating the system’s capacity to accurately identify and relate these key components within scientific literature.

BioMiner demonstrates superior performance in bioactivity triplet and structure-bioactivity annotation tasks, achieving high recall in Markush enumeration even with complex R-group modalities and structures, as evidenced by component-level analysis and successful processing of challenging examples.

Robust Chemical Structure Recognition Within BioMiner

BioMiner utilizes Optical Chemical Structure Recognition (OCSR) techniques to accurately identify chemical structures embedded within scientific publications. This process involves converting image-based depictions of molecules into machine-readable formats. A core component of this capability is the MolGlyph model, specifically designed for robust structure identification. OCSR allows BioMiner to extract chemical information directly from figures and diagrams, bypassing the limitations of text-based searches and enabling comprehensive data mining of chemical literature. The system can process diverse image qualities and structural representations common in scientific papers, facilitating the automated extraction of chemical entities for downstream analysis.

BioMiner employs Domain-Specific Models (DSMs) to improve chemical structure processing accuracy by accounting for the specific characteristics of scientific publications. These models are trained on datasets derived from the corpus of biomedical literature indexed by BioMiner, enabling adaptation to variations in structural representation, diagram quality, and the prevalence of particular chemical motifs within the domain. This targeted approach contrasts with general-purpose Optical Chemical Structure Recognition (OCSR) systems and allows BioMiner to more reliably interpret structures as they appear in published research, thereby minimizing errors arising from the unique challenges of scientific documentation.

BioMiner’s chemical structure recognition performance is quantitatively assessed using established benchmark datasets, including PoseBusters, to guarantee accuracy and reliability. Evaluation on the PDBbind v2016 core set and CSAR-HiQ demonstrates a 3.9% and 3.4% reduction in Root Mean Squared Error (RMSE), respectively, when the Optical Chemical Structure Recognition (OCSR) model is pre-trained using data extracted and curated by BioMiner itself. This pre-training strategy effectively leverages BioMiner’s internal data resources to enhance the precision of structure identification and improves overall system performance on standard datasets.

A controlled evaluation using [latex]4[/latex] annotators-[latex]2[/latex] experts and [latex]2[/latex] novices-demonstrated that BioMiner effectively assists structure-bioactivity annotation on the PoseBusters dataset, as shown by performance comparisons and error decomposition analysis across both expert and novice groups.

Human Oversight: Refining Accuracy and Impact

BioMiner distinguishes itself through the implementation of a Human-In-The-Loop (HITL) workflow, a crucial component for ensuring data integrity and reliability. This process doesn’t rely solely on automated extraction; instead, it actively incorporates expert scientists who review and refine the information identified from scientific literature. The system flags data points for human assessment, allowing researchers to correct inaccuracies or ambiguities that automated systems might miss. This iterative feedback loop – where human curation improves the algorithm’s performance and the algorithm pre-processes data for faster expert review – is central to BioMiner’s design, fostering a symbiotic relationship between artificial intelligence and human expertise and ultimately generating a highly trustworthy bioactivity database.

The architecture of BioMiner incorporates a Human-In-The-Loop (HITL) workflow designed to refine data extraction and minimize inaccuracies inherent in automated systems. This iterative process, where expert curators review and correct machine-identified bioactivity data, demonstrably enhances both the speed and reliability of the resulting database. Specifically, the HITL approach reduces the time required for annotation by a factor of 5.59 – decreasing the process from approximately 195.8 seconds to just 35.0 seconds per data entry. More critically, this human oversight elevates the overall accuracy of the bioactivity data to 96.25%, a significant improvement over the 90.5% accuracy achieved through fully manual annotation alone, thereby establishing a more trustworthy foundation for scientific research.

BioMiner establishes a powerful synergy between machine automation and human insight, resulting in a remarkably comprehensive and reliable resource for biological activity data. Through the processing of 11,683 scientific papers, the system successfully extracted 82,262 individual bioactivity data points, effectively accelerating the pace of scientific discovery. This curated dataset then demonstrably enhances the predictive power of Quantitative Structure-Activity Relationship (QSAR) models; models trained on BioMiner data exhibited a 38.6% improvement in EF1% – a critical metric for evaluating model performance – when compared to those built upon the established ChEMBL database. This substantial gain underscores BioMiner’s potential to not only amass data, but to refine and elevate the quality of information available for researchers developing new therapeutics and understanding biological systems.

BioMiner efficiently extracts NLRP3 bioactivity data from scientific literature, as demonstrated by its rapid processing of 85 papers, comparable pIC50 distributions to ChEMBL, and successful generation of robust QSAR models, further validated by molecular docking and dynamics simulations showing stable binding poses for compounds like Z6739936901 and Z5232931194.

BioMiner, as detailed in the study, aggressively tackles the challenge of extracting meaningful data from the complex landscape of scientific literature. This pursuit mirrors a fundamental tenet of understanding any system: dismantling it to reveal its inner workings. As Donald Knuth observed, “The best computer programs are the ones that work, but the best programmers are the ones who can debug them.” BioMiner doesn’t merely accept the published word as truth; it actively parses, dissects, and validates information-essentially debugging the claims embedded within research papers-to construct a robust and reliable bioactivity dataset. This process of rigorous analysis, pushing the boundaries of automated extraction, is essential for accelerating drug discovery and validating existing knowledge.

What Lies Ahead?

BioMiner, and systems like it, represent an attempt to brute-force the decoding of biological systems. The premise – that actionable data is locked within the published record, waiting for the right algorithm – isn’t novel. What’s intriguing is the implicit admission that the current methods of knowledge dissemination are… inefficient. The system functions as a ‘reverse compiler’ of sorts, attempting to reconstruct intent from the compiled output of research. The success of BioMiner isn’t a testament to the system itself, but rather an indictment of how poorly we’ve structured the communication of scientific findings.

The immediate challenge isn’t improving accuracy, but broadening scope. Markush enumeration is a clever patch, but it highlights a fundamental problem: chemical space is vast, and our notation for describing it is… constrained. The true bottleneck isn’t extracting data from publications, it’s the fact that much of the interesting information likely isn’t published in a readily machine-readable format. Reality is open source – the code exists – but we’re still stuck reading the documentation instead of the source.

Future iterations will undoubtedly focus on multi-modal integration, and BioVista provides a valuable testing ground. However, a more radical approach might involve incentivizing researchers to submit not just results, but the process – the raw data, the failed experiments, the reasoning behind each step. Only then can a system like BioMiner truly begin to map the underlying logic of biological activity, and move beyond mere data extraction towards genuine knowledge discovery.

Original article: https://arxiv.org/pdf/2604.21508.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Decoding the Bioactivity Bottleneck

BioMiner: An Agentic System for Automated Extraction

Robust Chemical Structure Recognition Within BioMiner

Human Oversight: Refining Accuracy and Impact

What Lies Ahead?

See also: