Visualizing Knowledge: A New Dataset for Smarter Scientific Figures

Author: Denis Avetisyan

Researchers have created a large-scale resource designed to help AI systems generate more effective and informative diagrams for scientific publications.

The framework automatically synthesizes schematic diagrams by retrieving relevant exemplars from DiagramBank via a retrieval-augmented generation pipeline, utilizing Nano Banana 3 Pro to construct and refine visual representations directly from ingested research papers and contextual information, demonstrating a capacity to produce publication-quality figures.

DiagramBank is a dataset of scientific diagrams and metadata intended to improve retrieval-augmented generation of teaser-style figures.

Despite recent advances in automated scientific writing, the generation of compelling visual summaries-like teaser figures-remains a significant bottleneck. To address this, we introduce DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation, a curated collection of 89,422 schematic diagrams paired with rich contextual information from scientific publications. This dataset enables improved multimodal retrieval and exemplar-driven generation of publication-quality figures, moving beyond simple data plots to embrace conceptual synthesis. Could this resource unlock a new era of fully automated scientific authoring, where both text and figures are generated with equal sophistication?

The Challenge of Visualizing Scientific Understanding

The bedrock of scientific understanding and dissemination frequently rests upon visual representations, with diagrams serving as crucial tools for conveying complex data and theoretical models. However, the production of these essential visuals remains surprisingly laborious, often requiring extensive manual effort from researchers and graphic designers. This process isn’t merely about aesthetic appeal; accurately translating intricate scientific concepts into digestible diagrams demands considerable time and expertise, diverting valuable resources from core research activities. The current reliance on manual creation presents a significant bottleneck in the scientific workflow, hindering the rapid communication and synthesis of knowledge, especially as the sheer volume of published research continues to expand exponentially.

Current approaches to scientific visualization frequently struggle to capture the subtle intricacies of complex phenomena. While tools exist to create diagrams and charts, they often require significant manual adjustment to accurately reflect the underlying data and theoretical models. This limitation stems from a reliance on generalized templates and a difficulty in representing multi-dimensional relationships or abstract concepts with sufficient precision. Consequently, visualizations can inadvertently oversimplify, misrepresent, or even obscure crucial details, hindering effective communication and potentially leading to flawed interpretations. The need for visuals that move beyond basic data depiction towards nuanced representations of scientific understanding remains a significant challenge in the field.

The relentless growth of scientific publications presents a significant bottleneck in knowledge dissemination, as researchers struggle to synthesize information from an ever-expanding body of work. This surge in data isn’t merely quantitative; it demands fundamentally new approaches to knowledge extraction and visual representation. Traditional methods of manually creating diagrams and figures simply cannot keep pace, leading to delays in understanding and innovation. Consequently, there is a pressing need for scalable solutions – automated systems capable of identifying key concepts, relationships, and trends within the literature and translating them into clear, concise visualizations. Such tools would not only accelerate scientific discovery but also facilitate broader accessibility of complex information, enabling researchers across disciplines to build upon existing knowledge more effectively.

The generation process is guided by the top three retrieved reference diagrams from DiagramBank.

DiagramBank: A Foundation for Visual AI

DiagramBank comprises 89,422 schematic diagrams assembled as a resource for multimodal retrieval research. The dataset is not simply a collection of images; each diagram is paired with accompanying metadata detailing its origin and content. This metadata is critical for enabling AI models to not only recognize diagrams but also to understand their meaning within a specific scientific context. The large scale of DiagramBank is intended to provide sufficient data for training robust models capable of both diagram understanding and generation tasks, moving beyond simple image recognition to semantic comprehension.

DiagramBank’s construction utilized PDFFigures 2.0, a tool designed for automated extraction of figures from PDF documents, to maximize the volume of diagrams sourced. Data was systematically collected from OpenReview, the open review platform for major machine learning conferences, ensuring a focus on current research. Specifically, the dataset includes diagrams from ICLR, ICML, NeurIPS, and TMLR, representing a diverse range of topics within the field. This combination of automated extraction and targeted sourcing from reputable academic venues prioritizes both the scale and the reliability of the assembled diagram collection.

DiagramBank serves as a training resource for artificial intelligence models designed for the interpretation and creation of scientific diagrams. The dataset comprises 89,422 curated diagrams sourced from prominent machine learning conferences and journals – including ICLR, ICML, NeurIPS, and TMLR – and covers the period from 2017 to 2025. This temporal scope allows for the development of models capable of adapting to evolving diagrammatic conventions within the field, and the large scale of the dataset facilitates robust model training and generalization. The availability of this resource is intended to accelerate progress in visual AI applications within scientific and technical domains.

Retrieval-Augmented Generation for Diagram Synthesis

Retrieval-Augmented Generation (RAG) is implemented by initially querying DiagramBank, a repository of scientific figures, to identify diagrams relevant to the input text prompt. These retrieved diagrams function as visual priors, providing the generative model with existing visual information to guide the synthesis of new diagrams. This contrasts with purely generative approaches by conditioning the generation process on established visual representations, improving the fidelity and relevance of the output. The system does not generate diagrams from scratch; instead, it leverages existing diagrams as a basis for creation, effectively combining retrieval and generation steps.

Hierarchical retrieval is implemented to refine the diagram selection process and enhance generation accuracy. This method moves beyond single-level searches by evaluating relevance at three distinct granularities: the full research paper, the diagram caption associated with the image, and the surrounding textual context within the paper. By considering these multiple levels, the system can more effectively identify diagrams that are not only visually similar to the query but also semantically aligned with the intended meaning and purpose, leading to improved results compared to methods relying on single-level retrieval.

Similarity search within DiagramBank is optimized through a combination of FAISS, OpenAI’s text-embedding-3- model, and a technique called Deep Fetch. The text-embedding-3- model generates vector representations of both diagram captions and user queries, enabling semantic comparison. These vectors are indexed using FAISS, a library designed for efficient similarity search on large datasets. Deep Fetch further enhances recall by iteratively expanding the search beyond the initial, most similar results, incorporating related diagrams to mitigate the risk of overlooking relevant visual priors. This multi-stage approach balances precision and recall, ensuring a comprehensive retrieval of diagrams relevant to the generation task.

A three-stage retrieval pipeline efficiently identifies relevant figures by first narrowing the search to the target discipline using title indexing, then refining results based on abstract similarity with a Deep Fetch mechanism, and finally matching at the caption level to ensure alignment with the user's diagram description. — A three-stage retrieval pipeline efficiently identifies relevant figures by first narrowing the search to the target discipline using title indexing, then refining results based on abstract similarity with a Deep Fetch mechanism, and finally matching at the caption level to ensure alignment with the user’s diagram description.

Towards Autonomous Scientific Visualization

The automated identification of diagrammatic content within large visual datasets relies on the power of Contrastive Language-Image Pre-Training, most notably through the implementation of CLIP. This technique allows for the nuanced classification of extracted figures, distinguishing between diagrams, plots, and more general images with a high degree of accuracy. Initial results demonstrate that approximately 19.8% of figures automatically extracted from scientific literature are successfully identified as diagrams using this method. This automated filtering is crucial, as it allows researchers to focus computational resources on relevant visual elements, ultimately streamlining the process of scientific visualization and enabling the creation of more coherent and accurate synthesized diagrams.

A crucial refinement in the diagram generation process involves a filtering step designed to prioritize visual relevance. By employing Contrastive Language-Image Pre-training – specifically utilizing CLIP – the system effectively distinguishes between diagrams, plots, and general images before incorporating them into synthesis. This meticulous selection dramatically improves the coherence and accuracy of the generated diagrams, preventing the inclusion of extraneous or misleading visuals. The application of a 0.85 confidence threshold, based on CLIP’s assessment, yielded a curated dataset of 59,765 high-confidence diagrams, representing a substantial resource for automated scientific visualization and enabling the development of more reliable and insightful AI-driven scientific tools.

The creation of a meticulously curated and classified repository of scientific figures directly facilitates the advancement of autonomous research capabilities. This resource moves beyond simple data aggregation; it provides a foundation for Artificial Intelligence to not only understand the visual language of science-distinguishing diagrams from general images-but also to actively generate new visualizations. By automating the process of scientific visualization, researchers can accelerate discovery, explore complex datasets with greater efficiency, and ultimately empower AI systems to function as true ‘Autonomous AI Scientists’, capable of independent investigation and knowledge synthesis without constant human intervention in the visual communication of findings.

Retrieval-augmented generation successfully transfers low-level visual details from reference images (as shown in Figure 5), a capability lacking in the baseline generation.

DiagramBank’s construction embodies a holistic understanding of information systems, recognizing that effective retrieval isn’t merely about data volume, but about structural coherence. The dataset meticulously links diagrams with their associated metadata – a critical step in enabling retrieval-augmented generation. As Grace Hopper famously stated, “It’s easier to ask forgiveness than it is to get permission.” This sentiment, while often applied to innovation, also resonates with the dataset’s approach; by comprehensively capturing the ‘structure’ of diagrammatic information, DiagramBank anticipates the diverse ‘interactions’ users will have with it, enabling more flexible and forgiving retrieval even with imperfect queries. The project acknowledges that a system’s behavior arises not just from its components, but from their interconnectedness, mirroring the principle that understanding the whole is paramount.

Beyond the Figure: Charting Future Directions

DiagramBank represents a necessary, if incremental, step towards a more intelligent authoring process. The current landscape often demands complete overhauls when adapting existing visual communication – rebuilding the entire block, as it were – instead of facilitating organic evolution. A truly robust system will not simply retrieve similar diagrams, but diagnose the underlying structural deficiencies in a draft, then suggest minimal interventions to improve clarity and impact. The metadata provided are a promising start, but future iterations must move beyond simple tagging, incorporating a formal grammar of visual arguments.

The limitations inherent in defining ‘good’ diagram design also deserve consideration. Current metrics tend to prioritize aesthetic qualities or adherence to established conventions, rather than genuine communicative effectiveness. Evaluating a diagram’s success requires understanding its role within a larger narrative – how it shapes understanding, highlights key findings, and ultimately, convinces a skeptical audience. This demands a shift from passive evaluation to active modeling of cognitive processes.

Ultimately, the goal should not be to automate the creation of pretty pictures, but to augment human reasoning. DiagramBank, therefore, is best viewed as a foundational element – a new material with which to construct a more resilient and adaptable infrastructure for scientific communication. The true test will lie in its capacity to facilitate not just the reproduction of existing knowledge, but the generation of novel insights.

Original article: https://arxiv.org/pdf/2604.20857.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Challenge of Visualizing Scientific Understanding

DiagramBank: A Foundation for Visual AI

Retrieval-Augmented Generation for Diagram Synthesis

Towards Autonomous Scientific Visualization

Beyond the Figure: Charting Future Directions

See also: