From Code to Canvas: Automating Scientific Visuals

Author: Denis Avetisyan

Researchers have developed a new system that automatically generates publication-quality diagrams and plots, streamlining the visual communication of complex scientific findings.

PaperBanana introduces an agentic framework leveraging visual language models to automate the creation of academic illustrations, including methodology diagrams and statistical plots, as evaluated on the PaperBananaBench benchmark.

Despite rapid advancements in autonomous AI scientists, a significant bottleneck remains in the creation of publication-ready visuals. This work introduces PaperBanana: Automating Academic Illustration for AI Scientists, an agentic framework designed to automatically generate high-quality academic illustrations, including methodology diagrams and statistical plots. Through the orchestration of specialized agents and leveraging state-of-the-art visual language models, PaperBanana consistently outperforms existing methods in generating faithful, concise, and aesthetically pleasing figures-as demonstrated by comprehensive evaluations on the newly introduced PaperBananaBench benchmark. Could this automated approach fundamentally reshape the scientific communication process and accelerate the pace of discovery?

The Sisyphean Task of Scholarly Figures

The production of polished, publication-ready figures represents a substantial hurdle for researchers across disciplines. While data acquisition and analysis often receive primary focus, translating those findings into clear and compelling visuals frequently demands considerable time and effort. This bottleneck stems not merely from the technical aspects of graphic design, but also from the need to adhere to the specific aesthetic and informational standards of academic journals and conferences. Many researchers find themselves spending weeks – or even months – refining charts, graphs, and illustrations, diverting resources from core scientific investigation. This situation is further compounded by the increasing expectation for complex datasets to be represented with nuanced and informative graphics, necessitating specialized skills that are not always readily available within research teams. Ultimately, the time invested in visual communication can significantly delay the dissemination of scientific knowledge, hindering progress and innovation.

The creation of effective scholarly visualizations is frequently hampered by a substantial demand for manual refinement and dedicated design expertise. Researchers often find themselves spending considerable time not on data analysis itself, but on painstakingly adjusting visual elements – color schemes, font choices, label placements, and overall aesthetic balance – within existing tools. This process extends beyond simply generating a graph; it requires a nuanced understanding of visual communication principles to ensure clarity, accuracy, and adherence to specific journal or publication guidelines. Consequently, those lacking formal training in graphic design, or the resources to collaborate with specialists, may struggle to produce figures that adequately represent their findings and meet the rigorous standards of academic discourse, effectively creating a barrier to dissemination and potentially impacting the reception of valuable research.

As research endeavors push the boundaries of knowledge, the methods of communicating findings must evolve alongside them. Traditional charts and graphs often prove inadequate for representing multi-dimensional datasets, complex simulations, or the nuanced relationships discovered within intricate systems. This escalating complexity necessitates visual communication strategies that go beyond simple data depiction; researchers increasingly require tools capable of conveying abstract concepts, illustrating dynamic processes, and highlighting subtle patterns that would otherwise remain hidden. Effectively communicating these findings demands not just clarity, but also the ability to synthesize large volumes of information into easily digestible visuals, ultimately influencing how quickly and accurately scientific breakthroughs are understood and built upon by the wider community.

Despite rapid advancements, contemporary image generation models frequently fall short when tasked with producing figures suitable for academic publication. These models, while capable of creating visually appealing imagery, often struggle with the precision and clarity demanded by scholarly standards – precise data representation, consistent styling, and accurate labeling are paramount, yet proving difficult to consistently achieve. Issues arise from a tendency towards aesthetic embellishment over factual accuracy, difficulties in maintaining visual consistency across multiple related figures, and an inability to reliably reproduce specific chart types or data visualizations commonly used in research. Consequently, researchers still require significant manual intervention – often hours of painstaking refinement – to transform model outputs into publication-ready graphics, highlighting a critical gap between the capabilities of artificial intelligence and the rigorous demands of scientific communication.

PaperBanana: A Pragmatic Approach to Visual Automation

PaperBanana employs an agentic framework to decompose the complex task of illustration into a series of discrete, manageable sub-tasks. This architecture consists of multiple specialized agents, each responsible for a specific aspect of the process. By distributing the workload, the system avoids the limitations of monolithic models and facilitates greater control and precision. Each agent operates autonomously, receiving inputs, processing information, and producing outputs that are then passed to subsequent agents in a defined sequence. This modular design promotes scalability and allows for independent improvement and refinement of individual components, ultimately enhancing the overall quality and efficiency of the illustration pipeline.

The Planner Agent within PaperBanana functions as the initial processing unit, converting user-provided, unstructured input – such as broad concepts or incomplete descriptions – into a precise textual prompt suitable for image generation. This process involves identifying key elements, defining relationships between those elements, and specifying details regarding composition, subject matter, and overall visual characteristics. The agent employs natural language processing techniques to disambiguate ambiguous requests and to infer implicit requirements, ultimately constructing a detailed textual representation of the desired visualization. This structured prompt then serves as the primary input for subsequent agents in the pipeline, ensuring consistent and accurate image generation based on the user’s intent.

The Retriever Agent functions by querying a pre-existing database of images and associated metadata to locate visual examples pertinent to the user’s prompt and the Planner Agent’s detailed description. This process utilizes semantic search techniques to identify images that align with the desired subject matter, composition, and style, even if exact keyword matches are absent. The retrieved examples are then provided as contextual guidance to the image generation model, effectively serving as visual constraints and improving the accuracy and relevance of the final illustration by reducing ambiguity and promoting adherence to user intent. The number of retrieved examples and the similarity threshold used during the search are configurable parameters impacting the degree of influence on the generation process.

Image generation within PaperBanana is achieved through the integration of Large Language Models (LLMs) and Nano-Banana-Pro, a diffusion-based image generation model. The LLM processes the detailed textual description provided by the Planner Agent and translates it into prompts suitable for Nano-Banana-Pro. This process is further refined by a Stylist Agent which modulates the prompts to enforce specific aesthetic guidelines – including color palettes, artistic styles, and composition rules – thereby influencing the final visual output. The Stylist Agent operates by appending stylistic directives to the LLM-generated prompts before they are passed to Nano-Banana-Pro, ensuring consistency and adherence to desired visual characteristics.

PaperBananaBench: A Baseline for Pragmatic Evaluation

PaperBananaBench is a newly developed benchmark designed for the quantitative evaluation of automated diagram generation systems. Constructed from a curated dataset of diagrams and associated textual context sourced from publications at the NeurIPS conference, the benchmark provides a standardized method for assessing system performance. The dataset includes a range of diagram types commonly found in machine learning research papers, enabling evaluation across various complexities and visual representations. The comprehensive nature of PaperBananaBench facilitates reproducible research and allows for direct comparison of different automated diagram generation approaches, moving beyond subjective qualitative assessments.

PaperBananaBench evaluates generated diagrams based on two primary metrics: faithfulness and aesthetic qualities. Faithfulness is quantitatively determined by assessing the degree to which visual elements and their relationships accurately represent the information presented in the source context, typically a research paper abstract or section. Aesthetic qualities are judged based on characteristics contributing to visual clarity and appeal, including conciseness, readability, and overall visual design. These metrics are combined to provide a holistic evaluation of the generated diagram’s effectiveness in communicating complex information from the source material.

Reference-based evaluation utilizes a dataset of human-created diagrams paired with their corresponding source context to provide quantitative metrics for assessing generated images. This method involves calculating similarity scores between the generated diagram and the human reference using established image comparison algorithms. These algorithms quantify differences in pixel values, structural similarity, and perceptual quality, providing objective measurements for faithfulness and aesthetic qualities. By comparing against human-created references, the evaluation moves beyond subjective assessments and allows for statistically significant comparisons between different automated diagram generation systems and algorithms.

Evaluations conducted using the PaperBananaBench benchmark indicate a 17.0% improvement in the overall score achieved by the automated diagram generation system when compared to leading baseline models. This overall score is a composite metric derived from assessments of diagram faithfulness, conciseness, readability, and aesthetics. The system’s performance gains were quantitatively measured across a curated dataset of diagrams sourced from NeurIPS publications, establishing a statistically significant advantage in automated diagram creation capabilities.

Quantitative evaluation using PaperBananaBench demonstrates that the proposed system achieves statistically significant improvements over baseline automated diagram generation models across multiple metrics. Specifically, the system exhibits a +2.8% increase in Faithfulness, indicating a greater accuracy in reflecting the source material; a substantial +37.2% improvement in Conciseness, representing a more compact and efficient illustration of the key information; a +12.9% gain in Readability, suggesting enhanced clarity and ease of comprehension; and a +6.6% improvement in Aesthetics, indicating a higher visual quality of the generated diagrams. These results collectively demonstrate the system’s capacity to generate diagrams that are not only more accurate and concise, but also more easily understood and visually appealing.

The PaperBanana system incorporates a Critic Agent designed to enhance illustration quality through iterative refinement. This agent functions by analyzing generated diagrams and providing feedback based on pre-defined criteria related to faithfulness, conciseness, readability, and aesthetics. The feedback is then used to adjust the diagram generation process, enabling the system to progressively improve its output. This iterative feedback loop allows for continuous optimization of the illustrations, resulting in diagrams that more accurately reflect the source context and adhere to established visual communication principles. The Critic Agent’s assessments are integrated into the training process, guiding the model towards producing higher-quality diagrams.

The Long View: Automation and the Future of Scientific Communication

The automation of visualization tasks, as demonstrated by PaperBanana, promises a substantial reduction in the time researchers spend transforming raw data into interpretable figures. Traditionally, creating compelling visuals requires significant manual effort – selecting appropriate chart types, meticulously formatting axes, and iteratively refining aesthetics – all of which divert valuable time and resources from core scientific inquiry. By automating these processes, PaperBanana allows researchers to rapidly explore datasets, test hypotheses, and communicate findings with increased efficiency. This acceleration isn’t merely about saving hours; it’s about fostering a more iterative research cycle, enabling quicker insights and potentially unlocking discoveries that might otherwise remain hidden due to the logistical burden of visualization.

The effective conveyance of scientific discoveries hinges critically on visual communication, and improvements in this area can dramatically amplify the clarity and impact of research findings. Studies consistently demonstrate that information presented visually is processed far more readily and retained more effectively than text-based data alone. By transforming complex datasets into intuitive graphs, charts, and diagrams, researchers can bypass cognitive limitations and facilitate quicker comprehension for both peers and the public. This not only accelerates the pace of scientific progress by streamlining knowledge dissemination, but also enhances the potential for broader societal impact as findings become more accessible and understandable, fostering informed decision-making and innovation.

PaperBanana’s architecture prioritizes seamless interoperability within existing research pipelines. The system is constructed from independent, well-defined modules, each responsible for a specific stage of the visualization process – data ingestion, analysis, and graphical rendering. This modularity enables researchers to readily incorporate PaperBanana’s capabilities into their preferred analytical tools and workflows, rather than requiring a complete overhaul of established practices. Furthermore, the design facilitates the substitution or enhancement of individual modules without disrupting the overall system functionality, allowing for continuous improvement and adaptation to evolving research needs. This ‘plug-and-play’ approach promises to significantly reduce the barrier to entry for automated visualization, fostering wider adoption and accelerating scientific discovery.

Continued development efforts are centered on broadening the scope of PaperBanana to encompass a more diverse array of visualization techniques, moving beyond current capabilities to include network graphs, geospatial representations, and potentially even interactive 3D models. Crucially, researchers aim to enhance the system’s performance when processing increasingly complex datasets – those with higher dimensionality or larger volumes – through algorithmic optimization and the exploration of parallel computing architectures. This expansion isn’t simply about adding features; it’s about ensuring that automated visualization remains a scalable and reliable tool for scientific discovery, capable of extracting meaningful insights from the ever-growing flood of research data and facilitating more robust data exploration.

The pursuit of fully autonomous AI scientists, as demonstrated by PaperBanana, feels remarkably like building a more elaborate Rube Goldberg machine. It’s an elegant attempt to automate academic illustration, yes, but one suspects production environments will quickly uncover edge cases the framework never anticipated. G. H. Hardy once observed, “A mathematician, like a painter or a poet, is a maker of patterns.” This pursuit of pattern-making, of automating visual language models to generate diagrams and statistical plots, is admirable. However, it’s also a temporary victory. The system will crash, likely in spectacular and unforeseen ways. At least, it’s predictably unpredictable, and the resulting diagrams will provide future digital archaeologists with a fascinating case study in applied technical debt.

So What Comes Next?

The automation of academic illustration, as proposed by PaperBanana, feels predictably inevitable. One suspects the core problem isn’t a lack of algorithms, but rather the relentless human capacity to redefine ‘high-quality’ just as soon as a system achieves it. Today’s elegant diagram will, without fail, become tomorrow’s visually-dated relic. The introduction of PaperBananaBench is, of course, a necessary step – a target to aim at – but it also serves as a guarantee that the target will swiftly move.

The real challenge lies not in generating a plot, but in generating the right plot – one that doesn’t subtly misrepresent the data or, worse, obscure interesting anomalies. The system’s current reliance on pre-defined templates feels… limiting. It recalls the early days of GUI design, where everything looked like a variation of Windows 3.1. One anticipates a future where these agentic frameworks spend more time arguing about aesthetic choices than actually producing insights.

Ultimately, PaperBanana appears to be a sophisticated wrapper around existing plotting libraries. Which is, naturally, how most ‘revolutionary’ frameworks end up. The documentation will inevitably become outdated, the dependencies will fracture, and someone will eventually fork it, claiming to have ‘fixed’ the problems with a new, equally fragile system. Everything new is just the old thing with worse docs.

Original article: https://arxiv.org/pdf/2601.23265.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Sisyphean Task of Scholarly Figures

PaperBanana: A Pragmatic Approach to Visual Automation

PaperBananaBench: A Baseline for Pragmatic Evaluation

The Long View: Automation and the Future of Scientific Communication

So What Comes Next?

See also: