Unlocking History’s Visual Secrets

Author: Denis Avetisyan


A new deep-learning pipeline is enabling researchers to automatically analyze and understand the illustrations within vast collections of digitized historical manuscripts.

Large multimodal models, such as LLaVA, demonstrate the capacity to generate descriptive captions directly applicable to figures within research manuscripts.
Large multimodal models, such as LLaVA, demonstrate the capacity to generate descriptive captions directly applicable to figures within research manuscripts.

This work details an efficient approach to image classification, object detection, and captioning for historical document analysis and digital humanities research.

Despite growing digital archives of historical manuscripts, systematic, large-scale study of their visual content remains a significant challenge. This paper, ‘Studying Illustrations in Manuscripts: An Efficient Deep-Learning Approach’, presents a fast and scalable deep-learning pipeline for automatically detecting, extracting, and describing illustrations within these collections. By combining image classification, object detection, and multimodal image captioning, we demonstrate the ability to process over three million manuscript pages, identifying more than 200,000 unique illustrations at unprecedented speed. Will this approach fundamentally reshape scholarly workflows and unlock new insights into the artistic and cultural heritage embedded within these invaluable historical resources?


Unveiling Hidden Narratives: A System for Manuscript Exploration

Historical manuscripts constitute a remarkably extensive, largely unexplored source of cultural and artistic insight, yet their thorough investigation is severely hampered by the constraints of manual analysis. Centuries of accumulated documents – from medieval chronicles and religious texts to personal letters and early scientific observations – remain unread or only partially understood due to the sheer labor involved in deciphering faded inks, complex scripts, and fragmented pages. While dedicated scholars have painstakingly unlocked portions of this hidden heritage, the process is incredibly time-consuming, limiting the scope of research and preventing a comprehensive understanding of past civilizations. This slow pace not only restricts academic discovery but also jeopardizes the preservation of these fragile artifacts, as repeated handling can accelerate their deterioration, creating a pressing need for innovative approaches to access and interpret their wealth of information.

The preservation of cultural heritage faces a significant hurdle due to the immense scale of historical manuscript collections and their inherent vulnerability. Millions of documents-many centuries old and exquisitely fragile-require careful study, but manual examination is a painstakingly slow process, and direct handling risks irreversible damage. This confluence of volume and fragility necessitates the development of automated analytical methods. Researchers are increasingly turning to computational techniques – including advanced image processing and machine learning – to create digital surrogates and extract information from these invaluable sources without physically interacting with the originals, enabling wider access and more efficient exploration of history’s hidden stories.

Illuminated manuscripts present a unique challenge to conventional image analysis due to their inherent artistic complexity. Unlike modern text documents with standardized layouts, these historical artifacts feature highly variable designs – intricate borders, stylized calligraphy, and diverse miniature paintings – that confound algorithms designed for simpler, more uniform images. Traditional methods relying on identifying consistent patterns or clear text segmentation often fail when confronted with the deliberate ornamentation and non-linear arrangements characteristic of medieval book production. The very features that make these manuscripts beautiful and culturally significant – the artistic flourishes, the varied page designs, and the integration of image and text – actively impede automated analysis, requiring the development of novel approaches capable of discerning meaningful information amidst visual richness and deliberate deviation from standardized formats.

This pipeline converts scanned historical document pages into a searchable database of artwork and illustrations.
This pipeline converts scanned historical document pages into a searchable database of artwork and illustrations.

Automated Illustration Extraction: A Deep Learning Pipeline

The initial stage of the illustration extraction pipeline utilizes the EfficientNet model for image classification. This model was trained and evaluated on a dedicated dataset, achieving a classification accuracy of 95.1% when tested on a held-out, unseen test set. This high level of accuracy ensures that images containing illustrations are correctly identified before proceeding to subsequent processing stages, minimizing false positives and focusing computational resources on relevant content. The performance metric was calculated using standard classification accuracy, representing the proportion of correctly classified images within the test set.

Object detection is performed using the YOLOv11 model to identify and crop illustrations within each page of a document. This process yields an object detection recall of 79% specifically for illustration localization, indicating the model’s ability to correctly identify and bound the majority of illustrations present. The model outputs bounding box coordinates, which are then used to crop the identified illustration from the original page image, preparing it for subsequent analysis or processing steps.

Page Layout Analysis, implemented through pixel-level segmentation, refines the boundaries of detected illustrations beyond the initial object detection phase. This technique analyzes the image at the pixel level to differentiate between illustration content and surrounding page elements, such as text, lines, and background. By precisely identifying the edges of the illustrations based on pixel characteristics, the system minimizes inclusion of extraneous page content within the extracted illustration region. This granular approach results in more accurate cropping and improved fidelity of the extracted illustration, enhancing its suitability for downstream analysis tasks.

The automated illustration extraction pipeline demonstrates a processing throughput of less than 0.06 seconds per page. This performance level was achieved through optimization of the integrated deep learning models and efficient data handling. The resulting speed significantly reduces the manual effort previously required for illustration preparation, enabling faster analysis of large document sets. Prior methods necessitated substantial human intervention for cropping and isolating illustrations, often requiring several minutes per page; this automated approach represents a considerable time savings and facilitates scalability for large-scale projects.

Our fine-tuned YOLOv11n model accurately detects illustrations within images, as indicated by the bounding boxes highlighting their regions.
Our fine-tuned YOLOv11n model accurately detects illustrations within images, as indicated by the bounding boxes highlighting their regions.

From Visuals to Verbiage: Describing Illustrated Narratives

The image captioning process utilizes LLaVA, a multimodal model combining visual and language processing capabilities. Specifically, LLaVA receives the extracted illustrations as visual input and generates corresponding textual descriptions detailing the depicted content. This is achieved through a transformer-based architecture trained on extensive image-text datasets, allowing the model to correlate visual features with semantic language representations. The output consists of descriptive sentences identifying key elements, actions, and contextual information present within each illustration, effectively translating visual data into a machine-readable and human-understandable format.

The vision-language model employed generates textual descriptions that encompass multiple facets of each illustration. Descriptions detail the primary subjects depicted – individuals, animals, or mythological figures – alongside identified objects present within the scene, such as furniture, tools, or garments. Crucially, the model also characterizes the artistic style, noting features indicative of specific periods, techniques – like illumination or engraving – and stylistic conventions, thereby providing a comprehensive content representation beyond simple object recognition.

Performance of the image captioning pipeline was evaluated using the Vatican Library Collection and the Golden Haggadah. Results indicate an F1-score of 76.5% on a held-out test set, demonstrating the system’s capacity to accurately identify and describe elements within complex visual scenes. This metric assesses the balance between precision and recall in classifying the content of illustrations, and the achieved score suggests a robust level of performance in understanding and characterizing the visual information present in these historical texts.

The textual captions generated by the image captioning pipeline function as descriptive metadata associated with each illustration, facilitating advanced content-based retrieval and analysis. This metadata allows researchers to implement precise searches, filtering illustrations by identified objects, subjects, or stylistic elements. Furthermore, the structured textual data enables quantitative analysis of visual trends within a collection, such as the frequency of specific motifs or the prevalence of certain artistic techniques. The availability of this metadata significantly enhances the accessibility and research potential of digitized historical illustrations, moving beyond simple visual inspection to enable data-driven investigations.

Our prototype interface enables search within the Golden Haggadah.
Our prototype interface enables search within the Golden Haggadah.

Expanding the Horizon: Implications for the Digital Humanities

The development of an automated pipeline represents a significant leap forward in the capacity to process vast quantities of historical documents. This system isn’t simply about converting pages to digital images; it’s engineered for scale, capable of digitizing and analyzing over three million manuscript pages with an efficiency previously unattainable. This scalability unlocks opportunities for researchers to move beyond localized studies and explore broad historical trends, comparative analyses across geographically diverse collections, and the identification of patterns within enormous datasets. By automating key stages of the digitization and analysis process, the pipeline dramatically reduces both the time and resources required, effectively democratizing access to these invaluable cultural resources and fostering new avenues of inquiry in the digital humanities.

The automated extraction and description of illustrations from digitized manuscripts is fundamentally reshaping art historical inquiry. Previously requiring painstaking manual review, this process now allows researchers to analyze vast quantities of visual material, identifying recurring motifs, stylistic evolutions, and the diffusion of artistic ideas across time and geography with unprecedented scale. This computational approach moves beyond traditional iconographic studies, enabling the tracing of visual trends and influences-from the subtle adoption of decorative elements to the wholesale imitation of compositional strategies-across entire collections. Consequently, scholars can now explore networks of artistic exchange, pinpoint regional variations in artistic practice, and even reconstruct the visual cultures that shaped historical perceptions with a level of detail previously unattainable, fostering a more nuanced understanding of art’s role in broader cultural contexts.

The systematic generation of descriptive metadata from digitized manuscripts offers a transformative opportunity for cultural heritage institutions. This data, detailing aspects like script type, layout features, and even identified illustrations, isn’t merely a byproduct of the digitization process, but a key to unlocking wider access and facilitating richer research. By integrating this machine-generated metadata into digital libraries and archives, previously fragmented or inaccessible collections become searchable and interconnected. This enhanced discoverability allows researchers – and the public – to move beyond simple keyword searches, instead exploring manuscripts based on nuanced visual or textual characteristics. Consequently, the pipeline promises to significantly broaden engagement with historical documents, fostering new avenues for scholarship and preserving cultural knowledge for future generations.

The digitization and analysis of historical manuscripts represents more than a technical achievement; it unlocks a wealth of previously inaccessible narratives that profoundly shape cultural heritage. These documents, often penned centuries ago, contain the thoughts, beliefs, and daily lives of past generations, offering unique insights into the evolution of societies, artistic movements, and intellectual thought. By making these stories readily available, researchers and the public alike can explore patterns of cultural exchange, trace the development of ideas, and gain a nuanced understanding of the forces that have shaped the present. The ability to computationally analyze these texts and associated imagery further amplifies this potential, revealing hidden connections and allowing for the reconstruction of lost contexts, ultimately enriching and expanding collective knowledge of the human experience.

The pursuit of automated analysis within digitized manuscripts, as detailed in this study, mirrors a systemic approach to understanding complex structures. Just as a holistic view is crucial for identifying potential weaknesses, this deep-learning pipeline acknowledges the interconnectedness of visual elements within historical documents. As Bertrand Russell observed, “The difficulty lies not so much in developing new ideas as in escaping from old ones.” This research escapes the limitations of manual analysis, offering a framework to move beyond traditional methods and unlock previously inaccessible insights from vast collections. The system’s ability to identify and describe illustrations-moving from pixel data to meaningful contextualization-highlights how structure dictates behavior, both in the manuscript itself and in the analytical process.

Beyond the Image

The presented pipeline, while efficient in its extraction and description of illustrative content within manuscripts, merely addresses the surface of a far more intricate problem. The true challenge lies not simply in detecting an image, but in understanding its function within the larger codex structure. A drawing is not an isolated entity; its meaning is intrinsically linked to the text surrounding it, the physical placement on the page, and the historical context of its creation. Future work must therefore prioritize relational understanding – how these visual elements interact to construct meaning.

Current approaches, even those incorporating image captioning, often fall into the trap of descriptive labeling rather than interpretive analysis. A system might identify a ‘knight on horseback’, but it cannot discern whether that knight represents a historical figure, a symbolic archetype, or a marginal doodle. The pursuit of ‘ground truth’ in these domains is inherently problematic; interpretations shift with scholarly consensus, and ambiguity is often intentional. The goal should not be to eliminate ambiguity, but to represent it transparently.

Ultimately, the utility of such a pipeline rests on its ability to facilitate, not replace, humanistic inquiry. The system functions best not as an autonomous analyst, but as a sophisticated filter, allowing researchers to navigate vast digital collections with greater efficiency. The elegance of any solution will be measured not by its complexity, but by its capacity to reveal the underlying simplicity of the manuscript itself.


Original article: https://arxiv.org/pdf/2601.05269.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-12 17:22