=Paper= {{Paper |id=Vol-3865/04_paper |storemode=property |title=Transfer learning for Renaissance illuminated manuscripts: starting a journey from classification to interpretation (short paper) |pdfUrl=https://ceur-ws.org/Vol-3865/04_paper.pdf |volume=Vol-3865 |authors=Valeria Minisini,Giorgio Gosti,Bruno Fanini |dblpUrl=https://dblp.org/rec/conf/aiia/MinisiniGF24 }} ==Transfer learning for Renaissance illuminated manuscripts: starting a journey from classification to interpretation (short paper)== https://ceur-ws.org/Vol-3865/04_paper.pdf
                                .

                         Transfer learning for Renaissance illuminated manuscripts:
                         starting a journey from classification to interpretation
                         Valeria Minisini1,2,*,‡ , Giorgio Gosti2,‡ and Bruno Fanini2
                         1
                             Sapienza Università di Roma, Dipartimento di Scienze dell’Antichità, Piazzale Aldo Moro 5, 00185 Rome, Italy
                         2
                             National Research Council - Institute of Heritage Science (CNR-ISPC), Via Salaria KM 29300, 00015 Monterotondo, Italy


                                        Abstract
                                        In recent years illuminated manuscripts have been extensively digitized, providing an unprecedented amount
                                        of material for computer vision research and the creation of better performing neural networks. In addition to
                                        character recognition, which has long been the main application field, these new resources have allowed the
                                        adoption of machine learning to detect decoration and miniatures that usually concern only a few pages of a
                                        codex but attract great attention especially from non-academic public. The paper presents ongoing research that
                                        aims to demonstrate the possible adoption of transfer learning for digitized artworks to improve a pretrained deep
                                        neural network ability to recognize handwritten pages, identify the layout elements and, specifically, figurative
                                        miniatures in Renaissance manuscripts to be used for the creation of an immersive interface for consulting and
                                        comparing images based on their iconography. After a brief introduction to contextualize the changes brought
                                        by the massive cultural heritage digitization, we will present some of the most interesting research conducted
                                        on both manuscripts and artworks. Next, the dataset built to train the model will be described, focusing on its
                                        composition and the image classification system adopted. In conclusion, we will then expose the training strategy
                                        chosen to minimize human effort by dividing the dataset into three groups before concluding with the first results
                                        obtained so far and the prospects for future development.

                                        Keywords
                                        Deep Neural Network, Transfer Learning, Illuminated Manuscript, Image Classification, Layout Analysis




                         1. Introduction
                         Until the mid-15th century, illuminated manuscripts were the main vehicle for the circulation and
                         preservation of knowledge among the Western upper classes, surviving even beyond Johannes Guten-
                         berg’s introduction of movable type printing. This typology of cultural artifacts, difficult to access
                         due to their materials’ fragility, has found new life thanks to digitization which has made available to
                         scholars and the common public many little-known works, rarely released from deposits.
                            Each complete codex comprises hundreds of pages, which do not only contain text, but are also
                         richly decorated with miniatures that share subjects and iconography with other artistic expressions
                         like paintings and sculptures. Although illuminations occupy only a limited number of sheets, they are
                         also the components that attract the most non-academic audience, making the often-incomprehensible
                         written content partly accessible.
                            To enhance the thousands of volumes made available online we are developing an immersive visual-
                         ization interface that allows users to move easily between the digitized pages, searching for figurative
                         miniatures and capable of relating reproductions of different works with the same subject. To make this
                         possible, a deep neural network system is being trained on a specially constructed dataset containing
                         reproductions of Medieval and Renaissance volumes together with artworks. Our objective is to ob-
                         tain a model that can recognize elements in the page layout and identify handwritten sheets within

                         3rd Workshop on Artificial Intelligence for Cultural Heritage (AI4CH 2024, https:// ai4ch.di.unito.it/ ), co-located with the 23nd
                         International Conference of the Italian Association for Artificial Intelligence (AIxIA 2024). 26-28 November 2024, Bolzano, Italy
                         *
                           Corresponding author.
                         ‡
                           These authors contributed equally.
                         $ valeria.minisini@uniroma1.it (V. Minisini); giorgio.gosti@cnr.it (G. Gosti); bruno.fanini@cnr.it (B. Fanini)
                          0009-0000-6266-7757 (V. Minisini); 0000-0002-8571-1404 (G. Gosti); 0000-0003-4058-877X (B. Fanini)
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
heterogeneous collections of items. Given that deep neural networks require massive datasets with
high-quality annotations that are labor-intensive, we implemented an Interactive Machine Learning
approach to rapidly compile increasingly larger databases, in which a domain expert and an AI expert
cooperate to train a model with domain-specific knowledge using transfer learning [1, 2].
  Section 2 outlines the state of the art while Section 3 proceeds to present the dataset by describing its
composition and image classification system before exposing the technique adopted for training the
model. Finally, in Section 4 we discuss conclusions.


2. Related work
Most datasets concerning digitized handwritten and early printed texts are composed of materials
merged according to a shared language, creation period and resource-distributing institution [3]. Often,
they are small if compared with newspapers [4] and museum collections [5], containing frequently
only a subset of pages from the volumes or few complete codices [6], such as the HBA dataset [7]
with images that originate from 11 books of the French digital library Gallica. In several cases, the
dataset needs to be augmented with new items and an example is [8] which uses generative adversarial
networks (GANs) to create synthetic pages to train the model.
   The utilization of machine learning and specifically neural networks to facilitate the study of historical
documents is an established field of research [9]. The main investigated task has been text recognition
[10], but the focus on graphic elements such as layout analysis [8], drop caps for letter extraction [11],
figure gestures classification through template-based detectors [12] and miniature retrieval [13] have
gradually shifting attention from text to artistic aspects. Also attempts with bounding boxes have been
made for illumination detection [14] and iconographic recognition [15] with good results.
   In the historical-artistic field, among the several neural networks architectures ResNet [16] was used
individually [17] or within more complex pipelines [18], mostly, because it performs well on labeling
and detection tasks and is easily trainable, given that its implementation of residual functions allows
better error propagation. Often, transfer learning is used to train ResNet architectures given small
datasets [18, 19].
   For an in-depth review of “human-in-the-loop” approaches refer to [20]. Specifically, [1, 2] discuss
Interactive Machine Learning. Label Studio is a flexible labeling tool that implements active learning
[21]. As well, ilastik [22] or Cellpose [23] are machine learning tools that offer Interactive Machine
Learning capabilities.


3. The Project
The collective interest in illuminated manuscripts is still lower than the attraction exerted by works
of art, almost entirely coming from relatively few expert researchers due to language barriers and the
complexity of interpreting handwritten sheets. In addition to the textual part, which simultaneously
serves as both the main vehicle and greatest obstacle to content communication, the miniatures that
captivate the viewer can be used to build accessible experiential paths based on a visual language
understood regardless of one’s country of origin or historical knowledge. The first step in creating an
interface that will allow the user to explore the manuscript pages through its more intuitive graphical
contents was to automate their recognition.
   To develop the dataset required for deep neural network transfer learning, we implemented an
iterative incremental training cycle, composed of four main phases. First, a domain expert annotates or
corrects the labels of a smaller dataset. Second, an AI expert uses transfer learning and hyperparameter
optimization to train a deep neural network. Third, the model predicts the larger dataset labels. Thus,
the cycle restarts from the first phase with the domain expert correcting the annotations predicted for
the larger dataset. Labels were created using Label Studio [21].
3.1. The Dataset
For training our deep neural network to identify the different visual components inside the pages and
distinguish any miniatures present also from ornaments or illuminated initials, we have thus formed an
original training dataset. It contains images from digitized manuscript volumes, mainly dating back to
the Medieval and Renaissance periods, together with reproductions of incunabula and later printed
copies but also works of art and objects normally present in digital museum catalogs often featuring
figurative subjects with a sacred theme as decoration.
   Files come exclusively from institutional databases like the US Library of Congress, The Metropolitan
Museum of Art of New York and the J. Paul Getty Museum Collection of Los Angeles but also from
European ones such as the Staatsbibliothek of Berlin and the Koninklijke Bibliotheek of the Netherlands.
We preferred images from public domain or released under creative common license CC-BY. Furthermore,
by using different sources we tried to obtain a certain variety of scanning and shooting conditions as
well as the quality of reproductions, selecting both overall views and detail shots.
   Each element was classified using an alphanumeric code that allowed to quickly trace the origin
and provide information about the depicted subject, the century of production, and the specific author
or, if unknown, at least its geographical area. As the number of elements increased, a prefix was also
added to each series to distinguish the type of physical object digitized, the associated category and
subclass. An example of the result thus obtained is "P.01_S(An)_14BMGetty" where "P.01" indicates the
category and the subclass it belongs to, “S(An)” identifies the subject as Saint Anne while "14" is the
century of creation, “BM” the author’s initials and “Getty” is the provenance institution. At the same
time, this system has kept low the risk of inserting multiple copies of a work, especially in cases such
as engravings, and associating images with the same themes already in the early stage of research to
possibly increase the examples of a specific subject when poorly represented.
   In total, the dataset consists of 45.798 items divided into two macro-categories, volumes (P) and art
(A), and into ten subgroups: manuscript sheets (P.01), printed pages (P.02), paintings (A.01), engravings
(A.02), drawings (A.03), sculptures (A.04), stained glass windows (A.05), tapestries (A.06), art prints
(A.07) and other objects (A.08). Since the project is focused on manuscripts, the number of images
dedicated to these is higher than all other typologies and constitute 68.25% of the total, chosen to cover
a wide variety of layouts and styles. They chiefly come from books of hours, psalters, missals, breviaries,
and bibles selected primarily for the presence of illustrations. The miniatures in the chosen manuscripts
are both full page and inserted in the text.
   Always privileging the figurative element, the sheets without illustrations have been classified
according to the presence of decorations and, only in the absence of these, on text or musical scores
considering the dominant component. Four summary partitions were thus obtained in the main
category, first useful to balance the composition of the dataset, avoiding the insertion of too many text
pages compared to the other elements, and afterwards for the subdivision into three training groups
progressively larger in size of 400, 4.000 and 41.398.

3.2. The Training
The smallest group consists of 330 pages and 70 artworks, a proportion that remains constant in the
second and undergoes slight variations in the third set where the category “Pages” contains a total of
34.447 items to which are added 6.951 images of the other one. In the preprocessing steps we uniformed
the dataset through resizing to 1280 × 1280 pixels maintaining proportions.
   To limit human effort in the annotation phase, we manually labeled only the first group to report
the presence or co-presence in the pages of figurative miniatures (Fig), decorations (Deco), text (Text),
music scores (Mus) and illuminated initials (Let) with a minimum of one and a maximum of four tags
for a total of 818. This set, split into 70% training and 30% test, was used to initially train our deep
neural network to distinguish between pages and art, manuscript and printed pages as to detect the
layout elements.
   We use a transfer learning approach based on a ResNet architecture [16] pretrained on ImageNet
Figure 1: Prediction accuracy for target labels for the 400 images dataset.


dataset [24]. We planned to test ResNet-50 and 101 as well as ResNet-18. Since we had a relatively small
dataset, we started with ResNet-18 because models with fewer parameters are less likely to be in an
over-parametrization regime. We obtained unexpectedly good results for both the training and the test
metrics. In future, we will test the larger networks.
   In the training phase, we augmented the dataset by transforming the images passed by the dataloader
through the following sequence of transformations. First, the dataloader resized the images to 256×256,
then it randomly cropped and resized the pictures to 256 × 256, and finally, it normalized the images
according to ResNet’s suggested parameters. Initially a distinct group of deep neural networks was
trained on the first group without the category “Art” and this allowed us to test that art images do not
negatively impact the prediction accuracy.
   We considered two gradient descent training methods: Stochastic Gradient Descent (SGD) and Adam.
For SGD we used an exponential learning rate scheduler with a starting learning rate 0.001, momentum
0.9, learning rate scheduler step size 7, and gamma 0.1. For Adam, we settled on a learning rate 0.001.
   We also compared two types of transfer learning: fine-tuning on all weights or keeping the features
in the training phase while fine-tuning only the last layer.
   Figure 1 shows the accuracy level achieved: in almost all cases, the fine-tuning of all weights and
SGD gave the best results, even if the difference is not very large. The accuracy rate is almost always
above 90%, particularly high in recognizing pages from art and detecting text or music, while the lowest
value was recorded for illuminated initials, mostly caused by the inherent ambiguity between simple
ornamental letters and historiated initials.
   Considering the promising results, the models with the highest accuracy level for each target were
used to automatically predict the labels for the 4000 images of the second set. Then, the domain expert
carefully checked to identify specific model biases and corrected the assigned labels. These models
demonstrated their ability to recognize with high precision the presence of decorations and illustrations
on the pages, especially when there are borders and the scene occupies a large portion of space. They are
however less precise in distinguishing between the miniatures inserted in the text and the illuminated
initials, often confusing the two elements and labeling some of the former as “Let”. They also incorrectly
classified blank pages as "Text" and adopted the same tag for some sculptures together with the right
"Fig".
   Only in ten cases the prediction was completely wrong: out of 4.000 images only 1.549 required
manual intervention, saving several hours of work to reach an overall result of 7.205 annotations.
Comparing the automatically applied tags with the correct ones, the most significant decrease was
observed for "Text", which went from 3.174 to 2.854, followed by "Let" reduced from 966 to 936. On the
other hand, the "Fig" and "Deco" classes increased, going from 1.495 to 1.765 and from 1.214 to 1.292
respectively.
   The accuracy of the predictions was overall considered satisfactory with a rate of 0.98% for the
recognition of figurative miniatures, 0.88% for text, 0.86% for decorations and 0.74% for illuminated
initials. However, a particular criticality has been observed with musical scores printed after the 15th
century which are not recognized as such, significantly impacting the number of labels assigned by the
system, only 251 out of the exact 358.


4. Conclusions
The initial recognition tests on the layout elements gave us very encouraging results as accuracy levels
higher than 80% are not often obtained with such small datasets. Our approach allows for a gradual
and efficient labeling process and in this paper we stop at the end of the first step of the incremental
learning cycle. In future work, we plan to progressively grow the annotated items’ number to verify if
similar results will be maintained also for the third larger group.
   Once this phase is completed, a further step will be taken to improve the dataset with more com-
plex label tasks such as instance semantic segmentation. Specifically, we will gradually move from
image classification to the detection of specific components using the activation zones as a guide to
automatically place boxes and then perform the semantic segmentation of any miniatures present.
   The masks thus obtained will be used to enrich an interactive and immersive web3D system based on
an open-source framework [25], that will be investigated in the upcoming stages of the research. This
will give the user the possibility to inspect and query large amounts of pages in a 3D space, juxtapose
similar images, and comparing them with art reproductions.
   In doing so, we intend to demonstrate how, through the use of new technologies, figurative language
can be used to structure narratives capable of making artifacts, originally prerogative of a few, available
to all by freeing a delicate cultural property from its physical and conservative limits and exploiting
iconography as an interpretative system that can be understood independently of the spoken language.



References
 [1] A. Holzinger, Interactive machine learning for health informatics: when do we need the human-
     in-the-loop?, Brain Informatics 3 (2016) 119–131. doi:10.1007/s40708-016-0042-6.
 [2] R. Porter, J. Theiler, D. Hush, Interactive machine learning in data exploitation, Computing in
     Science and Engineering 15 (2013) 12–20. doi:10.1109/MCSE.2013.74.
 [3] K. Nikolaidou, M. Seuret, H. Mokayed, M. Liwicki, A survey of historical document image
     datasets, International Journal on Document Analysis and Recognition (IJDAR) 25 (2022) 305–338.
     doi:10.1007/s10032-022-00405-8.
 [4] B. C. G. Lee, J. Mears, E. Jakeway, M. Ferriter, C. Adams, N. Yarasavage, D. Thomas, K. Zwaard, D. S.
     Weld, The newspaper navigator dataset: extracting headlines and visual content from 16 million
     historic newspaper pages in Chronicling America, in: Proceedings of the 29th ACM International
     Conference on Information & Knowledge Management, CIKM ’20, Association for Computing
     Machinery, New York, NY, USA, 2020, pp. 3055–3062. doi:10.1145/3340531.3412767.
 [5] N. A. Ypsilantis, N. Garcia, G. Han, S. Ibrahimi, N. V. Noord, G. Tolias, The Met Dataset: instance-
     level recognition for artworks, 2022. URL: https://arxiv.org/abs/2202.01747. arXiv:2202.01747.
 [6] F. Simistira, M. Seuret, N. Eichenberger, A. Garz, M. Liwicki, R. Ingold, DIVA-HisDB: a
     precisely annotated large dataset of challenging medieval manuscripts, in: 2016 15th In-
     ternational Conference on Frontiers in Handwriting Recognition (ICFHR), 2016, pp. 471–476.
     doi:10.1109/ICFHR.2016.0093.
 [7] M. Mehri, P. Héroux, R. Mullot, J.-P. Moreux, B. Coüasnon, B. Barrett, HBA 1.0: a pixel-based
     annotated dataset for historical book analysis, in: Proceedings of the 4th International Workshop
     on Historical Document Imaging and Processing, HIP ’17, Association for Computing Machinery,
     New York, NY, USA, 2017, pp. 107–112. doi:10.1145/3151509.3151528.
 [8] N. Rahal, L. Vögtlin, R. Ingold, Approximate ground truth generation for semantic labeling of
     historical documents with minimal human effort, International Journal on Document Analysis
     and Recognition (IJDAR) 27 (2024) 335–347. doi:10.1007/s10032-024-00475-w.
 [9] F. Lombardi, S. Marinai, Deep learning for historical document analysis and recognition—a survey,
     Journal of Imaging 6 (2020) 110. doi:10.3390/jimaging6100110.
[10] M. A. Islam, I. E. Iacob, Manuscripts character recognition using machine learning and deep
     learning, Modelling 4 (2023) 168–188. doi:10.3390/modelling4020010.
[11] M. Coustaty, R. Pareti, N. Vincent, J.-M. Ogier, Towards historical document indexing: extraction
     of drop cap letters, International Journal on Document Analysis and Recognition (IJDAR) 14 (2011)
     243–254. doi:10.1007/s10032-011-0152-x.
[12] J. Schlecht, B. Carqué, B. Ommer, Detecting gestures in medieval images, in: 2011 18th IEEE
     International Conference on Image Processing, IEEE, Brussels, Belgium, 2011, pp. 1285–1288.
     doi:10.1109/ICIP.2011.6115669.
[13] D. Borghesani, C. Grana, R. Cucchiara, Miniature illustrations retrieval and innovative inter-
     action for digital illuminated manuscripts, Multimedia Systems 20 (2014) 65–79. doi:10.1007/
     s00530-013-0315-3.
[14] F. Aouinti, V. Eyharabide, X. Fresquet, F. Billiet, Illumination detection in IIIF medieval manuscripts
     using deep learning, Digital Medievalist 15 (2022) 1–18. doi:10.16995/dm.8073.
[15] P. Manoni, AI4MSS : un esperimento di intelligenza artificiale alla Biblioteca Apostolica Vaticana,
     in: G. Bergamin, T. Possemato (Eds.), Guardando oltre i confini : partire dalla tradizione per
     costruire il futuro delle biblioteche : studi e testimonianze per i 70 anni di Mauro Guerrini, AIB -
     Associazione Italiana Biblioteche, 2023, pp. 231–244. doi:10.1400/294065.
[16] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE
     Conference on Computer Vision and Pattern Recognition (CVPR), Springer, 2016, pp. 770–778.
     doi:10.1109/CVPR.2016.90.
[17] W. Zhao, D. Zhou, X. Qiu, W. Jiang, Compare the performance of the models in art classification,
     PLoS ONE 16 (2021). doi:10.1371/journal.pone.0248414.
[18] F. Milani, P. Fraternali, A dataset and a convolutional model for iconography classification in
     paintings, Journal on Computing and Cultural Heritage 14 (2021) 46. doi:10.1145/3458885.
[19] N. Banar, W. Daelemans, M. Kestemont, Transfer learning for the visual arts: the multi-modal
     retrieval of Iconclass codes, Journal on Computing and Cultural Heritage 16 (2023) 32. doi:10.
     1145/3575865.
[20] E. Mosqueira-Rey, E. Hernández-Pereira, D. Alonso-Ríos, J. Bobes-Bascarán, A. Fernández-Leal,
     Human-in-the-loop machine learning: a state of the art, Artificial Intelligence Review 56 (2022)
     3005–3054. doi:10.1007/S10462-022-10246-W.
[21] M. Tkachenko, M. Malyuk, A. Holmanyuk, N. Liubimov, Label Studio: data labeling software,
     2020-2024. URL: https://github.com/HumanSignal/label-studio, open source software available
     from https://github.com/HumanSignal/label-studio.
[22] S. Berg, D. Kutra, T. Kroeger, C. N. Straehle, B. X. Kausler, C. Haubold, M. Schiegg, J. Ales, T. Beier,
     M. Rudy, K. Eren, J. I. Cervantes, B. Xu, F. Beuttenmueller, A. Wolny, C. Zhang, U. Koethe, F. A.
     Hamprecht, A. Kreshuk, ilastik: interactive machine learning for (bio)image analysis, Nature
     Methods 16 (2019) 1226–1232. doi:10.1038/s41592-019-0582-9.
[23] M. Pachitariu, C. Stringer, Cellpose 2.0: how to train your own model, Nature Methods 19 (2022)
     1634–1641. doi:10.1038/s41592-022-01663-4.
[24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
     M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet large scale visual recognition challenge, International
     Journal of Computer Vision 115 (2015) 211–252. doi:10.1007/s11263-015-0816-y.
[25] B. Fanini, D. Ferdani, E. Demetrescu, S. Berto, E. d’Annibale, ATON: an open-source framework
     for creating immersive, collaborative and liquid web-apps for cultural heritage, Applied Sciences
     11 (2021) 11062. doi:10.3390/app112211062.