Dynamic instance generation for few-shot handwritten document layout segmentation (short paper) Axel De Nardin1 , Silvia Zottin1 , Matteo Paier1 , Gian Luca Foresti1 , Emanuela Colombi2 and Claudio Piciarelli1 1 Department of Mathematics, Computer Science and Physics, University of Udine, Udine, Italy 2 Department of Humanities and Cultural Heritage, University of Udine, Udine, Italy Abstract Historical handwritten document analysis is an important activity to retrieve information about our past. Given that this type of process is slow and time-consuming, the humanities community is searching for new techniques that could aid them in this activity. Document layout analysis is a branch of machine learning that aims to extract semantic informations from digitised documents. Here we propose a new framework for handwritten document layout analysis that differentiates from the current state-of-the- art by the fact that it features few-shot learning, thus allowing for good results with little manually labelled data and the dynamic instance generation process. Our results were obtained using the DIVA - HisDB dataset. Keywords Few-shot learning, Handwritten document layout analysis, Fully-Convolutional Network, Document image segmentation 1. Introduction In the humanities community the study of historical handwritten documents is a crucial activ- ity [1]. For centuries humanists have focused only on text without considering the elements that accompanied it as comments and decoration, in general called paratext. In the latter years, paratext analysis has gained more and more relevance: this data is fundamental to the cultural- historical understanding of the individual manuscript but is also of great philological relevance, because paratexts can be transcribed from one manuscript to another [2]. Despite this importance there are currently not so many studies about them, mainly because paratext analysis is a difficult and expensive task. So, newly developed tools and methods are sought: automated paratext extraction not only saves large amounts of time, but also enables 1st Italian Workshop on Artificial Intelligence for Cultural Heritage (AI4CH22), co-located with the 21st International Conference of the Italian Association for Artificial Intelligence (AIxIA 2022). 28 November 2022, Udine, Italy. " denardin.axel@spes.uniud.it (A. De Nardin); zottin.silvia@spes.uniud.it (S. Zottin); paier.matteo@spes.uniud.it (M. Paier); gianluca.foresti@uniud.it (G. L. Foresti); emanuela.colombi@uniud.it (E. Colombi); claudio.piciarelli@uniud.it (C. Piciarelli)  0000-0002-0762-708X (A. De Nardin); 0000-0003-0820-7260 (S. Zottin); 0000-0002-8425-6892 (G. L. Foresti); 0000-0002-0384-6664 (E. Colombi); 0000-0001-5305-1520 (C. Piciarelli) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Visual representation of the proposed segmentation framework. comparisons of the extracted data in rapid time, making it possible to establish connections between seemingly distant manuscripts, connections that escape the human eye and memory. In order to achieve this we must start from page segmentation of a given document image into semantically meaningful regions (e.g. main text, comments, decorations and background), that is the main focus of this paper. Page segmentation is a well-known open problem in the machine learning community. Due to the non-uniformity and integrity of the images, many of the approaches adopted to solve this problem rely on a fully supervised learning paradigm [3, 4, 5]. An exhaustive survey on document layout analysis is in Binmakhashen and Mahmoud [6]. In contrast, our paradigm tries to achieve few-shot learning, given that in real world appli- cations usually the ground truth is limited in size [7]. To overcome this limit we introduce a dynamic instance generation process that allow to improve the limited data available in this scenario. By combining this with a fully-convolutional network we are able to achieve greater performances adding image patching. We used DIVA-HisDB to test the proposed framework [8]. The rest of this paper is organized as follows. Section 2 describes the components defining the proposed framework. Section 3 reports the details of our experimental setup. Finally, in Section 4, are drawn the conclusions and discuss the future work. 2. Proposed Method In this section, we present the details of the proposed framework with a brief description of the key components. First, we introduce a Fully-Convolutional Network model with a ResNet-50 backbone. This is used for a segmentation tasks in our framework. Moreover we present our training process characterized by the dynamic instance generation. In Fig. 1 it is presented a visual representation of our framework. (a) Patches division (b) Randomly selected crops Figure 2: An example of the dynamically enhanced training data. In (2a) a page of the CSG863 manuscript class is divided into non-overlapping, fixed-size, patches that cover the entire input im- age. In (2b) 10 crops with the same size of the patches are randomly selected in the same page to improve the training set. 2.1. Backbone network The main component of our framework is a Fully-Convolutional Network (FCN) model [9], a ResNet-based deep model that combines layers of the feature hierarchy and refines the spatial precision of the output. This architecture is largely adopted in the context of image semantic segmentation. The network is composed of a downsampling and upsampling path. The first is used to extract and interpret the context, instead second one allows the localization. The network is able to combine coarse, high layer information with fine, low layer information. Furthermore, the multilayer outputs are followed by deconvolutional layers for bilinear upsampling to pixel-dense outputs. This allows for pixel-level identification of class labels and predict segmentation masks. The FCN-based methods learn a mapping from pixels to pixels, without extracting the region proposals. This architecture employs solely locally connected layers (such as convolution, pooling and upsampling) and avoids the use of dense layers. This means that requires less parameters and then making the network faster to train and enables to make predictions on inputs of variable sizes. 2.2. Dynamically enhanced training data In this paper we present a few-shot learning system and the key element of this process is the maximization of the exploitation of the available data. Usually, to capture global contextual information about images the model is trained on whole images, however we believe that a great extent of the same information can also be retrieved from smaller sections of the document pages. For this reason, to improve the efficiency of our training setup reducing the number of annotated images, we decided to split each page of the manuscript in a set 𝑃 of non-overlapping, fixed-size, patches that cover the entire input image. (a) CB55 page (b) CSG18 page (c) CSG863 page (d) CB55 detail (e) CSG18 detail (f) CSG863 detail Figure 3: Samples from the 3 manuscripts (CB55, CSG18 and CSG863) presents in DIVA-HisDB [8]. Fig. 3a– 3c show a full page for each manuscripts, while Fig. 3d– 3f show a detail extracted from each of them. All of these patches forms the base training set. The cardinality of 𝑃 cannot raise indefinitelity: the size of the single patches must be large enough to allow capturing contextual information from the corresponding represented area of the original image. To overcome this limitation, we introduce a dynamic instance generation process. At each epoch we retrieve a set 𝐢 of randomly selected crops of the same size as the patches. We then train the segmentation network using as additional instances these patches, together with the corresponding segmentation maps. So, the patches selected to cover the entire image are always the same, while the crops will change at each epoch and will be taken at random in the entire image. These crops, being randomly selected, can also overlap each other. An example of this process can be seen in Fig. 2. This enables us to further improve the efficiency of our training process while trying to enhance the generalization capacities of our model. 3. Experiments In this section, we offer a description of the used dataset by highlighting its characteristics. We also describe the detailed training setup adopted for the experiments. Then, we outline the metrics used for the evaluation of our framework as well as provide an ablation study, aimed at supporting the effectiveness of the choices defining the proposed system. (a) Original page (b) Ground Truth Figure 4: A page of the CSG863 manuscript class ( 4a) and the corresponding ground truth segmenta- tion map ( 4b): the magenta areas represent the main text, the yellow the comments and the cyan the decorations. 3.1. Dataset To train and test our system we selected the DIVA-HisDB dataset [8]. It is a historical handwritten document dataset consisting of a high-resolution and RGB color annotated pages coming from three different medieval manuscripts, identified as CB55, CSG18 and CSG863. These documents have complex and heterogeneous layouts. Moreover, as an additional challenge, they have different levels of degradation. A sample page for each manuscripts are reported in Fig. 3. The dataset consisting of a total of 150 images where, for each manuscripts, 20 are typically used for training, 10 for validation and another 20 for testing. For the present work, we only relied on 2 images for each manuscript to train the model. DIVA-HisDB supplies pixel-level ground truth segmentation (Fig. 4) for the layout of each image, which distinguishes between 4 classes of elements: background, main text, comments and decoration. 3.2. Hyperparameters setup For the training of the proposed model, the loss function selected is a weighted cross-entropy loss. This is due to the fact that DIVA-HisDB, and in general all of the historical manuscripts, is a very imbalance between classes. Detail of DIVA-HisDB distribution is provided in Tab. 1. Background Comments Decoration Text CB55 82.41 8.36 0.55 8.68 CSG18 85.16 6.78 1.47 6.59 CSG863 77.82 6.35 1.83 14.00 Table 1 Classes distribution (%). Then, the weights for each class are calculated by taking the square root of 1 over the class frequency in the dataset (Eq. 1, where 𝐹𝑖 represents the frequency (%) of class in the corresponding class dataset). βˆšοΈ‚ 1 π‘Šπ‘– = (1) 𝐹𝑖 The ADAM optimizer with a learning rate of 1π‘’βˆ’ 3 and a weight decay of 1π‘’βˆ’ 5 were used. The maximum number of epochs during training model was 200 with an early stop if the network didn’t improve over the last 20 iterations after 50 epochs. The original images in the dataset have high spatial resolution (up to 4.8π‘˜ Γ— 6.8π‘˜ px), thus, to reduce the model’s computational complexity, they have been resized to 1120 Γ— 1344 px. To train the proposed model have been selected 2 images for each manuscript and divided into patches of size 224 Γ— 224 px, for a total of 60 patches for each manuscript. This set is then enhanced by generating 10 random crops of the same size for each image as part of our dynamic training routine generating a maximum of 4000 instances if the model needs all the epochs to converge. 3.3. Metrics The performance at pixel level of proposed approach is evaluated by Precision, Recall, Inter- section over Union (IoU) and F1-Score. These metrics were calculated individually for each manuscripts following the definition reported in Eq. 2– 5, where TP, FP and FN stand respec- tively for True Positives, False positives and False Negatives, and then a weighted average, based on each class frequency has been performed. 𝑇𝑃 𝑃 π‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› = (2) 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 π‘…π‘’π‘π‘Žπ‘™π‘™ = (3) 𝑇𝑃 + 𝐹𝑁 𝑇𝑃 πΌπ‘œπ‘ˆ = (4) 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 2 Γ— 𝑃 π‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› Γ— π‘…π‘’π‘π‘Žπ‘™π‘™ 𝐹 1 βˆ’ π‘ π‘π‘œπ‘Ÿπ‘’ = (5) 𝑃 π‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› + π‘…π‘’π‘π‘Žπ‘™π‘™ 3.4. Results We have conducted an ablation study on the different versions of the proposed framework. The baseline approach consists of using whole images as backbone network input. Our improved model is obtain by running the same network on a patch level with dynamic crop generation. A comparison of the two versions is presented in Tab. 2. We report both the scores for the individual manuscripts as weel as the final averaged ones. In general, as we can observe, the image split into patches and the dynamic crop generation determine an improvement in the framework performance with an average improvement across all the metrics. Our framework raises all of the selected metrics in all of the classes, in particular in CB55. CB55 CSG18 CSG863 Mean Prec Rec IoU F1 Prec Rec IoU F1 Prec Rec IoU F1 Prec Rec IoU F1 Ours (baseline) 0.814 0.829 0.705 0.795 0.853 0.862 0.756 0.831 0.893 0.881 0.782 0.862 0.853 0.857 0.748 0.829 Ours (w/ dynamic crop gen.) 0.867 0.853 0.735 0.824 0.869 0.873 0.775 0.845 0.894 0.885 0.789 0.867 0.877 0.870 0.766 0.845 Table 2 Results of the ablation study. Each rows shows the performance of the different versions of our system across all the selected metrics for the 4 manuscripts of DIVA-HisDB dataset. The last four columns show the average scores. Figure 5: Image shows the comparison between the segmentation results obtained by our framework and the ground truth provided by DIVA-HisDB. Each column represents a different instance of the three classes of manuscripts. In Fig. 5 we provide some examples of the results. In particular, we show the comparison between our results and the provided ground truth. As can be seen, our model provides a different level of precision (a coarser segmentation map) compared with the fidelity of the ground truth segmentation map. 4. Conclusions In this paper, we presented an few-shot learning framework for the layout segmentation for the historical handwritten document. In particular, we introduced a dynamic instance generation module that allowed us to increase the model performance, while maintaining the requirement of few segmentated images for training. This was an important aspect of our research because manually annotating the ground truth of manuscripts is a heavy task, thus reducing the number of required segmentated pages greately helps humaninsts. For future works, we would like to refine our results in order to get pixel-level segmentation, as the ground truth. This would provide similar results to the current state-of-the-art for historical document layout segmentation, but maintaining few-shot learning, that in our opinion is a fundamental feature. References [1] F. Simistira, M. Bouillon, M. Seuret, M. WΓΌrsch, M. Alberti, R. Ingold, M. Liwicki, Icdar2017 competition on layout analysis for challenging medieval manuscripts, in: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, IEEE, 2017, pp. 1361–1370. [2] P. Andrist, Toward a definition of paratexts and paratextuality: the case of ancient greek manuscripts, Bible as Notepad. Tracing Annotations and Annotation Practices (2018) 130–149. [3] M. Mehri, N. Nayef, P. HΓ©roux, P. Gomez-KrΓ€mer, R. Mullot, Learning texture features for enhancement and segmentation of historical document images, in: Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing, 2015, pp. 47–54. [4] Y. Xu, F. Yin, Z. Zhang, C.-L. Liu, et al., Multi-task layout analysis for historical handwritten documents using fully convolutional networks., in: IJCAI, 2018, pp. 1057–1063. [5] S. A. Oliveira, B. Seguin, F. Kaplan, dhsegment: A generic deep-learning approach for document segmentation, in: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), IEEE, 2018, pp. 7–12. [6] G. M. Binmakhashen, S. A. Mahmoud, Document layout analysis: a comprehensive survey, ACM Computing Surveys (CSUR) 52 (2019) 1–36. [7] A. Garz, M. Seuret, F. Simistira, A. Fischer, R. Ingold, Creating ground truth for historical manuscripts with document graphs and scribbling interaction, in: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), IEEE, 2016, pp. 126–131. [8] F. Simistira, M. Seuret, N. Eichenberger, A. Garz, M. Liwicki, R. Ingold, Diva-hisdb: A precisely annotated large dataset of challenging medieval manuscripts, in: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), IEEE, 2016, pp. 471–476. [9] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.