1. Introduction

Dynamic instance generation for few-shot handwritten document layout segmentation (short paper)

Axel De Nardin

Silvia Zottin

Matteo Paier

Gian Luca Foresti

Emanuela Colombi

Claudio Piciarelli

1 0 Department of Humanities and Cultural Heritage, University of Udine , Udine , Italy 1 Department of Mathematics , Computer Science and Physics , University of Udine , Udine , Italy

Historical handwritten document analysis is an important activity to retrieve information about our past. Given that this type of process is slow and time-consuming, the humanities community is searching for new techniques that could aid them in this activity. Document layout analysis is a branch of machine learning that aims to extract semantic informations from digitised documents. Here we propose a new framework for handwritten document layout analysis that diferentiates from the current state-of-theart by the fact that it features few-shot learning, thus allowing for good results with little manually labelled data and the dynamic instance generation process. Our results were obtained using the DIVA HisDB dataset.

eol>Few-shot learning Handwritten document layout analysis Fully-Convolutional Network Document image segmentation

1. Introduction

In the humanities community the study of historical handwritten documents is a crucial activity [ 1 ]. For centuries humanists have focused only on text without considering the elements that accompanied it as comments and decoration, in general called paratext. In the latter years, paratext analysis has gained more and more relevance: this data is fundamental to the culturalhistorical understanding of the individual manuscript but is also of great philological relevance, because paratexts can be transcribed from one manuscript to another [ 2 ].

Despite this importance there are currently not so many studies about them, mainly because paratext analysis is a dificult and expensive task. So, newly developed tools and methods are sought: automated paratext extraction not only saves large amounts of time, but also enables comparisons of the extracted data in rapid time, making it possible to establish connections between seemingly distant manuscripts, connections that escape the human eye and memory.

In order to achieve this we must start from page segmentation of a given document image into semantically meaningful regions (e.g. main text, comments, decorations and background), that is the main focus of this paper.

Page segmentation is a well-known open problem in the machine learning community. Due to the non-uniformity and integrity of the images, many of the approaches adopted to solve this problem rely on a fully supervised learning paradigm [ 3, 4, 5 ]. An exhaustive survey on document layout analysis is in Binmakhashen and Mahmoud [ 6 ].

In contrast, our paradigm tries to achieve few-shot learning, given that in real world applications usually the ground truth is limited in size [ 7 ]. To overcome this limit we introduce a dynamic instance generation process that allow to improve the limited data available in this scenario. By combining this with a fully-convolutional network we are able to achieve greater performances adding image patching. We used DIVA-HisDB to test the proposed framework [ 8 ].

The rest of this paper is organized as follows. Section 2 describes the components defining the proposed framework. Section 3 reports the details of our experimental setup. Finally, in Section 4, are drawn the conclusions and discuss the future work.

2. Proposed Method

In this section, we present the details of the proposed framework with a brief description of the key components. First, we introduce a Fully-Convolutional Network model with a ResNet-50 backbone. This is used for a segmentation tasks in our framework. Moreover we present our training process characterized by the dynamic instance generation. In Fig. 1 it is presented a visual representation of our framework.

(a) Patches division (b) Randomly selected crops

2.1. Backbone network

The main component of our framework is a Fully-Convolutional Network (FCN) model [ 9 ], a ResNet-based deep model that combines layers of the feature hierarchy and refines the spatial precision of the output. This architecture is largely adopted in the context of image semantic segmentation.

The network is composed of a downsampling and upsampling path. The first is used to extract and interpret the context, instead second one allows the localization. The network is able to combine coarse, high layer information with fine, low layer information. Furthermore, the multilayer outputs are followed by deconvolutional layers for bilinear upsampling to pixel-dense outputs. This allows for pixel-level identification of class labels and predict segmentation masks.

The FCN-based methods learn a mapping from pixels to pixels, without extracting the region proposals. This architecture employs solely locally connected layers (such as convolution, pooling and upsampling) and avoids the use of dense layers. This means that requires less parameters and then making the network faster to train and enables to make predictions on inputs of variable sizes.

2.2. Dynamically enhanced training data

In this paper we present a few-shot learning system and the key element of this process is the maximization of the exploitation of the available data.

Usually, to capture global contextual information about images the model is trained on whole images, however we believe that a great extent of the same information can also be retrieved from smaller sections of the document pages. For this reason, to improve the eficiency of our training setup reducing the number of annotated images, we decided to split each page of the manuscript in a set of non-overlapping, fixed-size, patches that cover the entire input image. (a) CB55 page (b) CSG18 page (c) CSG863 page (d) CB55 detail (e) CSG18 detail (f) CSG863 detail

All of these patches forms the base training set.

The cardinality of cannot raise indefinitelity: the size of the single patches must be large enough to allow capturing contextual information from the corresponding represented area of the original image. To overcome this limitation, we introduce a dynamic instance generation process. At each epoch we retrieve a set of randomly selected crops of the same size as the patches. We then train the segmentation network using as additional instances these patches, together with the corresponding segmentation maps. So, the patches selected to cover the entire image are always the same, while the crops will change at each epoch and will be taken at random in the entire image. These crops, being randomly selected, can also overlap each other. An example of this process can be seen in Fig. 2.

This enables us to further improve the eficiency of our training process while trying to enhance the generalization capacities of our model.

3. Experiments

In this section, we ofer a description of the used dataset by highlighting its characteristics. We also describe the detailed training setup adopted for the experiments. Then, we outline the metrics used for the evaluation of our framework as well as provide an ablation study, aimed at supporting the efectiveness of the choices defining the proposed system.

(a) Original page (b) Ground Truth

3.1. Dataset

To train and test our system we selected the DIVA-HisDB dataset [ 8 ]. It is a historical handwritten document dataset consisting of a high-resolution and RGB color annotated pages coming from three diferent medieval manuscripts, identified as CB55, CSG18 and CSG863. These documents have complex and heterogeneous layouts. Moreover, as an additional challenge, they have diferent levels of degradation. A sample page for each manuscripts are reported in Fig. 3.

The dataset consisting of a total of 150 images where, for each manuscripts, 20 are typically used for training, 10 for validation and another 20 for testing. For the present work, we only relied on 2 images for each manuscript to train the model.

DIVA-HisDB supplies pixel-level ground truth segmentation (Fig. 4) for the layout of each image, which distinguishes between 4 classes of elements: background, main text, comments and decoration.

3.2. Hyperparameters setup

For the training of the proposed model, the loss function selected is a weighted cross-entropy loss. This is due to the fact that DIVA-HisDB, and in general all of the historical manuscripts, is a very imbalance between classes. Detail of DIVA-HisDB distribution is provided in Tab. 1.

CB55 CSG18 CSG863

Background 82.41 85.16 77.82

Comments 8.36 6.78 6.35

Decoration 0.55 1.47 1.83

Text 8.68 6.59 14.00

Then, the weights for each class are calculated by taking the square root of 1 over the class frequency in the dataset (Eq. 1, where represents the frequency (%) of class in the corresponding class dataset).

3.3. Metrics

The performance at pixel level of proposed approach is evaluated by Precision, Recall, Intersection over Union (IoU) and F1-Score. These metrics were calculated individually for each manuscripts following the definition reported in Eq. 2– 5, where TP, FP and FN stand respectively for True Positives, False positives and False Negatives, and then a weighted average, based on each class frequency has been performed. (1) (2) (3) (4) (5) = = = 1 − =

+ +

+ + 2 × ×

3.4. Results

We have conducted an ablation study on the diferent versions of the proposed framework. The baseline approach consists of using whole images as backbone network input. Our improved model is obtain by running the same network on a patch level with dynamic crop generation. A comparison of the two versions is presented in Tab. 2. We report both the scores for the individual manuscripts as weel as the final averaged ones.

In general, as we can observe, the image split into patches and the dynamic crop generation determine an improvement in the framework performance with an average improvement across all the metrics. Our framework raises all of the selected metrics in all of the classes, in particular in CB55. Results of the ablation study. Each rows shows the performance of the diferent versions of our system across all the selected metrics for the 4 manuscripts of DIVA-HisDB dataset. The last four columns show the average scores. and the ground truth provided by DIVA-HisDB. Each column represents a diferent instance of the three classes of manuscripts.

In Fig. 5 we provide some examples of the results. In particular, we show the comparison between our results and the provided ground truth. As can be seen, our model provides a diferent level of precision (a coarser segmentation map) compared with the fidelity of the ground truth segmentation map.

4. Conclusions

In this paper, we presented an few-shot learning framework for the layout segmentation for the historical handwritten document. In particular, we introduced a dynamic instance generation module that allowed us to increase the model performance, while maintaining the requirement of few segmentated images for training. This was an important aspect of our research because manually annotating the ground truth of manuscripts is a heavy task, thus reducing the number of required segmentated pages greately helps humaninsts.

For future works, we would like to refine our results in order to get pixel-level segmentation, as the ground truth. This would provide similar results to the current state-of-the-art for historical document layout segmentation, but maintaining few-shot learning, that in our opinion is a fundamental feature.

[1]

Simistira ,

Bouillon ,

Seuret ,

Würsch ,

Alberti ,

Ingold , M. Liwicki, Icdar2017 competition on layout analysis for challenging medieval manuscripts , in: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 1 , IEEE, 2017 , pp. 1361 - 1370 .

[2]

Andrist , Toward a definition of paratexts and paratextuality: the case of ancient greek manuscripts, Bible as Notepad . Tracing Annotations and Annotation Practices ( 2018 ) 130 - 149 .

[3]

Mehri ,

Nayef ,

Héroux ,

Gomez-Krämer ,

Mullot , Learning texture features for enhancement and segmentation of historical document images , in: Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing , 2015 , pp. 47 - 54 .

[4]

Xu ,

Yin ,

Zhang , C.-L. Liu , et al., Multi-task layout analysis for historical handwritten documents using fully convolutional networks ., in: IJCAI , 2018 , pp. 1057 - 1063 .

[5]

S. A.

Oliveira ,

Seguin , F. Kaplan, dhsegment: A generic deep-learning approach for document segmentation , in: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) , IEEE, 2018 , pp. 7 - 12 .

[6]

G. M.

Binmakhashen ,

S. A.

Mahmoud , Document layout analysis: a comprehensive survey , ACM Computing Surveys (CSUR) 52 ( 2019 ) 1 - 36 .

[7]

Garz ,

Seuret ,

Simistira ,

Fischer ,

Ingold , Creating ground truth for historical manuscripts with document graphs and scribbling interaction , in: 2016 12th IAPR Workshop on Document Analysis Systems (DAS) , IEEE, 2016 , pp. 126 - 131 .

[8]

Simistira ,

Seuret ,

Eichenberger ,

Garz ,

Liwicki ,

Ingold , Diva-hisdb: A precisely annotated large dataset of challenging medieval manuscripts , in: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR) , IEEE, 2016 , pp. 471 - 476 .

[9]

Long , E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2015 , pp. 3431 - 3440 .