<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dynamic instance generation for few-shot handwritten document layout segmentation (short paper)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Axel De Nardin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia Zottin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Paier</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gian Luca Foresti</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuela Colombi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Piciarelli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Humanities and Cultural Heritage, University of Udine</institution>
          ,
          <addr-line>Udine</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics</institution>
          ,
          <addr-line>Computer Science and Physics</addr-line>
          ,
          <institution>University of Udine</institution>
          ,
          <addr-line>Udine</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Historical handwritten document analysis is an important activity to retrieve information about our past. Given that this type of process is slow and time-consuming, the humanities community is searching for new techniques that could aid them in this activity. Document layout analysis is a branch of machine learning that aims to extract semantic informations from digitised documents. Here we propose a new framework for handwritten document layout analysis that diferentiates from the current state-of-theart by the fact that it features few-shot learning, thus allowing for good results with little manually labelled data and the dynamic instance generation process. Our results were obtained using the DIVA HisDB dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Few-shot learning</kwd>
        <kwd>Handwritten document layout analysis</kwd>
        <kwd>Fully-Convolutional Network</kwd>
        <kwd>Document image segmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the humanities community the study of historical handwritten documents is a crucial
activity [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For centuries humanists have focused only on text without considering the elements
that accompanied it as comments and decoration, in general called paratext. In the latter years,
paratext analysis has gained more and more relevance: this data is fundamental to the
culturalhistorical understanding of the individual manuscript but is also of great philological relevance,
because paratexts can be transcribed from one manuscript to another [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Despite this importance there are currently not so many studies about them, mainly because
paratext analysis is a dificult and expensive task. So, newly developed tools and methods are
sought: automated paratext extraction not only saves large amounts of time, but also enables
comparisons of the extracted data in rapid time, making it possible to establish connections
between seemingly distant manuscripts, connections that escape the human eye and memory.</p>
      <p>In order to achieve this we must start from page segmentation of a given document image
into semantically meaningful regions (e.g. main text, comments, decorations and background),
that is the main focus of this paper.</p>
      <p>
        Page segmentation is a well-known open problem in the machine learning community. Due
to the non-uniformity and integrity of the images, many of the approaches adopted to solve
this problem rely on a fully supervised learning paradigm [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. An exhaustive survey on
document layout analysis is in Binmakhashen and Mahmoud [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In contrast, our paradigm tries to achieve few-shot learning, given that in real world
applications usually the ground truth is limited in size [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. To overcome this limit we introduce a
dynamic instance generation process that allow to improve the limited data available in this
scenario. By combining this with a fully-convolutional network we are able to achieve greater
performances adding image patching. We used DIVA-HisDB to test the proposed framework [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The rest of this paper is organized as follows. Section 2 describes the components defining
the proposed framework. Section 3 reports the details of our experimental setup. Finally, in
Section 4, are drawn the conclusions and discuss the future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Method</title>
      <p>In this section, we present the details of the proposed framework with a brief description of the
key components. First, we introduce a Fully-Convolutional Network model with a ResNet-50
backbone. This is used for a segmentation tasks in our framework. Moreover we present our
training process characterized by the dynamic instance generation. In Fig. 1 it is presented a
visual representation of our framework.</p>
      <p>(a) Patches division
(b) Randomly selected crops</p>
      <sec id="sec-2-1">
        <title>2.1. Backbone network</title>
        <p>
          The main component of our framework is a Fully-Convolutional Network (FCN) model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], a
ResNet-based deep model that combines layers of the feature hierarchy and refines the spatial
precision of the output. This architecture is largely adopted in the context of image semantic
segmentation.
        </p>
        <p>The network is composed of a downsampling and upsampling path. The first is used to
extract and interpret the context, instead second one allows the localization. The network is able
to combine coarse, high layer information with fine, low layer information. Furthermore, the
multilayer outputs are followed by deconvolutional layers for bilinear upsampling to pixel-dense
outputs. This allows for pixel-level identification of class labels and predict segmentation masks.</p>
        <p>The FCN-based methods learn a mapping from pixels to pixels, without extracting the region
proposals. This architecture employs solely locally connected layers (such as convolution,
pooling and upsampling) and avoids the use of dense layers. This means that requires less
parameters and then making the network faster to train and enables to make predictions on
inputs of variable sizes.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Dynamically enhanced training data</title>
        <p>In this paper we present a few-shot learning system and the key element of this process is the
maximization of the exploitation of the available data.</p>
        <p>Usually, to capture global contextual information about images the model is trained on whole
images, however we believe that a great extent of the same information can also be retrieved
from smaller sections of the document pages. For this reason, to improve the eficiency of our
training setup reducing the number of annotated images, we decided to split each page of the
manuscript in a set  of non-overlapping, fixed-size, patches that cover the entire input image.
(a) CB55 page
(b) CSG18 page
(c) CSG863 page
(d) CB55 detail
(e) CSG18 detail
(f) CSG863 detail</p>
        <p>All of these patches forms the base training set.</p>
        <p>The cardinality of  cannot raise indefinitelity: the size of the single patches must be large
enough to allow capturing contextual information from the corresponding represented area of
the original image. To overcome this limitation, we introduce a dynamic instance generation
process. At each epoch we retrieve a set  of randomly selected crops of the same size as the
patches. We then train the segmentation network using as additional instances these patches,
together with the corresponding segmentation maps. So, the patches selected to cover the entire
image are always the same, while the crops will change at each epoch and will be taken at
random in the entire image. These crops, being randomly selected, can also overlap each other.
An example of this process can be seen in Fig. 2.</p>
        <p>This enables us to further improve the eficiency of our training process while trying to
enhance the generalization capacities of our model.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>In this section, we ofer a description of the used dataset by highlighting its characteristics. We
also describe the detailed training setup adopted for the experiments. Then, we outline the
metrics used for the evaluation of our framework as well as provide an ablation study, aimed at
supporting the efectiveness of the choices defining the proposed system.</p>
      <p>(a) Original page
(b) Ground Truth</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>
          To train and test our system we selected the DIVA-HisDB dataset [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. It is a historical handwritten
document dataset consisting of a high-resolution and RGB color annotated pages coming from
three diferent medieval manuscripts, identified as CB55, CSG18 and CSG863. These documents
have complex and heterogeneous layouts. Moreover, as an additional challenge, they have
diferent levels of degradation. A sample page for each manuscripts are reported in Fig. 3.
        </p>
        <p>The dataset consisting of a total of 150 images where, for each manuscripts, 20 are typically
used for training, 10 for validation and another 20 for testing. For the present work, we only
relied on 2 images for each manuscript to train the model.</p>
        <p>DIVA-HisDB supplies pixel-level ground truth segmentation (Fig. 4) for the layout of each
image, which distinguishes between 4 classes of elements: background, main text, comments
and decoration.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Hyperparameters setup</title>
        <p>For the training of the proposed model, the loss function selected is a weighted cross-entropy
loss. This is due to the fact that DIVA-HisDB, and in general all of the historical manuscripts, is
a very imbalance between classes. Detail of DIVA-HisDB distribution is provided in Tab. 1.</p>
        <p>CB55
CSG18
CSG863</p>
        <p>Background
82.41
85.16
77.82</p>
        <p>Comments
8.36
6.78
6.35</p>
        <p>Decoration
0.55
1.47
1.83</p>
        <p>Text
8.68
6.59
14.00</p>
        <p>Then, the weights for each class are calculated by taking the square root of 1 over the
class frequency in the dataset (Eq. 1, where  represents the frequency (%) of class in the
corresponding class dataset).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Metrics</title>
        <p>The performance at pixel level of proposed approach is evaluated by Precision, Recall,
Intersection over Union (IoU) and F1-Score. These metrics were calculated individually for each
manuscripts following the definition reported in Eq. 2– 5, where TP, FP and FN stand
respectively for True Positives, False positives and False Negatives, and then a weighted average,
based on each class frequency has been performed.
(1)
(2)
(3)
(4)
(5)
 =
 =
  =
 1 −  =</p>
        <p>+  
 
  +</p>
        <p>+   +  
2 ×   ×</p>
        <p>+</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Results</title>
        <p>We have conducted an ablation study on the diferent versions of the proposed framework. The
baseline approach consists of using whole images as backbone network input. Our improved
model is obtain by running the same network on a patch level with dynamic crop generation.
A comparison of the two versions is presented in Tab. 2. We report both the scores for the
individual manuscripts as weel as the final averaged ones.</p>
        <p>In general, as we can observe, the image split into patches and the dynamic crop generation
determine an improvement in the framework performance with an average improvement across
all the metrics. Our framework raises all of the selected metrics in all of the classes, in particular
in CB55.
Results of the ablation study. Each rows shows the performance of the diferent versions of our system
across all the selected metrics for the 4 manuscripts of DIVA-HisDB dataset. The last four columns
show the average scores.
and the ground truth provided by DIVA-HisDB. Each column represents a diferent instance of the three
classes of manuscripts.</p>
        <p>In Fig. 5 we provide some examples of the results. In particular, we show the comparison
between our results and the provided ground truth. As can be seen, our model provides a
diferent level of precision (a coarser segmentation map) compared with the fidelity of the
ground truth segmentation map.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper, we presented an few-shot learning framework for the layout segmentation for the
historical handwritten document. In particular, we introduced a dynamic instance generation
module that allowed us to increase the model performance, while maintaining the requirement
of few segmentated images for training. This was an important aspect of our research because
manually annotating the ground truth of manuscripts is a heavy task, thus reducing the number
of required segmentated pages greately helps humaninsts.</p>
      <p>For future works, we would like to refine our results in order to get pixel-level segmentation, as
the ground truth. This would provide similar results to the current state-of-the-art for historical
document layout segmentation, but maintaining few-shot learning, that in our opinion is a
fundamental feature.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Simistira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bouillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seuret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Würsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ingold</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Liwicki, Icdar2017 competition on layout analysis for challenging medieval manuscripts</article-title>
          ,
          <source>in: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</source>
          , volume
          <volume>1</volume>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>1361</fpage>
          -
          <lpage>1370</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Andrist</surname>
          </string-name>
          ,
          <article-title>Toward a definition of paratexts and paratextuality: the case of ancient greek manuscripts, Bible as Notepad</article-title>
          .
          <source>Tracing Annotations and Annotation Practices</source>
          (
          <year>2018</year>
          )
          <fpage>130</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mehri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nayef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Héroux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gomez-Krämer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mullot</surname>
          </string-name>
          ,
          <article-title>Learning texture features for enhancement and segmentation of historical document images</article-title>
          ,
          <source>in: Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-L. Liu</surname>
          </string-name>
          , et al.,
          <article-title>Multi-task layout analysis for historical handwritten documents using fully convolutional networks</article-title>
          .,
          <source>in: IJCAI</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1057</fpage>
          -
          <lpage>1063</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Seguin</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Kaplan, dhsegment: A generic deep-learning approach for document segmentation</article-title>
          ,
          <source>in: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Binmakhashen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Mahmoud</surname>
          </string-name>
          ,
          <article-title>Document layout analysis: a comprehensive survey</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 52</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Garz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seuret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Simistira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ingold</surname>
          </string-name>
          ,
          <article-title>Creating ground truth for historical manuscripts with document graphs and scribbling interaction</article-title>
          ,
          <source>in: 2016 12th IAPR Workshop on Document Analysis Systems (DAS)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>126</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Simistira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seuret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Eichenberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liwicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ingold</surname>
          </string-name>
          ,
          <article-title>Diva-hisdb: A precisely annotated large dataset of challenging medieval manuscripts</article-title>
          ,
          <source>in: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>471</fpage>
          -
          <lpage>476</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          , E. Shelhamer, T. Darrell,
          <article-title>Fully convolutional networks for semantic segmentation</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>3431</fpage>
          -
          <lpage>3440</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>