<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Location of Simple Graphemes in Mediaeval Manuscripts based on Mask R-CNN?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simone Marinai</string-name>
          <email>simone.marinai@unifi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriella Pomaro</string-name>
          <email>gabriella.pomaro@sismelfirenze.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia Raffaelli</string-name>
          <email>claudia.raffaelli@stud.unifi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Scandiffio</string-name>
          <email>francesco.scandiffio@stud.unifi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Società Internazionale per lo Studio del Medioevo Latino Via Montebello</institution>
          ,
          <addr-line>7, 50123 Florence</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Florence, Department of Information Engineering Via di Santa Marta 3</institution>
          ,
          <addr-line>50139 Florence</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe a system for the location of simple graphemes in mediaeval manuscripts based on the Mask R-CNN convolutional neural network. This is the first step towards the ambitious goal of providing palaeographers with a powerful tool with which to speed up and refine the delicate process of dating and determining the origin of manuscripts. In order to train the network, a new dataset composed of 49 pages of Latin Middle Ages manuscripts has been built. Experimental results demonstrate that using the Mask R-CNN network, along with a proper configuration of parameters, leads to good overall outcomes of classification.</p>
      </abstract>
      <kwd-group>
        <kwd>Character recognition tion</kwd>
        <kwd>Mask R-CNN</kwd>
        <kwd>Deep learning phy</kwd>
        <kwd>Grapheme classification and loca-</kwd>
        <kwd>Mediaeval Manuscripts</kwd>
        <kwd>Paleogra-</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The analysis of manuscripts, in particular their dating and localization,
represents the principal way to reconstruct our history before the invention of movable
type printing. Unfortunately, for the large majority of manuscripts there is no
reliable information about their origin and provenance. As a matter of fact, only
after the late 14th Century we have significant quantities of items with associated
dating information in libraries and archives. For this reason, palaeographers use
a variety of methods to determine the age of a manuscript, but they can
usually only provide an approximate period of time about its origin. Among the
methodologies used by palaeographers we can list the study of the material, the
? Research supported by Fondazione Cassa di Risparmio di Firenze</p>
      <p>Copyright c 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). This volume is published
and copyrighted by its editors. IRCDL 2021, February 18-19, 2021, Padua, Italy.
ink used, and of course the analysis of the language. In the case of mediaeval
manuscripts, researchers have an additional element from which to draw
important information to carry out their dating task: the analysis of the shape of
graphemes and the features that make them up. Indeed, different ways of
writing the same grapheme have spread among the amanuenses, in a way similar to
what happens to us today with cursive and capital letters. Since these graphic
signs have changed several times and spread slowly over the centuries, they are
a trace of how writing has changed over time. By closely observing the changes
in lettering, palaeographers can provide a basic time frame for when the
document was written. However, some writing styles lasted for a so long time or were
so widespread that they could not provide any useful information for dating.
For these reasons scholars are interested in examining for each manuscript the
copyist’s graphic choices and also the presence or absence of significant graphic
variants. This process is very long and prone to human errors, such as wrong
reading or missing an occurrence of the searched sign. When manuscripts consist
of a large number of pages i.e. hundreds or thousands of pages, it is very difficult
to completely and carefully inspect them because it would be an extremely time
consuming task. Usually, only a small amount of sample pages is analysed in
order to extract the required information. This kind of approach can easily lead
to incorrect dating results due to the fact that copying a manuscript took years
of time during which even the same amanuensis could change its writing style
several times.</p>
      <p>For these reasons, a system capable of extracting information about the
presence of certain palaeographic letter variants within a collection of documents can
be of particular use to palaeographers. In this paper we describe the first version
of a grapheme-detection system based on the Mask R-CNN deep neural network
in order to identify and count the occurrences of a specific subset of graphemes in
manuscripts from the Latin Middle Ages. This work is part of the project
"Mediaeval manuscripts of Tuscany (XIII-XIV centuries): design and development
of software for dating and determination of origin" financed by SISMEL
(Società Internazionale per lo Studio del Medioevo Latino), DINFO (Dipartimento
di Ingegneria dell’Informazione of the University of Florence), and Fondazione
Cassa di Risparmio Firenze.</p>
      <p>The remainder of the paper is structured as follows. In Section 2 we present
related work about character recognition and the Mask R-CNN framework. In
Section 3 we describe the building process of the dataset employed in this project.
In Section 4 we present the approaches used to train the network. Finally, Section
5 contains the obtained results and in Section 6 we draw the conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>In this section we discuss related work concerning the detection of characters
in manuscript images and we provide a summary of the main object detection
approach considered in our work.
2.1</p>
      <sec id="sec-2-1">
        <title>Character detection in manuscripts</title>
        <p>
          Sheng et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] claim that since automatic document reading does not always
allow to fully understand documents, additional techniques that go beyond the
mere use of OCR systems are needed. In particular, with respect to manuscripts,
locating the geographic origin or identifying the writer may be also relevant tasks.
For this reason the authors have decided to develop a particular set of features
that allows to map the pixels composing the characters in an high-dimensional
space, capturing in this way specific information about the characters. The
proposed features can be used separately or jointly and are based on the principle
of Joint Feature Distribution (JFD). The goal is to answer four questions in the
field of palaeography about who produced a certain document, which document,
when and where. The proposed features are divided between Textural based
features and Grapheme based features. The first ones consider the manuscripts as
textual images, extracting statistical information from the text blocks on the
entire image. They capture the curvature and skew characteristic of different
writing styles and typically do not require line or character segmentation. As for
the grapheme-based features, these allow to capture the statistical distribution
of the single character already segmented starting from documents. These are
based on the principle of the JFD to concatenate spatial information to obtain
a larger structure that is faithful to the traced sign. Among the features that
best allow to date a document, in their work the authors highlight CoHinge and
QuadHinge [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which fall into the category of textural-based features, as well
as the Junction feature (Junclets) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] with regard to grapheme-based features.
Wick et al. in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] propose a method for the automatic transcription of lyrics in
mediaeval music manuscripts. The work is based on the open-source OCR
engine Calamari [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The predictions are made on previously segmented lines of the
original manuscript page using an available pre-trained model or with a custom
model. In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], Wahlberg presents a method for line segmentation, along with a
set of features that can be used for text recognition, writer identification and
production dates.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Mask R-CNN</title>
        <p>
          Artifical Neural Networks ,and in particular deep learning architectures, have
bee widely used in to process historical documents [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Among other approaches,
Mask R-CNN is one state of the art model for instance segmentation and object
detection, developed by Facebook AI Research group [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Mask R-CNN extends
Faster R-CNN (already used to locate words in early printed documents [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ])
making use of an extra mask head and is composed by a standard
convolutional network for image classification and, on the top of that, an additional
fully convolutional network for semantic segmentation at pixel level on the
proposed regions. The network undergoes through two main stages. The first one
is responsible for generating, on the input image, a set of proposals i.e. regions
where there might be an object. The second one is related to the output
produced by the network and is designed for classifying the proposal suggested at
(a) S Dritta. (b) S Tonda. (c) S Documentaria. (d) D Dritta. (e) D Tonda. (f) K.
(g) Z.
        </p>
        <p>(h) Z3.</p>
        <p>
          (i) C Cedigliata. (j) Et in legatura. (k) Et tachigrafica. (l) T assibilata.
the previous stage, in order to allow bounding boxes and masks generation. At
the end of the process, the result is a bounding box that encloses the recognised
object and a pixel mask placed on it. The detection branch, that is the branch
for classification and bounding box, runs in parallel with a branch used for
predicting segmentation masks, allowing a decoupling of the two tasks. For more
details refer to [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Building the dataset</title>
      <p>It is important to remark that a program aimed at identifying simple graphemes
should not be designed taking into account one specific period for the manuscript
production or one particular region. Rather, it should focus on the peculiarities of
the graphic material to be examined, in which the simple grapheme has a specific
meaning. Even if the approach we propose is not designed to deal with cursive
writings and would not work for historical periods in which the syntagm is more
important than the paradigm, there are whole centuries whose manuscripts can
be suitably analysed and on which we focus our research: they are manuscripts
between the XIth and the XIVth centuries and also documents from the XVth
century.</p>
      <p>The manuscripts used in this study are carefully selected from the large
Codex archive that has been built by SISMEL in the last twenty years by
cataloguing mediaeval manuscripts from Tuscany in the Codex Project3. In
particular, the data used are based on the ample collection of scanned works accurately
linked to the codicological descriptions. Taking into account the period of time
previously mentioned, we believe that it is important to identify - and compute
the distribution in the manuscripts - of the following graphemes: three variants
of the letter s and two of the letter d; ligature et; tachygraphic et; k; three
variants of z (including the ç). Given the peculiarities of the dataset, we are also
considering to include the graphic sign ti assibilata: this is a ligature of t and i
3 https://www.sismelfirenze.it/index.php/biblioteca-digitale/codex
that is however an isolated symbol. Currently, the data collection process is still
in progress, so this last grapheme will be introduced in a later version of the
dataset. The set of graphemes of interest is shown in Figure 1.</p>
      <p>In order to locate and recognize the graphemes of the subset above, it was
necessary to build an appropriate dataset4 with Latin Middle Ages characters
from the period XI-XIV ineunte. The examples were gathered from manuscripts
of the accessible digital databases, giving preference to those originated from
Tuscany. To achieve good training of the neural network, only documents without
excessive signs of wear such as burns, rubs and tears were selected. However,
it should be noticed that online libraries of ancient documents usually expose
only low-quality images to the public. The difficulty of collecting good quality
pages suitable for our purpose inevitably influenced the quantity of images that
make up the dataset. The dataset consists of 49 pages of which only 11 are
completely labelled. Other documents have already been identified but have not
been validated; we plan to include the future release of the dataset a much larger
number of pages and examples.</p>
      <p>
        The occurrences of the graphemes of interest have been manually annotated
through the Interactive LabelMe program [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] (Figure 2) which outputs a
4 The manuscripts that make up the dataset are listed in the Acknowledgements
section and are available upon request by sending an email to the authors of this paper.
JSON file containing polygonal segmentations of the graphemes. The JSON files
have been converted into the COCO (Common Objects in Context) format for
ease of use with Mask R-CNN. Since characters are usually drawn very close to
each other, some annotations contain not only the grapheme of interest but also
small portions of adjacent signs. This implies that some of the segmentations
have little noise which was nevertheless deemed acceptable.
      </p>
      <p>Some of the images in the dataset have a very high resolution. This has a
positive effect on learning but is also a challenging element for the amount of
GPU memory required for training. After investigating various cutting methods,
we decided to cut each image into four blocks of the same size. Since manuscripts
do not have a predetermined page structure (in some documents the text is a
continuum without separation into columns while in others it surrounds figures
that can be placed anywhere in the page) all images are processed in the same
way. Annotations along the cut lines are discarded.</p>
      <p>Since the dataset is to be used for a very specific task, we decided to rely
only on the content of carefully selected manuscripts, avoiding the use of data
augmentation techniques. As a consequence of this choice, considering also that
the Latin language has some graphemes much more frequent than others, the
number of annotated characters is strongly unbalanced in favour of some common
classes and is almost totally non-existent for other rare - but still important
palaeographic letter variants (see Table 1). For instance, S dritta and D tonda
are the most frequent classes, making up 70% of the dataset. Such an unbalanced
set of data can create some learning issues. It is indeed highly probable that,
with this configuration, the network will learn well the most common classes,
and not so well the rarest ones. Finally, the dataset has been divided into train,
validation and test sets. Since a complete and reliable ground truth is critical to
a proper performance evaluation, the 11 fully annotated pages have been divided
between validation (5) and test (6). The remaining 38 documents were used for
training.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Model Identification</title>
      <p>In this section we present the experiments conducted to adjust the hyper-parameters
to be used in the network training, discussing the effects produced by their
variations and explaining how they led us towards the final model.</p>
      <p>
        As previously discussed, this work is based on Mask R-CNN which is one
state of the art convolutional network used for object detection and image
segmentation. In particular, we selected the Detectron2 implementation developed
by Facebook AI Research Group [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. To configure the network we used the
fine-tuning paradigm which consists of initialising the weights with a pre-trained
model. This approach is useful when training the network from scratch is made
difficult by limited amount of data available, as pointed out by a variety of
scientific publications [
        <xref ref-type="bibr" rid="ref13 ref14 ref15 ref16 ref17 ref18">13–18</xref>
        ]. The chosen model is a ResNet50 with a FPN backbone
trained for 37 epochs on the COCO dataset. After selecting the model, it was
necessary to refine the training parameters, paying particular attention to
learning rate, the number of iterations and the batch size.
      </p>
      <p>In order to identify the optimal learning rate we have carried out 69
independent trainings composed of 500 iterations each, assigning to the i-th training the
lri computed as lri = 0:0001 i. The aim of the experiment was to compute the
loss calculated on the train at the end of the 500 iterations, selecting the lr with
the highest variation towards the minimum value of loss. We observed that from
iteration i = 40 the loss increases, diverging at iteration 69. With low values
of lr (magnitude of 1e-3) we saw a good reduction, but the loss was still high.
We have therefore decided to keep the learning rate of 3:5e-3 which combines an
overall reduction with the minimum global value of loss.</p>
      <p>The learning rate schedule and the total number of iterations have been
identified through an experimental trial and error approach. All other configuration
parameters being equal, we evaluated different scheduling policies and max
number of iterations by comparing the Precision, Recall and F 1 measures calculated
on the validation set. Regarding the policy of variation, the best results have
been achieved by training the network with fixed learning rate of 3:5e-3 for a
total of 1150 iterations, obtaining the following values: Precision = 0:869, Recall
= 0:551, F 1 = 0:675. Similar but slightly worse results were obtained by training
with lr fixed at 3:5e-3 for the first 800 iterations, then proceeding with a lr of
5e 4 for other 400 iterations.</p>
      <p>Concerning the number of iterations with fixed learning rate, shorter training
provides a precision in the range of 0:03 from the one of the selected model.
As an opposite case, training for more than 1150 iterations increases precision
by 5 percentage points, but at the same time negatively affects recall, bringing
it down to below 0:20. This trend is confirmed by the comparison of inference
boxes (Figure 3). An excessive number of iterations increases the confidence and
reduces false positives but at the same time makes the model no longer able to
detect graphemes that were previously retrieved. This behaviour is attributable
(a) Test with iteration parameter set to 1150. (b) Test with iteration parameter set to 1250.
to the fact that graphemes of the same class are written in slightly different ways
over the dataset, depending on the style of the writer. For instance, some graphic
signs can be traced in a more slanted way, can be larger than others and more
generally present very personal characteristics related to the hand of the writer.
All of this not to mention the fact that some graphemes are more likely to be
drawn close to each other, making it even more difficult to distinguish them.
Keeping all of this in mind, it is not surprising that an excessively high number
of iterations brings the network to overfit on the style of the grapheme used in
the train set, making it difficult to locate the others.</p>
      <p>The last hyperparameter to be tuned is the batch size, i.e. the number of
training samples that are analysed by the network before performing a weight
update. The developers of Mask R-CNN adopted an "image-centric training"
with the consequence that the batch size corresponds to the number of images
analysed by the GPUs for each weight update. We choose a batch size of 16,
thus computing 4 documents at a time since each document is divided into four
images.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Result and Analysis</title>
      <p>
        The fine-tuning paradigm discussed in the previous section had been actually
used even at an earlier stage of this work, when the available dataset was
approximately only 25% of the current size. Considering the dataset expansion work
carried out over the last few months, we questioned the usefulness of initialising
the network with a pre-trained model, thus investigating alternative methods of
initialisation. For this reason, inspired by [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ], we have made experiments
Class
on training from scratch, Furthermore, in this section we analyse how a highly
unbalanced dataset can influence network metrics to appear better than they
actually are.
      </p>
      <p>Table 2 summarises the training examples of the two reference datasets
grouped by classes. It is easy to observe that due to the intrinsic rarity of some
graphemes, the example instances are not properly balanced. Moreover, more
than half of the classes in the initial dataset have fewer than 35 annotations, an
insignificant number compared to the great variety of sign executions that can
be found even within a single manuscript.
5.1</p>
      <sec id="sec-5-1">
        <title>Models comparison</title>
        <p>Throughout the analysis of the results we decided to prefer a higher recall even
at the expense of precision, provided that the value of the latter was at least
80%. The reason for this choice is related to the final objective of the research
project. If the automatic grapheme identification and localisation system had a
high recall and low precision it would provide many wrong results, leading the
user to not trust the system and requiring to manually check a large amount of
retrieved data. This would surely discourage the use of the software and therefore
must be avoided. Even the opposite situation - high precision, low recall - would
be counterproductive because it would provide data that is unrepresentative and
not able to satisfy the research, leading the palaeographer to search manually
for important but undetected graphemes. However, we would like to point out
that usually the analysis of the writing is manually carried out and that due to
the complexity and heaviness of the task it is usually done only on a very limited
selection of pages. For this reason obtaining even only a third of the instances
of a manuscript would be a considerable improvement.</p>
        <p>Considering the two datasets and the two techniques for initialising the
weights, we can identify the following scenarios to which we associate codes
for the sake of brevity: training from scratch on the initial dataset (0L); training
from scratch on the updated dataset (0H); pre-trained weights and small dataset
(1L); pre-trained weights and updated dataset (1H). By applying the approach
discussed in Section 4 we obtained the following 4 models:
– 0L is trained for 1400 iterations with fixed learning rate at 0:0035
– 1L is trained for 1400 iterations with fixed learning rate at 0:0035
– 0H is trained for 800 iterations with fixed learning rate at 0:0035
– 1H is trained for 1150 iterations with fixed learning rate at 0:0035</p>
        <p>In the first analysis we make pairwise comparisons between the models
grouping by weight initialisation method and structure of the training set (Table
3). Comparing 0H and 1H it is evident that the network initialized with
precalculated weights has better recall and precision values than its untrained
counterpart. This is justified by the fact that although the dataset is better supplied
with examples, these are not sufficient to allow a good training of the network
without the support of a basic model.</p>
        <p>Comparing the models 0L and 1L on the validation set it emerges that, given
an equal length of training, initialising with pre-trained weights brings a benefit
in terms of recall (+0:054) at the expense of precision which instead decreases
by 0:042 points. From the analysis of the evaluation metrics, as the number
of iterations varies 0L shows a trend that grows smoothly on both precision
and recall, going into overfitting after iteration 1400. The 1L metrics, on the
other hand, are more abrupt, oscillating several times before reaching the values
previously reported. These behaviours are in line with what was discussed in
Section 4. The trend of the validation set is confirmed by the results obtained
on the test set.</p>
        <p>When comparing the results on the test of all the networks, 1L and 0L models
obtain the best scores, ranking first and second respectively for F1 measure. From
these values it may seem that the models trained on the intial dataset are better
than those on the updated version. This would mean that the update of the
dataset made things worse, an unlikely behaviour if we compare the composition
of the two datasets. Although the problem of imbalance is still present, albeit in a
slightly reduced form: more than half of the classes exceeded the 100-annotation
threshold, becoming more significant in the training phase. Since this can only
be a positive fact, a deeper analysis is necessary.</p>
        <p>First of all, we note that in this context it is much more useful and accurate to
assess precision and recall separately for each class, as in Table 4. This method of
analysis is necessary to take into account the different probabilities of occurrence
of Latin graphemes, element that inevitably affects the number of examples in
the dataset. However, every grapheme in the set of interest has relevance and
this is why it would be unacceptable to produce good results on only a subset
of the selected classes.</p>
        <p>From the results grouped by class (Table 4) it is clear that 1L cannot provide
information on more than half of the characters and is therefore an
unsatisfactory model. This result can be explained by looking at the structure of the initial
training set: the 1L model has been able to specialise exclusively on the
recognition of the first four characters with the largest number of examples.
Comparing 1H and 1L we can say that having a lower recall on the most common
classes and a higher recall for all the others is a good indicator that the dataset
update has succeeded in preventing the network from specialising on a subset
of characters, bringing us closer to the final goal. Certainly further work needs
to be done to increase the number of examples relating to the graphemes Z3,
S documentary, Z, which are currently not recognised by the network because
they are last in terms of number of annotations.</p>
        <p>Summarizing, the information on the performance of the models contained
in Table 3 expresses a value that may seem absolute but that in reality is closely
related to the structure of the training set used. It is therefore incorrect to use
the values in Table 3 to compare L models with H models and say that L models
are preferable to H models because they achieve better metrics. Instead, it is
correct to say that 1L and 1H are better than 0L and 0H respectively, i.e. that
initialization with pretrained weights produced better results than initialization
from scratch.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>The content of this paper is part of a larger project which aims to provide
palaeographers with a software able to classify and locate simple graphemes
within mediaeval manuscripts. The major benefit of an automatic detection will
be to replace the time-consuming process of manual analysis with a quick and
easy way of obtaining information about simple graphemes contained within
entire document collections. The core of the system is a Mask R-CNN network
trained to recognise a specific subset of graphemes. The training phase was
carried out on a new dataset built by manually labelling images of manuscript
from the period XI to XIV. The results discussed in Section 5 show that among
the proposed models the best results are achieved by the pre-trained network.
This is justified by the fact that the amount of data available needs to be further
increased and better balanced before dropping the use of pre-trained weights.
Training from scratch provided satisfactory results, although slightly worse than
its pre-trained counterpart.</p>
      <p>The first future development to be carried out concerns the improvement of
the dataset. Expanding the dataset by adding examples for rarer classes would
help in increasing the recall for those classes. In order to enhance the quality of
the training dataset another technique that could be applied is the one of data
augmentation in favour of those classes with fewer examples.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work is partially supported by the Fondazione Cassa di Risparmio di
Firenze that funded the project I manoscritti medievali della Toscana (sec.
XIIIXIV): progettazione e sviluppo di un software per la datazione e la
determinazione di origine granted to SISMEL.</p>
      <p>We would like to thank the following libraries for providing us the documents:
– Barcellona, Biblioteca de Catalunya: 639 f. non precisabile
– Bologna, Biblioteca Universitaria: 1746 ff. 7, 8, 10, 30, 42, 52
– Berlin, Staatsbibliothek zu Berlin: Rehdiger 227 f. 10r; Phillips 1716 12v, f.</p>
      <p>39r
– Cologny, Fondation Martin: Bodmer 30 f. 12v
– Firenze, Biblioteca della Fondazione E. Franceschini: ms. 2 f. 222v, ignota
– Firenze, Biblioteca Medicea Laurenziana: Plut. 42.23 f. 1r - Plut. 19 dex. 1
f. 7v - Plut. 19 dex. 5 ff. 5v, 6v, 14r, 23v, 55r, 55v, 135v - Plut. 19 dex. 8 f.
5v - Plut. 19 dex. 7 ff. 21v, 89v, 108v - Plut. 30 sin. 3 ff. 42v, 110v, 235v
Conv.Soppr. 321 ff. 145r, 146r, 150v, 151r - Strozzi 146, f. 2r, 12r
– Firenze, Biblioteca Riccardiana: 222 f. 152r - 269 f. 1r - 323 f. 105v - 327 f.</p>
      <p>8r - 1422 f. 70r – 829 f. 12r - 1471 f. 43v
– Firenze, Biblioteca Nazionale Centrale: I.III 272-273 f. 32r - C.S D.7.1158 f.</p>
      <p>10r - Magl. XII.4, f.17r
– Milano, Biblioteca Ambrosiana: M 76 sup. f. 274
– Pisa, Archivio di Stato: Div. A n. 2 f. 101v
– Siena, Biblioteca Comunale degli Intronati: F.III.3 ff. 2r, 137r.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>He</given-names>
            <surname>Sheng</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lambert</given-names>
            <surname>Schomaker</surname>
          </string-name>
          . “
          <string-name>
            <surname>Beyond OCR</surname>
          </string-name>
          <article-title>: Multi-faceted understanding of handwritten document characteristics”</article-title>
          .
          <source>In: Pattern Recognition</source>
          <volume>63</volume>
          (
          <issue>Mar</issue>
          .
          <year>2017</year>
          ), pp.
          <fpage>321</fpage>
          -
          <lpage>333</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.patcog.
          <year>2016</year>
          .
          <volume>09</volume>
          .017.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <surname>L. Schomaker. “</surname>
          </string-name>
          <article-title>Co-occurrence Features for Writer Identification”</article-title>
          .
          <source>In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)</source>
          .
          <year>2016</year>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>83</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICFHR.
          <year>2016</year>
          .
          <volume>0027</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Lambert</given-names>
            <surname>Schomaker</surname>
          </string-name>
          , Marco Wiering, and He Sheng. “
          <article-title>Junction detection in handwritten documents and its application to writer identification”</article-title>
          .
          <source>In: Pattern Recognition</source>
          <volume>48</volume>
          (
          <year>June 2015</year>
          ). doi:
          <volume>10</volume>
          .1016/j.patcog.
          <year>2015</year>
          .
          <volume>05</volume>
          . 022.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartelt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Puppe</surname>
          </string-name>
          . “
          <article-title>Lyrics Recognition and Syllable Assignment of Medieval Music Manuscripts”</article-title>
          .
          <source>In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)</source>
          .
          <year>2020</year>
          , pp.
          <fpage>187</fpage>
          -
          <lpage>192</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICFHR2020.
          <year>2020</year>
          .
          <volume>00043</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Wick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Reul</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Puppe. Calamari - A HighPerformance</surname>
          </string-name>
          Tensorflow
          <article-title>-based Deep Learning Package for Optical Character Recognition</article-title>
          .
          <year>2018</year>
          . arXiv:
          <year>1807</year>
          .
          <year>02004</year>
          [cs.CV].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Fredrik</given-names>
            <surname>Wahlberg</surname>
          </string-name>
          .
          <article-title>“Interpreting the Script: Image Analysis and Machine Learning for Quantitative Studies of Pre-modern Manuscripts”</article-title>
          .
          <source>PhD thesis</source>
          .
          <source>Acta Universitatis Upsaliensis</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Lombardi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Simone</given-names>
            <surname>Marinai</surname>
          </string-name>
          . “
          <article-title>Deep Learning for Historical Document Analysis and Recognition - A Survey”</article-title>
          .
          <source>In: J. Imaging</source>
          <volume>6</volume>
          .10 (
          <year>2020</year>
          ), p.
          <fpage>110</fpage>
          . url: https://doi.org/10.3390/jimaging6100110.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          et al. “
          <string-name>
            <surname>Mask R-CNN</surname>
          </string-name>
          <article-title>”</article-title>
          .
          <source>In: 2017 IEEE International Conference on Computer Vision</source>
          (ICCV).
          <year>2017</year>
          , pp.
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          . doi:
          <volume>10</volume>
          . 1109 / ICCV .
          <year>2017</year>
          .
          <volume>322</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Zahra</given-names>
            <surname>Ziran</surname>
          </string-name>
          et al. “
          <article-title>Text alignment in early printed books combining deep learning and dynamic programming”</article-title>
          .
          <source>In: Pattern Recognit. Lett</source>
          .
          <volume>133</volume>
          (
          <year>2020</year>
          ), pp.
          <fpage>109</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Samuele</given-names>
            <surname>Capobianco</surname>
          </string-name>
          . “
          <article-title>Deep Learning Methods for Document Image Understanding”</article-title>
          .
          <source>PhD thesis</source>
          . University of Florence,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Kentaro</given-names>
            <surname>Wada</surname>
          </string-name>
          .
          <article-title>labelme: Image Polygonal Annotation with Python</article-title>
          . https: //github.com/wkentaro/labelme.
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Yuxin</given-names>
            <surname>Wu</surname>
          </string-name>
          et al.
          <source>Detectron2</source>
          . https://github.com/facebookresearch/ detectron2.
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Käding</surname>
          </string-name>
          et al. “
          <article-title>Fine-Tuning Deep Neural Networks in Continuous Learning Scenarios”</article-title>
          .
          <source>In: ACCV Workshops</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Pulkit</surname>
            <given-names>Agrawal</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Jitendra</given-names>
          </string-name>
          <string-name>
            <surname>Malik</surname>
          </string-name>
          .
          <article-title>Analyzing the Performance of Multilayer Neural Networks for Object Recognition</article-title>
          .
          <year>2014</year>
          . arXiv:
          <volume>1407</volume>
          .1610 [cs.CV].
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          et al. “
          <article-title>Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”</article-title>
          .
          <source>In: 2014 IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <year>2014</year>
          , pp.
          <fpage>580</fpage>
          -
          <lpage>587</lpage>
          . doi:
          <volume>10</volume>
          . 1109 / CVPR .
          <year>2014</year>
          .
          <volume>81</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oquab</surname>
          </string-name>
          et al. “
          <article-title>Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks”</article-title>
          .
          <source>In: 2014 IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <year>2014</year>
          , pp.
          <fpage>1717</fpage>
          -
          <lpage>1724</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2014</year>
          .
          <volume>222</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Steve</given-names>
            <surname>Branson</surname>
          </string-name>
          et al.
          <source>Bird Species Categorization Using Pose Normalized Deep Convolutional Nets</source>
          .
          <year>2014</year>
          . arXiv:
          <volume>1406</volume>
          .2952 [cs.CV].
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Artem</given-names>
            <surname>Babenko</surname>
          </string-name>
          et al.
          <source>Neural Codes for Image Retrieval</source>
          .
          <year>2014</year>
          . arXiv:
          <volume>1404</volume>
          .1777 [cs.CV].
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Yuxin</given-names>
            <surname>Wu</surname>
          </string-name>
          et al.
          <source>Detectron2 Model Zoo and Baselines</source>
          .
          <year>2020</year>
          . url: https: //github.com/facebookresearch/detectron2/blob/master/MODEL_ ZOO.md.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Piotr</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          .
          <source>Rethinking ImageNet Pretraining</source>
          .
          <year>2018</year>
          . arXiv:
          <year>1811</year>
          .
          <article-title>08883 [cs</article-title>
          .CV].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>