<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Location of Simple Graphemes in Mediaeval Manuscripts based on Mask R-CNN</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Simone</forename><surname>Marinai</surname></persName>
							<email>simone.marinai@unifi.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Engineering Via di Santa Marta</orgName>
								<orgName type="institution">University of Florence</orgName>
								<address>
									<postCode>50139</postCode>
									<settlement>Florence</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gabriella</forename><surname>Pomaro</surname></persName>
							<email>gabriella.pomaro@sismelfirenze.it</email>
							<affiliation key="aff1">
								<orgName type="institution">Società Internazionale per lo Studio del Medioevo</orgName>
								<address>
									<addrLine>Latino Via Montebello, 7</addrLine>
									<postCode>50123</postCode>
									<settlement>Florence</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Claudia</forename><surname>Raffaelli</surname></persName>
							<email>claudia.raffaelli@stud.unifi.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Engineering Via di Santa Marta</orgName>
								<orgName type="institution">University of Florence</orgName>
								<address>
									<postCode>50139</postCode>
									<settlement>Florence</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Francesco</forename><surname>Scandiffio</surname></persName>
							<email>francesco.scandiffio@stud.unifi.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Engineering Via di Santa Marta</orgName>
								<orgName type="institution">University of Florence</orgName>
								<address>
									<postCode>50139</postCode>
									<settlement>Florence</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">Research supported by Fondazione Cassa di Risparmio di Firenze</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Location of Simple Graphemes in Mediaeval Manuscripts based on Mask R-CNN</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">76DAA0EA4A2ED207B23BE803F63AEC6F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Character recognition</term>
					<term>Grapheme classification and location</term>
					<term>Mask R-CNN</term>
					<term>Deep learning</term>
					<term>Mediaeval Manuscripts</term>
					<term>Paleography</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we describe a system for the location of simple graphemes in mediaeval manuscripts based on the Mask R-CNN convolutional neural network. This is the first step towards the ambitious goal of providing palaeographers with a powerful tool with which to speed up and refine the delicate process of dating and determining the origin of manuscripts. In order to train the network, a new dataset composed of 49 pages of Latin Middle Ages manuscripts has been built. Experimental results demonstrate that using the Mask R-CNN network, along with a proper configuration of parameters, leads to good overall outcomes of classification.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The analysis of manuscripts, in particular their dating and localization, represents the principal way to reconstruct our history before the invention of movable type printing. Unfortunately, for the large majority of manuscripts there is no reliable information about their origin and provenance. As a matter of fact, only after the late 14th Century we have significant quantities of items with associated dating information in libraries and archives. For this reason, palaeographers use a variety of methods to determine the age of a manuscript, but they can usually only provide an approximate period of time about its origin. Among the methodologies used by palaeographers we can list the study of the material, the ink used, and of course the analysis of the language. In the case of mediaeval manuscripts, researchers have an additional element from which to draw important information to carry out their dating task: the analysis of the shape of graphemes and the features that make them up. Indeed, different ways of writing the same grapheme have spread among the amanuenses, in a way similar to what happens to us today with cursive and capital letters. Since these graphic signs have changed several times and spread slowly over the centuries, they are a trace of how writing has changed over time. By closely observing the changes in lettering, palaeographers can provide a basic time frame for when the document was written. However, some writing styles lasted for a so long time or were so widespread that they could not provide any useful information for dating. For these reasons scholars are interested in examining for each manuscript the copyist's graphic choices and also the presence or absence of significant graphic variants. This process is very long and prone to human errors, such as wrong reading or missing an occurrence of the searched sign. When manuscripts consist of a large number of pages i.e. hundreds or thousands of pages, it is very difficult to completely and carefully inspect them because it would be an extremely time consuming task. Usually, only a small amount of sample pages is analysed in order to extract the required information. This kind of approach can easily lead to incorrect dating results due to the fact that copying a manuscript took years of time during which even the same amanuensis could change its writing style several times.</p><p>For these reasons, a system capable of extracting information about the presence of certain palaeographic letter variants within a collection of documents can be of particular use to palaeographers. In this paper we describe the first version of a grapheme-detection system based on the Mask R-CNN deep neural network in order to identify and count the occurrences of a specific subset of graphemes in manuscripts from the Latin Middle Ages. This work is part of the project "Mediaeval manuscripts of Tuscany (XIII-XIV centuries): design and development of software for dating and determination of origin" financed by SISMEL (Società Internazionale per lo Studio del Medioevo Latino), DINFO (Dipartimento di Ingegneria dell'Informazione of the University of Florence), and Fondazione Cassa di Risparmio Firenze.</p><p>The remainder of the paper is structured as follows. In Section 2 we present related work about character recognition and the Mask R-CNN framework. In Section 3 we describe the building process of the dataset employed in this project. In Section 4 we present the approaches used to train the network. Finally, Section 5 contains the obtained results and in Section 6 we draw the conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>In this section we discuss related work concerning the detection of characters in manuscript images and we provide a summary of the main object detection approach considered in our work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Character detection in manuscripts</head><p>Sheng et al. <ref type="bibr" target="#b0">[1]</ref> claim that since automatic document reading does not always allow to fully understand documents, additional techniques that go beyond the mere use of OCR systems are needed. In particular, with respect to manuscripts, locating the geographic origin or identifying the writer may be also relevant tasks. For this reason the authors have decided to develop a particular set of features that allows to map the pixels composing the characters in an high-dimensional space, capturing in this way specific information about the characters. The proposed features can be used separately or jointly and are based on the principle of Joint Feature Distribution (JFD). The goal is to answer four questions in the field of palaeography about who produced a certain document, which document, when and where. The proposed features are divided between Textural based features and Grapheme based features. The first ones consider the manuscripts as textual images, extracting statistical information from the text blocks on the entire image. They capture the curvature and skew characteristic of different writing styles and typically do not require line or character segmentation. As for the grapheme-based features, these allow to capture the statistical distribution of the single character already segmented starting from documents. These are based on the principle of the JFD to concatenate spatial information to obtain a larger structure that is faithful to the traced sign. Among the features that best allow to date a document, in their work the authors highlight CoHinge and QuadHinge <ref type="bibr" target="#b1">[2]</ref>, which fall into the category of textural-based features, as well as the Junction feature (Junclets) <ref type="bibr" target="#b2">[3]</ref> with regard to grapheme-based features. Wick et al. in <ref type="bibr" target="#b3">[4]</ref> propose a method for the automatic transcription of lyrics in mediaeval music manuscripts. The work is based on the open-source OCR engine Calamari <ref type="bibr" target="#b4">[5]</ref>. The predictions are made on previously segmented lines of the original manuscript page using an available pre-trained model or with a custom model. In <ref type="bibr" target="#b5">[6]</ref>, Wahlberg presents a method for line segmentation, along with a set of features that can be used for text recognition, writer identification and production dates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Mask R-CNN</head><p>Artifical Neural Networks ,and in particular deep learning architectures, have bee widely used in to process historical documents <ref type="bibr" target="#b6">[7]</ref>. Among other approaches, Mask R-CNN is one state of the art model for instance segmentation and object detection, developed by Facebook AI Research group <ref type="bibr" target="#b7">[8]</ref>. Mask R-CNN extends Faster R-CNN (already used to locate words in early printed documents <ref type="bibr" target="#b8">[9]</ref>) making use of an extra mask head and is composed by a standard convolutional network for image classification and, on the top of that, an additional fully convolutional network for semantic segmentation at pixel level on the proposed regions. The network undergoes through two main stages. The first one is responsible for generating, on the input image, a set of proposals i.e. regions where there might be an object. The second one is related to the output produced by the network and is designed for classifying the proposal suggested at  the previous stage, in order to allow bounding boxes and masks generation. At the end of the process, the result is a bounding box that encloses the recognised object and a pixel mask placed on it. The detection branch, that is the branch for classification and bounding box, runs in parallel with a branch used for predicting segmentation masks, allowing a decoupling of the two tasks. For more details refer to <ref type="bibr" target="#b7">[8]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Building the dataset</head><p>It is important to remark that a program aimed at identifying simple graphemes should not be designed taking into account one specific period for the manuscript production or one particular region. Rather, it should focus on the peculiarities of the graphic material to be examined, in which the simple grapheme has a specific meaning. Even if the approach we propose is not designed to deal with cursive writings and would not work for historical periods in which the syntagm is more important than the paradigm, there are whole centuries whose manuscripts can be suitably analysed and on which we focus our research: they are manuscripts between the XIth and the XIVth centuries and also documents from the XVth century.</p><p>The manuscripts used in this study are carefully selected from the large Codex archive that has been built by SISMEL in the last twenty years by cataloguing mediaeval manuscripts from Tuscany in the Codex Project <ref type="foot" target="#foot_0">3</ref> . In particular, the data used are based on the ample collection of scanned works accurately linked to the codicological descriptions. Taking into account the period of time previously mentioned, we believe that it is important to identify -and compute the distribution in the manuscripts -of the following graphemes: three variants of the letter s and two of the letter d; ligature et; tachygraphic et; k; three variants of z (including the ç). Given the peculiarities of the dataset, we are also considering to include the graphic sign ti assibilata: this is a ligature of t and i that is however an isolated symbol. Currently, the data collection process is still in progress, so this last grapheme will be introduced in a later version of the dataset. The set of graphemes of interest is shown in Figure <ref type="figure" target="#fig_1">1</ref>.</p><p>In order to locate and recognize the graphemes of the subset above, it was necessary to build an appropriate dataset <ref type="foot" target="#foot_1">4</ref> with Latin Middle Ages characters from the period XI-XIV ineunte. The examples were gathered from manuscripts of the accessible digital databases, giving preference to those originated from Tuscany. To achieve good training of the neural network, only documents without excessive signs of wear such as burns, rubs and tears were selected. However, it should be noticed that online libraries of ancient documents usually expose only low-quality images to the public. The difficulty of collecting good quality pages suitable for our purpose inevitably influenced the quantity of images that make up the dataset. The dataset consists of 49 pages of which only 11 are completely labelled. Other documents have already been identified but have not been validated; we plan to include the future release of the dataset a much larger number of pages and examples.</p><p>The occurrences of the graphemes of interest have been manually annotated through the Interactive LabelMe program <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11]</ref> (Figure <ref type="figure" target="#fig_2">2</ref>) which outputs a JSON file containing polygonal segmentations of the graphemes. The JSON files have been converted into the COCO (Common Objects in Context) format for ease of use with Mask R-CNN. Since characters are usually drawn very close to each other, some annotations contain not only the grapheme of interest but also small portions of adjacent signs. This implies that some of the segmentations have little noise which was nevertheless deemed acceptable. Some of the images in the dataset have a very high resolution. This has a positive effect on learning but is also a challenging element for the amount of GPU memory required for training. After investigating various cutting methods, we decided to cut each image into four blocks of the same size. Since manuscripts do not have a predetermined page structure (in some documents the text is a continuum without separation into columns while in others it surrounds figures that can be placed anywhere in the page) all images are processed in the same way. Annotations along the cut lines are discarded.</p><p>Since the dataset is to be used for a very specific task, we decided to rely only on the content of carefully selected manuscripts, avoiding the use of data augmentation techniques. As a consequence of this choice, considering also that the Latin language has some graphemes much more frequent than others, the number of annotated characters is strongly unbalanced in favour of some common classes and is almost totally non-existent for other rare -but still importantpalaeographic letter variants (see Table <ref type="table" target="#tab_0">1</ref>). For instance, S dritta and D tonda are the most frequent classes, making up 70% of the dataset. Such an unbalanced set of data can create some learning issues. It is indeed highly probable that, with this configuration, the network will learn well the most common classes, and not so well the rarest ones. Finally, the dataset has been divided into train, validation and test sets. Since a complete and reliable ground truth is critical to a proper performance evaluation, the 11 fully annotated pages have been divided between validation (5) and test <ref type="bibr" target="#b5">(6)</ref>. The remaining 38 documents were used for training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Model Identification</head><p>In this section we present the experiments conducted to adjust the hyper-parameters to be used in the network training, discussing the effects produced by their variations and explaining how they led us towards the final model.</p><p>As previously discussed, work is based on Mask R-CNN which is one state of the art convolutional network used for object detection and image segmentation. In particular, we selected the Detectron2 implementation developed by Facebook AI Research Group <ref type="bibr" target="#b11">[12]</ref>. To configure the network we used the fine-tuning paradigm which consists of initialising the weights with a pre-trained model. This approach is useful when training the network from scratch is made difficult by limited amount of data available, as pointed out by a variety of scientific publications <ref type="bibr" target="#b12">[13]</ref><ref type="bibr" target="#b13">[14]</ref><ref type="bibr" target="#b14">[15]</ref><ref type="bibr" target="#b15">[16]</ref><ref type="bibr" target="#b16">[17]</ref><ref type="bibr" target="#b17">[18]</ref>. The chosen model is a ResNet50 with a FPN backbone trained for 37 epochs on the COCO dataset. After selecting the model, it was necessary to refine the training parameters, paying particular attention to learning rate, the number of iterations and the batch size.</p><p>In order to identify the optimal learning rate we have carried out 69 independent trainings composed of 500 iterations each, assigning to the i-th training the lr i computed as lr i = 0.0001 • i. The aim of the experiment was to compute the loss calculated on the train at the end of the 500 iterations, selecting the lr with the highest variation towards the minimum value of loss. We observed that from iteration i = 40 the loss increases, diverging at iteration 69. With low values of lr (magnitude of 1e-3) we saw a good reduction, but the loss was still high. We have therefore decided to keep the learning rate of 3.5e-3 which combines an overall reduction with the minimum global value of loss.</p><p>The learning rate schedule and the total number of iterations have been identified through an experimental trial and error approach. All other configuration parameters being equal, we evaluated different scheduling policies and max number of iterations by comparing the Precision, Recall and F 1 measures calculated on the validation set. Regarding the policy of variation, the best results have been achieved by training the network with fixed learning rate of 3.5e-3 for a total of 1150 iterations, obtaining the following values: Precision = 0.869, Recall = 0.551, F 1 = 0.675. Similar but slightly worse results were obtained by training with lr fixed at 3.5e-3 for the first 800 iterations, then proceeding with a lr of 5e−4 for other 400 iterations.</p><p>Concerning the number of iterations with fixed learning rate, shorter training provides a precision in the range of ±0.03 from the one of the selected model. As an opposite case, training for more than 1150 iterations increases precision by 5 percentage points, but at the same time negatively affects recall, bringing it down to below 0.20. This trend is confirmed by the comparison of inference boxes (Figure <ref type="figure" target="#fig_4">3</ref>). An excessive number of iterations increases the confidence and reduces false positives but at the same time makes the model no longer able to detect graphemes that were previously retrieved. This behaviour is attributable  to the fact that graphemes of the same class are written in slightly different ways over the dataset, depending on the style of the writer. For instance, some graphic signs can be traced in a more slanted way, can be larger than others and more generally present very personal characteristics related to the hand of the writer. All of this not to mention the fact that some graphemes are more likely to be drawn close to each other, making it even more difficult to distinguish them. Keeping all of this in mind, it is not surprising that an excessively high number of iterations brings the network to overfit on the style of the grapheme used in the train set, making it difficult to locate the others.</p><p>The last hyperparameter to be tuned is the batch size, i.e. the number of training samples that are analysed by the network before performing a weight update. The developers of Mask R-CNN adopted an "image-centric training" with the consequence that the batch size corresponds to the number of images analysed by the GPUs for each weight update. We choose a batch size of 16, thus computing 4 documents at a time since each document is divided into four images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Result and Analysis</head><p>The fine-tuning paradigm discussed in the previous section had been actually used even at an earlier stage of this work, when the available dataset was approximately only 25% of the current size. Considering the dataset expansion work carried out over the last few months, we questioned the usefulness of initialising the network with a pre-trained model, thus investigating alternative methods of initialisation. For this reason, inspired by <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b19">20]</ref>, we have made experiments on training from scratch, Furthermore, in this section we analyse how a highly unbalanced dataset can influence network metrics to appear better than they actually are. Table <ref type="table" target="#tab_1">2</ref> summarises the training examples of the two reference datasets grouped by classes. It is easy to observe that due to the intrinsic rarity of some graphemes, the example instances are not properly balanced. Moreover, more than half of the classes in the initial dataset have fewer than 35 annotations, an insignificant number compared to the great variety of sign executions that can be found even within a single manuscript.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Models comparison</head><p>Throughout the analysis of the results we decided to prefer a higher recall even at the expense of precision, provided that the value of the latter was at least 80%. The reason for this choice is related to the final objective of the research project. If the automatic grapheme identification and localisation system had a high recall and low precision it would provide many wrong results, leading the user to not trust the system and requiring to manually check a large amount of retrieved data. This would surely discourage the use of the software and therefore must be avoided. Even the opposite situation -high precision, low recall -would be counterproductive because it would provide data that is unrepresentative and not able to satisfy the research, leading the palaeographer to search manually for important but undetected graphemes. However, we would like to point out that usually the analysis of the writing is manually carried out and that due to the complexity and heaviness of the task it is usually done only on a very limited selection of pages. For this reason obtaining even only a third of the instances of a manuscript would be a considerable improvement. Considering the two datasets and the two techniques for initialising the weights, we can identify the following scenarios to which we associate codes for the sake of brevity: training from scratch on the initial dataset (0L); training from scratch on the updated dataset (0H); pre-trained weights and small dataset (1L); pre-trained weights and updated dataset (1H). By applying the approach discussed in Section 4 we obtained the following 4 models:</p><p>-0L is trained for 1400 iterations with fixed learning rate at 0.0035 -1L is trained for 1400 iterations with fixed learning rate at 0.0035 -0H is trained for 800 iterations with fixed learning rate at 0.0035 -1H is trained for 1150 iterations with fixed learning rate at 0.0035</p><p>In the first analysis we make pairwise comparisons between the models grouping by weight initialisation method and structure of the training set (Table <ref type="table" target="#tab_2">3</ref>). Comparing 0H and 1H it is evident that the network initialized with precalculated weights has better recall and precision values than its untrained counterpart. This is justified by the fact that although the dataset is better supplied with examples, these are not sufficient to allow a good training of the network without the support of a basic model.</p><p>Comparing the models 0L and 1L on the validation set it emerges that, given an equal length of training, initialising with pre-trained weights brings a benefit in terms of recall (+0.054) at the expense of precision which instead decreases by 0.042 points. From the analysis of the evaluation metrics, as the number of iterations varies 0L shows a trend that grows smoothly on both precision and recall, going into overfitting after iteration 1400. The 1L metrics, on the other hand, are more abrupt, oscillating several times before reaching the values previously reported. These behaviours are in line with what was discussed in Section 4. The trend of the validation set is confirmed by the results obtained on the test set.</p><p>When comparing the results on the test of all the networks, 1L and 0L models obtain the best scores, ranking first and second respectively for F1 measure. From these values it may seem that the models trained on the intial dataset are better than those on the updated version. This would mean that the update of the dataset made things worse, an unlikely behaviour if we compare the composition of the two datasets. Although the problem of imbalance is still present, albeit in a slightly reduced form: more than half of the classes exceeded the 100-annotation First of all, we note that in this context it is much more useful and accurate to assess precision and recall separately for each class, as in Table <ref type="table" target="#tab_3">4</ref>. This method of analysis is necessary to take into account the different probabilities of occurrence of Latin graphemes, element that inevitably affects the number of examples in the dataset. However, every grapheme in the set of interest has relevance and this is why it would be unacceptable to produce good results on only a subset of the selected classes.</p><p>From the results grouped by class (Table <ref type="table" target="#tab_3">4</ref>) it is clear that 1L cannot provide information on more than half of the characters and is therefore an unsatisfactory model. This result can be explained by looking at the structure of the initial training set: the 1L model has been able to specialise exclusively on the recognition of the first four characters with the largest number of examples. Comparing 1H and 1L we can say that having a lower recall on the most common classes and a higher recall for all the others is a good indicator that the dataset update has succeeded in preventing the network from specialising on a subset of characters, bringing us closer to the final goal. Certainly further work needs to be done to increase the number of examples relating to the graphemes Z3, S documentary, Z, which are currently not recognised by the network because they are last in terms of number of annotations.</p><p>Summarizing, the information on the performance of the models contained in Table <ref type="table" target="#tab_2">3</ref> expresses a value that may seem absolute but that in reality is closely related to the structure of the training set used. It is therefore incorrect to use the values in Table <ref type="table" target="#tab_2">3</ref> to compare L models with H models and say that L models are preferable to H models because they achieve better metrics. Instead, it is correct to say that 1L and 1H are better than 0L and 0H respectively, i.e. that initialization with pretrained weights produced better results than initialization from scratch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions</head><p>The content of this paper is part of a larger project which aims to provide palaeographers with a software able to classify and locate simple graphemes within mediaeval manuscripts. The major benefit of an automatic detection will be to replace the time-consuming process of manual analysis with a quick and easy way of obtaining information about simple graphemes contained within entire document collections. The core of the system is a Mask R-CNN network trained to recognise a specific subset of graphemes. The training phase was carried out on a new dataset built by manually labelling images of manuscript from the period XI to XIV. The results discussed in Section 5 show that among the proposed models the best results are achieved by the pre-trained network. This is justified by the fact that the amount of data available needs to be further increased and better balanced before dropping the use of pre-trained weights. Training from scratch provided satisfactory results, although slightly worse than its pre-trained counterpart.</p><p>The first future development to be carried out concerns the improvement of the dataset. Expanding the dataset by adding examples for rarer classes would help in increasing the recall for those classes. In order to enhance the quality of the training dataset another technique that could be applied is the one of data augmentation in favour of those classes with fewer examples.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>S Documentaria. (d) D Dritta. (e) D Tonda. (f) K. (g) Z. (h) Z3. (i) C Cedigliata. (j) Et in legatura. (k) Et tachigrafica. (l) T assibilata.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Graphemes of interest to be detected.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: Example of an annotated image viewed from the Interactive LabelMe interface.</figDesc><graphic coords="5,152.06,115.83,311.24,269.76" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>(a) Test with iteration parameter set to 1150. (b) Test with iteration parameter set to 1250.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 3 :</head><label>3</label><figDesc>Fig. 3: Prediction results on validation produced with different number of training iterations. When iterations exceed 1150, box predictions confidence increases but many instances previously detected detected are no longer recognised.</figDesc><graphic coords="8,140.98,137.70,164.34,147.34" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Number of labelled graphemes in each split of the dataset grouped by class.</figDesc><table><row><cell>Class</cell><cell cols="4">Train Validation Test Total</cell></row><row><cell>S dritta</cell><cell>1009</cell><cell>441</cell><cell>533</cell><cell>1983</cell></row><row><cell>D tonda</cell><cell>933</cell><cell>329</cell><cell>358</cell><cell>1620</cell></row><row><cell>S tonda</cell><cell>367</cell><cell>70</cell><cell>48</cell><cell>485</cell></row><row><cell>Et tachigrafica</cell><cell>279</cell><cell>39</cell><cell>26</cell><cell>344</cell></row><row><cell>D dritta</cell><cell>192</cell><cell>21</cell><cell>47</cell><cell>260</cell></row><row><cell>K</cell><cell>61</cell><cell>27</cell><cell>67</cell><cell>155</cell></row><row><cell>Et in legatura</cell><cell>106</cell><cell>21</cell><cell>18</cell><cell>145</cell></row><row><cell>C cedigliata</cell><cell>66</cell><cell>17</cell><cell>20</cell><cell>103</cell></row><row><cell>Z3</cell><cell>40</cell><cell>25</cell><cell>20</cell><cell>85</cell></row><row><cell>S documentaria</cell><cell>44</cell><cell>15</cell><cell>5</cell><cell>64</cell></row><row><cell>Z</cell><cell>16</cell><cell>2</cell><cell>2</cell><cell>20</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Number of annotations of the two baseline trainings grouped by class.</figDesc><table><row><cell>Class</cell><cell cols="2">Initial Train set Current Train set</cell></row><row><cell>S dritta</cell><cell>530</cell><cell>1009</cell></row><row><cell>D tonda</cell><cell>395</cell><cell>933</cell></row><row><cell>S tonda</cell><cell>78</cell><cell>367</cell></row><row><cell>Et tachigrafica</cell><cell>85</cell><cell>279</cell></row><row><cell>D dritta</cell><cell>28</cell><cell>192</cell></row><row><cell>Et in legatura</cell><cell>3</cell><cell>106</cell></row><row><cell>C cedigliata</cell><cell>32</cell><cell>66</cell></row><row><cell>K</cell><cell>22</cell><cell>61</cell></row><row><cell>S documentaria</cell><cell>4</cell><cell>44</cell></row><row><cell>Z3</cell><cell>6</cell><cell>40</cell></row><row><cell>Z</cell><cell>1</cell><cell>16</cell></row><row><cell>Totals</cell><cell>1184</cell><cell>3113</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Comparison of the key metrics calculated on the validation set and test set on different network configurations.</figDesc><table><row><cell></cell><cell></cell><cell>Validation</cell><cell></cell><cell></cell><cell>Test</cell></row><row><cell></cell><cell cols="2">Precision Recall</cell><cell>F1</cell><cell cols="2">Precision Recall</cell><cell>F1</cell></row><row><cell>Network 0H</cell><cell>0.857</cell><cell cols="2">0.452 0.592</cell><cell>0.841</cell><cell>0.446 0.583</cell></row><row><cell>Network 1H</cell><cell>0.869</cell><cell cols="2">0.551 0.675</cell><cell>0.869</cell><cell>0.490 0.624</cell></row><row><cell>Network 0L</cell><cell>0.903</cell><cell cols="2">0.510 0.651</cell><cell>0.854</cell><cell>0.528 0.653</cell></row><row><cell>Network 1L</cell><cell>0.861</cell><cell cols="2">0.564 0.682</cell><cell>0.839</cell><cell>0.610 0.706</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 :</head><label>4</label><figDesc>Comparison of the precision and recall metrics obtained on the test set and grouped by classes for both network configurations 1L and 1H.</figDesc><table><row><cell></cell><cell cols="2">Network 1L</cell><cell cols="2">Network 1H</cell><cell></cell></row><row><cell>Class</cell><cell cols="5">Precision Recall Precision Recall Test set</cell></row><row><cell>S dritta</cell><cell>0.809</cell><cell>0.835</cell><cell>0.869</cell><cell>0.650</cell><cell>533</cell></row><row><cell>D tonda</cell><cell>0.919</cell><cell>0.664</cell><cell>0.952</cell><cell>0.332</cell><cell>358</cell></row><row><cell>S tonda</cell><cell>1.0</cell><cell>0.042</cell><cell>0.823</cell><cell>0.292</cell><cell>67</cell></row><row><cell>Et tachigrafica</cell><cell>0.611</cell><cell>0.423</cell><cell>0.733</cell><cell>0.423</cell><cell>48</cell></row><row><cell>D dritta</cell><cell>0.0</cell><cell>0.0</cell><cell>1.0</cell><cell>0.149</cell><cell>47</cell></row><row><cell>K</cell><cell>0.0</cell><cell>0.0</cell><cell>0.704</cell><cell>0.567</cell><cell>26</cell></row><row><cell>Et in legatura</cell><cell>0.0</cell><cell>0.0</cell><cell>0.882</cell><cell>0.833</cell><cell>20</cell></row><row><cell>C cedigliata</cell><cell>0.0</cell><cell>0.0</cell><cell>0.857</cell><cell>0.3</cell><cell>20</cell></row><row><cell>Z3</cell><cell>0.0</cell><cell>0.0</cell><cell>0.0</cell><cell>0.0</cell><cell>18</cell></row><row><cell>S documentaria</cell><cell>0.0</cell><cell>0.0</cell><cell>0.0</cell><cell>0.0</cell><cell>5</cell></row><row><cell>Z</cell><cell>0.0</cell><cell>0.0</cell><cell>0.0</cell><cell>0.0</cell><cell>2</cell></row></table><note>threshold, becoming more significant in the training phase. Since this can only be a positive fact, a deeper analysis is necessary.</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://www.sismelfirenze.it/index.php/biblioteca-digitale/codex</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">The manuscripts that make up the dataset are listed in the Acknowledgements section and are available upon request by sending an email to the authors of this paper.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This work is partially supported by the Fondazione Cassa di Risparmio di Firenze that funded the project I manoscritti medievali della Toscana (sec. XIII-XIV): progettazione e sviluppo di un software per la datazione e la determinazione di origine granted to SISMEL. We would like to thank the following libraries for providing us the documents:</p><p>-Barcellona, Biblioteca de Catalunya: 639 f. non precisabile -Bologna, Biblioteca Universitaria: 1746 ff. 7, 8, 10, 30, 42, 52 -Berlin, Staatsbibliothek zu Berlin: Rehdiger 227 f. 10r; Phillips 1716 12v, f. 39r -Cologny, Fondation Martin: Bodmer 30 f. 12v -Firenze, Biblioteca della Fondazione E. Franceschini: ms. 2 f. 222v, ignota -Firenze, Biblioteca Medicea Laurenziana: Plut. 42.23 f. 1r -Plut. <ref type="bibr" target="#b18">19</ref> dex. 1 f. 7v -Plut. 19 dex. 5 ff. 5v, 6v, 14r, 23v, 55r, 55v, 135v -Plut. 19 dex. 8 f. 5v -Plut. 19 dex. 7 ff. 21v, 89v, 108v -Plut. 30 sin. 3 ff. 42v, 110v, 235v -Conv.Soppr. 321 ff. 145r, 146r, 150v, 151r -Strozzi 146, f. 2r, 12r</p><p>-Firenze, Biblioteca Riccardiana: 222 f. 152r -269 f. 1r -323 f. 105v -327 f. 8r -1422 f. 70r -829 f. 12r -1471 f. 43v -Firenze, Biblioteca Nazionale Centrale: I.III 272-273 f. 32r -C.S D.7.1158 f. 10r -Magl. XII.4, f.17r -Milano, Biblioteca Ambrosiana: M 76 sup. f. 274 -Pisa, Archivio di Stato: Div. A n. 2 f. 101v -Siena, Biblioteca Comunale degli Intronati: F.III.3 ff. 2r, 137r.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Beyond OCR: Multi-faceted understanding of handwritten document characteristics</title>
		<author>
			<persName><forename type="first">He</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lambert</forename><surname>Schomaker</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.patcog.2016.09.017</idno>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">63</biblScope>
			<biblScope unit="page" from="321" to="333" />
			<date type="published" when="2017-03">Mar. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Co-occurrence Features for Writer Identification</title>
		<author>
			<persName><forename type="first">S</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schomaker</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICFHR.2016.0027</idno>
	</analytic>
	<monogr>
		<title level="m">15th International Conference on Frontiers in Handwriting Recognition (ICFHR)</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="78" to="83" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Junction detection in handwritten documents and its application to writer identification</title>
		<author>
			<persName><forename type="first">Lambert</forename><surname>Schomaker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marco</forename><surname>Wiering</surname></persName>
		</author>
		<author>
			<persName><forename type="first">He</forename><surname>Sheng</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.patcog.2015.05.022</idno>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">48</biblScope>
			<date type="published" when="2015-06">June 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Lyrics Recognition and Syllable Assignment of Medieval Music Manuscripts</title>
		<author>
			<persName><forename type="first">C</forename><surname>Wick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hartelt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Puppe</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICFHR2020.2020.00043</idno>
	</analytic>
	<monogr>
		<title level="m">17th International Conference on Frontiers in Handwriting Recognition (ICFHR)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="187" to="192" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Calamari -A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition</title>
		<author>
			<persName><forename type="first">Christoph</forename><surname>Wick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christian</forename><surname>Reul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Frank</forename><surname>Puppe</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1807.02004</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note>cs.CV</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Interpreting the Script: Image Analysis and Machine Learning for Quantitative Studies of Pre-modern Manuscripts</title>
		<author>
			<persName><forename type="first">Fredrik</forename><surname>Wahlberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Acta Universitatis Upsaliensis</title>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Deep Learning for Historical Document Analysis and Recognition -A Survey</title>
		<author>
			<persName><forename type="first">Francesco</forename><surname>Lombardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Simone</forename><surname>Marinai</surname></persName>
		</author>
		<idno type="DOI">10.3390/jimaging6100110</idno>
		<ptr target="https://doi.org/10.3390/jimaging6100110" />
	</analytic>
	<monogr>
		<title level="j">J. Imaging</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page">110</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Mask R-CNN</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICCV.2017.322</idno>
	</analytic>
	<monogr>
		<title level="m">2017 IEEE International Conference on Computer Vision (ICCV)</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="2980" to="2988" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Text alignment in early printed books combining deep learning and dynamic programming</title>
		<author>
			<persName><forename type="first">Zahra</forename><surname>Ziran</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognit. Lett</title>
		<imprint>
			<biblScope unit="volume">133</biblScope>
			<biblScope unit="page" from="109" to="115" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Deep Learning Methods for Document Image Understanding</title>
		<author>
			<persName><forename type="first">Samuele</forename><surname>Capobianco</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
		<respStmt>
			<orgName>University of Florence</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">labelme: Image Polygonal Annotation with Python</title>
		<author>
			<persName><forename type="first">Kentaro</forename><surname>Wada</surname></persName>
		</author>
		<ptr target="https://github.com/wkentaro/labelme.2016" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Yuxin</forename><surname>Wu</surname></persName>
		</author>
		<ptr target="https://github.com/facebookresearch/detectron2.2019" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Fine-Tuning Deep Neural Networks in Continuous Learning Scenarios</title>
		<author>
			<persName><forename type="first">C</forename><surname>Käding</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACCV Workshops</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Analyzing the Performance of Multilayer Neural Networks for Object Recognition</title>
		<author>
			<persName><forename type="first">Pulkit</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ross</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jitendra</forename><surname>Malik</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1407.1610</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note>cs.CV</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation</title>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2014.81</idno>
	</analytic>
	<monogr>
		<title level="m">2014 IEEE Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="580" to="587" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Oquab</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2014.222</idno>
	</analytic>
	<monogr>
		<title level="m">2014 IEEE Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1717" to="1724" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Bird Species Categorization Using Pose Normalized Deep Convolutional Nets</title>
		<author>
			<persName><forename type="first">Steve</forename><surname>Branson</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1406.2952</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note>cs.CV</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Neural Codes for Image Retrieval</title>
		<author>
			<persName><forename type="first">Artem</forename><surname>Babenko</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1404.1777</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note>cs.CV</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Detectron2 Model Zoo and Baselines</title>
		<author>
			<persName><forename type="first">Yuxin</forename><surname>Wu</surname></persName>
		</author>
		<ptr target="https://github.com/facebookresearch/detectron2/blob/master/MODEL_ZOO.md" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Rethinking ImageNet Pretraining</title>
		<author>
			<persName><forename type="first">Kaiming</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ross</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Piotr</forename><surname>Dollár</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1811.08883</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note>cs.CV</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
