In Codice Ratio: OCR of Handwritten Latin
Documents using Deep Convolutional Networks

Donatella Firmani1 , Paolo Merialdo1 , Elena Nieddu1 , and Simone Scardapane2
                           1
                              Roma Tre University
donatella.firmani@uniroma3.it,merialdo@dia.uniroma3.it,ema.nieddu@gmail.
                                    com
                            2
                              Sapienza University
                      simone.scardapane@uniroma1.it


       Abstract. Automatic transcription of historical handwritten documents
       is a challenging research problem, requiring in general expensive tran-
       scriptions from expert paleographers. In Codice Ratio is designed to
       be an end-to-end architecture requiring instead limited labeling effort,
       whose aim is the automatic transcription of a portion of the Vatican
       Secret Archives (one of the largest historical libraries in the world). In
       this paper, we describe in particular the design of our OCR component
       for Latin characters. To this end, we first annotated a large corpus of
       Latin characters with a custom crowdsourcing platform. Leveraging over
       recent progresses in deep learning, we designed and trained a deep con-
       volutional network achieving an overall accuracy of 96% over the entire
       dataset, which is one of the highest results reported in the literature so
       far. Our training data are publicly available.


Keywords: deep convolutional neural networks, handwritten text recognition,
optical character recognition, medieval documents


1     Introduction

Historical documents are an essential source of knowledge concerning past cul-
tures and societies [10]. Until recently, the main bottleneck was the availability
of large collections of historical documents in digital form. Today, many histori-
cal archives have begun instead a full digitalization of their assets, including the
Bibliothèque Nationale de France1 and the Vatican Apostolic Library.2 Due to
the cost (and time) required for manual transcription of these documents, and
the sheer size of the collections, the challenge has become the design of fully
automatic solutions for their transcription in computer-readable form. While
impressive results have been achieved for printed historical documents [15], suc-
cessfully transcribing handwritten documents remains a challenging task due to
1
    http://gallica.bnf.fr/
2
    http://www.digitavaticana.org/
a variety of reasons, including irregularities in writing, ligatures and abbrevia-
tions, errors in transcription, and so forth (see the discussion in Section 2).
    In Codice Ratio is an interdisciplinary project involving Humanities and En-
gineering departments from Roma Tre University, as well as the Vatican Secret
Archives, aiming at the complete transcription of the Vatican Registers, a corpus
of more than 18000 pages contained as part of the Vatican Secret Archives, with
minimal labeling effort. The Vatican Secret Archives is one of the largest histor-
ical libraries in the world, containing more than 85 linear kilometres of shelving.
Interestingly, ‘secret’ does not stand for confidential, but rather denotes them as
private property of the Pope. The corpus is comprised of official correspondence
of the Roman Curia produced in the 13th century, including letters, opinions on
legal questions, addressed from and to kings and sovereigns, as well as to many
political and religious institutions throughout Europe. Never having been tran-
scribed in the past, these documents are of unprecedented historical relevance,
and could shed light to that crucial historical period. A preliminary description
of the system appeared in [1].
Our contribution. In this paper, we describe the design of a novel compo-
nent for optical character recognition (OCR) of the Latin characters extracted
from the text. Building a corpus for this task is extremely challenging due to
the complexity of segmenting the characters and reading ancient fonts [5]. For
this project, we implemented a custom crowdsourcing platform, employing more
than a hundred high-school students to manually label the dataset. After a data
augmentation process, the result was the creation of an inexpensive, high-quality
dataset of 23000 characters. Following recent progresses in deep learning [8], we
designed a deep convolutional neural network (CNN) for the classification step.
In the last years, deep CNNs have become the de facto standard for complex
OCR problems [2, 3]. Our trained deep CNN achieves an overall accuracy of 96%
on an independent test set, which is one of highest results obtained in the litera-
ture so far. The aim of this paper is to show the effectiveness of the classification
step, and the evaluation of the pipeline in [1] is out of our current scope.
Structure of the paper. The rest of the paper is structured as follows. After
discussing related projects in Section 2, we detail the construction of our anno-
tated dataset in Section 3, and the design (and training) of the CNN in Section 4.
We experimentally evaluate the network in Section 5, before discussing future
works in Section 6.


2   Related Work

Due to the many challenges involved in a fully automatic transcription of histor-
ical handwritten documents, many researchers in the last years have focused on
solving easier sub-problems, most notably keywords spotting [11]. However, as
more and more libraries and archives worldwide digitize their collections, great
effort is being put into the creation of full-fledged transcription systems [4].
    One of the largest effort to this end was the EU-funded tranScriptorium
project [12], which resulted, among others, in the transcription of a relatively


                                      10
                                  (a)                                        (b)

Fig. 1: (a) Sample text from the manuscript Liber septimus regestorum domini
Honorii pope III, in the Popes’ Registers of the Vatican Secret Archive. (b)
Proposed segmentation cut-points for the word ’culpam’. We use green for actual
character boundaries, and red otherwise.

large corpus of Dutch handwritten documents from the 15th century. Several
competitions have been organized on the datasets released from the tranScrip-
torium project [13]. State-of-the-art algorithms from these challenges generally
work by a segmentation-free approach, where it is not necessary to individually
segment each character.3 While this removes one of the hardest steps in the
process, it is necessary to have full-text transcriptions for the training corpus, in
turn requiring expensive labeling procedures with expert paleographers on the
period under consideration. To overcome this limitation and reduce the training
costs, In Codice Ratio focuses on a character-level classification, allowing us to
collect a large corpus of annotated data using a cheap crowdsourcing procedure.


3     Dataset Collection

The dataset is collected from high-resolution (300 dpi, 2136 × 2697 pixels) scans
of 30 pages coming from register 12 of Pope Honorii III. All pages are in the so-
called Caroline minuscule script, which spread in Western Europe during Charle-
magne’s reign and became a standard under the Holy Roman Empire. Compared
to similar fonts, writings in the Caroline minuscule are relatively regular and have
fewer ligatures. A sample text is shown in Fig. 1a.
    All pages are pre-processed according to the workflow in [1], by first remov-
ing the background, splitting the text into lines, and then extracting tentative
character’ segmentations as shown in Fig. 1b. Each tentative character is then
fed to the OCR system, built on top of a deep CNN, described in the next sec-
tion. A further sub-system based on a Hidden Markov Model is then in charge
of selecting the most probable word transcription starting from all the possible
segmentations of the word. In this paper we focus on the design of the OCR
system, and we refer to [1] for a more accurate description of the first and third
steps.
Character classes. We take into account minuscule characters of the latin
alphabet, yielding initially 19 classes (a, b, c, d, e, f, g, h, i, l, m, n, o, p, q, r, s,
t, u) plus one special non-character class ⊗ . Since our dataset includes multiple
3
    Segmenting and recognizing a character are two heavily interdependent processes:
    this is known as Sayre’s paradox [14].


                                         11
        (a) d1           (b) d2          (c) s1          (d) s2            (e) s3

                 Fig. 2: Different shapes of the characters “d” and “s”.

versions of characters “d” and “s”, we split class d into two classes (d1 and d2),
and class s into three (s1, s2 and s3). The different character shapes and the
corresponding labels are shown in Fig. 2. We have total 23 classes, including 22
character classes and the special non-character class ⊗ .
Crowdsourcing. To collect annotations on the segmentations of the manuscript
words, a custom crowdsourcing platform was developed. We enrolled 120 high-
school students in the city of Rome, that did the labeling as a part of a work-
related learning program. The task to perform was simple: having positive and
negative examples for a given character, each student was required to select any
matching images from a grid appearing on the platform. In Fig. 3, we show a
screenshot of a task.
                                               Each task consists of 40 images,
                                           arranged in a grid, each with its own
                                           check-box. Every time the check-box
                                           is marked, the image receives a vote.
                                           Image labels correspond to the most
                                           voted characters, among those with
                                           at most 3 votes.4 If there is no such
                                           character, the image is labelled with
                                           a special non-character class, denot-
                                           ing a wrong segmentation.
                                               Characters with less than 1K ex-
 Fig. 3: Sample screen of our platform. amples were augmented to match the
                                           required quantity and balance the
training set. The augmentation process involves slight random rotation, zooming,
shearing and shifting, both vertical and horizontal. Before training, all image val-
ues are normalized in the range [0,1]. The final dataset comprises 23K examples
evenly split between 23 classes, and is available online5 .


4     Network Architecture

Our deep CNN takes as input 56 × 56 single-channel images, which are binarized
before training. The input is then propagated through 8 adaptable layers, whose
design is inspired to similar networks having achieved state-of-the-art results in
4
    In our experiments, we did not observe any tie.
5
    http://www.dia.uniroma3.it/db/icr/.


                                        12
modern OCR recently [8]. First, we apply a convolutional layer having 42 filters
with size 5 × 5 and stride 1. Secondly, the output of the convolutional layer is
fed to a rectified linear (ReLU) nonlinearity applied element-wise:

                                g(s) = max {0, s} .                              (1)

The output of the ReLU is down-sampled using a max-pooling operation with
stride 2 × 2 to reduce the number of adaptable parameters. The previous three
operations (convolution, nonlinearity, and max-pooling) are repeated another
two times, using 28 filters for the convolutional layer instead of 42. The output
of the last convolutional layer is then flattened and fed through a fully connected
layer with 100 neurons and ReLU nonlinearities, and a final output layer with
a softmax activation function to output a probability distribution over the 23
classes.
    In order to prevent overfitting, we apply 50% dropout during training [8]
before each of the nonlinearities. We minimize a regularized cross-entropy loss
given by:
                               XN X  K
                   J (w) = −           −ŷi,k log (yi,k ) + λkwk2 ,             (2)
                               i=1 k=1

where N is the number of examples in the training dataset, K = 23 is the num-
ber of classes, yi,k is the correct output of the kth class over the ith input, ŷi,k
is the predicted output of the network, w is the vector of adaptable parameters
of the network, and λ > 0 is a regularization factor. The regularization factor
is selected as λ = 0.001 by doing a grid-search over different values in an expo-
nential interval and computing the accuracy on a held-out validation set of 2500
examples from the original training set. This validation set is also used to se-
lect a stopping point for the optimization procedure. We minimize (2) using the
Adam algorithm [9] on randomly sampled mini-batches of 128 elements until the
validated accuracy stops improving (200 epochs), using default hyperparameters
as in [9]. The final network is then tested on a further independent test set of
another 2300 examples.


5   Experimental Results

Overall accuracy reached 96%, while average precision, recall and F1-measure
for each class are reported in Table 5 (support is always 100). The confusion
matrix is shown in Fig. 4. Some typical errors are the following.
  • Characters “f” and “s1” are easily confused, due to their similar shapes.
    Specifically, ≈ 8% of “s1” are labelled as “f”, and 14% of “f” as “s1”.
  • Images not containing any character are sometimes mis-classified as actual
    characters, mainly as “m” Specifically, ≈ 10% of “not-character” are labelled
    as “m”, and ≈ 15% of “not-character” are labelled as some other character.
For comparison purposes, we report that a simple logistic regression classifier on
the same dataset achieves average 80% precision and 79% recall.


                                      13
                    Fig. 4: Confusion matrix for the test set.


Convolution visualization. We show in Fig. 6a the                   Prec. Rec. F1
effect of the filters learned by of our network at the          a   0.98 0.99 0.99
first level. Specifically, we show the result of convolu-       b   0.98 0.97 0.97
tion with first layer filters on a sample input image           c   0.95 1.00 0.98
after the activation function (blues are positive val-          d1 0.97 0.98 0.98
ues). Visually inspecting activation output is indeed           d2 0.92 0.98 0.95
useful for debugging purposes. In the figure, the effect        e   0.99 0.98 0.98
of edge and lighting detection filters is clearly visible.      f   0.89 0.85 0.87
                                                                g   0.97 0.99 0.98
Gradient Ascent Given the filters learned by our                h   0.96 0.97 0.97
                                                                i   0.98 0.96 0.97
network, we now perform gradient ascent over the in-
                                                                l   0.96 0.99 0.98
put image (initially random) and maximize the output            m 0.91 0.99 0.95
of each filter, separately. This is a common step to vi-        n   0.99 1.00 1.00
sualize what the network has learnt to recognize [16].          o   0.98 0.91 0.94
Intuitively, we generate synthetic images that max-             p   0.98 0.97 0.97
imize the activation of the filters of each layer, in-          q   1.00 0.99 0.99
cluding the output layer. In deep CNNs, the first lay-          r   0.94 0.96 0.95
ers usually detect simple features, and the features            s1 0.86 0.90 0.88
become more complex and abstract as the layers go               s2 0.99 0.94 0.96
deeper. The result of this experiment for a sample of           s3 0.95 1.00 0.98
our filters is shown in Fig. 6b. The figure suggests that       t   0.94 0.97 0.96
                                                                u   1.00 1.00 1.00
the first layer of our network is in charge of detecting
                                                                ⊗ 0.95 0.74 0.83
edges, while the second layer exhibits more complex,
                                                                avg 0.96 0.96 0.96
geometrical patterns. Finally, the third and deepest
layer, seems to detect whole character strokes.
                                                             Fig. 5: Per-class results.


                                       14
       After activation function          1st layer      2nd layer       3rd layer
                  (a)                                       (b)

Fig. 6: (a) Input image “q” convoluted with kernels from the first convolutional
layer. (b) Some examples of gradient ascent generated images over the filters of
each convolutional layer.

6   Conclusions

In this paper, we have described the collection of a large corpus of annotated
Latin characters, and the design of a novel deep convolutional network for the
classification step. The described system is a key component in the In Codice
Ratio project, whose aim is to fully transcribe a large corpus of documents
contained in the Vatican Secret Archives [1]. Some preliminary results with the
entire system have shown that the framework is able to reach around 80% of
word-error rate on the pages under consideration. Thorough evaluation of the
entire system (including the segmentation step) is ongoing work.
    Future work will require the design of a fully differentiable system to substi-
tute the currently hand-tuned segmentation step. Recently, indeed, some authors
have proposed the use of recurrent networks to process the entire text sequen-
tially [6, 7]. While these methods still require the annotation of the entire text,
annotations can be noisy, and obtained results are generally higher than related
systems based on hidden Markov models.


Acknowledgments

We thank Debora Benedetto, Elena Bernardi and Riccardo Cecere for their help
with the pre-processing steps and the crowd-sourcing application. Finally, we
are indebted to all the teacher and students of Liceo Keplero and Liceo Montale
who joined the work-related learning program, and did all the labeling effort.


                                     15
References
 1. S. Ammirati, D. Firmani, M. Maiorino, P. Merialdo, E. Nieddu, and A. Rossi. In
    codice ratio: Scalable transcription of historical handwritten documents. In 25th
    Italian Symposium on Advanced Database Systems (SEBD), 2017. To Appear.
 2. D. Cireşan and U. Meier. Multi-column deep neural networks for offline hand-
    written chinese character classification. In 2015 International Joint Conference on
    Neural Networks (IJCNN), pages 1–6. IEEE, 2015.
 3. D. C. Cireşan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big, simple
    neural nets for handwritten digit recognition. Neural Computation, 22(12):3207–
    3220, 2010.
 4. A. Fischer. Handwriting recognition in historical documents. PhD thesis, Univer-
    sitat Bers, 2012.
 5. A. Fischer, E. Indermühle, H. Bunke, G. Viehhauser, and M. Stolz. Ground truth
    creation for handwriting recognition in historical documents. In 9th IAPR In-
    ternational Workshop on Document Analysis Systems (DAS), pages 3–10. ACM,
    2010.
 6. A. Fischer, M. Wuthrich, M. Liwicki, V. Frinken, H. Bunke, G. Viehhauser, and
    M. Stolz. Automatic transcription of handwritten medieval documents. In 15th
    IEEE International Conference on Virtual Systems and Multimedia (VSMM),
    pages 137–142. IEEE, 2009.
 7. V. Frinken, A. Fischer, H. Bunke, and R. Manmatha. Adapting BLSTM neural net-
    work based keyword spotting trained on modern data to historical documents. In
    2010 International Conference On Frontiers in Handwriting Recognition (ICFHR),
    pages 352–357. IEEE, 2010.
 8. I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
 9. D. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd
    International Conference for Learning Representations (ICLR), 2015.
10. J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, , J. P. Pickett,
    D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, and E. L.
    Aiden. Quantitative analysis of culture using millions of digitized books. Science,
    331(6014):176–182, 2011.
11. M. Rusiñol, D. Aldavert, R. Toledo, and J. Lladós. Efficient segmentation-free key-
    word spotting in historical document collections. Pattern Recognition, 48(2):545–
    555, 2015.
12. J. A. Sánchez, V. Bosch, V. Romero, K. Depuydt, and J. de Does. Handwritten
    text recognition for historical documents in the transcriptorium project. In Pro-
    ceedings of the First International Conference on Digital Access to Textual Cultural
    Heritage, pages 111–117. ACM, 2014.
13. J. A. Sánchez, V. Romero, A. H. Toselli, and E. Vidal. Icfhr2014 competition on
    handwritten text recognition on transcriptorium datasets (HTRtS). In 2014 14th
    International Conference on Frontiers in Handwriting Recognition (ICFHR), pages
    785–790. IEEE, 2014.
14. K. M. Sayre. Machine recognition of handwritten words: A project report. Pattern
    Recognition, 5(3):213–228, 1973.
15. U. Springmann, D. Najock, H. Morgenroth, H. Schmid, A. Gotscharek, and F. Fink.
    OCR of historical printings of latin texts: problems, prospects, progress. In ACM
    First International Conference on Digital Access to Textual Cultural Heritage
    (DATeCH), pages 71–75. ACM, 2014.
16. M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks.
    In European conference on computer vision, pages 818–833. Springer, Cham, 2014.


                                        16