In Codice Ratio: Scalable Transcription of
          Historical Handwritten Documents
                 (Extended Abstract)

    Serena Ammirati1 , Donatella Firmani1 , Marco Maiorino2 , Paolo Merialdo1 ,
                       Elena Nieddu1 , and Andrea Rossi1
                                1
                                Roma Tre University
 serena.ammirati,donatella.firmani@uniroma3.it,merialdo@dia.uniroma3.it,
                     ema.nieddu,and.rossi.516@gmail.com
     2
       Vatican Secret Archives (Archivum Secretum Apostolicum Vaticanum)
                               m.maiorino@asv.va


        Abstract. Huge amounts of handwritten historical documents are be-
        ing published by digital libraries world wide. However, for these raw
        digital images to be really useful, they need to be annotated with infor-
        mative content. State-of-the-art Handwritten Text Recognition (HTR)
        approaches require an impressive training effort by expert paleographers.
        Our contribution is a scalable, end-to-end transcription work-flow – that
        we call In Codice Ratio – based on fine-grain segmentation of text ele-
        ments into characters and symbols, with limited training effort. We pro-
        vide a preliminary evaluation of In Codice Ratio over a corpus of letters
        by pope Honorii III, stored in the Vatican Secret Archive.


1     Introduction

Large document collections are sources of important correlations between entities
such as people, events, places, and organizations. Previous studies [7] have shown
that it is possible to detect macroscopic patterns of cultural change over peri-
ods of centuries by analyzing large textual time series. Such automatic methods
promise to empower scholars with a quantitative and data-driven tool to study
culture and society, but their power has been limited by the amount of digitally
transcribed sources. Indeed, the World Wide Web only contains a small part of
the traditional archives. (It is evocative to think that it may only contain a few
millimeters out of the 85km of linear shelves in the Vatican Secret Archives.)
Recently, many historical archives have begun to digitize their assets, sharing
high-resolution images of the original documents. Notable examples include the
Bibliothque Nationale de France3 , the Virtual Manuscript Library of Switzer-
land4 , and the Vatican Apostolic Library5 . In this scenario, expert paleographers
3
  http://gallica.bnf.fr
4
  http://www.e-codices.unifr.ch/en
5
  http://www.digitavaticana.org
           (a)                                    (b)

Fig. 1: (a) Fragments. (b) Sample text from the manuscript Liber septimus
regestorum domini Honorii pope III, in the Vatican Registrers.


can largely benefit from computer-assisted transcription technologies. This in-
cludes not only full transcriptions, but also partial transcription and other kinds
of automatically produced meta-data, useful for indexing and searching.
    Popular automatic tools for transcribing the text content of digital images
include Optical Character Recognition (OCR) systems, which work great for
typewritten text but are not suitable for handwritten text recognition (HTR).
Since most digitized documents by historical archives are manuscripts, HTR has
recently gained more and more attention by researchers worldwide. Handwritten
text is more challenging to recognize than typewritten one because characters
have less regular shapes, and are often combined into single units known as a
ligatures. Therefore, while OCR systems are trained to recognize individual type-
written glyphs, most state-of-the-art HTR systems use holistic approaches: all
text elements (sentences, words, and characters) of a single text line are recog-
nized as a whole, without any prior segmentation of the line into these elements.
So-called segmentation free models [13] can be automatically obtained using well
known training techniques, but they require the whole transcripts of a number of
these unsegmented images. In order to use these technologies, users with expe-
rience in handwritten documents transcription are required to transcribe man-
ually significant portions of the original documents. Currently available HTR
technologies are still far from offering scalable automated solutions.
Our contribution. The approach of our project In Codice Ratio is in the
middle of a spectrum, where on the one side there are OCR systems and on the
other segmentation free HTR technologies, which recognize bigger handwritten
elements. Rather than relying on well-segmented glyphs (as in typewriting) or
training to recognize whole words, we focus on overlapping “fragments” of words
composed of zero, one, or two (rarely three) characters, as in Figure 1a. For each
word, we compute possible cut-points yielding different ways for segmenting the
word into fragments. Then, we choose the best among different segmentation
options using OCR and language models. Cut points are managed similarly to
on-line HTR systems [6], where the text is written on a touch-screen and a
sensor captures all the pen tip movements. Since we do not have access to pen
movements, we need a number of labelled fragments6 for training our model.
6
    A fragment can also have empty transcriptions when it does not contain any char-
    acter.
Training with fragments has two advantages over training with words (as in
segmentation free HTR):
  • the number of fragments needed for training is much smaller, because it does
    not need to deal with the impressive variety of lexicon;
  • fragments can be labeled by volunteers in large transcription projects, with
    little or no expertise, provided with adequate examples.
To this end, we set up a simple crowd-sourcing application.
Proof of Concept. Our hybrid work-flow takes the best of two worlds: we
can handle challenging ligatures as in state-of-the-art HTR, and at the same
time we require limited training effort like typical OCR systems. Our system
is not mature enough for a thorough experimental evaluation, but for sake of
demonstration, we take into account the “Vatican Registrers” corpus in the
Vatican Secret Archives. The Vatican Registers is a huge collection of volumes
(more than 18.000 pages) produced in the 13-th century, and containing official
correspondence of the Roman Curia, such as political letters, opinions on legal
questions, and documents addressed to various religious institutes throughout
Europe. Such records are of unprecedented historical relevance, have not been
transcribed yet, and have a regular writing style (see Figure 1b). For these
reasons, we believe that they can motivate our work.
    Our crowd-sourcing application enrolled 120 high-school students in the city
of Rome, that did the labelling as a part of their work-related learning program.
The program is jointly organized by the engineering and humanities departments
of Roma Tre University, and includes frontal lessons in a variety of topics, such as
paleography, history and machine learning, thus also serving as school guidance.

2   Related Works
The idea of exploiting large textual corpora to de-
tect macroscopic cultural trends has been discussed
for many years [11], promising to empower histori-
ans and other humanities scholars with a tool for
the study of culture and society.
Data-driven history. Many studies have been
published over the past few years [4, 10] about a
quantitative and data-driven approach to the study
of cultural change and continuity. A seminal study
of 5 million English-language books published over
the arc of 200 years [9] showed the potential of this
approach, for example, measuring the time required
by various technologies to become established or the
                                                        Fig. 2: Our workflow.
duration of celebrity for various categories of people.
Handwriting text recognition. HTR can be de-
fined as the ability to transform handwritten input
represented as graphical marks into symbolic representation as ASCII text. Ac-
cording to the mode of data acquisition used, HTR can be classified into off-line
and on-line. In off-line systems the handwriting is given as an image or scanned
text, without time sequence information. In on-line systems the handwriting is
given as a temporal sequence of coordinates that represents the pen tip tra-
jectory. For high quality text images, current HTR state-of-the-art prototypes
provide accuracy levels that range from 40 to 80% at the word level [12, 2]. On the
other hand, on-line systems are more accurate [1, 5, 6], reaching 90% word-level
accuracy in some cases.
Crowd-sourcing. Expected users of HTR technology belong mainly to two
groups:
  • individual researchers with experience in handwritten documents;
  • volunteers which collaborate in large transcription projects.
Recent HTR project [14, 8] expose HTR tools through specialised crowd-sourcing
web portals, supporting collaborative work.
Other works. Language modeling for on-line handwriting recognition bears
many similarities with OCR and speech recognition, which often employ statis-
tical n-gram models on the character or word level [15].


3     System Work-Flow
Our system first segments every word into small (possibly overlapping) frag-
ments, then recognizes characters in fragments, and finally the entire word. The
main steps are schematically shown in Figure 2.
 1. Pre-processing. In this phase the color image of a page is cropped into
    lines and words. The image is also transformed into a bi-chromatic one.
 2. Cut-level operations. For each word we guess cut-points for characters
    and build the so-called segmentation lattice data structure [6], such that
    each path in the lattice represent a way of segmenting the word.
 3. Fragment-level operations. For each pair of cut-points we crop the cor-
    responding text fragment and classify it, choosing among known characters.
 4. Word-level operations. We finally return the best path in the labelled
    lattice, which represent a way for transcribing the word.

3.1   Transcription Algorithms
In this section we describe the inner phases of our system. Inner phases are
designed for the Carolingian minuscule script, which is used in the manuscript
for our proof of concept (see Figure 1b of Section 1).
Cut-level operations. The input for on-line HTR consists of pen up/pen down
switching movements, therefore cut-points for character boundaries can be se-
lected at certain positions of each stroke. Since we do not have access to stroke
sequences, we design a simple heuristic based on black pixel distribution local
minima, as shown in Figure 3a for a sample occurrence of the word “culpam”.
Then, we consider all the possible segmentation options induced by the cut-
points and let the further phases select the best one. We build a segmentation
lattice where each start-end path represents a segmentation option.
  • Each cut-point corresponds to a node. The leftmost cut-point is the start
    node, and the rightmost the end node.
  • Each edge corresponds to the fragment of word bounded by its endpoints.
We observed experimentally that edges corresponding to fragments smaller than
8 pixels or larger than 34 pixels can be safely dropped our further consideration.
The segmentation lattice for our “culpam” example is shown in Figure 3b. In
the figure, edges have labels, as will be clarified in the next section.
Fragment-level operations. We call
fragment any squared portion of a
word that is bounded by two cut-
points, including incomplete charac-
ters, combinations of incomplete char-
acters, or multiple characters together.
Each edge of the segmentation lat-
tice corresponds to a different frag-
ment, and each path corresponds to
a different segmentation option of the
word into fragments. Our next step is a                     (a)
OCR step, where each fragment/edge
of the segmentation lattice is labeled
with the result of a character classi-
fier. There are many principled ways
for the classification task at hand. We
decide to use a convolutional Neural
Network (NN), that is one of the most
common and popular approaches [3].
Since the NN return a score distribu-
tion, rather than a unique answer, we
add multiple edges between the same
two nodes. Some of the fragments are
submitted to a crowd-sourcing appli-
cation, for training the NN. The ap-
plication guides the workers, providing
sample images of characters and high-                       (b)
lighting variations in shape and style.
Therefore, workers without experience      Fig. 3: Cut-points for the word “cul-
in transcription can give answers based    pam” and corresponding lattice. We use
on their perceived similarity of frag-     green for actual character boundaries,
ments to sample images. We refer to        and red otherwise.
the resulting data structure as labeled
lattice.
Word-level operations. Labeled paths represent candidate transcriptions for
the current word. In order find the best transcription recognition result, we
use language models (LM). A statistical LM is a representation of a certain
language as a probability distribution over sequences of words or characters.
In other words, a LM expresses the likelihood that certain sequences of words
or characters appear in texts written in the language under analysis. To this
end, we downloaded a large medieval Latin corpus (≈ 1.5M words) and com-
puted 3-grams frequencies. Then, we select the best path maximizing the corre-
sponding word probability. For instance, for the path “culham” in Figure 3b we
have7 p($culham∧ ) = p(c|$)p(u|$c)p(l|cu)p(h|ul)p(a|lh)p(m|ha)p(ˆ|am), where $
and ∧ are special symbols denoting the beginning and the end of a word. Every
term in the product can be computed directly from language models.


4     Experiments

We describe in this section our preliminary experi-
mental results, which serves us as a proof of concept
for In Codice Ratio. Ideas and methods in the proof
have been realized in collaboration with the Vatican
Secret Archives, with the aim of demonstrating in
principle the practical potential of our system.
Dataset. The “Vatican Registrers” corpus consists
of 43 parchment registers, for a total of 18650 pages
(i.e., writing facades). All the registers are written
with the same script: the so-called Cancelleresca.
Our dataset consists of 30 pages (≈ 15K characters) Fig. 4: Sample words, for
of register 12 by Pope Honorii III, that is the only our proof of concept.
Pope with un-transcribed registers and therefore is
of most interest for the VSR.
Results and future works. Our NN shows 95%
accuracy and recall for our dataset, with 4.6% error. Typical errors of the NN
are the following.
  • Characters “f” and “s” are easily confused, due to their similar shapes.
    Specifically, ≈ 20% of “s” are labelled as “f” and 25% of “f” as “s”.
  • Characters “l” is often mis-classified as other “upper” characters, due to
    spurious ink in the fragment. Specifically, ≈ 72% of “l” are labelled as “b”
    and only ≈ 17% of “l” are labelled as such.
Other characters are labelled correctly more than 96% of the times.
    We select 3 words, showing strengths and limits of our method, dubbed “cul-
pam”, “criminis” and “uiuscemod(i)”8 . In Table 1 we show the top three paths
in the labelled lattice, according to word probability. For “criminis”, the right
transcription is first in the ranking. For “uiuscemod”, the right transcription is
not contained in any path, but the maximum probability path “uiufemod” is
correct except for the ligature “sc”, which is labelled as “f” as mentioned earlier
in this section. This specific problem is due to errors in the training set, where
a number of “sc” fragments has been selected as “f” by workers. In the near
7
    Using 2-nd order Markov assumption.
8
    “uiuscemod” and “i” are processed as separate words.
future, we plan to repeat the training phase by removing such wrong items. Fi-
nally, “culpam” is overthrown by “cullum” because of more serious errors in the
OCR process (“p” and “a” are labelled as “l” and “u’, respectively). While word
probability does not help (3-gram “llu” is more frequent than “lpa” in our LM),
we are confident to correct this kind of error in future works. Furture works
include the following.
  • More sophisticated path selection criteria, for instance, taking into account
    NN output score (second “l” has lower score than “p”).
  • More advanced language model tools, such as Hidden Markov Models. This
    is currently being developed with promising result.
  • Excluding labels that do not fit in current line margins, such as excluding
    an “l” where the lower margin contains black pixels, such as for “p”.


5   Conclusions
In Codice Ratio is an automatic transcription work-
flow with low training effort. Our proof of concept is
done on a high-resolution digitized copy of the reg-            path      prob
isters by pope Honorii III. Manuscript pages first                 culpam
                                                             cullum     1 · 10−4
undergo a series of transformations, that extract a
                                                             culpam 8 · 10−7
clean version of the text image. Then, preprocessed          culluni    3 · 10−7
pages are decomposed into fragments containing ba-                 criminis
sic text elements, such as characters and symbols.           criminis 8 · 10−7
                                                             crinunis 6 · 10−8
Some fragments are labelled by unskilled crowd-              crinuius 4 · 10−8
sourcing workers, which are asked simply to select              uiuscemod(i)
matching images to template symbols selected by              uiufemod 2 · 10−2
paleographers. Labelled symbols are used to train a          uuifemod 2 · 10−3
                                                             uiiifemod 5 · 10−4
Neural Networks, that is in charge of automatically
compute labels for all the un-labelled fragments. Table 1: Path probabilities
Automatically computed labels are aggregated at for the words in Figure 4.
the word-level into a segmentation lattice, in which
all the traversing paths represent candidate tran-
scriptions for the word. Selection of the best path is
done based on language models. Our proof of concept suggests that In Codice
Ratio can be applied to the large collection of Vatican Registers in the Vatican
Secret Archives, subject to specifics improving, that are matter of ongoing work.


Acknowledgments
We thank Gaetano Bonofiglio, Veronica Iovinella e Andrea Salvoni for their
work on the OCR neural network. We thank Matteo Mariani for helping with
all the pre-processing steps. We also thank Gianlorenzo Didonato for developing
the front-end of the In Codice Ratio crowd-sourcing application. Finally, we are
indebted to all the teacher and students of Liceo Keplero and Liceo Montale who
joined the work-related learning program, and did all the labelling effort.
References
 1. Character recognition experiments using unipen data. In Proceedings of the Sixth
    International Conference on Document Analysis and Recognition, ICDAR ’01,
    pages 481–, Washington, DC, USA, 2001. IEEE Computer Society.
 2. R. Bertolami and H. Bunke. Hidden markov model-based ensemble methods for
    offline handwritten text line recognition. Pattern Recognition, 41(11):3452 – 3460,
    2008.
 3. D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for
    image classification. In Computer Vision and Pattern Recognition (CVPR), 2012
    IEEE Conference on, pages 3642–3649. IEEE, 2012.
 4. I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and
    N. Cristianini. Research methods in the age of digital journalism: Massive-scale
    automated analysis of news-contenttopics, style and gender. Digital Journalism,
    1(1):102–116, 2013.
 5. S. Jaeger, S. Manke, J. Reichert, and A. Waibel. Online handwriting recogni-
    tion: the npen++ recognizer. International Journal on Document Analysis and
    Recognition, 3(3):169–180, 2001.
 6. D. Keysers, T. Deselaers, H. A. Rowley, L.-L. Wang, and V. Carbune. Multi-
    language online handwriting recognition. IEEE Transactions on Pattern Analysis
    and Machine Intelligence, 2016.
 7. T. Lansdall-Welfare, S. Sudhahar, J. Thompson, J. Lewis, F. N. Team, and N. Cris-
    tianini. Content analysis of 150 years of british periodicals. Proceedings of the
    National Academy of Sciences, 114(4):E457–E465, 2017.
 8. A. Marcus, A. Parameswaran, et al. Crowdsourced data management: Industry
    and academic perspectives. Foundations and Trends R in Databases, 6(1-2):1–161,
    2015.
 9. J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, , J. P. Pickett,
    D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, and E. L.
    Aiden. Quantitative analysis of culture using millions of digitized books. Science,
    331(6014):176–182, 2011.
10. F. Moretti. Distant reading. Verso Books, 2013.
11. R. Reddy and G. StClair. The million book digital library project. http://www.
    rr.cs.cmu.edu/mbdl.htm, 2016. Accessed December 19, 2016.
12. V. Romero, V. Alabau, and J. M. Benedı́. Combination of n-grams and stochastic
    context-free grammars in an offline handwritten recognition system. In Proceedings
    of the 3rd Iberian Conference on Pattern Recognition and Image Analysis, Part I,
    IbPRIA ’07, pages 467–474, Berlin, Heidelberg, 2007. Springer-Verlag.
13. V. Romero, J. A. Snchez, V. Bosch, K. Depuydt, and J. de Does. Influence of
    text line segmentation in handwritten text recognition. In 2015 13th International
    Conference on Document Analysis and Recognition (ICDAR), pages 536–540, Aug
    2015.
14. J. A. Sánchez, G. Mühlberger, B. Gatos, P. Schofield, K. Depuydt, R. M. Davis,
    E. Vidal, and J. De Does. transcriptorium: a european project on handwritten text
    recognition. In Proceedings of the 2013 ACM symposium on Document engineering,
    pages 227–228. ACM, 2013.
15. A. Stolcke et al. Srilm-an extensible language modeling toolkit. In Interspeech,
    volume 2002, page 2002, 2002.