=Paper= {{Paper |id=Vol-3290/long_paper2081 |storemode=property |title=Ground-truth Free Evaluation of HTR on Old French and Latin Medieval Literary Manuscripts |pdfUrl=https://ceur-ws.org/Vol-3290/long_paper2081.pdf |volume=Vol-3290 |authors=Thibault Clérice |dblpUrl=https://dblp.org/rec/conf/chr/Clerice22 }} ==Ground-truth Free Evaluation of HTR on Old French and Latin Medieval Literary Manuscripts== https://ceur-ws.org/Vol-3290/long_paper2081.pdf

Ground-truth Free Evaluation of HTR on Old French
and Latin Medieval Literary Manuscripts
Thibault Clérice
Centre Jean Mabillon, École nationale des Chartes, & INRIA

Abstract
As more and more projects openly release ground truth for handwritten text recognition (HTR), we
expect the quality of automatic transcription to improve on unseen data. Getting models robust to
scribal and material changes is a necessary step for speci昀椀c data mining tasks. However, evaluation
of HTR results requires ground truth to compare prediction statistically. In the context of modern
languages, successful attempts to evaluate quality have been done using lexical features or n-grams.This,
however, proves di昀케cult in the context of spelling variation that both Old French and Latin have, even
more so in the context of sometime heavily abbreviated manuscripts. We propose a new method based
on deep learning where we attempt to categorize each line error rate into four error rate ranges (0 <
10% < 25% < 50% < 100%) using three di昀昀erent encoder (GRU with Attention, BiLSTM, TextCNN). To
train these models, we propose a new dataset engineering approach using early stopped model, as an
alternative to rule-based fake predictions. Our model largely outperforms the n-gram approach. We
also provide an example application to qualitatively analyse our classi昀椀er, using classi昀椀cation on new
prediction on a sample of 1,800 manuscripts ranging from the 9th century to the 15th .

Keywords
HTR, OCR Quality Evaluation, Historical languages, Spelling Variation

1. Introduction
Handwritten Text Recognition (HTR) technologies have come a long way over the last 昀椀ve
years, to the point where data mining of medieval manuscripts and HTR-supported critical
editions is getting less rare nowadays, thanks in part to the user-friendliness of interfaces such
as Transkribus [23] and eScriptorium [26]. HTR, however, o昀琀en shows limits in its ability to
adapt to other scribes or periods, as it seems to 昀椀t speci昀椀c scripts and languages. For example,
Schoen and Saretto [38] has shown that a model trained over 1,330 lines of the 15th-century
manuscript CCC 1981 produces around 8.73% CER over test lines of the same manuscripts,
drops to 14% on the same text in another manuscript from the same decade, and can go as low
as 73.23% CER for a manuscript of a di昀昀erent text2 even though it is at most 20 years “younger”
and in the same language.

CHR 2022: Computational Humanities Research Conference, December 12 – 14, 2022, Antwerp, Belgium
£ thibault.clerice@chartes.psl.eu (T. Clérice)
ç ”https://github.com/ponteineptique” (T. Clérice)
ȉ 0000-0003-1852-9204 (T. Clérice)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR

CEUR Workshop Proceedings (CEUR-WS.org)
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073

1
Oxford, Corpus Christi College 198.
2
Oxford, Corpus Christi College 201.

1
In order to evaluate the consistency of a model on an out-of-domain document such as an-
other manuscript or a new hand, researchers usually have to create new ground-truth tran-
scriptions to which the model predictions are compared. In this context, it seems out of reach
to leverage with con昀椀dence the amount of data that remains dormant in the open data vaults of
libraries such as the Bibliothèque Nationale de France (BnF) for statistical studies, making the
number of 50,149 IIIF manifests catalogued by Biblissima’s portal [18] promising while leav-
ing a bitter taste of unavailability: it would require the manual transcription of at least a few
hundred lines for each manuscript3 .
To address this, we can approach this issue not as an HTR one but rather as a Natural Lan-
guage Processing (NLP) task, evaluating the apparent “correctness” of the acquired text rather
than its direct relationship with the digital picture of the manuscript. Evaluating new transcrip-
tions without ground truth has been done, but mainly for OCR and non-historical documents.
For modern languages, where spelling is 昀椀xed and grammar stable, a dictionary approach in
combination with some n-gram statistics have provided a solid framework for establishing the
probability that a document is well transcribed. However, for languages such as old French or
medieval Latin, both evolving over the span of few centuries, the issue is di昀昀erent. For exam-
ple, Camps, Clérice, Duval, Ing, Kanaoka, and Pinche [3] has catalogued 36 forms of the word
cheval (horse) in the largest available Old French corpus. A Dictionary approach would already
prove to be complex, but to make things worse, the abbreviated nature of medieval texts would
require taking into account several abbreviation systems, making it unsustainable.
HTR is most o昀琀en, in the humanities, not a task in itself but rather a preliminary step for
corpus building (such as digital editions) or corpus analysis. In this context, HTR quality can be
of primordial importance, depending on the task at hand. While Eder [16] has suggested that
good classi昀椀cation in stylometry is still possible for corpora with noise levels as high as 20%,
even for the smallest feature sets, Camps, Vidal-Gorène, and Vernet [4] demonstrated that, for
HTR, noise leads to accumulating errors throughout its post-processing (word segmentation,
abbreviation resolution, lemmatization and POS-tagging), making the post-processed textual
features less reliable than original character n-grams. For some other tasks, such as in corpus
linguistics (e.g. semantic dri昀琀 studies), the study of abbreviation systems such as the one per-
formed by Honkapohja and Suomela [22] or even the training of large language models such
as MacBerth [28] might require a higher level of precision.
As such, evaluating the textual quality of an automatic transcription “from afar” is extremely
useful, as it provides solid grounds to either exclude documents from analysis or help guide
ground-truth creation campaigns in well-funded projects. For cultural heritage institutions, it
can also provide a welcome indicator for the document that could be ingested by a research
engine. We can even imagine situations where these institutions transcribe only a sample of
each element of their collection, and only fully and automatically transcribe the ones that reach
a certain level of quality, thus saving energy and ultimately budget on the computation front.
From a human reader’s perspective, Springmann, Fink, and Schulz [39] and Holley [21] have
set a limit of a CER below 10% for a good OCR quality. Recently, Cuper [15] has proposed the

3
Five million lines would be required for the mentioned set of manifests of the BnF with only 100 lines per
manuscript. As a comparison point, the accumulated number of lines of manuscript dataset, regardless of the
script or language, publicly available on the HTR-United catalog [7] is 164,418 at the end of August 2022.

2
evaluation of OCR quality for heritage text collections, speci昀椀cally Dutch newspapers from the
17th century, to distinguish good OCR from bad, using the aforementioned threshold. They pro-
vide a tool, QuPipe, which o昀昀ers binary classi昀椀cation capacities, putting text in either the range
[0; 10]% of CER or in the remaining range of “bad” OCR. In 2022 as well, Ströbel, Clematide,
Volk, Schwitter, Hodel, and Schoch [40] addressed this issue regarding HTR of cultural her-
itage documents, speci昀椀cally from the 16th century. They provide a strong argument for using
lexical features and (pseudo-)perplexity scores for HTR quality estimation, with the speci昀椀c
limitation that the texts they studied, 16th-century Latin correspondence, does not provide as
much variation as older languages such as historical German. We also note that correspon-
dence may be less abbreviated, and that this dataset spans a very short period. In parallel to
these, Clausner, Pletschacher, and Antonacopoulos [8] approached the problem from a global
perspective, from segmentation to OCR, and proposed supervised classi昀椀cation methods.
In this paper, we address this issue as a supervised classi昀椀cation task, based on a dataset
of around 50,000 lines of ground truth spanning from the 9th through to the 15th century.
Following the conclusion of Cuper [15], we augment the number of categories we want to 昀椀nd:
we distinguish Good ([0, 10)%), Acceptable ([10, 25)%), Bad ([25, 50)%), and Very Bad (≥ 50%)
rates of OCR. This provides a more 昀椀ne-grained evaluation of the transcription and allows for
guided transcription campaigns, by addressing either the low-hanging fruits (Acceptable) or
the rotten ones. We evaluate three kinds of basic architectures (GRU with attention, BiLSTM
and TextCNN) on line classi昀椀cation using real-life “bad” transcriptions and precomputed CER
scores.
The resulting models have shown promising results, with quality levels such as Very Bad and
Good being well recognized. In order to evaluate the models and showcase their usefulness, we
also provide an example of a real-life classi昀椀cation application, where 1800 manuscripts were
randomly selected from the BnF and classi昀椀ed by our best model.
In summary, the contributions of this paper are:

1. a new approach for HTR evaluation of historical languages with variable spellings;
2. a new method to produce ground truth for OCR evaluation that does not rely on arti昀椀-
cially and manually tuned generation;
3. an initial evaluation of the output and a quick glance at the state of HTR for Old French
and Medieval Latin over six centuries.

The remainder of this paper is organised as follows. We start by addressing the background
in Section 2, speci昀椀cally regarding the speci昀椀cs of Old French and Medieval Latin and the idea
of readability. In Section 3, we describe the HTR datasets we used and their particularities.
In Section 4, we describe the architecture of the models, their feature engineering and the
process behind the generation of bad predictions. In Section 5, we describe the set-up of our
model selection and evaluation. Finally, in Section 6, we analyse the result both on the dataset
produced ad hoc (described in Sections 3 and 4), but also on completely unseen documents from
the BnF, to showcase the capacities of such models.

3
2. Background and Related Work
Handwritten Text Recognition, a sibling or sub-task of Optical Character Recognition, aims
at recognising text from digitised manuscripts. In the last 昀椀ve years, the digital humanities
landscape has seen a surge in HTR engines, as well as transcription interfaces that connect
and work well with these engines, from the dominant Transkribus [23] to the open-source
pair of eScriptorium [26] and Kraken [25]. To be able to recognize text, users have to provide
models, which are themselves the result of supervised training on ground truth data (human
provided transcriptions).
Printed books have been, over the last few decades, the focus in terms of remediation, from
their analogue form to a digitized picture and 昀椀nally to a machine-readable (and human search-
able) text. With the advances in HTR over the last 昀椀ve years, the focus can now shi昀琀 or be
shared with materials that have, for the most part, remained inaccessible from a digital point of
view, except for pictures. Latin manuscripts are present during the whole period of manuscript
production in western Europe. Literary Old French manuscripts exist from the 12th century on-
ward, with only a hundred known surviving manuscripts in the 12th century [5]. Over the span
of these seven centuries, multiple forms of handwritten scripts have existed, for both French
and Latin. As an example, the 2016 ICHFR Competition on the Classi昀椀cation of Medieval Hand-
writings in Latin Script [14] provided ground truth for the classi昀椀cation of 12 main families,
of which at least six are represented in our datasets. This diversity makes training models for
HTR quite complex but also a reachable goal, as they tend, speci昀椀cally for literary manuscripts,
to be more readable and stable between di昀昀erent hands.
Medieval French and Latin present both dialectal and scriptural variation in synchrony on
top of diachronic evolution. Old French’s syntax varies chronologically and geographically.
The spelling is simply variable. While Latin shows some level of variation, it di昀昀ers from Old
French mostly in its higher rate of abbreviation. These observations are limited to the context
of the datasets at hand, which are literary works (including scholastic, theological and medical
works). The Old French CREMMA Medieval dataset [34] has 0.97% of horizontal tildes and
0.16% of vertical ones, which are markers used in the dataset guidelines to indicate various
similar abbreviation diacritics [35]. Using the same guidelines, the CREMMA Medieval Latin
dataset shows a rate of 5.63% and 1.52% for the same characters. This di昀昀erence could be due
to the nature of the transcribed texts.
The question of abbreviation and the speci昀椀city of medieval literary manuscripts has pro-
voked many discussions in terms of how to transcribe documents, from a completely “diplo-
matic” approach with variants of letters to “semi-diplomatic” approaches. In the last year, three
authors have provided guidance or thoughts around guidelines for transcriptions: Pinche [35]
focusing on Old French, Schoen and Saretto [38] on Middle English, and Guéville and Wris-
ley [20] on Latin. The CREMMA guidelines have been used by 5 other datasets for a total
of 1.15 millions of characters over 昀椀昀琀y manuscripts, which make them the most diverse and
comprehensive ones for HTR of medieval manuscripts in Latin and Old French.
The most traditional metrics for HTR and OCR are both Word Error Rate (WER) and Char-
acter Error Rate (CER). The 昀椀rst one proves to be complicated to apply in Old French and
Medieval Latin, as spaces in medieval manuscripts tend to vary in size or simply be nonexis-
tent from a modern perspective, relying on the knowledge of the reader to separate words—or

4
the ability of NLP models to separate them [9]. The second one works well, with the limitation
that spaces are o昀琀en the 昀椀rst source of mistakes. CER corresponds to the sum of character
insertion, removal and replacement over the total number of characters, thus providing a 昀椀ne-
grained metric.
As mentioned earlier, in the introduction, both CER and WER require ground truth, and other
metrics currently discussed as alternatives, such as the (pseudo-)perplexity or lexical measures
proposed by Ströbel, Clematide, Volk, Schwitter, Hodel, and Schoch [40]. The other approach
to evaluating quality without ground truth is to predict a class of CER, such as the work done
by Bazzo, Lorentz, Suarez Vargas, and Moreira [1]. These approaches rely on features such as
n-grams, word statistics and language classi昀椀er outputs which are di昀케cult to leverage in the
present context. In order to train their classi昀椀er, Bazzo, Lorentz, Suarez Vargas, and Moreira [1]
and Nguyen, Jatowt, Coustaty, and Doucet [32] engineered bad predictions by creating rules to
reproduce the most common errors in OCR, such as “rn” becoming “m”. These bad predictions
are then fed to their model along with the metrics both papers want to predict.
Nguyen, Jatowt, Coustaty, and Doucet [32] provide an innovative approach to the issue of
noise in OCR by shi昀琀ing from a CER/WER problem to a readability one: if the reader “can
reod a txt with mi昀昀pelling” without having to refer back to the picture, at least one of the
goals of OCR has been achieved. As simply put by Martinc, Pollak, and Robnik-Šikonja [30],
“Readability is concerned with the relation between a given text and the cognitive load of a
reader to comprehend it”. It is even more important in the context of handwritten documents
where a somewhat badly but readable HTR output can be easier for non-specialists to read than
the original. In the 昀椀eld of readability assessment, Martinc, Pollak, and Robnik-Šikonja [30]
has shown that supervised models perform adequately, while Nguyen, Jatowt, Coustaty, and
Doucet [32] has shown that this translates to the OCR issues as well. This has not been applied
to any medieval dataset that we know of.

3. Dataset
To train di昀昀erent models, we reused the data from various projects, aligned with the same
guidelines used by Pinche [35]. Our experiment was made possible by the open release of
many projects’ datasets, including one MA thesis and one student project [41, 2]. We used the
ground truth of the CREMMA [13] and CREMMALab [34] projects, the Rescribe [42] project,
and the GalliCorpora [36, 37] projects, for a total of 42,292 lines (see Table 1). We include one
dataset of incunabula, which use graphical shapes similarly to literary manuscripts (but with
more regularity), while also using an abbreviation system.
The datasets present not only two main languages but also many di昀昀erent levels of digiti-
zation quality (including old binarization), di昀昀erent kinds of handwriting families, di昀昀erent
abbreviation levels and di昀昀erent genres. For example, while the CREMMA Medieval dataset
focuses more on literary texts, speci昀椀cally hagiographical and chanson de geste texts, the
CREMMA Medieval LAT corpus o昀昀ers theological commentaries and medicinal recipes, each
genre having its own speci昀椀c vocabulary. The dataset in general is skewed towards French and
the gothica handwritten family.
The transcription guidelines of Pinche [35] provide simpli昀椀cation rules: allographic ap-

5
Table 1
Training material for our models and our future bad transcription dataset.
Dataset name Project or company Coverage Language Lines Characters Manuscripts
Eutyches MA Thesis 850-900 Latin 2,828 86,832 2
Caroline Minuscule Rescribe 800-1199 Latin 457 17,155 17
CREMMA Medieval CREMMALab 1100-1499 French 21,656 579,368 14
CREMMA-Medieval-LAT CREMMA 1100-1599 Latin 6,648 240,291 18
DecameronFR Homework 1430-1455 French 751 19,821 1
Données HTR manuscrits du 15e siècle GalliCorpora 1400-1499 French 5,937 169,221 11
Incunables du 15e siècle GalliCorpora 1400-1499 French 7,608 244,958 13

Figure 1: Example of lines. (a) comes from the GalliCorpora manuscript dataset, (b) from the incunab-
ula one. (c) is drawn from the Eutyches MA Thesis, (d) and (e) from the CREMMA Medieval (French)
dataset. (f) and (g) are both taken from the CREMMA Medieval Latin repository.

proaches are forbidden (di昀昀erent shapes of s such as long s and “modern” s are not di昀昀er-
entiated), macrons and general horizontal-line diacritics over the letters such as tildes are
represented by horizontal tildes, any “zigzag”4 or similarly shaped forms are simpli昀椀ed into
superscript vertical tildes, etc. This allows for simpler transcriptions and also limited diversity
of characters for the machine to learn, satisfying both the human transcriber in terms of the
learning curve of the guidelines, and the HTR engine in terms of complexity. Each corpus
was passed through the ChocoMu昀椀n so昀琀ware [11] using project-speci昀椀c character translation
tables. This so昀琀ware, along with these tables, allows each dataset to be controlled at the char-
acter level and adapted to guideline modi昀椀cations. It also allows project-speci昀椀c transcription
standards to be translated to a more common one, such as Pinche’s.

4
O昀케cial name from the Unicode speci昀椀cations for the character U+299A.

6
4. Proposed Method
Our goal is to be able to predict a quality class for any HTR output on medieval French and
Latin. First, we design a way to generate ground truth for the quality assessment of HTR output.
Then, we propose three supervised text-based models, with speci昀椀c adaptations to handle both
languages with a single classi昀椀er.

4.1. “Bad Prediction” Ground Truth
In order to train our classi昀椀cation model, we require ground truth material along a CER class:
Good ([0; 10)%), Acceptable ([10; 25)%), Bad ([25; 50)%) and Very Bad (≥ 50%). In order to have
real life errors, and to reproduce the rather di昀케cult to predict capacity of a model to confuse
certain characters with others in speci昀椀c settings, we propose a three-step method:

1. We train Kraken [25] models based on the complete dataset, or on a subset. We voluntar-
ily stop some of the training in very early stages, when the CER on the validation dataset
remains high. We also keep one “best” model [12] trained on the full dataset.
2. We run each model on our two biggest and most diverse repositories, CREMMA Medieval
and CREMMA Medieval LAT. We also run a model trained on modern and contemporary
scripts, Manu McFrench [6] to create garbage-level transcriptions.
3. We evaluate each line’s CER and store it alongside the line. We also keep the ground
truth, whose CER is estimated as 0. We remove short lines (fewer than 15 characters)
and duplicated predictions across models for the same line.

Regarding the 昀椀nal models for prediction production, we have 16 models, allowing for a
maximum of 16 versions of each line, if none of the models predict the same text (see Table 2
for examples):

1. 4 models trained on the same train and validation dataset as best with a validation CER
of 55.9, 28.3, 23 and 20.8% according to Kraken.
2. 5 models trained on the CREMMA Medieval LAT dataset only, from the 1st to the 6th
epochs, ranging from 86% to 46% of CER.
3. 1 model trained on the Eutyches (Latin, Carolingian of the 9th century) and the Decameron
(French, 16th century) datasets with a 98.5% CER on its validation set.
4. 3 models trained on the CREMMA Medieval (Old French) dataset only, 昀椀ne-tuned from
the Manu Mc French Model, from 11% of CER down to 8.2%.
5. Manu McFrench, the best model and the ground-truth data.

These provide variable CER on unseen data from the test set of both CREMMA dataset but
also on training and validation sets as they did not reach their full capacities during the training
phase. A昀琀er 昀椀ltering small and repeated predictions, we have access to 322,903 lines of “HTR
Predictions, CER” couples (see in appendix Figure 6). We then translate that into each bin of
CER to produce the four established classes.

7
Table 2
Example of pairs of predictions for the same line for a file of CREMMA Medieval (University of Penn-
sylvania 660, Le Pélerinage de Mademoiselle Sapience). The first line is the ground truth, the second our
best model trained on the full dataset for production, the 4th from the bottom is from Manu McFrench.
Note that the diacritics are not consistently transcribed.
transcription CER
u̾ra on de q̃l vertu ses petis pies sont que vous 0.0
Bra on de q̃l vertuses petis pies sont que vous 6.1
Fra on de q̃l vertuses petis pies sont que vou 8.2
Bra on de q̃l vertuses petis pies sont que uous 8.2
Pra on de ql vertuses petis pies sont que dons 12.2
ura on de q̃l vertu ses petis pies font grre op 16.3
ura on de q̃l uertu ses petis pies font re dory 16.3
ura on de ql vertu ses petis pies font itce ir 20.4
Ard ondegl ratules nus mes sont que ls 42.9
a on de at etn le peos pes os e 49.0
a om de ał vrtir sot olisͣ pa sosisinos 57.1
⁊s cm dec uł vrtr fe pdp̃ pns ots pte 61.2

4.2. Model Architecture
We applied three model architectures, common to many NLP task, with an embedding-sentence
encoder-linear classi昀椀er structure where only the sentence encoder changes from one model
to another (see Figure 2). The embedding layer takes into account special tokens (Padding,
Unknown char, Start of Line, End of Line) and each character according to the Unicode NFD5
normalization of the line, for which characters and their diacritics are separated, e.g. [é]
becomes [e]+[´]. The linear layer is a simple (Encoding Output Dimension, Class Count) de-
cision layer. Each model uses a cross-entropy loss function6 and reduces its learning rate at
plateau using the validation set’s macro averaged recall metric. Optimization of the model is
done through the Ranger optimizer [43].
The encoding layer varies between three di昀昀erent forms:

• The 昀椀rst version uses a single BiLSTM network where the sentence encoding is the result
of the concatenation of the start-of-line token (BOS) and end-of-line token (EOS) hidden
state.
• The second version follows the architecture of sentence-level attention proposed by Yang,
Yang, Dyer, He, Smola, and Hovy [44], using a bidirectional GRU. The encoded sentence
vector is the sum of products of the hidden state of each token with its attention. Atten-
tion is also provided as an output for human interpretation of the results.
• The last one, TextCNN [27], uses the concatenation of the Max Pooling of each n-gram
size (2, 3, 4, 5, 6) taken into account by a convolutional neural network.

5
Normalization Form Canonical Decomposition.
6
Code available at https://github.com/PonteIneptique/neNequitia.

8
Figure 2: Available model architectures. Elements in orange are optional or varying elements, elements
in blue are common to all models.

As we deal with two di昀昀erent languages, we added another special token, following the work
of Martin, Villemonte de La Clergerie, Sagot, and Bordes [29] and Gong, Bhat, and Viswanath
[19]: for each encoding variation we add one variation of the codec where the 昀椀rst token a昀琀er
the beginning-of-string is a metadata token indicating the language. Thus, a line such as Fra
on de ̃ql vertuses petis pies sont que vo will be encoded as Fra on de
̃ql vertuses petis pies sont que vo.

5. Experimental Setup
In order to avoid lexical bias and to ensure the strength of our analysis, we propose a 5-Fold-
like experiment, where each subset for train, validation and test are the results of split across
manuscripts. For each K, two French manuscripts and two Latin ones are used for the vali-
dation set and the test set, and they di昀昀er by at least one manuscript from one K to another,
leaving three K completely di昀昀erent (K1, K3, K5; see Table 3). Each test set also contains a Latin
manuscript that was not used in any of the HTR model training or validation: Berlin, Hdschr.
25. This manuscript was then used for model evaluation, to have a stable pillar for evaluation.
Models are then evaluated using class-speci昀椀c precision and recall, as well as macro averaged
precision and recall.
For our baseline, we use the relative frequency of the 2000 most common n-grams of size 3,
4 and 5 as features and feed them to a linear classi昀椀er, with cross entropy loss and the Adam
optimizer. We run each model architecture once for each K, resulting in 7 di昀昀erent results with
the baseline (presence/absence of language token for the three encoding modules + baseline).
Our whole pipeline uses pandas for data preparation [31], PyTorch [33] for model develop-
ment, and Pytorch Lightning [17] for the training, evaluation and prediction wrapping.

9
Table 3
Composition of K-Folds set, based on manuscript selection.
K 1 2 3 4 5
Validation French BnF fr. 17229, BnF fr. 25516 BnF fr. 3516, BnF fr. 25516 BnF fr. 24428, BnF. Arsenal 3516 BnF fr. 24428, BnF fr.844 Pennsylvania Codex 909, BnF fr.844
Latin Arras 861, CCCC Ms 165 CLM 13027, CCCC 165 CLM 13027, Montpellier, H318 BnF lat. 6395, Montpellier, H318 BnF lat. 6395, Laur. Plut.33.31
Test French BnF Arsenal 3516, BnF fr. 13496 BnF fr. 24428, BnF fr. 411 BnF fr. 844, BnF fr. 22549 BnF fr.412, Phil., Col. of Phys. 10a 13 Bodmer 168, Vat. reg. lat. 1616
Latin Sorbonne Fr. 193, CLM 13027 CCCC Ms. 236, H318 BnF lat. 6395, Egerton 821 BnF fr. 16195, Laur. Plut. 33.31 Laur. Plut. 53.08, BnF lat. 8236
Train Good 80,056 76,564 65,764 39,165 39,165
Acceptable 44,346 41,769 34,429 35,803 35,803
Bad 60,381 59,265 51,637 41,793 41,793
Very Bad 71,008 71,053 60,898 52,212 52,212
Validation Good 4,246 98,57 12,770 11,625 11,625
Acceptable 3,933 10,377 12,496 8,492 8,492
Bad 4,338 10,884 13,430 10,250 10,250
Very Bad 4,867 15,428 18,386 11,461 11,461
Test Good 9,165 7,046 14,933 42,677 42,677
Acceptable 9,744 5,877 11,098 13,728 13,278
Bad 12,763 7,333 12,415 25,439 25,439
Very Bad 18,056 7,350 14,647 30,258 30,258

Table 4
Test results statistics for each K and each model configuration.
Lang Encoder Good Acceptable Bad Very bad
Precision Recall Precision Recall Precision Recall Precision Recall
Mean Median Mean Median Mean Median Mean Median Mean Median Mean Median Mean Median Mean Median
No Baseline 33.87 35.24 33.84 33.76 36.81 37.63 7.56 8.67 37.34 37.11 19.15 18.27 60.27 59.73 97.24 97.32
Yes Attention 65.31 65.61 41.62 41.88 45.29 44.01 26.72 26.32 49.36 49.66 49.23 47.70 75.74 75.33 95.53 95.48
Yes BiLSTM 67.00 66.82 38.02 37.31 41.75 41.29 21.89 21.05 47.13 47.77 51.79 51.09 76.17 74.18 94.20 94.51
Yes TextCNN 57.78 59.06 31.52 26.14 41.97 43.24 20.87 22.29 43.66 44.88 35.93 33.26 68.09 65.95 96.73 97.61
No Attention 58.08 57.00 39.85 41.62 44.10 44.40 35.60 34.21 51.98 51.34 49.98 47.16 80.01 78.53 94.41 94.51
No BiLSTM 60.30 57.60 39.70 36.55 42.87 42.90 31.15 28.17 50.95 51.37 52.39 52.63 79.96 80.40 94.19 93.80
No TextCNN 50.85 49.35 38.43 38.32 40.24 40.10 24.77 26.63 48.59 49.14 47.94 47.16 76.92 77.48 94.46 94.44

6. Experiments
6.1. Model Classification Results
The 昀椀rst conclusion we can draw from the experience is that our models always beat the base-
line (see Table 4 and, in the Appendix, Table 4 for more details). No RNN-based architecture
clearly beats the other, but TextCNN clearly underperforms. The introduction of the language
metadata token helps when detecting Good transcriptions (delta ≈ +7% for attention’s me-
dian precision, ≤ +1% for the recall) for both RNN based models. Models without a language
marker tend to outperform models with language markers, except for the Very Bad class where
the delta is up to +6% in favour of models without language tokens (using median precision
scores).
Regarding the variability of results, we found that the length of the string had an impact on
the prediction, no matter the model architecture. Surprisingly, none of the models withstand
long noisy lines: the accuracy of the Very Bad class is inversely correlated with line size. On the
contrary, depending on the encoder, some classes bene昀椀t from longer strings: Good lines ben-
e昀椀t from it with all models except the baseline. TextCNN is the only model to really correlate
accuracy on the Bad and Acceptable classes with line length.
Finally, for all models except the baseline, the most common confusion is always in the “ad-
jacent” class(es) (see Figure 4). For the classes Acceptable and Bad, which have two neighbours,
the error rate is evenly split between them: the class Acceptable tends to be confused with
either Good or Bad. This shows the model’s ability to understand cleanness or noise, but also
shows the limit of these classes: for a line with 50 characters, such as “quãt tel eufaut gist en
tes lieu. Derite respoint”, 6 mistakes are enough to swing into the Acceptable category (Ground

10
Figure 3: Regression of accuracy based on lines’ length overall 5-Fold test sets. Common manuscript
not included (Berlin, Hdschr. 25).

truth: “quãt tel enfant gist en tel lieu . Uerite respon”, one space has been removed before the
dot).
Overall, with an accuracy for the Good and Very Bad classes around 50% on these languages,
and considering that most of the confusions are from adjacent classes (e.g. Good is confused
with Acceptable, Acceptable with Good and Bad, etc.), the solution performs well either at 昀椀lter-
ing badly read manuscripts, or keeping only the very good ones. The Acceptable class and the
Bad class have stable performance facing variable line length, although the Acceptable class
shows the worst classi昀椀cation performance.

6.2. Application on a Real-World Library Dataset
As a real-world application, we wanted to apply one of our best models to an unseen dataset, in
the same way that we envision cultural institutions might use the tool. We describe the set-up

11
Figure 4: Confusion rate dispersion in the errors made by each model. Only confusion that happens
more than 50 times is taken into account, as well as the total number of errors greater or equal to 300.
The graph can be read as follows: for the baseline, 40% of the errors for the ground truth class Good
are Acceptable predictions.

for this particular experiment below, and then evaluate the results of the classi昀椀cation model
with regard to the capacity of the HTR model; we also study some randomly sampled elements.

6.2.1. Set-up
To evaluate on as much unseen data as possible, we crawled the Biblissima IIIF collection por-
tal [18]. We searched individually for each combination of language (French, Latin) and century

12
(9th to 15th ), limiting the number of samples retrieved to 500 manuscripts. We then sampled
10 sequential pictures from each manuscript.7 To avoid empty pages (which tend to be at the
start and the back of each book’s digitization or IIIF manifest at the BnF), we take either the
ten 昀椀rst pictures from the second decile of the manifest, or from the 20th up to the 30th if there
are fewer than 100 pictures, or the 10 last if there are fewer than 20 pictures.
Each downloaded sample is then segmented using YALTAi [10] with the included model
designed for cultural heritage manuscripts and the base Kraken BLLA segmenter [24]. As
YALTAi provides di昀昀erent zones—from the margin to main body of text—through numbering,
we only consider lines that are part of the main bodies of text of each model, thus excluding
any marginal or paratext. We then use Kraken to predict a transcription for each line, using
the best trained model as described in our 昀椀rst experiment. Next, we feed each line to our best
BiLSTM model (K-Fold 1 has the best recall/precision on Good) while keeping the line metadata:
language, century, manuscript identi昀椀er, and page identi昀椀er.
Finally, we provide three di昀昀erent evaluations of the transcriptions. The 昀椀rst is based strictly
on the number of lines predicted in each class (Good, Acceptable, etc.). The second is page-based:
we take the most common prediction for all lines. The last one is manuscript based: we take
the most common page prediction, using the previous page-based metric.

6.2.2. Evaluation
Overall, the HTR prediction results produced by our BiLSTM module are in line with the HTR
strength on the dataset (see Figure 5). The model performs extremely well on early manuscripts
thanks to the presence of two datasets of early manuscripts (Eutyches and Caroline Minuscule)
It performs well on Old French except for the 13th century, where Bad predictions are more
common. The relative frequency of Very Bad predictions tends to grow as we get closer to the
16th century: from the data we have seen, this could be due to the presence of non-literary
manuscripts written in cursive, for which our model has no ground truth.
If we look at the sampled predictions (Appendix, Table 2), most Good predictions seem cor-
rect or nearly correct. However, we can see that the metadata from Biblissima and the BnF
has some limitations when used automatically, as it can produce problematic results: most
12th century Acceptable predictions are probably in Latin, which would indicate a multilingual
manuscript or a badly catalogued one. This issue also arises in the crawler for the century,
as some manuscripts were catalogued as French but with a production date that is before the
昀椀rst known French document: these are most likely multilingual documents, with either a col-
lection of various leaves from previous manuscripts, or the inclusion of the language used for
marginal notes. 3 out of the 6 Acceptable predictions between the 13th and the 14th century
are de昀椀nitely readable and understandable, and we cannot but wonder if the lack of spaces
in “q̃ merueilles fu lacitebiengarne mlt” is responsible for its classi昀椀cation as Acceptable rather
than Good. We note that at least one Very Bad prediction in French, “OU EtE L. Cheualier de
Monifort, son Oncle, Gles”, seems rather readable, albeit with more corrections than for a Good
transcription. Latin shows the same trend, in being accurate over Good and Acceptable.

7
Note that we are not talking about pages but about pictures: in some cases, most commonly in the case of digitised
micro昀椀lms, one picture can contain two pages.

13
Figure 5: Predictions distribution per line (first two rows), per page (row 3 & 4), per manuscript (last
rows) over languages and centuries filtering.

7. Conclusion
The ability to 昀椀lter, without pre-transcribing samples, automated transcriptions of manuscripts
in Latin, Old French or any other Western historical language, might lead to the production of
datasets designed for analysis that relies on better transcriptions, or to guiding cultural heritage
institutions and their partners in the production of new ground truth. Producing HTR ground
truth does indeed require time, skilled transcribers and, last but not least, budget. However,
most current error rate prediction or HTR output analysis models rely on n-gram frequencies
and lexical features—two approaches that are o昀琀en less viable for languages such as Old French
which “su昀昀ers” from a highly variable spelling system or for languages like Latin which are
potentially highly abbreviated, with abbreviations changing even within a single manuscript,
depending on the context, the topic and the scribe.

14
In this context, we chose to treat CER range predictions as a sentence-like classi昀椀cation
problem, for which we implemented three basic models, using either a single BiLSTM encoder,
an attention-supported GRU, or a TextCNN encoder. These three tools show stronger results
than an n-gram based baseline. On top of this, we include a language metadata token which
can improve the reliability of the lowest range of CER (between 0 and 10%, the Good class)
while worsening the classi昀椀cation’s reliability for the highest range (over 50%, the Bad class).
For the purpose of training these models, we propose a new way to generate real life “bad
transcriptions”, using early-stopped HTR models, or models trained on small samples of data:
this provides an alternative to previous rule-based generation of “bad transcription” ground
truths.
We show that on a completely unknown dataset of around 1,800 manuscripts, analysed
with a new HTR model speci昀椀cally trained on medieval Latin and French, the number of well-
transcribed manuscripts predicted is on par with the ground truth for that dataset. The quality
assessment predictions provide quick insights for larger collections, and could be run relatively
o昀琀en by cultural heritage institutions.
In the future, hyper-parameter 昀椀ne-tuning and other encoders could be used in the architec-
ture. Speci昀椀cally, with more correctly transcribed manuscripts, including the abbreviations in
their transcriptions, 昀椀ne-tuning larger language models could allow the application of (pseudo-
)perplexity ranking such as the one proposed by Ströbel, Clematide, Volk, Schwitter, Hodel, and
Schoch [40], while allowing for partial noise in the training data. We hope to see such classi-
昀椀cation of manuscripts used by ground truth producers in order to enhance the robustness of
openly available HTR models.

Acknowledgments
I want to thanks Jean-Baptiste Camps, Ariane Pinche and Malamatenia Vlachou-Efstathiou
for their constant feedback and replies on some particular questions regarding manuscripts or
HTR data. Many thanks to Ben Nagy for his proof-reading of the pre-print version.
This work was funded by the Centre Jean Mabillon and the DIM MAP (https://www.dim-m
ap.fr/projets-soutenus/cremmalab/).

References
[1] G. T. Bazzo, G. A. Lorentz, D. Suarez Vargas, and V. P. Moreira. “Assessing the Impact of
OCR Errors in Information Retrieval”. In: Advances in Information Retrieval. Ed. by J. M.
Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J. Silva, and F. Martins. Lecture
Notes in Computer Science. Cham: Springer International Publishing, 2020, pp. 102–109.
doi: 10.1007/978-3-030-45442-5\_13.
[2] S. Biay, V. Boby, K. Konstantinova, and Z. Cappe. TNAH-2021-DecameronFR. 2022. doi:
10.5281/zenodo.6126376. url: https://github.com/PSL-Chartes-HTR-Students/TNAH-2
021-DecameronFR.

15
[3] J.-B. Camps, T. Clérice, F. Duval, L. Ing, N. Kanaoka, and A. Pinche. “Corpus and Models
for Lemmatisation and POS-tagging of Old French”. 2022. url: https://halshs.archives-o
uvertes.fr/halshs-03353125.
[4] J.-B. Camps, C. Vidal-Gorène, and M. Vernet. Handling Heavily Abbreviated Manuscripts:
HTR engines vs text normalisation approaches. May 2021. url: https://hal-enc.archives-o
uvertes.fr/hal-03279602.
[5] M. Careri, C. Ruby, and I. Short. Livres et écritures en français et en occitan au XIIe siècle:
catalogue illustré. Viella, 2011. 274 pp.
[6] A. Chagué and T. Clérice. HTR-United - Manu McFrench V1 (Manuscripts of Modern and
Contemporaneous French). Version 1.0.0. 2022. doi: 10.5281/zenodo.6657809. url: https:
//doi.org/10.5281/zenodo.6657809.
[7] A. Chagué and T. Clérice. HTR-United: Ground Truth Resources for the HTR and OCR of
patrimonial documents. 2022. url: https://htr-united.github.io.
[8] C. Clausner, S. Pletschacher, and A. Antonacopoulos. “Quality Prediction System for
Large-Scale Digitisation Work昀氀ows”. In: 2016 12th IAPR Workshop on Document Analysis
Systems (DAS). 2016, pp. 138–143. doi: 10.1109/das.2016.82.
[9] T. Clérice. “Evaluating Deep Learning Methods for Word Segmentation of Scripta Con-
tinua Texts in Old French and Latin”. In: Journal of Data Mining & Digital Humanities
2020 (2020). doi: 10.46298/jdmdh.5581. url: https://jdmdh.episciences.org/6264.
[10] T. Clérice. “You Actually Look Twice At it (YALTAi): using an object detection approach
instead of region segmentation within the Kraken engine”. 2022. url: https://hal-enc.ar
chives-ouvertes.fr/hal-03723208.
[11] T. Clérice and A. Pinche. Choco-Mu昀椀n, a tool for controlling characters used in OCR and
HTR projects. Comp. so昀琀ware. Version 0.0.4. 2021. doi: 10 . 5281 / zenodo . 5356154. url:
https://github.com/PonteIneptique/choco-mufin.
[12] T. Clérice, A. Pinche, and M. Vlachou-Efstathiou. Generic CREMMA Model for Medieval
Manuscripts (Latin and Old French), 8-15th century. Version 1.0.0. 2022. doi: 10.5281/zen
odo.7234166. url: https://doi.org/10.5281/zenodo.7234166.
[13] T. Clérice, M. Vlachou Efstathiou, and A. Chagué. CREMMA Manuscrits médiévaux latins.
Ed. by A. Chagué and T. Clérice. 2022. url: https://github.com/HTR-United/CREMMA-
Medieval-LAT.
[14] F. Cloppet, V. Eglin, V. C. Kieu, D. Stutzmann, and N. Vincent. “ICFHR2016 Competi-
tion on the Classi昀椀cation of Medieval Handwritings in Latin Script”. In: 2016 15th Inter-
national Conference on Frontiers in Handwriting Recognition (ICFHR). 2016 15th Interna-
tional Conference on Frontiers in Handwriting Recognition (ICFHR). Shenzhen, China:
Ieee, Oct. 2016, pp. 590–595. doi: 10.1109/icfhr.2016.0113. url: http://ieeexplore.ieee.or
g/document/7814129/.
[15] M. Cuper. “Examining a Multi Layered Approach for Classi昀椀cation of OCR Quality with-
out Ground Truth”. In: DH Benelux Journal (2022), p. 17.

16
[16] M. Eder. “Mind your corpus: systematic errors in authorship attribution”. In: Literary
and Linguistic Computing 28.4 (Dec. 1, 2013), pp. 603–614. doi: 10.1093/llc/fqt039. url:
https://doi.org/10.1093/llc/fqt039.
[17] W. Falcon and The PyTorch Lightning team. PyTorch Lightning. Comp. so昀琀ware. Ver-
sion 1.4. 2019. doi: 10.5281/zenodo.3828935. url: https://github.com/Lightning-AI/light
ning.
[18] E. Frunzeanu, E. MacDonald, and R. Robineau. “Biblissima’s Choices of Tools and
Methodology for Interoperability Purposes”. In: CIAN. Revista de historia de las universi-
dades 19.1 (2016), pp. 115–132.
[19] H. Gong, S. Bhat, and P. Viswanath. “Enriching Word Embeddings with Temporal and
Spatial Information”. In: Proceedings of the 24th Conference on Computational Natural
Language Learning. Online: Association for Computational Linguistics, 2020, pp. 1–11.
doi: 10.18653/v1/2020.conll-1.1. url: https://aclanthology.org/2020.conll-1.1.
[20] E. Guéville and D. J. Wrisley. “Transcribing Medieval Manuscripts for Machine Learning”.
In: arXiv preprint arXiv:2207.07726 (2022). url: https://arxiv.org/abs/2207.07726.
[21] R. Holley. “How good can it get? Analysing and improving OCR accuracy in large scale
historic newspaper digitisation programs”. In: D-Lib Magazine 15.3/4 (2009).
[22] A. Honkapohja and J. Suomela. “Lexical and function words or language and text type?
Abbreviation consistency in an aligned corpus of Latin and Middle English plague tracts”.
In: Digital Scholarship in the Humanities 37.3 (2021), pp. 765–787. doi: 10.1093/llc/fqab007.
url: https://doi.org/10.1093/llc/fqab007.
[23] P. Kahle, S. Colutto, G. Hackl, and G. Mühlberger. “Transkribus-a service platform for
transcription, recognition and retrieval of historical documents”. In: 2017 14th IAPR In-
ternational Conference on Document Analysis and Recognition (ICDAR). Vol. 4. Ieee. 2017,
pp. 19–24.
[24] B. Kiessling. “A modular region and text line layout analysis system”. In: 2020 17th Inter-
national Conference on Frontiers in Handwriting Recognition (ICFHR). Ieee. 2020, pp. 313–
318.
[25] B. Kiessling. The Kraken OCR system. Comp. so昀琀ware. Version 4.1.2. 2022. url: https://k
raken.re.
[26] B. Kiessling, R. Tissot, P. Stokes, and D. S. B. Ezra. “eScriptorium: an open source platform
for historical document analysis”. In: 2019 International Conference on Document Analysis
and Recognition Workshops (ICDARW). Vol. 2. Ieee. 2019, pp. 19–19.
[27] Y. Kim. “Convolutional Neural Networks for Sentence Classi昀椀cation”. In: Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha,
Qatar: Association for Computational Linguistics, 2014, pp. 1746–1751. doi: 10.3115/v1
/D14-1181. url: https://aclanthology.org/D14-1181.
[28] E. Manjavacas and L. Fonteyn. “Adapting vs. Pre-training Language Models for Historical
Languages”. In: Journal of Data Mining and Digital Humanities Nlp4dh (2022). doi: 10.4
6298/jdmdh.9152. url: https://hal.inria.fr/hal-03592137.

17
[29] L. Martin, É. Villemonte de La Clergerie, B. Sagot, and A. Bordes. “Controllable Sentence
Simpli昀椀cation”. In: LREC 2020 - 12th Language Resources and Evaluation Conference. Mar-
seille, France, 2020. url: https://hal.inria.fr/hal-02678214.
[30] M. Martinc, S. Pollak, and M. Robnik-Šikonja. “Supervised and Unsupervised Neural Ap-
proaches to Text Readability”. In: Computational Linguistics 47.1 (Apr. 21, 2021), pp. 141–
179. doi: 10.1162/coli\_a\_00398. url: https://doi.org/10.1162/coli%5C%5Fa%5C%5F0039
8.
[31] W. McKinney et al. “pandas: a foundational Python library for data analysis and statis-
tics”. In: Python for high performance and scienti昀椀c computing 14.9 (2011), pp. 1–9.
[32] H. T. T. Nguyen, A. Jatowt, M. Coustaty, and A. Doucet. “ReadOCR: A Novel Dataset
and Readability Assessment of OCRed Texts”. In: International Workshop on Document
Analysis Systems. Springer. 2022, pp. 479–491.
[33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N.
Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. “PyTorch: An Imperative
Style, High-Performance Deep Learning Library”. In: Advances in Neural Information
Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc,
E. Fox, and R. Garnett. Curran Associates, Inc., 2019, pp. 8024–8035. url: http://papers
.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-
library.pdf.
[34] A. Pinche. Cremma Medieval. 2022. doi: 10.5281/zenodo.5235185. url: https://github.co
m/HTR-United/cremma-medieval.
[35] A. Pinche. “Guide de transcription pour les manuscrits du Xe au XVe siècle”. 2022. url:
https://hal.archives-ouvertes.fr/hal-03697382.
[36] A. Pinche, S. Gabay, N. Leroy, and K. Christensen. Données HTR incunables du 15e siècle.
2022. url: https://github.com/Gallicorpora/HTR-incunable-15e-siecle.
[37] A. Pinche, S. Gabay, N. Leroy, and K. Christensen. Données HTR manuscrits du 15e siècle.
2022. url: https://github.com/Gallicorpora/HTR-MSS-15e-Siecle.
[38] J. Schoen and G. E. Saretto. “Optical Character Recognition (OCR) and Medieval
Manuscripts: Reconsidering Transcriptions in the Digital Age”. In: Digital Philology: A
Journal of Medieval Cultures 11.1 (2022), pp. 174–206. doi: 10.1353/dph.2022.0010. url:
https://muse.jhu.edu/article/853521.
[39] U. Springmann, F. Fink, and K. U. Schulz. Automatic quality evaluation and (semi-) auto-
matic improvement of OCR models for historical printings. Oct. 20, 2016. doi: 10.48550/ar
Xiv.1606.05157. url: http://arxiv.org/abs/1606.05157.
[40] P. B. Ströbel, S. Clematide, M. Volk, R. Schwitter, T. Hodel, and D. Schoch. “Evaluation of
HTR models without Ground Truth Material”. In: arXiv preprint arXiv:2201.06170 (2022).
url: http://arxiv.org/abs/2201.06170.
[41] M. Vlachou-Efstathiou. Voss.Lat.O.41 - Eutyches ”de uerbo” glossed. 2022. url: https://git
hub.com/malamatenia/Eutyches.

18
[42] N. White, A. Karaisl, and T. Clérice. Caroline Minuscule by Rescribe. Ed. by A. Chagué
and T. Clérice. 2022. url: https://github.com/rescribe/carolineminuscule-groundtruth.
[43] L. Wright. New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam +
LookAhead for the best of… Medium. Sept. 4, 2019. url: https://medium.com/%5C@less
w/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-
for-the-best-of-2dc83f79a48d.
[44] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. “Hierarchical Attention Net-
works for Document Classi昀椀cation”. In: Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Tech-
nologies. Proceedings of the 2016 Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Language Technologies. San Diego, Cali-
fornia: Association for Computational Linguistics, 2016, pp. 1480–1489. doi: 10.18653/v
1/N16-1174. url: http://aclweb.org/anthology/N16-1174.

A. Appendix
The so昀琀ware has been archived at the following address: https://doi.org/10.5281/zenodo.723
3984. A good chunk of the data is available here: https://github.com/PonteIneptique/neNequ
itia/releases/tag/chr2022-release.
Manuscripts metadata and the predictions in XML ALTO formats for Section 6 are available
at https://doi.org/10.5281/zenodo.7234399. The same repository contains also the XML data
for training the classi昀椀er.

Lang Century Prediction Transcription
昀爀o 12 Good ra monstre de couf de uoudenay
昀爀o 12 Good uucuns pair er que Iehuz de le chaulre le ieuue uor de teuteur dicelui
office oy a este priuoz et de loucez pour msen et acaise de cereus cu ⁊
decpa
昀爀o 12 Good seriant estoit exilliez en laueniance de sacolpe. li poures
昀爀o 13 Good les cõdurroit car il sauoit crop bien. coz lespas. ⁊
昀爀o 13 Good Se il lẽ set dire nouele
昀爀o 13 Good tiseras. ⁊ ieres sempres amendes. ⁊ en un au
昀爀o 14 Good Procureur du Roi du mẽme iour qui ne l empeche. lOrdon.
昀爀o 14 Good sonpere tous les rodais et les tartcites
昀爀o 14 Good sacies qͥ l nestoit deriens i cant desirans ꝯme de
昀爀o 15 Good quil ne lui celast mit ains lui deist qui
昀爀o 15 Good miere pour ce que par la renue de cest
昀爀o 15 Good sauoit puis fait. Et il lui cõte cõmet̾
lat 9 Good cumppriae accipit᷑ tab naculum belli res est. Adtem pus enim cumd-
abolo dimicamꝰ. & tunc opusẽ
lat 9 Good aut̃ comminati : miscrunteuminex
lat 9 Good ce Detempore ordinat ionum.
lat 10 Good epm̃ ñaccipiant xccraxui a gererĩ ꝓcur auerint
lat 10 Good babeat᷑. sicastigat: psatis faccionẽ uenia ab epo noluerit ꝓmereri.
lat 10 Good prima creatrix : posterior
lat 11 Good cer cũ 昀爀atrib in labore manuu
lat 11 Good la tricem illã uiris armisq nobilẽ hispadua: illam semi

19
Lang Century Prediction Transcription
lat 11 Good ꝓ motionẽ dare debebit Postumianus ep̃ s dixit:
lat 12 Good minus.¶ Vmmasculo ñ cõmisceb̾is contu femineo: qͥ a
lat 12 Good non tenuerit eccłiastice ficlei caritatisq cõ
lat 12 Good que fuerant futura damnantur. Deinde si eisad ꝑcipiendũ bapti
lat 13 Good diei. q̃ sitꝰ a ꝑentib infans inuentꝰ est ⁊ sublatꝰ defouea obnolutꝰ ceno⁊
lat 13 Good fit. nͥ adumantibo utust lac̾ tis q st̾ it᷑ costa
lat 13 Good sub sarcinis adoriri. Qua pulsa inpedimentisq direptis. futurũ
lat 14 Good seꝙ. dicã i vit̾ . fiałs et̾
lat 14 Good do rerũ. Que disciplina: Que grã
lat 14 Good ut̾ usque ⁊ siauł̾ deiusto ⁊ułto tubł̾iliau quostã ꝑtrła
lat 15 Good a tlium ꝯsilus extraneus aud̾eat discre pare
lat 15 Good p̾parata pena.S qd cica : Duodici fatemur xpm̃ apostolos habuis
lat 15 Good absoloe oñm et c̾ ꝑ ꝓcessũ aut et tͥ lu q̾om et c̾ Ncessus de ca̾i et pre
uacãtib
昀爀o 12 Acceptable hoc michi uircus caritacis ex
昀爀o 12 Acceptable poue sg̃ uis not arcã de roy nosta siro l gñt dixur de gendy
昀爀o 12 Acceptable uideliet ꝙ. Vluifxix ꝑ iii obo ł ddẽ debił monsõ daentũ erignita t qͣ n-
decim ẜ derẽ d̾ cuo̾ foreꝭ pn hune medum ui
昀爀o 13 Acceptable q̃ merueilles fu lacitebiengarne mlt
昀爀o 13 Acceptable eceual ꝯmanda a .i. desgrũs baillies
昀爀o 13 Acceptable ⁊ aumang̾ loea lonseigne inporcee
昀爀o 14 Acceptable en excepter aucuns : quĩ dit les aroits, sans en excepter aucuns, dir tous
昀爀o 14 Acceptable beancoipe ⁊ de nofimeeeEt chastellaus du chosirur diu hur d ursarce
Confe sfout anen en ilirur
昀爀o 14 Acceptable ¶ Oedee est alber de chyam de
昀爀o 15 Acceptable grans coupz sur leitargt du foy des orgueilleux
昀爀o 15 Acceptable cau en ueritayꝭ cest grant ⁊ Iouff
昀爀o 15 Acceptable nophanes eracleopolites qͥ ceste
lat 9 Acceptable septies. sedusque septuagies septies.
lat 9 Acceptable aestuat. Dehac rcriptũ. ẽ :
lat 9 Acceptable to hostem patriae redire iubet ad propria. Iune
lat 10 Acceptable bilis sit deuotio. Consttt qu uram dilec tione magna remune
lat 10 Acceptable sustinebt̃ salus aut̃ mea insẽpit̃ nũ crit.
lat 10 Acceptable sorac cae plũr tm̃ ut ħierusolima. quasi ut hic narrabo plũr tm̃
uthutreueri utroque
lat 11 Acceptable bitatem ipsiis omino ugor
lat 11 Acceptable diccũ ẽ. ego dns exaudiã eos. dr̃ istł ñderelinquã eos: ñidõ diccũ ẽ.
cãquã gen
lat 11 Acceptable ait.A ẽsis hicuobis micummm̃siũ.primus est uobis irm̃si
lat 12 Acceptable ꝑ secutio leuist adcauedũ.s beticor seductioꝑni
lat 12 Acceptable ierłm & uide. immo iudicent int̃
lat 12 Acceptable surci :reccutores repu :. lic: ett̃ migriin
lat 13 Acceptable Meseach ⁊ tafari ⁊ Rrasis sic̃ dicc̃
lat 13 Acceptable orit lui. ⁊ termo optimus est
lat 13 Acceptable quilibet sp̃ s. omĩno
lat 14 Acceptable potior conditio pp̃ e.facit de rxp. duobus .li. bl.
lat 14 Acceptable se dm̃ habere. et pmicꝰ sibimet satiffaciens.
lat 14 Acceptable ualent vuã breuẽ. ⁊ ultia ualet tũi
lat 15 Acceptable sup s comꝑarõ ioñ prudẽtes ꝯquas
lat 15 Acceptable metermuim.et rẽgm euis non erit finis
lat 15 Acceptable L e carnalis ht, qm pater ip̃ s parentis.
昀爀o 12 Bad orailleo .poulleer xv lib

20
Lang Century Prediction Transcription
昀爀o 12 Bad Rbir les bartres
昀爀o 12 Bad deaute lqu ques creppt Eentiferoi rece nyꝰ seelle ces liea aa mn pie do
d ce ee lu moasum
昀爀o 13 Bad atourne. giest sibo lans quil qsui
昀爀o 13 Bad Carde peo eequie auoit d̾yonde eane de adtus edtoit pao coudequaiẽ
昀爀o 13 Bad Mol edito se vtan di or icuttͥs
昀爀o 14 Bad Q Anne Autiron. Que ledit saques de Lancrau, epousa en premieres
noces, le
昀爀o 14 Bad stallis eudinor ꝯpu es uiai fugdutu padeur i uo ferras pu uea puis ai
昀爀o 14 Bad uolent ipitur ⁊atia rertace, quan ineestaeq aleeeclere, et decõuy ny s ã ã
昀爀o 15 Bad deianarr de bbrdide
昀爀o 15 Bad ⁊uribz allegate, Sed epclusissent ab uitestate Ipsi
昀爀o 15 Bad msuol ̃ ipousaultis anonuen natucõ auol
lat 9 Bad lus . necnonalu acquealii fundatores ecdlesiae atque erudito
lat 9 Bad us prae erat ut ꝑhoc. P̃sedemtis
lat 9 Bad crea turar quues upra
lat 10 Bad eruc tucins quat tuor an
lat 10 Bad utunde positum eleganter concin
lat 10 Bad aecenim consid rtio suasit qnm manifestum. ẽ. omnemutabile
lat 11 Bad UERBuai : FIuERBũ.
lat 11 Bad mus qm ipse anns nr̃ animã posuit suã
lat 11 Bad 昀爀a si tua foret roma.to
lat 12 Bad et arbusta eius cedros dei.
lat 12 Bad rit i audalunt dñt surreber̃ scm̃q marie
lat 12 Bad relecti mansueti.
lat 13 Bad qurdr uicba .i. quit est
lat 13 Bad de inim̃. m. ñ quãu
lat 13 Bad usqi io intintoẽm amcti delendi sumuis
lat 14 Bad cui subsunc becm̃braa laceiicelligas lr̃ãm. siue sint plati or
lat 14 Bad Sĩ mõlał ãt ibs
lat 14 Bad fult ad nol in eigilia lanooe orucil oroit ꝓuril crauns
lat 15 Bad ⁊ Artaiita mons cum flumi-
lat 15 Bad ꝑ te maiorz pñt corrumꝑe
lat 15 Bad lo sunt ni locu unu. ⁊ appare
昀爀o 12 Very bad I guille choeneau
昀爀o 12 Very bad mnl cct quarantẽ dope
昀爀o 12 Very bad nullo cappic d bii uigr
昀爀o 13 Very bad noulonoe rolissicanuꝭ, Rudauu/, ꝯgarobanu
昀爀o 13 Very bad L a nuis ÷ eueuue sihat ipponses
昀爀o 13 Very bad diqe ut̃ le diui⁊ inr
昀爀o 14 Very bad oe consepeedeetante cemere
昀爀o 14 Very bad OU EtE L. Cheualier de Monifort, son Oncle, Gles
昀爀o 14 Very bad Bussoy Iaguio dar Rnauex a eedamet dunin
昀爀o 15 Very bad aximiun oiuinca s apenalriuuuirõuo uutonli
昀爀o 15 Very bad libas ꝯsadiorandapio sidimił
昀爀o 15 Very bad Euuon lan uiii fut
lat 9 Very bad IV Mtru&Ε Rlnꝯrdo¬
lat 9 Very bad eo locus sp atiosus admanen
lat 9 Very bad ie godñpsormanilues
lat 10 Very bad dule hanu curde uut lato
lat 10 Very bad arnals de seruull
lat 10 Very bad sbib liotheca tsede

21
Lang Century Prediction Transcription
lat 11 Very bad s ec tanie ca uis uirtusq
lat 11 Very bad Don de N. le duc de la Tremoille . MV
lat 11 Very bad uitr fuit rtimuli
lat 12 Very bad minu Benedlicat uos clns exsy
lat 12 Very bad sad mumnoui hominıᷤ
lat 12 Very bad uanni addant̃ ad aunos am̃s cabłam aiomsn ita uidt
lat 13 Very bad ngeũ drãs mudtistã
lat 13 Very bad fit cum eo emplin ypocondrus
lat 13 Very bad mauuse ⁊ de ala mri nream
lat 14 Very bad G terie eni ĩtuiueẽ sã
lat 14 Very bad ni quo coła nig ꝯsurg
lat 14 Very bad uino dt᷑. uł nat
lat 15 Very bad duabus ncibus ¶ uel syr ma-
lat 15 Very bad Orano mlelll pam cess aa qra mfenus
lat 15 Very bad Poleuae me oblous ons

Table 5: Examples of HTR Prediction on unseen documents and their classi昀椀cation by the model.

22
Lang Encoder Good Acceptable Bad Very bad
Precision Recall Precision Recall Precision Recall Precision Recall
Yes Attention 63.35 40.36 43.04 26.32 49.66 47.70 75.33 96.25
Yes Attention 65.61 31.47 41.80 32.35 51.81 51.75 77.74 95.41
Yes Attention 63.32 51.27 49.17 18.27 47.84 45.95 71.96 95.99
Yes Attention 66.00 41.88 48.45 33.90 52.85 56.78 80.47 94.51
Yes Attention 68.27 43.15 44.01 22.76 44.67 43.98 73.20 95.48
Yes TextCNN 59.06 38.07 44.31 11.46 40.16 32.17 64.50 97.87
Yes TextCNN 54.70 25.13 38.16 22.45 44.88 44.09 72.30 95.41
Yes TextCNN 51.54 42.39 44.37 20.74 45.34 27.13 64.88 97.61
Yes TextCNN 60.23 26.14 39.78 27.40 45.54 43.00 72.81 95.16
Yes TextCNN 63.35 25.89 43.24 22.29 42.40 33.26 65.95 97.61
Yes BiLSTM 62.33 46.19 44.07 12.07 44.10 51.09 74.11 94.51
Yes BiLSTM 71.75 32.23 41.29 30.80 50.41 53.94 78.07 94.06
Yes BiLSTM 60.56 55.33 47.55 21.05 49.94 48.14 74.18 94.64
Yes BiLSTM 73.53 19.04 36.58 30.80 47.77 58.53 81.51 91.41
Yes BiLSTM 66.82 37.31 39.26 14.71 43.42 47.26 72.96 96.38
No Attention 56.05 44.67 45.02 30.80 49.16 44.75 76.96 95.16
No Attention 61.79 33.25 43.23 43.50 54.51 58.21 86.30 92.76
No Attention 57.00 43.40 45.66 34.21 50.41 47.16 78.53 94.51
No Attention 58.57 41.62 44.40 38.08 54.50 53.72 82.40 94.06
No Attention 56.97 36.29 42.20 31.42 51.34 46.06 75.85 95.54
No TextCNN 54.51 36.80 41.15 26.63 47.59 43.22 74.30 95.41
No TextCNN 47.37 38.83 40.10 26.01 49.14 47.16 77.48 94.25
No TextCNN 54.71 38.32 41.80 28.02 49.66 55.58 81.19 92.83
No TextCNN 48.31 39.85 39.09 28.02 50.12 47.26 78.94 94.44
No TextCNN 49.35 38.32 39.04 15.17 46.45 46.50 72.71 95.35
No BiLSTM 70.79 31.98 42.90 24.30 47.44 55.80 78.07 94.96
No BiLSTM 57.60 36.55 42.23 35.76 52.28 52.63 80.84 93.22
No BiLSTM 60.25 36.55 41.84 28.17 51.37 57.44 80.40 93.80
No BiLSTM 56.69 49.49 42.90 40.71 54.52 47.48 82.85 93.60
No BiLSTM 56.17 43.91 44.47 26.78 49.12 48.58 77.64 95.35

23
Figure 6: “Bad transcriptions” CER Violin plot, per manuscript. Most manuscript have a strong enough
diversity of CER to train upon.