Ground-truth Free Evaluation of HTR on Old French
and Latin Medieval Literary Manuscripts
Thibault Clérice
Centre Jean Mabillon, École nationale des Chartes, & INRIA


                                         Abstract
                                         As more and more projects openly release ground truth for handwritten text recognition (HTR), we
                                         expect the quality of automatic transcription to improve on unseen data. Getting models robust to
                                         scribal and material changes is a necessary step for speci昀椀c data mining tasks. However, evaluation
                                         of HTR results requires ground truth to compare prediction statistically. In the context of modern
                                         languages, successful attempts to evaluate quality have been done using lexical features or n-grams.This,
                                         however, proves di昀케cult in the context of spelling variation that both Old French and Latin have, even
                                         more so in the context of sometime heavily abbreviated manuscripts. We propose a new method based
                                         on deep learning where we attempt to categorize each line error rate into four error rate ranges (0 <
                                         10% < 25% < 50% < 100%) using three di昀昀erent encoder (GRU with Attention, BiLSTM, TextCNN). To
                                         train these models, we propose a new dataset engineering approach using early stopped model, as an
                                         alternative to rule-based fake predictions. Our model largely outperforms the n-gram approach. We
                                         also provide an example application to qualitatively analyse our classi昀椀er, using classi昀椀cation on new
                                         prediction on a sample of 1,800 manuscripts ranging from the 9th century to the 15th .

                                         Keywords
                                         HTR, OCR Quality Evaluation, Historical languages, Spelling Variation


1. Introduction
Handwritten Text Recognition (HTR) technologies have come a long way over the last 昀椀ve
years, to the point where data mining of medieval manuscripts and HTR-supported critical
editions is getting less rare nowadays, thanks in part to the user-friendliness of interfaces such
as Transkribus [23] and eScriptorium [26]. HTR, however, o昀琀en shows limits in its ability to
adapt to other scribes or periods, as it seems to 昀椀t speci昀椀c scripts and languages. For example,
Schoen and Saretto [38] has shown that a model trained over 1,330 lines of the 15th-century
manuscript CCC 1981 produces around 8.73% CER over test lines of the same manuscripts,
drops to 14% on the same text in another manuscript from the same decade, and can go as low
as 73.23% CER for a manuscript of a di昀昀erent text2 even though it is at most 20 years “younger”
and in the same language.


CHR 2022: Computational Humanities Research Conference, December 12 – 14, 2022, Antwerp, Belgium
£ thibault.clerice@chartes.psl.eu (T. Clérice)
ç ”https://github.com/ponteineptique” (T. Clérice)
ȉ 0000-0003-1852-9204 (T. Clérice)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


1
  Oxford, Corpus Christi College 198.
2
  Oxford, Corpus Christi College 201.


                                                                                                           1
   In order to evaluate the consistency of a model on an out-of-domain document such as an-
other manuscript or a new hand, researchers usually have to create new ground-truth tran-
scriptions to which the model predictions are compared. In this context, it seems out of reach
to leverage with con昀椀dence the amount of data that remains dormant in the open data vaults of
libraries such as the Bibliothèque Nationale de France (BnF) for statistical studies, making the
number of 50,149 IIIF manifests catalogued by Biblissima’s portal [18] promising while leav-
ing a bitter taste of unavailability: it would require the manual transcription of at least a few
hundred lines for each manuscript3 .
   To address this, we can approach this issue not as an HTR one but rather as a Natural Lan-
guage Processing (NLP) task, evaluating the apparent “correctness” of the acquired text rather
than its direct relationship with the digital picture of the manuscript. Evaluating new transcrip-
tions without ground truth has been done, but mainly for OCR and non-historical documents.
For modern languages, where spelling is 昀椀xed and grammar stable, a dictionary approach in
combination with some n-gram statistics have provided a solid framework for establishing the
probability that a document is well transcribed. However, for languages such as old French or
medieval Latin, both evolving over the span of few centuries, the issue is di昀昀erent. For exam-
ple, Camps, Clérice, Duval, Ing, Kanaoka, and Pinche [3] has catalogued 36 forms of the word
cheval (horse) in the largest available Old French corpus. A Dictionary approach would already
prove to be complex, but to make things worse, the abbreviated nature of medieval texts would
require taking into account several abbreviation systems, making it unsustainable.
   HTR is most o昀琀en, in the humanities, not a task in itself but rather a preliminary step for
corpus building (such as digital editions) or corpus analysis. In this context, HTR quality can be
of primordial importance, depending on the task at hand. While Eder [16] has suggested that
good classi昀椀cation in stylometry is still possible for corpora with noise levels as high as 20%,
even for the smallest feature sets, Camps, Vidal-Gorène, and Vernet [4] demonstrated that, for
HTR, noise leads to accumulating errors throughout its post-processing (word segmentation,
abbreviation resolution, lemmatization and POS-tagging), making the post-processed textual
features less reliable than original character n-grams. For some other tasks, such as in corpus
linguistics (e.g. semantic dri昀琀 studies), the study of abbreviation systems such as the one per-
formed by Honkapohja and Suomela [22] or even the training of large language models such
as MacBerth [28] might require a higher level of precision.
   As such, evaluating the textual quality of an automatic transcription “from afar” is extremely
useful, as it provides solid grounds to either exclude documents from analysis or help guide
ground-truth creation campaigns in well-funded projects. For cultural heritage institutions, it
can also provide a welcome indicator for the document that could be ingested by a research
engine. We can even imagine situations where these institutions transcribe only a sample of
each element of their collection, and only fully and automatically transcribe the ones that reach
a certain level of quality, thus saving energy and ultimately budget on the computation front.
   From a human reader’s perspective, Springmann, Fink, and Schulz [39] and Holley [21] have
set a limit of a CER below 10% for a good OCR quality. Recently, Cuper [15] has proposed the

3
    Five million lines would be required for the mentioned set of manifests of the BnF with only 100 lines per
    manuscript. As a comparison point, the accumulated number of lines of manuscript dataset, regardless of the
    script or language, publicly available on the HTR-United catalog [7] is 164,418 at the end of August 2022.


                                                        2
evaluation of OCR quality for heritage text collections, speci昀椀cally Dutch newspapers from the
17th century, to distinguish good OCR from bad, using the aforementioned threshold. They pro-
vide a tool, QuPipe, which o昀昀ers binary classi昀椀cation capacities, putting text in either the range
[0; 10]% of CER or in the remaining range of “bad” OCR. In 2022 as well, Ströbel, Clematide,
Volk, Schwitter, Hodel, and Schoch [40] addressed this issue regarding HTR of cultural her-
itage documents, speci昀椀cally from the 16th century. They provide a strong argument for using
lexical features and (pseudo-)perplexity scores for HTR quality estimation, with the speci昀椀c
limitation that the texts they studied, 16th-century Latin correspondence, does not provide as
much variation as older languages such as historical German. We also note that correspon-
dence may be less abbreviated, and that this dataset spans a very short period. In parallel to
these, Clausner, Pletschacher, and Antonacopoulos [8] approached the problem from a global
perspective, from segmentation to OCR, and proposed supervised classi昀椀cation methods.
   In this paper, we address this issue as a supervised classi昀椀cation task, based on a dataset
of around 50,000 lines of ground truth spanning from the 9th through to the 15th century.
Following the conclusion of Cuper [15], we augment the number of categories we want to 昀椀nd:
we distinguish Good ([0, 10)%), Acceptable ([10, 25)%), Bad ([25, 50)%), and Very Bad (≥ 50%)
rates of OCR. This provides a more 昀椀ne-grained evaluation of the transcription and allows for
guided transcription campaigns, by addressing either the low-hanging fruits (Acceptable) or
the rotten ones. We evaluate three kinds of basic architectures (GRU with attention, BiLSTM
and TextCNN) on line classi昀椀cation using real-life “bad” transcriptions and precomputed CER
scores.
   The resulting models have shown promising results, with quality levels such as Very Bad and
Good being well recognized. In order to evaluate the models and showcase their usefulness, we
also provide an example of a real-life classi昀椀cation application, where 1800 manuscripts were
randomly selected from the BnF and classi昀椀ed by our best model.
   In summary, the contributions of this paper are:

   1. a new approach for HTR evaluation of historical languages with variable spellings;
   2. a new method to produce ground truth for OCR evaluation that does not rely on arti昀椀-
      cially and manually tuned generation;
   3. an initial evaluation of the output and a quick glance at the state of HTR for Old French
      and Medieval Latin over six centuries.

   The remainder of this paper is organised as follows. We start by addressing the background
in Section 2, speci昀椀cally regarding the speci昀椀cs of Old French and Medieval Latin and the idea
of readability. In Section 3, we describe the HTR datasets we used and their particularities.
In Section 4, we describe the architecture of the models, their feature engineering and the
process behind the generation of bad predictions. In Section 5, we describe the set-up of our
model selection and evaluation. Finally, in Section 6, we analyse the result both on the dataset
produced ad hoc (described in Sections 3 and 4), but also on completely unseen documents from
the BnF, to showcase the capacities of such models.


                                                 3
2. Background and Related Work
Handwritten Text Recognition, a sibling or sub-task of Optical Character Recognition, aims
at recognising text from digitised manuscripts. In the last 昀椀ve years, the digital humanities
landscape has seen a surge in HTR engines, as well as transcription interfaces that connect
and work well with these engines, from the dominant Transkribus [23] to the open-source
pair of eScriptorium [26] and Kraken [25]. To be able to recognize text, users have to provide
models, which are themselves the result of supervised training on ground truth data (human
provided transcriptions).
   Printed books have been, over the last few decades, the focus in terms of remediation, from
their analogue form to a digitized picture and 昀椀nally to a machine-readable (and human search-
able) text. With the advances in HTR over the last 昀椀ve years, the focus can now shi昀琀 or be
shared with materials that have, for the most part, remained inaccessible from a digital point of
view, except for pictures. Latin manuscripts are present during the whole period of manuscript
production in western Europe. Literary Old French manuscripts exist from the 12th century on-
ward, with only a hundred known surviving manuscripts in the 12th century [5]. Over the span
of these seven centuries, multiple forms of handwritten scripts have existed, for both French
and Latin. As an example, the 2016 ICHFR Competition on the Classi昀椀cation of Medieval Hand-
writings in Latin Script [14] provided ground truth for the classi昀椀cation of 12 main families,
of which at least six are represented in our datasets. This diversity makes training models for
HTR quite complex but also a reachable goal, as they tend, speci昀椀cally for literary manuscripts,
to be more readable and stable between di昀昀erent hands.
   Medieval French and Latin present both dialectal and scriptural variation in synchrony on
top of diachronic evolution. Old French’s syntax varies chronologically and geographically.
The spelling is simply variable. While Latin shows some level of variation, it di昀昀ers from Old
French mostly in its higher rate of abbreviation. These observations are limited to the context
of the datasets at hand, which are literary works (including scholastic, theological and medical
works). The Old French CREMMA Medieval dataset [34] has 0.97% of horizontal tildes and
0.16% of vertical ones, which are markers used in the dataset guidelines to indicate various
similar abbreviation diacritics [35]. Using the same guidelines, the CREMMA Medieval Latin
dataset shows a rate of 5.63% and 1.52% for the same characters. This di昀昀erence could be due
to the nature of the transcribed texts.
   The question of abbreviation and the speci昀椀city of medieval literary manuscripts has pro-
voked many discussions in terms of how to transcribe documents, from a completely “diplo-
matic” approach with variants of letters to “semi-diplomatic” approaches. In the last year, three
authors have provided guidance or thoughts around guidelines for transcriptions: Pinche [35]
focusing on Old French, Schoen and Saretto [38] on Middle English, and Guéville and Wris-
ley [20] on Latin. The CREMMA guidelines have been used by 5 other datasets for a total
of 1.15 millions of characters over 昀椀昀琀y manuscripts, which make them the most diverse and
comprehensive ones for HTR of medieval manuscripts in Latin and Old French.
   The most traditional metrics for HTR and OCR are both Word Error Rate (WER) and Char-
acter Error Rate (CER). The 昀椀rst one proves to be complicated to apply in Old French and
Medieval Latin, as spaces in medieval manuscripts tend to vary in size or simply be nonexis-
tent from a modern perspective, relying on the knowledge of the reader to separate words—or


                                               4
the ability of NLP models to separate them [9]. The second one works well, with the limitation
that spaces are o昀琀en the 昀椀rst source of mistakes. CER corresponds to the sum of character
insertion, removal and replacement over the total number of characters, thus providing a 昀椀ne-
grained metric.
   As mentioned earlier, in the introduction, both CER and WER require ground truth, and other
metrics currently discussed as alternatives, such as the (pseudo-)perplexity or lexical measures
proposed by Ströbel, Clematide, Volk, Schwitter, Hodel, and Schoch [40]. The other approach
to evaluating quality without ground truth is to predict a class of CER, such as the work done
by Bazzo, Lorentz, Suarez Vargas, and Moreira [1]. These approaches rely on features such as
n-grams, word statistics and language classi昀椀er outputs which are di昀케cult to leverage in the
present context. In order to train their classi昀椀er, Bazzo, Lorentz, Suarez Vargas, and Moreira [1]
and Nguyen, Jatowt, Coustaty, and Doucet [32] engineered bad predictions by creating rules to
reproduce the most common errors in OCR, such as “rn” becoming “m”. These bad predictions
are then fed to their model along with the metrics both papers want to predict.
   Nguyen, Jatowt, Coustaty, and Doucet [32] provide an innovative approach to the issue of
noise in OCR by shi昀琀ing from a CER/WER problem to a readability one: if the reader “can
reod a txt with mi昀昀pelling” without having to refer back to the picture, at least one of the
goals of OCR has been achieved. As simply put by Martinc, Pollak, and Robnik-Šikonja [30],
“Readability is concerned with the relation between a given text and the cognitive load of a
reader to comprehend it”. It is even more important in the context of handwritten documents
where a somewhat badly but readable HTR output can be easier for non-specialists to read than
the original. In the 昀椀eld of readability assessment, Martinc, Pollak, and Robnik-Šikonja [30]
has shown that supervised models perform adequately, while Nguyen, Jatowt, Coustaty, and
Doucet [32] has shown that this translates to the OCR issues as well. This has not been applied
to any medieval dataset that we know of.


3. Dataset
To train di昀昀erent models, we reused the data from various projects, aligned with the same
guidelines used by Pinche [35]. Our experiment was made possible by the open release of
many projects’ datasets, including one MA thesis and one student project [41, 2]. We used the
ground truth of the CREMMA [13] and CREMMALab [34] projects, the Rescribe [42] project,
and the GalliCorpora [36, 37] projects, for a total of 42,292 lines (see Table 1). We include one
dataset of incunabula, which use graphical shapes similarly to literary manuscripts (but with
more regularity), while also using an abbreviation system.
  The datasets present not only two main languages but also many di昀昀erent levels of digiti-
zation quality (including old binarization), di昀昀erent kinds of handwriting families, di昀昀erent
abbreviation levels and di昀昀erent genres. For example, while the CREMMA Medieval dataset
focuses more on literary texts, speci昀椀cally hagiographical and chanson de geste texts, the
CREMMA Medieval LAT corpus o昀昀ers theological commentaries and medicinal recipes, each
genre having its own speci昀椀c vocabulary. The dataset in general is skewed towards French and
the gothica handwritten family.
  The transcription guidelines of Pinche [35] provide simpli昀椀cation rules: allographic ap-


                                                5
Table 1
Training material for our models and our future bad transcription dataset.
    Dataset name                           Project or company       Coverage    Language     Lines   Characters   Manuscripts
    Eutyches                               MA Thesis                  850-900       Latin    2,828       86,832             2
    Caroline Minuscule                     Rescribe                  800-1199       Latin      457       17,155            17
    CREMMA Medieval                        CREMMALab                1100-1499     French    21,656      579,368            14
    CREMMA-Medieval-LAT                    CREMMA                   1100-1599       Latin    6,648      240,291            18
    DecameronFR                            Homework                 1430-1455     French       751       19,821             1
    Données HTR manuscrits du 15e siècle   GalliCorpora             1400-1499     French     5,937      169,221            11
    Incunables du 15e siècle               GalliCorpora             1400-1499     French     7,608      244,958            13


Figure 1: Example of lines. (a) comes from the GalliCorpora manuscript dataset, (b) from the incunab-
ula one. (c) is drawn from the Eutyches MA Thesis, (d) and (e) from the CREMMA Medieval (French)
dataset. (f) and (g) are both taken from the CREMMA Medieval Latin repository.


proaches are forbidden (di昀昀erent shapes of s such as long s and “modern” s are not di昀昀er-
entiated), macrons and general horizontal-line diacritics over the letters such as tildes are
represented by horizontal tildes, any “zigzag”4 or similarly shaped forms are simpli昀椀ed into
superscript vertical tildes, etc. This allows for simpler transcriptions and also limited diversity
of characters for the machine to learn, satisfying both the human transcriber in terms of the
learning curve of the guidelines, and the HTR engine in terms of complexity. Each corpus
was passed through the ChocoMu昀椀n so昀琀ware [11] using project-speci昀椀c character translation
tables. This so昀琀ware, along with these tables, allows each dataset to be controlled at the char-
acter level and adapted to guideline modi昀椀cations. It also allows project-speci昀椀c transcription
standards to be translated to a more common one, such as Pinche’s.


4
    O昀케cial name from the Unicode speci昀椀cations for the character U+299A.


                                                                6
4. Proposed Method
Our goal is to be able to predict a quality class for any HTR output on medieval French and
Latin. First, we design a way to generate ground truth for the quality assessment of HTR output.
Then, we propose three supervised text-based models, with speci昀椀c adaptations to handle both
languages with a single classi昀椀er.

4.1. “Bad Prediction” Ground Truth
In order to train our classi昀椀cation model, we require ground truth material along a CER class:
Good ([0; 10)%), Acceptable ([10; 25)%), Bad ([25; 50)%) and Very Bad (≥ 50%). In order to have
real life errors, and to reproduce the rather di昀케cult to predict capacity of a model to confuse
certain characters with others in speci昀椀c settings, we propose a three-step method:

   1. We train Kraken [25] models based on the complete dataset, or on a subset. We voluntar-
      ily stop some of the training in very early stages, when the CER on the validation dataset
      remains high. We also keep one “best” model [12] trained on the full dataset.
   2. We run each model on our two biggest and most diverse repositories, CREMMA Medieval
      and CREMMA Medieval LAT. We also run a model trained on modern and contemporary
      scripts, Manu McFrench [6] to create garbage-level transcriptions.
   3. We evaluate each line’s CER and store it alongside the line. We also keep the ground
      truth, whose CER is estimated as 0. We remove short lines (fewer than 15 characters)
      and duplicated predictions across models for the same line.

  Regarding the 昀椀nal models for prediction production, we have 16 models, allowing for a
maximum of 16 versions of each line, if none of the models predict the same text (see Table 2
for examples):

   1. 4 models trained on the same train and validation dataset as best with a validation CER
      of 55.9, 28.3, 23 and 20.8% according to Kraken.
   2. 5 models trained on the CREMMA Medieval LAT dataset only, from the 1st to the 6th
      epochs, ranging from 86% to 46% of CER.
   3. 1 model trained on the Eutyches (Latin, Carolingian of the 9th century) and the Decameron
      (French, 16th century) datasets with a 98.5% CER on its validation set.
   4. 3 models trained on the CREMMA Medieval (Old French) dataset only, 昀椀ne-tuned from
      the Manu Mc French Model, from 11% of CER down to 8.2%.
   5. Manu McFrench, the best model and the ground-truth data.

   These provide variable CER on unseen data from the test set of both CREMMA dataset but
also on training and validation sets as they did not reach their full capacities during the training
phase. A昀琀er 昀椀ltering small and repeated predictions, we have access to 322,903 lines of “HTR
Predictions, CER” couples (see in appendix Figure 6). We then translate that into each bin of
CER to produce the four established classes.


                                                 7
Table 2
Example of pairs of predictions for the same line for a file of CREMMA Medieval (University of Penn-
sylvania 660, Le Pélerinage de Mademoiselle Sapience). The first line is the ground truth, the second our
best model trained on the full dataset for production, the 4th from the bottom is from Manu McFrench.
Note that the diacritics are not consistently transcribed.
                             transcription                                       CER
                             u̾ra on de q̃l vertu ses petis pies sont que vous    0.0
                             Bra on de q̃l vertuses petis pies sont que vous      6.1
                             Fra on de q̃l vertuses petis pies sont que vou       8.2
                             Bra on de q̃l vertuses petis pies sont que uous      8.2
                             Pra on de ql vertuses petis pies sont que dons      12.2
                             ura on de q̃l vertu ses petis pies font grre op     16.3
                             ura on de q̃l uertu ses petis pies font re dory     16.3
                             ura on de ql vertu ses petis pies font itce ir      20.4
                             Ard ondegl ratules nus mes sont que ls              42.9
                             a on de at etn le peos pes os e                     49.0
                             a om de ał vrtir sot olisͣ pa sosisinos             57.1
                             ⁊s cm dec uł vrtr fe pdp̃ pns ots pte               61.2


4.2. Model Architecture
We applied three model architectures, common to many NLP task, with an embedding-sentence
encoder-linear classi昀椀er structure where only the sentence encoder changes from one model
to another (see Figure 2). The embedding layer takes into account special tokens (Padding,
Unknown char, Start of Line, End of Line) and each character according to the Unicode NFD5
normalization of the line, for which characters and their diacritics are separated, e.g. [é]
becomes [e]+[´]. The linear layer is a simple (Encoding Output Dimension, Class Count) de-
cision layer. Each model uses a cross-entropy loss function6 and reduces its learning rate at
plateau using the validation set’s macro averaged recall metric. Optimization of the model is
done through the Ranger optimizer [43].
   The encoding layer varies between three di昀昀erent forms:

       • The 昀椀rst version uses a single BiLSTM network where the sentence encoding is the result
         of the concatenation of the start-of-line token (BOS) and end-of-line token (EOS) hidden
         state.
       • The second version follows the architecture of sentence-level attention proposed by Yang,
         Yang, Dyer, He, Smola, and Hovy [44], using a bidirectional GRU. The encoded sentence
         vector is the sum of products of the hidden state of each token with its attention. Atten-
         tion is also provided as an output for human interpretation of the results.
       • The last one, TextCNN [27], uses the concatenation of the Max Pooling of each n-gram
         size (2, 3, 4, 5, 6) taken into account by a convolutional neural network.


5
    Normalization Form Canonical Decomposition.
6
    Code available at https://github.com/PonteIneptique/neNequitia.


                                                         8
Figure 2: Available model architectures. Elements in orange are optional or varying elements, elements
in blue are common to all models.


    As we deal with two di昀昀erent languages, we added another special token, following the work
 of Martin, Villemonte de La Clergerie, Sagot, and Bordes [29] and Gong, Bhat, and Viswanath
 [19]: for each encoding variation we add one variation of the codec where the 昀椀rst token a昀琀er
 the beginning-of-string is a metadata token indicating the language. Thus, a line such as Fra
 on de ̃ql vertuses petis pies sont que vo will be encoded as <BOS><FRO>Fra on de
̃ql vertuses petis pies sont que vo<EOS>.


5. Experimental Setup
In order to avoid lexical bias and to ensure the strength of our analysis, we propose a 5-Fold-
like experiment, where each subset for train, validation and test are the results of split across
manuscripts. For each K, two French manuscripts and two Latin ones are used for the vali-
dation set and the test set, and they di昀昀er by at least one manuscript from one K to another,
leaving three K completely di昀昀erent (K1, K3, K5; see Table 3). Each test set also contains a Latin
manuscript that was not used in any of the HTR model training or validation: Berlin, Hdschr.
25. This manuscript was then used for model evaluation, to have a stable pillar for evaluation.
Models are then evaluated using class-speci昀椀c precision and recall, as well as macro averaged
precision and recall.
   For our baseline, we use the relative frequency of the 2000 most common n-grams of size 3,
4 and 5 as features and feed them to a linear classi昀椀er, with cross entropy loss and the Adam
optimizer. We run each model architecture once for each K, resulting in 7 di昀昀erent results with
the baseline (presence/absence of language token for the three encoding modules + baseline).
   Our whole pipeline uses pandas for data preparation [31], PyTorch [33] for model develop-
ment, and Pytorch Lightning [17] for the training, evaluation and prediction wrapping.


                                                  9
Table 3
Composition of K-Folds set, based on manuscript selection.
K                                                          1                              2                                   3                                         4                                      5
Validation   French            BnF fr. 17229, BnF fr. 25516    BnF fr. 3516, BnF fr. 25516    BnF fr. 24428, BnF. Arsenal 3516                 BnF fr. 24428, BnF fr.844    Pennsylvania Codex 909, BnF fr.844
             Latin               Arras 861, CCCC Ms 165          CLM 13027, CCCC 165           CLM 13027, Montpellier, H318           BnF lat. 6395, Montpellier, H318           BnF lat. 6395, Laur. Plut.33.31
Test         French        BnF Arsenal 3516, BnF fr. 13496      BnF fr. 24428, BnF fr. 411           BnF fr. 844, BnF fr. 22549   BnF fr.412, Phil., Col. of Phys. 10a 13       Bodmer 168, Vat. reg. lat. 1616
             Latin           Sorbonne Fr. 193, CLM 13027           CCCC Ms. 236, H318               BnF lat. 6395, Egerton 821          BnF fr. 16195, Laur. Plut. 33.31        Laur. Plut. 53.08, BnF lat. 8236
Train        Good                                    80,056                         76,564                               65,764                                    39,165                                 39,165
             Acceptable                              44,346                         41,769                               34,429                                    35,803                                 35,803
             Bad                                     60,381                         59,265                               51,637                                    41,793                                 41,793
             Very Bad                                71,008                         71,053                               60,898                                    52,212                                 52,212
Validation   Good                                      4,246                          98,57                              12,770                                    11,625                                 11,625
             Acceptable                                3,933                        10,377                               12,496                                     8,492                                  8,492
             Bad                                       4,338                        10,884                               13,430                                    10,250                                 10,250
             Very Bad                                  4,867                        15,428                               18,386                                    11,461                                 11,461
Test         Good                                      9,165                          7,046                              14,933                                    42,677                                 42,677
             Acceptable                                9,744                          5,877                              11,098                                    13,728                                 13,278
             Bad                                     12,763                           7,333                              12,415                                    25,439                                 25,439
             Very Bad                                18,056                           7,350                              14,647                                    30,258                                 30,258


Table 4
Test results statistics for each K and each model configuration.
Lang     Encoder          Good                                         Acceptable                                     Bad                                            Very bad
                          Precision            Recall                  Precision               Recall                 Precision              Recall                  Precision              Recall
                          Mean Median          Mean       Median       Mean Median             Mean      Median       Mean Median            Mean       Median       Mean Median            Mean      Median
No       Baseline         33.87     35.24      33.84       33.76       36.81     37.63          7.56       8.67       37.34     37.11        19.15       18.27        60.27    59.73        97.24      97.32
Yes      Attention        65.31     65.61      41.62       41.88       45.29     44.01         26.72      26.32       49.36     49.66        49.23       47.70        75.74    75.33        95.53      95.48
Yes      BiLSTM           67.00     66.82      38.02       37.31       41.75     41.29         21.89      21.05       47.13     47.77        51.79       51.09        76.17    74.18        94.20      94.51
Yes      TextCNN          57.78     59.06      31.52       26.14       41.97     43.24         20.87      22.29       43.66     44.88        35.93       33.26        68.09    65.95        96.73      97.61
No       Attention        58.08     57.00      39.85       41.62       44.10     44.40         35.60      34.21       51.98     51.34        49.98       47.16       80.01     78.53        94.41      94.51
No       BiLSTM           60.30     57.60      39.70       36.55       42.87     42.90         31.15      28.17       50.95     51.37        52.39       52.63        79.96    80.40        94.19      93.80
No       TextCNN          50.85     49.35      38.43       38.32       40.24     40.10         24.77      26.63       48.59     49.14        47.94       47.16        76.92    77.48        94.46      94.44


6. Experiments
6.1. Model Classification Results
The 昀椀rst conclusion we can draw from the experience is that our models always beat the base-
line (see Table 4 and, in the Appendix, Table 4 for more details). No RNN-based architecture
clearly beats the other, but TextCNN clearly underperforms. The introduction of the language
metadata token helps when detecting Good transcriptions (delta ≈ +7% for attention’s me-
dian precision, ≤ +1% for the recall) for both RNN based models. Models without a language
marker tend to outperform models with language markers, except for the Very Bad class where
the delta is up to +6% in favour of models without language tokens (using median precision
scores).
   Regarding the variability of results, we found that the length of the string had an impact on
the prediction, no matter the model architecture. Surprisingly, none of the models withstand
long noisy lines: the accuracy of the Very Bad class is inversely correlated with line size. On the
contrary, depending on the encoder, some classes bene昀椀t from longer strings: Good lines ben-
e昀椀t from it with all models except the baseline. TextCNN is the only model to really correlate
accuracy on the Bad and Acceptable classes with line length.
   Finally, for all models except the baseline, the most common confusion is always in the “ad-
jacent” class(es) (see Figure 4). For the classes Acceptable and Bad, which have two neighbours,
the error rate is evenly split between them: the class Acceptable tends to be confused with
either Good or Bad. This shows the model’s ability to understand cleanness or noise, but also
shows the limit of these classes: for a line with 50 characters, such as “quãt tel eufaut gist en
tes lieu. Derite respoint”, 6 mistakes are enough to swing into the Acceptable category (Ground


                                                                                                     10
Figure 3: Regression of accuracy based on lines’ length overall 5-Fold test sets. Common manuscript
not included (Berlin, Hdschr. 25).


truth: “quãt tel enfant gist en tel lieu . Uerite respon”, one space has been removed before the
dot).
   Overall, with an accuracy for the Good and Very Bad classes around 50% on these languages,
and considering that most of the confusions are from adjacent classes (e.g. Good is confused
with Acceptable, Acceptable with Good and Bad, etc.), the solution performs well either at 昀椀lter-
ing badly read manuscripts, or keeping only the very good ones. The Acceptable class and the
Bad class have stable performance facing variable line length, although the Acceptable class
shows the worst classi昀椀cation performance.

6.2. Application on a Real-World Library Dataset
As a real-world application, we wanted to apply one of our best models to an unseen dataset, in
the same way that we envision cultural institutions might use the tool. We describe the set-up


                                                11
Figure 4: Confusion rate dispersion in the errors made by each model. Only confusion that happens
more than 50 times is taken into account, as well as the total number of errors greater or equal to 300.
The graph can be read as follows: for the baseline, 40% of the errors for the ground truth class Good
are Acceptable predictions.


for this particular experiment below, and then evaluate the results of the classi昀椀cation model
with regard to the capacity of the HTR model; we also study some randomly sampled elements.

6.2.1. Set-up
To evaluate on as much unseen data as possible, we crawled the Biblissima IIIF collection por-
tal [18]. We searched individually for each combination of language (French, Latin) and century


                                                  12
(9th to 15th ), limiting the number of samples retrieved to 500 manuscripts. We then sampled
10 sequential pictures from each manuscript.7 To avoid empty pages (which tend to be at the
start and the back of each book’s digitization or IIIF manifest at the BnF), we take either the
ten 昀椀rst pictures from the second decile of the manifest, or from the 20th up to the 30th if there
are fewer than 100 pictures, or the 10 last if there are fewer than 20 pictures.
   Each downloaded sample is then segmented using YALTAi [10] with the included model
designed for cultural heritage manuscripts and the base Kraken BLLA segmenter [24]. As
YALTAi provides di昀昀erent zones—from the margin to main body of text—through numbering,
we only consider lines that are part of the main bodies of text of each model, thus excluding
any marginal or paratext. We then use Kraken to predict a transcription for each line, using
the best trained model as described in our 昀椀rst experiment. Next, we feed each line to our best
BiLSTM model (K-Fold 1 has the best recall/precision on Good) while keeping the line metadata:
language, century, manuscript identi昀椀er, and page identi昀椀er.
   Finally, we provide three di昀昀erent evaluations of the transcriptions. The 昀椀rst is based strictly
on the number of lines predicted in each class (Good, Acceptable, etc.). The second is page-based:
we take the most common prediction for all lines. The last one is manuscript based: we take
the most common page prediction, using the previous page-based metric.

6.2.2. Evaluation
Overall, the HTR prediction results produced by our BiLSTM module are in line with the HTR
strength on the dataset (see Figure 5). The model performs extremely well on early manuscripts
thanks to the presence of two datasets of early manuscripts (Eutyches and Caroline Minuscule)
It performs well on Old French except for the 13th century, where Bad predictions are more
common. The relative frequency of Very Bad predictions tends to grow as we get closer to the
16th century: from the data we have seen, this could be due to the presence of non-literary
manuscripts written in cursive, for which our model has no ground truth.
   If we look at the sampled predictions (Appendix, Table 2), most Good predictions seem cor-
rect or nearly correct. However, we can see that the metadata from Biblissima and the BnF
has some limitations when used automatically, as it can produce problematic results: most
12th century Acceptable predictions are probably in Latin, which would indicate a multilingual
manuscript or a badly catalogued one. This issue also arises in the crawler for the century,
as some manuscripts were catalogued as French but with a production date that is before the
昀椀rst known French document: these are most likely multilingual documents, with either a col-
lection of various leaves from previous manuscripts, or the inclusion of the language used for
marginal notes. 3 out of the 6 Acceptable predictions between the 13th and the 14th century
are de昀椀nitely readable and understandable, and we cannot but wonder if the lack of spaces
in “q̃ merueilles fu lacitebiengarne mlt” is responsible for its classi昀椀cation as Acceptable rather
than Good. We note that at least one Very Bad prediction in French, “OU EtE L. Cheualier de
Monifort, son Oncle, Gles”, seems rather readable, albeit with more corrections than for a Good
transcription. Latin shows the same trend, in being accurate over Good and Acceptable.

7
    Note that we are not talking about pages but about pictures: in some cases, most commonly in the case of digitised
    micro昀椀lms, one picture can contain two pages.


                                                           13
Figure 5: Predictions distribution per line (first two rows), per page (row 3 & 4), per manuscript (last
rows) over languages and centuries filtering.


7. Conclusion
The ability to 昀椀lter, without pre-transcribing samples, automated transcriptions of manuscripts
in Latin, Old French or any other Western historical language, might lead to the production of
datasets designed for analysis that relies on better transcriptions, or to guiding cultural heritage
institutions and their partners in the production of new ground truth. Producing HTR ground
truth does indeed require time, skilled transcribers and, last but not least, budget. However,
most current error rate prediction or HTR output analysis models rely on n-gram frequencies
and lexical features—two approaches that are o昀琀en less viable for languages such as Old French
which “su昀昀ers” from a highly variable spelling system or for languages like Latin which are
potentially highly abbreviated, with abbreviations changing even within a single manuscript,
depending on the context, the topic and the scribe.


                                                  14
   In this context, we chose to treat CER range predictions as a sentence-like classi昀椀cation
problem, for which we implemented three basic models, using either a single BiLSTM encoder,
an attention-supported GRU, or a TextCNN encoder. These three tools show stronger results
than an n-gram based baseline. On top of this, we include a language metadata token which
can improve the reliability of the lowest range of CER (between 0 and 10%, the Good class)
while worsening the classi昀椀cation’s reliability for the highest range (over 50%, the Bad class).
For the purpose of training these models, we propose a new way to generate real life “bad
transcriptions”, using early-stopped HTR models, or models trained on small samples of data:
this provides an alternative to previous rule-based generation of “bad transcription” ground
truths.
   We show that on a completely unknown dataset of around 1,800 manuscripts, analysed
with a new HTR model speci昀椀cally trained on medieval Latin and French, the number of well-
transcribed manuscripts predicted is on par with the ground truth for that dataset. The quality
assessment predictions provide quick insights for larger collections, and could be run relatively
o昀琀en by cultural heritage institutions.
   In the future, hyper-parameter 昀椀ne-tuning and other encoders could be used in the architec-
ture. Speci昀椀cally, with more correctly transcribed manuscripts, including the abbreviations in
their transcriptions, 昀椀ne-tuning larger language models could allow the application of (pseudo-
)perplexity ranking such as the one proposed by Ströbel, Clematide, Volk, Schwitter, Hodel, and
Schoch [40], while allowing for partial noise in the training data. We hope to see such classi-
昀椀cation of manuscripts used by ground truth producers in order to enhance the robustness of
openly available HTR models.


Acknowledgments
I want to thanks Jean-Baptiste Camps, Ariane Pinche and Malamatenia Vlachou-Efstathiou
for their constant feedback and replies on some particular questions regarding manuscripts or
HTR data. Many thanks to Ben Nagy for his proof-reading of the pre-print version.
  This work was funded by the Centre Jean Mabillon and the DIM MAP (https://www.dim-m
ap.fr/projets-soutenus/cremmalab/).


References
 [1] G. T. Bazzo, G. A. Lorentz, D. Suarez Vargas, and V. P. Moreira. “Assessing the Impact of
     OCR Errors in Information Retrieval”. In: Advances in Information Retrieval. Ed. by J. M.
     Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J. Silva, and F. Martins. Lecture
     Notes in Computer Science. Cham: Springer International Publishing, 2020, pp. 102–109.
     doi: 10.1007/978-3-030-45442-5\_13.
 [2] S. Biay, V. Boby, K. Konstantinova, and Z. Cappe. TNAH-2021-DecameronFR. 2022. doi:
     10.5281/zenodo.6126376. url: https://github.com/PSL-Chartes-HTR-Students/TNAH-2
     021-DecameronFR.


                                               15
 [3] J.-B. Camps, T. Clérice, F. Duval, L. Ing, N. Kanaoka, and A. Pinche. “Corpus and Models
     for Lemmatisation and POS-tagging of Old French”. 2022. url: https://halshs.archives-o
     uvertes.fr/halshs-03353125.
 [4] J.-B. Camps, C. Vidal-Gorène, and M. Vernet. Handling Heavily Abbreviated Manuscripts:
     HTR engines vs text normalisation approaches. May 2021. url: https://hal-enc.archives-o
     uvertes.fr/hal-03279602.
 [5] M. Careri, C. Ruby, and I. Short. Livres et écritures en français et en occitan au XIIe siècle:
     catalogue illustré. Viella, 2011. 274 pp.
 [6] A. Chagué and T. Clérice. HTR-United - Manu McFrench V1 (Manuscripts of Modern and
     Contemporaneous French). Version 1.0.0. 2022. doi: 10.5281/zenodo.6657809. url: https:
     //doi.org/10.5281/zenodo.6657809.
 [7] A. Chagué and T. Clérice. HTR-United: Ground Truth Resources for the HTR and OCR of
     patrimonial documents. 2022. url: https://htr-united.github.io.
 [8] C. Clausner, S. Pletschacher, and A. Antonacopoulos. “Quality Prediction System for
     Large-Scale Digitisation Work昀氀ows”. In: 2016 12th IAPR Workshop on Document Analysis
     Systems (DAS). 2016, pp. 138–143. doi: 10.1109/das.2016.82.
 [9] T. Clérice. “Evaluating Deep Learning Methods for Word Segmentation of Scripta Con-
     tinua Texts in Old French and Latin”. In: Journal of Data Mining & Digital Humanities
     2020 (2020). doi: 10.46298/jdmdh.5581. url: https://jdmdh.episciences.org/6264.
[10]   T. Clérice. “You Actually Look Twice At it (YALTAi): using an object detection approach
       instead of region segmentation within the Kraken engine”. 2022. url: https://hal-enc.ar
       chives-ouvertes.fr/hal-03723208.
[11]   T. Clérice and A. Pinche. Choco-Mu昀椀n, a tool for controlling characters used in OCR and
       HTR projects. Comp. so昀琀ware. Version 0.0.4. 2021. doi: 10 . 5281 / zenodo . 5356154. url:
       https://github.com/PonteIneptique/choco-mufin.
[12]   T. Clérice, A. Pinche, and M. Vlachou-Efstathiou. Generic CREMMA Model for Medieval
       Manuscripts (Latin and Old French), 8-15th century. Version 1.0.0. 2022. doi: 10.5281/zen
       odo.7234166. url: https://doi.org/10.5281/zenodo.7234166.
[13]   T. Clérice, M. Vlachou Efstathiou, and A. Chagué. CREMMA Manuscrits médiévaux latins.
       Ed. by A. Chagué and T. Clérice. 2022. url: https://github.com/HTR-United/CREMMA-
       Medieval-LAT.
[14]   F. Cloppet, V. Eglin, V. C. Kieu, D. Stutzmann, and N. Vincent. “ICFHR2016 Competi-
       tion on the Classi昀椀cation of Medieval Handwritings in Latin Script”. In: 2016 15th Inter-
       national Conference on Frontiers in Handwriting Recognition (ICFHR). 2016 15th Interna-
       tional Conference on Frontiers in Handwriting Recognition (ICFHR). Shenzhen, China:
       Ieee, Oct. 2016, pp. 590–595. doi: 10.1109/icfhr.2016.0113. url: http://ieeexplore.ieee.or
       g/document/7814129/.
[15]   M. Cuper. “Examining a Multi Layered Approach for Classi昀椀cation of OCR Quality with-
       out Ground Truth”. In: DH Benelux Journal (2022), p. 17.


                                                16
[16]   M. Eder. “Mind your corpus: systematic errors in authorship attribution”. In: Literary
       and Linguistic Computing 28.4 (Dec. 1, 2013), pp. 603–614. doi: 10.1093/llc/fqt039. url:
       https://doi.org/10.1093/llc/fqt039.
[17]   W. Falcon and The PyTorch Lightning team. PyTorch Lightning. Comp. so昀琀ware. Ver-
       sion 1.4. 2019. doi: 10.5281/zenodo.3828935. url: https://github.com/Lightning-AI/light
       ning.
[18]   E. Frunzeanu, E. MacDonald, and R. Robineau. “Biblissima’s Choices of Tools and
       Methodology for Interoperability Purposes”. In: CIAN. Revista de historia de las universi-
       dades 19.1 (2016), pp. 115–132.
[19]   H. Gong, S. Bhat, and P. Viswanath. “Enriching Word Embeddings with Temporal and
       Spatial Information”. In: Proceedings of the 24th Conference on Computational Natural
       Language Learning. Online: Association for Computational Linguistics, 2020, pp. 1–11.
       doi: 10.18653/v1/2020.conll-1.1. url: https://aclanthology.org/2020.conll-1.1.
[20]   E. Guéville and D. J. Wrisley. “Transcribing Medieval Manuscripts for Machine Learning”.
       In: arXiv preprint arXiv:2207.07726 (2022). url: https://arxiv.org/abs/2207.07726.
[21]   R. Holley. “How good can it get? Analysing and improving OCR accuracy in large scale
       historic newspaper digitisation programs”. In: D-Lib Magazine 15.3/4 (2009).
[22]   A. Honkapohja and J. Suomela. “Lexical and function words or language and text type?
       Abbreviation consistency in an aligned corpus of Latin and Middle English plague tracts”.
       In: Digital Scholarship in the Humanities 37.3 (2021), pp. 765–787. doi: 10.1093/llc/fqab007.
       url: https://doi.org/10.1093/llc/fqab007.
[23]   P. Kahle, S. Colutto, G. Hackl, and G. Mühlberger. “Transkribus-a service platform for
       transcription, recognition and retrieval of historical documents”. In: 2017 14th IAPR In-
       ternational Conference on Document Analysis and Recognition (ICDAR). Vol. 4. Ieee. 2017,
       pp. 19–24.
[24]   B. Kiessling. “A modular region and text line layout analysis system”. In: 2020 17th Inter-
       national Conference on Frontiers in Handwriting Recognition (ICFHR). Ieee. 2020, pp. 313–
       318.
[25]   B. Kiessling. The Kraken OCR system. Comp. so昀琀ware. Version 4.1.2. 2022. url: https://k
       raken.re.
[26]   B. Kiessling, R. Tissot, P. Stokes, and D. S. B. Ezra. “eScriptorium: an open source platform
       for historical document analysis”. In: 2019 International Conference on Document Analysis
       and Recognition Workshops (ICDARW). Vol. 2. Ieee. 2019, pp. 19–19.
[27]   Y. Kim. “Convolutional Neural Networks for Sentence Classi昀椀cation”. In: Proceedings of
       the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha,
       Qatar: Association for Computational Linguistics, 2014, pp. 1746–1751. doi: 10.3115/v1
       /D14-1181. url: https://aclanthology.org/D14-1181.
[28]   E. Manjavacas and L. Fonteyn. “Adapting vs. Pre-training Language Models for Historical
       Languages”. In: Journal of Data Mining and Digital Humanities Nlp4dh (2022). doi: 10.4
       6298/jdmdh.9152. url: https://hal.inria.fr/hal-03592137.


                                                17
[29]   L. Martin, É. Villemonte de La Clergerie, B. Sagot, and A. Bordes. “Controllable Sentence
       Simpli昀椀cation”. In: LREC 2020 - 12th Language Resources and Evaluation Conference. Mar-
       seille, France, 2020. url: https://hal.inria.fr/hal-02678214.
[30]   M. Martinc, S. Pollak, and M. Robnik-Šikonja. “Supervised and Unsupervised Neural Ap-
       proaches to Text Readability”. In: Computational Linguistics 47.1 (Apr. 21, 2021), pp. 141–
       179. doi: 10.1162/coli\_a\_00398. url: https://doi.org/10.1162/coli%5C%5Fa%5C%5F0039
       8.
[31]   W. McKinney et al. “pandas: a foundational Python library for data analysis and statis-
       tics”. In: Python for high performance and scienti昀椀c computing 14.9 (2011), pp. 1–9.
[32]   H. T. T. Nguyen, A. Jatowt, M. Coustaty, and A. Doucet. “ReadOCR: A Novel Dataset
       and Readability Assessment of OCRed Texts”. In: International Workshop on Document
       Analysis Systems. Springer. 2022, pp. 479–491.
[33]   A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N.
       Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
       S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. “PyTorch: An Imperative
       Style, High-Performance Deep Learning Library”. In: Advances in Neural Information
       Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc,
       E. Fox, and R. Garnett. Curran Associates, Inc., 2019, pp. 8024–8035. url: http://papers
       .neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-
       library.pdf.
[34]   A. Pinche. Cremma Medieval. 2022. doi: 10.5281/zenodo.5235185. url: https://github.co
       m/HTR-United/cremma-medieval.
[35]   A. Pinche. “Guide de transcription pour les manuscrits du Xe au XVe siècle”. 2022. url:
       https://hal.archives-ouvertes.fr/hal-03697382.
[36]   A. Pinche, S. Gabay, N. Leroy, and K. Christensen. Données HTR incunables du 15e siècle.
       2022. url: https://github.com/Gallicorpora/HTR-incunable-15e-siecle.
[37]   A. Pinche, S. Gabay, N. Leroy, and K. Christensen. Données HTR manuscrits du 15e siècle.
       2022. url: https://github.com/Gallicorpora/HTR-MSS-15e-Siecle.
[38]   J. Schoen and G. E. Saretto. “Optical Character Recognition (OCR) and Medieval
       Manuscripts: Reconsidering Transcriptions in the Digital Age”. In: Digital Philology: A
       Journal of Medieval Cultures 11.1 (2022), pp. 174–206. doi: 10.1353/dph.2022.0010. url:
       https://muse.jhu.edu/article/853521.
[39]   U. Springmann, F. Fink, and K. U. Schulz. Automatic quality evaluation and (semi-) auto-
       matic improvement of OCR models for historical printings. Oct. 20, 2016. doi: 10.48550/ar
       Xiv.1606.05157. url: http://arxiv.org/abs/1606.05157.
[40]   P. B. Ströbel, S. Clematide, M. Volk, R. Schwitter, T. Hodel, and D. Schoch. “Evaluation of
       HTR models without Ground Truth Material”. In: arXiv preprint arXiv:2201.06170 (2022).
       url: http://arxiv.org/abs/2201.06170.
[41]   M. Vlachou-Efstathiou. Voss.Lat.O.41 - Eutyches ”de uerbo” glossed. 2022. url: https://git
       hub.com/malamatenia/Eutyches.


                                               18
[42]    N. White, A. Karaisl, and T. Clérice. Caroline Minuscule by Rescribe. Ed. by A. Chagué
        and T. Clérice. 2022. url: https://github.com/rescribe/carolineminuscule-groundtruth.
[43]    L. Wright. New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam +
        LookAhead for the best of… Medium. Sept. 4, 2019. url: https://medium.com/%5C@less
        w/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-
        for-the-best-of-2dc83f79a48d.
[44]    Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. “Hierarchical Attention Net-
        works for Document Classi昀椀cation”. In: Proceedings of the 2016 Conference of the North
        American Chapter of the Association for Computational Linguistics: Human Language Tech-
        nologies. Proceedings of the 2016 Conference of the North American Chapter of the Asso-
        ciation for Computational Linguistics: Human Language Technologies. San Diego, Cali-
        fornia: Association for Computational Linguistics, 2016, pp. 1480–1489. doi: 10.18653/v
        1/N16-1174. url: http://aclweb.org/anthology/N16-1174.


A. Appendix
The so昀琀ware has been archived at the following address: https://doi.org/10.5281/zenodo.723
3984. A good chunk of the data is available here: https://github.com/PonteIneptique/neNequ
itia/releases/tag/chr2022-release.
   Manuscripts metadata and the predictions in XML ALTO formats for Section 6 are available
at https://doi.org/10.5281/zenodo.7234399. The same repository contains also the XML data
for training the classi昀椀er.

 Lang     Century   Prediction   Transcription
 昀爀o           12   Good         ra monstre de couf de uoudenay
 昀爀o           12   Good         uucuns pair er que Iehuz de le chaulre le ieuue uor de teuteur dicelui
                                 office oy a este priuoz et de loucez pour msen et acaise de cereus cu ⁊
                                 decpa
 昀爀o           12   Good         seriant estoit exilliez en laueniance de sacolpe. li poures
 昀爀o           13   Good         les cõdurroit car il sauoit crop bien. coz lespas. ⁊
 昀爀o           13   Good         Se il lẽ set dire nouele
 昀爀o           13   Good         tiseras. ⁊ ieres sempres amendes. ⁊ en un au
 昀爀o           14   Good         Procureur du Roi du mẽme iour qui ne l empeche. lOrdon.
 昀爀o           14   Good         sonpere tous les rodais et les tartcites
 昀爀o           14   Good         sacies qͥ l nestoit deriens i cant desirans ꝯme de
 昀爀o           15   Good         quil ne lui celast mit ains lui deist qui
 昀爀o           15   Good         miere pour ce que par la renue de cest
 昀爀o           15   Good         sauoit puis fait. Et il lui cõte cõmet̾
 lat            9   Good         cumppriae accipit᷑ tab naculum belli res est. Adtem pus enim cumd-
                                 abolo dimicamꝰ. & tunc opusẽ
 lat            9   Good         aut̃ comminati : miscrunteuminex
 lat            9   Good         ce Detempore ordinat ionum.
 lat           10   Good         epm̃ ñaccipiant xccraxui a gererĩ ꝓcur auerint
 lat           10   Good         babeat᷑. sicastigat: psatis faccionẽ uenia ab epo noluerit ꝓmereri.
 lat           10   Good         prima creatrix : posterior
 lat           11   Good         cer cũ 昀爀atrib in labore manuu
 lat           11   Good         la tricem illã uiris armisq nobilẽ hispadua: illam semi


                                                19
Lang   Century   Prediction   Transcription
lat         11   Good         ꝓ motionẽ dare debebit Postumianus ep̃ s dixit:
lat         12   Good         minus.¶ Vmmasculo ñ cõmisceb̾is contu femineo: qͥ a
lat         12   Good         non tenuerit eccłiastice ficlei caritatisq cõ
lat         12   Good         que fuerant futura damnantur. Deinde si eisad ꝑcipiendũ bapti
lat         13   Good         diei. q̃ sitꝰ a ꝑentib infans inuentꝰ est ⁊ sublatꝰ defouea obnolutꝰ ceno⁊
lat         13   Good         fit. nͥ adumantibo utust lac̾ tis q st̾ it᷑ costa
lat         13   Good         sub sarcinis adoriri. Qua pulsa inpedimentisq direptis. futurũ
lat         14   Good         seꝙ. dicã i vit̾ . fiałs et̾
lat         14   Good         do rerũ. Que disciplina: Que grã
lat         14   Good         ut̾ usque ⁊ siauł̾ deiusto ⁊ułto tubł̾iliau quostã ꝑtrła
lat         15   Good         a tlium ꝯsilus extraneus aud̾eat discre pare
lat         15   Good         p̾parata pena.S qd cica : Duodici fatemur xpm̃ apostolos habuis
lat         15   Good         absoloe oñm et c̾ ꝑ ꝓcessũ aut et tͥ lu q̾om et c̾ Ncessus de ca̾i et pre
                              uacãtib
昀爀o        12    Acceptable   hoc michi uircus caritacis ex
昀爀o        12    Acceptable   poue sg̃ uis not arcã de roy nosta siro l gñt dixur de gendy
昀爀o        12    Acceptable   uideliet ꝙ. Vluifxix ꝑ iii obo ł ddẽ debił monsõ daentũ erignita t qͣ n-
                              decim ẜ derẽ d̾ cuo̾ foreꝭ pn hune medum ui
昀爀o        13    Acceptable   q̃ merueilles fu lacitebiengarne mlt
昀爀o        13    Acceptable   eceual ꝯmanda a .i. desgrũs baillies
昀爀o        13    Acceptable   ⁊ aumang̾ loea lonseigne inporcee
昀爀o        14    Acceptable   en excepter aucuns : quĩ dit les aroits, sans en excepter aucuns, dir tous
昀爀o        14    Acceptable   beancoipe ⁊ de nofimeeeEt chastellaus du chosirur diu hur d ursarce
                              Confe sfout anen en ilirur
昀爀o        14    Acceptable   ¶ Oedee est alber de chyam de
昀爀o        15    Acceptable   grans coupz sur leitargt du foy des orgueilleux
昀爀o        15    Acceptable   cau en ueritayꝭ cest grant ⁊ Iouff
昀爀o        15    Acceptable   nophanes eracleopolites qͥ ceste
lat         9    Acceptable   septies. sedusque septuagies septies.
lat         9    Acceptable   aestuat. Dehac rcriptũ. ẽ :
lat         9    Acceptable   to hostem patriae redire iubet ad propria. Iune
lat        10    Acceptable   bilis sit deuotio. Consttt qu uram dilec tione magna remune
lat        10    Acceptable   sustinebt̃ salus aut̃ mea insẽpit̃ nũ crit.
lat        10    Acceptable   sorac cae plũr tm̃ ut ħierusolima. quasi ut hic narrabo plũr tm̃
                              uthutreueri utroque
lat        11    Acceptable   bitatem ipsiis omino ugor
lat        11    Acceptable   diccũ ẽ. ego dns exaudiã eos. dr̃ istł ñderelinquã eos: ñidõ diccũ ẽ.
                              cãquã gen
lat        11    Acceptable   ait.A ẽsis hicuobis micummm̃siũ.primus est uobis irm̃si
lat        12    Acceptable   ꝑ secutio leuist adcauedũ.s beticor seductioꝑni
lat        12    Acceptable   ierłm & uide. immo iudicent int̃
lat        12    Acceptable   surci :reccutores repu :. lic: ett̃ migriin
lat        13    Acceptable   Meseach ⁊ tafari ⁊ Rrasis sic̃ dicc̃
lat        13    Acceptable   orit lui. ⁊ termo optimus est
lat        13    Acceptable   quilibet sp̃ s. omĩno
lat        14    Acceptable   potior conditio pp̃ e.facit de rxp. duobus .li. bl.
lat        14    Acceptable   se dm̃ habere. et pmicꝰ sibimet satiffaciens.
lat        14    Acceptable   ualent vuã breuẽ. ⁊ ultia ualet tũi
lat        15    Acceptable   sup s comꝑarõ ioñ prudẽtes ꝯquas
lat        15    Acceptable   metermuim.et rẽgm euis non erit finis
lat        15    Acceptable   L e carnalis ht, qm pater ip̃ s parentis.
昀爀o        12    Bad          orailleo .poulleer xv lib


                                              20
Lang   Century   Prediction   Transcription
昀爀o         12   Bad          Rbir les bartres
昀爀o         12   Bad          deaute lqu ques creppt Eentiferoi rece nyꝰ seelle ces liea aa mn pie do
                              d ce ee lu moasum
昀爀o        13    Bad          atourne. giest sibo lans quil qsui
昀爀o        13    Bad          Carde peo eequie auoit d̾yonde eane de adtus edtoit pao coudequaiẽ
昀爀o        13    Bad          Mol edito se vtan di or icuttͥs
昀爀o        14    Bad          Q Anne Autiron. Que ledit saques de Lancrau, epousa en premieres
                              noces, le
昀爀o        14    Bad          stallis eudinor ꝯpu es uiai fugdutu padeur i uo ferras pu uea puis ai
昀爀o        14    Bad          uolent ipitur ⁊atia rertace, quan ineestaeq aleeeclere, et decõuy ny s ã ã
昀爀o        15    Bad          deianarr de bbrdide
昀爀o        15    Bad          ⁊uribz allegate, Sed epclusissent ab uitestate Ipsi
昀爀o        15    Bad          msuol ̃ ipousaultis anonuen natucõ auol
lat         9    Bad          lus . necnonalu acquealii fundatores ecdlesiae atque erudito
lat         9    Bad          us prae erat ut ꝑhoc. P̃sedemtis
lat         9    Bad          crea turar quues upra
lat        10    Bad          eruc tucins quat tuor an
lat        10    Bad          utunde positum eleganter concin
lat        10    Bad          aecenim consid rtio suasit qnm manifestum. ẽ. omnemutabile
lat        11    Bad          UERBuai : FIuERBũ.
lat        11    Bad          mus qm ipse anns nr̃ animã posuit suã
lat        11    Bad          昀爀a si tua foret roma.to
lat        12    Bad          et arbusta eius cedros dei.
lat        12    Bad          rit i audalunt dñt surreber̃ scm̃q marie
lat        12    Bad          relecti mansueti.
lat        13    Bad          qurdr uicba .i. quit est
lat        13    Bad          de inim̃. m. ñ quãu
lat        13    Bad          usqi io intintoẽm amcti delendi sumuis
lat        14    Bad          cui subsunc becm̃braa laceiicelligas lr̃ãm. siue sint plati or
lat        14    Bad          Sĩ mõlał ãt ibs
lat        14    Bad          fult ad nol in eigilia lanooe orucil oroit ꝓuril crauns
lat        15    Bad          ⁊ Artaiita mons cum flumi-
lat        15    Bad          ꝑ te maiorz pñt corrumꝑe
lat        15    Bad          lo sunt ni locu unu. ⁊ appare
昀爀o        12    Very bad     I guille choeneau
昀爀o        12    Very bad     mnl cct quarantẽ dope
昀爀o        12    Very bad     nullo cappic d bii uigr
昀爀o        13    Very bad     noulonoe rolissicanuꝭ, Rudauu/, ꝯgarobanu
昀爀o        13    Very bad     L a nuis ÷ eueuue sihat ipponses
昀爀o        13    Very bad     diqe ut̃ le diui⁊ inr
昀爀o        14    Very bad     oe consepeedeetante cemere
昀爀o        14    Very bad     OU EtE L. Cheualier de Monifort, son Oncle, Gles
昀爀o        14    Very bad     Bussoy Iaguio dar Rnauex a eedamet dunin
昀爀o        15    Very bad     aximiun oiuinca s apenalriuuuirõuo uutonli
昀爀o        15    Very bad     libas ꝯsadiorandapio sidimił
昀爀o        15    Very bad     Euuon lan uiii fut
lat         9    Very bad     IV Mtru&Ε Rlnꝯrdo¬
lat         9    Very bad     eo locus sp atiosus admanen
lat         9    Very bad     ie godñpsormanilues
lat        10    Very bad     dule hanu curde uut lato
lat        10    Very bad     arnals de seruull
lat        10    Very bad     sbib liotheca tsede


                                              21
Lang   Century     Prediction     Transcription
lat         11     Very bad       s ec tanie ca uis uirtusq
lat         11     Very bad       Don de N. le duc de la Tremoille . MV
lat         11     Very bad       uitr fuit rtimuli
lat         12     Very bad       minu Benedlicat uos clns exsy
lat         12     Very bad       sad mumnoui hominıᷤ
lat         12     Very bad       uanni addant̃ ad aunos am̃s cabłam aiomsn ita uidt
lat         13     Very bad       ngeũ drãs mudtistã
lat         13     Very bad       fit cum eo emplin ypocondrus
lat         13     Very bad       mauuse ⁊ de ala mri nream
lat         14     Very bad       G terie eni ĩtuiueẽ sã
lat         14     Very bad       ni quo coła nig ꝯsurg
lat         14     Very bad       uino dt᷑. uł nat
lat         15     Very bad       duabus ncibus ¶ uel syr ma-
lat         15     Very bad       Orano mlelll pam cess aa qra mfenus
lat         15     Very bad       Poleuae me oblous ons

       Table 5: Examples of HTR Prediction on unseen documents and their classi昀椀cation by the model.


                                                   22
Lang   Encoder     Good                 Acceptable         Bad                  Very bad
                   Precision   Recall   Precision Recall   Precision   Recall   Precision Recall
Yes    Attention       63.35    40.36       43.04  26.32       49.66    47.70       75.33  96.25
Yes    Attention       65.61    31.47       41.80  32.35       51.81    51.75       77.74  95.41
Yes    Attention       63.32    51.27      49.17   18.27       47.84    45.95       71.96  95.99
Yes    Attention       66.00    41.88       48.45  33.90       52.85    56.78       80.47  94.51
Yes    Attention       68.27    43.15       44.01  22.76       44.67    43.98       73.20  95.48
Yes    TextCNN         59.06    38.07       44.31  11.46       40.16    32.17       64.50 97.87
Yes    TextCNN         54.70    25.13       38.16  22.45       44.88    44.09       72.30  95.41
Yes    TextCNN         51.54    42.39       44.37  20.74       45.34    27.13       64.88  97.61
Yes    TextCNN         60.23    26.14       39.78  27.40       45.54    43.00       72.81  95.16
Yes    TextCNN         63.35    25.89       43.24  22.29       42.40    33.26       65.95  97.61
Yes    BiLSTM          62.33    46.19       44.07  12.07       44.10    51.09       74.11  94.51
Yes    BiLSTM          71.75    32.23       41.29  30.80       50.41    53.94       78.07  94.06
Yes    BiLSTM          60.56   55.33        47.55  21.05       49.94    48.14       74.18  94.64
Yes    BiLSTM         73.53     19.04       36.58  30.80       47.77   58.53        81.51  91.41
Yes    BiLSTM          66.82    37.31       39.26  14.71       43.42    47.26       72.96  96.38
No     Attention       56.05    44.67       45.02  30.80       49.16    44.75       76.96  95.16
No     Attention       61.79    33.25       43.23 43.50        54.51    58.21      86.30   92.76
No     Attention       57.00    43.40       45.66  34.21       50.41    47.16       78.53  94.51
No     Attention       58.57    41.62       44.40  38.08       54.50    53.72       82.40  94.06
No     Attention       56.97    36.29       42.20  31.42       51.34    46.06       75.85  95.54
No     TextCNN         54.51    36.80       41.15  26.63       47.59    43.22       74.30  95.41
No     TextCNN         47.37    38.83       40.10  26.01       49.14    47.16       77.48  94.25
No     TextCNN         54.71    38.32       41.80  28.02       49.66    55.58       81.19  92.83
No     TextCNN         48.31    39.85       39.09  28.02       50.12    47.26       78.94  94.44
No     TextCNN         49.35    38.32       39.04  15.17       46.45    46.50       72.71  95.35
No     BiLSTM          70.79    31.98       42.90  24.30       47.44    55.80       78.07  94.96
No     BiLSTM          57.60    36.55       42.23  35.76       52.28    52.63       80.84  93.22
No     BiLSTM          60.25    36.55       41.84  28.17       51.37    57.44       80.40  93.80
No     BiLSTM          56.69    49.49       42.90  40.71      54.52     47.48       82.85  93.60
No     BiLSTM          56.17    43.91       44.47  26.78       49.12    48.58       77.64  95.35


                                              23
Figure 6: “Bad transcriptions” CER Violin plot, per manuscript. Most manuscript have a strong enough
diversity of CER to train upon.


                                                24