1. Introduction

Antwerp, Belgium £ thibault.clerice@chartes.psl(.Te.uClérice) ç ”https://github.com/ponteineptiq u(Te.”Clérice) ȉ

Ground-truth Free Evaluation of HTR on Old French and Latin Medieval Literary Manuscripts

ThibaultClérice

0 0 Centre Jean Mabillon, École nationale des Chartes, & INRIA

2022

000 0 0003

As more and more projects openly release ground truth for handwritten text recognition (HTR), we expect the quality of automatic transcription to improve on unseen data. Getting models robust to scribal and material changes is a necessary step for speci昀椀c data mining tasks. However, evaluation of HTR results requires ground truth to compare prediction statistically. In the context of modern languages, successful attempts to evaluate quality have been done using lexical features or n-grams.This, however, proves di昀케cult in the context of spelling variation that both Old French and Latin have, even more so in the context of sometime heavily abbreviated manuscripts. We propose a new method based on deep learning where we attempt to categorize each line error rate into four error rat0e<ranges ( 10% < 25% < 50% < 100%) using three di昀erent encoder (GRU with Attention, BiLSTM, TextCNN). To train these models, we propose a new dataset engineering approach using early stopped model, as an alternative to rule-based fake predictions. Our model largely outperforms the n-gram approach. We also provide an example application to qualitatively analyse our classi昀椀er, using classi昀椀cation on new prediction on a sample of 1,800 manuscripts ranging from tthhece9ntury to the t1h5.

eol>HTR OCR Quality Evaluation Historical languages Spelling Variation

1. Introduction

In order to evaluate the consistency of a model on an out-of-domain document such as another manuscript or a new hand, researchers usually have to create new ground-truth transcriptions to which the model predictions are compared. In this context, it seems out of reach to leverage with con昀椀dence the amount of data that remains dormant in the open data vaults of libraries such as the Bibliothèque Nationale de France (BnF) for statistical studies, making the number of 50,149 IIIF manifests catalogued by Biblissima’s port1a8l] [promising while leaving a bitter taste of unavailability: it would require the manual transcription of at least a few hundred lines for each manuscri3p. t

To address this, we can approach this issue not as an HTR one but rather as a Natural Language Processing (NLP) task, evaluating the apparent “correctness” of the acquired text rather than its direct relationship with the digital picture of the manuscript. Evaluating new transcriptions without ground truth has been done, but mainly for OCR and non-historical documents. For modern languages, where spelling is 昀椀xed and grammar stable, a dictionary approach in combination with some n-gram statistics have provided a solid framework for establishing the probability that a document is well transcribed. However, for languages such as old French or medieval Latin, both evolving over the span of few centuries, the issue is di昀erent. For example, Camps, Clérice, Duval, Ing, Kanaoka, and Pinche3][ has catalogued 36 forms of the word cheval (horse) in the largest available Old French corpus. A Dictionary approach would already prove to be complex, but to make things worse, the abbreviated nature of medieval texts would require taking into account several abbreviation systems, making it unsustainable.

HTR is most o昀琀en, in the humanities, not a task in itself but rather a preliminary step for corpus building (such as digital editions) or corpus analysis. In this context, HTR quality can be of primordial importance, depending on the task at hand. While E16d]ehra[s suggested that good classi昀椀cation in stylometry is still possible for corpora with noise levels as high as 20%, even for the smallest feature sets, Camps, Vidal-Gorène, and Ver4n]edtem[onstrated that, for HTR, noise leads to accumulating errors throughout its post-processing (word segmentation, abbreviation resolution, lemmatization and POS-tagging), making the post-processed textual features less reliable than original character n-grams. For some other tasks, such as in corpus linguistics (e.g. semantic dri昀琀 studies), the study of abbreviation systems such as the one performed by Honkapohja and Suomela2[ 2 ] or even the training of large language models such as MacBerth 2[ 8 ] might require a higher level of precision.

As such, evaluating the textual quality of an automatic transcription “from afar” is extremely useful, as it provides solid grounds to either exclude documents from analysis or help guide ground-truth creation campaigns in well-funded projects. For cultural heritage institutions, it can also provide a welcome indicator for the document that could be ingested by a research engine. We can even imagine situations where these institutions transcribe only a sample of each element of their collection, and only fully and automatically transcribe the ones that reach a certain level of quality, thus saving energy and ultimately budget on the computation front.

From a human reader’s perspective, Springmann, Fink, and Schu3l9z][and Holley 2[ 1 ] have set a limit of a CER below 10% for a good OCR quality. Recently, Cu1p5e]rh[as proposed the 3Five million lines would be required for the mentioned set of manifests of the BnF with only 100 lines per manuscript. As a comparison point, the accumulated number of lines of manuscript dataset, regardless of the script or language, publicly available on the HTR-United cat7a]loisg1[64,418 at the end of August 2022. evaluation of OCR quality for heritage text collections, speci昀椀cally Dutch newspapers from the 17thcentury, to distinguish good OCR from bad, using the aforementioned threshold. They provide a toolQ,uPipe, which o昀ers binary classi昀椀cation capacities, putting text in either the range [0; 10]% of CER or in the remaining range of “bad” OCR. In 2022 as well, Ströbel, Clematide, Volk, Schwitter, Hodel, and Schoc4h0][ addressed this issue regarding HTR of cultural heritage documents, speci昀椀cally from the 1th6century. They provide a strong argument for using lexical features and (pseudo-)perplexity scores for HTR quality estimation, with the speci昀椀c limitation that the texts they studied, 16th-century Latin correspondence, does not provide as much variation as older languages such as historical German. We also note that correspondence may be less abbreviated, and that this dataset spans a very short period. In parallel to these, Clausner, Pletschacher, and Antonacopou8lo]sap[proached the problem from a global perspective, from segmentation to OCR, and proposed supervised classi昀椀cation methods.

In this paper, we address this issue as a supervised classi昀椀cation task, based on a dataset of around 50,000 lines of ground truth spanning from the 9th through to the 15th century. Following the conclusion of Cupe1r5][, we augment the number of categories we want to 昀椀nd: we distinguishGood ([0, 10)%), Acceptable ([10, 25)%), Bad ([25, 50)%), and Very Bad (≥ 50%) rates of OCR. This provides a more 昀椀ne-grained evaluation of the transcription and allows for guided transcription campaigns, by addressing either the low-hanging frAuccitepst(able) or the rotten ones. We evaluate three kinds of basic architectures (GRU with attention, BiLSTM and TextCNN) on line classi昀椀cation using real-life “bad” transcriptions and precomputed CER scores.

The resulting models have shown promising results, with quality levels VsuercyhBaasd and Good being well recognized. In order to evaluate the models and showcase their usefulness, we also provide an example of a real-life classi昀椀cation application, where 1800 manuscripts were randomly selected from the BnF and classi昀椀ed by our best model.

In summary, the contributions of this paper are: 1. a new approach for HTR evaluation of historical languages with variable spellings; 2. a new method to produce ground truth for OCR evaluation that does not rely on arti昀椀cially and manually tuned generation; 3. an initial evaluation of the output and a quick glance at the state of HTR for Old French and Medieval Latin over six centuries.

The remainder of this paper is organised as follows. We start by addressing the background in Section 2, speci昀椀cally regarding the speci昀椀cs of Old French and Medieval Latin and the idea of readability. In Section 3, we describe the HTR datasets we used and their particularities. In Section 4, we describe the architecture of the models, their feature engineering and the process behind the generation of bad predictions. In Section 5, we describe the set-up of our model selection and evaluation. Finally, in Section 6, we analyse the result both on the dataset producedad hoc (described in Sections 3 and 4), but also on completely unseen documents from the BnF, to showcase the capacities of such models.

2. Background and Related Work

Handwritten Text Recognition, a sibling or sub-task of Optical Character Recognition, aims at recognising text from digitised manuscripts. In the last 昀椀ve years, the digital humanities landscape has seen a surge in HTR engines, as well as transcription interfaces that connect and work well with these engines, from the dominant Transkri2b3u]st[o the open-source pair of eScriptorium26[] and Kraken [ 25 ]. To be able to recognize text, users have to provide models, which are themselves the result of supervised training on ground truth data (human provided transcriptions).

Printed books have been, over the last few decades, the focus in terms of remediation, from their analogue form to a digitized picture and 昀椀nally to a machine-readable (and human searchable) text. With the advances in HTR over the last 昀椀ve years, the focus can now shi昀琀 or be shared with materials that have, for the most part, remained inaccessible from a digital point of view, except for pictures. Latin manuscripts are present during the whole period of manuscript production in western Europe. Literary Old French manuscripts exist fromthtchenet1u2ry onward, with only a hundred known surviving manuscripts in tthhec1e2ntury 5[]. Over the span of these seven centuries, multiple forms of handwritten scripts have existed, for both French and Latin. As an example, the 2016 ICHFRCompetition on the Classi昀椀cation of Medieval Handwritings in Latin Script [ 14 ] provided ground truth for the classi昀椀cation of 12 main families, of which at least six are represented in our datasets. This diversity makes training models for HTR quite complex but also a reachable goal, as they tend, speci昀椀cally for literary manuscripts, to be more readable and stable between di昀erent hands.

Medieval French and Latin present both dialectal and scriptural variation in synchrony on top of diachronic evolution. Old French’s syntax varies chronologically and geographically. The spelling is simply variable. While Latin shows some level of variation, it di昀ers from Old French mostly in its higher rate of abbreviation. These observations are limited to the context of the datasets at hand, which are literary works (including scholastic, theological and medical works). The Old French CREMMA Medieval datase3t4][ has 0.97% of horizontal tildes and 0.16% of vertical ones, which are markers used in the dataset guidelines to indicate various similar abbreviation diacritic3s5][. Using the same guidelines, the CREMMA Medieval Latin dataset shows a rate of 5.63% and 1.52% for the same characters. This di昀erence could be due to the nature of the transcribed texts.

The question of abbreviation and the speci昀椀city of medieval literary manuscripts has provoked many discussions in terms of how to transcribe documents, from a completely “diplomatic” approach with variants of letters to “semi-diplomatic” approaches. In the last year, three authors have provided guidance or thoughts around guidelines for transcriptions:3P5i]nche [ focusing on Old French, Schoen and Sarett3o8][ on Middle English, and Guéville and Wrisley [ 20 ] on Latin. The CREMMA guidelines have been used by 5 other datasets for a total of 1.15 millions of characters over 昀椀琀y manuscripts, which make them the most diverse and comprehensive ones for HTR of medieval manuscripts in Latin and Old French.

The most traditional metrics for HTR and OCR are both Word Error Rate (WER) and Character Error Rate (CER). The 昀椀rst one proves to be complicated to apply in Old French and Medieval Latin, as spaces in medieval manuscripts tend to vary in size or simply be nonexistent from a modern perspective, relying on the knowledge of the reader to separate words—or the ability of NLP models to separate th9e]m.T[he second one works well, with the limitation that spaces are o昀琀en the 昀椀rst source of mistakes. CER corresponds to the sum of character insertion, removal and replacement over the total number of characters, thus providing a 昀椀negrained metric.

As mentioned earlier, in the introduction, both CER and WER require ground truth, and other metrics currently discussed as alternatives, such as the (pseudo-)perplexity or lexical measures proposed by Ströbel, Clematide, Volk, Schwitter, Hodel, and Sch40o]c.hT[he other approach to evaluating quality without ground truth is to predict a class of CER, such as the work done by Bazzo, Lorentz, Suarez Vargas, and Moreir1a].[ These approaches rely on features such as n-grams, word statistics and language classi昀椀er outputs which are di昀케cult to leverage in the present context. In order to train their classi昀椀er, Bazzo, Lorentz, Suarez Vargas, and M1o]reira [ and Nguyen, Jatowt, Coustaty, and Douce3t2][engineered bad predictions by creating rules to reproduce the most common errors in OCR, such as “rn” becoming “m”. These bad predictions are then fed to their model along with the metrics both papers want to predict.

Nguyen, Jatowt, Coustaty, and Douce3t2][ provide an innovative approach to the issue of noise in OCR by shi昀琀ing from a CER/WER problem to a readability one: if the reader “can reod a txt with mi昀pelling” without having to refer back to the picture, at least one of the goals of OCR has been achieved. As simply put by Martinc, Pollak, and Robnik-Šiko3n0j]a, [ “Readability is concerned with the relation between a given text and the cognitive load of a reader to comprehend it”. It is even more important in the context of handwritten documents where a somewhat badly but readable HTR output can be easier for non-specialists to read than the original. In the 昀椀eld of readability assessment, Martinc, Pollak, and Robnik-Šik3o0n]ja [ has shown that supervised models perform adequately, while Nguyen, Jatowt, Coustaty, and Doucet [ 32 ] has shown that this translates to the OCR issues as well. This has not been applied to any medieval dataset that we know of.

3. Dataset

To train di昀erent models, we reused the data from various projects, aligned with the same guidelines used by Pinche 3[ 5 ]. Our experiment was made possible by the open release of many projects’ datasets, including one MA thesis and one student pro4j1e,c2t].[We used the ground truth of the CREMMA13][ and CREMMALab [ 34 ] projects, the Rescribe42[] project, and the GalliCorpor3a6[ , 37 ] projects, for a total of 42,292 lines (see Ta1b)l.eWe include one dataset of incunabula, which use graphical shapes similarly to literary manuscripts (but with more regularity), while also using an abbreviation system.

The datasets present not only two main languages but also many di昀erent levels of digitization quality (including old binarization), di昀erent kinds of handwriting families, di昀erent abbreviation levels and di昀erent genres. For example, while the CREMMA Medieval dataset focuses more on literary texts, speci昀椀cally hagiographical achnadnson de geste texts, the CREMMA Medieval LAT corpus o昀ers theological commentaries and medicinal recipes, each genre having its own speci昀椀c vocabulary. The dataset in general is skewed towards French and thegothica handwritten family.

The transcription guidelines of Pinch3e5][ provide simpli昀椀cation rules: allographic approaches are forbidden (di昀erent shapes of s such as long s and “modern” s are not di昀erentiated), macrons and general horizontal-line diacritics over the letters such as tildes are represented by horizontal tildes, any “zig4zaogr” similarly shaped forms are simpli昀椀ed into superscript vertical tildes, etc. This allows for simpler transcriptions and also limited diversity of characters for the machine to learn, satisfying both the human transcriber in terms of the learning curve of the guidelines, and the HTR engine in terms of complexity. Each corpus was passed through the ChocoMu昀椀n so昀琀ware [ 11 ] using project-speci昀椀c character translation tables. This so昀琀ware, along with these tables, allows each dataset to be controlled at the character level and adapted to guideline modi昀椀cations. It also allows project-speci昀椀c transcription standards to be translated to a more common one, such as Pinche’s.

4O昀케cial name from the Unicode speci昀椀cations for the character U+299A. 4. Proposed Method

Our goal is to be able to predict a quality class for any HTR output on medieval French and Latin. First, we design a way to generate ground truth for the quality assessment of HTR output. Then, we propose three supervised text-based models, with speci昀椀c adaptations to handle both languages with a single classi昀椀er.

4.1. “Bad Prediction” Ground Truth

In order to train our classi昀椀cation model, we require ground truth material along a CER class: Good ([0; 10)%), Acceptable ([10; 25)%), Bad ([25; 50)%) and Very Bad (≥ 50%). In order to have real life errors, and to reproduce the rather di昀케cult to predict capacity of a model to confuse certain characters with others in speci昀椀c settings, we propose a three-step method: 1. We train Kraken2[ 5 ] models based on the complete dataset, or on a subset. We voluntarily stop some of the training in very early stages, when the CER on the validation dataset remains high. We also keep one “best” mode1l2[] trained on the full dataset. 2. We run each model on our two biggest and most diverse repositoCrRiEeMs,MA Medieval and CREMMA Medieval LAT. We also run a model trained on modern and contemporary scripts, Manu McFrench6[] to create garbage-level transcriptions. 3. We evaluate each line’s CER and store it alongside the line. We also keep the ground truth, whose CER is estimated as 0. We remove short lines (fewer than 15 characters) and duplicated predictions across models for the same line.

Regarding the 昀椀nal models for prediction production, we have 16 models, allowing for a maximum of 16 versions of each line, if none of the models predict the same text (see 2Table for examples): 1. 4 models trained on the same train and validation datasbeetstawsith a validation CER of 55.9, 28.3, 23 and 20.8% according to Kraken. 2. 5 models trained on thCeREMMA Medieval LAT dataset only, from thest1to the t6h epochs, ranging from 86% to 46% of CER. 3. 1 model trained on thEeutyches (Latin, Carolingian of theth9century) and thDe ecameron (French, 16thcentury) datasets with a 98.5% CER on its validation set. 4. 3 models trained on thCeREMMA Medieval (Old French) dataset only, 昀椀ne-tuned from theManu Mc French Model, from 11% of CER down to 8.2%. 5. Manu McFrench, thebest model and the ground-truth data.

These provide variable CER on unseen data from the test set of both CREMMA dataset but also on training and validation sets as they did not reach their full capacities during the training phase. A昀琀er 昀椀ltering small and repeated predictions, we have access to 322,903 lines of “HTR Predictions, CER” couples (see in appendix Figur6e). We then translate that into each bin of CER to produce the four established classes. u̾ra on de q̃ l vertu ses petis pies sont que vous Bra on de q̃ l vertuses petis pies sont que vous Fra on de q̃ l vertuses petis pies sont que vou Bra on de q̃ l vertuses petis pies sont que uous Pra on de ql vertuses petis pies sont que dons ura on de q̃ l vertu ses petis pies font grre op ura on de q̃ l uertu ses petis pies font re dory ura on de ql vertu ses petis pies font itce ir Ard ondegl ratules nus mes sont que ls a on de at etn le peos pes os e a om de ał vrtir sot olisͣ pa sosisinos ⁊s cm dec uł vrtr fe pdp̃ pns ots pte

4.2. Model Architecture

We applied three model architectures, common to many NLP task, with an embedding-sentence encoder-linear classi昀椀er structure where only the sentence encoder changes from one model to another (see Figur2e). The embedding layer takes into account special tokens (Padding, Unknown char, Start of Line, End of Line) and each character according to the Unicod5e NFD normalization of the line, for which characters and their diacritics are separat[eéd], e.g. becomes [e]+[´]. The linear layer is a simple (Encoding Output Dimension, Class Count) decision layer. Each model uses a cross-entropy loss func6tiaonnd reduces its learning rate at plateau using the validation set’s macro averaged recall metric. Optimization of the model is done through the Ranger optimiz4e3r][.

The encoding layer varies between three di昀erent forms: • The 昀椀rst version uses a single BiLSTM network where the sentence encoding is the result of the concatenation of the start-of-line token (BOS) and end-of-line token (EOS) hidden state. • The second version follows the architecture of sentence-level attention proposed by Yang, Yang, Dyer, He, Smola, and Hovy 4[ 4 ], using a bidirectional GRU. The encoded sentence vector is the sum of products of the hidden state of each token with its attention. Attention is also provided as an output for human interpretation of the results. • The last one, TextCNN2[ 7 ], uses the concatenation of the Max Pooling of each n-gram size (2, 3, 4, 5, 6) taken into account by a convolutional neural network.

5Normalization Form Canonical Decomposition. 6Code available athttps://github.com/PonteIneptique/neNequ.itia

As we deal with two di昀erent languages, we added another special token, following the work of Martin, Villemonte de La Clergerie, Sagot, and Bor2d9e]sa[nd Gong, Bhat, and Viswanath [ 19 ]: for each encoding variation we add one variation of the codec where the 昀椀rst token a昀琀er the beginning-of-string is a metadata token indicating the language. Thus, a line suFrcha as on de ̃ql vertuses petis pies sont que vo will be encoded as<BOS><FRO>Fra on de ̃ql vertuses petis pies sont que vo<EOS>.

5. Experimental Setup

In order to avoid lexical bias and to ensure the strength of our analysis, we propose a 5-Foldlike experiment, where each subset for train, validation and test are the results of split across manuscripts. For each K, two French manuscripts and two Latin ones are used for the validation set and the test set, and they di昀er by at least one manuscript from one K to another, leaving three K completely di昀erent (K1, K3, K5; see Tab3l)e. Each test set also contains a Latin manuscript that was not used in any of the HTR model training or validBaetriloinn,:Hdschr. 25. This manuscript was then used for model evaluation, to have a stable pillar for evaluation. Models are then evaluated using class-speci昀椀c precision and recall, as well as macro averaged precision and recall.

For our baseline, we use the relative frequency of the 2000 most common n-grams of size 3, 4 and 5 as features and feed them to a linear classi昀椀er, with cross entropy loss and the Adam optimizer. We run each model architecture once for each K, resulting in 7 di昀erent results with the baseline (presence/absence of language token for the three encoding modules + baseline).

Our whole pipeline uses pandas for data preparati3o1n],[PyTorch [ 33 ] for model development, and Pytorch Lightnin1g7[] for the training, evaluation and prediction wrapping.

6. Experiments 6.1. Model Classification Results

The 昀椀rst conclusion we can draw from the experience is that our models always beat the baseline (see Table4 and, in the Appendix, Table4 for more details). No RNN-based architecture clearly beats the other, but TextCNN clearly underperforms. The introduction of the language metadata token helps when detectiGnogod transcriptions (delt≈a +7% for attention’s median precision,≤ +1% for the recall) for both RNN based models. Models without a language marker tend to outperform models with language markers, except fVoerrythBaed class where the delta is up t+o6% in favour of models without language tokens (using median precision scores).

Regarding the variability of results, we found that the length of the string had an impact on the prediction, no matter the model architecture. Surprisingly, none of the models withstand long noisy lines: the accuracy of tVheery Bad class is inversely correlated with line size. On the contrary, depending on the encoder, some classes bene昀椀t from longer striGnogos:d lines bene昀椀t from it with all models except the baseline. TextCNN is the only model to really correlate accuracy on theBad and Acceptable classes with line length.

Finally, for all models except the baseline, the most common confusion is always in the “adjacent” class(es) (see Figure4). For the classeAscceptable and Bad, which have two neighbours, the error rate is evenly split between them: the Acclcaespstable tends to be confused with eitherGood orBad. This shows the model’s ability to understand cleanness or noise, but also shows the limit of these classes: for a line with 50 characters, such as “quuãtfatuetlgeist en tes lieu.Derite respiont”, 6 mistakes are enough to swing into tAhcceeptable category (Ground truth: “quãt tel enfant gist en tel lieu . Uerite respon”, one space has been removed before the dot).

Overall, with an accuracy for tGhoeod and Very Bad classes around 50% on these languages, and considering that most of the confusions are from adjacent claes.gse.sG(ood is confused withAcceptable, Acceptable withGood and Bad, etc.), the solution performs well either at 昀椀ltering badly read manuscripts, or keeping only the very good ones. ATchceptable class and the Bad class have stable performance facing variable line length, althougAhctcehpetable class shows the worst classi昀椀cation performance.

6.2. Application on a Real-World Library Dataset

As a real-world application, we wanted to apply one of our best models to an unseen dataset, in the same way that we envision cultural institutions might use the tool. We describe the set-up for this particular experiment below, and then evaluate the results of the classi昀椀cation model with regard to the capacity of the HTR model; we also study some randomly sampled elements. 6.2.1. Set-up To evaluate on as much unseen data as possible, we crawled the Biblissima IIIF collection portal 1[ 8 ]. We searched individually for each combination of language (French, Latin) and century (9th to 15th), limiting the number of samples retrieved to 500 manuscripts. We then sampled 10 sequential pictures from each manuscr7ipTto. avoid empty pages (which tend to be at the start and the back of each book’s digitization or IIIF manifest at the BnF), we take either the ten 昀椀rst pictures from the second decile of the manifest, or from tthhuep20to the 3t0hif there are fewer than 100 pictures, or the 10 last if there are fewer than 20 pictures.

Each downloaded sample is then segmented using YALTA1i0][ with the included model designed for cultural heritage manuscripts and the base Kraken BLLA segme2n4t]e.rA[s YALTAi provides di昀erent zones—from the margin to main body of text—through numbering, we only consider lines that are part of the main bodies of text of each model, thus excluding any marginal or paratext. We then use Kraken to predict a transcription for each line, using the best trained model as described in our 昀椀rst experiment. Next, we feed each line to our best BiLSTM model (K-Fold 1 has the best recall/precisionGooond) while keeping the line metadata: language, century, manuscript identi昀椀er, and page identi昀椀er.

Finally, we provide three di昀erent evaluations of the transcriptions. The 昀椀rst is based strictly on the number of lines predicted in each claGsoso(d, Acceptable, etc.). The second is page-based: we take the most common prediction for all lines. The last one is manuscript based: we take the most common page prediction, using the previous page-based metric. 6.2.2. Evaluation Overall, the HTR prediction results produced by our BiLSTM module are in line with the HTR strength on the dataset (see Figu5)r.eThe model performs extremely well on early manuscripts thanks to the presence of two datasets of early manuscrEiupttysch(es and Caroline Minuscule) It performs well on Old French except for the 13th century, wBhaedrperedictions are more common. The relative frequency oVfery Bad predictions tends to grow as we get closer to the 16th century: from the data we have seen, this could be due to the presence of non-literary manuscripts written in cursive, for which our model has no ground truth.

If we look at the sampled predictions (Appendix, Ta2b)l,emostGood predictions seem correct or nearly correct. However, we can see that the metadata from Biblissima and the BnF has some limitations when used automatically, as it can produce problematic results: most 12thcenturyAcceptable predictions are probably in Latin, which would indicate a multilingual manuscript or a badly catalogued one. This issue also arises in the crawler for the century, as some manuscripts were catalogued as French but with a production date that is before the 椀昀rst known French document: these are most likely multilingual documents, with either a collection of various leaves from previous manuscripts, or the inclusion of the language used for marginal notes. 3 out of theA6cceptable predictions between the t1h3and the 14th century are de昀椀nitely readable and understandable, and we cannot but wonder if the lack of spaces in “q̃ merueilles fu lacitebiengarne mlt” is responsible for its classi昀椀catiAonccaesptable rather thanGood. We note that at least oVneery Bad prediction in French, “OU EtE L. Cheualier de Monifort, son Oncle, Gles”, seems rather readable, albeit with more corrections thGaonodfor a transcription. Latin shows the same trend, in being accurateGoovoedrand Acceptable. 7Note that we are not talking about pages but about pictures: in some cases, most commonly in the case of digitised micro昀椀lms, one picture can contain two pages.

7. Conclusion

The ability to 昀椀lter, without pre-transcribing samples, automated transcriptions of manuscripts in Latin, Old French or any other Western historical language, might lead to the production of datasets designed for analysis that relies on better transcriptions, or to guiding cultural heritage institutions and their partners in the production of new ground truth. Producing HTR ground truth does indeed require time, skilled transcribers and, last but not least, budget. However, most current error rate prediction or HTR output analysis models rely on n-gram frequencies and lexical features—two approaches that are o昀琀en less viable for languages such as Old French which “su昀ers” from a highly variable spelling system or for languages like Latin which are potentially highly abbreviated, with abbreviations changing even within a single manuscript, depending on the context, the topic and the scribe.

In this context, we chose to treat CER range predictions as a sentence-like classi昀椀cation problem, for which we implemented three basic models, using either a single BiLSTM encoder, an attention-supported GRU, or a TextCNN encoder. These three tools show stronger results than an n-gram based baseline. On top of this, we include a language metadata token which can improve the reliability of the lowest range of CER (between 0 and 10%G,otohdeclass) while worsening the classi昀椀cation’s reliability for the highest range (over 50B%a, dthcleass). For the purpose of training these models, we propose a new way to generate real life “bad transcriptions”, using early-stopped HTR models, or models trained on small samples of data: this provides an alternative to previous rule-based generation of “bad transcription” ground truths.

We show that on a completely unknown dataset of around 1,800 manuscripts, analysed with a new HTR model speci昀椀cally trained on medieval Latin and French, the number of welltranscribed manuscripts predicted is on par with the ground truth for that dataset. The quality assessment predictions provide quick insights for larger collections, and could be run relatively o昀琀en by cultural heritage institutions.

In the future, hyper-parameter 昀椀ne-tuning and other encoders could be used in the architecture. Speci昀椀cally, with more correctly transcribed manuscripts, including the abbreviations in their transcriptions, 昀椀ne-tuning larger language models could allow the application of (pseudo)perplexity ranking such as the one proposed by Ströbel, Clematide, Volk, Schwitter, Hodel, and Schoch [ 40 ], while allowing for partial noise in the training data. We hope to see such classi椀昀cation of manuscripts used by ground truth producers in order to enhance the robustness of openly available HTR models.

Acknowledgments

I want to thanks Jean-Baptiste Camps, Ariane Pinche and Malamatenia Vlachou-Efstathiou for their constant feedback and replies on some particular questions regarding manuscripts or HTR data. Many thanks to Ben Nagy for his proof-reading of the pre-print version.

This work was funded by the Centre Jean Mabillon and the DIM MhAtPt(ps://www.dim-m ap.fr/projets-soutenus/cremmala).b/

N. White, A. Karaisl, and T. CléricCe.aroline Minuscule by Rescribe. Ed. by A. Chagué and T. Clérice. 2022. url:https://github.com/rescribe/carolineminuscule-ground.truth

A. Appendix

The so昀琀ware has been archived at the following addreshst:tps://doi.org/10.5281/zenodo.723 3984. A good chunk of the data is available herhet:tps://github.com/PonteIneptique/neNequ itia/releases/tag/chr2022-rele.ase

Manuscripts metadata and the predictions in XML ALTO formats for Section 6 are available athttps://doi.org/10.5281/zenodo.72343.99The same repository contains also the XML data for training the classi昀椀er. lat lat lat lat lat lat lat 12 13 13 13 14 14 14 15 15 15 9 Good Good Good

Good Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable

Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable Acceptable

Acceptable Bad 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o lat lat lat lat lat lat lat lat lat lat lat lat lat lat lat lat lat lat lat lat lat 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o 爀昀o lat lat lat lat lat lat 13 13 13 14 Lang lat lat lat lat lat lat lat lat lat lat lat lat lat lat lat Lang

Encoder

[1]

G. T.

Bazzo ,

G. A.

Lorentz ,

D. Suarez

Vargas , and

V. P.

Moreira . “Assessing the Impact of OCR Errors in Information Retrieval” .AInd:vances in Information Retrieval. Ed. by J. M. Jose , E. Yilmaz, J.

Magalhães , P.

Castells , N.

Ferro , M. J.

Silva , and F.

Martins . Lecture Notes in Computer Science. Cham: Springer International Publishing, 2020 , pp. 102 - 109 . doi: 10 .1007/978-3- 030 -45442-5\_ 13 .

[2]

Biay ,

Boby ,

Konstantinova , and

CappeT. NAH-2021-DecameronFR . 2022 . doi: 10 .5281/zenodo.6126376. url: https://github.com/PSL-Chartes - HTR-Students/TNAH-2 021 - DecameronFR .

[3] J.-B. Camps , T.

Clérice , F.

Duval , L.

Ing , N.

Kanaoka , and

Pinche . “ Corpus and Models for Lemmatisation and POS-tagging of Old French” . 2022 . uhrtl:tps://halshs.archives -o uvertes . fr/halshs-0335312 . 5

[4] J.-B. Camps , C.

Vidal-Gorène , and M.

VernetH.andling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches . May 2021 . url: https://hal-enc. archives-o uvertes . fr/hal-0327960 . 2

[5]

Careri ,

Ruby , and I. ShortL.ivres et écritures en français et en occitan au XIIe siècle: catalogue illustré . Viella , 2011 . 274 pp.

[6]

Chagué and

ClériceH.TR-United - Manu McFrench V1 (Manuscripts of Modern and Contemporaneous French) . Version 1.0.0 . 2022 . doi: 10 .5281/zenodo.6657809. url: https: //doi.org/10.5281/zenodo.665780.9

[7]

Chagué and

ClériceH. TR-United: Ground Truth Resources for the HTR and OCR of patrimonial documents . 2022 . url: https://htr-united.githu.b.io

[8]

Clausner ,

Pletschacher , and

Antonacopoulos . “ Quality Prediction System for Large-Scale Digitisation Work昀氀ows” . In2:016 12th IAPR Workshop on Document Analysis Systems (DAS) . 2016 , pp. 138 - 143 . doi: 10 .1109/das. 2016 . 82 .

[9]

Clérice . “ Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin” . IJno:urnal of Data Mining & Digital Humanities 2020 ( 2020 ). doi: 10 .46298/jdmdh.5581. url: https://jdmdh.episciences. org/62.64

[10]

Clérice . “ You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine” . 2022 . uhrtlt:ps://hal-enc. ar chives-ouvertes.fr/hal-037232 . 08

[11]

Clérice and

PincheC. hoco -Mu昀椀n, a tool for controlling characters used in OCR and HTR projects . Comp. so昀琀ware. Version 0.0.4 . 2021 . doi: 10 . 5281 / zenodo . 5356154. url: https://github.com/PonteIneptique/choco-m.ufin

[12]

Clérice ,

Pinche , and M. Vlachou-EfstathioGue.neric CREMMA Model for Medieval Manuscripts (Latin and Old French ), 8 - 15th century. Version 1.0.0 . 2022 . doi: 10 .5281/zen odo. 7234166 . url: https://doi.org/10.5281/zenodo.72341.66

[13]

Clérice , M.

Vlachou Efstathiou, and

ChagCuRé.EMMA

Manuscrits médiévaux latins . Ed. by

Chagué and

Clérice . 2022 . urlh:ttps://github.com/HTR-United/CREMMAMedieval-LAT.

[14]

Cloppet ,

Eglin ,

V. C.

Kieu ,

Stutzmann , and

Vincent . “ ICFHR2016 Competition on the Classi昀椀cation of Medieval Handwritings in Latin Script”2.0I1n6: 15th International Conference on Frontiers in Handwriting Recognition (ICFHR) . 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR) . Shenzhen, China: Ieee, Oct. 2016 , pp. 590 - 595 . doi: 10 .1109/icfhr. 2016 . 0113 . url: http://ieeexplore.ieee.or g/document/7814129./

[15]

Cuper . “ Examining a Multi Layered Approach for Classi昀椀cation of OCR Quality without Ground Truth” . IDn :H Benelux Journal ( 2022 ), p. 17 .

[16]

Eder . “ Mind your corpus: systematic errors in authorship attributioLnit”e . raInry: and Linguistic Computing 28 .4 ( Dec . 1, 2013 ), pp. 603 - 614 . doi: 10 .1093/llc/fqt03.9url: https://doi.org/10.1093/llc/fqt.039

[17]

Falcon and The PyTorch Lightning teamP y . Torch Lightning. Comp. so昀琀ware. Version 1.4 . 2019 . doi: 10 .5281/zenodo.3828935. url: https://github.com/Lightning-AI/light ning.

[18]

Frunzeanu , E. MacDonald, and

Robineau . “ Biblissima's Choices of Tools and Methodology for Interoperability PurposesC” . IIAnN: . Revista de historia de las universidades 19.1 ( 2016 ), pp. 115 - 132 .

[19]

Gong ,

Bhat , and

Viswanath . “ Enriching Word Embeddings with Temporal and Spatial Information” . InP:roceedings of the 24th Conference on Computational Natural Language Learning. Online: Association for Computational Linguistics , 2020 , pp. 1 - 11 . doi: 10 .18653/v1/ 2020 .conll- 1 .1.url: https://aclanthology.org/ 2020 .conll.- 1 . 1

[20]

Guéville and

D. J.

Wrisley . “ Transcribing Medieval Manuscripts for Machine Learning” . In: arXiv preprint arXiv:2207.07726 ( 2022 ). url: https://arxiv.org/abs/2207.0772.6

[21]

Holley . “ How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs” . DI-nL:ib Magazine 15.3/4 ( 2009 ).

[22]

Honkapohja and

Suomela . “ Lexical and function words or language and text type? Abbreviation consistency in an aligned corpus of Latin and Middle English plague tracts” . In: Digital Scholarship in the Humanities 37.3 ( 2021 ), pp. 765 - 787 . doi: 10 .1093/llc/fqab007. url: https://doi.org/10.1093/llc/fqab0.07

[23]

Kahle ,

Colutto , G. Hackl, and G. Mühlberger. “ Transkribus-a service platform for transcription, recognition and retrieval of historical documen2t0s1”7. I1n4:th IAPR International Conference on Document Analysis and Recognition (ICDAR) . Vol. 4 . Ieee . 2017 , pp. 19 - 24 .

[24]

Kiessling . “ A modular region and text line layout analysis system20”.2I0n1:7th International Conference on Frontiers in Handwriting Recognition (ICFHR) . Ieee . 2020 , pp. 313 - 318 .

[25]

Kiessling . The Kraken OCR system . Comp. so昀琀ware. Version 4.1.2 . 2022 . url: https://k raken. re.

[26]

Kiessling ,

Tissot ,

Stokes , and

D. S. B.

Ezra . “eScriptorium: an open source platform for historical document analysis” . 2I0n1:9 International Conference on Document Analysis and Recognition Workshops (ICDARW) . Vol. 2 . Ieee . 2019 , pp. 19 - 19 .

[27]

Kim . “ Convolutional Neural Networks for Sentence Classi昀椀cation”P. rIonc:eedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Doha, Qatar: Association for Computational Linguistics, 2014 , pp. 1746 - 1751 . 1d0o . 3i : 115 /v1 / D14 -1181. url: https://aclanthology.org/D14-1. 181

[28]

Manjavacas and

Fonteyn . “ Adapting vs. Pre-training Language Models for Historical Languages” . In: Journal of Data Mining and Digital Humanities Nlp4dh ( 2022 ). doi:10.4 6298/jdmdh.9152. url: https://hal.inria.fr/hal-035921. 37

[29]

Martin , É. Villemonte de La Clergerie , B. Sagot , and

Bordes . “ Controllable Sentence Simpli昀椀cation” . In: LREC 2020 - 12th Language Resources and Evaluation Conference . Marseille, France, 2020 . urlh:ttps://hal.inria.fr/hal-026782. 14

[30]

Martinc ,

Pollak , and

Robnik-Šikonja . “Supervised and Unsupervised Neural Approaches to Text Readability” . ICno:mputational Linguistics 47.1 (Apr. 21 , 2021 ), pp. 141 - 179 . doi: 10 .1162/coli\_a\_ 00398 . url: https://doi.org/10.1162/coli%5C %5Fa%5C%5F0039 8.

[31]

McKinney et al. “ pandas: a foundational Python library for data analysis and statistics” . In: Python for high performance and scienti昀椀c computing 14.9 ( 2011 ), pp. 1 - 9 .

[32]

H. T. T.

Nguyen ,

Jatowt ,

Coustaty , and

Doucet . “ ReadOCR: A Novel Dataset and Readability Assessment of OCRed Texts” . IInn:ternational Workshop on Document Analysis Systems . Springer. 2022 , pp. 479 - 491 .

[33]

Paszke ,

Gross ,

Massa ,

Lerer ,

Bradbury , G. Chanan,

Killeen ,

Lin ,

Gimelshein ,

Antiga ,

Desmaison ,

Kopf ,

Yang ,

DeVito ,

Raison ,

Tejani ,

Chilamkurthy ,

Steiner ,

Fang ,

Bai , and

Chintala . “ PyTorch: An Imperative Style, High-Performance Deep Learning Library” . IAnd:vances in Neural Information Processing Systems 32 . Ed. by

Wallach ,

Larochelle ,

Beygelzimer , F.

d'Alché-

Buc , E.

Fox , and R.

Garnett . Curran Associates, Inc., 2019 , pp. 8024 - 8035 . urhlt: tp://papers .neurips.cc/paper/9015-pytorch -an-imperative-style-high-performance-deep-learninglibrary.pd .f

[34]

A. Pinche. Cremma

Medieval . 2022 . doi: 10 .5281/zenodo.5235185. url: https://github.co m/HTR-United/cremma-medieva.l

[35]

Pinche . “ Guide de transcription pour les manuscrits du Xe au XVe siècle” . 2022 . url: https://hal.archives-ouvertes. fr/hal-03697 . 382

[36]

Pinche ,

Gabay ,

Leroy , and K. ChristensenD. onnées HTR incunables du 15e siècle . 2022 . url: https://github.com/Gallicorpora/HTR-incunable - 15e-s.iecle

[37]

Pinche ,

Gabay ,

Leroy , and K. ChristensenD. onnées HTR manuscrits du 15e siècle . 2022 . url: https://github.com/Gallicorpora/HTR-MSS - 15e-Si.ecle

[38]

Schoen and G. E. Saretto. “ Optical Character Recognition (OCR) and Medieval Manuscripts: Reconsidering Transcriptions in the Digital Age”D. iIgni:tal Philology: A Journal of Medieval Cultures 11.1 ( 2022 ), pp. 174 - 206 . doi: 10 .1353/dph. 2022 . 0010 . url: https://muse.jhu.edu/article/8535.21

[39]

Springmann ,

Fink , and K. U. SchulzA. utomatic quality evaluation and (semi-) automatic improvement of OCR models for historical printings . Oct . 20 , 2016 . doi: 10 .48550/ar Xiv. 1606 .05157. url: http://arxiv.org/abs/1606.0515.7

[40] P. B. Ströbel , S.

Clematide , M.

Volk , R.

Schwitter , T.

Hodel , and D.

Schoch . “ Evaluation of HTR models without Ground Truth Materiala”r . XIniv: preprint arXiv: 2201 .06170 ( 2022 ). url: http://arxiv.org/abs/2201.0617.0

[41]

Vlachou-EfstathiouV.oss.Lat .O. 41 - Eutyches ”de uerbo” glossed. 2022 . url: https://git hub.com/malamatenia/Eutych.es

[43]

Wright . New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam + LookAhead for the best of… Medium . Sept. 4 , 2019 . url:https://medium.com/%5C@ less w/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookaheadfor-the-best-of-2dc83f79a48 . d

[44]

Yang ,

Dyer ,

He ,

Smola , and E. Hovy. “ Hierarchical Attention Networks for Document Classi昀椀cation” . InP: roceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . San Diego, California: Association for Computational Linguistics, 2016 , pp. 1480 - 1489 . 1d0o . 1i:8653/ v 1/ N16 -1174. url: http://aclweb.org/anthology/N16-1. 174