<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Does Context Matter ? Enhancing Handwritten Text Recognition with Metadata in Historical Manuscripts ⋆</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Benjamin</forename><surname>Kiessling</surname></persName>
							<email>benjamin.kiessling@ephe.psl.eu</email>
							<affiliation key="aff0">
								<orgName type="department">École Pratique des Hautes Études</orgName>
								<orgName type="institution">Université PSL</orgName>
								<address>
									<addrLine>4-14 rue Ferrus</addrLine>
									<postCode>75014</postCode>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Thibault</forename><surname>Clérice</surname></persName>
							<email>thibault.clerice@inria.fr</email>
							<affiliation key="aff1">
								<orgName type="institution">Inria</orgName>
								<address>
									<addrLine>48 rue Barrault</addrLine>
									<postCode>75013</postCode>
									<settlement>Paris</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Does Context Matter ? Enhancing Handwritten Text Recognition with Metadata in Historical Manuscripts ⋆</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">7BB105B81428E22D83AB22FD26D2ADBE</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:49+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>handwritten text recognition</term>
					<term>medieval manuscripts</term>
					<term>metadata</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The digitization of historical manuscripts has significantly advanced in recent decades, yet many documents remain as images without machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into text, facilitating large-scale analysis of historical collections. In 2024, the CATMuS Medieval dataset was released, featuring extensive diachronic coverage and a variety of languages and script types. Previous research indicated that model performance degraded on the best manuscripts over time as more data was incorporated, likely due to overgeneralization. This paper investigates the impact of incorporating contextual metadata in training HTR models using the CATMuS Medieval dataset to mitigate this effect. Our experiments compare the performance of various model architectures, focusing on Conformer models with and without contextual inputs, as well as Conformer models trained with auxiliary classification tasks. Results indicate that Conformer models utilizing semantic contextual tokens (Century, Script, Language) outperform baseline models, particularly on challenging manuscripts. The study underscores the importance of metadata in enhancing model accuracy and robustness across diverse historical texts.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The digitization wave of the past two decades has significantly increased online access to historical manuscripts. Despite this progress, a substantial number of these documents are available only as images, lacking machine-readable text. Handwritten Text Recognition (HTR) has emerged as a vital tool for converting these images into text, facilitating the analysis of vast historical collections such as Camps et al. 's work <ref type="bibr" target="#b1">[2]</ref>. Consequently, multiple large datasets have emerged in recent years <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b16">17]</ref>. However, most of these datasets are mono-or bilingual, with relatively limited geographical, temporal, scribal, and generic diversity. While this does not affect the quality of the datasets per se, it limits the generalization of models derived from them. Specifically, such models may face vocabulary limitations in the case of language or generic unicity (e.g., corpora composed solely of biblical content <ref type="bibr" target="#b6">[7]</ref>), and graphical interpretation issues due to the lack of scribal, temporal, or geographical variation.</p><p>The Middle Ages, spanning approximately ten centuries, encompass a period of immense linguistic and cultural diversity. This era witnessed the evolution of numerous languages and dialects, each with distinct characteristics and scripts. From Old English and Latin to Old High German and Old French, the linguistic landscape of the medieval period was dynamic and continually evolving. This diversity poses both challenges and opportunities for HTR, as models must be capable of handling a wide array of scripts and languages that changed significantly over time. Addressing these challenges requires datasets that reflect the rich and varied nature of medieval manuscripts, incorporating a broad spectrum of geographical, temporal, and scribal variations to enhance the robustness and generalizability of HTR models.</p><p>In late 2023 and early 2024, the publication of the CATMuS Kraken model <ref type="bibr" target="#b12">[13]</ref> and subsequently the CATMuS Medieval dataset <ref type="bibr" target="#b2">[3]</ref> has opened up new opportunities for training and evaluating generic models across a vast diversity of categories and traits. With 200 manuscripts in their initial release in January 2024, and 250 in their 1.5.0 July release, encompassing 10 languages and 6 other metadata fields, these resources provide a robust framework for developing generalizable models that account for these specific features. However, in their initial study, Pinche et al. <ref type="bibr" target="#b12">[13]</ref> indicate that the new generalizing models, trained on the comprehensive dataset, exhibited a drop in performance compared to earlier, more language-specific models. This finding seems to contradict the intended benefit of large, intercompatible datasets <ref type="foot" target="#foot_0">1</ref> .</p><p>One promising approach to mitigate these issues is the enrichment of handwritten text datasets with metadata. Metadata provides contextual information that can enhance model training and improve recognition accuracy. For instance, metadata on the century of production, language, script, and genre can help models better understand and adapt to the specific characteristics of the text they are processing.</p><p>This paper explores the potential need for metadata-enriched handwritten text datasets. We hypothesize that incorporating detailed metadata can improve HTR performance, particularly for complex historical texts. By analyzing the performance of current models on metadataenriched versus non-enriched datasets, we aim to demonstrate the benefits of this approach and propose a framework for its implementation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background and Related Work</head><p>Automatic text recognition in general and in particular the processing of historical typewritten and machine-printed material has seen a stellar rise in recent years. This advancement has had a profound impact on scholarly work, especially in the field of historical research. The retrodigitization and accurate transcription of most types of historical documents, which were once laborious tasks, can now be accomplished with relative ease and sufÏcient precision to enable a multitude of novel investigations.</p><p>Metadata and domain knowledge have long played important roles in the design of automatic text recognition systems (ATR). In fact, the limitations of early ATR methods, principally utilized for the processing of documents in tightly constrained domains, necessitated incorporating both to restrict the search space and boost accuracy to acceptable levels. Examples of these are systems designed to aid in automatic letter sorting where the vocabulary is effectively closed but also general-purpose ATR software such as Tesseract <ref type="bibr" target="#b14">[15]</ref> utilizing extensive dictionaries and other means of language modelling.</p><p>Unfortunately traditional techniques to incorporate metadata have strong normalizing tendencies which are problematic for the recognition of historical documents which often have diverse language use, orthography, and multilingualism. While modern ATR software with its more powerful text recognition methods dispenses with many of these accuracy-boosting techniques, this is doubly true for software designed for historical document digitization like <ref type="bibr" target="#b8">[9]</ref> which in most cases go to great lengths to eliminate them as far as possible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Automatic Text Recognition</head><p>The principal paradigms employed in typical Automatic Text Recognition text recognizers have been stable for more than a decade although considerable research has resulted in recognition methods that are significantly more powerful, with higher accuracy, better generalization, and increased ease-of-training than the basic algorithm proposed in <ref type="bibr" target="#b5">[6]</ref>. These recognizers are placed at the end of a pipeline of interconnected processes. A rudimentary but fairly standard ATR pipeline will ingest a digital scan of a page image at a time, perform any necessary pre-processing, e.g. rectification, dewarping, or binarization, find individual lines on the page image in a step called layout analysis, and feed the identified lines individually through the text recognizer. In a final step, the recognition results of the individual lines are reassembled into a paginated text by concatenation and serialization into raw text files or combined with data from the layout analysis to produce a digital facsimile, most frequently in standardized formats like ALTO or PageXML.</p><p>The most important feature of these ATR systems is that they implement text recognition as a sequence to sequence modelling task where the input sequence is typically a line image and the desired output sequence a string of characters. There are multiple ways to construct such a sequence-mapping text recognizer albeit the most popular way is with Connectionist Temporal Classification loss (CTC) <ref type="bibr" target="#b5">[6]</ref> which permits the model to learn without requiring an explicit alignment between input and output. Further, these methods have multiple other advantages, some especially pertinent for historical document retrodigitization: training data creation is typically much faster than with older character-based ATR methods as line-wise annotation is generally more efÏcient, a lack of explicit character segmentation markedly improves error rates on cursive writing and connected scripts, and the ability of the recognizer to take contextual information into account boosts accuracy of characters that are difÏcult to recognize in isolation, e.g. in the case of degraded writing.</p><p>Style-aware HTR and other metadata-enriched architectures While interventions contributing domain knowledge into ATR systems at a general language level, e.g. with dictionaries or language modelling, are widespread, approaches explicitly leveraging other metadata that might be known about the text to be recognized have rarely been described in the liter-ature. Minor exceptions include a method described in <ref type="bibr" target="#b19">[20]</ref>, similar to the semantic context token in section 3.2, for the processing of standardized European Accident Statements, achieving a 10% reduction in CER with an architecture concatenating a metadata vector to the encoder features in a standard CNN-LSTM trained with CTC.</p><p>[1] describes a metadata-aware handwritten text recognition method albeit for a very different use case. A 𝑘-shot learning algorithm for style-aware HTR based on meta-learning, a base model is first trained from a text recognition training set enriched with writer labels where each meta-learning task corresponds to writing produced by a single writer. During inference on writing produced by a previously unknown individual scribe, an update of the model weights with a low number of labelled samples results in an adapted model for this particular scribe. This approach boost accuracy by around 5-7 percentage points in comparison to naive fine-tuning.</p><p>Automatic Text Recognition (ATR) datasets for historical, and specifically medieval, manuscripts likely began with Latin script datasets from the Historical Databases of IAM, notably the Partzival <ref type="bibr" target="#b4">[5]</ref> and St. Gall <ref type="bibr" target="#b3">[4]</ref> subsets. These datasets, which remain widely used for benchmarking new ATR engines, are relatively small (1,000 and 4,000 lines respectively), derive from single source documents, and are fundamentally incompatible due to differing annotation guidelines.</p><p>Late 2010s datasets, such as those developed by D. Stutzmann and the company Teklia <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b17">18]</ref> <ref type="foot" target="#foot_1">2</ref> , have taken a more focused generic approach (e.g., cartularies, books of psalms) and provided a significantly larger number of lines (more than 120,000 for HIMANIS). However, these datasets are limited by their generic and language unicity, and their use of annotation guidelines that resolve abbreviations restricts their reusability in multilingual settings. This is due to genre-or language-specific abbreviations and normalizations, which pose challenges for contextual-dependent abbreviation resolution <ref type="bibr" target="#b20">[21]</ref>.</p><p>The CATMuS dataset offers an innovative framework to address these limitations, enabling testing of ATR models across diachronic (8th-16th century), diageneric (from practical documents to poetry), and multilingual (10 languages) variations. With a consistent annotation approach, the CATMuS dataset <ref type="bibr" target="#b2">[3]</ref> allows for the development and evaluation of single models capable of handling the rich diversity of medieval manuscripts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Proposed method</head><p>We propose two basic approaches to evaluate the impact of metadata on recognition performance at different points of a text recognition method and evaluate it against a baseline of an advanced attentional text recognizer based on the Conformer architecture <ref type="bibr" target="#b7">[8]</ref> and the default hybrid convolutional and recurrent neural recognizer of the kraken OCR engine. Although our experiments are run on an adaptation of fairly complex Conformer models the fundamental idea can be employed in almost any type of text recognizer based on neural networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Text Recognition with Transformers</head><p>The baseline system consists of an adapted Conformer, a Transformer-style <ref type="bibr" target="#b21">[22]</ref> neural network augmented with convolutional layers, currently the dominant neural network architecture in automatic speech recognition (ASR). While ASR and ATR share many of the same features, e.g. relatively low-dimensional inputs and a prevalent sequence-to-sequence paradigm, there is no reported use of them in the ATR domain as of yet.</p><p>While the fundamental architecture requires no adaptation for text recognition, the size of even very large text recognition datasets is significantly smaller than the corpora of spoken speech typically used in ASR research which necessitates downscaling the network for reliable convergence (encoder_dim = 144, encoder_layers = 16, num_attention_heads = 4). In addition, we adopt the computationally more efÏcient depthwise-convolution downsampling schema (conv_channels = 32, subsampling_factor = 4) from <ref type="bibr" target="#b13">[14]</ref> which roughly doubles inference speed without accuracy losses.</p><p>Our baseline recognizer consists of this down-sized Conformer encoder followed by a single fully connected layer as a decoder. Like most text recognition methods it is trained with CTC loss.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Semantic Context Token</head><p>Our first proposed method explicitly supplies the text recognizer with contextual information of the line to be recognized during training and inference. Given an input image of a line 𝐼 ∈ ℝ 𝑤×ℎ×𝑐 with height ℎ, width 𝑤, 𝑐 channels to the recognizer and a vector ⃗ 𝑡 ∈ {0, 1} containing the encoded metadata, which we will call the semantic context token, we simply expand the token to size 𝑤 × | ⃗ 𝑡| and concatenate it to the input resulting in a new input to the network 𝐼 ′ ∈ ℝ 𝑤+| ⃗ 𝑡|×ℎ×𝑐 . The neural network is then trained as usual with CTC loss.</p><p>The chosen metadata is encoded into semantic context token ⃗ 𝑡 through a simple multi-hot encoding, suitable for a wide-range of tag-type metadata, placing a high value at a particular position in the vector to indicate the presence of a tag. Classes are dealt with through expansion, e.g. for a language metadata field and possible values 𝐿 = {Castilian, Venetian, Latin} we would be converted into a semantic context token | ⃗ 𝑡 𝐿 | = 3. An obvious drawback of this method is that the text recognizer needs to be supplied the same array of metadata during both training and inference, i.e. it can only effectively recognize unknown text lines when the same metadata using during training is known.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Auxiliary Loss</head><p>In contrast to the first approach which is intended to induce the recognition model to context switch based on explicitly provided information during inference, our second method relies on an auxiliary loss during training to aid the network in learning the structure of the input data without requiring a semantic context token during inference.</p><p>Instead, the network is trained to reconstruct the semantic context token as the output of a side-branch of the text recognition network. This side branch, situated just after the Conformer encoder, consists of a simple adaptive max pooling and fully connected layer and operates on  the totality of the encoder features. For a context token 𝑡 and a prediction of the side branch t of size 𝑛 the auxiliary loss 𝐿 aux is computed using binary cross-entropy (BCE):</p><formula xml:id="formula_0">𝐿 aux (𝑡, t ) = − ∑{𝑙 1 , … , 𝑙 𝑛 } ⊤<label>(1)</label></formula><p>where</p><formula xml:id="formula_1">𝑙 𝑛 = − [ t 𝑛 ⋅ log 𝑡 𝑛 + (1 − t 𝑛 ) ⋅ log(1 − 𝑡 𝑛 )]<label>(2)</label></formula><p>The overall training objective thus becomes:</p><formula xml:id="formula_2">𝐿 = (1 − 𝑤) ⋅ 𝐿 CTC + 𝑤 ⋅ 𝐿 aux (<label>3</label></formula><formula xml:id="formula_3">)</formula><p>where 𝑤 is an additional hyperparameter of the training process that determines the proportion between the main CTC and auxiliary BCE loss. In line with common practice and confirmed with preliminary experiments we chose to put a relatively low weight (𝑤 ∈ (0.1, 0.3)) on the auxiliary loss during training. benefit from the expanded and more varied test set, enhancing the robustness of our evaluation without compromising the integrity of our initial training and validation processes.</p><p>Representing the diversity, or lack thereof, in the CATMuS dataset is challenging due to the various metrics (lines, characters, pages, or documents) and numerous features to consider (genre, language, script, century, etc.). Language can be seen as a super-category, which is then refined by genre if we view genre as primarily limiting vocabulary. In our dataset description, we focus on script (which can serve as a proxy for century), language, and use lines as the metric of choice. Lines are ultimately the unit used for training (sample and batch size) and offer a compromise between document and character count. However, it is important to note that some documents are heavily represented in terms of lines, while others have much longer lines (particularly in the context of prose vs. poetry), affecting the overall representation.</p><p>CATMuS 1.0.1 and 1.5.0 are heavily uneven across categories. In Table <ref type="table" target="#tab_1">2</ref>, we identify four particularly challenging "couples" in the test set: 156 lines of Castilian in Humanistica script, 273 lines of French in Semihybrida, 736 lines of Navarese, and 147 lines of Venetian in Textualis script. Each of these scripts has representatives in the training and development sets in other languages, but Venetian has only two documents in CATMuS (1 in train and 1 in test since CATMuS 1.0.0) and Navarese has only one document overall, and only in the test set. However, the Textualis script, which represents these languages, is the most common script in the training and development sets (see Table <ref type="table" target="#tab_0">1</ref>). We anticipate these test lines to be the most difÏcult for the model to predict. Latin is the most represented language across scripts, missing  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head><p>We perform experiments on the latest 2024 version of the CATMuS Medieval dataset. While this dataset is sufÏcient in size to train a Conformer model from scratch, the models in our experiments were fine-tuned from a base model trained on around 2.5 million text lines in a large number of scripts and languages in order to reduce the time and computational resources expended.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Implementation Details</head><p>All experiments are performed using the same hyperparameters and identical initial seeds. The model architecture follows section 3.1. Line images are scaled to a fixed height of 96 pixels and padded on both sides with 16 pixels. The batch size is set to 32, the maximum supported by our Nvidia A40 GP under BFloat16 mixed precision.</p><p>Models are trained using the AdamW optimizer <ref type="bibr" target="#b11">[12]</ref> for 100 epochs with a cosine learning rate schedule with linear warmup over 35000 iterations, equivalent to slightly more than 8 epochs on our dataset and batch size. Initial learning rate after warmup is 3𝑒 − 4 decaying to 3𝑒 −5 by the end of the schedule. The network is regularized with weight decay (1𝑒 −5), dropout (0.1), and augmentation with random blurring, scaling, rotation, and elastic transforms<ref type="foot" target="#foot_4">5</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Experimental Setup</head><p>We chose to evaluate our methods on a subset, shown in Table <ref type="table" target="#tab_2">3</ref>, of the line-level metadata provided by the CATMuS dataset. To determine the impact of each metadata field and potential synergistic effects on recognition accuracy, both methods were trained with language, script type, and age fields both separately and jointly. For the auxiliary loss weight an upper limit was determined empirically, from below which the values {0.1, 0.2, 0.3} were sampled for evaluation.</p><p>All models are evaluated on character accuracy. For comparison, baseline models were trained with both the default configuration of the Kraken OCR engine (CNN+LSTM recognizer) and the unmodified Conformer architecture.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Test Results (Character Accuracy): Models or combinations not reported failed to converge, exhibiting a micro-accuracy below 15%. Macro-accuracy represents the mean of document-level accuracies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Input</head><p>Micro General results. Out of the two proposed architectures, only the Conformer model using contextual input tokens with all context tokens (Century, Script, Language) consistently outperforms the other models. Specifically, this model surpasses the baseline Conformer architecture, which itself outperforms the original Kraken baselines (see Table <ref type="table">4</ref>). Models that utilized a single category of features, such as Language or Century, ultimately performed worse than the baseline. The auxiliary loss approach yielded unexpected results: out of the 12 configurations (four types of tasks with three types of loss weights), half did not converge and resulted in character accuracies below 15%. Even worse, the observed unstable training behavior seems to be unrelated to the chosen weight 𝑤, which indicates that optimal hyperparameters must be determined for each new dataset and metadata token.</p><p>Accuracy dispersion across manuscript. The Contextual Input model consistently outperforms all other models, with the lowest median CER and the lowest variance. For the most challenging manuscript, it achieves over a 2 percentage point increase in accuracy compared to the Conformer baseline (see Table <ref type="table" target="#tab_4">5</ref>). Additionally, the Contextual Input model, without ablation, exhibits the smallest variance among all models (see Figure <ref type="figure" target="#fig_4">3a</ref>). Compared to the baseline (see Figure <ref type="figure" target="#fig_4">3b</ref>), the model utilizing the contextual token demonstrates superior accuracy, with a median improvement of 0.64 percentage points. It only underperforms on three manuscripts: Paris, BnF, fr. 6447 (baseline: 97.20%, -0.33); Paris, BnF, lat. 17903 (baseline: 80.25%, -0.32); and Paris, BnF, lat. 130 (baseline: 97.31%, -0.08). Ablation study. To evaluate the impact of the contextual token, we present results with null contextual tokens in Table <ref type="table" target="#tab_6">7</ref>. For models utilizing a single category of contextual input, removing the contextual token results in decreased accuracy, with macro-accuracy dropping by up to 3.2 percentage points for the model using the Century metadata and by as little as 0.88 points for the model using scripts. These findings suggest that the models may be overfitting to the contextual token, as evidenced by the baseline Conformer models outperforming them.</p><p>However, for the model using all contextual inputs (Context Input All), the removal of the context token leads to a smaller reduction in efÏciency. Despite being less efÏcient with null contextual tokens, the model still leverages learned features during decoding, aligning with our expectations for the Auxiliary Loss training architecture. The minimal variation in accuracy between the zeroed-out context and the full context (&lt; 0.15 percentage points) while still surpassing the baseline may indicate that the model has effectively learned to separate features, even without manually provided context.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Impact of unknown features.</head><p>In documents featuring unknown or extremely rare features, such as the Navarrese language (unknown) and Venetian language (represented by only one training sample), our results not only remain stable but also surpass those of the conformer model when utilizing all contextual tokens. Particularly noteworthy are manuscripts BnF 65 and BnF ita. 783 (cf. Table <ref type="table" target="#tab_5">6</ref>), where we observe consistently stronger performance. Even in cases with null semantic tokens, we achieve improvements ranging from +0.2 to +0.4 points in   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>In this study, we explored the effectiveness of incorporating contextual metadata into Handwritten Text Recognition (HTR) models to enhance the digitization of medieval manuscripts.</p><p>Utilizing the CATMuS Medieval dataset, which offers a rich variety of scripts, languages, and centuries, we compared the performance of Conformer models with and without contextual inputs, as well as training these models with auxiliary classification tasks. Our objective was to determine whether adding metadata such as Century, Script, and Language could improve model accuracy and robustness. We tested several configurations, including models with single and multiple contextual tokens, and evaluated them against both the baseline Conformer architecture and the original Kraken baselines. By doing so, we aimed to identify the most effective strategies for leveraging contextual information in HTR tasks.</p><p>Our results showed that the Conformer model using all contextual input tokens (Century, Script, Language) consistently outperformed other configurations, including the baseline models. This model achieved higher accuracy, particularly on the most challenging manuscripts, with an improvement of over 2 percentage points in some cases. Moreover, it exhibited the smallest variance in performance, indicating its robustness across different types of manuscripts. The use of multiple contextual tokens enabled the model to effectively learn and utilize diverse features, leading to better generalization. Interestingly, models with single contextual tokens did not perform as well and often fell short of the baseline, suggesting that a more comprehensive approach to metadata integration is necessary. Additionally, the auxiliary loss approach did not yield the expected improvements and frequently resulted in non-converging models, indicating the complexity of effectively balancing multiple training objectives.</p><p>While our approach demonstrated significant improvements, there are several areas for future exploration. The current approach relies on multi-hot encoding various categories without embedding these features into a learnable space beforehand. Approaches in natural language processing, such as <ref type="bibr" target="#b9">[10]</ref>, could potentially allow the model to approximate relationships between scripts and languages that are closely related, such as 'Caroline' and 'Humanistica' scripts. Secondly, the context token method appends information directly onto the image data fed into the encoder, a design choice motivated by the very lightweight FFN decoder which we deemed to be unlikely to effectively make use of the encoder features augmented with the raw context token. Combining a more powerful decoder, e.g. a pre-trained language model like in <ref type="bibr" target="#b10">[11]</ref>, and injecting metadata after the encoder is an avenue of future research. Such an architecture with a clear separation between the visual and linguistic model would presumably be beneficial for some types of semantic tokens, in particular language and genre, which we consider to be of more importance to the latter than the former. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Appendix</head></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Architecture of the semantic context token method: the multi-hot encoded token is concatenated (light green) column-wise to the input image. The combined input is then fed through the recognition network as normal, both during training and inference. The encoder (orange) is our modified Conformer network, the decoder (light blue) is a single layer feed-forward network.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Architecture of the auxiliary loss method: during training the encoder features are processed by the side branch (light yellow) to predict the context token for a particular line. The auxiliary loss 𝐿 aux is merged with the main CTC loss 𝐿 CTC computed on the predicted text to arrive at the overall loss 𝐿.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>(a) Dispersion of the CER across manuscripts per model for the main models (b) CER difference between the baseline and the best model (Context. Input All nonzeroed).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Dispersion of CER across models on the test set.</figDesc><graphic coords="12,128.47,84.17,187.52,180.11" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Number of lines in train and development split in CATMuS 1.0.0</figDesc><table><row><cell cols="6">Castilian Catalan English French Italian Latin Middle Dutch Venetian</cell></row><row><cell>Caroline</cell><cell></cell><cell>538</cell><cell></cell><cell></cell><cell>6706</cell></row><row><cell>Cursiva</cell><cell>300</cell><cell>482</cell><cell>7560</cell><cell cols="2">595 1394</cell></row><row><cell>Gothic</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>525</cell></row><row><cell>Docu-</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>mentary</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Script</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Humanistica</cell><cell></cell><cell></cell><cell></cell><cell>929</cell><cell>598</cell><cell>94</cell></row><row><cell>Hybrida</cell><cell>7089</cell><cell>196</cell><cell>271</cell><cell cols="2">184 1619</cell></row><row><cell>Personal</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>151</cell></row><row><cell>Praegothica</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>816</cell></row><row><cell>Print</cell><cell>5552</cell><cell></cell><cell>11308</cell><cell></cell><cell>1880</cell></row><row><cell>Semihybrida</cell><cell>613</cell><cell>172</cell><cell></cell><cell></cell><cell>605</cell></row><row><cell>Semitextualis</cell><cell>9669</cell><cell>416</cell><cell>416</cell><cell></cell><cell>679</cell></row><row><cell>Textualis</cell><cell>7609</cell><cell></cell><cell>28688</cell><cell cols="2">444 5922</cell><cell>45998</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Representation of lines by couple script-language in the test set in comparison to the data in train and development splits, as a percentage, such that 𝑣 = |𝐿𝑖𝑛𝑒𝑠 test |/(𝑚𝑎𝑥(1, |𝐿𝑖𝑛𝑒𝑠 train |+|𝐿𝑖𝑛𝑒𝑠 dev |)). When there are no data in the train and development sets, the percentage is normalised using 1 as the number of lines, and values are put in bold. Catalan and English) are absent from the test set entirely. Caroline and Praegothica scripts are overly represented in the test set in terms of lines, but this metric hides a reality for Caroline in number of documents, as three documents in Latin Caroline are in the test set, but 22 different small documents represent this script in the train and dev split4 .</figDesc><table><row><cell></cell><cell>Castilian</cell><cell cols="4">French Italian Latin Middle Dutch Navarrese Venetian</cell></row><row><cell>Caroline</cell><cell></cell><cell></cell><cell>101.2</cell><cell></cell></row><row><cell>Cursiva</cell><cell></cell><cell>18.1</cell><cell>60.3</cell><cell></cell></row><row><cell>Gothic Doc. Humanistica</cell><cell>15600.0</cell><cell>54.0</cell><cell>20.2 45.7</cell><cell></cell></row><row><cell>Hybrida</cell><cell>7.6</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Praegothica</cell><cell></cell><cell></cell><cell>106.1</cell><cell></cell></row><row><cell>Print Semihybrida</cell><cell cols="2">3.7 25.1 27300.0</cell><cell></cell><cell></cell></row><row><cell>Semitextualis Textualis</cell><cell></cell><cell>5.2</cell><cell>28.4 4.8</cell><cell>2.3</cell><cell>73600.0 14700.0</cell></row><row><cell cols="6">representation in only five classes in the test sets. Additionally, two scripts (Personal and Print)</cell></row><row><cell cols="2">and two languages (</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Selected metadata fields and values</figDesc><table><row><cell>Field</cell><cell>Values</cell></row><row><cell>Language</cell><cell>Italian, English, French, Castilian, Latin,</cell></row><row><cell></cell><cell>Middle Dutch, Navarrese, Venetian,</cell></row><row><cell></cell><cell>Catalan</cell></row><row><cell cols="2">Script type Caroline, Cursiva, Gothic Documen-</cell></row><row><cell></cell><cell>tary Script, Humanistica, Hybrida,</cell></row><row><cell></cell><cell>Praegothica, Personal, Print, Semihy-</cell></row><row><cell></cell><cell>brida, Semitextualis, Textualis</cell></row><row><cell>Century</cell><cell>8, 9, 10, 11, 12, 13, 14, 15, 16</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc>Test results for the two worst performing manuscripts across models (Bibiothèque Inter-universitaire de la Sorbonne, 193 &amp; BnF, Lat. 17903), and the two best one (BnF, fr. 13496 &amp; BnF, fr. 574). Only the best Aux. Loss and Context. Input are kept.</figDesc><table><row><cell></cell><cell>Input</cell><cell cols="4">BIUS 193 BnF, Lat. 17903 BnF, fr. 13496 BnF, fr. 574</cell></row><row><cell>Kraken</cell><cell></cell><cell>77.19</cell><cell>77.27</cell><cell>95.88</cell><cell>94.94</cell></row><row><cell>Conformer</cell><cell></cell><cell>80.52</cell><cell>80.25</cell><cell>96.53</cell><cell>97.72</cell></row><row><cell>Aux. Loss</cell><cell>All (0.1)</cell><cell>78.64</cell><cell>80.62</cell><cell>97.43</cell><cell>97.35</cell></row><row><cell></cell><cell>All All (zeroed)</cell><cell>82.22 82.21</cell><cell>79.93 80.63</cell><cell>97.25 97.08</cell><cell>98.04 98.08</cell></row><row><cell>Context. Input</cell><cell>Century</cell><cell>74.64</cell><cell>77.69</cell><cell>95.79</cell><cell>95.68</cell></row><row><cell></cell><cell>Language</cell><cell>75.60</cell><cell>78.76</cell><cell>96.30</cell><cell>96.98</cell></row><row><cell></cell><cell>Script</cell><cell>71.32</cell><cell>78.46</cell><cell>95.86</cell><cell>96.78</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 6</head><label>6</label><figDesc>Test zoom-in on manuscripts with an unknown language (character accuracy).</figDesc><table><row><cell></cell><cell>Input</cell><cell cols="2">BnF, esp. 65 BnF, ita. 783</cell></row><row><cell>Kraken</cell><cell></cell><cell>91.94</cell><cell>90.37</cell></row><row><cell>Conformer</cell><cell></cell><cell>93.60</cell><cell>92.91</cell></row><row><cell>Context. Input</cell><cell>All All (zeroed)</cell><cell>94.08 94.14</cell><cell>93.11 93.07</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 7</head><label>7</label><figDesc>Ablation results (character accuracy): All Conformer models using contextual inputs include configurations with the nullification of the contextual token, indicated as (zeroed).</figDesc><table><row><cell></cell><cell>Input</cell><cell cols="2">Micro-Accuracy Macro-Accuracy</cell></row><row><cell>Conformer</cell><cell></cell><cell>90.32</cell><cell>92.07</cell></row><row><cell></cell><cell>All</cell><cell>91.14</cell><cell>92.86</cell></row><row><cell></cell><cell>All (zeroed)</cell><cell>91.13</cell><cell>92.79</cell></row><row><cell></cell><cell>Century</cell><cell>87.53</cell><cell>89.59</cell></row><row><cell>Context. Input</cell><cell>Century (zeroed) Language</cell><cell>85.52 88.37</cell><cell>88.64 90.21</cell></row><row><cell></cell><cell>Language (zeroed)</cell><cell>86.55</cell><cell>88.66</cell></row><row><cell></cell><cell>Script</cell><cell>87.73</cell><cell>89.91</cell></row><row><cell></cell><cell>Script (zeroed)</cell><cell>87.22</cell><cell>89.03</cell></row><row><cell>accuracy.</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 8</head><label>8</label><figDesc>Composition of the test dataset on CATMuS 1.5.0</figDesc><table><row><cell>Shelfmark</cell><cell>Language</cell><cell>Script</cell><cell cols="2">Type Genre</cell><cell cols="3">Century Lines Characters</cell></row><row><cell>Paris, BnF, lat. 130</cell><cell>Latin</cell><cell>Caroline</cell><cell cols="2">prose Treatises</cell><cell>12</cell><cell>198</cell><cell>13843</cell></row><row><cell>Paris, BnF, lat. 8001 Paris, BnF, lat. 7499</cell><cell>Latin Latin</cell><cell>Caroline Caroline</cell><cell cols="2">vers prose Treatises Poetry</cell><cell cols="2">13 10 6086 504</cell><cell>19550 168092</cell></row><row><cell>Paris, BnF, fr. 1881</cell><cell>French</cell><cell>Cursiva</cell><cell cols="2">verse Narratives</cell><cell>16</cell><cell>163</cell><cell>3507</cell></row><row><cell>Paris, BnF, fr. 604</cell><cell>French</cell><cell>Cursiva</cell><cell cols="2">verse Narratives</cell><cell>15</cell><cell>343</cell><cell>10514</cell></row><row><cell>Paris, BnF, fr. 413</cell><cell>French</cell><cell>Cursiva</cell><cell cols="2">prose Narratives</cell><cell>15</cell><cell>860</cell><cell>26952</cell></row><row><cell>Paris, BnF, lat. 14650</cell><cell>Latin</cell><cell>Cursiva</cell><cell cols="2">prose Narratives</cell><cell>15</cell><cell>172</cell><cell>10752</cell></row><row><cell>Paris, Bibliothèque inter-universitaire de la Sorbonne, 193</cell><cell>Latin</cell><cell>Cursiva</cell><cell cols="2">prose Treatises</cell><cell>14</cell><cell>669</cell><cell>34655</cell></row><row><cell>Paris, BnF, lat. 10996</cell><cell>Latin</cell><cell cols="3">Gothic Documentary Script prose Documents of practice</cell><cell>13</cell><cell>106</cell><cell>5548</cell></row><row><cell>Paris, BnF, esp. 368</cell><cell>Castilian</cell><cell>Humanistica</cell><cell cols="2">prose Treatises</cell><cell>16</cell><cell>156</cell><cell>9092</cell></row><row><cell>Paris, BnF, ita. 481</cell><cell>Italian</cell><cell>Humanistica</cell><cell cols="2">prose Narratives</cell><cell>14</cell><cell>502</cell><cell>18493</cell></row><row><cell>Florence, Biblioteca Medicea Laurenziana, Laur. Plut. 39.34</cell><cell>Latin</cell><cell>Humanistica</cell><cell>vers</cell><cell>Poetry</cell><cell>15</cell><cell>135</cell><cell>4268</cell></row><row><cell>Paris, BnF, Smith-Lesouëf 16</cell><cell>Latin</cell><cell>Humanistica</cell><cell cols="2">prose Documents of practice</cell><cell>16</cell><cell>138</cell><cell>6415</cell></row><row><cell>Paris, BnF, esp. 36</cell><cell>Castilian</cell><cell>Hybrida</cell><cell cols="2">prose Narratives</cell><cell>14</cell><cell>541</cell><cell>20043</cell></row><row><cell>Paris, BnF, lat. 17903</cell><cell>Latin</cell><cell>Praegothica</cell><cell>vers</cell><cell>Poetry</cell><cell>13</cell><cell>439</cell><cell>16228</cell></row><row><cell cols="2">Montpellier, Bibliothèque universitaire Historique de Médecine, H318 Latin</cell><cell>Praegothica</cell><cell cols="2">prose Treatises</cell><cell>12</cell><cell>427</cell><cell>26773</cell></row><row><cell>Paris, BnF, Rés. YE-1325</cell><cell>French</cell><cell>Print</cell><cell cols="2">prose Narratives</cell><cell>16</cell><cell>416</cell><cell>13957</cell></row><row><cell>Madrid, BNE, MSS. 3995</cell><cell>Castilian</cell><cell>Semihybrida</cell><cell cols="2">prose Treatises</cell><cell>15</cell><cell>154</cell><cell>6178</cell></row><row><cell>Paris, BnF, fr. 2701</cell><cell>French</cell><cell>Semihybrida</cell><cell cols="2">prose Treatises</cell><cell>15</cell><cell>273</cell><cell>13923</cell></row><row><cell>Paris, BnF, lat. 14137 Paris, BnF, fr. 574</cell><cell>Latin French</cell><cell>Semitextualis Textualis</cell><cell cols="2">vers prose Treatises Poetry</cell><cell>14 14</cell><cell>193 113</cell><cell>5706 2451</cell></row><row><cell>Paris, BnF, fr. 13496 Paris, BnF, fr. 747</cell><cell>French French</cell><cell>Textualis Textualis</cell><cell cols="2">prose Narratives prose Narratives</cell><cell>13 13</cell><cell>159 91</cell><cell>4755 5349</cell></row><row><cell>Paris, BnF, fr. 6447</cell><cell>French</cell><cell>Textualis</cell><cell cols="2">prose Narratives</cell><cell>13</cell><cell>383</cell><cell>16310</cell></row><row><cell>Paris, BnF, fr. 23117</cell><cell>French</cell><cell>Textualis</cell><cell cols="2">prose Narratives</cell><cell>13</cell><cell>736</cell><cell>24203</cell></row><row><cell>Paris, BnF, NAL 730</cell><cell>Latin</cell><cell>Textualis</cell><cell cols="2">prose Treatises</cell><cell>14</cell><cell>284</cell><cell>14612</cell></row><row><cell>Vienna, ÖNB, 12.905</cell><cell cols="2">Middle Dutch Textualis</cell><cell cols="2">prose Treatises</cell><cell>14</cell><cell>1047</cell><cell>40465</cell></row><row><cell>Paris, BnF, esp. 65</cell><cell>Navarrese</cell><cell>Textualis</cell><cell cols="2">prose Treatises</cell><cell>14</cell><cell>736</cell><cell>19932</cell></row><row><cell>Paris, BnF, ita. 783</cell><cell>Venetian</cell><cell>Textualis</cell><cell cols="2">prose Narratives</cell><cell>14</cell><cell>147</cell><cell>7361</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">It is important to note, however, that the models were compared using a similar architecture, without any hyperparameter optimization based on the newly acquired diversity of the dataset. This suggests that further optimization and adaptation may be necessary to fully leverage the potential of such diverse datasets.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">Their publication date is relatively older than their original availability.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">We leveraged the release of a larger, more diverse test set for evaluation; however, due to the short time frame (less than five days) between the release of version 1.5.0 and the submission deadline of this paper, we were unable to retrain and redo all experiments. While some documents seem to have undergone metadata correction in between releases, we expect it to have a relatively small impact on our evaluation scores.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">This is another example of how difÏcult it is to represent the diversity and over-representation of some categories.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">The source code for all experiments can be found under a libre Apache 2.0 at https://github.com/mittagessen/con former_ocr.git.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head></div>
			</div>


			<div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Data</head><p>For the purpose of this paper, we utilized the CATMuS Medieval dataset, adhering to the provided dataset splits, which segment the training, validation, and evaluation sets by document. The training and validation splits were sourced from the 1.0.1 release, while the evaluation split was taken from the 1.5.0 release 3 for testing purposes (see Table <ref type="table">8</ref>). This approach allowed us to</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Bhunia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ghose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">N</forename><surname>Chowdhury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-Z</forename><surname>Song</surname></persName>
		</author>
		<idno type="DOI">10.1109/cvpr46437.2021.01557</idno>
	</analytic>
	<monogr>
		<title level="m">2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR</title>
				<imprint>
			<biblScope unit="volume">2021</biblScope>
			<biblScope unit="page" from="15825" to="15834" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Make Love or War? Monitoring the Thematic Evolution of Medieval French Narratives</title>
		<author>
			<persName><forename type="first">J.-B</forename><surname>Camps</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Baumard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-C</forename><surname>Langlais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Morin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Clérice</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Norindr</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Computational Humanities Research</title>
				<meeting><address><addrLine>CHR</addrLine></address></meeting>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2023">2023. 2023</date>
			<biblScope unit="page" from="734" to="756" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond</title>
		<author>
			<persName><forename type="first">T</forename><surname>Clérice</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pinche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vlachou-Efstathiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chagué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-B</forename><surname>Camps</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gille-Levenson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Brisville-Fertin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gervers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Boutreux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Manton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gabay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>O'connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Haverals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kestemont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Vandyck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Kiessling</surname></persName>
		</author>
		<ptr target="https://inria.hal.science/hal-04453952" />
	</analytic>
	<monogr>
		<title level="m">2024 International Conference on Document Analysis and Recognition (ICDAR)</title>
				<meeting><address><addrLine>Athens, Greece</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Transcription Alignment of Latin Manuscripts using Hidden Markov Models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Frinken</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fornés</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bunke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2011 Workshop on Historical Document Imaging and Processing</title>
				<meeting>the 2011 Workshop on Historical Document Imaging and Processing</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="29" to="36" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automatic Transcription of Handwritten Medieval Documents</title>
		<author>
			<persName><forename type="first">A</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wuthrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Liwicki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Frinken</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bunke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Viehhauser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stolz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2009 15th International Conference on Virtual Systems and Multimedia</title>
				<imprint>
			<publisher>Ieee</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="137" to="142" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Fernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd international conference on Machine learning. Acm</title>
				<meeting>the 23rd international conference on Machine learning. Acm</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="369" to="376" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Transcribing Medieval Manuscripts for Machine Learning</title>
		<author>
			<persName><forename type="first">E</forename><surname>Gueville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Wrisley</surname></persName>
		</author>
		<ptr target="https://shs.hal.science/halshs-03725166" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Conformer: Convolution-augmented Transformer for Speech Recognition</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gulati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-C</forename><surname>Chiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Pang</surname></persName>
		</author>
		<idno type="DOI">10.21437/Interspeech.2020-3015</idno>
	</analytic>
	<monogr>
		<title level="m">Proc. Interspeech</title>
				<meeting>Interspeech</meeting>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="5036" to="5040" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Kraken -a Universal Text Recognizer for the Humanities</title>
		<author>
			<persName><forename type="first">B</forename><surname>Kiessling</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Actes de Digital Humanities Conference</title>
				<editor>
			<persName><surname>Adho</surname></persName>
		</editor>
		<meeting>s de Digital Humanities Conference</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Categorical Metadata Representation for Customized Text Classification</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">K</forename><surname>Amplayo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Seo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-W</forename><surname>Hwang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="201" to="215" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models</title>
		<author>
			<persName><forename type="first">M</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lv</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Florencio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="13094" to="13102" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Decoupled Weight Decay Regularization</title>
		<author>
			<persName><forename type="first">I</forename><surname>Loshchilov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=Bkg6RiCqY7" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">CATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts</title>
		<author>
			<persName><forename type="first">A</forename><surname>Pinche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Clérice</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chagué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-B</forename><surname>Camps</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vlachou-Efstathiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gille Levenson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Brisville-Fertin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Boschetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gervers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Boutreux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Manton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gabay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Haverals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kestemont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Vandyck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>O'connor</surname></persName>
		</author>
		<ptr target="https://inria.hal.science/hal-04346939" />
	</analytic>
	<monogr>
		<title level="m">Dh2024. Adho</title>
				<meeting><address><addrLine>Washington DC, United States</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Fast Conformer with Linearly Scalable Attention for EfÏcient Speech Recognition</title>
		<author>
			<persName><forename type="first">D</forename><surname>Rekesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Rao Koluguri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kriman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Majumdar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Noroozi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Hrinchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Puvvada</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Balam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ginsburg</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2305.05084</idno>
		<idno type="arXiv">arXiv:2305.05084</idno>
		<idno>arXiv: 2305.05084</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv e-prints</note>
	<note>eess.AS</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">An Overview of the Tesseract OCR Engine</title>
		<author>
			<persName><forename type="first">R</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Ninth International Conference on Document Analysis and Recognition -Volume 02</title>
				<meeting>the Ninth International Conference on Document Analysis and Recognition -Volume 02<address><addrLine>Usa</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="629" to="633" />
		</imprint>
	</monogr>
	<note>Icdar &apos;07</note>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Original Charters From Fontenay before</title>
		<author>
			<persName><forename type="first">D</forename><surname>Stutzmann</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1213">1213. 2022</date>
		</imprint>
	</monogr>
	<note>Fontenay Dataset</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Words as graphic and linguistic structures. Word spacing in Psalm 101 Domine exaudi orationem meam (eleventh-fifteenth centuries)</title>
		<author>
			<persName><forename type="first">D</forename><surname>Stutzmann</surname></persName>
		</author>
		<idno type="DOI">url:10.1484/m.usml-eb.5.120721</idno>
	</analytic>
	<monogr>
		<title level="m">Les Mots au Moyen Âge -Words in the Middle Ages. Utrecht Studies in Medieval Literacy 46</title>
				<meeting><address><addrLine>Turnhout; Brepols</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="21" to="59" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">HOME-Alcar: Aligned and Annotated Cartularies</title>
		<author>
			<persName><forename type="first">D</forename><surname>Stutzmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Aguilar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Chaffenet</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">La recherche en plein texte dans les sources manuscrites médiévales: enjeux et perspectives du projet HIMANIS pour l&apos;édition électronique</title>
		<author>
			<persName><forename type="first">D</forename><surname>Stutzmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-F</forename><surname>Moufòet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hamel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Médiévales</title>
		<imprint>
			<biblScope unit="page" from="67" to="96" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Field Typing for Improved Recognition on Heterogeneous Handwritten Forms</title>
		<author>
			<persName><forename type="first">C</forename><surname>Tomoiaga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Salzmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Jayet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2019 International Conference on Document Analysis and Recognition (ICDAR)</title>
				<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="487" to="493" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Handwritten Text Recognition for Documentary Medieval Manuscripts</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">Torres</forename><surname>Aguilar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Jolivet</surname></persName>
		</author>
		<idno type="DOI">10.46298/jdmdh.10484</idno>
	</analytic>
	<monogr>
		<title level="m">ments and automatic text recognition</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Attention is All you Need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper%5C%5Ffiles/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">I</forename><surname>Guyon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">U</forename><forename type="middle">V</forename><surname>Luxburg</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Bengio</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Wallach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Fergus</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Vishwanathan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">30</biblScope>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
