1. Introduction

Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition

David A. Smith

Jacob Murel

Jonathan ParkesAllen

Matthew Thomas Mille

1 0 Khoury College of Computer Sciences, Northeastern University , Boston MA , U.S.A 1 Roshan Institute for Persian Studies, University of Maryland , College Park MD , U.S.A

206 221

Handwritten text recognition (HTR) has enabled many researchers to gather textual evidence from the human record. One common training paradigm for HTR is to identify an individual manuscript or coherent collection and to transcribe enough data to achieve acceptable performance on that collection. To build generalized models for Arabic-script manuscripts, perhaps one of the largest textual traditions in the pre-modern world, we need an approach that can improve its accuracy on unseen manuscripts and hands without linear growth in the amount of manually annotated data. We propose Automatic Collation for Diversifying Corpora (ACDC), taking advantage of the existence of multiple manuscripts of popular texts. Starting from an initial HTR model, ACDC automatically detects matching passages of popular texts in noisy HTR output and selects high-quality lines for retraining HTR without any manually annotated data. We demonstrate the e昀ectiveness of this approach to distant supervision by annotating a test set drawn from a diverse collection of 59 Arabic-script manuscripts and a training set of 81 manuscripts of popular texts embedded within a larger corpus. A昀琀er a few rounds of ACDC retraining, character accuracy rates on the test set increased by 19.6% absolute percentage, while a supervised model trained on manually annotated data from the same collection increased accuracy by 15.9%. We analyze the variation in ACDC's performance across books and languages and discuss further applications to collating manuscript families.

eol>handwritten text recognition collation manuscripts

1. Introduction

Within the past decade, widely-available handwritten text recognition (HTR) tools have enabled many disciplines to investigate a wide range of handwritten documentary sources from the human record1[ 1 ]. Most of the current generation of HTR systems are trained at the line level to optimize connectionist temporal classi昀椀cation (CTC) lo6s]s.[This frees users from having to annotate individual words or characters to produce training data; instead, they simply transcribe the plain text of each line. Creating training data still remains a bottleneck for HTR. Nockels, Gooding, Ames, and Terras11[] describe the community around the Transkribus HTR system as “a ‘bottom-up’ mass digitization movement, made up of hundreds of simultaneous projects driven by motivated researchers” creating their own training data and models. Put another way, an important use case for HTR is to help those “motivated researchers,” who know which documents they wish to transcribe, annotate enough data so that they can train a model to transcribe the rest.

We contend that there is room for a complementary training paradigm for HTR. In some projects, we are confronted with large collections of documents in a diversity of languages, hands, genres, and time periods. If we do not knoawpriori which documents will be interesting, it may be hard to allocate e昀케ciently the time it takes to produce HTR training data for the whole collection. Instead, we proposediastant supervision approach to training HTR that takes advantage of the structure of many larger collectioAnust.omatic Collation for Diversifying Corpora (ACDC) starts by having users assemble digital editions of texts they believe will be widely copied in the collection. Starting from an initial, imperfect model, ACDC proceeds by: 1. running initial HTR segmentation and transcription models on a diverse manuscript collection (§3); 2. aligning passages in this HTR output with passages in the reference digital edition4s.1()§; 3. selecting manuscript lines with their page-image coordinate information and their corresponding text from the reference editions4.(2§); and 4. retraining the HTR model on the selected lines.

Once a new HTR model is trained, it can be used to re-transcribe the manuscript collection and run this process again (Figur1e).

Distant supervision has been employed at the paragraph level in HT2R] a[nd, using stateof-the-art vision transformers, for training joint segmentation and transcription mod3e].ls [ These systems, however, still assume that we have a “diplomatic”, ground-truth transcription of a particular manuscript paragraph or page. ACDC instead infers which matching passages between a noisy HTR transcript and a reference digital edition are close enough to use for training and which might contain variant readings.

In this paper, we demonstrate ACDC’s e昀ectiveness by applying it to a diverse collection of Arabic-script manuscripts 2(§). We annotate a test set drawn from a diverse collection of 59 Arabic-script manuscripts and a training set of 81 manuscripts of popular texts embedded within a larger corpus. A昀琀er a few rounds of ACDC retraining, character accuracy rates on the test set increased by 19.6% absolute percentage, while a supervised model trained on manually annotated data from the same collection increased accuracy by 15.95%). (§We analyze the variation in ACDC’s performance across books and languages and discuss further applications to collating manuscript families6()§. We have released our code, annotated data, and trained models under open-source licens1es. 1See repositories of codeht(tps://github.com/OpenITI/acdc_trai)n, test data (https://github.com/OpenITI/aocp_ms _eval), annotated lines from the training set to compare to unsupervised ACDhCttp(s://github.com/OpenITI/ara bic_ms_data), and trained models and evaluation dathat(tps://github.com/OpenITI/acdc_resul)t.s

2. Arabic-script Manuscripts

The Islamicate written traditions together form one of the largest—if not the largest—archives of human cultural production of the pre-modern world. Primarily written in Arabic and Persian, and also encompassing Ottoman Turkish, Urdu, and other languages, these textual traditions stretch through more than twelve hundred years of history and extend from Iberia and North Africa in the west, the upper reaches of the Volga in the north, Sub-Saharan Africa in the south, and China, the Indian subcontinent, and the Philippines and Indonesia in the east. The exact number of extant manuscript volumes has yet to be determined with precision, but rough estimates suggest at least several million volumes exist today in collections that range in diversity from West African madrasas and the Turkish state archives to European and American museums and libraries1[, pp. 34–35]. We know from modern printed editions of pre-modern texts in these languages that the total number of discrete texts certainly eclipses the Latinate and European vernacular traditions—perhaps rivaled only by pre-modern Chinese cultural output. These estimates, however, are only based on modern print production, which is but the tip of the proverbial iceberg. One prominent scholar, Carl W. Ernst, estimates that only 5-10% percent of the Persian and Arabic written tradition has been published in any print format—a number broadly consonant with the 昀椀ndings of Maxim Romanov as wel4l,[ 13 ]. Whatever the exact numbers, it is certain that a signi昀椀cant portion of the Persian and Arabic written tradition remains exclusively in manuscript form in thousands of libraries and archives across the world. The sheer number of these manuscripts makes it di昀케cult for scholars, librarians, and students to focus their energies on more than the most important samples in each collection.

To evaluate HTR models on a wide range of script types, languages, and time periods, we collected sets of public-domain digital images digitized by 17 libraries (T1a).blFerom the resulting 59 manuscripts, dating from 900 to 1869 ce, we transcribed 1704 lines. An average of 29 lines per book is not usually enough to train book-speci昀椀c model with acceptably high accuracy. Unlike some other evaluation7s] [on Arabic-script manuscripts, however, this paper focuses on the use-case where the training set and test sets come from di昀erent manuscripts and di昀erent hands.

To test the ACDC method’s e昀ectiveness at producing HTR training data, we selected 昀椀ve texts, three in Arabic and two in Persian, for which we could 昀椀nd digital transcriptions and a reasonable number of digital editions (Tab2l).eWe downloaded 81 sets of page images of these 椀昀ve texts. To perform error analysis and to be able to compare ACDC to supervised training, we transcribed 6842 lines in total from these 81 manuscripts. None of these manual transcriptions were used for training ACDC. During training, we also added 50 additional “distractor” manuscripts to evaluate the alignment process. None of these 131 manuscripts overlap with the 59 manuscripts used for testing. Furthermore, none of the 59 test manuscripts are copies of the 昀椀ve widely-copied works we use for training.

All of the manuscript images used here have been released by the libraries digitizing them into the public domain. We release the layout analysis and transcribed lines under an opensource license.

3. HTR Training and Testing

We employ the Kraken HTR system 8[] for training and testing layout analysis and transcription models due to its support for right-to-le昀琀 scripts and the curved baselines common in manuscripts. As with other current HTR systems, Kraken 昀椀rst uses saegmentation model to detect regions and lines in a page image; it then separately passes each extracted line image to a transcription model to produce text output. The ACDC method described here could be easily adapted to other line-oriented OCR systems.

The experiments in this paper start with segmentation and transcription models trained on annotations produced for Arabic and Persian printed books by the Open Islamicate Texts Initiative [ 14, 9 ]. While the layout of books and manuscripts is of course very di昀erent, we keep this print-trained segmentation model 昀椀xed for all experiments to focus on improvements in text alignment and transcription models. (See6 §for further discussion of layout analysis.)

We use character accuracy rate (CAR) to evaluate the e昀ectiveness of transcription models. This metric computes the (Levenshtein) edit distance between the reference transcription and the model output and divides by the number of characters in the reference. The resulting character error rate is then subtracted from one, i.=e. 1 − ( referenc,ehypothesis)/#(reference char)s. We remove Arabic short vowel marks and merge variant forms of the letterksaf and yah in both the reference and hypothesis before comparing them. In addition to CAR, we also measure thAerabic character accuracy rate by removing spaces, punctuation, and other non-Arabic characters from the reference and hypothesis before comparing them. As we discuss in4,§we ignore non-letter characters when aligning noisy HTR output with digital editions. Arabic CAR is thus a helpful diagnostic for relating transcription accuracy on these letters to the amount of training data the ACDC method is able to extract. When summarizing these evaluation metrics across a test set, we take the average of the CAR for each book. This “macro averaging” ensures that books with more transcribed lines do not receive undue weight in the 昀椀nal evaluation.

We train transcription models with Kraken on pairs of manuscript line images and reference transcriptions. As with similar line-level HTR systems, Kraken minimizes connectionist temporal classi昀椀cation (CTC) loss 6[] with respect to the weights of a convolutional plus recurrent neural network. For supervised training, both the boundaries of the lines within the page image and the transcriptions were produced manually as discussed above (T2a)b.lFeor ACDC training with distant supervision, the boundaries of the lines were produced by the print-trained segmentation model and the reference transcriptions were inferred by the collation process (§4). We trained transcription models both from scratch, i.e., with random initialization of all weights, and by 昀椀ne tuning the existing print-trained model. In our experiments, 昀椀ne-tuning always proved more e昀ective on both validation and test data. For each training set, we randomly hold out 10% of the lines as validation data to perform early stopping and model selection. We use a constant learning rate o1f0−4 recommended by Kraken for manuscript training and perform early stopping when the best CAR on validation data has not improved for ten iterations.

The print-trained model we use as a starting point for our experiments achieves a (macroaveraged) 60.5% CAR on the test set. As shown in Figur2e, this average combines clusters of books with CAR in the high 60s and above and books with CAR in the mid 50s and below. Finetuning this print model on the 6842 manually transcribed lines from our training set achieves an average CAR of 76.4%. While the print model transcribed 35 test books with CAR less than 60%, the supervised model performs below 60% on only six test books. Even the supervised model is trained with no overlap between the training books and test books. Its accuracy is therefore below what we would expect from the common HTR paradigm where training pages are drawn from the same book as test pages.

4. Collating Noisy HTR with Digital Editions

The ACDC method starts with the output from an initial HTR model—here, the print-trained model. It then aligns this HTR output with a collection of reference texts to see if any parts of the HTR output are su昀케ciently close to some passage in a reference text. In this section, we describe the inference process for collating noisy HTR output with reference texts or the HTR output on other manuscripts. We then analyze this collation output to select lines for retraining HTR models.

The proposed approach is another step in increasingly distant supervision for training HTR. Kraken, like other HTR systems that are trained to minimize the CTC loss between a reference transcription of a line and model predictions, already performs a character alignment for each line6[]. This process enables us to forgo annotating each character’s positionFigure 2: Distribution of CAR of the model on the page image and instead simply anno- trained on printed text compared to tate a whole line with the desired sequence of fully supervised training and to the characters. Chammas, Mokbel, and Likforman- 椀昀rst three iterations of ACDC trainSulem [ 2 ] proposed collecting reference tran- ing with distant supervision. scriptions at the paragraph level and using the best Levenshtein alignment with HTR output to split this reference into lines. Coquenet, Chatelain, and Paqu3e]tp[roposed collecting fullpage transcriptions and learning a page-level reading order. In this paper, we propcoosrpeuaslevel approach to alignment: rather than deciding ahead of time which lines or paragraphs or pages we should transcribe, we collect reference texts that we believe will overlap with significant number of manuscripts in our corpus (Tab2l)e.When preparing input for ACDC, we do not need to specify which reference texts correspond to which manuscripts—let alone to which pages or lines.

This corpus-level approach, however, makes the alignment problem more di昀케cult in two ways. First, there is the search problem of matching lines in HTR output with passages in arbitrarily long reference text4s.(1§). Second, we need to infer which line-level alignments are of high enough quality to use as training data 4(§.2).

4.1. HMM Alignment Models

Unlike previous approaches to distant supervision for HTR, we cannot use Levenshtein alignment (i.e., Needleman-Wunsch)2[] since the page or other portion of a MS we happen to have may not cover the whole texts of the reference edition we are trying to align it to; moreover, a given MS page may contain material, such as commentary or other notes, extraneous to the main text (Figure7). Previous work on HTR, by contrast, has employed “diplomatic” transcriptions those manuscripts selected for the training set. Unlike previous work on text reuse detection [ 15, 5 ], we do not use Smith-Waterman alignment due to the problem of di昀erences in reading order among di昀erent manuscripts and editions. Due to either di昀erences in layout or errors in layout analysis, two versions of a text with the same material might present the same material in di昀erent sequence.

We propose, therefore, to use a more generalized 昀椀nite-state approach to alignment based on hidden Markov models (HMMs)1[ 6 ]. The observations are characters of (the HTR transcript of) the manuscript we are trying to align to a digital edition or another manuscript, and the hidden states are positions in these other witnesses. For any position in the target manuscript, the hidden state is a “read head” that speci昀椀es what source we might be copying from. Unlike Levenshtein or Smith-Waterman alignments, it is possible to move this read head backwards or forwards an arbitrary distance in the source. That does not mean that all jumps in position are equally likely, however.

As in other HMMs, we need to specify atransition distribution that assigns probabilities to shi昀琀s in position of the read head and an[emission] distribution that speci昀椀es what characters we are likely to observe in the target text when reading from a given location in the source. To compute( | )( | −1 ), the probability of generating th eth target character given the source position that generated th e−1 st, we consider that the source state can continue generating text in its current position with probabili tyor it can move the read head anywhere in the source text with probability1 − . We then compute a probabilistic version of Levenshtein distance with parameter , the probability that a character will be copied unchanged from the source to the target. The remaining1 − probability is divided uniformly among all other possible edits, i.e., substitutions, insertions, and deletions. The probability that we will stop generating target text is( + 1)/2 . We also include a pruning parameter, the length of the allowable gap between target characters that are copied unchanged from the source. For the experiments in this paper, we le t = 0.8 , the average character accuracy rate in previous experiments on Arabic-script HTR. We let = 0.998 and = 600 . Since this HMM is a generative model of the target text, it is possible to reestima te and from unlabeled data using expectation maximization. We leave that for future work since, as we shall see, we are able to recover su昀케cient high-quality aligned data at these parameter settings.

As with other edit-distance computations, the time and space complexity of inference with this HMM grows as theproduct of the lengths of the source and target texts. In common with other approaches to text-reuse analysis, therefore, we prune the search space by constraining the alignment at positions where we 昀椀nd su昀케ciently long matches between source and target. Unlike other text-reuse approaches that tokenize the input into wor1d5s, [ 5 ], possibly lemmatizing or taking advantage of thesauri and other lexical resour1c0e]s, [our alignment operates at the character level. The lower character accuracy rate for Arabic-script HTR makes matching even single words between two documents. Even more seriously, word-segmentation errors are especially common Arabic-script manuscripts: the space character is one of the most commonly inserted or deleted characters in our experiments. Instead of word n-gram features for pruning, therefore, we use subsequences ofcharacters in the alphanumeric Unicode class, thus ignoring both combining diacritics (e.g., short vowel marks in Arabic), whitespace, and punctuation. In preliminary experiments measuring the match rate between HTR output and digital editions (see 4§.2) without access to manual manuscript transcriptions, we se=t 7 and required = 5 such subsequences to match before aligning a source and target passage. When performing a full collation of all manuscript pages against all other pages, this pruning results in running the HMM alignment on only 2% of possible pairs of pages. Other characterbased methods instead apply alignment algorithms on all possible pairs, resulting in orders of magnitude more computational cost in order to maximize rec1a2l]l.2[

4.2. Scoring Candidate Lines for Training

Once HTR transcripts have been aligned with a collection of digital editions or with the HTR of other manuscripts, the output is organized with each manuscript being treated as the “target” text in turn. For each line of the target text, the alignment shows zero or more passages from other texts as witnesses. In the fragment of JSON output in Figur3eb, for example, one line of a Berlin manuscript is shown as the target text with one passage spanning two lines from a digital edition of Fīrūzābādaīl-Qāmūs al-muḥīṭ and another passage from a Leipzig manuscript as witnesses. The digital edition matches the target Berlin manuscript perfectly, and so we can with high con昀椀dence use this transcription, along with a line image extracted by the printtrained segmentation model, as additional training data for HTR.

Not all lines, of course, match perfectly; moreover, it seems likely that manuscripts with mistakes in their transcription by the current HTR model might bene昀椀t more from additional training data. Di昀erences between texts, however, can arise for two di昀erent reasons. First, as we saw above (§3), the output of the initial print-trained HTR model will match the diplomatic transcription in our evaluation set only 60.5% of the time. Second, the manuscripts we are transcribing with HTR, and the digital editions we are aligning to, may include variiannctlsuded by their writers or editors. In the Figure3b example, we can see that the Leipzig manuscript omits a word included in both the Berlin manuscript and the digital edition. It would therefore be dangerous to use the digital edition as ground truth for the image of the Leipzig manuscript.

To separate these two sources of variation, we analyze both tmheatch rate (the proportion of characters in the digital edition that are exactly copied in the HTR transcript) and the pattern of gaps (insertions or deletions) in the alignment between them. Due to errors in the print-trained segmentation model, many lines are not fully or correctly identi昀椀e6d). (W§e therefore exclude lines under 昀椀ve characters long (about one word, to exclude fragmentary lines) and those with a gap at the initial or 昀椀nal position in the alignment. We analyze the remaining lines by their match rate and theirmax gap, i.e., the length of the maximum number of contiguous insertions or deletions. Figur3ea shows that lines with a max gap≥ 4 mostly have a match rate below 50%. A signi昀椀cant cluster of these lines with longer gaps still has a match rate above 50%, as with the Leipzig MS example in Figure3b. Perhaps surprisingly, lines with a max gap of zero, i.e., no insertions or deletions at all, tend to have a much lower match rate. Upon inspection, these tend to be lines with low accuracy in between other lines with much better accuracy where the HMM found a higher probability alignment by substituting a series of non-matching characters. There are a small number, as in the Berlin MS example, with zero gaps and high match rate. 2Version 2 ofpassim (https://github.com/dasmiq/passim) implements this model asseriatim.

Staatsbibliothek zu Berlin, Glaser 33

Leipzig Ms. Gabelentz 60

Collation output (a) Match rate between HTR and edition

(b) Collating a MS with an edition and another MS

Finally, lines with a max gap between 1 and 3 inclusive had a match rate of mostly more than 50%. For our experiments, therefore, we selected those lines with a max gap less than 4 and a match rate greater than 50%. In experiments with a 昀椀xed number of lines for training, we selected them in descending order of match rate.

5. Experiments with ACDC Training

We have now described the components of the ACDC method: 1. compiling a corpus of manuscript page images that we believe to have some overlap with a collection of reference edition2s)(;§ 2. running initial HTR segmentation and transcription model3s)(o§n this corpus; 3. aligning passages in this HTR output with passages in the reference texts4(.§1); 4. selecting manuscript lines with their page-image coordinate information and their corresponding text from the reference editions4.(2§); and 5. retraining the HTR model on the selected lines.

A昀琀er executing these steps, we can iterate the process, returning to step 2 and using the retrained HTR model to re-transcribe the training manuscripts. As noted above, this paper focuses on HTR transcription and does not retrain the segmentation model. When training HTR models, we can choose to train from scratch, i.e., from a random initialization of model parameters, or to start training from an existing model. As noted 3i,nth§e latter always led to better validation and test accuracy in our experiments. Further search of the space of training hyperparameters might lead to gains, but we did not pursue the investigation.

Table3 shows the average accuracy on the annotated lines of the training set and the test set of the initial print-trained model, the fully supervised model, and the 昀椀rst three iterations of using ACDC to retrain the HTR model without any access to transcribed manuscript data. On average, three iterations of ACDC training improved over the initial model’s CAR by 19.6% absolute, more than 3 percentage points above the performance of the supervised model.

Figure2 shows the distribution of CAR over di昀erent books in the test set for each of these models. As noted above, the initial print-trained model exhibits two discernible clusters of books that perform above and below 60% CAR. Both ACDC and supervised training greatly reduce the number of poorly performing books and concentrate CAR more tightly at a higher level.

The range of accuracies achieved by the initial model across di昀erent books means that not all books are equally well represented in the training data ACDC extracts on this 昀椀rst (or later) iterations. As discussed in4.§2, we select lines with a short maximum gap length and match rate above 50%. In Figure4a, we observe that higher Arabic CAR on training data for books, particularly when above 50%, leads to higher yields, i.e., a high proportion of a book’s lines extracted for training. We compute CAR on Arabic characters alone, excluding spaces and punctuation, because spaces and punctuation are also excluded when 昀椀nding matching passages during the alignment process. Note that these evaluations on training data are not used during ACDC training or for model selection. The observations for the same book’s accuracy under the initial print-trained model and the 昀椀rst ACDC-trained model are linked by lines to show the direction of change. Figu4rbe shows the next step in the process: higher numbers of training lines extracted for a book unsurprisingly lead to higher Arabic CAR when evaluated on that book.

We also examine the variation in accuracy at the level of individual test books. In Fi5gau,re we see that almost all test books have higher accuracy a昀琀er ACDC training than with the initial print-trained model, i.e., their points are above the diagonal. Each point is coded with the 昀椀rst (a) Yield by Arabic CAR

(b) Book-level Arabic CAR by training lines letter of the book’s language. Books whose accuracy increased the most were in Arabic. This is less surprising considering the results in Figu4rbe, where much more training data was extracted by ACDC for Arabic books. The exceptions to this consistent improvement were one Persian book and two Arabic documentary texts with hands very di昀erent from the book hands in the training set. Considering this result from another angle, Fig5ubreshows that ACDC achieved better results than supervised training for most books, with the exception of the aforementioned documentary texts and a cluster of some of the Persian, Ottoman Turkish, and mixed-language books.

We ran an additional experiment to remove the di昀erence in language coverage between the supervised and ACDC training sets. Selecting the same number of lines from each training book from among both the manually transcribed data and the lines extracted by the 昀椀rst iteration of ACDC leaves us with 2786 training lines. The bottom of Tab3lsehows that the differences between the digital editions used by ACDC and the transcriptions produced manually for these manuscripts still results in a di昀erence in performance between these models even when training data in exactly the same proportions is used, albeit smaller than the di昀erence between the full training runs. The remaining di昀erences between the learning curves of these training methods may be the result of ACDC training using the print-trained layout model to identify line images, while supervised training uses manually corrected line images. Even if ACDC could exactly recover the transcription of a line, a layout model’s cutting o昀 some letters (a) ACDC CAR given initial CAR (b) ACDC CAR given supervised CAR in that line or erroneously including others would inhibit accurate training. Future work could aim both to evaluate the e昀ectiveness of training on line images with erroneous boundaries and to bootstrap better layout models.

6. Discussion

The experiments in §5 show that Automatic Collation for Diversifying Corpora (ACDC) is a promising approach to improving HTR systems on diverse manuscript collections without additional annotated data. All that is required is that the manuscript collection have a su昀케cient number of widely-copied texts so that we can align their noisy HTR transcripts with clean digital editions. This may not be the case for many documentary archives with unique manuscript letters, for instance. Some archives of o昀케cial documents, however, may include enough duplicated material for ACDC to work. Distant supervision will not in the near future, we expect, replace supervised training for projects where a researcher can identify ahead of time those documents or hands of interest and curate a training set for them. In any case, we reiterate that ACDC does not assume that thetest set will have manuscripts that overlap with existing digital editions.

We also note that the impressive gains shown by ACDC were made despite working with a page segmentation model trained on printed texts. This model can o昀琀en fail spectacularly (Figure7). Even on that page, with a large amount of unrecognized marginalia, ACDC was able to extract one line. The majority of the training data extracted by ACDC from the training manuscripts was from pages where the print layout model worked surprisingly well (F6i)g,ure despite some errors with features like rubrication or words written larger than others on the same line. We are hopeful, therefore, that a similar distant supervision approach can be employed to improve segmentation models by identifying and perhaps normalizing outputs on pages like these.

As the number of manuscripts with digitized page images grows, we expect that broadcoverage methods like ACDC will complement task-speci昀椀c training sets. Beyond training HTR, we also expect that the collation methods developed here will be useful in producing multi-text editions (Figur3eb), as well as using evidence from multiple manuscripts to model the text-transmission process.

Acknowledgments

The authors would like to thank their collaborators on the Open Islamicate Texts Initiative— in particular, John Mullan, Lorenz Nigst, and Alejandro Toselli—for help annotating data and training the print models. This work was supported in part by a National Endowment for the Humanities Digital Humanities Advancement Grant (HAA-277203-21) and the Andrew W. Mellon Foundation’s Scholarly Communications and Information Technology program. Any views, 昀椀ndings, conclusions, or recommendations expressed do not necessarily re昀氀ect those of the NEH or Mellon. (a) Staatsbibliothek zu Berlin, Glaser 33 (b) Staatsbibliothek zu Berlin, Glaser 133 (a) Print-trained line detection

[1]

Bausi ,

P. G.

Borbone ,

Briquel-Chatonnet ,

Buzi ,

Gippert ,

Macé ,

Maniaci ,

Melissakis ,

L. E.

Parodi , and

Witakowski , edsC. omparative Oriental Manuscript Studies: An Introduction . Hamburg, Germany: COMSt, Comparative Oriental Manuscript Studies, 2015 .

[2]

Chammas ,

Mokbel , and L. Likforman-Sulem . “Handwriting Recognition of Historical Documents with Few Labeled Data” . InIn:ternational Workshop on Document Analysis Systems (DAS) . 2018 , pp. 43 - 48 . doi: 10 .1109/das. 2018 . 15 .

[3]

Coquenet ,

Chatelain , and

Paquet . “ DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition” . IEnE:E Transactions on Pattern Analysis and Machine Intelligence ( 2023 ), pp. 1 - 17 . doi: 10 .1109/tpami. 2023 . 3235826 .

[4]

Ernst . “In Search of Su昀椀 Manuscripts” . In:The Eleventh Islamic Manuscript Conference: Su昀椀sm and Islamic Manuscript Culture . 2016 .

[5]

Funk and

L. A.

Mullen . “ The Spine of American Law: Digital Text Analysis and U.S. Legal Practice” . InT:he American Historical Review 123.1 ( 2018 ), pp. 132 - 164 . doi: 10 .10 93/ahr/123.1.132.

[6]

Graves ,

Fernández ,

Gomez , and

Schmidhuber . “ Connectionist temporal classi椀昀cation: labelling unsegmented sequence data with recurrent neural networks”P .rIno-: ceedings of the International Conference on Machine Learning (ICML) . 2006 , pp. 369 - 376 . doi: 10 .1145/1143844.1143891.

[7]

Keinan-Schoonbaert . Results of the RASM2019 Competition on Recognition of Historical Arabic Scienti昀椀c Manuscripts . 2019 . url: https://blogs.bl.uk/digital-scholarship/ 2019 /09 /rasm2019-results.htm.l

[8]

Kiessling . “ Kraken: A Universal Text Recognizer for the Humanities” . DI nig:ital Humanities (DH) . 2019 .

[9]

M. T.

Miller ,

M. G.

Romanov , and

S. B.

Savant . “ Digitizing the Textual Heritage of the Premodern Islamicate World: Principles and Plans” . InInte:rnational Journal of Middle East Studies 50.1 ( 2018 ), pp. 103 - 109 . doi: 10 .1017/s0020743817000964. url: https://www .cambridge.org/core/product/identifier/S0020743817000964/type/journal%5C% 5Far .ticle

[10]

Moritz ,

Wiederhold ,

Pavlek ,

Bizzoni , and

Büchler . “ Non-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse” . InP:roceedings of Empirical Methods in Natural Language Processing (EMNLP) . 2016 .

[11]

Nockels ,

Gooding ,

Ames , and

Terras . “ Understanding the application of handwritten text recognition technology in heritage contexts: a systematic review of Transkribus in published research” . InA:rchival Science 22.3 ( 2022 ), pp. 367 - 392 . doi: 10 .100 7/s10502-022-09397-0.

[12]

Paju ,

Rantala , and

Salmi . “ Towards an Ontology and Epistemology of Text Reuse” . In: Digitised Newspapers - A New Eldorado for Historians?: Re昀氀ections on Tools, Methods and Epistemology. Ed. by E. Bunout,

Ehrmann , and

Clavert. De Gruyter , 2022 .

[13]

Romanov . “ Algorithmic Analysis of Medieval Arabic Biographical Collections” . In: Speculum 92.S1 ( 2017 ), pp. 226 - 246 .

[14]

Romanov ,

M. T.

Miller ,

S. B.

Savant , and B. KiesslinIgm.portant New Developments in Arabographic Optical Character Recognition (OCR) . 2017 . arXiv: 1703 .09550 [cs.CV].

[15]

D. A.

Smith ,

Cordell ,

E. M.

Dillon ,

Stramp , and

Wilkerson . “ Detecting and Modeling Local Text Reuse” . In:ACM-IEEE Joint Conference on Digital Libraries (JCDL) . 2014 .

[16]

Vogel ,

Ney , and

Tillmann . “ HMM-based word alignment in statistical translation” . In: Proceedings of the 16th Conference on Computational Linguistics (COLING) . Vol. 2 . 1996 , p. 836 . doi: 10 .3115/993268.993313.