<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David A. Smith</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacob Murel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonathan ParkesAllen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew Thomas Mille</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Khoury College of Computer Sciences, Northeastern University</institution>
          ,
          <addr-line>Boston MA</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Roshan Institute for Persian Studies, University of Maryland</institution>
          ,
          <addr-line>College Park MD</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
      </contrib-group>
      <fpage>206</fpage>
      <lpage>221</lpage>
      <abstract>
        <p>Handwritten text recognition (HTR) has enabled many researchers to gather textual evidence from the human record. One common training paradigm for HTR is to identify an individual manuscript or coherent collection and to transcribe enough data to achieve acceptable performance on that collection. To build generalized models for Arabic-script manuscripts, perhaps one of the largest textual traditions in the pre-modern world, we need an approach that can improve its accuracy on unseen manuscripts and hands without linear growth in the amount of manually annotated data. We propose Automatic Collation for Diversifying Corpora (ACDC), taking advantage of the existence of multiple manuscripts of popular texts. Starting from an initial HTR model, ACDC automatically detects matching passages of popular texts in noisy HTR output and selects high-quality lines for retraining HTR without any manually annotated data. We demonstrate the e昀ectiveness of this approach to distant supervision by annotating a test set drawn from a diverse collection of 59 Arabic-script manuscripts and a training set of 81 manuscripts of popular texts embedded within a larger corpus. A昀琀er a few rounds of ACDC retraining, character accuracy rates on the test set increased by 19.6% absolute percentage, while a supervised model trained on manually annotated data from the same collection increased accuracy by 15.9%. We analyze the variation in ACDC's performance across books and languages and discuss further applications to collating manuscript families.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;handwritten text recognition</kwd>
        <kwd>collation</kwd>
        <kwd>manuscripts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Within the past decade, widely-available handwritten text recognition (HTR) tools have
enabled many disciplines to investigate a wide range of handwritten documentary sources from
the human record1[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Most of the current generation of HTR systems are trained at the line
level to optimize connectionist temporal classi昀椀cation (CTC) lo6s]s.[This frees users from
having to annotate individual words or characters to produce training data; instead, they simply
transcribe the plain text of each line. Creating training data still remains a bottleneck for HTR.
Nockels, Gooding, Ames, and Terras11[] describe the community around the Transkribus HTR
system as “a ‘bottom-up’ mass digitization movement, made up of hundreds of simultaneous
projects driven by motivated researchers” creating their own training data and models. Put
another way, an important use case for HTR is to help those “motivated researchers,” who know
which documents they wish to transcribe, annotate enough data so that they can train a model
to transcribe the rest.
      </p>
      <p>We contend that there is room for a complementary training paradigm for HTR. In some
projects, we are confronted with large collections of documents in a diversity of languages,
hands, genres, and time periods. If we do not knoawpriori which documents will be interesting,
it may be hard to allocate e昀케ciently the time it takes to produce HTR training data for the
whole collection. Instead, we proposediastant supervision approach to training HTR that takes
advantage of the structure of many larger collectioAnust.omatic Collation for Diversifying
Corpora (ACDC) starts by having users assemble digital editions of texts they believe will be
widely copied in the collection. Starting from an initial, imperfect model, ACDC proceeds by:
1. running initial HTR segmentation and transcription models on a diverse manuscript
collection (§3);
2. aligning passages in this HTR output with passages in the reference digital edition4s.1()§;
3. selecting manuscript lines with their page-image coordinate information and their
corresponding text from the reference editions4.(2§); and
4. retraining the HTR model on the selected lines.</p>
      <p>Once a new HTR model is trained, it can be used to re-transcribe the manuscript collection and
run this process again (Figur1e).</p>
      <p>Distant supervision has been employed at the paragraph level in HT2R] a[nd, using
stateof-the-art vision transformers, for training joint segmentation and transcription mod3e].ls [
These systems, however, still assume that we have a “diplomatic”, ground-truth transcription
of a particular manuscript paragraph or page. ACDC instead infers which matching passages
between a noisy HTR transcript and a reference digital edition are close enough to use for
training and which might contain variant readings.</p>
      <p>In this paper, we demonstrate ACDC’s e昀ectiveness by applying it to a diverse collection
of Arabic-script manuscripts 2(§). We annotate a test set drawn from a diverse collection of
59 Arabic-script manuscripts and a training set of 81 manuscripts of popular texts embedded
within a larger corpus. A昀琀er a few rounds of ACDC retraining, character accuracy rates on the
test set increased by 19.6% absolute percentage, while a supervised model trained on manually
annotated data from the same collection increased accuracy by 15.95%). (§We analyze the
variation in ACDC’s performance across books and languages and discuss further applications
to collating manuscript families6()§. We have released our code, annotated data, and trained
models under open-source licens1es.
1See repositories of codeht(tps://github.com/OpenITI/acdc_trai)n, test data (https://github.com/OpenITI/aocp_ms
_eval), annotated lines from the training set to compare to unsupervised ACDhCttp(s://github.com/OpenITI/ara
bic_ms_data), and trained models and evaluation dathat(tps://github.com/OpenITI/acdc_resul)t.s</p>
    </sec>
    <sec id="sec-2">
      <title>2. Arabic-script Manuscripts</title>
      <p>
        The Islamicate written traditions together form one of the largest—if not the largest—archives
of human cultural production of the pre-modern world. Primarily written in Arabic and
Persian, and also encompassing Ottoman Turkish, Urdu, and other languages, these textual
traditions stretch through more than twelve hundred years of history and extend from Iberia and
North Africa in the west, the upper reaches of the Volga in the north, Sub-Saharan Africa in the
south, and China, the Indian subcontinent, and the Philippines and Indonesia in the east. The
exact number of extant manuscript volumes has yet to be determined with precision, but rough
estimates suggest at least several million volumes exist today in collections that range in
diversity from West African madrasas and the Turkish state archives to European and American
museums and libraries1[, pp. 34–35]. We know from modern printed editions of pre-modern
texts in these languages that the total number of discrete texts certainly eclipses the Latinate
and European vernacular traditions—perhaps rivaled only by pre-modern Chinese cultural
output. These estimates, however, are only based on modern print production, which is but the
tip of the proverbial iceberg. One prominent scholar, Carl W. Ernst, estimates that only 5-10%
percent of the Persian and Arabic written tradition has been published in any print format—a
number broadly consonant with the 昀椀ndings of Maxim Romanov as wel4l,[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Whatever the
exact numbers, it is certain that a signi昀椀cant portion of the Persian and Arabic written
tradition remains exclusively in manuscript form in thousands of libraries and archives across the
world. The sheer number of these manuscripts makes it di昀케cult for scholars, librarians, and
students to focus their energies on more than the most important samples in each collection.
      </p>
      <p>To evaluate HTR models on a wide range of script types, languages, and time periods, we
collected sets of public-domain digital images digitized by 17 libraries (T1a).blFerom the
resulting 59 manuscripts, dating from 900 to 1869 ce, we transcribed 1704 lines. An average
of 29 lines per book is not usually enough to train book-speci昀椀c model with acceptably high
accuracy. Unlike some other evaluation7s] [on Arabic-script manuscripts, however, this paper
focuses on the use-case where the training set and test sets come from di昀erent manuscripts
and di昀erent hands.</p>
      <p>To test the ACDC method’s e昀ectiveness at producing HTR training data, we selected 昀椀ve
texts, three in Arabic and two in Persian, for which we could 昀椀nd digital transcriptions and a
reasonable number of digital editions (Tab2l).eWe downloaded 81 sets of page images of these
椀昀ve texts. To perform error analysis and to be able to compare ACDC to supervised training,
we transcribed 6842 lines in total from these 81 manuscripts. None of these manual
transcriptions were used for training ACDC. During training, we also added 50 additional “distractor”
manuscripts to evaluate the alignment process. None of these 131 manuscripts overlap with
the 59 manuscripts used for testing. Furthermore, none of the 59 test manuscripts are copies
of the 昀椀ve widely-copied works we use for training.</p>
      <p>All of the manuscript images used here have been released by the libraries digitizing them
into the public domain. We release the layout analysis and transcribed lines under an
opensource license.</p>
    </sec>
    <sec id="sec-3">
      <title>3. HTR Training and Testing</title>
      <p>We employ the Kraken HTR system 8[] for training and testing layout analysis and
transcription models due to its support for right-to-le昀琀 scripts and the curved baselines common in
manuscripts. As with other current HTR systems, Kraken 昀椀rst uses saegmentation model to
detect regions and lines in a page image; it then separately passes each extracted line image to
a transcription model to produce text output. The ACDC method described here could be easily
adapted to other line-oriented OCR systems.</p>
      <p>
        The experiments in this paper start with segmentation and transcription models trained
on annotations produced for Arabic and Persian printed books by the Open Islamicate Texts
Initiative [
        <xref ref-type="bibr" rid="ref14 ref9">14, 9</xref>
        ]. While the layout of books and manuscripts is of course very di昀erent, we keep
this print-trained segmentation model 昀椀xed for all experiments to focus on improvements in
text alignment and transcription models. (See6 §for further discussion of layout analysis.)
      </p>
      <p>We use character accuracy rate (CAR) to evaluate the e昀ectiveness of transcription
models. This metric computes the (Levenshtein) edit distance between the reference
transcription and the model output and divides by the number of characters in the
reference. The resulting character error rate is then subtracted from one,  i.=e. 1 −
( referenc,ehypothesis)/#(reference char)s. We remove Arabic short vowel marks and
merge variant forms of the letterksaf and yah in both the reference and hypothesis before
comparing them. In addition to CAR, we also measure thAerabic character accuracy rate by
removing spaces, punctuation, and other non-Arabic characters from the reference and
hypothesis before comparing them. As we discuss in4,§we ignore non-letter characters when aligning
noisy HTR output with digital editions. Arabic CAR is thus a helpful diagnostic for relating
transcription accuracy on these letters to the amount of training data the ACDC method is able
to extract. When summarizing these evaluation metrics across a test set, we take the average
of the CAR for each book. This “macro averaging” ensures that books with more transcribed
lines do not receive undue weight in the 昀椀nal evaluation.</p>
      <p>We train transcription models with Kraken on pairs of manuscript line images and reference
transcriptions. As with similar line-level HTR systems, Kraken minimizes connectionist
temporal classi昀椀cation (CTC) loss 6[] with respect to the weights of a convolutional plus recurrent
neural network. For supervised training, both the boundaries of the lines within the page image
and the transcriptions were produced manually as discussed above (T2a)b.lFeor ACDC
training with distant supervision, the boundaries of the lines were produced by the print-trained
segmentation model and the reference transcriptions were inferred by the collation process
(§4). We trained transcription models both from scratch, i.e., with random initialization of all
weights, and by 昀椀ne tuning the existing print-trained model. In our experiments, 昀椀ne-tuning
always proved more e昀ective on both validation and test data. For each training set, we
randomly hold out 10% of the lines as validation data to perform early stopping and model
selection. We use a constant learning rate o1f0−4 recommended by Kraken for manuscript training
and perform early stopping when the best CAR on validation data has not improved for ten
iterations.</p>
      <p>The print-trained model we use as a starting point for our experiments achieves a
(macroaveraged) 60.5% CAR on the test set. As shown in Figur2e, this average combines clusters of
books with CAR in the high 60s and above and books with CAR in the mid 50s and below.
Finetuning this print model on the 6842 manually transcribed lines from our training set achieves
an average CAR of 76.4%. While the print model transcribed 35 test books with CAR less than
60%, the supervised model performs below 60% on only six test books. Even the supervised
model is trained with no overlap between the training books and test books. Its accuracy is
therefore below what we would expect from the common HTR paradigm where training pages
are drawn from the same book as test pages.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Collating Noisy HTR with Digital Editions</title>
      <p>The ACDC method starts with the output from
an initial HTR model—here, the print-trained
model. It then aligns this HTR output with a
collection of reference texts to see if any parts
of the HTR output are su昀케ciently close to some
passage in a reference text. In this section,
we describe the inference process for collating
noisy HTR output with reference texts or the
HTR output on other manuscripts. We then
analyze this collation output to select lines for
retraining HTR models.</p>
      <p>
        The proposed approach is another step in
increasingly distant supervision for training
HTR. Kraken, like other HTR systems that are
trained to minimize the CTC loss between a
reference transcription of a line and model
predictions, already performs a character
alignment for each line6[]. This process enables us
to forgo annotating each character’s positionFigure 2: Distribution of CAR of the model
on the page image and instead simply anno- trained on printed text compared to
tate a whole line with the desired sequence of fully supervised training and to the
characters. Chammas, Mokbel, and Likforman- 椀昀rst three iterations of ACDC
trainSulem [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed collecting reference tran- ing with distant supervision.
scriptions at the paragraph level and using the
best Levenshtein alignment with HTR output
to split this reference into lines. Coquenet, Chatelain, and Paqu3e]tp[roposed collecting
fullpage transcriptions and learning a page-level reading order. In this paper, we
propcoosrpeuaslevel approach to alignment: rather than deciding ahead of time which lines or paragraphs or
pages we should transcribe, we collect reference texts that we believe will overlap with
significant number of manuscripts in our corpus (Tab2l)e.When preparing input for ACDC, we do
not need to specify which reference texts correspond to which manuscripts—let alone to which
pages or lines.
      </p>
      <p>This corpus-level approach, however, makes the alignment problem more di昀케cult in two
ways. First, there is the search problem of matching lines in HTR output with passages in
arbitrarily long reference text4s.(1§). Second, we need to infer which line-level alignments are
of high enough quality to use as training data 4(§.2).</p>
      <sec id="sec-4-1">
        <title>4.1. HMM Alignment Models</title>
        <p>
          Unlike previous approaches to distant supervision for HTR, we cannot use Levenshtein
alignment (i.e., Needleman-Wunsch)2[] since the page or other portion of a MS we happen to have
may not cover the whole texts of the reference edition we are trying to align it to; moreover,
a given MS page may contain material, such as commentary or other notes, extraneous to the
main text (Figure7). Previous work on HTR, by contrast, has employed “diplomatic”
transcriptions those manuscripts selected for the training set. Unlike previous work on text reuse
detection [
          <xref ref-type="bibr" rid="ref15 ref5">15, 5</xref>
          ], we do not use Smith-Waterman alignment due to the problem of di昀erences
in reading order among di昀erent manuscripts and editions. Due to either di昀erences in layout
or errors in layout analysis, two versions of a text with the same material might present the
same material in di昀erent sequence.
        </p>
        <p>
          We propose, therefore, to use a more generalized 昀椀nite-state approach to alignment based on
hidden Markov models (HMMs)1[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The observations are characters of (the HTR transcript
of) the manuscript we are trying to align to a digital edition or another manuscript, and the
hidden states are positions in these other witnesses. For any position in the target manuscript,
the hidden state is a “read head” that speci昀椀es what source we might be copying from. Unlike
Levenshtein or Smith-Waterman alignments, it is possible to move this read head backwards
or forwards an arbitrary distance in the source. That does not mean that all jumps in position
are equally likely, however.
        </p>
        <p>As in other HMMs, we need to specify atransition distribution that assigns probabilities to
shi昀琀s in position of the read head and an[emission] distribution that speci昀椀es what characters
we are likely to observe in the target text when reading from a given location in the source. To
compute(  |  )(  | −1 ), the probability of generating th eth target character given the source
position that generated th e−1 st, we consider that the source state can continue generating text
in its current position with probabili tyor it can move the read head anywhere in the source
text with probability1 −  . We then compute a probabilistic version of Levenshtein distance
with parameter , the probability that a character will be copied unchanged from the source
to the target. The remaining1 −  probability is divided uniformly among all other possible
edits, i.e., substitutions, insertions, and deletions. The probability that we will stop generating
target text is( + 1)/2 . We also include a pruning parameter, the length of the allowable
gap between target characters that are copied unchanged from the source. For the experiments
in this paper, we le t = 0.8 , the average character accuracy rate in previous experiments on
Arabic-script HTR. We let = 0.998 and  = 600 . Since this HMM is a generative model
of the target text, it is possible to reestima te and  from unlabeled data using expectation
maximization. We leave that for future work since, as we shall see, we are able to recover
su昀케cient high-quality aligned data at these parameter settings.</p>
        <p>
          As with other edit-distance computations, the time and space complexity of inference with
this HMM grows as theproduct of the lengths of the source and target texts. In common with
other approaches to text-reuse analysis, therefore, we prune the search space by constraining
the alignment at positions where we 昀椀nd su昀케ciently long matches between source and target.
Unlike other text-reuse approaches that tokenize the input into wor1d5s, [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], possibly
lemmatizing or taking advantage of thesauri and other lexical resour1c0e]s, [our alignment operates
at the character level. The lower character accuracy rate for Arabic-script HTR makes matching
even single words between two documents. Even more seriously, word-segmentation errors
are especially common Arabic-script manuscripts: the space character is one of the most
commonly inserted or deleted characters in our experiments. Instead of word n-gram features for
pruning, therefore, we use subsequences ofcharacters in the alphanumeric Unicode class,
thus ignoring both combining diacritics (e.g., short vowel marks in Arabic), whitespace, and
punctuation. In preliminary experiments measuring the match rate between HTR output and
digital editions (see 4§.2) without access to manual manuscript transcriptions, we se=t 7
and required = 5 such subsequences to match before aligning a source and target passage.
When performing a full collation of all manuscript pages against all other pages, this pruning
results in running the HMM alignment on only 2% of possible pairs of pages. Other
characterbased methods instead apply alignment algorithms on all possible pairs, resulting in orders of
magnitude more computational cost in order to maximize rec1a2l]l.2[
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Scoring Candidate Lines for Training</title>
        <p>Once HTR transcripts have been aligned with a collection of digital editions or with the HTR of
other manuscripts, the output is organized with each manuscript being treated as the “target”
text in turn. For each line of the target text, the alignment shows zero or more passages from
other texts as witnesses. In the fragment of JSON output in Figur3eb, for example, one line
of a Berlin manuscript is shown as the target text with one passage spanning two lines from a
digital edition of Fīrūzābādaīl-Qāmūs al-muḥīṭ and another passage from a Leipzig manuscript
as witnesses. The digital edition matches the target Berlin manuscript perfectly, and so we can
with high con昀椀dence use this transcription, along with a line image extracted by the
printtrained segmentation model, as additional training data for HTR.</p>
        <p>Not all lines, of course, match perfectly; moreover, it seems likely that manuscripts with
mistakes in their transcription by the current HTR model might bene昀椀t more from additional
training data. Di昀erences between texts, however, can arise for two di昀erent reasons. First, as
we saw above (§3), the output of the initial print-trained HTR model will match the diplomatic
transcription in our evaluation set only 60.5% of the time. Second, the manuscripts we are
transcribing with HTR, and the digital editions we are aligning to, may include variiannctlsuded by
their writers or editors. In the Figure3b example, we can see that the Leipzig manuscript omits
a word included in both the Berlin manuscript and the digital edition. It would therefore be
dangerous to use the digital edition as ground truth for the image of the Leipzig manuscript.</p>
        <p>To separate these two sources of variation, we analyze both tmheatch rate (the proportion of
characters in the digital edition that are exactly copied in the HTR transcript) and the pattern of
gaps (insertions or deletions) in the alignment between them. Due to errors in the print-trained
segmentation model, many lines are not fully or correctly identi昀椀e6d). (W§e therefore exclude
lines under 昀椀ve characters long (about one word, to exclude fragmentary lines) and those with
a gap at the initial or 昀椀nal position in the alignment. We analyze the remaining lines by their
match rate and theirmax gap, i.e., the length of the maximum number of contiguous insertions
or deletions. Figur3ea shows that lines with a max gap≥ 4 mostly have a match rate below 50%.
A signi昀椀cant cluster of these lines with longer gaps still has a match rate above 50%, as with
the Leipzig MS example in Figure3b. Perhaps surprisingly, lines with a max gap of zero, i.e.,
no insertions or deletions at all, tend to have a much lower match rate. Upon inspection, these
tend to be lines with low accuracy in between other lines with much better accuracy where the
HMM found a higher probability alignment by substituting a series of non-matching characters.
There are a small number, as in the Berlin MS example, with zero gaps and high match rate.
2Version 2 ofpassim (https://github.com/dasmiq/passim) implements this model asseriatim.</p>
        <p>Staatsbibliothek zu Berlin, Glaser 33</p>
        <p>Leipzig Ms. Gabelentz 60</p>
        <p>Collation output
(a) Match rate between HTR and edition</p>
        <p>(b) Collating a MS with an edition and another MS</p>
        <p>Finally, lines with a max gap between 1 and 3 inclusive had a match rate of mostly more than
50%. For our experiments, therefore, we selected those lines with a max gap less than 4 and
a match rate greater than 50%. In experiments with a 昀椀xed number of lines for training, we
selected them in descending order of match rate.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments with ACDC Training</title>
      <p>We have now described the components of the ACDC method:
1. compiling a corpus of manuscript page images that we believe to have some overlap with
a collection of reference edition2s)(;§
2. running initial HTR segmentation and transcription model3s)(o§n this corpus;
3. aligning passages in this HTR output with passages in the reference texts4(.§1);
4. selecting manuscript lines with their page-image coordinate information and their
corresponding text from the reference editions4.(2§); and
5. retraining the HTR model on the selected lines.</p>
      <p>A昀琀er executing these steps, we can iterate the process, returning to step 2 and using the
retrained HTR model to re-transcribe the training manuscripts. As noted above, this paper
focuses on HTR transcription and does not retrain the segmentation model. When training HTR
models, we can choose to train from scratch, i.e., from a random initialization of model
parameters, or to start training from an existing model. As noted 3i,nth§e latter always led to
better validation and test accuracy in our experiments. Further search of the space of training
hyperparameters might lead to gains, but we did not pursue the investigation.</p>
      <p>Table3 shows the average accuracy on the annotated lines of the training set and the test
set of the initial print-trained model, the fully supervised model, and the 昀椀rst three iterations
of using ACDC to retrain the HTR model without any access to transcribed manuscript data.
On average, three iterations of ACDC training improved over the initial model’s CAR by 19.6%
absolute, more than 3 percentage points above the performance of the supervised model.</p>
      <p>Figure2 shows the distribution of CAR over di昀erent books in the test set for each of these
models. As noted above, the initial print-trained model exhibits two discernible clusters of
books that perform above and below 60% CAR. Both ACDC and supervised training greatly
reduce the number of poorly performing books and concentrate CAR more tightly at a higher
level.</p>
      <p>The range of accuracies achieved by the initial model across di昀erent books means that
not all books are equally well represented in the training data ACDC extracts on this 昀椀rst (or
later) iterations. As discussed in4.§2, we select lines with a short maximum gap length and
match rate above 50%. In Figure4a, we observe that higher Arabic CAR on training data for
books, particularly when above 50%, leads to higher yields, i.e., a high proportion of a book’s
lines extracted for training. We compute CAR on Arabic characters alone, excluding spaces
and punctuation, because spaces and punctuation are also excluded when 昀椀nding matching
passages during the alignment process. Note that these evaluations on training data are not
used during ACDC training or for model selection. The observations for the same book’s
accuracy under the initial print-trained model and the 昀椀rst ACDC-trained model are linked
by lines to show the direction of change. Figu4rbe shows the next step in the process: higher
numbers of training lines extracted for a book unsurprisingly lead to higher Arabic CAR when
evaluated on that book.</p>
      <p>We also examine the variation in accuracy at the level of individual test books. In Fi5gau,re
we see that almost all test books have higher accuracy a昀琀er ACDC training than with the initial
print-trained model, i.e., their points are above the diagonal. Each point is coded with the 昀椀rst
(a) Yield by Arabic CAR</p>
      <p>(b) Book-level Arabic CAR by training lines
letter of the book’s language. Books whose accuracy increased the most were in Arabic. This
is less surprising considering the results in Figu4rbe, where much more training data was
extracted by ACDC for Arabic books. The exceptions to this consistent improvement were
one Persian book and two Arabic documentary texts with hands very di昀erent from the book
hands in the training set. Considering this result from another angle, Fig5ubreshows that
ACDC achieved better results than supervised training for most books, with the exception of
the aforementioned documentary texts and a cluster of some of the Persian, Ottoman Turkish,
and mixed-language books.</p>
      <p>We ran an additional experiment to remove the di昀erence in language coverage between
the supervised and ACDC training sets. Selecting the same number of lines from each
training book from among both the manually transcribed data and the lines extracted by the 昀椀rst
iteration of ACDC leaves us with 2786 training lines. The bottom of Tab3lsehows that the
differences between the digital editions used by ACDC and the transcriptions produced manually
for these manuscripts still results in a di昀erence in performance between these models even
when training data in exactly the same proportions is used, albeit smaller than the di昀erence
between the full training runs. The remaining di昀erences between the learning curves of these
training methods may be the result of ACDC training using the print-trained layout model to
identify line images, while supervised training uses manually corrected line images. Even if
ACDC could exactly recover the transcription of a line, a layout model’s cutting o昀 some letters
(a) ACDC CAR given initial CAR
(b) ACDC CAR given supervised CAR
in that line or erroneously including others would inhibit accurate training. Future work could
aim both to evaluate the e昀ectiveness of training on line images with erroneous boundaries
and to bootstrap better layout models.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The experiments in §5 show that Automatic Collation for Diversifying Corpora (ACDC) is a
promising approach to improving HTR systems on diverse manuscript collections without
additional annotated data. All that is required is that the manuscript collection have a su昀케cient
number of widely-copied texts so that we can align their noisy HTR transcripts with clean
digital editions. This may not be the case for many documentary archives with unique manuscript
letters, for instance. Some archives of o昀케cial documents, however, may include enough
duplicated material for ACDC to work. Distant supervision will not in the near future, we expect,
replace supervised training for projects where a researcher can identify ahead of time those
documents or hands of interest and curate a training set for them. In any case, we reiterate
that ACDC does not assume that thetest set will have manuscripts that overlap with existing
digital editions.</p>
      <p>We also note that the impressive gains shown by ACDC were made despite working with
a page segmentation model trained on printed texts. This model can o昀琀en fail spectacularly
(Figure7). Even on that page, with a large amount of unrecognized marginalia, ACDC was
able to extract one line. The majority of the training data extracted by ACDC from the training
manuscripts was from pages where the print layout model worked surprisingly well (F6i)g,ure
despite some errors with features like rubrication or words written larger than others on the
same line. We are hopeful, therefore, that a similar distant supervision approach can be
employed to improve segmentation models by identifying and perhaps normalizing outputs on
pages like these.</p>
      <p>As the number of manuscripts with digitized page images grows, we expect that
broadcoverage methods like ACDC will complement task-speci昀椀c training sets. Beyond training
HTR, we also expect that the collation methods developed here will be useful in producing
multi-text editions (Figur3eb), as well as using evidence from multiple manuscripts to model
the text-transmission process.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors would like to thank their collaborators on the Open Islamicate Texts Initiative—
in particular, John Mullan, Lorenz Nigst, and Alejandro Toselli—for help annotating data and
training the print models. This work was supported in part by a National Endowment for
the Humanities Digital Humanities Advancement Grant (HAA-277203-21) and the Andrew W.
Mellon Foundation’s Scholarly Communications and Information Technology program. Any
views, 昀椀ndings, conclusions, or recommendations expressed do not necessarily re昀氀ect those of
the NEH or Mellon.
(a) Staatsbibliothek zu Berlin, Glaser 33
(b) Staatsbibliothek zu Berlin, Glaser 133
(a) Print-trained line detection</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bausi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Borbone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Briquel-Chatonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gippert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maniaci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Melissakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Parodi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Witakowski</surname>
          </string-name>
          , edsC.
          <source>omparative Oriental Manuscript Studies: An Introduction</source>
          . Hamburg, Germany: COMSt, Comparative Oriental Manuscript Studies,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Chammas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mokbel</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Likforman-Sulem</surname>
          </string-name>
          .
          <article-title>“Handwriting Recognition of Historical Documents with Few Labeled Data”</article-title>
          .
          <source>InIn:ternational Workshop on Document Analysis Systems (DAS)</source>
          .
          <year>2018</year>
          , pp.
          <fpage>43</fpage>
          -
          <lpage>48</lpage>
          . doi:
          <volume>10</volume>
          .1109/das.
          <year>2018</year>
          .
          <volume>15</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Coquenet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chatelain</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Paquet</surname>
          </string-name>
          . “
          <article-title>DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition”</article-title>
          .
          <source>IEnE:E Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2023</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          . doi:
          <volume>10</volume>
          .1109/tpami.
          <year>2023</year>
          .
          <volume>3235826</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ernst</surname>
          </string-name>
          .
          <article-title>“In Search of Su昀椀 Manuscripts”</article-title>
          .
          <source>In:The Eleventh Islamic Manuscript Conference: Su昀椀sm and Islamic Manuscript Culture</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Funk</surname>
          </string-name>
          and
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Mullen</surname>
          </string-name>
          . “
          <article-title>The Spine of American Law: Digital Text Analysis and U.S. Legal Practice”</article-title>
          .
          <source>InT:he American Historical Review 123.1</source>
          (
          <issue>2018</issue>
          ), pp.
          <fpage>132</fpage>
          -
          <lpage>164</lpage>
          . doi:
          <volume>10</volume>
          .10 93/ahr/123.1.132.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          . “
          <article-title>Connectionist temporal classi椀昀cation: labelling unsegmented sequence data with recurrent neural networks”P</article-title>
          .rIno-:
          <source>ceedings of the International Conference on Machine Learning (ICML)</source>
          .
          <year>2006</year>
          , pp.
          <fpage>369</fpage>
          -
          <lpage>376</lpage>
          . doi:
          <volume>10</volume>
          .1145/1143844.1143891.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Keinan-Schoonbaert</surname>
          </string-name>
          .
          <article-title>Results of the RASM2019 Competition on Recognition of Historical Arabic Scienti昀椀c Manuscripts</article-title>
          .
          <year>2019</year>
          . url: https://blogs.bl.uk/digital-scholarship/
          <year>2019</year>
          /09 /rasm2019-results.htm.l
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kiessling</surname>
          </string-name>
          . “
          <article-title>Kraken: A Universal Text Recognizer for the Humanities”</article-title>
          .
          <source>DI nig:ital Humanities (DH)</source>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Romanov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Savant</surname>
          </string-name>
          . “
          <article-title>Digitizing the Textual Heritage of the Premodern Islamicate World: Principles and Plans”</article-title>
          .
          <source>InInte:rnational Journal of Middle East Studies 50.1</source>
          (
          <issue>2018</issue>
          ), pp.
          <fpage>103</fpage>
          -
          <lpage>109</lpage>
          . doi:
          <volume>10</volume>
          .1017/s0020743817000964. url: https://www .cambridge.org/core/product/identifier/S0020743817000964/type/journal%5C%
          <fpage>5Far</fpage>
          .ticle
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Moritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wiederhold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pavlek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bizzoni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Büchler</surname>
          </string-name>
          . “
          <article-title>Non-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse”</article-title>
          .
          <source>InP:roceedings of Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nockels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gooding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ames</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Terras</surname>
          </string-name>
          . “
          <article-title>Understanding the application of handwritten text recognition technology in heritage contexts: a systematic review of Transkribus in published research”</article-title>
          .
          <source>InA:rchival Science 22.3</source>
          (
          <issue>2022</issue>
          ), pp.
          <fpage>367</fpage>
          -
          <lpage>392</lpage>
          . doi:
          <volume>10</volume>
          .100 7/s10502-022-09397-0.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Paju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rantala</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Salmi</surname>
          </string-name>
          . “
          <article-title>Towards an Ontology and Epistemology of Text Reuse”</article-title>
          . In:
          <article-title>Digitised Newspapers - A New Eldorado for Historians?: Re昀氀ections on Tools, Methods</article-title>
          and Epistemology. Ed. by E. Bunout,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ehrmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Clavert. De Gruyter</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Romanov</surname>
          </string-name>
          . “
          <article-title>Algorithmic Analysis of Medieval Arabic Biographical Collections”</article-title>
          .
          <source>In: Speculum 92.S1</source>
          (
          <year>2017</year>
          ), pp.
          <fpage>226</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Romanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Savant</surname>
          </string-name>
          , and B.
          <source>KiesslinIgm.portant New Developments in Arabographic Optical Character Recognition (OCR)</source>
          .
          <year>2017</year>
          . arXiv:
          <volume>1703</volume>
          .09550 [cs.CV].
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cordell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Dillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Stramp</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wilkerson</surname>
          </string-name>
          . “
          <article-title>Detecting and Modeling Local Text Reuse”</article-title>
          .
          <source>In:ACM-IEEE Joint Conference on Digital Libraries (JCDL)</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vogel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Tillmann</surname>
          </string-name>
          . “
          <article-title>HMM-based word alignment in statistical translation”</article-title>
          .
          <source>In: Proceedings of the 16th Conference on Computational Linguistics (COLING)</source>
          . Vol.
          <volume>2</volume>
          .
          <year>1996</year>
          , p.
          <fpage>836</fpage>
          . doi:
          <volume>10</volume>
          .3115/993268.993313.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>