<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Steps Towards Mining Manuscript Images for Untranscribed Texts: A Case Study From the Syriac ⋆ Collection at the Vatican Library</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>DanielStökl Ben Ezra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi Bambaci</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>GeorgeKiraz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ChristineRoughan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthieu Freyder</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Beth Mardutho: The Syriac Institute</institution>
          ,
          <addr-line>New Jersey</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CHR 2024: Computational Humanities Research Conference</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Center for Digital Humanities, Princeton University</institution>
          ,
          <addr-line>New Jersey</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>École Pratique des Hautes Études, Université Paris Sciences &amp; Lettres</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Institut Catholique d'Arts et Métiers</institution>
          ,
          <addr-line>Strasbourg</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Institute of Advanced Studies</institution>
          ,
          <addr-line>Princeton, New Jersey</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Manuscript, Rare Book and Archive Studies, Princeton University</institution>
          ,
          <addr-line>New Jersey</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>48</fpage>
      <lpage>64</lpage>
      <abstract>
        <p>Digital libraries and databases of texts are invaluable resources for researchers, yet their reliance on printed editions can lead to significant gaps and potentially exclude works without printed reproductions. The Simtho database of Syriac serves as a pertinent example: it is derived primarily from OCR of scholarly editions, but how representative are these of the language's extensive literary tradition, transmitted and preserved in manuscript form for centuries? Taking the Simtho database and a selection of the Vatican Library's Syriac manuscript collection as a case study, we propose a pipeline that aligns a corpus of e-texts with a set of digitised manuscript images, in order to ascertain the presence or absence of texts between the e-text and manuscript corpora and thus contribute to their enrichment. We delve into the complexities of this task, evaluating both efective tools for alignment and approaches to detect factors that can contribute to alignment failures. This case study is intended as a first step towards foundational methodologies applicable to larger-scale manuscript processing eforts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;automatic text recognition (ATR)</kwd>
        <kwd>layout segmentation</kwd>
        <kwd>text alignment</kwd>
        <kwd>Syriac manuscripts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Digital libraries and databases have become indispensable for researchers working with
historical texts across various disciplines. Many such repositories predominantly rely on printed
works, which are usually more feasibly made machine-readable through automated text
recognition (ATR) technology. Many manuscript materials have become more accessible through
digitization eforts as well, albeit often not in the form of a machine-readable text (hereafter:
e-text) such as a plain text file – the outputs of library manuscript digitization initiatives are
usually digital images.</p>
      <p>We propose that aligning these categories of digital materials can support greater enrichment
of both e-text and manuscript collections. Identifying which manuscript images overlap with
works in a database of e-texts would allow for supplementing the database with information
about manuscript witnesses of particular works; conversely, it would facilitate the cataloguing
of manuscript contents where such data is not yet available or complete. Meanwhile, finding
spans of manuscript folios where their contents dnoot overlap with any part of an e-text corpus
can identify texts that are not yet present in that corpus and which might merit future inclusion
or scholarly attention – some of these might be unedited works, existing only in manuscript
form.</p>
      <p>
        This task is a form of manuscript alignment: mapping a machine-readable text to manuscript
images. Much of the previous work on this topic has explored aligning the text to image
directly: see a summary overview in8[]. Alternatively, the manuscript images and digital texts
might be aligned via an intermediary dataset: (noisy) transcriptions of the image produced by
handwritten text recognition (HTR) technology. This is the approach taken b1y, [
        <xref ref-type="bibr" rid="ref11 ref4">11, 4</xref>
        ] and is
the approach explored in the present paper.
      </p>
      <p>These past studies were aimed towards the alignment of a transcription with its manuscript
images or towards alignment for the purpose of creating ground truth that could be used for
training HTR tools. The present study’s goals of detecting matching or missing texts between
an e-text corpus and a manuscript image corpus requires such alignments to be leveraged on
a larger scale.</p>
      <p>A pipeline recently published by Smith, Murel, Allen, and Mil1l2e]r, A[utomatic Collation
for Diversifying Corpora (ACDC), similarly addresses text-to-manuscript alignments at scale
and demonstrates the feasibility of such an approach. Through the text-reuse software Passim1,
this pipeline performs extensive alignments between manuscripts and existing digital editions
of popular texts, using those results for distantly-supervised training of HTR models.</p>
      <p>Building on this groundwork, we propose a pipeline that, like ACDC, exploits the capabilities
of Passim to align automated transcriptions and digital editions, but that, unlike ACDC, ofers
greater flexibility: since our goals do not require the exact line matches essential for HTR
ground truth creation, we allow for alignments that are more tolerant of the scribal variants
and transcription errors found between manuscripts and print editions.</p>
      <p>Given (1) a corpus of digital, machine-readable texts, (2) a set of digitized manuscript images,
and (3) segmentation and transcription models for the language in question, our full workflow
would proceed as follows:
1. Perform layout analysis with a segmentation model to locate the text on the manuscript
image.
2. Run a transcription model to generate automated transcriptions of the manuscript text.</p>
      <p>These transcriptions do not have to be high qualit2y.
1passim: https://github.com/dasmiq/passim.
2Smith, Murel, Allen, and Mille1r2][ have already demonstrated success in aligning digital editions to automated
transcriptions produced by an ATR model that achieved only 60.5% accuracy on a test set.
3. Use alignment software to find likely alignments between the automated transcriptions
and each e-text in the given digital corpus.
4. For each page image, select the e-text with the best alignment results. This is necessary
because the alignment software might return successful results for multiple works, as in
the case of one work using direct quotations from another work.</p>
      <p>The results are a set of files, one for each manuscript image, where each of the lines detected
in the segmentation phase either contains a successfully-aligned line of text or contains nothing
(indicating no alignment).</p>
      <p>These alignment results can be further categorized as follows:
• True positive: the process returns an alignment where the aligned text matches what
is in the manuscript.
• False positive: the process returns an alignment where the aligned text does not match
what is found in the manuscript.
• True negative: the process does not return an alignment, and this is accurate because
the text in the manuscript is not represented in the digital corpus.
• False negative: the process does not return an alignment, but the manuscript text is
indeed represented in the digital corpus.</p>
      <p>Minimizing false positives and negatives in such a pipeline is crucial. In our paper, we will
specifically focus on the factors leading to such errors, and explore methods for reducing them.
To achieve this, we will use a test dataset consisting of a random sampling of manuscript images
from the Vatican Library’s collection of Syriac manuscripts aligned with a digital corpus, the
Simtho database of Syriac texts. This experiment will serve as feasibility test: by evaluating
the performance of our pipeline on this subset, we aim to establish robust methodologies that
can be scaled in future work to process entire collections of digitized Syriac manuscripts.</p>
      <p>The use of true/false positives/negatives here will serve a qualitative evaluation of the
challenges posed to such a process: for reasons we will explain i5n, a§ full quantitative evaluation
of such a binary classification is outside the scope of the present paper.</p>
      <p>We will begin our work by providing context on the Syriac materials used for our case study
(§ 2). Next, we will detail the data preparation process, including image selectio3n.1()§,
automatic transcription (§3.2), and the preprocessing of Syriac e-texts from Simtho (3§.3). Finally,
we will thoroughly describe our pipeline and the alignment experimen4t),(e§valuate the
results (§5), and ofer concluding remarks on our findings and future work (§6).</p>
      <p>The materials used in our study – segmentation data, automatic transcriptions, links to
images from the Vatican Library, and the alignment outputs – are available in our Zenodo
repository3.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Syriac Manuscripts, Syriac Data</title>
      <sec id="sec-2-1">
        <title>2.1. The Syriac language and script</title>
        <p>Syriac emerged in the area of the city of Edessa (today Şanlıurfa in Turkey), the capital of the
kingdom of Osrhoene (in today’s South-East Turkey and Northern Syria) as a local daughter
script of Imperial Aramaic after the demise of the Persian Empire. Its earliest attestations
are inscriptions from the turn of the era (6 CE), while the oldest manuscripts in Syriac are
documents from the third century.</p>
        <p>Classical Syriac became the common cultural and liturgical language of Eastern Christianity
and was used for literary texts in a vast area from Edessa to China and India.</p>
        <p>Some of the local Aramaic dialects remain in use today, but with the rise of Islam and the
increase of the usage of Arabic, Syriac’s use as a spoken language decreased and gradually
became restricted mainly to liturgy and literature. In some contexts, Arabic-language texts
were written using the Syriac script – these are known as Garshuni texts. The Syriac script
was also used for other languages such as Armenian, Malayalam, and Ottoman Turki1s0h].[</p>
        <p>Luckily, huge amounts of Syriac and Garshuni manuscripts survived the vicissitudes of
history, wars and persecutions, so that scholars need a repertoire of libraries and manuscript
catalogues to navigate the collections3][.</p>
        <p>The Syriac writing system is an abjad, written in a connected cursive from right to left.
Vowels and other features can optionally be written with diacritical marks. Syriac makes use
of a number of scripts which difer in how letterforms are written, the most important of which
are the Estrangela, Serto and Eastern scripts.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. The Simtho database of Syriac</title>
        <p>Over the course of several years, a team of students, postdocs, and heritage speakers at Beth
Mardutho (The Syriac Institute) have built a 16-million token corpus of Syriac texts, now hosted
at simtho.bethmardutho.or.g The source for this data is printed editions digitized via OCR.
Beth Mardutho teams have been involved in developing high-accuracy OCR models for printed
Syriac; where these models do not achieve perfect results, the MelthoLab team (young women
and men from Syriac heritage communities, mostly in the Middle East) have manually corrected
the output.</p>
        <p>The texts currently included in Simtho encompass 1,214 works, almost all printed scholarly
editions, coming to a total of 19,978,900 tokens. The team is additionally OCRing
heritagecommunity-produced publications (both liturgical and literary texts). When complete, the
Simtho database will include the vast majority of printed texts in Syriac.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. The Syriac manuscript collection at the Vatican library</title>
        <p>
          With ca. 850 items, the Vatican library (Biblioteca Apostolica Vaticana) contains one of the
largest collections of Syriac manuscripts in the world. Much of the Vatican collection was
acquired byElia Assemaniand Joseph Simon Assemani in the 18th century. On one of these
trips, the boat of Elia Assemani capsized on the Nile and the manuscripts fell into the water
(a) Estrangela script (MSSinai Syr. 15)
(b) Serto script (MSParis syr. 56)
(c) Eastern script (MSSinai Syr. 1)
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Some manuscripts could be salvaged but were badly damaged resulting in darkened pages
or ink washed away.
        </p>
        <p>
          Digitization campaigns started in 2000 with a project lead by Brigham Young Universi1ty3,[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Most of the manuscripts are in the main collection of the Vatican Library, but some are in
the Borgiana collection. Of the former, 371 manuscripts have been digitized so far; of the latter,
45 manuscripts. 144 of the main Vatican collection and 20 of the Borgiana collection are in high
quality color images, the rest are older and lower quality greyscale scans. All are accessible via
a powerful IIIF-server.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Preparation of the Data</title>
      <p>For our experiment, we needed to prepare three types of data. The first of these was a
smallscale slice of manuscript images representing the Vatican Syriac manuscript collectio3n.1()§.
From these, we then generated automated transcriptions using ATR tools3(.§2). Third, it was
necessary to preprocess the texts in the Simtho database so that they could serve as an efective
source for alignment with the automated transcriptions3(.§3).</p>
      <sec id="sec-3-1">
        <title>3.1. Image selection</title>
        <p>The present experiments have been run on selections from the full-color digitizations in the
Digital Vatican Library (so digitized microfilms have not yet been evaluated). Because our
segmentation models perform best on folio layouts with one or two columns of text, we further
narrowed our selection to only manuscripts with such layouts, which we identified manually.
This resulted in a corpus of 121 manuscripts (69 single-column, 52 double-column).</p>
        <p>For each manuscript in this collection, we randomly sampled ten images to be used in the
following experiment. We then discarded covers, empty pages, and calibration images (color
checkers, rulers), leaving us with our final test dataset of 1,115 images. The manual
classification into single, double, empty and untreatable images was quickly done on a Windows 11
system by sliding images into subfolders. For future projects, we plan to train a layout classifier
based on this data.4</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Generating automated transcriptions</title>
        <p>For the segmentation and recognition of the sample images we used the open-source ATR
engine Kraken (version 5) 9[].5 We chose Kraken for its strengths in right-to-left scripts and
the ease of integrating its outputs with Passim, discussed below (4§).</p>
        <p>The segmentation and transcription models previously produced by a transcribathon
workshop organized by Christine Roughan, Daniel Stökl Ben Ezra and George Kiraz from 25-28
March 2024 at Princeton University. These models and further details on their production will
be published in a separate publication. Here it shall sufÏce to detail that the transcribathon
produced training data for manuscript images from three collections: the Biblioteca Apostolica
Vaticana (Vatican City, Vatican State),6 the Bibliothèque Nationale de France (Paris, France), and
the Sinai collection whose digitized microfilms are housed at the Library of Congress
(Washington DC, USA). Participants corrected automatic text-to-text alignments of transcriptions that
had been generated by earlier base-models. Among the models trained on this data are
dedicated segmentation models for single- and double-column layouts as well as one generalized
transcription model.</p>
        <p>The transcription model had been trained on 108 sample folios from 38 manuscripts ranging
in date from the 6th to the 20th centuries. The distribution of dates was not precisely even: the
ifnal training data included no manuscripts from the 18th or 19th century, and manuscripts
from the 13th century were most frequent. Approximately 60% of the transcription ground
truth were images of Serto scripts, 30% were Estrangela scripts, and 10% were Eastern scripts.
This HTR model, therefore, is a generalized one that does not specialize in one particular
manuscript but may perform better on Serto and Estrangela scripts. It demonstrated 97.4%
accuracy on the test data during training.
4We envision a classifier such as the one developed by Gogawale, Bambaci, Kurar-Barakat, Vasyutinsky-Shapira,
Stökl Ben Ezra, and Dershowitz5[] for the layout classification of Hebrew prints, seeh:ttps://github.com/TAU-C
H/midrash_layout_classification_using_multilabel_.vgg
5kraken: https://github.com/mittagessen/kraken./
6There is an overlap of 17 images between the training data that produced the transcription model used here and
the manuscript images used to test alignment in the current experiment.</p>
        <p>Using kraken to apply these segmentation and recognition models to the experiment’s
images produced transcriptions in ALTO XML totalling 215,859 tokens.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Preprocessing Simtho</title>
        <p>Vocalization and diacritics may vary widely from manuscript to manuscript and are usually
absent in the printed editions such as the ones we used for alignment. To aid the alignment
process, we therefore stripped from the Simtho texts most such characters, i.e. diacritics, except
for the combining diaeresis (U+0308) and the combining dot above (U+307); all punctuation,
except the end paragraph sign (U+700), the period, and the semicolon; special Garshuni characters
and any letters that do not fall within the main Syriac Unicode block (U+0710 to U+072C).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. The Alignment Experiment</title>
      <p>Once we obtained the ALTO files and the preprocessed texts to compare from Simtho, we
proceeded to the alignment of the two datasets.</p>
      <p>As anticipated in the introduction, we chose Passim as a key tool in our alignment pipeline
for its ability to process large corpora of text efÏcientl7y. While Passim excels in detecting
instances of text reuse across multiple sources, it is not specifically designed for the task of
text alignment, which requires matching specifically-sized sections of text to each other. In
our case, it is necessary to align subsections of the longer text-reuse results on the level of the
line in the manuscript. In addition, it is essential for us to obtain specific and detailed reports on
the alignment results, both at the line level and the page level, for use in evaluation, statistical
analysis, and more.</p>
      <p>Recognizing these specific needs, we turned to a pipeline designed to complement Passim
by providing robust and accurate text alignment functionality. This pipeline, developed by
Matthieu Freyder8, performs the following essential tasks:
1. Text Preparation: It uses the set of texts from the input corpus (here, Simtho) as the
source texts for alignment. It processes those e-texts as well as the automatic
transcriptions from the ALTOs generated by Kraken to prepare them for ingestion into Passim.
2. Text-reuse Data Production: It runs Passim to detect instances of text reuse between
the prepared e-texts and the automated transcriptions.
3. Alignment Data Extraction: It extracts text-reuse data generated by Passim and
compares it with each line of the transcriptions. This comparison involves filtering based on
a user-defined threshold of Levenshtein percentage to identify the best alignment
candidates. These are reinserted back into the ALTOs, while unfit candidates are removed.
4. Metrics Generation: Finally, the pipeline generates dedicated metrics in the form of</p>
      <p>
        TSV files to enable evaluation and perform targeted analyses.
7See [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and kitab-project.orgfor some use cases with extensive corpora. Further bibliography on Passim can be
found there.
8TABA: https://github.com/Freymat/from_eScriptorium_to_Passim_and_ba.ck
      </p>
      <p>The first preparatory phase involves choosing the units of texts to compare. Passim ofers
lfexibility in structuring and specifying the length of input data, from entire books down to
ifner units, making it suitable for our corpora. For our experiment, we used regions – i.e. the
text zones on the page identified by our segmentation models – as our minimal text units for
the automatic transcriptions. For the source e-texts from Simtho, we used plain text documents,
with one TXT file per edition. Phase two then executes the main processing task by running
Passim on the textual corpora (see Listin1g in the appendix, § 7, for the arguments passed to
Passim in our experiment).</p>
      <p>The core component of our pipeline revolves around the third phase, where we aim to
transform text-reuse findings into precise text alignments. Here, we use the Levenshtein percentage
metric, meaning that we accept as potential alignments any Passim suggestions that overlap
with the automatic transcription by a given Levenshtein percentage threshold. After carefully
evaluating three of them — 70%, 80%, and 90% — we ultimately selected the lowest (70%), so as to
manage variations and possible noise within the e-texts as well as the automatic transcriptions,
while still ensuring pertinent alignment results.</p>
      <p>The TSVs generated by the last phase of our pipeline provide insights into the quantity and
quality of identified alignments. We will give a demonstration of that in Sectio5n.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>Let us turn our attention now to evaluating the alignment results. We will start with a general
overview (§5.1), and then move on to examine successful and unsuccessful alignments (§5§.2,
5.3).</p>
      <sec id="sec-5-1">
        <title>5.1. Overview</title>
        <p>These results come from the TSV reports generated at the end of our pipeline. A portion of
one such TSV is shown in Table1.</p>
        <p>As can be seen, all alignments found for any of the e-texts are shown – thus, folio 1v of
MS Vat. sir. 1 appears in nine rows, with nine e-texts as potential alignment candidates for
that image. This does not mean that that one folio contains excerpts from nine diferent texts;
rather, the entries with lower alignment ratios are likely to be false positives.</p>
        <p>This report facilitates scaling up from evaluations of alignments on the manuscript line level
to that of alignments on the folio level. Each folio image will contain a certain number of
lines, and we can measure for how many lines per image our pipeline has found a particular
alignment candidate. Total alignment success on the image level means an alignment candidate
was found covering 100% of the lines in an image; total alignment failure means none were
found covering any lines.</p>
        <p>In the results for our test dataset, the pipeline found 100% alignment for 21 images;
conversely, it found 0% alignment for 268 images. Tab2legives an overview of how many images
saw between 0-25%, 25-50%, 50-75%, or 75-100% successfully aligned lines. Although total folio
alignments were a small proportion of the results, for approximately half of the test dataset we
found an e-text from the Simtho database that aligned to at least 50% of the image’s lines.
Vat.sir.1_0006_fa_0001v.xml
Vat.sir.1_0006_fa_0001v.xml
Vat.sir.1_0006_fa_0001v.xml
Vat.sir.1_0006_fa_0001v.xml
Vat.sir.1_0006_fa_0001v.xml
Vat.sir.1_0006_fa_0001v.xml
Vat.sir.1_0006_fa_0001v.xml
Vat.sir.1_0006_fa_0001v.xml
Vat.sir.1_0006_fa_0001v.xml
Vat.sir.1_0007_fa_0002r.xml
Vat.sir.1_0007_fa_0002r.xml
Vat.sir.1_0010_fa_0003v.xml
Vat.sir.1_0010_fa_0003v.xml
Vat.sir.1_0010_fa_0003v.xml
Vat.sir.1_0010_fa_0003v.xml</p>
        <p>Of the 21 images for which the pipeline returned total alignment success, many of the aligned
texts were Biblical ones. So for example fol. 25v of MVSat. sir. 1 fully aligned with the book
of Genesis from the Syriac version of the Old Testament (the Peshiṭta, “P_Gen[AB]”). Visual
inspection confirmed the accuracy of this alignment. 100% alignment success presents an
obviously strong case for an accurate text identification, but partial alignment successes ofer text
identification candidates as well.</p>
        <p>The reasons for total alignment failure were meanwhile more varied. Alignment failure in
general will be discussed in further detail i5n.3§ – here, manual inspection of the 268 images
with zero alignment results found reasons ranging from physical damage to segmentation
failures to Garshuni text contents. This inspection did additionally identify three Syriac texts in
the manuscript images of the test dataset that are not yet incorporated into the Simtho database:
the Ecclesiastical History of Socrates Scholasticus, in MS Vat. sir. 145; the Ecclesiastical History
of Theodoret of Cyrrhus, also in MS Vat. sir. 145; and the Syriac Grammar of Bar Hebraeus, in
MS Vat. sir. 193.</p>
        <p>We can also visualize the data for alignment success ratios per image (Figu2r).e Across
the approximately 10 random images selected for each manuscript in the test dataset, we can
see that for some manuscripts alignment was highly successful: MS Vat. sir. 471 tops the list
with nine of its folios achieving 100% alignment success and the tenth achieving 96.9%. This
is a manuscript of the Peshiṭta New Testament, and all of the alignment results are relevant
Biblical books.</p>
        <p>For other manuscripts the pipeline returned negative results for attempted alignments. For
instance, MS Vat. sir. 424, parts 1 and 2, achieved 0% alignment across all folio images.
Inspection confirmed that the text contents were Garshuni, not Syriac.</p>
        <p>The following subsections will discuss the successful and failed alignments in more detail,
categorizing diferent types and discussing methods that address these categories. For various
reasons, a full quantitative evaluation of the specific identifications of the texts is beyond the
scope of this present article9.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Successful alignments</title>
        <p>Alignments can be found by the pipeline for several reasons. The aligned text can either be:
1. A version of the automatically transcribed text, e.g. one of the manuscripts used for the
edition with more or less variant readings, resolved abbreviations, etc.
2. A diferent (but close enough) recension, e.g., there are diferent translations of the
Biblical books from Greek, Hebrew and Aramaic into Syriac.
3. A quotation or parallel, especially in the case of long biblical quotations, florilegia,
dictionaries, biographies, histories, or in anthologies such as exegetical Catenae that are
compilations of quotations from previously existing commentaries or homilies.
4. Not a true match, but perhaps a line that happens to share enough formulae between the
e-text and automatic transcription to pass the Levenshtein percentage threshold.</p>
        <p>
          Case [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is a false positive and one we want to filter out. Similarly, case [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] does not serve
the goals of either automatic cataloguing or automatic detection of out-of-corpus texts, and
so should be filtered out as well. (Such alignments could, of course, serve HTR ground truth
creation or text reuse analyses, but these are outside the goals of the present experiment.)
        </p>
        <p>Fortunately, the majority of such cases can be filtered out simply by selecting only the
alignment result with the highest alignment ratio for that folio. So, for Ta1b’lseexample of MS
9For some texts Simtho does not have any ground truth to be aligned with, for others the Simtho e-text covers
only a part of the composition. Many liturgical, legal and exegetical manuscripts present anthological material
that contain quotations or parallels of varying length; here a binary evaluation is not easily decided and probably
should be avoided altogether. Some e-texts in Simtho are themselves anthologies and so the issue exists both ways.
Text reuse in Syriac literature in general is quite extensive. A diferent challenge is posed by the limitations of
catalogue data: for at least 40 pages from 8 manuscripts (e.g. Vat. sir. 107, 122, 318, 352, 529, 560 pt. 1, 567, 623)
we do not have scholarly reports of their contents. For at least half of them, the pipeline seems to make good or
excellent suggestions, but this requires more research. We will take up these challenges elsewhere.
Vat. sir. 1 folio 1v, we end up with the alignment for the book of Genesis from the Peshiṭta
(“P_Gen[AB]”) and discard the remainder.</p>
        <p>
          This works when we do have a likely true positive alignment result for that image, but
for images with less likely results we require other methods. This is where we leverage the
“max_cluster” column, which expresses the total number coofnsecutive aligned lines between
the automatic transcriptions and the digital edition: case [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] usually results in only one aligned
line, and short quotations from case [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] will lead to only short consecutive spans of aligned
lines. We have therefore found it efective to keep only alignment results with at least three
consecutive aligned line1s0.
        </p>
        <p>
          To return to our categories of successful alignments: we meanwhile treat cases [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
as true positives, using the alignment ratio and the total of consecutive aligned lines as key
indicators. In the first case, the identification can be a direct candidate for automatic cataloguing.
In the second case, the identification might be an indirect candidate for automatic cataloguing,
identifying the text but not the specific recension.11
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Failed alignments</title>
        <p>Alignment failure can have very diferent reasons.</p>
        <p>1. Poor preservation of the surface (e.g. damage through ink corrosion, fragmentary state).
2. Poor preservation of the writing traces (e.g. water damage, smear or shine-through).
3. Imaging problem.
4. Poor layout segmentation leading to a bad representation of the overall text in the
recognition (missing lines or parts of lines, missing columns or parts of columns, or
erroneously joining separate lines).
5. Poor text recognition.
6. No matching text in the textual database.</p>
        <p>
          Among these cases, only case [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a true negative, indicating that the manuscript did not
contain any texts from the corpus of e-texts used for alignment. In the remainder of cases, the
pipeline may be returning false negatives. Whether or not the manuscript contains a work
from the e-text corpus cannot be automatically determined so long as these other factors are
impeding potential alignments. Further elaboration of the six cases and methods for addressing
them follows below.
        </p>
        <p>
          Cases 1-2: poor preservation of the writing or page surface. These cases lead to poor
performance further along our workflow: fragmentary or water-damaged folios lead to poor
segmentation results (since our models expect undamaged columns of text); poor visibility of
the text leads to poor automated transcription results. There is unfortunately little to be done
10Handling quotations or parallels of longer length is a challenge in the present version of the experiment, but can
be addressed in future work by running full manuscripts through the pipeline rather than isolated folios. This
would allow for the use of a similar “max_cluster” metric, albeit on the folio-level rather than the line-level – such
a metric could help to disambiguate when a manuscript contains a text in full or only a (long) quotation of it.
11There are some cases where the two highest ranked alignment suggestions have equal evaluation scores – in such
cases it requires closer research to determine which one is the better identification.
at this point in time for images of a manuscript whose text has been washed away because
it fell into the Nile. Therefore, for cases [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] we primarily seek to flag these problem
images so that they can be filtered out.12
        </p>
        <p>Poor segmentation results can be identified through the segmentation data in the ALTO
XML files. The segmentation models used in this experiment identify sets of lines contained
in regions (columns). A simple evaluation of the line segmentation quality for a given column
is the quotient of the mean line length and the column’s average width. A low value implies
broken or too short lines.</p>
        <p>For poor transcription results, one possibility for flagging problem cases can be found in the
reported confidence of the recognition network used to produce the HTR transcriptions. For
142 images, the average recognition confidence for that page is below 96.25%. In most cases
(123 images, i.e. 86.6%), this indicated a problem like water damage, smeared ink, gold ink on
dark backgroundV(at. sir. 622), complex page layout (e.g. a table), fragmentary preservation, or
shine-through. Vice versa, for the 973 cases with a recognition confidence higher than 96.25%,
visual inspection revealed that only 197 (20.3%) presented a problem such as water damage,
low contrast or fragmentary state.</p>
        <p>We can also get an indication of poor transcription results by checking the ratio of the
recognized characters in a line to the segmented line length. We have found low values tend to
correspond to water damage, smeared ink, or gold ink on dark folios.</p>
        <p>Case 3: imaging problems. In the present experiment, we avoided imaging problems by
selecting test images only from the high-quality color digitizations in the Vatican Library. In
future work, microfilm or other greyscale / black and white images in a massive corpus of
digitized manuscripts could be easily flagged by checking the color mode of each image or a
random sampling of a small number of pixels. Further automatic evaluation of imaging
problems requires further research.</p>
        <sec id="sec-5-3-1">
          <title>Cases 4-5: poor recognition performance by the segmentation or transcription mod</title>
          <p>
            els. We have discussed methods for detecting poor segmentation or transcription above.
Where cases [
            <xref ref-type="bibr" rid="ref1 ref2 ref3">1-3</xref>
            ] are ruled out as potential reasons for this performance, we are likely dealing
instead with a manuscript for which our models perform poorly.
          </p>
          <p>These segmentation and transcription models will of course achieve variable results on new
manuscripts. The Passim parameters used here are selected to be tolerant for noisy HTR results,
but poor enough performance will still lead to failed alignments. One option for remedying
poor HTR results is using a tool like ACDC to refine models so that they may better handle the
manuscripts with subpar results.</p>
          <p>Case 6: no aligned text found. This itself could be caused by one of several reasons,
depending on whether the image contains:
a. no text,
12Such damage, however, does not always lead to failed alignments: visual inspection showed that the current
segmentation and transcription models in fact dealt with some of these difÏcult cases remarkably well.
b. text not written in the Syriac script,
c. Syriac script used to write another language (i.e., Garshuni, see2.§1), or
d. a Syriac text not present in the e-texts used for alignment.</p>
          <p>Our experiment has already manually excluded case [6a] from the images in our dataset.
Such filtering could also be achieved automatically by flagging images with no (or only minor)
segmentation results.</p>
          <p>
            In our dataset, we identified examples of case [6b] as occurring inVat. sir. 23 (left columns
are Arabic),Vat. sir. 20.2(fol. 164r has Arabic in the same column as Syriac), anVdat. sir. 51.1
(fol. 7r is in Latin). To distinguish case [6b] from case [6d], we can take advantage again of
the reported confidence of the recognition network since the images that contain non-Syriac
scripts have a lower reported confidence. Unlike cases [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] and [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], however, case [6b] does not
result in lines with unusually few characters, and so when we have poor reported confidence
but normal line character lengths, this can indicate the presence of a non-Syriac script.
          </p>
          <p>Our experiment additionally seeks to flag case [6c] because the Simtho database only
contains Syriac texts, not Garshuni ones. At this point, we did not have the capacities to train a
dedicated BERT model and therefore relied on simpler NLP methods, i.e. bigram distribution.
We calculated the 200 most common bigrams in Simtho to represent all styles and periods of
Syriac in general. Then we calculated the distribution of these 200 bigrams for each
automatically recognized page and measured the Euclidean distance to the Simtho bigram vector. We
then used the automatic transcriptions of these Garshuni pages to calculate a bigram-vector
for these 200 most common bigrams also for Garshuni. There were 82 images in our dataset
which contained Garshuni text; leveraging the calculated Euclidean distances allowed us to
successfully flag 76 of them.</p>
          <p>When the above methods rule cases [6a, b, c] to be unlikely, then we are dealing with case
[6d], and so have successfully located a manuscript Syriac text not present in the input e-text
corpus (here, the Simtho database of Syriac).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The experiments of this paper have highlighted the variety of challenges that need to be
surmounted when attempting corpus-level automated indexing or detection of absent texts
through text alignment methods. Through the features discovered in the evaluation of the test
dataset that lead to false positives or false negatives, we have been able to strategize and test
methods for mitigating these challenges.</p>
      <p>Overall, this small experiment on the randomized set of folios sampled from the Vatican
Library collection of digitized Syriac manuscripts has shown that the proposed workflow has
promise. The pipeline achieved great success in aligning e-texts to noisy HTR in the test dataset,
and the results can be used to determine candidates for text identifications of the contents of
those images. Additionally, even in this small slice of the Vatican Syriac manuscript collection
we were able to identify three Syriac texts which were absent from the Simtho database. With
procedures put in place to handle the challenges described i5n.3§, such a workflow could be
put to the fuller corpus. To leverage this workflow on a full-sized dataset, however, we will
want to optimize our pipeline to function in a more time-efÏcient manner.</p>
      <p>The case study detailed in the present paper has covered Syriac corpora, but the broad
strokes of the workflow are not language-specific. Given suitable e-text repositories, digitized
manuscript collections, and preliminary ATR models, this workflow is broadly applicable to a
wide range of language traditions.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Funded by the European Union (ERC, MiDRASH, Project No. 101071829). Views and opinions
expressed are however those of the author(s) only and do not necessarily reflect those of the
European Union or the European Research Council Executive Agency. Neither the European
Union nor the granting authority can be held responsible for them.</p>
    </sec>
    <sec id="sec-8">
      <title>7. Appendices</title>
      <p>We ran our pipeline on HPC cluster equipped with 128 CPUs (see Tab3lfeor the full hardware
specifications). We measured the execution time as the experiment proceeded through each
step. As can be seen in Table 4, the optimizations of Passim were clear: detecting text reuse
between a 16-million token corpus (Simtho) and a 200-thousand token corpus (automatic
transcriptions) took only three and a half minutes on our hardware. The following steps, on the
other hand, were more costly, requiring 25 minutes for our current small-scale dataset. Since in
future work we will want to apply this pipeline to a much larger dataset, we will be exploring
optimizations to improve the execution time of these steps.</p>
      <sec id="sec-8-1">
        <title>Listing 1</title>
        <p>List of Passim arguments used. An complete explanation of these arguments can be found in Passim
documentation (see note 1).</p>
        <p>--master local[125]
--executor-memory 200G
--driver-memory 30G
seriatim
--docwise
--floating-ngrams
--fields ref
--filterpairs 'ref = 1 AND ref2 = 0'
--all-pairs
--complete-lines
-n 7
/TABA/data/processed/json_for_passim/passim_input.json
/TABA/data/processed/passim_output</p>
        <p>task</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Chammas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mokbel</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Likforman-Sulem</surname>
          </string-name>
          .
          <article-title>“Handwriting Recognition of Historical Documents with Few Labeled Data”</article-title>
          .
          <source>InI:nternational Workshop on Document Analysis Systems (DAS)</source>
          .
          <year>2018</year>
          , pp.
          <fpage>43</fpage>
          -
          <lpage>48</lpage>
          . url:
          <volume>10</volume>
          .1109/das.
          <year>2018</year>
          .
          <volume>15</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cordell</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>SmithV</surname>
          </string-name>
          .iral Texts:
          <article-title>Mapping Networks of Reprinting in 19th-Century Newspapers</article-title>
          and Magazines.
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Desreuxmaux</surname>
          </string-name>
          and
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Briquel-ChatonnetR.épertoire des Bibliothèques et</article-title>
          des Catalogues de Manuscrits Syriaques. Paris: Editions du Centre National de la Recherche Scientifique,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. S. B.</given-names>
            <surname>Ezra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Brown-DeVost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dershowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pechorin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Kiessling</surname>
          </string-name>
          . “
          <article-title>Transcription Alignment for Highly Fragmentary Historical Manuscripts: The Dead Sea Scrolls”</article-title>
          .
          <source>In: 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)</source>
          , Dortmund, Germany,
          <year>2020</year>
          .
          <year>2020</year>
          , pp.
          <fpage>361</fpage>
          -
          <lpage>366</lpage>
          . url:
          <volume>10</volume>
          .1109/icfhr2020.
          <year>2020</year>
          .
          <volume>00072</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gogawale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bambaci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kurar-Barakat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vasyutinsky-Shapira</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Stökl Ben Ezra, and</article-title>
          <string-name>
            <given-names>N.</given-names>
            <surname>Dershowitz</surname>
          </string-name>
          . “
          <article-title>NetLay: Layout Classification Dataset for Enhancing Layout Analysis”</article-title>
          . In: Magazén:
          <source>International Journal for Digital and Public Humanities</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>C. GrifÏn.</surname>
          </string-name>
          “
          <article-title>Syriac Manuscripts from the Egyptian Desert”</article-title>
          .
          <source>In: The Newsletter of the Neal A. Maxwell Institute for Religious Scholarship 31.1</source>
          (
          <year>2011</year>
          ). url: https://scholarsarchive.by u.edu/insights/vol31/iss1/.
          <fpage>4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Heal</surname>
          </string-name>
          . “Vatican Syriac Manuscripts: Volume
          <volume>1</volume>
          .” InH:ugoye
          <source>: Journal of Syriac Studies 8.1</source>
          (
          <issue>2018</issue>
          ), pp.
          <fpage>115</fpage>
          -
          <lpage>122</lpage>
          . url: https://hugoye.bethmardutho.
          <source>org/article/hv8n1prhe a.l2</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ibn Khedher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jmila</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>El-Yacoubi</surname>
          </string-name>
          .
          <article-title>“Automatic Processing of Historical Arabic Documents: A Comprehensive Survey”</article-title>
          .
          <source>InP:attern Recognition</source>
          <volume>100</volume>
          (
          <year>2020</year>
          ), p.
          <fpage>107144</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.patcog.
          <year>2019</year>
          .
          <volume>107144</volume>
          . url: https://www.sciencedirect.com/science/article /pii/S0031320319304455.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kiessling</surname>
          </string-name>
          . “
          <article-title>Kraken: A Universal Text Recognizer for the Humanities”</article-title>
          .
          <source>IDni:gital Humanities (DH)</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>G. Kiraz.</surname>
          </string-name>
          “Garshunography:
          <article-title>Terminology and Some Formal Properties of Writing One Language in the Script of Another”</article-title>
          . In:
          <article-title>Scripts beyond Borders: A Survey of Allographic Traditions in the Euro-Mediterranean World</article-title>
          . Ed. by T. P. Johannes den Heijer Andrea Schmidt. Peeters,
          <year>2014</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>T. de Reuse</surname>
            and
            <given-names>I. Fujinaga.</given-names>
          </string-name>
          “
          <article-title>Robust Transcript Alignment on Medieval Chant Manuscripts”</article-title>
          .
          <source>In: Proceedings of the 2nd International Workshop on Reading Music Systems, Delft, the Netherlands, November</source>
          <volume>2</volume>
          ,
          <year>2019</year>
          .
          <year>2019</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Murel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Allen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Miller</surname>
          </string-name>
          . “
          <article-title>Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition”</article-title>
          .
          <source>In: Proceedings of the Computational Humanities Research Conference</source>
          <year>2023</year>
          , Paris, France, December 6-
          <issue>8</issue>
          ,
          <year>2023</year>
          .
          <year>2023</year>
          , pp.
          <fpage>206</fpage>
          -
          <lpage>221</lpage>
          . url: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3558</volume>
          /paper 1708.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>E. J. Wilson.</surname>
          </string-name>
          “
          <article-title>The Digitizing of Selected Syriac MSS in the Vatican Apostolic Library”</article-title>
          .
          <source>In: Hugoye: Journal of Syriac Studies 3.2</source>
          (
          <issue>2000</issue>
          [2010]), pp.
          <fpage>282</fpage>
          -
          <lpage>285</lpage>
          . url: https://hugoye.bet hmardutho.org/article/hv3n2crdigitizingsyr m.ss
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>