1. Introduction

Informatique et les Techniques Avancées, Paris, France £ luigi.bambaci@ephe.psl.eu (L. Bambaci); daniel.stoekl@ephe.psl.eu (D. Stökl Ben Ezra) ȉ

Enhancing HTR of Historical Texts through Scholarly Editions: A Case Study from an Ancient Collation of the Hebrew Bible

Luigi Bambaci

Daniel Stökl Ben Ezra

0 Archéologie & Philologie d'Orient et d'Occident UMR 8546, École Pratique des Hautes Études, Université Paris Sciences & Lettres (EPHE, PSL) , Les Patios Saint-Jacques, 4-14 Rue Ferrus, 75014 Paris , France

2021

000 9 0009

Printed critical editions of literary texts are a largely neglected source of knowledge in computational humanities. However, under certain conditions, they hold signi昀椀cant potential for multifaceted exploration: First, through Optical Character Recognition (OCR) of the text and its apparatus, coupled with intelligent parsing of the variant readings, it becomes possible to reconstruct comprehensive manuscript collations, which can prove invaluable for a variety of investigations, including phylogenetic analyses, redaction history studies, linguistic inquiries, and more. Second, by aligning the printed edition with manuscript images, a substantial amount of Handwritten Text Recognition (HTR) ground truth can be generated. This serves as valuable material for paleography, layout analysis, as well as for assessing the quality of the collation criteria adopted by the editor. The present paper focuses on the challenges mastered in the processes of the OCR, the apparatus parsing, the text reconstruction, and the alignment with the manuscript images, taking as a case study the edition of the Hebrew Bible published by Kennicott in the late eighteenth century.

eol>layout analysis automatic transcription text encoding Hebrew Bible manuscripts textual criticism

1. Introduction

For centuries, critical editions have served as the backbone of the humanities far beyond philology, o昀ering important insights into the textual evolution of numerous historical works and providing scholars with reliable texts for their academic inquiries. The advent of Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) technologies as well as Natural Language Processing (NLP) has opened a new era both in the preservation and in the exploration of these indispensable works.

Numerous OCR and HTR so昀琀ware solutions are available today, and multiple studies and projects have contributed to the advancement of digitizing and analyzing the cultural book heritage. Among the most well-known so昀琀ware, we can mention Transkribus [ 13 ], Monk [ 21 ], Aletheia [ 7 ], and Tesseract 4.0.1 Notable research e昀orts relevant to ours include the work by Toselli et al. [26], who exploited huge datasets of existing OCRed printed books for selfsupervised layout analysis, as well as projects like HORAE [5], which examined large amounts of Biblical texts or quotations in Latin to create HTR ground truth and conduct manuscript analysis.

The focus of these advancements has predominantly revolved around traditions within classical languages or modern languages in the Latin alphabet. However, there has been a recent shi昀琀 towards including Hebrew texts in such endeavors: one eminent example is the BiblIA project [25], which examined a substantial corpus of Medieval manuscripts in Hebrew script, providing the 昀椀rst public dataset of transcriptions as well as e昀케cient models for automatic segmentation and text recognition.

In this paper, we aim to contribute to the ongoing progress in digital Hebrew research, focusing in particular on the corpus of scholarly editions and biblical manuscripts. We will delve into the challenges faced in digitizing and encoding an ancient edition of the Hebrew Bible, namely the eighteenth-century collation by Benjamin Kennicott, and we will elucidate how we extracted from it a large amount of complete manuscript texts that we will be able to align with their manuscript images.

The signi昀椀cance of Kennicott’s collation for biblical studies remains unparalleled. The wealth of data it o昀ers is exceptional and its potential applications are manifold, as we will elaborate shortly (§ 2). Yet, the sheer volume and complexity of the data constitute a signi昀椀cant obstacle to analysis, compelling scholars to work on limited samples and to perform laborious manual processing.

Through the digitization of this edition, our aim is to provide the scholarly community with a digital resource for swi昀琀, e昀케cient, and large-scale examinations of Hebrew Bible manuscripts. Additionally, we will leverage Kennicott’s collation for an unprecedented purpose: enhancing the performance of HTR systems using automatically reconstructed texts derived from the critical apparatus data. The automatic generation of these texts will a昀ord us a massive amount (approximately 75,000 pages) of ground truth for HTR of Hebrew manuscripts, while the alignment with the images will enable us not only to measure the degree of discrepancy between the original collation and the actual manuscripts, but also to correct errors, 昀椀ll gaps, and produce more faithful and updated collation data.

In the next sections, we will provide a detailed account of our work. We will elaborate on how we conducted layout analysis for the purpose of segmentation and transcription (§3§.2, 3.3), and how we automatically encoded the data present in the critical apparatus using a rulebased parser (§ 3.4). Lastly, we will demonstrate how we successfully generated complete texts of fully collated witnesses and how we are going to use them as training data to improve and speed up the automatic transcription of a number of manuscripts of the Hebrew Bible (§4).

The method we are about to present here is part of an ongoing project entitled Reverse Engineering Kennicott (REK), funded by Biblissima+2 and directed by the École Pratique des Hautes Études, Paris Sciences et Lettres University. The project is carried out in close synergy 1https://github.com/tesseract-ocr/tesseract. 2https://biblissima.fr/. with Ktiv,3 the most important online catalog of Hebrew manuscripts, and with the National Library of Israel,4 and is centered on the web application eScriptorium5.

At the time of writing this article, the 昀椀rst of the two volumes of Kennicott’s work has been encoded, and the text of the witnesses of the book of Genesis have been automatically generated.

Before illustrating the pipeline of our project, let us brie昀氀y outline Kennicott’s work, in order to explain why it is so important for biblical research and how it can be used to fully recover the text of medieval manuscripts.

2. Kennicott’s collation of the Hebrew Bible

The Hebrew Bible is a compilation of texts from the 昀椀rst millennium before the common era, written primarily in Hebrew with some sections in Aramaic, and totaling around 470,000 tokens. Sacred text in Judaism and later also Christianity, with numerous translations into many ancient languages such as Greek, Latin, Aramaic, Armenian, Coptic, Georgian, Arabic and, since the reformation era, into virtually all contemporary languages, it is one of the most important texts extant worldwide.

Kennicott was the 昀椀rst scholar to systematically gather and collate the Hebrew textual witnesses of the Bible.

His two-volumes collation, published at Oxford between 1776 and 1780 and titledVetus Testa3https://www.nli.org.il/en/discover/manuscripts/hebrew-manuscripts. 4https://www.nli.org.il/en. 5https://msia.escriptorium.fr/. mentum Hebraicum cum variis lectionibus [ 14, 15 ], remains the largest of its kind to this day: its extensive critical apparatus, built upon the examination of no fewer than 600 manuscripts and 70 printed editions (Fig.1), is estimated to contain something like 1,500,000 pieces of textual information.6

Kennicott’s work has never been replaced: De Rossi’s collations 1[ 0, 9 ], which were published shortly a昀琀erwards, are highly eclectic and present only a restricted selection of variants, while later editions either depend on these classical collation7s,or drastically reduce the number of collated manuscripts,8 or even dispense with the testimony of medieval manuscripts altogether.9

The use of Kennicott’s data is not con昀椀ned solely to consultation or the compilation of critical editions. Scholars have repeatedly demonstrated how it is possible to extract relevant research information out of Kennicott’s apparatus: from textual history, enabling the reconstruction of the transmission process of the Hebrew Bible in the Middle Ages through stemmatological methods, such as clustering [ 6 ] and phylogenetics [ 3, 2 ]; to philology, for the study of common copying errors and scribal habits; from codicology and paleography, aiding in dating and localizing new manuscripts [ 19, 12 ]; to linguistics, allowing the analysis of variant spelling and orthography [ 8 ].

Kennicott’s collation is a valuable resource for research across all these domains. Indeed, it stands as the inaugural and, up to present, solitary endeavor to provide a scholarly edition of the medieval Hebrew Bible text. Let us take a closer look at some of its key features.

The work is organized into sections, each dedicated to a biblical book (e.g. Genesis, Exodus etc.) or to a collection of biblical books (e.g. the Five Megilloth: Song of Songs, Ruth etc.). Each section comprises two main parts: a reference text10 printed at the top page and a critical apparatus of variants printed at the bottom (Fig.2).

In the apparatus, the witnesses are cited using unique alphanumericsigla. Keys to these sigla are provided in the catalog reproduced in the introduction to the 昀椀rst volume, containing the most relevant bibliographical information, such as date and provenance11.

In addition to this catalog, Kennicott provides recapitulative lists of witnesses at the end of each book or collection of books. The purpose of these lists is to categorize the witnesses into manuscripts and printed editions, as well as to signal which of them have been collated in full (per totum collati) and which only partially (in loci selectis collati). This distinction by degree of collation, which is absent, for example, in De Rossi, is of fundamental importance and directly impacts our work: since only fully collated witnesses can provide the basis for a systematic gathering of variants, it permits us to identify which are the witnesses we can reasonably expect to obtain complete and reliable automatic transcriptions for. 6[ 4 ] 28昀. 7So for example the Biblia Hebraica Stuttgartensia [ 11 ]. 8Like the Hebrew University Bible [ 22 ]. 9As the Biblia Hebraica Quinta [ 20 ]. 10Taken from the most widely used edition at the time, that of E. van der Hooght (Amsterdam 1705), which Kennicott adopted as basis for his collation. 11Most of Kennicott’s manuscripts have been identi昀椀ed: in 2020, Idan Dershowitz published a comprehensive list of these manuscripts containing URLs to Ktiv, where updated bibliographic information and, when available, images can be found. This list is accessible on the author’s academia.edu page:https://www.academia.edu/37862623.

Such systematic approach towards collation is the hallmark of Kennicott’s method. In contrast to what De Rossi would later do, Kennicott goes beyond the most conspicuous phenomena of variation, encompassing all potential discrepancies between the reference text and each individual witness, such as spelling, the layout of paratextual elements, and various details of the mise en page. This choice, however philologically questionable, actually bene昀椀ts us by assuring, at least theoretically and net of inconsistencies, errors, and omissions, that we have complete lists of variants at our disposal.

Finally, but most importantly, Kennicott organizes the variants in the apparatus in an extremely precise manner, minimizing the use of natural language and adopting a formalism that anticipates that of most recent editions. On this aspect, which is crucial for automatically extracting information from the critical apparatus, we will dwell at length later on (§3.4.1).

The features we have just listed e昀ectively make Kennicott’s work not only a rich source of data on the textual tradition of the Hebrew Bible, but also an ideal candidate for our computational treatment.

3. Pipeline

REK’s main objectives are threefold: 1. to obtain a TEI-compliant encoding of both the reference text and the critical apparatus of Kennicott; 2. to reconstruct the text of 244 manuscripts fully automatically by way of encoding, for a total of approx. 75,000 pages; 3. to provide an accurate and complete transcription of the text of 10 Kennicott’s manuscripts (approx. 7,500 pages) through alignment with these automatically reconstructed texts To achieve these objectives, we devised the following 4-step pipeline: 1. acquisition of images of Kennicott’sVetus Testamentum and of the 10 chosen manuscripts; 2. automatic segmentation and transcription; 3. parsing and encoding of Kennicott’s apparatus; 4. reconstruction of the witness texts

We will now discuss each of these points in detail, presenting the work done as well as outlining what is yet to be accomplished. Let us begin with the 昀椀rst step, image acquisition.

3.1. Image acquisition

Digital copies of Kennicott’sVetus Testamentum are freely available on the web on platforms such as Archive.org and Google Books, both in .pdf format and in various image formats. We chose the .jp2 images from Archive,12 which are in an acceptable resolution, and converted 12First volume: https://archive.org/details/vetustestamentum01kenn; second volume: https://archive.org/details/ vetustestamentum02kenn. them to .jpeg, which is most widely supported and produces smaller 昀椀le sizes which still su昀케ce for OCR.

Among the manuscripts collated by Kennicott, we have identi昀椀ed about 20 that are important for their variants. Among these, we have selected 10, based on criteria of convenience such as simple layout, the absence of inline translations into targumic Aramaic, and, of course, the availability of the images (Tab.1).

Among the di昀erent so昀琀ware mentioned in the Introduction, we have chosen to work with eScriptorium [ 24, 17, 23 ] and its OCR/HTR engine, Kraken [ 16 ],13 which is optimized for historical and non-Latin script material.

To upload the images of these manuscripts into eScriptorium, we made use of the IIIF standard: for each chosen manuscript, we retrieved the IIIF manifest and then we used Python scripts to download the images and populate our database.

In the next section, we will discuss the segmentation (§3.2) and transcription process (§ 3.3). For the sake of clarity, we will devote separate subsections to segmentation and transcription of the Vetus Testamentum (§§ 3.2.1, 3.3.1) and of Kennicott’s manuscripts (§§3.2.2, 3.3.2), respectively.

3.2. Segmentation

Once we uploaded the images of theVetus Testamentum and the manuscripts onto eScriptorium, we proceeded with segmentation, which is indispensable for identifying those regions on the page where the text to be transcribed is located.

For both segmentation and transcription, we used models in .mlmodel model format trained with Kraken so昀琀ware. These models can be trained with Kraken and then imported into eScriptorium. Alternatively, as in our case, they can be trained directly within the eScriptorium application.

3.2.1. Vetus Testamentum

The layout of the Vetus Testamentum is complex, but the segmentation was relatively straightforward. We started by de昀椀ning a segmentation ontology, distinguishing running headers, titles, le昀琀 and right main columns, and le昀琀 and right apparatus for both region types and line types. Following this, we manually segmented approx. 30 pages and trained a model on this sample. With this model, we were able to automatically segment the entire 昀椀rst volume, keeping manual corrections to a bare minimum. Fig.3 shows an example of segmentation of regions (3a) and lines (3b) taken from the 昀椀rst volume.

As can be seen, eScriptorium provides an intuitive graphical interface that allows the creation of an ontology to distinguish between di昀erent types of regions and lines, which are represented by di昀erent colors. This feature is extremely useful, as it enabled us to mark only the portions of text for which we wanted to obtain a transcription, namely the reference text (the two regions at the top) and the critical apparatus (the two regions at the bottom), while excluding titles, headers, page numbers, and catchwords.

Similarly, by marking the types of lines, we can express the order of columns and the textual 氀昀ow. This permitted us to di昀erentiate between lines we need to transcribe and those we do not (e.g., the Samaritan text with its variants, see Fig.2).

In addition to its user-friendly graphical interface, eScriptorium o昀ers a rich API, which makes it possible to automate numerous segmentation- and transcription-related operations. Using the API functions, we opted to replace the polygonic lineboundaries with parallelogrammatic ones, as they were found to enhance transcription accuracy (Fig.4). 3.2.2. Kennicott’s manuscripts We have applied the same segmentation procedures to the medieval manuscripts. Unlike the Vetus Testamentum, which required us to create our own models from scratch, there already exist excellent segmenters as well as recognizers for Hebrew manuscripts, and ongoing research in this area continually improves their accuracy1.4 Only occasionally, for manuscripts with less regular layout, we had to train new models on top of these standard models.

An instance of automatic segmentation for one of the 10 manuscripts in our possession can be seen in Fig. 5.

3.3. Transcription

We proceeded with transcription next. Currently, we have completed the transcription of both the reference text and the critical apparatus of Kennicott’s 昀椀rst volume, and we are now in the process of transcribing the manuscript texts. 14The segmentation models we used are accessible here:https://github.com/dstoekl/sofer_mahirand the recognition models here: https://zenodo.org/record/5167263#.YhzNEtIo-po. (a) Region segmentation (b) Line segmentation Figure 3: Layout analysis from the collation of the book of Genesis.

562 (a) Before repolygonization (b) A昀琀er repolygonization

3.3.1. Vetus Testamentum

Transcribing the text of the Vetus Testamentum posed numerous challenges. The reference text and the critical apparatus follow two distinct textual 昀氀ows, each with its own peculiarities and complexities, and require di昀erent treatments. We opted, therefore, to transcribe them separately.

The main complexity of the critical apparatus lies in the presence of two di昀erent alphabets (Hebrew and Latin) with distinct directionality (right-to-le昀琀 and le昀琀-to-right), as well as of punctuation, numbers, and special symbols that require exact reproduction for proper parsing (§ 3.4.1). Dealing with directionality proved particularly demanding, since RTL and LTR markers are invisible and therefore di昀케cult to manage during correction. We successfully overcame this obstacle by employing a visible LTR marker to establish proper word order. A昀琀er transcribing manually a sample of about 30 pages and training a recognition model on these sample pages, we 昀椀nally managed to achieve a satisfactory accuracy of approx. 98%. 15 Thanks to the introduction of the LTR marker, the resulting transcriptions became much more manageable to correct.

Transcribing the reference text, on the other hand, proved notably smoother, since it is in a single alphabet, Hebrew,16 and since it reproduces a standard text, that of the Hebrew Bible, for which excellent transcription models, as we mentioned (§3.2.2), already exist. The combination of these features resulted in an accuracy of 98%. 15Starting hereon, the accuracy percentages we provide for transcription are based on Character Error Rate (CER) metric, which is the one used by Kraken. 16Excluding verse and chapter numbers, which were added in post-processing, see §3.4.

As for the correction of the reference text, we took advantage of the recent integration of passim’s17 text-to-text alignment into eScriptorium, which allows loading an external version of the same text and aligning it with the output of the automatic transcription. This alignment signi昀椀cantly expedited the correction process: As depicted in Fig. 6, di昀erences between the aligned versions are highlighted (deletions in red and additions in green), enabling easy identi椀昀cation of errors as well as variants. The exceptional bene昀椀ts of this tool are evident, and we are con昀椀dent that it will prove immensely helpful also for the correction of the reconstructed textual witnesses (§ 4).

Before going on to describe the treatment of medieval manuscripts, it is only right to spend a few words about the manual correction process, which is by far the most time-consuming for the human user.

The graphical interface of eScriptorium is designed to make the manual correction process easier: As shown in Fig. 7, eScriptorium enables the user to scroll through the text line by line, with the original image alongside the result of the automatic transcription. Additionally, eScriptorium allows for the creation of customizable keyboards, which can be used to insert characters that are not easily reproducible otherwise. This utility proved exceptionally convenient for correcting the critical apparatus, which, as mentioned, contains many of these special characters.

Once we obtained correct transcriptions for both the reference text and the critical apparatus, we exported them using eScriptorium’s API, so as to have pairs of .txt 昀椀les (text + apparatus) for each treated biblical book (Figs.8a and 8b).

Finally, we post-processed these 昀椀les (removing hyphenations, regularizing newlines etc.) to obtain copies suitable for automatic encoding (§3.4). Examples of these post-processed texts are visible in Figs. 9a and 9b. 3.3.2. Kennicott’s manuscripts We are presently working in transcribing the 10 Kennicott’s manuscripts (Fig.10), using the models mentioned in Section3.2.2. When we have their text, we plan to utilize the same alignment feature discussed in Section3.3.1, which was used for transcribing Kennicott’s reference text. This will help us locate transcription errors more e昀ectively and speed up correction.

Upon completing the transcription process, we intend to align these texts with the texts automatically reconstructed from Kennicott’s data. Subsequent sections will explain the details of this reconstruction.

3.4. XML encoding

In order to process the data from a scholarly edition, mere machine-readability in the provided .txt 昀椀les is insu昀케cient. It is crucial for the data to be machine-actionable to enable automated processing and analysis. To achieve this essential feature, we opted for XML encoding, the most widely adopted practice in Digital Scholarly Editing.

As for the encoding of Kennicott’s reference text, we used simple Python scripts to 昀椀rst divide the text of each biblical book into its hierarchical units, i.e. chapters, verses, and words. Then, we compared these segmented texts with a standard digital version of the Bible in order to determine the exact number of chapters and verses. An extract of encoded reference text is shown in Fig. 11.

The encoding of the critical apparatus was much more complex. We decided to follow and extend the methodology outlined in Bambaci2[ , 1 ], which involves the development of a rulebased parser for automatic encoding. A detailed account of this methodology can be found there. In the following subsection, we will highlight the key points necessary for readers to understand how we manage to obtain XML 昀椀les out of Kennicott’s critical apparatus. 3.4.1. Parsing the critical apparatus As anticipated in Section2, Kennicott’s critical apparatus proves to be highly suitable for automatic parsing due to its rigorous language and structured presentation of variants. Instead (b) Critical apparatus of using Latin commentary-like notes like De Rossi, Kennicott employs a highly formalized language, in which each element performs a precise function according to the position it occupies in the overall structure and according to the class of strings (letters, numbers, symbols) to which it belongs. Both the position and the class of strings can be “captured” by the rules of a Context-Free Grammar (CFG), and these rules “fed” to the parser in order to recognize the function of the individual apparatus components.

Let us consider a fragment of the apparatus as shown in Fig.12.

For simplicity, let us focus on the 昀椀rst apparatus entry only:

5. !Mיהלא – !יהלא 109.

which informs us of the substitution of ! M‘יהלא’ with ‘!יהלא’ in manuscript no. 109. The philologist will immediately recognise the following elements: the place of variation, expressed by the verse number (‘5’), separated by a dot (‘.’); the lemma of the reference text, expressed in Hebrew letters (‘!Mיהלא’) and separated by a horizontal line (‘–’); the variant (!י‘הלא’); the numerical siglum of the manuscript (‘109’); and 昀椀nally a dot followed by a long white space (encoded as a tabulation, see below), which closes the apparatus entry.

1 grammar kennicottCFG; 2 all: app; 3 app: loc lem var appSep; 4 loc: verse locSep; 5 lem: w lemSep; 6 var: w wit; 7 verse: NUM; 8 locSep: DOT; 9 w: HEBW; 10 lemSep: DASH; 11 wit: NUM; 12 appSep: DOT TAB; 13 NUM: [ 0−9 ]+; 14 HEBW: [\u0590−\u05ff]+; 15 DASH: '—'; 16 DOT: '.'; 17 TAB: '\t'; 18 WHITESPACE : ' ' −> skip;

A CFG as the one shown in Fig.13 can be formulated18 in order to describe this apparatus entry and instruct the parser on how to recognize its individual elements correctly.

With the 昀椀rst rule ( all) we describe the structure of the entire document, which in our example consists of a single apparatus entry, which we call app. This in turn consists of a variant location (loc), a lemma (lem), a separator for the lemma (lemSep), a variant (var), a witness number (wit), and 昀椀nally a separator for the apparatus ( appSep). A variant location consists in turn of a sequence of numbers (NUM); lemma and variant contain Hebrew words (HEB); separators consist of horizontal bars D(ASH), dots (DOT), and tabulations (TAB).

With the 昀椀rst sequence of rules we established, the so-called parsing rules, we are able to de昀椀ne the order of succession of the elements (that is, their syntax), as well as to express their function using “speaking” names that make their meaning explicit for the philologist. With the second sequence of rules, called tokenization rules, we instead indicate the class of strings to which the individual elements belong, such as numerals [(0-9]), alphabetical letters ([\u0590-\u05ff]), punctuation etc.

By employing a CFG akin to the illustrated fragment and using ANTLR4 so昀琀ware [ 18 ],19 we were able to automatically encode the entire critical apparatus of the 昀椀rst volume into XML, with minimum cost and very high accuracy (around 98%)2.0 18This CFG is designed just for explanation purposes. The CFG we used to parse Kennicott’s apparatus is much more complex and will be published, along with all the relevant material, upon completion of the project (4§). 19https://www.antlr.org/. 20This percentage is indicative and is calculated for the book of Genesis by simply dividing the total number of XML elements correctly assigned by the parser by the total number of XML elements found in this book. To identify errors and derive the correct elements through subtraction, we use the element<lem> (lemma) as the unit of measurement. Here is the calculation: In Genesis, the total number of lemmata amounts to 6,866; out of these, 146 are cases of lemmata erroneously interpreted as readings <(rdg>) due to syntactic ambiguity; the parser correctly identi昀椀ed 6,720 lemmata; the accuracy is therefore equal to 6866−146 × 100 = 97.87%. This value 6866 An extract of XML code of the apparatus is shown in Fig.14.

Having been encoded, the reference text and the apparatus are now ready for the reconstruction of the witness text, which is our ultimate goal.

4. Text reconstruction

To reconstruct the witnesses, all variants in the apparatus must 昀椀rst be mapped onto the reference text using the lemmata, as it were, as foreign keys. Once the mapping has been performed, our textual reconstruction proceeds simply by replacing, for each manuscript, the lemma in the reference text with the variant in the apparatus. Reconstructed text of ms. no. 109:

...!הליל ארק Kשחלו Mוי רואל !Mיהלא !ארקיו 1:5 5. !Mיהלא – !יהלא 109. !Kשוחלו 152, 206. !רקוב 9.

!הליל ארק Kשחלו Mוי רואל !יהלא !ארקיו 1:5 An example of such a procedure is shown in Tab.2 (see also Fig. 12), where the lemma may naturally vary from book to book, but there is no signi昀椀cant reason to expect a substantial change. For instance, the same calculation performed on the book of Exodus yield9s9.32%, and on Leviticus it is98.38%. A comprehensive estimate of the parser’s accuracy will only be feasible upon the project’s completion. ‘!Mיהלא’ corresponds to the variant !י‘הלא’ in manuscript no. 109. Textual reconstruction is straightforward here: using Python, we simply replace the lemma with the variant, as shown. Cases like this, where each apparatus entry corresponds to one and only one lemma, are the easiest to deal with and constitute the majority in Kennicott, accounting for about 70% of the entries.

In the remaining 30% of cases, on the other hand, we do not have any lemma provided, and we need to deduce it from the reference text before we can map the variants. Automatic deduction was possible in all but 3% of the cases.

Let us give one example of the most common case (Tab.3). Reconstructed text of mss. nos. 152, 206:

...!הליל ארק !Kשחלו !Mוי רואל Mיהלא ארקיו 1:5

In the apparatus, as shown, the variant ‘!Kשוחלו’ is cited for manuscripts no. 152 and 206, but the lemma of the reference text ‘!Kשחלו’ is missing. Due to the very close proximity of the two words (only one character di昀erence), the reference is immediate for the human reader, and for this reason, it is omitted. To make this information available to the machine, we use Levenshtein distance (or edit distance, ), which returns the correct solution in our example (‘!Kשוחלו’, with = 1 ). Such an approach proved to be quite e昀ective for our case study: 60% of all the variants in Kennicott are in fact graphical variants that involve only a few letters.

There are cases, however, where this approach returns multiple outputs with equal distance value, as well as more complex cases where the lemma spans across two or more verses, or where the lemma is not given explicitly in Hebrew, but is rather described by Latin phrases. At the time of writing this article, such residual cases account for about 3% of the total, but we are further improving their automatic treatment.

Using the procedures just described, we have been able to reconstruct the full text of 114 witnesses of the book of Genesis, including 97 manuscripts and 17 printed editions. We are now working on generating transcriptions for the entire Enneateuch (from Genesis to Kings, corresponding to Kennicott’s 昀椀rst volume), which means an average of 100 witnesses per biblical book and approx. 35,000 manuscript pages obtained in a fully automatic manner (Fig1.5a and Tab. 4).

Next, we will align these automatically generated text with the automatic transcriptions of Kennicott’s manuscripts, and then we will correct them using eScriptorium alignment feature discussed in Section3.3.1. A昀琀er correction, we will have approx. 7,500 pages of text, which (a) First volume (b) Second volume will allow us to train new and accurate models for automatic text recognition of Hebrew manuscripts.

Finally, once we have the XML 昀椀les of all relevant texts (Kennicott’s reference text and apparatus, reconstructed witness texts etc.), we will take care to convert our custom XML language to TEI standards using XSLT, in order to ensure data interchangeability and reusability.

We plan to make all the data generated throughout the project, from the HTR models to XML and text 昀椀les, publicly available. We envisage publishing all pertinent segmentation and recognition models on Kraken’s Zenodo repositor2y1. For the HTR and OCR results, we could either publish their di昀erent milestone stages in a separate repository on Zenodo with pointers from Biblissima+ (and e.g. HTRunited) or directly on Biblissima+. Moreover, we will post all the relevant material for the project at our GitHub address2.2

5. Conclusion

We discussed how traditional scholarly editions could o昀er a viable pathway to improve the performance of current HTR models. Taking the concrete example of the REK project, we illustrated how, through encoding the critical apparatus, we were able to generate complete automatic transcriptions of witness texts, and how we plan to obtain a large amount of training data useful for HTR of biblical Hebrew manuscripts out of these transcriptions.

The accuracy values achieved so far are highly encouraging. All the Kraken models for segmentation and transcription that we used have proven to be exceptionally performant, even in handling highly complex texts such as the critical apparatus: their overall accuracy is never lower than 97%.

The decision to implement a rule-based parser for mining the apparatus has also been fruitful: thanks to it, we have been able to automatically encode a huge amount of data (more than 65,000 apparatus entries) that would have been unthinkable to encode manually, and this with an accuracy of 98%.

The automatic reconstruction of the texts of the witnesses has been equally e昀케cient, and is fully automatable for 97% of the cases. The remaining portion that necessitates manual intervention is still substantial, considering the number and complexity of interventions needed for each biblical book, but we are con昀椀dent that we can increase the automation of variant mapping (including, for example, the case of lemmata spanning multiple verses), thereby further reducing the need for manual correction.

The data and statistics presented here refer to the 昀椀rst volume of the Vetus Testamentum, the processing of which is nearing completion. Our intention is to extend the methodology discussed to the second volume (Fig.15b), which will allow us to increase the number of reconstructible witnesses up to 244, for a total of approx. 75,000 manuscript pages (Tab.4).

Moreover, we are intending to incorporate the remaining 10 out of the 20 identi昀椀ed manuscripts mentioned in Section3.1. This will enable us to double the quantity of pages with highly accurate transcriptions, providing us with an augmented ground truth from which to develop further enhanced HTR models speci昀椀cally tailored for Hebrew Bible manuscripts.

6. Acknowledgments

We would like to express our sincere thanks to Idan Dershowitz from the University of Potsdam for his invaluable collaboration, as well as to Uriel Aiskovich for his assistance with the

Book

No. mss

No. pages

First volume Genesis Exodus Leviticus Numbers Deuteronomy Joshua Judges I-II Samuel I-II Kings Isaiah Jeremiah Ezekiel Minor Prophets Psalms Job Proverbs Megilloth Daniel Ezra-Nehemiah Chronicles |Total|

Second volume 97 103 101 103 108 65 67 67 65 72 71 69 69 102 87 76 126 68 71 68 244 5,820 5,150 3,535 5,150 4,644 1,950 1,943 4,623 4,745 3,528 4,473 3,795 3,243 6,426 2,262 1,748 4,032 1,224 2,130 5,168 75,589 manuscript segmentation task. We also wish to extend our appreciation for the support kindly provided by the National Library of Israel’s sta昀.

Our research received generous funding from the Agence Nationale de Recherche as part of the Programme d’investissements d’avenir within the France 2030 framework, under the reference ANR-21-ESRE-0005. Additionally, we bene昀椀ted from funding by the European Union through the MiDRASH project (ERC, project number 101071829). Please note that the views and opinions expressed in this paper are those of the author(s) alone and do not necessarily represent the views of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for the expressed views.

[1]

Bambaci . “ Critical Apparatus as Domain Speci昀椀c Languages. A Rule-based Parser for Encoding an Eighteenth-Century Collation of Hebrew Manuscripts” . In:International Journal of Information Science and Technology 5.1 ( 2021 ), pp. 22 - 33 .

[2]

Bambaci . “ Digitizing Kennicott's Collation of the Hebrew Bible: Experiences of Encoding and of Computer-Assisted Stemmatic Analysis” . In: Jewish Studies in the Digital Age . Ed. by

Zaagsma ,

Stökl Ben Ezra ,

Miriam ,

Michelle , and L. Amalia S . Studies in Digital History and Hermeneutics 5 . De Gruyter, 2022 , pp. 299 - 334 . doi: 10 .1515/978 3110744828- 014 .

[3]

Bambaci . “ Is a Stemma Possible for the Hebrew Bible? Towards a Genealogy of Medieval Manuscripts Through Phylogenetic Analysis” . In:Materia Giudaica - Rivista dell'Associazione Italiana per lo Studio del Giudaismo Xxvi . 2 ( 2021 ), pp. 3 - 30 .

[4] [5]

Barthélemy . “ Les manuscrits médiévaux et le texte tibérien classique” . In: Critique textuelle de l' Ancien Testament , 3 . Ézéchiel, Daniel et les 12 Prophètes . Vol. 3 .

Orbis

Biblicus et Orientalis 50. Fribourg/Göttingen: Éditions Universitaires/Vandenhoeck & Ruprecht, 1992 , pp. xix -xcvi.

Boillet , M.-L. Bonhomme , D.

Stutzmann , and C.

Kermorvant . “HORAE: An annotated dataset of books of hours” . In:Proceedings of the 5th International Workshop on Historical Document Imaging and Processing . 2019 , pp. 7 - 12 . doi: 10 .1145/3352631.3352633.

[6] P. G. Borbone. “Appendice - La tradizione medievale”. InI:l libro del profeta Osea - Edizione critica del testo ebraico. Torino: Zamorani , 1990 , pp. 183 - 227 .

[7]

Clausner ,

Pletschacher , and

Antonacopoulos . “ Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments” . InP: roceedings of the 2011 International Conference on Document Analysis and Recognition . 2011 , pp. 48 - 52 . doi: 10 .1109/icdar. 2011 . 19 .

[8]

Cohen . “ The 'Masoretic Text' and the Extent of Its In昀氀uence on the Transmission of the Biblical Text in the Middle Ages” . In:Studies in Bible and Exegesis. Ed. by U. Simon.

Vol. 2 . Ramat Gan: Bar Ilan University Press, 1986 , pp. 229 - 256 .

[9] G. B. De Rossi . Scholia critica in V.T. libros, seu supplementa ad varias sacri textus lectiones . Parma: Ex regio typographeo , 1798 .

[10] G. B. De Rossi . Variae lectiones Veteris Testamenti . Parmae: Ex regio typographeo , 1784 - 1788 .

[11]

Elliger and

Rudorf . Biblia Hebraica Stuttgartensia. 5th. Stuttgart: Deutsche Bibelgesellscha昀琀 , 1997 .

[12]

J. S.

Penkower . “A Sheet of Parchment from a 10th or 11th Century Torah Scroll: Determining its Type among Four Traditions (Oriental, Sefardi , Ashkenazi, Yemenite)”. In: Textus 21.1 ( 2002 ), pp. 235 - 264 . doi: 10 .1163/2589255x- 02101012 .

[13]

Kahle ,

Colutto , G. Hackl, and G. Mühlberger. “ Transkribus - A Service Platform for Transcription, Recognition and Retrieval of Historical Documents” . In1:st International Workshop on Open Services and Tools for Document Analysis , 14th IAPR International Conference on Document Analysis and Recognition , OSTICDAR 2017 , Kyoto, Japan, November 9- 15 , 2017 . Vol. 04 . 2017 , pp. 19 - 24 . doi: 10 .1109/icdar. 2017 . 307 .

[14]

Kennicott . Vetus Testamentum Hebraicum cum variis lectionibus . Vol. 1 . Oxford: Clarendon, 1776 .

[15]

Kennicott . Vetus Testamentum Hebraicum cum variis lectionibus . Vol. 2 . Oxford: Clarendon, 1780 .

[16]

Kiessling . “ Kraken - An Universal Text Recognizer for the Humanities” . In: 2019 .

[17]

Kiessling ,

Tissot ,

Stokes , and

Stökl Ben Ezra. “eScriptorium: An Open Source Platform for Historical Document Analysis” . In:2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) . Vol. 2 . 2019 , pp. 19 - 24 . doi: 10 .110 9/icdarw. 2019 . 10032 .

[18]

Parr . The De昀椀nitive ANTLR 4 Reference . Dallas/Raleigh: Pragmatic Bookshelf, 2012 .

[19]

J. S.

Penkower . “A Tenth-century Pentateuchal MS from Jerusalem (MS C3), Corrected by Mishael ben Uzziel” . In: Tarbiz 58.1 ( 1988 ), pp. 49 - 74 .

[20]

Schenker ,

Y. A. P.

Goldman ,

G. J.

Norton ,

Kooji Van Der ,

Pisano , J. De Waard , and R. D. Weis, eds.Biblia Hebraica Quinta. General Introduction and Megilloth . Stuttgart: Deutsche Bibelgesellscha昀琀 , 2004 .

[21]

Schomaker . “ Design considerations for a large-scale image-based text search engine in historical manuscript collections” . In:it - Information Technology 58.2 ( 2016 ), pp. 80 - 88 . doi: 10 .1515/itit-2015-0049.

Segal . “ Methodological Considerations in the Preparation of an Edition of the Hebrew Bible” . In: The Text of the Hebrew Bible and

Its

Editions . Leiden, The Netherlands: Interactive Factory, 2017 , pp. 34 - 55 .

[22] [25] [26] [23]

Stokes ,

Kiessling ,

Stökl Ben Ezra ,

Tissot , and E. Gargem. “ The eScriptorium VRE for Manuscript Cultures” . In: Ancient Manuscripts and Virtual Research Environments, Special issue of Classics 18 ( 2021 ). Ed. by

Clivaz and

G. V.

Allen .

[24]

P. A.

Stokes ,

Stökl Ben Ezra ,

Kiessling , and

Tissot . EScripta: A New Digital Platform for the Study of Historical Texts and Writing. 2019 . doi: 10 .34894/bixswx.

“BiblIA - A General

Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset” . In: The 6th International Workshop on Historical Document Imaging and Processing . Hip ' 21 . New York, NY, USA: Association for Computing Machinery, 2021 , pp. 61 - 66 .

doi: 10.1145/3476887 .3476896.

A. H.

Toselli ,

Wu , and

D. A.

Smith. “Digital Editions as Distant Supervision for Layout Analysis of Printed Books” . In: Document Analysis and Recognition - ICDAR 2021 . 2021 , pp. 462 - 476 . doi: 10 .1007/978-3- 030 -86331-9\_ 30 .