Computational Paleography of Medieval Hebrew Scripts⋆ Berat Kurar-Barakat∗ , Daria Vasyutinsky-Shapira, Sharva Gogawale, Mohammad Suliman and Nachum Dershowitz Tel Aviv University Abstract We present ongoing work as part of an international multidisciplinary project, called MiDRASH, on the computational analysis of medieval manuscripts. We focus here on clustering manuscripts written in Ashkenazi square script using a dataset of 206 pages from 59 manuscripts. Collaborating with expert paleographers, we identified ten critical features and trained a multi-label CNN, achieving high accuracy in feature prediction. This should make it possible to computationally predict the subclusters already known to paleographers and those yet to be discovered. We identified visible clusters using PCA and 𝜒 2 feature selection. In future work, we aim to enhance feature extraction using deep learning algorithms and provide computational tools to ease paleographers’ work. We plan to develop new methodologies for analyzing Hebrew scripts and refining our understanding of medieval Hebrew manuscripts. Keywords Medieval Hebrew manuscripts, computational paleography, convolutional neural networks, image clus- tering, recurrent neural networks 1. Introduction “MIDRASH: Migrations of Textual and Scribal Traditions via Large-Scale Computational Anal- ysis of Medieval Manuscripts in Hebrew Script,” supported by an ERC Synergy grant, is an international effort to develop a revolutionary, computational approach to manuscript studies. Among other aspects, it combines traditional, digital, and computational paleographic meth- ods to refine and potentially rewrite our understanding of Hebrew scripts, particularly their geographical variation in scribal practices [8]. The project is led by Daniel Stökl Ben Ezra (École pratique des hautes études [EPHE], Paris Sciences-Lettres University), Judith Olszowy- Schlanger (École pratique des hautes études, Paris Sciences-Lettres University and Oxford Uni- versity), Nachum Dershowitz (Tel Aviv University [TAU]), and Avi Shmidman (Bar-Ilan Univer- sity [BIU]), with the participation of the National Library of Israel (NLI) and Haifa University. The main goal of the project is to develop new methodologies for studying medieval Hebrew manuscripts. In addition to employing handwriting text recognition to extract text from images, We will analyze these manuscripts using computational tools from paleographic, codicological, CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark ∗ Corresponding author. £ berat@tauex.tau.ac.il (B. Kurar-Barakat); dariashap@tauex.tau.ac.il (D. Vasyutinsky-Shapira); sharvag@mail.tau.ac.il (S. Gogawale); suliman@mail.tau.ac.il (M. Suliman); nachum@tau.ac.il (N. Dershowitz) ç https:cs.tau.ac.il/~berat (B. Kurar-Barakat) ȉ 0000-0002-7240-7286 (B. Kurar-Barakat) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 707 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings linguistic, and literary perspectives. This analysis will contribute to the understanding of the manuscripts’ materiality, textuality, transmission, and the historical and intellectual context of their creation and readership. By combining traditional philology with machine learning, computer vision, and computational linguistics, we will process large amounts of textual and paleographical data that traditional philology cannot handle. See Figure 1. Figure 1: Synergy in computational manuscriptology. The principal aims include: 1. Develop optical character recognition (OCR) algorithms to convert manuscript images into searchable text. 2. Implement text mining algorithms to compare a large corpus of texts and identify quo- tations, paraphrases, borrowings, allusions, and other intertextual relationships. 3. Train machine learning models to perform handwriting analysis and predict each manuscript’s geographical and temporal origins. 4. Design natural language processing (NLP) algorithms to extract and analyze linguistic features for improved textual searches and historical context placement. 5. Integrate traditional and computational methodologies for paleographic, philological, and textual analysis. 2. Computational Paleography Tasks Accessing manuscripts’ textual and non-textual information is valuable only if we can under- stand the texts in their specific context of place and time. Out of all the medieval Hebrew 708 manuscripts, only about 3,500 are dated and have colophons (scribes’ notes) or other identify- ing marks. Paleography (the study of handwriting) and codicology (the study of the physical aspects of books) are the primary methods used to determine the provenance of manuscripts. SfarData (sfardata.nli.org.il) is the only existing database that focuses primarily on codicology for dated medieval Hebrew manuscripts. However, reliance on paleography is essential for a project, like ours, studying document im- ages. The MiDRASH traditional paleography team aims to make precise regional and chrono- logical classifications. They scan well-defined manuscript samples to find correlations between their textual features and scripts. We use HebrewPal (hebrewpalaeography.com), an ongoing effort to build a comprehensive database of Hebrew paleography. Processing this data involves synergetic collaboration between the traditional and computational paleography teams. As the computational paleography team, we are currently working on solving the prob- lem of finding subgroups among the Ashkenazi square script documents. Paleographers de- scribe medieval Hebrew manuscripts according to their script mode (square, cursive, and semi- cursive) and geographical type. The six geographical types are Oriental (Egypt, Palestine, Syria, Lebanon, Iraq, Iran, Uzbekistan, and Bukhara, Eastern Turkey), Sephardic (the Iberian Penin- sula, Provence and Languedoc, North Africa, and Sicily), Italian, Ashkenazi (France and Eng- land, the Holy Roman Empire, Central and Eastern Europe), Byzantine (Greece, the Balkans, Western Asia Minor, and regions surrounding the Black Sea), and Yemenite (Figure 2). This level of codicological classification for Hebrew manuscripts and initial automatic dating has already been successfully performed using computational means [9, 10, 5]. Ashkenazi Byzantine Italian Oriental Yemenite Sephardic Figure 2: Medieval Hebrew script types in square mode. Within certain script type-modes, there are distinct subclusters. Only in rare cases have these subclusters been relatively well-studied. This is, for example, the case of the Ashkenazi square script, which has been well-studied and clustered [1, 4, 3, 6]. However, even the most experi- enced paleographers are most familiar with the manuscripts they work with frequently, and no human memory can retain thousands of script examples. Moreover, the variations within some script type-modes are very subtle. Therefore, we are working to develop computational methods to identify clusters, and subclusters within different script types that have yet to be discovered (or those that are not discoverable) by paleographers. This work will contribute to identifying the place of copying for manuscripts of unknown provenance, more exactly than the current results [2]. 709 Figure 3: Sample page images from the ASC dataset. 3. Data The Ktiv project at the National Library of Israel has led a significant digitization campaign of Hebrew-character manuscripts from collections worldwide. It has accumulated more than 80% of extant manuscripts, making tens of thousands of manuscripts accessible via a unified catalog. The Friedberg Genizah Project has contributed images and metadata of approximately 350,000 fragments from medieval book and document depositories, known as genizot (geniza or genizah in the singular. This digital corpus serves as the source material for our project. It includes relatively well-preserved scrolls and codices, as well as hundreds of thousands of fragments. For the clustering task, we used high-resolution pages from well-preserved manuscripts. We are using a dataset of images built for us by Judith Olszowy-Schlanger specifically for the Ashkenazi-square clustering problem, a style for which she is the leading expert. She challenged us (as part of the MiDRASH project) with the task of automatically clustering within this specific type-mode, and potentially revealing additional subclusters as yet uncat- egorized by traditional Hebrew paleography. This “ASC” dataset, publicly available online at github.com/TAU-CH/midrash_ASC_dataset, contains 206 images, each depicting part of a page from 59 manuscripts, with approximately four pages from each manuscript (Table 1). It also includes an annotation file for the bounding boxes of the main text regions and text lines. Sam- ples are unlabeled, but it is known that 17 manuscripts are from Germany and 11 are from France, while the origins of the remaining 31 manuscripts are currently unknown (Figure 3). All the manuscripts are written in Ashkenazi square script, and we aim to discover potential subclusters within these manuscripts based on their script types, which have slight variations. Table 1 Statistics of the ASC Dataset Germany France Unknown Total Manuscripts 17 11 31 59 Pages 62 35 109 206 Text regions 136 61 260 457 Text lines 4413 1799 8080 14292 710 4. Methods and Results Our preliminary work focused on clustering medieval manuscripts written in Ashkenazi square script using the ASC dataset. Conventional computational methods, such as the bag-of-words approach, struggle to identify the intricate features necessary for effective paleographic clus- tering, as the frequency of occurrence of paleographical features varies even within the same script type. To address this, we had expert paleographers identify ten critical features that they use in their analyses of this script type. The ten features identified in this way are vocalization marks, left (end of line) justification, vertical stretch, strings, short descenders, fishtails, left slant, biting, nesting, and shading (Figure 4). Figure 4: Figure showing all ten features identified in the dataset. Each image patch contains a spe- cific feature, highlighted by green circles. Data annotation was facilitated using the Hasty AI assisted annotation tool (hasty.cloudfactory.com). 4.1. Predicting Paleographical Features We trained a multi-label VGG-19 network [7] model to predict the presence of these features on a given page image. Treating this as a multi-label problem allowed us to account for the coexistence of multiple labels, as their spatial location and frequency of occurrence are not crucial for paleographical definition. In order to prevent overfitting, we utilized the regularization approach known as early stop- ping. This technique stops the training process when the model’s performance on the valida- tion set stops improving, thus preventing the model from simply memorizing the training data. As a result, we ended the training when the validation loss reached 0.12, resulting in the model achieving its best validation performance (see Figure 5). To evaluate model performance on unseen data, we split the dataset at the manuscript level (Figure 6). Splitting at this level ensures that entire manuscripts, rather than individual pages, were held out for testing. Unseen testing involves evaluating the model on data that contains patterns not seen during training. This is important for ensuring that the model can generalize well to unseen data, like in the real-world scenarios where new manuscripts are encountered. The prediction performance on the unseen test set, as shown in the bar graph, demonstrates that our model can effectively automate the tasks performed by a paleographer, achieving ac- curacy levels of 98%. The performance graph (Figure 7) shows three types of F1 scores: macro 711 Figure 5: Training and validation loss across epochs, demonstrating the application of early stopping to achieve the model with the best validation performance when the validation loss reached 0.12. Figure 6: Pie chart showing the split percentages of the dataset at the manuscript level, used for unseen testing to ensure the model’s generalization capabilities on unseen data. average, micro average, and weighted average. The macro average F1 score calculates the F1 score for each class individually and then takes their average. It gives equal importance to all classes, regardless of their size. Therefore, every class contributes equally to the final score. The micro average F1 score combines the contributions of all classes to calculate the F1 score. The classes with larger samples influence the micro average F1 more. The weighted average F1 score calculates the F1 score for each class and then calculates the average, weighted by the number of samples in each class. This gives a balanced view by considering the size of each class, ensuring that larger classes have more influence on the final score. During training, we monitored the prediction accuracy for each of the ten labels to identify the ease or difÏculty of learning specific features (Figure 8). For instance, we observed that the “left slanted” feature took longer to learn due to its non-binary nature. It eventually achieved high accuracy because of its frequent occurrence in a single-page image. On the other hand, features such as “nesting,” “shading,” and “string” also took longer to learn due to their gradual values and finally resulted in lower accuracies due to their less frequent appearances. 712 Figure 7: Bar chart showing the average F1 scores for the prediction performance on the unseen test set, demonstrating the model’s effectiveness in automating a paleographer task with an accuracy level of 98% Figure 8: Feature-wise training F1 scores through epochs, showing the learning progress for each of the ten labels. The “left slanted” feature, despite taking longer to learn, eventually achieved high accuracy, while features such as “nesting,” “shading,” and “string” features exhibited lower accuracies due to their gradual values and less frequent occurrences. 4.2. Exploring Subclusters To identify the subclusters, we performed a brute-force search to find the feature combinations that lead to the most cohesive subclusters. Principal component analysis (PCA) was used to visualize the samples in 2D and identify potential clusters (Figure 9). In Figure 10, you can see a sample page labeled for its regional origin by a colored frame in each cluster. We systematically 713 tested all features or selected features and found that 𝜒 2 feature selection led to visible clus- ters. This feature selection process highlighted visible clusters based on the selected features (strings, left slanted, vertical stretch, and nesting), addressing the challenge faced by paleogra- phers who can quickly identify individual features on a single page but struggle to simultane- ously remember and analyze these features across multiple pages to discern grouping patterns. The clustering algorithm mainly successfully grouped manuscripts of known provenance and suggested some meaningful grouping of other manuscripts. One of the main challenges in computational paleography is the time and effort needed to build an initial dataset. We plan to enlarge the existing dataset and experiment on other known clusters (Oriental square and non- square; Sephardic square, non-square, and cursive, etc.) and cluster the lesser studied script types such as Yemenite. We expect this to improve the results and to significantly advance our knowledge of both human and computational paleography. Figure 9: 2D PCA visualization of manuscripts based on 𝜒 2 selected features, highlighting the forma- tion of visible clusters using the identified features (strings, left slanted, vertical stretch, and nesting). Each dot is labeled with the identifier of the corresponding manuscript. 5. Conclusion and Future Work Our approach tackles the challenge encountered by paleographers. They can quickly identify individual features on a single page but find it difÏcult to remember and analyze them across multiple pages to identify grouping patterns. We identified some clusters using this method 714 Figure 10: Sample patches from the manuscripts in each of the subclusters. Frames are color-coded: cyan for Germany, magenta for France, green for England, and yellow for unknown. 715 and provided insights into the paleographic features that drive these formations. Hence, we automated some of the constraints of traditional paleographic analysis. In future work, we aim to explore methods to discover discriminative features besides those defined by paleographers. Assuming that a script type 𝑆 ′ possesses 𝑛 distinct paleographical features that are absent in a baseline script type 𝑆 (Ashkenazi square script, in our case), we will train a multi-label CNN to predict the presence of all 𝑛 features in images of script 𝑆 ′ , while predicting the absence of these features in images of the baseline 𝑆. We can visualize the spatial locations of these 𝑛 features within the images of 𝑆 ′ using gradient-weighted class activation mapping (Grad-CAM). This approach enables us to identify characteristics that may not be immediately apparent to human experts, furthering our understanding of these script types. We plan to incorporate another deep learning architecture to further enhance the representa- tion of handwriting style features. For instance, we will train a sequence-generating recurrent neural network (RNN) on the ordered sequence of contour tip points from letter strokes. The hidden state vectors from the RNN will then be used as embedding vectors, which are expected to capture stylistic features of the handwriting. Acknowledgments Funded by the European Union (ERC, MiDRASH, Project No. 101071829). Views and opinions expressed are, however, those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. References [1] M. Beit-Arie and E. Engel. Specimens of mediaeval Hebrew scripts. Israel Academy of Sciences and Humanities, 2017. [2] A. Droby, I. Rabaev, D. V. Shapira, B. Kurar-Barakat, and J. El-Sana. “Digital Hebrew Paleography: Script Types and Modes”. In: Journal of Imaging 8.5 (2022), p. 143. [3] E. Engel. “Between France and Germany: Gothic Characteristics in Ashkenazi Script”. In: Manuscrits hébreux et arabes: Mélanges en l’honneur de Colette Sirat. Publications de l’École Pratique des Hautes Études, 2014, pp. 197–219. [4] E. Engel. “Calamus or Chisel: On The History of the Ashkenazic Script”. In: ”Genizat Germania” – Hebrew and Aramaic Binding Fragments from Germany in Context. Leiden, The Netherlands: Brill, 2010, pp. 183–197. [5] B. Madi, N. Atamni, V. Tsitrinovich, D. Vasyutinsky-Shapira, J. El-Sana, and I. Rabaev. “Automated Dating of Medieval Manuscripts with a New Dataset”. In: Workshop on Com- putational Paleography (WCP). 2024, pp. 45–48. [6] J. Olszowy-Schlanger. “The early developments of Hebrew scripts in north-western Eu- rope”. In: Gazette du livre medieval 63.1 (2017), pp. 1–19. 716 [7] K. Simonyan and A. Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: 3rd International Conference on Learning Representations. 2015, pp. 1–14. [8] D. Vasyutinsky-Shapira, B. Kurar-Barakat, S. Gogawale, M. Suliman, and N. Dershowitz. “MiDRASH – A Project for Computational Analysis of Medieval Hebrew Manuscripts”. In: Eurographics Workshop on Graphics and Cultural Heritage. 2024, p. 0. [9] L. Wolf, N. Dershowitz, L. Potikha, T. German, R. Shweka, and Y. Choueka. “Automatic paleographic exploration of Genizah manuscripts”. In: Kodikologie und Paläographie im Digitalen Zeitalter – Codicology and Palaeography in the Digital Age. Norderstedt, Ger- many: Books on Demand, 2011, pp. 157–17. [10] L. Wolf, L. Potikha, N. Dershowitz, R. Shweka, and Y. Choueka. “Computerized paleog- raphy: Tools for historical manuscripts”. In: 18th IEEE International Conference on Image Processing. 2011, pp. 3545–3548. 717