<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Computational Paleography of Medieval Hebrew ⋆ Scripts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Berat Kurar-Barakat</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daria Vasyutinsky-Shapira</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sharva Gogawale</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Suliman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nachum Dershowitz</string-name>
        </contrib>
      </contrib-group>
      <fpage>707</fpage>
      <lpage>717</lpage>
      <abstract>
        <p>We present ongoing work as part of an international multidisciplinary project, called MiDRASH, on the computational analysis of medieval manuscripts. We focus here on clustering manuscripts written in Ashkenazi square script using a dataset of 206 pages from 59 manuscripts. Collaborating with expert paleographers, we identified ten critical features and trained a multi-label CNN, achieving high accuracy in feature prediction. This should make it possible to computationally predict the subclusters already known to paleographers and those yet to be discovered. We identified visible clusters using PCA and  2 feature selection. In future work, we aim to enhance feature extraction using deep learning algorithms and provide computational tools to ease paleographers' work. We plan to develop new methodologies for analyzing Hebrew scripts and refining our understanding of medieval Hebrew manuscripts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medieval Hebrew manuscripts</kwd>
        <kwd>computational paleography</kwd>
        <kwd>convolutional neural networks</kwd>
        <kwd>image clustering</kwd>
        <kwd>recurrent neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>linguistic, and literary perspectives. This analysis will contribute to the understanding of the
manuscripts’ materiality, textuality, transmission, and the historical and intellectual context
of their creation and readership. By combining traditional philology with machine learning,
computer vision, and computational linguistics, we will process large amounts of textual and
paleographical data that traditional philology cannot handle. See Figure 1.
1. Develop optical character recognition (OCR) algorithms to convert manuscript images
into searchable text.
2. Implement text mining algorithms to compare a large corpus of texts and identify
quotations, paraphrases, borrowings, allusions, and other intertextual relationships.
3. Train machine learning models to perform handwriting analysis and predict each
manuscript’s geographical and temporal origins.
4. Design natural language processing (NLP) algorithms to extract and analyze linguistic
features for improved textual searches and historical context placement.
5. Integrate traditional and computational methodologies for paleographic, philological,
and textual analysis.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Computational Paleography Tasks</title>
      <p>Accessing manuscripts’ textual and non-textual information is valuable only if we can
understand the texts in their specific context of place and time. Out of all the medieval Hebrew
manuscripts, only about 3,500 are dated and have colophons (scribes’ notes) or other
identifying marks. Paleography (the study of handwriting) and codicology (the study of the physical
aspects of books) are the primary methods used to determine the provenance of manuscripts.
SfarData (sfardata.nli.org.il) is the only existing database that focuses primarily on codicology
for dated medieval Hebrew manuscripts.</p>
      <p>However, reliance on paleography is essential for a project, like ours, studying document
images. The MiDRASH traditional paleography team aims to make precise regional and
chronological classifications. They scan well-defined manuscript samples to find correlations between
their textual features and scripts. We use HebrewPal (hebrewpalaeography.com), an ongoing
efort to build a comprehensive database of Hebrew paleography. Processing this data involves
synergetic collaboration between the traditional and computational paleography teams.</p>
      <p>
        As the computational paleography team, we are currently working on solving the
problem of finding subgroups among the Ashkenazi square script documents. Paleographers
describe medieval Hebrew manuscripts according to their script mode (square, cursive, and
semicursive) and geographical type. The six geographical types are Oriental (Egypt, Palestine, Syria,
Lebanon, Iraq, Iran, Uzbekistan, and Bukhara, Eastern Turkey), Sephardic (the Iberian
Peninsula, Provence and Languedoc, North Africa, and Sicily), Italian, Ashkenazi (France and
England, the Holy Roman Empire, Central and Eastern Europe), Byzantine (Greece, the Balkans,
Western Asia Minor, and regions surrounding the Black Sea), and Yemenite (Figure 2). This
level of codicological classification for Hebrew manuscripts and initial automatic dating has
already been successfully performed using computational means [
        <xref ref-type="bibr" rid="ref10 ref5 ref9">9, 10, 5</xref>
        ].
      </p>
      <p>Ashkenazi</p>
      <p>Byzantine</p>
      <p>Italian</p>
      <p>Oriental</p>
      <p>Yemenite</p>
      <p>Sephardic</p>
      <p>
        Within certain script type-modes, there are distinct subclusters. Only in rare cases have these
subclusters been relatively well-studied. This is, for example, the case of the Ashkenazi square
script, which has been well-studied and clustered [
        <xref ref-type="bibr" rid="ref3 ref4 ref6">1, 4, 3, 6</xref>
        ]. However, even the most
experienced paleographers are most familiar with the manuscripts they work with frequently, and
no human memory can retain thousands of script examples. Moreover, the variations within
some script type-modes are very subtle. Therefore, we are working to develop computational
methods to identify clusters, and subclusters within diferent script types that have yet to be
discovered (or those that are not discoverable) by paleographers. This work will contribute to
identifying the place of copying for manuscripts of unknown provenance, more exactly than
the current results [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <p>The Ktiv project at the National Library of Israel has led a significant digitization campaign of
Hebrew-character manuscripts from collections worldwide. It has accumulated more than 80%
of extant manuscripts, making tens of thousands of manuscripts accessible via a unified catalog.
The Friedberg Genizah Project has contributed images and metadata of approximately 350,000
fragments from medieval book and document depositories, known as genizot (geniza or genizah
in the singular. This digital corpus serves as the source material for our project. It includes
relatively well-preserved scrolls and codices, as well as hundreds of thousands of fragments.
For the clustering task, we used high-resolution pages from well-preserved manuscripts.</p>
      <p>We are using a dataset of images built for us by Judith Olszowy-Schlanger specifically for
the Ashkenazi-square clustering problem, a style for which she is the leading expert. She
challenged us (as part of the MiDRASH project) with the task of automatically clustering
within this specific type-mode, and potentially revealing additional subclusters as yet
uncategorized by traditional Hebrew paleography. This “ASC” dataset, publicly available online at
github.com/TAU-CH/midrash_ASC_dataset, contains 206 images, each depicting part of a page
from 59 manuscripts, with approximately four pages from each manuscript (Table 1). It also
includes an annotation file for the bounding boxes of the main text regions and text lines.
Samples are unlabeled, but it is known that 17 manuscripts are from Germany and 11 are from
France, while the origins of the remaining 31 manuscripts are currently unknown (Figure 3).
All the manuscripts are written in Ashkenazi square script, and we aim to discover potential
subclusters within these manuscripts based on their script types, which have slight variations.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methods and Results</title>
      <p>Our preliminary work focused on clustering medieval manuscripts written in Ashkenazi square
script using the ASC dataset. Conventional computational methods, such as the bag-of-words
approach, struggle to identify the intricate features necessary for efective paleographic
clustering, as the frequency of occurrence of paleographical features varies even within the same
script type. To address this, we had expert paleographers identify ten critical features that they
use in their analyses of this script type. The ten features identified in this way are vocalization
marks, left (end of line) justification, vertical stretch, strings, short descenders, fishtails, left
slant, biting, nesting, and shading (Figure 4).</p>
      <sec id="sec-4-1">
        <title>4.1. Predicting Paleographical Features</title>
        <p>
          We trained a multi-label VGG-19 network [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] model to predict the presence of these features
on a given page image. Treating this as a multi-label problem allowed us to account for the
coexistence of multiple labels, as their spatial location and frequency of occurrence are not
crucial for paleographical definition.
        </p>
        <p>In order to prevent overfitting, we utilized the regularization approach known as early
stopping. This technique stops the training process when the model’s performance on the
validation set stops improving, thus preventing the model from simply memorizing the training data.
As a result, we ended the training when the validation loss reached 0.12, resulting in the model
achieving its best validation performance (see Figure 5).</p>
        <p>To evaluate model performance on unseen data, we split the dataset at the manuscript level
(Figure 6). Splitting at this level ensures that entire manuscripts, rather than individual pages,
were held out for testing. Unseen testing involves evaluating the model on data that contains
patterns not seen during training. This is important for ensuring that the model can generalize
well to unseen data, like in the real-world scenarios where new manuscripts are encountered.</p>
        <p>The prediction performance on the unseen test set, as shown in the bar graph, demonstrates
that our model can efectively automate the tasks performed by a paleographer, achieving
accuracy levels of 98%. The performance graph (Figure 7) shows three types of F1 scores: macro
average, micro average, and weighted average. The macro average F1 score calculates the F1
score for each class individually and then takes their average. It gives equal importance to all
classes, regardless of their size. Therefore, every class contributes equally to the final score.
The micro average F1 score combines the contributions of all classes to calculate the F1 score.
The classes with larger samples influence the micro average F1 more. The weighted average
F1 score calculates the F1 score for each class and then calculates the average, weighted by the
number of samples in each class. This gives a balanced view by considering the size of each
class, ensuring that larger classes have more influence on the final score.</p>
        <p>During training, we monitored the prediction accuracy for each of the ten labels to identify
the ease or difÏculty of learning specific features (Figure 8). For instance, we observed that the
“left slanted” feature took longer to learn due to its non-binary nature. It eventually achieved
high accuracy because of its frequent occurrence in a single-page image. On the other hand,
features such as “nesting,” “shading,” and “string” also took longer to learn due to their gradual
values and finally resulted in lower accuracies due to their less frequent appearances.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Exploring Subclusters</title>
        <p>To identify the subclusters, we performed a brute-force search to find the feature combinations
that lead to the most cohesive subclusters. Principal component analysis (PCA) was used to
visualize the samples in 2D and identify potential clusters (Figure 9). In Figure 10, you can see a
sample page labeled for its regional origin by a colored frame in each cluster. We systematically
tested all features or selected features and found that  2 feature selection led to visible
clusters. This feature selection process highlighted visible clusters based on the selected features
(strings, left slanted, vertical stretch, and nesting), addressing the challenge faced by
paleographers who can quickly identify individual features on a single page but struggle to
simultaneously remember and analyze these features across multiple pages to discern grouping patterns.
The clustering algorithm mainly successfully grouped manuscripts of known provenance and
suggested some meaningful grouping of other manuscripts. One of the main challenges in
computational paleography is the time and efort needed to build an initial dataset. We plan to
enlarge the existing dataset and experiment on other known clusters (Oriental square and
nonsquare; Sephardic square, non-square, and cursive, etc.) and cluster the lesser studied script
types such as Yemenite. We expect this to improve the results and to significantly advance our
knowledge of both human and computational paleography.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>Our approach tackles the challenge encountered by paleographers. They can quickly identify
individual features on a single page but find it difÏcult to remember and analyze them across
multiple pages to identify grouping patterns. We identified some clusters using this method
and provided insights into the paleographic features that drive these formations. Hence, we
automated some of the constraints of traditional paleographic analysis.</p>
      <p>In future work, we aim to explore methods to discover discriminative features besides those
defined by paleographers. Assuming that a script type  ′ possesses  distinct paleographical
features that are absent in a baseline script type  (Ashkenazi square script, in our case), we
will train a multi-label CNN to predict the presence of all  features in images of script  ′,
while predicting the absence of these features in images of the baseline  . We can visualize
the spatial locations of these  features within the images of  ′ using gradient-weighted class
activation mapping (Grad-CAM). This approach enables us to identify characteristics that may
not be immediately apparent to human experts, furthering our understanding of these script
types.</p>
      <p>We plan to incorporate another deep learning architecture to further enhance the
representation of handwriting style features. For instance, we will train a sequence-generating recurrent
neural network (RNN) on the ordered sequence of contour tip points from letter strokes. The
hidden state vectors from the RNN will then be used as embedding vectors, which are expected
to capture stylistic features of the handwriting.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Funded by the European Union (ERC, MiDRASH, Project No. 101071829). Views and opinions
expressed are, however, those of the authors only and do not necessarily reflect those of the
European Union or the European Research Council Executive Agency. Neither the European
Union nor the granting authority can be held responsible for them.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Beit-Arie</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Engel</surname>
          </string-name>
          .
          <article-title>Specimens of mediaeval Hebrew scripts</article-title>
          .
          <source>Israel Academy of Sciences and Humanities</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Droby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Rabaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Shapira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kurar-Barakat</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>El-Sana</surname>
          </string-name>
          .
          <article-title>“Digital Hebrew Paleography: Script Types and Modes”</article-title>
          .
          <source>In: Journal of Imaging 8.5</source>
          (
          <issue>2022</issue>
          ), p.
          <fpage>143</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>E. Engel.</surname>
          </string-name>
          “Between France and
          <article-title>Germany: Gothic Characteristics in Ashkenazi Script”</article-title>
          . In: Manuscrits hébreux et arabes: Mélanges en l'honneur de Colette Sirat. Publications de l'
          <source>École Pratique des Hautes Études</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>219</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>E. Engel.</surname>
          </string-name>
          “
          <article-title>Calamus or Chisel: On The History of the Ashkenazic Script”</article-title>
          . In: ”Genizat Germania”
          <article-title>- Hebrew and Aramaic Binding Fragments from Germany in Context</article-title>
          . Leiden, The Netherlands: Brill,
          <year>2010</year>
          , pp.
          <fpage>183</fpage>
          -
          <lpage>197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Madi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Atamni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tsitrinovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vasyutinsky-Shapira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>El-Sana</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Rabaev.</surname>
          </string-name>
          “
          <article-title>Automated Dating of Medieval Manuscripts with a New Dataset”</article-title>
          .
          <source>In: Workshop on Computational Paleography (WCP)</source>
          .
          <year>2024</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>J. Olszowy-Schlanger.</surname>
          </string-name>
          “
          <article-title>The early developments of Hebrew scripts in north-western Europe”</article-title>
          .
          <source>In: Gazette du livre medieval 63.1</source>
          (
          <issue>2017</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          . “
          <article-title>Very deep convolutional networks for large-scale image recognition”</article-title>
          .
          <source>In: 3rd International Conference on Learning Representations</source>
          .
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>In: Eurographics Workshop on Graphics and Cultural Heritage</source>
          .
          <year>2024</year>
          , p.
          <fpage>0</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dershowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Potikha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>German</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shweka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choueka</surname>
          </string-name>
          . “
          <article-title>Automatic paleographic exploration of Genizah manuscripts”</article-title>
          . In:
          <article-title>Kodikologie und Paläographie im Digitalen Zeitalter - Codicology and Palaeography in the Digital Age</article-title>
          . Norderstedt, Germany: Books on Demand,
          <year>2011</year>
          , pp.
          <fpage>157</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Potikha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dershowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shweka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choueka</surname>
          </string-name>
          . “
          <article-title>Computerized paleography: Tools for historical manuscripts”</article-title>
          .
          <source>In: 18th IEEE International Conference on Image Processing</source>
          .
          <year>2011</year>
          , pp.
          <fpage>3545</fpage>
          -
          <lpage>3548</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>