<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>approaches for medieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>SoftServe Inc</institution>
          ,
          <addr-line>201 W 5th Street, Suite 1550, Austin, TX 78701</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vasyl Stefanyk Carpathian National University</institution>
          ,
          <addr-line>Shevchenka 57, 76018 Ivano-Frankivsk</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but ecfiiency drops with Latin documents of medieval times . This paper presents a deep learning method to extract text information from handwritten Latin-language documents of the 9th to 11th centuries. The approach takes into account the properties inherent in medieval documents. The paper provides a brief introduction to the field of historical document transcription, a first- sight analysis of the raw data, and the related works and studies. The paper presents the steps of dataset development for further training of the models. The explanatory data analysis of the processed data is provided as well. The paper explains the pipeline of deep learning models to extract text information from the document images, from detecting objects to word recognition using classification models and embedding word images. The paper reports the following results: recall, precision, F1 score, intersection over union, confusion matrix, and mean string distance. The plots of the metrics are also included. The implementation is published on the GitHub repository. Handwritten text recognition, medieval document processing, object detection, image classification, computer vision, machine learning, deep learning, and historical document transcription.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
Before Gutenberg invented the printing press in 1439, all texts were handwritten. This aspect
complicates the process of reading such documents because of the variety of fonts and approaches to
writing text by hand. Some handwritten fonts require a scientist to have a lot of experience to
understand a document's payload.
1.1. First-sight overview of the given historical documents
In the case of medieval documents, we must keep in mind the age of such texts, which might be
more than a thousand years. Because of that, the state of some documents might have some
impairments, for example, lost fragments, stains over the text, etc. The examples of such decfiiencies
are shown on Figure 1.</p>
      <p>
        Given from the Center of Medieval Studies, Latin-language act documents are from the
chancellery of the Carolingian and Ottonian dynasties of the 9th-11th centuries, which are publicly
available at [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The material of the documents mostly describes the history of the formation of
Latin church dioceses. Most of the documents are created on a parchment base and have a relatively
well-preserved state. The text is written in Carolingian
minuscule script with elements of
Carolingian majuscule, 8th-9th century. An example of such a document is depicted in Figure 2. We
can see that the content is divided by text lines, and the font may vary within the same document.
      </p>
      <p>We can see some structure in the depicted example. The first line of the document almost always has
a distinct font, also it’s provided with the first capital letter "C", written in a decorative style. The
beginning of a document is almost always the same: "In nomine sanctae et individuae trinitatis". The
main payload of the document is written aeftr the first line – it describes the main ideas and events.
Aeftr the main middle part , there is a section for the initials, decorative elements, and the seal.
Almost every text ends with the word "Amen", including the one in the example.</p>
      <p>You do not need to be a historian to distinguish some of the words since they are visually clearly
separated. But often in the text , it is not so clear that the two parts of a writing are separate words,
for example, "In nomine" at the beginning of the documents is often written as a single word
without any separator. Also, there is a system of word contraction, so some of the words are not
written as they are spelled, but are present in a shortened version. If you’re studying such
documents as a historian, you need to know all the details about the scripting culture of the time
and place where the document was written.</p>
      <p>The given database includes 31 images of documents and corresponding texts previously
extracted by historians. The image quality is enough to recognize words and conveniently read the
text of the document. The transcribed texts, except for everything else, contain the content of
damaged parts and the full variation of shortened words. Also, the metadata about the document is
included, for example, date, special notes, etc.</p>
      <p>
        Some of the documents aren’t the original ones; they are copies made in later centuries. The
overall style and font of such images are noticeably diferent, and their preservation state is much
better.
1.2. Related work
Handwritten Text Recognition (HTR) approaches have already done a great job in transcribing
handwritten texts. There are a lot of studies on applying HTR methods to recognize text on
historical documents. The transformer-based models have already shown accurate results in the case
of historical HTR [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The authors have used 16th-century Latin texts containing around 17500 text
lines to train a transformer-based model to transcribe the texts.
      </p>
      <p>
        The most common approach in HTR is to extract text lines and recognize text inside the lines
using models trained by connectionist temporal classicfiation (CTC) loss. The models that use CTC
loss originally were built for speech recognition problems, but similar approaches are efective in the
case of recognizing handwritten symbols. The OrigamiNet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposes a solution able to extract
text not only from text lines but also from whole pages. The model architecture consists of a CNN
backbone and the CTC head part. The bidirectional LSTM module might be used to make the model
remember previous states. Also, the Attention mechanism and language models can be included,
which is shown in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The extraction of text lines is preceded by semantic segmentation in most cases [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The most
common approach to achieve the goal is to build a U-Net model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The application of the
segmentation model is useful in cases of clearly distinct text lines, even if their shape is curvy, as in
the case of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which has shown good results in dealing with such scenarios.
      </p>
      <p>
        In the efild of object detection, the YOLO architecture shows state -of-the-art results [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The large
variety of YOLO models is widely used in real-world scenarios, including inference on edge devices
such as Raspberry Pi boards. Nowadays, the YOLO models can solve not only the detection problem,
but also visual object tracking, semantic segmentation, and pose estimation. The Ultralytics YOLO
implementation allows for easy training and deployment/export of detection models. The eficiency
of using YOLO in visual elements detection of diferent classes was shown in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Nowadays, YOLO
capabilities are not constrained only by object detection; there are versions for segmentation,
tracking, and pose estimation problems [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Segmentation and detection models can also be used
for document Layout analysis [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Paper [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] demonstrates the application of deep learning for the
detection problem. Paper [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] introduces semantic segmentation using deep learning and articfiial
neural networks.
2. Problems to be solved
The majority of HTR and optical character recognition (OCR) models are trained on modern data,
whose age i s n o m ore t han 3 00 y ears. I n m ost c ases, the handwritten texts are well preserved and
written in languages that are currently used today and have not changed very much. If medieval
data is fed to the models, the processing results are not accurate enough.
      </p>
      <p>The second problem concerns CTC models – these models do not provide information about the
position of extracted words, which is useful for analysis. For historians as end users, it’s hard to
know what exact information is used by the model to make the prediction of a specific word. OCR
systems can solve this task for modern data, but it is not efective with medieval text. Also, the
medieval Latin texts often contain a signicfiant number of contracted words, which makes it harder
to transcribe a text character by character. For example, “nostrae” (ours in English) oeftn is written
using only 3 characters – “nse”.</p>
      <p>To train a word classicfiation model to make accurate predictions, there is a requirement to cover
a major part of the vocabulary. It causes a problem if there is a need to work with a limited amount
of data.</p>
      <p>This paper proposes an approach to solve these problems in a single modular architecture
solution. It uses visual object detection models to extract objects’ coordinates and a pair of CNN
models to interpret the detected words.
3. Data
3.1. Development of the training datasets
We’ve annotated our document images with words and text lines. The word coordinates are denfied
as a bounding box: (x, y, w, h), where x, y – are the coordinates of the box’s center; w, h – width and
height, respectively. An example of a labeled document is depicted in Figure 3. We decided to label
the document using diferent classes for more accurate further use.</p>
      <p>The ordinal bounding box is not a proper format for text line annotation because of its strict
rectangular shape, whose sides are parallel to the image sides. Oeftn in handwritten documents, not
only medieval ones, you might encounter that the text lines are not perfectly aligned, so another
format of annotation is needed. The solution was the oriented bounding boxes – the main diference
is that it’s denfied by 8 numbers – ( 1,  1,  2,  2,  3  4), where each (  ,   ) pair is coordinates of the
corresponding corner. The given format allows to make more flexible annotations, which are needed
for extracting text lines from the document image. Also, we can deduce the angle at which the text
line is inclined.</p>
      <p>To train a model to detect word instances inside cropped text lines, there is a need to create a
separate dataset that contains images of text lines. Our team wrote a Python script to define which
word belongs to which line using the given formula (1) as a criterion.</p>
      <p>| ∩  | (1)</p>
      <p>,
| |
where  – bounding box of a word, and  – bounding box of a text line, the |. | operation
represents an area of a 2D shape. The 0.5 threshold value was suficient for this task. The
visualization of the process of denfi ing element relationships is shown in Figure 4.</p>
      <p>To make each text line aligned to the image sides, we need to apply rotation to the cropped text
line image. The center of the cropped image is the center of rotation. Let’s consider the top and
bottom sides as vectors pointing right. The text line direction vector is denfied as the mean value of
the top and bottom vectors. To calculate the angle between the line direction vector and a horizontal
line, the (2) is used.</p>
      <p>α = arctan (  +   ), (2)</p>
      <p>+  
where   ,   – ocordinates of topdirection vector,   ,   – ocordinates of bottomdirection
vector. To make data consistent, we need to rotate the words' coordinates as well. To reach it, the
rotation matrix with an inverted angle is applied for each word center point, and the width and
height are left with the same values.</p>
      <p>For word classification, we need a separate dataset to map word images to their text
representation. We matched labeled word annotations to words in transcribed text. Corrupted
words and words with carry to the next line are not included in the classification dataset.
3.2. Explanatory Data Analysis</p>
      <p>In total, the dataset contains 31 documents. Mean width and height of the document images are
2733 and 2246 pixels, respectively. Each document contains 300 words on average, but there are
several outliers containing more than 1200 words. The distribution of word numbers across the
documents is shown in Figure 5. The dataset contains 610 text lines in total and 19.67 lines per
document on average. There are 4800 annotated word objects in the whole dataset. Each text line
contains 20.65 words on average.</p>
      <p>Conjunctions are the most common words, such as "et", "in", "ad", "cum", and so on. 2433 words occur
1 or 2 times in the whole dataset. The whole distribution of word occurrences is depicted in Figure 6.
2433 words occur 1-2 times, 1012 words occur 2-5 times, 289 words occur 5-10 times, 206 words
occur 10-25 times, 56 words occur 25-50 times, 17 words occur 50-100 times, 6 words occur 100-500
times.</p>
      <p>
        Latin has an extensive declension system, so cognate words in diferent cases are syntactically
similar. To measure the similarity between two words modiefid Hamming distance is used [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>We propose a modiefid Hamming distance calculation that incorporates length -based constraints
to handle minor structural diferences between words. If the absolute length diference exceeds 2, the
function assigns an innfiite distance, indicating excessive dissimilarity. For words of identical length,
the traditional Hamming distance is computed as the count of difering characters. If the lengths
difer by one, the function checks whether removing a single character from the longer word results
in a match, assigning a distance of 1 if successful. Additionally, words shorter than six characters are
deemed incomparable to words of six or more characters, further renfiing the comparison criteria.</p>
      <p>The total number of similar word pairs is 2406. Aeftr merging the similar words (considering
them as the same ones), we encounter that the number of rare words is signicfiantly reduced: only
1868 words occur 1-2 times, and 989 words occur 2-5 times.
4. Deep learning approaches
To reach the goal of extracting text from medieval documents, we developed a pipeline using
detection and classicfiation approaches. Also, we provided a model to map the word images onto a
linear space to make it possible to nfid similar words using images only.</p>
      <p>
        To do so, we use several deep learning techniques, which have shown their eficiency in a large
number of domains. Most commonly, in our research, we used the transfer learning technique [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
which is commonly used in computer vision.
      </p>
      <p>Surely, not all the transferred weights are relevant, so we might need to train the model with
transferred information. Such a training process, in most cases, is much quicker than training a
model from scratch – transferred weights make the model’s loss value much closer to the optimum
point at the start of the training.</p>
      <p>
        Also, we used the nfie- tuning technique to train our detection model to deal with more specicfi
data [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
4.1. YOLO for text line detection
YOLO is a widely used detection architecture that can make fast detections using a single model
pass. The Ultralytics implementation of these models makes it easy to use detectors in diferent
projects.
      </p>
      <p>The line detector has been trained for 300 epochs and reached a result of 98% precision and 93%
recall. To train the model, the rotation augmentation was used, and the rotation angle was in the
range from -5° to 5°. The scaling and translation were also applied for the augmentation.</p>
      <p>The Adam optimization algorithm was used with a scheduled learning rate: 8 ∗ 10−4 as a starting
value and 10−6 as a nfiish value. The batch size value was set to 6. Also, the lost function weights
have to be set for the most efective optimization. Since we predict a single class on the image, the
classicfiation loss is not that important, so its weight value was reduced to 1.5. The box loss weight
was raised to 9, and the distribution focal loss was set to 1.3.</p>
      <p>The training images were of a square shape of size 700 pixels. The model was trained on RTX
3050 mobile GPU with 4 GB of video memory.</p>
      <p>Some of the line predictions might intersect each other, so we had to resolve such scenarios as a
post-processing step. We’ve tried several methods to solve this problem. To detect intersections, the
intersection over union (IoU) was used (3).</p>
      <p>| ∩  | (3)
IoU(a, b) = ,</p>
      <p>| ∪  |
where  and  are bounding boxes, and the |. | operator represents the area of a shape.</p>
      <p>We’ve chosen to use 0.4 as a threshold value and to IoU of two bounding boxes, and we’ve
denfied the intersection of two shapes if the IoU value is higher than or equal to the threshold.</p>
      <p>We used condfience value to resolve intersections for text line detection, meaning if two
bounding boxes are intersecting, leave one with a higher condfience value.</p>
      <p>After all the intersections are resolved, we need to extend the line bounding boxes to the image's
right and left corners while maintaining the angle of inclination. Our extending method moves the
top and bottom bounding box edges by calculating their corner coordinates. The horizontal
coordinates became 0 or 1 depending on whether it’s the left or right side of the bounding box, and
the new vertical coordinates are calculated using an average of the top and bottom vectors.</p>
      <p>After the bounding box extension is done, the lines are cropped and rotated to equalize the text
they contain. To get the rotation angle, the average inclination of the bounding box’s top and bottom
corners is used. The result of line detection with post-processing is depicted in Figure 7.</p>
      <p>After the post-processing, the cropped lines are used to detect words inside. The cropped lines
are sorted from top to bottom to maintain the right text order.
4.2. YOLO for word detection
For word detection inside the text lines, we nfie- tuned the "yolov8m" model (8th medium YOLO
version). The hyperparameter tuning was the most valuable process to train the model. The nfie
tuned model with default hyperparameter values resulted in too low recall to use it.</p>
      <p>We trained our detection model for 400 epochs. Most of the hyperparameters are copied from the
training of text line detection, but with some signicfiant changes. Because there was only 4 GB of
video memory available, the batch size value was set to 4. The image size was 1024 pixels in a
rectangular shape to maintain the aspect ratio. The augmentation was also changed; only the
translation and cropping were applied for the training process. The example of a training batch is
depicted in Figure 8.</p>
      <p>We used a confusion matrix, F1-Condfience curve, Precision- Confidence curve, Recall-Condfience
curve, and Precision-Recall curve to compare our trained models and choose the best one for further
use. To deal with the intersections, the union resolving has shown the most efective results ,
meaning if two bounding boxes are intersecting too much, return the smallest bounding box, which
contains both of the intersecting ones. The trained model’s results of detecting words inside a
cropped text line are shown in Figure 9.</p>
      <p>Aeftr th e prediction is done, the words are cropped from text line images and sorted from left to
right to maintain text order. Then the words are passed to the classification model.</p>
    </sec>
    <sec id="sec-2">
      <title>4.3. Word classification model</title>
      <p>The decision to choose an approach with word classicfiation instead of processing text by characters
was motivated by the large number of abbreviated words in the medieval documents. For example,
the word “nostrae” is oeftn contracted into 3 letters.</p>
      <p>We have built our classifier by combining a pre-trained ResNet50 backbone and an MLP head
network. The head neural network consists of a dropout layer and three linear layers of size 2048,
which are separated by a ReLU activation function and the Softmax activation at the end of the
network. The task of the model is to classify the word written on an input image. Compared to CTC
models, the classicfiation architecture is not afected by abbreviated words and the variety of ways
to write characters.</p>
      <p>The training input images are augmented using scaling, rotations, and elastic transformation, and
resized to a shape of 200x200 pixels. Our dataset represents the distribution of words in real human
language, so some words occur with much higher frequency than others; in other words, the dataset
is unbalanced. Augmentation is a crucial part of dealing with highly unbalanced data, but the other
step is applying learning weights for each class. The weights are calculated as the inverse of word
occurrences in the dataset, so the word occurring once has the weight 1, the word occurring twice
has the weight 0.5, and so on. The train/validation split was performed in such a way as to ensure
that the words, which occur only once, are included in the train subset, so that while training the
model encounters representations of all classes. The output size of the model is 1765, which is the
number of known words in our dataset. This value will increase with the size of the training data.</p>
    </sec>
    <sec id="sec-3">
      <title>4.4. Word similarity measure model</title>
      <p>
        The words classicfiation dataset is highly unbalanced, so besides classification , we need some
additional tools to recognize words. Our proposed method is to build an embedding model, which
maps a given word image to a linear space such that the vector representation of images with similar
words are the closest ones in terms of Euclidean distance. The method is based on the FaceNet
approach [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>We use a pretrained ResNet50 CNN model as a backbone to extract visual features. The extracted
features are then processed by a network consisting of 3 linear layers with ReLU activation
functions and a single residual connection. The model architecture is shown in Figure 10.</p>
      <p>We used triplet margin loss as a criterion function and Adam optimizer with 5 ∗ 10−5 learning
rate value. The model was trained for 400 epochs with 400 training triplets for each epoch and with
a batch size value of 16. The image size was set to 120 since it’s half of mean size of the cropped
word images. The margin value for the triplet loss function was set to b e 2. The output size of the
model was set to be 64.</p>
      <p>The words, whose string distance is less than or equal to 1 (same words or difer at 1 symbol), are
denfied as similar. So, the train triplets are formed in such a way that the anchor and positive images
represent similar words, and the anchor and negative images represent diferent words.</p>
      <p>
        After the training was done, all the training images were mapped onto the embedding space
using the trained model. The part of the TSNE 2d projection of the embeddings is depicted in Figure
11. For inference, the model receives an unseen image of a word and outputs its embedding
representation; to find the most similar words , we need to nfid the closest embedding vectors in the
created space. To make the closest embedding search faster than just linear time brute force, the
Faiss vector database is used [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>The model might be used in some scenarios if the word classiefir is "not sure" about what word it
sees, meaning the Somftax layer is not pointing to a specific class but rather spreading the
probabilities among many classes. Our dataset does not cover the whole Latin vocabulary, so such
cases might occur.
4.5. Full pipeline
The full pipeline is built using the models described in previous sections; the diagram of our pipeline
is depicted in Figure 12. Firstly, we detect text lines using an image of the whole document. Aeftr
cropping, the text lines are post-processed and sorted from top to bottom.</p>
      <p>Each cropped line is processed by a word detection model to denfi e the bounds of each word.
Detected objects are post-processed, cropped, and sorted from left to r ight.</p>
      <p>Each cropped word is classiefid to denfie its value. In case of too low condfience in classicfiation ,
the pipeline returns a list of similar words, which are found by the embedding model. The word
image is embedded by such a model, and the algorithm searches for the closest embeddings in a
vector database.</p>
      <p>The ordered classicfiations are returned as a list of predicted words. Because the main users of
our solution are professional historians, the detected objects are also shown for more clarity and
understanding of what’s going on.
5. Results
Table 1 shows the results of our trained YOLO text line detection model. Figure 13 and Figure 14
show the Precision-Recall and F1 curves of text lines and words detection models, respectively. Table
2 and Table 3 show the confusion matrices of text lines and word detection models, respectively.</p>
      <p>To measure the results of the word image embedding model, we used a modiefid custom
precision metric, which works in the following way: for the resulting embedding vector, nfid K
nearest embeddings from a database and calculate the fraction of correct neighbors. Table 4 shows
the connection between the number of nearest words and modiefid precision values.</p>
      <p>The example of word recognitions you can see on the Figure 15. In this example you can see the
result of combining classicfiation and word embedding models.</p>
      <p>To measure the performance of our classification and embedding model in pairs, we used the
mean string distance between predicted and ground truth words. The decision to use this metric was
taken because of the high number of similar words in our data, so interpreting, for example, “nostri”
and “nostro” as two diferent classes is not the best approach. On our data, the classicfiation model
was evaluated to a value of 2.4, meaning on average the predicted word difers from ground truth by
2-3 symbols.
6. Discussion</p>
      <p>
        Our solution shows the efectiveness of such a combination of approaches in the case of
transcribing text from handwritten medieval documents. The implementation of training and
inference processes, explanatory data analysis, and dataset building utilities can be accessed on the
GitHub repository [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>6.1. Interpretation of results</title>
        <p>Training of detection models has shown that such a relatively small amount of training data is
enough to generalize it to be useful in unseen scenarios.</p>
        <p>The embedding model maps the syntax information of a word using its image as input. 2D
visualization of train embeddings shows the clusters of similar words. The vector databases allow for
nifding the closest vector representations fast enough for convenient usage. Metric results show the
efectiveness of using the embedding model on naturally distributed data.</p>
        <p>The classicfiation model has shown good results, according to the nature of the input training
data. It results in good results in the case of predicting common words and can optimize manual
work for professional historians.</p>
        <p>The whole pipeline consists of separate components; in other words, it’s modular, which allows
for upgrading the performance of the pipeline by afecting only components that are responsible for
the features that need to be upgraded.</p>
      </sec>
      <sec id="sec-3-2">
        <title>6.2. Comparison with previous research</title>
        <p>In comparison with previous research, the main advantage of our approach is that it is more specicfi
to medieval handwritings.</p>
        <p>Our model is more applicable for professional usage because of its ability to see the transcription
of each individual word, which helps to understand the whole text. This feature is not present in
CTC-based solutions.</p>
        <p>Text line detection has shown efective results in comparison to image segmentation approaches.
It allows for correctly distinct, separate lines and easily crops them. The usage of oriented bounding
boxes is a simple but also a fundamental decision, which makes the model capable of dealing with
the natural style of handwritten text. Besides a relatively small amount of training data, text line
detection has shown surprisingly good results.</p>
        <p>Our pipeline is more adapted for processing documents in poorer condition, so the solution can
be used for texts with stains, gaps, and other damage. Most of the public data consists of
wellpreserved document images, but the solution of this study shows methods to deal with damaged
examples as well. For example, the system avoids recognizing damaged or lost parts of the
document, because only the words that were detected are processed by the last 2 models of the
pipeline.</p>
        <p>The main disadvantage of our approach is that the pipeline models do not process context
information. That might help understand written language, instead of separate words. Also, the
system requires more manual work to be done to create the training datasets, especially detection
ones. CTC approaches, where only image and text are required without annotated bounding boxes,
there is no need to annotate each word.
6.3. Implications and limitations
Our work introduces a new approach to historical document processing, combining detection and
recognition techniques. Even with a relatively small amount of data, the models have shown
condfient results. The solution can be improved or adapted by giving more annotated data.</p>
        <p>Our classicfiation dataset does not cover the whole Latin vocabulary, which may afect the
recognition of documents with a large number of unseen words. Also, our models are trained on
documents of a specicfi period and “genre”, so the usability of our trained models does not apply to
documents of a diferent style.</p>
        <p>Our classicfiation an d embedding models perform much poorer on rare word;smore annotated
data is needed to improve this behavior.</p>
        <p>Each model in our solution is dependent on the previous one in the pipeline, so if a line is not
detected, we lose each word written in this line. Same for words, each undetected one cannot be
classiefid and transcribed. Because of that, the error might sum up from the start to the end of the
pipeline. Generative AI was not used for the study.</p>
        <sec id="sec-3-2-1">
          <title>7. Conclusions</title>
          <p>Our team has built an HTR solution suitable for extracting text information from handwritten
medieval Latin documents. To deal with features specicfi to medieval documents, such as word
contraction, damage, etc., we’ve come up with our own approach, based on a combination of
detection, classification and embedding models.</p>
          <p>We’ve annotated training datasets for detection, classification , and embedding models based on
Latin documents from the 9th to 11th centuries. As a result, we have 31 images with annotated words
and text lines. During the post processing we got a classicfiation dataset, which maps word images
to corresponding words.</p>
          <p>We’ve trained text line and word detection models to locate and order objects we want to
recognize. Post-processing steps were developed to transform detection data into the needed form.
To deal with word contractions, we use word classicfiation. Some words have a small number of
occurrences in our data, so the classification model might struggle with predicting them correctly.
Because of this, the embedding model was built, which encodes the cropped word image into an
embedding space. To find the most similar words vector database is used.</p>
          <p>The approach shows good results despite a relatively low amount of training data, which proves
its eficiency in further use with larger datasets. There is a wide efild of future imp rovements, such
as context-aware text recognition and more detailed document structure analysis.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Declaration on Generative AI</title>
          <p>The authors have not employed any Generative AI tools.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Haus</surname>
            <given-names>-</given-names>
          </string-name>
          , Hof- und
          <string-name>
            <surname>Staatsarchiv</surname>
          </string-name>
          , Salzburg, Erzstift (
          <fpage>798</fpage>
          -
          <lpage>1806</lpage>
          ), in: Monasterium.net, URL https://www.monasterium.net/mom/AT-HHStA/SbgE, accessed
          <year>2025</year>
          -
          <volume>03</volume>
          -22.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Haus</surname>
            <given-names>-</given-names>
          </string-name>
          , Hof- und
          <string-name>
            <surname>Staatsarchiv</surname>
          </string-name>
          , Salzburg, Domkapitel (
          <fpage>831</fpage>
          -
          <lpage>1802</lpage>
          ), in: Monasterium.net, URL https://www.monasterium.net/mom/AT-HHStA/SbgDK/, accessed 2025-
          <volume>03</volume>
          -22.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Phillip</given-names>
            <surname>Benjamin</surname>
          </string-name>
          <string-name>
            <surname>Ströbel</surname>
          </string-name>
          , Simon Clematide,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Volk</surname>
          </string-name>
          .
          <article-title>"Transformer-based HTR for Historical Documents"</article-title>
          , arXiv preprint,
          <source>arXiv:2203.11008</source>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2203.11008.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Mohamed</given-names>
            <surname>Yousef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Tom E.</given-names>
            <surname>Bishop</surname>
          </string-name>
          .
          <article-title>"Origaminet: Weakly-Supervised, Segmentation-Free, OneStep, Full Page Text Recognition by learning to unfold", arXiv preprint</article-title>
          , arXiv:
          <year>2006</year>
          .07491,
          <year>2020</year>
          . doi: https://arxiv.org/abs/
          <year>2006</year>
          .07491.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Wick</surname>
          </string-name>
          , Jochen Zollner,
          <string-name>
            <given-names>Tobias</given-names>
            <surname>Gruning</surname>
          </string-name>
          .
          <article-title>"Rescoring Sequence-to-Sequence Models for Text Line Recognition with CTC-Prexfies"</article-title>
          , arXiv preprint,
          <source>arXiv:2110.05909</source>
          ,
          <year>2021</year>
          . doi: https://arxiv.org/abs/2110.05909.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Berat</given-names>
            <surname>Barakat</surname>
          </string-name>
          , Ahmad Droby, Majeed Kassis, Jihad El-Sana.
          <article-title>"Text Line Segmentation for Challenging Handwritten Document Images Using Fully Convolutional Network", arXiv preprint</article-title>
          ,
          <source>arXiv:2101.08299</source>
          ,
          <year>2021</year>
          . doi: https://arxiv.org/abs/2101.08299.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Ronneberger</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brox</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>U-Net: Convolutional Networks for Biomedical Image Segmentation</article-title>
          . In: Navab,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Hornegger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Wells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Frangi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (
          <article-title>eds) Medical Image Computing</article-title>
          and
          <string-name>
            <surname>Computer-Assisted</surname>
            <given-names>Intervention - MICCAI</given-names>
          </string-name>
          <year>2015</year>
          .
          <source>MICCAI 2015. Lecture Notes in Computer Science()</source>
          , vol
          <volume>9351</volume>
          . Springer, Cham. https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -24574-4_
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>"You Only Look Once: Uniefid, Real -Time Object Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas</article-title>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          , doi: 10.1109/CVPR.
          <year>2016</year>
          .
          <volume>91</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Büttner</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinetz</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El-Hajj</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valleriani</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>CorDeep and the Sacrobosco Dataset: Detection of Visual Elements in Historical Documents</article-title>
          .
          <source>J Imaging</source>
          .
          <source>2022 Oct</source>
          <volume>15</volume>
          ;
          <issue>8</issue>
          (
          <issue>10</issue>
          ):
          <fpage>285</fpage>
          . doi:
          <volume>10</volume>
          .3390/jimaging8100285. PMID: 36286379; PMCID:
          <fpage>PMC9605005</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Terven</surname>
          </string-name>
          , Juan, Diana-Margarita
          <string-name>
            <surname>Córdova-Esparza</surname>
          </string-name>
          , and
          <string-name>
            <surname>Julio-Alejandro</surname>
          </string-name>
          Romero-González.
          <year>2023</year>
          .
          <article-title>"A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS" Machine Learning and Knowledge Extraction 5</article-title>
          , no.
          <volume>4</volume>
          :
          <fpage>1680</fpage>
          -
          <lpage>1716</lpage>
          . https://doi.org/10.3390/make5040083.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Rahal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vögtlin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Ingold</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>Historical document image analysis using controlled data for pre-training</article-title>
          .
          <source>IJDAR 26</source>
          ,
          <fpage>241</fpage>
          -
          <lpage>254</lpage>
          (
          <year>2023</year>
          ). https://doi.org/10.1007/s10032-023-00437-8.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kozlenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zamikhovska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tkachuk</surname>
          </string-name>
          , L. Zamikhovskyi,
          <article-title>Deep learning based fault detection of natural gas pumping unit</article-title>
          ,
          <source>in: 2021 IEEE 12th International Conference on Electronics and Information Technologies (ELIT)</source>
          , Lviv, Ukraine, May
          <volume>19</volume>
          -21,
          <year>2021</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>75</lpage>
          , doi: 10.1109/ELIT53502.
          <year>2021</year>
          .
          <volume>9501066</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kozlenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sendetskyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Simkiv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Savchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosyi</surname>
          </string-name>
          ,
          <article-title>Identity documents recognition and detection using semantic segmentation with convolutional neural network</article-title>
          ,
          <source>in: 2021 Workshop on Cybersecurity Providing in Information and Telecommunication Systems, CEUR Workshop Proceedings</source>
          , vol.
          <volume>2923</volume>
          ,
          <string-name>
            <surname>Kyiv</surname>
          </string-name>
          , Ukraine, Jan.
          <volume>28</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>242</lpage>
          . https://ceurws.org/Vol-
          <volume>2923</volume>
          /paper25.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Hamming</surname>
          </string-name>
          ,
          <article-title>"Error detecting and error correcting codes," in The Bell System Technical Journal</article-title>
          , vol.
          <volume>29</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>147</fpage>
          -
          <lpage>160</lpage>
          ,
          <year>April 1950</year>
          , doi: 10.1002/j.1538-
          <fpage>7305</fpage>
          .
          <year>1950</year>
          .tb00463.x.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Zhuang</surname>
          </string-name>
          , Fuzhen et al.
          <article-title>"A Comprehensive Survey on Transfer Learning."</article-title>
          <source>Proceedings of the IEEE</source>
          <volume>109</volume>
          (
          <year>2019</year>
          ):
          <fpage>43</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>Peng</given-names>
          </string-name>
          &amp; Wang,
          <string-name>
            <surname>Jiugen.</surname>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>How to nfie -tune deep neural networks in few-shot learning?</article-title>
          .
          <volume>10</volume>
          .48550/arXiv.
          <year>2012</year>
          .
          <volume>00204</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Voloshchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zarembovska</surname>
          </string-name>
          , Carolingus Project,
          <year>2025</year>
          . URL: https://github.com/AIVMZB/Carolingus.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>