<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semi-automatic pipeline for constructing HTR corpora from Ukrainian-language historical documents⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrii Ivasechko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khrystyna Lipianina-Honcharenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computer Information Technologies, West Ukrainian National University</institution>
          ,
          <addr-line>46000 Ternopil</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Recognition of historical manuscripts is a challenging task due to multilingualism, non-standard spelling, and variability in writing styles, which limits the effectiveness of traditional OCR systems, especially for low-resource languages. This study presents an integrated system for creating a dataset for deep learning that combines automated preprocessing, character segmentation, and a collaborative interface for annotation. Based on documents from the State Archive of Khmelnytskyi Region, 1684 validated entries with 215 unique symbols in two languages were created. The platform proved user-friendly for untrained users - 48 students participated in collaborative labeling, ensuring high annotation quality. The challenges of segmentation, data variability, and the prospects for expanding the corpus and implementing active learning are discussed.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;historical manuscripts</kwd>
        <kwd>handwritten text recognition</kwd>
        <kwd>dataset creation</kwd>
        <kwd>image segmentation</kwd>
        <kwd>character annotation</kwd>
        <kwd>cultural heritage1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>This article presents a semi-automatic pipeline for constructing HTR corpora from
Ukrainianlanguage historical documents. Chapter 2 provides a review of related work. Chapter 3 outlines the
system architecture, including the dataset loader, pre-processing procedures, labeling interface,
quality aggregation algorithm, and evaluation metrics. Chapter 4 presents the testing results,
describes the dataset, and reports on annotation quality based on the defined metrics.</p>
      <p>Automatic Handwritten Text Recognition (HTR) is a key component in the digital
transformation of historical documents and cultural heritage sources. This issue is particularly
relevant in the context of the humanities, where the preservation, analysis, and reuse of archival
manuscripts are fundamental tasks. Unlike modern printed texts, manuscripts from the 14th to 19th
centuries feature high variability in handwriting, multilingualism, non-standard spelling, rare
fonts, and significant degradation of the media. These factors significantly limit the effectiveness of
traditional Optical Character Recognition (OCR) systems designed for modern printed texts.</p>
      <p>
        One of the major challenges in the field of deep learning for HTR is the limited availability of
high-quality training data, particularly for low-resource languages and historical fonts. Creating
such corpora requires substantial human resources and time. To address this issue, recent studies
propose methods such as semi-automated labeling, active learning, synthetic data generation, and
transformer architectures that rely less on large labeled datasets. At the same time, character
segmentation in irregular historical handwriting remains a critical step that requires further
improvements[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>This study presents the development of an interactive system for constructing handwritten text
corpora, combining automated image preprocessing, segmentation, a collective character labeling
interface, and annotation storage in a format suitable for further training deep learning models.</p>
      <p>The system was tested on digitized multilingual documents from the State Archive of
Khmelnytskyi Region, resulting in over 1,600 validated characters and creating a foundation for
future text recognition.</p>
      <p>In this context, the paper analyzes the dataset creation process, the structure of the annotation
interface, the challenges of segmentation quality, and the organization of group collaboration to
enhance labeling reliability. The proposed approach can be scaled for other historical corpora and
serves as a valuable tool for researchers in digital humanities, automated manuscript recognition,
and the creation of intelligent access systems for cultural heritage.</p>
      <p>This article presents a semi-automatic pipeline for constructing HTR corpora from
Ukrainianlanguage historical documents. Chapter 2 provides a review of related work. Chapter 3 outlines the
system architecture, including the dataset loader, preprocessing procedures, labeling interface,
quality aggregation algorithm, and evaluation metrics. Chapter 4 presents the testing results,
describes the dataset, and reports on annotation quality based on the defined metrics.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        The goal of this study is to develop an interactive system for constructing handwritten text
corpora, which combines automated image preprocessing, segmentation, a collective character
labeling interface, and annotation storage in a format suitable for further training deep learning
models. To justify the design of such a system and identify knowledge gaps, this section
summarizes related works in six directions: (i) the availability and quality of multilingual resources
and the consequences of their scarcity for low-resource languages, including Ukrainian [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; (ii)
end-to-end HTR pipelines for historical documents (layout recognition, segmentation,
transcription) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]; (iii) synthetic data generation and augmentation to enhance models in the
absence of labeling [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7">3–7</xref>
        ]; (iv) interactive and semi-supervised approaches (active/self-training) to
reduce the cost of manual annotation [
        <xref ref-type="bibr" rid="ref8 ref9">8–9</xref>
        ]; (v) the capabilities and limitations of modern MLLMs
and foundation models for manuscripts and related tasks [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14">10–14</xref>
        ]; (vi) methods for targeted
collection, structuring, and tool support for annotations compatible with subsequent ML training
[
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15–20</xref>
        ]. This systematization serves as the methodological foundation for making engineering
decisions in the proposed system (automation of preprocessing and segmentation,
"human-in-theloop" for symbols, interoperable storage formats), aimed at rapidly building high-quality corpora
for further deep learning.
      </p>
      <p>
        Yu et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] conduct a quantitative and qualitative analysis of multilingual NLP resources,
covering 156 public datasets, manually annotating text sources and annotations, creation tools,
tasks, and motivations. The researchers show that simply counting datasets is misleading due to
the predominance of automatically generated and English-translated corpora, and they identify a
correlation between experts' and crowdworkers' assessments of data availability and the actual
existence of resources. The authors' crowdsourcing experiments lead to practical recommendations
for collecting high-quality multilingual data for languages with limited corpora. In their
classification, Ukrainian is categorized as a low-resource language, providing a methodological
basis for justifying the scarcity of Ukrainian-language corpora and planning data collection,
particularly for tasks like fake news detection.
      </p>
      <p>
        The iForal study [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] focuses on automating the transcription of historical manuscripts to
facilitate access to cultural heritage. The developed system includes layout recognition,
segmentation, and transcription of text. A corpus of 67 Portuguese charters was used for training.
The system achieved an accuracy of 0.98 mAP@0.50 for layout, 0.91 mAP@0.50 for segmentation,
and 8.1% CER. It reduces the need for expert intervention and supports adaptation to other writing
styles through transfer learning. The dataset has been published in HTR United, and it is also made
available in the HTR United catalog for reuse.
      </p>
      <p>
        The study by Lisa Koopmans [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is dedicated to the automated dating of historical manuscripts
using SVM analysis of texture and grapheme features. To address the issue of limited labeled data,
data augmentation was applied, resulting in a 1–3% increase in accuracy. The models were tested
on several corpora, including the Medieval Paleographical Scale. The authors note the potential for
adaptation to specific handwriting styles to improve accuracy.
      </p>
      <p>
        Lars Vögtlin's work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] presents a framework for generating synthetic historical documents
with accurate labeling based on unlabeled images. A two-step approach is proposed: creating
templates with controlled content and transferring the style of historical scans. This method
ensures realism and accuracy without expert involvement. Pre-training on generated data
outperforms baseline models, opening the way for creating scalable datasets for low-resource
languages and rare writing systems.
      </p>
      <p>
        Wei Chen [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposes the Fine-grained Automatic Augmentation (FgAA) method to improve
handwritten text recognition for languages with limited sample data. Unlike traditional approaches
that perform global transformations on words, FgAA operates at the level of individual strokes:
each word is segmented into strokes, approximated by Bézier curves, and local transformations are
applied. Optimal augmentation parameters are automatically selected using Bayesian optimization.
      </p>
      <p>
        In the work by Arthur Flor de Souza Neto [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a systematic review of data augmentation
methods for offline Handwritten Text Recognition (HTR) systems is presented, which are crucial
for improving model quality when labeled data is limited. The review analyzes 32 relevant studies
from 976 found in databases from 2012 to 2023. The authors note that traditional Digital Image
Processing (DIP) is still widely used, although recent interest in Generative Adversarial Networks
(GANs) is increasing, allowing for the synthesis of handwritten text with arbitrary style and
content. The paper also discusses the datasets used and recognition levels in the studies.
      </p>
      <p>
        In Yahia Hamdi's article [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], four data augmentation strategies to improve online handwritten
text recognition (OHR) with small datasets are presented. The proposed approaches include: (1)
geometric transformations (italic angle, scale, tilt), (2) frequency processing of trajectories, (3)
betaelliptical modeling of writing dynamics, and (4) a hybrid combination of all strategies. The system
was tested on multilingual datasets (Arabic: ADAB, ALTEC-OnDB, Online_KHATT; Latin:
UNIPEN) using CNN architecture. Results demonstrate significant improvements in recognition
accuracy compared to baseline and contemporary approaches.
      </p>
      <p>
        In the study by Alejandro Héctor Toselli [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], an interactive system for transcribing historical
manuscripts is proposed, which continuously fine-tunes based on user-validated results. The goal is
to reduce user interactions while improving their efficiency. Three approaches are considered:
adaptation through semi-supervised learning, active learning for selecting ambiguous examples,
and error probability assessment for regulating user intervention. Experiments on two historical
documents confirm the effectiveness of the approach.
      </p>
      <p>
        In Fabian Wolf's study [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a self-learning method for Handwritten Text Recognition (HTR) and
word search is proposed, which eliminates the need for manual annotation. The baseline model is
trained on synthetic data, after which it generates pseudo-labels for real images and is further
trained on them in a semi-supervised manner. To improve accuracy, mechanisms for filtering
unreliable pseudo-annotations are applied. The proposed approach demonstrates better accuracy
and stability compared to other annotation-free methods.
      </p>
      <p>
        In Shukang Yin's article [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a review of the development of multimodal large language models
(MLLMs), specifically GPT-4V, is presented, which combine language processing with image, video,
and text analysis. The paper discusses architectures, training strategies, types of data, and
evaluation metrics, as well as challenges, including the issue of multimodal hallucinations. The
authors analyze the prospects for expanding MLLMs to new modalities, languages, and application
scenarios, particularly through M-ICL, M-CoT, and LAVR. The review is accompanied by an open
GitHub repository with current research.
      </p>
      <p>
        In the study by Jacob Murel and David Smith [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a method for improving the detection of
handwritten annotations in early printed books based on visual similarity between text samples is
proposed. The authors explore the impact of pseudo-labeled page images on the performance of
manuscript localization models, using pages from copies of Shakespeare's "First Folio."
Selflearning and active learning approaches with pseudo-labels for both positive and negative
examples are compared. The results show a 15% improvement in average accuracy for individual
copies, although the effectiveness on collections from multiple sources was less conclusive.
      </p>
      <p>
        Carina Geldhauser and Konstantin A. Malyshev [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] introduced a prototype for integrating
Handwritten Text Recognition (HTR) and semi-automated annotation of textual features in the
graphical interface eScriptorium. The solution is aimed at humanists, particularly researchers of
ancient Greek texts (majuscules) who are creating critical editions or digital collections. The
prototype allows for simultaneous transcription and annotation, an important step in
reconstructing textual variants and facilitating the analysis of manuscript traditions, such as in the
study of Homeric or biblical texts.
      </p>
      <p>
        Giorgia Crosilla, Lukas Klic, and Giovanni Colavizza [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] compare the capabilities of
multimodal large language models (MLLMs), such as Claude 3.5 Sonnet, with traditional HTR
systems like Transkribus for handwritten text recognition. Unlike classical models, which require
significant manual annotation, MLLMs can recognize various handwriting styles without
specialized training. Experiments cover modern and historical texts in four languages (English,
French, German, Italian). The results demonstrated the advantages of proprietary models in a
zeroshot setting, particularly for English, but also revealed the limitations of LLMs in independently
correcting transcriptions.
      </p>
      <p>
        In the study by Li Y. et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the efficient use of Vision Transformer (ViT) for handwritten
text recognition tasks under limited data conditions is proposed. The authors replace the standard
patch representation in ViT with features extracted using a convolutional neural network (CNN)
and apply the Sharpness-Aware Minimization (SAM) optimizer to achieve stable and generalized
loss minimization. The method also introduces span masking, a regularizer that masks related areas
in the feature map. The proposed approach demonstrates competitive results on small IAM and
READ2016 datasets and sets a new benchmark on the largest LAM dataset (19,830 text lines).
      </p>
      <p>
        One key study [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] analyzes data collection for machine learning, emphasizing the need for
large volumes of annotated data for deep models. The entire data collection cycle is discussed, from
acquisition to dataset enhancement, with a focus on integrating Big Data and AI.
      </p>
      <p>
        The study [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] examines automated data collection (ADC) in the context of Industry 4.0, where
IoT devices are used. ADC reduces the workload for users by applying AI to identify relevant data.
      </p>
      <p>
        The study [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] proposes formalizing the data collection process as an optimization task that
minimizes costs while maintaining a balance between the amount of data and model accuracy,
especially under semi-supervised learning conditions.
      </p>
      <p>In the study [18], a model for recognizing irregular text in images is proposed, combining
ResNet-31, an LSTM encoder-decoder, and an attention mechanism to reduce data preparation
costs. This solution demonstrates high effectiveness on test datasets.</p>
      <p>The research [19] focuses on the use of synthetic data to reduce the costs of dataset creation.
The generation of synthetic images with automatic labeling showed that they can provide
performance comparable to real data.</p>
      <p>In the work [20], a system for creating annotated datasets based on historical manuscripts is
described. The system combines semi-automated character labeling and multi-level verification to
enhance data quality. It supports Cyrillic, Latin, and Arabic scripts and utilizes a
"human-in-theloop" approach.</p>
      <p>The study [21] proposes an intelligent system for online product promotion, which includes
keyword generation, product catalog creation, advertisement content generation, and targeting.
The experiment confirmed that the system improves advertising effectiveness by 125% while
reducing costs by 87%.</p>
      <p>The paper [22] compares three neural networks — ResNet, EfficientNet, and Xception. The
models were evaluated for accuracy, sensitivity, specificity, and F1-score. The Xception model
achieved the highest accuracy (87.7%), EfficientNet showed high efficiency under limited resources,
while ResNet faced challenges with classifying underrepresented classes, highlighting the
importance of data balance and training methods.</p>
      <p>In the study [23], a method for segmentation of atmospheric cloud images obtained via remote
sensing was presented. The authors developed an algorithm to isolate cloud structures in satellite
images, demonstrating how classical computer-vision techniques can effectively separate complex
visual objects. This approach can be adapted for preprocessing and segmenting handwritten
manuscript images.</p>
      <p>The paper [24] introduces a model for classifying information objects by combining neural
networks with fuzzy logic. This hybrid approach improves classification accuracy in conditions of
uncertainty and data heterogeneity. Such techniques can be leveraged to classify symbols or
identify languages in multilingual historical documents.</p>
      <p>
        The review revealed that most approaches either rely on large, carefully annotated corpora or
are narrowly focused on specific languages/scripts; at the same time, there is a noticeable lack of
open, symbol-level standardized resources for Cyrillic and mixed scripts [
        <xref ref-type="bibr" rid="ref1 ref2">1–2</xref>
        ]. Synthetic data and
augmentation can partially compensate for the data shortage [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7">3–7</xref>
        ], while interactive/self-learning
methods reduce annotation costs [
        <xref ref-type="bibr" rid="ref8 ref9">8–9</xref>
        ]. However, there is a lack of practical, reproducible solutions
that integrate these approaches into a unified workflow with transparent data quality and
interoperable formats [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref17">10–14, 15–21</xref>
        ]. The proposed system directly addresses these gaps:
automated preprocessing and segmentation reduce input costs, collective labeling ensures
symbollevel quality control, and standardized annotation storage makes the data ready for training deep
models and further expansion with synthetic/semi-supervised samples. A pilot test on digitized
multilingual documents from the State Archive of Khmelnytskyi Region resulted in over 1,600
validated symbols, confirming the viability of the approach and creating a foundation for the next
stage — building and evaluating handwritten text recognition models for Ukrainian and mixed
corpora. Thus, the results of the review directly inform the requirements for the system’s
architecture and functionality, focused on effectively solving the task at hand.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. System architecture</title>
      <p>A system for forming a training dataset, focused on handwritten text recognition tasks from
historical manuscripts, has been developed, along with a specialized application and conducted test
labeling. The approach includes sequential stages of preprocessing, segmentation, and annotation
of character images, ensuring high-quality preparation of input data for training optical character
recognition models under conditions of variability and degradation of manuscript sources. The
system is schematically presented in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset Loader &amp; Pre-processing</title>
        <p>The study used documents obtained from the State Archive of Khmelnytskyi Region. The sample
included digitized descriptions of materials from several fonds: Fond 442 (Kamianets-Podilskyi
County Treasury, 1861–1913), Fond 507 (Office of the Head of the Southwestern Customs District,
1907–1913), Fond 596 (Podillya Branch of the Princess Tatiana Nikolaevna Committee for Assisting
War Victims, 1914–1915), Fond 598 (Judicial Investigation Department of Kamianets-Podilskyi
County, 1875–1880), Fond 616 (Military Affairs of Kamianets-Podilskyi County, 1884–1919), and
Fond 309 (Isakovets Customs, 1915–1931). The documents are written in Ukrainian and Russian,
with a total of 80 pages.</p>
        <p>Each page of the PDF document is rendered in RGB format at a density of 400 DPI using
PyMuPDF. The images are converted to grayscale and binarized using Otsu's method for character
segmentation. Morphological opening is applied to remove noise and separate fused components.
Next, contours are detected using the cv2.findContours algorithm, which highlights only the outer
boundaries of the characters. The contours are analyzed based on several criteria: minimum
character area (less than 80 pixels), aspect ratio (0.2 ≤ w/h ≤ 4.0), and fill factor (0.1 ≤ extent ≤ 0.25).
Valid contours are normalized to a size of 64×64 pixels and saved in PNG format. This approach
improves the quality of the sample and the preparation of data for training text recognition models.
As a result of the processing, 7464 segmented character images were obtained, an example of
which is shown in Figure 2.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Labeling UI</title>
        <p>A local application based on Gradio has been developed for manual character annotation, which
accelerates the labeling process for images obtained at the preprocessing stage. The main goal is to
involve experts in entering labels (annotations) for the images.</p>
        <p>Authorization Section: At the beginning, the user enters their name, group, and archive team
number, which allows identifying participants during the subsequent analysis of the collected
labels.</p>
        <p>Main Labeling Interface: The symbol is displayed, the language is indicated (automatically
determined from the filename), a field for entering the label, and buttons for navigation (save, skip,
return to the previous symbol) are provided. The language of the symbol is automatically detected
from part of the filename (_ukr, _eng, _pol, _rus), which can be used for further classification.</p>
        <p>Each label is saved in a CSV file, which includes:
1. label — the entered label (symbol),
2. language — the language, determined in advance and indicated as a suffix in the
filename (e.g., "Manuscripts/230-1-1_ukr.pdf"),
3. image_path — the path to the image,
4. user_name, user_group, team_number — metadata about the annotator.</p>
        <p>The indexing update mechanism allows for correcting labels in case of skipped or revisited
images. The application works locally, without the need to connect to external servers, making it
convenient for handling confidential data or working in areas with limited network access. The
simplicity of the interface allows even untrained users to participate in character annotation.</p>
        <p>The annotation application's interface is implemented as a local web application using the
Gradio library. After launching the application, the user is directed to the registration page, an
example of which is shown in Figure 3. The user enters their personal data: name, surname, group
number, and archive identifier (set number of 200 images). This ensures user identification and
control over the quality of the entered labels.</p>
        <p>After confirming the data by clicking the "Start labeling" button, the system redirects the user to
the main character annotation page (see Figure 4). The interface displays a progress indicator
showing the number of processed and remaining images. The symbols are displayed at a fixed size
of 128×128 pixels, with the option to zoom in or download. The language of the symbol is
automatically determined by the filename.</p>
        <p>In the central part of the interface, there is a text field for entering the label. Below, there are
navigation buttons: to move to the next or previous image, as well as a "Save" button that records
the label in the CSV file and initiates the transition to the next symbol. If necessary, the user can
skip an image or return to the previous one, with the label being updated accordingly.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Quality aggregation</title>
        <p>To merge annotation data obtained from multiple user teams in the form of CSV files, a software
module was developed that ensures standardized merging of the labels based on the majority vote
principle. The code is implemented in Python using the pandas library.</p>
        <p>At the first stage, the module searches and reads all available CSV files in a specified directory
(csv_inputs). Empty files or files with reading errors are ignored, which enhances the processing
robustness.</p>
        <p>For each record containing metadata about the language (language), image path (image_path),
label (label), as well as user and team information (user_name, user_group, team_number), a check
is performed to ensure compliance with the required column set. After that, the language names
are normalized using a pre-defined mapping dictionary (LANG_MAP), which ensures consistent
representation of labels like "ua", "ukr", "uk" to a single standard "uk".</p>
        <p>The main aggregation operation is performed at the record grouping level by the key
(team_number, image_path). For each group, the agreed-upon values for the fields label and
language are determined using majority voting (majority_vote). If there are multiple values with
the same number of votes, the one that appears first alphabetically is selected, ensuring result
stability.</p>
        <p>The summarized results are stored in the final file merged_labels.csv with UTF-8-SIG encoding
for compatibility with local processing systems.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Metrics</title>
        <p>To evaluate the quality of annotations, two inter-rater agreement metrics were chosen —
Krippendorff's α [25] and Fleiss' κ [26].</p>
        <p>Krippendorff's α was calculated on the nominal scale of measurement, which corresponds to the
nature of the character classification task. Formally, the coefficient α is defined as:
where D0 is the observed variance (the number of disagreements between annotators), and D is
e
the expected variance under random distribution of labels. In our case, α=0.637\alpha =
0.637α=0.637, which indicates an acceptable level of agreement between annotators, sufficient for
analytical conclusions.</p>
        <p>Fleiss' κ was applied to the subset of images that were annotated by exactly three annotators (n
= 3), which meets the conditions for applying the metric. The formula for Fleiss' κ is as follows:
α = 1</p>
        <p>D0</p>
        <p>De
k =</p>
        <p>P - Pe
1 - Pe
where P is the average agreement proportion between annotators for all objects, and Peis the
expected proportion of agreement under random label assignment. In this study, the value of Fleiss'
κ is 0.562, which also indicates a moderate, but acceptable level of agreement.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Result</title>
      <p>As a result of the semi-automated system, a corpus of 1684 handwritten character images was
created. Of these, 215 are unique in terms of content and label, indicating a certain level of
redundancy (≈87% repeated entries), which was intentionally built in to allow for label aggregation
based on the majority voting principle. Example records can be seen in Figure 5.
(1)
(2)</p>
      <p>The corpus covers two language domains — Ukrainian and Russian — which are identified based
on the suffixes in the original PDF document filenames. All images were pre-normalized to a size of
64×64 pixels in grayscale, maintaining the proportions of the characters and with additional
padding to avoid losing context.</p>
      <sec id="sec-4-1">
        <title>Typical PDF page size</title>
      </sec>
      <sec id="sec-4-2">
        <title>Average number of characters per page</title>
        <p>Value
1684
215</p>
      </sec>
      <sec id="sec-4-3">
        <title>2 (Ukrainian, russian)</title>
      </sec>
      <sec id="sec-4-4">
        <title>PNG, 64×64 px, binarised 112.6 0.73 ≈ 8.2%</title>
        <p>A4, 400 DPI
120–170</p>
        <p>Figure 6 shows the frequency graph of characters in the collected corpus. The most common
character was "C" — with over 140 occurrences, significantly surpassing the frequency of other
characters. Other frequent graphemes include "o" (≈100 occurrences) and the symbol "/".</p>
        <p>This uneven frequency distribution is caused both by the characteristics of the linguistic
material (e.g., the frequency of the letter "o" in Ukrainian and Russian texts) and by segmentation
errors. Specifically, the high frequency of the symbol "/" indicates misclassification of fragments or
line breaks as separate symbols.</p>
        <p>Manual review revealed that some of the input images labeled as "c", "o", and "/" correspond to
incomplete symbols rather than full characters, or artifacts from the images. This confirms the need
for improvement in preprocessing modules and the implementation of filtering based on context
evaluation.</p>
        <p>To assess the quality of annotations, two inter-rater agreement metrics were used —
Krippendorff’s α and Fleiss’ κ. Krippendorff’s α (nominal scale) was 0.637, indicating an acceptable
level of agreement between participants. Fleiss’ κ was calculated only for images with the same
number of annotations (n = 3) and was 0.562, which also suggests moderate but acceptable
agreement. The results confirm the sufficient quality of annotations for further analysis.</p>
        <p>To evaluate the effectiveness of the labeling module, an analysis was conducted on the time
spent annotating characters as well as preprocessing the images. During the annotation of 100
characters, the total time recorded was 4 minutes and 45 seconds, corresponding to an average time
of 2.85 seconds per character.</p>
        <p>The effectiveness of the image segmentation module was also analyzed. A total of 1866
characters from handwritten text pages were processed, with an average processing time of 1.98
seconds per page and a total processing time of 39.62 seconds for the entire dataset. It is important
to note that the processing time remained stable and was practically independent of the number of
segmented characters on the page. For example, when 8 characters were preserved from page 1, the
processing time was 2.07 seconds, while for 179 characters on page 18, it was only 2.00 seconds.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>A prototype of an interactive system for creating handwritten character corpora has been
developed and tested. The system implements a full pipeline: preprocessing, segmentation,
collective labeling, quality aggregation, and export to a standardized format for HTR tasks from
historical documents. A total of 80 pages of multilingual materials were processed. 7,464 character
crops (PNG, 64×64 px) were generated. As a result of labeling, 1,684 validated examples were
obtained, with 215 being unique. The redundancy was approximately 87%, which was deliberately
incorporated for majority voting. The annotation consistency is Krippendorff’s α = 0.637 and Fleiss’
κ = 0.562 (n = 3). This corresponds to a moderate and practically sufficient level of agreement. The
average labeling time is 2.85 seconds per symbol. The average segmentation time is 1.98 seconds
per page, and it remains stable across a range of 8–179 symbols per page. The total time for the
analyzed subset is 39.62 seconds. The frequency analysis reveals the dominance of the symbol "C"
(&gt;140 occurrences) and "o" (~100 occurrences). The increased frequency of "/" indicates
segmentation artifacts.</p>
      <p>The volume of validated data is currently limited: 1,684 examples, including 215 unique ones.
The language coverage includes only Ukrainian and Russian. The distribution of graphemes is
uneven. False positives, particularly for "/", are noted, caused by imperfections in preprocessing and
segmentation. Inter-annotator agreement is moderate. The labeling is stored in CSV format at the
symbol level. PAGE-XML/ALTO formats have not yet been integrated, making direct comparison
with benchmarks difficult and excluding "line" and "word" levels.</p>
      <p>The plan is to scale the corpus to at least 10,000 validated symbols. Language-script coverage
will be expanded, and grapheme frequencies will be balanced. Segmentation will be improved
through adaptive morphological filters, symbol/non-symbol classification, contextual filtering, and
detectors based on Mask R-CNN or ViT. The goal is to reduce false positives for "/" by at least 50%.
Active learning and self-training integration is planned, including Dawid–Skene and self-training
with pseudo-labels, aiming to increase α to ≥0.75. The labeling will be converted to
PAGE-XML/ALTO and COCO formats and supplemented with "line" and "word" levels. The final
stage will involve benchmarking HTR models (CRNN+CTC and Transformer/ViT with SAM). The
impact of synthetic data and fine-tuning on CER and WER will be evaluated, and "quality–labeling
volume" curves will be plotted.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[18] Shi, B., Bai, X., &amp; Yao, C. (2019). An end-to-end trainable neural network for image-based
sequence recognition and its application to scene text recognition. Proceedings of the AAAI
Conference on Artificial Intelligence.
[19] Rocco, C., de Mello, M. R., &amp; Oliveira, L. (2022). Synthetic dataset creation for computer vision
application: Pipeline proposal.
[20] Ivasechko, A. V., &amp; Lipianina-Honcharenko, K. V. (2025). Architecture of a semi-automated
annotation system for multilingual archival handwritten texts. Systems and Technologies.
[21] Lipianina-Honcharenko, K., Wolff, C., Sachenko, A., Desyatnyuk, O., Sachenko, S., &amp; Kit, I.
(2023). Intelligent information system for product promotion in internet market. Applied
sciences, 13(17), 9585.
[22] Lipianina-Honcharenko, K., Telka, M., &amp; Melnyk, N. (2024, December). Comparison of ResNet,
EfficientNet, and Xception architectures for deepfake detection. In Proceedings of the 1st
International Workshop on Advanced Applied Information Technologies CEUR-WS,
Khmelnytskyi, Ukraine, Zilina, Slovakia (pp. 26-34).
[23] Rusyn,B et al (2018) Segmentation of atmospheric clouds images obtained by remote
sensing.14th International Conrefences on Advanced Trends in Radioelectronics,
Telecommunication and Computer Engineering,TCSET,2018,Proceeding,pp.213-216.
[24] Mukhin,V,et al (2025) A model for classifying information objects using neural networks and
fuzzy logic.Scientific Reports,v.15,is.1,15904.
[25] Marzi G., Balzano M., Marchiori D. K-Alpha Calculator—A user-friendly tool for computing
Krippendorff’s Alpha. MethodsX, 12, 102545, 2024. DOI: 10.1016/j.mex.2023.102545. (Короткий
огляд α, інтерпретація порогів і практичний інструмент.)
[26] Halpin S.N. Inter-Coder Agreement in Qualitative Coding: Considerations for its use.</p>
        <p>American Journal of Qualitative Research, 2024.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Yu</surname>
            <given-names>X. V.</given-names>
          </string-name>
          <string-name>
            <surname>Beyond Counting</surname>
          </string-name>
          <article-title>Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources (Findings EMNLP</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Matos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almeida</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Correia</surname>
            ,
            <given-names>P. L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pacheco</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>iForal: Automated handwritten text transcription for historical medieval manuscripts</article-title>
          .
          <source>Journal of Imaging</source>
          ,
          <volume>11</volume>
          (
          <issue>2</issue>
          ),
          <fpage>36</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Koopmans</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dhali</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Schomaker</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>The effects of character-level data augmentation on style-based dating of historical manuscripts</article-title>
          . In M. De Marsico,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Sanniti di Baja, &amp;</article-title>
          <string-name>
            <given-names>A</given-names>
            .
            <surname>Fred</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM</source>
          <year>2023</year>
          )
          <article-title>(pp</article-title>
          .
          <fpage>124</fpage>
          -
          <lpage>135</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Vögtlin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drazyk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pondenkandath</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alberti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ingold</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Generating synthetic handwritten historical documents with OCR constrained GANs</article-title>
          . In J. Lladós,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lopresti</surname>
          </string-name>
          , &amp; S. Uchida (Eds.),
          <source>Document Analysis and Recognition - ICDAR 2021 (Lecture Notes in Computer Science</source>
          , Vol.
          <volume>12823</volume>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>625</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>Fine-grained automatic augmentation for handwritten character recognition</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <volume>159</volume>
          ,
          <fpage>111079</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>de Sousa Neto</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bezerra</surname>
            ,
            <given-names>B. L. D.</given-names>
          </string-name>
          , de Moura,
          <string-name>
            <given-names>G. C. D.</given-names>
            , &amp;
            <surname>Toselli</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. H.</surname>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Data augmentation for offline handwritten text recognition: A systematic literature review</article-title>
          .
          <source>SN Computer Science</source>
          ,
          <volume>5</volume>
          , 258. https://doi.org/10.1007/s42979-023-02583-6 SpringerLink
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Hamdi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boubaker</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Alimi</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Data augmentation using geometric, frequency, and beta modeling approaches for improving multi-lingual online handwriting recognition</article-title>
          .
          <source>International Journal on Document Analysis and Recognition (IJDAR)</source>
          ,
          <volume>24</volume>
          ,
          <fpage>283</fpage>
          -
          <lpage>298</lpage>
          . https://doi.org/10.1007/s10032-021-00376-2 SpringerLink
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Toselli</surname>
            ,
            <given-names>A. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vidal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Casacuberta</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Active interaction and learning in handwritten text transcription</article-title>
          . In A. H.
          <string-name>
            <surname>Toselli</surname>
          </string-name>
          , E. Vidal, &amp; F. Casacuberta (Eds.),
          <article-title>Multimodal interactive pattern recognition and applications</article-title>
          (pp.
          <fpage>119</fpage>
          -
          <lpage>133</lpage>
          ). Springer. https://doi.org/10.1007/978-0-
          <fpage>85729</fpage>
          -479-
          <issue>1</issue>
          _5 SpringerLink
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Fink</surname>
            ,
            <given-names>G. A.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Self-training for handwritten word recognition and retrieval</article-title>
          .
          <source>International Journal on Document Analysis and Recognition (IJDAR)</source>
          ,
          <volume>27</volume>
          ,
          <fpage>225</fpage>
          -
          <lpage>244</lpage>
          . https://doi.org/10.1007/s10032-024-00484-9 SpringerLink
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>A survey on multimodal large language models</article-title>
          .
          <source>National Science Review</source>
          ,
          <volume>11</volume>
          (
          <issue>12</issue>
          ),
          <year>nwae403</year>
          . https://doi.org/10.1093/nsr/nwae403 Oxford Academic
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Murel</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            <given-names>D</given-names>
          </string-name>
          .
          <article-title>Self-training and Active Learning with Pseudo-relevance Feedback for Handwriting Detection in Historical Print</article-title>
          .
          <source>In: Proc. ICDAR</source>
          <year>2024</year>
          , LNCS 14967, pp.
          <fpage>305</fpage>
          -
          <lpage>324</lpage>
          . Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Geldhauser</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malyshev</surname>
            <given-names>K.</given-names>
          </string-name>
          <article-title>Semi-automatic annotation of Greek majuscule manuscripts: Steps towards integrated transcription and annotation</article-title>
          .
          <source>FedCSIS 2024 - AI in Digital Humanities (Comm. Papers)</source>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          ,
          <year>2024</year>
          . DOI:
          <volume>10</volume>
          .15439/2024F1772. ACM Journal
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Crosilla</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klic</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colavizza</surname>
            <given-names>G</given-names>
          </string-name>
          .
          <article-title>Benchmarking Large Language Models for Handwritten Text Recognition</article-title>
          .
          <source>arXiv preprint arXiv:2503.15195</source>
          ,
          <year>2025</year>
          . arXiv
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Li</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            <given-names>X</given-names>
          </string-name>
          .
          <article-title>HTR-VT: Handwritten Text Recognition with Vision Transformer</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <volume>158</volume>
          (
          <year>2024</year>
          ):
          <fpage>110967</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Baek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yun</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Character region awareness for text detection</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          (ICCV).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Medvedeva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ponomarenko</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Automating data collection process in Industry 4.0</article-title>
          . AIP Conference Proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            ,
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            , &amp;
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Scene text recognition with permuted autoregressive sequence models</article-title>
          .
          <source>In Advances in Neural Information Processing Systems (NeurIPS).</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>