<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Neural Framework For Handwritten Calendar Parsing and Semantic Content Categorization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antoni Gagliard</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rayappa David Amar Raj</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rama Muni Reddy Yanamala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amrita School of Artificial Intelligence, Amrita Vishwa Vidyapeetham</institution>
          ,
          <addr-line>Coimbatore, Tamil Nadu, 641112</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Technology</institution>
          ,
          <addr-line>Warangal, Telangana, 506004</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>89</fpage>
      <lpage>98</lpage>
      <abstract>
        <p>Digital calendars, accessible via laptops, tablets, and smartphones, ofer features such as automatic reminders that improve time management and personal organization. However, older people often struggle to use these tools, preferring to rely on traditional paper calendars. This digital divide can lead to missed appointments and a subsequent negative impact on well-being. We propose an innovative application that can automatically capture and digitize a physical calendar, allowing reminders to be sent and commitments to be tracked even by third parties. By integrating the familiar interface of paper with digital features, our tool aims to improve appointment keeping and reduce the technological gap in time management for the elderly population.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Artificial Intelligence</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Optical Character Recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        a generated image. Another important application of
handwriting recognition has been applied to cuneiform
Optical Character Recognition (OCR) refers to a set of tablets [2]. Researchers, instead of using photos, relies
techniques used to detect and convert characters from on 3D models of the tablets, delivering significantly more
physical documents into editable, and searchable digital reliable results than previous methods. This makes it
postext. This process typically involves capturing an image sible to search through the content of multiple tablets and
of the document using a scanner or a digital camera. The to compare them with each other. They used 3D models
ability of converting various forms of documents can be of nearly 2000 cuneiform tablets, many of them are more
applied in a wide range of fields, for example the recog- than 5000 years old and are thus among mankind’s oldest
nition of human handwriting, the digital conversion of surviving written records. What they discover is an
exlabels and manuscripts, the recognition of numerical dig- tremely wide range of topics, from shopping lists to court
its in financial and banking contexts and the validation rulings, providing a glimpse into mankind’s past several
of a particular type of handwriting to authenticate the millennia ago. However, despite the result obtained, the
provenance of a manuscript. Another interesting applica- challenge remains open since there are lot of
complication that has been developed in recent years involves the tions, mostly the fact that some tablets are heavily ruined
recognition of ancient characters [
        <xref ref-type="bibr" rid="ref7">1</xref>
        ]. The objective of and also that the writing system was very complex at that
the challenge was to digital reconstruct ancient damaged age and encompassed several languages. Consequently,
papyrus scrolls. The scrolls were digitally "unwrapped" efective modeling requires not only higher-quality data
using computed tomography (CT) and machine-learning but also more sophisticated prior knowledge [3, 4] to
captechnology. The resulting scans were then turned into ture the complexity and multilingual nature of cuneiform
a 3D volume of voxels, which have been segmented by writing.
tracing the crumpled layers of the rolled papyrus in the It is noteworthy that also a major technology company
3D scan, actually flattening the images. The last step such as Google has developed its own personal OCR
syswas detecting ink on papyrus by using machine learning tem in the more recent years. Google OCR, developed
to identify regions of ink in the flattened segments of by Google AI [5], is designed to convert a variety of
docthe papyrus. A particularly remarkable aspect of this ument types, including scanned documents, PDFs, and
application is that the model operated without any prior images captured by a digital camera, into editable text.
knowledge of alphabets or handwriting conventions. The The system’s principal advantages are its high degree of
digital characters predicted by the model, result therefore accuracy, achieved through the use of sophisticated deep
purely from plotting the local ink detection spots across learning techniques for the recognition and extraction of
text with remarkable precision (even in the presence of
complex backgrounds or low-quality images), the
incorporation of multiple languages within the system leads
to the capability of processing a wide range of alphabets,
including ideograms, and possibility to process not only
printed characters but also handwritten texts.
      </p>
      <p>OCR system plays also an important social role when
it is employed in all that applications to deal with visual
impairments, giving the possibility to blind people to
convert a written text into audio (OCR + speech synthesis).</p>
      <p>In conclusion, OCR represents a versatile technological
solution with broad applicability across document
processing, data management, and accessibility domains. In
order to achieve our scope, it is necessary to create a
pipeline that can be employed to digitize the content,
assign it to a category and store it in a database or in an
existing digital calendar, for example using the Google
Calendar API or the iOS Calendar API.</p>
      <p>Challenges The process of recognizing and
digitizing human handwriting presents several significant
challenges for OCR systems. The presence of noise and
distortions in images represents a considerable obstacle, as it
can negatively impact the eficiency and accuracy of the
system. They may also struggle to recognize characters in
scanned images afected by distortions or intrinsic noise,
leading to recognition errors. Furthermore, the issue of
multilingual support introduces another layer of
complexity, as OCR systems may face challenges in processing
documents that contain multiple languages, each with its
own set of characters and linguistic rules. OCR systems,
like human readers, are inherently tied to specific
alphabets when recognizing characters. The system learns the
local features of the diferent characters directly from
the handwritten text, there are input data in the form
of  = ( , ) ,  = {( ,1) , ( ,2) , . . . , ( ,) },
where each ( ,) is a symbol, or grapheme (from gr.
 , ‘write’), mapped in a digital encoding and
decoding alphabet. The system learns a specific alphabet
directly from the text and since local features (shape,
thickness, corners, edges,...) are inherent to the
characters of an alphabet, As a result, the system experiences a
significant drop in performance when applied to
characters outside its trained alphabet. This significantly limits
the transfer learning techniques, since the network
must be retrained from scratch to recognize another
alphabet diferent from the one on which the model has
already been trained.</p>
      <p>Moreover, the diversity of handwriting styles further
complicates the OCR process. Handwriting is highly
individual, requiring OCR systems to handle not only
personal variation but also atypical styles such as cursive
or artistic fonts. The inherent subjectivity of handwriting
makes it imperative to develop an OCR system that is
robust: it should handle as much as possible the actual
changes in font and style from text to text and decipher
the myriad ways in which individuals express themselves
on paper. Taking the Latin alphabet into account, we can
diferentiate the characters into capital letters and
lowercase letters. Notably, capital letters tend to exhibit
lower variability, whereas lowercase letters—though
subject to basic calligraphic conventions—reflect more
personal handwriting traits and a broader range of stylistic
variation.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>Early eforts in text digitization date back to the devel</title>
        <p>opment of LeNet [6], which demonstrated that a shallow
convolutional neural network could accurately recognize
handwritten digits in 32 × 32 grayscale images.
Building on this, [7] proposed a methodology leveraging the
MNIST dataset [8] to address more complex handwriting
recognition tasks, but also coping with image defects
and noise [9]. Their approach emphasized preprocessing
steps—including grayscale normalization, cropping, and
resizing—to improve the recognition of isolated
handwritten characters. They showed the efectiveness of
convolutional neural networks (CNNs) in extracting
local features from such inputs. The utility of CNNs for
handwriting recognition has since been widely adopted.</p>
        <p>
          In 2015, [
          <xref ref-type="bibr" rid="ref15 ref40 ref52 ref55 ref9">10</xref>
          ] introduced the CRNN architecture, which
combines CNNs for spatial feature extraction with
recurrent neural networks (RNNs) to model character
sequences. This design is particularly suited for
handwritten word recognition, as RNNs can capture sequential
dependencies—albeit with limitations such as the
vanishing gradient problem. More recently, [11] proposed a
system that combines CNNs with Error Correcting
Output Codes (ECOC) to enhance classification robustness.
        </p>
        <p>Feature extraction is performed using architectures such
as LeNet [6] and AlexNet [12], while classification is
carried out by training an ensemble of binary Support Vector
Machines (SVMs) via ECOC. This method decomposes
the multiclass problem into several binary subproblems,
yielding higher accuracy compared to CNNs followed by
a standard softmax classifier—particularly on the MNIST
dataset. In 2021, [13] introduced a more complex model
based on CRNNs for full handwritten document
recognition. This approach integrates CNNs for visual feature
extraction with Long Short-Term Memory (LSTM)
networks to model sequential dependencies across words or
phrases. The system is trained using the Connectionist
Temporal Classification (CTC) loss function [ 14], which
enables sequence prediction without requiring explicit
character-level alignment. CTC considers all possible
alignments and computes a summed probability,
allowing for end-to-end training even when segmentation is
ambiguous. The next leap in this domain has been driven
by the application of transformers [15] to computer
vision, particularly through the introduction of the Vision
Transformer (ViT) [16]. ViT replaces convolutional
layers with a pure attention mechanism, enabling the model
to capture long-range dependencies in visual inputs. This
shift opens new directions in handwriting recognition
by enabling the integration of global context across
entire input images, beyond the local receptive fields of
traditional CNNs.</p>
        <p>It is worth pointing out that, before the advent of Table 1
CRNN and ViT model, the majority of the proposed Datasets: the calendar dataset has been created manually by
achieved high performance in terms of recognition accu- using a graphic tablet, numeric digits is a collections of
numracy but showed huge limitations: the input for the first fboenrstsi,nwthhieleraIAngMe [(I1n,s3ti1tu]tgfe¨nreIrnaftoerdmfraotmik cuonmd pAuntgerewsyasntdemte
OCR neural model was necessarily provided in the form Mathematik) Handwritten is a dataset belonging to the
Uniof individual character alphabet, the networks were able versity of Bern.
to classify the salient features of the character and
provide a classification consisting of the corresponding
digital label, but with the inability to create a context (both in mal additional efort, this classification can be extended
terms of previous word in a sentence and in terms of indi- to the entire calendar: feature vectors from each line are
vidual characters in a single word), profound refactoring aggregated and averaged to obtain a single
representaof the software systems was often required[17, 18]. tion, which is then used to predict a high-level category
for the calendar’s overall content.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Implementation Description</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>To ensure accurate calendar digitization, the proposed
pipeline is composed of several neural modules. The ini- To conduct the experiments, we independently created a
tial component is a segmentation network based on dataset of handwritten calendar images using a graphic
the U-Net architecture [19], which enables the model to tablet. The corresponding ground truth segmentation
identify the most semantically relevant regions of the masks were annotated using the Oxford VGG Image
Ancalendar. This network produces a segmentation mask, notator tool. The dataset, although limited in size (see
allowing for the algorithmic extraction of targeted calen- Table 1), is suficient for our use case. Unlike generic
dar segments. Once the relevant segments are extracted, segmentation tasks—such as those involving the
heterothey are processed by two separate ResNet6-based con- geneous images found in datasets like ImageNet—our
volutional neural networks [20], responsible for recog- domain involves structurally homogeneous data. All
innizing the month and day digits, respectively. While put images depict the same subject: a calendar with a
the recognition of day digits benefits from standardized symmetric and well-defined layout. This structural
regupatterns—since the digits recur uniformly across sam- larity allows for meaningful learning even with a smaller
ples—the month field presents greater variability. Users number of examples. To simulate realistic acquisition
are allowed to write the month manually, which intro- conditions, we applied synthetic distortions using
Phoduces inconsistencies in handwriting style and position- toshop to introduce parallax efects (horizontal, vertical,
ing. To address these variations and improve classifi- or both). These transformations reflect the common
scecation robustness, the calendar template incorporates a nario where the camera capturing the calendar may not
distinct pair of AprilTags [21] for each month of each year, be perfectly aligned with the sheet. Incorporating such
positioned at the left and right margins of the month field. distortions during training improves model robustness in
This design introduces spatial regularity, guiding the con- real-world scenarios. To correct these geometric
distorvolutional layers toward more stable visual features and tions at inference time, we integrated a RANSAC
(Ranmitigating the ambiguity caused by user-written titles. dom Sample Consensus) [23] module into our pipeline.
The final step of the pipeline involves processing the re- The four AprilTags placed at the corners of each
calenminder segments, where a second U-Net is employed to dar template serve as keypoints, allowing RANSAC to
segment user-written sentences into individual words. estimate a homography and realign the captured image
These word segments are then passed to a digitization with the reference calendar layout.
module that combines the Vision Transformer (ViT) [16] The numeric digits dataset is assembled by saving
picand BERT [22]. While ViT captures visual features at a tures of numbers from 1 to 31 using classical fonts
availglobal scale, BERT processes the embedded representa- able at the level of the computer’s operating system. This
tions to extract semantic meaning. An additional com- choice is useful both for training a compact ResNet to
ponent of the system enables content classification at recognize calendar day numbers and for allowing the
the line level. Specifically, each textual line is processed model to generalize beyond the specific font used in the
through BERT for semantic categorization. With mini- study.</p>
        <p>Other useful data are retrieved from the public Internet. 3.2. Data Processing
In particular, the IAM Handwriting Datasets [24] are
used to make the pre-trained of the Word Paser model
and of the Note Digitizer. IAM Handwriting is a dataset
that contains a collection of words cut from portions of
written texts, nd its main advantage, as discussed in the
introduction, lies in enabling the model to learn relevant
character features. This approach enables the system
to learn the sequential features of the characters that
together make up a particular word.
(a) Example of the original image taken from IAM Lines dataset,</p>
        <p>"suitable to be paraded before the public"
(b) The original image is filtered with the Laplacian operator</p>
        <p>to retrieve its details, in particular the edges
(c) Pixels values of the detail are corrected in order to be 0
128 255. The filtering is applied taking into account some
intensity values as threshold.</p>
        <p>(d) Final binary inverse thresholding</p>
        <p>
          Standard computer vision techniques are employed to
process the images contained in the datasets: Gaussian
blurring is preferred over classical blurring, since using
a kernel derived from a Gaussian distribution helps to
attenuate image noise and details, ensuring better
preservation of edges and contours. The image pixel intensity
values are uniformed by applying a binary thresholding
technique that sets to a maximum value all pixels above
a certain threshold (represented, for example, in the case
of the IAM dataset, by the grayscale levels annotated in
the data), while setting the values below to 0. Moreover,
all pixel values in the image are additionally scaled by
a factor of 1/255 to bring them into the interval [
          <xref ref-type="bibr" rid="ref7">0, 1</xref>
          ],
and normalized around a certain mean and standard
deviation. These general steps ensure that data are in a
format more suitable for model processing.
        </p>
        <p>Regarding the calendar dataset, to ease the
computational segmentation process, images are downsampled
using a Laplacian Pyramid. The reason why this
technique is applied to the calendar data before feeding them
into the network will be made clearer in Section 3.4.1.</p>
        <p>It is also worth noting that, since a word segmentation
block for the calendar note segments has been included
in the architecture, it needs to be trained on pairs (image,
word mask) that are not directly available in the dataset.</p>
        <p>This is why a dedicated image processing step is applied
online during dataset generation to obtain, from a single
text line image, its corresponding ground-truth word
segmentation mask. The original image (Figure 1a) is
ifltered with a Laplacian operator in order to extract its
details, borders, and edges (Figure 1b). This operation is
particularly useful in the current data domain because, as
previously mentioned in Section 3.1, the images from the
IAM Line/Sentence datasets are collections of cropped
word/text lines. By further filtering the pixel values of the
detailed image, it becomes possible to highlight borders
and edges, and thus delineate the cutting boundaries
with respect to their content (Figure 1c). At this point, by
applying a binary inverse threshold function, the final
word-level binary segmentation mask can be obtained
(Figure 1d).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Neural blocks</title>
        <p>The digitization system is composed of five neural
modules working in parallel. The first is the Calendar Parser,
a U-Net model with encoder-decoder structure and skip
connections, which takes as input a calendar image
(downsampled using a Laplacian Pyramid) and performs
pixel-wise classification into six semantic regions.
Initially, a larger number of classes was tested to directly
model the month and day digits at this stage, but this
was later simplified, as discussed in Section 3.4. The</p>
        <sec id="sec-3-2-1">
          <title>Word Parser is also based on a U-Net architecture and</title>
          <p>is pre-trained using image/mask pairs from the IAM
Line/Sentence dataset, where masks are generated as
described in Section 3.2. It produces a binary
segmentation mask for each text line, and, as shown in Section 4,
generalizes well to calendar data despite significant
visual diferences from the training set. Both the Month
Digitizer and Day Digitizer are implemented as compact
ResNet architectures composed of six convolutional
layers with two residual connections. Each model concludes
with a linear classification head, consisting of either 12 or
31 output units depending on whether the task is month
or day classification, respectively. This minimal design
was chosen to ensure fast inference while maintaining
adequate performance given the low visual variability of
the segmented inputs.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Content Text Classifier To classify the textual con</title>
          <p>3.3.1. Content classification with BERT tent, each sentence is preprocessed by lowercasing all
words and removing stopwords to reduce noise and
reOnce the handwritten text has been digitized, it may be tain only the most relevant terms. The processed text is
useful to classify it into categories to highlight the na- then passed through BERT, which maps each token to a
ture of its content. To do so, it is possible to rely on a 768-dimensional latent space. The special [CLS] token,
Transformer-based architecture modeled by a modern capturing the overall sentence meaning, is extracted and
LLM such as BERT. BERT is a Transformer architecture fed into a linear classification layer. A Softmax function
consisting of a single encoder. Its peculiarity lies in the is applied to the output to obtain a probability
distribufact that, given the nature of its training (Next Sentence tion over predefined content categories. The model is
Prediction and Question Answering tasks), it can be eas- trained using multiclass cross-entropy loss.
ily adapted to a new domain by fine-tuning the encoder
on the target dataset and stacking an additional classi- 3.4. Experiment
ifcation layer on top of its head. BERT incorporates a
strong semantic understanding of the English language The overall project is made up of a bunch of experiments
and can therefore be used to process textual content. that have been conducted on the diferent architectural
components examined in Section. 3.3: since both the This technique is particularly useful, as each
blurringcalendar parser and the word parser take benefit of using downsampling step retains the image’s residuals. Once
U-Net segmentation architecture, both the model has the calendar is segmented, each relevant segment can be
been trained using Dice Loss [25]. The Dice-Sørensen upsampled and added back to its corresponding residual
coeficient, Equation. 1, also know as Dice Coeficient, is in the pyramid to restore the original resolution. This
a statistic used to gauge the similarity of two samples. It design choice is especially efective in addressing the
is widely employed in the field of segmentation, where problem of image resolution, which would otherwise be
it is particularly efective in evaluating the pixel-wise too low for proper digitization.
accuracy of ground truth masks in comparison to the Regarding resizing operations, day segments are scaled
output segmentation mask generated by a model. to 32 × 32, while month fragments are resized to 128 ×
512. It is also worth noting that since a ViT [16] is used
( , ) = 2| ∩ | (1) in the note digitization phase, word images from the IAM
| | + || Words dataset are resized to 224 × 224. This dimension
 and  are the set of pixels belonging respectively allows the Vision Transformer to correctly apply its
inito the predicted and the ground truth masks, | ∩ | tial convolutional step, which extracts patches using a
is the number of pixels that both set have in common 16 × 16 kernel with a stride of 16 × 16.
apnixdelfinianlltyhe t|w o| asentds.|Th|erceaplreensdeanrt ptharesitnogtails aemssoenutnitalolyf ViOsunaelEnco difnearlDecodneorte modelc:oncerHnusggingFtahcee
caoMmuelst:iClassification task. The Equation. 1 be- wpriothvidaeslaangwuraagpepemrcoldaessl cdoemcobdinerin.gInasvteisaidonotfraunssinfogrmtheer
pre-trained TrOCR [26], a new model instance was
( , ) = 1 ∑︀=1 ︁( 1 − ∑︀2∑=︀1=1,( +∑,︀·=1,), ︁) (2) VcrieTatetod, amccoodmifmyiondgattehethfirestspceocnivficoliuntpiountaclolnadyietrionosf
the word parsing can be treated instead as a of this experiment. This adjustment was necessary
BinaryClassification task and the Equation. because ViT was originally trained on ImageNet [27],
1 becomes: where input images have three channels. In our case,
inputs are reduced to a single channel, both to reduce
( , ) = ∑︀2∑=︀1=1+(∑︀· =1) (3) ccaolmligpruatpahtiyo,nwalhcicohmtpylpexiciatyllyanudsebsegcarauyssectahlee itnatseknisnitvyolavneds
does not benefit from RGB information.
3.4.2. Training Strategy
 is the predicted binary value for pixel  and could be
 = 1 or  = 0,  is the same but in the ground truth
mask while  is the total number of pixel, and since
the task deals with 2D images,  =  × , where 
and  are respectively the width and the height of the
images. The value of the Dice Loss can be easily obtained
as:
The models Calendar Parser and Word Parser are
trained on the calendar dataset and the IAM lines dataset,
respectively, to develop the capability of producing
suitable segmentation maps as output. For both models, the
( , ) = 1 − ( , ) (4) Dice Loss (see Section 3.4) is used during the
backpropagation phase to update the weights. The overall learning
Looking at the Equation. 4 it can observed that mini- process is optimized using the Adam optimizer, with an
mize the loss means efectively maximize the Dice Co- initial learning rate tuned in the range of 0.001 to 0.0001.
eficient that provides a measurement of the similarity The batch size for the calendar parser is set to 4, while
between the model’s prediction and the ground truth. the best configuration for the word parser is obtained
Once the model is trained, the most suitable predic- with a batch size of 8 and gradient accumulation set to
tion of the segmentation map is obtained respectively as 16 in order to simulate a larger batch size and improve
 * = ( ,) in a multi-class domain and in generalization.
the simpler binary case as  * = 1(  ≥ 0.5) , with 1 The Day Digitizer model is trained on the
nuthat is an indicator of whether the condition is satisfied meric fonts dataset and fine-tuned on day segments
exor not. tracted from the calendar via segmentation. The Month
Digitizer, instead, is trained on a set of month
seg3.4.1. Design Choices ments obtained through calendar parsing and heavily
The images are converted from 3 channels (RGB) to 1 augmented to improve robustness against issues related
channel (black and white) and processed as described to the calendar’s month field, as discussed in Section ??.
in Section 3.2. Calendar images are downsampled using Both models use the Adam optimizer with an initial
learna Laplacian Pyramid before being fed into the network. ing rate of 0.001 and a batch size of 4.</p>
          <p>Words Parser 0.986 - - binary masks that are generated when a specific
classiHandwriting Digitizer - 0.201 0.272 ifcation segment is requested from the model. Class 0,
which corresponds to the None/Contours category, is
deTable 3 ifned during the training phase because, during manual
Tdeuxrtinclgasitssifipcraet-iotrnawiniitnhgBmERaTkeLLitMa: TpheerfsekcitllscaancqduidiraetdebfyorBEteRxTt segmentation of the calendar for ground truth
generaclassification. The model can be easily adapted to the new tion, some segments do not overlap, leaving holes in the
domain with the fine-tuning, and its special context token (i.e. mask. Class 5 corresponds to the borders of the calendar
[CLS]) allows us to achieve a good level of accuracy in text and is included to ensure that the model focuses on the
classification after just few training epochs. most relevant features—namely, the one-month, two-day,
and three-note segments. Class 4, which corresponds</p>
          <p>Text Classification Train F1 Valid F1 to empty notes, was incorporated into the segmentation
BERT Text Classifier 0.84 0.77 map to allow the system to eliminate, during digitization,
those areas whose average pixel values are close to 1 (i.e.,
white). In such cases, it can be inferred that the user has</p>
          <p>The Text Classifier is trained using Adam as well, not written any reminders for that day.
with a learning rate of 0.00015 and a batch size of 32,
accumulating gradients every 32 steps to simulate an 4.2. Text line segmentation
efective batch size of 32 × 32. This practice, commonly
adopted in NLP tasks, is particularly useful when
computational resources are limited, as larger efective batch
sizes help the model better diferentiate between data
samples and significantly enhance overall performance.</p>
          <p>The Word Parser achieves good accuracy in generating
binary segmentation masks to separate sentence images
into individual words. The results of this operation are
shown in Appendix ??. As observed there, although the
process used to create binary masks—described in
Section 3.2—may not always be highly accurate, the model is
4. Results often able to learn segmentation patterns from the
training data that, in some cases, allow it to produce masks
In the current section, the results of the experiment are even more precise than those generated algorithmically
discussed, highlighting both its strengths and weaknesses. and used as ground truth.</p>
          <p>The training outcomes of the Calendar Parser and the
other neural modules are reported in Table 2, while the
results related to text classification are presented in Ta- 4.3. Calendar fragments digitization
ble 3.</p>
          <p>The calendar note digitizer model operates on word
segments previously extracted by the Word Parser. Each
word image is first processed by the encoder, which
divides the image into patches, applies positional
embeddings, and projects the result into a latent
representation. This latent space is then interpreted by the decoder,</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>4.1. Calendar Segmentation</title>
        <sec id="sec-3-3-1">
          <title>The model performs the segmentation of the calendar in a satisfactory manner. Figure 4 illustrates the six distinct</title>
          <p>which maps the encoded features to a sequence of
output symbols. In this study, the decoder is constrained to
generate output as a sequence of individual characters,
which are subsequently reassembled into words.</p>
          <p>Although character-level decoding may seem less
efifcient than full-sentence transcription, it is justified by
the limited contextual information available in calendar
notes. In general, longer sentences allow the model to
capture richer semantic dependencies between words.
However, since calendar entries are typically short
annotations, the limited context reduces the benefit of
sequence-level modeling. In this scenario, it is more
appropriate to train the model to learn direct mappings
between visual features and their corresponding
alphabetic symbols.</p>
          <p>This approach yields strong performance at both the
character and word levels. Character-level accuracy is
measured as:
accℎ = 1 −</p>
          <p>CER
while word-level accuracy is computed as:
acc = 1 −</p>
          <p>WER
Although the model performs well on the dataset used
for pretraining, its performance degrades when applied
to our domain—likely due to diferences in resolution
and handwriting style. We conclude that, to improve
accuracy, it is necessary to expand the training dataset
with additional handwriting styles and apply further
finetuning to enhance model robustness.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>4.4. Text Content Classification</title>
        <p>The results of the classification task are reported in
Table 3 and appear to be satisfactory. BERT has proven to
be an efective tool for achieving the desired outcome,
thanks to its ability to leverage semantic knowledge of
the English language acquired during pretraining. The
classification was performed on a domain consisting of
15 distinct classes. This setting qualifies as a fine-grained
classification task, as opposed to a more general
coarsegrained classification with broader categories. Under this
hypothesis, the trained model can be reused for related
classification tasks without requiring retraining. It is
possible to cluster fine-grained classes into broader
semantic groups and perform classification by mapping each
ifne-grained label to its corresponding coarse-grained
category.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusions</title>
      <p>Overall, the experiment demonstrates how a character
recognition task can be addressed through a sequence
of neural models integrated into a single operational
89–98
(a) Calendar Segmentation F1 score</p>
      <p>(b) Calendar Segmentation loss
pipeline. The results, particularly in terms of calendar
segmentation and subsequent month and day
classification, are satisfactory. However, the same cannot be
said for the note digitizer, which still exhibits several
limitations—both in terms of Word Error Rate and
Character Error Rate. A promising direction for improvement
would be to enrich the training set with a more
heterogeneous handwriting dataset, incorporating a wider variety
of calligraphic styles. Indeed, data augmentation alone
did not significantly enhance performance when applied
to samples that difer substantially from those seen
during training and validation. A more diverse dataset could
improve the model’s ability to extract robust features and,
consequently, enhance the overall digitization process.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Declaration on Generative AI</title>
      <sec id="sec-5-1">
        <title>During the preparation of this work, the authors</title>
        <p>used ChatGPT, Grammarly in order to: Grammar and
spelling check, Paraphrase and reword. After using this
tool/service, the authors reviewed and edited the content
as needed and take full responsibility for the publication’s
content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>GCCE.</surname>
          </string-name>
          <year>2018</year>
          .
          <volume>8574624</volume>
          . [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , L. Bottou,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hafner</surname>
          </string-name>
          , Gradient-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>Proceedings of the IEEE</source>
          , volume
          <volume>86</volume>
          ,
          <year>1998</year>
          . [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Stanford</surname>
          </string-name>
          , Appli-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>tion</surname>
          </string-name>
          ,
          <year>2014</year>
          . URL: https://api.semanticscholar.org/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          CorpusID:
          <fpage>16468741</fpage>
          . [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Grother</surname>
          </string-name>
          , Nist special database 19 handprinted
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>forms and characters database</source>
          ,
          <year>1995</year>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          //api.semanticscholar.org/CorpusID:59785963. [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lo Sciuto</surname>
          </string-name>
          , G. Capizzi,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shikler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , Or-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>(a) F1 results ganic solar cells defects classification by using a</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>Journal of Intelligent Systems</source>
          <volume>36</volume>
          (
          <year>2021</year>
          )
          <fpage>2443</fpage>
          -
          <lpage>2464</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>doi:10</source>
          .1002/int.22386. [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <article-title>An end-to-end trainable neu-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>telligence PP</surname>
          </string-name>
          (
          <year>2015</year>
          ). doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          2646371. [11]
          <string-name>
            <surname>M. B. Bora</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Daimary</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Amitab</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Kandar</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>(b) Loss Results ing cnn-ecoc</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>167</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          2403-
          <fpage>2409</fpage>
          . URL: https://www.sciencedirect.com/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>science/article/pii/S1877050920307596. doi:https:</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          //doi.org/10.1016/j.procs.
          <year>2020</year>
          .
          <volume>03</volume>
          .293,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Intelligence</surname>
            and
            <given-names>Data</given-names>
          </string-name>
          <string-name>
            <surname>Science</surname>
            . [12]
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Krizhevsky</surname>
            , I. Sutskever,
            <given-names>G. E.</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
            , Imagenet [1]
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Parsons</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Parker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Chapman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Hayashida, classification with deep convolutional neural net-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>text from herculaneum papyri using x-ray ct</article-title>
          ,
          <year>2023</year>
          .
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          (Eds.),
          <source>Advances in Neural Infor</source>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stötzner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Homburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Bullenkamp</surname>
          </string-name>
          , H. Mara,
          <source>mation Processing Systems</source>
          <volume>25</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>R-CNN based PolygonalWedge Detection Learned Inc</article-title>
          .,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>from Annotated 3D Renderings and Mapped</source>
          Pho- [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jebadurai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Jebadurai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J. L.</given-names>
            <surname>Paulraj</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. V.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Cultural</given-names>
            <surname>Heritage</surname>
          </string-name>
          ,
          <source>The Eurographics Association</source>
          , 2021 Third International Conference on Inven-
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          2023. doi:
          <volume>10</volume>
          .2312/gch.20231157. tive Research in Computing Applications (ICIRCA), [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Fiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ponzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <source>Keeping eyes on the 2021</source>
          , pp.
          <fpage>1037</fpage>
          -
          <lpage>1042</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICIRCA51532.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>road: Understanding driver attention</article-title>
          and
          <source>its role</source>
          <year>2021</year>
          .
          <volume>9544513</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>in safe driving</article-title>
          , in: CEUR Workshop Proceedings, [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          volume
          <volume>3695</volume>
          ,
          <year>2023</year>
          , p.
          <fpage>85</fpage>
          -
          <lpage>95</lpage>
          .
          <article-title>Connectionist temporal classification:</article-title>
          <source>Labelling</source>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Boutarfaia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tibermacine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. E.</given-names>
            <surname>Tiber</surname>
          </string-name>
          <article-title>- unsegmented sequence data with recurrent neu-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <article-title>macine, Deep learning for eeg-based motor imagery ral 'networks</article-title>
          , volume
          <year>2006</year>
          ,
          <year>2006</year>
          , pp.
          <fpage>369</fpage>
          -
          <lpage>376</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>classification: Towards enhanced human-machine doi:10.1145/1143844</source>
          .1143891.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>interaction and assistive robotics</article-title>
          , in: CEUR Work- [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          , J. Uszkoreit,
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>shop Proceedings</source>
          , volume
          <volume>3695</volume>
          ,
          <year>2023</year>
          , p.
          <fpage>68</fpage>
          -
          <lpage>74</lpage>
          . L.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>I. Polosukhin</given-names>
          </string-name>
          , At[5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fujii</surname>
          </string-name>
          , [invited]
          <article-title>optical character recognition re- tention is all you need</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2017</year>
          ). URL: http:
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          search at google,
          <year>2018</year>
          , pp.
          <fpage>265</fpage>
          -
          <lpage>266</lpage>
          . doi:
          <volume>10</volume>
          .1109/ //arxiv.org/abs/1706.03762. [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          , D. Weis-
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>senborn</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Unterthiner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dehghani</surname>
          </string-name>
          , arXiv:
          <fpage>2109</fpage>
          .
          <fpage>10282</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , J. Uszko- [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Fei-
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>reit</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 Fei, Imagenet: A large-scale hierarchical image</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>words: Transformers for image recognition at database</article-title>
          , in: 2009 IEEE Conference on Computer
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>scale</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2010</year>
          .11929. Vision and Pattern Recognition,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          . [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          , E. Tramontana, Using
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>ing of large systems</article-title>
          ,
          <source>in: Proceedings - 2013 7th</source>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>and Software Intensive Systems, CISIS</source>
          <year>2013</year>
          ,
          <year>2013</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          p.
          <fpage>529</fpage>
          -
          <lpage>534</lpage>
          . doi:
          <volume>10</volume>
          .1109/CISIS.
          <year>2013</year>
          .
          <volume>96</volume>
          . [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Borowik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fornaia</surname>
          </string-name>
          , R. Giunta,
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>tronics and Telecommunications</source>
          <volume>61</volume>
          (
          <year>2015</year>
          )
          <fpage>17</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <source>doi:10</source>
          .1515/eletel-2015-
          <volume>0002</volume>
          . [19]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net: Convo-
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          tation,
          <source>CoRR abs/1505</source>
          .04597 (
          <year>2015</year>
          ). URL: http:
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          //arxiv.org/abs/1505.04597. [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Fei-
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>database, in: 2009 IEEE Conference on Com-</mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <source>puter Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          1109/CVPR.
          <year>2009</year>
          .
          <volume>5206848</volume>
          . [21]
          <string-name>
            <given-names>E.</given-names>
            <surname>Olson</surname>
          </string-name>
          ,
          <article-title>Apriltag: A robust and flexible visual fidu-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <article-title>cial system</article-title>
          , in: 2011 IEEE International Conference
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <source>on Robotics and Automation</source>
          ,
          <year>2011</year>
          . doi:
          <volume>10</volume>
          .1109/
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <string-name>
            <surname>ICRA.</surname>
          </string-name>
          <year>2011</year>
          .
          <volume>5979561</volume>
          . [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <article-title>for language understanding abs/</article-title>
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          URL: http://arxiv.org/abs/
          <year>1810</year>
          .04805. [23]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Fischler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Bolles</surname>
          </string-name>
          , Random sample con-
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <string-name>
            <surname>Commun. ACM</surname>
          </string-name>
          24 (
          <year>1981</year>
          )
          <fpage>381</fpage>
          -
          <lpage>395</lpage>
          . URL: https://doi.
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <source>org/10</source>
          .1145/358669.358692. doi:
          <volume>10</volume>
          .1145/358669.
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          358692. [24]
          <string-name>
            <given-names>U.-V.</given-names>
            <surname>Marti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bunke</surname>
          </string-name>
          ,
          <article-title>The iam-database: An</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          <source>ument Analysis and Recognition</source>
          <volume>5</volume>
          (
          <year>2002</year>
          )
          <fpage>39</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <source>doi:10</source>
          .1007/s100320200071. [25]
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Sudre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vercauteren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ourselin</surname>
          </string-name>
          , M. J.
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          tions,
          <source>CoRR</source>
          (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1707.
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          03237. [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Florencio</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          <string-name>
            <surname>els</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2109.10282.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>