1. Introduction

A Neural Framework For Handwritten Calendar Parsing and Semantic Content Categorization

Antoni Gagliard

Rayappa David Amar Raj

Rama Muni Reddy Yanamala

0 0 Amrita School of Artificial Intelligence, Amrita Vishwa Vidyapeetham , Coimbatore, Tamil Nadu, 641112 , India 1 National Institute of Technology , Warangal, Telangana, 506004 , India

89 98

Digital calendars, accessible via laptops, tablets, and smartphones, ofer features such as automatic reminders that improve time management and personal organization. However, older people often struggle to use these tools, preferring to rely on traditional paper calendars. This digital divide can lead to missed appointments and a subsequent negative impact on well-being. We propose an innovative application that can automatically capture and digitize a physical calendar, allowing reminders to be sent and commitments to be tracked even by third parties. By integrating the familiar interface of paper with digital features, our tool aims to improve appointment keeping and reduce the technological gap in time management for the elderly population.

eol>Artificial Intelligence Machine Learning Deep Learning Optical Character Recognition

1. Introduction

a generated image. Another important application of handwriting recognition has been applied to cuneiform Optical Character Recognition (OCR) refers to a set of tablets [2]. Researchers, instead of using photos, relies techniques used to detect and convert characters from on 3D models of the tablets, delivering significantly more physical documents into editable, and searchable digital reliable results than previous methods. This makes it postext. This process typically involves capturing an image sible to search through the content of multiple tablets and of the document using a scanner or a digital camera. The to compare them with each other. They used 3D models ability of converting various forms of documents can be of nearly 2000 cuneiform tablets, many of them are more applied in a wide range of fields, for example the recog- than 5000 years old and are thus among mankind’s oldest nition of human handwriting, the digital conversion of surviving written records. What they discover is an exlabels and manuscripts, the recognition of numerical dig- tremely wide range of topics, from shopping lists to court its in financial and banking contexts and the validation rulings, providing a glimpse into mankind’s past several of a particular type of handwriting to authenticate the millennia ago. However, despite the result obtained, the provenance of a manuscript. Another interesting applica- challenge remains open since there are lot of complication that has been developed in recent years involves the tions, mostly the fact that some tablets are heavily ruined recognition of ancient characters [ 1 ]. The objective of and also that the writing system was very complex at that the challenge was to digital reconstruct ancient damaged age and encompassed several languages. Consequently, papyrus scrolls. The scrolls were digitally "unwrapped" efective modeling requires not only higher-quality data using computed tomography (CT) and machine-learning but also more sophisticated prior knowledge [3, 4] to captechnology. The resulting scans were then turned into ture the complexity and multilingual nature of cuneiform a 3D volume of voxels, which have been segmented by writing. tracing the crumpled layers of the rolled papyrus in the It is noteworthy that also a major technology company 3D scan, actually flattening the images. The last step such as Google has developed its own personal OCR syswas detecting ink on papyrus by using machine learning tem in the more recent years. Google OCR, developed to identify regions of ink in the flattened segments of by Google AI [5], is designed to convert a variety of docthe papyrus. A particularly remarkable aspect of this ument types, including scanned documents, PDFs, and application is that the model operated without any prior images captured by a digital camera, into editable text. knowledge of alphabets or handwriting conventions. The The system’s principal advantages are its high degree of digital characters predicted by the model, result therefore accuracy, achieved through the use of sophisticated deep purely from plotting the local ink detection spots across learning techniques for the recognition and extraction of text with remarkable precision (even in the presence of complex backgrounds or low-quality images), the incorporation of multiple languages within the system leads to the capability of processing a wide range of alphabets, including ideograms, and possibility to process not only printed characters but also handwritten texts.

OCR system plays also an important social role when it is employed in all that applications to deal with visual impairments, giving the possibility to blind people to convert a written text into audio (OCR + speech synthesis).

In conclusion, OCR represents a versatile technological solution with broad applicability across document processing, data management, and accessibility domains. In order to achieve our scope, it is necessary to create a pipeline that can be employed to digitize the content, assign it to a category and store it in a database or in an existing digital calendar, for example using the Google Calendar API or the iOS Calendar API.

Challenges The process of recognizing and digitizing human handwriting presents several significant challenges for OCR systems. The presence of noise and distortions in images represents a considerable obstacle, as it can negatively impact the eficiency and accuracy of the system. They may also struggle to recognize characters in scanned images afected by distortions or intrinsic noise, leading to recognition errors. Furthermore, the issue of multilingual support introduces another layer of complexity, as OCR systems may face challenges in processing documents that contain multiple languages, each with its own set of characters and linguistic rules. OCR systems, like human readers, are inherently tied to specific alphabets when recognizing characters. The system learns the local features of the diferent characters directly from the handwritten text, there are input data in the form of = ( , ) , = {( ,1) , ( ,2) , . . . , ( ,) }, where each ( ,) is a symbol, or grapheme (from gr. , ‘write’), mapped in a digital encoding and decoding alphabet. The system learns a specific alphabet directly from the text and since local features (shape, thickness, corners, edges,...) are inherent to the characters of an alphabet, As a result, the system experiences a significant drop in performance when applied to characters outside its trained alphabet. This significantly limits the transfer learning techniques, since the network must be retrained from scratch to recognize another alphabet diferent from the one on which the model has already been trained.

Moreover, the diversity of handwriting styles further complicates the OCR process. Handwriting is highly individual, requiring OCR systems to handle not only personal variation but also atypical styles such as cursive or artistic fonts. The inherent subjectivity of handwriting makes it imperative to develop an OCR system that is robust: it should handle as much as possible the actual changes in font and style from text to text and decipher the myriad ways in which individuals express themselves on paper. Taking the Latin alphabet into account, we can diferentiate the characters into capital letters and lowercase letters. Notably, capital letters tend to exhibit lower variability, whereas lowercase letters—though subject to basic calligraphic conventions—reflect more personal handwriting traits and a broader range of stylistic variation.

2. Related Works Early eforts in text digitization date back to the devel

opment of LeNet [6], which demonstrated that a shallow convolutional neural network could accurately recognize handwritten digits in 32 × 32 grayscale images. Building on this, [7] proposed a methodology leveraging the MNIST dataset [8] to address more complex handwriting recognition tasks, but also coping with image defects and noise [9]. Their approach emphasized preprocessing steps—including grayscale normalization, cropping, and resizing—to improve the recognition of isolated handwritten characters. They showed the efectiveness of convolutional neural networks (CNNs) in extracting local features from such inputs. The utility of CNNs for handwriting recognition has since been widely adopted.

In 2015, [ 10 ] introduced the CRNN architecture, which combines CNNs for spatial feature extraction with recurrent neural networks (RNNs) to model character sequences. This design is particularly suited for handwritten word recognition, as RNNs can capture sequential dependencies—albeit with limitations such as the vanishing gradient problem. More recently, [11] proposed a system that combines CNNs with Error Correcting Output Codes (ECOC) to enhance classification robustness.

Feature extraction is performed using architectures such as LeNet [6] and AlexNet [12], while classification is carried out by training an ensemble of binary Support Vector Machines (SVMs) via ECOC. This method decomposes the multiclass problem into several binary subproblems, yielding higher accuracy compared to CNNs followed by a standard softmax classifier—particularly on the MNIST dataset. In 2021, [13] introduced a more complex model based on CRNNs for full handwritten document recognition. This approach integrates CNNs for visual feature extraction with Long Short-Term Memory (LSTM) networks to model sequential dependencies across words or phrases. The system is trained using the Connectionist Temporal Classification (CTC) loss function [ 14], which enables sequence prediction without requiring explicit character-level alignment. CTC considers all possible alignments and computes a summed probability, allowing for end-to-end training even when segmentation is ambiguous. The next leap in this domain has been driven by the application of transformers [15] to computer vision, particularly through the introduction of the Vision Transformer (ViT) [16]. ViT replaces convolutional layers with a pure attention mechanism, enabling the model to capture long-range dependencies in visual inputs. This shift opens new directions in handwriting recognition by enabling the integration of global context across entire input images, beyond the local receptive fields of traditional CNNs.

It is worth pointing out that, before the advent of Table 1 CRNN and ViT model, the majority of the proposed Datasets: the calendar dataset has been created manually by achieved high performance in terms of recognition accu- using a graphic tablet, numeric digits is a collections of numracy but showed huge limitations: the input for the first fboenrstsi,nwthhieleraIAngMe [(I1n,s3ti1tu]tgfe¨nreIrnaftoerdmfraotmik cuonmd pAuntgerewsyasntdemte OCR neural model was necessarily provided in the form Mathematik) Handwritten is a dataset belonging to the Uniof individual character alphabet, the networks were able versity of Bern. to classify the salient features of the character and provide a classification consisting of the corresponding digital label, but with the inability to create a context (both in mal additional efort, this classification can be extended terms of previous word in a sentence and in terms of indi- to the entire calendar: feature vectors from each line are vidual characters in a single word), profound refactoring aggregated and averaged to obtain a single representaof the software systems was often required[17, 18]. tion, which is then used to predict a high-level category for the calendar’s overall content.

3. Implementation Description 3.1. Dataset

To ensure accurate calendar digitization, the proposed pipeline is composed of several neural modules. The ini- To conduct the experiments, we independently created a tial component is a segmentation network based on dataset of handwritten calendar images using a graphic the U-Net architecture [19], which enables the model to tablet. The corresponding ground truth segmentation identify the most semantically relevant regions of the masks were annotated using the Oxford VGG Image Ancalendar. This network produces a segmentation mask, notator tool. The dataset, although limited in size (see allowing for the algorithmic extraction of targeted calen- Table 1), is suficient for our use case. Unlike generic dar segments. Once the relevant segments are extracted, segmentation tasks—such as those involving the heterothey are processed by two separate ResNet6-based con- geneous images found in datasets like ImageNet—our volutional neural networks [20], responsible for recog- domain involves structurally homogeneous data. All innizing the month and day digits, respectively. While put images depict the same subject: a calendar with a the recognition of day digits benefits from standardized symmetric and well-defined layout. This structural regupatterns—since the digits recur uniformly across sam- larity allows for meaningful learning even with a smaller ples—the month field presents greater variability. Users number of examples. To simulate realistic acquisition are allowed to write the month manually, which intro- conditions, we applied synthetic distortions using Phoduces inconsistencies in handwriting style and position- toshop to introduce parallax efects (horizontal, vertical, ing. To address these variations and improve classifi- or both). These transformations reflect the common scecation robustness, the calendar template incorporates a nario where the camera capturing the calendar may not distinct pair of AprilTags [21] for each month of each year, be perfectly aligned with the sheet. Incorporating such positioned at the left and right margins of the month field. distortions during training improves model robustness in This design introduces spatial regularity, guiding the con- real-world scenarios. To correct these geometric distorvolutional layers toward more stable visual features and tions at inference time, we integrated a RANSAC (Ranmitigating the ambiguity caused by user-written titles. dom Sample Consensus) [23] module into our pipeline. The final step of the pipeline involves processing the re- The four AprilTags placed at the corners of each calenminder segments, where a second U-Net is employed to dar template serve as keypoints, allowing RANSAC to segment user-written sentences into individual words. estimate a homography and realign the captured image These word segments are then passed to a digitization with the reference calendar layout. module that combines the Vision Transformer (ViT) [16] The numeric digits dataset is assembled by saving picand BERT [22]. While ViT captures visual features at a tures of numbers from 1 to 31 using classical fonts availglobal scale, BERT processes the embedded representa- able at the level of the computer’s operating system. This tions to extract semantic meaning. An additional com- choice is useful both for training a compact ResNet to ponent of the system enables content classification at recognize calendar day numbers and for allowing the the line level. Specifically, each textual line is processed model to generalize beyond the specific font used in the through BERT for semantic categorization. With mini- study.

Other useful data are retrieved from the public Internet. 3.2. Data Processing In particular, the IAM Handwriting Datasets [24] are used to make the pre-trained of the Word Paser model and of the Note Digitizer. IAM Handwriting is a dataset that contains a collection of words cut from portions of written texts, nd its main advantage, as discussed in the introduction, lies in enabling the model to learn relevant character features. This approach enables the system to learn the sequential features of the characters that together make up a particular word. (a) Example of the original image taken from IAM Lines dataset,

"suitable to be paraded before the public" (b) The original image is filtered with the Laplacian operator

to retrieve its details, in particular the edges (c) Pixels values of the detail are corrected in order to be 0 128 255. The filtering is applied taking into account some intensity values as threshold.

(d) Final binary inverse thresholding

Standard computer vision techniques are employed to process the images contained in the datasets: Gaussian blurring is preferred over classical blurring, since using a kernel derived from a Gaussian distribution helps to attenuate image noise and details, ensuring better preservation of edges and contours. The image pixel intensity values are uniformed by applying a binary thresholding technique that sets to a maximum value all pixels above a certain threshold (represented, for example, in the case of the IAM dataset, by the grayscale levels annotated in the data), while setting the values below to 0. Moreover, all pixel values in the image are additionally scaled by a factor of 1/255 to bring them into the interval [ 0, 1 ], and normalized around a certain mean and standard deviation. These general steps ensure that data are in a format more suitable for model processing.

Regarding the calendar dataset, to ease the computational segmentation process, images are downsampled using a Laplacian Pyramid. The reason why this technique is applied to the calendar data before feeding them into the network will be made clearer in Section 3.4.1.

It is also worth noting that, since a word segmentation block for the calendar note segments has been included in the architecture, it needs to be trained on pairs (image, word mask) that are not directly available in the dataset.

This is why a dedicated image processing step is applied online during dataset generation to obtain, from a single text line image, its corresponding ground-truth word segmentation mask. The original image (Figure 1a) is ifltered with a Laplacian operator in order to extract its details, borders, and edges (Figure 1b). This operation is particularly useful in the current data domain because, as previously mentioned in Section 3.1, the images from the IAM Line/Sentence datasets are collections of cropped word/text lines. By further filtering the pixel values of the detailed image, it becomes possible to highlight borders and edges, and thus delineate the cutting boundaries with respect to their content (Figure 1c). At this point, by applying a binary inverse threshold function, the final word-level binary segmentation mask can be obtained (Figure 1d).

3.3. Neural blocks

The digitization system is composed of five neural modules working in parallel. The first is the Calendar Parser, a U-Net model with encoder-decoder structure and skip connections, which takes as input a calendar image (downsampled using a Laplacian Pyramid) and performs pixel-wise classification into six semantic regions. Initially, a larger number of classes was tested to directly model the month and day digits at this stage, but this was later simplified, as discussed in Section 3.4. The

Word Parser is also based on a U-Net architecture and

is pre-trained using image/mask pairs from the IAM Line/Sentence dataset, where masks are generated as described in Section 3.2. It produces a binary segmentation mask for each text line, and, as shown in Section 4, generalizes well to calendar data despite significant visual diferences from the training set. Both the Month Digitizer and Day Digitizer are implemented as compact ResNet architectures composed of six convolutional layers with two residual connections. Each model concludes with a linear classification head, consisting of either 12 or 31 output units depending on whether the task is month or day classification, respectively. This minimal design was chosen to ensure fast inference while maintaining adequate performance given the low visual variability of the segmented inputs.

Content Text Classifier To classify the textual con

3.3.1. Content classification with BERT tent, each sentence is preprocessed by lowercasing all words and removing stopwords to reduce noise and reOnce the handwritten text has been digitized, it may be tain only the most relevant terms. The processed text is useful to classify it into categories to highlight the na- then passed through BERT, which maps each token to a ture of its content. To do so, it is possible to rely on a 768-dimensional latent space. The special [CLS] token, Transformer-based architecture modeled by a modern capturing the overall sentence meaning, is extracted and LLM such as BERT. BERT is a Transformer architecture fed into a linear classification layer. A Softmax function consisting of a single encoder. Its peculiarity lies in the is applied to the output to obtain a probability distribufact that, given the nature of its training (Next Sentence tion over predefined content categories. The model is Prediction and Question Answering tasks), it can be eas- trained using multiclass cross-entropy loss. ily adapted to a new domain by fine-tuning the encoder on the target dataset and stacking an additional classi- 3.4. Experiment ifcation layer on top of its head. BERT incorporates a strong semantic understanding of the English language The overall project is made up of a bunch of experiments and can therefore be used to process textual content. that have been conducted on the diferent architectural components examined in Section. 3.3: since both the This technique is particularly useful, as each blurringcalendar parser and the word parser take benefit of using downsampling step retains the image’s residuals. Once U-Net segmentation architecture, both the model has the calendar is segmented, each relevant segment can be been trained using Dice Loss [25]. The Dice-Sørensen upsampled and added back to its corresponding residual coeficient, Equation. 1, also know as Dice Coeficient, is in the pyramid to restore the original resolution. This a statistic used to gauge the similarity of two samples. It design choice is especially efective in addressing the is widely employed in the field of segmentation, where problem of image resolution, which would otherwise be it is particularly efective in evaluating the pixel-wise too low for proper digitization. accuracy of ground truth masks in comparison to the Regarding resizing operations, day segments are scaled output segmentation mask generated by a model. to 32 × 32, while month fragments are resized to 128 × 512. It is also worth noting that since a ViT [16] is used ( , ) = 2| ∩ | (1) in the note digitization phase, word images from the IAM | | + || Words dataset are resized to 224 × 224. This dimension and are the set of pixels belonging respectively allows the Vision Transformer to correctly apply its inito the predicted and the ground truth masks, | ∩ | tial convolutional step, which extracts patches using a is the number of pixels that both set have in common 16 × 16 kernel with a stride of 16 × 16. apnixdelfinianlltyhe t|w o| asentds.|Th|erceaplreensdeanrt ptharesitnogtails aemssoenutnitalolyf ViOsunaelEnco difnearlDecodneorte modelc:oncerHnusggingFtahcee caoMmuelst:iClassification task. The Equation. 1 be- wpriothvidaeslaangwuraagpepemrcoldaessl cdoemcobdinerin.gInasvteisaidonotfraunssinfogrmtheer pre-trained TrOCR [26], a new model instance was ( , ) = 1 ∑︀=1 ︁( 1 − ∑︀2∑=︀1=1,( +∑,︀·=1,), ︁) (2) VcrieTatetod, amccoodmifmyiondgattehethfirestspceocnivficoliuntpiountaclolnadyietrionosf the word parsing can be treated instead as a of this experiment. This adjustment was necessary BinaryClassification task and the Equation. because ViT was originally trained on ImageNet [27], 1 becomes: where input images have three channels. In our case, inputs are reduced to a single channel, both to reduce ( , ) = ∑︀2∑=︀1=1+(∑︀· =1) (3) ccaolmligpruatpahtiyo,nwalhcicohmtpylpexiciatyllyanudsebsegcarauyssectahlee itnatseknisnitvyolavneds does not benefit from RGB information. 3.4.2. Training Strategy is the predicted binary value for pixel and could be = 1 or = 0, is the same but in the ground truth mask while is the total number of pixel, and since the task deals with 2D images, = × , where and are respectively the width and the height of the images. The value of the Dice Loss can be easily obtained as: The models Calendar Parser and Word Parser are trained on the calendar dataset and the IAM lines dataset, respectively, to develop the capability of producing suitable segmentation maps as output. For both models, the ( , ) = 1 − ( , ) (4) Dice Loss (see Section 3.4) is used during the backpropagation phase to update the weights. The overall learning Looking at the Equation. 4 it can observed that mini- process is optimized using the Adam optimizer, with an mize the loss means efectively maximize the Dice Co- initial learning rate tuned in the range of 0.001 to 0.0001. eficient that provides a measurement of the similarity The batch size for the calendar parser is set to 4, while between the model’s prediction and the ground truth. the best configuration for the word parser is obtained Once the model is trained, the most suitable predic- with a batch size of 8 and gradient accumulation set to tion of the segmentation map is obtained respectively as 16 in order to simulate a larger batch size and improve * = ( ,) in a multi-class domain and in generalization. the simpler binary case as * = 1( ≥ 0.5) , with 1 The Day Digitizer model is trained on the nuthat is an indicator of whether the condition is satisfied meric fonts dataset and fine-tuned on day segments exor not. tracted from the calendar via segmentation. The Month Digitizer, instead, is trained on a set of month seg3.4.1. Design Choices ments obtained through calendar parsing and heavily The images are converted from 3 channels (RGB) to 1 augmented to improve robustness against issues related channel (black and white) and processed as described to the calendar’s month field, as discussed in Section ??. in Section 3.2. Calendar images are downsampled using Both models use the Adam optimizer with an initial learna Laplacian Pyramid before being fed into the network. ing rate of 0.001 and a batch size of 4.

Words Parser 0.986 - - binary masks that are generated when a specific classiHandwriting Digitizer - 0.201 0.272 ifcation segment is requested from the model. Class 0, which corresponds to the None/Contours category, is deTable 3 ifned during the training phase because, during manual Tdeuxrtinclgasitssifipcraet-iotrnawiniitnhgBmERaTkeLLitMa: TpheerfsekcitllscaancqduidiraetdebfyorBEteRxTt segmentation of the calendar for ground truth generaclassification. The model can be easily adapted to the new tion, some segments do not overlap, leaving holes in the domain with the fine-tuning, and its special context token (i.e. mask. Class 5 corresponds to the borders of the calendar [CLS]) allows us to achieve a good level of accuracy in text and is included to ensure that the model focuses on the classification after just few training epochs. most relevant features—namely, the one-month, two-day, and three-note segments. Class 4, which corresponds

Text Classification Train F1 Valid F1 to empty notes, was incorporated into the segmentation BERT Text Classifier 0.84 0.77 map to allow the system to eliminate, during digitization, those areas whose average pixel values are close to 1 (i.e., white). In such cases, it can be inferred that the user has

The Text Classifier is trained using Adam as well, not written any reminders for that day. with a learning rate of 0.00015 and a batch size of 32, accumulating gradients every 32 steps to simulate an 4.2. Text line segmentation efective batch size of 32 × 32. This practice, commonly adopted in NLP tasks, is particularly useful when computational resources are limited, as larger efective batch sizes help the model better diferentiate between data samples and significantly enhance overall performance.

The Word Parser achieves good accuracy in generating binary segmentation masks to separate sentence images into individual words. The results of this operation are shown in Appendix ??. As observed there, although the process used to create binary masks—described in Section 3.2—may not always be highly accurate, the model is 4. Results often able to learn segmentation patterns from the training data that, in some cases, allow it to produce masks In the current section, the results of the experiment are even more precise than those generated algorithmically discussed, highlighting both its strengths and weaknesses. and used as ground truth.

The training outcomes of the Calendar Parser and the other neural modules are reported in Table 2, while the results related to text classification are presented in Ta- 4.3. Calendar fragments digitization ble 3.

The calendar note digitizer model operates on word segments previously extracted by the Word Parser. Each word image is first processed by the encoder, which divides the image into patches, applies positional embeddings, and projects the result into a latent representation. This latent space is then interpreted by the decoder,

4.1. Calendar Segmentation The model performs the segmentation of the calendar in a satisfactory manner. Figure 4 illustrates the six distinct

which maps the encoded features to a sequence of output symbols. In this study, the decoder is constrained to generate output as a sequence of individual characters, which are subsequently reassembled into words.

Although character-level decoding may seem less efifcient than full-sentence transcription, it is justified by the limited contextual information available in calendar notes. In general, longer sentences allow the model to capture richer semantic dependencies between words. However, since calendar entries are typically short annotations, the limited context reduces the benefit of sequence-level modeling. In this scenario, it is more appropriate to train the model to learn direct mappings between visual features and their corresponding alphabetic symbols.

This approach yields strong performance at both the character and word levels. Character-level accuracy is measured as: accℎ = 1 −

CER while word-level accuracy is computed as: acc = 1 −

WER Although the model performs well on the dataset used for pretraining, its performance degrades when applied to our domain—likely due to diferences in resolution and handwriting style. We conclude that, to improve accuracy, it is necessary to expand the training dataset with additional handwriting styles and apply further finetuning to enhance model robustness.

4.4. Text Content Classification

The results of the classification task are reported in Table 3 and appear to be satisfactory. BERT has proven to be an efective tool for achieving the desired outcome, thanks to its ability to leverage semantic knowledge of the English language acquired during pretraining. The classification was performed on a domain consisting of 15 distinct classes. This setting qualifies as a fine-grained classification task, as opposed to a more general coarsegrained classification with broader categories. Under this hypothesis, the trained model can be reused for related classification tasks without requiring retraining. It is possible to cluster fine-grained classes into broader semantic groups and perform classification by mapping each ifne-grained label to its corresponding coarse-grained category.

5. Conclusions

Overall, the experiment demonstrates how a character recognition task can be addressed through a sequence of neural models integrated into a single operational 89–98 (a) Calendar Segmentation F1 score

(b) Calendar Segmentation loss pipeline. The results, particularly in terms of calendar segmentation and subsequent month and day classification, are satisfactory. However, the same cannot be said for the note digitizer, which still exhibits several limitations—both in terms of Word Error Rate and Character Error Rate. A promising direction for improvement would be to enrich the training set with a more heterogeneous handwriting dataset, incorporating a wider variety of calligraphic styles. Indeed, data augmentation alone did not significantly enhance performance when applied to samples that difer substantially from those seen during training and validation. A more diverse dataset could improve the model’s ability to extract robust features and, consequently, enhance the overall digitization process.

6. Declaration on Generative AI During the preparation of this work, the authors

used ChatGPT, Grammarly in order to: Grammar and spelling check, Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

GCCE.

2018 . 8574624 . [6]

LeCun , L. Bottou,

Bengio ,

Hafner , Gradient-

Proceedings of the IEEE , volume 86 , 1998 . [7]

Xu ,

Wu ,

Zhang ,

S. M.

Stanford , Appli-

tion , 2014 . URL: https://api.semanticscholar.org/

CorpusID: 16468741 . [8]

Grother , Nist special database 19 handprinted

forms and characters database , 1995 . URL: https:

//api.semanticscholar.org/CorpusID:59785963. [9]

Lo Sciuto , G. Capizzi,

Shikler ,

Napoli , Or-

(a) F1 results ganic solar cells defects classification by using a

Journal of Intelligent Systems 36 ( 2021 ) 2443 - 2464 .

doi:10 .1002/int.22386. [10]

Shi ,

Bai ,

Yao , An end-to-end trainable neu-

telligence PP ( 2015 ). doi: 10 .1109/TPAMI. 2016 .

2646371. [11] M. B. Bora , D.

Daimary , K.

Amitab , D.

Kandar ,

(b) Loss Results ing cnn-ecoc , Procedia Computer Science 167 ( 2020 )

2403- 2409 . URL: https://www.sciencedirect.com/

science/article/pii/S1877050920307596. doi:https:

//doi.org/10.1016/j.procs. 2020 . 03 .293,

Intelligence and Data

Science . [12] A.

Krizhevsky , I. Sutskever, G. E.

Hinton , Imagenet [1] S.

Parsons , C.

Parker , C.

Chapman , M.

Hayashida, classification with deep convolutional neural net-

text from herculaneum papyri using x-ray ct , 2023 .

K. Q.

Weinberger (Eds.), Advances in Neural Infor [2]

Stötzner ,

Homburg ,

J. P.

Bullenkamp , H. Mara, mation Processing Systems 25 , Curran

Associates

R-CNN based PolygonalWedge Detection Learned Inc ., 2012 .

from Annotated 3D Renderings and Mapped Pho- [13]

Jebadurai ,

I. J.

Jebadurai ,

G. J. L.

Paulraj , S. V.

Cultural

Heritage , The Eurographics Association , 2021 Third International Conference on Inven-

2023. doi: 10 .2312/gch.20231157. tive Research in Computing Applications (ICIRCA), [3]

Fiani ,

Ponzi ,

Russo , Keeping eyes on the 2021 , pp. 1037 - 1042 . doi: 10 .1109/ICIRCA51532.

road: Understanding driver attention and its role 2021 . 9544513 .

in safe driving , in: CEUR Workshop Proceedings, [14]

Graves ,

Fernández ,

Gomez ,

Schmidhuber ,

volume 3695 , 2023 , p. 85 - 95 . Connectionist temporal classification: Labelling [4]

Boutarfaia ,

Russo ,

Tibermacine ,

I. E.

Tiber

- unsegmented sequence data with recurrent neu-

macine, Deep learning for eeg-based motor imagery ral 'networks , volume 2006 , 2006 , pp. 369 - 376 .

classification: Towards enhanced human-machine doi:10.1145/1143844 .1143891.

interaction and assistive robotics , in: CEUR Work- [15]

Vaswani ,

Shazeer ,

Parmar , J. Uszkoreit,

shop Proceedings , volume 3695 , 2023 , p. 68 - 74 . L. Jones , A. N.

Gomez , L.

Kaiser , I. Polosukhin , At[5]

Fujii , [invited] optical character recognition re- tention is all you need , CoRR ( 2017 ). URL: http:

search at google, 2018 , pp. 265 - 266 . doi: 10 .1109/ //arxiv.org/abs/1706.03762. [16]

Dosovitskiy ,

Beyer ,

Kolesnikov , D. Weis-

senborn , X.

Zhai , T.

Unterthiner , M.

Dehghani , arXiv: 2109 . 10282 .

Minderer , G. Heigold,

Gelly , J. Uszko- [27]

Deng ,

Dong ,

Socher ,

L.-J.

Li ,

Li , L . Fei-

reit , N.

Houlsby , An image is worth 16x16 Fei, Imagenet: A large-scale hierarchical image

words: Transformers for image recognition at database , in: 2009 IEEE Conference on Computer

scale , 2021 . URL: https://arxiv.org/abs/ 2010 .11929. Vision and Pattern Recognition, 2009 , pp. 248 - 255 .

arXiv: 2010 . 11929 . [17]

Napoli ,

Pappalardo , E. Tramontana, Using

ing of large systems , in: Proceedings - 2013 7th

and Software Intensive Systems, CISIS 2013 , 2013 ,

p. 529 - 534 . doi: 10 .1109/CISIS. 2013 . 96 . [18]

Borowik ,

Woźniak ,

Fornaia , R. Giunta,

tronics and Telecommunications 61 ( 2015 ) 17 - 23 .

doi:10 .1515/eletel-2015- 0002 . [19]

Ronneberger ,

Fischer ,

Brox , U-net: Convo-

tation, CoRR abs/1505 .04597 ( 2015 ). URL: http:

//arxiv.org/abs/1505.04597. [20]

Deng ,

Dong ,

Socher ,

L.-J.

Li ,

Li , L . Fei-

database, in: 2009 IEEE Conference on Com-

puter Vision and Pattern Recognition , 2009 . doi:10.

1109/CVPR. 2009 . 5206848 . [21]

Olson , Apriltag: A robust and flexible visual fidu-

cial system , in: 2011 IEEE International Conference

on Robotics and Automation , 2011 . doi: 10 .1109/

ICRA.

2011 . 5979561 . [22]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT:

for language understanding abs/

1810 .04805 ( 2018 ).

URL: http://arxiv.org/abs/ 1810 .04805. [23]

M. A.

Fischler ,

R. C.

Bolles , Random sample con-

Commun. ACM 24 ( 1981 ) 381 - 395 . URL: https://doi.

org/10 .1145/358669.358692. doi: 10 .1145/358669.

358692. [24]

U.-V.

Marti ,

Bunke , The iam-database: An

ument Analysis and Recognition 5 ( 2002 ) 39 - 46 .

doi:10 .1007/s100320200071. [25]

C. H.

Sudre ,

Li ,

Vercauteren ,

Ourselin , M. J.

tions, CoRR ( 2017 ). URL: http://arxiv.org/abs/1707.

03237. [26]

Li ,

Lv ,

Chen ,

Cui ,

Lu ,

Florencio ,

els , 2022 . URL: https://arxiv.org/abs/2109.10282.