1. Introduction

Low- vs High-level Lemmatization for Historical Languages. A Case study on Italian

Chiara Alzetta

Simonetta Montemagni

0 0 Istituto di Linguistica Computazionale "Antonio Zampolli", Consiglio Nazionale delle Ricerche , Pisa , Italy

2025

Lemmatization remains a foundational yet challenging task in the processing of historical Italian texts, due to the complex interplay of orthographic, morphological, and diatopic variation. A crucial, yet often overlooked, aspect is the degree of normalization applied during lemmatization. A conservative approach preserves attested historical forms, ensuring greater linguistic fidelity but increasing data sparsity. Conversely, an abstract normalization strategy aligns historical variants with standardized contemporary lemmas, improving generalization but potentially introducing inaccurate mappings. In this paper, we present a comparative evaluation of conservative and normalized lemmatization strategies for historical Italian. To our knowledge, this is the first study to explicitly assess the impact of lemmatization strategies in the context of historical languages, particularly those that are morphologically rich. Our results indicate that high-level normalization ofers a promising trade-of between precision and generalization.

eol>Data-driven Lemmatization Historical Italian Universal Dependencies Normalization

1. Introduction

dressed using rule-based morphological analyzers and dictionary lookup. However, recent years have seen Lemmatization is the task of identifying the canonical the rise of data-driven lemmatization approaches, where form, or lemma, of a given inflected wordform. While models learn to produce lemmas without relying on prethis mapping is often straightforward and based on well- defined linguistic rules and/or lexical resources. A key established criteria, it can also involve a considerable turning point in this methodological shift was the SIGdegree of discretion, especially in the case of diachronic MORPHON 2016 Shared Task, which reconceptualized language data. In historical lexicography, lemma selec- lemmatization as a special case of morphological reinflection remains a well-known and unresolved challenge tion (Cotterell et al. [ 3 ]). This view paved the way for the due to the high number of attested variant forms, many current dominant approaches, based on neural models. of which diverge significantly from the standard form. Within the data-driven paradigm, two main strategies Choosing a specific lemma to serve as the headword — have emerged. The generative character-level approach i.e. capable of efectively subsuming all its variants — relies on encoder-decoder architectures that generate is a widely debated issue. As Porter and Thompson [ 1 ] the lemma character by character, conditioned on the and Manolessou and Katsouda [ 2 ] have noted, it consti- input form and its context (Qi et al. [ 4 ], Bergmanis and tutes a genuine dilemma. In computational linguistics, by Goldwater [ 5 ]). In contrast, pattern-based models treat contrast, lemmatization criteria are rarely made explicit lemmatization as a supervised classification task (Straka and are often taken for granted. While this may pose [ 6 ]), where each class - derived from training data - coronly minor issues in the lemmatization of contemporary responds to the edit operations that transform a specific language, it becomes a critical concern for historical lan- wordform into its lemma. A comparative study on Esguage data. This paper investigates the role and impact tonian by Dorkin and Sirts [ 7 ] found that generative of diferent lemma identification strategies in automatic encoder-decoder models trained from scratch outperlemmatization, with a focus on historical varieties. form both rule-based systems and pattern-based models

Lemmatization is one of the fundamental tasks that fa- fine-tuned from large pre-trained language models. cilitate downstream Natural Language Processing (NLP) Among the most debated issues in lemmatization, parapplications and is particularly relevant for highly in- ticularly in data-driven models, there is the role of context lfected languages. Traditionally, this task has been ad- and morphological information. Contextual information has been shown to be crucial for handling unseen and CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- ambiguous words: see, among others, Bergmanis and tics, September 24 — 26, 2025, Cagliari, Italy Goldwater [ 5, 8 ] and McCarthy et al. [ 9 ]. The actual * Corresponding author: Chiara Alzetta role of morphological information in performing contex† These authors contributed equally. tual lemmatization was investigated by Toporkov and ($S. cMhoianrtae.malzaegtntai)@cnr.it (C. Alzetta); simonetta.montemagni@cnr.it Agerri [ 10 ], who showed that fine-grained morphologi© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License cal information does not help to substantially improve Attribution 4.0 International (CC BY 4.0).

wordform

Conservative Lemma

brieve sancto cotesto alma imperadore palagio utilitate admettere diliberare guarentire surgere

Normalized Lemma

breve santo codesto anima imperatore palazzo utilità ammettere deliberare garantire sorgere lemmatization (not even for highly inflected languages) and that using basic part-of-speech tags (UPOS) seems to be enough for comparable performance across languages. sbarinecvteissimo AADDJJ

Although much progress has been made on lemmati- chotesto DET zation for standard, resource-rich languages, the task re- alma NOUN mains challenging in the case of historical varieties, espe- imperadori NOUN cially for morphologically complex languages like Italian. palagio NOUN Historical Italian presents both orthographic and morpho- utilitati NOUN logical variation, not only over time but often in the same admettesse VERB period and even within the same text. These challenges diliberarono VERB include, among others: alternations between etymolog- guarentir VERB ical and phonetic spellings (e.g., haveva vs. aveva ’(it) surse VERB had’, chupola vs. cupola ’dome’); phonetic variation (e.g. Table 1 pulito vs. polito ’clean’, eguale vs. uguale ’equal’); mor- Examples of conservative vs normalized lemmatization for phologically distinct variants (e.g. avria vs. avrebbe ’(it) historical Italian would have’); cliticized finite verbal forms (aveagli ’(it) had-to-him’, avevalo ’(it) had-it’). Additional challenges, also relevant to contemporary Italian, include the treat- level lemmatization strategy was adopted by Favaro et al. ment of past participles (verbal vs. adjectival use) and [ 12, 13 ], deferring normalization to a later stage operatderivative forms (the open issue is whether they repre- ing on lemma variants. sent an independent lemma or should be associated with In this paper, we present a comparative evaluation of the corresponding base form, e.g. the diminutive angio- these two lemmatization strategies for historical Italian, letto ’little angel’ is an independent lemma or should be combining quantitative metrics with qualitative analylemmatized as angelo ’angel’). sis. To our knowledge, this issue has not yet been ex

A crucial but often neglected aspect of lemmatizing plicitly addressed in the computational linguistics lithistorical texts concerns the granularity and scope of the erature, where lemmatization choices are typically aslemma list, as well as the criteria guiding lemma iden- sumed rather than critically examined. We argue that tification: in other words, the degree of normalization this decision is especially relevant for morphologically applied. This choice carries both theoretical and practical rich languages, where diferent lemmatization strategies implications, influencing how linguistic variation is rep- can have a substantial impact on both the performance resented, how lexical continuity over time is interpreted, and interpretability of downstream tasks. and how efectively the data can be searched, analyzed, or The rest of the paper is organized as follows. In Section aligned across sources. Table 1 contrasts a conservative 2, the historical corpora selected as the basis of this study lemmatization approach - which preserves the graphical, are described. Section 3 illustrates the strategy adopted phonological, and morpho-syntactic features of attested for generating a version of these corpora with high-level historical variants - with a more abstract normalization normalized lemmatization. Section 4 describes the apstrategy that aligns such variants to a standardized con- proach employed to train two models for lemmatizing temporary (meta-)lemma. While the former ofers greater Italian historical texts. Section 5 discusses the results linguistic precision and interpretability, it may lead to obtained by the lemmatization models, focusing both on increased data sparsity. The latter, by contrast, reduces the results obtained in five-fold cross-validation experisparsity and facilitates generalization, though at the risk ments and against an external test set. Finally, Section 6 of introducing incorrect form–lemma associations. concludes the paper and presents some future prospects.

The choice between these strategies is shaped by several practical factors, including the target application and the specific language involved. Linguistic analyses, 2. Data for instance, may benefit from a conservative approach, whereas information retrieval systems and downstream For this study, we selected three corpora covering a wide NLP applications may perform better with normalized timespan, going from the 14th to the 20th century, listed lemmas. Language-specific features also play a key role. below: As Manjavacas et al. [ 11 ] note, the highly heterogeneous • UD-Italian Old [ 14 ]: Italian-Old is a treebank nature of historical languages — marked by overlapping containing Dante Alighieri’s Comedy, based on diachronic and diatopic variation and the absence of a the 1994 Petrocchi edition and sourced from the stable standardized norm — makes it particularly chal- DanteSearch corpus [ 15 ]. The treebank includes lenging to carry out lemmatization and normalization lemmatization, morpho-syntactic, and syntactic simultaneously. In the case of diachronic Italian, a low

Corpus

UD-Italian Old GDLI-QC - GDLI Quotation Corpus VGG - Voci della Grande Guerra

Total MIDIA: administratione, administrationi, administrazione, aministratione, amministratione, amministrationi, amministrazione, amministrazioni, nistrazione, strazione TLIO: adminestragione, administracion, administracione, administraciuni, administragione, administratione, administrationi, administrazione, aministracione, aministraciuni, aministragione, aministrascione, aministratione, amministracione, amministragione, amministragioni, amministratione, amministrazione, amministrazioni

All of these corpora follow a conservative lemmatization strategy. In terms of annotation, they are all natively annotated according to the Universal Dependencies (UD) scheme1 (De Marnefe et al. [20]), which has become the de facto standard nowadays. Lemmatization has been manually revised for each corpus — albeit only partially for UD-Italian Old — to ensure linguistic accuracy and internal consistency. As such, these corpora can be considered gold-standard resources. Table 2 provides details on their size in terms of sentences and tokens.

For the comparative study of the two lemmatization strategies, a normalized counterpart of each corpus, featuring high-level linguistic annotation, was required. To generate the normalized versions of the three corpora, we identified two historical Italian lexicons adopting this lemmatization approach.

One such resource is the MIDIA lexicon, which was built starting from the balanced diachronic corpus of writannotation. A partial manual revision was car- ten Italian texts called MIDIA (D’Achille and Grossmann ried out to align morpho-syntactic annotation [21]), fully annotated with lemma and part-of-speech and lemmatization with the Universal Dependen- (POS) information. Covering the period from the early cies (UD) guidelines, with particular attention to 13th to the first half of the 20 th century, the corpus is proper nouns and fixed multiword expressions. organized into five chronological periods and seven texFor our experiments, we used version 2.15 of the tual genres, comprising approximately 7.5 million tokens treebank, released in November 2024; drawn from about 800 texts. In MIDIA, lemmatization • VGG - Voci della Grande Guerra [ 16 ]: VVG and POS tagging were automatically performed using a is a corpus of texts that were written in Italian version of TreeTagger (Schmid [22]) adapted for historiin the period of World War I or shortly after- cal Italian (Iacobini et al. [23]). To handle the linguistic wards (most of them date back to the years 1915- variation typical of earlier stages of the language, the con1919). The corpus includes diferent textual gen- temporary Italian lexicon embedded in TreeTagger was res, namely: discourses, reports, and diaries of enriched with approximately 230,000 word forms, primarpoliticians and military chiefs; letters written by ily dating from the 14th to the 16th centuries. This substanmen and women, soldiers and civilians; literary tially expanded the original MIDIA lexicon. The version works of intellectuals, poets, and philosophers; we used contains 70,083 unique lemmata, 571,779 diswritings of journalists and lawyers. The corpus tinct wordform–lemma pairs, and 584,041 unique wordis annotated at the morpho-syntactic level and form–lemma–POS triples. Notably, there is a high delemmatized. Annotation was carried out with gree of overlap between the wordform–lemma pairs from UDPipe [ 17 ] trained on IUDT [ 18 ]v2.0; a subset the corpora under study and those in the MIDIA lexiwas then manually revised [19]. For this study, con: 89.91% for UD-Italian Old, 86.65% for GDLI-QC, and we used the gold portion of the corpus; 81.66% for VGG. • GDLI-QC - GDLI Quotation Corpus [ 12 ]: Another key reference resource identified for these GDLI-QC is a corpus derived from an authori- purposes is the Tesoro della Lingua Italiana delle Origini tative historical Italian dictionary, namely the (TLIO) (Beltrami [24]), a historical dictionary of old ItalGrande dizionario della lingua italiana (GDLI) ian based on all extant documentation from the earliest edited by Salvatore Battaglia. GDLI presents a texts recognizable as Italian up to the end of the 14th huge collection of quotations covering the entire century, which includes manual lemmatization. history of the Italian language, from which a sub- To fully understand the type of lemmatization perset has been extracted, representative of the most formed in these two resources, we report below the cited authors and covering a wide chronological set of wordforms sharing the nominal lemma amminspan (from the 14th to the 20th century). GDLI-QC istrazione ’administration’ in the MIDIA and TLIO lexhas been morpho-syntactically tagged and lem- icons: matized with Stanza [ 4 ]: annotation was carried out automatically, with full manual revision. from the corpus was preserved, and the case was labeled f2-different-pos. If no matching form or lemma was Adposition found in MIDIA, the case was labeled f2-missing.

Proper noun The final phase addressed the remaining unresolved Adverb cases from Phase 2 — those labeled f2-missing and Articulated Prep. f2-different-pos — by consulting the TLIO lexicon.

Verb As a first step, we checked whether the triple (wordAuxiliary form, lemma, POS) was present in the lexicon. If so,

PCroonnjouunnction wore mmoardkifieedd tthheeclaesmemasavatolidmataetdch(f 3th-evatlriipdle-linemTmLaI-OF), ADJ Adjective (f3-modified-lemma-F). If the lemma appeared as NOUN Noun a wordform in TLIO with the same POS, the lemma PUNCT Punctuation was changed to match the lemma reported in TLIO PRON,SCONJ (f3-modified-lemma-L) or validated against the lexiNUM Numeral con (f3-valid-lemma-L). If the form was present but PRON,ADV,SCONJ Interrogative associated with a diferent POS, the case was labeled Table 3 f3-different-lemma-pos. If none of the above conMapping between MIDIA and UD part of speech tags ditions applied, the case remained unresolved and was labeled f3-missing.

Table 4 exemplifies the cases treated in the diferent 3. Lemma Normalization normalization steps, reporting the corpus annotation and how it was revised based on the evidence of the MIDIA / To carry out lemma normalization, the first step consisted TLIO lexicons. of converting the part of speech tags of the MIDIA lex- For each step described above, Table 5 reports the disicon to the UD annotation scheme. Table 3 details the tribution of cases in the three normalization steps. For correspondences between the two tagsets. The conver- the three historical corpora, the number of matching WL sion was carried out automatically, and the ambiguous pairs is very high: the lemmatization in the corpus and underspecified cases (e.g. che and wh tags) were then the lexicon coincided in more than 96% of the cases (with revised manually. minor diferences across the corpora). Cases normalized

The normalization process of the selected corpora was during one of the three phases amount to 3.56% in the carried out in three successive phases, relying on lexicon- UD-Italian Old, 3.02% in VGG, and 2.97% in GDLI-QC. A based validation and correction. The objective was to ver- neglectable number of cases were not normalized, rangify and, where appropriate, normalize wordform-lemma ing from 0.09% in the UD-Italian Old, to 0.85% and 0.73% (WL) pairs extracted from the selected historical corpora in VGG and GDLI-QC respectively. using the MIDIA and TLIO historical lexicons.

In the first phase, each WL pair was checked against the MIDIA lexicon. If the WL pair was found in 4. Model Training MIDIA, the case was marked as f1-match-found For the analysis of historical Italian texts, we trained and left unchanged. If the wordform was present the Stanza natural language processing neural pipeline in the MIDIA lexicon but was associated with a [ 4 ], developed by the Stanford NLP Group. Stanza, foldiferent lemma, or with both a diferent lemma lowing a generative character-level approach, ofers a and POS, the unmatching information was modified modular architecture with state-of-the-art models for towith the values appearing in MIDIA (case marked as kenization, lemmatization, part-of-speech tagging, morf1-modified-lemma or f1-modified-lemma+pos). phological analysis, dependency parsing, and named enIf the wordform was not found in MIDIA, the case was tity recognition. Built on a Python interface, it supports labeled f1-form-missing and passed as input to the over 70 human languages and is trained on UD treebanks. second phase. In addition to its pre-trained models, Stanza allows users

In the second normalization phase, the wordforms la- to train custom models from scratch using UD-formatted belled as missing (i.e. f1-form-missing) in MIDIA data. In this study, we specifically focused on the lemmaduring Phase 1 were re-analyzed. For these cases, we tization component. checked whether MIDIA contained the lemma matching The lemmatization model was trained using the norany other form. If the POS in the corpus and MIDIA malized versions of the selected historical corpora — UDlexicon coincided, then we marked the case as correct us- Italian Old, VGG, and GDLI-QC — as input data. To these, ing the label f2-validated-lemma. If the lemma was we added the contemporary Italian corpus ISDT (Italian present in MIDIA with a diferent POS, the original POS Label f1-match-found

Corpus (wordform, lemma, POS) (proposta, proposta, NOUN)

Lexicon (wordform, lemma, POS) Phase 1, Lexicon: MIDIA

(proposta, proposta, NOUN) f1-modified-lemma (altipiano, altopiano, NOUN)

(altipiano, altipiano, NOUN) f1-modified-lemma+pos (esuberanti, esuberare, VERB)

(esuberanti, esuberante, ADJ) f1-form-missing

(prevvede, prevedere, VERB) f2-validated-lemma (com’, come, ADV)

Phase 2, Lexicon: MIDIA

(come, come, ADV) f2-diferent-pos f2-missing (rassicurantissime, rassicurante, ADJ)

(rassicurante, rassicurare, VERB) (fidenti, fidente, ADJ) f3-valid-lemma-F

(accecamento, accecamento, NOUN) f3-modified-lemma-F

(disolate, disolato, ADJ) f3-modified-lemma-L

(adirizar, adirizare, VERB) f3-valid-lemma-L (succian, succiare, VERB)

Phase 3, Lexicon: TLIO (accecamento, accecamento, NOUN) (disolate, desolato, ADJ) (adirizare, addirizzare, VERB) (succiare, succiare, VERB) f3-diferent-lemma-pos (ubbriachi, ubbriaco, ADJ)

(ubbriaco, ubriaco, NOUN) f3-missing (addobbamenti, addobbamento, NOUN)

– No changes are made; the triple matches the lexicon.

The lemma in the corpus is corrected to match the lexicon.

Both lemma and POS are corrected to align with the lexicon.

The form is missing from the lexicon and flagged for review.

The corpus triple is validated despite form variation; lemma and POS match the lexicon.

The same form appears in the lexicon with a diferent lemma and POS; the corpus POS is retained for further analysis.

The form and lemma are absent from the lexicon and marked as missing.

The triple is validated; it matches the lexicon entry.

The lemma is corrected to align with the TLIO lexicon.

The triple is normalized using the lemma assigned to the variant in the lexicon.

The triple is validated; the lemma is found in the lexicon with matching POS.

Lemma and POS difer from the lexicon; no change is applied.

Both the form and lemma are missing from the lexicon; no change is made.

Label f1-match-found f1-modified-lemma f1-modified-lemma+pos f2-validated-lemma f2-diferent-pos f3-modified-lemma-L f3-valid-lemma-L f3-modified-lemma-F f3-valid-lemma-F f3-diferent-lemma-pos f3-missing

Stanford Dependency Treebank) (Bosco et al. [ 18 ]). For (containing 14,419 sentences, corresponding to the 80% comparison purposes, we also trained a model using the of the full dataset), a validation set (4,806 sentences, 10%), original, non-normalized versions of the historical cor- and a test set (4,806 sentences, 10%). As detailed in Table pora. In the remainder of this paper, we refer to the model 6, the internal composition of the validation and test sets trained on normalized data as NORM_Lem, and to the was representative of the four diferent corpora used for one trained on unnormalized original data as ORIG_Lem. training in similar proportions.

To evaluate the performance of the NORM_Lem and The second set of experiments aimed to evaluate the ORIG_Lem models, we conducted two sets of experi- accuracy and robustness of the normalized lemmatization ments, each with a distinct objective. The first set was de- model on an external historical corpus (Section 5.2). In signed to assess the impact of low-level versus high-level this case, the model was trained on the entire dataset normalization on lemmatization accuracy (Section 5.1). and tested on a selection of sentences from the MIDIA For this purpose, we performed 5-fold cross-validation: corpus, which had been semi-automatically converted in each fold, the dataset was divided into a training set into the UD format. This evaluation allowed us to test

Fold

1 2 3 4 5

Set dev test train dev test train dev test train dev test train dev test train to the ambiguous use of past participles, which often alternate between verbal and adjectival function, a frequent source of lemmatization errors. As for NOUNs, the observed errors may also be linked to the treatment of derived forms, whose lemmatization may not always be consistent across treebank sources. Regarding NUM, the category with the highest error rate, we noted that most errors involve Roman numerals, often misinterpreted as PROPN. the generalizability of the NORM_Lem model beyond the data it was trained on.

5. Lemmatization Results 5.1. Low- vs High-level Normalization Results

The first set of experiments was conducted using 5-fold cross-validation. The NORM_Lem and the ORIG_Lem ORIG_Lem model models were tested on the normalized and original ver- Fold Lemma Acc. (DEV) Lemma Acc. (TEST) sions of the treebanks respectively. Table 7 presents the Fold 1 0.9827 0.9830 accuracy scores for each fold, as well as for the entire Fold 2 0.9817 0.9829 DEV and TEST sets. In all cases, the NORM_Lem model Fold 3 0.9824 0.9821 consistently outperforms the ORIG_Lem model, both Fold 4 0.9830 0.9825 across individual folds and on average. A reduction in Fold 5 0.9828 0.9826 the number of incorrectly lemmatized tokens is observed Average 0.9825 0.9826 ifnorthsoeuUrDce-IctaolripaonrOa,ldwcitohrptuhse, mwohsetrenNotOabRlMei_mLepmrovyeiemldesnat FFoolldd1 LemmNa0O.A9R8c5Mc1._(LDeEmV)modLeelmma0.A98c4c1. (TEST) 0.38% decrease in lemmatization errors on both the DEV Fold 2 0.9841 0.9847 and TEST sets. An exception to this trend is GDLI-QC, for Fold 3 0.9852 0.9835 which both models show a slight drop in accuracy (–0.18 Fold 4 0.9852 0.9841 on both DEV and TEST). The VGG corpus is less afected Fold 5 0.9847 0.9851 by normalization, showing a reduction in lemmatization Average 0.9848 0.9843 errors of 0.11%. Table 7

We also analysed the results by part-of-speech (POS). Lemma accuracy obtained with the ORIG_Lem and the Table 8 reports the error rates in the TEST set. Aside NORM_Lem models over 5-fold cross-validation on DEV and from NUM (numerals), which is the worst-performing TEST portions. category with an increase of errors with the NORM_Lem model, the POS with the highest error rates (above 3%) are ADJ, VERB, and PROPN, followed by NOUN and PRON, with error rates of 2.37% and 1.87% respectively.

All other POS categories show error rates below 1%. Errors involving ADJs and VERBs are mainly ascribable

POS ADJ ADP ADV AUX CCONJ DET NOUN NUM PRON PROPN PUNCT SCONJ VERB cantly high values, ranging from 93.58% to 97.44%. The lowest accuracy is observed for the text dated 1505 by Leonardo Da Vinci (93.58%). However, this drop seems more related to the complexity and idiosyncrasies of the text’s genre (i.e., technical and fragmentary scientific 5.2. Testing NORM_Lem with an External notes) rather than to its chronological distance. Excluding this outlier, lemmatization accuracy across the re

Historical Corpus maining texts shows limited variance, with most scores In the second set of experiments, we focused on the clustering around 96–97%, indicating the robustness of NORM_Lem model with the aim of evaluating its ac- the model to diachronic variation. curacy and robustness on an external historical corpus. The genre-based evaluation further confirms this trend. The test set comprises a selection of sentences from the The model performs best on personal correspondence MIDIA corpus, for a total of 5,116 tokens. The sentences and expository texts, achieving in both cases an accuracy are acquired from ten diferent texts to ensure diversity of 96.94%, closely followed by literary prose (96.87%). in terms of genre and period of composition. In fact, the Slightly lower accuracy is recorded for scientific texts texts span a broad chronological range, from the early (95.88%), very likely due to genre-specific linguistic char14th century to the mid-19th century, thus ofering a rep- acteristics, such as technical terminology, irregular synresentative sample of linguistic variation across diferent tax, and less standardized spelling. However, the perevolution stages of the Italian language. In terms of genre formance remains consistently high across all genres, distribution, the dataset includes three subsets of expos- confirming the generalizability of the NORM_Lem model itory essays, three of scholarly or scientific texts, two to diferent types of historical texts. of literary prose texts, and two of personal correspon- An analysis of lemmatization errors by part-of-speech dence. This selection, which includes textual genres not (POS) on the external test set (Table 10) reveals patterns represented in the training corpus, aims to evaluate the that are largely consistent with those observed in the robustness of the NORM_Lem model in the face of stylis- five-fold evaluation, while also highlighting genre- and tic, genre, and diachronic variation. domain-specific challenges. As in the internal evaluation,

The overall lemmatization accuracy achieved by the ADJ, VERB, and PROPN remain among the POS with NORM_Lem model on the external test set is 96.59%. the highest error rates, recording values of 9.59%, 6.71%, While this score is slightly lower than the average accu- and 6.80%, respectively, in the full test set. These results racy obtained in the 5-fold cross-validation experiment confirm the persistent dificulty posed by adjectives and described above, such a diefrence is expected given that verbs, often due to the ambiguous status of past particithe test set comprises previously unseen texts that par- ples that can function both as verbal and adjectival forms. tially difer both in genre and chronological coverage Errors in the PROPN category remain notably high, parfrom the training data. The slight performance drop ticularly in scientific texts (21.43%). However, this result reflects the increased dificulty posed by domain shift, should be interpreted with caution, as it is influenced particularly with respect to historical variation (in this by the low frequency of proper nouns in these texts. AlMIDIA sample there are periods which are not covered though the proportion of incorrectly lemmatized proper in the training corpus) and text type. nouns appears substantial, the scientific subcorpus con

A closer analysis of the accuracy of lemmatization over tains only 14 PROPN tokens in total. This small sample time, shown in Figure 1, reveals that the performance size limits their overall impact on the test set and may inremains relatively stable over the centuries, with signifi- lfate the observed error rate due to sampling efects. ADV,

SCONJ, and DET also show minor fluctuations in accuracy, but their overall contribution to the global error rate remains limited. Errors in NOUN lemmatization reveal a range of recurrent challenges, including both lexical variation and morphological ambiguity. Several errors involve orthographic variants or archaic spellings that are typical of historical texts, such as uppinione lemmatized as uppinione (instead of opinione), or phonological or dialectal interference, e.g. ariento lemmatized as such instead of argento. Other errors highlight semantic or derivational mismatches, where the model fails to associate the inflected form with the appropriate lemma.

For example, the wordform diletti is incorrectly lemmatized as dilettare (VERB) rather than diletto (NOUN).

Finally, some errors involve mislemmatization due to homography or syntactic ambiguity, as seen, e.g., with mostra lemmatized as mostrare, where the model incorrectly assumes a verbal or adjectival interpretation.

Such cases may be tied to the POS-lemmatization interaction, where contextually ambiguous forms are resolved incorrectly, possibly due to inconsistent POS-tag/lemma alignments in training data.

Interestingly, NUM errors are less prominent in the external test set compared to the five-fold validation, likely due to the lower frequency of Roman numerals or a more predictable usage context. Other categories such as ADP, CCONJ, and AUX remain highly stable, with error rates below 1%, suggesting that closed-class words are generally well handled by the model, even in previously unseen texts.

Overall, the distribution of errors confirms the robustness of the NORM_Lem model across POS categories, while also emphasizing the influence of genre-specific lexical and morphological variation, particularly in scientific and early modern texts.

Last but not least, we analyzed how the NORM_Lem model handles the challenge of Out-Of-Vocabulary (OOV) words — i.e., words not included in the pre-trained vocabulary — which typically lead to degraded model performance. The results reported in Table 10 are consistent with our previous observations: the highest percentage of incorrect predictions is found in Science and Expository texts (35%). This percentage decreases to 30% in Literary Prose and to 25% in Letters. We further examined the incorrect predictions by part of speech (POS), revealing that the most problematic categories are still NOUNs (30%), VERBs (27%), ADJECTIVEs (22%), and PROPER NOUNs (5%), which together account for 84% of the errors in OOV words. A closer inspection of individual cases suggests that there is still room for improvement: several errors are due to case mismatches, while others involve derivative formations.

6. Conclusion and Future Work

This paper has addressed the role and impact of diferent lemma definition strategies in automatic lemmatization, with a particular focus on historical language varieties. Specifically, we presented a comparative study of two lemmatization strategies for historical Italian: a conservative approach and a normalized one. The model trained on normalized data (NORM_Lem) was compared to a counterpart trained on unnormalized corpora, i.e. following a conservative lemmatization approach (ORIG_Lem). Both models were evaluated intrinsically via five-fold cross-validation. Results consistently favored the NORM_Lem model, which outperformed ORIG_Lem across all folds, achieving higher accuracy and reducing the number of incorrectly lemmatized tokens.

To further evaluate the efectiveness and generalization capacity of the NORM_Lem model, we tested it on an external dataset including textual genres and historical periods not represented in the training data. Although overall accuracy on this out-of-domain test set was slightly lower — due to domain and temporal variation — the model maintained strong generalization capabilities, with stable lemmatization accuracy across diferent historical periods. From a genre-specific perspective, lower accuracy was observed in scientific texts, where challenges such as domain-specific terminology and Latinized proper names were more prominent. A detailed POS-based error analysis confirmed that adjectives, verbs, and proper nouns remain problematic, often due e.g. to morphological ambiguity or derivational complexity. These findings align with previous observations on the limitations of character-based neural models in capturing morpho-syntactic regularities in low-frequency or irregular data, especially in historical language varieties.

Overall, our results provide empirical evidence that high-level normalized lemmatization improves the performance of data-driven models applied to morphologically rich and orthographically variable languages like historical Italian. In particular, high-level normalization emerges as a valuable preprocessing step for lemmatization tasks involving historical corpora. However, the trade-of between normalization and linguistic fidelity should be carefully considered, especially in philological or interpretative contexts where access to attested variants is essential.

Future work will explore hybrid approaches that combine normalization with variant-aware lemmatization strategies, potentially through multitask learning or postlemmatization clustering techniques. Another promising direction involves assessing the impact of diferent lemmatization strategies on downstream tasks — such as information retrieval, syntactic parsing, or historical named entity recognition — in order to evaluate their broader utility within practical NLP pipelines.

Acknowledgments

We gratefully acknowledge the support of the project CHANGES – Cultural Heritage Innovation for Next-Gen Sustainable Society (PE00000020), funded under the NRRP program of the Italian Ministry of University and Research (MUR) and financed by the European Union through NextGenerationEU. Furthermore, we express our sincere gratitude to the team who designed, developed, and currently maintains the MIDIA corpus, and, in particular, to Claudio Iacobini for his great support. Last but not least, we thank Felice Dell’Orletta and Alessio Miaschi for their precious suggestions in designing the experiments, and Elisa Guadagnini for her helpful comments on lemmatization criteria of historical Italian. Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

[1]

N. A.

Porter ,

P. A.

Thompson , Lemmas and dilemmas: Problems in old english lexicography (dictionary of old english ), International Journal of Lexicography 2 ( 1989 ) 135 - 146 .

[2]

Manolessou , G. Katsouda, On Lemmas and Dilemmas again: Problems in Historical Dialectal Lexicography , Brill, 2024 , pp. 298 - 326 .

[3]

Cotterell ,

Kirov ,

Sylak-Glassman ,

Yarowsky ,

Eisner , M. Hulden, The SIGMORPHON 2016 shared Task-Morphological reinflection , in: M. Elsner , S. Kuebler (Eds.), Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Berlin, Germany, 2016 , pp. 10 - 22 .

[4]

Qi ,

Zhang ,

Bolton ,

C. D.

Manning , Stanza: A python natural language processing toolkit for many human languages, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , Association for Computational Linguistics , 2020 .

[5]

Bergmanis ,

Goldwater , Context sensitive neural lemmatization with Lematus , in: M. Walker , H. Ji , A . Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 ( Long

Papers)

, Association for Computational Linguistics , New Orleans, Louisiana, 2018 , pp. 1391 - 1400 .

[6]

Straka , UDPipe 2 . 0 prototype at CoNLL 2018 UD shared task , in: D. Zeman , J. Hajič (Eds.), Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics , Brussels, Belgium, 2018 , pp. 197 - 207 .

[7]

Dorkin ,

Sirts , Comparison of current approaches to lemmatization: A case study in Estonian , in: T. Alumäe, M. Fishel (Eds.), Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) , University of Tartu Library, Tórshavn, Faroe Islands, 2023 , pp. 280 - 285 .

[8]

Bergmanis ,

Goldwater , Data augmentation for context-sensitive neural lemmatization using inflection tables and raw text , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4119 - 4128 .

[9]

A. D.

McCarthy ,

Vylomova ,

Wu ,

Malaviya ,

Wolf-Sonkin ,

Nicolai ,

Kirov ,

Silfverberg ,

S. J.

Mielke ,

Heinz ,

Cotterell , M. Hulden, The SIGMORPHON 2019 shared task: Morphological analysis in context and cross-lingual transfer for inflection , in: G. Nicolai, R. Cotterell (Eds.), Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Florence, Italy, 2019 , pp. 229 - 244 . Linguistics, Sofia, Bulgaria, 2013 , pp. 61 - 69 . URL:

[10]

Toporkov ,

Agerri , On the role of morpho- https://aclanthology.org/W13-2308. logical information for contextual lemmatization , [19] I. De Felice , F.

Dell'Orletta , G.

Venturi , A.

Lenci , Computational Linguistics 50 ( 2024 ) 157 - 191 . S. Montemagni, Italian in the trenches: linguis-

[11]

Manjavacas , Á. Kádár,

Kestemont , Improving tic annotation and analysis of texts of the great lemmatization of non-standard languages with joint war , in: Proceedings of the Fifth Italian Conferlearning , in: J. Burstein , C. Doran , T. Solorio (Eds.), ence on Computational Linguistics (CLiC-it 2018 ), Proceedings of the 2019 Conference of the North Accademia University Press, 2018 , pp. 160 - 164 . American Chapter of the Association for Computa- [20] M.-C. De Marnefe , C. D.

Manning , J.

Nivre , D.

Zetional Linguistics: Human Language Technologies, man, Universal dependencies , Computational linVolume 1 , Association for Computational Linguis- guistics 47 ( 2021 ) 255 - 308 . tics, Minneapolis, Minnesota, 2019 , pp. 1493 - 1503 . [21] P. D'Achille , M. Grossmann , Per la storia della for-

[12]

Favaro ,

Guadagnini , E. Sassolini, M. Bifi, mazione delle parole in italiano: un nuovo corpus in S. Montemagni, Towards the creation of a di- rete (MIDIA) e nuove prospetive di studio, Franco achronic corpus for italian: A case study on the gdli Cesati Editore ., 2017 . quotations, in: Proceedings of the Second Work- [22]

Schmid , Probabilistic part-of-speech tagging shop on Language Technologies for Historical and using decision trees , in: Proceedings of InternaAncient Languages , 2022 , pp. 94 - 100 . tional Conference on New Methods in Language

[13]

Favaro ,

Bifi ,

Montemagni , Pos tagging and Processing , 1994 , pp. 1 - 9 . lemmatization of historical varieties of languages . [23]

Iacobini , A. De Rosa , G. Schirato, Part-of-speech the challenge of old italian, Italian Journal of Com- tagging strategy for midia: a diachronic corpus of putational Linguistics (IJCoL) 9 ( 2023 ) 99 - 120 . the italian language , in: Proceedings of the First

[14]

Corbetta ,

M. C.

Passarotti ,

F. M.

Cecchini , Italian Conference on Computational Linguistics G. Moretti, Highway to hell. towards a universal CLiC-it 2014 , Pisa University Press, 2014 , pp. 213 - dependencies treebank for dante alighieri's comedy, 218 . in: Proceedings of CLiC-it 2023 : 9th Italian Confer- [24]

P. G.

Beltrami , Il tesoro della lingua italiana delle ence on Computational Linguistics, Nov 30-Dec origini (tlio) , in: Italia linguistica anno Mille, Italia 02 , 2023 , Venice, Italy, CEUR-WS, 2023 , pp. 1 - 8 . linguistica anno Duemila: atti del XXXIV Con-

[15]

Tavoni , Dantesearch: il corpus delle opere gresso internazionale di studi della Società di linvolgari e latine di dante lemmatizzate con mar- guistica italiana (SLI) , Firenze 19-21 ottobre 2000 . - catura grammaticale e sintattica , in: Lectura Dantis (Pubblicazioni della Società linguistica italiana; 45) , 2002 - 2009 . Omaggio a Vincenzo Placella per i suoi Bulzoni , 2003 , pp. 1000 - 1004 . settanta anni, volume 2 , Università degli Studi di Napoli" L'Orientale" , Il Torcoliere-Oficine . . . , 2012 , pp. 583 - 608 .

[16]

Boschetti , I. De Felice,

S. Dei

Rossi ,

Dell'Orletta ,

M. Di

Giorgio ,

Miliani ,

L. C.

Passaro ,

Puddu , G. Venturi,

Labanca ,

Lenci ,

Montemagni , "voices of the great war": A richly annotated corpus of italian texts on the first world war , in: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020 ), European Language Resources Association (ELRA) , 2020 , pp. 911 -- 918 .

[17]

Straka ,

Hajič , J. Straková, UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing , in: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) , European Language Resources Association (ELRA) , 2016 , pp. 4290 - 4297 .

[18]

Bosco ,

Montemagni ,

Simi , Converting Italian treebanks: Towards an Italian Stanford dependency treebank , in: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse , Association for Computational