<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Low- vs High-level Lemmatization for Historical Languages. A Case study on Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chiara Alzetta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simonetta Montemagni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Istituto di Linguistica Computazionale "Antonio Zampolli", Consiglio Nazionale delle Ricerche</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Lemmatization remains a foundational yet challenging task in the processing of historical Italian texts, due to the complex interplay of orthographic, morphological, and diatopic variation. A crucial, yet often overlooked, aspect is the degree of normalization applied during lemmatization. A conservative approach preserves attested historical forms, ensuring greater linguistic fidelity but increasing data sparsity. Conversely, an abstract normalization strategy aligns historical variants with standardized contemporary lemmas, improving generalization but potentially introducing inaccurate mappings. In this paper, we present a comparative evaluation of conservative and normalized lemmatization strategies for historical Italian. To our knowledge, this is the first study to explicitly assess the impact of lemmatization strategies in the context of historical languages, particularly those that are morphologically rich. Our results indicate that high-level normalization ofers a promising trade-of between precision and generalization.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data-driven Lemmatization</kwd>
        <kwd>Historical Italian</kwd>
        <kwd>Universal Dependencies</kwd>
        <kwd>Normalization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        dressed using rule-based morphological analyzers and
dictionary lookup. However, recent years have seen
Lemmatization is the task of identifying the canonical the rise of data-driven lemmatization approaches, where
form, or lemma, of a given inflected wordform. While models learn to produce lemmas without relying on
prethis mapping is often straightforward and based on well- defined linguistic rules and/or lexical resources. A key
established criteria, it can also involve a considerable turning point in this methodological shift was the
SIGdegree of discretion, especially in the case of diachronic MORPHON 2016 Shared Task, which reconceptualized
language data. In historical lexicography, lemma selec- lemmatization as a special case of morphological
reinflection remains a well-known and unresolved challenge tion (Cotterell et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). This view paved the way for the
due to the high number of attested variant forms, many current dominant approaches, based on neural models.
of which diverge significantly from the standard form. Within the data-driven paradigm, two main strategies
Choosing a specific lemma to serve as the headword — have emerged. The generative character-level approach
i.e. capable of efectively subsuming all its variants — relies on encoder-decoder architectures that generate
is a widely debated issue. As Porter and Thompson [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] the lemma character by character, conditioned on the
and Manolessou and Katsouda [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have noted, it consti- input form and its context (Qi et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Bergmanis and
tutes a genuine dilemma. In computational linguistics, by Goldwater [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). In contrast, pattern-based models treat
contrast, lemmatization criteria are rarely made explicit lemmatization as a supervised classification task (Straka
and are often taken for granted. While this may pose [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]), where each class - derived from training data -
coronly minor issues in the lemmatization of contemporary responds to the edit operations that transform a specific
language, it becomes a critical concern for historical lan- wordform into its lemma. A comparative study on
Esguage data. This paper investigates the role and impact tonian by Dorkin and Sirts [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] found that generative
of diferent lemma identification strategies in automatic encoder-decoder models trained from scratch
outperlemmatization, with a focus on historical varieties. form both rule-based systems and pattern-based models
      </p>
      <p>
        Lemmatization is one of the fundamental tasks that fa- fine-tuned from large pre-trained language models.
cilitate downstream Natural Language Processing (NLP) Among the most debated issues in lemmatization,
parapplications and is particularly relevant for highly in- ticularly in data-driven models, there is the role of context
lfected languages. Traditionally, this task has been ad- and morphological information. Contextual information
has been shown to be crucial for handling unseen and
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- ambiguous words: see, among others, Bergmanis and
tics, September 24 — 26, 2025, Cagliari, Italy Goldwater [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ] and McCarthy et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The actual
* Corresponding author: Chiara Alzetta role of morphological information in performing
contex† These authors contributed equally. tual lemmatization was investigated by Toporkov and
($S. cMhoianrtae.malzaegtntai)@cnr.it (C. Alzetta); simonetta.montemagni@cnr.it Agerri [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], who showed that fine-grained
morphologi© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License cal information does not help to substantially improve
Attribution 4.0 International (CC BY 4.0).
      </p>
      <p>wordform</p>
      <sec id="sec-1-1">
        <title>Conservative Lemma</title>
        <p>brieve
sancto
cotesto
alma
imperadore
palagio
utilitate
admettere
diliberare
guarentire
surgere</p>
      </sec>
      <sec id="sec-1-2">
        <title>Normalized Lemma</title>
        <p>breve
santo
codesto
anima
imperatore
palazzo
utilità
ammettere
deliberare
garantire
sorgere
lemmatization (not even for highly inflected languages)
and that using basic part-of-speech tags (UPOS) seems to
be enough for comparable performance across languages. sbarinecvteissimo AADDJJ</p>
        <p>
          Although much progress has been made on lemmati- chotesto DET
zation for standard, resource-rich languages, the task re- alma NOUN
mains challenging in the case of historical varieties, espe- imperadori NOUN
cially for morphologically complex languages like Italian. palagio NOUN
Historical Italian presents both orthographic and morpho- utilitati NOUN
logical variation, not only over time but often in the same admettesse VERB
period and even within the same text. These challenges diliberarono VERB
include, among others: alternations between etymolog- guarentir VERB
ical and phonetic spellings (e.g., haveva vs. aveva ’(it) surse VERB
had’, chupola vs. cupola ’dome’); phonetic variation (e.g. Table 1
pulito vs. polito ’clean’, eguale vs. uguale ’equal’); mor- Examples of conservative vs normalized lemmatization for
phologically distinct variants (e.g. avria vs. avrebbe ’(it) historical Italian
would have’); cliticized finite verbal forms (aveagli ’(it)
had-to-him’, avevalo ’(it) had-it’). Additional challenges,
also relevant to contemporary Italian, include the treat- level lemmatization strategy was adopted by Favaro et al.
ment of past participles (verbal vs. adjectival use) and [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ], deferring normalization to a later stage
operatderivative forms (the open issue is whether they repre- ing on lemma variants.
sent an independent lemma or should be associated with In this paper, we present a comparative evaluation of
the corresponding base form, e.g. the diminutive angio- these two lemmatization strategies for historical Italian,
letto ’little angel’ is an independent lemma or should be combining quantitative metrics with qualitative
analylemmatized as angelo ’angel’). sis. To our knowledge, this issue has not yet been
ex
        </p>
        <p>A crucial but often neglected aspect of lemmatizing plicitly addressed in the computational linguistics
lithistorical texts concerns the granularity and scope of the erature, where lemmatization choices are typically
aslemma list, as well as the criteria guiding lemma iden- sumed rather than critically examined. We argue that
tification: in other words, the degree of normalization this decision is especially relevant for morphologically
applied. This choice carries both theoretical and practical rich languages, where diferent lemmatization strategies
implications, influencing how linguistic variation is rep- can have a substantial impact on both the performance
resented, how lexical continuity over time is interpreted, and interpretability of downstream tasks.
and how efectively the data can be searched, analyzed, or The rest of the paper is organized as follows. In Section
aligned across sources. Table 1 contrasts a conservative 2, the historical corpora selected as the basis of this study
lemmatization approach - which preserves the graphical, are described. Section 3 illustrates the strategy adopted
phonological, and morpho-syntactic features of attested for generating a version of these corpora with high-level
historical variants - with a more abstract normalization normalized lemmatization. Section 4 describes the
apstrategy that aligns such variants to a standardized con- proach employed to train two models for lemmatizing
temporary (meta-)lemma. While the former ofers greater Italian historical texts. Section 5 discusses the results
linguistic precision and interpretability, it may lead to obtained by the lemmatization models, focusing both on
increased data sparsity. The latter, by contrast, reduces the results obtained in five-fold cross-validation
experisparsity and facilitates generalization, though at the risk ments and against an external test set. Finally, Section 6
of introducing incorrect form–lemma associations. concludes the paper and presents some future prospects.</p>
        <p>
          The choice between these strategies is shaped by
several practical factors, including the target application
and the specific language involved. Linguistic analyses, 2. Data
for instance, may benefit from a conservative approach,
whereas information retrieval systems and downstream For this study, we selected three corpora covering a wide
NLP applications may perform better with normalized timespan, going from the 14th to the 20th century, listed
lemmas. Language-specific features also play a key role. below:
As Manjavacas et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] note, the highly heterogeneous • UD-Italian Old [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]: Italian-Old is a treebank
nature of historical languages — marked by overlapping containing Dante Alighieri’s Comedy, based on
diachronic and diatopic variation and the absence of a the 1994 Petrocchi edition and sourced from the
stable standardized norm — makes it particularly chal- DanteSearch corpus [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. The treebank includes
lenging to carry out lemmatization and normalization lemmatization, morpho-syntactic, and syntactic
simultaneously. In the case of diachronic Italian, a
low
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Corpus</title>
        <p>UD-Italian Old
GDLI-QC - GDLI Quotation Corpus
VGG - Voci della Grande Guerra</p>
        <p>Total
MIDIA: administratione, administrationi,
administrazione, aministratione, amministratione,
amministrationi, amministrazione, amministrazioni,
nistrazione, strazione
TLIO: adminestragione, administracion,
administracione, administraciuni, administragione,
administratione, administrationi, administrazione,
aministracione, aministraciuni, aministragione,
aministrascione, aministratione, amministracione,
amministragione, amministragioni, amministratione,
amministrazione, amministrazioni</p>
        <p>All of these corpora follow a conservative
lemmatization strategy. In terms of annotation, they are all natively
annotated according to the Universal Dependencies (UD)
scheme1 (De Marnefe et al. [20]), which has become the
de facto standard nowadays. Lemmatization has been
manually revised for each corpus — albeit only partially
for UD-Italian Old — to ensure linguistic accuracy and
internal consistency. As such, these corpora can be
considered gold-standard resources. Table 2 provides details
on their size in terms of sentences and tokens.</p>
        <p>For the comparative study of the two lemmatization
strategies, a normalized counterpart of each corpus,
featuring high-level linguistic annotation, was required. To
generate the normalized versions of the three corpora,
we identified two historical Italian lexicons adopting this
lemmatization approach.</p>
        <p>
          One such resource is the MIDIA lexicon, which was
built starting from the balanced diachronic corpus of
writannotation. A partial manual revision was car- ten Italian texts called MIDIA (D’Achille and Grossmann
ried out to align morpho-syntactic annotation [21]), fully annotated with lemma and part-of-speech
and lemmatization with the Universal Dependen- (POS) information. Covering the period from the early
cies (UD) guidelines, with particular attention to 13th to the first half of the 20 th century, the corpus is
proper nouns and fixed multiword expressions. organized into five chronological periods and seven
texFor our experiments, we used version 2.15 of the tual genres, comprising approximately 7.5 million tokens
treebank, released in November 2024; drawn from about 800 texts. In MIDIA, lemmatization
• VGG - Voci della Grande Guerra [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]: VVG and POS tagging were automatically performed using a
is a corpus of texts that were written in Italian version of TreeTagger (Schmid [22]) adapted for
historiin the period of World War I or shortly after- cal Italian (Iacobini et al. [23]). To handle the linguistic
wards (most of them date back to the years 1915- variation typical of earlier stages of the language, the
con1919). The corpus includes diferent textual gen- temporary Italian lexicon embedded in TreeTagger was
res, namely: discourses, reports, and diaries of enriched with approximately 230,000 word forms,
primarpoliticians and military chiefs; letters written by ily dating from the 14th to the 16th centuries. This
substanmen and women, soldiers and civilians; literary tially expanded the original MIDIA lexicon. The version
works of intellectuals, poets, and philosophers; we used contains 70,083 unique lemmata, 571,779
diswritings of journalists and lawyers. The corpus tinct wordform–lemma pairs, and 584,041 unique
wordis annotated at the morpho-syntactic level and form–lemma–POS triples. Notably, there is a high
delemmatized. Annotation was carried out with gree of overlap between the wordform–lemma pairs from
UDPipe [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] trained on IUDT [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]v2.0; a subset the corpora under study and those in the MIDIA
lexiwas then manually revised [19]. For this study, con: 89.91% for UD-Italian Old, 86.65% for GDLI-QC, and
we used the gold portion of the corpus; 81.66% for VGG.
• GDLI-QC - GDLI Quotation Corpus [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]: Another key reference resource identified for these
GDLI-QC is a corpus derived from an authori- purposes is the Tesoro della Lingua Italiana delle Origini
tative historical Italian dictionary, namely the (TLIO) (Beltrami [24]), a historical dictionary of old
ItalGrande dizionario della lingua italiana (GDLI) ian based on all extant documentation from the earliest
edited by Salvatore Battaglia. GDLI presents a texts recognizable as Italian up to the end of the 14th
huge collection of quotations covering the entire century, which includes manual lemmatization.
history of the Italian language, from which a sub- To fully understand the type of lemmatization
perset has been extracted, representative of the most formed in these two resources, we report below the
cited authors and covering a wide chronological set of wordforms sharing the nominal lemma
amminspan (from the 14th to the 20th century). GDLI-QC istrazione ’administration’ in the MIDIA and TLIO
lexhas been morpho-syntactically tagged and lem- icons:
matized with Stanza [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]: annotation was carried
out automatically, with full manual revision.
from the corpus was preserved, and the case was labeled
f2-different-pos. If no matching form or lemma was
Adposition found in MIDIA, the case was labeled f2-missing.
        </p>
        <p>Proper noun The final phase addressed the remaining unresolved
Adverb cases from Phase 2 — those labeled f2-missing and
Articulated Prep. f2-different-pos — by consulting the TLIO lexicon.</p>
        <p>Verb As a first step, we checked whether the triple
(wordAuxiliary form, lemma, POS) was present in the lexicon. If so,</p>
        <p>PCroonnjouunnction wore mmoardkifieedd tthheeclaesmemasavatolidmataetdch(f 3th-evatlriipdle-linemTmLaI-OF),
ADJ Adjective (f3-modified-lemma-F). If the lemma appeared as
NOUN Noun a wordform in TLIO with the same POS, the lemma
PUNCT Punctuation was changed to match the lemma reported in TLIO
PRON,SCONJ (f3-modified-lemma-L) or validated against the
lexiNUM Numeral con (f3-valid-lemma-L). If the form was present but
PRON,ADV,SCONJ Interrogative associated with a diferent POS, the case was labeled
Table 3 f3-different-lemma-pos. If none of the above
conMapping between MIDIA and UD part of speech tags ditions applied, the case remained unresolved and was
labeled f3-missing.</p>
        <p>Table 4 exemplifies the cases treated in the diferent
3. Lemma Normalization normalization steps, reporting the corpus annotation and
how it was revised based on the evidence of the MIDIA /
To carry out lemma normalization, the first step consisted TLIO lexicons.
of converting the part of speech tags of the MIDIA lex- For each step described above, Table 5 reports the
disicon to the UD annotation scheme. Table 3 details the tribution of cases in the three normalization steps. For
correspondences between the two tagsets. The conver- the three historical corpora, the number of matching WL
sion was carried out automatically, and the ambiguous pairs is very high: the lemmatization in the corpus and
underspecified cases (e.g. che and wh tags) were then the lexicon coincided in more than 96% of the cases (with
revised manually. minor diferences across the corpora). Cases normalized</p>
        <p>The normalization process of the selected corpora was during one of the three phases amount to 3.56% in the
carried out in three successive phases, relying on lexicon- UD-Italian Old, 3.02% in VGG, and 2.97% in GDLI-QC. A
based validation and correction. The objective was to ver- neglectable number of cases were not normalized,
rangify and, where appropriate, normalize wordform-lemma ing from 0.09% in the UD-Italian Old, to 0.85% and 0.73%
(WL) pairs extracted from the selected historical corpora in VGG and GDLI-QC respectively.
using the MIDIA and TLIO historical lexicons.</p>
        <p>
          In the first phase, each WL pair was checked against
the MIDIA lexicon. If the WL pair was found in 4. Model Training
MIDIA, the case was marked as f1-match-found For the analysis of historical Italian texts, we trained
and left unchanged. If the wordform was present the Stanza natural language processing neural pipeline
in the MIDIA lexicon but was associated with a [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], developed by the Stanford NLP Group. Stanza,
foldiferent lemma, or with both a diferent lemma lowing a generative character-level approach, ofers a
and POS, the unmatching information was modified modular architecture with state-of-the-art models for
towith the values appearing in MIDIA (case marked as kenization, lemmatization, part-of-speech tagging,
morf1-modified-lemma or f1-modified-lemma+pos). phological analysis, dependency parsing, and named
enIf the wordform was not found in MIDIA, the case was tity recognition. Built on a Python interface, it supports
labeled f1-form-missing and passed as input to the over 70 human languages and is trained on UD treebanks.
second phase. In addition to its pre-trained models, Stanza allows users
        </p>
        <p>In the second normalization phase, the wordforms la- to train custom models from scratch using UD-formatted
belled as missing (i.e. f1-form-missing) in MIDIA data. In this study, we specifically focused on the
lemmaduring Phase 1 were re-analyzed. For these cases, we tization component.
checked whether MIDIA contained the lemma matching The lemmatization model was trained using the
norany other form. If the POS in the corpus and MIDIA malized versions of the selected historical corpora —
UDlexicon coincided, then we marked the case as correct us- Italian Old, VGG, and GDLI-QC — as input data. To these,
ing the label f2-validated-lemma. If the lemma was we added the contemporary Italian corpus ISDT (Italian
present in MIDIA with a diferent POS, the original POS
Label
f1-match-found</p>
        <p>Corpus (wordform, lemma, POS)
(proposta, proposta, NOUN)</p>
        <p>Lexicon (wordform, lemma, POS)
Phase 1, Lexicon: MIDIA</p>
        <p>(proposta, proposta, NOUN)
f1-modified-lemma
(altipiano, altopiano, NOUN)</p>
        <p>(altipiano, altipiano, NOUN)
f1-modified-lemma+pos
(esuberanti, esuberare, VERB)</p>
        <p>(esuberanti, esuberante, ADJ)
f1-form-missing</p>
        <p>(prevvede, prevedere, VERB)
f2-validated-lemma
(com’, come, ADV)</p>
        <p>Phase 2, Lexicon: MIDIA</p>
        <p>(come, come, ADV)
f2-diferent-pos
f2-missing
(rassicurantissime, rassicurante, ADJ)</p>
        <p>(rassicurante, rassicurare, VERB)
(fidenti, fidente, ADJ)
f3-valid-lemma-F</p>
        <p>(accecamento, accecamento, NOUN)
f3-modified-lemma-F</p>
        <p>(disolate, disolato, ADJ)
f3-modified-lemma-L</p>
        <p>(adirizar, adirizare, VERB)
f3-valid-lemma-L
(succian, succiare, VERB)</p>
        <p>Phase 3, Lexicon: TLIO
(accecamento, accecamento, NOUN)
(disolate, desolato, ADJ)
(adirizare, addirizzare, VERB)
(succiare, succiare, VERB)
f3-diferent-lemma-pos
(ubbriachi, ubbriaco, ADJ)</p>
        <p>(ubbriaco, ubriaco, NOUN)
f3-missing
(addobbamenti, addobbamento, NOUN)</p>
        <p>–
No changes are made; the triple matches
the lexicon.</p>
        <p>The lemma in the corpus is corrected to
match the lexicon.</p>
        <p>Both lemma and POS are corrected to align
with the lexicon.</p>
        <p>The form is missing from the lexicon and
flagged for review.</p>
        <p>The corpus triple is validated despite form
variation; lemma and POS match the
lexicon.</p>
        <p>The same form appears in the lexicon with
a diferent lemma and POS; the corpus POS
is retained for further analysis.</p>
        <p>The form and lemma are absent from the
lexicon and marked as missing.</p>
        <p>The triple is validated; it matches the
lexicon entry.</p>
        <p>The lemma is corrected to align with the
TLIO lexicon.</p>
        <p>The triple is normalized using the lemma
assigned to the variant in the lexicon.</p>
        <p>The triple is validated; the lemma is found
in the lexicon with matching POS.</p>
        <p>Lemma and POS difer from the lexicon; no
change is applied.</p>
        <p>Both the form and lemma are missing from
the lexicon; no change is made.</p>
        <p>Label
f1-match-found
f1-modified-lemma
f1-modified-lemma+pos
f2-validated-lemma
f2-diferent-pos
f3-modified-lemma-L
f3-valid-lemma-L
f3-modified-lemma-F
f3-valid-lemma-F
f3-diferent-lemma-pos
f3-missing</p>
        <p>
          Stanford Dependency Treebank) (Bosco et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]). For (containing 14,419 sentences, corresponding to the 80%
comparison purposes, we also trained a model using the of the full dataset), a validation set (4,806 sentences, 10%),
original, non-normalized versions of the historical cor- and a test set (4,806 sentences, 10%). As detailed in Table
pora. In the remainder of this paper, we refer to the model 6, the internal composition of the validation and test sets
trained on normalized data as NORM_Lem, and to the was representative of the four diferent corpora used for
one trained on unnormalized original data as ORIG_Lem. training in similar proportions.
        </p>
        <p>To evaluate the performance of the NORM_Lem and The second set of experiments aimed to evaluate the
ORIG_Lem models, we conducted two sets of experi- accuracy and robustness of the normalized lemmatization
ments, each with a distinct objective. The first set was de- model on an external historical corpus (Section 5.2). In
signed to assess the impact of low-level versus high-level this case, the model was trained on the entire dataset
normalization on lemmatization accuracy (Section 5.1). and tested on a selection of sentences from the MIDIA
For this purpose, we performed 5-fold cross-validation: corpus, which had been semi-automatically converted
in each fold, the dataset was divided into a training set into the UD format. This evaluation allowed us to test</p>
      </sec>
      <sec id="sec-1-4">
        <title>Fold</title>
        <p>1
2
3
4
5</p>
        <p>Set
dev
test
train
dev
test
train
dev
test
train
dev
test
train
dev
test
train
to the ambiguous use of past participles, which often
alternate between verbal and adjectival function, a
frequent source of lemmatization errors. As for NOUNs, the
observed errors may also be linked to the treatment of
derived forms, whose lemmatization may not always be
consistent across treebank sources. Regarding NUM, the
category with the highest error rate, we noted that most
errors involve Roman numerals, often misinterpreted as
PROPN.
the generalizability of the NORM_Lem model beyond the
data it was trained on.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Lemmatization Results</title>
      <sec id="sec-2-1">
        <title>5.1. Low- vs High-level Normalization</title>
      </sec>
      <sec id="sec-2-2">
        <title>Results</title>
        <p>The first set of experiments was conducted using 5-fold
cross-validation. The NORM_Lem and the ORIG_Lem ORIG_Lem model
models were tested on the normalized and original ver- Fold Lemma Acc. (DEV) Lemma Acc. (TEST)
sions of the treebanks respectively. Table 7 presents the Fold 1 0.9827 0.9830
accuracy scores for each fold, as well as for the entire Fold 2 0.9817 0.9829
DEV and TEST sets. In all cases, the NORM_Lem model Fold 3 0.9824 0.9821
consistently outperforms the ORIG_Lem model, both Fold 4 0.9830 0.9825
across individual folds and on average. A reduction in Fold 5 0.9828 0.9826
the number of incorrectly lemmatized tokens is observed Average 0.9825 0.9826
ifnorthsoeuUrDce-IctaolripaonrOa,ldwcitohrptuhse, mwohsetrenNotOabRlMei_mLepmrovyeiemldesnat FFoolldd1 LemmNa0O.A9R8c5Mc1._(LDeEmV)modLeelmma0.A98c4c1. (TEST)
0.38% decrease in lemmatization errors on both the DEV Fold 2 0.9841 0.9847
and TEST sets. An exception to this trend is GDLI-QC, for Fold 3 0.9852 0.9835
which both models show a slight drop in accuracy (–0.18 Fold 4 0.9852 0.9841
on both DEV and TEST). The VGG corpus is less afected Fold 5 0.9847 0.9851
by normalization, showing a reduction in lemmatization Average 0.9848 0.9843
errors of 0.11%. Table 7</p>
        <p>We also analysed the results by part-of-speech (POS). Lemma accuracy obtained with the ORIG_Lem and the
Table 8 reports the error rates in the TEST set. Aside NORM_Lem models over 5-fold cross-validation on DEV and
from NUM (numerals), which is the worst-performing TEST portions.
category with an increase of errors with the NORM_Lem
model, the POS with the highest error rates (above 3%)
are ADJ, VERB, and PROPN, followed by NOUN and
PRON, with error rates of 2.37% and 1.87% respectively.</p>
        <p>All other POS categories show error rates below 1%.
Errors involving ADJs and VERBs are mainly ascribable</p>
        <p>POS
ADJ
ADP
ADV
AUX
CCONJ
DET
NOUN
NUM
PRON
PROPN
PUNCT
SCONJ
VERB
cantly high values, ranging from 93.58% to 97.44%. The
lowest accuracy is observed for the text dated 1505 by
Leonardo Da Vinci (93.58%). However, this drop seems
more related to the complexity and idiosyncrasies of the
text’s genre (i.e., technical and fragmentary scientific
5.2. Testing NORM_Lem with an External notes) rather than to its chronological distance.
Excluding this outlier, lemmatization accuracy across the
re</p>
        <p>Historical Corpus maining texts shows limited variance, with most scores
In the second set of experiments, we focused on the clustering around 96–97%, indicating the robustness of
NORM_Lem model with the aim of evaluating its ac- the model to diachronic variation.
curacy and robustness on an external historical corpus. The genre-based evaluation further confirms this trend.
The test set comprises a selection of sentences from the The model performs best on personal correspondence
MIDIA corpus, for a total of 5,116 tokens. The sentences and expository texts, achieving in both cases an accuracy
are acquired from ten diferent texts to ensure diversity of 96.94%, closely followed by literary prose (96.87%).
in terms of genre and period of composition. In fact, the Slightly lower accuracy is recorded for scientific texts
texts span a broad chronological range, from the early (95.88%), very likely due to genre-specific linguistic
char14th century to the mid-19th century, thus ofering a rep- acteristics, such as technical terminology, irregular
synresentative sample of linguistic variation across diferent tax, and less standardized spelling. However, the
perevolution stages of the Italian language. In terms of genre formance remains consistently high across all genres,
distribution, the dataset includes three subsets of expos- confirming the generalizability of the NORM_Lem model
itory essays, three of scholarly or scientific texts, two to diferent types of historical texts.
of literary prose texts, and two of personal correspon- An analysis of lemmatization errors by part-of-speech
dence. This selection, which includes textual genres not (POS) on the external test set (Table 10) reveals patterns
represented in the training corpus, aims to evaluate the that are largely consistent with those observed in the
robustness of the NORM_Lem model in the face of stylis- five-fold evaluation, while also highlighting genre- and
tic, genre, and diachronic variation. domain-specific challenges. As in the internal evaluation,</p>
        <p>The overall lemmatization accuracy achieved by the ADJ, VERB, and PROPN remain among the POS with
NORM_Lem model on the external test set is 96.59%. the highest error rates, recording values of 9.59%, 6.71%,
While this score is slightly lower than the average accu- and 6.80%, respectively, in the full test set. These results
racy obtained in the 5-fold cross-validation experiment confirm the persistent dificulty posed by adjectives and
described above, such a diefrence is expected given that verbs, often due to the ambiguous status of past
particithe test set comprises previously unseen texts that par- ples that can function both as verbal and adjectival forms.
tially difer both in genre and chronological coverage Errors in the PROPN category remain notably high,
parfrom the training data. The slight performance drop ticularly in scientific texts (21.43%). However, this result
reflects the increased dificulty posed by domain shift, should be interpreted with caution, as it is influenced
particularly with respect to historical variation (in this by the low frequency of proper nouns in these texts.
AlMIDIA sample there are periods which are not covered though the proportion of incorrectly lemmatized proper
in the training corpus) and text type. nouns appears substantial, the scientific subcorpus
con</p>
        <p>A closer analysis of the accuracy of lemmatization over tains only 14 PROPN tokens in total. This small sample
time, shown in Figure 1, reveals that the performance size limits their overall impact on the test set and may
inremains relatively stable over the centuries, with signifi- lfate the observed error rate due to sampling efects. ADV,</p>
        <p>SCONJ, and DET also show minor fluctuations in
accuracy, but their overall contribution to the global error rate
remains limited. Errors in NOUN lemmatization reveal
a range of recurrent challenges, including both lexical
variation and morphological ambiguity. Several errors
involve orthographic variants or archaic spellings that are
typical of historical texts, such as uppinione lemmatized
as uppinione (instead of opinione), or phonological or
dialectal interference, e.g. ariento lemmatized as such
instead of argento. Other errors highlight semantic
or derivational mismatches, where the model fails to
associate the inflected form with the appropriate lemma.</p>
        <p>For example, the wordform diletti is incorrectly
lemmatized as dilettare (VERB) rather than diletto (NOUN).</p>
        <p>Finally, some errors involve mislemmatization due to
homography or syntactic ambiguity, as seen, e.g., with
mostra lemmatized as mostrare, where the model
incorrectly assumes a verbal or adjectival interpretation.</p>
        <p>Such cases may be tied to the POS-lemmatization
interaction, where contextually ambiguous forms are resolved
incorrectly, possibly due to inconsistent POS-tag/lemma
alignments in training data.</p>
        <p>Interestingly, NUM errors are less prominent in the
external test set compared to the five-fold validation,
likely due to the lower frequency of Roman numerals
or a more predictable usage context. Other categories
such as ADP, CCONJ, and AUX remain highly stable,
with error rates below 1%, suggesting that closed-class
words are generally well handled by the model, even in
previously unseen texts.</p>
        <p>Overall, the distribution of errors confirms the
robustness of the NORM_Lem model across POS categories,
while also emphasizing the influence of genre-specific
lexical and morphological variation, particularly in
scientific and early modern texts.</p>
        <p>Last but not least, we analyzed how the NORM_Lem
model handles the challenge of Out-Of-Vocabulary (OOV)
words — i.e., words not included in the pre-trained
vocabulary — which typically lead to degraded model
performance. The results reported in Table 10 are consistent
with our previous observations: the highest percentage of
incorrect predictions is found in Science and Expository
texts (35%). This percentage decreases to 30% in Literary
Prose and to 25% in Letters. We further examined the
incorrect predictions by part of speech (POS), revealing
that the most problematic categories are still NOUNs
(30%), VERBs (27%), ADJECTIVEs (22%), and PROPER
NOUNs (5%), which together account for 84% of the
errors in OOV words. A closer inspection of individual
cases suggests that there is still room for improvement:
several errors are due to case mismatches, while others
involve derivative formations.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Conclusion and Future Work</title>
      <p>This paper has addressed the role and impact of
diferent lemma definition strategies in automatic
lemmatization, with a particular focus on historical language
varieties. Specifically, we presented a comparative study
of two lemmatization strategies for historical Italian:
a conservative approach and a normalized one. The
model trained on normalized data (NORM_Lem) was
compared to a counterpart trained on unnormalized
corpora, i.e. following a conservative lemmatization
approach (ORIG_Lem). Both models were evaluated
intrinsically via five-fold cross-validation. Results consistently
favored the NORM_Lem model, which outperformed
ORIG_Lem across all folds, achieving higher accuracy
and reducing the number of incorrectly lemmatized
tokens.</p>
      <p>To further evaluate the efectiveness and
generalization capacity of the NORM_Lem model, we tested it on
an external dataset including textual genres and
historical periods not represented in the training data.
Although overall accuracy on this out-of-domain test set
was slightly lower — due to domain and temporal
variation — the model maintained strong generalization
capabilities, with stable lemmatization accuracy across
diferent historical periods. From a genre-specific perspective,
lower accuracy was observed in scientific texts, where
challenges such as domain-specific terminology and
Latinized proper names were more prominent. A detailed
POS-based error analysis confirmed that adjectives, verbs,
and proper nouns remain problematic, often due e.g.
to morphological ambiguity or derivational complexity.
These findings align with previous observations on the
limitations of character-based neural models in
capturing morpho-syntactic regularities in low-frequency or
irregular data, especially in historical language varieties.</p>
      <p>Overall, our results provide empirical evidence that
high-level normalized lemmatization improves the
performance of data-driven models applied to
morphologically rich and orthographically variable languages like
historical Italian. In particular, high-level normalization
emerges as a valuable preprocessing step for
lemmatization tasks involving historical corpora. However, the
trade-of between normalization and linguistic fidelity
should be carefully considered, especially in
philological or interpretative contexts where access to attested
variants is essential.</p>
      <p>Future work will explore hybrid approaches that
combine normalization with variant-aware lemmatization
strategies, potentially through multitask learning or
postlemmatization clustering techniques. Another
promising direction involves assessing the impact of diferent
lemmatization strategies on downstream tasks — such
as information retrieval, syntactic parsing, or historical
named entity recognition — in order to evaluate their
broader utility within practical NLP pipelines.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>We gratefully acknowledge the support of the project
CHANGES – Cultural Heritage Innovation for Next-Gen
Sustainable Society (PE00000020), funded under the NRRP
program of the Italian Ministry of University and
Research (MUR) and financed by the European Union
through NextGenerationEU. Furthermore, we express
our sincere gratitude to the team who designed,
developed, and currently maintains the MIDIA corpus, and, in
particular, to Claudio Iacobini for his great support. Last
but not least, we thank Felice Dell’Orletta and Alessio
Miaschi for their precious suggestions in designing the
experiments, and Elisa Guadagnini for her helpful
comments on lemmatization criteria of historical Italian.
Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Porter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <article-title>Lemmas and dilemmas: Problems in old english lexicography (dictionary of old english</article-title>
          ),
          <source>International Journal of Lexicography</source>
          <volume>2</volume>
          (
          <year>1989</year>
          )
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Manolessou</surname>
          </string-name>
          , G. Katsouda,
          <article-title>On Lemmas and Dilemmas again: Problems in Historical Dialectal Lexicography</article-title>
          , Brill,
          <year>2024</year>
          , pp.
          <fpage>298</fpage>
          -
          <lpage>326</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kirov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sylak-Glassman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yarowsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisner</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Hulden, The SIGMORPHON 2016 shared Task-Morphological reinflection</article-title>
          , in: M.
          <string-name>
            <surname>Elsner</surname>
          </string-name>
          , S. Kuebler (Eds.),
          <source>Proceedings of the 14th SIGMORPHON Workshop</source>
          on Computational Research in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Berlin, Germany,
          <year>2016</year>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bolton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Stanza: A python natural language processing toolkit for many human languages, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</article-title>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bergmanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goldwater</surname>
          </string-name>
          ,
          <article-title>Context sensitive neural lemmatization with Lematus</article-title>
          , in: M.
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Stent (Eds.),
          <source>Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>1391</fpage>
          -
          <lpage>1400</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Straka</surname>
          </string-name>
          , UDPipe
          <volume>2</volume>
          .
          <article-title>0 prototype at CoNLL 2018 UD shared task</article-title>
          , in: D.
          <string-name>
            <surname>Zeman</surname>
          </string-name>
          , J. Hajič (Eds.),
          <source>Proceedings of the CoNLL</source>
          <year>2018</year>
          <article-title>Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics</article-title>
          , Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dorkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sirts</surname>
          </string-name>
          ,
          <article-title>Comparison of current approaches to lemmatization: A case study in Estonian</article-title>
          , in: T. Alumäe, M. Fishel (Eds.),
          <source>Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)</source>
          , University of Tartu Library, Tórshavn, Faroe Islands,
          <year>2023</year>
          , pp.
          <fpage>280</fpage>
          -
          <lpage>285</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bergmanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goldwater</surname>
          </string-name>
          ,
          <article-title>Data augmentation for context-sensitive neural lemmatization using inflection tables and raw text</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4119</fpage>
          -
          <lpage>4128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. D.</given-names>
            <surname>McCarthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Vylomova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Malaviya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wolf-Sonkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Nicolai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kirov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Silfverberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Mielke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Heinz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Hulden, The SIGMORPHON 2019 shared task: Morphological analysis in context and cross-lingual transfer for inflection</article-title>
          , in: G. Nicolai, R. Cotterell (Eds.),
          <source>Proceedings of the 16th Workshop on Computational Research</source>
          in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>244</lpage>
          . Linguistics, Sofia, Bulgaria,
          <year>2013</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>69</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O.</given-names>
            <surname>Toporkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agerri</surname>
          </string-name>
          , On the role of morpho- https://aclanthology.org/W13-2308.
          <article-title>logical information for contextual lemmatization</article-title>
          , [19]
          <string-name>
            <surname>I. De Felice</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Venturi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
          </string-name>
          ,
          <source>Computational Linguistics</source>
          <volume>50</volume>
          (
          <year>2024</year>
          )
          <fpage>157</fpage>
          -
          <lpage>191</lpage>
          . S. Montemagni,
          <article-title>Italian in the trenches: linguis-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Manjavacas</surname>
          </string-name>
          , Á. Kádár,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <article-title>Improving tic annotation and analysis of texts of the great lemmatization of non-standard languages with joint war</article-title>
          ,
          <source>in: Proceedings of the Fifth Italian Conferlearning</source>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.), ence on Computational Linguistics (CLiC-it
          <year>2018</year>
          ),
          <source>Proceedings of the 2019 Conference of the North</source>
          Accademia University Press,
          <year>2018</year>
          , pp.
          <fpage>160</fpage>
          -
          <lpage>164</lpage>
          . American Chapter of the Association for Computa- [20]
          <string-name>
            <surname>M.-C. De Marnefe</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Nivre</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Zetional Linguistics: Human Language Technologies, man, Universal dependencies</article-title>
          ,
          <source>Computational linVolume 1</source>
          , Association for Computational Linguis- guistics
          <volume>47</volume>
          (
          <year>2021</year>
          )
          <fpage>255</fpage>
          -
          <lpage>308</lpage>
          . tics, Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>1493</fpage>
          -
          <lpage>1503</lpage>
          . [21]
          <string-name>
            <surname>P. D'Achille</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Grossmann</surname>
          </string-name>
          ,
          <article-title>Per la storia della for-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Favaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Guadagnini</surname>
          </string-name>
          , E. Sassolini,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Bifi, mazione delle parole in italiano: un nuovo corpus in S. Montemagni, Towards the creation of a di- rete (MIDIA) e nuove prospetive di studio, Franco achronic corpus for italian: A case study on the gdli Cesati Editore</article-title>
          .,
          <year>2017</year>
          . quotations, in: Proceedings of the Second Work- [22]
          <string-name>
            <given-names>H.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Probabilistic part-of-speech tagging shop on Language Technologies for Historical and using decision trees</article-title>
          ,
          <source>in: Proceedings of InternaAncient Languages</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>94</fpage>
          -
          <lpage>100</lpage>
          . tional Conference on New Methods in Language
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Favaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bifi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          ,
          <source>Pos tagging and Processing</source>
          ,
          <year>1994</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
          <article-title>lemmatization of historical varieties of languages</article-title>
          . [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Iacobini</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. De Rosa</surname>
          </string-name>
          , G. Schirato,
          <article-title>Part-of-speech the challenge of old italian, Italian Journal of Com- tagging strategy for midia: a diachronic corpus of putational Linguistics (IJCoL) 9 (</article-title>
          <year>2023</year>
          )
          <fpage>99</fpage>
          -
          <lpage>120</lpage>
          .
          <article-title>the italian language</article-title>
          ,
          <source>in: Proceedings of the First</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Corbetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Passarotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Cecchini</surname>
          </string-name>
          , Italian Conference on Computational Linguistics G. Moretti,
          <article-title>Highway to hell. towards a universal CLiC-it</article-title>
          <year>2014</year>
          , Pisa University Press,
          <year>2014</year>
          , pp.
          <fpage>213</fpage>
          -
          <article-title>dependencies treebank for dante alighieri's comedy, 218</article-title>
          . in: Proceedings of CLiC-it
          <year>2023</year>
          : 9th Italian Confer- [24]
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Beltrami</surname>
          </string-name>
          ,
          <article-title>Il tesoro della lingua italiana delle ence on Computational Linguistics, Nov 30-Dec origini (tlio)</article-title>
          ,
          <source>in: Italia linguistica anno Mille, Italia</source>
          <volume>02</volume>
          ,
          <year>2023</year>
          , Venice, Italy, CEUR-WS,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . linguistica anno Duemila:
          <article-title>atti del XXXIV Con-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tavoni</surname>
          </string-name>
          ,
          <article-title>Dantesearch: il corpus delle opere gresso internazionale di studi della Società di linvolgari e latine di dante lemmatizzate con mar- guistica italiana (SLI)</article-title>
          ,
          <source>Firenze 19-21</source>
          ottobre
          <year>2000</year>
          .
          <article-title>- catura grammaticale e sintattica</article-title>
          ,
          <source>in: Lectura Dantis (Pubblicazioni della Società linguistica italiana; 45)</source>
          ,
          <fpage>2002</fpage>
          -
          <lpage>2009</lpage>
          .
          <article-title>Omaggio a Vincenzo Placella per i suoi Bulzoni</article-title>
          ,
          <year>2003</year>
          , pp.
          <fpage>1000</fpage>
          -
          <lpage>1004</lpage>
          . settanta anni, volume
          <volume>2</volume>
          ,
          <article-title>Università degli Studi di Napoli" L'Orientale"</article-title>
          , Il Torcoliere-Oficine . . . ,
          <year>2012</year>
          , pp.
          <fpage>583</fpage>
          -
          <lpage>608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>F.</given-names>
            <surname>Boschetti</surname>
          </string-name>
          , I. De Felice,
          <string-name>
            <given-names>S. Dei</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Di</given-names>
            <surname>Giorgio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Miliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Passaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Puddu</surname>
          </string-name>
          , G. Venturi,
          <string-name>
            <given-names>N.</given-names>
            <surname>Labanca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          ,
          <article-title>"voices of the great war": A richly annotated corpus of italian texts on the first world war</article-title>
          ,
          <source>in: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC</source>
          <year>2020</year>
          ),
          <source>European Language Resources Association (ELRA)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>911</fpage>
          --
          <lpage>918</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Straka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hajič</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Straková,</surname>
          </string-name>
          <article-title>UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing</article-title>
          ,
          <source>in: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)</source>
          ,
          <source>European Language Resources Association (ELRA)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>4290</fpage>
          -
          <lpage>4297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simi</surname>
          </string-name>
          ,
          <article-title>Converting Italian treebanks: Towards an Italian Stanford dependency treebank</article-title>
          ,
          <source>in: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse</source>
          , Association for Computational
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>