<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Cagliari, Italy
* Corresponding author.
† This paper is the result of the collaboration between the two au-
thors. For the specific concerns of the Italian academic attribution
system, Rachele Sprugnoli is responsible for sections</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Annotating Manzoni: Challenges in the Annotation of Lemmas, POS and Features in “I Promessi Sposi”</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rachele Sprugnoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arianna Redaelli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università Cattolica del Sacro Cuore</institution>
          ,
          <addr-line>Largo Gemelli, 1, 20123 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università di Parma</institution>
          ,
          <addr-line>Via D'Azeglio, 85, 43125 Parma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>2</volume>
      <issue>3</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In this paper we introduce a dataset of I Promessi Sposi annotated with lemmas, UPOS tags, and features aligned with Universal Dependencies (UD). Three representative chapters from Manzoni's 1840 edition (791 sentences, almost 26 K tokens) were automatically tagged with UDPipe and fully manually corrected. Tailored guidelines extended standard UD practice with: (i) a double lemmatization approach, one that maintains archaic spellings and altered forms and one that normalizes lemmas, (ii) novel features that capture specific important characteristics of the novel, such as the use of apocopated and altered forms. Using the resulting dataset, we retrained the Stanza pipeline to obtain an in-domain model. Augmenting training data with ISDT sentences yielded further, although smaller, gains. Finally, a CRF sequence tagger was developed to identify apocopated forms.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;annotation</kwd>
        <kwd>Italian literature</kwd>
        <kwd>computational literary studies</kwd>
        <kwd>Alessandro Manzoni</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>• A manually annotated dataset comprising three
chapters of the novel, totaling 791 sentences and
approximately 26,000 tokens. The annotations
include lemmas, UPOS tags, and morphological
features following the Universal Dependencies
(UD) framework. Particular attention was given
to (i) using features described in the Italian UD
guidelines that are not yet widely adopted across
existing treebanks, (ii) applying a dual
lemmatization strategy (normalizing and conservative), (iii)
defining additional features that capture stylistic
and linguistic peculiarities of the novel.
• An in-domain model trained on the aforemen- for Italian. For instance, in MIDIA, altered forms are
tioned annotated dataset. linked to their corresponding base lemmas, but other
• A joint model trained on the combined data from word forms have not been normalized, resulting in
dis</p>
      <p>I Promessi Sposi and the ISDT treebank. tinct lemmas for each variation: for example, the archaic
• A dedicated model for the recognition of apoc- spelling imaginando (“imagining") is lemmatized as
imagopated forms, which are characteristic of the inare, while the modern form immaginando corresponds
novel’s language. to immaginare. In COLFIS (Corpus e Lessico di Frequenza
dell’Italiano Scritto), altered nouns and adjectives were
All datasets and models are publicly available in initially lemmatized as independent lemmas and then a
a dedicated GitHub repository: https://github.com/ reference to the corresponding base form was added.11
RacheleSprugnoli/CoNLL-U_Manzoni. Finally, in LIPSI (Lessico di frequenza dell’italiano parlato
nella Svizzera italiana), altered forms are mapped to a
2. Related Work base lemma when weakly lexicalized: e.g., chiesina (“little
church") is lemmatized with chiesa (“church"). On the
contrary, independent entries are created when there is
a significant semantic divergence between the derived
form and the base: e.g., lampadina (“light bulb") is treated
as a separate lemma with respect to lampada (“lamp")
[13, 14]. This same strategy is also adopted in the
compilation of the Nuovo De Mauro dictionary12 and in our
work, as explained in detail in Section 3.</p>
    </sec>
    <sec id="sec-2">
      <title>The application of NLP tools to Italian literary texts has</title>
      <p>been approached through targeted experiments since the
early 2000s. Basili et al. [5] employed machine-learning
techniques to semantically classify narrative fragments
from Alberto Moravia’s novel Gli indiferenti , whereas
Pennacchiotti and Zanzotto [6] evaluated the accuracy
of a morphological analyzer and a POS tagger on a range
of prose and poetry texts dating from the thirteenth
century to the late nineteenth century, revealing a drop in 3. Annotation
performance compared with results obtained on
contemporary Italian. More recently, within the TrAVaSI project Chapters 1, 8, and 23 13 of the final edition of I Promessi
(Trattamento Automatico di Varietà Storiche di Italiano), Sposi (1840) were automatically annotated with UDPipe
texts of various genres, including literary works dated 2 (ISDT model, version 2.15) [15] [16] and then manually
from 1861 onwards, have been annotated according to corrected.14 We adopted the CoNLL-U Plus format15 to
the UD framework, but using the same annotation layers arrange specific annotation requirements designed for
we adopt for Manzoni, i.e., excluding dependency parsing. the novel, as explained in the following subsection (see
As in our study, these annotated data have been exploited Figure 1).
to train automatic models [7]. Particular attention has
been devoted to lemmatization, adopting a conservative 3.1. Guidelines
approach that preserves the original token’s graphical,
phonological, and morphological characteristics [8]. By The annotation guidelines were developed
collaboracontrast, dependency parsing is included in the annota- tively, discussed in multiple revision rounds, and refined
tion of Dante Alighieri’s Divina Commedia, which has to their current form. Their purpose was to guide the
anin turn enabled the release of the Italian-Old treebank9 notation process while remaining as consistent as
possiand the development of models specifically tailored to ble with the oficial UD guidelines for Italian 16. However,
this text [9]. In this annotation, lemmatization follows existing Italian treebanks do not always strictly follow
the criteria established in the DanteSearch project, from UD’s recommendations. Whenever discrepancies were
which the data were drawn [10] before applying the UD
framework. In this case as well, a conservative strategy 11https://linguistica.sns.it/CoLFIS/Home.htm
is adopted, whereby pecorelle (“little sheep”) is lemma- 12https://dizionario.internazionale.it/avvertenze/2
tized as pecorella. The same methodology has also been 13These chapters were selected for their stylistic and structural
variemployed in the Edizione dell’Opera Omnia di Luigi Piran- ety. Chapter 1 introduces the setting of the novel and includes a
dello [11] and in the Archivio Lessicografico della Poesia ldooncgumdeesncrtiaprtyivpeapratsssmagaer,kseodmbeydaiarclohgaiiccsleecxtiicoanlsc,
haonidceesv;ecnhpaspetuedro8Italiana dell’Otto-Novecento (ALPION) [12], although in plays a central role in the narrative, featuring multiple scenes,
these projects the data are accessible only through con- thematic shifts, and dialogic exchanges, as well as a semi-lyrical
cordances.10 Diferent lemmatization choices have been closing section; chapter 23 is characterized by its predominantly
made in the compilation of other linguistic resources dialogic structure and includes a lengthy final soliloquy.
14As we report in Table 2, the performance of this model is not
optimal.
9https://github.com/UniversalDependencies/UD_Italian-Old 15https://universaldependencies.org/ext-format.html
10https://vocabolari.pirandellonazionale.it/; https://alpion.unict.it/ 16https://github.com/UniversalDependencies/docs/tree/
vocabolario/ricerca/ pages-source/_it
encountered between the UD guidelines and currently tion rather than a genuine elision, and our annotation
available treebanks, our guidelines prioritized the oficial still treats these forms as apocopated.
UD specifications. This involved both substitutions and Furthermore, we extended the set of possible values for
additions. the feature Degree to include morphological alterations,</p>
      <p>Among the substitutions, we systematically replaced which are also frequently attested in the novel:
the use of VerbForm=Ger, commonly found in current • Degree=Dim for diminutives (e.g., casetta, “little
treebanks for the traditional Italian gerund (e.g., dicendo, house");
“saying”), with the correct label VerbForm=Conv. Simi- • Degree=Aug for augmentatives (e.g., spadone,
larly, for superlative adjective forms (e.g., pessimo, “very “big sword");
bad”), we replaced Degree=Sup with Degree=Abs. • Degree=Pej for pejoratives (e.g., occhiacci,</p>
      <p>Among the additions, we decided to use the feature “nasty eyes");
Reflex=Yes for reflexive forms (e.g., sé, si, proprio, • Degree=End for endearments (e.g., poverina,
“him/her/itself”, “themselves”): although this feature is “poor little girl").
listed among the ones to be used in Italian,17 it is still
rarely applied in most currently available treebanks.18 We Rather than relying exclusively on morphological
strucalso annotated indefinite pronouns functioning as total ture, the annotation of this feature was guided by
conquantifiers (e.g., ogni, “each”, “every”, tutto, “all”, “every- textual interpretation, focusing on the expressive or
afthing” and ciascuno, “everyone”, “each one”) with the fective nuance that the altered form conveys in each
feature PronType=Tot, in line with the UD guidelines, occurrence. As Perotti [18] noted, many of these altered
despite its inconsistent use across current resources. forms were introduced by Manzoni only in later revisions</p>
      <p>Beyond these additions, we introduced a set of features of I Promessi Sposi, reflecting his pursuit of greater
prenot prescribed by the UD Italian guidelines, but intended cision and expressive depth. The extended feature set
to account for morphosyntactic phenomena of particular was thus designed to capture and document this stylistic
historical or stylistic relevance in I Promessi Sposi. All evolution through a consistent, context-sensitive, and
such features were annotated in the MISC field. ifne-grained annotation approach. Altered forms were</p>
      <p>
        Firstly, we used the feature Variant=Apoc to anno- lemmatized in the third field with their standard,
nontate apocopated forms, only excluding indefinite articles altered base forms; the altered lemma, instead, was
re(e.g., un, “a”), which are fully grammaticalized in contem- ported in the eleventh field (e.g., occhiacci, “nasty eyes";
porary Italian and therefore not stylistically significant. third field: occhio, “eye"; eleventh field: occhiaccio “nasty
As observed by Bianchi [17], Manzoni drew both on post- eye"). By lemmatizing altered forms under their standard
consonantal and postvocalic apocopes (e.g., respectively, base lemma, the annotation facilitates lexical querying
fecer instead of fecero, “they did", and cagion instead of and quantitative analysis, avoiding the dispersion of
occagione, “cause") to evoke the rhythms and informality of currences across multiple lemmas while preserving the
spoken language, at times even extending beyond Floren- expressive variation. For the same reason, namely to
entine usage, which was his main language model. Unlike sure consistency and semantic clarity in lexical analysis,
elisions, which involve the omission of a final vowel be- fully lexicalized altered forms whose meaning
signififore an initial vowel and are graphically marked with an cantly diverges from that of the base lemma were instead
apostrophe, apocopes generally drop final phonemes re- treated as independent lemmas (e.g., cavallone, “large
gardless of the phonological context and are not marked. water wave", was lemmatized separately from cavallo,
However, some apocopated forms in the novel, such as “horse").
que’ instead of quei (“those"), do include an apostrophe. In a nineteenth-century corpus like I Promessi Sposi,
In such cases, the apostrophe reflects a graphic conven- lemmatization also required additional care to account
for archaisms and diachronic variation. In all cases, we
prioritized the modern form of the lemma as the primary
17phattgpess:-//sgoiuthrcueb/._ciot/mfe/aUt/nRivefleexrs.maldDependencies/docs/blob/ entry, placing it in the third field, regardless of the degree
18Reflex=Yes is currently present in the following treebanks: PUD of obsolescence or morphological variation. This
crite(3 occurrences), ParTUT (14), OLD (2,346). rion was adopted to support both practical usability and
interpretive clarity: lemmatizing under a standard mod- distinguish between participles and adjectives, we
reern lemma ensures ease of information retrieval, even for ferred to Guasti [
        <xref ref-type="bibr" rid="ref1">20</xref>
        ], indicating three diagnostic tests,
users who may not be familiar with historical or literary also adopted in the annotation of CoLFIS:
Italian. However, such standardization was not pursued
at the expense of losing linguistically significant traces • participles cannot be modified with the sufix
of the novel’s historical and stylistic identity. On the issimo or intensifying adverbs (e.g., molto, “very”),
contrary, we aimed to preserve this richness by system- while adjectives can;
atically annotating archaic and obsolete forms through a • past participles can host clitic pronouns, while
dedicated feature in the MISC field and/or an additional adjectives cannot;
lemmatization in the eleventh field. • participles can co-occur with both essere, “to be”,
      </p>
      <p>More specifically, in line with this approach, we dis- and venire, “to come”, while adjectives can’t.
tinguished two main cases for archaic forms:
3.2. Inter-Annotator Agreement
• when the form was both obsolete and
corresponded to an archaic lemma whose modern The IAA was calculated on the first 100 sentences of
Chapcounterpart difered only in orthography or mor- ter 38, the last one of the novel. This chapter is not part
phology (not in lexical identity or meaning), we of the current dataset and the completion is in progress
annotated the feature Style=Arch in MISC field at the time of writing this paper. The annotators involved
and reported the archaic lemma in the eleventh are two students of the Master’s degree in “Linguistic
ifeld (e.g., annunzio; field LEMMA: annun- Computing” at Università Cattolica del Sacro Cuore; they
ciare, “to announce”; MISC field: Style=Arch; are Italian native speakers who have studied UD during a
eleventh field: annunziare); couple of courses of the degree but have not participated
• when the form was only the archaic spelling in the writing and discussion of the guidelines and are at
of a lemma that is still used today (i.e., the their first experience of extensive annotation. Before
belemma itself was not obsolete), we only anno- ginning their work on Chapter 38, the students read the
tated Style=Arch in the MISC field without guidelines and analyzed the annotations already made
adding any lemma in the eleventh field (e.g., for Chapters 1, 8, and 23.
varjo; field LEMMA: vario, “various”; MISC field: The Cohen’s kappa recorded for the diferent
annotaStyle=Arch). The same criterion was also ap- tion levels was as follows:
plied to inflected forms that appear archaic but
whose corresponding lemma is still current and
unaltered (e.g., chieggio, which is the first person
singular of chiedere, “to ask”).</p>
      <p>• Lemmatization: 0.80;
• UPOS tagging: 0.97;
• Morphological features identification: 0.84;
• Other features: Degree, 0.80; Style, 0.86;</p>
      <p>Variant, 0.99.</p>
    </sec>
    <sec id="sec-3">
      <title>In case of uncertainty, we referred to Nuovo De Mauro</title>
      <p>[19], which provides mappings between obsolete or
literary forms and their modern equivalents.</p>
      <p>Finally, consistent with the principles outlined above, Table 1
we applied a contextual approach to UPOS tagging and Cohen’s kappa on the first 100 sentences of Chapter 38.
morphological features assignment, following the con- UPOS
ventions of current Italian treebanks: for example,
inifnitives and participles were annotated as NOUN or ADJ
when used as nouns or adjectives, respectively. In the
case of infinitives used as nouns, no morphological
features were assigned, as these forms are not inflected for
gender or number. For participles, instead, the annotation
also had consequences on lemmatization: when used as
adjectives, they were lemmatized with the corresponding
masculine singular form, in line with standard adjectives;
when retaining a verbal function, they were lemmatized
with the infinitive of the corresponding verb 19. To help
Morphological Features
Polarity 0.89
Definite 0.82
Gender 0.81
Foreign 0.8
NumType 0.8
Number 0.8
Person 0.8
Clitic 0.78
VerbForm 0.78
Poss 0.77
PronType 0.77
Tense 0.76
Mood 0.76
Degree 0.45</p>
      <p>Reflex 0.39
X
NUM
INTJ
PROPN
PUNCT
NOUN
CCONJ
ADP
VERB
PRON
AUX
DET
ADV
ADJ
SCONJ
1
1
1
1
0.99
0.99
0.99
0.98
0.98
0.96
0.96
0.95
0.94
0.92
0.89
19As for present participles, their usage is almost exclusively limited
to either a nominal or, more rarely, a verbal function. The nominal
use is generally easy to identify, as present participles functioning
as nouns are typically preceded by a determiner (e.g., an article).</p>
      <p>Table 1 provides details on the Cohen’s kappa achieved the three chapters. Following this approach, the
partifor each UPOS tag and morphological feature. Overall, tions are the following:
the results for the various annotation levels are good,
often above 0.80 (indicating substantial or almost perfect • training set: 615 sentences, 20,806 tokens;
agreement), with a few exceptions only for some features. • development set: 101 sentences, 2,670 tokens;</p>
      <p>
        As for lemmatization, there are 27 discordant lemmas • test set: 75 sentences, 2,457 tokens.
that fall into 4 categories. Some cases are clear errors
due to superficial annotation: e.g., in si sana ogni piaga Using this partition, a new Stanza [
        <xref ref-type="bibr" rid="ref2">21</xref>
        ] model for
Man(“every wound is healed”), sana is lemmatized as sano zoni’s novel has been developed.
(“healthy”) instead of sanare (“to heal”). A recurring issue Table 2 presents the performance of the retrained
concerns the lemmatization of unstressed personal pro- model on the test set, in comparison with results
obnouns. Sometimes, the lemma matches the token itself; tained on the same file from other models, namely the
other times, it corresponds to the masculine form: e.g., ISDT [15] and OLD [9] 2.15 models of UDPipe 2, as well
in l’era stata compagna (“she had been her companion”), as the spaCy it_core_news_lg20 and the Stanza
coml’ is lemmatized with le (feminine) or with lo (masculine). bined models. The retrained model outperforms the other
Another disagreement concerns the lemmatization of evaluated ones across all tasks. Obviously this is also due
words in an archaic form, which also has repercussions to the diferent annotation choices, especially those
reon the feature Style=Arch. For example, pronunziar lated to the features (see Section 3).
(“to pronounce”) is lemmatized alternatively as pronun- All models are nearly equivalent and highly reliable
ciare, in this case by adding the feature Style=Arch, or in token segmentation. The biggest divergence occurs
as pronunziare, without the feature. for sentence splitting: as previously shown by Redaelli
      </p>
      <p>
        Regarding the annotation of UPOS tags, the lowest and Sprugnoli [
        <xref ref-type="bibr" rid="ref3">22</xref>
        ], this task is challenging due to the
agreement is recorded on subordinate conjunctions, con- distinctive punctuation of the novel, particularly the use
fused with adpositions (2 times), adverbs (4 times) and of guillemets and long dashes as closing quotation marks,
pronouns (7 times, always in the annotation of che, mean- thus the development of a dedicated model is especially
ing “who”, “which” or “that”). necessary. Syntactic word segmentation has high scores
      </p>
      <p>The results concerning the annotation of morphologi- (&gt; 90) across all models but spaCy proved to be the least
cal features show greater variability. Notably, the features reliable.</p>
      <p>Degree, which is employed for marking comparative and With regard to UPOS tagging, the retrained Stanza
superlative forms of adjectives and adverbs, and Reflex, model achieves an improvement of 2.44 F1 points
comwhich is used for reflexive pronouns, have relatively low pared to the Stanza combined model. The tag with
kappa scores (0.45 and 0.39 respectively), indicating mod- the lowest F1 score under the retrained setting is INTJ
erate and fair IAA. As mentioned in subsection 3.1, these (F1=0.79, P=1, R=0.65). For example, the only occurrence
features were subject to modifications that appear to have of ohimè (a roughly equivalent interjection to “alas") is
been insuficiently assimilated by the annotators. For in- misclassified as a NOUN, while addio, “farewell“, is
classistance, one annotator consistently employed the Sup ifed three times as an INTJ and three times as a NOUN. All
value of Degree rather than Abs for absolute superla- other tags have values above 0.80 but we can notice some
tives, and frequently omitted the Reflex=Yes feature. recurring errors in the case of the SCONJ tag. Indeed,</p>
      <p>By contrast, the level of agreement is high for the subordinating conjunctions (F1=0.85, P=0.84, R=0.85) are
newly introduced features in the MISC column. An in- confused with prepositions (ADP, especially for dopo,
“afteresting example of annotation divergence concerns the ter”), pronouns (PRON, as in the case of che, “who/that”),
token figliuoli (“children”): one annotator interprets it or adverbs (ADV, as in the case of dove, “where”).
as an archaic form of the lemma figlio (“child”) with an As for Universal features (UFeats), the 3.71 point
imendearing sufix, whereas the other annotator assigns provement over the Stanza combined model is likely due
the lemma figliuolo , without marking it with either the to diferences in the handling of specific features such as
Degree=End or Style=Arch features. Reflex=Yes and VerbForm=Conv. The features with
the lowest F1 scores are PronType=Int (F1=0.50, P=0.50,
R=0.50), which marks interrogative pronouns and
deter4. Retraining Stanza miners, and PronType=Exc (F1=0.44, P=0.67, R=0.33),
which is applied to exclamative pronouns and
determiners. These categories are sparsely represented in the test
set, with only 8 and 6 instances respectively. However,
there is evidence of confusion between the two: for
example, in the sentence “Come stava allora il povero don</p>
    </sec>
    <sec id="sec-4">
      <title>The dataset was split into training, development, and test</title>
      <p>sets using an 80/10/10 ratio, with the division based on
the number of syntactic words as units, in accordance
with the guidelines of the UD framework. The number of
syntactic words was taken proportionally equally from
20https://spacy.io/models/it#it_core_news_lg
Abbondio!” (“How was poor Don Abbondio feeling at lemmas in the training set of the ISDT treebank.
that moment!”) the word come, “how”, is annotated as
PronType=Exc in the gold data, but the model incor- 4.1. One Novel, Three Versions
rectly predicts PronType=Int. The feature Mood=Cnd,
indicating verbs in the conditional mood, also yields a Alessandro Manzoni revised I Promessi Sposi multiple
relatively low F1 score (F1=0.73, P=1, R=0.57). Although times, resulting in three versions. The earliest, a
handthis class includes only a small number of instances (7), written draft composed in 1823 and known as Fermo e
misclassifications occurred, including one case where it Lucia, difers in both content and style from later editions.
was confused with the indicative mood (fiaterebbe , “he The language used, for example, is an original
combinawould breathe”) and another with the subjunctive mood tion of Italian, Lombard, French and Latin calques, also
(leverebbe, “he would take away”). rich in author’s neologisms. In 1827, Manzoni published a</p>
      <p>For lemmatization, the improvement is of 2.84 points revised version, commonly called the Ventisettana, which
with respect to the Stanza combined model, with a total introduced substantial linguistic refinements aimed at
of 82 incorrect lemma predictions. Notably, lemmatiza- improving clarity and accessibility for Italian readers.
tion choices involving altered forms and archaic variants The definitive version, released starting from 1840 and
do not appear to be major sources of inaccuracy: indeed, known as the Quarantana, incorporated further stylistic
only 12% of errors involve altered forms, and 4% involve and linguistic changes based on the Florentine language,
archaic ones. Table 3 provides examples of these types reflecting Manzoni’s eforts to promote a unified Italian
of errors. The remaining instances mostly concern the language.
prediction of non-existent lemmas (e.g., riunendo (gerund Given the linguistic diferences among these versions,
of “reunite”) → riunere instead of riunire; mangi (present it is of particular interest to assess the extent to which
subjunctive of “eat”) → manire instead of mangiare); and the model trained on the Quarantana generalizes to
earof feminine forms instead of the correct masculine ones lier texts. Table 4 presents the F1 scores obtained in the
(e.g., scure (“dark”) → scura instead of scuro; forestiera ifrst chapter of Fermo e Lucia (5,760 tokens) and the
Ven(“female foreigner”) → forestiera instead of forestiero). It tisettana (7,407 tokens). Notably, performance on the
is interesting to note that the UDPipe model trained on Ventisettana is even higher in terms of morphological
the Divina Commedia (UDPipe-OLD) exhibits low perfor- features and lemmatization, although there is a slight
mance on lemmatization, despite the fact that the target decrease in UPOS tagging. Morphological features
identidomain is literary, as is the case for Manzoni. This discrep- fication is still good on the 1823 version but UPOS tagging
ancy can likely be attributed to the considerable temporal and lemmatization show a more evident drop.
and stylistic diferences between the two sources: the
Divina Commedia is dated back to the 14th century and is 4.2. A Joint Model
composed in poetic form, whereas Manzoni’s work dates
to the 19th century and is written in prose. Indeed, the An additional experiment involved the creation of a
comlexical overlap between the lemmas in the training set of bined model trained on the merged training and
develthe OLD treebank and those in our corpus amounts to opment sets of ISDT and the training set of I Promessi
only 50%, compared to a higher overlap of 69% with the Sposi. ISDT was selected because its corresponding model
achieved better results than the other of-the-shelf
models, although it still underperformed compared to the
indomain retrained model. The resulting combined training
set consisted of 14,300 sentences.</p>
      <p>Table 5 reports the performance of this combined
model on the first chapters of Fermo e Lucia and the
Ventisettana, as well as on the test set from the
Quarantana. The increased training data, despite being from
a diferent domain and not always consistent with our
annotation guidelines, led to a modest overall
improvement in performance, particularly on the 1840 test set.</p>
      <p>
        These generally positive results align with findings from
previous experiments conducted on the Voci della Grande
Guerra [
        <xref ref-type="bibr" rid="ref4">23</xref>
        ] and VoDIM [7] corpora. In contrast, joint
models developed for syntactic parsing of the Divina
Commedia have shown lower performance compared to
in-domain models [
        <xref ref-type="bibr" rid="ref5">24</xref>
        ].
      </p>
      <sec id="sec-4-1">
        <title>5. Modeling Apocopes</title>
        <p>We implemented a supervised sequence labeling pipeline
for identifying apocopated forms using Conditional
Random Fields (CRFs) and the same train, development and
test sets used for the retraining of Stanza. For the time
being, we have focused on apocopated forms only, as
among the three specific features we added to the
annotation, Variant=Apoc is the most frequent, whereas
the others are too sparsely represented.21 Although more
frequent than the other features, the number of instances
was still insuficient to support the use of neural methods,
which require larger amounts of training data to perform
21The whole dataset, at the moment of writing, contains 735
apocopated forms, 109 altered forms and 106 archaic forms.
efectively and generalize well. Therefore, we adopted a
CRF-based approach instead.</p>
        <p>The model is trained using the sklearn-crfsuite
library and hyperparameters (c1 and c2 regularization
coefifcients) are optimized via randomized search with 5-fold
cross-validation. The feature set includes orthographic
(e.g., lowercase form, word sufixes and prefixes),
morphological (e.g., UPOS and FEATS) and lexical (lemma)
features from the preceding and following tokens. The
results of the model’s binary classification on the test set
are reported in Table 6.</p>
        <p>The test set contains 59 apocopated forms
corresponding to 41 tokens and 33 lemmas; 12 of these forms do
not appear in the training set, which includes 611
apocopated instances corresponding to 220 tokens and 169
distinct lemmas. Among the model’s 9 false negatives, 4
are apocopated forms that were not seen during training:
i.e., timor (“fear”), almen (“at least”), passan (“they pass
by”), ondeggiar (“to ripple”). As for the remaining cases,
the model fails to correctly classify par (“it seems”, seen
7 times in the training set), fra (“friar”, 3 times), star (“to
stay”, 2 times), and siam and cagion (“we are” and “cause”,
each seen once in the training data).</p>
      </sec>
      <sec id="sec-4-2">
        <title>6. Conclusion</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>In this paper, we have introduced several new resources:</title>
      <p>(i) a manually annotated dataset of 3 chapters of I
Promessi Sposi, comprising 791 sentences and
approximately 26,000 tokens, enriched with lemmas, UPOS tags,
Universal Dependencies morphological features and
adhoc features designed for capturing specific stylistic
characteristics of Manzoni’s novel; (ii) an in-domain NLP
model trained specifically on this dataset; (iii) a joint
model combining data from the novel and the ISDT
treebank; (iv) a specialized model for recognizing apocopated
forms, which are a distinctive feature of Manzoni’s text.</p>
      <p>All data and models developed in this study are made
publicly available in a dedicated GitHub repository,
hopefully laying the groundwork for future research on Italian
literary texts through computational approaches.</p>
      <p>As for future work, a key priority is to extend the
annotation to additional chapters. Thanks to the new
models developed in this study and their relatively
low error rates, the manual correction process is
expected to be significantly accelerated. The expansion</p>
      <sec id="sec-5-1">
        <title>Acknowledgments</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>The authors thank Flavio Massimiliano Cecchini for an</title>
      <p>notating chapters 1, 8, 23 of Quarantana and Ventisettana,
Alessia Leo and Michael Mostacchi for annotating
chapter 38 of Quarantana, Chiara Febbraro for the annotation
of chapter 1 of Fermo e Lucia and Giovanni Moretti for
technical assistance.
of the dataset will also enable the development of
models targeting the other two specific features introduced
in our annotation scheme, namely Style=Arch and
Degree=Aug/Dim/End/Pej. Another future step will
involve syntactic annotation, with the ultimate goal of
incorporating Italy’s most important novel among the
UD treebanks. This will continue the broader efort to
integrate Italian literary texts into syntactically annotated
resources, following the precedent set by the annotation
of the Divina Commedia [9].</p>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Text
translation. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Guasti</surname>
          </string-name>
          , Il sintagma aggettivale, in: L.
          <string-name>
            <surname>Renzi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Salvi</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Cardinaletti (Eds.),
          <article-title>Grande grammatica italiana di consultazione, vol. II, libreriauniversitaria</article-title>
          .
          <source>it Edizioni</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>321</fpage>
          -
          <lpage>340</lpage>
          .
          <article-title>First published in 1991 by Il Mulino</article-title>
          . Anastatic reprint.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bolton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Stanza: A Python natural language processing toolkit for many human languages</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>
          ,
          <year>2020</year>
          . URL: https://nlp.stanford.edu/pubs/ qi2020stanza.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Redaelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          ,
          <article-title>Is sentence splitting a solved task? experiments to the intersection between NLP and Italian linguistics</article-title>
          , in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Montemagni</surname>
          </string-name>
          , R. Sprugnoli (Eds.),
          <source>Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy,
          <year>2024</year>
          , pp.
          <fpage>813</fpage>
          -
          <lpage>820</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .clicit-
          <volume>1</volume>
          .88/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [23]
          <string-name>
            <surname>I. De Felice</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Venturi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Montemagni</surname>
          </string-name>
          , et al.,
          <article-title>Italian in the trenches: linguistic annotation and analysis of texts of the great war</article-title>
          ,
          <source>in: Proceedings of the Fifth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2018</year>
          ), Accademia University Press,
          <year>2018</year>
          , pp.
          <fpage>160</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Corbetta</surname>
          </string-name>
          , G. Moretti,
          <string-name>
            <given-names>M.</given-names>
            <surname>Passarotti</surname>
          </string-name>
          ,
          <article-title>Join together? combining data to parse Italian texts</article-title>
          , in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Montemagni</surname>
          </string-name>
          , R. Sprugnoli (Eds.),
          <source>Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy,
          <year>2024</year>
          , pp.
          <fpage>251</fpage>
          -
          <lpage>257</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          . clicit-
          <volume>1</volume>
          .30/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>