-

Cagliari, Italy * Corresponding author. † This paper is the result of the collaboration between the two au- thors. For the specific concerns of the Italian academic attribution system, Rachele Sprugnoli is responsible for sections

Annotating Manzoni: Challenges in the Annotation of Lemmas, POS and Features in “I Promessi Sposi”

Rachele Sprugnoli

Arianna Redaelli

1 0 Università Cattolica del Sacro Cuore , Largo Gemelli, 1, 20123 Milano , Italy 1 Università di Parma , Via D'Azeglio, 85, 43125 Parma , Italy

2025

2 3 0000 0001

In this paper we introduce a dataset of I Promessi Sposi annotated with lemmas, UPOS tags, and features aligned with Universal Dependencies (UD). Three representative chapters from Manzoni's 1840 edition (791 sentences, almost 26 K tokens) were automatically tagged with UDPipe and fully manually corrected. Tailored guidelines extended standard UD practice with: (i) a double lemmatization approach, one that maintains archaic spellings and altered forms and one that normalizes lemmas, (ii) novel features that capture specific important characteristics of the novel, such as the use of apocopated and altered forms. Using the resulting dataset, we retrained the Stanza pipeline to obtain an in-domain model. Augmenting training data with ISDT sentences yielded further, although smaller, gains. Finally, a CRF sequence tagger was developed to identify apocopated forms.

eol>annotation Italian literature computational literary studies Alessandro Manzoni

• A manually annotated dataset comprising three chapters of the novel, totaling 791 sentences and approximately 26,000 tokens. The annotations include lemmas, UPOS tags, and morphological features following the Universal Dependencies (UD) framework. Particular attention was given to (i) using features described in the Italian UD guidelines that are not yet widely adopted across existing treebanks, (ii) applying a dual lemmatization strategy (normalizing and conservative), (iii) defining additional features that capture stylistic and linguistic peculiarities of the novel. • An in-domain model trained on the aforemen- for Italian. For instance, in MIDIA, altered forms are tioned annotated dataset. linked to their corresponding base lemmas, but other • A joint model trained on the combined data from word forms have not been normalized, resulting in dis

I Promessi Sposi and the ISDT treebank. tinct lemmas for each variation: for example, the archaic • A dedicated model for the recognition of apoc- spelling imaginando (“imagining") is lemmatized as imagopated forms, which are characteristic of the inare, while the modern form immaginando corresponds novel’s language. to immaginare. In COLFIS (Corpus e Lessico di Frequenza dell’Italiano Scritto), altered nouns and adjectives were All datasets and models are publicly available in initially lemmatized as independent lemmas and then a a dedicated GitHub repository: https://github.com/ reference to the corresponding base form was added.11 RacheleSprugnoli/CoNLL-U_Manzoni. Finally, in LIPSI (Lessico di frequenza dell’italiano parlato nella Svizzera italiana), altered forms are mapped to a 2. Related Work base lemma when weakly lexicalized: e.g., chiesina (“little church") is lemmatized with chiesa (“church"). On the contrary, independent entries are created when there is a significant semantic divergence between the derived form and the base: e.g., lampadina (“light bulb") is treated as a separate lemma with respect to lampada (“lamp") [13, 14]. This same strategy is also adopted in the compilation of the Nuovo De Mauro dictionary12 and in our work, as explained in detail in Section 3.

The application of NLP tools to Italian literary texts has

been approached through targeted experiments since the early 2000s. Basili et al. [5] employed machine-learning techniques to semantically classify narrative fragments from Alberto Moravia’s novel Gli indiferenti , whereas Pennacchiotti and Zanzotto [6] evaluated the accuracy of a morphological analyzer and a POS tagger on a range of prose and poetry texts dating from the thirteenth century to the late nineteenth century, revealing a drop in 3. Annotation performance compared with results obtained on contemporary Italian. More recently, within the TrAVaSI project Chapters 1, 8, and 23 13 of the final edition of I Promessi (Trattamento Automatico di Varietà Storiche di Italiano), Sposi (1840) were automatically annotated with UDPipe texts of various genres, including literary works dated 2 (ISDT model, version 2.15) [15] [16] and then manually from 1861 onwards, have been annotated according to corrected.14 We adopted the CoNLL-U Plus format15 to the UD framework, but using the same annotation layers arrange specific annotation requirements designed for we adopt for Manzoni, i.e., excluding dependency parsing. the novel, as explained in the following subsection (see As in our study, these annotated data have been exploited Figure 1). to train automatic models [7]. Particular attention has been devoted to lemmatization, adopting a conservative 3.1. Guidelines approach that preserves the original token’s graphical, phonological, and morphological characteristics [8]. By The annotation guidelines were developed collaboracontrast, dependency parsing is included in the annota- tively, discussed in multiple revision rounds, and refined tion of Dante Alighieri’s Divina Commedia, which has to their current form. Their purpose was to guide the anin turn enabled the release of the Italian-Old treebank9 notation process while remaining as consistent as possiand the development of models specifically tailored to ble with the oficial UD guidelines for Italian 16. However, this text [9]. In this annotation, lemmatization follows existing Italian treebanks do not always strictly follow the criteria established in the DanteSearch project, from UD’s recommendations. Whenever discrepancies were which the data were drawn [10] before applying the UD framework. In this case as well, a conservative strategy 11https://linguistica.sns.it/CoLFIS/Home.htm is adopted, whereby pecorelle (“little sheep”) is lemma- 12https://dizionario.internazionale.it/avvertenze/2 tized as pecorella. The same methodology has also been 13These chapters were selected for their stylistic and structural variemployed in the Edizione dell’Opera Omnia di Luigi Piran- ety. Chapter 1 introduces the setting of the novel and includes a dello [11] and in the Archivio Lessicografico della Poesia ldooncgumdeesncrtiaprtyivpeapratsssmagaer,kseodmbeydaiarclohgaiiccsleecxtiicoanlsc, haonidceesv;ecnhpaspetuedro8Italiana dell’Otto-Novecento (ALPION) [12], although in plays a central role in the narrative, featuring multiple scenes, these projects the data are accessible only through con- thematic shifts, and dialogic exchanges, as well as a semi-lyrical cordances.10 Diferent lemmatization choices have been closing section; chapter 23 is characterized by its predominantly made in the compilation of other linguistic resources dialogic structure and includes a lengthy final soliloquy. 14As we report in Table 2, the performance of this model is not optimal. 9https://github.com/UniversalDependencies/UD_Italian-Old 15https://universaldependencies.org/ext-format.html 10https://vocabolari.pirandellonazionale.it/; https://alpion.unict.it/ 16https://github.com/UniversalDependencies/docs/tree/ vocabolario/ricerca/ pages-source/_it encountered between the UD guidelines and currently tion rather than a genuine elision, and our annotation available treebanks, our guidelines prioritized the oficial still treats these forms as apocopated. UD specifications. This involved both substitutions and Furthermore, we extended the set of possible values for additions. the feature Degree to include morphological alterations,

Among the substitutions, we systematically replaced which are also frequently attested in the novel: the use of VerbForm=Ger, commonly found in current • Degree=Dim for diminutives (e.g., casetta, “little treebanks for the traditional Italian gerund (e.g., dicendo, house"); “saying”), with the correct label VerbForm=Conv. Simi- • Degree=Aug for augmentatives (e.g., spadone, larly, for superlative adjective forms (e.g., pessimo, “very “big sword"); bad”), we replaced Degree=Sup with Degree=Abs. • Degree=Pej for pejoratives (e.g., occhiacci,

Among the additions, we decided to use the feature “nasty eyes"); Reflex=Yes for reflexive forms (e.g., sé, si, proprio, • Degree=End for endearments (e.g., poverina, “him/her/itself”, “themselves”): although this feature is “poor little girl"). listed among the ones to be used in Italian,17 it is still rarely applied in most currently available treebanks.18 We Rather than relying exclusively on morphological strucalso annotated indefinite pronouns functioning as total ture, the annotation of this feature was guided by conquantifiers (e.g., ogni, “each”, “every”, tutto, “all”, “every- textual interpretation, focusing on the expressive or afthing” and ciascuno, “everyone”, “each one”) with the fective nuance that the altered form conveys in each feature PronType=Tot, in line with the UD guidelines, occurrence. As Perotti [18] noted, many of these altered despite its inconsistent use across current resources. forms were introduced by Manzoni only in later revisions

Beyond these additions, we introduced a set of features of I Promessi Sposi, reflecting his pursuit of greater prenot prescribed by the UD Italian guidelines, but intended cision and expressive depth. The extended feature set to account for morphosyntactic phenomena of particular was thus designed to capture and document this stylistic historical or stylistic relevance in I Promessi Sposi. All evolution through a consistent, context-sensitive, and such features were annotated in the MISC field. ifne-grained annotation approach. Altered forms were

Firstly, we used the feature Variant=Apoc to anno- lemmatized in the third field with their standard, nontate apocopated forms, only excluding indefinite articles altered base forms; the altered lemma, instead, was re(e.g., un, “a”), which are fully grammaticalized in contem- ported in the eleventh field (e.g., occhiacci, “nasty eyes"; porary Italian and therefore not stylistically significant. third field: occhio, “eye"; eleventh field: occhiaccio “nasty As observed by Bianchi [17], Manzoni drew both on post- eye"). By lemmatizing altered forms under their standard consonantal and postvocalic apocopes (e.g., respectively, base lemma, the annotation facilitates lexical querying fecer instead of fecero, “they did", and cagion instead of and quantitative analysis, avoiding the dispersion of occagione, “cause") to evoke the rhythms and informality of currences across multiple lemmas while preserving the spoken language, at times even extending beyond Floren- expressive variation. For the same reason, namely to entine usage, which was his main language model. Unlike sure consistency and semantic clarity in lexical analysis, elisions, which involve the omission of a final vowel be- fully lexicalized altered forms whose meaning signififore an initial vowel and are graphically marked with an cantly diverges from that of the base lemma were instead apostrophe, apocopes generally drop final phonemes re- treated as independent lemmas (e.g., cavallone, “large gardless of the phonological context and are not marked. water wave", was lemmatized separately from cavallo, However, some apocopated forms in the novel, such as “horse"). que’ instead of quei (“those"), do include an apostrophe. In a nineteenth-century corpus like I Promessi Sposi, In such cases, the apostrophe reflects a graphic conven- lemmatization also required additional care to account for archaisms and diachronic variation. In all cases, we prioritized the modern form of the lemma as the primary 17phattgpess:-//sgoiuthrcueb/._ciot/mfe/aUt/nRivefleexrs.maldDependencies/docs/blob/ entry, placing it in the third field, regardless of the degree 18Reflex=Yes is currently present in the following treebanks: PUD of obsolescence or morphological variation. This crite(3 occurrences), ParTUT (14), OLD (2,346). rion was adopted to support both practical usability and interpretive clarity: lemmatizing under a standard mod- distinguish between participles and adjectives, we reern lemma ensures ease of information retrieval, even for ferred to Guasti [ 20 ], indicating three diagnostic tests, users who may not be familiar with historical or literary also adopted in the annotation of CoLFIS: Italian. However, such standardization was not pursued at the expense of losing linguistically significant traces • participles cannot be modified with the sufix of the novel’s historical and stylistic identity. On the issimo or intensifying adverbs (e.g., molto, “very”), contrary, we aimed to preserve this richness by system- while adjectives can; atically annotating archaic and obsolete forms through a • past participles can host clitic pronouns, while dedicated feature in the MISC field and/or an additional adjectives cannot; lemmatization in the eleventh field. • participles can co-occur with both essere, “to be”,

More specifically, in line with this approach, we dis- and venire, “to come”, while adjectives can’t. tinguished two main cases for archaic forms: 3.2. Inter-Annotator Agreement • when the form was both obsolete and corresponded to an archaic lemma whose modern The IAA was calculated on the first 100 sentences of Chapcounterpart difered only in orthography or mor- ter 38, the last one of the novel. This chapter is not part phology (not in lexical identity or meaning), we of the current dataset and the completion is in progress annotated the feature Style=Arch in MISC field at the time of writing this paper. The annotators involved and reported the archaic lemma in the eleventh are two students of the Master’s degree in “Linguistic ifeld (e.g., annunzio; field LEMMA: annun- Computing” at Università Cattolica del Sacro Cuore; they ciare, “to announce”; MISC field: Style=Arch; are Italian native speakers who have studied UD during a eleventh field: annunziare); couple of courses of the degree but have not participated • when the form was only the archaic spelling in the writing and discussion of the guidelines and are at of a lemma that is still used today (i.e., the their first experience of extensive annotation. Before belemma itself was not obsolete), we only anno- ginning their work on Chapter 38, the students read the tated Style=Arch in the MISC field without guidelines and analyzed the annotations already made adding any lemma in the eleventh field (e.g., for Chapters 1, 8, and 23. varjo; field LEMMA: vario, “various”; MISC field: The Cohen’s kappa recorded for the diferent annotaStyle=Arch). The same criterion was also ap- tion levels was as follows: plied to inflected forms that appear archaic but whose corresponding lemma is still current and unaltered (e.g., chieggio, which is the first person singular of chiedere, “to ask”).

• Lemmatization: 0.80; • UPOS tagging: 0.97; • Morphological features identification: 0.84; • Other features: Degree, 0.80; Style, 0.86;

Variant, 0.99.

In case of uncertainty, we referred to Nuovo De Mauro

[19], which provides mappings between obsolete or literary forms and their modern equivalents.

Finally, consistent with the principles outlined above, Table 1 we applied a contextual approach to UPOS tagging and Cohen’s kappa on the first 100 sentences of Chapter 38. morphological features assignment, following the con- UPOS ventions of current Italian treebanks: for example, inifnitives and participles were annotated as NOUN or ADJ when used as nouns or adjectives, respectively. In the case of infinitives used as nouns, no morphological features were assigned, as these forms are not inflected for gender or number. For participles, instead, the annotation also had consequences on lemmatization: when used as adjectives, they were lemmatized with the corresponding masculine singular form, in line with standard adjectives; when retaining a verbal function, they were lemmatized with the infinitive of the corresponding verb 19. To help Morphological Features Polarity 0.89 Definite 0.82 Gender 0.81 Foreign 0.8 NumType 0.8 Number 0.8 Person 0.8 Clitic 0.78 VerbForm 0.78 Poss 0.77 PronType 0.77 Tense 0.76 Mood 0.76 Degree 0.45

Reflex 0.39 X NUM INTJ PROPN PUNCT NOUN CCONJ ADP VERB PRON AUX DET ADV ADJ SCONJ 1 1 1 1 0.99 0.99 0.99 0.98 0.98 0.96 0.96 0.95 0.94 0.92 0.89 19As for present participles, their usage is almost exclusively limited to either a nominal or, more rarely, a verbal function. The nominal use is generally easy to identify, as present participles functioning as nouns are typically preceded by a determiner (e.g., an article).

Table 1 provides details on the Cohen’s kappa achieved the three chapters. Following this approach, the partifor each UPOS tag and morphological feature. Overall, tions are the following: the results for the various annotation levels are good, often above 0.80 (indicating substantial or almost perfect • training set: 615 sentences, 20,806 tokens; agreement), with a few exceptions only for some features. • development set: 101 sentences, 2,670 tokens;

As for lemmatization, there are 27 discordant lemmas • test set: 75 sentences, 2,457 tokens. that fall into 4 categories. Some cases are clear errors due to superficial annotation: e.g., in si sana ogni piaga Using this partition, a new Stanza [ 21 ] model for Man(“every wound is healed”), sana is lemmatized as sano zoni’s novel has been developed. (“healthy”) instead of sanare (“to heal”). A recurring issue Table 2 presents the performance of the retrained concerns the lemmatization of unstressed personal pro- model on the test set, in comparison with results obnouns. Sometimes, the lemma matches the token itself; tained on the same file from other models, namely the other times, it corresponds to the masculine form: e.g., ISDT [15] and OLD [9] 2.15 models of UDPipe 2, as well in l’era stata compagna (“she had been her companion”), as the spaCy it_core_news_lg20 and the Stanza coml’ is lemmatized with le (feminine) or with lo (masculine). bined models. The retrained model outperforms the other Another disagreement concerns the lemmatization of evaluated ones across all tasks. Obviously this is also due words in an archaic form, which also has repercussions to the diferent annotation choices, especially those reon the feature Style=Arch. For example, pronunziar lated to the features (see Section 3). (“to pronounce”) is lemmatized alternatively as pronun- All models are nearly equivalent and highly reliable ciare, in this case by adding the feature Style=Arch, or in token segmentation. The biggest divergence occurs as pronunziare, without the feature. for sentence splitting: as previously shown by Redaelli

Regarding the annotation of UPOS tags, the lowest and Sprugnoli [ 22 ], this task is challenging due to the agreement is recorded on subordinate conjunctions, con- distinctive punctuation of the novel, particularly the use fused with adpositions (2 times), adverbs (4 times) and of guillemets and long dashes as closing quotation marks, pronouns (7 times, always in the annotation of che, mean- thus the development of a dedicated model is especially ing “who”, “which” or “that”). necessary. Syntactic word segmentation has high scores

The results concerning the annotation of morphologi- (> 90) across all models but spaCy proved to be the least cal features show greater variability. Notably, the features reliable.

Degree, which is employed for marking comparative and With regard to UPOS tagging, the retrained Stanza superlative forms of adjectives and adverbs, and Reflex, model achieves an improvement of 2.44 F1 points comwhich is used for reflexive pronouns, have relatively low pared to the Stanza combined model. The tag with kappa scores (0.45 and 0.39 respectively), indicating mod- the lowest F1 score under the retrained setting is INTJ erate and fair IAA. As mentioned in subsection 3.1, these (F1=0.79, P=1, R=0.65). For example, the only occurrence features were subject to modifications that appear to have of ohimè (a roughly equivalent interjection to “alas") is been insuficiently assimilated by the annotators. For in- misclassified as a NOUN, while addio, “farewell“, is classistance, one annotator consistently employed the Sup ifed three times as an INTJ and three times as a NOUN. All value of Degree rather than Abs for absolute superla- other tags have values above 0.80 but we can notice some tives, and frequently omitted the Reflex=Yes feature. recurring errors in the case of the SCONJ tag. Indeed,

By contrast, the level of agreement is high for the subordinating conjunctions (F1=0.85, P=0.84, R=0.85) are newly introduced features in the MISC column. An in- confused with prepositions (ADP, especially for dopo, “afteresting example of annotation divergence concerns the ter”), pronouns (PRON, as in the case of che, “who/that”), token figliuoli (“children”): one annotator interprets it or adverbs (ADV, as in the case of dove, “where”). as an archaic form of the lemma figlio (“child”) with an As for Universal features (UFeats), the 3.71 point imendearing sufix, whereas the other annotator assigns provement over the Stanza combined model is likely due the lemma figliuolo , without marking it with either the to diferences in the handling of specific features such as Degree=End or Style=Arch features. Reflex=Yes and VerbForm=Conv. The features with the lowest F1 scores are PronType=Int (F1=0.50, P=0.50, R=0.50), which marks interrogative pronouns and deter4. Retraining Stanza miners, and PronType=Exc (F1=0.44, P=0.67, R=0.33), which is applied to exclamative pronouns and determiners. These categories are sparsely represented in the test set, with only 8 and 6 instances respectively. However, there is evidence of confusion between the two: for example, in the sentence “Come stava allora il povero don

The dataset was split into training, development, and test

sets using an 80/10/10 ratio, with the division based on the number of syntactic words as units, in accordance with the guidelines of the UD framework. The number of syntactic words was taken proportionally equally from 20https://spacy.io/models/it#it_core_news_lg Abbondio!” (“How was poor Don Abbondio feeling at lemmas in the training set of the ISDT treebank. that moment!”) the word come, “how”, is annotated as PronType=Exc in the gold data, but the model incor- 4.1. One Novel, Three Versions rectly predicts PronType=Int. The feature Mood=Cnd, indicating verbs in the conditional mood, also yields a Alessandro Manzoni revised I Promessi Sposi multiple relatively low F1 score (F1=0.73, P=1, R=0.57). Although times, resulting in three versions. The earliest, a handthis class includes only a small number of instances (7), written draft composed in 1823 and known as Fermo e misclassifications occurred, including one case where it Lucia, difers in both content and style from later editions. was confused with the indicative mood (fiaterebbe , “he The language used, for example, is an original combinawould breathe”) and another with the subjunctive mood tion of Italian, Lombard, French and Latin calques, also (leverebbe, “he would take away”). rich in author’s neologisms. In 1827, Manzoni published a

For lemmatization, the improvement is of 2.84 points revised version, commonly called the Ventisettana, which with respect to the Stanza combined model, with a total introduced substantial linguistic refinements aimed at of 82 incorrect lemma predictions. Notably, lemmatiza- improving clarity and accessibility for Italian readers. tion choices involving altered forms and archaic variants The definitive version, released starting from 1840 and do not appear to be major sources of inaccuracy: indeed, known as the Quarantana, incorporated further stylistic only 12% of errors involve altered forms, and 4% involve and linguistic changes based on the Florentine language, archaic ones. Table 3 provides examples of these types reflecting Manzoni’s eforts to promote a unified Italian of errors. The remaining instances mostly concern the language. prediction of non-existent lemmas (e.g., riunendo (gerund Given the linguistic diferences among these versions, of “reunite”) → riunere instead of riunire; mangi (present it is of particular interest to assess the extent to which subjunctive of “eat”) → manire instead of mangiare); and the model trained on the Quarantana generalizes to earof feminine forms instead of the correct masculine ones lier texts. Table 4 presents the F1 scores obtained in the (e.g., scure (“dark”) → scura instead of scuro; forestiera ifrst chapter of Fermo e Lucia (5,760 tokens) and the Ven(“female foreigner”) → forestiera instead of forestiero). It tisettana (7,407 tokens). Notably, performance on the is interesting to note that the UDPipe model trained on Ventisettana is even higher in terms of morphological the Divina Commedia (UDPipe-OLD) exhibits low perfor- features and lemmatization, although there is a slight mance on lemmatization, despite the fact that the target decrease in UPOS tagging. Morphological features identidomain is literary, as is the case for Manzoni. This discrep- fication is still good on the 1823 version but UPOS tagging ancy can likely be attributed to the considerable temporal and lemmatization show a more evident drop. and stylistic diferences between the two sources: the Divina Commedia is dated back to the 14th century and is 4.2. A Joint Model composed in poetic form, whereas Manzoni’s work dates to the 19th century and is written in prose. Indeed, the An additional experiment involved the creation of a comlexical overlap between the lemmas in the training set of bined model trained on the merged training and develthe OLD treebank and those in our corpus amounts to opment sets of ISDT and the training set of I Promessi only 50%, compared to a higher overlap of 69% with the Sposi. ISDT was selected because its corresponding model achieved better results than the other of-the-shelf models, although it still underperformed compared to the indomain retrained model. The resulting combined training set consisted of 14,300 sentences.

Table 5 reports the performance of this combined model on the first chapters of Fermo e Lucia and the Ventisettana, as well as on the test set from the Quarantana. The increased training data, despite being from a diferent domain and not always consistent with our annotation guidelines, led to a modest overall improvement in performance, particularly on the 1840 test set.

These generally positive results align with findings from previous experiments conducted on the Voci della Grande Guerra [ 23 ] and VoDIM [7] corpora. In contrast, joint models developed for syntactic parsing of the Divina Commedia have shown lower performance compared to in-domain models [ 24 ].

5. Modeling Apocopes

We implemented a supervised sequence labeling pipeline for identifying apocopated forms using Conditional Random Fields (CRFs) and the same train, development and test sets used for the retraining of Stanza. For the time being, we have focused on apocopated forms only, as among the three specific features we added to the annotation, Variant=Apoc is the most frequent, whereas the others are too sparsely represented.21 Although more frequent than the other features, the number of instances was still insuficient to support the use of neural methods, which require larger amounts of training data to perform 21The whole dataset, at the moment of writing, contains 735 apocopated forms, 109 altered forms and 106 archaic forms. efectively and generalize well. Therefore, we adopted a CRF-based approach instead.

The model is trained using the sklearn-crfsuite library and hyperparameters (c1 and c2 regularization coefifcients) are optimized via randomized search with 5-fold cross-validation. The feature set includes orthographic (e.g., lowercase form, word sufixes and prefixes), morphological (e.g., UPOS and FEATS) and lexical (lemma) features from the preceding and following tokens. The results of the model’s binary classification on the test set are reported in Table 6.

The test set contains 59 apocopated forms corresponding to 41 tokens and 33 lemmas; 12 of these forms do not appear in the training set, which includes 611 apocopated instances corresponding to 220 tokens and 169 distinct lemmas. Among the model’s 9 false negatives, 4 are apocopated forms that were not seen during training: i.e., timor (“fear”), almen (“at least”), passan (“they pass by”), ondeggiar (“to ripple”). As for the remaining cases, the model fails to correctly classify par (“it seems”, seen 7 times in the training set), fra (“friar”, 3 times), star (“to stay”, 2 times), and siam and cagion (“we are” and “cause”, each seen once in the training data).

6. Conclusion In this paper, we have introduced several new resources:

(i) a manually annotated dataset of 3 chapters of I Promessi Sposi, comprising 791 sentences and approximately 26,000 tokens, enriched with lemmas, UPOS tags, Universal Dependencies morphological features and adhoc features designed for capturing specific stylistic characteristics of Manzoni’s novel; (ii) an in-domain NLP model trained specifically on this dataset; (iii) a joint model combining data from the novel and the ISDT treebank; (iv) a specialized model for recognizing apocopated forms, which are a distinctive feature of Manzoni’s text.

All data and models developed in this study are made publicly available in a dedicated GitHub repository, hopefully laying the groundwork for future research on Italian literary texts through computational approaches.

As for future work, a key priority is to extend the annotation to additional chapters. Thanks to the new models developed in this study and their relatively low error rates, the manual correction process is expected to be significantly accelerated. The expansion

Acknowledgments The authors thank Flavio Massimiliano Cecchini for an

notating chapters 1, 8, 23 of Quarantana and Ventisettana, Alessia Leo and Michael Mostacchi for annotating chapter 38 of Quarantana, Chiara Febbraro for the annotation of chapter 1 of Fermo e Lucia and Giovanni Moretti for technical assistance. of the dataset will also enable the development of models targeting the other two specific features introduced in our annotation scheme, namely Style=Arch and Degree=Aug/Dim/End/Pej. Another future step will involve syntactic annotation, with the ultimate goal of incorporating Italy’s most important novel among the UD treebanks. This will continue the broader efort to integrate Italian literary texts into syntactically annotated resources, following the precedent set by the annotation of the Divina Commedia [9].

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Text translation. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[20]

M. T.

Guasti , Il sintagma aggettivale, in: L. Renzi , G. Salvi , A . Cardinaletti (Eds.), Grande grammatica italiana di consultazione, vol. II, libreriauniversitaria . it Edizioni , 2022 , pp. 321 - 340 . First published in 1991 by Il Mulino . Anastatic reprint.

[21]

Qi ,

Zhang ,

Bolton ,

C. D.

Manning , Stanza: A Python natural language processing toolkit for many human languages , in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , 2020 . URL: https://nlp.stanford.edu/pubs/ qi2020stanza.pdf .

[22]

Redaelli ,

Sprugnoli , Is sentence splitting a solved task? experiments to the intersection between NLP and Italian linguistics , in: F. Dell'Orletta , A.

Lenci , S.

Montemagni , R. Sprugnoli (Eds.), Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), CEUR Workshop Proceedings, Pisa, Italy, 2024 , pp. 813 - 820 . URL: https://aclanthology.org/ 2024 .clicit- 1 .88/.

[23] I. De Felice , F.

Dell'Orletta , G.

Venturi , A.

Lenci , S.

Montemagni , et al., Italian in the trenches: linguistic annotation and analysis of texts of the great war , in: Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018 ), Accademia University Press, 2018 , pp. 160 - 164 .

[24]

Corbetta , G. Moretti,

Passarotti , Join together? combining data to parse Italian texts , in: F. Dell'Orletta , A.

Lenci , S.

Montemagni , R. Sprugnoli (Eds.), Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), CEUR Workshop Proceedings, Pisa, Italy, 2024 , pp. 251 - 257 . URL: https://aclanthology.org/ 2024 . clicit- 1 .30/.