1. Introduction and Motivation

of the Assessing DIScourse COherence in Italian TEXts task

Dominique Brunato

dominique.brunato@ilc.cnr.it

Davide Colla

davide.colla@unito.it

Felice Dell'Orletta

Irene Dini

irene.dini@ilc.cnr.it

Daniele Paolo Radicioni

daniele.radicioni@unito.it

Andrea Amelio Ravelli

Pisa - ItaliaNLP Lab

The Assessing DIScourse COherence in Italian TEXts D(isCoTEX) task is the first shared task focused on modelling discourse coherence for Italian real-word texts, which has been proposed for the first time at EVALITA 2023. Providing two diferent datasets from diferent textual genres, we arranged the task into two independent tasks: a more traditional one, aimed at evaluating whether models are able to distinguish well-organized documents from corrupted ones and a less explored one, which assesses the models' performance on texts evaluated for coherence by human raters. In this paper, we describe the datasets released, we discuss the diferent approaches tackled by the participating systems and provide a first analysis of the obtained results.

text coherence Italian language computational modeling evaluation campaign dataset

1. Introduction and Motivation

Coherence is a key property of any well-organized text ration from frameworks like the Centering Theory 2[]. and it plays a crucial role in human discourse processing. One popular approach in this context is the entity-grid Indeed, as individuals process unfolding text, they are re- approach, which focuses on assessinglocal coherence, quired to assemble information from single sentences and specifically the transitions between adjacent sentences to draw inferences between and among them in order to (see, among others, [3, 4]). More recently, also neural create a meaningful mental representation of the whole text. According to the tripartite model developed by [1], models have been applied to deal with both structured representations of text and unstructured text by taking text, encoding diferent, and progressively more abstract, this is the outcome of a three-step process in which read- advantage of neural models’ ability to learn useful repreers construct multileveled memory representations of a sentations for the task, e.g. [5, 6]. Modelling coherence in natural language is of pivotal importance in a variety of information at each level. From this perspective, coher- downstream applications, from automatic essay scoring ence is an inherently psychological construct, thus very in language learning scenarios [7, 8], to language assesshard to be modelled; however, it also has a counterpart at the level of linguistic content and structure, often referred ment in clinical settings [9, 10] Additionally, from the Natural Language Generation point of view, coherence to as “cohesion”, a property of a text that is conveyed is an intrinsic evaluation metric to assess the quality of by signalling linguistic devices such as reference, ellip- generated texts. An emerging area of interest pertains to sis, discourse connectives, argument overlap, which help the interpretability of modern deep neural networks. In readers make explicit the logical links between diferent units in texts. this respect, while existing work on probing pre-trained language models has largely focused on sentence-level

As regards the computational modelling, coherence properties, the ability of these models to encode discourse has been widely investigated in the Natural Language and pragmatic phenomena is still unclear [11, 12, 13]. EVALITA 2023: 8th Evaluation Campain of NLP and Speech Tools for

CoTEX, organized in the context of the 8th evaluation

campaign of NLP and speech tools for the Italian language (EVALITA 2023) [14] intends to encourage research on automatic discourse coherence modeling with emphasis on the Italian language.

3. Datasets

The dataset1 utilized for the DisCoTEX task encom2. Definition of the Task passes texts sourced from two distinct origins: the Italian Wikipedia and the Italian speech transcripts section of Drawing inspiration from existing coherence modeling the Multilingual TEDx corpus (mTEDx). These sources literature, the DisCoTEX task was designed with the in- represent two diferent language varieties: the former is tention of addressing two distinct scenarios. The first a ‘standard’ written variety, and the latter a ‘hybrid’ vascenario involves the evaluation of models’ ability to riety combining diverse genres (e.g., university lectures, diferentiate well-structured documents from corrupted newspaper articles, conference presentations, and TV ones. The corrupted documents are typically created by science programs) as well as diferent semiotic modes, either shufling the sentence order of the original docu- such as written, spoken, audio, and video 1[7]. Extensive ment or replacing specific linguistic elements that con- research on genre and register variation acknowledges tribute to coherence within and across sentences, such that written and spoken language employ distinct strateas personal pronouns or discourse connectives. The sec- gies to establish coherence within a text [18]. Therefore, ond scenario, which has been less explored, focuses on we decided to evaluate systems on both these types of assessing the models’ performance in coherence evalu- data. ation by comparing their predictions to human raters’ For sub-task 1, each data sample consists of a prompt, evaluations. which is a paragraph comprising three sentences, fol

To capture these distinct scenarios, we proposed two lowed by a target sentence. To create the written dataset, independent sub-tasks: we leveraged the existing paragraph segmentation in Wikipedia to select four-sentence paragraphs. For the • Sub-task 1 - Last sentence classification : This spoken dataset, as mTEDx speeches lacked such internal sub-task was casted as a binary classification task. structure, we divided all the transcripts into passages of Specifically, participants are presented with a four sentences. The target sentence is determined as the prompt, which is a short paragraph consisting of immediate continuation of the prompt, forming a coherapproximately three consecutive sentences, and ent sample. In the case of a non-coherent passage, as an individual sentence referred to as the target. previously anticipated, we selected either a sentence ranThe objective is to classify whether the target sen- domly taken from a diferent document or the sentence tence, when combined with the prompt, forms a that appears ten sentences after the prompt in the same coherent or incoherent text. The negative target document. Each final dataset consists of 8,000 training can either be a sentence randomly selected from a samples and 800 test samples. Examples can be found in diferent document or a sentence extracted from Table 1. the same document as the prompt, in order to Regarding sub-task 2, the dataset construction difers introduce incremental degrees of complexity on slightly. In this case, for each source we extracted samples the resolution of the task; consisting solely of four-sentence paragraphs (we keep • Sub-task 2 - Human score prediction: This sub- the term ‘prompts’ to refer to them), with half of them task was framed as a regression task where par- deliberately made incoherent through sentence perturticipants were asked to predict the average co- bations. The possible perturbations, chosen with equal herence score assigned by human raters to short probability, include: paragraphs. These paragraphs were evaluated in their original or artificially modified version.

As shown in previous tasks on the automatic assessment of subjective phenomena [15, 16], this scenario is expected to be more challenging, as it requires modeling the human perception of coherence, which can be influenced by both linguistic and non-linguistic factors, as highlighted in previous studies [7].

• Flip of two random sentences: each sentence of the prompt has the same probability of being lfipped. • Swap of a sentence with the next 10th of the same document from which the prompt was extracted.

The first and the last sentence have double the swap probability compared to the middle two sentences to make the swapping of the first/last or a middle one equiprobable.

For both sub-tasks, dataset were extracted from two corpora representative of two distinct domains, as described in the following section. For the purposes of theDisCoTEX task, we selected

1,064 prompts equally balanced between the two domains.

1The DisCoTEX dataset is available at the following link:

https://github.com/davidecolla/DisCoTex/

Prompt Target Class Il regolamento del carcere era durissimo e le condizioni igienicheQui le sevizie di ogni genere venivano inflitte 1 drammatiche. Agli ebrei erano negati i pochi diritti concessi aglisoprattutto sugli ebrei che non rivelavano i altri prigionieri politici e comuni, ovvero l’ora d’aria in cortilree,capiti o i nascondigli dei loro parenti della l’assistenza sanitaria, la possibilità di ricevere lettere e pacchciui presenza a Milano o nei dintorni le SS e di acquistare generi alimentari allo spaccio del carcere. Gelriano venute a conoscenza tramite loro spie. interrogatori degli arrestati erano condotti in uno stanzone a pian terreno, detto il ”refettorio”.

Ci siamo trovati a Brasilia, la capitale del Brasile; e c’erano cittVàedete in questo semplicissimo grafico, il 0 di tutto il mondo, dall’Australia al Giappone, all’Asia, all’Africrao,sso è tutto quello che prima,una decina di agli Stati Uniti. E lì abbiamo avuto la consapevolezza che siamo anni fa, andava a smaltimento. un movimento che sta crescendo nel mondo e che sempre più costruisce risultati e vantaggi. Una delle più grandi città del mondo che ha fatto questa scelta è San Francisco.

Of these, 33% (i.e. 360 prompts) are extracted from the

subset of authentic prompts and 66% (i.e. 704) from perturbed ones. Examples can be found in Table2.

As anticipated, coherence was assessed through manual annotation. Specifically, to gather human ratings of coherence, we conducted a crowdsourcing task on the Prolific 2 platform, involving Italian native speakers.

Recognizing that coherence is a subjective concept inlfuenced by the reader or listener’s interpretation, we employed a gradual judgment approach, and asked them to evaluate their perception of on a Likert scale ranging from 1 to 5. The number of annotations per prompts ranged from 9 to 12, with an average of 11.75.

The resulting dataset was split into training and test samples with a proportion of 80% to 20%, respectively.

In Figure 1 we show some general statistics about col- Figure 1: Overview of human judgments collected for the lected judgments considering both the whole dataset dataset used in sub-task 2. The plot shows the overall mean of of prompts and prompts grouped into specific sections, human judgments for the whole dataseta(ll) and for respective according to genre and perturbation. As it can be seen, subsets, including both coherent*_(no_pert) and perturbed prompts derived from Wikipedia texts are generally rated prompts (*_pert). as more coherent by humans compared to TEDx prompts.

This observation confirms previous findings with regard to the influence of genre on the perception of coherence [7]. What is particularly interesting is that this disparity 3.1. Format is evident not only in the original form of the prompts but also in the perturbed versions. This seems to suggest The DisCoTEX dataset was released as tab-separated text that Wikipedia documents tend to exhibit a more stan- files. Specifically, for sub-task 1, the two data sources dardized structure, including internal coherence, which (i.e. Wikipedia and TED) were kept separated and, for remains relatively stable even with minor alterations that each source, participants were provided with a file with afect sentence order or the insertion of an intruder sen- a following structure: tence from the same document. We plan to conduct a more in-depth analysis by examining each perturbation strategy independently to gain a deeper understanding of their individual impact on coherence. • ID: a numerical identifier for the entry; • PROMPT: textual passage made by three consec

utive sentences; • TARGET: the sentence which participants are asked to assess if it is coherent with the prompt (i.e. it is the next sentence after the prompt);

2https://www.prolific.co/

Le nuove idee sono una sfida che accende nel nostro cervello la stessa area che elabora le minac4c.8e3 fisiche. Ecco perché tendiamo a reagire con forza, a volte con aggressività, alle nuove idee. Davanti a informazioni che mettano in discussione le nostre convinzioni noi tendiamo paradossalmente a reagire raforzandole ancora di più. Si chiamano bias cognitivi, sono molto forti e ci caschiamo tutti. I Romani furono scommettitori appassionati, specialmente ai tempi dell’Impero Romano, e il gioco 3.3 0.95 dei dadi era popolare, seppur proibito da una ”Lex alearia” del 204 a.C. circa, eccetto che durante i Saturnali. Orazio derise la gioventù dell’epoca che sprecava tempo tra i pericoli del gioco invece di domare il suo cavallo e darsi alle durezze dell’inseguimento. Una di queste diceva che nessuna causa poteva essere intentata da una persona che permetteva il gioco d’azzardo nella sua casa anche se era stata imbrogliata o assalita. Le scommesse sui dadi per denaro fu l’oggetto di molte leggi Romane. • ID: a simple identifier for the entry; • TEXT: the 4-sentence prompt to be evaluated;

To decide whether the target sentence is coherent

with the paragraph we computed the median distance • MEAN: the coherence score of the text to be pre- value across the whole training dataset, and we used dicted, based on the mean of the human judge- this as a threshold: all the test samples with a distance ments collected; Examples extracted from the dataset of sub-task 2. The first one is an original prompt taken from the mTEDx corpus. The second ones is a perturbed prompt from the Wikipedia corpus, with a swap between the third and the last sentence. • CLASS: the class to be predicted (1 if the target the prompt, and the target sentence based on Hamming follows the prompt, 0 otherwise).

distance coeficient

For sub-task 2, we mixed data from the two sources and released a single dataset with the following structure: : 1

(∑ ⟨

=0 ( , ) = , ⟩ ) .

Mean

Stdev pated in both sub-tasks opted to use the same strategy for both challenges. None of the systems chose to utilize ( ) =

1 − 1 =1 −1 ∑ (⃖⃗

, ⃖⃖+⃖⃖⃖1⃗).

5. Participants We received a total of 3 submissions for sub-task 1 and 2 submissions for sub-task 2 from 3 diferent teams. Each and mTEDx data. In the context of DisCoTEX, for both sub-tasks par

For sub-task 2 we first extracted the one-hot vecenhance their models, with the exception of Wikipedia { 1, 2, ..., }: ticipants could leverage further external resources to tors from each sentence in the input prompt = • For sub-task 1: the evaluation metric is Accuracy coherence score for the paragraph , ( ) , we averaged (the ratio between correctly predicted samples the scores featuring each pair of adjacent sentences: value under the median have been considered coherent, incoherent otherwise.

⃖⃖1⃗ ← 1, ⃖⃖2⃗ ← 2, ..., ⃖⃖⃗ ← .

Then we computed the proximity between each consecutive vectors pair ⟨⃖⃗, ⃖⃖+⃖⃖⃖1⃗⟩ ∈ through Jaccard distance

metric

, thereby resulting in (n-1) distance scores, grasping the degree of semantic overlap between each two neighbouring sentences. In order to compute the

4. Evaluation measures We defined the following evaluation metrics for each

sub-task: and all processed samples) obtained by each system in the test set. We also reported Precision,

Recall and F-score for the two classes; • For sub-task 2: the evaluation metric is the harmonic mean between Pearson and Spearman correlation coeficients between the participants’ scores and test set scores. tence in the input prompt = { 1, 2, ..., }, as well as for the target sentence . The distance between the prompt and the target sentence , ( , )

is computed as the average distance between each sentence from

Baseline The baseline for both tasks has been com- submission had the option to include up to three diferputed by employing one-hot vectors representations: For ent runs. The strategies used to approach the task are sub-task 1 we extracted the one hot vector for each sen- all very diferent from each other. Teams that particiTeam MPG IUSSNets ExtremITA

Members

Afiliations

# Runs 3 3 4

Sony Computer Science Laboratories Paris, France Enrico Fermi’s Research Center (CREF), Rome, Italy, Sapienza University of Roma, Italy Iuss Pavia, Italy Università degli Studi di Roma Tor Vergata, Università di Torino, Italy ! ! !

X ! ! 4 9 6 additional resources apart from the oficial datasets. Further information regarding the task participation can be found in Table 5.

Team extremita IUSSNets mpg baseline

Model llama bert lgbm hamming MPG

The MPG team [19] utilized the tree-based classifier Light

GBM incorporating a set of explicitly engineered features aiming at comparing the prompt and target with respect to several metrics such as TF-IDF vectors, counts of upper instance of the dataset preceeded by the task and subcase words, tenses, punctuation, words, and characters, task name and it produced the predicted label or score as as well as sentence embeddings extracted from Sentence- output. Conversely, the extremITLLaMa model, which BERT [20]. They exclusively participated in sub-task 1 requires a structured prompt, was provided with a textual with two runs . description of the task and the desired output format specification. For sub-task 1 the prompt is: “Le due frasi preceIUSSNets denti, separate da ’[SEP]’, sono coerenti tra loro? Rispondi sì o no” ; while for sub-task 2 the prompt is: “Quanto è The IUSSNets team [21] employed fine-tuning techniques coerente questa frase in una scala da 0 a 5?”. Their team on four distinct Italian language models: BERT-ita 2[2], emerged as the winner across bothDisCoTEX sub-tasks Electra-ita [22], Umberto [23], and Bertino [24] sepa- and datasets, thanks to the LLAMA-based model. Howrately for each sub-task. For sub-task 1, they submitted ever, the iT5-based model performed considerably worse, three BERT fine-tuned models: the first fine-tuned on especially in the second sub-task where it remained beWikipedia (BERT 1), the second on mTEDx (BERT 2), and low the baseline. the third on both (BERT 3), achieving the second-place score. For sub-task 2, they submitted BERT, Bertino, and Electra fine-tuned models, once again securing the sec- 6. Results ond position, primarily due to the performance of the Electra model.

Tables 4 and 5 report the leaderboard of systems taking

part in sub-task 1 and sub-task 2, respectively. Note that, for the purpose of the oficial ranking, for sub-task 1 we ExtremITA considered the accuracy of the best run, and we further The ExtremITA team [25] competed using two multi-task computed the mean between the best result/run on Wiki Language Models. The first model (ExtremITA-iT5) is and the best result/run on mTEDx data. Conversely, for an encoder-decoder based on iT5-small [26], while the sub-task 2 we first computed both Pearson and Spearsecond model (ExtremIT-LLaMA) is a decoder based on man correlations, then we applied the harmonic mean Camoscio [27], the Italian version of LLAMA [28]. These between the two measures. models largely difer in number of parameters: iT5-small As it can be seen, all systems outperform the baseline has approximately 110 Million parameters, while the used in both sub-tasks. The best performance was achieved version of Camoscio has 7 Billion parameters. Both mod- by the team extremITA with the system based on the els underwent joint fine-tuning on all EVALITA 2023 LLAMA model. tasks and sub-tasks, leveraging prompting techniques.

For both DisCoTEX the extremIT5 model received each ence within text. The second one intended to model (Volume 1: Long Papers), Association for Computhe human perception of text coherence by predicting tational Linguistics, Vancouver, Canada, 2017, pp. the average score attributed to human raters to a text. 1320–1330. URL: https://aclanthology.org/P17-1121. A novel dataset was developed for this task comprising doi:10.18653/v1/P17- 1121. texts from two diferent domains, representative of a [6] J. Li, D. Jurafsky, Neural net models of open-domain written and spoken language variety in order to inves- discourse coherence, ArXiv abs/1606.01545 (2017). tigate the role of modality on the automatic modeling [7] A. Lai, J. R. Tetreault, Discourse coherence in the of coherence. Three teams participated in the task and wild: A dataset, evaluation and methods, CoRR submitted a total of 19 runs. Notably, the ExtremITA abs/1805.04993 (2018). URL: http://arxiv.org/abs/ team secured the first position in both sub-tasks with 1805.04993. arXiv:1805.04993. their system based on the largest decoder model pro- [8] M. Mesgar, M. Strube, A neural local coherence posed. However, it is worth highlighting that smaller model for text quality assessment, in: Proceedings models with fewer parameters also demonstrated com- of the 2018 conference on empirical methods in parable performance, indicating their efectiveness in natural language processing, 2018, pp. 4328–4339. capturing discourse-related information. Quite surpris- [9] B. Elvevåg, P. W. Foltz, D. R. Weinberger, T. E. ingly, the results of sub-task 2 revealed that systems were Goldberg, Quantifying incoherence in speech: an more proficient in predicting coherence scores for TEDx automated methodology and novel application to talks compared to Wikipedia texts, which calls for further schizophrenia, Schizophrenia research 93 (2007) investigation by also expanding the current dataset of hu- 304–316. man evaluated texts. Future plans involve extending the [10] D. Iter, J. Yoon, D. Jurafsky, Automatic detection DisCoTEX task to a multilingual perspective, enabling co- of incoherent speech for diagnosing schizophrenia, herence modeling exploration across diferent languages in: Proceedings of the Fifth Workshop on Compuusing reproducible data collection processes in languages tational Linguistics and Clinical Psychology: From with available Wiki and TED resources. Keyboard to Clinic, 2018, pp. 136–146. [11] A. Shen, M. Mistica, B. Salehi, H. Li, T. Baldwin,

J. Qi, Evaluating Document Coherence Modeling, Acknowledgements Transactions of the Association for Computational Linguistics 9 (2021) 621–640. URL: https://doi.org/ The authors gratefully acknowledge the support of the PNRR MUR project PE0000013-FAIR. [12] 1M0..11C6h2/etna,cl_Za._0C03h8u8,. dKo.i:1G0 .im11p6e2l,/taclE_vaa_lu0a0t3i8o8n. benchmarks and learning criteria for discourseReferences aware sentence representations, arXiv preprint arXiv:1909.00142 (2019). [1] T. A. Van Dijk, W. Kintsch, Strategies of discourse [13] Y. Farag, J. Valvoda, H. Yannakoudakis, T. Briscoe, comprehension, Academic Press, New York, 1983. Analyzing neural discourse coherence models, in: [2] B. J. Grosz, A. K. Joshi, S. Weinstein, Centering: Proceedings of the First Workshop on CompuA framework for modeling the local coherence of tational Approaches to Discourse, Association discourse, Computational Linguistics 21 (1995) for Computational Linguistics, Online, 2020, pp. 203–225. URL: https://aclanthology.org/J95-2003. 102–112. URL: https://aclanthology.org/2020.codi-1. [3] R. Barzilay, M. Lapata, Modeling Local Coherence: 11. doi:10.18653/v1/2020.codi- 1.11.

An Entity-Based Approach, Computational Linguis- [14] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugtics 34 (2008) 1–34. URL: https://doi.org/10.1162/ noli, G. Venturi, EVALITA 2023: Overview of the coli.2008.34.1.1. doi:10.1162/coli.2008.34.1.1. 8th evaluation campaign of natural language proarXiv:https://direct.mit.edu/coli/article- cessing and speech tools for italian, in: Proceedings pdf/34/1/1/1798481/coli.2008.34.1.1.pdf. of the Eighth Evaluation Campaign of Natural Lan[4] M. Elsner, E. Charniak, Disentangling chat with guage Processing and Speech Tools for Italian. Final local coherence models, in: Proceedings of the Workshop (EVALITA 2023), CEUR.org, Parma, Italy, 49th Annual Meeting of the Association for Com- 2023. putational Linguistics: Human Language Technolo- [15] D. Brunato, C. Chesi, F. Dell’Orletta, S. Montegies, Association for Computational Linguistics, magni, G. Venturi, R. Zamparelli, AcCompl-it @ Portland, Oregon, USA, 2011, pp. 1179–1189. URL: EVALITA2020: Overview of the Acceptability & https://aclanthology.org/P11-1118. Complexity Evaluation Task for Italian, EVALITA [5] D. Tien Nguyen, S. Joty, A neural local coherence Evaluation of NLP and Speech Tools for Italian model, in: Proceedings of the 55th Annual Meeting December 17th, 2020 (2020). of the Association for Computational Linguistics [16] L. Gregori, M. Montefinese, D. P. Radicioni, A. A. Ravelli, R. Varvara, Concretext @ EVALITA2020: [26] G. Sarti, M. Nissim, IT5: Large-scale text-to-text The Concreteness in Context Task., EVALITA Eval- pretraining for Italian language understanding and uation of NLP and Speech Tools for Italian - Decem- generation, ArXiv preprint 2203.03759 (2022). URL: ber 17th, 2020 (2020). https://arxiv.org/abs/2203.03759. [17] G. Caliendo, The popularisation of science in web- [27] A. Santilli, Camoscio: An Italian instruction-tuned based genres, The language of popularisation: The- LLaMa, https://github.com/teelinsan/camoscio, oretical and descriptive models 3 (2012) 101–132. 2023. [18] D. Biber, S. Conrad, R. Reppen, Corpus linguistics: [28] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. investigating language structure and use., Cam- Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambridge University Press, Cambridge, 1998. bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, [19] M. Galletti, P. Gravino, G. Prevedello, MPG G. Lample, LLaMa: Open and eficient foundation at DisCoTex: Predicting text coherence by tree- language models, 2023. arXiv:2302.13971. based modelling of linguistic features, in: M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugnoli, G. Venturi (Eds.), Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), CEUR.org, September 7th-8th 2023, Parma, 2023. [20] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3982–3992. URL: https://aclanthology.org/

D19-1410. doi:10.18653/v1/D19-1410. [21] E. Zanoli, M. Barbini, C. Chesi, IussNets at DisCo

Tex: A fine-tuned approach to coherence, in: M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugnoli, G. Venturi (Eds.), Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), CEUR.org, September 7th-8th 2023, Parma, 2023. [22] S. Schweter, Italian BERT and ELECTRA models, 2020. URL: https://doi.org/10.5281/zenodo.4263142.

doi:10.5281/zenodo.4263142. [23] L. Parisi, S. Francia, P. Magnani, UmBERTo: an

Italian language model trained with whole word masking, https://github.com/musixmatchresearch/ umberto, 2020. [24] M. Mufo, E. Bertino, BERTino: an Italian Distil

BERT model, Computational Linguistics CLiC-it 2020 (2020) 317. [25] C. D. Hromei, D. Croce, V. Basile, R. Basili, ExtremITA at EVALITA2023: Multi-task sustainable scaling to large language models at its extreme, in: M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugnoli, G. Venturi (Eds.), Proceedings of Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), CEUR.org, September 7th-8th 2023, Parma, 2023.