1. Introduction

ExtremITA at EVALITA 2023: Multi-Task Sustainable Scaling to Large Language Models at its Extreme

Claudiu D. Hromei

Danilo Croce

Valerio Basile

Roberto Basili

0 0 Università degli Studi di Roma Tor Vergata , Italy 1 Università di Torino , Italy

This paper explores the potential application of a monolithic neural model for all tasks in EVALITA 2023. We evaluated two models: extremIT5, an encoder-decoder model, and extremITLLaMA an instruction-tuned Decoder-only Large Language Model, specifically designed for handling Italian instructions. Our approach revolves around representing tasks in natural language, where we provide instructions to the model using prompts that define the expected responses. Remarkably, our best-performing model achieved first place in 41% of the subtasks and showcased top-three performance in 64%. These subtasks encompass various semantic dimensions, including Afect Detection, Authorship Analysis, Computational Ethics, Named Entity Recognition, Information Extraction, and Discourse Coherence.

1. Introduction

© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License (64%). The adopted LLMs (especially LLaMA-based) proCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) posed solution strongly supports the viability and high performance of a single monolithic architecture, as it scenarios. Recently, this model was applied to hundreds only requires modeling the tasks in natural language us- of tasks in [24], while in [4] a systematic pre-training at ing prompts. This approach has been further reinforced large scale demonstrates the efectiveness within “zeroby recent work [24], which indicates the same direction. shot" or “few-shot" learning scenarios. In this paper, the

In the rest of the paper, Section 2 describes the adopted ifrst approach we adopted is based on T5, pre-trained on LLMs. Section 3 provides the results, accompanied by a Italian texts, namely IT5 [3]. brief error analysis. Finally Section 4 derives the conclu- On the other hand, Decoder models are typically sions. trained to be triggered by text, such as a natural language request or a piece of text intended for processing.

These models generate text one word at a time, produc2. Multi-task prompting in ing an output that can be an answer to a question or a ExtremITA solution to the given tasks or requests. Such models have the ability to essentially follow instructions, as exempliThe Transformer architecture [25] can be divided into ifed by the recent release of ChatGPT. This characteristic two main components, each giving rise to distinct fami- holds a greater appeal, as tasks can be linguistically delies of models. The encoder, exemplified by BERT [ 26], scribed using prompts, where the input sentence serves RoBERTa [27], and DeBERTa [28], is responsible for en- as contextual information. InstructGPT [30] is an extencoding input sequences and generating meaningful rep- sion of the GPT [6] language model explicitly designed to resentations (embeddings) using the self-attention mech- excel in multi-task scenarios when used with prompts. It anism. On the other hand, the decoder, represented by combines the power of language models with the ability models like GPT [5], GPT3 [6], and LLaMA [7], generates to follow instructions provided in the form of natural output sequences in an auto-regressive manner based language prompts. Unlike conventional language modon the input and previously generated output tokens. els that generate text freely, InstructGPT is fine-tuned Additionally, another family of models, the Encoder- using human feedback to understand and generate text Decoder models, such as T5 [1] and BART [29], com- based on a given prompt and to select the best sequence bine the strengths of both encoder and decoder com- that humans would have preferred. Another language ponents. These models maintain the integration of the model that adopts this instruction-tuning technique is two aforementioned blocks and they are usually used Alpaca [31], which builds upon the LLaMA [7] foundain tasks like machine translation, summarization, and tional models. In the case of Alpaca, the authors created question-answering, where complex input understand- 175 sets of instructions, input sentences, and corresponding as transduction is required. ing outputs. These were then used to generate variations

A first efective application of an Encoder-Decoder using GPT 3.5, resulting in a collection of approximately architecture in a multi-task scenario is presented in [1]: 52, 000 instruction examples. The LLaMA model was furin particular, the pre-training process of the so-called ther fine-tuned using this extensive dataset, a process reT5 involves training the model on a large corpus of di- ferred to as instruction-tuning. The outcome of this efort verse text data, which consists of a wide range of sources was the Stanford Alpaca [31] as an instruction-following such as books, articles, and websites, but also texts in- LLaMA model. More recently, an Italian counterpart volved in machine translation, classification and regres- called Camoscio [32] has overgone a similar intructionsion tasks. During pre-training, T5 utilizes a denoising tuning to Alpaca but on Italian data, essentially serving objective, similar to other popular Transformer-based as the Italian equivalent. It is based on the same LLaMA models like BERT and GPT. The model is trained to re- model and it was instruction-tuned on the 52.000 inconstruct masked or corrupted input text, which helps it structions that were automatically translated into Italian learn meaningful representations and capture contextual using ChatGPT as in [32]. As the size of these models information. One of the key strengths of T5 is its ver- continues to grow, reaching trillions of parameters, there satility. By casting various NLP tasks into a text-to-text is a need for a way to fine-tune them efectively using format, it can be fine-tuned on a specific task simply by modest GPU resources. The technique adopted in this providing a prefix that serves as a description of the task paper is called Low-Rank Adaptation (LoRA [33]). LoRA and appropriate input-output pairs during fine-tuning. involves freezing the weights of the pre-trained model In practice, such an architecture can be triggered by con- and introducing trainable rank decomposition matrices catenating the name of the task it is trained on with an into each layer of the Transformer architecture. This input text, and it generates in output the expected solu- approach significantly reduces the number of trainable tion to the task, e.g., a class label in a classification task parameters for downstream tasks while avoiding addior a text span that answers to a question. This flexibil- tional inference latency. ity eliminates the need for task-specific architectures or To summarize, the ExtremITA approach for the modifications, making it easier to apply T5 to diferent EVALITA challenge focuses on eficiently modeling all available tasks using a single monolithic architecture, architectures. based on two independently tested models:

Prompt Engineering in ExtremITA. The approach em• extremIT5, An Encoder-Decoder model, based ployed in this study draws inspiration from the original on IT51, consisting of approximately 110 million T5 and IT5 methodologies. Similar to those approaches, parameters. This model is trained by concatenat- each training example is converted into a text-to-text ing the name of the task and the input sentence/- format. paragraph in the input texts, each representing The model called extremIT5 is trained as a generic an example from a generic EVALITA task. Its pur- T5 model. In input, each example for an individual task is pose is to generate a piece of text that solves the given to the neural architecture as concatenated after the target task. task name. As an example, in the ACTI A task [17, 19] the • extremITLLaMA, an instruction-tuned Decoder- input is just “ACTI: Hanno votato tutti obbligo vaccinale, only model, built upon the LLaMA foundational green pass, persecuzioni varie". models2, with a total of 7 billion parameters. The The output depends on the specific task. For a cominitial model was trained using the LoRA tech- prehensive compilation of outputs for the ExtremITA nique on Italian translations3 of Alpaca instruc- models, please refer to Table 1. In classification tasks tion data. This training enables the model to com- involving only one label (such as EMit B, LangLearn, prehend instructions in Italian. After training the HaspeeDe 3, HODI A, MULTI-Fake-DetectiVE, ACTI A, adapters, they are merged into the original model ACTI B, WiC-ITA and DisCoTEX 1) the output is just the to create an instruction-based model (using the label of the target class. In the above example, the output “merge” procedure from [33]). Finally, this model would be “Cospirazione” as the input text reflects some is further fine-tuned using LoRA on instructions conspiracy theory. In some tasks, such as PoliticIT [11], that reflect the EVALITA task. For each exam- where a text is expected to be associated with the gender ple from EVALITA, an input text is paired with and the political inclination of the author, multiple labels a manually crafted question that simulates an in- reflecting such diferent dimensions are used, e.g., “ uomo struction to be solved, accurately representing sinistra centro-sinistra”. In EMit A [9] where multiple the specific task. emotions can be triggered, these are provided as a sequence of labels. In regression tasks, such as EmotivITA [10] and DisCoTEX 2 [23], the output is the number to be

The next section describes how the 22 subtasks in

EVALITA are encoded as prompts to fine-tune the above

1https://huggingface.co/it5/it5-eficient-small-el32 2https://huggingface.co/decapoda-research/llama-7b-hf 3https://github.com/teelinsan/camoscio/tree/main/data

Task Name EMit A EMit B EmotivITA

Natural language instruction “Quali emozioni sono espresse in questo testo? Puoi scegliere una o più emozioni tra ’rabbia’, ’anticipazione’, ’disgusto’, ’paura’, ’gioia’, ’amore’, ’tristezza’, ’sorpresa’, ’fiducia’, o ’neutro’." “Di cosa parla il testo, tra ’direzione’, ’argomento’, ’entrambi’, ’non specificato’?" “Scrivi quanta valenza è espressa in questo testo su una scala da 1 a 5, seguito da quanto stimolo è espresso in questo testo su una scala da 1 a 5, seguito da quanto controllo è espresso in questo testo su una scala da 1 a 5." “Scrivi se l’autore del testo è ’uomo’ o ’donna’, seguito dalla sua appartenenza politica tra ’destra’, ’sinistra’, ’centrodestra’, ’centrosinistra’." “Scrivi la regione di appartenenza di chi ha scritto questo testo, seguito dalla latitudine, seguita dalla longitudine." “Questi due testi separati da [SEP] sono presentati nell’ordine in cui sono stati scritti? Rispondi sì o no." “In questo testo si esprime odio? Rispondi sì o no." “In questo testo si esprime odio omotransfobico? Rispondi sì o no." “Con quali parole l’autore del testo precedente esprime odio omotransfobico? Separa le sequenze di parole con [gap]." “L’evento riportato nel testo è ’certamente vero’, ’probabilmente vero’, ’probabilmente falso’, o ’certamente falso’?" “In questo testo si parla di una cospirazione? Rispondi sì o no." “Di quale teoria cospirazionista parla questo testo, tra ’Covid’, ’Qanon’, ’Terrapiattista’, ’Russia’?" “Scrivi le menzioni di entità nel testo, indicandone il tipo: [PER] (persona), [LOC] (luogo), [ORG] (organizzazione)." “Trova i risultati dei test e delle misurazioni nel testo. Per ogni risultato, scrivi ’[BREL]’, seguito dal risultato seguito da ’[SEP]’, seguito dal test, seguito da ’[EREL]’. Se non trovi nessun risultato, scrivi ’[NOREL]’." “La parola compresa tra [TGTS] e [TGTE] ha lo stesso significato in entrambe le frasi? Rispondi sì o no." “Le due frasi precedenti, separate da ’[SEP]’, sono coerenti tra loro? Rispondi sì o no." “Quanto è coerente questa frase, su una scala da 0 a 5?" predicted within a specific range. In GeoLingIt [12], the [SEP] PSA [EREL]" (where 2 and 62 reflect the RML while models are requested to determine the region of origin PSA is the test event). of the tweet and the corresponding coordinates (latitude In contrast, as extremITLLaMA is pre-trained to exeand longitude) based solely on the text. For instance, cute instructions, it leverages a structured prompt, which for the extremIT5 model the task name (“GeoLingIt") comprises the textual description of the task and the is provided, while for extremITLLaMA a more in detail specification of the desired output format. For instance, prompt is given: “Scrivi la regione di appartenenza di when applied to the ACTI task, the instruction provided chi ha scritto questo testo, seguito dalla latitudine, seguita is “In questo testo si parla di una cospirazione? Rispondi dalla longitudine.". For example, if the input sentence sì o no.". The subsequent sentence to be evaluated is apis “Daje che je ’a famo!", the model should provide the pended to this instruction. A comprehensive list of such answer “Lazio 41.8984164 12.54514535", considering the instructions can be found in Table 2. use of the typical Roman dialect. This particular task The decoder is thus expected to continue the sentence combines both multi-label classification and regression, by generating the answer. In general, the same answers as it requires determining the region (classification) and used in extremIT5 are adopted. The only exception providing the precise coordinates (regression) simultane- concerns the following binary classification tasks (Lanously. In HODI B [15] where the span of the ofending gLearn, HaSpeeDe 3, HODI A, ACTI A, WiC-ITA and text is expected to be extracted, it is simply provided as DisCOTEX 1) where the instruction is only expected to output. In NERMuD [20], the list of expected Named answer yes or no, to reduce data sparseness. Entities is reported as a sequence of text spans, each associated with the corresponding entity type. CLinkaRT [21] focuses on extracting the names of medical tests 3. Experimental Results performed on patients from an input text and linking them to the corresponding test results, treating it as a Experimental Setup. Models were trained using PyRelation Extraction problem. Here the relations are en- Torch, the Huggingface library and the Peft packages coded with a slightly more complex form to summarize a to implement the LoRA technique. Both models were list of relations, each associating an Event with a corre- trained on the unified dataset of all the tasks of EVALITA. sponding measure (or RML); as an example, the sentence Generally, one example in an EVALITA task corresponds to an example in our learning setting. Below are some ex“CLinkaRT : Il PSA aumentava da 2 a 62 ng/ml." is associ- ceptions. The dataset for the ACTI task was expanded by ated with “[BREL] 2 [SEP] PSA [EREL] [BREL] 62 ng/ml incorporating some4 sentences from dataset B and vice ployed a batch size of 64 for extremIT5 and 32 for versa, resulting in an increase in the number of examples extremITLLaMA. To optimize the models’ performance, from 460 to 1, 909 for ACTI A and from 300 to 777 for a linear scheduler with warmup was applied, utilizing ACTI B. a warmup ratio of 0.1. The extremITLLaMA model’s

Since in CLinkaRT only (long) documents were made training process utilized LoRA to refine the , , available, these medical reports were segmented into and modules of the transformer (for more details smaller parts with a minimum of 50 characters and a max- please refer to the original paper [33]), incorporating a imum of 30 words using the Spacy library, respecting sen- matrix rank = 8 and a parameter = 16 for the tence boundaries. Moreover, we augmented this dataset LoRA matrices. The decoding strategy in the generation with examples derived from the dataset made available phase used a beam search equal to 4, temperature of 0.2, in TESTLINK@Ibe rLEF 2023 5 that contains medical re- with a top probability of 0.75 amongst the first 40 candiports in Spanish: although of a diferent language, these dates. Two Tesla T4 GPUs with 16GB of memory each texts contain similar phenomena about events and mea- were used in parallel. This was particularly beneficial sures that are generally language invariant and were for the extremITLLaMA model, as its training duration useful to augment the dataset. This process significantly exceeded 144 hours. The training data was divided into augmented the dataset, expanding it from 83 large docu- a 95% training set and a 5% validation set initially for ments to 3, 903 shorter sentences. In general, this process hyper-parameter optimization. We release the source recovered more than 95% of annotated relations. In the code on GitHub7 for reproducing the experiment and case of EMit, the dataset underwent a transformation dataset generation. where emoji representations were converted into textual descriptions, enhancing its compatibility with language models. GeoLingIt was modified to solve task A and task B simultaneously, enabling a single prediction for both tasks. For HODI B, only sentences expressing homotransphobia were considered, resulting in a reduction from 5, 000 to 1, 914 examples. The dataset of the LangLearn task was truncated into sentences with a maximum of 100 tokens, and additional examples with inverted sentence pairs (by flipping the label from positive to negative and vice versa), augmenting the dataset from 3, 377 to 6, 438 examples. In MULTI-Fake-DetectiVE we neglected images, and duplicate examples were removed (i.e., same text and diferent image), leading to a decrease from 1, 058 to 860 examples. NERMuD was transformed into a sequence-to-sequence task from its original token classification format. In PoliticIT, each text was divided into sentences with a maximum length of 200 tokens, enabling more manageable input for language models. At classification time, a voting strategy was applied to select the final class about gender and political ideas, grouping all sentences written by the same author. Lastly, the WiC-ITA dataset was expanded by including examples6 with inverted sentence pairs while preserving the same label, resulting in an increase from 5, 610 to 6, 600 examples. Overall, the entire dataset is composed of a total of 134, 018 examples.

The extremIT5 model underwent 10 epochs of training with a learning rate of 2 · 10− 5, while the extremITLLaMA model underwent 2 epochs of training with a learning rate of 3 · 10− 4. The models emResults Discussion. The experimental results are reported in Table 3. We presented the tasks categorized by sub-task, followed by the Evaluation Metric, and the scores and ranks achieved by our extremIT5 model, extremITLLaMA model, and the best competitor. The best-performing method for each subtask is highlighted in bold. Our systems, particularly extremITLLaMA, ranked first in 9 out of 22 subtasks (i.e., the 41% of subtasks) in EVALITA 2023 . Additionally, it ranks in the top-three position in 14 subtasks, i.e., 64% of all tasks.

However, we faced challenges in tasks such as GeoLingIt, LangLearn, and WiC-Ita, where our monolithic architectures demonstrated its limitations. These tasks specifically require a system to detect and analyze changes in the author’s writing style or the contextual meaning of words. Our models are primarily designed for sentence classification or rewriting spans of input text to justify previous decisions (e.g., HODI).

There are also important considerations regarding the computational cost of both training and inference. Training extremIT5 (made of “only” 110 million parameters) required approximately 12 hours on the entire EVALITA dataset, while extremITLLaMA (made of 7 billion parameters) took over 144 hours. In terms of inference, extremITLLaMA processes only 2 or 3 sentences per second, whereas extremIT5 handles over almost one hundred sentences per second. This significant diference in processing speed makes the extremITLLaMA model less practical, despite its superior performance across a wide range of tasks. Additionally, the number of parameters between the two models difers by one order of magnitude, with extremITLLaMA having 7 billion parameters compared to extremIT5’s 110 million.

Overall, the above results are quite impressive, espe

4Only the positive examples, i.e. the ones that involved any conspir

acy theory, are added from the dataset A to B or viceversa. 5https://e3c.fbk.eu/testlinkiberlef 6Only the positive examples underwent sentence order flipping in order to rebalance the class distribution.

7https://github.com/crux82/ExtremITA

Emit EmotivITA

Error Analysis. Since our team participated in all the tasks, it would be unfeasible to provide a deeper analysis of each individual result in this report. However, in order to gain some insight into the inner working of the two models we employed, here we present some error analysis carried out on two tasks. We selected a task where our systems ranked very high, and one where they ranked cially when considering that no task-specific architec- very low. In the EmIt task A, extremITLLaMA ranked tural designs were applied. Instead, a single LLM was ifrst in the oficial ranking, and extremIT5 was second. utilized, demonstrating competitive performance across The task is a multi-label classification problem, where almost all tasks. The key to achieving such results seems the labels are eight emotions defined by Plutchik [ 34] to lie in properly prompting the model with natural lan- plus “love" and a label for neutral texts. Table 4 reports guage requests or employing task-specific encoding tech- the performance of the two ExtremITA systems broken niques for the outputs. We can expect higher results to down by labels. It is interesting to notice that the adbe achieved using larger LLMs such as LLaMA 65B. To vantage shown by extremITLLaMA on the aggregated conduct a more comprehensive evaluation and optimiza- result comes from a skewed distribution over the labels. tion, it would have been beneficial to explore a broader In particular, extremIT5 is hardly capable of modelrange of architectures and thoroughly investigate all the ing Fear, which is also the least represented label in the hyper-parameters of the models. The estimation of these test set. An inverse correlation between the number of parameters was done hastily due to the time constraints positive instances in the test set and the gain in perforimposed by the EVALITA deadlines and the extensive mance of extremITLLaMA with respect to extremIT5 commitment required for the parallel completion of all is indeed present. This indicates that extremITLLaMA 13 tasks. is better suited than extremIT5 for the classification of sparser phenomena. Moreover, extremITLLaMA shows superior capability in modeling and correctly predicting every emotion, besides “Trust", where extremIT5 results in a better performance.

In the LangLearn task, our systems ranked quite low, respectively 8th place for extremITLLaMA and 10th place for extremIT5. LangLearn is a text pair classiLabel Anger Anticipation Disgust Fear Joy Love Neutral Sadness Surprise Trust 56 85 165 13 100 103 210 95 102 272

4. Conclusions In a recent position paper with a provocative title, Basile

[35] asks himself “is EVALITA done?”, referring to the mounting trend of LLMs and zero-shot approaches in NLP and their impact on the evaluation campaign. Judging by the results presented in this report, the answer is still the same as the original paper, i.e., no. The variety and challenge ofered by the tasks of EVALITA continue to represent a fundamental resource to understand and develop language resources and tools for the Italian language, as shown, for instance, by the variability of the ranking obtained by our transformer-based modFigure 1: Accuracy of our systems on the LangLearn test els. However, the raw performance of extremIT5 and set, with texts removed that are longer than an increasing extremITLLaMA, with minimal adaptations and tuning, threshold (horizontal axis). is undoubtedly pushing the limits of some tasks, especially text classification tasks with roots in text semantics.

In any case, these results once again confirm the huge ifcation task where the most informative features are potential of LLMs and their applicability in real-world expected to be stylistic, rather than semantic, to capture scenarios. It is important to note that this experiment, the development in language learning of the author of while not conclusive, used the smallest available models the texts. With this premise, we were anticipating a sub- due to their size limitations. Additionally, it would be par performance by our transformer-based models from worthwhile, from a sustainability standpoint, to explore the beginning. However, another relevant characteristic the results that can be achieved by significantly reducing of this task is the length of the texts. For computational the amount of annotated data available through zero or reasons, we had to cut the texts to 100 tokens or less, few-shot learning approaches. therefore leaving out a significant portion of the data — we retained exactly 24.6% of the tokens from the two training sets combined. We checked the impact of the text Acknowledgments size on the accuracy of the prediction, under the hypothesis that longer texts in the test set (which were cut by We would like to thank the “Istituto di Analisi dei Sistemi our systems to a greater extent) are penalized. The plot ed Informatica - Antonio Ruberti” (IASI) for supporting in Figure 1 shows the accuracy of our systems against the experimentations. Claudiu Daniel Hromei is a Ph.D. portions of the test set where the texts were filtered by student enrolled in the National Ph.D. in Artificial Intelsize. The number on the horizontal axis is a threshold on ligence, XXXVII cycle, course on Health and life sciences, the minimum size in terms of characters of the two texts organized by the Università Campus Bio-Medico di Roma. forming an instance of the test set. Indeed, the downward trend indicates that the predictions of our systems are more accurate on shorter pairs of texts, while more and more errors are made by both systems on longer texts. ings of the Eighth Evaluation Campaign of Natural eit, L. Jones, A. N. Gomez, L. Kaiser, I. PoloLanguage Processing and Speech Tools for Italian. sukhin, Attention is all you need, Co RR Final Workshop (EVALITA 2023 ), CEUR.org, Parma, abs/1706.03762 (2017). URL: http://a rxiv.org/abs/ Italy, 2023 . 1706.03762. arXiv:1706.03762. [17] P. Russo, N. Stoehr, M. Horta Ribeiro, Subtask [26] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: a- conspirato rial content classification, 2023 . URL: pre-training of deep bidirectional transformers for https://kaggle.com/competitions/acti-subtask-a. language understanding, in: J. Burstein, C. Doran, [18] P. Russo, N. Stoehr, M. Horta Ribeiro, Subtask b - T. Solorio (Eds.), Proceedings of the NAACL 2019, conspi racy category classification, 2023 . URL: https: 2019, pp. 4171–4186. URL: https://doi.org/10.18653/ //kaggle.com/competitions/acti-subtask-b. v1/n19-1423. doi:10.18653/v1/n19-1423. [19] G. Russo, L. Verginer, M. H. Ribeiro, G. Casiraghi, [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, Spillover of antisocial behavior from fringe plat- O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, forms: The unintended consequences of commu- Roberta: A robustly optimized BERT pretraining nity banning, in: Proceedings of the International approach, CoRR abs/1907.11692 (2019). URL: http: AAAI Conference on Web and Social Media, vol- //arxiv.org/abs/1907.11692. arXiv:1907.11692. ume 17, 2023, pp. 742–753. [28] P. He, X. Liu, J. Gao, W. Chen, Deberta: decoding[20] A. Palmero Aprosio, T. Paccosi, Nermud at evalita enhanced bert with disentangled attention, in: 9th 2023: Overview of the named-entities recognition International Conference on Learning Representaon multi-domain documents task, in: Pro ceedings tions, ICLR 2021 , Virtual Event, Austria, May 3-7, of the Eighth Evaluation Campaign of Natural Lan- 2021, 2021 . URL: https://openreview.net/forum?id= guage Processing and Speech Tools for Italian. Final XPZIaotutsD.

Wo rkshop (EVALITA 2023 ), CEUR.org, Parma, Italy, [29] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo2023 . hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: [21] B. Altuna, G. Karunakaran, A. Lavelli, B. Magnini, denoising sequence-to-sequence pre-training for M. Speranza, R. Zanoli, CLinkaRT at EVALITA natural language gene ration, translation, and com2023 : Overview of the Task on Linking a Lab Re- prehension, CoRR abs/1910.13461 (2019). URL: http: sult to its Test Event in the Clinical Domain , in: //arxiv.org/abs/1910.13461. arXiv:1910.13461. Proceedings of the Eighth Evaluation Campaign of [30] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. WainNatural Language Processing and Speech Tools for wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, Italian. Final Wo rkshop (EVALITA 2023 ), CEUR.org, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, Parma, Italy, 2023 . M. Simens, A. Askell, P. Welinder, P. Christiano, [22] P. Cassotti, L. Siciliani, L. Passaro, M. Gatto, J. Leike, R. Lowe, Training language models to P. Basil e, Wic-ita at evalita2023 : Overview of the follow instructions with human feedback, 2022. evalita2023 word-in-context for italian task, in: arXiv:2203.02155.

Proceedings of the Eighth Evaluation Campaign of [31] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, Natural Language Processing and Speech Tools for C. Guestrin, P. Liang, T. B. Hashimoto, Stanford alItalian. Final Wo rkshop (EVALITA 2023 ), CEUR.org, paca: An instruction-following lla ma model, https: Parma, Italy, 2023 . //github.co m/tatsu-lab/stanford_alpaca, 2023 . [23] D. Brunato, D. Colla, F. Dell’Orletta, I. Dini, D. P. [32] A. Santilli, Camoscio: An italian instruction-tuned Radicioni, A. A. Ravelli, Discotex at evalita 2023 : llama, https://github.co m/teelinsan/camoscio, 2023 . Overview of the assessing discourse coherence in [33] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, italian texts task, in: Proceedings of the Eighth Y. Li, S. Wang, W. Chen, Lora: Low-rank Evaluation Campaign of Natural Language Process- adaptation of large language models, CoRR ing and Speech Tools for Italian. Final Workshop abs/2106.09685 (2021). URL: https://a rxiv.org/abs/ (EVALITA 2023 ), CEUR.org, Par ma, Italy, 2023 . 2106.09685. arXiv:2106.09685. [24] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, [34] R. Plutchik, H. Kellerman, Theories of emotion 1 W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, (1980).

A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, [35] V. Basile, Is EVALITA done? on the impact of A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robin- prompting on the italian NLP evaluation campaign, son, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, in: D. Nozza, L. C. Passaro, M. Polignano (Eds.), Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, Proceedings of the Sixth Workshop on Natural LanJ. Devlin, A. Roberts, D. Zhou, Q. V. Le, J. Wei, Scal- guage for Artificial Intelligence (NL4AI 2022), voling instruction-finetuned language models, 2022 . ume 3287 of CEUR Workshop Proceedings, CEURarXiv:2210.11416. WS.org, 2022, pp. 127–140. URL: https://ceur-ws. [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor- org/Vol-3287/paper13.pdf .

of the Eighth Evaluation Campaign of Natural Lan[1]

Rafel ,

Shazeer ,

Roberts ,

Lee , S. Narang, guage Processing and Speech Tools for Italian . Final

Matena ,

Zhou ,

Li ,

P. J.

Liu , Explor- Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy,

ing the limits of transfer learning with a unified 2023.

text-to-text transformer , J. Mach. Learn. Res . 21 [9]

Araque ,

Frenda ,

Sprugnoli ,

Nozza ,

( 2020 ) 140 : 1 - 140 : 67 . URL: http://jmlr.org/papers/ V. Patti, EMit at EVALITA 2023: Overview of the

v21/ 20 - 074 .html. Categorical Emotion Detection in Italian Social Me [2]

Xue ,

Constant ,

Roberts ,

Kale , R. Al- dia Task , in: Proceedings of the Eighth Evalua-

Rfou , A.

Siddhant , A.

Barua , C.

Rafel, mt5: A mas- tion Campaign of Natural Language Processing and

former, in: K. Toutanova , A. Rumshisky , L. Zettle- 2023), CEUR.org, Parma, Italy, 2023 .

moyer , D.

Hakkani-Tür , I.

Beltagy , S. Bethard, [10] G.

Gafà , F.

Cutugno , M.

Venuti , Emotivita at

Cotterell ,

Chakraborty , Y. Zhou (Eds.), Pro- EVALITA2023: Overview of the dimensional and

ceedings of the NAACL-HLT 2021, Online, June multidimensional emotion analysis task , in: Pro-

6- 11 , 2021 , Association for Computational Linguis- ceedings of the Eighth Evaluation Campaign of Nat-

tics , 2021 , pp. 483 - 498 . URL: https://doi.org/10. ural Language Processing and Speech Tools for Ital-

18653 /v1/ 2021 .naacl-main. 41 . doi: 10 .18653/v1/ ian. Final Workshop (EVALITA 2023 ), CEUR.org,

2021.naacl-main. 41 . Parma , Italy, 2023 . [3]

Sarti , M. Nissim, IT5: large-scale text-to-text pre- [11]

Russo ,

S. M.

Jiménez-Zafra , J. A . García-

eration, CoRR abs/2203 .03759 ( 2022 ). URL: https:// López, R. Valencia-García, Overview of

doi.org/10.48550/arXiv.2203.03759. doi: 10 .48550/ PoliticIT2023@EVALITA: Political Ideology

arXiv. 2203 .03759. arXiv: 2203 .03759. Detection in Italian Texts, 2023 . [4]

H. W.

Chung ,

Hou ,

Longpre ,

Zoph ,

Tay , [12]

Ramponi , C. Casula, GeoLingIt at EVALITA 2023:

Chi , J.

Dean , J.

Devlin , A.

Roberts , D.

Zhou , Q. V. ( EVALITA 2023), CEUR .org, Parma, Italy, 2023 .

Le , J.

Wei , Scaling instruction-finetuned language [13] C.

Alzetta , D.

Brunato , F.

Delll'Orletta , A . Miaschi,

models, CoRR abs/2210 .11416 ( 2022 ). URL: https:// K. Sagae,

C. H.

Sánchez-Gutiérrez , G. Venturi, Lan-

doi.org/10.48550/arXiv.2210.11416. doi: 10 .48550/ glearn at evalita 2023: Overview of the language

arXiv. 2210 .11416. arXiv: 2210 .11416. learning development task , in: Proceedings of the [5]

Radford ,

Narasimhan ,

Salimans , I. Sutskever , Eighth Evaluation Campaign of Natural Language

erative pre-training ( 2018 ). shop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 . [6] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , [14] M.

Lai , F.

Celli , A.

Ramponi , S.

Tonelli , C. Bosco,

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Patti , Haspeede3 at evalita 2023: Overview of the

Sigler ,

Litwin ,

Gray ,

Chess ,

Clark , Italian. Final Workshop (EVALITA 2023 ), CEUR.org,

Berner ,

McCandlish ,

Radford , I. Sutskever , Parma, Italy, 2023 .

Amodei , Language models are few-shot learners , [15]

Nozza ,

A. T.

Cignarella , G. Damo, T. Caselli,

CoRR abs/ 2005 .14165 ( 2020 ). URL: https://arxiv.org/ V. Patti, HODI at EVALITA 2023: Overview of

abs/ 2005 .14165. arXiv: 2005 . 14165. the Homotransphobia Detection in Italian Task , in: [7]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -A. Proceedings of the Eighth Evaluation Campaign of

bro , F.

Azhar , A.

Rodriguez , A.

Joulin , E. Grave, Italian. Final Workshop (EVALITA 2023 ), CEUR.org,

Lample , Llama: Open and eficient foundation Parma, Italy, 2023 .

language models , 2023 . arXiv: 2302 . 13971 . [16]

Bondielli ,

Dell'Oglio ,

Lenci ,

Marcelloni , [8]

Lai ,

Menini ,

Polignano ,

Russo ,

Sprug- L. C. Passaro ,

Sabbatini , Multi- fake-detective

noli , G. Venturi, Evalita 2023 : Overview of the 8th at evalita 2023: Overview of the multimodal fake