1. Introduction

of the 8th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian

Mirko Lai

mirko.lai@unito.it 1 2

Stefano Menini

menini@fbk.eu 1 2

Marco Polignano

marco.polignano@uniba.it 1 2 3

Valentina Russo

vrusso@logogramma.com 1 2

Rachele Sprugnoli

rachele.sprugnoli@unipr.it 1 2

Giulia Venturi

giulia.venturi@ilc.cnr.it 0 1 2 0 Institute for Computational Linguistics ”A. Zampolli” (CNR-ILC) , Pisa 1 Processing and Speech Tools for Italian , Sep 7 - 8, Parma, IT 2 The Evaluation Campaign of Natural Language Process- 3 University of Bari “Aldo Moro”

EVALITA provides a shared framework for evaluating and comparing diferent Nautural Language Processing (NLP) and speech systems across various tasks suggested and organized by the Italian research community. These tasks represent scientific challenges and allow testing of methods, resources, and systems on shared benchmarks related to linguistic open issues and real-world applications, including considering multilingual and/or multi-modal perspectives. The EVALITA 2023 edition consisted of 13 diferent tasks grouped into four research areas: Afect, Authorship Analysis, Computational Ethics, and New Challenges in Long-standing Tasks. The participation saw 42 groups from 12 diferent countries, indicating an increasing international interest, partly due to the proposal of multilingual tasks. The final workshop showcases the results obtained and highlights the growing interest in using deep learning techniques based on Large Language Models as a new trend. Overall, EVALITA serves as a valuable platform for Italian and international researchers to explore NLP-related challenges, develop solutions, and foster discussions within the community.

1. Introduction

CEUR

Workshop Proceedings (CEUR-WS.org) linguistic open issues or real-world applications, possibly in a multilingual and/or multi-modal perspective. The collected datasets provide big opportunities for scientists to explore old and new problems concerning NLP in

Italian as well as to develop solutions and discuss NLP

traditionally present in the evaluation campaign, while others are completely new.

This paper introduces the tasks proposed at EVALITA 2023 and provides an overview of the participants and systems whose descriptions and obtained results are reported in these Proceedings. The EVALITA 2023 final workshop, held in Parma on September 7-8ℎ , counts 13 diferent tasks. In particular, the selected tasks are their objective and characteristics, namely: (i) Afect ; (ii) Authorship Analysis; (iii) Computational Ethics; (iv) New

Challenges in Long-standing Tasks.

This edition was participated by 42 groups whose members have afiliations in 12 diferent countries. The high number of tasks is in line with a clear trend towards an increasing volume of proposed tasks at EVALITA. In fact, we have witnessed a significant progression from the 5 tasks organized in the first EVALITA campaign in 2007 to a peak of 14 tasks in the latest 2020 edition. Although EVALITA is generally promoted and targeted to the Italian research community, this edition saw increasresearch community. The proposed tasks represent scien- grouped into four research areas (tracks) according to ing international participation, partly due to our strong encouragement for the submission of multilingual tasks.

This confirms a general trend of internationalization for the campaign, which reached its maximum this year, as discussed further.

This overview is organized as follows: in Section 2 a brief description of the tasks belonging to the various areas is reported. Section 3 discusses the participation in the workshop referred to several aspects, from the research area to the afiliation of authors. Section 4 describes the criteria used to assign the best system across tasks award, made by an ad-hoc committee starting from the suggestions of task organizers and reviewers. Finally, section 5 points out both the obtained results and the future of the workshop.

2. EVALITA 2023 Tracks and Tasks In the 2023 edition of EVALITA, 13 diferent tasks were

proposed, peer-reviewed, and accepted. Data were produced by the task organizers and made available to the participants. For the future availability of this data, we are going to release them on GitHub4, in accordance with the terms and conditions of the respective data sources. Such a repository will also reference alternative repositories managed by the task organizers. The tasks of EVALITA 2023 are grouped according to the following tracks corresponding to four broad research areas: Afect

EMit - Categorical Emotion Detection in Italian Social

Media [ 1 ]. It aims to provide the first evaluation framework for emotion detection in Italian texts at EVALITA, following the categorical approach and ofering novel annotated data. It presents two subtasks: i) Subtask A, which consists of an emotion detection challenge, and ii) Subtask B, which introduces a novel problem of target detection of the expressed emotion.

EmotivITA - Dimensional and Multi-dimensional Emotion Analysis [ 2 ]. The first shared task for Italian that follows the dimensional approach in emotion analysis. It introduces a new Italian dataset annotated with the Valence, Arousal, and Dominance dimensions and has two subtasks: i) Dimensional emotion regression and ii) Multi-dimensional emotion regression.

Authorship Analysis

PoliticIT - Political Ideology Detection in Italian Texts [3].

It aims to extract politicians’ ideology informa

4https://github.com/evalita2023

tion from a set of tweets in Italian framed as a binary and a multiclass classification. The task is designed to be privacy-preserving and accompanied by a subtask targeting the identification of self-assigned gender as a demographic trait.

GeoLingIt - Geolocation of Linguistic Variation in Italy [4].

The first shared task on the geolocation from social media posts comprising content in language varieties other than standard Italian (i.e., regional Italian, and languages and dialects of Italy). It is articulated into two subtasks: i) coarse-grained geolocation, aiming at predicting the region in which the variety expressed in the post is spoken, and ii) fine-grained geolocation, aiming at predicting its exact coordinates.

LangLearn - Language Learning Development [ 5 ]. The ifrst shared task on automatic language development assessment aimed at developing and evaluating systems to predict the evolution of the written language abilities of learners across several time intervals. It was conceived to be multilingual, relying on written productions of Italian and Spanish learners, and representative of L1 and L2 learning scenarios.

Computational Ethics

HaSpeeDe 3 - Political and Religious Hate Speech Detec

tion [ 6 ]. The third edition of a shared task on the detection of hateful content in Italian tweets. Diferently from the two previous editions (organized within EVALITA 2018 and 2020), it explores hate speech in strong polarised debates, concerning politics and religion. Participants are asked to predict hate speech in both in- and out-domain settings, using either only the provided textual content of the tweet or any kind of external data. HODI - Homotransphobia Detection in Italian [ 7 ]. The ifrst shared task for the automatic detection of homotransphobia in Italian. The challenge is organized into two subtasks: i) Subtask A focuses on the binary textual classification of homotransphobic tweets, ii) Subtask B is concerned with the identification of rationales for explainability in the form of textual spans of text.

MULTI-Fake-DetectiVE - MULTImodal Fake News Detection and VErification [ 8 ]. The first task on fake news detection in Italian that explores multimodality and wants to address the problem from two perspectives, represented by the two subtasks: i) sub-task 1 aimed to evaluate the efectiveness of multimodal fake news detection systems, ii) sub-task 2, which consists in gaining insights into the interplay between text and images. Both perspectives were framed as classification problems.

3. Participation EVALITA 2023 attracted the interest of a large number of researchers from academia and industry, for a total of 42 single teams composed of about 109 individuals

ACTI – Automatic Conspiracy Theory Identification [ 9 ]. participating in one or more of the 13 proposed tasks.

The first shared task based exclusively on com- After the evaluation period, 51 system descriptions were ments published on conspiratorial channels of submitted (reported in these proceedings), i.e., a 12% pertelegram. It is articulated into two subtasks: i) centage decrease with respect to the previous EVALITA Conspiratorial Content Classification consisting edition [ 14 ]. in identifying conspiratorial content and ii) Con- Moreover, task organizers allowed participants to subspiratorial Category Classification about specific mit more than one system result (called runs), for a total conspiracy theory classification. of 246 submitted runs. Table 1 shows the diferent tracks and tasks along with the number of participating teams and submitted runs. The data reported in the table is based on information provided by the task organizers at NERMuD - Named-Entities Recognition on Multi-Domain the end of the evaluation process. Such data represents Documents [ 10 ]. It consists in extracting and clas- an overestimation with respect to the systems described sifying persons, organizations, and locations from in the proceedings. The trends are similar, but there are documents in various domains. It is articulated diferences due to groups participating in more than one into two subtasks: i) Domain-agnostic classifica- task and groups that have not produced a system report. tion, where participants are required to identify Unlike previous EVALITA editions, the organizers were and classify entities from diferent types of texts, not discouraged from distinguishing the submissions i.e., news, fiction, and political speeches, using between unconstrained and constrained runs5. In fact, a single model, and ii) Domain-specific classifi- some of them introduced subtasks based on external recation, where a diferent model can be used for sources used for training, while others required both a each text type. constrained and an unconstrained run. Alternatively, CLinkaRT - Linking a Lab Result to its Test Event in the they allowed participants the freedom to utilize exterClinical Domain [ 11 ]. It is a relation extraction nal resources or augment the distributed datasets. This task based on clinical cases taken from the E3C decision was motivated by the expectation that most corpus, i.e., Italian written documents reporting participants would employ pre-trained Neural Language statements of clinical practice. The task consists Models. Thus, the organizers wanted to assess the parin identifying test results and measurements and ticipants’ creativity in adopting strategies beyond solely linking them to the textual mentions of the labo- relying on these models. ratory tests and measurements from which they Participation was quite imbalanced across diferent were obtained. tracks and tasks, as reported in Figure 1: each rectangle represents a task whose size reflects the number of WiC-ITA - Word-in-Context task for Italian [ 12 ]. The participants, while the color indicates the corresponding ifrst shared task at EVALITA on determining if a track. word occurring in two diferent sentences has the In line with the past edition of EVALITA [ 14 ], the same meaning or not. It has been modeled as both development of systems dedicated to identifying unethia binary classification and a ranking problem. cal behaviors or malicious intentions in texts, spanning various aspects of human society, remains a topic of significant interest to the community. In fact, as evidenced by the high participation, the shared tasks grouped under the “Computational Ethics” track obtained the most attention. However, for the first time, this year the second most participated track was the “Authorship Analysis” one, which is focused on analyzing text writing styles to capture diverse author characteristics. This is a quite new result since the same typology of track had a relatively

DisCoTEX - Assessing DIScourse COherence in Italian

TEXts [ 13 ]. The first shared task focused on modeling discourse coherence for Italian real-word texts. It was organized into two independent tasks: a more traditional one, aimed at evaluating whether models are able to distinguish wellorganized documents from corrupted ones, and a less explored one, which assesses the models’ performance on texts evaluated for coherence by human raters.

New Challenges in Long-Standing Tasks

5A system is considered constrained when using the provided

training data only; on the contrary, it is considered unconstrained when using additional material to augment the training dataset or to acquire additional resources. Afect Authorship Analysis Computational Ethics New Challenges in Long-standing Tasks

Task EMit EmotivITA PoliticIT GeoLingIt LangLearn HaSpeeDe 3 HODI MULTI-Fake-DetectiVE ACTI NERMuD CLinkaRT WiC-ITA DisCoTEX low number of participants during the 2020 campaign. the latest generation of Large Language Models. These This shows the interest of the NLP community towards models served as the foundation for the majority of the new and potentially more challenging areas of natural approaches devised by the participants, as illustrated in language understanding. It is worth noting that this year Section 5 for the first time we introduced a new track solely dedi- It is worth noting that we also received a considerable cated to evaluating systems of emotions detection from number of tasks presented for the first time at EVALITA. two diverse perspectives (the “Afect” track). Addition- Besides the two tasks centered around modeling diferally, we decided to keep the “New Challenges in Long- ent aspects of afect, namely EMit and EmotivITA, among standing Tasks” track. Even if this track was among the them we can find GeoLingIt, LangLearn, HODI, and ACTI, least participated, the rationale behind this choice was which introduced novel problems. Interestingly, two of to ofer benchmarks for more conventional NLP tasks to these newly introduced tasks received the highest number of submissions, showing the interest of the commu- of the five members come from academia while two of nity in taking on new challenges. them are from industry. The composition of the commit

In contrast to the 2020 edition, which saw a total of tee is balanced with respect to the level of seniority as over 180 task organizers or participants, EVALITA 2023 well as to their academic background (computer scienceexperienced reduced participation. However, it is worth oriented vs. humanities-oriented). In order to select a noting that the authorship of the 172 proceedings authors, short list of candidates, the task organizers were invited including both participants and task organizers, reflects a to propose one candidate system participating in their greater diversity in terms of their origins, spanning 15 dif- tasks (not necessarily top-ranking). The committee was ferent countries. Notably, 70% of these contributors come provided with the list of candidate systems and the critefrom Italy, while the remaining 30% come from Institu- ria for eligibility, based on: tions and companies abroad. The group of the 63 task • novelty with respect to the state of the art; organizers have afiliations in 6 countries (79% from Italy while 21% from Institutions and companies abroad). In • originality, in terms of identification of new linsummary, a noticeable increase was observed in the num- guistic resources, identification of linguistically ber of task organizers, particularly those afiliated with motivated features, and implementation of a theinstitutions abroad. In fact, the proportion of organizers oretical framework grounded in linguistics; with foreign afiliations more than doubled with respect • critical insight, paving the way to future chalto the previous edition, rising from 10% to 21% of the lenges (deep error analysis, discussion on the limtotal organizers. This indicates a growing international its of the proposed system, discussion of the ininterest in EVALITA. Notably, 6 out of the 13 tasks were herent challenges of the task); organized by authors with mixed afiliations, combining • technical soundness and methodological rigor. both Italian and foreign institutions. This statistic aligns with one of the innovations we introduced this year. In- We collected XX system nominations from the orgadeed, during the call for tasks period, we encouraged nizers of XX tasks from across all tracks. The candidate the proposal of multilingual tasks, where participants systems are authored by 20 authors, among whom 12 are were provided with datasets in both Italian and other lan- students, either at the master’s or PhD level. The award guages. Up until now, only two tasks, namely LangLearn recipient(s) will be announced during the final EVALITA and WiC-ITA, provided participants with datasets in Ital- workshop, during the plenary session, held online. ian and Spanish, and English, respectively. Although only a small number of organizers embraced this sugges- 5. Final Remarks tion, we see it as a promising first step towards achieving a more international profile for EVALITA in the future. The widespread adoption of Large Language Models

As a last remark, we would like to notice that this year (LLMs) was evident in the EVALITA 2023 challenge. LLMs, we had four teams that participated in multiple tasks. such as GPT-3 and its variants, have revolutionized the Among them, one team employed the same approach for NLP landscape due to their ability to learn from large two tasks (HODI and HaSpeeDe 3), while two other teams amounts of data and generate contextually relevant reutilized distinct methods for two tasks each (LangLearn sponses. These models have shown remarkable perforand WiC-ITA, and LangLearn and DisCoTEX ). Particu- mance across various NLP tasks, and their usage was larly innovative was the approach taken by a single team, prominent in this edition of EVALITA. The confirmation which submitted results for all 13 tasks, employing vari- of the massive use of LLMs underscores their efectiveations of the same model. In Section 5, we discuss how ness and potential in advancing NLP technology. this feat was accomplished through the utilization of Traditional supervised learning approaches heavily instruction-based models fine-tuned on all the EVALITA rely on annotated data, which can be expensive and 2023 datasets using task-specific prompts. time-consuming to obtain. In response to this challenge, many participants in EVALITA 2023 proposed a semi4. Best System Across Tasks Award supervised approach using the prompting technique. The prompting technique involves providing the model with a In line with the previous edition, we confirmed the award few example inputs or a prompt to guide its response gento the best system across-task. The award was introduced eration. This method allows leveraging limited labeled with the aim of fostering student participation in the data while utilizing the model’s language understanding evaluation campaign and in the workshop. capabilities to generalize to unseen instances. The adop

A committee of 5 members (Felice Dell’Orletta, Bernardo tion of the prompting technique showcases the interest Magnini, Azzurra Mancini, Stefano Menini, Viviana Patti) in exploring more eficient and resourceful ways to tackle was asked to choose the best system across tasks. Three NLP tasks.

A noteworthy development in EVALITA 2023 was a

team that participated in all tasks using the same approach, facilitated by prompt-based LLMs fine-tuning. While this approach showed promise, it also highlighted an essential observation: the performance of LLMs varies significantly across diferent NLP tasks. While LLMs are powerful models, they may not excel uniformly in all linguistic challenges. This underscores the need to understand the strengths and limitations of LLMs and to ifne-tune them specifically for each task to achieve optimal results.

Another important outcome of the EVALITA 2023 challenge was the substantial increase in participation from groups outside Italy, making it one of the most attended editions by international teams. The rising international interest can be attributed to the growing significance of NLP and speech technologies on a global scale. The encouragement for multilingual tasks and the availability of shared datasets might have attracted researchers from diferent countries to participate actively. This trend signifies the growing impact and international recognition of the EVALITA initiative, facilitating collaboration and knowledge exchange among NLP communities worldwide.

To sum up, EVALITA 2023 outcomes demonstrate the dominance of LLMs in NLP, the exploration of semisupervised approaches, the significance of task-specific ifne-tuning, and the increasing internationalization of the initiative. These outcomes contribute to advancing the ifeld of NLP, encouraging further research, and fostering a diverse and collaborative NLP community.

Acknowledgments

We would like to thank our sponsors: Talia6, Almawave7, APTUS.AI8 and Logogramma9. Our gratitude goes also to the University of Parma for hosting the event. In addition, we sincerely thank the Best System award committee for providing their expertise and experience. Moreover, we acknowledge the AILC Board members for their trust and support. We warmly thank our invited speaker Julio Gonzalo, for having shared his knowledge and insights with his talk. Last but not least, we would like to thank all the task organizers and participants who made this edition special with their enthusiasm and creativity.

6https://talia.cloud/ 7https://www.almawave.com/it/ 8https://www.aptus.ai/ 9https://www.logogramma.com/

[1]

Alzetta ,

Brunato ,

Dell'Orletta ,

Miaschi ,

Sagae ,

C. H.

Sánchez-Gutiérrez , G. Venturi, LangLearn at EVALITA 2023: Overview of the Language Learning Development Task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[2]

Gafà ,

Cutugno , M. Venuti, EmotivITA at EVALITA 2023: Overview of the Dimensional and Multidimensional Emotion Analysis Task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[3]

Russo ,

S. M.

Jiménez Zafra ,

J. A.

GarcíaDíaz , T. Caselli,

Guerini ,

L. A.

Ureña López , R. Valencia-García, PoliticIT at EVALITA 2023: Overview of the Political Ideology Detection in Italian Texts Task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[4]

Ramponi , C. Casula, GeoLingIt at EVALITA 2023: Overview of the Geolocation of Linguistic Variation in Italy Task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[5]

Araque ,

Frenda ,

Sprugnoli ,

Nozza , V. Patti, EMit at EVALITA 2023: Overview of the Categorical Emotion Detection in Italian Social Media Task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[6]

Lai ,

Celli ,

Ramponi ,

Tonelli ,

Bosco ,

Patti , HaSpeeDe3 at EVALITA 2023: Overview of the Political and Religious Hate Speech Detection task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[7]

Nozza ,

A. T.

Cignarella , G. Damo,

Caselli ,

Patti , HODI at EVALITA 2023: Overview of the first Shared Task on Homotransphobia Detection in Italian , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[8]

Bondielli ,

Dell'Oglio ,

Lenci ,

Marcelloni ,

L. C.

Passaro , M. Sabbatini, MULTI-Fake-DetectiVE at EVALITA 2023: Overview of the MULTImodal Fake News Detection and VErification Task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[9]

Russo ,

Stoehr ,

M. Horta

Ribeiro , ACTI at EVALITA 2023: Overview of the Conspiracy Theory Identification Task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[10]

Palmero Aprosio , T. Paccosi, NERMuD at EVALITA 2023: Overview of the Named-Entities Recognition on Multi-Domain Documents Task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[11]

Begoña ,

Karunakaran ,

Lavelli ,

Magnini ,

Speranza , R. Zanoli, ClinkaRT at EVALITA 2023: Overview of the Task on Linking a Lab Result to its Test Event in the Clinical Domain , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[12]

Cassotti ,

Siciliani ,

L. C.

Passaro ,

Gatto , P. Basile, WiC-ITA at EVALITA 2023: Overview of the EVALITA2023 Word-in-Context for ITAlian Task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[13]

Brunato ,

Colla ,

Dell'Orletta ,

Dini ,

D. P.

Radicioni ,

A. A.

Ravelli , DisCoTEX at EVALITA 2023: Overview of the assessing DIScourse COherence in Italian TEXts task , in: M. Lai, alii (Eds.), Proceedings of EVALITA 2023 , CEUR .org, September 7th-8th 2023 , Parma, 2023 .

[14]

Basile ,

Croce ,

M. Di

Maro ,

L. C.

Passaro , EVALITA 2020 : Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2020 ), CEUR.org, Online, 2020 .