1. Introduction and Motivation

Journal of Applied Re

10.18653/v1/2020.semeval-1.186

MULTI-Fake-DetectiVE at EVALITA 2023: Overview of the MULTImodal Fake News Detection and VErification Task

Alessandro Bondielli

0 2

Pietro Dell'Oglio

Alessandro Lenci

Francesco Marcelloni

Lucia C. Passaro

Marco Sabbatini

2 0 Department of Computer Science, University of Pisa , Italy 1 Department of Information Engineering, University of Pisa , Italy 2 Department of Philology , Literature and Linguistics , University of Pisa , Italy

2020

33 60 71

This paper introduces the MULTI-Fake-DetectiVE shared task for the EVALITA 2023 campaign. The task was aimed at exploring multimodality within the realm of fake news and intended to address the problem from two perspectives, represented by the two sub-tasks. In sub-task 1, we aimed to evaluate the efectiveness of multimodal fake news detection systems. In sub-task 2, we sought to gain insights into the interplay between text and images, specifically how they mutually influence the interpretation of content in the context of distinguishing between fake and real news. Both perspectives were framed as classification problems. The paper presents an overview of the task. In particular, we detail the key aspects of the task, including the creation of a new dataset for fake news detection in Italian, the evaluation methodology and criteria, the participant systems, and their results. In light of the obtained results, we argue that the problem is still open and propose some future directions.

eol>Fake News Fake news detection Multi-modality Vision-Language models Large Language Models

1. Introduction and Motivation These issues have led over the years to the creation

of numerous initiatives for independent fact-checking Recent years have seen a great increase in the online and fake news detection, and the topic has increased its proliferation of disinformation and fake news [1]. This is relevance in the research community. The literature on especially true in the context of real-world events that are issues related to fake news detection, disinformation, and reported as breaking news. It is often the case that entities fact-checking, is constantly growing despite the inherent with malicious intents exploit breaking news to push challenges and many facets of the problem. their own agenda by distorting facts and intentionally A large number of approaches and techniques have publishing false or misleading information. been proposed for content verification and fake news

Distorted uses of online social media have been made detection in a uni-modal setting. Most of the proposed mostly evident in the last few years by the first so-called approaches use either the actual content of the news (i.e., infodemic following the COVID-19 pandemic [2], in what the text itself), its context (e.g., social network structures, has been defined by several authors as a Post-Truth Era temporal information), or a combination of both [5]. Most [3] dominated by emotions and pseudo-facts [4]. This modern systems typically leverage transformer models phenomenon has grown further with the outbreak of with additional information [6]. the Russian war against Ukraine. Like in all conflicts, It is clear that the easiest way to spread disinformation disinformation has become a powerful strategic weapon. is in textual form. However, online outlets and social media allow for other modalities as well. Images for exEVALITA 2023: 8th Evaluation Campaign of Natural Language Pro- ample can be leveraged in the context of disinformation cessing and Speech Tools for Italian, Sep 7 – 8, Parma, IT and fake news in diferent ways: first, the inclusion of * Corresponding author. images in malicious content can be leveraged as a way $ alessandro.bondielli@unipi.it (A. Bondielli); to provide more credibility for the text containing the (pAie.tLroe.ndceil)l;ofgrlaion@ceuscnoi.fi.mit a(Prc.eDlleolnl’iO@gulinoi)p;ia.iltes(Fsa.nMdarorc.leelnlocni@i);unipi.it fake news; second, images could be described in such lucia.passaro@unipi.it (L. C. Passaro); marco.sabbatini@unipi.it ways that their original content is misinterpreted by read(M. Sabbatini) ers, leading again to disinformation; finally, they can be 0000-0003-3426-6643 (A. Bondielli); 0000-0002-0793-5226 used in an attempt to increase the post attraction and get (P. Dell’Oglio); 0000-0001-5790-4308 (A. Lenci); the fake news shared by as many social media users as (0L0.0C0.-0P0a0s2s-a5r8o9);5-0807060X-0(0F0.2M-8a8r3c7e-l6lo5n92i);( M00.0S0a-0b0b0a3ti-n4i9)34-5344 possible.

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License We can argue that multimodal scenarios may be conCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) sidered as closer in nature to real-world ones examining social media data. Nevertheless, multimodality has which includes a textual component and a visual comreceived relatively less attention over the years in this ponent (i.e., one or more images), classify it into one of context [4]. This is rapidly changing, with a number of in- the following classes: ternational multimodal shared tasks being organised for fake news and propaganda detection, fact-checking and Certainly Fake: news that is certain to be fake, whatrelated areas [7, 8, 9, 10]. Nevertheless, models combin- ever the context. ing multiple modalities for detecting fake news remain a Probably Fake: news that is likely to be fake, but may major open challenge in the literature, as well as datasets include some real information or at the very least including diferent modalities and diferent sources of be somewhat credible. fake news [4]. Moreover, we believe that a fundamental step towards a more nuanced understanding of the Probably Real: news that is very credible but still reproblem lies in actually understanding and modelling the tains some degree of uncertainty about the prointerplay between the diferent modalities in generating vided information. disinformation.

In this context, we propose MULTI-Fake-DetectiVE1 as Certainly Real: news that is certain to be real and inpart of the EVALITA 2023 Evaluation campaign [11]. The contestable, whatever the context. task is aimed at addressing both the textual and visual aspects of fake news on social media and online news The classes refer to the informational content as a outlets, from two key perspectives: we want to model whole, and not to its single components. For example, fake news detection from a multimodal perspective, and a fake piece of news including a real image (e.g., in a we are interested in exploring how images and texts in- misleading context) is still probably (or certainly) fake. teract and influence each other in the context of real and fake news. Further, we contribute to this research area 2.2. Sub-task 2: Cross-modal relations in by creating a dataset of social media posts from Twitter Fake and Real News and news articles regarding the Russian-Ukrainian war including fake and real news.

2. Definition of the task

MULTI-Fake-DetectiVE includes two sub-tasks. Both are formulated as multi-class classification problems. In the ifrst sub-task, given a piece of content (i.e., a social media post or a news article) that includes both a visual and a textual component, the goal is to determine its likelihood of being a real or a fake news. In the second sub-task, given a text and an accompanying image, the goal is to decide whether their combination is aimed at misleading the interpretation of the reader about one or the other, or not. Note that for both sub-tasks we consider the visual component as all the images provided with a given textual content (i.e., news article or social media post). Thus, for example, if a tweet includes three images, and one of them is misleading, the expected label will be misleading.

In the following, we describe in detail both sub-tasks.

2.1. Sub-task 1: Multimodal Fake News Detection The first sub-task is structured as a multi-class classifi

cation problem in a multimodal setting. The problem is defined as follows: given a piece of content = ⟨, ⟩

1https://sites.google.com/unipi.it/multi-fake-detective/home The second sub-task is aimed at assessing how the two

modalities (i.e., textual and visual) interact in the context of fake and real news. Our goal is to understand how images and texts in fake and real news can lead to misleading interpretations of the content pertaining to the other modality and to the whole news.

The sub-task is a three-class classification problem, and is defined as follows: given a piece of content = ⟨, ⟩ which includes a textual component and a visual component , decide whether their combination is:

Misleading: one between the textual and visual components is used deceptively to lead to misinterpretation of the other. Not Misleading: the combination of the visual and textual component does NOT lead to misinterpretation of the news. Unrelated: the visual component is not related to the text component, or does not add information to the text component or does not change its interpretation in a meaningful way. 3. Dataset

The dataset for the shared task includes social media posts and news articles, containing both a textual and a visual component, concerning one or more real world events that are known to have been subject to the generation of fake news. In particular, the dataset focuses on the Ukrainian-Russian war, and includes data in a time span going from February 2022 to December 2022.

The dataset is composed of two sub-datasets, one for each sub-task. Each is further split into a training set and two diferent test sets. More specifically, the dataset for each sub task is divided as follows:

Training Set: the training data provided to participants. It includes data from February 2022 to September 2022.

for both sub-tasks was not too skewed in favour of real Test Set (Oficial): the oficial test set used for eval- news and not misleading claims, as it would have been uation. It includes data from October 2022 to in an uncontrolled scenario. On the other hand, the seed December 2022. fake news and misleading claims served as context for Test Set (Additional): an additional batch of test data the annotation process. Specifically, we used Prolific 3 including data from the same time window as the to obtain labels for our dataset. For each sub-task, we training set. provided annotators with the seed fake news and misleading claims as context, and asked them to label a few

The Oficial Test Set was developed to challenge par- of the data samples. Each data sample was labelled by ticipating systems to classify fake news and misleading at least five diferent annotators. We collected human content in a more real-world scenario (i.e., diferent time annotations and kept only data samples for which at least windows that might determine diferent data distribu- 3 out of the 5 annotators provided the same label. tions). The Additional Test Set was instead aimed at Datasets sizes and class distributions are reported in giving us a clearer picture of how participating systems Tables 1 and 2 for, respectively, sub-tasks 1 and 2. In are resilient to changes in the context over time [12]. sub-task 1, inter-annotator agreement was calculated as Note that the evaluation on the Additional Test Set was the average Spearman correlation coeficient between not mandatory. annotator pairs, considering the ordered nature of the

The dataset is available for download on the website labels. The average correlation was 0.43 ( = 0.04). In subof the task.2 task 2 we employed Fleiss’ Kappa to measure the interannotator agreement since the labels were not inherently 3.1. Data Collection and annotation ordered. We obtained = 0.25.

Participants were provided with a TSV file containing The dataset was collected and annotated via crowdsourc- IDs, URLs, and numeric labels representing classes for ing following a multi-step process heavily inspired by sub-task 1 and sub-task 2. The label was excluded from the one proposed in [13]. First, we broadly collected the test set during the evaluation period. Participants Twitter data regarding the Ukrainian-Russian war in the had the option to download the data using their preferred chosen time span. To collect such data, we chose a set method or utilise a provided download script. The script of keywords representative of the conflict, e.g., “Ucraina, ofered participants access to textual data, including metaRussia, Putin, Zelensky”. In addition to this, we collected data such as URLs, data type (e.g., tweet or article), and texts and images for news articles that were in the tweets. creation date if available, as well as associated images. At this stage, the data were collected regardless of the Authorship information was not provided with the data. sub-tasks. Note that while the datasets were treated separately for

Then, we exploited a manually collected set of verified annotation, some data samples could be present in both fake news and misleading claims (henceforth referred to sub-tasks. In such cases, the ID associated with the data as seed fake news and misleading claims) to generate the point remained consistent across the two sub-tasks. dataset for each sub-task. We took into account diferent news outlets reporting on the fake news and independent 3.2. Copyright and Content Warning fact-checking websites. These seed fake news and misleading claims were intended to serve a dual purpose. On The dataset includes tweets and news articles. The the one hand, we used them to filter the original dataset provided download script performs a coarse-grained by considering their similarity with data samples. This anonymization of the data (e.g. by removing author inwas done to ensure that: i) the resulting datasets would formation). include only relevant elements (i.e., that actually refer to Upon download, users agree not to share the material the Ukrainian-Russian war), and ii) the class distribution they receive both during and after the competition. The

2https://sites.google.com/unipi.it/multi-fake-detective/data 3prolific.co

data for the MULTI-Fake-Detective tasks is to be used for research purposes only. Note that by receiving the data users implicitly agree to Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy for academic researchers.

We do not share responsibility for the contents of the dataset. Downloaded texts and images may include copyrighted material and sensitive contents. The downloaded data and the provided labels do not reflect in any way the social and political views of the task organisers.

4. Evaluation measures

Participants were allowed to present up to four diferent systems for predicting labels on the oficial test set, with one system marked as primary. Results for primary systems were used as basis for the final ranking. Specifically, the ranking was calculated based on the weighted average F1-score of the systems. The same evaluation procedure and criteria was applied to both sub-tasks. The evaluation procedure was conducted by means of an evaluation script (available to participants).

Note that due to restrictions in data distribution (see Section 3), not all participants may have had access to the exact same test dataset. For example, articles/tweets in the dataset may have been removed by the authors during the evaluation window. To ensure fair competition, we evaluated and ranked the systems only on the subsets of the test sets for which all the participants were able to provide a label.

4.1. Baseline models Participating systems were evaluated against each other

and against a set of baseline models.

Specifically, we proposed two diferent classification models, namely a Support Vector Machine (SVM) and a Multi-Layer Perceptron (MLP), with three diferent feature sets as the baseline models. As for the feature sets, we considered:

Text-only features extracted with a multilingual BERT model [14]. Image-only features extracted with ResNet-18 [15]. Multimodal features obtained by concatenating the text-only and image-only features. All models were trained using the default parameters

from scikit-learn4.

To ensure fair reproducibility and comparisons, the baseline models and the evaluation scripts are available on the website of the task. 5

5. Participating systems and results A total of four teams participated in MULTI-Fake

DetectiVE. All four teams participated to sub-task 1 (Multimodal Fake News Detection), and two of them also participated to sub-task 2 (Cross-modal relations in Fake and Real News). The proposed approaches are quite different. We can distinguish between two truly multimodal approaches and two text-oriented ones. In the following, we broadly describe the core systems of participating teams.

PoliTo [16] participated to both sub-tasks with an approach focused on refining FND-CLIP [ 17], a fake news detection multimodal model based on CLIP [18]. Authors proposed several refinements to the original model via ad-hoc extensions including sentiment-based text encoding, image transformations in the frequency domain, and data augmentation via back translation. The final model for both sub-tasks is an ensemble that combines predictions for all the extensions.

AIMH [19] participated to both sub-tasks with a vision

text dual encoder approach. They used ViT to encode images and RoBERTa/BERT to encode texts. Authors experimented with diferent inputs for their model. They generated image captions and automatically translated Italian texts to English. They tested various input combinations and chose to use English texts and images as inputs for their ifnal model.

ExtremITA [20] participated with a text-only approach

aimed at solving all EVALITA tasks via prompt engineering of Large Language Models. The team proposed two Italian models, an encoderdecoder based on T5 [21] and an instructiontuned decoder-only model based on LLaMA [22].

HIJLI-JU-CLEF [23] proposed a text-oriented model

to solve sub-task 1. The model uses a pre-trained

4https://scikit-learn.org/stable/supervised_learning.html

5https://sites.google.com/unipi.it/multi-fake-detective/ tasks-and-evaluation Polito-P1 extremITA-camoscio_lora AIMH-MYPRIMARYRUN Baseline-SVM_TEXT Baseline-SVM_MULTI Baseline-MLP_TEXT Baseline-MLP_IMAGE HIJLI-JU-CLEF-Multi Baseline-SVM_IMAGE Baseline-MLP_MULTI extremITA-camoscio_lora PoliTo extremITA-it5

5.1. Results of Sub-task 1 All participating systems attempted to solve the Multi

modal Fake News Detection sub-task. Tables 3 and 4 detail the obtained results of each system including baselines on the Oficial and Additional test sets, respectively.

As for the Oficial test set, the PoliTo ensemble model and the LLaMA-based ExtremITA model ranked first and second, with close results. The AIMH vision-text dual encoder model ranked third. All three models were able to outperform all baseline systems, albeit marginally. The best performing baselines are the text-based and multimodal SVM models. The other baseline models performed significantly worse. Finally, the HIJLI-JU-CLEF text-oriented system was able to outperform two out of the six baseline models proposed, ranking fourth among participants and eighth globally.

As for the Additional test set, the best performing model was the LLaMA-based ExtremITA, closely followed by the PoliTo ensemble approach. The T5-based ExtremITA model performed significantly worse. The

Only the truly multimodal models participated in the

Cross-modal relations in Fake and Real News sub-task. This is due to the fact that the task is inherently multimodal, and cannot be modelled properly with text-only models: the relationship between image and text features lies at the core of the task, and thus images have to be modelled in some capacity to face it efectively.

Table 5 shows the results obtained by each system, including baselines, on the Oficial test set. The PoliTo ensemble model ranked first, outperforming all baseline models, while the AIMH vision-text dual encoder model outperformed only the image-only MLP model, ranking seventh. Among baselines, surprisingly the bestperforming ones are the text-only models, followed by the multimodal ones. We suspect that the text-only baseline performances are to be attributed to chance rather than to their efective modeling of the problem.

Only the PoliTo team participated in the Additional evaluation, obtaining a weighted average F1-Score of 0.61.

6. Discussion We can draw some interesting insights from comparing

the diferent proposed models both in terms of their architectures and obtained results.

General findings. First, we can argue that multimodal fake news detection and cross-modal analysis of images and texts in the context of fake news are two rather challenging tasks. As shown by agreement metrics, it was a challenging task for annotators as well (see Sec. 3). This is reflected also by the fact that even the best performing systems were not able to considerably improve over the baseline models results.

As for performances between the Oficial and Addi- large part of the key information to answer the questional test sets, we saw a rather large discrepancy among tion lies within the textual content. If we assume this, tasks. We expected systems to perform better on data it is easier to understand how model scale also plays a from the same time period. This appears to be true for crucial role in performances. The only Large Langauge sub-task 2, but not for sub-task 1. Note however that the Model (LLM) presented can get close to the performances only true comparison can be made on the PoliTo system, of a more refined and nuanced approach in a few-shot as it is the only one that participated to the Oficial and setting via prompting. Note that this may hold true reAdditional evaluation for both tasks. gardless of the fact that Italian pre-trained LLMs are still

Finally, we must point out the weaknesses of the base- not consistently outperforming all other approaches as line models. While participating systems were able to their English counterparts due to their sheer size [25]. perform consistently across sub-task and test sets, the performances of the baseline systems exhibit significant variability, with the relative rankings and performance disparities among models varying across tasks. This suggests that the baselines are unable to adequately model features of both modalities and to leverage them for the tasks.

The importance of additional processing. Sub-task 2 was specifically developed to frame the problem as a multimodal one. Only the two truly multimodal systems participated. As previously discussed, both systems employ a similar architecture and are arguably comparable in terms of model sizes. Thus, we can hypothesize that the diference in performances both in sub-task 1 and 2 may be attributed mostly to the additional processing and extensions applied by the PoliTo system. We could further argue that the tasks are both very complex and nuanced, and additional forms of processing and/or features may provide important benefits in this scenario, rather than sole reliance on textual and visual features extracted from pre-trained models.

The architectures of the systems. In sub-task 1, only two out of the four participating systems could be considered as truly multimodal, since they explicitly model image-level features (i.e., with an image encoder model).

They are quite similar in principle: both leverage a dual encoder architecture [24] with a classification layer. A ViT image encoder was chosen by both approaches, albeit trained on diferent data. The text encoders employed in the AIMH system are RoBERTa (for English translations) and an Italian version of BERT for original texts. 7. Conclusions and future The PoliTo team uses the FND-CLIP text encoder, which directions is based on GPT instead. The popularity of CLIP-like Vision-Language models is evident due to their versa- In this paper, we presented the MULTI-Fake-DetectiVE tility and ease of adaptation for various scenarios and shared task for EVALITA 2023. The task was focused on downstream tasks, including fake news detection.The multimodality in the context of fake news. We considered main diferences are the extensions (e.g., the inclusion the problem from two perspectives: we wanted to assess of sentiment-aware text features and image transforma- fake news detection systems in a multimodal setting, and tions) proposed by the PoliTo team. Authors report per- we wanted to understand how text and images influence formance increases with all proposed extensions, with each other in the interpretation of a piece of content in the ensemble classifier performing best. The remaining the context of fake and real news. We framed both as two systems either disregard images due to the model classification problems. architecture (ExtremITA) or consider their automatically We saw an interesting degree of variety among progenerated caption (HIJLI-JU-CLEF), shifting the problem posed systems, which we categorized as truly multimodal to a text-only space. or text-oriented. By analyzing the proposed approaches and their results, we can summarise our findings as folThe role of textual content. The results of sub-task lows. First, multimodal fake news detection is a very 1 do not directly highlight a clear advantage of a multi- challenging task, especially when considering near realmodal approach over a uni-modal one. Two out of the world scenarios. Second, we saw that for similar visionthree models which outperform all the baselines are actu- language models, both in terms of architecture and model ally multimodal, with the PoliTo FND-CLIP-IT ensemble scale, extending the boundaries of the problem by considoutperforming all the others. However, the runner-up ering additional and/or alternative processing strategies, was the text-only Italian LoRA based on LLaMA with including afective-oriented features and image processnear identical performances. ing in the frequency domain, is highly beneficial. Third,

We can hypothesise that while modelling images in we saw that model scale plays also an important role, conjunction with text is actually helpful for determining with pre-trained LLMs approaching the performances of whether a piece of content is a real or a fake news, a thoroughly fine-tuned systems.

Our findings suggest that the problem is still open

and that moving forward it could be advantageous to jointly leverage the advantages of the best-performing approaches, for instance by focusing on Large pre-trained Vision-Language models augmented with additional features (e.g., by using available emotive resources [26]) via either fine-tuning or appropriate prompt tuning/engineering.

Due to the current challenges posed by deceptive or misleading content on social media, we believe that an efective understanding and modelling of such a complex problem may prove to be highly beneficial in contrasting online disinformation. In this regard, the MULTI-FakeDetectiVE task, including the proposed approaches and the provided datasets, may serve the Italian NLP community as an initial stepping stone in addressing this issue for the Italian language.

Acknowledgments This research was partially supported by the Italian Min

istry of University and Research (MUR) in the framework of the PON 2014-2021 “Research and Innovation" resources – Innovation Action - DM MUR 1062/2021 Title of the Research: “Modelli semantici multimodali per l’industria 4.0 e le digital humanities.”, of PNRR M4C2 - Investimento 1.3, Partenariato Esteso PE00000013 - “FAIR - Future Artificial Intelligence Research" - Spoke 1 “Human-centered AI", funded by the European Commission under the NextGeneration EU programme and of the CrossLab and FoReLab projects (Departments of Excellence). V. Lomonaco, D. Bacciu, Continual pre-training fake-detective: Multimodal fake news detection usmitigates forgetting in language and vision, arXiv ing deep learning approach, in: Proceedings of preprint arXiv:2205.09357 (2022). the Eighth Evaluation Campaign of Natural Lan[13] L. C. Passaro, A. Bondielli, P. Dell’Oglio, A. Lenci, guage Processing and Speech Tools for Italian. Final F. Marcelloni, In-context annotation of topic- Workshop (EVALITA 2023), CEUR.org, Parma, Italy, oriented datasets of fake news: A case study on 2023. the notre-dame fire event, Information Sciences [24] Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, J. Gao, Vision(2022). language pre-training:: Basics, recent advances, and [14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, future trends, Foundations and Trends® in ComBert: Pre-training of deep bidirectional transform- puter Graphics and Vision 14 (2022) 163–352. ers for language understanding, arXiv preprint [25] V. Basile, Is EVALITA done? on the impact of arXiv:1810.04805 (2018). prompting on the italian NLP evaluation campaign, [15] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- in: Proceedings of the Sixth Workshop on Natural ing for image recognition, in: Proceedings of the Language for Artificial Intelligence (NL4AI 2022), IEEE conference on computer vision and pattern Udine, November 30th, 2022, volume 3287 of CEUR recognition, 2016, pp. 770–778. Workshop Proceedings, CEUR-WS.org, 2022, pp. 127– [16] L. D’Amico, D. Napolitano, L. Vaiani, L. Cagliero, 140.

Polito at multi-fake-detective: Improving fnd-clip [26] L. C. Passaro, A. Lenci, Evaluating context selection for multimodal italian fake news detection, in: Pro- strategies to build emotive vector space models, in: ceedings of the Eighth Evaluation Campaign of Nat- Proceedings of the Tenth International Conference ural Language Processing and Speech Tools for Ital- on Language Resources and Evaluation (LREC’16), ian. Final Workshop (EVALITA 2023), CEUR.org, European Language Resources Association (ELRA), Parma, Italy, 2023. Portorož, Slovenia, 2016, pp. 2185–2191. [17] Y. Zhou, Q. Ying, Z. Qian, S. Li, X. Zhang, Multimodal fake news detection via clip-guided learning, arXiv preprint arXiv:2205.14304 (2022). [18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,

G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763. [19] G. Puccetti, A. Esuli, Aimh at multi-fake-detective:

System report, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), CEUR.org, Parma, Italy, 2023. [20] C. D. Hromei, D. Croce, V. Basile, R. Basili, Extremita at evalita2023: Multi-task sustainable scaling to large language models at its extreme, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian.

Final Workshop (EVALITA 2023), CEUR.org, Parma,

Italy, 2023. [21] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,

M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-totext transformer, The Journal of Machine Learning

Research 21 (2020) 5485–5551. [22] H. Touvron, T. Lavril, G. Izacard, X. Martinet,

M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and eficient foundation language models, arXiv preprint arXiv:2302.13971 (2023). [23] S. Sarkar, N. Tudu, D. Das, Hijli-ju-clef at multi