<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Applied Re</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/2020.semeval-1.186</article-id>
      <title-group>
        <article-title>MULTI-Fake-DetectiVE at EVALITA 2023: Overview of the MULTImodal Fake News Detection and VErification Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Bondielli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pietro Dell'Oglio</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Marcelloni</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia C. Passaro</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Sabbatini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information Engineering, University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Philology</institution>
          ,
          <addr-line>Literature and Linguistics</addr-line>
          ,
          <institution>University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>33</volume>
      <fpage>60</fpage>
      <lpage>71</lpage>
      <abstract>
        <p>This paper introduces the MULTI-Fake-DetectiVE shared task for the EVALITA 2023 campaign. The task was aimed at exploring multimodality within the realm of fake news and intended to address the problem from two perspectives, represented by the two sub-tasks. In sub-task 1, we aimed to evaluate the efectiveness of multimodal fake news detection systems. In sub-task 2, we sought to gain insights into the interplay between text and images, specifically how they mutually influence the interpretation of content in the context of distinguishing between fake and real news. Both perspectives were framed as classification problems. The paper presents an overview of the task. In particular, we detail the key aspects of the task, including the creation of a new dataset for fake news detection in Italian, the evaluation methodology and criteria, the participant systems, and their results. In light of the obtained results, we argue that the problem is still open and propose some future directions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fake News</kwd>
        <kwd>Fake news detection</kwd>
        <kwd>Multi-modality</kwd>
        <kwd>Vision-Language models</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <sec id="sec-1-1">
        <title>These issues have led over the years to the creation</title>
        <p>of numerous initiatives for independent fact-checking
Recent years have seen a great increase in the online and fake news detection, and the topic has increased its
proliferation of disinformation and fake news [1]. This is relevance in the research community. The literature on
especially true in the context of real-world events that are issues related to fake news detection, disinformation, and
reported as breaking news. It is often the case that entities fact-checking, is constantly growing despite the inherent
with malicious intents exploit breaking news to push challenges and many facets of the problem.
their own agenda by distorting facts and intentionally A large number of approaches and techniques have
publishing false or misleading information. been proposed for content verification and fake news</p>
        <p>Distorted uses of online social media have been made detection in a uni-modal setting. Most of the proposed
mostly evident in the last few years by the first so-called approaches use either the actual content of the news (i.e.,
infodemic following the COVID-19 pandemic [2], in what the text itself), its context (e.g., social network structures,
has been defined by several authors as a Post-Truth Era temporal information), or a combination of both [5]. Most
[3] dominated by emotions and pseudo-facts [4]. This modern systems typically leverage transformer models
phenomenon has grown further with the outbreak of with additional information [6].
the Russian war against Ukraine. Like in all conflicts, It is clear that the easiest way to spread disinformation
disinformation has become a powerful strategic weapon. is in textual form. However, online outlets and social
media allow for other modalities as well. Images for
exEVALITA 2023: 8th Evaluation Campaign of Natural Language Pro- ample can be leveraged in the context of disinformation
cessing and Speech Tools for Italian, Sep 7 – 8, Parma, IT and fake news in diferent ways: first, the inclusion of
* Corresponding author. images in malicious content can be leveraged as a way
$ alessandro.bondielli@unipi.it (A. Bondielli); to provide more credibility for the text containing the
(pAie.tLroe.ndceil)l;ofgrlaion@ceuscnoi.fi.mit a(Prc.eDlleolnl’iO@gulinoi)p;ia.iltes(Fsa.nMdarorc.leelnlocni@i);unipi.it fake news; second, images could be described in such
lucia.passaro@unipi.it (L. C. Passaro); marco.sabbatini@unipi.it ways that their original content is misinterpreted by
read(M. Sabbatini) ers, leading again to disinformation; finally, they can be
0000-0003-3426-6643 (A. Bondielli); 0000-0002-0793-5226 used in an attempt to increase the post attraction and get
(P. Dell’Oglio); 0000-0001-5790-4308 (A. Lenci); the fake news shared by as many social media users as
(0L0.0C0.-0P0a0s2s-a5r8o9);5-0807060X-0(0F0.2M-8a8r3c7e-l6lo5n92i);( M00.0S0a-0b0b0a3ti-n4i9)34-5344 possible.</p>
        <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License We can argue that multimodal scenarios may be
conCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) sidered as closer in nature to real-world ones
examining social media data. Nevertheless, multimodality has which includes a textual component  and a visual
comreceived relatively less attention over the years in this ponent  (i.e., one or more images), classify it into one of
context [4]. This is rapidly changing, with a number of in- the following classes:
ternational multimodal shared tasks being organised for
fake news and propaganda detection, fact-checking and Certainly Fake: news that is certain to be fake,
whatrelated areas [7, 8, 9, 10]. Nevertheless, models combin- ever the context.
ing multiple modalities for detecting fake news remain a Probably Fake: news that is likely to be fake, but may
major open challenge in the literature, as well as datasets include some real information or at the very least
including diferent modalities and diferent sources of be somewhat credible.
fake news [4]. Moreover, we believe that a
fundamental step towards a more nuanced understanding of the Probably Real: news that is very credible but still
reproblem lies in actually understanding and modelling the tains some degree of uncertainty about the
prointerplay between the diferent modalities in generating vided information.
disinformation.</p>
        <p>In this context, we propose MULTI-Fake-DetectiVE1 as Certainly Real: news that is certain to be real and
inpart of the EVALITA 2023 Evaluation campaign [11]. The contestable, whatever the context.
task is aimed at addressing both the textual and visual
aspects of fake news on social media and online news The classes refer to the informational content as a
outlets, from two key perspectives: we want to model whole, and not to its single components. For example,
fake news detection from a multimodal perspective, and a fake piece of news including a real image (e.g., in a
we are interested in exploring how images and texts in- misleading context) is still probably (or certainly) fake.
teract and influence each other in the context of real and
fake news. Further, we contribute to this research area 2.2. Sub-task 2: Cross-modal relations in
by creating a dataset of social media posts from Twitter Fake and Real News
and news articles regarding the Russian-Ukrainian war
including fake and real news.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Definition of the task</title>
      <p>MULTI-Fake-DetectiVE includes two sub-tasks. Both are
formulated as multi-class classification problems. In the
ifrst sub-task, given a piece of content (i.e., a social media
post or a news article) that includes both a visual and a
textual component, the goal is to determine its likelihood
of being a real or a fake news. In the second sub-task,
given a text and an accompanying image, the goal is to
decide whether their combination is aimed at misleading
the interpretation of the reader about one or the other,
or not. Note that for both sub-tasks we consider the
visual component as all the images provided with a given
textual content (i.e., news article or social media post).
Thus, for example, if a tweet includes three images, and
one of them is misleading, the expected label will be
misleading.</p>
      <p>In the following, we describe in detail both sub-tasks.</p>
      <sec id="sec-2-1">
        <title>2.1. Sub-task 1: Multimodal Fake News</title>
      </sec>
      <sec id="sec-2-2">
        <title>Detection</title>
        <sec id="sec-2-2-1">
          <title>The first sub-task is structured as a multi-class classifi</title>
          <p>cation problem in a multimodal setting. The problem is
defined as follows: given a piece of content  = ⟨, ⟩</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>1https://sites.google.com/unipi.it/multi-fake-detective/home</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>The second sub-task is aimed at assessing how the two</title>
          <p>modalities (i.e., textual and visual) interact in the context
of fake and real news. Our goal is to understand how
images and texts in fake and real news can lead to
misleading interpretations of the content pertaining to the
other modality and to the whole news.</p>
          <p>The sub-task is a three-class classification problem,
and is defined as follows: given a piece of content  =
⟨, ⟩ which includes a textual component  and a visual
component , decide whether their combination is:</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>Misleading: one between the textual and visual components is used deceptively to lead to misinterpretation of the other.</title>
        </sec>
        <sec id="sec-2-2-5">
          <title>Not Misleading: the combination of the visual and textual component does NOT lead to misinterpretation of the news.</title>
        </sec>
        <sec id="sec-2-2-6">
          <title>Unrelated: the visual component is not related to the text component, or does not add information to the text component or does not change its interpretation in a meaningful way.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The dataset for the shared task includes social media
posts and news articles, containing both a textual and
a visual component, concerning one or more real world
events that are known to have been subject to the
generation of fake news. In particular, the dataset focuses on
the Ukrainian-Russian war, and includes data in a time
span going from February 2022 to December 2022.</p>
      <p>The dataset is composed of two sub-datasets, one for
each sub-task. Each is further split into a training set and
two diferent test sets. More specifically, the dataset for
each sub task is divided as follows:</p>
      <sec id="sec-3-1">
        <title>Training Set: the training data provided to participants. It includes data from February 2022 to September 2022.</title>
        <p>for both sub-tasks was not too skewed in favour of real
Test Set (Oficial): the oficial test set used for eval- news and not misleading claims, as it would have been
uation. It includes data from October 2022 to in an uncontrolled scenario. On the other hand, the seed
December 2022. fake news and misleading claims served as context for
Test Set (Additional): an additional batch of test data the annotation process. Specifically, we used Prolific 3
including data from the same time window as the to obtain labels for our dataset. For each sub-task, we
training set. provided annotators with the seed fake news and
misleading claims as context, and asked them to label a few</p>
        <p>The Oficial Test Set was developed to challenge par- of the data samples. Each data sample was labelled by
ticipating systems to classify fake news and misleading at least five diferent annotators. We collected human
content in a more real-world scenario (i.e., diferent time annotations and kept only data samples for which at least
windows that might determine diferent data distribu- 3 out of the 5 annotators provided the same label.
tions). The Additional Test Set was instead aimed at Datasets sizes and class distributions are reported in
giving us a clearer picture of how participating systems Tables 1 and 2 for, respectively, sub-tasks 1 and 2. In
are resilient to changes in the context over time [12]. sub-task 1, inter-annotator agreement was calculated as
Note that the evaluation on the Additional Test Set was the average Spearman correlation coeficient between
not mandatory. annotator pairs, considering the ordered nature of the</p>
        <p>The dataset is available for download on the website labels. The average correlation was 0.43 ( = 0.04). In
subof the task.2 task 2 we employed Fleiss’ Kappa to measure the
interannotator agreement since the labels were not inherently
3.1. Data Collection and annotation ordered. We obtained  = 0.25.</p>
        <p>Participants were provided with a TSV file containing
The dataset was collected and annotated via crowdsourc- IDs, URLs, and numeric labels representing classes for
ing following a multi-step process heavily inspired by sub-task 1 and sub-task 2. The label was excluded from
the one proposed in [13]. First, we broadly collected the test set during the evaluation period. Participants
Twitter data regarding the Ukrainian-Russian war in the had the option to download the data using their preferred
chosen time span. To collect such data, we chose a set method or utilise a provided download script. The script
of keywords representative of the conflict, e.g., “Ucraina, ofered participants access to textual data, including
metaRussia, Putin, Zelensky”. In addition to this, we collected data such as URLs, data type (e.g., tweet or article), and
texts and images for news articles that were in the tweets. creation date if available, as well as associated images.
At this stage, the data were collected regardless of the Authorship information was not provided with the data.
sub-tasks. Note that while the datasets were treated separately for</p>
        <p>Then, we exploited a manually collected set of verified annotation, some data samples could be present in both
fake news and misleading claims (henceforth referred to sub-tasks. In such cases, the ID associated with the data
as seed fake news and misleading claims) to generate the point remained consistent across the two sub-tasks.
dataset for each sub-task. We took into account diferent
news outlets reporting on the fake news and independent 3.2. Copyright and Content Warning
fact-checking websites. These seed fake news and
misleading claims were intended to serve a dual purpose. On The dataset includes tweets and news articles. The
the one hand, we used them to filter the original dataset provided download script performs a coarse-grained
by considering their similarity with data samples. This anonymization of the data (e.g. by removing author
inwas done to ensure that: i) the resulting datasets would formation).
include only relevant elements (i.e., that actually refer to Upon download, users agree not to share the material
the Ukrainian-Russian war), and ii) the class distribution they receive both during and after the competition. The</p>
      </sec>
      <sec id="sec-3-2">
        <title>2https://sites.google.com/unipi.it/multi-fake-detective/data</title>
      </sec>
      <sec id="sec-3-3">
        <title>3prolific.co</title>
        <p>data for the MULTI-Fake-Detective tasks is to be used for
research purposes only. Note that by receiving the data
users implicitly agree to Twitter Terms of Service, Privacy
Policy, Developer Agreement, and Developer Policy for
academic researchers.</p>
        <p>We do not share responsibility for the contents of the
dataset. Downloaded texts and images may include
copyrighted material and sensitive contents. The downloaded
data and the provided labels do not reflect in any way
the social and political views of the task organisers.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation measures</title>
      <p>Participants were allowed to present up to four
diferent systems for predicting labels on the oficial test set,
with one system marked as primary. Results for primary
systems were used as basis for the final ranking.
Specifically, the ranking was calculated based on the weighted
average F1-score of the systems. The same evaluation
procedure and criteria was applied to both sub-tasks.
The evaluation procedure was conducted by means of an
evaluation script (available to participants).</p>
      <p>Note that due to restrictions in data distribution (see
Section 3), not all participants may have had access to the
exact same test dataset. For example, articles/tweets in
the dataset may have been removed by the authors during
the evaluation window. To ensure fair competition, we
evaluated and ranked the systems only on the subsets of
the test sets for which all the participants were able to
provide a label.</p>
      <sec id="sec-4-1">
        <title>4.1. Baseline models</title>
        <sec id="sec-4-1-1">
          <title>Participating systems were evaluated against each other</title>
          <p>and against a set of baseline models.</p>
          <p>Specifically, we proposed two diferent classification
models, namely a Support Vector Machine (SVM) and a
Multi-Layer Perceptron (MLP), with three diferent
feature sets as the baseline models. As for the feature sets,
we considered:</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Text-only features extracted with a multilingual BERT model [14].</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Image-only features extracted with ResNet-18 [15].</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Multimodal features obtained by concatenating the text-only and image-only features.</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>All models were trained using the default parameters</title>
          <p>from scikit-learn4.</p>
          <p>To ensure fair reproducibility and comparisons, the
baseline models and the evaluation scripts are available
on the website of the task. 5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Participating systems and results</title>
      <sec id="sec-5-1">
        <title>A total of four teams participated in MULTI-Fake</title>
        <p>DetectiVE. All four teams participated to sub-task 1
(Multimodal Fake News Detection), and two of them also
participated to sub-task 2 (Cross-modal relations in Fake
and Real News). The proposed approaches are quite
different. We can distinguish between two truly multimodal
approaches and two text-oriented ones. In the following,
we broadly describe the core systems of participating
teams.</p>
        <p>PoliTo [16] participated to both sub-tasks with an
approach focused on refining FND-CLIP [ 17], a fake
news detection multimodal model based on CLIP
[18]. Authors proposed several refinements to the
original model via ad-hoc extensions including
sentiment-based text encoding, image
transformations in the frequency domain, and data
augmentation via back translation. The final model
for both sub-tasks is an ensemble that combines
predictions for all the extensions.</p>
      </sec>
      <sec id="sec-5-2">
        <title>AIMH [19] participated to both sub-tasks with a vision</title>
        <p>text dual encoder approach. They used ViT to
encode images and RoBERTa/BERT to encode texts.
Authors experimented with diferent inputs for
their model. They generated image captions and
automatically translated Italian texts to English.
They tested various input combinations and chose
to use English texts and images as inputs for their
ifnal model.</p>
      </sec>
      <sec id="sec-5-3">
        <title>ExtremITA [20] participated with a text-only approach</title>
        <p>aimed at solving all EVALITA tasks via prompt
engineering of Large Language Models. The
team proposed two Italian models, an
encoderdecoder based on T5 [21] and an
instructiontuned decoder-only model based on LLaMA [22].</p>
      </sec>
      <sec id="sec-5-4">
        <title>HIJLI-JU-CLEF [23] proposed a text-oriented model</title>
        <p>to solve sub-task 1. The model uses a pre-trained</p>
      </sec>
      <sec id="sec-5-5">
        <title>4https://scikit-learn.org/stable/supervised_learning.html</title>
        <p>5https://sites.google.com/unipi.it/multi-fake-detective/
tasks-and-evaluation
Polito-P1
extremITA-camoscio_lora
AIMH-MYPRIMARYRUN
Baseline-SVM_TEXT
Baseline-SVM_MULTI
Baseline-MLP_TEXT
Baseline-MLP_IMAGE
HIJLI-JU-CLEF-Multi
Baseline-SVM_IMAGE
Baseline-MLP_MULTI
extremITA-camoscio_lora
PoliTo
extremITA-it5</p>
        <sec id="sec-5-5-1">
          <title>5.1. Results of Sub-task 1</title>
        </sec>
      </sec>
      <sec id="sec-5-6">
        <title>All participating systems attempted to solve the Multi</title>
        <p>modal Fake News Detection sub-task. Tables 3 and 4
detail the obtained results of each system including
baselines on the Oficial and Additional test sets, respectively.</p>
        <p>As for the Oficial test set, the PoliTo ensemble model
and the LLaMA-based ExtremITA model ranked first and
second, with close results. The AIMH vision-text dual
encoder model ranked third. All three models were able to
outperform all baseline systems, albeit marginally. The
best performing baselines are the text-based and
multimodal SVM models. The other baseline models
performed significantly worse. Finally, the HIJLI-JU-CLEF
text-oriented system was able to outperform two out of
the six baseline models proposed, ranking fourth among
participants and eighth globally.</p>
        <p>As for the Additional test set, the best performing
model was the LLaMA-based ExtremITA, closely
followed by the PoliTo ensemble approach. The T5-based
ExtremITA model performed significantly worse. The</p>
      </sec>
      <sec id="sec-5-7">
        <title>Only the truly multimodal models participated in the</title>
        <p>Cross-modal relations in Fake and Real News sub-task.
This is due to the fact that the task is inherently
multimodal, and cannot be modelled properly with text-only
models: the relationship between image and text features
lies at the core of the task, and thus images have to be
modelled in some capacity to face it efectively.</p>
        <p>Table 5 shows the results obtained by each system,
including baselines, on the Oficial test set. The PoliTo
ensemble model ranked first, outperforming all
baseline models, while the AIMH vision-text dual encoder
model outperformed only the image-only MLP model,
ranking seventh. Among baselines, surprisingly the
bestperforming ones are the text-only models, followed by
the multimodal ones. We suspect that the text-only
baseline performances are to be attributed to chance rather
than to their efective modeling of the problem.</p>
        <p>Only the PoliTo team participated in the Additional
evaluation, obtaining a weighted average F1-Score of
0.61.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <sec id="sec-6-1">
        <title>We can draw some interesting insights from comparing</title>
        <p>the diferent proposed models both in terms of their
architectures and obtained results.</p>
        <p>General findings. First, we can argue that multimodal
fake news detection and cross-modal analysis of images
and texts in the context of fake news are two rather
challenging tasks. As shown by agreement metrics, it was
a challenging task for annotators as well (see Sec. 3). This
is reflected also by the fact that even the best performing
systems were not able to considerably improve over the
baseline models results.</p>
        <p>As for performances between the Oficial and Addi- large part of the key information to answer the
questional test sets, we saw a rather large discrepancy among tion lies within the textual content. If we assume this,
tasks. We expected systems to perform better on data it is easier to understand how model scale also plays a
from the same time period. This appears to be true for crucial role in performances. The only Large Langauge
sub-task 2, but not for sub-task 1. Note however that the Model (LLM) presented can get close to the performances
only true comparison can be made on the PoliTo system, of a more refined and nuanced approach in a few-shot
as it is the only one that participated to the Oficial and setting via prompting. Note that this may hold true
reAdditional evaluation for both tasks. gardless of the fact that Italian pre-trained LLMs are still</p>
        <p>Finally, we must point out the weaknesses of the base- not consistently outperforming all other approaches as
line models. While participating systems were able to their English counterparts due to their sheer size [25].
perform consistently across sub-task and test sets, the
performances of the baseline systems exhibit significant
variability, with the relative rankings and performance
disparities among models varying across tasks. This
suggests that the baselines are unable to adequately model
features of both modalities and to leverage them for the
tasks.</p>
        <p>The importance of additional processing. Sub-task
2 was specifically developed to frame the problem as a
multimodal one. Only the two truly multimodal systems
participated. As previously discussed, both systems
employ a similar architecture and are arguably comparable
in terms of model sizes. Thus, we can hypothesize that
the diference in performances both in sub-task 1 and
2 may be attributed mostly to the additional processing
and extensions applied by the PoliTo system. We could
further argue that the tasks are both very complex and
nuanced, and additional forms of processing and/or
features may provide important benefits in this scenario,
rather than sole reliance on textual and visual features
extracted from pre-trained models.</p>
        <p>The architectures of the systems. In sub-task 1, only
two out of the four participating systems could be
considered as truly multimodal, since they explicitly model
image-level features (i.e., with an image encoder model).</p>
        <p>They are quite similar in principle: both leverage a dual
encoder architecture [24] with a classification layer. A
ViT image encoder was chosen by both approaches, albeit
trained on diferent data. The text encoders employed
in the AIMH system are RoBERTa (for English
translations) and an Italian version of BERT for original texts. 7. Conclusions and future
The PoliTo team uses the FND-CLIP text encoder, which directions
is based on GPT instead. The popularity of CLIP-like
Vision-Language models is evident due to their versa- In this paper, we presented the MULTI-Fake-DetectiVE
tility and ease of adaptation for various scenarios and shared task for EVALITA 2023. The task was focused on
downstream tasks, including fake news detection.The multimodality in the context of fake news. We considered
main diferences are the extensions (e.g., the inclusion the problem from two perspectives: we wanted to assess
of sentiment-aware text features and image transforma- fake news detection systems in a multimodal setting, and
tions) proposed by the PoliTo team. Authors report per- we wanted to understand how text and images influence
formance increases with all proposed extensions, with each other in the interpretation of a piece of content in
the ensemble classifier performing best. The remaining the context of fake and real news. We framed both as
two systems either disregard images due to the model classification problems.
architecture (ExtremITA) or consider their automatically We saw an interesting degree of variety among
progenerated caption (HIJLI-JU-CLEF), shifting the problem posed systems, which we categorized as truly multimodal
to a text-only space. or text-oriented. By analyzing the proposed approaches
and their results, we can summarise our findings as
folThe role of textual content. The results of sub-task lows. First, multimodal fake news detection is a very
1 do not directly highlight a clear advantage of a multi- challenging task, especially when considering near
realmodal approach over a uni-modal one. Two out of the world scenarios. Second, we saw that for similar
visionthree models which outperform all the baselines are actu- language models, both in terms of architecture and model
ally multimodal, with the PoliTo FND-CLIP-IT ensemble scale, extending the boundaries of the problem by
considoutperforming all the others. However, the runner-up ering additional and/or alternative processing strategies,
was the text-only Italian LoRA based on LLaMA with including afective-oriented features and image
processnear identical performances. ing in the frequency domain, is highly beneficial. Third,</p>
        <p>We can hypothesise that while modelling images in we saw that model scale plays also an important role,
conjunction with text is actually helpful for determining with pre-trained LLMs approaching the performances of
whether a piece of content is a real or a fake news, a thoroughly fine-tuned systems.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Our findings suggest that the problem is still open</title>
        <p>and that moving forward it could be advantageous to
jointly leverage the advantages of the best-performing
approaches, for instance by focusing on Large pre-trained
Vision-Language models augmented with additional
features (e.g., by using available emotive resources [26]) via
either fine-tuning or appropriate prompt
tuning/engineering.</p>
        <p>Due to the current challenges posed by deceptive or
misleading content on social media, we believe that an
efective understanding and modelling of such a complex
problem may prove to be highly beneficial in contrasting
online disinformation. In this regard, the
MULTI-FakeDetectiVE task, including the proposed approaches and
the provided datasets, may serve the Italian NLP
community as an initial stepping stone in addressing this issue
for the Italian language.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>This research was partially supported by the Italian Min</title>
        <p>istry of University and Research (MUR) in the
framework of the PON 2014-2021 “Research and Innovation"
resources – Innovation Action - DM MUR 1062/2021
Title of the Research: “Modelli semantici multimodali
per l’industria 4.0 e le digital humanities.”, of PNRR
M4C2 - Investimento 1.3, Partenariato Esteso PE00000013
- “FAIR - Future Artificial Intelligence Research" - Spoke
1 “Human-centered AI", funded by the European
Commission under the NextGeneration EU programme and
of the CrossLab and FoReLab projects (Departments of
Excellence).
V. Lomonaco, D. Bacciu, Continual pre-training fake-detective: Multimodal fake news detection
usmitigates forgetting in language and vision, arXiv ing deep learning approach, in: Proceedings of
preprint arXiv:2205.09357 (2022). the Eighth Evaluation Campaign of Natural
Lan[13] L. C. Passaro, A. Bondielli, P. Dell’Oglio, A. Lenci, guage Processing and Speech Tools for Italian. Final
F. Marcelloni, In-context annotation of topic- Workshop (EVALITA 2023), CEUR.org, Parma, Italy,
oriented datasets of fake news: A case study on 2023.
the notre-dame fire event, Information Sciences [24] Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, J. Gao,
Vision(2022). language pre-training:: Basics, recent advances, and
[14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, future trends, Foundations and Trends® in
ComBert: Pre-training of deep bidirectional transform- puter Graphics and Vision 14 (2022) 163–352.
ers for language understanding, arXiv preprint [25] V. Basile, Is EVALITA done? on the impact of
arXiv:1810.04805 (2018). prompting on the italian NLP evaluation campaign,
[15] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- in: Proceedings of the Sixth Workshop on Natural
ing for image recognition, in: Proceedings of the Language for Artificial Intelligence (NL4AI 2022),
IEEE conference on computer vision and pattern Udine, November 30th, 2022, volume 3287 of CEUR
recognition, 2016, pp. 770–778. Workshop Proceedings, CEUR-WS.org, 2022, pp. 127–
[16] L. D’Amico, D. Napolitano, L. Vaiani, L. Cagliero, 140.</p>
        <p>Polito at multi-fake-detective: Improving fnd-clip [26] L. C. Passaro, A. Lenci, Evaluating context selection
for multimodal italian fake news detection, in: Pro- strategies to build emotive vector space models, in:
ceedings of the Eighth Evaluation Campaign of Nat- Proceedings of the Tenth International Conference
ural Language Processing and Speech Tools for Ital- on Language Resources and Evaluation (LREC’16),
ian. Final Workshop (EVALITA 2023), CEUR.org, European Language Resources Association (ELRA),
Parma, Italy, 2023. Portorož, Slovenia, 2016, pp. 2185–2191.
[17] Y. Zhou, Q. Ying, Z. Qian, S. Li, X. Zhang,
Multimodal fake news detection via clip-guided learning,
arXiv preprint arXiv:2205.14304 (2022).
[18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,</p>
        <p>G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, et al., Learning transferable visual
models from natural language supervision, in:
International conference on machine learning, PMLR,
2021, pp. 8748–8763.
[19] G. Puccetti, A. Esuli, Aimh at multi-fake-detective:</p>
        <p>System report, in: Proceedings of the Eighth
Evaluation Campaign of Natural Language Processing and
Speech Tools for Italian. Final Workshop (EVALITA
2023), CEUR.org, Parma, Italy, 2023.
[20] C. D. Hromei, D. Croce, V. Basile, R. Basili,
Extremita at evalita2023: Multi-task sustainable scaling to
large language models at its extreme, in:
Proceedings of the Eighth Evaluation Campaign of Natural
Language Processing and Speech Tools for Italian.</p>
        <p>Final Workshop (EVALITA 2023), CEUR.org, Parma,</p>
        <p>Italy, 2023.
[21] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,</p>
        <p>M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
limits of transfer learning with a unified
text-totext transformer, The Journal of Machine Learning</p>
        <p>Research 21 (2020) 5485–5551.
[22] H. Touvron, T. Lavril, G. Izacard, X. Martinet,</p>
        <p>M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
E. Hambro, F. Azhar, et al., Llama: Open and
eficient foundation language models, arXiv preprint
arXiv:2302.13971 (2023).
[23] S. Sarkar, N. Tudu, D. Das, Hijli-ju-clef at
multi</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>