Multimodal Attention is all you need
                                Marco Saioni1,* , Cristina Giannone1,2
                                1
                                    University G. Marconi, Rome, IT
                                2
                                    Almawave S.p.A., Via di Casal Boccone, 188-190 00137, Rome, IT


                                                Abstract
                                                In this paper, we present a multimodal model for classifying fake news. The main peculiarity of the proposed model is the
                                                cross attention mechanism. Cross-attention is an evolution of the attention mechanism that allows the model to examine
                                                intermodal relationships to better understand information from different modalities, enabling it to simultaneously focus on
                                                the relevant parts of the data extracted from each. We tested the model using textitMULTI-Fake-DetectiVE data from Evalita
                                                2023. The presented model is particularly effective in both the tasks of classifying fake news and evaluating the intermodal
                                                relationship.

                                                Keywords
                                                Transformer, fake news classification, multimodal classification, cross attention


                                1. Introduction                                                                                          the text and images it receives as input).
                                                                                                                                            The aim was to find a way to reconcile the two different
                                Internet has facilitated communication by enabling rapid, representation embeddings because they are learned sep-
                                immersive information exchanges. However, it is also arately from two different corpora, such as text and im-
                                increasingly used to convey falsehoods, so today, more ages, trying to capture their mutual relationships through
                                than ever, the rapid spread of fake news can have se- some interaction between the respective semantic spaces.
                                vere consequences, from inciting hatred to influencing                                                      The remainder of the paper is structured as follows:
                                financial markets or the progress of political elections to section 2 presents a brief overview of related work, and
                                endangering world security. For this reason, mitigating section 3 describes the architecture of the proposed
                                the growing spread of fake news on the web has become model. Section 4 discusses an overview of our exper-
                                a significant challenge.                                                                                 iments. Sections 5 and 6 present the final results and our
                                    Fake news manifests itself on the internet through conclusions, respectively.
                                text, images, video, audio, or, in general, a combina-
                                tion of these modalities, which is a multimodal way. In
                                this article, we took the two, text and image, compo- 2. Related Works
                                nents of news as it proposed, for instance, in a social
                                network. In this work we proposed an approach to auto- The Italian MULTI-Fake-DetectiVE competition [2] adds
                                matically and promptly identify fake news. We use the to the various datasets and challenges on multimodal
                                dataset MULTI-Fake-DetectiVE1 competition, proposed in fake news recently developed, for instance, Factify [3]
                                EVALITA 20232 . The competition aims to evaluate the and Fakeddint [4]. The creation of these competitions
                                truthfulness of news that combines text and images, an shows the interest in this task. The first task of the Italian
                                aim expressed through two tasks: the first, which car- challenge saw three completely different systems placed
                                ries out the identification of fake news (Multimodal Fake on the podium. While the first system POLITO[5] with
                                News Detection); the second, which seeks relationships a system based on the FND-CLIP multimodal architec-
                                between the two modalities text and image by observing ture [6] proposing some ad hoc extensions of CLIP [7]
                                the presence or absence of correlation or mutual implica- including sentiment-based text encoding, image transfor-
                                tion (Cross-modal relations in Fake and Real News).                                                      mation in the frequency domain, and data augmentation
                                    Our approach proposes a Transformer-based model                                                      via back-translation. The Extremita system [8], second
                                that focuses on relating the textual and visual embeddings classified, exploited the LLM capabilities, focusing only
                                of the input samples (i.e., the vector representations of on the textual component of each news. They fine-tuned
                                                                                                                                         the open-source LMM Camoscio [9] with the textual part
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, of the dataset. The impressive results show how the tex-
                                Dec 04 — 06, 2024, Pisa, Italy                                                                           tual component plays a primary role in identifying fake
                                *
                                  Corresponding author.                                                                                  news. Despite the significant contribution of the tex-
                                $ marco.saioni@gmail.com (M. Saioni); c.giannone@unimarconi.it tual component to the task, more and more multimodal
                                (C. Giannone)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License approaches are taking hold. In [10] proposed CNN ar-

                                1
                                            Attribution 4.0 International (CC BY 4.0).
                                  https://sites.google.com/unipi.it/multi-fake-detective                                                 chitecture combining texts and images to classify fake
                                2
                                  https://www.evalita.it[1]                                                                              news. In that direction, approaches such as CB-FAKE[11]


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
incorporate the encoder representations from the BERT
model to extract the textual features and comb them with
a model to extract the image features. These features are
combined to obtain a richer data representation that helps
to determine whether the news is fake or real. Vision-
language models, in general, have gained a lot of interest
also in the last years, in the "large models era". Language
Vision Models have been proposed during the previous
year, with surprising results in many visual language
interaction tasks [12],[13].


3. The proposed Model
The objective was to "engage" specialist models for nat-
ural language processing and artificial vision, making
them discover and learn bimodal features from text and
images collaboratively and harmoniously by applying the
teachings of Vaswani et al. [14]: we decided to follow the
path indicated by "Attention is all you need" Vaswani et
al. very famous paper, following up on the intuition that
the Attention mechanism could provide an important
added value to the multimodal model of identification of
fake news, becoming a Multimodal Attention (hence the
title of this article), i.e. an attention mechanism applied   Figure 1: Proposed model architecture.
between the two textual and visual modes of news. In
fact, while Attention or Self Attention (as described in
Vaswani et al. paper) takes as input the embeddings of        to each other with the strategy of mutual cross-attention
a single modality and transforms them into more infor-        to obtain two embeddings subsequently concatenated to
mative embeddings (contextualized embeddings), Mul-           provide the input of the last dense classification layer.
timodal Attention takes as input the embeddings of the
two different modalities by combining them and then
                                                              3.1.1. Pre-processing step
transforming them into a single embedding capable of
capturing any existing relationships between the two        As a first step it is necessary to process the data made
input modes.                                                available by the organizers of the MULTI-Fake-DetectiVE
                                                            competition to produce inputs that are compatible and
3.1. Architecture                                           compliant with those expected from the pre-trained mod-
                                                            els. The choices made for this preparation or for the
Multimodal Attention is the heart that supports the pro- pre-processing of the dataset and the data ”personaliza-
posed model, making it capable of exploring the hidden tion” strategy will then be described in the following
aspects of multimodal communication. As shown at a three points:
high level in Figure 1, the architecture of the proposed
model consists of a hierarchical structure with three lay-       • resolution/explosion of 1 : 𝑁 relationships be-
ers preceded by a pre-processing step. In order, there are:         tween text and images into 𝑁 times 1 : 1 rela-
a pre-processing step, an input layer, a cross-modal layer          tionships;
and a fusion layer. It was decided to propose a network          • data augmentation with the creation of an addi-
that models the consistent information between the two              tional image to support the original one already
modalities textual and visual starting from State Of The            present in each example;
Art pre-trained neural networks. In particular, we use a         • management of the textual component, truncated
BERT [15] pre-trained model to learn the word embed-                by BERT or rather by the relevant tokenizer to a
dings by the textual component of news and a ResNet                 fixed maximum length of tokens.
[16] pre-trained model to learn visual embeddings by the
visual component. The two embeddings, belonging to As decided for the visual and textual components, there-
two spaces with different dimensions, are first projected fore following processing, for each      single sample we
into a uniform, reduced-dimensional space, then related move from the original pairs < 𝑡, 𝑣 >, where 𝑣 indi-
                                                                                                 +             +

                                                            cates the ratio 1 : 𝑁 between text in natural language
and images in JPEG format, to the triples appropriately      visual embedding of size ℎ𝑟 for each example and which
translated into numbers                                      represents the features in a compact and semantic form
                                                             extracted through convolutions and pooling within the
                     < 𝑡𝑡𝑟𝑢𝑛𝑐 , 𝑣, 𝑣𝑎𝑢𝑔 >                    ResNet network. In fact, to obtain visual embeddings
                                                             from a pre-trained neural network like ResNet, we usu-
where 𝑡𝑡𝑟𝑢𝑛𝑐 indicates, for each sample, a first-order ten- ally take the output of the penultimate layer, i.e. global
sor with 128 values (token), while 𝑣 and 𝑣𝑎𝑢𝑔 denote pooling. In the proposed model, ResNet50V2 was cho-
third-order tensors with (224 × 224 × 3) values (pixels). sen which in global pooling reduces the spatial dimen-
In fact, the first order tensor is the representation of the sions of the output tensor to 2048 values and therefore
text in numerical form according to the default strategy each input image will correspond in output to a vector
of the BERT tokenizer, while the third order tensor is the with ℎ = 2048 values, which represents the visual em-
                                                                    𝑟
representation of the images in numerical form according beddings extracted from the network for that specific
to the RGB coding for ResNet.                                image. After obtaining the embeddings for each of the
                                                             two images, they are concatenated together to obtain
3.1.2. Input layer                                           a single output tensor which will therefore have size
This layer receives as input the previously processed        2 × ℎ𝑟 = 4096. Using the same formalism as the previ-
dataset, i.e. the text and the images represented in nu- ous text encoder, we have:
merical form, passing it to the pre-trained BERT and
                                                                         ev = ResNet(v)[𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑜𝑙𝑖𝑛𝑔]
ResNet models to obtain the respective embeddings, sub-
sequently projected into a space with small and common where ev ∈ Rℎ𝑟 is the visual embedding vector and v ∈
dimensions to make them comparable and to allow them R𝐿×𝐻×𝐶 the input third-order tensor. The indicated
to collaborate with each other in the subsequent cross- equation refers to a single sample but can be extended to
modal layers.                                                the entire batch of 𝑁 examples, therefore indicating the
                                                            batch with V ∈ R𝑁 ×𝐿×𝐻×𝐶 , we will have:
BERT Encoder Each sample pre-processed and rep-
resented in numerical form by the tokenizer is passed                 Ev = ResNet(V)[𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑜𝑙𝑖𝑛𝑔]
as input to the pre-trained BERT model which returns
different output tensors for each of them. For the pur-     where Ev ∈ R𝑁 ×ℎ𝑟 is the visual embedding matrix
poses of the classification task object of this study, we   learned by the ResNet model. Similar discussion for the
consider the pooled_output, a compact representation        second image, for which it will be valid at batch level:
of all the token sequences given as input to the BERT
model, obtained via the special token [CLS]. It is there-          Evaug = ResNet(Vaug )[𝑔𝑙𝑜𝑏𝑎𝑙_𝑝𝑜𝑜𝑙𝑖𝑛𝑔]
fore a summary of the information extracted from the
                                                            where Evaug ∈ R𝑁 ×ℎ𝑟 . By concatenating the two em-
entire input dataset whose dimensions evidently depend
                                                            beddings, we will obtain:
on the number of hidden units of BERT. Since each text
supplied as input to BERT will correspond to a tensor            Ev ⊕ Evaug = Econcat(v,vaug ) ∈ R𝑁 ×2ℎ𝑟 .
with 768 values real, using vector notation we have that:
                                                            From this moment and for simplicity of notation, Ev will
         et = BERT(ttrunc )[𝑝𝑜𝑜𝑙𝑒𝑑_𝑜𝑢𝑡𝑝𝑢𝑡]                  refer to Econcat(v,vaug ) , knowing that this embedding
                                                            is actually the concatenation of embeddings of an image
where et ∈ Rℎ is the word embeddings vector, ttrunc ∈
                                                            and the one obtained through random transformations.
R𝑁𝑚𝑎𝑥 is the token input vector and ℎ = 768 is the
BERT hidden size. The equation shown refers to a single
sample but can be extended to the entire batch of 𝑁     Projection The pre-trained models provide embed-
examples processed by BERT. Indicating this batch with  dings with different sizes. It is, therefore, necessary to
Ttrunc ∈ R𝑁 ×𝑁𝑚𝑎𝑥 , we will have:                       transform them into a space with the same dimensional-
                                                        ity to obtain comparable representations. The projection
         Et = BERT(Ttrunc )[𝑝𝑜𝑜𝑙𝑒𝑑_𝑜𝑢𝑡𝑝𝑢𝑡]              function carries out this task, introduced both to reduce
                                                        the dimensions of the two embeddings and reduce the
where Et ∈ R𝑁 ×ℎ is the text embedding matrix learned computational load, improving the performance of the
by the BERT model.                                      multimodal model and allowing it to learn more complex
                                                        patterns. The projection of embeddings is particularly
ResNet Encoder The two images of each sample pre- useful in cases where you want to compare the seman-
viously represented in numerical form are passed as in- tic representations of two objects, ensuring that both
put to the pre-trained ResNet model, which returns a are aligned in the same reduced semantic space, making
them comparable in terms of similarity or distance or          3.1.4. Fusion layer
facilitating the comparison and analysis of relationships.
                                                               Once you have available the embeddings (textual and
For this model, we selected 𝑑𝑝𝑟𝑗 = 128 as the projec-
                                                               visual) learned unimodally in the network, and the cross-
tion size, reducing both embeddings sizes of the input
                                                               attention embeddings learned intermodally, it is neces-
components.
                                                               sary to implement a fusion strategy that can best balance
                                                               their respective contributions in the multimodal classi-
3.1.3. Cross-modal layer                                       fication task. Although the architecture of the model
This layer is the heart of the model, which is developed       would seem to suggest the implementation of the late
taking inspiration from the behavior of human beings           fusion strategy, it is necessary to observe how the cross-
when faced with news made up of text and images. Intu-         attention of the cross-modal layer is already a fusion strat-
itively, we try to read in the image what is written in the    egy adopted in the network during learning before the
text and to represent in the text what is shown by the         one explicitly implemented in the next fusion layer: this
image. It can be said that cross-modal attention relations     allowed the model to learn shared features during train-
exist between image and text. This is why, to simulate         ing while maintaining the suitable flexibility between the
the human process described in a neural model, we relied       multimodal components, i.e. without excessively influ-
on the cross attention between the two modalities, a vari-     encing the learning process of each modality separately.
ant of the standard component of multi-head attention             The concatenation preserves each modality’s distinc-
capable of capturing global dependencies between text          tive features, allowing the model to exploit them during
and images.                                                    learning, unlike the sum which could lead to the loss of
   In the proposed model, two blocks of crossed atten-         information due to values that can cancel each other out,
tion are activated in the two text-image and image-text        taking away the model’s descriptive capacity. For these
perspectives. In the first case, we consider the textual       reasons, the fusion occurs taking into consideration all
embeddings as queries for the multi-head attention block,      four embeddings learned by the model Et−projected ,
while the visual ones as key and value. This should allow      Ev−projected , Ecross−tv , Ecross−vt , where the first
the characteristics of the text to guide the model to focus    two provide distinctive unimodal features, while the
on regions of the image semantically coherent with the         other two provide correlated and mutually ”attentioned”
text: in fact, if the textual embeddings are considered        cross-modal features. The hybrid fusion strategy then
as queries and the visual ones as key and value, then          completes the recipe, providing that pinch of flexibility
the attention will be applied to the images in based on        necessary to give balance to the multimodal classifier.
compatibility with the text, which is therefore consid-        Formally we have the following equation, which aims to
ered the context on which to evaluate the relevance of         make the most of both the information provided by the
an image. In this way, attention is focused on the images      individual modalities as such, and that provided jointly:
with respect to how relevant they are to the text, i.e. we
try to give importance to the visual features in relation           Eglobal = (Et−projected ⊕ Ev−projected )⊕
to the context provided by the text. Conversely, in the
                                                                                Ecross−tv ⊕ Ecross−vt
second case the visual embeddings are the queries, while
the keys and values are the textual embeddings, and this       where Eglobal 𝑖𝑛R𝑁 ×4𝑑𝑝𝑟𝑗 , where 𝑁 is the size of the
should allow the visual features to make the model pay         batch of examples given as input to the network and
attention to those parts of text consistent with the images.   𝑑𝑝𝑟𝑗 = 128.
That is, the same thing as in the previous case applies,          The final output of the multimodal model is obtained
but the roles between text and image are reversed.             by applying a densely connected layer with 𝐶 = 4 units
   Wanting to formalize the bidirectional cross-attention      and a softmax activation function that returns the proba-
between the embeddings of the text Et−projected and            bilities of the four classes. Formally:
those of the images Ev−projected , we can write:
                                                                                Y = (Eglobal W + b)
Ecross−tv = Attention(Et−projected , Ev−projected )
                                                                                   O = softmax(Y)
Ecross−vt = Attention(Ev−projected , Et−projected )
                                                         with W ∈ R             , b ∈ R1×𝐶 and therefore O ∈
                                                                              4𝑑𝑝𝑟𝑗 ×𝐶

where Ecross−tv represents the attention embeddings of R   𝑁 ×𝐶
                                                                is a matrix in which each row is a vector with
image information with respect to the text and Ecross−vt 𝐶 = 4 values representing the conditional (estimated)
represents attention embeddings of text information com- probability of each class for the relevant sample.
pared to images.
  In this layer the dimensions of the embeddings are not
modified in any way, therefore we remain in R𝑁 ×128 .
4. Experimental Setup                                               Model           Accuracy      F1-weighted
                                                                    Text-only         0.498          0.462
                                                                    Multi-modal       0.480          0.442
4.1. Split dataset into training and                                Image-only        0.438          0.371
     validation
                                                            Table 1
To guarantee that the proportions relating to the classes   Summary and comparison of the main metrics for the three
and sources are maintained uniformly in the two sets,       baseline models on the official dataset.
the 1034 samples of the dataset are randomly divided
following the 80%-20% proportion between training and
validation in a stratified way both with respect to the     by the unimodal textual model, but higher than the score
labels, as also happens in the baseline model of the com-   of the unimodal visual model, indicating that the integra-
petition MULTI-Fake-DetectiVE and, with respect to the      tion of visual and textual information led to an improve-
type of source of the news.                                 ment in performance compared to the model visual, but
                                                            not enough to outperform the text model. This suggests
4.2. Training and validationn                               that there may be potential to perform additional opti-
                                                            mizations or modality integration strategies to achieve
For our experiment, the model was trained up to 80          better performance from the multimodal model.
epochs with early stopping on using the focal loss [17]
function. It is a dynamically scaled loss cross entropy
function, where the scaling factor decays to zero as con-
                                                            5.2. Proposed model
fidence in the correct class increases. Intuitively, this   To evaluate the model proposed on the Multimodal Fake
scaling factor can automatically scale the contribution     News Detection task, we chose to follow the approach used
of easy examples during training and quickly focus the      by the organizers in the notebook of the baseline models,
model on difficult examples. For the optimizer we chosed    i.e. we performed an ablation study on the proposed
AdamW, given that the models used to analyze text and       model: first a unimodal textual model was trained, then
images were originally pre-trained using this algorithm,    a unimodal visual one, then a multimodal one without
which applies weight regularization directly to the model   cross-bi-attention, finally a multimodal one with cross-bi-
parameters during weight updating, helping to improve       attention. Table 2 reports the respective accuracy and
the stability and generalization of the model.              F1-weighted values.

                                                             Model                           Accuracy      F1-weighted
5. Results                                                   Proposed Multi-modal ⊗            0.541          0.537
                                                             Proposed Text-only                0.472          0.469
5.1. Official baseline models                                Proposed Multi-modal ⊕            0.460          0.445
                                                             Proposed Image-only               0.418          0.422
In the notebook provided by the MULTI-Fake-DetectiVE
organizers there is an evaluation strategy on the offi-     Table 2
cial dataset which is developed by comparing the perfor-    Ablation study on the proposed model: accuracy and F1-
mance of the unimodal pre-trained models with a multi-      weighted. The ⊗ symbol indicates cross-bi-attention enabled,
                                                            while ⊕ indicates cross-bi-attention disabled (i.e. concatena-
modal model:
                                                            tion enabled).
     • Text-only model: model trained only on textual
       features, extracted with a pre-trained BERT net-  The results for the unimodal and multimodal models
       work.                                             without cross-bi-attention are in perfect harmony with
     • Image-only model: model trained only on the       those of the similar baseline models.
       visual features of images, extracted with a pre-  But the data that catches the eye is that of the accuracy
       trained ResNet18 network.                         and F1-weighted values of the multimodal model with
     • Multi-modal model: model trained on the con-      cross-bi-attention. In particular, its F1-weighted score is
       catenation of text and image features, extracted  almost seven percentage points higher than the proposed
       separately with the previous two only-model.      textual unimodal model, more than eleven compared to
                                                         the visual unimodal model and more than nine compared
The F1-weighted score values of the three baseline mod-
                                                         to the multimodal one without cross-bi-attention.
els are shown in Table 1. The textual model is therefore
                                                            Let’s see the accuracy and F1-weighted values of
the most effective among the three baseline models in
                                                         the multimodal model proposed with cross-bi-attention
classifying fake news and the visual one has lower per-
                                                         against finalist models. Its F1-weighted score is two and
formance than the textual model. The multimodal model
                                                         a half points higher than that of the winning model of
obtained an F1-weighted score lower than that obtained
the MULTI-Fake-DetectiVE competition, as evident from           The data preparation strategy in the Pre-processing step
the Table 3. As supposed and hoped, the mechanism             provides the model with more information to learn from,
                                                              the real strength can be identified in the Cross-modal
  Model                      Accuracy      F1-weighted        Layer.
  Proposed Multi-modal         0.541          0.537
  PoliTo - FND-CLIP-ITA          -            0.512
  ExtremITA - Suede_LoRA         -            0.507           6. Conclusions
  Baseline Multi-modal         0.480          0.442
Table 3
                                                              The Internet has facilitated the multimodality of commu-
Final comparison between all the analyzed models and the      nication by enabling rapid information exchanges that
proposed model.                                               are increasingly immersive but increasingly used to con-
                                                              vey falsehoods. In this study, a multimodal model for
                                                              identifying fake news was proposed which is based on
of crossed attention seen from the two text-image and
                                                              the mechanism of cross attention between the represen-
image-text perspectives enriched by the skip connec-
                                                              tations of the features learned by the network on the
tion provided by the simple concatenation of the two
                                                              textual component of the news and those learned on the
different embeddings, provides the model with that extra
                                                              visual component associated with it.
edge that allows it to dig background in the relationships
                                                                 Many multimodal models are based on the concatena-
between textual and visual features. By combining bi-
                                                              tion of features learned from distinct modalities which,
lateral cross-attention and residual connection, tasks of
                                                              despite having good performance, however, limit the
the cross-modal layer and the fusion layer respectively,
                                                              potential of the interaction between the features them-
significant semantic and semiotic interrelations are ob-
                                                              selves.
tained in favor of the performance of the classifier which
                                                                 From the experiments carried out, the use of cross-
becomes more precise and sensitive.
                                                              attention demonstrated significant improvements in the
   In fact, if on the one hand the cross-modal layer allows
                                                              performance of the model proposed in this work com-
the model to learn multimodal semantics between text
                                                              pared to the first two models classified in the MULTI-
and images, the fusion layer enhances it by improving
                                                              Fake-DetectiVE competition for both tasks requested by
its stability, capacity and performance thanks to the skip
                                                              the organizers, despite the dataset available for training
connection which provides the gradient with a useful di-
                                                              is very small in size and unbalanced both with respect
rect path during backpropagation to flow without tending
                                                              to the categories to be predicted and with respect to the
to zero, bringing significant and additional information
                                                              source of the news. Despite the intrinsic complexity of
into each layer of the network.
                                                              the two tasks, the cross-layer of the proposed model man-
   All the results described up to this point are obtained
                                                              ages to express the representations learned from the text
by measuring the model on the Multimodal Fake News
                                                              and images of a news story in a harmonious, collabora-
Detection task of the competition covered by this work.
                                                              tive and synergistic way, balancing their contributions
As mentioned, the organizers also proposed a second
                                                              and preventing one from taking over the other.
task Cross-modal relations in Fake and Real News, aimed
                                                                 Future developments concern the components of the
at verifying the robustness of the model to changing
                                                              model which could use a Visual Transformer [18] instead
tasks without any human intervention. Table 4 shows
                                                              of the ResNet in order to relate textual embeddings and
the accuracy and F1-weighted values for the proposed
                                                              visuals both generated by training a Transformer net-
model called to express itself on the Cross-modal relations
                                                              work.
task, together with the baseline and winner models of
the MULTI- competition Fake-DetectiVE. The results show

  Model                      Accuracy      F1-weighted
                                                              References
  Proposed Multi-modal         0.529          0.527            [1] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug-
  PoliTo - FND-CLIP-ITA          -            0.517
                                                                   noli, G. Venturi (Eds.), Proceedings of the Eighth
  Baseline Multi-modal           -            0.442
                                                                   Evaluation Campaign of Natural Language Pro-
Table 4                                                            cessing and Speech Tools for Italian. Final Work-
Result summary on Task 2.                                          shop (EVALITA 2023), Parma, Italy, September 7th-
                                                                   8th, 2023, volume 3473 of CEUR Workshop Proceed-
a clear improvement in performance in solving the task             ings, CEUR-WS.org, 2023. URL: https://ceur-ws.org/
even compared to the winning model of the competition.             Vol-3473.
This is a very important result, because it demonstrates       [2] A. Bondielli, P. Dell’Oglio, A. Lenci, F. Marcelloni,
the network’s ability to adapt to changes in tasks and             L. C. Passaro, M. Sabbatini, Multi-fake-detective at
changes in training data, which is not at all a given.             evalita 2023: Overview of the multimodal fake news
    detection and verification task, CEUR WORKSHOP                 org/abs/2307.16456. arXiv:2307.16456.
    PROCEEDINGS 3473 (2023). URL: https://ceur-ws.            [10] I. Segura-Bedmar, S. Alonso-Bartolome, Multi-
    org/Vol-3473/paper32.pdf.                                      modal fake news detection, Information 13 (2022).
[3] S. Suryavardan,          S. Mishra,       P. Patwa,            URL: https://www.mdpi.com/2078-2489/13/6/284.
    M. Chakraborty, A. Rani, A. N. Reganti, A. Chadha,        [11] B. Palani, S. Elango, V. K, Cb-fake: A multi-
    A. Das, A. P. Sheth, M. Chinnakotla, A. Ekbal,                 modal deep learning framework for automatic fake
    S. Kumar, Factify 2: A multimodal fake news                    news detection using capsule neural network and
    and satire news dataset., in: A. Das, A. P. Sheth,             bert, Multimedia Tools and Applications 81 (2022).
    A. Ekbal (Eds.), DE-FACTIFY@AAAI, volume 3555                  doi:10.1007/s11042-021-11782-3.
    of CEUR Workshop Proceedings, CEUR-WS.org, 2023.          [12] W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang,
    URL: http://dblp.uni-trier.de/db/conf/defactify/               J. Ji, Z. Yang, L. Zhao, X. Song, J. Xu, B. Xu, J. Li,
    defactify2023.html#SuryavardanMPCR23.                          Y. Dong, M. Ding, J. Tang, Cogvlm: Visual expert
[4] K. Nakamura, S. Levy, W. Y. Wang, Fakeddit: A new              for pretrained language models, 2024. URL: https:
    multimodal benchmark dataset for fine-grained                  //arxiv.org/abs/2311.03079. arXiv:2311.03079.
    fake news detection, in: N. Calzolari, F. Béchet,         [13] H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines
    P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi,        with visual instruction tuning, 2024. URL: https:
    H. Isahara, B. Maegaard, J. Mariani, H. Mazo,                  //arxiv.org/abs/2310.03744. arXiv:2310.03744.
    A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings     [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
    of the Twelfth Language Resources and Evaluation               L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At-
    Conference, European Language Resources Associ-                tention is all you need, 2017. arXiv:1706.03762.
    ation, Marseille, France, 2020, pp. 6149–6157. URL:       [15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
    https://aclanthology.org/2020.lrec-1.755.                      Bert: Pre-training of deep bidirectional trans-
[5] L. D’Amico, D. Napolitano, L. Vaiani, L. Cagliero,             formers for language understanding, 2019.
    Polito at multi-fake-detective: Improving FND-                 arXiv:1810.04805.
    CLIP for multimodal italian fake news detection, in:      [16] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
    M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug-           learning for image recognition, in: 2016 IEEE Con-
    noli, G. Venturi (Eds.), Proceedings of the Eighth             ference on Computer Vision and Pattern Recog-
    Evaluation Campaign of Natural Language Pro-                   nition (CVPR), 2016, pp. 770–778. doi:10.1109/
    cessing and Speech Tools for Italian. Final Work-              CVPR.2016.90.
    shop (EVALITA 2023), Parma, Italy, September 7th-         [17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dol-
    8th, 2023, volume 3473 of CEUR Workshop Proceed-               lár, Focal loss for dense object detection, 2018.
    ings, CEUR-WS.org, 2023. URL: https://ceur-ws.org/             arXiv:1708.02002.
    Vol-3473/paper35.pdf.                                     [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
[6] Y. Zhou, Q. Ying, Z. Qian, S. Li, X. Zhang, Multi-             senborn, X. Zhai, T. Unterthiner, M. Dehghani,
    modal fake news detection via clip-guided learn-               M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit,
    ing, 2022. URL: https://arxiv.org/abs/2205.14304.              N. Houlsby, An image is worth 16x16 words:
    arXiv:2205.14304.                                              Transformers for image recognition at scale, 2021.
[7] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,                  arXiv:2010.11929.
    G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
    J. Clark, G. Krueger, I. Sutskever, Learning transfer-
    able visual models from natural language supervi-
    sion, 2021. arXiv:2103.00020.
[8] C. D. Hromei, D. Croce, V. Basile, R. Basili, Extrem-
    ita at EVALITA 2023: Multi-task sustainable scaling
    to large language models at its extreme, in: M. Lai,
    S. Menini, M. Polignano, V. Russo, R. Sprugnoli,
    G. Venturi (Eds.), Proceedings of the Eighth Evalua-
    tion Campaign of Natural Language Processing and
    Speech Tools for Italian. Final Workshop (EVALITA
    2023), Parma, Italy, September 7th-8th, 2023, vol-
    ume 3473 of CEUR Workshop Proceedings, CEUR-
    WS.org, 2023. URL: https://ceur-ws.org/Vol-3473/
    paper13.pdf.
[9] A. Santilli, E. Rodolà, Camoscio: an italian
    instruction-tuned llama, 2023. URL: https://arxiv.