<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Attention is all you need</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Saioni</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Giannone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Almawave S.p.A., Via di Casal Boccone</institution>
          ,
          <addr-line>188-190 00137, Rome, IT</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University G. Marconi</institution>
          ,
          <addr-line>Rome, IT</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present a multimodal model for classifying fake news. The main peculiarity of the proposed model is the cross attention mechanism. Cross-attention is an evolution of the attention mechanism that allows the model to examine intermodal relationships to better understand information from diferent modalities, enabling it to simultaneously focus on the relevant parts of the data extracted from each. We tested the model using textitMULTI-Fake-DetectiVE data from Evalita 2023. The presented model is particularly efective in both the tasks of classifying fake news and evaluating the intermodal relationship.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Transformer</kwd>
        <kwd>fake news classification</kwd>
        <kwd>multimodal classification</kwd>
        <kwd>cross attention</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>the text and images it receives as input).</p>
      <p>The aim was to find a way to reconcile the two diferent
Internet has facilitated communication by enabling rapid, representation embeddings because they are learned
sepimmersive information exchanges. However, it is also arately from two diferent corpora, such as text and
imincreasingly used to convey falsehoods, so today, more ages, trying to capture their mutual relationships through
than ever, the rapid spread of fake news can have se- some interaction between the respective semantic spaces.
vere consequences, from inciting hatred to influencing The remainder of the paper is structured as follows:
ifnancial markets or the progress of political elections to section 2 presents a brief overview of related work, and
endangering world security. For this reason, mitigating section 3 describes the architecture of the proposed
the growing spread of fake news on the web has become model. Section 4 discusses an overview of our
expera significant challenge. iments. Sections 5 and 6 present the final results and our</p>
      <p>
        Fake news manifests itself on the internet through conclusions, respectively.
text, images, video, audio, or, in general, a
combination of these modalities, which is a multimodal way. In
this article, we took the two, text and image, compo- 2. Related Works
nents of news as it proposed, for instance, in a social
network. In this work we proposed an approach to auto- The Italian MULTI-Fake-DetectiVE competition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] adds
matically and promptly identify fake news. We use the to the various datasets and challenges on multimodal
dataset MULTI-Fake-DetectiVE1 competition, proposed in fake news recently developed, for instance, Factify [3]
EVALITA 20232. The competition aims to evaluate the and Fakeddint [4]. The creation of these competitions
truthfulness of news that combines text and images, an shows the interest in this task. The first task of the Italian
aim expressed through two tasks: the first, which car- challenge saw three completely diferent systems placed
ries out the identification of fake news ( Multimodal Fake on the podium. While the first system POLITO[ 5] with
News Detection); the second, which seeks relationships a system based on the FND-CLIP multimodal
architecbetween the two modalities text and image by observing ture [6] proposing some ad hoc extensions of CLIP [7]
the presence or absence of correlation or mutual implica- including sentiment-based text encoding, image
transfortion (Cross-modal relations in Fake and Real News). mation in the frequency domain, and data augmentation
      </p>
      <p>Our approach proposes a Transformer-based model via back-translation. The Extremita system [8], second
that focuses on relating the textual and visual embeddings classified, exploited the LLM capabilities, focusing only
of the input samples (i.e., the vector representations of on the textual component of each news. They fine-tuned
the open-source LMM Camoscio [9] with the textual part
of the dataset. The impressive results show how the
textual component plays a primary role in identifying fake
news. Despite the significant contribution of the
textual component to the task, more and more multimodal
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Dec 04 — 06, 2024, Pisa, Italy
* Corresponding author.
$ marco.saioni@gmail.com (M. Saioni); c.giannone@unimarconi.it
(C. Giannone)</p>
      <p>
        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License approaches are taking hold. In [10] proposed CNN
ar1https://siAtettrsib.gutoioon4g.0leIn.tceronamtio/nual n(CiCpBi.Yit4./0m).ulti-fake-detective chitecture combining texts and images to classify fake
2https://www.evalita.it[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] news. In that direction, approaches such as CB-FAKE[11]
incorporate the encoder representations from the BERT
model to extract the textual features and comb them with
a model to extract the image features. These features are
combined to obtain a richer data representation that helps
to determine whether the news is fake or real.
Visionlanguage models, in general, have gained a lot of interest
also in the last years, in the "large models era". Language
Vision Models have been proposed during the previous
year, with surprising results in many visual language
interaction tasks [12],[13].
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. The proposed Model</title>
      <p>The objective was to "engage" specialist models for
natural language processing and artificial vision, making
them discover and learn bimodal features from text and
images collaboratively and harmoniously by applying the
teachings of Vaswani et al. [14]: we decided to follow the
path indicated by "Attention is all you need" Vaswani et
al. very famous paper, following up on the intuition that
the Attention mechanism could provide an important
added value to the multimodal model of identification of
fake news, becoming a Multimodal Attention (hence the
title of this article), i.e. an attention mechanism applied
between the two textual and visual modes of news. In
fact, while Attention or Self Attention (as described in
Vaswani et al. paper) takes as input the embeddings of
a single modality and transforms them into more
informative embeddings (contextualized embeddings),
Multimodal Attention takes as input the embeddings of the
two diferent modalities by combining them and then
transforming them into a single embedding capable of
capturing any existing relationships between the two
input modes.
to each other with the strategy of mutual cross-attention
to obtain two embeddings subsequently concatenated to
provide the input of the last dense classification layer.
3.1.1. Pre-processing step</p>
      <sec id="sec-2-1">
        <title>As a first step it is necessary to process the data made</title>
        <p>available by the organizers of the MULTI-Fake-DetectiVE
competition to produce inputs that are compatible and
3.1. Architecture compliant with those expected from the pre-trained
models. The choices made for this preparation or for the
Multimodal Attention is the heart that supports the pro- pre-processing of the dataset and the data
”personalizaposed model, making it capable of exploring the hidden tion” strategy will then be described in the following
aspects of multimodal communication. As shown at a three points:
high level in Figure 1, the architecture of the proposed
model consists of a hierarchical structure with three lay- • resolution/explosion of 1 :  relationships
beers preceded by a pre-processing step. In order, there are: tween text and images into  times 1 : 1
relaa pre-processing step, an input layer, a cross-modal layer tionships;
and a fusion layer. It was decided to propose a network • data augmentation with the creation of an
addithat models the consistent information between the two tional image to support the original one already
modalities textual and visual starting from State Of The present in each example;
Art pre-trained neural networks. In particular, we use a • management of the textual component, truncated
BERT [15] pre-trained model to learn the word embed- by BERT or rather by the relevant tokenizer to a
dings by the textual component of news and a ResNet ifxed maximum length of tokens.
[16] pre-trained model to learn visual embeddings by the
visual component. The two embeddings, belonging to
two spaces with diferent dimensions, are first projected
into a uniform, reduced-dimensional space, then related
As decided for the visual and textual components,
therefore following processing, for each single sample we
move from the original pairs &lt; , + &gt;, where +
indicates the ratio 1 :  between text in natural language
and images in JPEG format, to the triples appropriately visual embedding of size ℎ for each example and which
translated into numbers represents the features in a compact and semantic form
extracted through convolutions and pooling within the
&lt; , ,  &gt; ResNet network. In fact, to obtain visual embeddings
from a pre-trained neural network like ResNet, we
usuwhere  indicates, for each sample, a first-order ten- ally take the output of the penultimate layer, i.e. global
sor with 128 values (token), while  and  denote pooling. In the proposed model, ResNet50V2 was
chothird-order tensors with (224 × 224 × 3) values (pixels). sen which in global pooling reduces the spatial
dimenIn fact, the first order tensor is the representation of the sions of the output tensor to 2048 values and therefore
text in numerical form according to the default strategy each input image will correspond in output to a vector
of the BERT tokenizer, while the third order tensor is the with ℎ = 2048 values, which represents the visual
emrepresentation of the images in numerical form according beddings extracted from the network for that specific
to the RGB coding for ResNet. image. After obtaining the embeddings for each of the
two images, they are concatenated together to obtain
3.1.2. Input layer a single output tensor which will therefore have size
2 × ℎ = 4096. Using the same formalism as the
previous text encoder, we have:</p>
      </sec>
      <sec id="sec-2-2">
        <title>This layer receives as input the previously processed</title>
        <p>dataset, i.e. the text and the images represented in
numerical form, passing it to the pre-trained BERT and
ResNet models to obtain the respective embeddings,
subsequently projected into a space with small and common
dimensions to make them comparable and to allow them
to collaborate with each other in the subsequent
crossmodal layers.</p>
        <p>BERT Encoder Each sample pre-processed and
represented in numerical form by the tokenizer is passed
as input to the pre-trained BERT model which returns
diferent output tensors for each of them. For the
purposes of the classification task object of this study, we
consider the pooled_output, a compact representation
of all the token sequences given as input to the BERT
model, obtained via the special token [CLS]. It is
therefore a summary of the information extracted from the
entire input dataset whose dimensions evidently depend
on the number of hidden units of BERT. Since each text
supplied as input to BERT will correspond to a tensor
with 768 values real, using vector notation we have that:
et = BERT(ttrunc)[_]
ev = ResNet(v)[_]
where ev ∈ Rℎ is the visual embedding vector and v ∈
R× ×  the input third-order tensor. The indicated
equation refers to a single sample but can be extended to
the entire batch of  examples, therefore indicating the
batch with V ∈ R× × ×  , we will have:</p>
        <p>Ev = ResNet(V)[_]
where Ev ∈ R× ℎ is the visual embedding matrix
learned by the ResNet model. Similar discussion for the
second image, for which it will be valid at batch level:</p>
        <p>Evaug = ResNet(Vaug)[_]
where Evaug ∈ R× ℎ . By concatenating the two
embeddings, we will obtain:</p>
        <p>Ev ⊕</p>
        <p>Evaug = Econcat(v,vaug) ∈ R× 2ℎ .</p>
      </sec>
      <sec id="sec-2-3">
        <title>From this moment and for simplicity of notation, Ev will</title>
        <p>refer to Econcat(v,vaug), knowing that this embedding
is actually the concatenation of embeddings of an image
and the one obtained through random transformations.
where et ∈ Rℎ is the word embeddings vector, ttrunc ∈
R is the token input vector and ℎ = 768 is the
BERT hidden size. The equation shown refers to a single
sample but can be extended to the entire batch of 
examples processed by BERT. Indicating this batch with
Ttrunc ∈ R×  , we will have:</p>
      </sec>
      <sec id="sec-2-4">
        <title>Projection The pre-trained models provide embed</title>
        <p>dings with diferent sizes. It is, therefore, necessary to
transform them into a space with the same
dimensionality to obtain comparable representations. The projection
Et = BERT(Ttrunc)[_] function carries out this task, introduced both to reduce
the dimensions of the two embeddings and reduce the
where Et ∈ R× ℎ is the text embedding matrix learned computational load, improving the performance of the
by the BERT model. multimodal model and allowing it to learn more complex
patterns. The projection of embeddings is particularly
ResNet Encoder The two images of each sample pre- useful in cases where you want to compare the
semanviously represented in numerical form are passed as in- tic representations of two objects, ensuring that both
put to the pre-trained ResNet model, which returns a are aligned in the same reduced semantic space, making
them comparable in terms of similarity or distance or
facilitating the comparison and analysis of relationships.</p>
        <p>For this model, we selected  = 128 as the
projection size, reducing both embeddings sizes of the input
components.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Once you have available the embeddings (textual and</title>
        <p>visual) learned unimodally in the network, and the
crossattention embeddings learned intermodally, it is
necessary to implement a fusion strategy that can best balance
their respective contributions in the multimodal
classi3.1.3. Cross-modal layer ifcation task. Although the architecture of the model
This layer is the heart of the model, which is developed would seem to suggest the implementation of the late
taking inspiration from the behavior of human beings fusion strategy, it is necessary to observe how the
crosswhen faced with news made up of text and images. Intu- attention of the cross-modal layer is already a fusion
stratitively, we try to read in the image what is written in the egy adopted in the network during learning before the
text and to represent in the text what is shown by the one explicitly implemented in the next fusion layer: this
image. It can be said that cross-modal attention relations allowed the model to learn shared features during
trainexist between image and text. This is why, to simulate ing while maintaining the suitable flexibility between the
the human process described in a neural model, we relied multimodal components, i.e. without excessively
influon the cross attention between the two modalities, a vari- encing the learning process of each modality separately.
ant of the standard component of multi-head attention The concatenation preserves each modality’s
distinccapable of capturing global dependencies between text tive features, allowing the model to exploit them during
and images. learning, unlike the sum which could lead to the loss of</p>
        <p>In the proposed model, two blocks of crossed atten- information due to values that can cancel each other out,
tion are activated in the two text-image and image-text taking away the model’s descriptive capacity. For these
perspectives. In the first case, we consider the textual reasons, the fusion occurs taking into consideration all
embeddings as queries for the multi-head attention block, four embeddings learned by the model Et− projected,
while the visual ones as key and value. This should allow Ev− projected, Ecross− tv, Ecross− vt, where the first
the characteristics of the text to guide the model to focus two provide distinctive unimodal features, while the
on regions of the image semantically coherent with the other two provide correlated and mutually ”attentioned”
text: in fact, if the textual embeddings are considered cross-modal features. The hybrid fusion strategy then
as queries and the visual ones as key and value, then completes the recipe, providing that pinch of flexibility
the attention will be applied to the images in based on necessary to give balance to the multimodal classifier.
compatibility with the text, which is therefore consid- Formally we have the following equation, which aims to
ered the context on which to evaluate the relevance of make the most of both the information provided by the
an image. In this way, attention is focused on the images individual modalities as such, and that provided jointly:
with respect to how relevant they are to the text, i.e. we
try to give importance to the visual features in relation Eglobal = (Et− projected ⊕ Ev− projected)⊕
to the context provided by the text. Conversely, in the
second case the visual embeddings are the queries, while Ecross− tv ⊕ Ecross− vt
the keys and values are the textual embeddings, and this where Eglobal R× 4 , where  is the size of the
should allow the visual features to make the model pay batch of examples given as input to the network and
attention to those parts of text consistent with the images.  = 128.</p>
        <p>That is, the same thing as in the previous case applies, The final output of the multimodal model is obtained
but the roles between text and image are reversed. by applying a densely connected layer with  = 4 units</p>
        <p>Wanting to formalize the bidirectional cross-attention and a softmax activation function that returns the
probabetween the embeddings of the text Et− projected and bilities of the four classes. Formally:
those of the images Ev− projected, we can write:
Y = (EglobalW + b)
Ecross− tv = Attention(Et− projected, Ev− projected)
O = softmax(Y)
Ecross− vt = Attention(Ev− projected, Et− projected)
with W ∈ R4 ×  , b ∈ R1×  and therefore O ∈
where Ecross− tv represents the attention embeddings of R×  is a matrix in which each row is a vector with
image information with respect to the text and Ecross− vt  = 4 values representing the conditional (estimated)
represents attention embeddings of text information com- probability of each class for the relevant sample.
pared to images.</p>
        <p>In this layer the dimensions of the embeddings are not
modified in any way, therefore we remain in R× 128.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experimental Setup</title>
      <sec id="sec-3-1">
        <title>4.1. Split dataset into training and</title>
        <p>validation
To guarantee that the proportions relating to the classes
and sources are maintained uniformly in the two sets,
the 1034 samples of the dataset are randomly divided
following the 80%-20% proportion between training and
validation in a stratified way both with respect to the by the unimodal textual model, but higher than the score
labels, as also happens in the baseline model of the com- of the unimodal visual model, indicating that the
integrapetition MULTI-Fake-DetectiVE and, with respect to the tion of visual and textual information led to an
improvetype of source of the news. ment in performance compared to the model visual, but
not enough to outperform the text model. This suggests
4.2. Training and validationn that there may be potential to perform additional
optimizations or modality integration strategies to achieve
better performance from the multimodal model.</p>
      </sec>
      <sec id="sec-3-2">
        <title>5.2. Proposed model</title>
        <sec id="sec-3-2-1">
          <title>To evaluate the model proposed on the Multimodal Fake</title>
          <p>News Detection task, we chose to follow the approach used
by the organizers in the notebook of the baseline models,
i.e. we performed an ablation study on the proposed
model: first a unimodal textual model was trained, then
a unimodal visual one, then a multimodal one without
cross-bi-attention, finally a multimodal one with
cross-biattention. Table 2 reports the respective accuracy and
F1-weighted values.</p>
          <p>Model
Proposed Multi-modal ⊗
Proposed Text-only
Proposed Multi-modal ⊕
Proposed Image-only</p>
          <p>For our experiment, the model was trained up to 80
epochs with early stopping on using the focal loss [17]
function. It is a dynamically scaled loss cross entropy
function, where the scaling factor decays to zero as
conifdence in the correct class increases. Intuitively, this
scaling factor can automatically scale the contribution
of easy examples during training and quickly focus the
model on dificult examples. For the optimizer we chosed
AdamW, given that the models used to analyze text and
images were originally pre-trained using this algorithm,
which applies weight regularization directly to the model
parameters during weight updating, helping to improve
the stability and generalization of the model.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <sec id="sec-4-1">
        <title>5.1. Oficial</title>
        <p>baseline models</p>
        <sec id="sec-4-1-1">
          <title>In the notebook provided by the MULTI-Fake-DetectiVE</title>
          <p>organizers there is an evaluation strategy on the
oficial dataset which is developed by comparing the
performance of the unimodal pre-trained models with a
multimodal model:
The F1-weighted score values of the three baseline
models are shown in Table 1. The textual model is therefore
the most efective among the three baseline models in
classifying fake news and the visual one has lower
performance than the textual model. The multimodal model
obtained an F1-weighted score lower than that obtained
of crossed attention seen from the two text-image and
image-text perspectives enriched by the skip
connection provided by the simple concatenation of the two
diferent embeddings, provides the model with that extra
edge that allows it to dig background in the relationships
between textual and visual features. By combining
bilateral cross-attention and residual connection, tasks of
the cross-modal layer and the fusion layer respectively,
significant semantic and semiotic interrelations are
obtained in favor of the performance of the classifier which
becomes more precise and sensitive.</p>
          <p>In fact, if on the one hand the cross-modal layer allows
the model to learn multimodal semantics between text
and images, the fusion layer enhances it by improving
its stability, capacity and performance thanks to the skip
connection which provides the gradient with a useful
direct path during backpropagation to flow without tending
to zero, bringing significant and additional information
into each layer of the network.</p>
          <p>All the results described up to this point are obtained
by measuring the model on the Multimodal Fake News
Detection task of the competition covered by this work.
As mentioned, the organizers also proposed a second
task Cross-modal relations in Fake and Real News, aimed
at verifying the robustness of the model to changing
tasks without any human intervention. Table 4 shows
the accuracy and F1-weighted values for the proposed
model called to express itself on the Cross-modal relations
task, together with the baseline and winner models of
the MULTI- competition Fake-DetectiVE. The results show
Model
Proposed Multi-modal
PoliTo - FND-CLIP-ITA</p>
          <p>Baseline Multi-modal
a clear improvement in performance in solving the task
even compared to the winning model of the competition.
This is a very important result, because it demonstrates
the network’s ability to adapt to changes in tasks and
changes in training data, which is not at all a given.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>The data preparation strategy in the Pre-processing step</title>
          <p>provides the model with more information to learn from,
the real strength can be identified in the Cross-modal
Layer.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>The Internet has facilitated the multimodality of
communication by enabling rapid information exchanges that
are increasingly immersive but increasingly used to
convey falsehoods. In this study, a multimodal model for
identifying fake news was proposed which is based on
the mechanism of cross attention between the
representations of the features learned by the network on the
textual component of the news and those learned on the
visual component associated with it.</p>
      <p>Many multimodal models are based on the
concatenation of features learned from distinct modalities which,
despite having good performance, however, limit the
potential of the interaction between the features
themselves.</p>
      <p>From the experiments carried out, the use of
crossattention demonstrated significant improvements in the
performance of the model proposed in this work
compared to the first two models classified in the
MULTIFake-DetectiVE competition for both tasks requested by
the organizers, despite the dataset available for training
is very small in size and unbalanced both with respect
to the categories to be predicted and with respect to the
source of the news. Despite the intrinsic complexity of
the two tasks, the cross-layer of the proposed model
manages to express the representations learned from the text
and images of a news story in a harmonious,
collaborative and synergistic way, balancing their contributions
and preventing one from taking over the other.</p>
      <p>Future developments concern the components of the
model which could use a Visual Transformer [18] instead
of the ResNet in order to relate textual embeddings and
visuals both generated by training a Transformer
network.
detection and verification task, CEUR WORKSHOP org/abs/2307.16456. arXiv:2307.16456.
PROCEEDINGS 3473 (2023). URL: https://ceur-ws. [10] I. Segura-Bedmar, S. Alonso-Bartolome,
Multiorg/Vol-3473/paper32.pdf. modal fake news detection, Information 13 (2022).
[3] S. Suryavardan, S. Mishra, P. Patwa, URL: https://www.mdpi.com/2078-2489/13/6/284.</p>
      <p>M. Chakraborty, A. Rani, A. N. Reganti, A. Chadha, [11] B. Palani, S. Elango, V. K, Cb-fake: A
multiA. Das, A. P. Sheth, M. Chinnakotla, A. Ekbal, modal deep learning framework for automatic fake
S. Kumar, Factify 2: A multimodal fake news news detection using capsule neural network and
and satire news dataset., in: A. Das, A. P. Sheth, bert, Multimedia Tools and Applications 81 (2022).
A. Ekbal (Eds.), DE-FACTIFY@AAAI, volume 3555 doi:10.1007/s11042-021-11782-3.
of CEUR Workshop Proceedings, CEUR-WS.org, 2023. [12] W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang,
URL: http://dblp.uni-trier.de/db/conf/defactify/ J. Ji, Z. Yang, L. Zhao, X. Song, J. Xu, B. Xu, J. Li,
defactify2023.html#SuryavardanMPCR23. Y. Dong, M. Ding, J. Tang, Cogvlm: Visual expert
[4] K. Nakamura, S. Levy, W. Y. Wang, Fakeddit: A new for pretrained language models, 2024. URL: https:
multimodal benchmark dataset for fine-grained //arxiv.org/abs/2311.03079. arXiv:2311.03079.
fake news detection, in: N. Calzolari, F. Béchet, [13] H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines
P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, with visual instruction tuning, 2024. URL: https:
H. Isahara, B. Maegaard, J. Mariani, H. Mazo, //arxiv.org/abs/2310.03744. arXiv:2310.03744.
A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
of the Twelfth Language Resources and Evaluation L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
AtConference, European Language Resources Associ- tention is all you need, 2017. arXiv:1706.03762.
ation, Marseille, France, 2020, pp. 6149–6157. URL: [15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
https://aclanthology.org/2020.lrec-1.755. Bert: Pre-training of deep bidirectional
trans[5] L. D’Amico, D. Napolitano, L. Vaiani, L. Cagliero, formers for language understanding, 2019.</p>
      <p>Polito at multi-fake-detective: Improving FND- arXiv:1810.04805.</p>
      <p>CLIP for multimodal italian fake news detection, in: [16] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- learning for image recognition, in: 2016 IEEE
Connoli, G. Venturi (Eds.), Proceedings of the Eighth ference on Computer Vision and Pattern
RecogEvaluation Campaign of Natural Language Pro- nition (CVPR), 2016, pp. 770–778. doi:10.1109/
cessing and Speech Tools for Italian. Final Work- CVPR.2016.90.
shop (EVALITA 2023), Parma, Italy, September 7th- [17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P.
Dol8th, 2023, volume 3473 of CEUR Workshop Proceed- lár, Focal loss for dense object detection, 2018.
ings, CEUR-WS.org, 2023. URL: https://ceur-ws.org/ arXiv:1708.02002.</p>
      <p>Vol-3473/paper35.pdf. [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D.
Weis[6] Y. Zhou, Q. Ying, Z. Qian, S. Li, X. Zhang, Multi- senborn, X. Zhai, T. Unterthiner, M. Dehghani,
modal fake news detection via clip-guided learn- M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit,
ing, 2022. URL: https://arxiv.org/abs/2205.14304. N. Houlsby, An image is worth 16x16 words:
arXiv:2205.14304. Transformers for image recognition at scale, 2021.
[7] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, arXiv:2010.11929.</p>
      <p>G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, I. Sutskever, Learning
transferable visual models from natural language
supervision, 2021. arXiv:2103.00020.
[8] C. D. Hromei, D. Croce, V. Basile, R. Basili,
Extremita at EVALITA 2023: Multi-task sustainable scaling
to large language models at its extreme, in: M. Lai,
S. Menini, M. Polignano, V. Russo, R. Sprugnoli,
G. Venturi (Eds.), Proceedings of the Eighth
Evaluation Campaign of Natural Language Processing and
Speech Tools for Italian. Final Workshop (EVALITA
2023), Parma, Italy, September 7th-8th, 2023,
volume 3473 of CEUR Workshop Proceedings,
CEURWS.org, 2023. URL: https://ceur-ws.org/Vol-3473/
paper13.pdf.
[9] A. Santilli, E. Rodolà, Camoscio: an italian
instruction-tuned llama, 2023. URL: https://arxiv.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          , G. Venturi (Eds.),
          <source>Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2023</year>
          ), Parma, Italy,
          <year>September 7th8th</year>
          ,
          <year>2023</year>
          , volume
          <volume>3473</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3473</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondielli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dell'Oglio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Marcelloni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Passaro</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sabbatini, Multi-fake-detective at evalita 2023: Overview of the multimodal fake news</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>