<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PoliTo at MULTI-Fake-DetectiVE: Improving FND-CLIP for Multimodal Italian Fake News Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorenzo D'Amico</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Napolitano</string-name>
          <email>davide.napolitano@polito</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Vaiani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Cagliero</string-name>
          <email>luca.cagliero@polito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Fake News Detection, Multimodal Learning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Torino</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Processing and Speech Tools for Italian</institution>
          ,
          <addr-line>Sep 7-8, Parma, IT</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The MULTI-Fake-DetectiVE challenge addresses the automatic detection of Italian fake news in a multimodal setting, where both textual and visual components contribute as potential sources of fake content. This paper describes the PoliTO approach to the tasks of fake news detection and analysis of the modality contributions. Our solution turns out to be the best performer on both tasks. It leverages the established FND-CLIP multimodal architecture and proposes ad hoc extensions including sentiment-based text encoding, image transformation in the frequency domain, and data augmentation via back-translation. Thanks to its efectiveness in combining visual and textual content, our solution contributes to fighting the spread of disinformation in the Italian news flow.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>forms all the baselines and competitors in both tasks,
can be efectively detected.</p>
      <sec id="sec-1-1">
        <title>The MULTI-Fake-DetectiVE challenge [2] proposed</title>
        <p>news content and investigating the influence of visual
aim of accurately discriminating between real and fakfueture directions.
and textual components on each other’s interpretation.</p>
        <p>In this work, we present the PoliTO approach to both2. Related Work
at EVALITA 2023 [3] focuses on overcoming the limita-Section3 briefly describes the dataset, task, and
mettions of existing approaches in coping with multimodarlics used in the challenge. In Sectio4nwe describe the
Italian news content. It addresses the automatic detectmioenthodology, primarily focusing on the
proposeFdNDof Italian fake news in a multimodal setting, where botChLIP extensions. Section5 presents the experimental
textual and visual components potentially contributesaestup and the obtained results. Finally, Sect6iodnraws
sources of fake content. The challenge has the twofoldthe conclusions and discusses the main limitations and</p>
      </sec>
      <sec id="sec-1-2">
        <title>The remainder of this paper is organized as follows. In</title>
      </sec>
      <sec id="sec-1-3">
        <title>Section2 we review the literature on fake news detection, considering both text-only and multimodal approaches.</title>
        <p>nEvelop-O</p>
        <p>0000-0001-9077-4103 (D. Napolitano);0000-0002-3605-1577
(L. Vaiani);0000-0002-7185-5247 (L. Cagliero)</p>
        <sec id="sec-1-3-1">
          <title>NLP-based approaches.</title>
          <p>Early approaches focused
on linguistic features, such as lexical and syntactic
patterns, to distinguish between real and fake news.
However, with the advancement of deep learning researchers
have increasingly turned to more sophisticated
methods such as recurrent neural networks4][, convolutional
neural networks 5[], and transformer models 6[] to
capture semantic and contextual information for improved
detection accuracy. In this work, we mainly rely on statoer-exhibit a certain level of credibilitPyro;bably Real (PR),
of-the-art transformers pretrained on Italian textual dawthaen the news is highly credible but retains some
deto efectively extract information from the news textuagrlee of uncertainty regarding the information provided;
component. Certainly Real (CR), when the news is most certain to
be real and indisputable, regardless of the context. It is
Multimodal Approaches. Incorporating multimodal worth noticing that these labels pertain to the overall
information such as text and images has shown to bienformational content and should not be assigned based
promising to improve the accuracy of fake news detecs-olely on the individual components.
tion systems 7[]. Recently, the adoption of multimodal
architectures and transformers has shown to be partiTca-sk 2: Analysis of Cross-Modal Relations in Fake
ularly efective in capturing the semantic relationshipasnd Real News. The purpose is to examine the
relaamong diferent modalities for fake news detection, e.g.,tionship between the textual and visual modalities within
CB-Fake [8], CAFE [9] and TTEC [10]. the context of fake and real news. The primary objective</p>
          <p>FND-CLIP, proposed by [11], is among the most re- is to gain insights into how images and texts in fake and
cently proposed multimodal architectures for fake newrseal news can potentially lead to misleading
interpretadetection. It relies on the established CLIP mod1e2l][to tions of the content, both within each modality and as a
measure the cross-modal similarity and guide the mapw- hole. The task can be formulated as a three-class
classiping and fusion of the input features. The architectu rficeation problem. Given a multimodal piece of conte nt,
develops along three main streams: a textual one, whicthhe goal is to automatically assign one of the following
extracts information using BERT and CLIP, a visual onecategories to: Misleading (M), when either the image or
which extracts features from the images using ResNet antdhe text is misleading in terms of interpreting the
inforCLIP, and a multimodal one which combines the featuresmation conveyed by the other modality or the content
extracted using CLIP from both modalitieFsN.D-CLIP as a whole;Not Misleading (NM), when both image and
sufers from the following limitations: text are related to each other, providing support to the
overall information presented, and are not intended to
• The natural language encoder neglects the pom-islead;Unrelated (U), when the image and the text are
larity of the input text, which is known to bneot related to each other.</p>
          <p>relevant to fake news detectio1n3][.
• Fake news examples are likely to be undersam-Evaluation Metrics. For both tasks, the evaluation
pled in real training data. Hence, the classifi-metrics are accuracy, average per-class precision and
cation model may sufer from class imbalance recall, and macro- and weighted- F1 score.
Weightedefects. F1 score has been selected as the reference metric for
• Multimodal fake news often contains tampered vi- ranking participants.</p>
          <p>sual content. Tampered images are more likely to
be detected in the frequency domain space.
However, FND-CLIP does not consider any frequency- 3.2. Dataset Description
based image descriptor1[4].</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>Both task-specific datasets consist of a collection of Twit</title>
        <p>Our research endeavors to address the aforesaid limit-er posts and newspaper articles describing one or more
tations by proposingFND-CLIP-IT, i.e., an improved ver- real events. For Task 1 the training set contains 908
dission FND-CLIP suited to multimodal Italian fake newtsinct labeled sample1.s The labels in the training data are
detection. distributed as follows: CF 16.4%, PF 22.0%, PR 44.4%, CR
17.2%. Around 80.0% of the samples are tweets, whereas
the remaining ones are news articles. The test set
con3. Task and Dataset Description sists of 193 samples following roughly the same per-class
and per-type distributions as in the training data. For
3.1. Tasks Description Task 2, the training set contains 1309 distinct samples and
the per-class distribution Mis 26.9%, U 40.6%, NM 31.5%.</p>
        <p>Task 1: Multimodal Fake News Detection. The 66.0% of the samples are tweets, whereas the remaining
problem can be formulated as a multi-class classifica-ones are news articles. The test set contains 219 samples.
tion task, where the input cont e=nt⟨,  ⟩ , consisting Compared to the training data, the per-type sample
disof a textual componen tand a visual componen t, can tribution is slightly more biased towards tweets (75.0%)
be classified as follows: Certainly Fake (CF), when the and Non-Misleading content (45.2%).
news is very likely to be fake, regardless of the context in
which it is presentedP;robably Fake (PF), when the news
is likely to be fake but may contain some real information 1available at the time of writing, June 2023</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Methodology</title>
      <sec id="sec-2-1">
        <title>Here we presentFND-CLIP-IT, an improved version of</title>
        <p>FND-CLIP suited to the MULTI-Fake-DetectiVE
challenge. Our solution is rooted in the originaFNlD-CLIP
model [11] and a set of unimodal language and visual
encoders described below.</p>
        <sec id="sec-2-1-1">
          <title>Unimodal language baselines. We utilize the follow</title>
          <p>ing models tailored to the Italian language: BERT2-,IT
GilBERTo3, BART-IT4 [15].</p>
          <p>Since the input text can be longer than the maximum
model size, we adopt a hierarchical approach: the text is
divided into chunks of fixed length, where each one is
fed to the transformer encoder, and then the final
representation is obtained by averaging all the [CLS] tokens.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Unimodal visual baselines. We exploit two estab</title>
          <p>lished models, i.e., ViT [16] and ResNet-152 1[7]. Since
more pictures can be associated with the same sample, at
inference time we separately evaluate all the images and
the final prediction is the average of all obtained output
logits.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Multimodal baselines. To leverage visual and textual</title>
        <p>content at the same time, we rely on (i) the standard
FND-CLIP [11] architecture, adapted to handle Italian
text rather than English, (ii) CLI1P2[], and (iii) a late
fusion approach combining BERT-IT and ResNet-152.
4.1. FND-CLIP-IT</p>
      </sec>
      <sec id="sec-2-3">
        <title>FND-CLIP-IT extends the state-of-the-aFrNtD-CLIP ar</title>
        <p>chitecture to address the current limitations of fake news
detection approaches. By incorporating the proposed
extensions, the overall eficacy and robustness
ofFNDCLIP-IT shows significant improvements compared to
the baseline versions. A detailed description of the
proposed extensions, hereafter denoted by ,  ,  ,  , and 
for the sake of brevity, is given below.</p>
        <p>A. Sentiment-based textual representation: To
consider the polarity of the input text for fake
news detection1[3], we enrich the textual
representation by adding a sentiment-based encoding
to the existing text encoders. Specifically, we use
the Italian-BERT model finetuned on a sentiment
analysis tas5k. We also consider the following
variants of sentiment-based textual
representation:</p>
      </sec>
      <sec id="sec-2-4">
        <title>2https://huggingface.co/dbmdz/bert-base-italian-xxl-cased</title>
        <p>3https://huggingface.co/idb-ita/
gilberto-uncased-from-camembert
4https://huggingface.co/morenolq/bart-it
5https://huggingface.co/neuraly/
bert-base-italian-cased-sentiment
A1. A concatenation of the sentiment-based
embedding to the initial original
representation, on top of which we apply the textual
projection head.</p>
        <p>A2. A separate stream of information with a
dedicated projection head.</p>
        <p>B. DFT-based additional stream: we convert the
image from the spatial domain to the frequency
domain by applying Discrete Fourier Transform
(DFT). The purpose is to detect tampered images,
which likely occur in multimodal fake news14[].
We encode both real and imaginary parts using a
dedicated VGG19 1[8]. The obtained
representations are then concatenated to generate a parallel
stream of information that will be then combined
with the others before applying the final
FNDCLIP classifier.</p>
        <p>C. Embedding concatenation: instead of
summing the embedding of each stream we
concatenate them. Concatenation has already been
proven to be an efective way of combining
multimodal information1[9]. The rationale behind
it is that by keeping more fine-grained pieces of
information the classification head, adapted to
handle the new encoding, can capture the most
discriminating source features in a more efective
way.</p>
        <p>D. Class rebalancing through data
augmentation: since the dataset is quite imbalanced across
the classes, we re-balance the data distribution
by penalizing the most frequent class. In
particular, we generate new samples of the minority
classes by applying a textual augmentation based
on back-translation20[], which already proved
to be beneficial in both multimodal2[1] and fake
news detection2[2] tasks. The auxiliary language
adopted is English, and the translation models
used are provided by Helsinki-NLP6.</p>
        <p>E. Additional Squeeze and Excitation Layers:
Similar to FND-CLIP, we employ a
squeeze-andexcitation operation23[] to weigh the input
embedding streams. The purpose of a
squeezeand-excitation block is to adaptively recalibrate
channel-wise feature responses by explicitly
modeling interdependencies between channels.
Unlike [11], where they weigh diferently the
textual and visual streams, we also adopt
squeezeand-excitation within each modality to weigh the
relevance of each encoder. The key idea is to
give more importance to discriminating
modalityspecific embeddings.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Beyond considering each FND-CLIP extension sepa</title>
        <p>rately, we also build both models that combine the
pro</p>
      </sec>
      <sec id="sec-2-6">
        <title>6https://huggingface.co/Helsinki-NLP</title>
        <p>Model
BERT-IT
GilBERTo
BART-IT
ResNet-152
ViT
BERT-IT+ResNet-152
CLIP-IT
FND-CLIP-IT
FND-CLIP-IT 1
FND-CLIP-IT 2
FND-CLIP-IT
FND-CLIP-IT
FND-CLIP-IT
FND-CLIP-IT
FND-CLIP-IT∗ 2,
FND-CLIP-IT∗ 1,,
FND-CLIP-IT∗ 1,,
FND-CLIP-IT 1,,,,
ENSEMBLE
posed extensions - in diferent ways and ensemble individually, yield notable improvements over the
origimethods that combine best-performing individual modn-al implementation. In addition, select combinations of
els. To this end, we use a weighted average of individualthese variants produce even more promising outcomes.
logits for each class. Although both focal loss and cross-entropy were
evaluated, we chose to report only the results obtained with
focal loss, due to their overall superior performance
com5. Experimental Results pared to cross-entropy. It is worth noting, however, that
the combination of all variants does not surpass the
per5.1. Setup formance of specific combinations, indicating a potential
The models were fine-tuned for a maximum of 80 epochs, susceptibility to overfitting. Furthermore, an intriguing
using a batch size of 16, a learning rate of 1e-3, anobservation emerges with the implementation of an
enAdamW optimizer with a weight decay of 0.001 and a lins-emble model that leverages the best-performing
combiear scheduler. All baseline models were trained using anations. This ensemble model outperforms the individual
cross-entropy-loss, while all FND-CLIP-IT variants were models, further accentuating the benefits of employing
trained with both cross-entropy and focal losses. ensemble techniques to enhance overall performance.</p>
        <p>5.3. Competition
5.2. Results
Table 1 presents the results of the baselines (upper partW)e employed our ensemble method to evaluate the
perand our proposed solutions (lower part), obtained on theformance of our FND-CLIP-IT variants on the test
samTask 1 validation set. Significantly, the outcomes reveaplles. The test results are presented in Tab2l.eThe upper
an intriguing pattern wherein text-only models exhibiptart of Table2 shows the outcomes obtained for Task 1.
superior performance when compared to image-only Although these results are worse than the performance
models, underscoring the paramount importance of texa-chieved on our validation set, they surpass all other
tual information within the context of the task. Notabbalys,elines and competitors.
the multimodal CLIP baseline demonstrates results com- Furthermore, we fine-tuned the same ensemble of
modparable to the text-only model. At the same time, FNeDls- for Task 2 by replacing the classification head last
CLIP-IT architecture attains performance marginally belat-yer. The bottom of Table2 reports the achieved results
ter than the BERT-IT model. Furthermore, our diversefor Task 2. Remarkably, our approach outperforms both
extensions of the FND-CLIP-IT framework, when applied the baseline and the competitors.</p>
        <p>Team Run</p>
        <p>PoliTo-P1
1 extremITA-camoscio_lora
sk AIMH-MYPRIMARYRUN
aT Baseline-SVM-TEXT</p>
        <p>HIJLI-JU-CLEF-Multi
2 PoliTo-P1
sk Baseline-MLP-TEXT
aT AIMH-MYPRIMARYRUN
technologies. Computational resources were provided by
HPC@POLITO, a project of Academic Computing within
the Department of Control and Computer Engineering
at the Politecnico di Tori7n.o</p>
      </sec>
      <sec id="sec-2-7">
        <title>This study was carried out within the FAIR - Future Ar</title>
        <p>
          tificial Intelligence Research and received funding from
the European Union Next-GenerationEU (PNRR M4C2, [9] Y. Chen, D. Li, P. Zhang, J. Sui, Q. Lv, L. Tun,
INVESTIMENTO 1.3 D.D. 1555 11/10/2022, PE00000013). L. Shang, Cross-modal ambiguity learning for
mulThis study was carried out within the MICS (Made in timodal fake news detection, in: Proc. of the ACM
Italy – Circular and Sustainable) Extended Partnership Web C
          <xref ref-type="bibr" rid="ref28">onference 2022</xref>
          , 2022, pp. 2897–2905.
and received funding from the European Union Next-[10] J. Hua, X. Cui, X. Li, K. Tang, P. Zhu, Multimodal
GenerationEU
          <xref ref-type="bibr" rid="ref14">(PNRR M4C2, INVESTIMENTO 1.3 D.D. fake news detection through data
augmentation1551.11-10-2022, PE00000004)</xref>
          . This manuscript reflects based contrastive learning, Applied Soft Computing
only the authors’ views and opinions, neither the Eu- 136 (2023) 110125.
ropean Union nor the European Commission can be [11] Y. Zhou, Q. Ying, Z. Qian, S. Li, X. Zhang,
Multimodal fake news detection via clip-guided learning,
arXiv preprint arXiv:2205.14304 (2022).
considered responsible for them. The research leading
to these results has been partly funded by the
SmartData@PoliTO center for Big Data and Machine Learning 7https://www.hpc.polito.it/
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Aral,</surname>
          </string-name>
          <article-title>The spread of true and</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Table 2 false news online</article-title>
          ,
          <source>Science</source>
          (
          <year>2018</year>
          ).
          <article-title>Oficial MULTI-Fake-DetectiVE results</article-title>
          . For the oficial base- [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondielli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dell'Oglio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Marcelloni</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          proach. at evalita 2023:
          <article-title>Overview of the multimodal fake</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>By leveraging our best ensemble method, we have Language Processing and Speech Tools for Italian. demonstrated the robustness and</article-title>
          versatility of FoNurD- Final
          <string-name>
            <surname>Workshop</surname>
          </string-name>
          (EVALITA
          <year>2023</year>
          ), CEUR.org, Parma,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>CLIP-IT variants across both Task 1 and Task 2</article-title>
          ,
          <fpage>surpass</fpage>
          - Italy,
          <year>2023</year>
          .
          <article-title>ing existing approaches in terms of performance</article-title>
          and [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Russo</surname>
          </string-name>
          , R. Sprug-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          efectiveness. noli, G. Venturi,
          <year>Evalita 2023</year>
          :
          <article-title>Overview of the 8th</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          6. Conclusion and
          <article-title>Future of the Eighth Evaluation Campaign of Natural Lan-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Workshop</surname>
          </string-name>
          (EVALITA
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>In this study, we introduced the FND-CLIP-IT architec- 2023. ture exploring several variants for fake news detection</article-title>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Iwendi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohan</surname>
          </string-name>
          , S. khan, E. Ibeke,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Ahmadian,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>in a multimodal setting. Our findings demonstrate the ef- T. Ciano, Covid-19 fake news sentiment analysis,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>fectiveness of these variants, with notable improvements Computers and Electrical Engineering (</article-title>
          <year>2022</year>
          ).
          <article-title>observed over the original implementation</article-title>
          . Furthermore,[5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>In the future, we plan to continue refining and optimiz- news detection</article-title>
          ,
          <source>Applied Intelligence</source>
          (
          <year>2023</year>
          ).
          <article-title>ing the proposed variants to further enhance their per</article-title>
          [-6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Kang</surname>
          </string-name>
          , H. Lim, exbake:
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>decision-making process will be an interesting direction ers (bert</article-title>
          ),
          <source>Applied Sciences</source>
          <volume>9</volume>
          (
          <year>2019</year>
          ).
          <article-title>for future research</article-title>
          . [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Segura-Bedmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Alonso-Bartolome</surname>
          </string-name>
          , Multi-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>modal fake news detection</article-title>
          ,
          <source>Information</source>
          <volume>13</volume>
          (
          <year>2022</year>
          ). Acknowledgments [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Palani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Elango</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Viswanathan</surname>
          </string-name>
          <string-name>
            <surname>K</surname>
          </string-name>
          , Cb-fake: A
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          and bert,
          <source>Multimedia Tools and Applications</source>
          (
          <year>2022</year>
          ). [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Ramesh,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <year>2021</year>
          . [13]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Alonso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vilares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gómez-Rodríguez</surname>
          </string-name>
          , J. Vi-
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>Electronics</source>
          <volume>10</volume>
          (
          <year>2021</year>
          ). [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Multi-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          (
          <year>2023</year>
          ). [15]
          <string-name>
            <surname>M. La Quatra</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cagliero</surname>
          </string-name>
          , Bart-it: An eficient
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>marization</surname>
          </string-name>
          ,
          <source>Future Internet</source>
          <volume>15</volume>
          (
          <year>2023</year>
          ). [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          , D. Weis-
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Trans-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>formers for image recognition at scale</article-title>
          , in: 9th Inter-
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>ICLR</source>
          <year>2021</year>
          ,
          <year>2021</year>
          . [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Deep residual learn-
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>nition</surname>
          </string-name>
          ,
          <year>2016</year>
          . [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , Very deep convolu-
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>in: Proc. of the 3rd International Conference on</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Learning</given-names>
            <surname>Representations</surname>
          </string-name>
          ,
          <source>ICLR</source>
          <year>2015</year>
          , San Diego,
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>CA</surname>
          </string-name>
          , USA, May 7-
          <issue>9</issue>
          ,
          <year>2015</year>
          ,
          <year>2015</year>
          . [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Vaiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Garza</surname>
          </string-name>
          , Lever-
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          tion, in
          <source>: Proc. of the 37th ACM/SIGAPP Symposium</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>on Applied Computing</source>
          ,
          <year>2022</year>
          . [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grangier</surname>
          </string-name>
          , Under-
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>2018 Conference on Empirical Methods in Natu-</mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <year>2018</year>
          . [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Vaiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          , P. Garza, PoliTo at SemEval-
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <article-title>2023 task 1: Clip-based visual-word sense disam-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>17th International Workshop on Semantic Evalua-</mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>tion (SemEval-2023),</surname>
            <given-names>ACL</given-names>
          </string-name>
          , Toronto, Canada,
          <year>2023</year>
          . [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          , Data augmentation
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <article-title>in the urdu language</article-title>
          ,
          <source>in: Proc. of the 12th lan-</source>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <source>guage resources and evaluation conference</source>
          ,
          <year>2020</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          pp.
          <fpage>2537</fpage>
          -
          <lpage>2542</lpage>
          . [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          , G. Sun,
          <string-name>
            <surname>Squeeze-</surname>
          </string-name>
          and
          <string-name>
            <surname>-excitation</surname>
          </string-name>
          net-
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>puter vision and pattern recognition</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>