<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Team coco at Factify 2: Optimizing Modality Representation and Multimodal Fusion for Multimodal Fact Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kangshuai Guo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shichao Luo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruipeng Ma</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yan Wang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanru Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Shenzhen Institute for Advanced Study of UESTC</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Electronic Science and Technology of China</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>While social media has changed news dissemination and how humans obtain information, it has also become the main channel for disseminating fake news. Quickly identifying fake news in social media and curbing the spread of false information is crucial to purifying cyberspace and maintaining public safety. Exploring eficient modality representation and multimodal information fusion methods has been a hot topic in the field of multimodal fake news detection or fact verification. To this end, a new multi-Modal fact verification is proposed: First, deep modality representations of text and images are extracted using a large-scale pre-trained model. Secondly, a bidirectional-hybrid attention mechanism is introduced to fuse text and image features. The hybrid mechanism reduces redundant information generated during multimodal fusion and uses Bidirectional feature fusion to ensure the integrity of information. Besides, we adopt the ensemble method to achieve better performance. Our team, coco, won the sixth prize (F1-score: 75.696%) in the Factify challenge hosted by De-Factify @ AAAI 2023. Extensive experiments including comparison experiments, analysis of parameter sensitivity, and ablation study demonstrate the efectiveness of our proposed approach.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-modal fact verification</kwd>
        <kwd>Modality representation</kwd>
        <kwd>Multi-modal fusion</kwd>
        <kwd>De-Factify</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Due to the timeliness and profitability of social media, some people artificially fabricate or
adapt news to generate fake news in order to obtain attention trafic. Fake news can be easily
disseminated along with real news, thereby confusing the majority of users, which has brought
certain harm to the economy and society. Especially since the COVID-19 epidemic, a large
amount of fake news has emerged on social media, which has grown exponentially in a short
period of time, bringing negative and negative impacts on society [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The current form of news is no longer limited to text, but a combination of text, image, video
and other modalities. Compared with traditional single-modal text news, multi-modal news is
easier to attract people’s attention. Fake news usually has highly emotionally provocative text
and visually impactful pictures or videos. In addition, due to the mixture of real information and
fake news, fake news is generally dificult to be identified by humans. Therefore multimodal
fact verification is one of the efective ways to combat fake news. Multimodal fact verification
is defined as: discriminating a given multimodal claim (text + visual)as true/false given credible
news sources.</p>
      <p>
        The challenge Factify_2 1 hosted by De-Factify team2 provides a more complex and eficient
multimodal fact verification task than just classifying claims as true or false [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. The goal
is to design a method to classify the given claim text and images into one of the five categories:
Support_Multimodal, Support_Text, Insuficient_Multimodal, Insuficient_Text, and Refute. Fig.
1 shows some examples for all five categories.
      </p>
      <p>To tackle the task, in this paper, we propose a new multi-Modal fact verification method,
which by optimizing modality representation and multimodal fusion to Improve multimodal
fact verification accuracy. Specifically, deep modality representations of text and images are
extracted using a large-scale pre-trained model which is combined with a positionally encoded
self-attention mechanism. Afterward, a bidirectional hybrid attention mechanism is introduced
to achieve homomodal and cross-modal information fusion. The hybrid mechanism reduces
redundant information generated during multimodal fusion and uses Bidirectional feature
fusion to ensure the integrity of information. Finally, we adopt the ensemble method to achieve
1https://codalab.lisn.upsaclay.fr/competitions/8275
2https://aiisc.ai/defactify2/index.html
better performance.</p>
      <p>The main results of this paper can be summarized as follows:
• First, we explore and compare diferent pre-trained models and embedding methods for
modality representation, enabling eficient fact verification.
• Second, we improve detection performance by building embedding pairs of modality
representations modeling to achieve multimodal alignment relationships and performing
multimodal information fusion.
• Finally, improve the efectiveness and generalization of the model through ensemble
learning.</p>
      <p>The rest of the paper is organized as follows: in the next section, we first review the previous
studies on the multimodal fact checking and pre-trained models. Then, the details of the
framework of the proposed method is introduced in section III. After that, experimental results
and analysis will be explained in Section IV. And finally, we would give our conclusion in
Section V.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Multimodal Fact Checking</title>
        <p>
          Existing works on multimodal fake news detection or fact-checking are reviewed briefly. Early
studies [
          <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5, 6, 7, 8</xref>
          ] on fake news detection or fact-checking were based only on unimodal
information (text content), and these research methods can be summarized as feature-building
techniques and deep learning techniques. The current form of news is no longer limited to text,
but is composed of multiple modalities such as text, images, and videos. Recent studies about
fake news detection or fact-checking have started to take images [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref9">9, 10, 11, 12</xref>
          ] and videos [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
into consideration. Compared with single-modal fake news detection techniques, multi-modal
fake news is more flexible, authentic and accurate. Numerous studies [
          <xref ref-type="bibr" rid="ref14 ref15 ref16">14, 15, 16</xref>
          ] have shown
that multimodal fake news detection models perform better than single modality models under
the same dataset. Most approaches for multimodal fake news detection or fact-checking are
based on cross-modality consistency checking [
          <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
          ] or fusion of multi-modal (text + visual)
information by modeling multi-modal alignment relationships [
          <xref ref-type="bibr" rid="ref14 ref19 ref20">19, 14, 20</xref>
          ]. The former focuses
on multimodal consistency measurement [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. SAFE [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] uses an Image Captioning model to
translate images into sentences, and then computes multimodal inconsistency by measuring the
sentence similarity between the original news text and the generated image captions. MCNN
[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] transforms text and visual features into a common feature space to calculate similarity
through sub-network weight sharing. The latter improves detection performance by computing
fused representation of multimodal information [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. attRNN [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] proposes a recurrent neural
network based on a neuron-level attention mechanism to fuse graphic information. CARMN [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]
uses a collaborative attention mechanism to model bidirectional augmentation between text and
images. EMAF [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] extracts the target labels of the images, and then uses the capsule network
to fuse the nouns in the text with these target labels. Our approach focuses on multimodal
consistency measures, transforming textual and visual features into a common feature space to
compute similarity, which is inspired by SAFE and MCNN.
        </p>
        <sec id="sec-2-1-1">
          <title>ULMFiT</title>
          <p>Multi-lingual</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>MultiFiT</title>
          <p>Cross-lingual
Multi-task
+ Generation</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Semi-supervised Sequence Learning context2Vec</title>
        </sec>
        <sec id="sec-2-1-4">
          <title>Pre-trained seq2seq</title>
        </sec>
        <sec id="sec-2-1-5">
          <title>ELMo</title>
          <p>Transformer
Bidirectional LM</p>
        </sec>
        <sec id="sec-2-1-6">
          <title>BERT</title>
          <p>XLM
UDify MT-DNN</p>
        </sec>
        <sec id="sec-2-1-7">
          <title>MASS</title>
          <p>Knowledge distillation UniLM
Span prediction
MT-DN  RemLoovnegNerStPime
Remove NSP
More data</p>
        </sec>
        <sec id="sec-2-1-8">
          <title>SpanBERT</title>
        </sec>
        <sec id="sec-2-1-9">
          <title>RoBERTa</title>
        </sec>
        <sec id="sec-2-1-10">
          <title>XLNet</title>
          <p>+Knowledge Graph Cross-modal
Permutation LM
Transformer-XL
More data</p>
        </sec>
        <sec id="sec-2-1-11">
          <title>ERNIE (Tsinghua)</title>
          <p>Neural entity linker</p>
        </sec>
        <sec id="sec-2-1-12">
          <title>KnowBert</title>
        </sec>
        <sec id="sec-2-1-13">
          <title>VideoBERT CBT</title>
        </sec>
        <sec id="sec-2-1-14">
          <title>ViLBERT</title>
        </sec>
        <sec id="sec-2-1-15">
          <title>VisualBERT B2T2</title>
        </sec>
        <sec id="sec-2-1-16">
          <title>Unicoder-VL</title>
        </sec>
        <sec id="sec-2-1-17">
          <title>LXMERT</title>
        </sec>
        <sec id="sec-2-1-18">
          <title>VL-BERT</title>
        </sec>
        <sec id="sec-2-1-19">
          <title>UNITER GPT</title>
          <p>Larger model
More data
GPT-2
Defense</p>
        </sec>
        <sec id="sec-2-1-20">
          <title>Grover</title>
          <p>Whole Word Masking</p>
        </sec>
        <sec id="sec-2-1-21">
          <title>ERNIE (Baidu)</title>
        </sec>
        <sec id="sec-2-1-22">
          <title>BERT-wwm</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Pre-trained Models</title>
        <p>
          Transformer [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] became the preferred architecture for language models in 2017, followed by
the emergence of GPT [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] and BERT[
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] in 2018, bringing Pre-Trained Models(PTM) into a
new era [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Common PTMs are shown in Fig. 4. These PTMs models are very large, with a
large number of parameters, which can capture information such as polysemy, morphology,
syntactic structure, and real-world knowledge from the text, and then fine-tune the model
to achieve amazing performance on downstream tasks [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. By now, fine-tuning for specific
tasks on large-scale PTMs has become an industry consensus. The rise of a series of large-scale
PLMs such as DeBERTa [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], Roberta [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], XLM-RoBERTa [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], XLNet [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ], and SpanBERT
[
          <xref ref-type="bibr" rid="ref35">35</xref>
          ]. These PLMs have been fine-tuned using a few label examples with task-specific and have
created a new state of the art in many downstream tasks. [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ].
        </p>
        <p>
          In the past 10 years, Convolutional Neural Network (CNN) [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ], as a model that is good
at capturing local features, has been placed high hopes in the field of computer vision and
has led an era. However, the operation of convolution lacks a global understanding of the
image itself, and cannot model the dependencies between features, so it cannot fully utilize
contextual information. Vision Transformer [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] is therefore proposed, an image classification
method based entirely on the self-attention mechanism. Compared with CNN, the Transformer
architecture has achieved good results in many visual tasks since its self-attention mechanism
is not limited by local interaction features, it can mine long-distance dependencies and learn
the most appropriate inductive bias according to diferent task objectives [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]. Then the rise of
a series of large-scale transformers architecture visual pre-training models such as ViT [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ],
Deit [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ], Swin Transformer [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ], and BEiT [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Overview</title>
        <p>Figure 3 illustrates the overview of the proposed framework. The input of each model contains
the claim text, the document text,the claim image, and the document image. The modality
representation part adopts DeBERTa as the pre-trained NLP model and DeiT as the pre-trained
CV model and feeds the outputs of pre-trained models to the embedding layer for transforming
modality representation into corresponding embeddings, which to help modal information
alignment. The multi-modality fusion part fuses this information from the homomodal (text pair,
image pair) and cross-modal (text-image pair, image-text pair) based on bidirectional-hybrid
attention mechanism. The classifier predicts the probability of each category based on the
embeddings from modality representation and the embeddings from multi-modality fusion.
Finally, the output of each model is ensembled according to the score-driven strategy to improve
fact-checking efectiveness and generalization.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Modality Representation</title>
        <p>3.2.1. Pre-training</p>
        <p>
          Pre-training obtains task-independent pre-training models from large-scale data through
selfsupervised learning. The exploration of pre-trained models is mainly devoted to deep semantic
representation and contextual semantic representation. Extensive work [
          <xref ref-type="bibr" rid="ref26 ref30">26, 30</xref>
          ] has shown
that pre-trained models on large corpora can learn general-purpose language representations,
avoiding training new models from scratch when solving downstream tasks. Compared with
other pre-training models, Deberta implements Disentangled Attention and Disentangled
Attention. The former enables it to have stronger representation capabilities, and the latter is
used to avoid inconsistencies between pre-training tasks and downstream tasks. To this end, we
use DeBERTa as our pre-trained NLP model and DeiT as our pre-trained CV model to modality
representation. The abstraction of modality representation is shown in Fig.4.
3.2.2. Embedding
In order to adapt to downstream tasks, the output of the pre-trained model is fed into the
embedding layer. As shown in Fig. 5. The text and image modality representations are mapped
to a unified embedding space, which enhances the alignment relationship between multimodality
including homomodal and cross-modality, and is conducive to multi-modal fusion.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Multi-Modality Fusion</title>
        <p>For the homomodal, consider their mutual relationship by performing uni-modality (text or
image ) similarity matching, only need to pay attention to the common modality representation.
For cross-modality, only part of the information contained in the image is related to the text,
and there is a large amount of redundant information irrelevant to the task. so the information
fusion of same-modality and cross-modal is necessary.</p>
        <p>
          The multi-head attention mechanism endows the model with the ability to jointly attend
to information from diferent representation subspaces at diferent positions [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. The
coattention mechanism [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ] is a variant of the standard multi-head self-attention mechanism
which contributes to multimodal information fusion. Their structure is shown in Fig. 6
(a) multi-head attention [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]
(b) multi-head self-attention block[
          <xref ref-type="bibr" rid="ref44">44</xref>
          ]
(c) Co-attention block[
          <xref ref-type="bibr" rid="ref44">44</xref>
          ]
        </p>
        <p>Our Bidirectional-hybrid attention mechanism considers both the same-modality and
crossmodality in multimodal fusion. The hybrid mechanism reduces redundant information generated
during multimodal fusion and uses bidirectional feature fusion to ensure the integrity of
information. For the same modality, we construct text pairs and image pairs based on the output of the
embedding layer. Correspondingly, cross-modality constructs text-image pairs and image-text
pairs. Then multi-modal fusion is achieved by adopting the co-attention block (Fig.6c).
3.0
2.5
2.0
1.5
1.0
0.5</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Classifier &amp; Ensemble Method</title>
        <p>3.4.1. Classifier
The multimodal fact verification task is a multi-classification problem, specifically, subdivided
into 5 categories: Support_multimodal , Support_textual , Insuficient_multimodal ,
Insuficient_text and Refute.</p>
        <p>
          The embedding of modality representation undergoes multi-modal fusion through a
bidirectional hybrid attention mechanism. The output passes through a regularization layer and then
enters a classifier, which consists of the linear layer and an activation function. The activation
function realizes the efect of nonlinear transformation. Common activation functions are
shown in Fig. 7. In deep neural networks, it is crucial to use a suitable mapping for nonlinear
iftting to complete the classification task [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ]. Our approach put attention on selection of
activation functions, specifically, two activation functions, ReLU and Swish (  = 1), are used
in the classifier and embedding layers of the model to achieve eficient multimodal fact-checking
efects.
3.4.2. Ensemble
The hybrid prediction method combines diferent prediction methods to improve the
performance of the final prediction. The performance of a single model is limited in many cases.
Hybrid forecasting methods combine the ability of multiple models to accommodate changes in
the sample by setting the results of each model in combination [46].
        </p>
        <p>As shown in Fig. 3. The final output of our method is ensembled by two diferent multi-class
prediction models. We adopt a score-driven ensemble strategy, where weights are determined
based on the Val-set f1 score of each model, in order to balance the efect of each model and
achieve better multimodal fact-checking results.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          There are many opensource datasets in the field of automated factchecking, such as LIAR
[47], FEVER [48] , Covid19 Fake News dataset [49] and Claim matching beyond english [50].
Compared with the Factify 2 datasets [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which contain complete images and textual information
about claims and reference documents, the aforementioned are all unimodal datasets (only
text). Each sample of includes claim, claim_image, claim_ocr, document, document_image,
document_ocr, and category.
        </p>
        <p>
          The description for each attribute is as follows [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]:
• claim: the text of the claim source by tweet A.
• claim_image: the image of tweet A.
• claim_ocr: ocr of claim image.
• document: the article text of the given reference document which is tweet B.
• document_image: the image of tweet B.
• document_ocr: ocr of document image.
        </p>
        <p>• category: a given category of all five classes..</p>
        <p>The category including:
• Support_Multimodal: text and images of claim are supported by the document
corresponding.
• Support_Text: claim text is supported by the document text, but claim images are not
relevant.
• Insuficient_Multimodal: claim text is neither supported nor refuted by the document
text but images are similar to the document images.
• Insuficient_Text: both text and images of the claim are neither supported nor refuted by
the document corresponding.</p>
        <p>• Refute: fake claim or fake image inferred from document corresponding.</p>
        <p>
          The training set 3 contains 35,000 samples, which has 7,000 samples of each category, and the
validation set contains 7,500 samples, which has 1,500 samples of each category. The test set,
which is used to evaluate the leaderboard score, same specifications as the validation set [
          <xref ref-type="bibr" rid="ref36 ref4">4, 36</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>4.2.1. Testing Performance
Table 1 shows the performance of the testing set. Our approach achieved 0.75696 of the weighted
average F1-score, winning the sixth prize in multi-modal fact checking. This result outperformed
the baseline by 11%. Compared with the Pre-CoFact method, our method pays more attention
to the prediction of the support category, and the category of insuficient is slightly weaker.
3https://drive.google.com/drive/folders/13JwnIBzDfe8a5E1anPkt7J90r4NBIYES</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Further Analysis</title>
        <p>4.3.1. Comparison Result
In order to show that diferent pre-training models afect the modality representation, we did
some comparative experiments. As shown in Fig. 8. In modality representation, using diferent
pre-trained models has a certain impact on the performance of fact-checking. It can be seen
that using the Deberta pre-trained model for modal representation has achieved better results.
0.74
0.72
1
tF0.70
e
s
laV0.68
0.66
0.64</p>
        <p>Deberta-Relu
Roberta-Relu
Deberta-Swish</p>
        <p>Roberta-Swish
XLM-Roberta
0.745
0.740
0.735
1
F
t
l-se0.730
a
V
0.725
4.3.2. Parameter Sensitivity
The presence of embedding layers helps modality representations to better adapt to downstream
tasks. Mapping diferent modality representations into a unified embedding space reduce the
variance between modalities, which facilitates modality representation feature alignment and
cross-modal fusion. Parameter sensitivity experiments on embedding size are performed, which
explore the efect of embedding space size on fact-checking performance. As shown in Fig. 9.
4.3.3. Ablation Study
• w/o attention: Each model removes the bidirectional-hybrid attention mechanism during
training, and only uses the embedding of modality representation.
• w/o  : Model ensemble without the score-driven strategy, the same weight for each
model.</p>
        <p>• w/o avg: Model ensemble strategy without averaging model predictions.</p>
        <p>The results are shown in Table 2. From this table, it is observed that all the components are
indispensable for the superior performance of our method.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we optimize modality representation and multimodal fusion achieve the eficient
multimodal fact-checking task. Specifically, deep modality representations of text and images
are extracted using a large-scale pre-trained model. Afterward, a bidirectional hybrid attention
mechanism is introduced to achieve homomodal and cross-modal information fusion. To achieve
better performance, we adopted an ensemble method by weighting several models. Extensive
experiments including comparison experiments, analysis of parameter sensitivity, and ablation
study demonstrate the efectiveness of our proposed approach.
4Data from the corresponding paper, also the oficial results De-Factify ’22.
5The final score is obtained by a weighted average of the category-wise scores by the organizers.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>
        We appreciate previous work [
        <xref ref-type="bibr" rid="ref4 ref43">4, 51, 43, 52</xref>
        ] and open resources[
        <xref ref-type="bibr" rid="ref36 ref44">44, 36</xref>
        ]. Based on this, we
conducted our work on Defactify 2. We also appreciate the help from the Defactify team and
program chairs. We appreciate Wu and Wang for their provided open resource.
preprint arXiv:1908.08681 4 (2019) 10–48550.
[46] H. Liu, trymore: Solution to spatial dynamic wind power forecasting for kdd cup 2022
(2022).
[47] W. Y. Wang, " liar, liar pants on fire": A new benchmark dataset for fake news detection,
arXiv preprint arXiv:1705.00648 (2017).
[48] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a large-scale dataset for
fact extraction and verification, arXiv preprint arXiv:1803.05355 (2018).
[49] P. Patwa, S. Sharma, S. Pykl, V. Guptha, G. Kumari, M. S. Akhtar, A. Ekbal, A. Das,
T. Chakraborty, Fighting an infodemic: Covid-19 fake news dataset, in: International
Workshop on Combating Online Hostile Posts in Regional Languages during Emergency
Situation, Springer, 2021, pp. 21–29.
[50] A. Kazemi, K. Garimella, D. Gafney, S. A. Hale, Claim matching beyond english to scale
global fact-checking, arXiv preprint arXiv:2106.00853 (2021).
[51] P. Patwa, S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, A. Das,
T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Benchmarking multi-modal entailment for
fact verification, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking
and Hate Speech Detection, ceur, 2022.
[52] L. Gao, Q. Zhang, X. Zhu, J. Song, H. T. Shen, Staircase sign method for boosting adversarial
attacks, arXiv preprint arXiv:2104.09722 (2021).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Khan</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-H. M. Kamal</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kabir</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Yeasmin</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Islam</surname>
            ,
            <given-names>K. I. A.</given-names>
          </string-name>
          <string-name>
            <surname>Chowdhury</surname>
            ,
            <given-names>K. S.</given-names>
          </string-name>
          <string-name>
            <surname>Anwar</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Covid-</surname>
          </string-name>
          19
          <article-title>-related infodemic and its impact on public health: A global social media analysis</article-title>
          ,
          <source>The American journal of tropical medicine and hygiene 103</source>
          (
          <year>2020</year>
          )
          <fpage>1621</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chinnakotla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Factify 2: A multimodal fake news and satire news dataset</article-title>
          ,
          <source>in: proceedings of defactify 2: second workshop on Multimodal Fact-Checking and Hate Speech Detection, CEUR</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chinnakotla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Findings of factify 2: multimodal fake news detection</article-title>
          ,
          <source>in: proceedings of defactify 2: second workshop on Multimodal Fact-Checking and Hate Speech Detection, CEUR</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhaskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chopra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
          </string-name>
          , et al.,
          <article-title>Factify: A multi-modal fact verification dataset</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Multimodal Fact-Checking and Hate</source>
          Speech
          <string-name>
            <surname>Detection (DE-FACTIFY)</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanselowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>A richly annotated corpus for diferent tasks in automated fact-checking</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>01214</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kotonya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Toni</surname>
          </string-name>
          ,
          <article-title>Explainable automated fact-checking for public health claims</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>09926</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Augenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Lima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Simonsen</surname>
          </string-name>
          ,
          <article-title>Multifc: A real-world multi-domain dataset for evidence-based fact checking of claims</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>03242</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nørregaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Horne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adalı</surname>
          </string-name>
          , Nela-gt
          <article-title>-2018: A large multi-labelled news dataset for the study of misinformation in news articles</article-title>
          ,
          <source>in: Proceedings of the international AAAI conference on web and social media</source>
          , volume
          <volume>13</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>630</fpage>
          -
          <lpage>638</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Boididou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Andreadou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-T.</surname>
            Dang-Nguyen,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Boato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kompatsiaris</surname>
          </string-name>
          , et al.,
          <source>Verifying multimedia use at mediaeval</source>
          <year>2015</year>
          ., MediaEval
          <volume>3</volume>
          (
          <year>2015</year>
          )
          <article-title>7</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levy</surname>
          </string-name>
          , W. Y. Wang, r/fakeddit:
          <article-title>A new multimodal benchmark dataset for ifne-grained fake news detection</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>03854</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jindal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vatsa</surname>
          </string-name>
          , T. Chakraborty,
          <article-title>Newsbag: a multi-modal benchmark dataset for fake news detection</article-title>
          ,
          <source>in: CEUR Workshop Proc.,</source>
          volume
          <volume>2560</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>138</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Reis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Melo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Garimella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eckles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Benevenuto</surname>
          </string-name>
          ,
          <article-title>A dataset of fact-checked images shared on whatsapp during the brazilian and indian elections</article-title>
          ,
          <source>in: Proceedings of the International AAAI Conference on Web and Social Media</source>
          , volume
          <volume>14</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>903</fpage>
          -
          <lpage>908</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>O.</given-names>
            <surname>Papadopoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kompatsiaris</surname>
          </string-name>
          ,
          <article-title>A corpus of debunked and verified user-generated videos</article-title>
          ,
          <source>Online information review 43</source>
          (
          <year>2018</year>
          )
          <fpage>72</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Xun,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , Eann:
          <article-title>Event adversarial neural networks for multi-modal fake news detection</article-title>
          ,
          <source>in: Proceedings of the 24th acm sigkdd international conference on knowledge discovery &amp; data mining</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>849</fpage>
          -
          <lpage>857</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Knowledge-aware multi-modal adaptive graph convolutional networks for fake news detection</article-title>
          ,
          <source>ACM Transactions on Multimedia Computing</source>
          , Communications, and
          <string-name>
            <surname>Applications</surname>
          </string-name>
          (TOMM)
          <volume>17</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>Deap-faked: knowledge graph based approach for fake news detection</article-title>
          ,
          <source>arXiv preprint arXiv:2107.10648</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Multimodal emergent fake news detection via meta neural process networks</article-title>
          ,
          <source>in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &amp; Data Mining</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3708</fpage>
          -
          <lpage>3716</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Abdelnabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fritz</surname>
          </string-name>
          ,
          <article-title>Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>14940</fpage>
          -
          <lpage>14949</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Luo,
          <article-title>Multimodal fusion with recurrent neural networks for rumor detection on microblogs</article-title>
          ,
          <source>in: Proceedings of the 25th ACM international conference on Multimedia</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>795</fpage>
          -
          <lpage>816</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Fmfn: Fine-grained multimodal fusion networks for fake news detection</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>12</volume>
          (
          <year>2022</year>
          )
          <fpage>1093</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Mi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Improving fake news detection by using an entity-enhanced framework to fuse diverse multimodal clues</article-title>
          ,
          <source>in: Proceedings of the 29th ACM International Conference on Multimedia</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1212</fpage>
          -
          <lpage>1220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D.</given-names>
            <surname>Khattar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Goud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          , Mvae:
          <article-title>Multimodal variational autoencoder for fake news detection</article-title>
          ,
          <source>The World Wide Web Conference</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Detecting fake news by exploring the consistency of multimodal data</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>58</volume>
          (
          <year>2021</year>
          )
          <fpage>102610</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>58</volume>
          (
          <year>2021</year>
          )
          <fpage>102437</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yao</surname>
          </string-name>
          , G. Xu,
          <article-title>Entity-oriented multi-modal alignment and fusion network for fake news detection</article-title>
          ,
          <source>IEEE Transactions on Multimedia</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z</surname>
          </string-name>
          . Zhang,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Pre-trained models: Past, present and future</article-title>
          ,
          <source>AI</source>
          Open 2
          <article-title>(</article-title>
          <year>2021</year>
          )
          <fpage>225</fpage>
          -
          <lpage>250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Salimans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Improving language understanding by generative pre-training (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Pre-trained models for natural language processing: A survey</article-title>
          ,
          <source>Science China Technological Sciences</source>
          <volume>63</volume>
          (
          <year>2020</year>
          )
          <fpage>1872</fpage>
          -
          <lpage>1897</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen, Deberta:
          <article-title>Decoding-enhanced bert with disentangled attention</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>03654</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02116</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J. Carbonell,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Xlnet: Generalized autoregressive pretraining for language understanding</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          , Spanbert:
          <article-title>Improving pre-training by representing and predicting spans, Transactions of the Association for Computational Linguistics 8 (</article-title>
          <year>2020</year>
          )
          <fpage>64</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>W.-Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W.-C. Peng, Team yao at factify 2022:
          <article-title>Utilizing pre-trained models and co-attention networks for multi-modal fact verification</article-title>
          ,
          <source>arXiv preprint arXiv:2201.11664</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Denker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hubbard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Jackel</surname>
          </string-name>
          ,
          <article-title>Backpropagation applied to handwritten zip code recognition</article-title>
          ,
          <source>Neural computation 1</source>
          (
          <year>1989</year>
          )
          <fpage>541</fpage>
          -
          <lpage>551</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>K.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          , et al.,
          <article-title>A survey on vision transformer</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Training data-eficient image transformers &amp; distillation through attention</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10347</fpage>
          -
          <lpage>10357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , Beit:
          <article-title>Bert pre-training of image transformers</article-title>
          ,
          <source>arXiv preprint arXiv:2106.08254</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Point to rectangle matching for image text retrieval</article-title>
          ,
          <source>in: Proceedings of the 30th ACM International Conference on Multimedia</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4977</fpage>
          -
          <lpage>4986</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Multimodal fusion with co-attention networks for fake news detection, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP</article-title>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>2560</fpage>
          -
          <lpage>2569</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>D.</given-names>
            <surname>Misra</surname>
          </string-name>
          ,
          <article-title>Mish: A self regularized non-monotonic neural activation function</article-title>
          , arXiv
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>