<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Team gzw at Factify 2: Multimodal Attention and Fusion Networks for Multi-Modal Fact Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhenwei Gao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tong Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zheng Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Electronic and Information Engineering of UESTC in Guangdong</institution>
          ,
          <addr-line>523808</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays, detecting fake news on social media platforms has become a top priority since the widespread dissemination of fake news may mislead readers and have negative efects. To address the problem, we propose a Multimodal Attention and Fusion Network (MAFN) for multi-modal fact verification. Specifically, we employ DeiT and DeBERTa to obtain better representations for text and images, respectively. Then, we feed the obtained representations of images and text into a multi-modal attention network to fuse both inter-modality and intra-modality relationships. Besides, we adopt an ensemble strategy by using diferent pre-trained models in MAFN to achieve better performance. We conduct a series of ablation studies to verify the impact of each designed module on performance. Our method (team gzw) ranked fifth in the leaderboard of the Factify Challenge hosted by De-Factify@AAAI 2023, achieving an F1 score of 76.051%, which shows that our model achieves a competitive performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-modal Attention</kwd>
        <kwd>Pre-trained Model</kwd>
        <kwd>Self-Attention</kwd>
        <kwd>De-Factify</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Social media has become a mainstream platform for people to communicate their ideas, due
to the increasing convenience and intelligence. However, every coin has two sides. That is to
say, it also gradually becomes an ideal place for the widespread of fake news. Since fake news
distorts and fabricates facts maliciously, its extensive dissemination has extremely negative
impacts on individuals and society. In addition, multimedia intelligence [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] can help the
people better understand the world. Therefore, it is urgently important to detect fake news
with multimedia in social platforms.
      </p>
      <p>
        In order to facilitate the detection of fake news, many approaches have been proposed. The
early attempts (e.g., snopes.com) mainly verified the fake news by experts or institutions in
related fields, which is obviously time-consuming and labor-intensive. Therefore, automatically
detecting fake news has been a key research direction and drawn much attention in recent
years. Basically, existing studies on automatic fake news detection can be summarized into two
categories: (1) The first one is traditional learning methods [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ], which design plenty of
hand-crafted features from the media content of posts and the social context of users. With these
sophisticated features, SVM classifiers [
        <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
        ] and decision tree [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] have been trained to debunk
fake news. However, the content of fake news is highly complicated and hard to be fully captured
by hand-crafted features. (2)With deep neural networks having yielded immense success in
learning image and textual representations and their downstream tasks [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], researchers
realize that deep learning plays a very important role in detecting fake news. Thus the deep
learning based methods [
        <xref ref-type="bibr" rid="ref10 ref11 ref7">7, 10, 11</xref>
        ] are proposed to automatically capture the deep features in
an end-to-end way. For example, Ma et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] employ Recurrent Neural Networks (RNNs) to
learn the hidden features from posts. Yu et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] use Convolutional Neural Networks (CNNs)
to obtain key features and their high-level interactions from fake news. However, most of the
above methods focus only on textual content and ignore posts with multi-modal information
(such as text, images, etc.), which is a key component of social media platforms.
      </p>
      <p>
        De-Factify2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a competition hosted by AAAI 2023 workshop on multi-modal fact checking
and hate speech detection, an extension of the De-Factify [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] competition. This workshop
aims to encourage researchers from inter-disciplinary domains working on multi-modality
and/or fact checking to come together and work on multi-modal (images, memes, videos) fact
checking. The goal of this competition is to design a method to classify the given text and images
into one of the five categories: Support_Multimodal, Support_Text, Insuficient_Multimodal,
Insuficient_Text, and Refute, as displayed in Figure 1. For more details, we refer readers to [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
To tackle the problem, this paper proposes a Multimodal Attention and Fusion Network (MAFN)
with pre-trained models and co-attention networks to perform the shared task, which first
extracts features from both text and images, then fuses this information through the co-attention
module. Specifically, two powerful Transformer-based pre-trained models, DeBERTa [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and
DeiT [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], are adopted to extract features of images and text both from claims and documents,
respectively. Based on that, several co-attention modules are designed to fuse the contexts
of text and images. Afterwards, we apply self-attention mechanism to get corresponding
representative embeddings. Finally, these embeddings are sequentially concatenated to obtain
the final embedding to classify the categories of news.
      </p>
      <p>The main results of this paper can be summarized as follows:
• We leverage an ensemble strategy based on diferent pre-trained models to obtain better
representations for the claims and documents.
• We design a multi-modal attention mechanism and a fusion module to learn the semantic
correlation at intra-modality (text or images from claims and documents) and the
intermodality dependencies.
• Our ensemble model outperforms the baseline by 17.0% in terms of testing score, while it
still has about 7.6% gap compared to the first prize. Besides, a series of ablation studies
were further conducted to study the impact of the designed modules on the overall
performance of the model.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Fake News Detection</title>
        <p>
          Recently, fake news detection with multi-modality has received considerable attentions. Several
approaches[
          <xref ref-type="bibr" rid="ref16 ref17 ref18 ref19">16, 17, 18, 19</xref>
          ] conduct fake news detection based on the multimedia content and
obtain superior performance. Jin et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] propose a multi-modality based fake news detection
model, which extracts the multi-modality information including visual, textual and social context
features, and then fuses them by attention mechanism. Khattar et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] introduce a multimodal
variational autoencoder that learns a shared representation of text and images. Shivangi et
al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] make use of the pre-trained BERT to learn text features and apply VGG-19 pre-trained
on ImageNet dataset to learn image features. Wang et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] design a novel knowledge-driven
multimodal graph convolutional network to jointly model the textual information, knowledge
concepts and visual information into a unified framework for fake news detection. MCAN [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]
adopts a large-scale pre-trained NLP model and a pre-trained computer vision (CV) model to
obtain features from text and images, and then fuses them and frequency domain features from
images with multiple co-attention layers.
        </p>
        <p>These methods demonstrate that multi-modal content can also help the model to detect
fake news. Thus, we design a multimodal attention and fusion network to mine the semantic
correlation among multimedia to facilitate the fact verification.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Large-Scale Pre-trained Models</title>
        <p>
          Pre-trained models have achieved significant success across numerous tasks. Transformer [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]
ifrst introduced in machine translation, has inspired many competitive approaches in natural
language processing (NLP) and computer vision tasks. Specifically, Transformer-based pre-trained
language models (PLMs) have significantly improved the performance of various NLP tasks due
to the ability to understand contextualized information from the pre-trained dataset. GPT [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]
replaces bi-LSTMs with a left-to-right Transformer to better extract contextual semantics by a
global attention mechanism. DeBERTa [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] proposes a novel disentangled attention mechanism
and a new virtual adversarial training to significantly improve the eficiency of pre-training
and the performance of 2 downstream tasks.
        </p>
        <p>
          Vision Transformer (ViT) [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] is a Transformer encoder architecture with patching raw
images to achieve competitive results of image classification, compared to state-of-the-art
convolutional networks, which demonstrates that convolution-free networks can still capture
the visual relation efectively. Then several follow-up studies based on ViT have been conducted.
For example, DeiT [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] develops a novel distillation procedure to ensure the student learns
better knowledge from the teacher through attention.
        </p>
        <p>In a word, pre-trained models can benefit the procedure of capturing rich information for
downstream tasks and also reduce the cost of training from scratch. These advantages drives us
to obtain better contextual embedding of images and text with recent pre-trained models.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Attention Mechanism</title>
        <p>
          Attention mechanisms are demonstrated efective in various tasks such as image captioning [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ],
machine translation [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] and recommendation system [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Concretely, Bahdanau et al. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]
ifrstly introduce attention in the machine translation task to allow the model to automatically
search for parts of a source sentence that are relevant to predicting a target word. Recently,
attention mechanisms have been incorporated into fake news detection. For example, Chen
et al. [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] propose a deep attention model on the basis of recurrent neural networks (RNN) to
learn selectively temporal hidden representations of sequential posts for identifying fake news.
        </p>
        <p>Inspired by the successful applications of attention mechanism, we introduce a co-attention
network to compute the intra-modality relationship and inter-modality relationship of image
tokens and text words.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Overview</title>
        <p>Let  = {, , , }=1 denote a set of  training data, where the
th sample is composed of the claim text , the claim image , the
document text , and the document image .  = {1, 2, · · · ,  }=1 denote
a set of corresponding labels where  ∈ {_ , _ ,
  _ ,   _ ,  }. The task of this competition is
to classify the data sample into one of the five categories when given a textual claim, claim
image, document text and document image.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Overall Framework</title>
        <p>
          Inspired by [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], we introduce a Multimodal Attention and Fusion Network (MAFN) to improve
the performance of multimodal fact verification. By exploiting a multi-modal attention network
for multi-modal feature fusion, our model can capture the intra-modality and inter-modality
relationship of textual and visual content of fake news. The overall architecture is illustrated in
Figure 2. Specifically, our model consists of the following components:
• Text and Image Encoding Network: The enrichment of pre-trained models enables
us to extract rich information without training from scratch. We first use DeBERTa [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
as our pre-trained NLP model and DeiT [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] as our pre-trained CV model to precisely
capture the semantics both from the text and the image, and then employ a full connection
layer followed by a ReLU function to further extract the multi-modal embedding.
• Multi-Modality Fusion Network: As the intra-modality (images/text from the claim
and document) or inter-modality (images and text from the claim/document) relationships
can facilitate the detection of fake news, we use the multi-modality fusion part to fuse
the information from the same modality and diferent modalities.
• Category Classifier aims to classify each piece of data in the dataset into one of five
categories with a fully-connected layer followed by a corresponding activation function.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Text and Image Encoding Network</title>
        <p>Text Encoding Network: In order to represent the rich semantic information of sentences,
we employ DeBERTa as the core module of our textual language model. Given a sentence,
we split it into  words with tokenization technique  = {1, 2, · · · , }, and we denote the
transformed feature as  = {1, · · · , } with  corresponding to the transformed feature of
. The word representation  is calculated by DeBERTa:
 = {1, · · · , } =  ( ),
(1)
where  ∈ R is the last hidden state of corresponding token in DeBERTa, and  is the
dimension of the word embedding. Specifically, we feed the claim text and document text
into DeBERTa respectively, the corresponding features, e.g.  =  ( ),  =
 ( ), where the output dimensions of DeBERTa is 768. Then we use the embedding
layer for transforming pre-trained embeddings to embeddings in our task. Sepecifically, output
of the embedding layer is calculated as follows:
 =  ( ),
 =  ( ).</p>
        <p>Here the  is composed of a fully-connected layer and an activation function, and  , 
are  dimension vectors. It is noted that the activation functions in  we used are ReLU and
Mish [29] for testing the results.</p>
        <p>Image Encoding Network: For each input of image, we use pre-trained DeiT model to extract
token features. The output is a set of token features  = {1, · · · , }, where  denotes the
token number of the image. The parameters of the pre-trained DeiT are frozen, which means
we do not update the parameters of the pretrained model during training. In other words, given
the image , the operation of feature extraction can be expressed as:</p>
        <p>= {1, · · · , } =  (),
where  ∈ R and  is the dimension of the image embedding. Specifically, we feed the claim
image and document image into DeiT respectively, and get the corresponding features, e.g.
 =  ( ),  =  ( ), where the output dimensions of DeiT is 768. Then we
use the embedding layer for transforming pre-trained embeddings to embeddings in our task.
Sepecifically, output of the embedding layer is calculated as follows:
(2)
(3)
(4)
 =  ( ),
 =  ( ),
where  module is same as the  in equation 2.  ,  are  dimension vectors.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Multi-Modality Fusion</title>
        <p>
          Co-attention block has been widely used in VQA tasks [30], as it can capture dependencies
of diferent inputs. Thus, after generating embeddings of text and images, we adopt multiple
co-attention layers as [
          <xref ref-type="bibr" rid="ref20">20, 31</xref>
          ] to fuse the embeddings for the improvement of the intra-
/intermodality relations on the detection of fake news.
        </p>
        <p>First, we employ a co-attention layer to separately fuse 1) images of claims and images of
documents and 2) text of claims and text of documents(fuse features from same modality). Then
we learn the inter-modal alignment by fusing features from diferent modalities (images and
text from the claim/document). Besides, the relation between text and images from the claims
or document can be viewed as checking whether they are relative or not. Therefore, we also
adopt the co-attention layer for fusing 3) images and text of claims and 4) images and text of
documents(fuse features from diferent modality).</p>
        <p>Therefore, we use the co-attention layer for fusing. Specifically, each co-attention layer takes
two inputs  and  to produce two outputs , . We first project / into query
 ∈ R× , key  ∈ R×  and value  ∈ R×  matrices:
 =   ,  =   ,  =   ,
 =   ,  =   ,  =   ,
˜ =  ( +  ( √ )),
˜ =  ( +  ( √ )),
 =  (˜ +    (˜)),
 =  (˜ +    (˜)),
 ,  = ( ,  ),
  ,   = ( ,  ),
 ,   = ( ,  ),
 ,   = ( ,  ),
where   ,   ,   ,   ,   ,   ∈ R× .</p>
        <p>
          We then employ attention mechanism together with the residual connection to provide
additional capacity for more complex reasoning in our aggregation functions. The specific
expression is:
(5)
(6)
(7)
(8)
(9)
where  is a Layer Normalization and    is the same feed forward network as [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Now
we can use co-attention layer to fuse features from same modalities (or diferent modalities):
where  denotes the co-attention layer.
        </p>
        <p>
          Afterwards, the aggregation function is adopted to aggregate fused tokens into a
representative token. That is, given a fused embedding with  = {1, · · · ,  } ∈ R× , where  is
the sequence length, we perform self-attention mechanism [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] over the fused tokens, which
adopts average feature  = 1 ∑︀
        </p>
        <p>=1   as the query and aggregates all the tokens to obtain a
representative token. Besides, we also feed  ,  ,  ,  into the aggregation function
for classification.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Category Classifier</title>
        <p>As The features fused by the co-attention layer can represent the complex relationship
between claim and document, we first concatenate 8 aggregated outputs  ,  ,   ,
  ,  ,   ,   ,  from the co-attention layers  = ( :
 :   :   :  :   :   :  ). It is worth
noting that we also use the outputs of aggregated embeddings since the original information
can provide some clues for classifying the news, thus we concatenate 4 aggregated
embeddings  = ( :  :  :  ). Then we concatenate these two features
 = ( :  ) and feed  to the subsequent category classification network to predict
the label of the given claims and documents. Afterwards, the output of the classifier is the
probability as follows:
where  (0) ∈ R12× ,  (1) ∈ R× 1 , and  (2) ∈ R1× 5. Note that  is the same as in ,
which uses both ReLU and Mish for testing the results.</p>
        <p>In the end, We minimize cross-entropy loss ℒ to verify a multimodal claim:
(10)
(11)
(12)
(13)</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Ensemble Method</title>
        <p>Each classifier may have its strengths and weakness, and ensemble methods have been widely
used to enhance the performance. Some models have a higher score on the validation set, we
naturally want it to have a larger weight in the final integrated model, thus we use diferent
weights to integrate the model. The formula is derived as follows:
 = 1 × 1 + 2 × 2 + · · ·
+  × ,
where 1, · · · ,  are the predicted probability from the corresponding model, 1, · · · ,  are
weights with respect to the corresponding model,  is the number of trained models. It is noted
that the weight parameters are tuned by hand.</p>
        <p>(1) =  ( (0)),
(2) =  ((1) (1)),
ˆ =  ((2) (2)),
ℒ = −
||
∑︁ (ˆ).</p>
        <p>=1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset and Implementation</title>
        <p>
          Dataset. Factify [
          <xref ref-type="bibr" rid="ref12">12, 32</xref>
          ] is a dataset for multi-modal fact verification, which contains images of
the claim, textual claims, reference textual documents and images. Each data contains a reliable
source of information, called a “document” and another source whose validity must be assessed,
called a “claim”. Both source and claim information sources have a corresponding image. Each
data sample belongs to one of the five categories, which are Support_Text, Support_Multimodal,
Insuficient_Text, Insuficient_Multimodal and Refute. The labels are defined as:
• Support_Multimodal: both the claim text and image are similar to that of the document.
• Support_Text: the claim text is similar or entailed, but images of the document and claim
are not similar.
• Insuficient_Multimodal: the claim text is neither supported nor refuted by the document
but images are similar to the document.
• Insuficient_Text: both text and images of the claim are neither supported nor refuted
by the document, although it is possible that the text claim has common words with the
document text.
• Refute: the images and/or text from the claim and document are completely contradictory
i.e, the claim is false/fake.
        </p>
        <p>
          The training set contains 35,000 samples with 5,000 samples per class, and the validation set
includes 7,500 samples with 1,500 samples per class. The test set, which is used to evaluate the
private score, also contains 7,500 samples. For more details, we refer readers to [
          <xref ref-type="bibr" rid="ref12">12, 33</xref>
          ].
Implementation Details. The dimension  was set to 512, the hidden dim of the fully connected
layer was set to 1024, the output dimension of DeBERTa and DeiT was 768, and the number
of heads was set to 4. The dropout rate was 0.1, and the max sequence length was 512. The
batch size was 64, the learning rates were set to 2e-5, the number of training epochs was 30,
and the seeds were tested with 24. The weight coeficients between diferent models are set to
0.7, 0.5, 0.6, 0.7, 0.6, which were manually tuned by validation score. The pre-trained DeBERTa
was deberta-base1, and the DeiT was deit-base-patch16-2242. The parameters of the two
pretrained models were frozen during training, which means we do not update their parameters
during training. All images were transformed by resizing to 256, center cropping to 224, and
normalizing. We preprocessed only for transforming images, and then we stored the text and
processed images in corresponding pickle files for training and evaluating. All expriments were
conducted with Nvidia GeForce RTX A6000.
        </p>
        <p>Evaluation Metric. The weighted average F1 score across 5 categories is adopted to evaluate
the performance.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Testing Performance</title>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ablation Study</title>
        <p>To study the impact of each module, we carried a series of ablation studies to verify the
efectiveness of the designed modules. As shown in Table 3, applying co-attention only on the
same modality (w/o CoAtt(A, B)) is insuficient, which demonstrates the need for modeling
dependencies between diferent modalities. In addition, if only apply co-attention on the
diferent modality (w/o CoAtt(A, A)), the model will not be able to distinguish the diference
1https://huggingface.co/microsoft/deberta-base
2https://huggingface.co/facebook/deit-base-patch16-224
between claim and document, which will also afect performance. Finally, if removing the
co-attention module completely (w/o CoAtt), the performance will drop drastically, which
justifies the use of co-attention on the same modality and diferent modality.</p>
        <p>We also explored the efectiveness of the self-attention module. If it is replaced by a simple
mean operation, a large performance drop can be observed (see in Table 4), which proves that
the model can focus on important sequences through the self-attention module. Meanwhile,
it is evident that without concatenating  to the final embedding , the performance will
obviously degrades.</p>
        <p>It is noted that our ensemble method slightly improves the performance compared to
PreCoFact. Our ensemble method includes MAFN (model1 in Table 2), MAFN with replacing
DeBERTa with XLM-RoBERTa (model2 in Table 2), MAFN with replacing DeBERTa with
RoBERTa (model3 in Table 2), MAFN with replacing DeBERTa with RoBERTa and replacing
ReLU with Mish (model4 in Table 2), and MAFN with replacing ReLU with Mish (model4 in
Table 2). We the performance of each model in Table 2, and we ensemble the model using
equation (13).</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Visualization</title>
        <p>Figure 3 visualizes some verification examples by our model and the baseline. It can be observed
that our model is superior to the baseline on the multimodal fact verification. On the left side of
Figure 3, we can intuitively see that the content of the two pictures is similar, but for the text,
the claim and document are diferent in length, and the sentence structure is also very diferent,
but the semantics are the same. Our model can correctly classify the results, demonstrating
that our model can learn high-level semantic connections between claim and document texts,
which we attribute to the use of Co-Attention module. The example on the right also shows
that our model can understand high-level semantic information.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we proposed a multimodal fact verification method called MAFN, which utilizes
pre-trained models and multiple co-attention networks to alleviate the efect of fake news.
To further improve the performance, we adopted an ensemble method by weighting several
diferent pretrained models. The ablation study demonstrates the efectiveness of our proposed
approach. The test scores can also illustrates the efectiveness of our model.
co-attention networks for multi-modal fact verification, arXiv preprint arXiv:2201.11664
(2022).
[29] D. Misra, Mish: A self regularized non-monotonic neural activation function, CoRR
abs/1908.08681 (2019).
[30] P. Gao, Z. Jiang, H. You, P. Lu, S. C. H. Hoi, X. Wang, H. Li, Dynamic fusion with
intraand inter-modality attention flow for visual question answering, in: IEEE Conference on
Computer Vision and Pattern Recognition, 2019, pp. 6639–6648.
[31] N. Wang, Z. Wang, X. Xu, F. Shen, Y. Yang, H. T. Shen, Attention-based relation reasoning
network for video-text retrieval, in: 2021 IEEE International Conference on Multimedia
and Expo (ICME), 2021, pp. 1–6. doi:10.1109/ICME51207.2021.9428215.
[32] S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, P. Patwa, A. Das,
T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Factify: A multi-modal fact verification
dataset (2022).
[33] S. Suryavardan, S. Mishra, M. Chakraborty, P. Patwa, A. Rani, A. Chadha, A. Reganti,
A. Das, A. Sheth, M. Chinnakotla, A. Ekbal, S. Kumar, Findings of factify 2: multimodal
fake news detection, in: proceedings of defactify 2: second workshop on Multimodal
Fact-Checking and Hate Speech Detection, CEUR, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhenwei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yadan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Heng</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>Point to rectangle matching for image text retrieval</article-title>
          ,
          <source>in: Proceedings of the 30th ACM International Conference on Multimedia</source>
          ,
          <year>2022</year>
          , p.
          <fpage>4977</fpage>
          -
          <lpage>4986</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jingjing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jiangbo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Discovering attractive segments in the user-generated video streams</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>57</volume>
          (
          <year>2020</year>
          )
          <fpage>102</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jingjing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Universal adversarial perturbations generative network</article-title>
          ,
          <source>World Wide Web</source>
          <volume>25</volume>
          (
          <year>2022</year>
          )
          <fpage>1725</fpage>
          -
          <lpage>1746</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Castillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mendoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Poblete</surname>
          </string-name>
          , Information credibility on twitter,
          <source>in: Proceedings of the 20th international conference on World wide web</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>675</fpage>
          -
          <lpage>684</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Prominent features of rumor propagation in online social media</article-title>
          ,
          <source>in: 2013 IEEE 13th international conference on data mining, IEEE</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1103</fpage>
          -
          <lpage>1108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nourbakhsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Real-time rumor debunking on twitter</article-title>
          ,
          <source>in: Proceedings of the 24th ACM international on conference on information and knowledge management</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1867</fpage>
          -
          <lpage>1870</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          , W. Gao,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-F. Wong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Cha</surname>
          </string-name>
          ,
          <article-title>Detecting rumors from microblogs with recurrent neural networks (</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Universal weighting metric learning for crossmodal retrieval</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>44</volume>
          (
          <year>2022</year>
          )
          <fpage>6534</fpage>
          -
          <lpage>6545</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2021</year>
          .
          <volume>3088863</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Universal weighting metric learning for cross-modal matching</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>13005</fpage>
          -
          <lpage>13014</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-F. Wong</surname>
          </string-name>
          ,
          <article-title>Detect rumors on twitter by promoting information campaigns with generative adversarial learning</article-title>
          ,
          <source>in: The world wide Web conference</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3049</fpage>
          -
          <lpage>3055</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tan</surname>
          </string-name>
          , et al.,
          <article-title>A convolutional approach for misinformation identification</article-title>
          .,
          <source>in: IJCAI</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3901</fpage>
          -
          <lpage>3907</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R. Anku</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chinnakotla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Factify 2: A multimodal fake news and satire news dataset</article-title>
          ,
          <source>in: proceedings of defactify 2: second workshop on Multimodal Fact-Checking and Hate Speech Detection, CEUR</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhaskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chopra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <article-title>Benchmarking multi-modal entailment for fact verification</article-title>
          , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, ceur,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen,
          <article-title>Deberta: decoding-enhanced bert with disentangled attention</article-title>
          ,
          <source>in: 9th International Conference on Learning Representations</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Training dataeficient image transformers &amp; distillation through attention</article-title>
          ,
          <source>in: Proceedings of the 38th International Conference on Machine Learning</source>
          , volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10347</fpage>
          -
          <lpage>10357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Luo,
          <article-title>Multimodal fusion with recurrent neural networks for rumor detection on microblogs</article-title>
          ,
          <source>in: Proceedings of the 25th ACM international conference on Multimedia</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>795</fpage>
          -
          <lpage>816</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Khattar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Goud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          , Mvae:
          <article-title>Multimodal variational autoencoder for fake news detection</article-title>
          ,
          <source>in: The world wide web conference</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2915</fpage>
          -
          <lpage>2921</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kabra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kumaraguru</surname>
          </string-name>
          , Spotfake+:
          <article-title>A multimodal framework for fake news detection via transfer learning (student abstract)</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>34</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>13915</fpage>
          -
          <lpage>13916</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Fake news detection via knowledge-driven multimodal graph convolutional networks</article-title>
          ,
          <source>in: Proceedings of the 2020 International Conference on Multimedia Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>540</fpage>
          -
          <lpage>547</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Multimodal fusion with co-attention networks for fake news detection, in: Findings of the Association for Computational Linguistics</article-title>
          , volume ACL/IJCNLP 2021 of Findings of ACL,
          <year>2021</year>
          , pp.
          <fpage>2560</fpage>
          -
          <lpage>2569</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems</source>
          <year>2017</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Salimans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Improving language understanding by generative pre-training (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          ,
          <source>in: 9th International Conference on Learning Representations</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhudinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Show, attend and tell: Neural image caption generation with visual attention</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>2048</fpage>
          -
          <lpage>2057</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Bengio,</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          ,
          <source>arXiv preprint arXiv:1409.0473</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          , W. Liu, T.-S. Chua,
          <article-title>Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention</article-title>
          ,
          <source>in: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>335</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Zhang,</surname>
          </string-name>
          <article-title>Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection</article-title>
          ,
          <source>in: Pacific-Asia conference on knowledge discovery and data mining</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>W.-Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W.-C. Peng, Team yao at factify 2022:
          <article-title>Utilizing pre-trained models and</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>