<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Yao at Factify 2022: Utilizing Pre-trained Models and Co-attention Networks for Multi-Modal Fact Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wei-Yao Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wen-Chih Peng</string-name>
          <email>wcpeng@nctu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Vancouver, Canada</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, National Yang Ming Chiao Tung University</institution>
          ,
          <addr-line>Hsinchu</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, social media has enabled users to get exposed to a myriad of misinformation and disinformation; thus, misinformation has attracted a great deal of attention in research fields and as a social issue. To address the problem, we propose a framework, Pre-CoFact, composed of two pre-trained models for extracting features from text and images, and multiple co-attention networks for fusing the same modality but diferent sources and diferent modalities. Besides, we adopt the ensemble method by using diferent pre-trained models in Pre-CoFact to achieve better performance. We further illustrate the efectiveness from the ablation study and examine diferent pre-trained models for comparison. Our team, Yao, won the fith prize (F1-score: 74.585%) in the Factify challenge hosted by De-Factify @ AAAI 2022, which demonstrates that our model achieved competitive performance without using auxiliary tasks or extra information. The source code of our work is publicly available1.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Fake news has become easier to spread due to the growing number of users of social media. For
example, about 59% of social media consumers expect that news spread via social media may
be inaccurate [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To influence social thoughts, there are many fake news stories that mislead
readers about the news content by replacing some true content with false details. Besides, fake
news with textual and visual content can better attract readers and it is hard to judge than only
using textual content. Therefore, it is essential to detect multi-modal fake news to eliminate its
negative impacts.
      </p>
      <p>
        Fake checkers aim to check the worthiness, evidence or verified claim retrieval [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Recent
works have presented a number of approaches for tackling fake news detection automatically.
In uni-modal detection, Shu et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] exploited a tri-relationship (publishers, news pieces, and
users) to model the relations and interactions for detecting news disinformation. Przybyla [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
utilized the style the news articles are written in to estimate their credibility. In multi-modal
detection, Jin et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed an att-RNN that combines a recurrent neural network with
an attention mechanism to fuse textual content and visual images. MCAN is proposed by
extracting spatial-domain features and textual features by pre-trained models [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Further, to
address the fact that fake images are often re-compressed images or tampered images, which
shows periodicity in the frequency domain, they used discrete cosine transform as in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], then
designed a CNN-based network for capturing frequency-domain features from images.
      </p>
      <p>
        A real-world problem, identifying if the claim entails the document, is the challenge called
Factify [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] hosted by De-Factify1. Figure 1 shows some examples for all five categories. The
goal is to design a method to classify the given text and images into one of the five categories:
Support_Multimodal, Support_Text, Insuficient_Multimodal, Insuficient_Text, and Refute. To
tackle the problem, in this paper, we propose Pre-CoFact with pre-trained models and
coattention networks to perform the shared task, which first extracts features from both text and
images, then fuses this information through the co-attention module. Specifically, two powerful
Transformer-based pre-trained models, DeBERTa [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and DeiT [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], are adopted for extracting
features from both claims and documents’ text and images, respectively. Afterwards, several
co-attention modules are designed for fusing the contexts of the text and images. Finally, these
embeddings are aggregated as corresponding embeddings to classify the category of the news.
      </p>
      <p>The main results of this paper can be summarized as follows:
• Using text and images directly can achieve expressive results without any auxiliary tasks,
preprocessing methods, or extra information (e.g., optical character recognition (OCR)
1https://aiisc.ai/defactify/factify.html</p>
      <p>
        from images).
• Adopting pre-trained models helps improve the performance of the shared task, and using
co-attention networks can learn the correlation from the same modality (text or images
from claims and documents) and the dependencies between diferent modalities (text and
images).
• Our ensemble model outperforms the machine learning models [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] at least 48% and 40% in
terms of validation score and testing score. Besides, extensive experiments were further
conducted to examine the capability of the proposed model.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>Factify is a dataset for multi-modal fact verification, which contains images, textual claims,
reference textual documents and images. Each sample includes claim_image, claim, claim_ocr,
document_image, document, document_ocr, and category. The detail of each field is described
as follows:
• claim_image: the image of the given claim.
• claim: the text of the given claim.
• claim_ocr: the text from the claim_image detected by the host.
• document_image: the image of the given reference.
• document: the text of the given reference.
• document_ocr: the text from the document_image detected by the host.</p>
      <p>• category: the category of the data sample from a list of five classes.</p>
      <p>The category is composed of 1) Support_Multimodal: both the claim text and image are similar
to that of the document, 2) Support_Text: the claim text is similar or entailed, but images of the
document and claim are not similar, 3) Insuficient_Multimodal: the claim text is neither
supported nor refuted by the document but images are similar to the document, 4) Insuficient_Text:
both text and images of the claim are neither supported nor refuted by the document, although
it is possible that the text claim has common words with the document text, and 5) Refute: the
images and/or text from the claim and document are completely contradictory.</p>
      <p>
        The training set contains 35,000 samples, which has 5,000 samples of each class, and the
validation set contains 7,500 samples, which has 1,500 samples of each class. The test set, which
is used to evaluate the private score, also contains 7,500 samples. For more details, we refer
readers to [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Related Works</title>
      <sec id="sec-3-1">
        <title>3.1. Fake News Detection</title>
        <p>
          There have been a series of studies combating fake news detection to mitigate a societal crisis
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Vo and Lee [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] proposed a novel neural ranking model which jointly utilizes textual and
visual matching signals. This is the first work using multi-modal data in social media posts to
search for verified information, which can increase users’ awareness of fact-checked information
when they are exposed to fake news. Lee et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] adopted a perplexity-based approach in
the few-shot learning, which assumes that the given claim may be fake if the corresponding
perplexity score from evidence-conditioned language models is high. BertGCN [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] is proposed
by integrating the advantages of large-scale pre-trained models and graph neural networks for
fake news detection, which is able to learn the representations from the massive amount of
pre-trained data and the label influence through the propagation. MCAN [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] adopts a large-scale
pre-trained NLP model and a pre-trained computer vision (CV) model for extracting features
from text and images, respectively. Besides, MCAN also extracts frequency domain features
from images, and then uses multiple co-attention layers to fuse this information.
        </p>
        <p>These approaches demonstrate the efectiveness of using pre-trained models for fake news
detection, which motivated us to use pre-trained models as well. Besides, MCAN inspires us
to fuse the contexts of diferent modalities or the same modality ( e.g., text from claims and
documents).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Large-Scale Pre-trained Models</title>
        <p>
          Transformer [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] has been used for machine translation and has inspired many competitive
approaches in natural language processing (NLP) tasks. Transformer-based pre-trained language
models (PLMs) have significantly improved the performance of various NLP tasks due to the
ability to understand contextualized information from the pre-trained dataset. Since BERT [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
was presented, we have seen the rise of a set of large-scale PLMs such as GPT-3 [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], RoBERTa
[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], XLNet [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], ELECTRA [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], and DeBERTa [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. These PLMs have been fine-tuned using
task-specific labels and have created a new state of the art in many downstream tasks.
        </p>
        <p>
          Recently, vision Transformer (ViT) [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] is a Transformer encoder architecture directly applied
to image classification with patching raw images as input to NLP, which achieves competitive
results compared to state-of-the-art convolutional networks by pre-training a large private
image dataset JFT-300M [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. ViT demonstrates that convolution-free networks can still learn
the relation in the images. To reduce the pre-trained dataset size and training eficiency, several
follow-up studies have been conducted. DINO was proposed by [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] to improve the standard
ViT model through self-supervised learning. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] proposed DeiT, which used a novel distillation
procedure based on a distillation token to ensure the student learns from the teacher through
attention.
        </p>
        <p>These pre-trained models demonstrate the generalization of various domains. Further, using
pre-trained models benefits capturing rich information of downstream tasks, which can also
reduce the burden of training from scratch. These advantages motivated us to adopt
stateof-the-art pre-trained models for transforming images and text into contextual embeddings.
Besides, we focused on using Transformer-based pre-trained models for feature extraction.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Method</title>
      <sec id="sec-4-1">
        <title>4.1. Problem Formulation</title>
        <p>Let  = {  ,   ,   ,   }|=| 1 denote the corpus of the dataset, where the  -th sample is composed
of the claim text   =  1   2  ⋯, the claim image   , the document text   =  1   2  ⋯,
and the document image   . The  -th target   ∈ {  _ ,   _ ,
    _ ,     _ ,  } . The goal is to find out support,
insuficientevidence and refute between given claims and documents.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Pre-CoFact Overview</title>
        <p>
          Figure 2 illustrates the overview of the proposed Pre-CoFact framework. The input contains the
claim image, the claim text, the document image, and the document text. The feature extraction
part adopts DeiT [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] as the pre-trained CV model and DeBERTa [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] as the pre-trained NLP
model, and feeds the outputs of pre-trained models to the image embedding layer and text
embedding layer for transforming images and texts into corresponding embeddings. The
multimodality fusion part fuses this information from the same modality (images/text from the claim
and document) and diferent modalities (images and text from the claim/document) based on
multiple co-attention layers. Finally, the category classifier predicts the possible classes based
on the embeddings from feature extraction and the embeddings from multi-modality fusion.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Feature Extraction</title>
        <p>The enrichment of pre-trained models enables us to have rich information without training
from scratch. Moreover, Transformer-based pre-trained models demonstrate the success on
both NLP and CV tasks. However, it is essential to fine-tune for fitting in our task. To this end,
we first use DeBERTa as our pre-trained NLP model and DeiT as our pre-trained CV model, and
then we use the embedding layer for transforming pre-trained embeddings to embeddings in
our task. Specifically, the  -th output of the embedding layer is calculated as follows:
   =</p>
        <p>( (
   = 
 ( (
 )),    = 
 )),    = 
 ( (
 ( (
 )),
 )),
activation functions in</p>
        <p>
          we used are ReLU and Mish [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] for testing the results.
where the output dimensions of DeiT and DeBERTa are 768, the 
is composed of a MLP
and an activation function, and   
,   
,
        </p>
        <p>,    are  dimension vectors. It is noted that the</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Multi-Modality Fusion</title>
        <p>
          After generating embeddings of text and images, we adopt multiple co-attention layers as in
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to fuse the embeddings. To check the relation between claim and document, we use the
co-attention layer to separately fuse 1) images of claims and documents and 2) text of claims and
documents. Besides, the relation between text and images from the claims or document can be
viewed as checking whether they are relative or not. Therefore, we also adopt the co-attention
layer for fusing 3) images and text of claims and 4) images and text of documents.
        </p>
        <p>
          Specifically, each co-attention layer takes two inputs   and   and produces two outputs
  ,   . Here we use a single head to derive as the following equations:
is the same normalization
method and feed forward network as in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Co-attention block has been widely used in VQA
tasks [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], as it can capture dependencies of diferent inputs. Therefore, we use the co-attention
layer for fusing:
fused tokens into a representative token. That is, given a fused embedding with ℝ × , where
 is the sequence length, we use mean aggregation to output ℝ1× . Besides, we also feed
  
,   
,   
,    into the aggregation function for classification.
(1)
(2)
(4)
(5)
(6)
(7)
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Category Classifier</title>
        <p>To predict the label of the given claims and documents, we first concatenate 8 aggregated outputs
original 4 aggregated embeddings   
to obtain the input of the classifier   . It
 from the co-attention layers and
is worth noting that the outputs of embeddings are also used since the original information can
provide some clues for classifying the news. Afterwards, the  -th output of the classifier is the
probability as follows:
 
 1 =  ( 
  ),  
 2 =  (</p>
        <p>1  1),
 ̂ =  (
  2  2),
where   ∈ ℝ12× ,   1 ∈ ℝ×  1 , and   1 ∈ ℝ  1×5. Note that  is the same as in 
, which
uses both ReLU and Mish for testing the results.</p>
        <p>We trained our model by minimizing cross-entropy loss  to learn the prediction of the
categories:</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Ensemble Method</title>
        <p>||
=1
 = −</p>
        <p>∑   ( ̂  ).</p>
        <p>Each classifier may have its strengths and weakness, and ensemble methods have been widely
used to enhance the performance. Therefore, we follow [27] to use the power weighted sum to
enhance the performance of the model. The formula is derived as follows:
 =  1 ×  1 +  2 ×  2 + ⋯ +  
 ×   ,
where   , ⋯ ,   are the predicted probability from the corresponding model,  1, ⋯ ,   are
weights with respect to the corresponding model,  is the number of trained models, and  is
the weight of power. It is noted that these parameters are tuned by hand.
(8)
(9)
(10)
(11)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Analysis</title>
      <sec id="sec-5-1">
        <title>5.1. Experimental Setup</title>
        <sec id="sec-5-1-1">
          <title>5.1.1. Implementation Details</title>
          <p>The dimension  was set to 512, the inner dimension of the feed-forward layer was 1024, and the
number of heads was set to 4. The dropout rate was 0.1, and the max sequence length was 512.
The batch size was 32, the learning rates were set to 3e-5 and 2e-5, the training epochs were set
to 30, and the seeds were tested with 41 and 42. The power 
was set to 0.5, and the weights
were set to 0.6, 0.2, 0.1, 0.2, 0.3, which were manually tuned by validation score. The pre-trained
DeBERTa was deberta-base2, and the DeiT was deit-base-patch16-2243. The parameters of the
two pre-trained models were frozen. All images were transformed by resizing to 256, center</p>
          <p>Model
Weighted F1 (%)
w/o CoAtt
cropping to 224, and normalizing. We preprocessed only for transforming images, and then we
stored the text and processed images in corresponding pickle files for training and evaluating.
All the training and evaluation phases were conducted on a machine with Intel Xeon 4110 CPU
@ 2.10GHz, Nvidia GeForce RTX 2080 Ti, and 252GB RAM. The source code is available at
https://github.com/wywyWang/Multi-Modal-Fact-Verification-2021.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Evaluation Metric</title>
          <p>To evaluate the performance of the task, the weighted average F1 score was used across the 5
categories.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Quantitative Results</title>
        <sec id="sec-5-2-1">
          <title>5.2.1. Ablation Study</title>
          <p>We first conducted an ablation study to ensure the efective design of our proposed Pre-CoFact.
As shown in Table 1, it is evident that without co-attention networks (w/o CoAtt), the
performance is degraded. Further, applying co-attention only on the same modality (w/o CoAtt(text,
image)) is insuficient, which demonstrates the need for modeling dependencies between
different modalities. It is noted that our ensemble method slightly improves the performance
compared to Pre-CoFact. Our ensemble method includes Pre-CoFact, Pre-CoFact with replacing
DeBERTa with XLM-RoBERTa, Pre-CoFact with replacing DeBERTa with RoBERTa, Pre-CoFact
with replacing DeBERTa with RoBERTa and replacing ReLU with Mish, and Pre-CoFact with
replacing ReLU with Mish.</p>
          <p>We also use diferent pre-trained models to examine the module influence as shown in Table
2. It can be seen that DeiT is more suitable than DINO for this task. Besides, XLM-RoBERTa also
degrades the performance, while RoBERTa is slightly worse than Pre-CoFact with DeBERTa.
Rank
5</p>
          <p>Team</p>
          <p>Yao
Baseline</p>
          <p>Support Support
_ Text (%) _ Multimodal (%)</p>
          <p>Insuficient
_Text (%)</p>
          <p>Insuficient
_Multimodal (%) Refute (%) Final (%)</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Testing Performance</title>
        </sec>
        <sec id="sec-5-2-3">
          <title>5.2.3. Confusion Matrix</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we proposed Pre-CoFact utilizing pre-trained models and multiple co-attention
networks to alleviate the efect of fake news for the Factify task. To achieve better performance,
we adopted an ensemble method by weighting several models. The ablation study demonstrates
the efectiveness of our proposed approach. From the testing score, our method illustrates that
using only text and images without extra information can also achieve competitive performance.
and inter-modality attention flow for visual question answering, in: IEEE Conference on
Computer Vision and Pattern Recognition, 2019, pp. 6639–6648.
[27] W. Wang, K. Chang, Y. Tang, Emotiongif-yankee: A sentiment classifier with robust model
based ensemble methods, CoRR abs/2007.02259 (2020).
[28] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, Association for Computational Linguistics, 2020, pp. 8440–8451.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Shearer</surname>
          </string-name>
          , A. Mitchell,
          <article-title>News use across social media platforms in 2020, 2021</article-title>
          . URL: https://www.pewresearch.org/journalism/2021/01/12/ news-use
          <article-title>-across-social-media-platforms-in-</article-title>
          <year>2020</year>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P. A.</given-names>
            <surname>Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D. S.</given-names>
            <surname>Martino</surname>
          </string-name>
          ,
          <article-title>Automated fact-checking for assisting human fact-checkers</article-title>
          ,
          <source>in: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4551</fpage>
          -
          <lpage>4558</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          , H. Liu,
          <article-title>Beyond news contents: The role of social context for fake news detection</article-title>
          ,
          <source>in: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>312</fpage>
          -
          <lpage>320</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Przybyla</surname>
          </string-name>
          ,
          <article-title>Capturing the style of fake news</article-title>
          ,
          <source>in: The Thirty-Fourth AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>490</fpage>
          -
          <lpage>497</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Luo,
          <article-title>Multimodal fusion with recurrent neural networks for rumor detection on microblogs</article-title>
          ,
          <source>in: Proceedings of the 2017 ACM on Multimedia Conference</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>795</fpage>
          -
          <lpage>816</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Multimodal fusion with co-attention networks for fake news detection, in: Findings of the Association for Computational Linguistics</article-title>
          , volume ACL/IJCNLP 2021 of Findings of ACL,
          <year>2021</year>
          , pp.
          <fpage>2560</fpage>
          -
          <lpage>2569</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Exploiting multi-domain visual information for fake news detection</article-title>
          ,
          <source>in: 2019 IEEE International Conference on Data Mining</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>518</fpage>
          -
          <lpage>527</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhaskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chopra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <article-title>Factify: A multi-modal fact verification dataset</article-title>
          , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhaskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chopra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <article-title>Benchmarking multi-modal entailment for fact verification</article-title>
          , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen,
          <article-title>Deberta: decoding-enhanced bert with disentangled attention</article-title>
          ,
          <source>in: 9th International Conference on Learning Representations</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Training dataeficient image transformers &amp; distillation through attention</article-title>
          ,
          <source>in: Proceedings of the 38th International Conference on Machine Learning</source>
          , volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10347</fpage>
          -
          <lpage>10357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D. S.</given-names>
            <surname>Martino</surname>
          </string-name>
          ,
          <article-title>Fake news, disinformation, propaganda, media bias, and lfattening the curve of the COVID-19 infodemic</article-title>
          , in
          <source>: KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4054</fpage>
          -
          <lpage>4055</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Where are the facts? searching for fact-checked information to alleviate the spread of fake news</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7717</fpage>
          -
          <lpage>7731</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Towards few-shot fact-checking via perplexity</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1971</fpage>
          -
          <lpage>1981</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          , Bertgcn:
          <article-title>Transductive text classification by combining GCN and BERT</article-title>
          ,
          <source>CoRR abs/2105</source>
          .05727 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems</source>
          <year>2017</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems</source>
          <year>2020</year>
          , NeurIPS
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized BERT pretraining approach</article-title>
          , CoRR abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J. G. Carbonell, R. Salakhutdinov,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Xlnet: Generalized autoregressive pretraining for language understanding</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems</source>
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>5754</fpage>
          -
          <lpage>5764</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>ELECTRA: pre-training text encoders as discriminators rather than generators</article-title>
          ,
          <source>in: 8th International Conference on Learning Representations</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          ,
          <source>in: 9th International Conference on Learning Representations</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>Revisiting unreasonable efectiveness of data in deep learning era</article-title>
          ,
          <source>in: IEEE International Conference on Computer Vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>843</fpage>
          -
          <lpage>852</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Caron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          , I. Misra,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mairal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          ,
          <article-title>Emerging properties in self-supervised vision transformers</article-title>
          ,
          <source>CoRR abs/2104</source>
          .14294 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Misra</surname>
          </string-name>
          ,
          <article-title>Mish: A self regularized non-monotonic neural activation function</article-title>
          , CoRR abs/
          <year>1908</year>
          .08681 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C. H.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Dynamic fusion with intra-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>