Yet at Factify 2022 : Unimodal and Bimodal
RoBERTa-based models for Fact Checking
Yan Zhuang1 , Yanru Zhang*1,2
1
    University of Electronic Science and Technology of China
2
    Shenzhen Institute for Advanced Study, UESTC


                                         Abstract
                                         The development of social networks makes it easier and faster to spread news among people, but the
                                         spread of some uncertified news can cause great harm. The ’Factify’ task of the ’DE-FACTIFY’ workshop
                                         aims to solve the multi-modal fact verification problem. In this paper, unimodal and bimodal RoBERTa-
                                         based models for fact checking are proposed. The text-only model integrates disturbance on embedding
                                         layer, a new loss function and data augmentation by sequential dropout layers into the vanilla RoBERTa.
                                         Based on the text-only model, the text-image model changes the text embedding input into the fusion
                                         features of texts and images. The experiment results show that after the introduction of fusion features,
                                         the model improves slightly, but the best model is still our text-only model. With the best average F1
                                         score of 75.59%, we improve the baseline (53.10%) by 22% and are finally ranked 2nd.

                                         Keywords
                                         Multimodality, Fake News, Text Similarity


1. Introduction
The development of social media technology allows people to express themselves and receive
information anytime, anywhere. However, in order to attract the attention of users, some
media often publish some eye-catching but unconfirmed news. For example, 77% supporters of
Donald Trump, the former US president, held the opinion that 2020 US presidential election
was manipulated by ”voter fraud” because of the information spread in tweet even though
they don’t have enough evidence [1]. The situation becomes even worse during the COVID-19
pandemic period. There is an urgent need for a model that can automatically detect whether
the claim is fake or not.
   Fact checking can be described as, given a claim and some support information, such as
documents, images and other claims, we need to judge whether the claim entails the support
information. Most claims are evidenced-based so that their veracity can be determined by
external knowledge [2]. It is of great importance to take the evidence, or support information
into consideration since it helps a lot in reasoning in fact checking [3].
   In this paper, text-only model and text-image model are both proposed. The unimodal model
based on RoBERTa adds disturbance on embedding layer to boost robustness, creates positive
* Corresponding author
De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2022. 2022
Vancouver, Canada
Envelope-Open delecisz@gmail.com (Y. Zhuang); yanruzhang@uestc.edu.cn (Y. Zhang*)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
samples through sequential dropout layers to augment data and uses a new loss function for
alleviating the difficulty of predicting the image-related labels while the bimodal model based
on RoBERTa fuses the text embedding and image features and promotes the interaction between
two modalities through the self-attention mechanism in transformer. The experiment results
show that both models have good effectiveness. The text-only model performs better and helps
us rank 2nd in the multi-modal fact verification task.
   The rest of the paper is organised as follows. In section 2, related works about fact verification
is provided. Followed by the introduction of the task in section 3. In section 4, the details of
our proposed models are discussed. Section 5 contains the experiments results and analysis of
different models. Towards the end, section 6 concludes the paper along with future directions.


2. Background
Lots of efforts have been put into fact checking and related research has shifted from single
modal with monolingual texts to with multilingual texts to multi-modal with multi-text and
multi-image [4, 5, 6, 7, 8, 9]. A multi-level inter-sentence attention model shows competitive
performance in ’FEVER’ dataset, which consists of 185k samples with a claim and a supporting
document [4, 10]. Multilingual transformer-based models, additional metadata and evidence
from news stories are combined in multilingual dataset ’X-FACT’, which contains 31k short
statements in 25 languages.
   As for the multimodal situation, [11] shows that augmenting text with image embedding
immediately boosts performance. In Event Adversarial Neural Networks (EANN) [12], Text-
CNN is adapted to extract textual features and pre-trained VGG-19 [13] architecture with fully
connected layer is applied to extract visual features. Besides, a fake news detector and a event
discriminator take the concatenated features as input, then predict the label and identify the
event label respectively.
   Multimodal Variational Autoencoder (MVAE) and EANN have something in common that they
use the same visual feature extractor and take the concatenated features for further prediction
[14]. However, instead of Text-CNN, MVAE uses recurrent neural networks (RNNs) with bi-
directional Long-Short Term Memory (LSTM) cells to extract textutal features. After sampling
and reconstructing the concatenation of the both features, the model are trained by optimizing
the sum of the reconstruction loss and the Kullback-Leibler (KL) divergence loss.
   However, both MAVE and EANN ignore the interactions between the textual and visual
features. Vision Transformers (ViT) shows excellent performance in the vision-related tasks
[15], based on which, Vision-and-Language Transformer (ViLT) takes fusion of texts and
images into consideration, performs faster and competitive and shows excellent performance in
vision-language classification tasks such as VQA and MSCOCO [16]. By introducing Mixture-
of-Modality-Experts (MOME) Transformer to promote deeper modal interactions, Unified
Vision-Language Pre-Training with Mixture-of-Modality-Experts (ULMo) achieves state-of-art
results in the VQA and MSCOCO task [17]. The ViLT and ULMo focus more on the interactions
between textual and visual features and they may provide a better solution for fact checking.
Figure 1: An example of each category in the task dataset [9]. Each sample contains a claim text, claim
image, OCR of the claim image and their corresponding document ones. The categories of these samples
are judged based on the different entailment mentioned in section ’Task setup’.


3. Task setup
For the task, the dataset contains 50k claims with 100k images [9, 18]. Given a claim text, claim
image and OCR of the claim image, we need to predict whether they entail the document ones.
According to the different entailment, the claims can be classified into 5 categories:

    • Support_Text: the claim text entails the document one but claim image not
    • Insufficient_Text: both claim text and image are neither entailed nor refuted by the
      document ones
    • Support_Multimodal: both claim text and image entail document ones
    • Insufficient_Multimodal: the claim image entails the document one but claim text not
    • Refute: both claim image and text are contradictory with the document ones

In addition, each category accounts for the same percentage with 7k samples for training, 1.5k
samples for validation and 1.5k samples for testing. In Figure. 1, a sample of each category is
provided.


4. Model
Models can be divided into text-only ones and text-image ones according to the data they use.
Figure 2: Our text-only model uses RoBERTa as backbone, then the Projected Gradient Descent (PGD)
disturbance is put into the word embedding layer [22]. The output of the backbone is later used to
create positive and negative samples through R-Drop [23]. The loss function here is defined as the sum
of the focal loss and bidirectional Kullback-Leibler (KL) divergence between the two distributions from
R-Drop.


4.1. Text Pre-processing
There are lots of meaningless words, URLs and different language characters in claim texts and
OCRs, so before feeding the text into the model, the following steps are applied:

    • URL removal: There are a lot of URL information in the claims, and the information they
      contain is worthless and increases the length of the data processed by the model.
    • None-English words removal: Many non-English characters are contained in the claims,
      especially in OCRs, which rarely helps increase the performance.
    • Short words removal: There are a lot of spaces, various characters like “\n, aa ” in the
      original data. Words with less than 3 characters are removed.

Since there is less useful information in the OCRs of the images, many of them are “NaN, ANI, BBC ”
and the splicing of several words, so the provided OCR data is not used in our model.

4.2. Text-only models
Text-only models treat the task as the sentence pairs similarity problem and solve it by classifying
the cosine similarity between the embeddings of the claim and the corresponding document.
SentenceBERT [19] is used as for extracting the text embeddings and serves as text-only model
baseline in [9]. We use the pre-trained RoBERTa as the backbone and make some modifications
[20, 21]. The models structure can be seen in Figure. 2: After removing URL, non-English words
and short words, the claim text and document text are fed into the transformer, and here we
use vanilla BERT and RoBERTa for comparison. The robustness of the model can be boosted
through introducing disturbance on embedding layer, and we use PGD, which iterates several
times to slowly find the optimal perturbation and can be formulated in Equation 1:

                                        𝑟𝑎𝑑𝑣|𝑡+1 = 𝛼𝑔𝑡 /||𝑔𝑡 ||2                                (1)

Here 𝑔 means the input gradient and is defined as equation 2

                                          𝑔 = ▽𝑥 𝐿(𝜃, 𝑥, 𝑦)                                     (2)

There are many other adversarial training methods, such as FGM [24] and FreeAT [25]. The
former one can only obtain the locally optimal parameters. Although the latter is also a step-
by-step iterative search for the optimal disturbance, it is updated based on the gradient and
parameters of the previous step, and the parameters found in the current step are suboptimal
and do not maximize the Loss.
  After we get the last hidden state layer of the CLS in the model, we use a sequencetial network
with two dropout layers to generate another CLS layer so as to generate the positive samples,
and try to minimize the bidirectional KL divergence between the two CLS layers. The above
method is called ’R-Drop’ [23]and can be formulated in Equation 3:
                         1
                    𝐿𝐾 𝐿
                     𝑖 = [𝐾 𝐿(𝑄𝜃 (𝑦|𝑥𝑖 )||𝑃𝜃 (𝑦|𝑥𝑖 )) + 𝐾 𝐿(𝑃𝜃 (𝑦|𝑥𝑖 )||𝑄𝜃 (𝑦|𝑥𝑖 ))]            (3)
                         2
In vanilla BERT, the final loss can be computed as the sum of the cross entropy loss. However,
since the text-only model only uses text information, it is hard to judge whether the images
entail or not. So we add focal loss to alleviate the difficulty of predicting the image-related
labels [26]. Here focal loss is defined as Equation 4:

                                      𝐿𝐹𝑖 𝐿 = −𝛼(1 − 𝑝)𝛾 𝑙𝑜𝑔(𝑝)                                 (4)

The hyper-parameter 𝛼 is used to balance the relative importance of positive and negative
samples and 𝛾 is applied to reduce the weight of easy-to-classify samples, so that the model
focuses more on difficult-to-classify samples during training, which satisfies our need. And
final loss function of our model is defined as Equation 5:

                                        𝐿𝑜𝑠𝑠𝑖 = 𝐿𝐹𝑖 𝐿 + 𝛼𝐿𝐾
                                                          𝑖
                                                            𝐿                                   (5)

here 𝛼 denotes the loss weight and we let it equals 4 to compute the loss. The 5 fold cross-
validation is also adapted in our model and the averaged logits are used for classification.

4.3. Text-image models
Existing multimodal models focus more on classification tasks with an image and its description
[12, 14, 16, 15], and there are relatively few researches dealing with multiple texts and multiple
images. The multimodal baseline model provided in [9] computes the cosine similarity between
the text embedding derived from the pre-trained SentenceBERT and between the image features
derived from the pre-trained ResNet50 respectively [27]. The values of the similarity are seen
as the features, as well as the corresponding label as the target, then put into several algorithms
Figure 3: Our text-image model uses Pre-trained RoBERTa to achieve the text embeddings and uses
pretrained VGG16 to achieve image features. Then the claim text embedding (CT Embed) and claim image
features(CI features), and the document ones are concatenated and fed into Multi-Layer Perception
respectively before putting into the main model. The main model is the same as the one in our text-only
model.


like Random Forest, Decision Tree, and Logistic Regression. The above baseline model ignores
the interactions between different modalities.
   Our text-image method shares something in common with baseline that they all use pre-
trained model to extract embeddings and features. However, the pre-trained RoBERTa instead
of pre-trained SentenceBERT, pre-trained VGG16 instead of pre-trained ResNet50 are applied in
our model for their better representation ability. The difference between our text-image model
and text-only model is the input. In the former model, the image features are concatenated with
the text ones and then put into Multi-Layer Perception for interaction.The whole structure of
our text-image model is shown in Figure 4.3.


5. Experiments and evaluations
Here we choose the text-only baseline, pre-trained BERT and RoBERTa as the comparison
with our text-only model, use the multimodal baseline and mixed_input RoBERTa without our
modifications as a comparison with our text-image model. The mixed_input RoBERTa model
takes the fusion of text embeddings and image features, just the way we use in our text_image
model, as the model input. The results of all models are classified after averaging all logits
obtained from 5-fold cross-validation, except for the two baselines. Besides, all hyperparameters
are the same in BERT and RoBERTa models for fair comparison, just as shown in Table 1. The
official evaluation for this task is Macro-F1 and the final ranking is based on the weighted
average F1 score. The Macro-F1 scores of the models are shown in Table 2. With the best
average F1 score of 75.59%, we improve the baseline (53.10%) by 22% and are finally ranked
second in this task.
   Noting that the models above our text-image model in Table 2 are all text-only models. And
Table 1
The hyperparameters of the BERT and RoBERTa models.
                     learning_rate       dropout_rate    train_batch_size   adam 𝜖
                         4e-5                 0.1              32            1e-8
                max_sequence_length         epoch        test_batch_size     seed
                         128                   3               32             42


the column name in the first row of the table except ’Model’ is the first few letters of the
corresponding label, such as ’Sup_Text’ for ’Support_Text’, ’Insuffi_Multi’ for ’Insufficient_Mul-
timodal’. The figures in the ’Final’ column denotes the the weighted average F1 scores of the
former 5 categories.
   It can be seen that most model perform better in ’Insufficient_Text’ and ’Support_Multimodal’
label prediction than in ’Support_Text’ and ’Insufficient_Multimodal’ prediction task for that the
judgment basis for the first two labels is that either claim and claim image are both entailed or
both not entailed with the document ones. It shows that the information about the interaction
between two modalities the models learned is not enough. Besides, all models perform perfectly
in predicting the ’Refute’ except baselines because it is relatively easy to distinguish texts with
the opposite meaning.
   The multimodal baseline exceeds text-only baseline over 10% and achieves highest score
in ’Support_Text’ prediction, but compared with our text-only model, it is over 20% less. The
mixed_input RoBERTa model that combines two modalities performs better than the single
modality one. Our text-only model shows the best performance among all models and is
1% higher than vanilla text-only RoBERTa. And our text-image model scores higher than
mixed_input RoBERTa but does not show competitive performance in image-related label
prediction and scores 0.64% less than our text-only model. It is because that the introduction of
the image features in RoBERTa decreases the representation ability and results may be the same
after interacting different text embeddings and image features. Besides, the difference of the
magnitudes may cause bias and variances too. In addition, the ensemble model only ensembles
the first 3 models in the Table 2 and performs as well as our text-only model but it costs too
much time.
   Classifying the multiple texts and images is a tough task for it not only involves the entailment
between the texts and texts, images and images, but also between the many texts and images at
the same time. Combining the two modalities improves slightly the understanding of the text
and image pairs. But better interactions and understanding of the two modalities may further
improve the results in future works.


6. Conclusion
In this paper, the unimodal and bimodal RoBERTa-based models are discussed to solve multi-
modal fact checking task in De-Factify workshop. The major challenge of fact checking task
derives from the entailment between multiple texts and images, and existing approaches showed
Table 2
The results of the experiments
         Model            Sup_Text   Insuffi_Text   Sup_Multi   Insuffi_Multi   Refute    Final
         BERT              58.14%      68.08%        71.06%        64.20%       99.24%   72.14%
       RoBERTa             62.01%      70.24%        73.18%        67.96%       99.57%   74.59%
  Our text-only model      63.39%      70.85%        74.79%        69.33%       99.60%   75.59%
     Text_Baseline            -           -             -             -            -     41.33%
 mixed_input RoBERTa       62.69%      70.05%        73.90%        68.61%       99.50%   74.95%
 Our text-image model      62.90%      70.59%        73.72%        68.41%       99.60%   75.04%
 Multimodal_Baseline       82.68%      75.47%        74.42%        69.68%       42.35%   53.10%
       Ensemble            75.52%      89.38%        82.12%        80.81%       99.87%   75.59%


unsatisfactory performance. To address the problem, our model integrates the PGD, focal loss
and R-Drop into the RoBERTa model, which shows better effectiveness. Besides, our text-image
model show better performance compared with the vanilla model by fusing the text embedding
and image features, but the effect is still worse than the our text-only model, which helps us
stand 2nd in this task. Better multi-modal feature fusion and interaction strategies are conducive
to the better solving this challenge.


References
 [1] G. Pennycook, D. G. Rand, Examining false beliefs about voter fraud in the wake of the
     2020 presidential election, The Harvard Kennedy School Misinformation Review (2021).
 [2] S. Shaar, G. D. S. Martino, N. Babulkov, P. Nakov, That is a known lie: Detecting previously
     fact-checked claims, arXiv preprint arXiv:2005.06058 (2020).
 [3] C. Hansen, C. Hansen, L. C. Lima, Automatic fake news detection: Are models learning to
     reason?, arXiv preprint arXiv:2105.07698 (2021).
 [4] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a large-scale dataset for
     fact extraction and verification, arXiv preprint arXiv:1803.05355 (2018).
 [5] K. Shu, D. Mahudeswaran, S. Wang, D. Lee, H. Liu, Fakenewsnet: A data repository with
     news content, social context and spatialtemporal information for studying fake news on
     social media, arXiv preprint arXiv:1809.01286 (2018).
 [6] K. Nakamura, S. Levy, W. Y. Wang, r/fakeddit: A new multimodal benchmark dataset for
     fine-grained fake news detection, arXiv preprint arXiv:1911.03854 (2019).
 [7] J. C. Reis, P. Melo, K. Garimella, J. M. Almeida, D. Eckles, F. Benevenuto, A dataset of
     fact-checked images shared on whatsapp during the brazilian and indian elections, in:
     Proceedings of the International AAAI Conference on Web and Social Media, volume 14,
     2020, pp. 903–908.
 [8] P. Patwa, S. Sharma, S. Pykl, V. Guptha, G. Kumari, M. S. Akhtar, A. Ekbal, A. Das,
     T. Chakraborty, Fighting an infodemic: Covid-19 fake news dataset, in: International
     Workshop on Combating Online Hostile Posts in Regional Languages during Emergency
     Situation, Springer, 2021, pp. 21–29.
 [9] S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, P. Patwa, A. Das,
     T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Factify: A multi-modal fact verification
     dataset, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate
     Speech Detection, CEUR, 2022.
[10] C. Kruengkrai, J. Yamagishi, X. Wang, A multi-level attention model for evidence-based
     fact checking, arXiv preprint arXiv:2106.00950 (2021).
[11] F. Yang, X. Peng, G. Ghosh, R. Shilon, H. Ma, E. Moore, G. Predovic, Exploring deep
     multimodal fusion of text and photo for hate speech classification, in: Proceedings of the
     third workshop on abusive language online, 2019, pp. 11–18.
[12] Y. Wang, F. Ma, Z. Jin, Y. Yuan, G. Xun, K. Jha, L. Su, J. Gao, Eann: Event adversarial neural
     networks for multi-modal fake news detection, in: Proceedings of the 24th acm sigkdd
     international conference on knowledge discovery & data mining, 2018, pp. 849–857.
[13] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
     recognition, arXiv preprint arXiv:1409.1556 (2014).
[14] D. Khattar, J. S. Goud, M. Gupta, V. Varma, Mvae: Multimodal variational autoencoder for
     fake news detection, in: The world wide web conference, 2019, pp. 2915–2921.
[15] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De-
     hghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Trans-
     formers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
[16] W. Kim, B. Son, I. Kim, Vilt: Vision-and-language transformer without convolution or
     region supervision, arXiv preprint arXiv:2102.03334 (2021).
[17] W. Wang, H. Bao, L. Dong, F. Wei, Vlmo: Unified vision-language pre-training with
     mixture-of-modality-experts, arXiv preprint arXiv:2111.02358 (2021).
[18] P. Patwa, S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, A. Das,
     T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Benchmarking multi-modal entailment for
     fact verification, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking
     and Hate Speech Detection, CEUR, 2022.
[19] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     arXiv preprint arXiv:1908.10084 (2019).
[20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
[22] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu, Towards deep learning models
     resistant to adversarial attacks, arXiv preprint arXiv:1706.06083 (2017).
[23] X. Liang, L. Wu, J. Li, Y. Wang, Q. Meng, T. Qin, W. Chen, M. Zhang, T.-Y. Liu, R-drop:
     Regularized dropout for neural networks, arXiv preprint arXiv:2106.14448 (2021).
[24] T. Miyato, A. M. Dai, I. Goodfellow, Adversarial training methods for semi-supervised text
     classification, arXiv preprint arXiv:1605.07725 (2016).
[25] A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor,
     T. Goldstein, Adversarial training for free!, Advances in Neural Information Processing
     Systems 32 (2019).
[26] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in:
     Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[27] P. Kasnesis, R. Heartfield, X. Liang, L. Toumanidis, G. Sakellari, C. Patrikakis, G. Loukas,
     Transformer-based identification of stochastic information cascades in social networks
     using text and image similarity, Applied Soft Computing 108 (2021) 107413.