1. Introduction

Yao at Factify 2022: Utilizing Pre-trained Models and Co-attention Networks for Multi-Modal Fact Verification

Wei-Yao Wang

Wen-Chih Peng

wcpeng@nctu.edu.tw 0

Vancouver, Canada

0 Department of Computer Science, National Yang Ming Chiao Tung University , Hsinchu , Taiwan

In recent years, social media has enabled users to get exposed to a myriad of misinformation and disinformation; thus, misinformation has attracted a great deal of attention in research fields and as a social issue. To address the problem, we propose a framework, Pre-CoFact, composed of two pre-trained models for extracting features from text and images, and multiple co-attention networks for fusing the same modality but diferent sources and diferent modalities. Besides, we adopt the ensemble method by using diferent pre-trained models in Pre-CoFact to achieve better performance. We further illustrate the efectiveness from the ablation study and examine diferent pre-trained models for comparison. Our team, Yao, won the fith prize (F1-score: 74.585%) in the Factify challenge hosted by De-Factify @ AAAI 2022, which demonstrates that our model achieved competitive performance without using auxiliary tasks or extra information. The source code of our work is publicly available1.

1. Introduction

Fake news has become easier to spread due to the growing number of users of social media. For example, about 59% of social media consumers expect that news spread via social media may be inaccurate [ 1 ]. To influence social thoughts, there are many fake news stories that mislead readers about the news content by replacing some true content with false details. Besides, fake news with textual and visual content can better attract readers and it is hard to judge than only using textual content. Therefore, it is essential to detect multi-modal fake news to eliminate its negative impacts.

Fake checkers aim to check the worthiness, evidence or verified claim retrieval [ 2 ]. Recent works have presented a number of approaches for tackling fake news detection automatically. In uni-modal detection, Shu et al. [ 3 ] exploited a tri-relationship (publishers, news pieces, and users) to model the relations and interactions for detecting news disinformation. Przybyla [ 4 ] utilized the style the news articles are written in to estimate their credibility. In multi-modal detection, Jin et al. [ 5 ] proposed an att-RNN that combines a recurrent neural network with an attention mechanism to fuse textual content and visual images. MCAN is proposed by extracting spatial-domain features and textual features by pre-trained models [ 6 ]. Further, to address the fact that fake images are often re-compressed images or tampered images, which shows periodicity in the frequency domain, they used discrete cosine transform as in [ 7 ], then designed a CNN-based network for capturing frequency-domain features from images.

A real-world problem, identifying if the claim entails the document, is the challenge called Factify [ 8, 9 ] hosted by De-Factify1. Figure 1 shows some examples for all five categories. The goal is to design a method to classify the given text and images into one of the five categories: Support_Multimodal, Support_Text, Insuficient_Multimodal, Insuficient_Text, and Refute. To tackle the problem, in this paper, we propose Pre-CoFact with pre-trained models and coattention networks to perform the shared task, which first extracts features from both text and images, then fuses this information through the co-attention module. Specifically, two powerful Transformer-based pre-trained models, DeBERTa [ 10 ] and DeiT [ 11 ], are adopted for extracting features from both claims and documents’ text and images, respectively. Afterwards, several co-attention modules are designed for fusing the contexts of the text and images. Finally, these embeddings are aggregated as corresponding embeddings to classify the category of the news.

The main results of this paper can be summarized as follows: • Using text and images directly can achieve expressive results without any auxiliary tasks, preprocessing methods, or extra information (e.g., optical character recognition (OCR) 1https://aiisc.ai/defactify/factify.html

from images). • Adopting pre-trained models helps improve the performance of the shared task, and using co-attention networks can learn the correlation from the same modality (text or images from claims and documents) and the dependencies between diferent modalities (text and images). • Our ensemble model outperforms the machine learning models [ 8 ] at least 48% and 40% in terms of validation score and testing score. Besides, extensive experiments were further conducted to examine the capability of the proposed model.

2. Dataset

Factify is a dataset for multi-modal fact verification, which contains images, textual claims, reference textual documents and images. Each sample includes claim_image, claim, claim_ocr, document_image, document, document_ocr, and category. The detail of each field is described as follows: • claim_image: the image of the given claim. • claim: the text of the given claim. • claim_ocr: the text from the claim_image detected by the host. • document_image: the image of the given reference. • document: the text of the given reference. • document_ocr: the text from the document_image detected by the host.

• category: the category of the data sample from a list of five classes.

The category is composed of 1) Support_Multimodal: both the claim text and image are similar to that of the document, 2) Support_Text: the claim text is similar or entailed, but images of the document and claim are not similar, 3) Insuficient_Multimodal: the claim text is neither supported nor refuted by the document but images are similar to the document, 4) Insuficient_Text: both text and images of the claim are neither supported nor refuted by the document, although it is possible that the text claim has common words with the document text, and 5) Refute: the images and/or text from the claim and document are completely contradictory.

The training set contains 35,000 samples, which has 5,000 samples of each class, and the validation set contains 7,500 samples, which has 1,500 samples of each class. The test set, which is used to evaluate the private score, also contains 7,500 samples. For more details, we refer readers to [ 8 ].

3. Related Works 3.1. Fake News Detection

There have been a series of studies combating fake news detection to mitigate a societal crisis [ 12 ]. Vo and Lee [ 13 ] proposed a novel neural ranking model which jointly utilizes textual and visual matching signals. This is the first work using multi-modal data in social media posts to search for verified information, which can increase users’ awareness of fact-checked information when they are exposed to fake news. Lee et al. [ 14 ] adopted a perplexity-based approach in the few-shot learning, which assumes that the given claim may be fake if the corresponding perplexity score from evidence-conditioned language models is high. BertGCN [ 15 ] is proposed by integrating the advantages of large-scale pre-trained models and graph neural networks for fake news detection, which is able to learn the representations from the massive amount of pre-trained data and the label influence through the propagation. MCAN [ 6 ] adopts a large-scale pre-trained NLP model and a pre-trained computer vision (CV) model for extracting features from text and images, respectively. Besides, MCAN also extracts frequency domain features from images, and then uses multiple co-attention layers to fuse this information.

These approaches demonstrate the efectiveness of using pre-trained models for fake news detection, which motivated us to use pre-trained models as well. Besides, MCAN inspires us to fuse the contexts of diferent modalities or the same modality ( e.g., text from claims and documents).

3.2. Large-Scale Pre-trained Models

Transformer [ 16 ] has been used for machine translation and has inspired many competitive approaches in natural language processing (NLP) tasks. Transformer-based pre-trained language models (PLMs) have significantly improved the performance of various NLP tasks due to the ability to understand contextualized information from the pre-trained dataset. Since BERT [ 17 ] was presented, we have seen the rise of a set of large-scale PLMs such as GPT-3 [ 18 ], RoBERTa [ 19 ], XLNet [ 20 ], ELECTRA [ 21 ], and DeBERTa [ 10 ]. These PLMs have been fine-tuned using task-specific labels and have created a new state of the art in many downstream tasks.

Recently, vision Transformer (ViT) [ 22 ] is a Transformer encoder architecture directly applied to image classification with patching raw images as input to NLP, which achieves competitive results compared to state-of-the-art convolutional networks by pre-training a large private image dataset JFT-300M [ 23 ]. ViT demonstrates that convolution-free networks can still learn the relation in the images. To reduce the pre-trained dataset size and training eficiency, several follow-up studies have been conducted. DINO was proposed by [ 24 ] to improve the standard ViT model through self-supervised learning. [ 11 ] proposed DeiT, which used a novel distillation procedure based on a distillation token to ensure the student learns from the teacher through attention.

These pre-trained models demonstrate the generalization of various domains. Further, using pre-trained models benefits capturing rich information of downstream tasks, which can also reduce the burden of training from scratch. These advantages motivated us to adopt stateof-the-art pre-trained models for transforming images and text into contextual embeddings. Besides, we focused on using Transformer-based pre-trained models for feature extraction.

4. Method 4.1. Problem Formulation

Let = { , , , }|=| 1 denote the corpus of the dataset, where the -th sample is composed of the claim text = 1 2 ⋯, the claim image , the document text = 1 2 ⋯, and the document image . The -th target ∈ { _ , _ , _ , _ , } . The goal is to find out support, insuficientevidence and refute between given claims and documents.

4.2. Pre-CoFact Overview

Figure 2 illustrates the overview of the proposed Pre-CoFact framework. The input contains the claim image, the claim text, the document image, and the document text. The feature extraction part adopts DeiT [ 11 ] as the pre-trained CV model and DeBERTa [ 10 ] as the pre-trained NLP model, and feeds the outputs of pre-trained models to the image embedding layer and text embedding layer for transforming images and texts into corresponding embeddings. The multimodality fusion part fuses this information from the same modality (images/text from the claim and document) and diferent modalities (images and text from the claim/document) based on multiple co-attention layers. Finally, the category classifier predicts the possible classes based on the embeddings from feature extraction and the embeddings from multi-modality fusion.

4.3. Feature Extraction

The enrichment of pre-trained models enables us to have rich information without training from scratch. Moreover, Transformer-based pre-trained models demonstrate the success on both NLP and CV tasks. However, it is essential to fine-tune for fitting in our task. To this end, we first use DeBERTa as our pre-trained NLP model and DeiT as our pre-trained CV model, and then we use the embedding layer for transforming pre-trained embeddings to embeddings in our task. Specifically, the -th output of the embedding layer is calculated as follows: =

( ( = ( ( )), = )), = ( ( ( ( )), )), activation functions in

we used are ReLU and Mish [ 25 ] for testing the results. where the output dimensions of DeiT and DeBERTa are 768, the is composed of a MLP and an activation function, and , ,

, are dimension vectors. It is noted that the

4.4. Multi-Modality Fusion

After generating embeddings of text and images, we adopt multiple co-attention layers as in [ 6 ] to fuse the embeddings. To check the relation between claim and document, we use the co-attention layer to separately fuse 1) images of claims and documents and 2) text of claims and documents. Besides, the relation between text and images from the claims or document can be viewed as checking whether they are relative or not. Therefore, we also adopt the co-attention layer for fusing 3) images and text of claims and 4) images and text of documents.

Specifically, each co-attention layer takes two inputs and and produces two outputs , . Here we use a single head to derive as the following equations: is the same normalization method and feed forward network as in [ 16 ]. Co-attention block has been widely used in VQA tasks [ 26 ], as it can capture dependencies of diferent inputs. Therefore, we use the co-attention layer for fusing: fused tokens into a representative token. That is, given a fused embedding with ℝ × , where is the sequence length, we use mean aggregation to output ℝ1× . Besides, we also feed , , , into the aggregation function for classification. (1) (2) (4) (5) (6) (7)

4.5. Category Classifier

To predict the label of the given claims and documents, we first concatenate 8 aggregated outputs original 4 aggregated embeddings to obtain the input of the classifier . It from the co-attention layers and is worth noting that the outputs of embeddings are also used since the original information can provide some clues for classifying the news. Afterwards, the -th output of the classifier is the probability as follows: 1 = ( ), 2 = (

1 1), ̂ = ( 2 2), where ∈ ℝ12× , 1 ∈ ℝ× 1 , and 1 ∈ ℝ 1×5. Note that is the same as in , which uses both ReLU and Mish for testing the results.

We trained our model by minimizing cross-entropy loss to learn the prediction of the categories:

4.6. Ensemble Method

|| =1 = −

∑ ( ̂ ).

Each classifier may have its strengths and weakness, and ensemble methods have been widely used to enhance the performance. Therefore, we follow [27] to use the power weighted sum to enhance the performance of the model. The formula is derived as follows: = 1 × 1 + 2 × 2 + ⋯ + × , where , ⋯ , are the predicted probability from the corresponding model, 1, ⋯ , are weights with respect to the corresponding model, is the number of trained models, and is the weight of power. It is noted that these parameters are tuned by hand. (8) (9) (10) (11)

5. Results and Analysis 5.1. Experimental Setup 5.1.1. Implementation Details

The dimension was set to 512, the inner dimension of the feed-forward layer was 1024, and the number of heads was set to 4. The dropout rate was 0.1, and the max sequence length was 512. The batch size was 32, the learning rates were set to 3e-5 and 2e-5, the training epochs were set to 30, and the seeds were tested with 41 and 42. The power was set to 0.5, and the weights were set to 0.6, 0.2, 0.1, 0.2, 0.3, which were manually tuned by validation score. The pre-trained DeBERTa was deberta-base2, and the DeiT was deit-base-patch16-2243. The parameters of the two pre-trained models were frozen. All images were transformed by resizing to 256, center

Model Weighted F1 (%) w/o CoAtt cropping to 224, and normalizing. We preprocessed only for transforming images, and then we stored the text and processed images in corresponding pickle files for training and evaluating. All the training and evaluation phases were conducted on a machine with Intel Xeon 4110 CPU @ 2.10GHz, Nvidia GeForce RTX 2080 Ti, and 252GB RAM. The source code is available at https://github.com/wywyWang/Multi-Modal-Fact-Verification-2021.

5.1.2. Evaluation Metric

To evaluate the performance of the task, the weighted average F1 score was used across the 5 categories.

5.2. Quantitative Results 5.2.1. Ablation Study

We first conducted an ablation study to ensure the efective design of our proposed Pre-CoFact. As shown in Table 1, it is evident that without co-attention networks (w/o CoAtt), the performance is degraded. Further, applying co-attention only on the same modality (w/o CoAtt(text, image)) is insuficient, which demonstrates the need for modeling dependencies between different modalities. It is noted that our ensemble method slightly improves the performance compared to Pre-CoFact. Our ensemble method includes Pre-CoFact, Pre-CoFact with replacing DeBERTa with XLM-RoBERTa, Pre-CoFact with replacing DeBERTa with RoBERTa, Pre-CoFact with replacing DeBERTa with RoBERTa and replacing ReLU with Mish, and Pre-CoFact with replacing ReLU with Mish.

We also use diferent pre-trained models to examine the module influence as shown in Table 2. It can be seen that DeiT is more suitable than DINO for this task. Besides, XLM-RoBERTa also degrades the performance, while RoBERTa is slightly worse than Pre-CoFact with DeBERTa. Rank 5

Team

Yao Baseline

Support Support _ Text (%) _ Multimodal (%)

Insuficient _Text (%)

Insuficient _Multimodal (%) Refute (%) Final (%)

5.2.2. Testing Performance 5.2.3. Confusion Matrix 6. Conclusion

In this paper, we proposed Pre-CoFact utilizing pre-trained models and multiple co-attention networks to alleviate the efect of fake news for the Factify task. To achieve better performance, we adopted an ensemble method by weighting several models. The ablation study demonstrates the efectiveness of our proposed approach. From the testing score, our method illustrates that using only text and images without extra information can also achieve competitive performance. and inter-modality attention flow for visual question answering, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6639–6648. [27] W. Wang, K. Chang, Y. Tang, Emotiongif-yankee: A sentiment classifier with robust model based ensemble methods, CoRR abs/2007.02259 (2020). [28] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2020, pp. 8440–8451.

[1]

Shearer , A. Mitchell, News use across social media platforms in 2020, 2021 . URL: https://www.pewresearch.org/journalism/2021/01/12/ news-use -across-social-media-platforms-in- 2020 /.

[2]

Nakov ,

D. P. A.

Corney ,

Hasanain ,

Alam ,

Elsayed ,

Barrón-Cedeño ,

Papotti ,

Shaar ,

G. D. S.

Martino , Automated fact-checking for assisting human fact-checkers , in: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence , 2021 , pp. 4551 - 4558 .

[3]

Shu ,

Wang , H. Liu, Beyond news contents: The role of social context for fake news detection , in: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining , 2019 , pp. 312 - 320 .

[4]

Przybyla , Capturing the style of fake news , in: The Thirty-Fourth AAAI Conference on Artificial Intelligence , 2020 , pp. 490 - 497 .

[5]

Jin ,

Cao ,

Guo ,

Zhang , J. Luo, Multimodal fusion with recurrent neural networks for rumor detection on microblogs , in: Proceedings of the 2017 ACM on Multimedia Conference , 2017 , pp. 795 - 816 .

[6]

Wu ,

Zhan ,

Zhang ,

Wang ,

Xu , Multimodal fusion with co-attention networks for fake news detection, in: Findings of the Association for Computational Linguistics , volume ACL/IJCNLP 2021 of Findings of ACL, 2021 , pp. 2560 - 2569 .

[7]

Qi ,

Cao ,

Yang ,

Guo ,

Li , Exploiting multi-domain visual information for fake news detection , in: 2019 IEEE International Conference on Data Mining , 2019 , pp. 518 - 527 .

[8]

Mishra ,

Suryavardan ,

Bhaskar ,

Chopra ,

Reganti ,

Patwa , A. Das , T.

Chakraborty , A.

Sheth , A.

Ekbal , C.

Ahuja , Factify: A multi-modal fact verification dataset , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection , CEUR , 2022 .

[9]

Patwa ,

Mishra ,

Suryavardan ,

Bhaskar ,

Chopra ,

Reganti , A. Das , T.

Chakraborty , A.

Sheth , A.

Ekbal , C.

Ahuja , Benchmarking multi-modal entailment for fact verification , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection , CEUR , 2022 .

[10]

He ,

Liu ,

Gao , W. Chen, Deberta: decoding-enhanced bert with disentangled attention , in: 9th International Conference on Learning Representations , 2021 .

[11]

Touvron ,

Cord ,

Douze ,

Massa ,

Sablayrolles ,

Jégou , Training dataeficient image transformers & distillation through attention , in: Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Machine Learning Research , 2021 , pp. 10347 - 10357 .

[12]

Nakov ,

G. D. S.

Martino , Fake news, disinformation, propaganda, media bias, and lfattening the curve of the COVID-19 infodemic , in : KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2021 , pp. 4054 - 4055 .

[13]

Vo ,

Lee , Where are the facts? searching for fact-checked information to alleviate the spread of fake news , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , 2020 , pp. 7717 - 7731 .

[14]

Lee ,

Bang ,

Madotto ,

Fung , Towards few-shot fact-checking via perplexity , in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2021 , pp. 1971 - 1981 .

[15]

Lin ,

Meng ,

Sun ,

Han ,

Kuang ,

Li ,

Wu , Bertgcn: Transductive text classification by combining GCN and BERT , CoRR abs/2105 .05727 ( 2021 ).

[16]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 , 2017 , pp. 5998 - 6008 .

[17]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Association for Computational Linguistics, 2019 , pp. 4171 - 4186 .

[18] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , A.

Neelakantan , P.

Shyam , G.

Sastry , A.

Askell , S.

Agarwal , A.

Herbert-Voss , G. Krueger, T.

Henighan , R.

Child , A.

Ramesh , D. M.

Ziegler , J.

Wu , C.

Winter , C.

Hesse , M.

Chen , E. Sigler, M.

Litwin , S.

Gray , B.

Chess , J.

Clark , C.

Berner , S.

McCandlish , A.

Radford , I.

Sutskever , D.

Amodei , Language models are few-shot learners , in: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 , NeurIPS 2020 , 2020 .

[19]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized BERT pretraining approach , CoRR abs/ 1907 .11692 ( 2019 ).

[20]

Yang ,

Dai ,

Yang , J. G. Carbonell, R. Salakhutdinov,

Q. V.

Le , Xlnet: Generalized autoregressive pretraining for language understanding , in: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 , 2019 , pp. 5754 - 5764 .

[21]

Clark ,

Luong ,

Q. V.

Le ,

C. D.

Manning , ELECTRA: pre-training text encoders as discriminators rather than generators , in: 8th International Conference on Learning Representations , 2020 .

[22]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly ,

Uszkoreit ,

Houlsby , An image is worth 16x16 words: Transformers for image recognition at scale , in: 9th International Conference on Learning Representations , 2021 .

[23]

Sun ,

Shrivastava ,

Singh ,

Gupta , Revisiting unreasonable efectiveness of data in deep learning era , in: IEEE International Conference on Computer Vision , 2017 , pp. 843 - 852 .

[24]

Caron ,

Touvron , I. Misra,

Jégou ,

Mairal ,

Bojanowski ,

Joulin , Emerging properties in self-supervised vision transformers , CoRR abs/2104 .14294 ( 2021 ).

[25]

Misra , Mish: A self regularized non-monotonic neural activation function , CoRR abs/ 1908 .08681 ( 2019 ).

[26]

Gao ,

Jiang ,

You ,

Lu ,

S. C. H.

Hoi ,

Wang ,

Li , Dynamic fusion with intra-