Team Yao at Factify 2022: Utilizing Pre-trained Models and Co-attention Networks for Multi-Modal Fact Verification Wei-Yao Wang1 , Wen-Chih Peng1 1 Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan Abstract In recent years, social media has enabled users to get exposed to a myriad of misinformation and disinformation; thus, misinformation has attracted a great deal of attention in research fields and as a social issue. To address the problem, we propose a framework, Pre-CoFact, composed of two pre-trained models for extracting features from text and images, and multiple co-attention networks for fusing the same modality but different sources and different modalities. Besides, we adopt the ensemble method by using different pre-trained models in Pre-CoFact to achieve better performance. We further illustrate the effectiveness from the ablation study and examine different pre-trained models for comparison. Our team, Yao, won the fifth prize (F1-score: 74.585%) in the Factify challenge hosted by De-Factify @ AAAI 2022, which demonstrates that our model achieved competitive performance without using auxiliary tasks or extra information. The source code of our work is publicly available1 . Keywords Multi-modal fact verification, Transformer, Co-attention, De-Factify 1. Introduction Fake news has become easier to spread due to the growing number of users of social media. For example, about 59% of social media consumers expect that news spread via social media may be inaccurate [1]. To influence social thoughts, there are many fake news stories that mislead readers about the news content by replacing some true content with false details. Besides, fake news with textual and visual content can better attract readers and it is hard to judge than only using textual content. Therefore, it is essential to detect multi-modal fake news to eliminate its negative impacts. Fake checkers aim to check the worthiness, evidence or verified claim retrieval [2]. Recent works have presented a number of approaches for tackling fake news detection automatically. In uni-modal detection, Shu et al. [3] exploited a tri-relationship (publishers, news pieces, and users) to model the relations and interactions for detecting news disinformation. Przybyla [4] utilized the style the news articles are written in to estimate their credibility. In multi-modal detection, Jin et al. [5] proposed an att-RNN that combines a recurrent neural network with 1 https://github.com/wywyWang/Multi-Modal-Fact-Verification-2021 De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2022. 2022 Vancouver, Canada Envelope-Open sf1638.cs05@nctu.edu.tw (W. Wang); wcpeng@nctu.edu.tw (W. Peng) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: A screenshot from [8], which illustrates sample examples of five categories from the Factify dataset. an attention mechanism to fuse textual content and visual images. MCAN is proposed by extracting spatial-domain features and textual features by pre-trained models [6]. Further, to address the fact that fake images are often re-compressed images or tampered images, which shows periodicity in the frequency domain, they used discrete cosine transform as in [7], then designed a CNN-based network for capturing frequency-domain features from images. A real-world problem, identifying if the claim entails the document, is the challenge called Factify [8, 9] hosted by De-Factify1 . Figure 1 shows some examples for all five categories. The goal is to design a method to classify the given text and images into one of the five categories: Support_Multimodal, Support_Text, Insufficient_Multimodal, Insufficient_Text, and Refute. To tackle the problem, in this paper, we propose Pre-CoFact with pre-trained models and co- attention networks to perform the shared task, which first extracts features from both text and images, then fuses this information through the co-attention module. Specifically, two powerful Transformer-based pre-trained models, DeBERTa [10] and DeiT [11], are adopted for extracting features from both claims and documents’ text and images, respectively. Afterwards, several co-attention modules are designed for fusing the contexts of the text and images. Finally, these embeddings are aggregated as corresponding embeddings to classify the category of the news. The main results of this paper can be summarized as follows: • Using text and images directly can achieve expressive results without any auxiliary tasks, preprocessing methods, or extra information (e.g., optical character recognition (OCR) 1 https://aiisc.ai/defactify/factify.html from images). • Adopting pre-trained models helps improve the performance of the shared task, and using co-attention networks can learn the correlation from the same modality (text or images from claims and documents) and the dependencies between different modalities (text and images). • Our ensemble model outperforms the machine learning models [8] at least 48% and 40% in terms of validation score and testing score. Besides, extensive experiments were further conducted to examine the capability of the proposed model. 2. Dataset Factify is a dataset for multi-modal fact verification, which contains images, textual claims, reference textual documents and images. Each sample includes claim_image, claim, claim_ocr, document_image, document, document_ocr, and category. The detail of each field is described as follows: • claim_image: the image of the given claim. • claim: the text of the given claim. • claim_ocr: the text from the claim_image detected by the host. • document_image: the image of the given reference. • document: the text of the given reference. • document_ocr: the text from the document_image detected by the host. • category: the category of the data sample from a list of five classes. The category is composed of 1) Support_Multimodal: both the claim text and image are similar to that of the document, 2) Support_Text: the claim text is similar or entailed, but images of the document and claim are not similar, 3) Insufficient_Multimodal: the claim text is neither sup- ported nor refuted by the document but images are similar to the document, 4) Insufficient_Text: both text and images of the claim are neither supported nor refuted by the document, although it is possible that the text claim has common words with the document text, and 5) Refute: the images and/or text from the claim and document are completely contradictory. The training set contains 35,000 samples, which has 5,000 samples of each class, and the validation set contains 7,500 samples, which has 1,500 samples of each class. The test set, which is used to evaluate the private score, also contains 7,500 samples. For more details, we refer readers to [8]. 3. Related Works 3.1. Fake News Detection There have been a series of studies combating fake news detection to mitigate a societal crisis [12]. Vo and Lee [13] proposed a novel neural ranking model which jointly utilizes textual and visual matching signals. This is the first work using multi-modal data in social media posts to search for verified information, which can increase users’ awareness of fact-checked information when they are exposed to fake news. Lee et al. [14] adopted a perplexity-based approach in the few-shot learning, which assumes that the given claim may be fake if the corresponding perplexity score from evidence-conditioned language models is high. BertGCN [15] is proposed by integrating the advantages of large-scale pre-trained models and graph neural networks for fake news detection, which is able to learn the representations from the massive amount of pre-trained data and the label influence through the propagation. MCAN [6] adopts a large-scale pre-trained NLP model and a pre-trained computer vision (CV) model for extracting features from text and images, respectively. Besides, MCAN also extracts frequency domain features from images, and then uses multiple co-attention layers to fuse this information. These approaches demonstrate the effectiveness of using pre-trained models for fake news detection, which motivated us to use pre-trained models as well. Besides, MCAN inspires us to fuse the contexts of different modalities or the same modality (e.g., text from claims and documents). 3.2. Large-Scale Pre-trained Models Transformer [16] has been used for machine translation and has inspired many competitive approaches in natural language processing (NLP) tasks. Transformer-based pre-trained language models (PLMs) have significantly improved the performance of various NLP tasks due to the ability to understand contextualized information from the pre-trained dataset. Since BERT [17] was presented, we have seen the rise of a set of large-scale PLMs such as GPT-3 [18], RoBERTa [19], XLNet [20], ELECTRA [21], and DeBERTa [10]. These PLMs have been fine-tuned using task-specific labels and have created a new state of the art in many downstream tasks. Recently, vision Transformer (ViT) [22] is a Transformer encoder architecture directly applied to image classification with patching raw images as input to NLP, which achieves competitive results compared to state-of-the-art convolutional networks by pre-training a large private image dataset JFT-300M [23]. ViT demonstrates that convolution-free networks can still learn the relation in the images. To reduce the pre-trained dataset size and training efficiency, several follow-up studies have been conducted. DINO was proposed by [24] to improve the standard ViT model through self-supervised learning. [11] proposed DeiT, which used a novel distillation procedure based on a distillation token to ensure the student learns from the teacher through attention. These pre-trained models demonstrate the generalization of various domains. Further, using pre-trained models benefits capturing rich information of downstream tasks, which can also reduce the burden of training from scratch. These advantages motivated us to adopt state- of-the-art pre-trained models for transforming images and text into contextual embeddings. Besides, we focused on using Transformer-based pre-trained models for feature extraction. 4. Method 4.1. Problem Formulation |𝐶| Let 𝐶 = {𝐶𝑇𝑖 , 𝐶𝐼𝑖 , 𝐷𝑇𝑖 , 𝐷𝐼𝑖 }𝑖=1 denote the corpus of the dataset, where the 𝑖-th sample is composed 𝐶𝑇 𝐶𝑇 𝐷𝑇 𝐷𝑇 of the claim text 𝐶𝑇𝑖 = 𝑤1 𝑖 𝑤2 𝑖 ⋯, the claim image 𝐶𝐼𝑖 , the document text 𝐷𝑇𝑖 = 𝑤1 𝑖 𝑤2 𝑖 ⋯, Figure 2: Illustration of the Pre-CoFact framework. Each square can be seen as a token with a 𝑑 dimension vector. The feature extraction part aims to transform text and images into corresponding embeddings. The multi-modality fusion part fuses this information from the same modality (images/text from the claim and document) and different modalities (images and text from the claim/document) to obtain contexts. Finally, the category classifier predicts the possible classes based on the embeddings from feature extraction and the embeddings from multi-modality fusion. and the document image 𝐷𝐼𝑖 . The 𝑖-th target 𝑦𝑖 ∈ {𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝑀𝑢𝑙𝑡𝑖𝑚𝑜𝑑𝑎𝑙, 𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝑇 𝑒𝑥𝑡, 𝐼 𝑛𝑠𝑢𝑓 𝑓 𝑖𝑐𝑖𝑒𝑛𝑡_𝑀𝑢𝑙𝑡𝑖𝑚𝑜𝑑𝑎𝑙, 𝐼 𝑛𝑠𝑢𝑓 𝑓 𝑖𝑐𝑖𝑒𝑛𝑡_𝑇 𝑒𝑥𝑡, 𝑅𝑒𝑓 𝑢𝑡𝑒}. The goal is to find out support, insufficient- evidence and refute between given claims and documents. 4.2. Pre-CoFact Overview Figure 2 illustrates the overview of the proposed Pre-CoFact framework. The input contains the claim image, the claim text, the document image, and the document text. The feature extraction part adopts DeiT [11] as the pre-trained CV model and DeBERTa [10] as the pre-trained NLP model, and feeds the outputs of pre-trained models to the image embedding layer and text embedding layer for transforming images and texts into corresponding embeddings. The multi- modality fusion part fuses this information from the same modality (images/text from the claim and document) and different modalities (images and text from the claim/document) based on multiple co-attention layers. Finally, the category classifier predicts the possible classes based on the embeddings from feature extraction and the embeddings from multi-modality fusion. 4.3. Feature Extraction The enrichment of pre-trained models enables us to have rich information without training from scratch. Moreover, Transformer-based pre-trained models demonstrate the success on both NLP and CV tasks. However, it is essential to fine-tune for fitting in our task. To this end, we first use DeBERTa as our pre-trained NLP model and DeiT as our pre-trained CV model, and then we use the embedding layer for transforming pre-trained embeddings to embeddings in our task. Specifically, the 𝑖-th output of the embedding layer is calculated as follows: 𝐸𝐶𝐼𝑖 = 𝐸𝑚𝑏𝐶𝐼 (𝐷𝑒𝑖𝑇 (𝐶𝐼𝑖 )), 𝐸𝐷𝐼𝑖 = 𝐸𝑚𝑏𝐷𝐼 (𝐷𝑒𝑖𝑇 (𝐷𝐼𝑖 )), (1) 𝐸𝐶𝑇𝑖 = 𝐸𝑚𝑏𝐶𝑇 (𝐷𝑒𝐵𝐸𝑅𝑇 𝑎(𝐶𝑇𝑖 )), 𝐸𝐷𝑇𝑖 = 𝐸𝑚𝑏𝐷𝑇 (𝐷𝑒𝐵𝐸𝑅𝑇 𝑎(𝐷𝑇𝑖 )), (2) where the output dimensions of DeiT and DeBERTa are 768, the 𝐸𝑚𝑏 is composed of a MLP and an activation function, and 𝐸𝐶𝐼𝑖 , 𝐸𝐶𝑇𝑖 , 𝐸𝐷𝐼𝑖 , 𝐸𝐷𝑇𝑖 are 𝑑 dimension vectors. It is noted that the activation functions in 𝐸𝑚𝑏 we used are ReLU and Mish [25] for testing the results. 4.4. Multi-Modality Fusion After generating embeddings of text and images, we adopt multiple co-attention layers as in [6] to fuse the embeddings. To check the relation between claim and document, we use the co-attention layer to separately fuse 1) images of claims and documents and 2) text of claims and documents. Besides, the relation between text and images from the claims or document can be viewed as checking whether they are relative or not. Therefore, we also adopt the co-attention layer for fusing 3) images and text of claims and 4) images and text of documents. Specifically, each co-attention layer takes two inputs 𝐸𝐴 and 𝐸𝐵 and produces two outputs 𝐻𝐴 , 𝐻𝐵 . Here we use a single head to derive as the following equations: 𝑄𝐴 = 𝐸𝐴 𝑊 𝑄𝐴 , 𝐾𝐴 = 𝐸𝐴 𝑊 𝐾𝐴 , 𝑉𝐴 = 𝐸𝐴 𝑊 𝑉𝐴 , 𝑄𝐵 = 𝐸𝐵 𝑊 𝑄𝐵 , 𝐾𝐵 = 𝐸𝐵 𝑊 𝐾𝐵 , 𝑉𝐵 = 𝐸𝐵 𝑊 𝑉𝐵 , (3) 𝑄𝐴 𝐾𝐵𝑇 𝑄𝐵 𝐾𝐴𝑇 𝐻𝐴̃ = 𝑁 𝑜𝑟𝑚(𝐸𝐴 + 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( )𝑉𝐵 ), 𝐻𝐵̃ = 𝑁 𝑜𝑟𝑚(𝐸𝐵 + 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( )𝑉𝐴 ), (4) √𝑑 √𝑑 𝐻𝐴 = 𝑁 𝑜𝑟𝑚(𝐻𝐴̃ + 𝐹 𝐹 𝑁 (𝐻𝐴̃ )), 𝐻𝐵 = 𝑁 𝑜𝑟𝑚(𝐻𝐵̃ + 𝐹 𝐹 𝑁 (𝐻𝐵̃ )), (5) where 𝑊 𝑄𝐴 , 𝑊 𝐾𝐴 , 𝑊 𝑉𝐴 , 𝑊 𝑄𝐵 , 𝑊 𝐾𝐵 , 𝑊 𝑉𝐵 ∈ ℝ𝑑×𝑑 , and 𝑁 𝑜𝑟𝑚 and 𝐹 𝐹 𝑁 is the same normalization method and feed forward network as in [16]. Co-attention block has been widely used in VQA tasks [26], as it can capture dependencies of different inputs. Therefore, we use the co-attention layer for fusing: 𝐻𝐶𝐼 𝐷𝐼𝑖 , 𝐻𝐷𝐼 𝐶𝐼𝑖 = 𝐶𝑜𝐴𝑡𝑡(𝐸𝐶𝐼𝑖 , 𝐸𝐷𝐼𝑖 ), 𝐻𝐶𝑇 𝐷𝑇𝑖 , 𝐻𝐷𝑇 𝐶𝑇𝑖 = 𝐶𝑜𝐴𝑡𝑡(𝐸𝐶𝑇𝑖 , 𝐸𝐷𝑇𝑖 ), (6) 𝐻𝐶𝐼 𝐷𝑇𝑖 , 𝐻𝐷𝑇 𝐶𝐼𝑖 = 𝐶𝑜𝐴𝑡𝑡(𝐸𝐶𝐼𝑖 , 𝐸𝐷𝑇𝑖 ), 𝐻𝐶𝑇 𝐷𝐼𝑖 , 𝐻𝐷𝐼 𝐶𝑇𝑖 = 𝐶𝑜𝐴𝑡𝑡(𝐸𝐶𝑇𝑖 , 𝐸𝐷𝐼𝑖 ), (7) where 𝐶𝑜𝐴𝑡𝑡 denotes the co-attention layer. After applying the co-attention mechanism, the aggregation function is adopted to aggregate fused tokens into a representative token. That is, given a fused embedding with ℝ𝑁 ×𝑑 , where 𝑁 is the sequence length, we use mean aggregation to output ℝ1×𝑑 . Besides, we also feed 𝐸𝐶𝐼𝑖 , 𝐸𝐶𝑇𝑖 , 𝐸𝐷𝐼𝑖 , 𝐸𝐷𝑇𝑖 into the aggregation function for classification. 4.5. Category Classifier To predict the label of the given claims and documents, we first concatenate 8 aggregated outputs 𝐻𝐶𝐼 𝐷𝐼𝑖 , 𝐻𝐷𝐼 𝐶𝐼𝑖 , 𝐻𝐶𝑇 𝐷𝑇𝑖 , 𝐻𝐷𝑇 𝐶𝑇𝑖 , 𝐻𝐶𝐼 𝐷𝑇𝑖 , 𝐻𝐷𝑇 𝐶𝐼𝑖 , 𝐻𝐶𝑇 𝐷𝐼𝑖 , 𝐻𝐷𝐼 𝐶𝑇𝑖 from the co-attention layers and original 4 aggregated embeddings 𝐸𝐶𝐼𝑖 , 𝐸𝐶𝑇𝑖 , 𝐸𝐷𝐼𝑖 , 𝐸𝐷𝑇𝑖 to obtain the input of the classifier 𝑍𝑖 . It is worth noting that the outputs of embeddings are also used since the original information can provide some clues for classifying the news. Afterwards, the 𝑖-th output of the classifier is the probability as follows: 𝑀 𝑀 𝑀 𝑍𝑖 1 = 𝜎 (𝑍𝑖 𝑊 𝑍 ), 𝑍𝑖 2 = 𝜎 (𝑍𝑖 1 𝑊 𝑀1 ), (8) 𝑀 𝑦𝑖̂ = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑍𝑖 2 𝑊 𝑀2 ), (9) where 𝑊 𝑍 ∈ ℝ12𝑑×𝑑 , 𝑊 𝑀1 ∈ ℝ𝑑×𝑑𝑀1 , and 𝑊 𝑀1 ∈ ℝ𝑑𝑀1 ×5 . Note that 𝜎 is the same as in 𝐸𝑚𝑏, which uses both ReLU and Mish for testing the results. We trained our model by minimizing cross-entropy loss 𝕃 to learn the prediction of the categories: |𝐶| 𝕃 = − ∑ 𝑦𝑖 𝑙𝑜𝑔(𝑦𝑖̂ ). (10) 𝑖=1 4.6. Ensemble Method Each classifier may have its strengths and weakness, and ensemble methods have been widely used to enhance the performance. Therefore, we follow [27] to use the power weighted sum to enhance the performance of the model. The formula is derived as follows: 𝑝 = 𝑝1𝑁 × 𝑤1 + 𝑝2𝑁 × 𝑤2 + ⋯ + 𝑝𝑘𝑁 × 𝑤𝑘 , (11) where 𝑝𝑖 , ⋯ , 𝑝𝑘 are the predicted probability from the corresponding model, 𝑤1 , ⋯ , 𝑤𝑘 are weights with respect to the corresponding model, 𝑘 is the number of trained models, and 𝑁 is the weight of power. It is noted that these parameters are tuned by hand. 5. Results and Analysis 5.1. Experimental Setup 5.1.1. Implementation Details The dimension 𝑑 was set to 512, the inner dimension of the feed-forward layer was 1024, and the number of heads was set to 4. The dropout rate was 0.1, and the max sequence length was 512. The batch size was 32, the learning rates were set to 3e-5 and 2e-5, the training epochs were set to 30, and the seeds were tested with 41 and 42. The power 𝑁 was set to 0.5, and the weights were set to 0.6, 0.2, 0.1, 0.2, 0.3, which were manually tuned by validation score. The pre-trained DeBERTa was deberta-base2 , and the DeiT was deit-base-patch16-2243 . The parameters of the two pre-trained models were frozen. All images were transformed by resizing to 256, center 2 https://huggingface.co/microsoft/deberta-base 3 https://huggingface.co/facebook/deit-base-patch16-224 Model w/o CoAtt w/o CoAtt(text, image) Pre-CoFact (Ours) Ensemble (Ours) Weighted F1 (%) 75.68 (-4.34) 76.43 (-3.59) 78.46 80.02 (+1.56) Table 1 Ablation study of our model in terms of validation score. w/o CoAtt denotes using four embeddings for classification and w/o CoAtt(text, image) denotes using only the same modality (Equ. 6). Model DINO [24] XLM-RoBERTa [28] RoBERTa [19] Pre-CoFact (Ours) Weighted F1 (%) 73.94 (-4.52) 74.11 (-4.35) 77.53 (-0.93) 78.46 Table 2 Variant pre-trained models in terms of validation score. Pre-CoFact uses DeiT and DeBERTa as pre- trained models. DINO is replaced DeiT by DINO [24]. XLM-RoBERTa and RoBERTa are replaced DeBERTa by XLM-RoBERTa [28] and RoBERTa [19], respectively. cropping to 224, and normalizing. We preprocessed only for transforming images, and then we stored the text and processed images in corresponding pickle files for training and evaluating. All the training and evaluation phases were conducted on a machine with Intel Xeon 4110 CPU @ 2.10GHz, Nvidia GeForce RTX 2080 Ti, and 252GB RAM. The source code is available at https://github.com/wywyWang/Multi-Modal-Fact-Verification-2021. 5.1.2. Evaluation Metric To evaluate the performance of the task, the weighted average F1 score was used across the 5 categories. 5.2. Quantitative Results 5.2.1. Ablation Study We first conducted an ablation study to ensure the effective design of our proposed Pre-CoFact. As shown in Table 1, it is evident that without co-attention networks (w/o CoAtt), the perfor- mance is degraded. Further, applying co-attention only on the same modality (w/o CoAtt(text, image)) is insufficient, which demonstrates the need for modeling dependencies between dif- ferent modalities. It is noted that our ensemble method slightly improves the performance compared to Pre-CoFact. Our ensemble method includes Pre-CoFact, Pre-CoFact with replacing DeBERTa with XLM-RoBERTa, Pre-CoFact with replacing DeBERTa with RoBERTa, Pre-CoFact with replacing DeBERTa with RoBERTa and replacing ReLU with Mish, and Pre-CoFact with replacing ReLU with Mish. We also use different pre-trained models to examine the module influence as shown in Table 2. It can be seen that DeiT is more suitable than DINO for this task. Besides, XLM-RoBERTa also degrades the performance, while RoBERTa is slightly worse than Pre-CoFact with DeBERTa. Support Support Insufficient Insufficient Rank Team _ Text (%) _ Multimodal (%) _Text (%) _Multimodal (%) Refute (%) Final (%) 5 Yao 68.881 81.610 84.836 88.309 100.00 74.585 - Baseline 82.675 75.466 74.424 69.678 42.354 53.098 Table 3 Performance of our model in terms of testing score. Our method achieved fifth prize with only about a 2.2% gap, while we outperformed the baseline by 40.5%. Figure 3: Confusion matrix of the validation set and testing set. 5.2.2. Testing Performance Table 3 shows the performance of the testing set. Our approach achieved 74.585% of the F1-score, winning the fifth prize in detecting fake news. This result outperformed the baseline by 40.5%, while it still has only a 2.2% gap compared to the first prize. Despite the result, our approach still demonstrates that using only text and images can achieve competitive performance. 5.2.3. Confusion Matrix Figure 3 shows the confusion matrix of the validation set and testing set. It can be observed that our model can precisely classify refute on both the validation set and testing set, while our model misjudged whether the text is entailed when the image is not entailed. 6. Conclusion In this paper, we proposed Pre-CoFact utilizing pre-trained models and multiple co-attention networks to alleviate the effect of fake news for the Factify task. To achieve better performance, we adopted an ensemble method by weighting several models. The ablation study demonstrates the effectiveness of our proposed approach. From the testing score, our method illustrates that using only text and images without extra information can also achieve competitive performance. References [1] E. Shearer, A. Mitchell, News use across social media platforms in 2020, 2021. URL: https://www.pewresearch.org/journalism/2021/01/12/ news-use-across-social-media-platforms-in-2020/. [2] P. Nakov, D. P. A. Corney, M. Hasanain, F. Alam, T. Elsayed, A. Barrón-Cedeño, P. Papotti, S. Shaar, G. D. S. Martino, Automated fact-checking for assisting human fact-checkers, in: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021, pp. 4551–4558. [3] K. Shu, S. Wang, H. Liu, Beyond news contents: The role of social context for fake news detection, in: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019, pp. 312–320. [4] P. Przybyla, Capturing the style of fake news, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020, pp. 490–497. [5] Z. Jin, J. Cao, H. Guo, Y. Zhang, J. Luo, Multimodal fusion with recurrent neural networks for rumor detection on microblogs, in: Proceedings of the 2017 ACM on Multimedia Conference, 2017, pp. 795–816. [6] Y. Wu, P. Zhan, Y. Zhang, L. Wang, Z. Xu, Multimodal fusion with co-attention networks for fake news detection, in: Findings of the Association for Computational Linguistics, volume ACL/IJCNLP 2021 of Findings of ACL, 2021, pp. 2560–2569. [7] P. Qi, J. Cao, T. Yang, J. Guo, J. Li, Exploiting multi-domain visual information for fake news detection, in: 2019 IEEE International Conference on Data Mining, 2019, pp. 518–527. [8] S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, P. Patwa, A. Das, T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Factify: A multi-modal fact verification dataset, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEUR, 2022. [9] P. Patwa, S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, A. Das, T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Benchmarking multi-modal entailment for fact verification, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEUR, 2022. [10] P. He, X. Liu, J. Gao, W. Chen, Deberta: decoding-enhanced bert with disentangled attention, in: 9th International Conference on Learning Representations, 2021. [11] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou, Training data- efficient image transformers & distillation through attention, in: Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 2021, pp. 10347–10357. [12] P. Nakov, G. D. S. Martino, Fake news, disinformation, propaganda, media bias, and flattening the curve of the COVID-19 infodemic, in: KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021, pp. 4054–4055. [13] N. Vo, K. Lee, Where are the facts? searching for fact-checked information to alleviate the spread of fake news, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 7717–7731. [14] N. Lee, Y. Bang, A. Madotto, P. Fung, Towards few-shot fact-checking via perplexity, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 1971–1981. [15] Y. Lin, Y. Meng, X. Sun, Q. Han, K. Kuang, J. Li, F. Wu, Bertgcn: Transductive text classification by combining GCN and BERT, CoRR abs/2105.05727 (2021). [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- sukhin, Attention is all you need, in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017, pp. 5998–6008. [17] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2019, pp. 4171–4186. [18] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, in: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 2020. [19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy- anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). [20] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, in: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019, pp. 5754–5764. [21] K. Clark, M. Luong, Q. V. Le, C. D. Manning, ELECTRA: pre-training text encoders as discriminators rather than generators, in: 8th International Conference on Learning Representations, 2020. [22] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, in: 9th International Conference on Learning Representations, 2021. [23] C. Sun, A. Shrivastava, S. Singh, A. Gupta, Revisiting unreasonable effectiveness of data in deep learning era, in: IEEE International Conference on Computer Vision, 2017, pp. 843–852. [24] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, A. Joulin, Emerging properties in self-supervised vision transformers, CoRR abs/2104.14294 (2021). [25] D. Misra, Mish: A self regularized non-monotonic neural activation function, CoRR abs/1908.08681 (2019). [26] P. Gao, Z. Jiang, H. You, P. Lu, S. C. H. Hoi, X. Wang, H. Li, Dynamic fusion with intra- and inter-modality attention flow for visual question answering, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6639–6648. [27] W. Wang, K. Chang, Y. Tang, Emotiongif-yankee: A sentiment classifier with robust model based ensemble methods, CoRR abs/2007.02259 (2020). [28] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2020, pp. 8440–8451.