1. Introduction

Team gzw at Factify 2: Multimodal Attention and Fusion Networks for Multi-Modal Fact Verification

Zhenwei Gao

Tong Chen

Zheng Wang

0 1 0 Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China , China 1 Institute of Electronic and Information Engineering of UESTC in Guangdong , 523808

Nowadays, detecting fake news on social media platforms has become a top priority since the widespread dissemination of fake news may mislead readers and have negative efects. To address the problem, we propose a Multimodal Attention and Fusion Network (MAFN) for multi-modal fact verification. Specifically, we employ DeiT and DeBERTa to obtain better representations for text and images, respectively. Then, we feed the obtained representations of images and text into a multi-modal attention network to fuse both inter-modality and intra-modality relationships. Besides, we adopt an ensemble strategy by using diferent pre-trained models in MAFN to achieve better performance. We conduct a series of ablation studies to verify the impact of each designed module on performance. Our method (team gzw) ranked fifth in the leaderboard of the Factify Challenge hosted by De-Factify@AAAI 2023, achieving an F1 score of 76.051%, which shows that our model achieves a competitive performance.

eol>Multi-modal Attention Pre-trained Model Self-Attention De-Factify

1. Introduction

Social media has become a mainstream platform for people to communicate their ideas, due to the increasing convenience and intelligence. However, every coin has two sides. That is to say, it also gradually becomes an ideal place for the widespread of fake news. Since fake news distorts and fabricates facts maliciously, its extensive dissemination has extremely negative impacts on individuals and society. In addition, multimedia intelligence [ 1, 2, 3 ] can help the people better understand the world. Therefore, it is urgently important to detect fake news with multimedia in social platforms.

In order to facilitate the detection of fake news, many approaches have been proposed. The early attempts (e.g., snopes.com) mainly verified the fake news by experts or institutions in related fields, which is obviously time-consuming and labor-intensive. Therefore, automatically detecting fake news has been a key research direction and drawn much attention in recent years. Basically, existing studies on automatic fake news detection can be summarized into two categories: (1) The first one is traditional learning methods [ 4, 5, 6, 7 ], which design plenty of hand-crafted features from the media content of posts and the social context of users. With these sophisticated features, SVM classifiers [ 4, 7 ] and decision tree [ 5, 6 ] have been trained to debunk fake news. However, the content of fake news is highly complicated and hard to be fully captured by hand-crafted features. (2)With deep neural networks having yielded immense success in learning image and textual representations and their downstream tasks [ 8, 9 ], researchers realize that deep learning plays a very important role in detecting fake news. Thus the deep learning based methods [ 7, 10, 11 ] are proposed to automatically capture the deep features in an end-to-end way. For example, Ma et al. [ 7 ] employ Recurrent Neural Networks (RNNs) to learn the hidden features from posts. Yu et al. [ 11 ] use Convolutional Neural Networks (CNNs) to obtain key features and their high-level interactions from fake news. However, most of the above methods focus only on textual content and ignore posts with multi-modal information (such as text, images, etc.), which is a key component of social media platforms.

De-Factify2 [ 12 ] is a competition hosted by AAAI 2023 workshop on multi-modal fact checking and hate speech detection, an extension of the De-Factify [ 13 ] competition. This workshop aims to encourage researchers from inter-disciplinary domains working on multi-modality and/or fact checking to come together and work on multi-modal (images, memes, videos) fact checking. The goal of this competition is to design a method to classify the given text and images into one of the five categories: Support_Multimodal, Support_Text, Insuficient_Multimodal, Insuficient_Text, and Refute, as displayed in Figure 1. For more details, we refer readers to [ 12 ]. To tackle the problem, this paper proposes a Multimodal Attention and Fusion Network (MAFN) with pre-trained models and co-attention networks to perform the shared task, which first extracts features from both text and images, then fuses this information through the co-attention module. Specifically, two powerful Transformer-based pre-trained models, DeBERTa [ 14 ] and DeiT [ 15 ], are adopted to extract features of images and text both from claims and documents, respectively. Based on that, several co-attention modules are designed to fuse the contexts of text and images. Afterwards, we apply self-attention mechanism to get corresponding representative embeddings. Finally, these embeddings are sequentially concatenated to obtain the final embedding to classify the categories of news.

The main results of this paper can be summarized as follows: • We leverage an ensemble strategy based on diferent pre-trained models to obtain better representations for the claims and documents. • We design a multi-modal attention mechanism and a fusion module to learn the semantic correlation at intra-modality (text or images from claims and documents) and the intermodality dependencies. • Our ensemble model outperforms the baseline by 17.0% in terms of testing score, while it still has about 7.6% gap compared to the first prize. Besides, a series of ablation studies were further conducted to study the impact of the designed modules on the overall performance of the model.

2. Related Works 2.1. Fake News Detection

Recently, fake news detection with multi-modality has received considerable attentions. Several approaches[ 16, 17, 18, 19 ] conduct fake news detection based on the multimedia content and obtain superior performance. Jin et al. [ 16 ] propose a multi-modality based fake news detection model, which extracts the multi-modality information including visual, textual and social context features, and then fuses them by attention mechanism. Khattar et al. [ 17 ] introduce a multimodal variational autoencoder that learns a shared representation of text and images. Shivangi et al. [ 18 ] make use of the pre-trained BERT to learn text features and apply VGG-19 pre-trained on ImageNet dataset to learn image features. Wang et al. [ 19 ] design a novel knowledge-driven multimodal graph convolutional network to jointly model the textual information, knowledge concepts and visual information into a unified framework for fake news detection. MCAN [ 20 ] adopts a large-scale pre-trained NLP model and a pre-trained computer vision (CV) model to obtain features from text and images, and then fuses them and frequency domain features from images with multiple co-attention layers.

These methods demonstrate that multi-modal content can also help the model to detect fake news. Thus, we design a multimodal attention and fusion network to mine the semantic correlation among multimedia to facilitate the fact verification.

2.2. Large-Scale Pre-trained Models

Pre-trained models have achieved significant success across numerous tasks. Transformer [ 21 ] ifrst introduced in machine translation, has inspired many competitive approaches in natural language processing (NLP) and computer vision tasks. Specifically, Transformer-based pre-trained language models (PLMs) have significantly improved the performance of various NLP tasks due to the ability to understand contextualized information from the pre-trained dataset. GPT [ 22 ] replaces bi-LSTMs with a left-to-right Transformer to better extract contextual semantics by a global attention mechanism. DeBERTa [ 14 ] proposes a novel disentangled attention mechanism and a new virtual adversarial training to significantly improve the eficiency of pre-training and the performance of 2 downstream tasks.

Vision Transformer (ViT) [ 23 ] is a Transformer encoder architecture with patching raw images to achieve competitive results of image classification, compared to state-of-the-art convolutional networks, which demonstrates that convolution-free networks can still capture the visual relation efectively. Then several follow-up studies based on ViT have been conducted. For example, DeiT [ 15 ] develops a novel distillation procedure to ensure the student learns better knowledge from the teacher through attention.

In a word, pre-trained models can benefit the procedure of capturing rich information for downstream tasks and also reduce the cost of training from scratch. These advantages drives us to obtain better contextual embedding of images and text with recent pre-trained models.

2.3. Attention Mechanism

Attention mechanisms are demonstrated efective in various tasks such as image captioning [ 24 ], machine translation [ 25 ] and recommendation system [ 26 ]. Concretely, Bahdanau et al. [ 25 ] ifrstly introduce attention in the machine translation task to allow the model to automatically search for parts of a source sentence that are relevant to predicting a target word. Recently, attention mechanisms have been incorporated into fake news detection. For example, Chen et al. [ 27 ] propose a deep attention model on the basis of recurrent neural networks (RNN) to learn selectively temporal hidden representations of sequential posts for identifying fake news.

Inspired by the successful applications of attention mechanism, we introduce a co-attention network to compute the intra-modality relationship and inter-modality relationship of image tokens and text words.

3. Method 3.1. Overview

Let = {, , , }=1 denote a set of training data, where the th sample is composed of the claim text , the claim image , the document text , and the document image . = {1, 2, · · · , }=1 denote a set of corresponding labels where ∈ {_ , _ , _ , _ , }. The task of this competition is to classify the data sample into one of the five categories when given a textual claim, claim image, document text and document image.

3.2. Overall Framework

Inspired by [ 28 ], we introduce a Multimodal Attention and Fusion Network (MAFN) to improve the performance of multimodal fact verification. By exploiting a multi-modal attention network for multi-modal feature fusion, our model can capture the intra-modality and inter-modality relationship of textual and visual content of fake news. The overall architecture is illustrated in Figure 2. Specifically, our model consists of the following components: • Text and Image Encoding Network: The enrichment of pre-trained models enables us to extract rich information without training from scratch. We first use DeBERTa [ 14 ] as our pre-trained NLP model and DeiT [ 15 ] as our pre-trained CV model to precisely capture the semantics both from the text and the image, and then employ a full connection layer followed by a ReLU function to further extract the multi-modal embedding. • Multi-Modality Fusion Network: As the intra-modality (images/text from the claim and document) or inter-modality (images and text from the claim/document) relationships can facilitate the detection of fake news, we use the multi-modality fusion part to fuse the information from the same modality and diferent modalities. • Category Classifier aims to classify each piece of data in the dataset into one of five categories with a fully-connected layer followed by a corresponding activation function.

3.3. Text and Image Encoding Network

Text Encoding Network: In order to represent the rich semantic information of sentences, we employ DeBERTa as the core module of our textual language model. Given a sentence, we split it into words with tokenization technique = {1, 2, · · · , }, and we denote the transformed feature as = {1, · · · , } with corresponding to the transformed feature of . The word representation is calculated by DeBERTa: = {1, · · · , } = ( ), (1) where ∈ R is the last hidden state of corresponding token in DeBERTa, and is the dimension of the word embedding. Specifically, we feed the claim text and document text into DeBERTa respectively, the corresponding features, e.g. = ( ), = ( ), where the output dimensions of DeBERTa is 768. Then we use the embedding layer for transforming pre-trained embeddings to embeddings in our task. Sepecifically, output of the embedding layer is calculated as follows: = ( ), = ( ).

Here the is composed of a fully-connected layer and an activation function, and , are dimension vectors. It is noted that the activation functions in we used are ReLU and Mish [29] for testing the results.

Image Encoding Network: For each input of image, we use pre-trained DeiT model to extract token features. The output is a set of token features = {1, · · · , }, where denotes the token number of the image. The parameters of the pre-trained DeiT are frozen, which means we do not update the parameters of the pretrained model during training. In other words, given the image , the operation of feature extraction can be expressed as:

= {1, · · · , } = (), where ∈ R and is the dimension of the image embedding. Specifically, we feed the claim image and document image into DeiT respectively, and get the corresponding features, e.g. = ( ), = ( ), where the output dimensions of DeiT is 768. Then we use the embedding layer for transforming pre-trained embeddings to embeddings in our task. Sepecifically, output of the embedding layer is calculated as follows: (2) (3) (4) = ( ), = ( ), where module is same as the in equation 2. , are dimension vectors.

3.4. Multi-Modality Fusion

Co-attention block has been widely used in VQA tasks [30], as it can capture dependencies of diferent inputs. Thus, after generating embeddings of text and images, we adopt multiple co-attention layers as [ 20, 31 ] to fuse the embeddings for the improvement of the intra- /intermodality relations on the detection of fake news.

First, we employ a co-attention layer to separately fuse 1) images of claims and images of documents and 2) text of claims and text of documents(fuse features from same modality). Then we learn the inter-modal alignment by fusing features from diferent modalities (images and text from the claim/document). Besides, the relation between text and images from the claims or document can be viewed as checking whether they are relative or not. Therefore, we also adopt the co-attention layer for fusing 3) images and text of claims and 4) images and text of documents(fuse features from diferent modality).

Therefore, we use the co-attention layer for fusing. Specifically, each co-attention layer takes two inputs and to produce two outputs , . We first project / into query ∈ R× , key ∈ R× and value ∈ R× matrices: = , = , = , = , = , = , ˜ = ( + ( √ )), ˜ = ( + ( √ )), = (˜ + (˜)), = (˜ + (˜)), , = ( , ), , = ( , ), , = ( , ), , = ( , ), where , , , , , ∈ R× .

We then employ attention mechanism together with the residual connection to provide additional capacity for more complex reasoning in our aggregation functions. The specific expression is: (5) (6) (7) (8) (9) where is a Layer Normalization and is the same feed forward network as [ 21 ]. Now we can use co-attention layer to fuse features from same modalities (or diferent modalities): where denotes the co-attention layer.

Afterwards, the aggregation function is adopted to aggregate fused tokens into a representative token. That is, given a fused embedding with = {1, · · · , } ∈ R× , where is the sequence length, we perform self-attention mechanism [ 21 ] over the fused tokens, which adopts average feature = 1 ∑︀

=1 as the query and aggregates all the tokens to obtain a representative token. Besides, we also feed , , , into the aggregation function for classification.

3.5. Category Classifier

As The features fused by the co-attention layer can represent the complex relationship between claim and document, we first concatenate 8 aggregated outputs , , , , , , , from the co-attention layers = ( : : : : : : : ). It is worth noting that we also use the outputs of aggregated embeddings since the original information can provide some clues for classifying the news, thus we concatenate 4 aggregated embeddings = ( : : : ). Then we concatenate these two features = ( : ) and feed to the subsequent category classification network to predict the label of the given claims and documents. Afterwards, the output of the classifier is the probability as follows: where (0) ∈ R12× , (1) ∈ R× 1 , and (2) ∈ R1× 5. Note that is the same as in , which uses both ReLU and Mish for testing the results.

In the end, We minimize cross-entropy loss ℒ to verify a multimodal claim: (10) (11) (12) (13)

3.6. Ensemble Method

Each classifier may have its strengths and weakness, and ensemble methods have been widely used to enhance the performance. Some models have a higher score on the validation set, we naturally want it to have a larger weight in the final integrated model, thus we use diferent weights to integrate the model. The formula is derived as follows: = 1 × 1 + 2 × 2 + · · · + × , where 1, · · · , are the predicted probability from the corresponding model, 1, · · · , are weights with respect to the corresponding model, is the number of trained models. It is noted that the weight parameters are tuned by hand.

(1) = ( (0)), (2) = ((1) (1)), ˆ = ((2) (2)), ℒ = − || ∑︁ (ˆ).

4. Experiments 4.1. Dataset and Implementation

Dataset. Factify [ 12, 32 ] is a dataset for multi-modal fact verification, which contains images of the claim, textual claims, reference textual documents and images. Each data contains a reliable source of information, called a “document” and another source whose validity must be assessed, called a “claim”. Both source and claim information sources have a corresponding image. Each data sample belongs to one of the five categories, which are Support_Text, Support_Multimodal, Insuficient_Text, Insuficient_Multimodal and Refute. The labels are defined as: • Support_Multimodal: both the claim text and image are similar to that of the document. • Support_Text: the claim text is similar or entailed, but images of the document and claim are not similar. • Insuficient_Multimodal: the claim text is neither supported nor refuted by the document but images are similar to the document. • Insuficient_Text: both text and images of the claim are neither supported nor refuted by the document, although it is possible that the text claim has common words with the document text. • Refute: the images and/or text from the claim and document are completely contradictory i.e, the claim is false/fake.

The training set contains 35,000 samples with 5,000 samples per class, and the validation set includes 7,500 samples with 1,500 samples per class. The test set, which is used to evaluate the private score, also contains 7,500 samples. For more details, we refer readers to [ 12, 33 ]. Implementation Details. The dimension was set to 512, the hidden dim of the fully connected layer was set to 1024, the output dimension of DeBERTa and DeiT was 768, and the number of heads was set to 4. The dropout rate was 0.1, and the max sequence length was 512. The batch size was 64, the learning rates were set to 2e-5, the number of training epochs was 30, and the seeds were tested with 24. The weight coeficients between diferent models are set to 0.7, 0.5, 0.6, 0.7, 0.6, which were manually tuned by validation score. The pre-trained DeBERTa was deberta-base1, and the DeiT was deit-base-patch16-2242. The parameters of the two pretrained models were frozen during training, which means we do not update their parameters during training. All images were transformed by resizing to 256, center cropping to 224, and normalizing. We preprocessed only for transforming images, and then we stored the text and processed images in corresponding pickle files for training and evaluating. All expriments were conducted with Nvidia GeForce RTX A6000.

Evaluation Metric. The weighted average F1 score across 5 categories is adopted to evaluate the performance.

4.2. Testing Performance 4.3. Ablation Study

To study the impact of each module, we carried a series of ablation studies to verify the efectiveness of the designed modules. As shown in Table 3, applying co-attention only on the same modality (w/o CoAtt(A, B)) is insuficient, which demonstrates the need for modeling dependencies between diferent modalities. In addition, if only apply co-attention on the diferent modality (w/o CoAtt(A, A)), the model will not be able to distinguish the diference 1https://huggingface.co/microsoft/deberta-base 2https://huggingface.co/facebook/deit-base-patch16-224 between claim and document, which will also afect performance. Finally, if removing the co-attention module completely (w/o CoAtt), the performance will drop drastically, which justifies the use of co-attention on the same modality and diferent modality.

We also explored the efectiveness of the self-attention module. If it is replaced by a simple mean operation, a large performance drop can be observed (see in Table 4), which proves that the model can focus on important sequences through the self-attention module. Meanwhile, it is evident that without concatenating to the final embedding , the performance will obviously degrades.

It is noted that our ensemble method slightly improves the performance compared to PreCoFact. Our ensemble method includes MAFN (model1 in Table 2), MAFN with replacing DeBERTa with XLM-RoBERTa (model2 in Table 2), MAFN with replacing DeBERTa with RoBERTa (model3 in Table 2), MAFN with replacing DeBERTa with RoBERTa and replacing ReLU with Mish (model4 in Table 2), and MAFN with replacing ReLU with Mish (model4 in Table 2). We the performance of each model in Table 2, and we ensemble the model using equation (13).

4.4. Visualization

Figure 3 visualizes some verification examples by our model and the baseline. It can be observed that our model is superior to the baseline on the multimodal fact verification. On the left side of Figure 3, we can intuitively see that the content of the two pictures is similar, but for the text, the claim and document are diferent in length, and the sentence structure is also very diferent, but the semantics are the same. Our model can correctly classify the results, demonstrating that our model can learn high-level semantic connections between claim and document texts, which we attribute to the use of Co-Attention module. The example on the right also shows that our model can understand high-level semantic information.

5. Conclusion

In this paper, we proposed a multimodal fact verification method called MAFN, which utilizes pre-trained models and multiple co-attention networks to alleviate the efect of fake news. To further improve the performance, we adopted an ensemble method by weighting several diferent pretrained models. The ablation study demonstrates the efectiveness of our proposed approach. The test scores can also illustrates the efectiveness of our model. co-attention networks for multi-modal fact verification, arXiv preprint arXiv:2201.11664 (2022). [29] D. Misra, Mish: A self regularized non-monotonic neural activation function, CoRR abs/1908.08681 (2019). [30] P. Gao, Z. Jiang, H. You, P. Lu, S. C. H. Hoi, X. Wang, H. Li, Dynamic fusion with intraand inter-modality attention flow for visual question answering, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6639–6648. [31] N. Wang, Z. Wang, X. Xu, F. Shen, Y. Yang, H. T. Shen, Attention-based relation reasoning network for video-text retrieval, in: 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6. doi:10.1109/ICME51207.2021.9428215. [32] S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, P. Patwa, A. Das, T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Factify: A multi-modal fact verification dataset (2022). [33] S. Suryavardan, S. Mishra, M. Chakraborty, P. Patwa, A. Rani, A. Chadha, A. Reganti, A. Das, A. Sheth, M. Chinnakotla, A. Ekbal, S. Kumar, Findings of factify 2: multimodal fake news detection, in: proceedings of defactify 2: second workshop on Multimodal Fact-Checking and Hate Speech Detection, CEUR, 2023.

[1]

Zheng ,

Zhenwei ,

Xing ,

Yadan ,

Yang ,

S. Heng

Tao , Point to rectangle matching for image text retrieval , in: Proceedings of the 30th ACM International Conference on Multimedia , 2022 , p. 4977 - 4986 .

[2]

Zheng ,

Jie ,

Jing ,

Jingjing ,

Jiangbo ,

Yang , Discovering attractive segments in the user-generated video streams , Information Processing & Management 57 ( 2020 ) 102 - 130 .

[3]

Zheng ,

Yang ,

Jingjing ,

Zhu , Universal adversarial perturbations generative network , World Wide Web 25 ( 2022 ) 1725 - 1746 .

[4]

Castillo ,

Mendoza ,

Poblete , Information credibility on twitter, in: Proceedings of the 20th international conference on World wide web , 2011 , pp. 675 - 684 .

[5]

Kwon ,

Cha ,

Jung ,

Chen ,

Wang , Prominent features of rumor propagation in online social media , in: 2013 IEEE 13th international conference on data mining, IEEE , 2013 , pp. 1103 - 1108 .

[6]

Liu ,

Nourbakhsh ,

Li ,

Fang ,

Shah , Real-time rumor debunking on twitter , in: Proceedings of the 24th ACM international on conference on information and knowledge management , 2015 , pp. 1867 - 1870 .

[7]

Ma , W. Gao,

Mitra ,

Kwon ,

B. J.

Jansen , K.-F. Wong , M. Cha , Detecting rumors from microblogs with recurrent neural networks ( 2016 ).

[8]

Wei ,

Yang ,

Xu ,

Zhu ,

H. T.

Shen , Universal weighting metric learning for crossmodal retrieval , IEEE Transactions on Pattern Analysis and Machine Intelligence 44 ( 2022 ) 6534 - 6545 . doi: 10 .1109/TPAMI. 2021 . 3088863 .

[9]

Wei ,

Xu ,

Yang ,

Ji ,

Wang ,

H. T.

Shen , Universal weighting metric learning for cross-modal matching , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020 , pp. 13005 - 13014 .

[10]

Ma ,

Gao , K.-F. Wong , Detect rumors on twitter by promoting information campaigns with generative adversarial learning , in: The world wide Web conference , 2019 , pp. 3049 - 3055 .

[11]

Yu ,

Liu ,

Wu ,

Wang ,

Tan , et al., A convolutional approach for misinformation identification ., in: IJCAI , 2017 , pp. 3901 - 3907 .

[12]

Suryavardan ,

Mishra ,

Patwa ,

Chakraborty ,

A. R. Anku

Rani ,

Chadha , A. Das , A.

Sheth , M.

Chinnakotla , A.

Ekbal , S.

Kumar , Factify 2: A multimodal fake news and satire news dataset , in: proceedings of defactify 2: second workshop on Multimodal Fact-Checking and Hate Speech Detection, CEUR , 2023 .

[13]

Patwa ,

Mishra ,

Suryavardan ,

Bhaskar ,

Chopra ,

Reganti , A. Das , T.

Chakraborty , A.

Sheth , A.

Ekbal , C.

Ahuja , Benchmarking multi-modal entailment for fact verification , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, ceur, 2022 .

[14]

He ,

Liu ,

Gao , W. Chen, Deberta: decoding-enhanced bert with disentangled attention , in: 9th International Conference on Learning Representations , 2021 .

[15]

Touvron ,

Cord ,

Douze ,

Massa ,

Sablayrolles ,

Jégou , Training dataeficient image transformers & distillation through attention , in: Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Machine Learning Research , 2021 , pp. 10347 - 10357 .

[16]

Jin ,

Cao ,

Guo ,

Zhang , J. Luo, Multimodal fusion with recurrent neural networks for rumor detection on microblogs , in: Proceedings of the 25th ACM international conference on Multimedia , 2017 , pp. 795 - 816 .

[17]

Khattar ,

J. S.

Goud ,

Gupta ,

Varma , Mvae: Multimodal variational autoencoder for fake news detection , in: The world wide web conference , 2019 , pp. 2915 - 2921 .

[18]

Singhal ,

Kabra ,

Sharma ,

R. R.

Shah ,

Chakraborty ,

Kumaraguru , Spotfake+: A multimodal framework for fake news detection via transfer learning (student abstract) , in: Proceedings of the AAAI conference on artificial intelligence , volume 34 , 2020 , pp. 13915 - 13916 .

[19]

Wang ,

Qian ,

Hu ,

Fang ,

Xu , Fake news detection via knowledge-driven multimodal graph convolutional networks , in: Proceedings of the 2020 International Conference on Multimedia Retrieval , 2020 , pp. 540 - 547 .

[20]

Wu ,

Zhan ,

Zhang ,

Wang ,

Xu , Multimodal fusion with co-attention networks for fake news detection, in: Findings of the Association for Computational Linguistics , volume ACL/IJCNLP 2021 of Findings of ACL, 2021 , pp. 2560 - 2569 .

[21]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 , 2017 , pp. 5998 - 6008 .

[22]

Radford ,

Narasimhan ,

Salimans ,

Sutskever , et al., Improving language understanding by generative pre-training ( 2018 ).

[23]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly ,

Uszkoreit ,

Houlsby , An image is worth 16x16 words: Transformers for image recognition at scale , in: 9th International Conference on Learning Representations , 2021 .

[24]

Xu ,

Ba ,

Kiros ,

Cho ,

Courville ,

Salakhudinov ,

Zemel ,

Bengio , Show, attend and tell: Neural image caption generation with visual attention , in: International conference on machine learning, PMLR , 2015 , pp. 2048 - 2057 .

[25]

Bahdanau ,

Cho , Y. Bengio, Neural machine translation by jointly learning to align and translate , arXiv preprint arXiv:1409.0473 ( 2014 ).

[26]

Chen ,

Zhang ,

He ,

Nie , W. Liu, T.-S. Chua, Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention , in: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval , 2017 , pp. 335 - 344 .

[27]

Chen ,

Li ,

Yin , J. Zhang, Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection , in: Pacific-Asia conference on knowledge discovery and data mining , Springer, 2018 , pp. 40 - 52 .

[28]

W.-Y.

Wang , W.-C. Peng, Team yao at factify 2022: Utilizing pre-trained models and