1. Introduction

Multi-Modal Fact Verification Dataset

Shreyash Mishra

shreyash.m19@iiits.in 4

S Suryavardan

suryavardan.s19@iiits.in 4

Amrit Bhaskar

Parul Chopra

parulcho@andrew.cmu.edu 2

Aishwarya Reganti

Parth Patwa

Amitava Das

Tanmoy Chakraborty

Amit Sheth

Asif Ekbal

Chaitanya Ahuja

Vancouver, Canada

0 AI Institute, University of South Carolina , USA 1 Arizona State University , USA 2 Carnegie Mellon University , USA 3 IIIT Delhi , India 4 IIIT Sri City , India 5 IIT Patna , India 6 University of California Los Angeles , USA 7 Wipro AI labs , India

2021

Combating fake news is one of the burning societal crisis. It is dificult to expose false claims before they create a lot of damage. Automatic fact/claim verification has recently become a topic of interest among diverse research communities. Forums like FEVER, FNC [1, 2] aim to discuss automatic fact-checking on text. Research eforts and datasets on text fact verification could be found, but there is not much attention towards multi-modal or cross-modal fact-verification. In order to bring the attention of the research community towards understanding multimodal misinformation, we release a multimodal fact checking dataset named FACTIFY. It is notably the largest multimodal fact verification public dataset consisting of 50K data points, covering news from India and the US. FACTIFY contains images, textual claims, reference textual documents and images labeled with three broad categories namely - support, no-evidence, and refute.

1. Introduction

In the recent years, automatic fact checking has emerged to be an important problem in the AI community, since dangers of fraudulent claims masquerading as declarations of reality have become common. Although the birth of this problem goes back to the initial years of printing press, it has attracted increasing interest with the usage of social media. The rapid distribution of news across numerous media sources has resulted in the fast development of erroneous and fake content. It is tough to uncover misleading statements before they cause significant harm. According to statistics [ 3 ], about 67% of the American population believes that fake news produces a lot of uncertainty, and 10% of them knowingly propagate fake news. On the contrary, only 26% of respondents said they feel confidence in their ability to recognize bogus news.

The scarcity of available training data has been a fundamental obstacle in automated factchecking. Recently, significant progress has been made with the release of two of the largest datasets - FEVER [ 1 ] and LIAR [ 4 ], among several others. LIAR contains 12.8K claims along with their meta-data (i.e., speaker of the claim, political afiliations of the speaker, medium through which the claim was first published) collected from the real fact-checking websites like Politifcat. Huge advancements have been achieved since the release of LIAR. A significantly larger dataset - FEVER includes proof and extensive meta-data to contextualize the claims even more. FEVER consists of 185K claims which were manually curated from Wikipedia. Although FEVER is a large dataset, it was purpose-made for research and this limits its ability to capture patterns from the real-world data. We release a multimodal fact checking dataset, called FACTIFY, which would aid in resolving this problem as it consists of original samples with no post-processing or manual data creation involved. Additionally, the visual cues that support textual claims would help the system to detect fake content with greater confidence. The dataset is released at https://competitions.codalab.org/competitions/35153 and the baselines are available at https://github.com/Shreyashm16/Factify .

Although there are research initiatives [ 5, 6, 7 ] and datasets [ 1, 4 ], on textual fact verification, there is less focus on multi-modal or cross-modal fact verification. The majority of the present fact-checking research relies on unimodal techniques, synthetic data production, and limited annotated datasets. Therefore, we believe that FACTIFY can serve as a stepping stone to build novel multimodal fact verification systems. The dataset contains images, textual claim, reference textual document/image. The task is to tag support, no-evidence, and refute between given claims; each of these categories are explained in the next section. The first two categories are further sub-divided into text and multimodal components. Thus, in total, all the data samples are labeled with one out of five choices. We choose twitter handles of popular news channels from the two large nations – the US and India. Therefore, the dataset entirely consists of real samples gathered from diferent social media news handles popular in India and the US.

To summarize, in this paper, we release a novel multimodal fact-checking dataset that can be used as a benchmark for researchers. We also propose unimodal and multimodal baseline models for our dataset. The paper is organised as follows: The proposed task is described in Section 2. Related work is described in Section 3. Data collection and data distribution are explained in Section 4 while Section 5 demonstrates the baseline model. Section 6 shows the results of our baseline models. Finally, we summarise our task along with the further scope and open-ended pointers in Section 7. 2. The Factify Task

To detect multimodal fake news, we model the task as a multimodal entailment. We assume that each data point contains a reliable source of information, called “document”, and its associated image and another source whose validity must be assessed, called the “claim” which also contains a respective image. The goal is to identify if the claim entails the document. Since we are interested in a multimodal scenario with both image and text, entailment has two verticals, namely textual entailment and visual entailment and their respective combinations. This data format is a stepping stone for the fact checking problem where we have one reliable source of news and want to identify the fake/real claims given a large set of multimodal claims. Therefore the task essentially is – given a textual claim, claim image, document and document image, the system has to classify the data sample into one of the five categories: Support_Text, Support_Multimodal, Insuficient_Text , Insuficient_Multimodal and Refute. The images are also supported by the text obtained by running an OCR. The descriptions of the labels are as follows• Support_Text: the claim text is similar or entailed but images of the document and claim are not similar. • Support_Multimodal: both the claim text and image are similar to that of the document. • Insufficient_Text: both text and images of the claim are neither supported nor refuted by the document, although it is possible that the text claim has common words with the document text. • Insufficient_Multimodal: the claim text is neither supported nor refuted by the document but images are similar to the document. • Refute: The images and/or text from the claim and document are completely contradictory i.e, the claim is false/fake.

3. Related Work

Over the last few years, various fact checking and fact verification datasets have been published. Majority of them being text based and only a few being multi-modal datasets. The textual datasets can broadly be grouped into two categories based on the information they provide.

The first category includes datasets that aim to predict the veracity based on the claim alone. LIAR [ 4 ] contains 12.8k manually labeled claims from politifact with 6 fine-grained labels and metadata such as speaker name. CREDBANK [ 8 ] focuses on checking credibility by providing tweets related to 1k events with manual credibility annotation. The Lie Detector dataset [ 9 ] approaches the task with ’true’ and ’deceptive’ text samples of size 600. Another such dataset uses Claim Matching [ 10 ] and has 2k pairs of multi-lingual text with labels based on text pair similarity. A dataset on Covid-19 fake news is provided by [ 11 ].

The second category includes datasets where the claim is accompanied with documents annotated with labels indicating whether the document supports the claim or is unrelated to it. A very well known dataset of this type is FEVER [ 1 ]. It contains 185k samples with a claim and a supporting document from Wikipedia, but, these claims were manually generated and then altered before being classified as ’ Support’, ’Refute’ or ’NotEnoughInfo’. MultiFC [ 12 ] is a multi-domain dataset of size 35k with claims and rich-metadata from 26 diferent websites. It has a wide range of labels preserved from these websites such as ’correct’, ’incorrect’, ’mis-attributed’ and ’not the whole story’.

Textual datasets are no longer enough in the social media age. It is important to consider both the image and text when detecting fake news. Fakeddit [ 13 ] is a multi-modal dataset providing an image associated with a text. The image can be used as evidence for the text or vice-versa. Each of its 1 million samples has both high-level and fine-grained labels. It is similar to a image-caption dataset, which could result in a disjoint claim and image. FakeNewsNet [ 14 ] contains 23k articles with context and spatio-temporal information focused on fake news source and mitigation. The data and their labels have been obtained from fact checking websites such as Politifact and GossipCop. A dataset of fact-checked images shared on WhatsApp during the 2018 Brazilian and 2019 Indian Elections [ 15 ] provides two sets of 135 and 897 images containing misinformation from Brazil and India, respectively. These fact-checked fake images from WhatsApp are supported by data from fact checking websites and manual expert annotations. Table 1 summarizes datasets and their statistics. To the best of our knowledge, Factify is the largest real-world multimodal fake news detection dataset. The dataset has five categories based on the entailment of the text and image pairs. It supports the automation of fact checking using an entailment approach.

Name

LIAR [ 4 ]

CREDBANK [8] The Lie Detector [9] Claim matching be

yond english [ 10 ]

FEVER [1] MultiFC [12] Fakeddit [13] Covid-19 Fake News dataset [11] FakeNewsNet [14] Whatsapp factchecking dataset [15] Factify (ours)

# Claims # Labels Data

4. Data 4.1. Data Collection

We collected date-wise tweets from twitter handles of Indian and US news sources: Hindustan Times 1, ANI2 for India and ABC3, CNN 4 for US based on accessibility, popularity and posts per day. Moreover, these twitter handles are eminent for their objective and disinterested approach. From each tweet, we extracted the tweet text and the tweet image(s). Now, for each tweet, we do the following: 1https://twitter.com/htTweets 2https://twitter.com/ANI 3https://twitter.com/ABC 4https://twitter.com/CNN • For each tweet of account A, we got similar tweets from account B. Similarity is measured on the basis of text. Text similarity is measured using Sentence BERT first, and then the extent of common words is measured as the second metric. • Next, the the image similarity for the corresponding images of the tweet pair was calculated. Image similarity is measured using histogram similarity and cosine similarity on a pre-trained ResNet50 model. • According to the scores for each of these measures, the tweet pair is classified into 4 categories: Support_Multimodal, Support_Text, Insufficient_Multimodal and Insufficient_Text. The various thresholds used for classification are listed in Figure 3. • From this tweet pair, we selected a tweet (say tweet B) and obtained the url for the corresponding article published on the source’s website from the tweet text. We then replaced the tweet text with article contents after scraping it ( document in dataset). We do this so as to mimic real world fact checking process, i.e., manually comparing claims with documents or articles.

• The image OCRs were obtained using Google Cloud Vision API 5.

Here is the final description for each attribute in the dataset • Claim: Tweet A text • Claim_image: Tweet A image • Claim_ocr: Tweet A image OCR • Document: Tweet B article text • Document_image: Tweet B image • Document_ocr: Tweet B image OCR • Category Figure 2 explains the five classes in our dataset.

For appropriate classification of the dataset, two similarity measures were computed. Sentence Comparison: We use 2 methods to check similarity amongst sentences: • Sentence BERT: Sentence BERT [16] is a modification of the BERT model that uses siamese and triplet network structures to get sentence embeddings. These sentence embeddings can be compared with each other to get their corresponding similarity score. We use cosine similarity as the textual similarity metric. We use Sentence BERT (SBERT) over the pre-trained BERT model and RoBERTa mainly because it is much faster without compromising the accuracy. For our application, we used the ‘paraphrase-MiniLM-L6v26’ pre-trained model. For each text pair, we derive their corresponding embeddings using the SBERT model, and check their cosine similarity. We manually decide on a threshold value T1 for cosine similarity, and classify the text pair accordingly. If the cosine similarity score is greater than T1, then it is classified into the support category. On the other hand, if the cosine similarity score is lower than T1, the news may or may not be the same (the evidence at hand is insuficient to judge whether the news is same or not).

Hence it is sent for another check before classifying it into Insuficient category. 5https://cloud.google.com/vision/docs/ocr 6https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2

• NLTK: If the cosine similarity of the sentence pair is below T1, we use the NLTK library [17] to check for common words between the two sentences. If the common words score is above a diferent manually decided threshold T2, only then the news pair is classified into the insuficient category. Common words are being checked to ensure that the classification task is challenging. To check for common words, both texts in the pair are preprocessed, which included stemming and removing stopwords. The processed texts are then checked for common and similar words, and their corresponding scores are determined. If the common words score is greater than T2, the pair is classified as Insuficient else the pair is dropped.

Image Comparison: We use two metrics for determining whether images are similar or not: • Histogram Similarity: The images are converted to normalized histogram format and similarity is measured using the correlation metric. • Cosine Similarity: The images are converted to feature vectors using pre-trained ResNet50 model, and these feature vectors are used to calculate the cosine similarity score. Manually decided thresholds, as described in Figure 3, are used to judge whether the text and image pair is similar or not.

The text pairs are first classified into either Support or Insuficient categories, and then further sub-classified into Support_Text/ Support_Multimodal or Insuficient_Text / Insuficient_Multimodal categories based on the similarity of the image pairs. If the corresponding images for the texts are similar, then they could be used to judge whether news is the same or not. The category where both the images and the texts are similar is called Support_Multimodal. The category where the images are similar but the texts were not is called Insuficient_Multimodal . If the corresponding images for the texts were not similar, then they could not be used to judge whether news is the same or not. The category where both the images and the texts are not similar is called Insuficient_Text . The category where the texts are similar but the images are not is called Support_Text.

For the refute category, we scrape several reliable fact-check websites like Vishwas7, Times of India8, India Today9, AFP India10, AFP USA11, AltNews12, BOOM13, Factly14, NewsChecker15, NewsMobile16 and WebQoof17. For each article in these websites, we collect the claim (sentence that states the fake news), document (text that proves claim is false), claim images (fake news image, may be screenshot of fake-post), document_image (image is proof of fakeness of claim). The dataset is released at https://competitions.codalab.org/competitions/35153 .

4.2. Data Statistics And Analysis

In order to understand the nature and distribution, we provide preliminary analysis of the Factify dataset. The dataset has a total of 50000 samples, and each of the 5 categories has equal samples. The dataset has a Train-Val-Test split of 70:15:15 2.

To identify and predict the veracity of the claim, a common method is to collate a given claim and the corresponding news article or document. We analyze the word occurrence and distribution of the claims in Figure 4. We can observe that most fake news is related to politics and religion.

7https://www.vishvasnews.com 8https://timesofindia.indiatimes.com/times-fact-check 9https://www.indiatoday.in/fact-check 10https://factcheck.afp.com/afp-india 11https://factcheck.afp.com/afp-usa 12https://www.altnews.in/ 13https://www.boomlive.in/fact-check 14https://factly.in/category/english/ 15https://newschecker.in/ 16https://newsmobile.in/ 17https://www.thequint.com/news/webqoof Support_Multimodal Support_Text Insufficient_Multimodal Insufficient_Text Refute Total 7000 7000 7000 7000 7000 35000

Validation 1500 1500 1500 1500 1500 7500

Test 1500 1500 1500 1500 1500 7500

Total 10000 10000 10000 10000 10000 50000

The claims in the dataset are majorly associated with politics and governance. Claims from both the USA and India mention political parties and leaders, as shown by the top 20 most frequent entities listed in Table 3. The data captures other past or present afairs such as ”Covid19” aswell.

We show the number of unique n-grams for the Factify dataset in Table 4. This shows the lexical diversity of the dataset.

(a) Support (b) Insuficient (c) Refute

5. Baseline model

We explore 2 diferent settings to establish baselines i.e., text-only & multimodal. The goal is to identify the diference between using only one prime modality which is text and then augmenting image information to gauge the performance boost.

Text Only Model: This model (shown in figure 5) ignores the information given by the image. Instead of focusing on multimodal aspect of the data, this model focuses only on the textual aspect. To do so, the model creates sentence embeddings of claim and document attributes using a pretrained Sentence BERT model [16], ’paraphrase-MiniLM-L6-v2’. Then, cosine similarity is measured on the embeddings. This score is used as the only feature for the dataset, and classification is performed using traditional machine learning classifiers like Support Vector Machine and Decision Tree.

Multi-Modal Model: Information shared online is very often of multi-modal nature. Images can change the context of a claim and lead to misinformation. Thus, it is important that we consider both the image and text when classifying the claims. As it is an entailment based approach, features from both claim and document image-text pairs must be extracted. This is done using the pre-trained ResNet50 model [18]. The cosine similarity score is computed between both the claim and document image features. The cosine similarity for the text embeddings is computed, same as the textual baseline model. The model diagram is shown in ifgure 6. The final classification F1 score is shown in the table 5 below for diferent classifiers trained on these two scores as attributes. There is an improvement in performance compared to the text-only model. The baselines are available at https://github.com/Shreyashm16/Factify .

6. Results

The results obtained for each of setting described above are presented in Table 5. We experiment with various classification models for both the text and multimodal settings. For the text only setting, our best performing decision tree model achieves an F1-Score of 41.3% on the test set. While in the multimodal setting achieves a best performance of 53.09%. Note that there is about ~9% performance improvement when image features are used, which suggests that the task performance heavily relies on multi-modal information. However, we use quite naive approaches to establish baselines to encourage more innovative approaches and there is a huge scope for improvement. The results also indicate that of-the-shelf models don’t perform very well on the task since the best performing model achieves only 53.09%. More comprehensive approaches like using vision-language pre-trained models, training on other related datasets/tasks and fine-tuning on Factify, innovative attention and fusion techniques will definitely boost performance. We leave such methods as future work.

Method Text-only Text-only Text-only Text-only Text-only Multimodal Multimodal Multimodal Multimodal Multimodal Algorithm Logistic Regression KNN SVM Decision Tree Random Forest Logistic Regression KNN SVM Decision Tree Random Forest 7. Conclusion and Future Work

In this work, we take a leap towards developing machine learning techniques for the multimodal fact verification by releasing a large real-world dataset with cues from two modalities namely text and image. We also release unimodal and multimodal baselines to emphasize on the dificulty of the problem and scope for improvement. However, our work only scratches the surface and many follow-up research directions can be pursued. In the current dataset, we assume that claims have a binary class i.e., either fake or true but there can be cases where the claim can be partially true or fake. We aim to incorporate these classes in our future work. We also envision to understand deeper relationships between text and image with the help of attention methods in the future. elections, 2020. arXiv:2005.02443. [16] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bertnetworks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2019. URL: https: //arxiv.org/abs/1908.10084. [17] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, ” O’Reilly Media, Inc.”, 2009. [18] P. Kasnesis, R. Heartfield, X. Liang, et al., Transformer-based identification of stochastic information cascades in social networks using text and image similarity, in: Journal of Applied Soft Computing, 2021.

[1]

Thorne ,

Vlachos ,

Christodoulopoulos ,

Mittal , Fever: a large-scale dataset for fact extraction and verification , arXiv preprint arXiv: 1803 . 05355 ( 2018 ).

[2]

Hanselowski ,

PVS ,

Schiller ,

Caspelherr ,

Chaudhuri , C. M. Meyer, I. Gurevych, A retrospective analysis of the fake news challenge stance-detection task , in: Proceedings of the 27th International Conference on Computational Linguistics , Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2018 , pp. 1859 - 1874 . URL: https://aclanthology.org/C18-1158.

[3]

Watson , Fake news in the u .s. - statistics & facts, Statista ( 2021 ).

[4]

W. Y.

Wang , ” liar, liar pants on fire”: A new benchmark dataset for fake news detection , arXiv preprint arXiv:1705.00648 ( 2017 ).

[5]

Hanselowski ,

Zhang ,

Li ,

Sorokin ,

Schiller ,

Schulz , I. Gurevych , Ukp-athene: Multi-sentence textual entailment for claim verification , arXiv preprint arXiv: 1809 . 01479 ( 2018 ).

[6]

Liu ,

Xiong ,

Sun ,

Liu , Fine-grained fact verification with kernel graph attention network , arXiv preprint arXiv: 1910 . 09796 ( 2019 ).

[7]

Patwa ,

Bhardwaj ,

Guptha , G. Kumari,

Sharma , S. PYKL , A. Das ,

Ekbal ,

Akhtar , T. Chakraborty, Overview of constraint 2021 shared tasks: Detecting english covid-19 fake news and hindi hostile posts , in: Proceedings of the First Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation (CONSTRAINT) , Springer, 2021 .

[8]

Mitra , E. Gilbert, Credbank: A large-scale social media corpus with associated credibility annotations , in: ICWSM , 2015 .

[9]

Mihalcea ,

Strapparava , The lie detector: Explorations in the automatic recognition of deceptive language , in: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, ACLShort '09 , Association for Computational Linguistics, USA, 2009 , p. 309 - 312 .

[10]

Kazemi ,

Garimella ,

Gafney ,

S. A.

Hale , Claim matching beyond english to scale global fact-checking , 2021 . arXiv: 2106 . 00853 .

[11]

Patwa ,

Sharma ,

Pykl ,

Guptha , G. Kumari,

M. S.

Akhtar ,

Ekbal , A. Das , T. Chakraborty , Fighting an infodemic: Covid-19 fake news dataset, in: Combating Online Hostile Posts in Regional Languages during Emergency Situation (CONSTRAINT) 2021 , Springer, 2021 , p. 21 - 29 . URL: http://dx.doi.org/10.1007/978-3- 030 -73696- 5 _3. doi: 10 . 1007/978- 3- 030 - 73696- 5 _ 3 .

[12]

Augenstein ,

Lioma ,

Wang ,

L. Chaves

Lima ,

Hansen ,

J. Grue

Simonsen , Multifc: A real-world multi-domain dataset for evidence-based fact checking of claims , in: EMNLP, Association for Computational Linguistics, 2019 .

[13]

Nakamura ,

Levy , W. Y. Wang, r/fakeddit: A new multimodal benchmark dataset for ifne-grained fake news detection , arXiv preprint arXiv: 1911 . 03854 ( 2019 ).

[14]

Shu ,

Mahudeswaran ,

Wang ,

Lee , H. Liu, Fakenewsnet: A data repository with news content, social context and spatialtemporal information for studying fake news on social media , 2019 . arXiv: 1809 .01286.

[15]

J. C. S.

Reis , P. de Freitas Melo,

Garimella ,

J. M.

Almeida ,

Eckles ,

Benevenuto , A dataset of fact-checked images shared on whatsapp during the brazilian and indian