=Paper=
{{Paper
|id=Vol-3199/paper6
|storemode=property
|title=Logically at Factify 2022: Multimodal Fact Verfication
|pdfUrl=https://ceur-ws.org/Vol-3199/paper6.pdf
|volume=Vol-3199
|authors=Jie Gao,Hella-Franziska Hoffmann,Stylianos Oikonomou,David Kiskovski,Anil Bandhakavi
|dblpUrl=https://dblp.org/rec/conf/aaai/GaoHOKB22
}}
==Logically at Factify 2022: Multimodal Fact Verfication==
Logically at Factify 2022: Multimodal Fact Verification Jie Gao, Hella-Franziska Hoffmann, Stylianos Oikonomou, David Kiskovski and Anil Bandhakavi Brookfoot Mills, Brookfoot Industrial Estate, Brighouse, HD6 2RW, United Kingdom Abstract This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022. Despite the recent advance in text-based verification techniques and large pre-trained multimodal models cross vision and language, very limited work has been done in applying multimodal techniques to automate fact checking processes, particularly considering the increasing prevalence of claims and fake news about images and videos on social media. In our work, the challenge is treated as a multimodal entailment task and framed as multi-class classification. Two baseline approaches are proposed and explored including an ensemble model (combining two uni-modal models) and a multi- modal attention network (modeling the interaction between image and text pair from claim and evidence document). We conduct several experiments investigating and benchmarking different SoTA pre-trained transformers and vision models in this work. Our best model is ranked first on the leaderboard and obtains a weighted average F-measure of 0.77 on both validation and test set. Exploratory analysis is also carried out on the Factify data set and uncovers salient patterns and issues (e.g. word overlap, visual entailment correlation, source bias) that motivates our hypothesis. Finally, we highlight challenges of the task and multimodal dataset for future research. Keywords fact verification, multimodal representation learning, multimodal entailment, text entailment, attention mechanism 1. Introduction Rapidly growing volume of misinformation and fake news have become a pressing challenge and cause severe consequences on society. Significant joint efforts have been undertaken by a wide range of parties (represented by journalists, researchers, independent fact checkers) to protect communities from false information. It has never been more important to have a versatile ecosystem to scale up and speed up fact checking against misinformation using technology, which can be broadly categorised into claim detection and claim validation [1]. The former technique is to support fact checkers in content prioritisation through assessing check-worthiness, and the latter one is to automate the process of evidence retrieval from large De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2022. 2022 Vancouver, Canada " jie@logically.ai (J. Gao); hella.h@logically.ai (H. Hoffmann); stylianos@logically.ai (S. Oikonomou); david.k@logically.ai (D. Kiskovski); anil@logically.ai (A. Bandhakavi) ~ https://www.logically.ai/team/leadership/anil-bandhakavi (A. Bandhakavi) 0000-0002-3610-8748 (J. Gao) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) knowledge bases and performing veracity prediction of the detected claims in order to assist manual fact checking tasks. Claim matching [2] is another emerging trend, addressing the need for timely identification of previously fact-checked claims. Prior efforts focused mostly on text from news media articles and English language. In recent years, with the advance in user-generated content and increasingly polarized social platforms, the challenges of fact checking have increasingly become multilingual and multimodal which have been pervasive in user-generated multimedia content [3]. As a consequence, many new problems arise, typically false context, false connections or misleading content [4, 5, 6, 7, 8]. Another understudied process (known as amplification) is leveraged by coordinated disinformation campaigns [2]. It deliberately spreads large volumes of repeated claims in many different ways, in order to stimulate unintentional spread as false rumors [8]. Thus, there is an imperative need to develop algorithms to group same claims resided in various multimodal context and automate the verification process at scale. Compared to text-based fact checking, multimodal verification is an under-explored area of research. Image and text both contain rich information but reside in heterogeneous modalities. Comparing representation learning within the same modality, cross-modal architectures need not only learn the features for image and text to express their respective content but impor- tantly capture a measure for cross-modal semantic integrity [8, 9]. We study multimodal entailment in this paper. As a newly introduced subtask, it poses extra critical challenges. Simple image similarity cannot resolve fine-grained images differences and perform poorly for adversarial images [10], SAR images[11], etc. To exemplify this challenge, two pairs of claim and document from insufficient multimodal samples in Factify dataset [12] are presented in Table 1. The first sample shows two separate images of a politician taken in direct point-of-view, sitting at the exact same table, in the exact same room, giving a televised speech on different days for different issues. In both images, the politician is wearing a suit, in one image black, and in the other white. In this case, the images are likely to yield high similarity with respect to their content, but they should be considered different images and representative of different contextual information. The second sample presents two images of the same nature, where the politician is wearing the same white suit and ear plugs but with a news broadcasting logo overlaid on the upper right corner of claim image, and the other document image with no news channel logo visible. The main discrepancy is presented between text and image in document which reports that the politician is wearing a white mask during the video conference. Therefore, although the document text provide supporting evidence about the claim but the image is missing important context information. On the contrary, the sample in Table 2 presents two images having low content overlap but the document image corresponds to its textual content that supports the politician death information as presented in claim image. Thus, the right image should be considered as supporting image and representative of same information contextually with corresponded claim image. Relying on visual similarity analysis alone for multimodal fact verification is naturally prone to false positives, because images related to branding and advertisements (e.g., the “breaking news” image or a company’s logo) are often reused. This may cause erroneous detection when there is no real connection between them other than the reuse of a generic image. The problem becomes more complex with images exploited in disinformation on social media. Our work in this competition is in response to current online misinformation multimodal Claim Document In the demise of Union Minister Ram Vilas Paswan, ... President Ram Nath Kovind said on Wednesday, the nation has lost a visionary leader. He was ... Addressing the fourth annual convocation of the among the most active and longest-serving Jawaharlal Nehru University, he said said Indian members of parliament... scholars of today ... Prime Minister Narendra Modi holds a meeting via ... Prime Minister Narendra Modi on Saturday held video-conferencing with the Chief Ministers over a video conference with ... showed Modi wearing a #COVID19... white mask during the interaction ... Table 1 Insufficient Multimodal examples in Factify dataset. Claim image+text(left), Document image+text(right) Claim Document She was appointed to the Supreme Court by Bill Here’s a look back at the life and legacy of Ruth Clinton in 1993. Remembering Supreme Court Bader Ginsburg, the second woman to serve on the Justice Ruth Bader ... has died at the age of 87. US Supreme Court, in photos. Ginsburg died Friday ... lost a cherished colleague,"" Chief Justice John due to ... Roberts said ... Table 2 Support Multimodal examples in Factify dataset. Claim image+text(left), Document image+text(right) issue and has focused on solving the above challenges. Two different algorithms are designed for the task that is framed as a multimodal entailment prediction problem following two different frameworks, including an ensemble learning and an end-to-end attention network. The ensemble model approach is implemented with a decision tree classifier that combines predictions of two uni-modal models with a few data-specific heuristic features. Two uni-modal models are implemented including a 3-way text entailment model based on a State-of-the-Art (SoTA) pre-trained transformer language architectures fine-tuned on the task dataset, and a pre-trained CNN model (ResNet-50) for image similarity. A SoTA multimodal attention network for 5-way end-to-end entailment classification is implemented as an alternative solution in attempt to infer combinatorial entailment relation by combining representation of language and vision. Globe-level multimodal interactions are modeled with a popular multi-branch attention network framework in order to fuse multimodal information. Strong baselines are implemented for both 3-way and 5-way text entailment models to prove the advantage of our proposed methods. Exploratory data analysis and bias test experiments are conducted to understand the potential data issues and present the challenge of creating high-quality multimodal datasets for the real-world problem. Best results from the ensemble model were submitted for the competition. In the remainder of this paper, we firstly present a brief overview of related work (Section 2), then task definition and our proposed methods in details (Section 3) followed by experiments on the task dataset (Section 5). Exploratory data analysis is elaborated in Section 4. Finally, the results discussion (including 3-way models and 5-way models) and conclusion are provided in Sections 6 and 7 respectively. 2. Related Work Text Entailment Recognising Textual Entailment (RTE) is earliest and most related work to our Factify challenge that aims to determine an inferential relationship between natural language hypothesis and premise. On the basis of a given sentence pair, the task is to predict 3-way labels including Support, Refute or NotEnoughInfo. Well-known shared tasks include FEVER [13] and SCIVER [14], which advanced RTE research for claim validation in recent years. This line of work performs different forms of evidence retrieval and then applies claim validation based on that evidence. In contract, evidence retrieval is not required in the Factify task (although the practice of sentence retrieval [15, 16] as a classic NLI problem for long document text in Factify data can be considered as good practice and applicable). Stance detection is another direction of work supported by shared tasks such as UKP Snopes [17], Semeval-2017 Rumoureval [18]), and has also been exploited for RTE by retrieving texts relevant to a claim or story, determining the stance of those text afterwards so as to ultimately predict the veracity of a given claim. The common practice of RTE for claim verification [1] is also incorporated in our ensemble model (as one of the proposed solutions) and treated as a three-way text classification task on text data. Sentence retrieval for evidence aggregation and stance detection are not exploited in this work. Multi/cross-modal representation learning In the field of multimodal reasoning and matching, the success of attention mechanism in the NLP community motivated computer vision techniques to shift from traditional twin network (typically with Siamese nets[19, 20, 21]) to pre-train models in multimodal settings for wide range of downstream tasks, such as visual question answering (VQA), visual reasoning and image captioning. Similar to BERT[22], the recent approach is to use a single transformer architecture to jointly encode text and image such as VisualBERT [23], Uniter [24] and VL-BERT [25]. Alternatively, ViLBERT [26] and LXMERT [27] introduced the two-stream architecture, where two transformers are applied to images and text independently, which is fused by a third transformer in a later stage. These models typically rely on region-based image features extracted by pre-trained object detectors based on commonly used two-staged detectors (typically Faster R-CNN model[28] or its extension Mask-RCNN [29]), or single-stage detectors (typically SSD and YOLO V3 [30]) or anchor-free detectors(e.g., [31]). Another directions are patch embedding[32, 33, 34, 35, 36]. This direction of work directly operates on patches (as a sequence of tokens with fixed length). Image patches and text token embeddings are fed into a transformer or self-attention model to learn fused cross- modal attention. The great progress of these recently developed models can be witnessed on the leader boards of various tasks without using ensembling such as VQA, GAQ[37], NLVR2 [38], which can mainly be attributed to the availability of large scale weakly correlated multimodal data (typically captioned images or video clips and accompanying subtitles [39]) that can be utilised to learn cross-modal representation by contrastive learning [40]. However, existing pre- trained models use mostly scene-limited image-text pairs with short and relatively simple descriptive captions for images, while ignoring richer uni-modal text data and domain-specific information. This leads to the difficulties in comprehending long paragraphs than short text[41]. Thus, most such task (e.g., VQA, VAC, image retrieval) still need additional fusion layer to model interaction between visual and linguistic content. Moreover, limited ground truth information forces many tasks to use evaluation metrics based on binary relevance. Different from most of current cross-modal reasoning tasks, our work aims to model long sequence text and images for claim verification. In contrast to these multimodal architectures, we utilize the individual components from uni-modal pre-trained architectures. The equivalent architecture is employed by [42] for image-text pair interaction, however, we exploited richer cross-modal interaction among vision and text pairs. Inspired by the practice in [43], a stacked attention mechanism is exploited in our solution for cross-modal matching by inferring the latent language-vision alignments at a global level. Recent advance in fine-grained cross-modal representation learning approaches for region-word correspondence are not exploited in this work. Relevance matching technique Relevance matching (RM) is the core problem of informa- tion retrieval (IR) and has also been applied for the detecting the entailment relation[44, 45] by computing the best alignment of hypothesis to premise based on local and global interactions. Vo and Lee [46] exploited a neural ranking model using textual and visual modalities to match multimodal claim with fact-checked information. Their model unifies textual and visual interac- tion between a claim and a collection of candidate articles, while Factify task aims to match a claim with one given candidate document. In our proposed solution, we extended the matching module introduced in Vo and Lee [46] in order to better handle text with vary length. Visual Entailment Visual Entailment (VE) [47] is a variant of traditional RTE task that consists of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence. The problem that VE is trying to solve is to reason about the relationship between an image as premise P𝑖𝑚𝑎𝑔𝑒 and a text as hypothesis H𝑡𝑒𝑥𝑡 . This is different from Factify task that aims to reason about the multimodal relationship between a hypothesis and premise pair of both textual and visual content with respected to five categories. Moreover, the premise text is of vary lengths rather than short hypothesis sentences in SNLI-VE dataset. 3. Methodology 3.1. Problem Statement We frame the Factify task as a problem of multimodal entailment, which is to reason about the relationship of a multimodal claim as hypothesis and a multimodal document as premise. Specifically, given a multimodal hypothesis (e.g. tweet) denoted by 𝑄 = 𝑞𝑖𝑚𝑎𝑔𝑒 + 𝑞𝑡𝑒𝑥𝑡 and a document (typically one or more fact-checking articles) denoted by 𝐷 = 𝑑𝑖𝑚𝑎𝑔𝑒 + 𝑑𝑡𝑒𝑥𝑡 , both of which contain one image and a text, we aim to derive a function 𝑓 (𝑞, 𝑑) that infers their entailment of five categories ("Support_Multimodal", "Support_Text", "Insufficient_Multimodal", "Insufficient_Text", "Refute"). The label assignment is based on the relationship conveyed by (𝐷, 𝑄), • Support_Multimodal holds if there is enough evidence in 𝑑𝑡𝑒𝑥𝑡 to conclude that 𝑞𝑡𝑒𝑥𝑡 is true and 𝑑𝑖𝑚𝑎𝑔𝑒 is relevant to 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑞𝑡𝑒𝑥𝑡 in the same information context, • Support_Text holds if there is enough evidence in 𝑑𝑡𝑒𝑥𝑡 to conclude that 𝑞𝑡𝑒𝑥𝑡 is true but 𝑑𝑖𝑚𝑎𝑔𝑒 is irrelevant to 𝑞𝑖𝑚𝑎𝑔𝑒 and does not provide supplemental information for 𝑞𝑡𝑒𝑥𝑡 , • Insufficient_Multimodal holds if the evidence in 𝑑𝑡𝑒𝑥𝑡 is insufficient to draw a con- clusion about 𝑞𝑡𝑒𝑥𝑡 but 𝑑𝑖𝑚𝑎𝑔𝑒 is relevant to 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑞𝑡𝑒𝑥𝑡 in the same information context, • Insufficient_Text holds if the evidence in 𝑑𝑡𝑒𝑥𝑡 is insufficient to draw a conclusion about 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑖𝑚𝑎𝑔𝑒 is irrelevant to 𝑞𝑖𝑚𝑎𝑔𝑒 and does not provide supplemental information for 𝑞𝑡𝑒𝑥𝑡 , • otherwise, the relationship is Refute, implying that there is enough evidence in 𝑑𝑡𝑒𝑥𝑡 to conclude that 𝑞𝑡𝑒𝑥𝑡 is false and 𝑑𝑖𝑚𝑎𝑔𝑒 is irrelevant to Q of both visual and text content. Additional details of the task definition can be referred in [12]. 3.2. 3-way Text Entailment Recognizing entailment in natural language is a straightforward application for fact verification. In this section, we aim to study how well a SoTA textual entailment model can be fine-tuned on the textual data pairs in Factify data set and then used in a three-way RTE task. This allows us to further assess and benchmark our proposed solution of combining two uni-modal models prediction into an ensemble model for final 5-way multimodal entailment prediction. Pretrained transformer fine-tuning: Pretrained transformer models [22, 48, 49] have become the de facto models for a wide range of NLP tasks and provide SoTA results for RTE tasks[50]. More specifically, in this work, we attempt to investigate how a pretrained model can learn to conduct RTE on the given dataset without utilizing hidden dataset bias, how efficiently it can learn and generalise on the test set. The problem is different from existing benchmark datasets (MultiNLI[51], SNLI[52], Adversarial-NLI[53]) mostly consisting of short sentences. Fact verification task requires to apply natural language inference (NLI) on long paragraphs or articles. As mentioned above, to simplify the problem, the practice of evidence sentence selection [54] that are commonly adopted in SoTA evidence-aware fact checking system are not included in our study. Thus, supported maximum sequence length and optimum document context size are two of key factors to be considered. Transformer-based models, such as BERT, have been one of the most successful deep learning models for NLP, but one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. For that reason, Google’s BigBird model is selected in this study, which is one of the most successful long-sequence transformers that supports sequence length of 4000 tokens. To deal with the limitations that other models face, BigBird uses a sparse attention mechanism that reduces the quadratic dependency to linear[55]. That means that it can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, these models drastically improve performance on various NLP tasks such as claim verification for long sequences[56]. The model is fine-tuned as pair-wise classification task on re-purposed data samples converted from 5-way categories to three-way categories. Formally, given a pair of text sequence (denoted as 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 ) from 𝑄 and 𝐷, we aim to fine-tune a pre-trained model to map any pair of (𝑄, 𝐷) to a label 𝑦, that determines the pre-defined textual entailment relationship ("support", "refute", "insufficient") between 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 . The problem is treated as a supervised learning task and a set of training examples in the form of (𝑄, 𝐴, 𝑦) is given. [𝑆𝐸𝑃 ] is added as a separator between the two inputs in pre-processing and a softmax classifier is added on top of the [𝐶𝐿𝑆] token in the last layer to make predictions. MatchPyramid: In contrary to computationally expensive transformer models, we propose a simple baseline text entailment model with relevance matching technique. Intuitively, an article may be relevant to a claim if they have overlapping words or similar words. A strong interaction model, known as MatchPyramid[57, 58, 59, 44], is adopted in our baseline model. This technique leverages a similarity matrix plotted from the similarity between a pair of sequences and a CNN with pooling strategies to extract hierarchical interaction patterns. CNNs strength of modeling spatial (position-aware) correlation is utilised for vary length among data. This deep neural network enables us to find the matching patterns between a piece of short text in claim and a long document, which is critical to problems in our task. Multiple layers of 2D convolutions and pooling are used followed by a feed-forward network. [59] experimented with four similarity functions (indicator function, dot product, cosine and gaussian kernel) and found that using embedding, gaussian kernel similarity function is better than others. A proper kernel size will get more information and generate a better result. Pooling size are used to reduce the dimension of the feature maps, and to pick out the most important information for the latter layers. Especially in ac-hoc retrieval task, documents often contain hundreds of words, but most of them might be background words (exactly same problem in our task). So the pooling layers might be even more important to distill the useful information from the noisy background. Inspired by [47, 46], 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 are pre-processed and embedded with pre-trained word embedding model. The embeddings are used to initialise the network. Self-attention layer is applied to embeddings of two inputs since the premise document (𝐷) in Factify can be very long and complex. Intuitively, self-attention can be helpful to capture structural information and focus on important keywords particularly in long distance dependencies. Specifically, the scaled dot product (SDP) attention [60] is used to capture this hidden information: 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 = 𝑇 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( 𝑞𝐾√ )𝑉 , 𝑄𝑡𝑒𝑥𝑡_𝑎𝑡𝑡𝑛 = 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 (𝑄𝑡𝑒𝑥𝑡 ), 𝐷𝑡𝑒𝑥𝑡_𝑎𝑡𝑡𝑛 = 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 (𝐷𝑡𝑒𝑥𝑡 ), where 𝑞 here 𝑑 𝑘 represents 𝑄 or 𝐷, 𝑑𝑘 denotes the embedding dimension, 𝑄 ∈ R𝑀 ×𝑑𝑘 is the claim text (𝑄𝑡𝑒𝑥𝑡 ) feature matrix and 𝐷 ∈ R𝑁 ×𝑑𝑘 is the document text (𝐷𝑡𝑒𝑥𝑡 ) feature matrix. 𝑀 and 𝑁 is sequence length (of embedding) in matrix 𝑄 and 𝐷. 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 ∈ R𝑁 ×𝑀 computes resulting attention mask for 𝑄 and 𝐷 respectively. Subsequently, the self-attended 𝑄𝑡𝑒𝑥𝑡 and 𝐷𝑡𝑒𝑥𝑡 feature matrix is then fed into a GRU layer in order to obtain contextual representation. Finally, dot product function is applied to build similarity matrix between two GRU output sequences for MatchPyramid model in attempt to measure the semantic relevance more accurately between claim and document with higher level of word semantics. The output of MatchPyramid is flattened into a 1D vector and fed into a fully connected multi-layer perceptron (MLP), followed by the Softmax model to perform 3-way classification. 3.3. 5-way Ensemble Model One way to utilize multiple models is to combine uni-modal model predictions in an ensemble classifier to predict final labels. As elaborated in 3.2, 3-way textual entailment model is helpful in distinguishing three-way entailment relationship based on linguistic and semantic clues between 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 . To address multimodal entailment (as defined in this task 3.1), a simple relatedness measurement for visual content between 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 is adopted based on image pairwise similarity as a proxy of their visual entailment computed with a pre-trained CNN model (ResNet) in our approach. The same as text entailment, this is based on hypothesis and salient correlation patterns observed in this dataset that an article is relevant to a claim if the article contains images similar to the claim’s images. Hence, an ensemble approach is proposed to combine textual entailment, visual relatedness measurement and additional data specific features. More specifically, the proposed ensemble model utilizes a basic decision tree classifier with the following feature encoding to provide end-to-end five categories classification as depicted in Fig. 1. • Length of Text and OCR: four features of text length between (𝐷, 𝑄) are employed rep- resenting the length of 𝑞𝑡𝑒𝑥𝑡 , 𝑂𝐶𝑅(𝑞𝑖𝑚𝑎𝑔𝑒 ), 𝑑𝑡𝑒𝑥𝑡 , and 𝑂𝐶𝑅(𝑑𝑖𝑚𝑎𝑔𝑒 ) respectively. OCR text are measured and used as independent features here (ref. 4.4). • Text Entailment: two features consisting of a numeric representations (0, 1, 2) of text entailment prediction (i.e., "insufficient", "support", "refute") along with the corresponding probabilities (Sec. 3.2) • Image Similarity: the pairwise cosine similarity score between 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 is computed based on the features obtained by pre-trained ResNet-50 model, • Image Domain: two features encoded with one-hot-encoding scheme on the source domain name for 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 (ref. 4.5). Figure 1: Ensemble Model Architecture 3.4. 5-way Multimodal Entailment We consider how to obtain attended multimodal information that can effectively capture con- sistency and integrity of multimedia content between 𝐷 and 𝑄. Inspired by current advance in attention techniques [60, 61, 46, 47], we apply multiple attention mechanisms to learn the multimodal interaction between pairs of visual and textual content. Instead of local alignment approaches that model visual objects and textual words, we focus on global alignment based methods in this study that aim to map whole image and sentences into joint semantic space. A popular framework to model the multimodal relationship is using a multi-branch attention network, typically one branch projects the image and another models the text. The similarity is measured by dot product on the normalised feature vectors. The extended MatchPyramid (elaborated in section 3.2) is applied to model high-level relevance between text pairs. In general, our end-to-end multimodal entailment architecture consists of embedding layer, text matching layer, multimodal matching layer, and classification layer. Formally, as input representations, Images pairs (𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 ) are represented as the top layer (before softmax) of a pre-trained convolutional network and text pairs (𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 ) are mapped into a vector 𝑡 ∈ R𝑗 by a fixed word embedding layer initialised by Glove embeddings. 𝑗 denote as word embedding dimension (e.g., 𝑗 set to 50 for Glove-50). There is no restriction on the choice of the image encoder but the pre-trained ResNet-50 model is used in our experiments because of its simplicity. In embedding layer, Let 𝑙 be the dimension of an image visual vector (i.e., 𝑙 = 2049 for ResNet-50) and Let 𝑚 and 𝑛 be the number of words in 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 respectively. Let 𝑞𝑖 ∈ R𝑙 and 𝑞𝑡 ∈ R𝑗×𝑚 be claim image embedding vector and word embedding matrix, respectively. Likewise, let 𝑑𝑖 ∈ R𝑙 and 𝑑𝑡 ∈ R𝑗×𝑛 be document image embedding vector and word embedding matrix, respectively. Each 2048-dim feature vector (𝑞𝑖 and 𝑑𝑖 ) is fed into a (non-trainable) linear layer to reduce the visual features from 2048 to 512 dimensional vector space in this work. For the embeddings of the text pair, self-attention (SDP) is applied (as specified in section 3.2 for both 𝑞𝑡 and 𝑑𝑡 before feeding into a separate GRU layer to obtain both of their context sequence representation (𝑞𝑡_𝑐𝑥𝑡 ∈ R𝑗×𝑜 and 𝑑𝑡_𝑐𝑥𝑡 ∈ R𝑗×𝑜 ) and corresponding global representation (i.e., final state), denoted by 𝑞𝑡_𝑔 ∈ R𝑜 and 𝑑𝑡_𝑔 ∈ R𝑜 . 𝑜 denotes the GRU output dimension. In subsequent text matching layer, the same pipeline (as specified above for extended Match- Pyramid) is applied in attempt to model the high level relevance between article content (𝑑𝑡𝑒𝑥𝑡 ) to claim text (𝑞𝑡𝑒𝑥𝑡 ) based on contextual word embedding interactions. Interaction feature matrix is calculated by the matrix dot product between 𝑞𝑡_𝑐𝑥𝑡 and 𝑑𝑡_𝑐𝑥𝑡 , which is then applied by deep hierarchical convolution layers in the MatchPyramid model to extract aggregated similarity feature vector 𝑍𝑞_𝑑_𝑡𝑒𝑥𝑡 ∈ R𝑓 and 𝑓 is the output dimension of flattened feature maps. The high-level matching patterns are then fed into a multi-layer perceptron (MLP) with dropout to produce the final matching score with learnable weights. Multimodal latent interaction features are derived in multimodal matching layer which mainly consists of visual matching layer and cross-modal attention layer. Fundamentally, multimodal matching layer aims to find potential relevance of document visual vector (𝑑𝑖 ) to claim vectors of either visual or text context or both (𝑞𝑖 and 𝑑𝑡_𝑔 ), and are hence critical for predicting multimodal entailment relations for target claim. The same as the process for text embeddings, our visual matching layer utilises self-attention (SDP) for image pairs embeddings (𝑞𝑖 and 𝑑𝑖 ) in attempt to ⃦ ⃦ ⃦ ⃦ capture important features in each image. Then, a visual similarity feature (𝑉𝑞,𝑑 ⃦ ⃗𝑖 ⃦ · ⃦𝑑⃗𝑖 ⃦) 𝑖 = ⃦𝑞 ⃦ ⃦ ⃦ is computed by applying dot product to the two saliency guided image embeddings with 𝐿2 norm. This practice follows [62, 63, 47], which is a simple yet very efficient SoTA technique in handling variations in image illumination, view points, texture and season. It also links to feature whitening, linear discrimination analysis and image saliency. In addition, a separate image 1 euclidean similarity feature (𝐸𝑞,𝑑 𝑖 = ⃦ ⃦ ) is computed in an attempt to measure (1 + ⃦𝑞𝑖 − 𝑑𝑖 ⃦) ⃦ ⃦ potential adversarial images, same people in different scene, cropped image, etc [64, 65]. In cross-modal attention layer, multiple attention mechanisms are applied, aiming to learn to find and attend to the maximum relevant elements on both feature importance and relational mapping among each self-attended visual and content vectors between claim and document. Specifically, text-image interaction feature between document image and claim text is computed 𝑖,𝑡 using same SDP attention: 𝑞𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 = 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 (𝑑𝑖 , 𝑞𝑡_𝑔 ). Likewise, document image and document text interaction is computed via 𝑑𝑖,𝑡 𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 = 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 (𝑑𝑖 , 𝑑𝑡_𝑔 ). Note that due to the mismatched feature space between vision and content, GRU is employed initially to align image features and the text features in order to perform cross-modal learning. Finally, resulting multimodal features 𝑉𝑚𝑢𝑙𝑡𝑖 = (𝑉𝑞,𝑑 𝑖 ⊕ 𝐸 𝑖 ⊕ 𝑞 𝑖,𝑡 𝑖,𝑡 𝑞,𝑑 𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 ⊕ 𝑑𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 ) are obtained from the merge (with concatenation) of the two intermediate layers output before feeding into a MLP with dropout to generate fused higher level features. Finally, in the classification layer, the outputs of two MLP layers are merged along with corresponding hypothesis multimodal representation (𝑞𝑚𝑒𝑟𝑔𝑒 ) into a combined representation with batch normalisation. 𝑞𝑚𝑒𝑟𝑔𝑒 is the output of a separate MLP layer for merged (concatena- tion) 𝑞𝑡_𝑔 and normalised 𝑞_𝑖 (𝑞𝑚𝑒𝑟𝑔𝑒 ∈ R𝑜+𝑙 ). Final representation are applied with dropout regularization before feeding to a softmax layer to output the probabilities of five categories. 4. Factify Dataset 4.1. Data statistics Five entailment categories are balanced in both train and validation data set. As mentioned above, both claim (hypothesis) and document(premise) contains an image and a text of vary length. Optical character recognition (OCR) text extracted from image are not counted separately here and combined with corresponding claim and document text respectively. Thus, large claim word size presented in the table is due to extracted OCR text. The data details for each set are shown in Table 3. More details about the dataset and task details can be referred in [12]. 4.2. Word Overlaps distribution Word overlap is an important indication for modeling textual entailment as well as of potential data bias. Naturally, when pairing claims to evidence sentences, the word overlap ratio will be higher on average for claims with their supporting evidence. However, models relying on word-overlap will perform poorly when dealing with complexity in the real world examples (typically antonymous examples and adversarial attacks). In VITAMINC[66] and FEVER[67] Table 3 Data statistics Data set sizes: Train pairs: 35.000 Validation pairs: 7.500 Test pairs: 7.500 Sentence length: Claim (Hypotheis) inc. OCR min token count: 1 max token count: 19,105 mean token count: 51.5 Document(Premise) inc. OCR min token count: 1 max token count: 44,542 mean token count: 1010.5 dataset, this bias is deliberately minimized in order to create challenging examples that require sentence-pair inference and cannot be solved by simple word matching techniques. Here, the word overlaps distribution per class in train and val set are presented in Table 4. The data distribution indicates that evidential premise data in Factify have clearly high word overlap ratio than other two categories of insufficient evidence. Table 4 (𝑄, 𝐷) pair text word overlap dist. in train/val set Category min. max. mean mdn. Support_Multi. 0.0 1.0 0.299 0.273 Support_Text 0.0 1.0 0.316 0.294 Insufficient_Multi. 0.0 0.92 0.221 0.192 Insufficient_Text 0.0 1.0 0.238 0.176 Refute 0.0 1.0 0.406 0.346 4.3. Image Similarity distribution Image similarity is the most basic indicator of multimodal entailment. To empirically validate the intuition of potential similarity correlation between 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 (as mentioned in 3.3) in the dataset, image relatedness analysis is conducted in this section. Similar to the potential bias from data overlapping, we compute image pairwise similarity distribution with embedding space computed from pre-trained ResNet50 model over train/val set as presented in Table 5 and 6. As seen from the distribution over five categories, two text related entailment categories have clearly lower pairwise image similarity that multimodal evidential entailment categories. Table 5 (𝑄, 𝐷) image pairwise similarity dist. in train set Category min. max. mean mdn. Support_Multi. 0.533 1.0 0.864 0.865 Support_Text 0.327 1.0 0.704 0.725 Insufficient_Multi. 0.428 0.999 0.835 0.833 Insufficient_Text 0.408 0.971 0.703 0.722 Refute 0.41 1.0 0.82 0.835 Table 6 (𝑄, 𝐷) image pairwise similarity dist. in val set Category min. max. mean mdn. Support_Multi. 0.533 1.0 0.855 0.856 Support_Text 0.393 1.0 0.72 0.74 Insufficient_Multi. 0.578 0.996 0.846 0.844 Insufficient_Text 0.383 0.936 0.71 0.73 Refute 0.426 1.0 0.828 0.842 4.4. Text Length Distribution Text and OCR (text) length distributions between (𝑄, 𝐷) in the train set are presented in Fig 2 and Fig 3. Clearly separable distribution patterns can be seen across five categories in claim and document text and their corresponding OCR text. As shown in Fig 2, document text length varies most for ’Support_Multimodal’ among other four entailment categories. The document length in two insufficient categories share similar range and ’Refute’ category has the least document length. The claim length distribution shows a clear bias of ’Refute’ examples towards shorter claims. While the remaining classes present similar ranges, ’Insuffient_Text’ and ’Support_Multimodal’ tend to include slightly shorter claims. In comparison, OCR text for both claim and document in "Refute" category samples shows surprising longer length than other four categories. Motivated by the observation, we adopted text lengths as features in our ensemble model as illustrated in 3.3. 4.5. Image Domain Bias Source bias is one of the known and common problems in machine learning dataset [68] which happens when most data samples are collected from the same source. This problem has several links which mainly include selection bias, capture bias (bias happens due to particular data collection methods), label bias and negative set bias. We are interested in probing potential source bias in Factify datset. The potential bias in images domains over five categories brought to our attention in data analysis. This is important since multimedia metadata have been proved to be valuable information and signal for fact verification [69, 5, 70] in real-world application such as domain/source credibility, detection of image manipulation and tampering. Image (link) domains are extracted from all document samples of train set and distributions are computed Figure 2: Text length distribution over the five entailment categories Figure 3: OCR text length distribution over the five entailment categories across five categories. As shown in Fig. 4, experiments show a surprisingly strong correlation between image domains and each entailment categories in both claim and document. Motivated by the correlation analysis, image domains are employed as features in our ensemble model as illustrated in section 3.3. 5. Experiments 5.1. Experimental Setting To evaluate the performance of our baseline solutions, we use weighted average F1 for bench- marking on validation set. All the experiments are implemented on a single NVIDIA A100 GPU with up to 20 GiB RAM. Figure 4: Label Distribution by Image Domain in claim and document 5.2. Hypothesis Only Test We conducted hypothesis only reliance test by using hypothesis only information to train a model as baseline. This is a commonly adopted approach [71, 72] in SNLI/RTE to verify the presence of data bias. The assumption is that without any premise information, this baseline is supposed to make a random guess out of the five classes. We train two models with or without images and test the resulting accuracy for each model on val set. Two models (Hypo𝑡𝑒𝑥𝑡 and Hypo𝑡𝑒𝑥𝑡+𝑖𝑚𝑔 ) are implemented with similar architecture that consists of a text processing component and/or ResNet embedding layer followed by two fully- connected (FC) layers. The text processing component is used to extract the text feature from the given hypothesis. It firstly generates a sequence of word embeddings for the given claim text. The embedding sequence is then fed into a GRU [73] to output the text context features of dimension 300. The image processing component involves BGR to RGB conversion, resizing images (to [300, 300, 3]), the feature extraction with ResNet50 with a linear layer projecting pre-trained embeddings to 512 dimensional vector. The input and output dimensions of two FC layers for text only models are [300, 300] and [300, 3] respectively. For text+image model, the hidden layers dimensions are [300, 300] and [300, 3] respectively. Our experiments shows that the resulting accuracy and weighted F1 achieves same value in both 0.60 and both 0.64 on val set for text and text+image models respectively, implying the existence of bias in Factify dataset. The details are presented in results section. Our proposed solutions outperform the two hypothesis only baselines. 5.3. 3-way text entailment Transformer fine-tune settings: For the best entailment model, the pre-trained Big- Bird model from Huggingface and the implementation for pair-wise classification fine- tune was used. For our experiment the model was fine-tuned for 2 epochs, using the AdamW optimizer with learning rate 2e-5 and epsilon 1e-8 with batch size 4. Maxi- mum sentence length was set to the mean length of the input texts namely, 1396 to- kens. To train the 3-way entailment model, the 5-way data categories were converted to 3-way categories including "Support" ("Support_Multimodal"+"Support_Text"), "Refute", In- sufficient("Insufficient_Multimodal"+"Insufficient_Text"). OCR text from both 𝑄 and 𝐷 were excluded. MatchPyramid baseline model settings: We use Glove embeddings with 50 dimension (GloVe 6B 50d) for text input and GRU output dimension is set to 50. The number of CNN layers is set to 2, each with kernel size 3 × 3, pooling size 5 × 10 and in valid mode (i.e., no padding). ReLu activation is placed between all convolutional layers. Convolution channels are set to 16 and 32 respectively. 2-layer MLP are set with hidden dimensions of 128 and 64 respectively. The maximum text length for 𝑄𝑡𝑒𝑥𝑡 and 𝐷𝑡𝑒𝑥𝑡 are set to 100 and 1000 respectively. Limited experiments are conducted including content context size of claim and document, global average pooling and convolution padding schemes. 5.4. 5-way Ensemble Model The ensemble model is implemented using sklearn’s DecisionTreeClassifier class. Best pre- trained transformer ("BigBird") based text entailment classifier is adopted and pairwise image similarity is computed based on ResNet-50. For image pre-processing and feature extraction, same practice introduced in 5.2 is applied. One-hot encoding function provided in Scikit-learn is utilised to converts two categorical features (i.e., URL domains of 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 ) as a one-hot numeric array learnt from train set. Text are pre-processed separately for BigBird model and ensemble model. No pre-processing is applied for four text length features. For the ensemble model training, we use ‘best’ split based on ‘gini’ impurity matrix as training criteria and limit the number of layers to 8 to avoid overfitting. 5.5. 5-way Multimodal classification 5-way end-to-end Multimodal Entailment model Multimodal𝑒𝑛𝑡 is implemented with Keras and tensorflow(v2.4) with Adam optimiser with an adaptive learning scheduler. The initial learning rate and weight decay are both set to 0.0001. Batch size is 32 and maximum number of training epochs is set to 80. Optimal parameters and settings from MatchPyramid baseline model experi- ments are applied. Checkpoint callback is used to save best model that achieves best validation accuracy. ReLu activation is applied to all convolution layers and fully-connected layers. The uniform He initialization ("he_uniform") is used for all ReLU layers. Same settings (layer size, hidden dim, activation, etc) are applied for three separate MLP layers. Few parameters and archi- tectures are experimented including vary lengths of claim and document content, MLP layer size (1-2), hidden layer dimensions of MLPs(64,256,512,768,1024), merge strategies (concatenation and multiplication) of three MLPs output for classification layer. Ablation study is conducted by removing individual sub-components including hypothesis MLP (𝑞𝑚𝑒𝑟𝑔𝑒 ), document crossmodal interaction(𝑑𝑖,𝑡 𝑖,𝑡 𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 ), and document-claim crossmodal interaction(𝑞𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 ). The ablation experiments show the effectiveness of full model architecture. Best model (as reported in result 6.2) is obtained at training epoch 9 and stopped at epoch 14 with optimal settings of the 3 MLPs architecture, 1-layer MLP with dimension 256, MLPs outputs merged with concatenation, and text input lengths of 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 with 100 and 1000 respectively. Table 7 Three-way Classification Results on val set 𝑀 𝑎𝑡𝑐ℎ𝑃 𝑦𝑟𝑎𝑚𝑖𝑑𝑔𝑙𝑜𝑣𝑒50𝑑 BigBird LongFormer Categories P R F1 P R F1 P R F1 Support 0.77 0.74 0.76 0.83 0.86 0.85 0.83 0.86 0.84 Refute 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 Insufficient 0.75 0.77 0.76 0.85 0.83 0.84 0.85 0.82 0.84 Weighted Avg. 0.81 0.81 0.81 0.88 0.87 0.87 0.87 0.87 0.87 6. Results and Discussion 6.1. 3-way text entailment The results of the 3-way text entailment models are presented in Table 7. To validate our model choice, we evaluate few SoTA pre-trained transformer models, including BERT, RoBerta, BigBird and LongFormer. The best performing models are BigBird and LongFormer with the overall winner being BigBird because of the slightly better results and smaller input context size required (1396 vs 1484 respectively). For the architecture of extended MatchPyramid baseline, we have experimented with different parameters such as longer context length in 𝑄𝑡𝑒𝑥𝑡 (including 1500, 2000, 3000), Glove model with 300 dimension, larger GRU output dimension of 300, various pooling size ([3, 10]), etc, none of these attempts provide major improvement. Overall, our baseline implementation with self-attention and Glove-50d based contextual representation learning achieve optimal performance, which is competitive to large transformer model based approaches as presented in Table 7. 6.2. 5-way classification The results of 5-way 𝐸𝑛𝑠𝑒𝑚𝑏𝑙𝑒 and 𝑀 𝑢𝑙𝑡𝑖𝑚𝑜𝑑𝑎𝑙𝑒𝑛𝑡 model on val set are presented in Table 8. Four of our baseline methods outperform all baseline models proposed by task organiser as reported in [12]. The result of best baseline model (Multimodal𝑓 𝑎𝑐𝑡𝑖𝑓 𝑦 ) in Factify data paper is presented in the table 1 . Unsurprisingly, our ensemble model achieved best results on val set with 0.77 F1 which is 8% higher than the results of 𝑀 𝑢𝑙𝑡𝑖𝑚𝑜𝑑𝑎𝑙𝑒𝑛𝑡 model. The experiment results demonstrate a large performance gain with the large pre-trained text entailment model which works effectively on long paragraphs and contribute the most towards predicting final 5-way categories. This is particularly obvious for "Refute" label, the samples in which are mostly relying on text based inference. It is not surprising that the useful features incorporated from the heuristics and bias learned from the dataset have proved to be effective for this multimodal prediction task. It was found that differentiating between "Insufficient_Multimodal" and "Support_Text" or between "Insufficient_Text" and "Support_Multimodal" was the most challenging task without relying data specific features. In other words, when a sample contains supporting document 1 the corresponding class-wise performance are not provided by organisers Table 8 5-way Classification Results on val set Categories Multimodal𝑓 𝑎𝑐𝑡𝑖𝑓 𝑦 Hypo𝑡𝑒𝑥𝑡 Hypo𝑡𝑒𝑥𝑡+𝑖𝑚𝑔 Multimodal𝑒𝑛𝑡 Ensemble F1 P R F1 P R F1 P R F1 P R F1 Support_Multimodal n/a 0.60 0.60 0.60 0.59 0.63 0.61 0.84 0.57 0.68 0.74 0.78 0.76 Support_Text n/a 0.48 0.43 0.45 0.55 0.51 0.53 0.51 0.66 0.58 0.71 0.71 0.71 Insufficient_Multimodal n/a 0.45 0.52 0.61 0.56 0.61 0.56 0.62 0.52 0.57 0.68 0.65 0.66 Insufficient_Text n/a 0.61 0.50 0.55 0.62 0.51 0.56 0.57 0.69 0.62 0.74 0.73 0.73 Refute n/a 0.87 0.90 0.88 0.93 0.93 0.93 0.99 0.97 0.98 1.0 1.0 1.0 Weighted Avg. 0.54 0.60 0.60 0.60 0.64 0.64 0.64 0.71 0.68 0.69 0.77 0.77 0.77 Table 9 5-way Classification Results on test set Categories Multimodal𝑒𝑛𝑡 Ensemble P R F1 P R F1 Support_Multimodal 0.81 0.60 0.69 0.76 0.78 0.77 Support_Text 0.47 0.59 0.52 0.65 0.69 0.67 Insufficient_Multimodal 0.61 0.53 0.57 0.73 0.64 0.68 Insufficient_Text 0.56 0.66 0.60 0.71 0.73 0.72 Refute 0.99 0.96 0.98 1.0 1.0 1.0 Weighted Avg. 0.69 0.67 0.67 0.77 0.77 0.77 text for the claim but the image is irrelevant, our model has low confidence in predicting the label as "Support_Text" or "Insufficient_Multimodal". Likewise, when document image is relevant to claim image about same information context but document text is irrelevant, our model has low confidence in predicting the correct label. The decision is highly dependent on the annotation bias. From all the labels, the "Refute" label is the most distinguishable category and highly dependent on the text. The performance is highly consistent among all our models and participant systems in this competition (as seen in leaderboard 10). This is possibly mainly attributed to the articles samples selected from very few fact checking sources that have highly differentiable linguistic clues (typically high frequent negative words used and same verdict sentences frequently appeared in this category such as "The claim is false"). 6.3. Competition Result Final test set results and competition leaderboard are presented in Table 9 and 10 respectively. Our best model ("Ensemble") outperform all competition systems and best baseline models [74]. Test result of 𝐸𝑛𝑠𝑒𝑚𝑏𝑙𝑒 model achieved 0.77 avg. F1 which is the same as the result on val set and 10% higher than the result of 𝑀 𝑢𝑙𝑡𝑖𝑚𝑜𝑑𝑎𝑙𝑒𝑛𝑡 . 7. Conclusion We described our participation in the Multimodal Fact Verification Factify Challenge with the implementation of two proposed baseline solutions including an ensemble model and an Table 10 Factify Official Leaderboard Rank Team Support_Text Support_Multi. Insufficient_Text Insufficient_Multi. Refute Final 1 Logically 81.843% 87.429% 84.437% 78.345% 99.899% 76.819% 2 Yet 75.518% 89.38% 82.121% 80.81% 99.866% 75.591% 3 Truthformers 77.65% 85.057% 79.421% 84.482% 98.819% 74.862% 4 UofA-Truth 78.493% 89.786% 82.995% 75.981% 98.339% 74.807% 5 Yao 68.881% 81.61% 84.836% 88.309% 100.0% 74.585% 6 Greeny 74.947% 86.018% 80.382% 82.858% 99.125% 74.28% 7 GPTs 71.575% 79.032% 75.363% 79.275% 100.0% 69.461% 8 Tyche 75.0 % 75.259% 85.496% 68.823% 99.159% 69.203% 9 MUM_NLP 64.803% 80.857% 69.848% 66.548% 93.465% 61.165% - BASELINE 82.675% 75.466% 74.424% 69.678% 42.354% 53.098% end-to-end multimodal entailment model. Ensemble model based system outperform the end- to-end model on val set and test set. The best performing model in this competition combines results of 3-way text entailment classifier, visual similarity with a pre-trained CNN model and heuristics learnt from the dataset. Multimodal fusion technique is explored in this paper to model interaction between different modalities (i.e., text and image) in claim and document pairs and combines information from them to learn multimodal entailment relationship end-to-end. We found that multimodal entailment based system suffer from overfitting. Apart from limited train size and identified data bias, our experiments suggest that fine-grained image and text interaction model need to be explored further. We found that the ambiguous labels in Factify dataset undermines the performance of our deep learning architecture. Creating a dataset for a complex real-word multimodal NLP problems particularly natural language inference as multimodal verification has raised emergent challenges [75, 76] and indeed a cumbersome task, and we appreciate the work by the Factify organizers, yet, a more elaborate and unbiased dataset along with well defined annotation criterion should make this dataset more suitable for benchmark. More effort is required to tackle the dataset challenge of minimising hypotheses from human annotators and make dataset better reflecting real-world challenges. As an emergent research field, we hope our extensive data analysis and proposed baseline solutions can inspire further work. References [1] X. Zeng, A. S. Abumansour, A. Zubiaga, Automated fact-checking: A survey, Language and Linguistics Compass 15 (2021) e12438. [2] A. Kazemi, K. Garimella, D. Gaffney, S. A. Hale, Claim matching beyond english to scale global fact-checking, arXiv preprint arXiv:2106.00853 (2021). [3] Y. Jang, C.-H. Park, Y.-S. Seo, Fake news analysis modeling using quote retweet, Electronics 8 (2019) 1377. [4] K. Nakamura, S. Levy, W. Y. Wang, r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection, arXiv preprint arXiv:1911.03854 (2019). [5] D. Zlatkova, P. Nakov, I. Koychev, Fact-checking meets fauxtography: Verifying claims about images, arXiv preprint arXiv:1908.11722 (2019). [6] M. K. Elhadad, K. F. Li, F. Gebali, Detecting misleading information on covid-19, Ieee Access 8 (2020) 165201–165215. [7] F. Alam, S. Cresci, T. Chakraborty, F. Silvestri, D. Dimitrov, G. D. S. Martino, S. Shaar, H. Firooz, P. Nakov, A survey on multimodal disinformation detection, arXiv preprint arXiv:2103.12541 (2021). [8] M. Sun, X. Zhang, J. Ma, Y. Liu, Inconsistency matters: A knowledge-guided dual- inconsistency network for multi-modal rumor detection, in: Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 1412–1423. [9] E. Müller-Budack, J. Theiner, S. Diering, M. Idahl, S. Hakimov, R. Ewerth, Multimodal news analytics using measures of cross-modal entity and context consistency, International Journal of Multimedia Information Retrieval 10 (2021) 111–125. [10] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deepfool: a simple and accurate method to fool deep neural networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2574–2582. [11] C.-A. Deledalle, L. Denis, G. Poggi, F. Tupin, L. Verdoliva, Exploiting patch similarity for sar image processing: The nonlocal paradigm, IEEE Signal Processing Magazine 31 (2014) 69–78. [12] S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, P. Patwa, A. Das, T. Chakraborty, S. Amit, A. Ekbal, C. Ahuja, Factify: A multi-modal fact verification dataset, in: Proceedings of the First Workshop on Multimodal Fact-Checking and Hate Speech Detection (DE-FACTIFY), 2022. [13] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a large-scale dataset for fact extraction and verification, arXiv preprint arXiv:1803.05355 (2018). [14] D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, H. Hajishirzi, Fact or fiction: Verifying scientific claims, arXiv preprint arXiv:2004.14974 (2020). [15] J. Zhou, X. Han, C. Yang, Z. Liu, L. Wang, C. Li, M. Sun, Gear: Graph-based evidence aggregating and reasoning for fact verification, arXiv preprint arXiv:1908.01843 (2019). [16] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, D. Inkpen, Enhanced lstm for natural language inference, arXiv preprint arXiv:1609.06038 (2016). [17] A. Hanselowski, C. Stab, C. Schulz, Z. Li, I. Gurevych, A richly annotated corpus for different tasks in automated fact-checking, arXiv preprint arXiv:1911.01214 (2019). [18] L. Derczynski, K. Bontcheva, M. Liakata, R. Procter, G. W. S. Hoi, A. Zubiaga, Semeval-2017 task 8: Rumoureval: Determining rumour veracity and support for rumours, arXiv preprint arXiv:1704.05972 (2017). [19] J. Gu, J. Cai, S. R. Joty, L. Niu, G. Wang, Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7181–7189. [20] S. Wang, Y. Chen, J. Zhuo, Q. Huang, Q. Tian, Joint global and co-attentive representation learning for image-sentence retrieval, in: Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 1398–1406. [21] H. Nam, J.-W. Ha, J. Kim, Dual attention networks for multimodal reasoning and matching, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 299–307. [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [23] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, K.-W. Chang, Visualbert: A simple and performant baseline for vision and language, arXiv preprint arXiv:1908.03557 (2019). [24] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter: Learning universal image-text representations (2019). [25] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, J. Dai, Vl-bert: Pre-training of generic visual- linguistic representations, arXiv preprint arXiv:1908.08530 (2019). [26] J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: Pretraining task-agnostic visiolinguistic repre- sentations for vision-and-language tasks, arXiv preprint arXiv:1908.02265 (2019). [27] H. Tan, M. Bansal, Lxmert: Learning cross-modality encoder representations from trans- formers, arXiv preprint arXiv:1908.07490 (2019). [28] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 28 (2015) 91–99. [29] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969. [30] P. Adarsh, P. Rathi, M. Kumar, Yolo v3-tiny: Object detection and recognition using one stage improved model, in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), IEEE, 2020, pp. 687–694. [31] G. Yu, Q. Chang, W. Lv, C. Xu, C. Cui, W. Ji, Q. Dang, K. Deng, G. Wang, Y. Du, et al., Pp-picodet: A better real-time object detector on mobile devices, arXiv preprint arXiv:2111.00902 (2021). [32] I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, et al., Mlp-mixer: An all-mlp architecture for vision, arXiv preprint arXiv:2105.01601 (2021). [33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Trans- formers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020). [34] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, S. Yan, Tokens- to-token vit: Training vision transformers from scratch on imagenet, arXiv preprint arXiv:2101.11986 (2021). [35] X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, C. Shen, Conditional positional encodings for vision transformers, arXiv preprint arXiv:2102.10882 (2021). [36] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi, J. Fan, Z. He, A survey of visual transformers, arXiv preprint arXiv:2111.06091 (2021). [37] D. A. Hudson, C. D. Manning, Gqa: A new dataset for real-world visual reasoning and compositional question answering, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709. [38] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, Y. Artzi, A corpus for reasoning about natural language grounded in photographs, arXiv preprint arXiv:1811.00491 (2018). [39] K. Desai, G. Kaul, Z. T. Aysola, J. Johnson, Redcaps: Web-curated image-text data created by the people, for the people, in: 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks., 2021. [40] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant mapping, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, IEEE, 2006, pp. 1735–1742. [41] F. Schneider, Ö. Alaçam, X. Wang, C. Biemann, Towards multi-modal text-image retrieval to improve human reading, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, 2021. [42] D. Kiela, S. Bhooshan, H. Firooz, E. Perez, D. Testuggine, Supervised multimodal bitrans- formers for classifying images and text, arXiv preprint arXiv:1909.02950 (2019). [43] K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked cross attention for image-text matching, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 201– 216. [44] P. Liu, X. Qiu, J. Chen, X.-J. Huang, Deep fusion lstms for text semantic matching, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1034–1043. [45] M.-C. De Marneffe, B. MacCartney, T. Grenager, D. Cer, A. Rafferty, C. D. Manning, Learning to distinguish valid textual entailments, in: Second Pascal RTE Challenge Workshop, volume 62, Citeseer, 2006. [46] N. Vo, K. Lee, Where are the facts? searching for fact-checked information to alleviate the spread of fake news, arXiv preprint arXiv:2010.03159 (2020). [47] N. Xie, F. Lai, D. Doran, A. Kadav, Visual entailment: A novel task for fine-grained image understanding, arXiv preprint arXiv:1901.06706 (2019). [48] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [49] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [50] T. Gao, A. Fisch, D. Chen, Making pre-trained language models better few-shot learners, arXiv preprint arXiv:2012.15723 (2020). [51] A. Williams, N. Nangia, S. R. Bowman, The multi-genre nli corpus (2018). [52] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, A large annotated corpus for learning natural language inference, arXiv preprint arXiv:1508.05326 (2015). [53] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, D. Kiela, Adversarial nli: A new benchmark for natural language understanding, arXiv preprint arXiv:1910.14599 (2019). [54] J. Thorne, M. Glockner, G. Vallejo, A. Vlachos, I. Gurevych, Evidence-based verification for real world information needs, arXiv preprint arXiv:2104.00640 (2021). [55] M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, A. Ahmed, Big bird: Transformers for longer sequences, 2021. arXiv:2007.14062. [56] D. Stammbach, Evidence selection as a token-level prediction task, in: Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), 2021, pp. 14–20. [57] L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, X. Cheng, Text matching as image recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. [58] S. Wan, Y. Lan, J. Xu, J. Guo, L. Pang, X. Cheng, Match-srnn: Modeling the recursive matching structure with spatial rnn, arXiv preprint arXiv:1604.04378 (2016). [59] L. Pang, Y. Lan, J. Guo, J. Xu, X. Cheng, A study of matchpyramid models on ad-hoc retrieval, arXiv preprint arXiv:1606.04648 (2016). [60] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008. [61] Z. Li, Y. Li, H. Lu, Improve image captioning by self-attention, in: International Conference on Neural Information Processing, Springer, 2019, pp. 91–98. [62] T. Malisiewicz, A. Gupta, A. A. Efros, Ensemble of exemplar-svms for object detection and beyond, in: 2011 International conference on computer vision, IEEE, 2011, pp. 89–96. [63] M. Gharbi, T. Malisiewicz, S. Paris, F. Durand, A gaussian approximation of feature space for fast image similarity (2012). [64] L. Wang, Y. Zhang, J. Feng, On the euclidean distance of images, IEEE transactions on pattern analysis and machine intelligence 27 (2005) 1334–1339. [65] A. Pedraza, O. Deniz, G. Bueno, Really natural adversarial examples, International Journal of Machine Learning and Cybernetics (2021) 1–13. [66] T. Schuster, A. Fisch, R. Barzilay, Get your vitamin c! robust fact verification with contrastive evidence, arXiv preprint arXiv:2103.08541 (2021). [67] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Evaluating adversarial attacks against multiple fact verification systems, Association for Computational Linguistics, 2020. [68] A. Torralba, A. A. Efros, Unbiased look at dataset bias, in: CVPR 2011, IEEE, 2011, pp. 1521–1528. [69] B.-C. Chen, L. S. Davis, Deep representation learning for metadata verification, in: 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), IEEE, 2019, pp. 73–82. [70] T. Prabhakar, A. Gupta, K. Nadig, D. George, Check mate: Prioritizing user generated multi- media content for fact-checking, in: Proceedings of the International AAAI Conference on Web and Social Media, volume 15, 2021, pp. 1025–1033. [71] S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, N. A. Smith, An- notation artifacts in natural language inference data, arXiv preprint arXiv:1803.02324 (2018). [72] H. T. Vu, C. Greco, A. Erofeeva, S. Jafaritazehjan, G. Linders, M. Tanti, A. Testoni, R. Bernardi, A. Gatt, Grounded textual entailment, arXiv preprint arXiv:1806.05645 (2018). [73] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555 (2014). [74] P. Patwa, S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, A. Das, T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Benchmarking multi-modal entailment for fact verification, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEUR, 2022. [75] R. Le Bras, S. Swayamdipta, C. Bhagavatula, R. Zellers, M. Peters, A. Sabharwal, Y. Choi, Adversarial filters of dataset biases, in: International Conference on Machine Learning, PMLR, 2020, pp. 1078–1088. [76] S. Sharma, M. Dey, K. Sinha, Evaluating gender bias in natural language inference, arXiv preprint arXiv:2105.05541 (2021).