=Paper=
{{Paper
|id=Vol-3199/paper6
|storemode=property
|title=Logically at Factify 2022: Multimodal Fact Verfication
|pdfUrl=https://ceur-ws.org/Vol-3199/paper6.pdf
|volume=Vol-3199
|authors=Jie Gao,Hella-Franziska Hoffmann,Stylianos Oikonomou,David Kiskovski,Anil Bandhakavi
|dblpUrl=https://dblp.org/rec/conf/aaai/GaoHOKB22
}}
==Logically at Factify 2022: Multimodal Fact Verfication==
<pdf width="1500px">https://ceur-ws.org/Vol-3199/paper6.pdf</pdf>
<pre>
Logically at Factify 2022: Multimodal Fact
Verification
Jie Gao, Hella-Franziska Hoffmann, Stylianos Oikonomou, David Kiskovski and
Anil Bandhakavi
Brookfoot Mills, Brookfoot Industrial Estate, Brighouse, HD6 2RW, United Kingdom


                                      Abstract
                                      This paper describes our participant system for the multi-modal fact verification (Factify) challenge
                                      at AAAI 2022. Despite the recent advance in text-based verification techniques and large pre-trained
                                      multimodal models cross vision and language, very limited work has been done in applying multimodal
                                      techniques to automate fact checking processes, particularly considering the increasing prevalence of
                                      claims and fake news about images and videos on social media. In our work, the challenge is treated
                                      as a multimodal entailment task and framed as multi-class classification. Two baseline approaches are
                                      proposed and explored including an ensemble model (combining two uni-modal models) and a multi-
                                      modal attention network (modeling the interaction between image and text pair from claim and evidence
                                      document). We conduct several experiments investigating and benchmarking different SoTA pre-trained
                                      transformers and vision models in this work. Our best model is ranked first on the leaderboard and
                                      obtains a weighted average F-measure of 0.77 on both validation and test set. Exploratory analysis is
                                      also carried out on the Factify data set and uncovers salient patterns and issues (e.g. word overlap, visual
                                      entailment correlation, source bias) that motivates our hypothesis. Finally, we highlight challenges of
                                      the task and multimodal dataset for future research.

                                      Keywords
                                      fact verification, multimodal representation learning, multimodal entailment, text entailment, attention
                                      mechanism


1. Introduction
Rapidly growing volume of misinformation and fake news have become a pressing challenge
and cause severe consequences on society. Significant joint efforts have been undertaken by
a wide range of parties (represented by journalists, researchers, independent fact checkers)
to protect communities from false information. It has never been more important to have
a versatile ecosystem to scale up and speed up fact checking against misinformation using
technology, which can be broadly categorised into claim detection and claim validation [1].
The former technique is to support fact checkers in content prioritisation through assessing
check-worthiness, and the latter one is to automate the process of evidence retrieval from large

De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2022. 2022
Vancouver, Canada
" jie@logically.ai (J. Gao); hella.h@logically.ai (H. Hoffmann); stylianos@logically.ai (S. Oikonomou);
david.k@logically.ai (D. Kiskovski); anil@logically.ai (A. Bandhakavi)
~ https://www.logically.ai/team/leadership/anil-bandhakavi (A. Bandhakavi)
 0000-0002-3610-8748 (J. Gao)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
knowledge bases and performing veracity prediction of the detected claims in order to assist
manual fact checking tasks. Claim matching [2] is another emerging trend, addressing the
need for timely identification of previously fact-checked claims. Prior efforts focused mostly
on text from news media articles and English language. In recent years, with the advance
in user-generated content and increasingly polarized social platforms, the challenges of fact
checking have increasingly become multilingual and multimodal which have been pervasive in
user-generated multimedia content [3]. As a consequence, many new problems arise, typically
false context, false connections or misleading content [4, 5, 6, 7, 8]. Another understudied
process (known as amplification) is leveraged by coordinated disinformation campaigns [2].
It deliberately spreads large volumes of repeated claims in many different ways, in order to
stimulate unintentional spread as false rumors [8]. Thus, there is an imperative need to develop
algorithms to group same claims resided in various multimodal context and automate the
verification process at scale.
   Compared to text-based fact checking, multimodal verification is an under-explored area of
research. Image and text both contain rich information but reside in heterogeneous modalities.
Comparing representation learning within the same modality, cross-modal architectures need
not only learn the features for image and text to express their respective content but impor-
tantly capture a measure for cross-modal semantic integrity [8, 9]. We study multimodal
entailment in this paper. As a newly introduced subtask, it poses extra critical challenges.
   Simple image similarity cannot resolve fine-grained images differences and perform poorly
for adversarial images [10], SAR images[11], etc. To exemplify this challenge, two pairs of claim
and document from insufficient multimodal samples in Factify dataset [12] are presented in
Table 1. The first sample shows two separate images of a politician taken in direct point-of-view,
sitting at the exact same table, in the exact same room, giving a televised speech on different
days for different issues. In both images, the politician is wearing a suit, in one image black,
and in the other white. In this case, the images are likely to yield high similarity with respect to
their content, but they should be considered different images and representative of different
contextual information. The second sample presents two images of the same nature, where
the politician is wearing the same white suit and ear plugs but with a news broadcasting logo
overlaid on the upper right corner of claim image, and the other document image with no
news channel logo visible. The main discrepancy is presented between text and image in
document which reports that the politician is wearing a white mask during the video conference.
Therefore, although the document text provide supporting evidence about the claim but the
image is missing important context information. On the contrary, the sample in Table 2 presents
two images having low content overlap but the document image corresponds to its textual
content that supports the politician death information as presented in claim image. Thus, the
right image should be considered as supporting image and representative of same information
contextually with corresponded claim image.
   Relying on visual similarity analysis alone for multimodal fact verification is naturally prone
to false positives, because images related to branding and advertisements (e.g., the “breaking
news” image or a company’s logo) are often reused. This may cause erroneous detection when
there is no real connection between them other than the reuse of a generic image. The problem
becomes more complex with images exploited in disinformation on social media.
   Our work in this competition is in response to current online misinformation multimodal
                                       Claim                                             Document


                    In the demise of Union Minister Ram Vilas Paswan,   ... President Ram Nath Kovind said on Wednesday,
                    the nation has lost a visionary leader. He was      ... Addressing the fourth annual convocation of the
                    among the most active and longest-serving           Jawaharlal Nehru University, he said said Indian
                    members of parliament...                            scholars of today ...


                    Prime Minister Narendra Modi holds a meeting via    ... Prime Minister Narendra Modi on Saturday held
                    video-conferencing with the Chief Ministers over    a video conference with ... showed Modi wearing a
                    #COVID19...                                         white mask during the interaction ...

Table 1
Insufficient Multimodal examples in Factify dataset. Claim image+text(left), Document image+text(right)

                                       Claim                                             Document


                                                                        She was appointed to the Supreme Court by Bill
                    Here’s a look back at the life and legacy of Ruth   Clinton in 1993. Remembering Supreme Court
                    Bader Ginsburg, the second woman to serve on the    Justice Ruth Bader ... has died at the age of 87.
                    US Supreme Court, in photos. Ginsburg died Friday   ... lost a cherished colleague,"" Chief Justice John
                    due to ...                                          Roberts said ...

Table 2
Support Multimodal examples in Factify dataset. Claim image+text(left), Document image+text(right)


issue and has focused on solving the above challenges. Two different algorithms are designed
for the task that is framed as a multimodal entailment prediction problem following two
different frameworks, including an ensemble learning and an end-to-end attention network.
The ensemble model approach is implemented with a decision tree classifier that combines
predictions of two uni-modal models with a few data-specific heuristic features. Two uni-modal
models are implemented including a 3-way text entailment model based on a State-of-the-Art
(SoTA) pre-trained transformer language architectures fine-tuned on the task dataset, and a
pre-trained CNN model (ResNet-50) for image similarity. A SoTA multimodal attention network
for 5-way end-to-end entailment classification is implemented as an alternative solution in
attempt to infer combinatorial entailment relation by combining representation of language and
vision. Globe-level multimodal interactions are modeled with a popular multi-branch attention
network framework in order to fuse multimodal information. Strong baselines are implemented
for both 3-way and 5-way text entailment models to prove the advantage of our proposed
methods. Exploratory data analysis and bias test experiments are conducted to understand the
potential data issues and present the challenge of creating high-quality multimodal datasets
for the real-world problem. Best results from the ensemble model were submitted for the
competition.
  In the remainder of this paper, we firstly present a brief overview of related work (Section 2),
then task definition and our proposed methods in details (Section 3) followed by experiments
on the task dataset (Section 5). Exploratory data analysis is elaborated in Section 4. Finally, the
results discussion (including 3-way models and 5-way models) and conclusion are provided in
Sections 6 and 7 respectively.


2. Related Work
Text Entailment Recognising Textual Entailment (RTE) is earliest and most related work to our
Factify challenge that aims to determine an inferential relationship between natural language
hypothesis and premise. On the basis of a given sentence pair, the task is to predict 3-way labels
including Support, Refute or NotEnoughInfo. Well-known shared tasks include FEVER [13] and
SCIVER [14], which advanced RTE research for claim validation in recent years. This line of
work performs different forms of evidence retrieval and then applies claim validation based on
that evidence. In contract, evidence retrieval is not required in the Factify task (although the
practice of sentence retrieval [15, 16] as a classic NLI problem for long document text in Factify
data can be considered as good practice and applicable). Stance detection is another direction of
work supported by shared tasks such as UKP Snopes [17], Semeval-2017 Rumoureval [18]), and
has also been exploited for RTE by retrieving texts relevant to a claim or story, determining the
stance of those text afterwards so as to ultimately predict the veracity of a given claim. The
common practice of RTE for claim verification [1] is also incorporated in our ensemble model
(as one of the proposed solutions) and treated as a three-way text classification task on text data.
Sentence retrieval for evidence aggregation and stance detection are not exploited in this work.
   Multi/cross-modal representation learning In the field of multimodal reasoning and
matching, the success of attention mechanism in the NLP community motivated computer
vision techniques to shift from traditional twin network (typically with Siamese nets[19, 20, 21])
to pre-train models in multimodal settings for wide range of downstream tasks, such as visual
question answering (VQA), visual reasoning and image captioning. Similar to BERT[22], the
recent approach is to use a single transformer architecture to jointly encode text and image such
as VisualBERT [23], Uniter [24] and VL-BERT [25]. Alternatively, ViLBERT [26] and LXMERT
[27] introduced the two-stream architecture, where two transformers are applied to images
and text independently, which is fused by a third transformer in a later stage. These models
typically rely on region-based image features extracted by pre-trained object detectors based
on commonly used two-staged detectors (typically Faster R-CNN model[28] or its extension
Mask-RCNN [29]), or single-stage detectors (typically SSD and YOLO V3 [30]) or anchor-free
detectors(e.g., [31]). Another directions are patch embedding[32, 33, 34, 35, 36]. This direction of
work directly operates on patches (as a sequence of tokens with fixed length). Image patches and
text token embeddings are fed into a transformer or self-attention model to learn fused cross-
modal attention. The great progress of these recently developed models can be witnessed on the
leader boards of various tasks without using ensembling such as VQA, GAQ[37], NLVR2 [38],
which can mainly be attributed to the availability of large scale weakly correlated multimodal
data (typically captioned images or video clips and accompanying subtitles [39]) that can be
utilised to learn cross-modal representation by contrastive learning [40]. However, existing pre-
trained models use mostly scene-limited image-text pairs with short and relatively simple
descriptive captions for images, while ignoring richer uni-modal text data and domain-specific
information. This leads to the difficulties in comprehending long paragraphs than short text[41].
Thus, most such task (e.g., VQA, VAC, image retrieval) still need additional fusion layer to model
interaction between visual and linguistic content. Moreover, limited ground truth information
forces many tasks to use evaluation metrics based on binary relevance. Different from most of
current cross-modal reasoning tasks, our work aims to model long sequence text and images
for claim verification. In contrast to these multimodal architectures, we utilize the individual
components from uni-modal pre-trained architectures. The equivalent architecture is employed
by [42] for image-text pair interaction, however, we exploited richer cross-modal interaction
among vision and text pairs. Inspired by the practice in [43], a stacked attention mechanism
is exploited in our solution for cross-modal matching by inferring the latent language-vision
alignments at a global level. Recent advance in fine-grained cross-modal representation learning
approaches for region-word correspondence are not exploited in this work.
   Relevance matching technique Relevance matching (RM) is the core problem of informa-
tion retrieval (IR) and has also been applied for the detecting the entailment relation[44, 45] by
computing the best alignment of hypothesis to premise based on local and global interactions.
Vo and Lee [46] exploited a neural ranking model using textual and visual modalities to match
multimodal claim with fact-checked information. Their model unifies textual and visual interac-
tion between a claim and a collection of candidate articles, while Factify task aims to match a
claim with one given candidate document. In our proposed solution, we extended the matching
module introduced in Vo and Lee [46] in order to better handle text with vary length.
   Visual Entailment Visual Entailment (VE) [47] is a variant of traditional RTE task that
consists of image-sentence pairs whereby a premise is defined by an image, rather than a natural
language sentence. The problem that VE is trying to solve is to reason about the relationship
between an image as premise P𝑖𝑚𝑎𝑔𝑒 and a text as hypothesis H𝑡𝑒𝑥𝑡 . This is different from
Factify task that aims to reason about the multimodal relationship between a hypothesis and
premise pair of both textual and visual content with respected to five categories. Moreover, the
premise text is of vary lengths rather than short hypothesis sentences in SNLI-VE dataset.


3. Methodology
3.1. Problem Statement
We frame the Factify task as a problem of multimodal entailment, which is to reason about the
relationship of a multimodal claim as hypothesis and a multimodal document as premise.
   Specifically, given a multimodal hypothesis (e.g. tweet) denoted by 𝑄 = 𝑞𝑖𝑚𝑎𝑔𝑒 + 𝑞𝑡𝑒𝑥𝑡 and a
document (typically one or more fact-checking articles) denoted by 𝐷 = 𝑑𝑖𝑚𝑎𝑔𝑒 + 𝑑𝑡𝑒𝑥𝑡 , both
of which contain one image and a text, we aim to derive a function 𝑓 (𝑞, 𝑑) that infers their
entailment of five categories ("Support_Multimodal", "Support_Text", "Insufficient_Multimodal",
"Insufficient_Text", "Refute"). The label assignment is based on the relationship conveyed by
(𝐷, 𝑄),

    • Support_Multimodal holds if there is enough evidence in 𝑑𝑡𝑒𝑥𝑡 to conclude that 𝑞𝑡𝑒𝑥𝑡 is
      true and 𝑑𝑖𝑚𝑎𝑔𝑒 is relevant to 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑞𝑡𝑒𝑥𝑡 in the same information context,
    • Support_Text holds if there is enough evidence in 𝑑𝑡𝑒𝑥𝑡 to conclude that 𝑞𝑡𝑒𝑥𝑡 is true but
      𝑑𝑖𝑚𝑎𝑔𝑒 is irrelevant to 𝑞𝑖𝑚𝑎𝑔𝑒 and does not provide supplemental information for 𝑞𝑡𝑒𝑥𝑡 ,
    • Insufficient_Multimodal holds if the evidence in 𝑑𝑡𝑒𝑥𝑡 is insufficient to draw a con-
      clusion about 𝑞𝑡𝑒𝑥𝑡 but 𝑑𝑖𝑚𝑎𝑔𝑒 is relevant to 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑞𝑡𝑒𝑥𝑡 in the same information
      context,
    • Insufficient_Text holds if the evidence in 𝑑𝑡𝑒𝑥𝑡 is insufficient to draw a conclusion about
      𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑖𝑚𝑎𝑔𝑒 is irrelevant to 𝑞𝑖𝑚𝑎𝑔𝑒 and does not provide supplemental information
      for 𝑞𝑡𝑒𝑥𝑡 ,
    • otherwise, the relationship is Refute, implying that there is enough evidence in 𝑑𝑡𝑒𝑥𝑡 to
      conclude that 𝑞𝑡𝑒𝑥𝑡 is false and 𝑑𝑖𝑚𝑎𝑔𝑒 is irrelevant to Q of both visual and text content.

  Additional details of the task definition can be referred in [12].

3.2. 3-way Text Entailment
Recognizing entailment in natural language is a straightforward application for fact verification.
In this section, we aim to study how well a SoTA textual entailment model can be fine-tuned on
the textual data pairs in Factify data set and then used in a three-way RTE task. This allows us
to further assess and benchmark our proposed solution of combining two uni-modal models
prediction into an ensemble model for final 5-way multimodal entailment prediction.
   Pretrained transformer fine-tuning: Pretrained transformer models [22, 48, 49] have
become the de facto models for a wide range of NLP tasks and provide SoTA results for RTE
tasks[50]. More specifically, in this work, we attempt to investigate how a pretrained model can
learn to conduct RTE on the given dataset without utilizing hidden dataset bias, how efficiently
it can learn and generalise on the test set. The problem is different from existing benchmark
datasets (MultiNLI[51], SNLI[52], Adversarial-NLI[53]) mostly consisting of short sentences.
Fact verification task requires to apply natural language inference (NLI) on long paragraphs
or articles. As mentioned above, to simplify the problem, the practice of evidence sentence
selection [54] that are commonly adopted in SoTA evidence-aware fact checking system are
not included in our study. Thus, supported maximum sequence length and optimum document
context size are two of key factors to be considered.
   Transformer-based models, such as BERT, have been one of the most successful deep learning
models for NLP, but one of their core limitations is the quadratic dependency (mainly in
terms of memory) on the sequence length due to their full attention mechanism. For that
reason, Google’s BigBird model is selected in this study, which is one of the most successful
long-sequence transformers that supports sequence length of 4000 tokens. To deal with the
limitations that other models face, BigBird uses a sparse attention mechanism that reduces the
quadratic dependency to linear[55]. That means that it can handle sequences of length up to 8x
of what was previously possible using similar hardware. As a consequence of the capability to
handle longer context, these models drastically improve performance on various NLP tasks such
as claim verification for long sequences[56]. The model is fine-tuned as pair-wise classification
task on re-purposed data samples converted from 5-way categories to three-way categories.
Formally, given a pair of text sequence (denoted as 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 ) from 𝑄 and 𝐷, we aim
to fine-tune a pre-trained model to map any pair of (𝑄, 𝐷) to a label 𝑦, that determines the
pre-defined textual entailment relationship ("support", "refute", "insufficient") between 𝑞𝑡𝑒𝑥𝑡
and 𝑑𝑡𝑒𝑥𝑡 . The problem is treated as a supervised learning task and a set of training examples
in the form of (𝑄, 𝐴, 𝑦) is given. [𝑆𝐸𝑃 ] is added as a separator between the two inputs in
pre-processing and a softmax classifier is added on top of the [𝐶𝐿𝑆] token in the last layer to
make predictions.
   MatchPyramid: In contrary to computationally expensive transformer models, we propose
a simple baseline text entailment model with relevance matching technique. Intuitively, an
article may be relevant to a claim if they have overlapping words or similar words. A strong
interaction model, known as MatchPyramid[57, 58, 59, 44], is adopted in our baseline model.
This technique leverages a similarity matrix plotted from the similarity between a pair of
sequences and a CNN with pooling strategies to extract hierarchical interaction patterns. CNNs
strength of modeling spatial (position-aware) correlation is utilised for vary length among data.
This deep neural network enables us to find the matching patterns between a piece of short text
in claim and a long document, which is critical to problems in our task. Multiple layers of 2D
convolutions and pooling are used followed by a feed-forward network. [59] experimented with
four similarity functions (indicator function, dot product, cosine and gaussian kernel) and found
that using embedding, gaussian kernel similarity function is better than others. A proper kernel
size will get more information and generate a better result. Pooling size are used to reduce the
dimension of the feature maps, and to pick out the most important information for the latter
layers. Especially in ac-hoc retrieval task, documents often contain hundreds of words, but most
of them might be background words (exactly same problem in our task). So the pooling layers
might be even more important to distill the useful information from the noisy background.
   Inspired by [47, 46], 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 are pre-processed and embedded with pre-trained word
embedding model. The embeddings are used to initialise the network. Self-attention layer is
applied to embeddings of two inputs since the premise document (𝐷) in Factify can be very
long and complex. Intuitively, self-attention can be helpful to capture structural information
and focus on important keywords particularly in long distance dependencies. Specifically, the
scaled dot product (SDP) attention [60] is used to capture this hidden information: 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 =
               𝑇
𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( 𝑞𝐾√ )𝑉 , 𝑄𝑡𝑒𝑥𝑡_𝑎𝑡𝑡𝑛 = 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 (𝑄𝑡𝑒𝑥𝑡 ), 𝐷𝑡𝑒𝑥𝑡_𝑎𝑡𝑡𝑛 = 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 (𝐷𝑡𝑒𝑥𝑡 ), where 𝑞 here
              𝑑
             𝑘
represents 𝑄 or 𝐷, 𝑑𝑘 denotes the embedding dimension, 𝑄 ∈ R𝑀 ×𝑑𝑘 is the claim text (𝑄𝑡𝑒𝑥𝑡 )
feature matrix and 𝐷 ∈ R𝑁 ×𝑑𝑘 is the document text (𝐷𝑡𝑒𝑥𝑡 ) feature matrix. 𝑀 and 𝑁 is
sequence length (of embedding) in matrix 𝑄 and 𝐷. 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 ∈ R𝑁 ×𝑀 computes resulting
attention mask for 𝑄 and 𝐷 respectively.
   Subsequently, the self-attended 𝑄𝑡𝑒𝑥𝑡 and 𝐷𝑡𝑒𝑥𝑡 feature matrix is then fed into a GRU layer
in order to obtain contextual representation. Finally, dot product function is applied to build
similarity matrix between two GRU output sequences for MatchPyramid model in attempt to
measure the semantic relevance more accurately between claim and document with higher level
of word semantics. The output of MatchPyramid is flattened into a 1D vector and fed into a
fully connected multi-layer perceptron (MLP), followed by the Softmax model to perform 3-way
classification.

3.3. 5-way Ensemble Model
One way to utilize multiple models is to combine uni-modal model predictions in an ensemble
classifier to predict final labels. As elaborated in 3.2, 3-way textual entailment model is helpful
in distinguishing three-way entailment relationship based on linguistic and semantic clues
between 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 . To address multimodal entailment (as defined in this task 3.1), a simple
relatedness measurement for visual content between 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 is adopted based on
image pairwise similarity as a proxy of their visual entailment computed with a pre-trained
CNN model (ResNet) in our approach. The same as text entailment, this is based on hypothesis
and salient correlation patterns observed in this dataset that an article is relevant to a claim if
the article contains images similar to the claim’s images.
   Hence, an ensemble approach is proposed to combine textual entailment, visual relatedness
measurement and additional data specific features. More specifically, the proposed ensemble
model utilizes a basic decision tree classifier with the following feature encoding to provide
end-to-end five categories classification as depicted in Fig. 1.
    • Length of Text and OCR: four features of text length between (𝐷, 𝑄) are employed rep-
      resenting the length of 𝑞𝑡𝑒𝑥𝑡 , 𝑂𝐶𝑅(𝑞𝑖𝑚𝑎𝑔𝑒 ), 𝑑𝑡𝑒𝑥𝑡 , and 𝑂𝐶𝑅(𝑑𝑖𝑚𝑎𝑔𝑒 ) respectively. OCR
      text are measured and used as independent features here (ref. 4.4).
    • Text Entailment: two features consisting of a numeric representations (0, 1, 2) of text
      entailment prediction (i.e., "insufficient", "support", "refute") along with the corresponding
      probabilities (Sec. 3.2)
    • Image Similarity: the pairwise cosine similarity score between 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 is
      computed based on the features obtained by pre-trained ResNet-50 model,
    • Image Domain: two features encoded with one-hot-encoding scheme on the source domain
      name for 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 (ref. 4.5).


Figure 1: Ensemble Model Architecture
3.4. 5-way Multimodal Entailment
We consider how to obtain attended multimodal information that can effectively capture con-
sistency and integrity of multimedia content between 𝐷 and 𝑄. Inspired by current advance
in attention techniques [60, 61, 46, 47], we apply multiple attention mechanisms to learn the
multimodal interaction between pairs of visual and textual content. Instead of local alignment
approaches that model visual objects and textual words, we focus on global alignment based
methods in this study that aim to map whole image and sentences into joint semantic space.
A popular framework to model the multimodal relationship is using a multi-branch attention
network, typically one branch projects the image and another models the text. The similarity
is measured by dot product on the normalised feature vectors. The extended MatchPyramid
(elaborated in section 3.2) is applied to model high-level relevance between text pairs.
   In general, our end-to-end multimodal entailment architecture consists of embedding layer,
text matching layer, multimodal matching layer, and classification layer. Formally, as input
representations, Images pairs (𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 ) are represented as the top layer (before softmax)
of a pre-trained convolutional network and text pairs (𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 ) are mapped into a vector
𝑡 ∈ R𝑗 by a fixed word embedding layer initialised by Glove embeddings. 𝑗 denote as word
embedding dimension (e.g., 𝑗 set to 50 for Glove-50). There is no restriction on the choice of the
image encoder but the pre-trained ResNet-50 model is used in our experiments because of its
simplicity.
   In embedding layer, Let 𝑙 be the dimension of an image visual vector (i.e., 𝑙 = 2049 for
ResNet-50) and Let 𝑚 and 𝑛 be the number of words in 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 respectively. Let 𝑞𝑖 ∈ R𝑙
and 𝑞𝑡 ∈ R𝑗×𝑚 be claim image embedding vector and word embedding matrix, respectively.
Likewise, let 𝑑𝑖 ∈ R𝑙 and 𝑑𝑡 ∈ R𝑗×𝑛 be document image embedding vector and word embedding
matrix, respectively. Each 2048-dim feature vector (𝑞𝑖 and 𝑑𝑖 ) is fed into a (non-trainable) linear
layer to reduce the visual features from 2048 to 512 dimensional vector space in this work. For
the embeddings of the text pair, self-attention (SDP) is applied (as specified in section 3.2 for
both 𝑞𝑡 and 𝑑𝑡 before feeding into a separate GRU layer to obtain both of their context sequence
representation (𝑞𝑡_𝑐𝑥𝑡 ∈ R𝑗×𝑜 and 𝑑𝑡_𝑐𝑥𝑡 ∈ R𝑗×𝑜 ) and corresponding global representation (i.e.,
final state), denoted by 𝑞𝑡_𝑔 ∈ R𝑜 and 𝑑𝑡_𝑔 ∈ R𝑜 . 𝑜 denotes the GRU output dimension.
   In subsequent text matching layer, the same pipeline (as specified above for extended Match-
Pyramid) is applied in attempt to model the high level relevance between article content (𝑑𝑡𝑒𝑥𝑡 )
to claim text (𝑞𝑡𝑒𝑥𝑡 ) based on contextual word embedding interactions. Interaction feature matrix
is calculated by the matrix dot product between 𝑞𝑡_𝑐𝑥𝑡 and 𝑑𝑡_𝑐𝑥𝑡 , which is then applied by deep
hierarchical convolution layers in the MatchPyramid model to extract aggregated similarity
feature vector 𝑍𝑞_𝑑_𝑡𝑒𝑥𝑡 ∈ R𝑓 and 𝑓 is the output dimension of flattened feature maps. The
high-level matching patterns are then fed into a multi-layer perceptron (MLP) with dropout to
produce the final matching score with learnable weights.
   Multimodal latent interaction features are derived in multimodal matching layer which mainly
consists of visual matching layer and cross-modal attention layer. Fundamentally, multimodal
matching layer aims to find potential relevance of document visual vector (𝑑𝑖 ) to claim vectors of
either visual or text context or both (𝑞𝑖 and 𝑑𝑡_𝑔 ), and are hence critical for predicting multimodal
entailment relations for target claim. The same as the process for text embeddings, our visual
matching layer utilises self-attention (SDP) for image pairs embeddings (𝑞𝑖 and 𝑑𝑖 ) in attempt to
                                                                                         ⃦ ⃦ ⃦ ⃦
capture important features in each image. Then, a visual similarity feature (𝑉𝑞,𝑑        ⃦ ⃗𝑖 ⃦ · ⃦𝑑⃗𝑖 ⃦)
                                                                                    𝑖 = ⃦𝑞    ⃦ ⃦ ⃦
is computed by applying dot product to the two saliency guided image embeddings with 𝐿2
norm. This practice follows [62, 63, 47], which is a simple yet very efficient SoTA technique in
handling variations in image illumination, view points, texture and season. It also links to feature
whitening, linear discrimination analysis and image saliency. In addition, a separate image
                                                1
euclidean similarity feature (𝐸𝑞,𝑑  𝑖 =       ⃦        ⃦ ) is computed in an attempt to measure
                                         (1 + ⃦𝑞𝑖 − 𝑑𝑖 ⃦)
                                              ⃦        ⃦
potential adversarial images, same people in different scene, cropped image, etc [64, 65].
   In cross-modal attention layer, multiple attention mechanisms are applied, aiming to learn to
find and attend to the maximum relevant elements on both feature importance and relational
mapping among each self-attended visual and content vectors between claim and document.
Specifically, text-image interaction feature between document image and claim text is computed
                                 𝑖,𝑡
using same SDP attention: 𝑞𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛      = 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 (𝑑𝑖 , 𝑞𝑡_𝑔 ). Likewise, document image and
document text interaction is computed via 𝑑𝑖,𝑡   𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 = 𝐴𝑡𝑡𝑛𝑠𝑑𝑝 (𝑑𝑖 , 𝑑𝑡_𝑔 ). Note that due to
the mismatched feature space between vision and content, GRU is employed initially to align
image features and the text features in order to perform cross-modal learning. Finally, resulting
multimodal features 𝑉𝑚𝑢𝑙𝑡𝑖 = (𝑉𝑞,𝑑    𝑖 ⊕ 𝐸 𝑖 ⊕ 𝑞 𝑖,𝑡              𝑖,𝑡
                                            𝑞,𝑑    𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 ⊕ 𝑑𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 ) are obtained from the
merge (with concatenation) of the two intermediate layers output before feeding into a MLP
with dropout to generate fused higher level features.
   Finally, in the classification layer, the outputs of two MLP layers are merged along with
corresponding hypothesis multimodal representation (𝑞𝑚𝑒𝑟𝑔𝑒 ) into a combined representation
with batch normalisation. 𝑞𝑚𝑒𝑟𝑔𝑒 is the output of a separate MLP layer for merged (concatena-
tion) 𝑞𝑡_𝑔 and normalised 𝑞_𝑖 (𝑞𝑚𝑒𝑟𝑔𝑒 ∈ R𝑜+𝑙 ). Final representation are applied with dropout
regularization before feeding to a softmax layer to output the probabilities of five categories.


4. Factify Dataset
4.1. Data statistics
Five entailment categories are balanced in both train and validation data set. As mentioned
above, both claim (hypothesis) and document(premise) contains an image and a text of vary
length. Optical character recognition (OCR) text extracted from image are not counted separately
here and combined with corresponding claim and document text respectively. Thus, large claim
word size presented in the table is due to extracted OCR text. The data details for each set are
shown in Table 3. More details about the dataset and task details can be referred in [12].

4.2. Word Overlaps distribution
Word overlap is an important indication for modeling textual entailment as well as of potential
data bias. Naturally, when pairing claims to evidence sentences, the word overlap ratio will
be higher on average for claims with their supporting evidence. However, models relying on
word-overlap will perform poorly when dealing with complexity in the real world examples
(typically antonymous examples and adversarial attacks). In VITAMINC[66] and FEVER[67]
Table 3
Data statistics
                                Data set sizes:
                                Train pairs:                    35.000
                                Validation pairs:                7.500
                                Test pairs:                      7.500
                                Sentence length:
                                Claim (Hypotheis) inc. OCR
                                           min token count:           1
                                           max token count:      19,105
                                          mean token count:        51.5
                                Document(Premise) inc. OCR
                                           min token count:           1
                                           max token count:      44,542
                                          mean token count:      1010.5


dataset, this bias is deliberately minimized in order to create challenging examples that require
sentence-pair inference and cannot be solved by simple word matching techniques. Here, the
word overlaps distribution per class in train and val set are presented in Table 4. The data
distribution indicates that evidential premise data in Factify have clearly high word overlap
ratio than other two categories of insufficient evidence.

Table 4
(𝑄, 𝐷) pair text word overlap dist. in train/val set
                         Category               min.   max.    mean       mdn.
                         Support_Multi.         0.0      1.0   0.299      0.273
                         Support_Text           0.0      1.0   0.316      0.294
                         Insufficient_Multi.    0.0     0.92   0.221      0.192
                         Insufficient_Text      0.0      1.0   0.238      0.176
                         Refute                 0.0      1.0   0.406      0.346


4.3. Image Similarity distribution
Image similarity is the most basic indicator of multimodal entailment. To empirically validate
the intuition of potential similarity correlation between 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 (as mentioned in 3.3)
in the dataset, image relatedness analysis is conducted in this section. Similar to the potential
bias from data overlapping, we compute image pairwise similarity distribution with embedding
space computed from pre-trained ResNet50 model over train/val set as presented in Table 5
and 6. As seen from the distribution over five categories, two text related entailment categories
have clearly lower pairwise image similarity that multimodal evidential entailment categories.
Table 5
(𝑄, 𝐷) image pairwise similarity dist. in train set
                        Category               min.    max.     mean    mdn.
                        Support_Multi.         0.533      1.0   0.864   0.865
                        Support_Text           0.327      1.0   0.704   0.725
                        Insufficient_Multi.    0.428   0.999    0.835   0.833
                        Insufficient_Text      0.408   0.971    0.703   0.722
                        Refute                 0.41       1.0    0.82   0.835


Table 6
(𝑄, 𝐷) image pairwise similarity dist. in val set
                        Category               min.    max.     mean    mdn.
                        Support_Multi.         0.533      1.0   0.855   0.856
                        Support_Text           0.393      1.0    0.72    0.74
                        Insufficient_Multi.    0.578   0.996    0.846   0.844
                        Insufficient_Text      0.383   0.936     0.71    0.73
                        Refute                 0.426      1.0   0.828   0.842


4.4. Text Length Distribution
Text and OCR (text) length distributions between (𝑄, 𝐷) in the train set are presented in
Fig 2 and Fig 3. Clearly separable distribution patterns can be seen across five categories in
claim and document text and their corresponding OCR text. As shown in Fig 2, document text
length varies most for ’Support_Multimodal’ among other four entailment categories. The
document length in two insufficient categories share similar range and ’Refute’ category has
the least document length. The claim length distribution shows a clear bias of ’Refute’ examples
towards shorter claims. While the remaining classes present similar ranges, ’Insuffient_Text’
and ’Support_Multimodal’ tend to include slightly shorter claims. In comparison, OCR text for
both claim and document in "Refute" category samples shows surprising longer length than
other four categories. Motivated by the observation, we adopted text lengths as features in our
ensemble model as illustrated in 3.3.

4.5. Image Domain Bias
Source bias is one of the known and common problems in machine learning dataset [68] which
happens when most data samples are collected from the same source. This problem has several
links which mainly include selection bias, capture bias (bias happens due to particular data
collection methods), label bias and negative set bias. We are interested in probing potential
source bias in Factify datset. The potential bias in images domains over five categories brought
to our attention in data analysis. This is important since multimedia metadata have been proved
to be valuable information and signal for fact verification [69, 5, 70] in real-world application
such as domain/source credibility, detection of image manipulation and tampering. Image (link)
domains are extracted from all document samples of train set and distributions are computed
Figure 2: Text length distribution over the five entailment categories


Figure 3: OCR text length distribution over the five entailment categories


across five categories. As shown in Fig. 4, experiments show a surprisingly strong correlation
between image domains and each entailment categories in both claim and document. Motivated
by the correlation analysis, image domains are employed as features in our ensemble model as
illustrated in section 3.3.


5. Experiments
5.1. Experimental Setting
To evaluate the performance of our baseline solutions, we use weighted average F1 for bench-
marking on validation set. All the experiments are implemented on a single NVIDIA A100 GPU
with up to 20 GiB RAM.
Figure 4: Label Distribution by Image Domain in claim and document


5.2. Hypothesis Only Test
We conducted hypothesis only reliance test by using hypothesis only information to train a
model as baseline. This is a commonly adopted approach [71, 72] in SNLI/RTE to verify the
presence of data bias. The assumption is that without any premise information, this baseline is
supposed to make a random guess out of the five classes. We train two models with or without
images and test the resulting accuracy for each model on val set.
   Two models (Hypo𝑡𝑒𝑥𝑡 and Hypo𝑡𝑒𝑥𝑡+𝑖𝑚𝑔 ) are implemented with similar architecture that
consists of a text processing component and/or ResNet embedding layer followed by two fully-
connected (FC) layers. The text processing component is used to extract the text feature from
the given hypothesis. It firstly generates a sequence of word embeddings for the given claim
text. The embedding sequence is then fed into a GRU [73] to output the text context features of
dimension 300. The image processing component involves BGR to RGB conversion, resizing
images (to [300, 300, 3]), the feature extraction with ResNet50 with a linear layer projecting
pre-trained embeddings to 512 dimensional vector. The input and output dimensions of two FC
layers for text only models are [300, 300] and [300, 3] respectively. For text+image model, the
hidden layers dimensions are [300, 300] and [300, 3] respectively.
   Our experiments shows that the resulting accuracy and weighted F1 achieves same value in
both 0.60 and both 0.64 on val set for text and text+image models respectively, implying the
existence of bias in Factify dataset. The details are presented in results section. Our proposed
solutions outperform the two hypothesis only baselines.

5.3. 3-way text entailment
Transformer fine-tune settings: For the best entailment model, the pre-trained Big-
Bird model from Huggingface and the implementation for pair-wise classification fine-
tune was used. For our experiment the model was fine-tuned for 2 epochs, using the
AdamW optimizer with learning rate 2e-5 and epsilon 1e-8 with batch size 4. Maxi-
mum sentence length was set to the mean length of the input texts namely, 1396 to-
kens. To train the 3-way entailment model, the 5-way data categories were converted to
3-way categories including "Support" ("Support_Multimodal"+"Support_Text"), "Refute", In-
sufficient("Insufficient_Multimodal"+"Insufficient_Text"). OCR text from both 𝑄 and 𝐷 were
excluded.
   MatchPyramid baseline model settings: We use Glove embeddings with 50 dimension
(GloVe 6B 50d) for text input and GRU output dimension is set to 50. The number of CNN
layers is set to 2, each with kernel size 3 × 3, pooling size 5 × 10 and in valid mode (i.e., no
padding). ReLu activation is placed between all convolutional layers. Convolution channels
are set to 16 and 32 respectively. 2-layer MLP are set with hidden dimensions of 128 and 64
respectively. The maximum text length for 𝑄𝑡𝑒𝑥𝑡 and 𝐷𝑡𝑒𝑥𝑡 are set to 100 and 1000 respectively.
Limited experiments are conducted including content context size of claim and document, global
average pooling and convolution padding schemes.

5.4. 5-way Ensemble Model
The ensemble model is implemented using sklearn’s DecisionTreeClassifier class. Best pre-
trained transformer ("BigBird") based text entailment classifier is adopted and pairwise image
similarity is computed based on ResNet-50. For image pre-processing and feature extraction,
same practice introduced in 5.2 is applied. One-hot encoding function provided in Scikit-learn
is utilised to converts two categorical features (i.e., URL domains of 𝑞𝑖𝑚𝑎𝑔𝑒 and 𝑑𝑖𝑚𝑎𝑔𝑒 ) as a
one-hot numeric array learnt from train set. Text are pre-processed separately for BigBird
model and ensemble model. No pre-processing is applied for four text length features. For the
ensemble model training, we use ‘best’ split based on ‘gini’ impurity matrix as training criteria
and limit the number of layers to 8 to avoid overfitting.

5.5. 5-way Multimodal classification
5-way end-to-end Multimodal Entailment model Multimodal𝑒𝑛𝑡 is implemented with Keras and
tensorflow(v2.4) with Adam optimiser with an adaptive learning scheduler. The initial learning
rate and weight decay are both set to 0.0001. Batch size is 32 and maximum number of training
epochs is set to 80. Optimal parameters and settings from MatchPyramid baseline model experi-
ments are applied. Checkpoint callback is used to save best model that achieves best validation
accuracy. ReLu activation is applied to all convolution layers and fully-connected layers. The
uniform He initialization ("he_uniform") is used for all ReLU layers. Same settings (layer size,
hidden dim, activation, etc) are applied for three separate MLP layers. Few parameters and archi-
tectures are experimented including vary lengths of claim and document content, MLP layer size
(1-2), hidden layer dimensions of MLPs(64,256,512,768,1024), merge strategies (concatenation
and multiplication) of three MLPs output for classification layer. Ablation study is conducted by
removing individual sub-components including hypothesis MLP (𝑞𝑚𝑒𝑟𝑔𝑒 ), document crossmodal
interaction(𝑑𝑖,𝑡                                                          𝑖,𝑡
               𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 ), and document-claim crossmodal interaction(𝑞𝑐𝑟𝑜𝑠𝑠_𝑎𝑡𝑡𝑛 ). The ablation
experiments show the effectiveness of full model architecture. Best model (as reported in result
6.2) is obtained at training epoch 9 and stopped at epoch 14 with optimal settings of the 3 MLPs
architecture, 1-layer MLP with dimension 256, MLPs outputs merged with concatenation, and
text input lengths of 𝑞𝑡𝑒𝑥𝑡 and 𝑑𝑡𝑒𝑥𝑡 with 100 and 1000 respectively.
Table 7
Three-way Classification Results on val set
                           𝑀 𝑎𝑡𝑐ℎ𝑃 𝑦𝑟𝑎𝑚𝑖𝑑𝑔𝑙𝑜𝑣𝑒50𝑑                BigBird              LongFormer
        Categories
                            P       R       F1             P        R       F1    P       R      F1
        Support             0.77     0.74      0.76       0.83     0.86    0.85   0.83   0.86   0.84
        Refute              0.99     0.99      0.99       1.00     1.00    1.00   1.00   1.00   1.00
        Insufficient        0.75     0.77      0.76       0.85     0.83    0.84   0.85   0.82   0.84
        Weighted Avg.       0.81     0.81      0.81       0.88     0.87    0.87   0.87   0.87   0.87


6. Results and Discussion
6.1. 3-way text entailment
The results of the 3-way text entailment models are presented in Table 7. To validate our
model choice, we evaluate few SoTA pre-trained transformer models, including BERT, RoBerta,
BigBird and LongFormer. The best performing models are BigBird and LongFormer with the
overall winner being BigBird because of the slightly better results and smaller input context
size required (1396 vs 1484 respectively).
   For the architecture of extended MatchPyramid baseline, we have experimented with different
parameters such as longer context length in 𝑄𝑡𝑒𝑥𝑡 (including 1500, 2000, 3000), Glove model
with 300 dimension, larger GRU output dimension of 300, various pooling size ([3, 10]), etc,
none of these attempts provide major improvement. Overall, our baseline implementation
with self-attention and Glove-50d based contextual representation learning achieve optimal
performance, which is competitive to large transformer model based approaches as presented
in Table 7.

6.2. 5-way classification
The results of 5-way 𝐸𝑛𝑠𝑒𝑚𝑏𝑙𝑒 and 𝑀 𝑢𝑙𝑡𝑖𝑚𝑜𝑑𝑎𝑙𝑒𝑛𝑡 model on val set are presented in Table
8. Four of our baseline methods outperform all baseline models proposed by task organiser as
reported in [12]. The result of best baseline model (Multimodal𝑓 𝑎𝑐𝑡𝑖𝑓 𝑦 ) in Factify data paper
is presented in the table 1 . Unsurprisingly, our ensemble model achieved best results on val
set with 0.77 F1 which is 8% higher than the results of 𝑀 𝑢𝑙𝑡𝑖𝑚𝑜𝑑𝑎𝑙𝑒𝑛𝑡 model. The experiment
results demonstrate a large performance gain with the large pre-trained text entailment model
which works effectively on long paragraphs and contribute the most towards predicting final
5-way categories. This is particularly obvious for "Refute" label, the samples in which are mostly
relying on text based inference. It is not surprising that the useful features incorporated from
the heuristics and bias learned from the dataset have proved to be effective for this multimodal
prediction task.
   It was found that differentiating between "Insufficient_Multimodal" and "Support_Text" or
between "Insufficient_Text" and "Support_Multimodal" was the most challenging task without
relying data specific features. In other words, when a sample contains supporting document
    1
        the corresponding class-wise performance are not provided by organisers
Table 8
5-way Classification Results on val set
 Categories                Multimodal𝑓 𝑎𝑐𝑡𝑖𝑓 𝑦          Hypo𝑡𝑒𝑥𝑡                Hypo𝑡𝑒𝑥𝑡+𝑖𝑚𝑔                Multimodal𝑒𝑛𝑡                Ensemble
                                 F1              P         R       F1       P        R       F1         P         R      F1          P       R      F1
 Support_Multimodal               n/a            0.60     0.60     0.60    0.59     0.63     0.61    0.84        0.57    0.68    0.74       0.78    0.76
 Support_Text                     n/a            0.48     0.43     0.45    0.55     0.51     0.53    0.51        0.66    0.58    0.71       0.71    0.71
 Insufficient_Multimodal          n/a            0.45     0.52     0.61    0.56     0.61     0.56    0.62        0.52    0.57    0.68       0.65    0.66
 Insufficient_Text                n/a            0.61     0.50     0.55    0.62     0.51     0.56    0.57        0.69    0.62    0.74       0.73    0.73
 Refute                           n/a            0.87     0.90     0.88    0.93     0.93     0.93    0.99        0.97    0.98    1.0        1.0      1.0
 Weighted Avg.                    0.54           0.60     0.60     0.60    0.64     0.64     0.64    0.71        0.68    0.69    0.77       0.77    0.77


Table 9
5-way Classification Results on test set
                  Categories                                     Multimodal𝑒𝑛𝑡                               Ensemble
                                                             P         R      F1                    P            R              F1
                  Support_Multimodal                        0.81          0.60        0.69          0.76          0.78        0.77
                  Support_Text                              0.47          0.59        0.52          0.65          0.69        0.67
                  Insufficient_Multimodal                   0.61          0.53        0.57          0.73          0.64        0.68
                  Insufficient_Text                         0.56          0.66        0.60          0.71          0.73        0.72
                  Refute                                    0.99          0.96        0.98          1.0           1.0         1.0
                  Weighted Avg.                             0.69          0.67        0.67          0.77          0.77        0.77


text for the claim but the image is irrelevant, our model has low confidence in predicting the
label as "Support_Text" or "Insufficient_Multimodal". Likewise, when document image is relevant
to claim image about same information context but document text is irrelevant, our model
has low confidence in predicting the correct label. The decision is highly dependent on the
annotation bias. From all the labels, the "Refute" label is the most distinguishable category and
highly dependent on the text. The performance is highly consistent among all our models and
participant systems in this competition (as seen in leaderboard 10). This is possibly mainly
attributed to the articles samples selected from very few fact checking sources that have highly
differentiable linguistic clues (typically high frequent negative words used and same verdict
sentences frequently appeared in this category such as "The claim is false").

6.3. Competition Result
Final test set results and competition leaderboard are presented in Table 9 and 10 respectively.
Our best model ("Ensemble") outperform all competition systems and best baseline models [74].
Test result of 𝐸𝑛𝑠𝑒𝑚𝑏𝑙𝑒 model achieved 0.77 avg. F1 which is the same as the result on val set
and 10% higher than the result of 𝑀 𝑢𝑙𝑡𝑖𝑚𝑜𝑑𝑎𝑙𝑒𝑛𝑡 .


7. Conclusion
We described our participation in the Multimodal Fact Verification Factify Challenge with
the implementation of two proposed baseline solutions including an ensemble model and an
Table 10
Factify Official Leaderboard
 Rank      Team        Support_Text   Support_Multi.   Insufficient_Text   Insufficient_Multi.   Refute      Final
 1        Logically       81.843%         87.429%           84.437%              78.345%          99.899%   76.819%
 2           Yet          75.518%         89.38%            82.121%              80.81%           99.866%    75.591%
 3      Truthformers      77.65%          85.057%           79.421%              84.482%          98.819%    74.862%
 4       UofA-Truth       78.493%        89.786%            82.995%              75.981%          98.339%    74.807%
 5          Yao           68.881%         81.61%            84.836%             88.309%          100.0%      74.585%
 6         Greeny         74.947%         86.018%           80.382%              82.858%          99.125%    74.28%
 7          GPTs          71.575%         79.032%           75.363%              79.275%         100.0%      69.461%
 8         Tyche          75.0 %          75.259%          85.496%               68.823%          99.159%    69.203%
 9       MUM_NLP          64.803%         80.857%           69.848%              66.548%          93.465%    61.165%
 -       BASELINE        82.675%          75.466%           74.424%              69.678%         42.354%     53.098%


end-to-end multimodal entailment model. Ensemble model based system outperform the end-
to-end model on val set and test set. The best performing model in this competition combines
results of 3-way text entailment classifier, visual similarity with a pre-trained CNN model and
heuristics learnt from the dataset. Multimodal fusion technique is explored in this paper to
model interaction between different modalities (i.e., text and image) in claim and document pairs
and combines information from them to learn multimodal entailment relationship end-to-end.
We found that multimodal entailment based system suffer from overfitting. Apart from limited
train size and identified data bias, our experiments suggest that fine-grained image and text
interaction model need to be explored further.
   We found that the ambiguous labels in Factify dataset undermines the performance of
our deep learning architecture. Creating a dataset for a complex real-word multimodal NLP
problems particularly natural language inference as multimodal verification has raised emergent
challenges [75, 76] and indeed a cumbersome task, and we appreciate the work by the Factify
organizers, yet, a more elaborate and unbiased dataset along with well defined annotation
criterion should make this dataset more suitable for benchmark. More effort is required to tackle
the dataset challenge of minimising hypotheses from human annotators and make dataset better
reflecting real-world challenges. As an emergent research field, we hope our extensive data
analysis and proposed baseline solutions can inspire further work.


References
 [1] X. Zeng, A. S. Abumansour, A. Zubiaga, Automated fact-checking: A survey, Language
     and Linguistics Compass 15 (2021) e12438.
 [2] A. Kazemi, K. Garimella, D. Gaffney, S. A. Hale, Claim matching beyond english to scale
     global fact-checking, arXiv preprint arXiv:2106.00853 (2021).
 [3] Y. Jang, C.-H. Park, Y.-S. Seo, Fake news analysis modeling using quote retweet, Electronics
     8 (2019) 1377.
 [4] K. Nakamura, S. Levy, W. Y. Wang, r/fakeddit: A new multimodal benchmark dataset for
     fine-grained fake news detection, arXiv preprint arXiv:1911.03854 (2019).
 [5] D. Zlatkova, P. Nakov, I. Koychev, Fact-checking meets fauxtography: Verifying claims
     about images, arXiv preprint arXiv:1908.11722 (2019).
 [6] M. K. Elhadad, K. F. Li, F. Gebali, Detecting misleading information on covid-19, Ieee
     Access 8 (2020) 165201–165215.
 [7] F. Alam, S. Cresci, T. Chakraborty, F. Silvestri, D. Dimitrov, G. D. S. Martino, S. Shaar,
     H. Firooz, P. Nakov, A survey on multimodal disinformation detection, arXiv preprint
     arXiv:2103.12541 (2021).
 [8] M. Sun, X. Zhang, J. Ma, Y. Liu, Inconsistency matters: A knowledge-guided dual-
     inconsistency network for multi-modal rumor detection, in: Findings of the Association
     for Computational Linguistics: EMNLP 2021, 2021, pp. 1412–1423.
 [9] E. Müller-Budack, J. Theiner, S. Diering, M. Idahl, S. Hakimov, R. Ewerth, Multimodal news
     analytics using measures of cross-modal entity and context consistency, International
     Journal of Multimedia Information Retrieval 10 (2021) 111–125.
[10] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deepfool: a simple and accurate method to
     fool deep neural networks, in: Proceedings of the IEEE conference on computer vision
     and pattern recognition, 2016, pp. 2574–2582.
[11] C.-A. Deledalle, L. Denis, G. Poggi, F. Tupin, L. Verdoliva, Exploiting patch similarity for
     sar image processing: The nonlocal paradigm, IEEE Signal Processing Magazine 31 (2014)
     69–78.
[12] S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, P. Patwa, A. Das,
     T. Chakraborty, S. Amit, A. Ekbal, C. Ahuja, Factify: A multi-modal fact verification
     dataset, in: Proceedings of the First Workshop on Multimodal Fact-Checking and Hate
     Speech Detection (DE-FACTIFY), 2022.
[13] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a large-scale dataset for
     fact extraction and verification, arXiv preprint arXiv:1803.05355 (2018).
[14] D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, H. Hajishirzi, Fact or
     fiction: Verifying scientific claims, arXiv preprint arXiv:2004.14974 (2020).
[15] J. Zhou, X. Han, C. Yang, Z. Liu, L. Wang, C. Li, M. Sun, Gear: Graph-based evidence
     aggregating and reasoning for fact verification, arXiv preprint arXiv:1908.01843 (2019).
[16] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, D. Inkpen, Enhanced lstm for natural language
     inference, arXiv preprint arXiv:1609.06038 (2016).
[17] A. Hanselowski, C. Stab, C. Schulz, Z. Li, I. Gurevych, A richly annotated corpus for
     different tasks in automated fact-checking, arXiv preprint arXiv:1911.01214 (2019).
[18] L. Derczynski, K. Bontcheva, M. Liakata, R. Procter, G. W. S. Hoi, A. Zubiaga, Semeval-2017
     task 8: Rumoureval: Determining rumour veracity and support for rumours, arXiv preprint
     arXiv:1704.05972 (2017).
[19] J. Gu, J. Cai, S. R. Joty, L. Niu, G. Wang, Look, imagine and match: Improving textual-visual
     cross-modal retrieval with generative models, in: Proceedings of the IEEE Conference on
     Computer Vision and Pattern Recognition, 2018, pp. 7181–7189.
[20] S. Wang, Y. Chen, J. Zhuo, Q. Huang, Q. Tian, Joint global and co-attentive representation
     learning for image-sentence retrieval, in: Proceedings of the 26th ACM international
     conference on Multimedia, 2018, pp. 1398–1406.
[21] H. Nam, J.-W. Ha, J. Kim, Dual attention networks for multimodal reasoning and matching,
     in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017,
     pp. 299–307.
[22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[23] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, K.-W. Chang, Visualbert: A simple and performant
     baseline for vision and language, arXiv preprint arXiv:1908.03557 (2019).
[24] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter: Learning
     universal image-text representations (2019).
[25] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, J. Dai, Vl-bert: Pre-training of generic visual-
     linguistic representations, arXiv preprint arXiv:1908.08530 (2019).
[26] J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: Pretraining task-agnostic visiolinguistic repre-
     sentations for vision-and-language tasks, arXiv preprint arXiv:1908.02265 (2019).
[27] H. Tan, M. Bansal, Lxmert: Learning cross-modality encoder representations from trans-
     formers, arXiv preprint arXiv:1908.07490 (2019).
[28] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with
     region proposal networks, Advances in neural information processing systems 28 (2015)
     91–99.
[29] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE
     international conference on computer vision, 2017, pp. 2961–2969.
[30] P. Adarsh, P. Rathi, M. Kumar, Yolo v3-tiny: Object detection and recognition using one
     stage improved model, in: 2020 6th International Conference on Advanced Computing
     and Communication Systems (ICACCS), IEEE, 2020, pp. 687–694.
[31] G. Yu, Q. Chang, W. Lv, C. Xu, C. Cui, W. Ji, Q. Dang, K. Deng, G. Wang, Y. Du,
     et al., Pp-picodet: A better real-time object detector on mobile devices, arXiv preprint
     arXiv:2111.00902 (2021).
[32] I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner,
     D. Keysers, J. Uszkoreit, et al., Mlp-mixer: An all-mlp architecture for vision, arXiv preprint
     arXiv:2105.01601 (2021).
[33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De-
     hghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Trans-
     formers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
[34] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, S. Yan, Tokens-
     to-token vit: Training vision transformers from scratch on imagenet, arXiv preprint
     arXiv:2101.11986 (2021).
[35] X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, C. Shen, Conditional positional
     encodings for vision transformers, arXiv preprint arXiv:2102.10882 (2021).
[36] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi, J. Fan, Z. He, A survey
     of visual transformers, arXiv preprint arXiv:2111.06091 (2021).
[37] D. A. Hudson, C. D. Manning, Gqa: A new dataset for real-world visual reasoning
     and compositional question answering, in: Proceedings of the IEEE/CVF conference on
     computer vision and pattern recognition, 2019, pp. 6700–6709.
[38] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, Y. Artzi, A corpus for reasoning about natural
     language grounded in photographs, arXiv preprint arXiv:1811.00491 (2018).
[39] K. Desai, G. Kaul, Z. T. Aysola, J. Johnson, Redcaps: Web-curated image-text data created
     by the people, for the people, in: 35th Conference on Neural Information Processing
     Systems (NeurIPS 2021) Track on Datasets and Benchmarks., 2021.
[40] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant
     mapping, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern
     Recognition (CVPR’06), volume 2, IEEE, 2006, pp. 1735–1742.
[41] F. Schneider, Ö. Alaçam, X. Wang, C. Biemann, Towards multi-modal text-image retrieval
     to improve human reading, in: Proceedings of the 2021 Conference of the North American
     Chapter of the Association for Computational Linguistics: Student Research Workshop,
     2021.
[42] D. Kiela, S. Bhooshan, H. Firooz, E. Perez, D. Testuggine, Supervised multimodal bitrans-
     formers for classifying images and text, arXiv preprint arXiv:1909.02950 (2019).
[43] K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked cross attention for image-text matching,
     in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 201–
     216.
[44] P. Liu, X. Qiu, J. Chen, X.-J. Huang, Deep fusion lstms for text semantic matching, in:
     Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
     (Volume 1: Long Papers), 2016, pp. 1034–1043.
[45] M.-C. De Marneffe, B. MacCartney, T. Grenager, D. Cer, A. Rafferty, C. D. Manning,
     Learning to distinguish valid textual entailments, in: Second Pascal RTE Challenge
     Workshop, volume 62, Citeseer, 2006.
[46] N. Vo, K. Lee, Where are the facts? searching for fact-checked information to alleviate the
     spread of fake news, arXiv preprint arXiv:2010.03159 (2020).
[47] N. Xie, F. Lai, D. Doran, A. Kadav, Visual entailment: A novel task for fine-grained image
     understanding, arXiv preprint arXiv:1901.06706 (2019).
[48] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
[49] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are
     unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[50] T. Gao, A. Fisch, D. Chen, Making pre-trained language models better few-shot learners,
     arXiv preprint arXiv:2012.15723 (2020).
[51] A. Williams, N. Nangia, S. R. Bowman, The multi-genre nli corpus (2018).
[52] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, A large annotated corpus for learning
     natural language inference, arXiv preprint arXiv:1508.05326 (2015).
[53] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, D. Kiela, Adversarial nli: A new
     benchmark for natural language understanding, arXiv preprint arXiv:1910.14599 (2019).
[54] J. Thorne, M. Glockner, G. Vallejo, A. Vlachos, I. Gurevych, Evidence-based verification
     for real world information needs, arXiv preprint arXiv:2104.00640 (2021).
[55] M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham,
     A. Ravula, Q. Wang, L. Yang, A. Ahmed, Big bird: Transformers for longer sequences, 2021.
     arXiv:2007.14062.
[56] D. Stammbach, Evidence selection as a token-level prediction task, in: Proceedings of the
     Fourth Workshop on Fact Extraction and VERification (FEVER), 2021, pp. 14–20.
[57] L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, X. Cheng, Text matching as image recognition, in:
     Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
[58] S. Wan, Y. Lan, J. Xu, J. Guo, L. Pang, X. Cheng, Match-srnn: Modeling the recursive
     matching structure with spatial rnn, arXiv preprint arXiv:1604.04378 (2016).
[59] L. Pang, Y. Lan, J. Guo, J. Xu, X. Cheng, A study of matchpyramid models on ad-hoc
     retrieval, arXiv preprint arXiv:1606.04648 (2016).
[60] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: Advances in neural information processing systems,
     2017, pp. 5998–6008.
[61] Z. Li, Y. Li, H. Lu, Improve image captioning by self-attention, in: International Conference
     on Neural Information Processing, Springer, 2019, pp. 91–98.
[62] T. Malisiewicz, A. Gupta, A. A. Efros, Ensemble of exemplar-svms for object detection and
     beyond, in: 2011 International conference on computer vision, IEEE, 2011, pp. 89–96.
[63] M. Gharbi, T. Malisiewicz, S. Paris, F. Durand, A gaussian approximation of feature space
     for fast image similarity (2012).
[64] L. Wang, Y. Zhang, J. Feng, On the euclidean distance of images, IEEE transactions on
     pattern analysis and machine intelligence 27 (2005) 1334–1339.
[65] A. Pedraza, O. Deniz, G. Bueno, Really natural adversarial examples, International Journal
     of Machine Learning and Cybernetics (2021) 1–13.
[66] T. Schuster, A. Fisch, R. Barzilay, Get your vitamin c! robust fact verification with
     contrastive evidence, arXiv preprint arXiv:2103.08541 (2021).
[67] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Evaluating adversarial attacks
     against multiple fact verification systems, Association for Computational Linguistics, 2020.
[68] A. Torralba, A. A. Efros, Unbiased look at dataset bias, in: CVPR 2011, IEEE, 2011, pp.
     1521–1528.
[69] B.-C. Chen, L. S. Davis, Deep representation learning for metadata verification, in: 2019
     IEEE Winter Applications of Computer Vision Workshops (WACVW), IEEE, 2019, pp.
     73–82.
[70] T. Prabhakar, A. Gupta, K. Nadig, D. George, Check mate: Prioritizing user generated multi-
     media content for fact-checking, in: Proceedings of the International AAAI Conference
     on Web and Social Media, volume 15, 2021, pp. 1025–1033.
[71] S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, N. A. Smith, An-
     notation artifacts in natural language inference data, arXiv preprint arXiv:1803.02324
     (2018).
[72] H. T. Vu, C. Greco, A. Erofeeva, S. Jafaritazehjan, G. Linders, M. Tanti, A. Testoni,
     R. Bernardi, A. Gatt, Grounded textual entailment, arXiv preprint arXiv:1806.05645
     (2018).
[73] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural
     networks on sequence modeling, arXiv preprint arXiv:1412.3555 (2014).
[74] P. Patwa, S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, A. Das,
     T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Benchmarking multi-modal entailment for
     fact verification, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking
     and Hate Speech Detection, CEUR, 2022.
[75] R. Le Bras, S. Swayamdipta, C. Bhagavatula, R. Zellers, M. Peters, A. Sabharwal, Y. Choi,
     Adversarial filters of dataset biases, in: International Conference on Machine Learning,
     PMLR, 2020, pp. 1078–1088.
[76] S. Sharma, M. Dey, K. Sinha, Evaluating gender bias in natural language inference, arXiv
     preprint arXiv:2105.05541 (2021).

</pre>