=Paper=
{{Paper
|id=Vol-3199/paper11
|storemode=property
|title=GPTs at Factify 2022: Prompt Aided Fact-Verification (short paper)
|pdfUrl=https://ceur-ws.org/Vol-3199/paper11.pdf
|volume=Vol-3199
|authors=Saksham Aggarwal,Pawan Sahu,Taneesh Gupta,Gyanendra Das
|dblpUrl=https://dblp.org/rec/conf/aaai/AggarwalSGD22
}}
==GPTs at Factify 2022: Prompt Aided Fact-Verification (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-3199/paper11.pdf</pdf>
<pre>
GPTs at Factify 2022: Prompt Aided Fact-Verification
Saksham Aggarwal1,2 , Pawan Kumar Sahu1,2 , Taneesh Gupta1,2 and Gyanendra Das1,2
1
    Indian Institute of Technology (Indian School of Mines), Dhanbad, India
2
    The authors contributed equally.


                                         Abstract
                                         One of the most pressing societal issues is the fight against false news. The false claims, as difficult as
                                         they are to expose, create a lot of damage. To tackle with the problem, fact verification becomes crucial
                                         and thus has been a topic of interest among diverse research communities. Using only the textual form
                                         of data we propose our solution to the problem and achieve competitive results to other approaches. We
                                         present our solution based on two approaches - PLM (pre-trained language model) based method and
                                         Prompt based method. PLM based approach uses the traditional supervised learning, where the model is
                                         trained to take ’x’ as input and output prediction ’y’ as P(y|x). Whereas, Prompt-based learning reflects
                                         the idea to design input to fit the model such that the original objective may be re-framed as a problem
                                         of (masked) language modelling. We may further stimulate the rich knowledge provided by PLMs to
                                         better serve downstream tasks by employing extra prompts to fine-tune PLMs. Our experiments showed
                                         that the proposed method performs better than just fine tuning PLMs. We achieved an F1 score of 0.6946
                                         on the FACTIFY dataset and 7𝑡ℎ position on the competition leader-board.

                                         Keywords
                                         Deep Learning, Factify, Prompt based learning, PLM, NLP, RoBERTa, DeBERTa, Stacking, Ensembling


1. Introduction
When it comes to news consumption, social media has two faces. On one hand, individuals
consume news via social media because of its low cost, quick access, and rapid transmission of
information. On the other side, it facilitates the widespread circulation of fake news, i.e., low-
quality news containing false/misleading information. As per a study, Facebook engagements
with fake news sites average roughly 70 million per month [1]. It impacts government, media,
individuals, health, law and order, as well as the economy. This spread has been blamed for
incidents ranging from ethnic violence, inter-racial violence, and religious conflicts to mass
riots. Thus, combating fake news is one of the burning societal crisis.
   Our work in the competition tries to deal with the above mentioned fact-checking problem
using the Factify dataset. FACTIFY is a multi-modal fact verification dataset consisting of five
classes- “Support-Multimodal”, ”Support-Text”, ”Insufficient-Multimodal”, ”Insufficient-Text”
and ”Refute” - categorized based on the cross-relationship of visual and textual data (see section
3). Though the support of visual data was provided, our solution uses only textual information


DE-FACTIFY :Workshop on Multi-Modal Fact Checking and Hate-Speech Detection, co-located with AAAI 2022. 2022
Vancouver, Canada
Envelope-Open sakshamaggarwal20@gmail.com (S. Aggarwal); pawankumar.s.2001@gmail.com (P. K. Sahu);
tanishgupta34@gmail.com (T. Gupta); gyanendralucky9337@gmail.com (G. Das)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
and proposes an approach to achieve competitive results with the solutions that leverages visual
information.
   In a standard supervised learning practice for NLP, we take an input ’x’, generally text, and
output prediction ’y’ based on a model P(y|x) . ’y’ might be a label, a text string, or any other
type of output. we utilise a dataset comprising of pairs of inputs and outputs, to learn the
parameters of this model, and then train to predict the conditional probability. This is generally
done by following a pretrain-finetune strategy with additional task specific data using pretrained
language models (PLMs).
   Language models that directly estimate the likelihood of text are the basis for prompt-based
learning. To use these models to perform prediction tasks, the original input ’x’ is converted
into a textual string prompt ’xprompt’ having some unfilled slots using a template, and then
the language model is used to probabilistically fill the unfilled information to produce a final
string, from which the final output ’y’ can be derived.
   We model our solution based on both the approaches, and have shown a way to use the
prompt-based learning technique to aid in the classification. First approach - traditional pretrain-
finetune based method (also referenced as PLM based method further) - focuses on fine tuning
different pretrained models with some pre-processing, later aggregated using ensembling
techniques. Our second approach - Prompt based - divides the 5-class classification problem
into two parts, first, a binary classification task to efficiently segregate one of the classes from
other 4, and second, a 4-class classification task which is handled similar to the first approach.
We observe that merging the prompt based technique with traditional approach, boosts the
score by aiding in efficient segregation of one of the classes. Task report for Factify 2022 can be
found here [2].


2. Related Work
Pretrained language models (PLMs) such as RoBERTa [3], GPT [4], T5 [5] and BERT [6] have
proven themselves as powerful tools for text generation and language understanding. These
models are capable of capturing a plethora of semantic [7], linguistic [8] and syntactic [9]
knowledge leveraging large-scale corpora that’s now available to us. Finetuning these PLMs
by introducing additional task specific data, rich knowledge in the language models can be
propagated to multiple downstream tasks. Demonstration of outstanding performance on nearly
all key language tasks, by just finetuning pretrained language models, has upraised a consensus
in the community to finetune PLM instead of learning models from scratch [10].
   Despite the effectiveness of fine-tuning pre-trained language models, several recent research
have discovered that one of the most significant challenges is the substantial gap in the ’pre-
training’ and ’fine-tuning’ objectives. This restricts taking full advantage of knowledge in PLMs.
Although ’pre-training’ is typically formalized as a cloze-style task (e.g. MLM), downstream
tasks in ’fine-tuning’ exhibit different objective forms such as sequence labelling, classification
and generation. This discrepancy obstructs the transfer and adaption of PLM knowledge to
downstream tasks.
   [11] proposes prompt tuning to bridge this gap between pretraining strategy and subsequent
finetuning and downstream tasks. To understand the basic idea, we can consider the example
Figure 1: An example of pre-training, fine-tuning, and prompt tuning.


 of dance performance classification task , see figure 1 for reference, a typical prompt would
 consists of a template (e.g. “<Statement>. The guests appreciated the dance performance. It
was [MASK].”) and a set of label words (e.g. “outstanding” and “worse”). The set of label words
 serves as the constituent set for predicting [MASK]. This way, the original input is modified
with the help of prompt template to predict [mask] which is then mapped to the corresponding
 labels, thereby converting a classification task into a cloze-style task. Simply put, we can make
 the model predict “outstanding” or “worse” using PLMs at the position which is being masked,
which is then used to derive the sentiment (i.e positive or negative). Prompt tuning methods
 have also achieved promising results on some other few-class classification tasks as well such
 as natural language inference [12].
    Taking inspiration from both these approaches, i.e. Pre-train->Fine-tune and Pre-train->Prompt
->Predict, we present our solution for Multi-Modal Fact Verification dataset based on both the
 approaches.


3. Dataset
Factify is a datatset on multi-modal fact verification [13]. It contains images of the claim, textual
claim, reference textual document and image. The images are accompanied by their respective
OCR texts. The train data includes 35,000 instances maintaining a proper balance among all the
five classes - ’support-text’, ’insufficient-multimodal’, ’support-multimodal’, ’insufficient-text’
and ’refute’. The validation dataset contains 7,500 instances with equal data distribution across
the classes. Short description of the classes is provided in table 1.

Table 1
The table gives the description of the classes of the dataset
 Support Multimodal                         Text is entailed             Image is entailed
 Support Text                               Text is entailed           Image is not entailed
 Insufficient Multimodal                  Text is not entailed           Image is entailed
 Insufficient Text                        Text is not entailed         Image is not entailed
 Refute                                       Fake Claim                   Fake Image


4. Method
In our solution modelling we have used pretrained RoBERTa, DeBERTa, XLM-RoBERTa and
ALBERT models.
    RoBERTa [3] (Robustly Optimized BERT Pre-training Approach) improves on BERT [6] by
modifying and optimizing its architecture and training procedure. It makes some key changes
to BERT including the removal of Next Sentence Prediction (NSP) objective. RoBERTa also
dynamically changes the masking pattern by duplicating the training data and masking it 10
times, each time with a different strategy. It uses larger mini batches for training which improves
perplexity on masked language modelling objective and also makes it easier to parallelize via
distributed data parallel training.
    DeBERTa [14] (Decoding-enhanced BERT with disentangled attention) is a Transformer-
based neural language model pretrained which improves on BERT and RoBERTa by using two
novel techniques. It makes use of disentangled self-attention which, unlike BERT, uses two
vectors to encode word(content) embedding and positional embedding instead of using their
summation as a single vector. Secondly, it proposes Enhanced Mask Decoder(EMD), which
takes into account both relative as well as absolute position of the words.
    XLM-RoBERTa [15] is a pretrained multilingual model version of RoBERTa which outper-
form multilingual BERT. It is pretrained using Masked language modeling objective on 2.5TB of
filtered CommonCrawl data including 100 languages due to which it is capable of giving result
in 100 diffrerent languages.
    ALBERT [16] is a Transformer architecture based on BERT which reduces its model size
(18x fewer parameters) without deteriorating the performance. It uses 2 parameter reduction
techniques including Factorized embedding parameterization and Cross Layer parameter sharing.
ALBERT showcases excellent trade-off between huge size reduction and slight performance
drop.

4.1. Data Preprocessing
The train data consists of claim text, claim OCR text, document text and document OCR text.
For training, we concatenated claim text with OCR text and clipped the max length to 256. We
Table 2
The table shows the validation scores for the various models trained and along with the score obtained
by performing stacking over these models.
                  Model-Name                                       Validation-Score
                    DeBERTa                                             0.7305
                    RoBERTa                                             0.7082
                  XLM-RoBERTa                                           0.6985
                     ALBERT                                             0.6871
                Stacking Ensemble                                       0.7360


tried training on document text as well, but the results were poor. We used stratified 5-fold
cross validation strategy for training all of our models.

4.2. PLM Based Method
In this method we have used pretrained RoBERTa, DeBERTa, XLM-RoBERTa and ALBERT
models and finetuned it on the given dataset. This approach uses the traditional supervised
learning, which trains a model to take in an input x and predict an output y as P(y|x), here, x
denotes the textual data, y denotes the class set and P(y|x) denotes the probability. Later the
predictions from all the four models is ensembled to boost the final score. The validation scores
are mentioned in table 2. The training strategy along with the ensembling details is described
in the Experiments section.

4.3. Prompt Based Method
This approach makes use of the ’prompt-based learning’ methodology to aid in the classification.
A prompt consists of a template T(·) and a set of label words V. For each instance x, the template
is first used to map x to the prompt: xprompt = T(x). Given v ∈ V, we produce the probability
that the token v can fill the masked position. Then the predicted token v is mapped to its
defined class. This step known as answer mapping, maps the distribution over label set(V) to a
distribution over class set(Y).
   Using this technique we filter out the ’refute’ class more efficiently, leveraging the rich
knowledge distributed in the pretrained models. We transformed the multi-class classification
task to binary classification where the ’refute’ class would represent negative class and all other
classes (’support-text’, ’insufficient-multimodal’, ’support-multimodal’ and ’insufficient-text’)
combined would represent the positive class.
   The template we used in our case was ” <INPUT>. The statement is <MASK> ”. And,
the Label set (V) was comprised of words like ’false’, ’irrelevant’, ’incorrect’ etc mapped to
the negative class of Y and words like ’true’, ’relevant’, ’correct’ etc mapped to the positive
class of Y. The model used was pretrained RoBERTa. This prompt setting was motivated
by the nature of ’refute’ class and the fact that prompt-based learning is efficient for binary
classification. Once the refute class’ instances are segregated, now the task at hand becomes
a multi-class classification with 4 classes instead of 5. Further, the task is completed using
traditional supervised learning - PLM based approach.


5. Experiments
5.1. Pretraining
To pretrain our baseline, we employed the Masked Language Model (MLM) method, which
substitutes a special masked token for randomly picked tokens in the input sequence. Using
BCE loss, the MLM attempts to predict the masked tokens. 15% of the input tokens were chosen
uniformly for replacement, with masked tokens replacing 80% of the tokens, randomly selected
vocabulary token replacing 10%, and rest 10% were kept unmodified. AdamW is used to optimise
it, using an L2 weight decay of 0.01.The learning rate is warmed up to a peak value of 1e-4 over
the first 500 iterations, then linearly decreased.
    Models are pre-trained for 3000 iteration with a mini-batch size of 64 and a maximum length
of 256.

5.2. Finetuning
We used the pretrained models from above and finetuned them on the given dataset using
AdamW as the optimizer with weight decay of 0.01. The learning rate is warmed up over 100
iterations to peak value of 5e-6 and decayed using cosine annealing. Models are finetuned for
2000 iterations with mini-batch size 32 of maximum length = 256 tokens.

5.3. Ensembling
We used the following ensembling techniques which gave a significant boost to our scores.

5.3.1. Snapshot
Training multiple deep models for bagging requires heavy computation. By using snapshot
ensembling[17], we can obtain different models by training our model only once, converging to
several local minima along its optimization path. We can make an ensemble of these multiple
models to make more general purpose predictions and thereby boosting the system’s perfor-
mance. To obtain repeated convergence we used cyclic learning rate schedules e.g. OneCyclicLR
scheduler.

5.3.2. Stacking
Out of fold predictions were predicted for all the models i.e. when performing 5-fold cross-
validation training strategy, we predicted the validation scores for all the folds which were
then concatenated. These out of fold predictions can be used for ensembling. A 3 layer neural
network was trained on these out of fold predictions to predict the target labels. This neural
network was then used as an head on the test predictions generated by the transformer based
models. This method, known as stacking, gave a significant boost in our validation score
(0.7360).


6. Result
We ranked 7𝑡ℎ in the final leaderboard. Our results are mentioned in table 3. The solution
consists of two different approaches. Method 1 refers to the stacking of the finetuned models
including DeBERTa, RoBERTa, XLM-RoBERTa and ALBERT while Method 2 refers to the
prompt-based approach explained in the method section. Method 2 gave us better results on
the final testing dataset owing its boost to the prompt based learning.

Table 3
Table shows the results based on our two methods. Method 1 refers to stacking of the finetuned models.
Method 2 refers to the prompt based approach.
   Method        Support-      Support-  Insufficient- Insufficient-        Refute         Final
                   Text       Multimodal    Text       Multimodal
  Method1          0.726        0.7912     0.7437        0.7755              0.996         0.6902
  Method2          0.715        0.7903     0.7536        0.7927               1.0          0.6946


7. Conclusion
The task was to classify the given claim into one of the 4 labels- ’support-text’, ’insufficient-
multimodal’, ’support-multimodal’, ’insufficient-text’ and ’refute’. We followed 2 approaches
for this task: 1) finetuning transformer based models on our dataset and 2) using prompt based
learning to classify if the claim is a refute and then using transformer based models for the
downstream task. We carefully evaluated different methods that we used and we found out that
the prompt based method was giving us a significant boost. Pretraining using MLM on the given
dataset was also giving a small boost compared to finetuning of pretrained model. We also found
that pretraining using MLM for more iteration with bigger batch size and dynamically changing
the masking pattern applied to the training data was improving the score. Snapshot ensemble
of various checkpoints also gave a boost to our single model score. Performing stacking on
our prediction further improved the results and was outperforming the mean ensemble of our
predictions.


References
 [1] X. Zeng, A. S. Abumansour, A. Zubiaga, Automated fact-checking: A survey, 2021.
     arXiv:2109.11427 .
 [2] P. Patwa, S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, A. Das, T. Chakraborty,
     A. Sheth, A. Ekbal, C. Ahuja, Benchmarking multi-modal entailment for fact verification, in:
     Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection,
     CEUR, 2022.
 [3] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692 .
 [4] A. Radford, K. Narasimhan, Improving language understanding by generative pre-training, 2018.
 [5] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
     the limits of transfer learning with a unified text-to-text transformer, 2020. arXiv:1910.10683 .
 [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, 2019. arXiv:1810.04805 .
 [7] D. Yenicelik, F. Schmidt, Y. Kilcher, How does BERT capture semantics? a closer look at polysemous
     words, in: Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural
     Networks for NLP, Association for Computational Linguistics, Online, 2020, pp. 156–162. URL:
     https://aclanthology.org/2020.blackboxnlp-1.15. doi:10.18653/v1/2020.blackboxnlp- 1.15 .
 [8] G. Jawahar, B. Sagot, D. Seddah, What does BERT learn about the structure of language?,
     in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguis-
     tics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 3651–3657. URL: https:
     //aclanthology.org/P19-1356. doi:10.18653/v1/P19- 1356 .
 [9] J. Hewitt, C. D. Manning, A structural probe for finding syntax in word representations, in:
     Proceedings of the 2019 Conference of the North American Chapter of the Association for Com-
     putational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), As-
     sociation for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4129–4138. URL:
     https://aclanthology.org/N19-1419. doi:10.18653/v1/N19- 1419 .
[10] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, X. Huang, Pre-trained models for natural language processing:
     A survey, Science China Technological Sciences 63 (2020) 1872–1897. URL: http://dx.doi.org/10.
     1007/s11431-020-1647-3. doi:10.1007/s11431- 020- 1647- 3 .
[11] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
     D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
     C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot
     learners, 2020. arXiv:2005.14165 .
[12] T. Schick, H. Schütze, Exploiting cloze questions for few shot text classification and natural language
     inference, 2021. arXiv:2001.07676 .
[13] S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, P. Patwa, A. Das, T. Chakraborty,
     A. Sheth, A. Ekbal, C. Ahuja, Factify: A multi-modal fact verification dataset, in: Proceedings of
     De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEUR, 2022.
[14] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, 2021.
     arXiv:2006.03654 .
[15] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
     L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, 2020.
     arXiv:1911.02116 .
[16] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-supervised
     learning of language representations, 2020. arXiv:1909.11942 .
[17] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, K. Q. Weinberger, Snapshot ensembles: Train 1, get
     m for free, 2017. arXiv:1704.00109 .

</pre>