=Paper=
{{Paper
|id=Vol-3180/paper-32
|storemode=property
|title=SCUoL at CheckThat! 2022: Fake News Detection Using Transformer-Based Models
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-32.pdf
|volume=Vol-3180
|authors=Saud Althabiti,Mohammad Ammar Alsalka,Eric Atwell
|dblpUrl=https://dblp.org/rec/conf/clef/AlthabitiAA22
}}
==SCUoL at CheckThat! 2022: Fake News Detection Using Transformer-Based Models==
<pdf width="1500px">https://ceur-ws.org/Vol-3180/paper-32.pdf</pdf>
<pre>
            SCUoL at CheckThat! 2022: Fake News Detection Using
                        Transformer-Based Models
Saud Althabiti a,b, Mohammad Ammar Alsalka c and Eric Atwell d
a
  University of Leeds, scssal@leeds.ac.uk, Leeds, United Kingdom
b
  King Abdulaziz University, salthabiti@kau.edu.sa, Jeddah, Saudi Arabia
c
  University of Leeds, m.a.alsalka@leeds.ac.uk, Leeds, United Kingdom
d
  University of Leeds, e.s.atwell@leeds.ac.uk, Leeds, United Kingdom


                   Abstract
                   The fifth edition of the "CheckThat! Lab" is one of the 2022 Conference and Labs of the
                   Evaluation Forum (CLEF) and aims to evaluate advances supporting three factuality-related
                   tasks, covering several languages. Our team (SCUoL) participated in task 3A, which
                   concentrates on multi-class fake news detection of English news articles. This paper describes
                   our approach, including several experiments exploring different machine learning and
                   transformer-based models. Furthermore, we employed an additional dataset to support our
                   proposed model. During the validation results phase, the experiments highlight the best
                   performing machine learning classifier, which achieved cross-validation scores of over 60%
                   for the LinearSVC compared to the pre-trained BERT model that exceeds other models in this
                   task. While in the testing results, we obtained an F1 of approximately 0.305 compared to the
                   other participants’ average F1 of 0.252.

                   Keywords
                   Fake News Detection, Misinformation, Misleading Information, CLEF 2022 1, CheckThat! Lab

1. Introduction
    Fake news has evolved into a severe threat which may cause increased political, financial and
societal losses making detection of such news significant because, unlike traditional newspapers, online
information could mislead a broader range of communities. In addition, it may include both facts and
parts of incorrect contents in one statement, which can be challenging to identify. Therefore, various
research studies have developed models to find a beneficial solution to tackle this difficulty.
    The CheckThat! Lab [1]–[5] provides annual competitions divided into three categories, namely,
check-worthiness estimation, verified claim retrieval, and fake news detection (FND). In this
experiment, we participated in the third task [6]. The performed task aims to detect fake news articles
written in the English language and their topical domains. We focused only on the first part (Task 3A)
in this experiment. This subtask provided a dataset including almost 1,300 articles divided into training
and development datasets to determine whether each article's claim is false, partially false, true, or other
[7]–[9]. Therefore, the main objective of this experiment is to classify real-world news articles into
predefined categories. Each text has been labelled with a specific rating; hence, this is a supervised text
classification problem aiming to rank upcoming news based on the article's content.
    In this paper, we firstly discuss some of the related topics and previous studies in section 2. Then,
we analyze the datasets and describe the methodology in sections 3 and 4, respectively. The following
section discusses the results obtained after training the model, and finally, we conclude and suggest
future work in the last section.


1
    CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
                   ©️ 2022 Copyright for this paper by its authors.
                   Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                   CEUR Workshop Proceedings (CEUR-WS.org) Proceedings
2. Related Work
    Defining claims' credibility is a research problem that has drawn considerable attention in past years,
and various studies have been developing methods to overcome this issue [9]–[14]. Since this topic has
become trending in many languages, several surveys have attempted to review the various suggested
techniques and practical approaches systematically. For instance, the study by Shu and Liu [15] details
fake news detection methods in five categories: linguistics, topic-agnostic, knowledge-based, traditional
machine learning, and hybrid approaches. Furthermore, they examined how these methods could
interlink to be used jointly.
    In another study by [12], [16], they provided a descriptive tutorial that reevaluated FND techniques
and methods from four different viewpoints. The first viewpoint evaluates authenticity by extracting
the facts and comparing them with knowledge. An additional perspective is apprehending the writing
style of given news since manipulators who aim to distribute fake news usually spread distorted
messages intended to persuade others [16]. Furthermore, the propagation of the spread of information
is the third perspective. To explain, the route of widespread news messages forms a network that could
hold indications for early fake news detection. Finally, a source-based method is another employed
idea. This method mainly relies on the source of a particular post, such as the original news authors,
the publishers who conveyed that news, or the person who shared the posted news [17].
    The CheckThat! lab is part of the conference and labs of the evaluation forum (CLEF) [4], [7]. It
has provided contests since 2018 [1] and attempts to assess competitors' systems each year related to
factuality in different languages. It is divided into three challenges [4]: the first one is to predict which
tweets are worth fact-checking [18]. Our team (SCUoL) participated in this competition last year and
achieved the third-best result among eight other participating team [18], [19]. The second challenge is
to decide whether a posted claim can be verified or not [20]. The last task is task 3, which aims to
predict the veracity of a news article [7].

3. Datasets
3.1. Dataset provided from the competition

    In this competition, the data was collected from 2010 to 2022 with more than one topic, such as
elections, COVID-19 etc. [21]. The provided English dataset is about 1,300 articles for the training
data with four main features: public id, the text, the title, and the rating or class-label of each article. In
addition, the testing dataset includes more than 600 English articles with similar features except for the
rating, which our proposed model aims to predict. The text provides the most important features that
can help solve this multi-classification problem and determine the veracity of an article [22]. Therefore,
we decided to use the article text only to be analyzed along with the rating. There are four different
classes in the provided dataset (false, partially false, true, and other). We initially represent each
category as a number since the proposed transformer model only accepts numerical classes.

3.2.       External datasets
   Team NoFake, the winner of last year's competition, used additional datasets [23], which increased
their model performance. Therefore, we used an external dataset called the Fakenews Classification
Datasets2 from a Kaggle competition in our experiments. It contains more than 21,000 factual articles
and over 23,000 fake articles.

4. Methods
4.1. Text pre-processing and machine learning models


   2
       https://www.kaggle.com/datasets/liberoliber/onion-notonion-datasets
   We conducted several experiments using various methods, including traditional machine learning
and transformer-based models. Initially, the training datasets were divided into features and labels. In
order for machine learning (ML) algorithms to be able to make predictions, all words included within
each article's text have to be transformed into vectors. In this experiment, we used a statistical measure
called TF-IDF. This measure stands for (Term Frequency - Inverse Document Frequency) and aims to
evaluate how relevant each word is to a specific document in several documents [24]. The repeated
words that may show in most or all documents, such as the words that, who, and which, will have a
lower weight because these kinds of words will not add any valuable information for our predictions.
The importance of each word will be determined by multiplying TF by IDF. In addition, we set some
hyperparameters in the used vectorizer to minimize the unnecessary words, such as the 'stop words'.
Then, we unified labels that indicate the same meaning and converted them into numerical features.
For example, False, false, and untrue are represented as "1", and we also applied this to other labels.
The last step in the pre-processing is splitting the data into training and testing with approximate ratios
of 70% and 30%, respectively.
   After that, we used four frequently used machine learning algorithms in classification problems:
Random Forest Classifier (RFC), Linear Support Vector Classifier (SVC), Multinomial Naive Bayes
(MNB), and Logistic Regression (LR). Then, we examined these models on the split training dataset
and estimated the average prediction score over five folds of cross-validation.

4.2.    Transformer-based models
    In addition to the traditional machine learning algorithms, we aimed to utilize transformer-based
models. To simplify the process of using such models, we employed an NLP library called "Simple
Transformers", which includes multiple models, each intended to perform a particular task. For
example, the "ClassificationModel" is designed to implement binary and multi-class text classification
tasks. It can also implement other tasks, such as named entity recognition, multi-label text
classification, question answering, language generation, and more tasks. The applied simple-
transformer-based model contains a classification layer on top of the chosen transformer model. This
layer has four output neurons corresponding to each class (true, false, partially false, or other). After
creating the model, we specified the selected pre-trained model types and architectures based on the
following supported models:
   ● BERT: The acronym indicates Bidirectional Encoder Representations from Transformers. The
   BERT model differs from other language models because it is open-source and developed to pre-
   train deep bidirectional language representations exclusively by using a plain unsupervised text [25].
   ● XLNet is a generalised autoregressive pretraining model. It combines a bidirectional context
   and avoids independent predictions to overcome the limitations of BERT based on its autoregressive
   formulation [26].
   ● RoBERTa: A Robustly Optimised BERT Pretraining Approach is built based on BERT with
   a modification on the hyperparameters; for instance, it is trained for a more extended time on bigger
   batches and learning rates [27].
   ● DistilBERT is another pre-trained transformer model. However, unlike previous models, this
   model is a distilled version that aims to reduce the size of a BERT model by 40 per cent and make
   it faster by 60 per cent while having more than 95 per cent of its language understanding abilities
   [28].

5. Results and Discussion

   This section illustrates and discusses the results of the various experiments we conducted in this
competition. As described in section 4, we investigated multiple ML algorithms (RFC, SVC, MNB,
and LR) and transformer-based models (BERT, XLNet, RoBERTa, and DistilBERT). Firstly, we
assessed the average prediction scores over five cross-validation folds. We only used the accuracy
metric in the first experiment to choose the best performing ML classifier and then compare this selected
classifier with transformer-based models based on the macro F1 score. The initial results indicate that
the SVC usually outperforms other traditional ML models, as exhibited in Figure 1 with an average
accuracy score of 0.61 compared to LR, MNB, and RFC with 0.57, 0.52, and 0.52, respectively. After
that, we employed a simple transformer multi-classification model and fine-tuned it on both the learning
rate and the number of epochs hyperparameters. Since the BERT model has been proven to provide
state-of-the-art results in many studies, we employed it as the simple transformer model type. We
conducted over 20 experiments, using BERT-large-cased with a learning rate ranging from 1.00E-4 to
1.00E-6 and several epochs ranging from 5 to 25. We concluded that the best combination of these
hyperparameters was using the 1.00E-5 learning rate with five epochs. Then, we examined the other
transformer-based models using these parameters. However, the comparison shows that the used
combination delivers better scores with the BERT model than other transformer-based models, as
presented in Table 1.

                                                         Table 1
                                                         The highest scores received from the
                                                         employed models on the validation set
                                                             Model Name            Acc.      F1 Score
                                                               LinearSVC               0.62         0.51
                                                            bert-base-cased            0.43         0.46
                                                           bert-large-cased            0.63         0.53
                                                           xlnet-large-cased           0.53         0.38
                                                             roberta.large             0.59         0.47
 Figure 1: Applying cross validation score
                                                         distilbert-base-cased         0.58         0.45
 function on the four ML models

   In addition, we attempt to enhance the model using additional datasets. We used the Fakenews
Classification Datasets described in subsection 3.2 with different numbers of samples. We only used
500 fake news and 500 real news in the first attempt and combined them with the provided dataset.
Accordingly, we increased the amount of news to 1000 and 2000 samples in two different tries.
Nevertheless, the outcomes still indicate that the previous fine-tuned model using only the provided
dataset from the contest outperformed all other attempts. The reason here could be that the additional
dataset only includes binary labels (false or real), while this task aims to classify a multiclassification
problem. As a result, we decided to train the final model with only the provided dataset from the
competition so that the model can only see similar data in the same format. The testing results released
on the leaderboard for this task show that our team achieved 0.305 on the F1 measurement. Our score
nearly reached the highest, which is 0.339, and it is higher than the average F1 of other participants at
0.252.


 Figure 2: F1 scores for evaluating the model           Figure 3: Confusion matrix on the testing set
 on the validation and testing sets

    For evaluation, we used a classification report and confusion matrix to assess our model during the
validation phase. We also use these matrices to evaluate the submitted predictions on the released gold
labels. Our observations show that our model performed better during the validation phase than in the
testing, as illustrated in Figure 2. The received validation F1 scores are 0.77, 0.5, 0.57, and 0.28 for the
labels 'false', 'true', 'partially false', and 'other', respectively. However, on the testing set, the model
mispredicted most of the 'partially false' once, and none of the 'other' labels was predicted correctly,
although it behaved close enough when predicting the 'false' and 'true' articles. Moreover, we observed
from the presented confusion matrix in Figure 3 that the model mostly predicted the label 'false' with
about 65% of the total predictions. In contrast, the positive labels are much fewer than the 'predicted-
as-false' labels. One possible reason is that the labelling criteria in the training set differ from the last
released testing set.

6. Conclusion and Future Work
   Obtaining reliable information is considered an essential factor in our daily lives, especially when it
comes to reading news. Due to the extensive number of published online articles, many developers
have investigated and developed various models to tackle infodemic. This paper describes our system
and participation in the "CLEF CheckThat! Lab" Task-3A competition. We examined an English
dataset labelled as whether a particular article is 'true', 'false', 'partially false' and 'other'. We investigated
four ML algorithms and pre-trained transformers to solve this multi-classification problem.
Additionally, we attempted to use an external dataset from Kaggle to help improve the model. However,
the additional dataset did not increase the performance, even though we used a different number of
samples in each attempt. Finally, our findings from over 30 experiments show that the BERT model
outperforms other models. The obtained testing results on the leaderboard indicate that we got an F1
of around 0.305, which slightly differs from the highest participant's score with only about 0.03. In
future work, we recommend that finding an additional dataset with a similar format may help improve
the model. Also, using an ensemble method, which considers both rule-based and deep learning
methods, could significantly enhance the proposed system.

7. Acknowledgements
   We would like to acknowledge and give our warmest thanks to the Ministry of Education in Saudi
Arabia, King Abdulaziz University, and University of Leeds for their continuous support.

8. References

[1]     P. Nakov et al., “Overview of the CLEF-2018 CheckThat! Lab on automatic identification and
        verification of political claims,” in International conference of the cross-language evaluation
        forum for european languages, 2018, pp. 372–387.
[2]     T. Elsayed et al., “Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and
        Verification of Claims,” 2019, doi: 10.1007/978-3-030-28577-7_25.
[3]     A. Barrón-Cedeño et al., “CheckThat! at CLEF 2020: Enabling the automatic identification and
        verification of claims in social media,” 2020, doi: 10.1007/978-3-030-45442-5_65.
[4]     P. Nakov et al., “The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously
        fact-checked claims, and fake news,” in European Conference on Information Retrieval, 2021,
        pp. 639–649.
[5]     P. Nakov et al., “Overview of the CLEF-2022 CheckThat! lab on fighting the COVID-19
        infodemic and fake news detection,” 2022.
[6]     J. K ö hler, M. Shahi, Gautam Kishore Struß, Julia Maria Wiegand, M. Siegel, T. Mandl, and M.
        Sch ü tz, “Overview of the CLEF-2022 CheckThat! lab task 3 on fake news detection,” 2022.
[7]     G. K. Shahi, J. M. Struß, and T. Mandl, “Overview of the CLEF-2021 CheckThat! Lab: Task 3
        on fake news detection,” 2021.
[8]     P. Nakov et al., “The CLEF-2022 CheckThat! lab on fighting the COVID-19 infodemic and fake
        news detection,” in European Conference on Information Retrieval, 2022, pp. 416–428.
[9]     G. K. Shahi, A. Dirkson, and T. A. Majchrzak, “An exploratory study of covid-19
        misinformation on twitter,” Online Soc. networks media, vol. 22, p. 100104, 2021.
[10]    R. Mouty and A. Gazdar, “Survey on Steps of Truth Detection on Arabic Tweets,” 2018, doi:
       10.1109/NCG.2018.8593060.
[11]   G. Jardaneh, H. Abdelhaq, M. Buzz, and D. Johnson, “Classifying Arabic tweets based on
       credibility using content and user features,” 2019, doi: 10.1109/JEEIT.2019.8717386.
[12]   X. Zhou and R. Zafarani, “Fake News: a survey of research, Detection Methods, and
       Opportunities,” ACM Comput. Surv., pp. 1–40, 2018.
[13]   S. Gaonkar, S. Itagi, R. Chalippatt, A. Gaonkar, S. Aswale, and P. Shetgaonkar, “Detection of
       Online Fake News : A Survey,” 2019, doi: 10.1109/ViTECoN.2019.8899556.
[14]   G. K. Shahi and D. Nandini, “FakeCovid--A multilingual cross-domain fact check news dataset
       for COVID-19,” arXiv Prepr. arXiv2006.11343, 2020.
[15]   K. Shu and H. Liu, “Detecting Fake News on Social Media,” Synth. Lect. Data Min. Knowl.
       Discov., 2019, doi: 10.2200/s00926ed1v01y201906dmk018.
[16]   X. Zhou and R. Zafarani, “A Survey of Fake News: Fundamental Theories, Detection Methods,
       and Opportunities,” ACM Comput. Surv., 2020, doi: 10.1145/3395046.
[17]   N. Y. Hassan, W. H. Gomaa, G. A. Khoriba, and M. H. Haggag, “Credibility detection in twitter
       using word n-gram analysis and supervised machine learning techniques,” Int. Food Agribus.
       Manag. Rev., 2020, doi: 10.22434/IFAMR2019.0020.
[18]   S. Shaar et al., “Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness
       estimation in tweets and political debates,” Work. Notes CLEF, 2021.
[19]   S. Althabiti, M. Alsalka, and E. Atwell, “SCUoL at CheckThat! 2021: An AraBERT model for
       check-worthiness of Arabic tweets,” 2021.
[20]   S. Shaar et al., “Overview of the CLEF-2021 CheckThat! Lab Task 2 on detecting previously
       fact-checked claims in tweets and political debates,” in CEUR Workshop Proceedings, 2021,
       vol. 2936, pp. 393–405.
[21]   G. K. Shahi, “Amused: An annotation framework of multi-modal social media data,” arXiv
       Prepr. arXiv2010.00502, 2020.
[22]   J. Z. Pan, S. Pavlova, C. Li, N. Li, Y. Li, and J. Liu, “Content based fake news detection using
       knowledge graphs,” in International semantic web conference, 2018, pp. 669–683.
[23]   S. Kumari, “NoFake at CheckThat! 2021: fake news detection using BERT,” arXiv Prepr.
       arXiv2108.05419, 2021.
[24]   K. Poddar and K. S. Umadevi, “Comparison of various machine learning models for accurate
       detection of fake news,” in 2019 Innovations in Power and Advanced Computing Technologies
       (i-PACT), 2019, vol. 1, pp. 1–5.
[25]   J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional
       transformers for language understanding,” arXiv Prepr. arXiv1810.04805, 2018.
[26]   Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V Le, “Xlnet: Generalized
       autoregressive pretraining for language understanding,” Adv. Neural Inf. Process. Syst., vol. 32,
       2019.
[27]   Y. Liu et al., “Roberta: A robustly optimized bert pretraining approach,” arXiv Prepr.
       arXiv1907.11692, 2019.
[28]   V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT:
       smaller, faster, cheaper and lighter,” arXiv Prepr. arXiv1910.01108, 2019.

</pre>