A Survey on COVID-19 related Fake News Detection using
Machine Learning Models
Rayees Ahmad Dar1, Dr. Rana Hashmy2
1,2
    University of Kashmir, Hazratbal, Srinagar 190006, J&K, India


                Abstract
                Uncensored data explosion on social media platforms has on the one hand impelled fast and
                easy dissemination of news and facts, but at the same time poses serious threats because of
                its highly unreliable nature. Misinformation and disinformation are mainly prevalent at the
                time some important event is happening that people are curious about e.g., elections or
                something untoward happens like the COVID-19 pandemic. Because of the unprecedented
                nature of these events, people are susceptible to these bogus and potentially hazardous
                claims and articles. Therefore, we need an early detection mechanism to stop the spread of
                intentionally and unintentionally written fake news or claims.
                Past research has suggested various models based on machine learning, deep learning and
                pretrained language models to detect false news over the years. This research piece will try
                to assess the effectiveness of various relevant methods on the task of detecting fake news
                and false claims related to COVID-19 pandemic in this research. We will be using the
                combined corpus of two largest datasets available. We explore various pertained language
                models in addition to deep learning and conventional machine learning approaches and
                compare their performance. We find that RoBERTa in particular and Bert-based models in
                general outperform all other models. We believe this piece of research will help the research
                community a lot in exploring the said domain further.

                Keywords 1
                fake news detection, social media fake news, misinformation, COVID-19, machine
                learning, language models

1. Introduction
           Fake news can broadly be defined as “A news article or message published and propagated
      through media, carrying false information regardless of the means and motives behind it” [1-8].
      Fake news gets to its worst at the time of some pandemic as people tend to believe false
      information in these chaotic situations, as there is scarcity of knowledge and research about it.
      This gets even worse when it propagates on some social media platforms due to its
      unauthenticated nature. This could inflict damage on both individual and societal levels. Thus,
      early detection and stopping of these posts becomes crucial on social media platforms. At the
      time of COVID-19 breakout, certain infringe elements simultaneously exaggerated the
      uncertainty and social disruption by spreading false information mostly on social media
      platforms. This is mostly related to the disease itself in addition to vaccines, medication, mask
      usage, etc.2 Hence, it becomes equally important to mitigate this infodemic in addition to fighting
      the pandemic itself. Different machine learning methods have been employed for this purpose.


1
 MoMLeT+DS 2023: 5th International Workshop on Modern Machine Learning Technologies and Data Science, June 3, 2023, Lviv,
Ukraine
EMAIL: Rayees.csscholar@kashmiruniversity.net (R. Ahmad); ranahashmy@gmail.com (Dr. R. Hashmy);
ORCID: 0000-0002-4424-6593 (R. Ahmad);
             © 2023 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
   We have tried to analyze the performance of various relevant classical machine learning, deep
learning and most importantly the pre-trained language models on the combined COVID-19 dataset,
which we accumulated from various already available datasets.
   As the labeled data in this case is sparse, Bert based models and language models perform better
comparatively. In this comparative study, we will try to analyze how good these approaches perform
on the said datasets. Because of the distinct nature of corona (COVID-19) related fake news, we feel
the behavior of these models needs to be observed separately in this domain of fake news


2. Related Work
    Due to the omnipresence of Internet and its ease of access, social media has become an integral
part of our lives. However, its unauthenticated nature poses a serious threat simultaneously. A large
number of machines leaning based approaches have been proposed for the automatic detection of
false news and claims.
    Conventional machine learning based approaches when utilized for the purpose of fake news
detection have yielded good results. Reis et al. [9] approached this problem as a binary classification
task; various syntactic and semantic features are extracted through feature engineering and later
passed to conventional ML classifiers like K-Nearest Neighbor (KNN), Naïve Bayes, Random Forest
(RF), XGBOOST (XGB), Support Vector Machine (SVM) for training and classification. Out of
these XGB and RF yielded promising results.
    [10] Evaluated deep learning methods for fake news detection task. They trained various DL
models on COVID-19 fake news detection dataset [11] from Contraint@AAAI 2021. They analyzed
various deep learning models like LSTM, CNN, HAN, bi-LSTM+attention, DistilBERT and BERT-
base. They treated this problem to be a binary classification task. They mainly focused on news
content (Text). They have tried to give the pre-trained BERT and DistilBERT some context by pre-
training them on the tweet corpus related to covid-19 that has proved to increase performance in
comparison to the models which are trained on the dataset only. COVID-Twitter-BERT when
ensembled with BERT-cased model approach outperforms other approaches. Furthermore, HAN
outperformed other non-transformer-based models.
    In [12], authors trained an ensemble of Bi-LSTM and BI-GRU-dense models on the LIAR [13]
dataset and classified the news items as fake or real. The outputs from these two models averaged
out to get a single value as output. After experimentations, the results of the proposed model proved
to perform better when compared to other studies, which used the LIAR [13] dataset for fake-news
detection.
    Transformer [14] architecture is the base for majority of the state-of-the-art approaches for fake
news detection currently. As these models employ self-attention technique wherein every word in a
sentence is weighed on the basis of its significance and are pretrained on a large collection of data,
they have proven to be superior to previous non-transformer-based models. One among the
transformer based pretrained language models from Google is BERT [15] which has 345 million
trainable parameters (BERTLARGE) and is state-of-the-art architecture for various downstream jobs
like text classification. [16] Proposed fakeBERT, a BERT-based deep learning approach combining
CNN with the BERT, which helps in reducing ambiguity. The authors of [17] proposed an ensemble
of (BERT, ALBERT, and XLNET) and fine-tuned and later tested this model on the Constraint AI
2021 Fake News Detection dataset [11]. A variation of BERT: CT-BERT (COVID-Twitter-BERT)
[18] was created by pre-training the BERT model on a large collection of tweets related to COVID-
19 and has shown promising outcomes for fake-news detection on Corona related news. Various
language models (XLNet, ERNIE 2.0, XLMRoBERT , DeBERTa, RoBERTa, and ELECTRA) were
ensembled by [19 ] for COVID related fake-news detection. Apart from the features based on news
content, they incorporated social context-based features e.g., authors, source, username, and URL.
3. Dataset Description
   The Following two datasets were used in this comparative study:

3.1.    COVID-19-FNIR DATASET:
    COVID19-FNIR DATASET [20] (COVID-19 Fake News Infodemic Research Dataset) Consists
of true and fake news as separate files with a total of 7588 items which are class balanced (49.99%
as real and 50.01% as fake). The fake news items have been collected from Polynter and the true
items have been collected from authentic Twitter handles of news publishers. The dataset consists
of various columns such as Text, Date, Region, Country, Explanation, Origin, Label, etc. but we will
be using Text and the Label columns only in this study.

3.2.    COVID19 Fake News Dataset:
    COVID-19-fake-news detection [11] Dataset was published as a collection of various articles and
posts related to COVID19 from social media with fake and real labels. Real news items in the dataset
have been gathered from various verified news sources and the false items from some fact checking
platforms like NewsChecker, PolitiFact etc., which were verified to be false. This dataset originally
comprises of 10700 social media posts with a vocabulary size of 37505. 52.34% of news items in
this dataset are real and the remaining 47.66% are fake and hence is balanced class wise.
    Finally, we have created a combined corpus from these two datasets for this comparative study.
We renamed various columns to make them uniform, and also replaced label values 'fake' with 0 and
‘real’ with 1 in the first dataset (Section 3.1) to match these values with the label field of the other
dataset (Section 3.2). Finally, the combined corpus has 18288 news items in total, of which 51%
comprises of real samples and the remaining 49% are fake samples. Finally, we split this combined
dataset into train: test: validation with the ratio of 8:1:1 as shown in figure 1.


Figure 1: Combined corpus split of real and false news distribution

4. Methods
   In this research, we approached COVID-19 fake-news detection task as a classification problem
with two classes (fake and real) wherein news pieces are classified as real or fake.

4.1.    Data preprocessing
    We do some initial preprocessing of the raw text before actual processing. We eliminate
unnecessary URLs, HTML tags extra spaces, stop words and special characters from the text, which
is fed into the models after tokenization.
4.2.    Studied features
   Word level, n-gram level TF-IDF features, pre-trained fastText[21] which is an extension of
word2vec model and represents words as n-gram of characters, Glove[22] embeddings which is a
unsupervised learning algorithm which learns the word embedding based on the observation that
word-word co-occurrence probability ratios have the potential to encode some meaning, and word
embedding features from language models like BERT [15] pre-trained on English Wikipedia with
2500M words and BooksCorpus with 800M words. 128-dimensional words embedding of BERT are
used in this study. We have used Bert word embeddings because these capture the contextual
meaning and produce high-quality feature inputs, which are dynamically informed by the words
around them. These pre-trained embeddings were used with deep learning models like CNN and off
course for respective language models like roBERTa embedding for roBERTa model. We use the
embeddings from their corresponding tokenizers. We further experimented with combining
conventional machine learning models with word embeddings from finetuned BERT and the
fastText embeddings. The TF-IDF features outperformed other embeddings on the majority of
analyzed traditional ML models.

4.3.    Studied models
   We analyzed various models centered on classical machine learning, deep learning, and
pretrained language model approaches:

4.3.1. Conventional Machine Learning methods
   Traditional NLP approaches like Logistic Regression, Random Forest (RF), K-nearest neighbors
(KNN), Support Vector Machines (SVM), Multinomial Naïve Bayes, XG-Boost and Decision Trees
(DT) have been studied in this study.
   We analyzed the results from these approaches using TF-IDF, fastText word vectors as well as
the Bert word embeddings. We find that the TF-IDF features proved better than the fastText as well
as the Bert embeddings when used on traditional Machine learning approaches. We used
SelectKBest of sklearn to select k (k=1200) best features to be used for the training purpose.
   Finally, out of these analyzed models, we found that the SVM combined with TF-IDF features
performed best on test data of the combined corpus showing an accuracy of 84.29 %.

4.3.2. Deep learning models
    CNN: We have used a one-dimensional convolutional model with two layers containing 128
filters of filter size 5. Embedding layer is the first layer. The model is initialized with pre-trained
Glove embedding of dimension 300 and also experimented with BERT embedding and compared
their performance. The outputs of con1D layers are passed through the ReLu activation function.
This function outputs 0 for negative values and outputs positive values as it is.
    A max-pooling layer of pool size 2 is stacked after each convolutional layer to reduce the size of
model. The outputs from these maxpooled layers is concatenated into a single layer before being fed
into a dropout layer (dropout =0.4). During compilation, learning rate of the adam optimizer is set
to 0.0001. Finally, as we are dealing with a Binary classification problem, we pass these final outputs
to a dense layer (1 unit) and sigmoid as the activation function.
    LSTM: We initialized the embedding layer with pre-trained Glove embedding of size 300. The
LSTM layer’s output dimension was set to 300 and finally, we add a dropout layer (dropout = 0.6)
before feeding the output into a sigmoid activated dense layer for classification. We also
experimented with initializing the embedding layer with Bert embedding of length 128. The model
is compiled using Adam optimizer. Training of the model was done for 10 epochs and 64 was set as
the batch-size.
    CNN+LSTM: We analyzed the performance of a hybrid model consisting of CNN layer and a
LSTM layer on top of it. We define a CNN model as described above and before passing the outputs
to the final dense layer, we pass it through an LSTM layer of output dimension 300.

4.3.3. Pretrained language models
         Here, we describe the experimental setup of two advanced language models used in our
study.
    DistilBERT. DistilBert [23] was built based on the knowledge distillation compression technique,
the knowledge is distilled from the BERT base model using almost only half of its parameters while
retaining 95% of the BERT’s performance on its benchmark GLUE. The token type embeddings and
the pooler were removed from the original architecture by its creators to make it lighter. DistilBERT
is less resource intensive while retaining the performance closer to the BERT model and is thus
suited for production-level usage. We add a sigmoid activated dense layer as a classification head to
the distilBERT model.
    RoBerta: RoBERTa (Robustly optimized BERT pretraining approach) [24], alters key
hyperparameters of the BERT model like removing its next-sentence-prediction(NSP) objective and
training with higher learning rate on relatively larger mini batches, which proved to significantly
improve the performance . In RoBERTa byte pair encoding (BPE) is used as a tokenization algorithm
instead of BERT’s word piece tokenization. NSP objective is removed for a better training strategy.
A dropout of 0.4 is applied to the output from the transformer before being fed into a classification
head, which is a sigmoid activated dense layer.
    Corresponding word embedding is used with respective pre-trained language models i:e
distilBERT embeddings for DistilBert and RoBERTa embedding for RoBerta. These models were
trained for 18 epochs and 128 was set as the batch-size. In order to avoid overfitting, we used early
stopping (Validation loss as the metric). Finally, the models were trained with Adam optimizer
setting learning rate = 1e-4, b1 = .8, b2 = .898 and epsilon set to 1e-7. As a loss function, we used
sparse-categorical-cross-entropy. The experiments were performed on Tesla P100-PCIE -16GB
GPU provided by Kaggle.

5. Evaluation matrices
   We utilized the following evaluation metrics to measure the performance of these models:
Accuracy: The percentage of correctly classified tweets, calculated as:
   accuracy= (Tps' + Tns') / (Tps' + Tns' + Fps' + Fns')

Precision: The percentage of true positive predictions out of all positive predictions, calculated as:
   precision= Tps' / (Tps' + Fps')"                                .

Recall: The percentage of true positive predictions out of all actual positive tweets, calculated as:
   recall= Tps' / (Tps' + Fns')

F1 score: The harmonic average of precision and recall that provides a singular performance metric
for the model, calculated as:
        F1-score= 2 * (precision * recall) / (precision + recall

   Here, Tps' denotes the number of true positive predictions, which represents the number of fake
tweets correctly identified as fake by the model. Similarly, Tns' represents the number of true
negative predictions, which corresponds to the number of real tweets correctly identified as real by
the model. Fps' refers to the number of false positive predictions, indicating the number of real tweets
incorrectly identified as fake by the model, and Fns' denotes the number of false negative predictions,
representing the number of fake tweets incorrectly identified as real by the model.
6. Experiments and Results
   Much research has been done on automatic detection of fake-news on social media using various
machine learning and deep learning models. Some of them focused on comparative analyses of these
models on fake news datasets. Keeping in view the unique nature of fake news about COVID19, it
seems quite worthwhile to investigate various machine-learning models on COVID19 datasets. We
will try to address this concern in this study. We first analyze different conventional machine-
learning and deep-learning models with the combined corpus of COVID-19 related datasets and
importantly evaluate some pre-trained language models as well.
   Further, we also analyze the efficacy of various classical machine-learning and deep-learning
based approaches on different embedding vectors. We analyze the performance of traditional
machine learning approaches using three different embeddings (fastText, Glove, BERT) and deep
learning approaches using Glove and Bert.

Table 1
Performance of traditional machine learning models
   Method          Accuracy-Score (in %ge)               Precision (in %ge)                   Recall (in %ge)                       F1 Score (in %ge)


                 BERT        FastTe    TFIDF    BERT       FastTe         TFIDF      BERT     FastTe        TFIDF           BERT    FastTe          TFIDF
                 Embe          xt               Embe         xt                      Embe       xt                          Embe      xt
                 dding                          dding                                dding                                  dding

    Logistic     81.84       82.34      83.78   83.19         79.91       84.36      81.79    85.68         83.96           82.48   82.69           84.16
  regression

 Multinomial     81.73       78.73      82.17   82.94         73.40       79.38      81.90    89.12         87.80           82.42   80.50           83.38
 Naive Bayes


  K-Nearest      81.79       76.10      78.76   82.41         73.57       80.28      82.84    80.35         77.28           82.62   76.81           78.75
  Neighbor


   XG-Boost      81.84       82.68      81.78   82.63         79.79       82.77      82.63    86.79         81.10           82.63   83.14           81.93


   Random        81.62       81.30      81.78   82.56         79.57       81.78      82.21    83.46         85.01           82.38   81.47           83.36
    Forest

 Decision Tree   81.57       71.35      78.16   83.03         73.14       78.16      81.38    70.03         77.75           82.19   73.14           77.95


   Support       81.73       82.39     84.29    82.59         79.57       83.83      82.42    86.45         85.70           82.51   82.87           84.75
    Vector
   Machine


Table 2
Performance of deep learning approaches


     Method              Accuracy (in %ge)                      Precision (in %ge)                    Recall (in %ge)                  F1-Score (in %ge)


                   GLoVE                BERT            GLOVE                BERT            GLOVE               BERT                GLOVE              BERT


     CNN             84.53              85              85.40                83.31           79.83                  85.40            83.02              84.34


    LSTM             85                 85.5             84                   85              84                    85.8               84               85.39


  CNN+LSTM           86                 87.3             85                   86              85                 86.40                 85               86.50
Table 3
Performance of pre-trained language models


        Method              Accuracy            Precision         Recall               F1 Score


                             88.51                88.46            88.61                88.53
       DistilBERT


                             89.66                90.20            89.77                89.98
        RoBERTa


    RoBERTa + BiLSTM         92.34               92.34             92.38                92.36


6.1.     Results on machine learning models
    In this subsection, we present the results obtained for various classical machine-learning models.
Table 1 summarizes the results obtained. We experimented with fastText, TF-IDF word vectors and
BERT word embedding embeddings. Out of the analyzed models, Random Forest, Multinomial
Naïve Bayes, Support vector machine, and Logistic Regression showed best performance when
trained on Tf-idf feature vectors while K-Nearest Neighbor and Decision Tree showed best
performance when trained on pre-trained Bert embedding and XG-Boost shows best performance
on FastText word embedding. Hence, majority of the analyzed classical machine learning models
perform best with TF-IDF feature vectors on this specific dataset, especially when using TF-IDF
weighted average. The performance of classical machine learning models is depicted in figure 2.


Figure 2: Performance of Machine learning models on various embedding

6.2.     Results on deep learning models
    We analyzed CNN, LSTM, and an ensemble of these two for the purpose of this study. Two
different embeddings, BERT word embedding, and Glove word vectors were used to evaluate their
performance. The summarized results are shown in Table 2. It is clear that deep-learning based
approaches usually beat classical machine-learning approaches in terms of performance on this
particular dataset. Figure 3 shows a plot of analyzed deep learning models on GloVe and BERT
embeddings.
    CNN: We examine CNN with Bert Word embedding as well as Glove Word embedding. We
initialize the embedding-layer with pretrained embedding (BERT or Glove). Using Bert embedding,
resulted in a slightly higher performance (85% accuracy and 84.34% F1 score) when compared with
using glove word embedding (84.53% accuracy and 83.02% F1 score).

   LSTM: As the next deep-learning approach, we evaluate LSTM. We recreate the same setup as
defined for CNN above; that is, used Bert and GloVe word embeddings. LSTM model outperforms
the CNN model generally. The use of Bert word embedding proves to perform best in this case also
(Table 2).

   CNN+LSTM: We finally explore a hybrid model based on a CNN followed by an LSTM layer.
This hybrid model performs best overall and specifically when initialized with Bert word embedding
(see Table 2).


Figure 3: Performance of Deep learning models on GloVe and BERT embeddings.


6.3.    Results on pre-trained models
   As can be observed from Table 3, the advanced language models clearly outperform all the
machine-learning and deep-learning based approaches. The language models do not need large
datasets because they use pretrained embedding weights, and hence they show better performance
during the start of fine-tuning itself as compared to the deep learning models that require large
datasets for satisfactory performance.
Figure 4: DistilBERT training and                          Figure 5: RoBERTa training and
validation accuracy.                                       validation accuracy.


Figure 6: RoBERTa+ BiLSTM ensemble
Training and validation accuracy.

   DistilBert (66M parameters) model, roBERTa (125M Parameters) achieve an accuracy of 88.5%
and 89.7% respectively, as shown in figures 4 and 5. This gives a clear indication of the fact that
these models’ performance is directly proportional to their parameter size.

   Lastly, we analyzed the performance of the roBERTa model when stacked with a bi-LSTM layer
on top of it. The bi-LSTM-attention extracts the sentence features automatically. This hybrid model
proved to perform best on the combined corpus with an accuracy of 92.3%. As depicted in figure 6.

  The confusion matrix showing the individual number of true and fake predictions made by these
models is depicted in Figure 5.
Figure 7: Confusion Matrix of DistilBERT, RoBERTa, and RoBERTa + LSTM models from left to right
respectively.

7. Conclusion
   In this comparative study, we did analysis of classical machine-learning, deep learning, and
pretrained language models on fake-news related to COVID 19 on social media platforms. It is
evident from the study that the transformer-based approaches perform best overall. The pre-trained
models perform significantly better even on comparatively smaller data samples, as compared to
deep learning models which suffer from over-fitting on smaller datasets. Support Vector Machines
combined with TF-IDF feature vectors attained performance close to deep learning-based
approaches. The CNN-LSTM model showed performance close to pretrained language models. The
CNN layer learns the spatial and invariant features of the news items.

    Findings from this study can facilitate future research in this direction. In this study, we attended
fake-news about COVID19 problem to examine how good different models perform on this very
subtask of fake news detection. We will target designing a generalized fake-news detection model
in our future work.

8. References
[1] Mustafaraj E, Metaxas PT. The fake news spreading plague: was it preventable? In: In Proc. of
    the 9th ACM Conference on Web Science (WebSci); 2017. p. 235–239.
[2] Balmas M. When fake news becomes real: Combined exposure to multiple news sources and
    political attitudes of inefficacy, alienation, and cynicism. Communication Research.
    2014;41(3):430–454.
[3] Brewer PR, Young DG, Morreale M. The impact of real news about “fake news”: Intertextual
    processes and political satire. International Journal of Public Opinion Research.
    2013;25(3):323–343.
[4] Jin Z, Cao J, Zhang Y, Luo J. News verification by exploiting conflicting social viewpoints in
    microblogs. In: In Proc. of the 13th AAAI Conference on Artificial Intelligence (AAAI); 2016.
    p. 2972–2978.
[5] Rubin VL, Conroy N, Chen Y, Cornwell S. Fake news or truth? using satirical cues to detect
    potentially misleading news. In: In Proc. of the Second Workshop on Computational
    Approaches to Deception Detection; 2016. p. 7–17.
[6] Kshetri N, Voas J. The economics of “fake news”. IT Professional. 2017;19(6):8–12.
[7] Gelfert A. Fake news: A definition. Informal Logic. 2018;38(1):84–117.
[8] Sharma K, Qian F, Jiang H, Ruchansky N, Zhang M, Liu Y. Combating fake news: A survey
    on Identification and mitigation techniques. ACM Transactions on Intelligent Systems and
    Technology (TIST). 2019;10(3):1–42.
[9] Reis, Julio CS, Andr´e Correia, Fabr´ıcio Murai, Adriano Veloso, and Fabr´ıcio Benevenuto.
     “Supervised learning for fake news detection.” IEEE Intelligent Systems 34, no. 2 (2019):76-
     81.
[10] Aslam N.,Khan I.,Alotaibi F. ,Aldaej L. and Aldubaikil A. Fake Detect: A Deep Learning
     Ensemble Model for Fake News Detection.Complexity ,2021 ,(2021) ,5557784 , 1-8,
[11] Patwa, Parth, Shivam Sharma, Srinivas PYKL, Vineeth Guptha, Gitanjali Kumari, Md Shad
     Akhtar, Asif Ekbal, Amitava Das, and Tanmoy Chakraborty. “Fighting an Infodemic: COVID-
     19 Fake News Dataset.” arXiv preprint arXiv:2011.03327 (2020).
[12] A. Wani, I. Joshi, S. Khandve, V. Wagh, and R. Joshi, “Evaluating deep learning approaches
     for COVID-19 fake news detection,” 2021, http://arxiv.org/abs/2101.04012.
[13] Wang, W. Y. (2017). "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News
     Detection. arXiv. https://doi.org/10.48550/arXiv.1705.00648
[14] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.Gomez,
     Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” Advances in neural
     information processing systems 30 (2017): 5998-6008.
[15] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pretraining of
     deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805
     (2018)
[16] Kaliyar, R.K., Goswami, A. & Narang, P. FakeBERT: Fake news detection in social media with
     a BERT-based deep learning approach. Multimed Tools Appl 80, 11765–11788 (2021).
     https://doi.org/10.1007/s11042-020-10183-2
[17] Sunil Gundapu and Radhika Mamidi. Transformer based automatic COVID-19 fake news
     detection system. CoRR, abs/2101.00180, 2021.
[18] Muller, M., Salathe, M., Kummervold, P. E.: (2020). COVID-Twitter-BERT: A natural
     language processing model to analyse COVID-19 content on Twitter. arXiv preprint
     arXiv:2005.07503.
[19] Dipta, S., Basak, A., Dutta, S. (2021). A heuristic driven ensemble framework for COVID-19
     fake news detection. In Combating Online Hostile Posts
[20] Julio A. Saenz, Sindhu Reddy Kalathur Gopal, Diksha Shukla, June 12, 2021, "Covid-19 Fake
     News Infodemic Research Dataset (CoVID19-FNIR Dataset)", IEEE Dataport, doi:
     https://dx.doi.org/10.21227/b5bt-5244.
[21] P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, "Enriching Word Vectors with Subword
     Information", Transactions of the Association for Computational Linguistics, vol. 5, no. 1, pp.
     135-146, 2017.
[22] Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In
     Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
     (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics:
     Doha, Qatar, 2014; pp. 1532–1543.
[23] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. “DistilBERT, a distilled version of BERT:
     smaller, faster, cheaper and lighter”. In: arXiv preprint arXiv:1910.01108 (2019).
[24] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V
     (2019) Roberta: a robustly optimized bert pretraining approach. arXiv: Computation and
     language