CoulterOzler at CheckThat! 2022: Detecting fake news
with transformers
Kadir Bulut Ozler, Riah Coulter
University of Arizona


                                      Abstract
                                      In the age of the internet, people interact with each other more often than ever. Almost everybody
                                      with internet access has an affiliation with a social media website. With this popularity, spreading of
                                      misinformation has inevitably become a huge problem of the current age. In recent years, 2016 US
                                      Presidential Election brought the attention to fake news. With the Coronavirus Pandemic misinformation
                                      became an increasingly popular area to research in academia. To be a part of the research on detecting
                                      misinformation in the internet, we participated in task 3: Fake News Detection of the Checkthat! Lab at
                                      CLEF2022. In this paper, we show the details of our system consisting of data collection, transformer
                                      based pre-trained models and extensive preprocessing methods. We achieved an F1-score (macro) of
                                      0.328 against a top score of 0.339 on the official test set.

                                      Keywords
                                      Fake News, Multi-class Classification, Transformers, Fine-tuning, Misinformation


1. Introduction
In the information age, internet became the main source of news on what is happening in the
world. Individual access to the internet became easy and affordable which gave people a massive
freedom to obtain and share information online. Although there have been major benefits of
this freedom, it often came with a cost that is called misinformation. Misinformation is seen in
variety of forms [1]. It can be a Facebook post with fake content, a tweet from a fake profile of
a credible source, a news article that has a manipulative narrative or a misleading title that tells
a different story in the article.
   In recent years, misinformation became a significant research area in natural language
processing. Some of the past studies focused on rumor detection [2, 3, 4, 5, 6], fake news
detection [7, 8, 9, 10, 11, 12], spam detection [13, 14, 15, 16, 17] and bot detection [18, 19, 20, 21].
There have been several shared tasks related to misinformation detection. Recent SemEval
tasks [22, 23, 24] aimed to question stance and veracity of given texts and categorize them
to pre-defined classes. MediaEval [25] focused on misinformation regarding to Coronavirus
Pandemic and 5G.
   Task 3 of the Checkthat! Lab at CLEF2022 [26, 27, 28] is another shared task that focuses
on fake news detection. The task’s goal is to determine if the claim of the article belongs to
following categories [29] : true, partially true, false, or other (label descriptions can be found

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ kbozler@email.arizona.edu (K. B. Ozler); riahcoulter@email.arizona.edu (R. Coulter)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Task’s label descriptions
           Label                                          Description
            False                        The main claim made in an article is untrue.
       Partially False      The main claim of an article is a mixture of true and false information.
            True               The primary elements of the main claim are demonstrably true.
           Other             The claim is open to discussion regarding its misinformation status.


in table 1). The task has 2 sub-tasks: mono-lingual in English and cross-lingual for English
and German where the training set is in English and the test set is in German. This year, we
participated in mono-lingual sub-task. In the previous version of the lab [30], the sub-task 3A
[31] is very similar to the mono-lingual sub-task of this year’s task 3 of the shared task. For that
sub-task 3A, there have been many different approaches from the participants. [32] employed
several transformer based [33] models and got their best results with Albert [34]. [35] used
an ensemble of Roberta [36] and Longformer [37]. [38] showed that gradient boosting with
extensive preprocessing performed better than widely popular deep learning architectures such
as LSTMs [39] and BERT [40].
   In the following sections, we show the features of the datasets we used in this work, our
methods, experiments, results, error analysis and our conclusion.


2. Dataset Analysis
There are multiple datasets with mixed domains that focus on fake news detection. In this
section, we show features of the used datasets in our work. In table 2, you can find the label
count for all the datasets below. Label counts are calculated after dropping NaN values, bad
lines and duplicates except for official test set. In table 3, you can find the final distribution of
labels in the training set, dev set and test set that have been used in our work.

Table 2
Label details of each dataset
                   Dataset            # of samples    False   Partially False   True    Other
                   politifact            21898        4699        14311         2888       -
                  true-fake              44689       23478           -          21211      -
             fakenewskdd2020              1068         434           -           634       -
            official-training-set         1183         571         312           206      94
               official-test-set           612         315          56           210      31


2.1. politifact
Introduced in [41], this dataset consists of fact-checking articles from politifact.com. It includes
article title and article text. The available labels are true, false, partially false. The exact version
Table 3
Label details of final train-dev-test distribution
                   Dataset      # of samples    False   Partially False   True    Other
                 training set      46855        19870       10032         16891    62
                   dev set          396          191         104            69     32
                   test set         612          315          56           210     31


we used in our work can be found at Kaggle. We randomly chose 15k of the samples and
included them in our training set.

2.2. true-fake
This combined dataset consists of 2 seperate datasets, each includes article titles and article texts.
In true dataset, all samples are considered as true, in fake dataset all samples are considered as
fake. We randomly chose 30k of the samples and included them in our training set. The exact
version we used in our work can be found at here and here. Unfortunately we could not find
the original source that introduced these datasets.

2.3. fakenewskdd2020
This dataset has only article texts and their labels. Fake label is defined as "potentially unreliable".
We randomly chose 1k of the samples and included them in our training set. The dataset can be
found from Kaggle and it was provided by Kai Shu to the competition organizers.

2.4. official-training-set
This is the official dataset that is released by task organizers privately for the participants [42].
It contains article text, article title, and all 4 labels. 2/3 of it has been used in our training set,
and 1/3 of it has been used in our dev set during model development phase. It was built by
following steps in [43].

2.5. official-test-set
This is official test set released by task organizers. It contains article id, article text, and all 4
labels. It can be found at Zenodo. We calculated our final scores based on our predictions on
this dataset.


3. Methods
In this section, we give details about the methods we employed for the task. They consist of
text preprocessing and fine-tuning pre-trained language models.
3.1. Preprocessing
    • Concatenating title and content when available: In some datasets, there exists article
      title column that contains the title of the articles. In this case, we merged title and article
      content in one sequence of text.
    • Converting to lower case: It is usually unhelpful to keep the characters in both lowercase
      and uppercase form.
    • Removing stop words: Stop words are usually the most frequently occurring words
      in natural language and they do not contribute much to the meaning. It’s a widely used
      practice to remove them from text before feeding the sequences to the model.
    • Removing punctuation: As stop words, punctuation marks are usually unnecessary to
      keep.
    • Standardizing certain words: In order to make the text as clear as possible, we used
      some pre-defined tokens to replace urls, email addresses, phone numbers, names, numbers,
      digits and currency characters.
    • Lemmatizing: In order to simplify the words we used lemmatizing over stemming to
      avoid creating words that are not in the dictionary or lost their meaning.
    • Shortening: We shortened the sequences to 500 tokens and 4000 tokens in order to fit
      them into the models, depending on the model’s capacity.
  We used the Natural Language Toolkit (NLTK) [44] for lemmatizing, name standardization,
and stop words removal, unicodedata1 for punctuation removal, and the clean-text project2 for
converting to lower case and standardizing urls, email addresses, phone numbers, numbers,
digits and currency characters.

3.2. Fine-tuning LMs
Transformer based pre-trained language models have become significantly popular in recent
years due to the fact that they led to state of the art improvements in many natural language
processing tasks [40]. They also do not require in domain training from scratch which would
need more data and more GPU time. Therefore, we decided to go with fine-tuning a pre-trained
language model to develop our in domain model for this shared task. We used Huggingface’s
transformers [45] during this stage. The code repository has been shared3 . We explored multiple
LMs, training/eval batch size, number of epochs and learning rate. We chose distilbert [46]
because it is a smaller, yet promising model. We also chose longformer anticipating that if
the model is fed long sequences (which is normal for articles), the predictions could be more
accurate. Our hyperparameter space can be found in table 4.


4. Experiments and Results
In the model development stage, we used our custom split of training and dev sets that are
explained in section 2. We used 4 32GB Nvidia V100 GPUs. After the initial experiments on dev
   1
     https://docs.python.org/3/library/unicodedata.html
   2
     https://pypi.org/project/clean-text/
   3
     https://github.com/kbulutozler/clef2022-checkthat-task3
Table 4
Hyperparameter space
                   model         distilbert-base-uncased, allenai/longformer-base-4096
                # of epochs                          4, 12, 16, 20, 32
               learning rate                           2e-05, 5e-05
                 batch size                             2, 4, 16, 64


set, we decided to choose 2e-05 as learning rate, 64 as batch size to focus more on number of
epochs for the rest of the experiments. For the longformer model, we had to reduce the batch
size to 2 due to GPU limitations. The results obtained in the development stage can be found in
table 5 with metrics micro f1 and macro f1.

Table 5
Results on custom dev set
                               model           # of epochs   micro f1   macro f1
                     longformer-base-4096           16         0.482      0.163
                     longformer-base-4096           32         0.487      0.177
                    distilbert-base-uncased          4         0.477      0.364
                    distilbert-base-uncased         12         0.495      0.381
                    distilbert-base-uncased         16         0.515      0.399
                    distilbert-base-uncased         20         0.538      0.408


  We can sum up our findings during model development as follows:

    • It can be said that using longformer did not lead to the results we anticipated.
    • Longer training had diminishing returns.
    • Macro f1 scores are lower than micro f1 scores because the performance on certain label(s)
      is significantly worse.

   As seen from Table 5, best model setup is obtained by distilbert model trained for 20 epochs
with batch size of 64 and learning rate of 2e-05. Furthermore, we explored how the model
would perform with no additional data. We used the best model setup and trained on just
official training set with no hyperparameter search. Apart from that, we explored the effect of
preprocessing methods. For this, we used the same setup and trained on just official training set
with no preprocessing. In table 6, we show the results we obtained on official test set in terms
of accuracy, macro precision, macro recall, macro f1 metrics.
   Results show that additional data hurt the performance the most. One cause might be the
difference in domains of the official dataset and additional datasets. We anticipated that for
sequence classification, combining multiple domains might lead the model to make better
predictions as shown in [47], however we couldn’t obtain parallel results following similar
intuition. Moreover, training on preprocessed data led to better precision, recall and f1 scores
in comparison to training on unpreprocessed data.
Table 6
Results on official test set
        preprocessing                 training set             accuracy      precision    recall    f1
             yes          official data + additional data        0.462        0.327       0.295    0.262
             yes                    official data                0.451        0.345       0.359    0.328
             no                     official data                0.464        0.337       0.318    0.299


5. Error Analysis
In this section we present the confusion matrix on official test set for the model that was trained
on official training set after preprocessing (second model in table 6) in table 7. The table shows
the model managed to learn the most detecting false claims and struggled to see other labels.
This can be explained by the challenging nature of the data and the dominance of "False" label
in the training set.

Table 7
Confusion matrix
                       gold label       count    False     Partially False    True    Other
                         False            315        204         48            36        27
                    Partially False        56        20          13            17         6
                         True             210        72          68            49        21
                        Other              31        13          5              3        10


6. Conclusion and Future Scope
We participated in task 3: Fake News Detection of the Checkthat! Lab at CLEF2022 and developed
models to detect and classify misinformation in the internet. We applied extensive preprocessing
methods and fine-tuned several pre-trained language models with the released dataset and
additional datasets. We found that being able to feed longer sequences and additional data with
mixed domains did not improve performance, preprocessing and smaller model led to better
predictions.
   For the future work, one might explore extra pre-training an already pre-trained language
model with data that has similar nature to the official training set to develop better models. In
this task, class imbalance seems to be a significant issue. Therefore another direction might be
exploring data augmentation methods to increase in domain data or modifying loss function to
increase penalty for misprediction of samples of the underrepresented label(s).


References
 [1] D. Bawden, L. Robinson, The dark side of information: overload, anxiety and other
     paradoxes and pathologies, Journal of Information Science 35 (2009) 180 – 191.
 [2] J. Yu, J. Jiang, L. M. S. Khoo, H. L. Chieu, R. Xia, Coupled hierarchical transformer for stance-
     aware rumor verification in social media conversations, Association for Computational
     Linguistics, 2020.
 [3] S. Kwon, M. Cha, K. Jung, Rumor detection over varying time windows, PloS one 12 (2017)
     e0168344.
 [4] Q. Zhang, S. Zhang, J. Dong, J. Xiong, X. Cheng, Automatic detection of rumor on social
     network, in: Natural Language Processing and Chinese Computing, Springer, 2015, pp.
     113–122.
 [5] S. Hamidian, M. T. Diab, Rumor detection and classification for twitter data, arXiv preprint
     arXiv:1912.08926 (2019).
 [6] T. Takahashi, N. Igata, Rumor detection on twitter, in: The 6th International Conference
     on Soft Computing and Intelligent Systems, and The 13th International Symposium on
     Advanced Intelligence Systems, IEEE, 2012, pp. 452–457.
 [7] V. Pérez-Rosas, B. Kleinberg, A. Lefevre, R. Mihalcea, Automatic detection of fake news,
     arXiv preprint arXiv:1708.07104 (2017).
 [8] J. C. Reis, A. Correia, F. Murai, A. Veloso, F. Benevenuto, Supervised learning for fake
     news detection, IEEE Intelligent Systems 34 (2019) 76–81.
 [9] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: A data
     mining perspective, ACM SIGKDD explorations newsletter 19 (2017) 22–36.
[10] R. K. Kaliyar, A. Goswami, P. Narang, Fakebert: Fake news detection in social media
     with a bert-based deep learning approach, Multimedia tools and applications 80 (2021)
     11765–11788.
[11] C. Liu, X. Wu, M. Yu, G. Li, J. Jiang, W. Huang, X. Lu, A two-stage model based on
     bert for short fake news detection, in: International Conference on Knowledge Science,
     Engineering and Management, Springer, 2019, pp. 172–183.
[12] J. C. B. Cruz, J. A. Tan, C. Cheng, Localization of fake news detection via multitask transfer
     learning, arXiv preprint arXiv:1910.09295 (2019).
[13] A. Gupta, R. Kaushal, Improving spam detection in online social networks, in: 2015
     International conference on cognitive computing and information processing (CCIP), IEEE,
     2015, pp. 1–6.
[14] T. Wu, S. Liu, J. Zhang, Y. Xiang, Twitter spam detection based on deep learning, in:
     Proceedings of the australasian computer science week multiconference, 2017, pp. 1–8.
[15] G. Jain, M. Sharma, B. Agarwal, Spam detection on social media using semantic convolu-
     tional neural network, International Journal of Knowledge Discovery in Bioinformatics
     (IJKDB) 8 (2018) 12–26.
[16] G. Jain, M. Sharma, B. Agarwal, Optimizing semantic lstm for spam detection, International
     Journal of Information Technology 11 (2019) 239–250.
[17] J. Cao, C. Lai, A bilingual multi-type spam detection model based on m-bert, in: GLOBE-
     COM 2020-2020 IEEE Global Communications Conference, IEEE, 2020, pp. 1–6.
[18] M. Heidari, J. H. Jones, Using bert to extract topic-independent sentiment features for
     social media bot detection, in: 2020 11th IEEE Annual Ubiquitous Computing, Electronics
     & Mobile Communication Conference (UEMCON), IEEE, 2020, pp. 0542–0547.
[19] S. Feng, Z. Tan, R. Li, M. Luo, Heterogeneity-aware twitter bot detection with relational
     graph transformers, arXiv preprint arXiv:2109.02927 (2021).
[20] D. Martín-Gutiérrez, G. Hernández-Peñaloza, A. B. Hernández, A. Lozano-Diez, F. Álvarez,
     A deep learning approach for robust detection of bots in twitter using transformers, IEEE
     Access 9 (2021) 54591–54601.
[21] M. Heidari, S. Zad, P. Hajibabaee, M. Malekzadeh, S. HekmatiAthar, O. Uzuner, J. H. Jones,
     Bert model for fake news detection based on social bot activities in the covid-19 pandemic,
     in: 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication
     Conference (UEMCON), IEEE, 2021, pp. 0103–0109.
[22] G. Da San Martino, A. Barrón-Cedeno, H. Wachsmuth, R. Petrov, P. Nakov, Semeval-2020
     task 11: Detection of propaganda techniques in news articles, in: Proceedings of the
     fourteenth workshop on semantic evaluation, 2020, pp. 1377–1414.
[23] T. Mihaylova, G. Karadjov, P. Atanasova, R. Baly, M. Mohtarami, P. Nakov, Semeval-
     2019 task 8: Fact checking in community question answering forums, arXiv preprint
     arXiv:1906.01727 (2019).
[24] G. Gorrell, E. Kochkina, M. Liakata, A. Aker, A. Zubiaga, K. Bontcheva, L. Derczynski,
     Semeval-2019 task 7: Rumoureval, determining rumour veracity and support for rumours,
     in: Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp.
     845–854.
[25] K. Pogorelov, D. T. Schroeder, L. Burchard, J. Moe, S. Brenner, P. Filkukova, J. Langguth,
     Fakenews: Corona virus and 5g conspiracy task at mediaeval 2020, in: MediaEval 2020
     Workshop, 2020.
[26] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez,
     T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov,
     N. Babulkov, Y. S. Kartal, J. Beltrán, The CLEF-2022 CheckThat! Lab on fighting the
     covid-19 infodemic and fake news detection, in: M. Hagen, S. Verberne, C. Macdonald,
     C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval, Springer
     International Publishing, Cham, 2022, pp. 416–428.
[27] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez,
     T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov,
     N. Babulkov, Y. S. Kartal, J. Beltrán, M. Wiegand, M. Siegel, J. Köhler, Overview of the
     CLEF-2022 CheckThat! lab on fighting the COVID-19 infodemic and fake news detection,
     in: A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald,
     G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, F. Nicola (Eds.), Proceedings of the 13th
     International Conference of the CLEF Association: Information Access Evaluation meets
     Multilinguality, Multimodality, and Visualization, CLEF ’2022, Bologna, Italy, 2022.
[28] J. Köhler, G. K. Shahi, J. M. Struß, M. Wiegand, M. Siegel, T. Mandl, M. Schütz, Overview of
     the CLEF-2022 CheckThat! lab task 3 on fake news detection, in: Working Notes of CLEF
     2022—Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022.
[29] G. K. Shahi, A. Dirkson, T. A. Majchrzak, An exploratory study of covid-19 misinformation
     on twitter, Online Social Networks and Media 22 (2021) 100104.
[30] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
     F. Haouari, M. Hasanain, W. Mansour, B. Hamdan, Z. S. Ali, N. Babulkov, A. Nikolov,
     G. K. Shahi, J. M. Struß, T. Mandl, M. Kutlu, Y. S. Kartal, Overview of the CLEF-2021
     CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and
     fake news, in: Proceedings of the 12th International Conference of the CLEF Association:
     Information Access Evaluation Meets Multiliguality, Multimodality, and Visualization,
     CLEF ’2021, Bucharest, Romania (online), 2021. URL: https://link.springer.com/chapter/10.
     1007/978-3-030-85251-1_19.
[31] G. K. Shahi, J. M. Struß, T. Mandl, Overview of the CLEF-2021 CheckThat! lab task 3
     on fake news detection, in: Working Notes of CLEF 2021—Conference and Labs of the
     Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021. URL: http://ceur-ws.
     org/Vol-2936/paper-30.pdf.
[32] J. R. Martinez-Rico, J. Martinez-Romo, L. Araujo, L.: Nlp&ir@ uned at checkthat! 2021:
     check-worthiness estimation and fake news detection using transformer models, Faggioli
     et al.[33] (2021).
[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, Advances in neural information processing systems 30
     (2017).
[34] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for
     self-supervised learning of language representations, arXiv preprint arXiv:1909.11942
     (2019).
[35] H. Lekshmiammal, A. K. Madasamy, Nitk _ nlp at checkthat! 2021: Ensemble transformer
     model for fake news classification, in: Conference and Labs Ofthe Evaluation Forum (CLEF
     2021), 2021.
[36] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
[37] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The long-document transformer, arXiv
     preprint arXiv:2004.05150 (2020).
[38] C. G. Cusmuliuc, M. A. Amarandei, I. Pelin, V. I. Cociorva, A. Iftene, Uaics at checkthat!
     2021: fake news detection, Faggioli et al.[33] (2021).
[39] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
     1735–1780.
[40] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[41] N. Vo, K. Lee, Where are the facts? searching for fact-checked information to alleviate the
     spread of fake news, arXiv preprint arXiv:2010.03159 (2020).
[42] G. K. Shahi, J. M. Struß, T. Mandl, Overview of the clef-2021 checkthat! lab task 3 on fake
     news detection, Working Notes of CLEF (2021).
[43] G. K. Shahi, Amused: An annotation framework of multi-modal social media data, arXiv
     preprint arXiv:2010.00502 (2020).
[44] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with
     the natural language toolkit, " O’Reilly Media, Inc.", 2009.
[45] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, et al., Huggingface’s transformers: State-of-the-art natural language pro-
     cessing, arXiv preprint arXiv:1910.03771 (2019).
[46] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
     faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[47] K. B. Ozler, K. Kenski, S. Rains, Y. Shmargad, K. Coe, S. Bethard, Fine-tuning for multi-
domain and multi-label uncivil language detection, in: Proceedings of the Fourth Workshop
on Online Abuse and Harms, 2020, pp. 28–33.