AIT_FHSTP at CheckThat! 2022: Cross-Lingual Fake News Detection with a Large Pre-Trained Transformer Mina Schütz1 , Jaqueline Böck2 , Medina Andresel1 , Armin Kirchknopf2 , Daria Liakhovets1 , Djordje Slijepčević2 and Alexander Schindler1 1 Austrian Institute of Technology, Giefinggasse 4, 1210 Vienna, Austria 2 St.Pölten University of Applied Sciences, 3100 St. Pölten, Austria Abstract The increase of fake news in today’s society, partially due to the accelerating digital transformation, is a major problem in today’s world. This year’s CheckThat! Lab 2022 challenge addresses this problem as a Natural Language Processing (NLP) task aiming to detect fake news in English and German texts. Within this paper, we present our methodology and results for both, the monolingual (English) and cross-lingual (German) tasks of the CheckThat! challenge in 2022. We applied the multilingual transformer model XLM-RoBERTa to solve these tasks by pre-training the models on additional datasets and fine-tuning them on the original data as well as its translations for the cross-lingual task. Our final model achieves a macro F1-score of 15,48% and scores the 22𝑡ℎ rank in the benchmark. Regarding the second task, i.e., the cross-lingual German classification, our final model achieves an F1-score of 19.46% and reaches the 4𝑡ℎ rank in the benchmark. Keywords Fake News Detection, Pre-Training, Transformer, Cross-Lingual 1. Introduction Due to the information overload on the web and the rapid spread of content on social media platforms, fake news articles circulate faster and are difficult to distinguish from journalistic articles [1]. The term fake news is commonly used since the US presidential election in 2016 [2] and can include multiple aspects of incorrect information propagation, such as propaganda, pure fabrications, hoaxes, click-bait and rumors [2, 3, 4]. In this year’s shared task at CLEF2022 CheckThat! Lab [5] the third task is fake news detection with four classes [6, 7, 8]: false, partly false, true, and other. We decided to take part in both fake news detection sub-tasks: a) English and b) German. The latter was proposed as a cross-lingual task without training data in German language. We propose a large pre-trained XLM-RoBERTa model [9], which we additionally pre-trained on a non-publicly available dataset with roughly 200,000 news articles from journalistic as well as citizen sources, such as blogs. After pre-training the model, we fine-tuned it with the given English training data as well as their translations into German to increase its generalization ability. CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ Mina.Schuetz@ait.ac.at (M. Schütz); Jaqueline.Boeck@fhstp.ac.at (J. Böck); Medina.Andresel@ait.ac.at (M. Andresel); Armin.Kirchknopf@fhstp.ac.at (A. Kirchknopf); Daria.Liakhovets@ait.ac.at (D. Liakhovets); Djordje.Slijepcevic@fhstp.ac.at (D. Slijepčević); Alexander.Schindler@ait.ac.at (A. Schindler) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Our paper is structured as follows: The first Section 1.1 presents the current state-of-the-art and related work. Section 1.2 describes our methodological approach, including the employed datasets and models. Our experimental setup is explained in Section 2, followed by a documen- tation of the results (Section 3) and a discussion with final conclusions (Section 4). 1.1. Related Work Detecting fake news is a task that is becoming increasingly important due to digitalization and the rapid spread of information. Current approaches to detect fake news can be categorized into: feature-based approaches which describe the task of learning different writing styles and knowledge-based approaches, where the model learns latent information about the text and its domain [10]. Gasparetto et al. [11] provided a structured and comprehensive overview of the existing methods for text classification. In the past, supervised machine learning (ML) models such as stochastic gradient descent (SGD), support vector machines (SVM), linear support vector machines (LSVM), k-nearest neighbour (KNN) and decision trees (DT) have been used for solving that task [12]. These methods have been overtaken soon by deep learning (DL) models like long short-term memory (LSTM), convolutional neural networks (CNN) and attention-based bidirectional long short-term memory (BiLSTM with Attention) models. However, transformer models including BERT, ALBERT and XLNET outperformed recent ML and DL algorithms [13]. These are becoming increasingly popular as pre-trained models trained on large corpora of data [14] are made publicly available (e.g., https://huggingface.co/). Another popular transfomer is T5, which was used by Sabry et al. [15]. The authors trained an English T5 transformer (t5-base) for an English hate speech classification task and compared the results to several other state-of-the-art classification models; The authors stated that their T5 model outperformed the RoBERTa model in all tasks. This results show that the use of sequence-to-sequence models like T5 can be beneficial when it comes to text classification. Fine-tuning a pre-trained ML models to the target data is used in many deep learning applications, especially for small datasets. Previous studies have shown that models which got pre-trained and fine-tuned on similar data as the task-specific one leads to improvements in performance of the models [16, 17]. This has also been demonstrated in our work, where we have used additional datasets (external datasets and translations) for pre-training and fine- tuning [18]. 1.2. Methodological Approach In this paper we propose a feature-based approach for fake news detection. We define pre- training as unsupervised re-training of a transformer model and fine-tuning as supervised training on the specific classification task. For training our models we used additional as well as translated data. • Pre-Training Strategy: Transformer models are usually already pre-trained on a large set of generic text data [14]. However, to adapt these models to a specific classification task, we experiment with further pre-training on domain-related data (which might be relevant to the classification task). • Fine-Tuning Strategy: Training pre-trained models for a given downstream task on the given training data for classification is also called fine-tuning. This can be performed either on the upper layers of the model or on all layers. 1.3. CheckThat! 2022 Data (CT) This year’s training data consisted of 900 news articles and an additional development set containing 364 instances - both only in English [19]. The test set in English (sub-task a) contained 612 articles, the one in German 586 (sub-task b). The dataset contained in total four different classes with the following split for the training set [20, 21]: partially false (217), false (465), true (142), and other (76); and the development set: partially false (141), false (113), true (69), and other (41). Since the provided datasets do not include German data, we translated the original English CheckThat data to German using Google Translator. 1.4. External Data: In this section, we describe briefly the external data we used for pre-training our models. The abbreviations for the datasets AD and FN are later used to describe which datasets we used for pre-training. FN is used during the following sections for the combination of both presented fake news datasets. Article Dataset (AD) This dataset was collected over a period of 1,5 years as part of a nationally funded Austrian research project, i.e., Defalsif-AI https://science.apa.at/project/ defalsifai/, and therefore is not publicly available. It contains 194,332 gathered news articles from different sources. The articles are multilingual, however the majority are either in English or German. The articles are not annotated in terms of whether they are Fake News or not, and are only used for pre-training the transformer models. Fake News Dataset German (FN) The Fake News Dataset German [22] contains approx- imately 63,000 fake and non-fake news articles from the fields of economics and sports. As some of the text include HTML and JavaScript snippets we removed lines which contained such snippets. Fake and real news dataset (FN) This dataset can be found on Kaggle.com [23] and consists of text data from news, political and other articles. This dataset comprises approximately 19,000 fake and 21,000 non-fake texts. More details on the described data can be found in the papers [24, 25]. 2. Experimental Setup We employed and experimented with two different transformer models: XLM-R [9] and T5 [26]. The experimental setup is depicted in Figure 1. We evaluated each experiment with the same training (90%) and validation split (10%). • XLM-R is a multilingual model, trained on 100 languages, which is designed for standard NLP tasks. The underlying architecture is a combination of RoBERTa [27] and XLM [28] which leads to very good performance, outperforming other state-of-the-art models like the multilingual version of BERT (mBERT) [29] without the use of Next Sentence Prediction (relying only on Masked Language Modeling) as pre-training strategy. • T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. The small variant of the English T5 model is pre-trained on the English C4 [26] dataset as well as the Wiki-DPR [30] data. This publicly available model has been fine-tuned in the past for several downstream tasks using different datasets. In our work, this small version of the T5 got further fine-tuned for detecting fake news. 2.1. Unsupervised Pre-Training For pre-training our models, which we obtained from HuggingFace 1 we experimented with two different additional datasets (AD, FN). • T5-PRET: The smaller version of the T5 model (T5-small) that can be found on Hugging- face [26] was trained on: 1. the original CheckThat data: T5-PRET-CT 2. the original + translated CheckThat data: T5-PRET-CT-TL 3. as well as on a combination of the original CheckThat and the additional fake news dataset (FN ): T5-PRET-CT-FN Since the additional fake news datasets (FN ) are relatively large, experiments were con- ducted smaller subsets. In this paper, only the results for one of these splits are mentioned, since similar results were obtained for all splits. The additional data did not show better results than the model only re-trained on the original CheckThat data. The mentioned T5-PRET-CT-FN model was re-trained on a split of the fake news datasets (FN ) which had a length of about 10 million characters. The distribution of English and German texts is about 50%. All models got re-trained with a batch size of 8 and a learning rate of 1𝑒−4 . Each model was trained for about 8 to 15 epochs. • XLM-R-PRET-AD: We trained the available XLM-R model provided by HuggingFace with the AD dataset. It was trained for 5 epochs, with a batch size of 16 and a learning rate of 2𝑒−5 . The probability for masked language modeling was 15% as used in the original BERT paper [14], where this type of pre-training was introduced. We only trained it for such a low amount of epochs because the training time took roughly 55 hours on one GPU for all articles. • XLM-R-PRET-FN: The second pre-trained XLM-R transformer we trained was with the other fake news datasets (FN ). We used a similar strategy by also using the 15% probability for the Masked Language Modeling and a character length of 40 million. Since the available fake news datasets (FN ) are less than the additional dataset (AD), the training time on GPU was only around 13 minutes. We also trained it with a learning rate of 2𝑒−5 and for 5 epochs and a smaller batch size of 8. 1 https://huggingface.co/ Figure 1: Overview of the experimental setup for training the two transformer architectures, including both training strategies, i.e., unsupervised pre-training and supervised fine-tuning. 2.2. Supervised Fine-Tuning For fine-tuning our pre-trained models we experimented with different hyperparameters and data combinations. For the fine-tuning of the XLM-R we used the titles as well as the article content. If a title was available it was added to the front of the content, due to the maximum sequence length of models. Using transformer models, usually the content after the maximum sequence length gets padded and not cropped off. This has shown to push the performance on fake news classification tasks in other settings [31] and was also used for the official baseline of this year’s shared task [32]. This approach was not used for the T5 model, for which only the article content was used for training. Our experiments show that the T5 was not performing well due the low amount of data, even though it was pre-trained as well. Hence, after some experiments with the re-trained T5 models Table 1 Experiment results (in %). The models were all evaluated on a merge of the original English development set and its German translations. The evaluated performance metrics are accuracy, macro-averaged precision, recall, and F1-score. No. Model Dataset Accuracy Precision Recall F1 1 XLM-R-PRET-AD CheckThat 53.80 56.77 51.68 51.68 2 XLM-R-PRET-AD CheckThat 42.93 39.94 36.29 34.00 3 XLM-R-PRET-AD CheckThat + Translated 53.85 54.31 52.04 50.65 4 XLM-R-PRET-AD CheckThat + Translated 54.81 52.56 51.59 50.29 5 XLM-R-PRET-FN CheckThat + Translated 53.30 53.45 50.67 50.16 6 XLM-R-PRET-FN CheckThat + Translated 43.96 29.03 35.32 29.09 7 T5-PRET CheckThat 48.35 48.53 44.70 44.24 8 T5-PRET-CT CheckThat 54.40 49.70 50.89 49.76 9 T5-PRET-CT-TL CheckThat 49.18 45.59 46.14 45.26 10 T5-PRET-CT-FN CheckThat 53.02 55.06 48.48 48.98 fine-tuned on the original CheckThat data, we did not continue with further experiments and focused on the better performing XLM-R models. Since we only had a small dataset for the English language we found the best amount of epochs for fine-tuning the XLM-R models is 30, even though it has been shown that usually 3-5 epochs are enough or sometimes even 5-10 epochs - depending on the available dataset size [31]. We additionally experimented with only using the training data as well as the German translations. Table 1 shows the final results of our experiments, which we used to determine our best model for the submission. Experiment 4 (XLM-PRET-AD) shows the best performance in terms of accuracy. For a more detailed overview we documented all investigated hyperparameters in Table 2. For each experiment we evaluated different learning rates, i.e., 1𝑒−5 and 2𝑒−5 . Our assumption was that a pre-trained model needs a lower learning rate during fine-tuning, because of the additional data it was pre-trained on. As shown in Table 2 and Table 1 the higher learning rate of 2𝑒−5 did result in significantly worse predictions on the development set in two setups. For the second pre-trained model XLM-R-PRET-FN, we only performed experiments with the translated data, as preliminary results showed that using the translated data yielded more stable results with both learning rates. 3. Results Our submitted models for subtask 3a and subtask 3b are both pre-trained on the large additional dataset (AD) and only fine-tuned on the CheckThat data as well as its translations into German (see Table 3). For subtask 3a we rank 22𝑡ℎ out of 25 and for the cross-lingual task we rank 4𝑡ℎ of 8. The best performing models from other teams for subtask 3a achieved an F1-score of 33.91% and for subtask 3b 29.98%. These results indicate that none of the models achieves a performance that is suitable for real-world applications. An interesting observation is that our model performed better in the cross-language task, even without using training data in German. Table 2 Investigated hyperparameters. No. Epochs Batch Size Learning Rate Max. Seq. Length −5 1 30 8 1𝑒 512 2 30 8 2𝑒−5 512 3 30 8 1𝑒−5 512 4 30 8 2𝑒−5 512 5 30 8 1𝑒−5 512 6 30 8 2𝑒−5 512 7 7 4 1𝑒−4 512 8 10 4 1𝑒−4 512 9 4 4 1𝑒−4 512 10 9 4 1𝑒−4 512 Table 3 Results, model, rank. The F1-score is macro-averaged and shown in percent (%). Model Dataset Task F1 Accuracy Rank XLM-R-PRET-AD CheckThat + Translated Subtask 3a 15.48 19.93 22𝑡ℎ XLM-R-PRET-AD CheckThat + Translated Subtask 3b 19.46 25.42 4𝑡ℎ Table 4 Results for subtask 3a. Values are macro-averaged and shown in percent (%). Class Precision Recall F1 False 35.80 9.20 14.65 Other 2.81 6.45 3.92 Partially False 11.50 23.21 15.38 True 22.47 37.14 28.07 3.1. Subtask 3a: English Table 4 shows the overall results per class for subtask 3a (English data). The proposed model performs best for the class True (F1: 28.07%). The False (F1: 14.65%) and Partially False (F1: 15.38%) classes are classified considerably worse. However, XLM-R fails to model the class Other at all (F1: 3.92). The low results for some classes are probably due to the unbalanced class distributions. In general, studies have shown that even humans have difficulty distinguishing between the different fake-news related categories, even for binary classification tasks [33]. 3.2. Subtask 3b: Cross-Lingual To train one model for both subtasks, we translated the original English data into German to train and fine-tune our multilingual XLM-R model on the specific classes. In comparison to our Table 5 Results for subtask 3b. Values are macro-averaged and shown in percent (%). Class Precision Recall F1 False 22.03 13.61 16.82 Other 9.09 7.27 8.08 Partially False 13.28 17.52 15.11 True 34.45 41.97 37.84 approach, the baseline approach of the organizers was a standard BERT model, trained on the English CheckThat dataset. For the cross-lingual task, they translated the German test data to English and then evaluated the performance [32]. Even though there was no training data available, our model performs slightly better for each class compared to our results for subtask 3a. The results are shown in Table 5. The results for each class are similar to the results for subtask 3a, where the model performed best for the True class (F1: 37.84%) and worst for the Other class (F1: 8.08%). We assume that this behavior is due to the fact that the class Other class was significantly underrepresented in the training set. 4. Discussion & Conclusion In this paper, we provide the details on our submission to the CheckThat! 2022 Lab for Task 3: Fake News Detection, which consists of two tasks on the classification of fake content. Our experiments show that the unsupervised pre-training strategy of the XLM-R model with addi- tional generic (not task-specific) data with more data instances is the more promising strategy compared to using domain-specific data with fewer training instances. Our model - XLM-R- PRET-AD achieves an F1-score of 15.48% in subtask 3a and 19.46% in subtask 3b. However, the model shows great signs of overfitting, especially on the class Other. We conclude that using translations of the original data and using similar content for fine-tuning increases the perfor- mance of these models, rather than just fine-tuning them on the provided training data. In future work, we want to compare the influence of pre-training models with more domain-specific data than general content data. Acknowledgments This contribution has been funded by the FFG Project “Defalsif-AI” (Austrian security research programme KIRAS of the Federal Ministry of Agriculture, Regions and Tourism (BMLRT), grant no. 879670) and Project “Young People Against Online Hate: Computer-assisted Strategies for Facilitating Citizen-generated Counter Speech”, WWTF Austria, grant no. ICT-20-016. References [1] Z. I. Mahid, S. Manickam, S. Karuppayah, Fake news on social media: Brief review on detection techniques, in: 2018 Fourth International Conference on Advances in Computing, Communication Automation (ICACCA), 2018, pp. 1–5. [2] S. A. Khan, M. H. Alkawaz, H. M. Zangana, The use and abuse of social media for spreading fake news, in: 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), 2019, pp. 145–148. [3] E. Tandoc, Z. W. Lim, R. Ling, Defining “fake news”: A typology of scholarly definitions, Digital Journalism 6 (2018) 137–153. URL: https: //doi.org/10.1080/21670811.2017.1360143. doi:10.1080/21670811.2017.1360143. arXiv:https://doi.org/10.1080/21670811.2017.1360143. [4] K. Sharma, F. Qian, H. Jiang, N. Ruchansky, M. Zhang, Y. Liu, Combating fake news: A survey on identification and mitigation techniques, ACM Trans. Intell. Syst. Technol. 10 (2019). URL: https://doi.org/10.1145/3305260. doi:10.1145/3305260. [5] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov, N. Babulkov, Y. S. Kartal, J. Beltrán, The clef-2022 checkthat! lab on fighting the covid-19 infodemic and fake news detection, in: M. Hagen, S. Verberne, C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval, Springer Interna- tional Publishing, Cham, 2022, pp. 416–428. [6] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov, N. Babulkov, Y. S. Kartal, J. Beltrán, M. Wiegand, M. Siegel, J. Köhler, Overview of the CLEF-2022 CheckThat! lab on fighting the COVID-19 infodemic and fake news detection, in: Proceedings of the 13th International Conference of the CLEF Association: Information Access Evaluation meets Multilinguality, Multimodality, and Visualization, CLEF ’2022, Bologna, Italy, 2022. [7] J. Köhler, G. K. Shahi, J. M. Struß, M. Wiegand, M. Siegel, T. Mandl, Overview of the CLEF-2022 CheckThat! lab task 3 on fake news detection, in: Working Notes of CLEF 2022—Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022. [8] G. K. Shahi, D. Nandini, FakeCovid – a multilingual cross-domain fact check news dataset for covid-19, in: Workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. URL: http://workshop-proceedings.icwsm.org/pdf/2020_14.pdf. [9] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116. [10] Z. Khanam, B. N. Alwasel, H. Sirafi, M. Rashid, Fake news detection using machine learning approaches, IOP Conf. Ser. Mater. Sci. Eng. 1099 (2021) 012040. doi:0.1088/1757-899X/ 1099/1/012040. [11] J. Dižo, M. Blatnický, R. Melnik, O. K. https://orcid.org/0000 0003-4677-2535), A mathemat- ical model of operation of a semi-trailer tractor powertrain, Komunikácie (2022). [12] R. Malhotra, A. Mahur, Achint, Covid-19 fake news detection system, in: 2022 12th International Conference on Cloud Computing, Data Science Engineering (Confluence), 2022, pp. 428–433. doi:10.1109/Confluence52989.2022.9734144. [13] S. Gundapu, R. Mamidi, Transformer based automatic covid-19 fake news detection system, 2021. URL: https://arxiv.org/abs/2101.00180. doi:10.48550/ARXIV.2101.00180. [14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional trans- formers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, publisher = Association for Computational Linguistics, url = https://www.aclweb.org/anthology/N19- 1423, doi = 10.18653/v1/N19-1423, 2019, pp. 4171–4186. [15] S. S. Sabry, T. Adewumi, N. Abid, G. Kovacs, F. Liwicki, M. Liwicki, Hat5: Hate language identification using text-to-text transfer transformer, 2022. URL: https://arxiv.org/abs/2202. 05690. doi:10.48550/ARXIV.2202.05690. [16] Z. Liu, Y. Xu, Y. Xu, Q. Qian, H. Li, A. B. Chan, R. Jin, Improved fine-tuning by leveraging pre-training data: Theory and practice, 2022. URL: https://openreview.net/forum?id= kQns9y_JH6. [17] M. E. Peters, S. Ruder, N. A. Smith, To tune or not to tune? adapting pretrained represen- tations to diverse tasks, in: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Association for Computational Linguistics, Florence, Italy, 2019, pp. 7–14. URL: https://aclanthology.org/W19-4302. doi:10.18653/v1/W19-4302. [18] M. Schütz, J. Boeck, D. Liakhovets, D. Slijepcevic, A. Kirchknopf, M. Hecht, J. Bogensperger, S. Schlarb, A. Schindler, M. Zeppelzauer, Automatic sexism detection with multilingual transformer models ait fhstp@exist2021, in: IberLEF@SEPLN, 2021. [19] G. K. Shahi, J. M. Struß, T. Mandl, Overview of the clef-2021 checkthat! lab task 3 on fake news detection, Working Notes of CLEF (2021). [20] G. K. Shahi, A. Dirkson, T. A. Majchrzak, An exploratory study of covid-19 misinformation on twitter, Online Social Networks and Media 22 (2021) 100104. [21] G. K. Shahi, Amused: An annotation framework of multi-modal social media data, arXiv preprint arXiv:2010.00502 (2020). [22] A. Ströckl, Fake news dataset german, 2020. URL: https://www.kaggle.com/datasets/ astoeckl/fake-news-dataset-german. [23] C. BISAILLON, Fake and real news dataset, 2020. URL: https://www.kaggle.com/datasets/ clmentbisaillon/fake-and-real-news-dataset?resource=download. [24] H. Ahmed, I. Traore, S. Saad, Detecting opinion spams and fake news using text classifica- tion 1 (2018). [25] H. Ahmed, I. Traore, S. Saad, Detection of online fake news using n-gram analysis and machine learning techniques, in: Lecture Notes in Computer Science, Lecture notes in computer science, Springer International Publishing, Cham, 2017, pp. 127–138. [26] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, arXiv e-prints (2019). arXiv:1910.10683. [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy- anov, Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692. [28] A. Conneau, G. Lample, Cross-lingual language model pretraining, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 32, Curran Associates, Inc., 2019. URL: https: //proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf. [29] I. Turc, M.-W. Chang, K. Lee, K. Toutanova, Well-read students learn better: On the importance of pre-training compact models, arXiv preprint arXiv:1908.08962 (2019). [30] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W. tau Yih, Dense passage retrieval for open-domain question answering, 2020. arXiv:2004.04906. [31] M. Schütz, A. Schindler, M. Siegel, K. Nazemi, Automatic fake news detection with pre-trained transformer models, in: D. Bimbo, et al (Eds.), Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Sciences, volume 12667, Springer, Cham, 2021. doi:10.1007/978-3-030-68787-8\_45. [32] M. Schütz, M. Siegel, Baseline for clef2022 - checkthat! lab task 3, 2022. URL: https: //doi.org/10.5281/zenodo.6362498. doi:10.5281/zenodo.6362498. [33] X. Zhou, R. Zafarani, Fake news: A survey of research, detection methods, and opportunities, CoRR abs/1812.00315 (2018). URL: http://arxiv.org/abs/1812.00315. arXiv:1812.00315.