Mirela at CheckThat! 2024: Check-Worthiness of Tweets with Multilingual Embeddings and Adversarial Training Notebook for the CheckThat! Lab at CLEF 2024 Mirela Dryankova1,* , Dimitar Dimitrov1 , Ivan Koychev1 and Preslav Nakov2 1 Sofia University “St. Kliment Ohridski”, Bulgaria 2 Mohamed bin Zayed University of Artificial Intelligence, UAE Abstract Accurately assessing the credibility and significance of texts is crucial in today’s digital age where misinformation and disinformation abound, especially in social media. In this paper, we propose an approach for check-worthiness of tweets that integrates adversarial learning techniques to optimize classification accuracy and language identifi- cation simultaneously. We conduct fine-tuning of DistilBERT-multilingual and XLM-RoBERTa-base for English, Dutch, Spanish, and Arabic to allow the models to adapt to the intricacies of different languages. Furthermore, we introduce an adversarial training approach to enhance the performance of multilingual sentence transformers, ensuring their effectiveness across linguistic contexts. The proposed approach ranks 4th in Dutch, 11th in Arabic, and 16th in English with an F1 -score (positive class) of 0.65, 0.48, and 0.66, respectively. Keywords Check-worthiness, Misinformation, Disinformation, Social Media, Multilingual Classification, Sentence Trans- formers 1. Introduction The fast development of social media platforms in recent years has greatly urged information dissemi- nation. This advancement allows society to stay up-to-date with emerging news and follow the latest trends, fostering a more informed and connected global community. For instance, people can access real-time events, participate in online discussions and seminars, and engage with content from all over the world, encouraging public sharing of opinions. Thus, social media have become one of the main communication channels for information dissemination and consumption, and nowadays many people rely on them as their primary source of news [1]. However, the ease with which information can be shared and the often unchecked nature of user-generated content has led to the wide and rapid spread of false or misleading information, which can have negative societal consequences [2]. This dual-edged situation highlights the need for effective fact-checking mechanisms to distinguish between reliable and dubious sources. The CheckThat! Lab Task 1 [3] at the Conference and Labs of the Evaluation Forum (CLEF) 2024 focuses on developing models that automatically determine the tweets’ worthiness. This task is designed to assist fact-checkers by identifying tweets that contain potentially false claims and have a significant impact if left unchecked, thus streamlining the fact-checking process and helping to mitigate the spread of misinformation [4]. The task is focused on four languages - English, Spanish, Arabic, and Dutch, where data was collected from Twitter. We participated in the check-worthiness sub-task with a focus on Dutch, English, and Arabic. For the submission phase, we proposed a multi-language text classification strategy emphasizing the incorporation of language adversarial learning into the training process of sentence transformers for Arabic and Dutch, and run BERT-base-uncased for English. Our approach mainly focuses on using a pre-trained DistilBERT-multilingual model, which is lighter and faster at inference time, while also requiring a smaller computational training budget [5]. For CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ mireladryankova959@gmail.com (M. Dryankova); mitko.bg.ss@gmail.com (D. Dimitrov); koychev@fmi.uni-sofia.bg (I. Koychev); preslav.nakov@mbzuai.ac.ae (P. Nakov) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings further experiments, we fine-tuned XLM-RoBERTa-base - a transformer-based multilingual masked language model pre-trained on text in 100 languages, which obtains state-of-the-art performance on cross-lingual classification, sequence labeling and question answering [6]. The mentioned methodologies above are essential to automate the fact-checking process and address misinformation in diverse linguistic contexts. 2. Related Work Traditionally, fact-checking has been a manual process, relying heavily on human effort and some of the leaders in the field are FactCheck.org1 , Snopes2 , PolitiFact3 , and FullFact4 . This meticulous work is very time-consuming and labor-intensive so the need to automate it emerged. Automated fact-checking appeared as an approach where methods of Natural Language Processing (NLP) and Machine Learning (ML) are used to assist experts in making these decisions [4]. One of the earlier efforts in this direction is the ClaimBuster [7], an end-to-end system that uses machine learning, natural language processing, and database query techniques to aid in the process of fact-checking [8]. In their study [7], the authors address the check-worthiness task by comparing traditional models (Random Forest and SVM) with transformer-based models BERT and XLM-RoBERTa. The evaluation shows that transformer models (BERT-multilingual and XLM-RoBERTa-base) outperform the SVM and Random Forest in Dutch and English languages [7], but the results for Spanish are better using Random Forest. In another study [9], Fraunhofer SIT take first place for CLEF-2023 CheckThat! Task 1A and second place for CLEF-2023 CheckThat! Task 1B. To determine whether a claim in a tweet that contains both a snippet of text and an image is worth fact-checking [9], they combine BERT with an OCR analysis. To determine whether a text snippet from a political debate should be assessed for check-worthiness, the team run multiple experiments. The best approach for this task is an ensemble classification scheme centered on Model Souping [9]. Another interesting approach [10], compares GPT models with BERT models and uses zero-shot, few-shot, and fine-tuning techniques in the context of check-worthiness problem. As a result, the participants managed to outperform CheckThat! Lab 2022 Task 1 winning model, as fine-tuning DeBERTa v3 base. Additionally, other methods have been explored, using Word2Vec [11], as well as many participants apply different machine learning methods such as k-nearest neighbors [12] and Gradient boosting [13]. All of the above methodologies are specifically focused on addressing the check-worthiness task for English. 3. Methodology 3.1. Data The dataset used for the check-worthiness task is given by the organizers of the CheckThat! Lab. The train data is provided in four languages - English, Spanish, Arabic, and Dutch while the test datasets are in English, Arabic, and Dutch. The number of rows in English and Spanish given for training is relatively higher than in Arabic and Dutch. The datasets contain text, id and link of the tweet as well as class label ("Yes" / "No") whether or not the text can be fact-checked. Table 1 displays the distributions of the given datasets. As can be seen, the dataset suffers from class imbalance [9]. It can be concluded that within each split the total number of "No" labels is relatively higher than "Yes" labels. 1 http://www.factcheck.org/ 2 http://www.snopes.com/fact-check/ 3 http://www.politifact.com/ 4 http://fullfact.org/ Table 1 Class label distribution for Train, Dev, Dev-Test, and Test datasets. Class label Train Dev Dev-Test Test Arabic Yes 2,243 411 377 218 No 5,090 682 123 392 Total 7,333 1,093 500 610 Dutch Yes 405 102 316 397 No 590 150 350 603 Total 995 252 666 1,000 English Yes 5,413 238 108 88 No 17,088 794 210 253 Total 22,501 1,032 318 341 Spanish Yes 3,122 704 509 - No 16,826 4,296 4,491 - Total 19,948 5,000 5,000 - 3.2. Models We implemented a multilingual sentence classification system, designed to classify sentences across multiple languages and ensure that the learned representations are independent of the language of input sentences. We introduced two classes of models Sentence Transformer, representing a basic sentence embedding model, and Sentence Transformer Adversarial, extending the original model with an adversarial training. The key difference lies in incorporating an additional language classification in Sentence Transformer Adversarial architecture, enabling language prediction of the input sentence in an adversarial manner. Both models follow pre-trained transformer-based architectures, specifically DistilBERT-multilingual and XLM-RoBERTa-base. The model configuration is set to output both attention and hidden states. The architecture (Figure 1) includes a fully connected neural network to refine the representations for classification tasks. During training, we employ cross-entropy loss for classification tasks and incorporate linear scheduling of learning rates to stabilize training and improve convergence. For evaluation, the key metric used is F1 -score with respect to the positive class as proposed by the organizers of the CheckThat! Lab. Furthermore, accuracy, precision, and recall with respect to the positive and negative classes are also shown in this paper for a better understanding of model performance. Figure 1: Model Pipeline. Table 2 Performance metrics for different languages. Bold indicates positive class F1 -score. Underline indicates the best F1 -score. Class label Model Accuracy Precision Recall F1 - Score Arabic Yes 39.07 61.47 47.77 No DistilBERT-multilingual 51.97 68.54 46.68 55.54 Yes 41.08 80.28 54.35 No XLM-RoBERTa base 51.80 76.63 35.97 48.96 Dutch Yes 63.39 66.75 65.03 No DistilBERT-multilingual 71.50 77.32 74.63 75.95 Yes 63.56 73.80 68.29 No XLM-RoBERTa base 72.80 80.71 72.14 76.18 English Yes 79.71 62.50 70.06 No DistilBERT-multilingual 86.22 87.87 94.47 91.05 Yes 78.08 64.77 70.81 No XLM-RoBERTa base 86.22 88.43 93.68 90.98 Yes 75.12 57.95 65.80 No BERT-base-uncased 84.46 86.49 93.68 89.94 3.3. Experiments All models are trained on Google Colab Pro’s T4 GPU. The T4 GPU offers significant computational capabilities, 16 GB of memory, and CUDA cores, which are crucial for our experiments. DistilBERT-multilingual is trained on 5 epochs, while XLM-RoBERTa-base on 3 epochs due to higher computational demands. The training and validation data are processed in batches of 32 and 16, respectively. Additional hyperparameters include learning rate 2e-5, Adam optimizer 1e-8, and dropout level at 0.1. For the official competition submission, we provided a Multilingual DistilBERT model, which demon- strated promising results in the fact-checking task, particularly for Dutch. Initially, our approach focused exclusively on English using BERT-base-uncased. Later, we upgraded our methodology with multilingual model. Due to time constraints, we ran only the Multilingual DistilBERT model for Arabic and Dutch, and BERT-base-uncased model for English and submitted the results. Following this initial phase, we expanded our approach by incorporating the XLM-RoBERTa-base model. After releasing gold labels once the submission period ended, we evaluated all the experiments and reported the statistics. The output results on the test set from all models are presented in Table 2. We can conclude that our original submission model achieved the highest F1 -score over the positive class for Dutch. For all languages, XLM-RoBERTa base outperforms DistilBERT-multilingual and has the greatest increase in F1 -score for correctly identifying the positive class in the Arabic dataset. Moreover, XLM-RoBERTa-base model generally provides a better balance between precision and recall, making it slightly more reliable for identifying check-worthy tweets across multiple languages. The official results and ranking from the competition submission are presented in Table 3: Table 3 Overall ranking on the test set of Task 1: Check-worthiness of tweets Language Model F1 (positive class) Rank Arabic DistilBERT-multilingual 0.478 11th / 14th Dutch DistilBERT-multilingual 0.65 4th / 16th English BERT-base-uncased 0.658 16th / 27th 4. Conclusion In the following paper, we presented our experiments and insights gained from the check-worthiness task at CheckThat! 2024. Our methodology employs state-of-the-art transformer models, enhanced with adversarial training techniques, to improve fact-checking accuracy across multiple languages. By incorporating language classification into the training process, we ensure the models are capable of handling diverse linguistic inputs effectively. The proposed approach achieves 4th place in Dutch with an F1 -score (positive class) of 0.65 and only 11th in Arabic, and 16th in English with an F1 -score (positive class) of 0.48, and 0.66, respectively. Overall, our study reveals that XLM-RoBERTa-base definitely outperforms DistilBERT-multilingual for all languages, regarding F1 -score over the positive class. The superior performance of XLM-RoBERTa-base can be attributed to its larger size and more intricate model architecture which is capable of capturing complex linguistic patterns more effectively. Moreover, XLM-RoBERTa-base benefits from extensive multilingual pretraining on a diverse corpus, enhancing its ability to generalize across different languages and understand diverse linguistic patterns. We can also conclude that our approach achieves the best results in Dutch due to the equal distribution of "Yes" and "No" class labels in train dataset. In all other languages the negative labels outnumber the positive labels by more than two times. Further experiments can be conducted using a larger version of the discussed models or exploring bigger hyperparameter space, which can potentially lead to better results. Due to resource constraints, large transformer model architectures were not used in this research. Moreover, the current model can be expanded by incorporating additional contextual features, enhancing its capability to capture additional information from the input text and improve check-worthiness detection performance. Acknowledgments The work is partially financed by the European Union-NextGenerationEU, through the National Recov- ery and Resilience Plan of the Republic of Bulgaria, project SUMMIT, No BG-RRP-2.004-0008. References [1] A. Perrin, Social media usage, Pew research center (2015) 52–68. [2] J. Vladika, F. Matthes, Scientific Fact-Checking: A survey of resources and approaches (2023). [3] M. Hasanain, R. Suwaileh, S. Weering, C. Li, T. Caselli, W. Zaghouani, A. Barrón-Cedeño, P. Nakov, F. Alam, Overview of the CLEF-2024 CheckThat! lab task 1 on check-worthiness estimation of multigenre content (2024). [4] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, A. Nikolov, M. Kutlu, Y. S. Kartal, F. Alam, G. Da San Martino, A. Barrón-Cedeño, R. Míguez, J. Beltrán, T. Elsayed, P. Nakov, Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in tweets and political debates, PNotes of CLEF 2021—Conference and Labs of the Evaluation Forum, CLEF ’2021, Bucharest, Romania (online) (2021). [5] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Hugging Face (2020). [6] A. Conneau, K. Khandelwal, V. C. Naman Goyal, F. G. Guillaume Wenzek, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale v1 (2019). [7] P. Tarannum, M. A. Hasan, F. Alam, S. R. H. Noori, Z-index at CheckThat! lab 2022: Check- worthiness identification on tweet text (2022). [8] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph, A. Kulka- rni, A. K. Nayak, V. Sable, C. Li, M. Tremayne, ClaimBuster: the first-ever end-to-end fact-checking system (2017). [9] R. A. Frick, I. Vogel, J.-E. Choi, Fraunhofer SIT at CheckThat! 2023: Enhancing the detection of multimodal and multigenre check-worthiness using optical character recognition and model souping (2023). [10] M. Sawiński1, K. Węcel1, E. Księżniak, M. Stróżyna1, W. Lewoniewski, P. Stolarski, W. Abramowicz, OpenFact at CheckThat! 2023: Head-to-head GPT vs. BERT - a comparative study of transformers language models for the detection of check-worthy claims (2023). [11] M. Z. Ullah, An ML model for predicting information check-worthiness using a variety of features (2018). [12] B. Ghanem, M. Montes-y-G´omez, F. Rangel, P. Rosso, UPV-INAOE - check that: preliminary approach for checking worthiness of claims. in: Working notes of CLEF 2018 - conference and labs of the Evaluation Forum, Avignon, France (2018). [13] K. Yasser, M. Kutlu, T. Elsayed, bigIR at CLEF 2018: Detection and verification of check-worthy political claims (2018).