DSHacker at CheckThat! 2024: LLMs and BERT for Check-Worthy Claims Detection with Propaganda Co-occurrence Analysis Paweł Golik† , Arkadiusz Modzelewski1,2,*,† and Aleksander Jochym 1 Polish-Japanese Academy of Information Technology, Poland 2 University of Padua, Italy Abstract This paper presents our approach to check-worthiness detection, one of the main tasks in the CheckThat! Lab 2024 at the Conference and Labs of the Evaluation Forum. The challenge was to create a system to determine whether a claim found in Dutch and Arabic tweets or English debate snippets needs fact-checking. We explored fine-tuning pre-trained BERT-based models and employing a few-shot prompting technique with OpenAI GPT models. Our study compared monolingual models based on the BERT architecture with a multilingual XLM- RoBERTa-large model capable of processing data in multiple languages. Additionally, we investigated the link between propaganda detection and the check-worthiness of content. We also incorporated the recently released OpenAI GPT-4o model. Our systems’ impressive performance, surpassing baseline results across all languages, is highlighted by our high-ranking positions: 3rd in Arabic, 2nd in Dutch, and 8th in English, with even better outcomes in post-deadline experiments. Keywords Check-Worthiness, Fact-Checking, XLM-RoBERTa, GPT-3.5, GPT-4o, Propaganda 1. Introduction 1.1. Problem Overview Nowadays, information spreads from many online sources. As a result, it is crucial to ensure the information is accurate, as it affects public discourse and people’s decisions. The spread of disinformation and misinformation threatens the integrity of public discussions, impacting areas such as news reporting, political debates, and social media interactions. It is therefore essential to have solid fact-checking systems. Fact-checking is not just about verifying that something is true. It is also essential for helping people make good decisions based on accurate information and ensuring that the shared information is trustworthy [1]. Such efforts help create an environment where people can have thoughtful discussions and make deliberate choices. Due to the recent rapid advancements in Artificial Intelligence, automated systems can now offer assistance and support the fact-checking process. 1.2. Task Description CheckThat! Lab at CLEF 2024 addresses issues that aid research and decision-making throughout the fact-checking process [2]. In the initial editions, CheckThat! Lab focused on developing an automated system to assist journalist fact-checkers during the key stages of the text verification process, which follows a structured pipeline [3, 4, 5, 6, 7]: CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † These authors contributed equally. $ golik.pawel@gmail.com (P. Golik); arkadiusz.modzelewski@pja.edu.pl (A. Modzelewski); aleksanderjochym@gmail.com (A. Jochym) € https://amodzelewski.com/ (A. Modzelewski)  0009-0003-1254-6879 (P. Golik); 0009-0003-1169-831X (A. Modzelewski) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1. Assessing whether a document or claim is check-worthy, i.e., determining if its veracity should be checked by a journalist. 2. Retrieving previously verified claims that could aid in fact-checking the current claim. 3. Gathering further evidence from the Web, if necessary, to support the verification. 4. Making a final decision on the factual accuracy of the claim based on the collected evidence. CheckThat! Lab Task 1 at CLEF 2024 focuses on the first step of the pipeline. Its goal is to provide an automated system for deciding whether a tweet or transcription claim needs fact-checking [2, 8]. Traditionally, this decision involves experts or human reviewers considering whether the claim can be proven true and if it could cause harm before labeling it as worth checking [2, 8]. In this scenario, we are dealing with a binary classification task. For each instance, which is a short text like a tweet or a caption from a political debate transcription, we aim to predict one of two labels: "Yes" or "No," indicating whether the text is worth fact-checking. 1.3. Our Experiments Our approach to check-worthiness detection involved experimenting with both monolingual and multilingual models, as well as leveraging Large Language Models (LLMs) through few-shot prompting. For monolingual models, we fine-tuned BERT-based architectures tailored to specific languages like English, Dutch, and Arabic. In the multilingual approach, we utilized the XLM-RoBERTa-large model, fine-tuning it on combined datasets from multiple languages. Additionally, we employed OpenAI’s GPT-3.5 and the recently released GPT-4o models for few-shot prompting, which demonstrated results comparable to fine-tuned BERT-based models. Furthermore, we explored the relationship between propaganda detection and check-worthiness by fine-tuning models on a propaganda detection dataset before applying them to the check-worthiness task. Our comprehensive experiments enabled us to surpass baseline results and achieve high-ranking positions across different languages: 3rd in Arabic, 2nd in Dutch, and 8th in English, with even better outcomes in post-deadline experiments. 2. Related Work Detecting disinformation has become a crucial area of research. Researchers work not only on disin- formation, but also address particular challenges associated with identifying disinformation, misin- formation, and fake news. One such challenge is determining which claims are worthy of checking [7]. Hassan et al. [9] introduced a dataset from U.S. presidential debates and created classification models to differentiate among three distinct categories: check-worthy factual claims, non-factual claims, and insignificant factual claims. Jaradat et al. [10] developed ClaimRank, an online tool designed to identify check-worthy claims, with support for two languages: English and Arabic. Kartal and Kutlu [11] proposed a hybrid model which combines BERT with various features to prioritize claims based on their check-worthiness. The detection of check-worthy claims has also been a research focus in past years within CheckThat! Labs [12, 13, 1, 7, 14]. Alam et al. [1] introduced a task in 2023 for check-worthiness detection in multimodal and multigenre content with a multilingual dataset with three languages: Arabic, Spanish, and English. Team ES-VRAI proposed different methods based on pre-trained transformer models and sampling techniques for detecting check-worthiness in a multigenre content [15]. Their approach resulted in the first position in the Arabic language [15]. Team OpenFact was the best-performing team on English [1]. They fine-tuned GPT-3 curie model using more than 7K instances of sentences from debates and speeches annotated for check-worthiness [16]. In Spanish language the best performance achieved work done by Team DSHacker [17]. Modzelewski et al. [17] presented a system based on fine-tuning XLM-RoBERTa on all languages with additional data augmentation. For data augmentation Team DSHacker utilized GPT-3.5 model for translating and paraphrasing available training data [17]. 3. Dataset The dataset consists of texts and their corresponding gold labels annotated by human experts, forming a multilingual dataset. It includes data in four languages: English, Spanish, Dutch, and Arabic. Dataset was divided into training 𝐷𝑡𝑟𝑎𝑖𝑛 , validation (dev) 𝐷𝑑𝑒𝑣 , and dev-test 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 datasets with gold labels. We made final predictions for the test datasets 𝐷𝑡𝑒𝑠𝑡 published in English, Dutch, and Arabic. The English texts depict captions from political debates, while in the other languages, they represent tweets. In most datasets, the classes are imbalanced, with a predominance of texts that are not check-worthy. Table 1 provides detailed information about the datasets. For more information, refer to Hasanain et al. [8]. 4. Our Approach We experimented with two approaches for text classification: fine-tuning BERT-based models and utilizing few-shot prompting with GPT models, including recently released GPT-4o model. These methodologies are both widely used today and represent the state-of-the-art in the industry [18]. However, deciding whether fine-tuning or in-context learning yields better results is not trivial. Many scientists have addressed the challenge of comparing the two approaches fairly [18]. Table 1 Data characteristics. For average values, we report the arithmetic mean along with the standard deviation. (The ’#’ symbol stands for ’count.’) Dataset Language #samples %check-worthy Avg. #chars Avg. #words TRAIN EN 22, 500 24.06% 97.24 ± 70.50 20.59 ± 13.96 ES 19, 948 15.65% 166.02 ± 120.10 30.57 ± 20.66 NL 995 40.70% 188.22 ± 77.48 33.56 ± 14.73 AR 7, 333 30.59% 180.05 ± 74.23 32.03 ± 13.62 DEV EN 1, 032 23.06% 89.31 ± 67.14 19.09 ± 13.27 ES 5, 000 14.08% 208.89 ± 77.84 37.14 ± 14.52 NL 252 40.48% 194.30 ± 75.74 35.40 ± 15.23 AR 1, 093 37.60% 164.14 ± 71.84 28.90 ± 13.27 DEV-TEST EN 318 33.96% 67.78 ± 49.41 15.55 ± 10.54 ES 5, 000 10.18% 151.49 ± 100.27 27.93 ± 18.07 NL 666 47.45% 193.74 ± 77.99 40.20 ± 16.85 AR 500 75.40% 200.05 ± 77.04 35.77 ± 13.82 TEST EN 341 25.81% 78.25 ± 58.03 17.41 ± 11.69 NL 1, 000 39.70% 221.98 ± 74.03 39.29 ± 13.65 AR 610 35.74% 272.78 ± 30.39 46.32 ± 6.74 4.1. Fine-tuning BERT-based Models To provide a comprehensive overview of the effectiveness of BERT-based models in identifying check- worthy content across multiple languages, we conducted experiments using two types of models. First, we employed monolingual models that we fine-tuned exclusively on the training data from a single language. Secondly, we used multilingual models that we fine-tuned on different combinations of the training datasets from multiple languages. Our experiments allowed us to compare the performance of both multilingual and monolingual models in the context of check-worthiness detection. We started with comprehensive hyperparameter tuning for all chosen models (mono- and multilin- gual). This involved fine-tuning each model on the training dataset 𝐷𝑡𝑟𝑎𝑖𝑛 using every combination of hyperparameter values we specified. We used the proposed 𝐷𝑑𝑒𝑣 dataset for validation. Then, we as- sessed each model’s performance by measuring 𝐹1 score to determine the most effective hyperparameter values. To obtain the final models for submission, we merged the training 𝐷𝑡𝑟𝑎𝑖𝑛 and validation 𝐷𝑑𝑒𝑣 datasets to form the ultimate training dataset. We then retrained the model with the best hyperparameter values using the 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 dataset as the final validation set. Please refer to our Appendix A, which presents optimal hyperparameters of each final submission. 4.1.1. Monolingual Models For each language, we have chosen a single pretrained monolingual model available at HuggingFace that we later fine-tuned on the corresponding language: • ENGLISH (MONO-EN) - FacebookAI/roberta-large - the language model (355M parameters) trained on English data in a self-supervised fashion [19]. • DUTCH (MONO-NL) - DTAI-KULeuven/robbert-2023-dutch-large - the first Dutch large (355M parameters) model trained on the OSCAR2023 dataset [20]. • ARABIC (MONO-AR) - UBC-NLP/MARBERT - trained on randomly sampled 1B Arabic tweets (with at least 3 Arabic words) from a large in-house dataset of about 6B tweets [21]. 4.1.2. Multilingual Models We also fine-tuned a multilingual FacebookAI/xlm-roberta-large [22] model on a combined dataset from all available languages (MULTI-ALL). Since Spanish was not included in the final submission, we performed another fine-tuning using only English, Dutch, and Arabic data (MULTI-NO-ES). Additionally, we experimented with fine-tuning a multilingual model previously fine-tuned on a different propaganda-related dataset. We first fine-tuned the FacebookAI/xlm-roberta-large model on the propaganda presence binary classification task and then fine-tuned the model again on the CheckThat! Task 1 data (MULTI-PROP2). Refer to Section 5 for more information. 4.2. Few-shot Prompting with GPT Models We also employed Large Language Models (LLMs) to generate check-worthiness predictions. Our experiments included OpenAI’s gpt-4o (GPT-4o) and gpt-3.5-turbo-1106 (GPT-3.5) generative models. We implemented the few-shot prompting technique using the OpenAI Chat Completions API. Few-shot prompting with GPT models leverages pre-trained language models to perform specific tasks without retraining. Instead, the model is guided by providing a few examples and their expected responses within the input prompt. Each prediction request sent to the GPT model consisted of a list of messages presented to the model. Each message contains the role and content attribute. There are three roles available: 1. system message helps set the behavior of the model (assistant) by providing it context and guidelines. 2. user messages can provide exemplary requests for the assistant. In our case - example requests for a provided text’s check-worthiness evaluation. 3. assistant messages indicate the expected output of the assistant. In our experiments, the conversation is formatted starting with a system message that clarifies the task and the concept of check-worthiness. This is followed by alternating pairs of user and assistant messages. One pair for each few-shot example, where a user message poses a question about the example’s content check-worthiness, and the corresponding assistant message provides the gold label for the example, either ’Yes’ or ’No’. The final message following the pairs is one user message with the actual text to be classified by the model (See Appendix B). For each instance to be classified, we included four examples of few-shot prompting from the training dataset, two of which are check-worthy. The chosen few-shot examples were consistent in a given language. The prompt templates remained consistent for both the GPT-4o and GPT-3.5 experiments. 5. Propaganda Co-occurrence Analysis Since propaganda often involves misleading, biased, or manipulative information, such content is naturally more likely to warrant fact-checking [23, 24]. Therefore, we decided to indirectly analyze whether propaganda co-occurs with check-worthy claim. For that purpose we predicted the presence of propaganda using a model fine-tuned on a propaganda dataset. The underlying assumption was that check-worthy statements are more likely to contain propaganda techniques, given their potential to persuade or manipulate public opinion. We leveraged a multilingual FacebookAI/xlm-roberta-large [22] model fine-tuned by Modzelewski et al. [25] on the IberLEF DIPROMATS 2024 Task 1a [26] and then we employed it on the check-worthiness 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 dataset to evaluate this hypothesis (MUTLI-PROP1). IberLEF DIPROMATS 2024 Task 1a is a binary classification task for propaganda detection in English and Spanish tweets [26]. Table 2 shows the performance metrics of the MULTI-PROP1 model calculated for the 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 dataset. Relatively high precision shows that many texts with detected propaganda are indeed worth fact-checking. However, low recall illustrates that many check-worthy texts do not contain propaganda detectable by our model. Such results lead to an intuitive conclusion that propaganda often signals the need for fact-checking, but not all check-worthy statements necessarily rely on propaganda methods. We then further fine-tuned this model on CheckThat! Task 1 data and utilized it to predict check- worthiness (MULTI-PROP2). Due to time constraints, we did not perform hyperparameter tuning for this model. Instead, we used the hyperparameter values obtained from the search conducted during the MULTI-ALL experiment. Table 2 Performance Metrics of MULTI-PROP1 experiment obtained on DEV-TEST datset. Language F1 Score Accuracy Precision Recall Arabic 0.2208 0.294 0.6579 0.1326 Dutch 0.3597 0.551 0.5563 0.2658 English 0.1940 0.660 0.5000 0.1204 Spanish 0.1385 0.736 0.1037 0.2083 6. Results Since the number of allowed submissions was limited to one submission per language, we selected the models for the final predictions based on the 𝐹1 scores obtained on the 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 dataset. The models selected for the final submission are: MONO-EN, which ranked 8th on the final leaderboard for English, and MULTI-NO-ES, which ranked 2nd on the final leaderboard for Dutch and 3rd on the final leaderboard for Arabic. Table 3 provides the details of the models we submitted. Additionally, the table shows the baseline provided by the organisers of ChecktThat Lab 2024 Task 1 and the score of the best team for each language. After the submission deadline, we experimented with Large Language Models, namely GPT-3.5 and GPT-4o. Moreover, our experiments that combined knowledge from propaganda classification with check-worthiness detection through models MULTI-PROP1 and MULTI-PROP2 have also taken place after the deadline. Unfortunately, due to time constraints and strict deadlines, we could not complete these experiments before the submission deadline. Therefore, we did not consider their results on the 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 dataset when selecting models for submission. However, some of our post-deadline Table 3 Our final results from the CheckThat! Lab Task 1 official leaderboards. F1 Score Official Language Model Winner Baseline DEV-TEST TEST Rank English MONO-EN 0.8020 0.3070 0.9118 0.7600 8 Dutch MULTI-NO-ES 0.732 0.4380 0.6907 0.7300 2 Arabic MULTI-NO-ES 0.5690 0.4180 0.8599 0.5380 3 results are superior to our final submission results and, in the case of the Dutch language, even surpass the leaderboard winner. Table 4 shows the results of all experiments we conducted. We report the 𝐹1 scores calculated on the 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 and 𝐷𝑡𝑒𝑠𝑡 datasets. We did not have access to the ground truth for the test data while developing this system. Nevertheless, we calculated the test 𝐷𝑡𝑒𝑠𝑡 dataset scores after the organizers released the labels for this dataset following the submission deadline. Each record of DEV-TEST columns represents the 𝐹1 scores yielded by a model fine-tuned once on the combined training dataset (𝐷𝑡𝑟𝑎𝑖𝑛 + 𝐷𝑑𝑒𝑣 ) with 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 used as a validation dataset. Table 4 Check-worthiness classification results on the 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 and 𝐷𝑡𝑒𝑠𝑡 sets for all languages. F1 Score | F1 Score Language Model DEV-TEST TEST | Model DEV-TEST TEST English MONO-EN 0.9118 0.7600 | GPT-3.5 0.7343 0.6529 MULTI-ALL 0.8932 0.7429 | GPT-4o 0.8376 0.7207 MULTI-NO-ES 0.8867 0.7647 | MULTI-PROP2 0.8571 0.7368 Dutch MONO-NL 0.6571 0.6182 | GPT-3.5 0.4235 0.6937 MULTI-ALL 0.6657 0.7401 | GPT-4o 0.5844 0.7915 MULTI-NO-ES 0.6907 0.7300 | MULTI-PROP2 0.6667 0.7336 Arabic MONO-AR 0.7740 0.5254 | GPT-3.5 0.8640 0.5539 MULTI-ALL 0.8040 0.5568 | GPT-4o 0.8916 0.5523 MULTI-NO-ES 0.8599 0.5380 | MULTI-PROP2 0.8347 0.5680 6.1. Results 6.1.1. English Language Results The submission of the MONO-EN model earned us the 8th position on the leaderboard. Notably, the difference between the winning model’s 𝐹1 score and ours was small—just 0.042. Interestingly, the MULTI-NO-ES model, an xlm-roberta-large fine-tuned on English, Dutch, and Arabic data, achieved a better 𝐹1 score than our chosen submission, with a score of 0.7647. However, the difference may not be statistically significant. Most of the remaining models nearly matched our best result. 6.1.2. Dutch Language Results In Dutch, we came very close to winning with the MULTI-NO-ES model, achieving an 𝐹1 score of 0.73, just 0.02 points behind the actual winner. Two of our post-deadline models surpassed our submission result - MULTI-PROP2 with a slightly better 𝐹1 score of 0.7336 and GPT-4o with an impressive 0.7915. The results from GPT-4o even outperformed those of the leaderboard winner. Interestingly, the monolingual approach MONO-NL performed relatively poor, with an 𝐹1 score of 0.6182. 6.1.3. Arabic Language Results Similarly, we submitted the predictions of the MULTI-NO-ES model for the Arabic language. This submission secured us third place on the leaderboard. The difference in 𝐹1 scores between the MULTI- NO-ES model and the 1st ranked model on the leaderboard is 0.031. The monolingual model performed worse than the multilingual systems, but the difference is minor. Interestingly, GPT-3.5 produced one of the best results - 0.5539. Our top post-deadline model, MULTI-PROP2, nearly matched the 𝐹1 score of the leaderboard winner, with a difference of just 0.002. 6.1.4. GPT Few-shot Prompting Results The few-shot prompting approach using GPT models showed similar results to the fine-tuning approach with BERT-based models. Across all languages, the GPT-4o model consistently outperformed or performed similarly to the GPT-3.5 model. Specifically for English, GPT-4o performed noticeably better than GPT-3.5 and achieved results only slightly lower than fine-tuned BERT-based models. In the case of Dutch, GPT-4o was also superior to GPT-3.5, and it emerged as the best model among those we tested for this language. An interesting observation from our experiments is that the differences between Dutch on 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 and 𝐷𝑡𝑒𝑠𝑡 results were more pronounced for GPT models than other models. For Arabic, GPT-3.5 and GPT-4o showed nearly identical performance, comparable to experiments using BERT-based models. 6.1.5. Propaganda Detection Transfer Learning Results The results of a multilingual model fine-tuned on DIPROMATS 2024 Task 1a data further fine-tuned on English, Dutch, Arabic, and Spanish using CheckThat! 2024 Task 1 data (MULTI-PROP2) in one case outperformed all other approaches, but most probably the difference was statistically insignificant. We observed negligibly better results for Arabic and slightly worse results for the remaining languages compared to the MUTLI-ALL model. 6.2. Results Discussion Our results show that the outcomes on the 𝐷𝑡𝑒𝑠𝑡 and 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 datasets are significantly different. The explanation for this phenomenon is not straightforward. One possible reason is overfitting to the 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 data when selecting the best model during the final fine-tuning. However, this doesn’t explain the discrepancy observed with the GPT models with few-shot prompting, which were not fine-tuned yet still showed significant mismatches between the 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 and 𝐷𝑡𝑒𝑠𝑡 results. A potential reason for GPT large models differences is that the 𝐷𝑑𝑒𝑣−𝑡𝑒𝑠𝑡 and 𝐷𝑡𝑒𝑠𝑡 data may have differed significantly. The GPT models, particularly GPT-4o, showed results comparable to the fine-tuned BERT-based models. The few-shot prompting approach with OpenAI API offers a more straightforward solution for such tasks, delivering similar performance at a lower time cost. The monolingual models generally performed worse than the multilingual models in our experiments. Thus, using a language-specific model is not always the best choice. Adding Spanish to the training data did not always improve performance for the multilingual models. Model obtained from fine-tuning on the IberLEF DIPROMATS 2024 propaganda detection task and further fine-tuning on check-worthiness data did not yield any significant performance improvement compared to multilingual models fine-tuned exclusively on the check-worthiness data. Limitation of this approach was that we did not perform hyperparameter tunning for MULTI- PROP2. Hyperparameters optimization could significantly change our results. 7. Conclusions and Future Work In our work, we experimented with monolingual and multilingual approaches for various languages for check-worthy claims detection. For the monolingual approach, we utilized BERT models tailored to specific languages, optimizing hyperparameters and fine-tuning each model separately for each language. For the multilingual approach, we used XLM-RoBERTa-large. First, we optimized and fine-tuned it on the entire dataset. Then, we excluded Spanish from the training data in a second experiment. Additionally, we employed two LLMs, namely GPT-3.5-turbo and recently released GPT-4o for each language, using few-shot prompting to classify texts. We also fine-tuned a model on the IberLEF DIPROMATS 2024 Task 1 dataset and used this model to predict whether the data from CheckThat! Lab 2024 Task 1 contained propaganda. With this analysis, we aimed to indirectly determine whether check-worthy data also includes propaganda. We later utilized the mentioned model (XLM-RoBERTa-large fine-tuned on the IberLEF DIPROMATS 2024 Task 1a for binary propaganda classification) and further fine-tuned it for check-worthiness classification. Our final submissions for all languages were: English - RoBERTa-large fine-tuned, and optimized exclusively on English data; Dutch - XLM-RoBERTa-large, fine-tuned and optimized on all data except Spanish; Arabic - XLM-RoBERTa-large, fine-tuned and optimized on all data except Spanish. From our experiments we can conclude that GPT models, particularly GPT-4o, showed results compa- rable to the fine-tuned BERT-based models. Moreover, language-specific BERT-based model performed better only on English dataset. In other languages we got better results by utilizing multilingual model. Future work could incorporate a detailed linguistic analysis of texts to understand the different linguistic features among check-worthy texts and those that do not require checking. By identifying specific linguistic features and patterns, we could develop more nuanced systems that better differentiate between these types of texts. This analysis could involve examining rhetorical devices and stylistic elements that are prevalent in check-worthy claims. 8. Limitations and Ethics We acknowledge that our research may raise a number of ethical issues. The first shortcoming is a lack of clear explainability of our models’ results. Each model generates check-worthiness ratings without explanations as to why a statement was rated check-worthy or not. Users may need explanations to understand the basis for the model’s decisions. One of the most crucial tools for our research was fine-tuning BERT-based models and utilizing GPT-based Large Language Models. As a result, if BERT or GPT-based models were trained on data containing any bias, disinformation, or misinformation, these problems may affect the results of our experiments. The next potential shortcoming is a possible specialization of our systems to detect specific types of check-worthy information. Our systems may possibly not handle more subtle content or other off-topic statements well. We did not check whether the dataset included one topic check-worthy information, and we relied on the work of the workshop organizers to cover more than a specific type. References [1] F. Alam, A. Barrón-Cedeño, G. S. Cheema, S. Hakimov, M. Hasanain, C. Li, R. Míguez, H. Mubarak, G. K. Shahi, W. Zaghouani, et al., Overview of the clef-2023 checkthat! lab task 1 on check- worthiness in multimodal and multigenre content, Working Notes of CLEF (2023). [2] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari, M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness, in: N. Goharian, N. Tonel- lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458. [3] P. Nakov, A. Barrón-Cedeno, T. Elsayed, R. Suwaileh, L. Màrquez, W. Zaghouani, P. Atanasova, S. Kyuchukov, G. Da San Martino, Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings 9, Springer, 2018, pp. 372–387. [4] T. Elsayed, P. Nakov, A. Barrón-Cedeno, M. Hasanain, R. Suwaileh, G. Da San Martino, P. Atanasova, Overview of the clef-2019 checkthat! lab: automatic identification and verification of claims, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 10th International Conference of the CLEF Association, CLEF 2019, Lugano, Switzerland, September 9–12, 2019, Proceedings 10, Springer, 2019, pp. 301–321. [5] A. Barrón-Cedeno, T. Elsayed, P. Nakov, G. Da San Martino, M. Hasanain, R. Suwaileh, F. Haouari, Checkthat! at clef 2020: Enabling the automatic identification and verification of claims in social media, in: Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42, Springer, 2020, pp. 499–507. [6] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, M. Hasanain, W. Mansour, et al., Overview of the clef–2021 checkthat! lab on detecting check- worthy claims, previously fact-checked claims, and fake news, in: Experimental IR Meets Multilin- guality, Multimodality, and Interaction: 12th International Conference of the CLEF Association, CLEF 2021, Virtual Event, September 21–24, 2021, Proceedings 12, Springer, 2021, pp. 264–291. [7] P. Nakov, A. Barrón-Cedeño, G. da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli, M. Kutlu, W. Zaghouani, et al., Overview of the clef–2022 checkthat! lab on fighting the covid-19 infodemic and fake news detection, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2022, pp. 495–520. [8] M. Hasanain, R. Suwaileh, S. Weering, C. Li, T. Caselli, W. Zaghouani, A. Barrón-Cedeño, P. Nakov, F. Alam, Overview of the CLEF-2024 CheckThat! lab task 1 on check-worthiness estimation of multigenre content, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, Grenoble, France, 2024. [9] N. Hassan, C. Li, M. Tremayne, Detecting check-worthy factual claims in presidential debates, in: Proceedings of the 24th acm international on conference on information and knowledge management, 2015, pp. 1835–1838. [10] I. Jaradat, P. Gencheva, A. Barrón-Cedeño, L. Màrquez, P. Nakov, Claimrank: Detecting check- worthy claims in arabic and english, arXiv preprint arXiv:1804.07587 (2018). [11] Y. S. Kartal, M. Kutlu, Re-think before you share: a comprehensive study on prioritizing check- worthy claims, IEEE transactions on computational social systems 10 (2022) 362–375. [12] P. Atanasova, L. Marquez, A. Barron-Cedeno, T. Elsayed, R. Suwaileh, W. Zaghouani, S. Kyuchukov, G. Da San Martino, P. Nakov, et al., Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims. task 1: Check-worthiness, in: CEUR WORKSHOP PROCEEDINGS, volume 2125, CEUR-WS, 2018, pp. 1–13. [13] P. Atanasova, P. Nakov, G. Karadzhov, M. Mohtarami, G. Da San Martino, Overview of the clef-2019 checkthat! lab: Automatic identification and verification of claims. task 1: Check-worthiness (2019). [14] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, A. Nikolov, Y. S. Kartal, F. Alam, G. Da San Martino, et al., Overview of the clef-2021 checkthat! lab task 1 on check-worthiness estimation in tweets and political debates., 2021. [15] H. T. Sadouk, F. Sebbak, H. E. Zekiri, Es-vrai at checkthat! 2023: Analyzing checkworthiness in multimodal and multigenre (2023). [16] M. Sawiński, K. Węcel, E. P. Księżniak, M. Stróżyna, W. Lewoniewski, P. Stolarski, W. Abramowicz, Openfact at checkthat! 2023: head-to-head gpt vs. bert-a comparative study of transformers language models for the detection of check-worthy claims, Working Notes of CLEF (2023). [17] A. Modzelewski, W. Sosnowski, A. Wierzbicki, Dshacker at checkthat! 2023: Check-worthiness in multigenre and multilingual content with gpt-3.5 data augmentation, Working Notes of CLEF (2023). [18] M. Mosbach, T. Pimentel, S. Ravfogel, D. Klakow, Y. Elazar, Few-shot fine-tuning vs. in-context learn- ing: A fair comparison and evaluation, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguis- tics, Toronto, Canada, 2023, pp. 12284–12314. URL: https://aclanthology.org/2023.findings-acl.779. doi:10.18653/v1/2023.findings-acl.779. [19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [20] P. Delobelle, F. Remy, Robbert-2023: Keeping dutch language models up-to-date at a lower cost thanks to model conversion, 2023. [21] M. Abdul-Mageed, A. Elmadany, E. M. B. Nagoudi, ARBERT & MARBERT: Deep bidirectional transformers for Arabic, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 7088–7105. URL: https://aclanthology.org/2021.acl-long.551. doi:10.18653/v1/2021.acl-long. 551. [22] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116. [23] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, Y. Choi, Truth of varying shades: Analyzing language in fake news and political fact-checking, in: M. Palmer, R. Hwa, S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2931–2937. URL: https://aclanthology. org/D17-1317. doi:10.18653/v1/D17-1317. [24] S. Shaar, F. Alam, G. Da San Martino, A. Nikolov, W. Zaghouani, P. Nakov, A. Feldman, Findings of the NLP4IF-2021 shared tasks on fighting the COVID-19 infodemic and censorship detection, in: A. Feldman, G. Da San Martino, C. Leberknight, P. Nakov (Eds.), Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, Association for Computational Linguistics, Online, 2021, pp. 82–92. URL: https://aclanthology.org/2021.nlp4if-1. 12. doi:10.18653/v1/2021.nlp4if-1.12. [25] A. Modzelewski, P. Golik, A. Wierzbicki, Bilingual propaganda detection in diplomats’ tweets using language models and linguistic features., in: IberLEF@ SEPLN, 2024. [26] P. Moral, J. Fraile, G. Marco, A. Peñas, J. Gonzalo, Overview of dipromats 2024: Detection, characterization and tracking of propaganda in messages from diplomats and authorities of world powers, Procesamiento del Lenguaje Natural 73 (2024). A. Optimal Hyperparameter Values This appendix includes the optimal hyperparameter values for our best models. Table 5 Optimal hyperparamter values used in our models. Legend: lr - learning_rate; bs - batch_size; nte - num_train_epochs; ws - warmup_steps; wd - weight_decay Language Model lr bs nte ws wd English MONO-EN 10−5 32 6 300 0.001 Dutch MONO-NL 10−5 8 6 12 0.040 −5 Arabic MONO-AR 10 16 6 165 0.001 ALL MULTI-ALL 10−5 32 4 900 0.001 MULTI-ALL-NO-ES 10−5 32 4 900 0.001 MULTI-PROP2 10−5 32 4 900 0.001 B. Few-shot Prompting Templates In this appendix, we present the prompt messages included with each text classification request. The prompts are provided only in Spanish for brevity. 1. System prompt: You are a fact-checking expert. Your task aims to assess the check-worthiness of a presented text. As a fact-checker, you know that to decide whether a text is check-worthy, you must answer several auxiliary questions such as “Does the text contain a verifiable factual claim?” or “Is the text harmful?”. Please provide only the final label: “Yes” if the text is check-worthy and “No” otherwise. 2. Pairs of user and assistance prompts for few-shot prompting: a) Example 1: • User content: Answer whether the following text in Spanish is worth fact-checking. Answer using only a single English word: Yes or No. TEXT: Mañana, viernes, no puedes perderte el gran acto de cierre de campaña en Madrid. A las 19.00 h en el Pabellón 1 de IFEMA (Madrid). Con Kiko Veneno y O’Funk’illo en concierto y la intervención de @Pablo_Iglesias_, @AdaColau, @Irene_Montero_, @agarzoṅ.. ¡Te esperamos! • Assistant content: No b) Example 2: • User content: Answer whether the following text in Spanish is worth fact-checking. Answer using only a single English word: Yes or No. TEXT: tve_tve vuelve a quedar en evidencia. Desplaza al minuto 18 la denuncia del #CGPJ ante las críticas de Iglesias y habla de “diferencias”. Exigimos al responsable de edición del telediario explicaciones y a Sánchez que deje de utilizar rtve a su antojo @Enric_Hernandez • Assistant content: Yes c) Example 3: • User content: Answer whether the following text in Spanish is worth fact-checking. Answer using only a single English word: Yes or No. TEXT: El PSOE demostró su apoyo a Torra durante la moción #PorLaConvivencia Lroldansu "Cs sigue siendo el referente incontestable del Constitucionalismo en Cataluña. No podemos dejar el futuro de 7,5 millones de catalanes en manos de Torra" #ActualidadCs • Assistant content: No d) Example 4: • User content: Answer whether the following text in Spanish is worth fact-checking. Answer using only a single English word: Yes or No. TEXT: Pedro Sánchez ha dado el visto bueno a la apertura de las ’embajadas’ catalanas en Argentina, México y Túnez. Empiezan las cesiones a sus socios separatistas. Incrementan gasto en sus majaderías, despreciando la urgencia de las política sociales. • Assistant content: Yes 3. Used final user prompt: Answer whether the following text in Spanish is worth fact-checking. Answer using only a single English word: Yes or No. TEXT: