Overview of the CLEF-2024 CheckThat! Lab Task 1 on Check-Worthiness Estimation of Multigenre Content Maram Hasanain1,: , Reem Suwaileh2 , Sanne Weering3 , Chengkai Li4 , Tommaso Caselli3 , Wajdi Zaghouani5 , Alberto Barrón-Cedeño6 , Preslav Nakov7 and Firoj Alam1,˚,: 1 Qatar Computing Research Institute, HBKU, Qatar 2 Hamad Bin Khalifa University, Qatar 3 University of Groningen, Netherlands 4 University of Texas at Arlington, USA 5 Northwestern University in Qatar, Qatar 6 DIT, Università di Bologna, Forlì, Italy 7 Mohamed bin Zayed University of Artificial Intelligence, UAE Abstract We present an overview of the CheckThat! Lab 2024 Task 1, part of CLEF 2024. Task 1 involves determining whether a text item is check-worthy, with a special emphasis on COVID-19, political news, and political debates and speeches. It is conducted in three languages: Arabic, Dutch, and English. Additionally, Spanish was offered for extra training data during the development phase. A total of 75 teams registered, with 37 teams submitting 236 runs and 17 teams submitting system description papers. Out of these, 13, 15 and 26 teams participated for Arabic, Dutch and English, respectively. Among these teams, the use of transformer pre-trained language models (PLMs) was the most frequent. A few teams also employed Large Language Models (LLMs). We provide a description of the dataset, the task setup, including evaluation settings, and a brief overview of the participating systems. As is customary in the CheckThat! Lab, we release all the datasets as well as the evaluation scripts to the research community. This will enable further research on identifying relevant check-worthy content that can assist various stakeholders, such as fact-checkers, journalists, and policymakers. Keywords Check-worthiness, fact-checking, multilinguality, 1. Introduction Check-worthiness is a crucial component of the fact-checking pipeline. It helps to alleviate the burden on fact-checkers by reducing the need to verify every claim posted or shared across multiple online and social media platforms, which contain different types of content and modalities. This content can include news reports, citizen journalism, political debates, and posts from social media platforms. Identifying and debunking misleading claims is crucial to prevent the spread of misinformation, enabling individuals to make informed decisions where false information could lead to harmful consequences. For example, in critical areas such as health, finance, natural disasters and public policy, making well-informed decisions is especially important. The CheckThat! 2024 lab was held in the framework of CLEF 2024 [1, 2].1 Figure 1 shows the full CheckThat! identification and verification pipeline, highlighting the six tasks targeted in this seventh edition of the lab: Task 1 on check-worthiness estimation, Task 2 on subjectivity, Task 3 on persuasion technique detection (this paper), Task 4 on detecting hero, villain, and victim from memes, Task 5 on rumor verification using evidence from authorities, and Task 6 on robustness of credibility assessment with adversarial examples. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France ˚ Corresponding author. : These authors contributed equally. $ mhasanain@hbku.edu.qa (M. Hasanain); rsuwaileh@hbku.edu.qa (R. Suwaileh); s.weering@student.rug.nl (S. Weering); cli@uta.edu (C. Li); t.caselli@rug.nl (T. Caselli); wajdi.zaghouani@northwestern.edu (W. Zaghouani); a.barron@unibo.it (A. Barrón-Cedeño); preslav.nakov@mbzuai.ac.ae (P. Nakov); fialam@hbku.edu.qa (F. Alam) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://checkthat.gitlab.io CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: The CheckThat! lab verification pipeline. The 2024 edition of the lab covers six tasks: (T1) check- worthiness estimation (this paper), (T2) subjectivity, (T3) persuasion technique detection, (T4) detecting hero, villain, and victim from memes, (T5) rumor verification using evidence from authorities, and (T6) robustness of credibility assessment with adversarial examples. In this paper, we describe Task 1, which asks to detect whether a given text snippet from multigenre content, in a form of a tweet or a sentence from a political debate or speech, is worth fact-checking. Checkworthiness estimation simplifies and speeds up the process of fact-checking by prioritizing more important claims to be verified. In order to make that decision, one would need to consider questions such as “does it contain a verifiable factual claim?” or “is it harmful?”, before deciding on the final check-worthiness label [3]. We provided manually annotated data in three languages: Arabic, Dutch, and English. Additionally, we included Spanish as an extra dataset. Among the various languages, English was the most popular target for participants. Across the submitted systems, pre-trained language models (PLMs) were widely used, with BERT, RoBERTa, and XLM-RoBERTa being the most popular models. Moreover, some teams used large language models (LLMs). The top-ranked systems also employed data augmentation and additional preprocessing steps. The remainder of the paper is organized as follows: Section 2 describes the datasets released with the task. We present the evaluation setup in section 3. Section 4 discusses the system submissions and the official results. Section 5 presents some related work. Finally, we provide some concluding remarks in section 6. 2. Datasets The dataset contains multigenre content in Arabic, English, Dutch, and Spanish. The Spanish subset was only offered for training purposes. The evaluation focuses on the other three languages. For all languages but English, the dataset consists of tweets collected using keywords related to a variety of topics, such as COVID-19 and vaccines, climate change, political news and the war on Gaza. The choice of topics was language-specific and was based on current events at different points of time when the dataset was being constructed. Additionally, the Spanish subset included transcriptions from Spanish politicians, and the subset was manually annotated by professional journalists who are experts in fact-checking. To annotate Arabic and Dutch data, we followed the scheme described by Alam et al. [3]. As for the English subset, it was sourced from the annotated dataset described by Arslan et al. [4], and consists of transcribed sentences from candidates during the US Presidential election debates. We create the training, development and dev-test subsets for the 2024 edition by re-using all the data released in 2023 (or 2022 when the language was not run in the 2023 edition). Regarding the testing data, for Arabic we collected tweets using keywords relevant to the war on Gaza, that started in October 2023. For Dutch, we collected 1𝑘 messages between January 2021 and December 2022 on climate change and its associated debate. The English test set was constructed by manually annotating transcribed sentences that did not appear in Arslan et al. [4]. Table 1 shows statistics for all languages and partitions. Table 1 Check-worthiness in multigenre content. Statistics about the CT–CWT–24 corpus for all four languages. Data Splits Arabic Dutch English Spanish Yes No Yes No Yes No Yes No Train 2,243 5,090 405 590 5,413 17,087 3,128 16,862 Dev 411 682 102 150 238 794 704 4,296 Dev-test 377 123 316 350 108 210 509 4,491 Test 218 392 397 603 88 253 - - Total 3,249 6,287 1,220 1,693 5,847 18,344 4,341 25,649 3. Evaluation Settings We provided training, development, and dev-test subsets. The latter was intended to allow participants to validate their systems internally, while they could use the development set for hyper-parameter tuning and model selection. The test set was used for the final evaluation and ranking. The participants were allowed to submit multiple runs on the test set (without seeing the scores), and the last valid run was considered as official. This is a binary classification task and we evaluate it on the basis of the F1 -measure on the check- worthiness class (yes) to account for class imbalance. The data and the evaluation scripts are available online.2 The submission system was hosted on the CodaLab platform.3 4. Results and Overview of the Systems A total of 13, 15 and 26 teams submitted systems for Arabic, Dutch, and English, respectively. Table 3 reports the performance results for all systems and languages. For all languages, the participating systems outperformed the baseline, except for one team in Arabic and two teams in Dutch. Table 2 summarizes the approaches. Transformers were most popular. Some teams used language- specific transformers, while others opted for multilingual ones. Several teams also used large language models including variations of LLaMA, Mistral, Mixtral, and GPT. Standard preprocessing and data augmentation were also very common. Below, we briefly describe the systems across all languages. Team Fired_from_NLP [11] leveraged various model groups: Random Forest, SVM, and XGBoost; deep learning models such as LSTM and Bi-LSTM; and pre-trained language models (PLMs) including AraBERT for Arabic, RobBERT for Dutch, BERT-uncased for English, and Multilingual-BERT-uncased for all three languages. They trained and fine-tuned the models using the original datasets. Experiments showed that PLMs outperformed all other models. Team Fraunhofer SIT [12] proposed an adapter fusion approach that combines a task adapter model with a Named Entity Recognition (NER) adapter, offering a resource-efficient alternative to fully fine- tuned PLMs. The task adapter was trained using the original training data without any preprocessing or cleaning. This method demonstrated superior performance and achieved the third place in the task. Team Mirela [15] used DistilBERT-multilingual and XLM-RoBERTa-base PLMs. DistilBERT-multilingual was chosen for its lightweight and fast performance during inference, as well as its low computational training requirements. XLM-RoBERTa-base was selected due to its pre-training on 100 languages, achieving state-of-the-art performance in various NLP tasks in multilingual setups. Both models were finetuned on the original training data for English, Spanish, Arabic, and Dutch. Team SSN-NLP [19] used a range of machine learning algorithms, including Support Vector Machine (SVM), Random Forest Classifier, Logistic Regression, XGBoost Classifier, CatBoost Classifier, K-Nearest Neighbors (KNN), and Passive Aggressive Classifier. Additionally, they fine-tuned several PLMs, includ- ing BERT-base-uncased, RoBERTa-base, XLM-RoBERTa-base, and DeBERTa-v3-base. Hyperparameters 2 https://gitlab.com/checkthat_lab/clef2024-checkthat-lab/-/tree/main/task1 3 https://codalab.lisn.upsaclay.fr/competitions/18893 Table 2 Overview of the approaches. The numbers in the language box refer to the position of the team in the official ranking. Data aug: Data augmentation. Team Language Models Misc Info. Extraction Preprocessing Data Pruning DistilBERT BERTweet DeBERTa RoBERTa AraBERT Data aug LLama 3 ALBERT LLama2 GPT-3.5 BERTje English Gemini Mixtral Mistral Electra GEITje Arabic XLM-r Dutch GPT-4 GPT-3 BERT Checker Hacker [5] 14 § CLaC [6] 25 § § DataBees [7] 12 10 18 § § § § § § § § DSHacker [8] 3 2 8 § § § § FactFinders [9] 1 § § § § FC_RUG [10] 6 § Fired_from_NLP [11] 7 12 10 § § § Fraunhofer SIT [12] 3 § HYBRINFOX [13] 10 8 12 § § § IAI Group [14] 1 3 9 § § § § Mirela [15] 11 4 16 § § OpenFact [16] 2 7 2 § SemanticCuetSync [17] 5 16 6 § § § SINAI [18] 7 § § § § SSN-NLP[19] 13 § § § § § § Trio_Titans[20] 19 § § § § TurQUaz [21] 4 1 11 § § § § § § were optimized using GridSearchCV on the original data. Their preprocessing pipeline included text cleaning, tokenization, stopword removal, punctuation removal, URL removal, and spelling correction. For feature extraction, they used POS tagging and dependency parsing. These features were aggregated into vectors and combined with sentence embeddings generated using the Sentence-BERT PLM. The combined features were then normalized and reduced using Principal Component Analysis (PCA) to minimize computational requirements. Team FactFinders [9] fine-tuned Llama2 7b on the original training data, using prompts generated by ChatGPT. A similar performance was achieved through a 2-step data pruning technique, which reduced the training data by 44% without compromising performance. The pruning involved filtering informative sentences and applying the Condensed Nearest Neighbor undersampling technique. Despite a slight performance drop (<0.5%) with the pruned dataset, results were submitted using the model fine-tuned on the original data. The models showed variability in results across different runs, so the final results were based on the majority of five iterations. Other open-source LLMs, such as Mistral, Mixtral, Llama2 13b, Llama3 8b, and CommandR, were also evaluated. Mixtral achieved the highest F1-score in the dev-test phase, followed by Llama2 7b. Due to training time considerations, Llama2 7b was used for the remainder of the study. Experiments with data expansion techniques yielded high precision but lower recall models. Team SemanticCuetSync [22] fine-tuned language specific models such as RoBERTa, AraBERT, DistilBERT for English, Arabic and Dutch, respectively. Team Checker Hacker [5] employed an ensemble approach integrating BERT-base-uncased and XLM- RoBERTa to improve the detection of check-worthy claims. Preprocessing steps, including tokenization and normalization, were implemented, along with data augmentation techniques to ensure the model was exposed to varied textual representations. Team IAI Group [23] trained several PLMs. For English, RoBERTa-Large was fine-tuned, and for Dutch and Arabic, XLM-RoBERTa and GPT-3.5-Turbo were fine-tuned. The best models among them were selected based on their performance on the dev-test subsets. They reported that in some cases, GPT-4 in a zero-shot setting also performed well. Table 3 Multigenre check-worthiness estimation. The F1 score is calculated with respect to the positive class. Arabic Dutch English Team F1 Team F1 Team F1 1 IAI Group 0.569 1 TurQUaz 0.732 1 FactFinders 0.802 2 OpenFact 0.557 2 DSHacker 0.730 2 OpenFact 0.796 3 DSHacker 0.538 3 IAI Group 0.718 3 Fraunhofer SIT 0.780 4 TurQUaz 0.533 4 Mirela 0.650 4 mjmanas54 0.778 5 SemanticCuetSync 0.532 5 Zamoranesis 0.601 5 ZHAW_Students 0.771 6 mjmanas54 0.531 6 FC_RUG 0.594 6 SemanticCuetSync 0.763 7 Fired_from_NLP 0.530 7 OpenFact 0.590 7 SINAI 0.761 8 Madussree 0.530 8 HYBRINFOX 0.589 8 DSHacker 0.760 9 pandas 0.520 9 mjmanas54 0.577 9 IAI Group 0.753 10 HYBRINFOX 0.519 10 DataBees 0.563 10 Fired_from_NLP 0.745 11 Mirela 0.478 11 JUNLP 0.550 11 TurQUaz 0.718 12 DataBees 0.460 12 Fired_from_NLP 0.543 12 HYBRINFOX 0.711 13 Baseline 0.418 13 Madussree 0.482 13 SSN-NLP 0.706 14 JUNLP 0.212 14 Baseline 0.438 14 Checker Hacker 0.696 15 pandas 0.308 15 NapierNLP 0.675 16 SemanticCuetSync 0.218 16 Mirela 0.658 18 DataBees 0.619 19 Trio_Titans 0.600 20 Madussree 0.583 21 pandas 0.579 22 JUNLP 0.541 23 Sinai and UG 0.517 24 grig95 0.497 25 CLaC 0.494 26 Aqua_Wave 0.339 27 Baseline 0.307 Team OpenFact [16] finetuned DeBERTa and mDeBERTa on multiple versions of the task dataset. This included training one model per language using the corresponding language train subset. The team also experimented with multilingual models by training over concatenated train subsets of all (or part) of the task four languages. Team HYBRINFOX [13] developed a classification pipeline, consisting of three parts: a standard language model (RoBERTa for English and multilingual BERT for other languages), a component for extracting and encoding triples using OpenIE6 and Multi2OIE, and a merging neural network with a softmax layer for output. Early results indicated that including the triple encoding component improved performance over using the language model alone, especially for English. Challenges were noted in evaluating the approach for Dutch and Arabic due to limited proficiency in these languages. Team DSHacker [8] conducted experiments with both monolingual and multilingual approaches. For the monolingual approach, BERT models were fine-tuned for specific languages. For the multilingual approach, XLM-RoBERTa-large was used, initially optimized and fine-tuned on the entire dataset. In a subsequent experiment, Spanish was excluded from the training data. Additionally, two LLMs, GPT-3.5-turbo and the recently released GPT-4o, were employed for each language using few-shot prompting to classify texts. A model was also fine-tuned on the DIPROMATS 2024 Task 1 dataset to predict whether the data from CheckThat! Lab 2024 Task 1 contained propaganda. This analysis aimed to indirectly determine whether check-worthy data also included propaganda. The XLM-RoBERTa-large model, fine-tuned for binary propaganda classification, was further fine-tuned for check-worthiness classification. Team FC_RUG [10] tested GEITje, an LLM for Dutch based on Mistral-7B. They experimented with different prompts varying the learning settings (zero-shot vs few-shot) and the personas (helpful assistant vs fact-checker). The best model with few-shot in-context learning was selected based on the development data from the companion task of the CheckThat! 2022 Lab edition. Team CLaC [6] approached the task as a binary classification task, leveraging a LLM (Google’s Gemini4 ) 4 https://gemini.google.com to classify whether a sentence is True or False, without specifying the task to classify for. The task was modeled as a multi-annotator scenario where Gemini was used to create two semantically-similar sentences to each test sentence. Then, Gemini was prompted to predict one of these labels: True, or False, for each sentence, using a single prompt. Finally, majority vote over the three annotations was used as the final label. Additionally, to improve performance, the prompt was contextualized by providing 600 randomly selected samples from the training subset. Team SINAI [24] attempted two different approaches were attempted: (i) RoBERTa-base was fine-tuned using the original English data, and data augmentation was tried with Spanish transcription-sourced texts; (ii) A prompting approach with GPT-3.5-turbo was conducted, involving two experiments: one concatenating previous consecutive examples from the data (using the sentence_id) and the other using only the original text. Finally, after analyzing the results obtained from both approaches, the RoBERTa-base fine-tuning approach with the original English data was elected. Team Trio Titans [20] fine-tuned different transformer models including DistilBERT, ALBERT, and RoBERTa, with the latter performing the best. Team DataBees [7] fine-tuned various pre-trained models such as BERT, RoBERTa, and language- specific models like AraBERT for Arabic, along with traditional classifiers like MultinomialNB and Logistic Regression. The system was designed to work across the three languages. Their best F1 scores were achieved with DistilBERT for English, AraBERT for Arabic, and MultinomialNB for Dutch. Team TurQUaz [21] developed differnet models for each language. For Arabic and English, a two-stage approach was proposed to determine check-worthy statements. This method combined a fine-tuned RoBERTa classifier with in-context learning (ICL) using multiple different instruct-tuned models. The aggregation method varied between the Arabic and English datasets. For the Dutch dataset, the fine-tuned classifier was excluded, and reliance was placed solely on in-context learning due to time constraints. 5. Related Work 5.1. Checkworiness in Fact-checking Due to the significant surge of disinformative content online the importance of improving the capabilities of fact-checking pipeline is paramount. As depicted in Figure 1, the first part of the pipeline is finding claims that important to fact check [25]. The overall idea is to facilitate human fact-checkers to seamlessly streamline their daily fact-checking activities. To address and improve the capabilities of different components of fact-checking pipeline, there has been a considerable surge in research consisting of exploring fact-checking perspectives on fake news and associated issues [26], examining attitudes towards the detection of misinformation and disinformation [27], automating fact-checking to support human fact-checkers [28], predicting the factuality and the bias of entire news outlets [29], detecting disinformation across multiple modalities [30], and focusing on the use of abusive language on social media [31]. 5.2. LLMs for Checkworthiness Task Given that large language models (LLMs) have been demonstrating significant capabilities across various disciplines and many downstream NLP tasks, efforts have been made to utilize such models for detecting claims and their worthiness. Majer and Šnajder [32] evaluated gpt-4-turbo and demonstrated its potential for claim check-worthiness detection with minimal prompt engineering. Sawiński et al. [33] used GPT-3.5 and GPT-4 models in zero-shot and few-shot learning setups, comparing them with GPT-3, BERT, and RoBERTa-based fine-tuned models. Their findings demonstrate that the fine-tuned GPT-3 model performed the best across different models. Abdelali et al. [34] benchmarked various open and closed models for the Arabic checkworthiness task using the CT–CWT–22 dataset [35] and demonstrated that the performance of few-shot learning using GPT-4 is relatively higher; however, it is still far from state-of-the-art performance. CT! Lab Content Type Modality Language Papers CT-2018 [40] Debate Text Ar, En 5 CT-2019 [41] Debate, Web pages Text Ar, En 8 CT-2020 [42] Tweet Text Ar, En 10 CT-2021 [43, 44] Tweet, debate Text Ar, Bg, En, Es, Tr 10 CT-2022 [35, 45] Tweet Text Ar, Bg, En, Nl, Es, Tr 13 CT-2023 [46, 47] Tweet Text, Image Ar, En 12 CT-2024 Tweet, debate Text Ar, En, Nl 19 Table 4 Checkworthiness tasks from 2018 to 2014 offerend in different langauges and content types. 5.3. Previous Editions of Checkworthiness Shared Tasks Since the seminal work by Hassan et al. [36], the task of check-worthiness estimation has gained broader interest. This task, proposed by Hassan et al. [36], involves assessing whether a sentence from a political debate is non-factual, trivially factual, or significantly factual enough to warrant verification. Since then, several notable studies have focused on political debates [37], tweets, and transcripts from political debates [38], as well as cross-lingual studies over tweets [39]. A major research interest has been sparked since the inception of the CLEF CheckThat!lab initiatives. The initial focus was primarily on political debates and speeches. This focus has since expanded to include social media, transcriptions, and various languages and modalities. Significant research interest has been sparked since the inception of the CLEF CheckThat!lab initiatives. The initial focus was primarily on political debates and speeches. This focus has since expanded to include social media, transcriptions, and various languages and modalities. In Table 4, we report a summary of check-worthiness tasks over the years from 2018 to 2024. The focus has mainly been on debates and tweets, mostly in the text modality. As for languages, Arabic and English have been offered in all editions. The number of participants and system description paper submissions has increased over the years. 6. Conclusion and Future Work We presented an overview of Task 1 of the CLEF-2024 CheckThat! lab, which focused on check- worthiness estimation of multigenre content and covering three languages: Arabic, Dutch, and English. The task attracted significant participation, with 75 registered teams and 28 teams submitting system description papers. The majority of the participating systems leveraged transformer-based models, showcasing their effectiveness in this domain. Notable approaches included the fine-tuning of language- specific models such as AraBERT for Arabic and RobBERT for Dutch, as well as the use of multilingual models like XLM-RoBERTa. Several teams experimented with large language models including GPT- 3.5 and Llama2, while others implemented ensemble approaches combining multiple models. Data augmentation and preprocessing techniques were widely employed to enhance performance, and some teams incorporated named entity recognition and other linguistic features into their systems. The results show significant improvements over the baselines across all languages, highlighting the progress made in check-worthiness estimation. Future work may include covering other modalities and domains. Acknowledgments The work of F. Alam, M. Hasanain, R. Suwaileh and W. Zaghouani is partially supported by NPRP 14C- 0916-210015 from the Qatar National Research Fund, which is a part of Qatar Research Development and Innovation Council (QRDI). The findings achieved herein are solely the responsibility of the authors. References [1] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari, M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness, in: N. Goharian, N. Tonel- lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, 2024, pp. 449–458. [2] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli, G. Da San Martino, F. Haouari, C. Li, J. Piskorski, F. Ruggeri, X. Song, R. Suwaileh, Overview of the CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities and adversarial robustness, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [3] F. Alam, S. Shaar, F. Dalvi, H. Sajjad, A. Nikolov, H. Mubarak, G. D. S. Martino, A. Abdelali, N. Durrani, K. Darwish, A. Al-Homaid, W. Zaghouani, T. Caselli, G. Danoe, F. Stolk, B. Bruntink, P. Nakov, Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society, in: Findings of EMNLP 2021, 2021, pp. 611–649. [4] F. Arslan, N. Hassan, C. Li, M. Tremayne, A benchmark dataset of check-worthy factual claims, in: Proceedings of the International AAAI Conference on Web and Social Media, volume 14, 2020, pp. 821–829. [5] K. Chandani, D. E. Z. Syeda, Checker Hacker at CheckThat! 2024: Ensemble models for check- worthy tweet identification, in: [48], 2024. [6] S. Gruman, L. Kosseim, CLaC at CheckThat! 2024: A zero-shot model for check-worthiness and subjectivity classification, in: [48], 2024. [7] T. Sriram, S. Anand, Y. Venkatesh, Databees at CheckThat! 2024: Check worthiness estimation, in: [48], 2024. [8] P. Golik, A. Modzelewski, A. Jochym, DSHacker at CheckThat! 2024: LLMs and BERT for check-worthy claims detection with propaganda co-occurrence analysis, in: [48], 2024. [9] Y. Li, R. Panchendrarajan, A. Zubiaga, FactFinders at CheckThat! 2024: Refining check-worthy statement detection with LLMs through data pruning, in: [48], 2024. [10] S. Weering, T. Caselli, FC_RUG at CheckThat! 2024: Few-shot learning using GEITje for check- worthiness detection in Dutch, in: [48], 2024. [11] M. S. A. Chowdhury, A. M. Shanto, M. M. Chowdhury, H. Murad, U. Das, Fired_from_NLP at CheckThat! 2024: Estimating the check-worthiness of tweets using a fine-tuned transformer-based approach, in: [48], 2024. [12] I. Vogel, P. Möhle, Fraunhofer SIT at CheckThat! 2024: Adapter fusion for check-worthiness detection, in: [48], 2024. [13] G. Faye, M. Casanova, B. Icard, J. Chanson, G. Gadek, G. Gravier, P. Égré, HYBRINFOX at CheckThat! 2024: Enhancing language models with structured information for checkworthiness estimation, in: [48], 2024. [14] P. R. Aarnes, V. Setty, P. Galuščáková, IAI group at CheckThat! 2024: Transformer models and data augmentation for checkworthy claim detection, in: [48], 2024. [15] M. Dryankova1, D. Dimitrov, I. Koychev, P. Nakov, Mirela at CheckThat! 2024: Check-worthiness of tweets with multilingual embeddings and adversarial training, in: [48], 2024. [16] M. Sawinski, OpenFact at CheckThat! 2024: Optimizing training data selection through under- sampling techniques, in: [48], 2024. [17] A. I. Paran, M. S. Hossain, S. H. Shohan, J. Hossain, S. Ahsan, M. M. Hoque, SemanticCuetSync at CheckThat! 2024: Finding subjectivity in news article using Llama, in: [48], 2024. [18] J. Valle Aguilera, A. J. Gutiérrez Megías, S. M. Jiménez Zafra, L. A. Ureña López, E. Martínez Cámara, SINAI at CheckThat! 2024: Stealthy character-level adversarial attacks using homoglyphs and search, iterative, in: [48], 2024. [19] S. B. K. Giridharan, S. Sounderrajan, B. Bharathi, N. R. Salim, SSN-NLP at CheckThat! 2024: Assessing the check-worthiness of tweets and debate excerpts using traditional machine learning and transformer models, in: [48], 2024. [20] M. Prarthna, V. V. Chiranjeev Prasannaa, M. Sai Geetha, Trio Titans at CheckThat! 2024: Check worthiness estimation, in: [48], 2024. [21] M. E. Bulut, K. E. Keleş, M. Kutlu, TurQUaz at CheckThat! 2024: A hybrid approach of fine-tuning and in-context learning for check-worthiness estimation, in: [48], 2024. [22] S. H. Shohan, A. I. Paran, M. S. Hossain, J. Hossain, M. M. Hoque, SemanticCuetSync at CheckThat! 2024: Finetuning transformer models for checkworthy tweet identification, in: [48], 2024. [23] M. Hasanain, R. Suwaileh, S. Weering, C. Li, T. Caselli, W. Zaghouani, A. Barrón-Cedeño, P. Nakov, F. Alam, Overview of the CLEF-2024 CheckThat! lab task 1 on check-worthiness estimation of multigenre content, in: [48], 2024. [24] S. Stoia, J. Montañez-Collado, C. Ibáñez-Bautista, A. Montejo-Ráez, M. T. Martín-Valdivia, M. C. Díaz-Galiano, SINAI at CheckThat! 2024: Transformer-based approaches for check-worthiness classification, in: [48], 2024. [25] P. Nakov, D. Corney, M. Hasanain, F. Alam, T. Elsayed, A. Barrón-Cedeño, P. Papotti, S. Shaar, G. Da San Martino, Automated fact-checking for assisting human fact-checkers, in: Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI ’21, 2021, pp. 4551–4558. [26] J. Thorne, A. Vlachos, Automated fact checking: Task formulations, methods and future directions, in: Proceedings of the 27th International Conference on Computational Linguistics, COLING ’18, Association for Computational Linguistics, Santa Fe, NM, USA, 2018, pp. 3346–3359. [27] M. Hardalov, A. Arora, P. Nakov, I. Augenstein, A survey on stance detection for mis-and disinformation identification, in: Findings of the Association for Computational Linguistics: NAACL 2022, 2022, pp. 1259–1277. [28] G. K. Shahi, Fakekg: A knowledge graph of fake claims for improving automated fact-checking (student abstract), Proceedings of the AAAI Conference on Artificial Intelligence 37 (2023) 16320– 16321. doi:10.1609/aaai.v37i13.27020. [29] P. Nakov, H. T. Sencar, J. An, H. Kwak, A survey on predicting the factuality and the bias of news media, arXiv/2103.12506 (2021). [30] F. Alam, S. Cresci, T. Chakraborty, F. Silvestri, D. Dimitrov, G. Da San Martino, S. Shaar, H. Firooz, P. Nakov, A survey on multimodal disinformation detection, in: Proceedings of the 29th Interna- tional Conference on Computational Linguistics, 2022, pp. 6625–6643. [31] A. Arora, P. Nakov, M. Hardalov, S. M. Sarwar, V. Nayak, Y. Dinkov, D. Zlatkova, K. Dent, A. Bhatawdekar, G. Bouchard, I. Augenstein, Detecting harmful content on online platforms: What platforms need vs. where research efforts go, ACM Computing Surveys 5 (2023). [32] L. Majer, J. Šnajder, Claim check-worthiness detection: How well do llms grasp annotation guidelines?, arXiv:2404.12174 (2024). [33] M. Sawiński, K. Węcel, E. P. Księżniak, M. Stróżyna, W. Lewoniewski, P. Stolarski, W. Abramowicz, Openfact at checkthat! 2023: head-to-head gpt vs. bert-a comparative study of transformers language models for the detection of check-worthy claims, in: CEUR Workshop Proceedings, volume 3497, 2023. [34] A. Abdelali, H. Mubarak, S. Chowdhury, M. Hasanain, B. Mousi, S. Boughorbel, S. Abdaljalil, Y. El Kheir, D. Izham, F. Dalvi, M. Hawasly, N. Nazar, Y. Elshahawy, A. Ali, N. Durrani, N. Milic- Frayling, F. Alam, LAraBench: Benchmarking Arabic AI with large language models, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp. 487–520. [35] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, R. Míguez, T. Caselli, M. Kutlu, W. Za- ghouani, C. Li, S. Shaar, H. Mubarak, A. Nikolov, Y. S. Kartal, J. Beltrán, Overview of the CLEF-2022 CheckThat! lab task 1 on identifying relevant claims in tweets, in: N. Faggioli, Guglielmo andd Ferro, A. Hanbury, M. Potthast (Eds.), Working Notes of CLEF 2022—Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022. [36] N. Hassan, C. Li, M. Tremayne, Detecting check-worthy factual claims in presidential debates, in: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, 2015, pp. 1835–1838. [37] S. Vasileva, P. Atanasova, L. Màrquez, A. Barrón-Cedeño, P. Nakov, It takes nine to smell a rat: Neural multi-task learning for check-worthiness prediction, in: Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP ’19, 2019, pp. 1229–1239. [38] Y. S. Kartal, M. Kutlu, Re-think before you share: A comprehensive study on prioritizing check- worthy claims, IEEE Transactions on Computational Social Systems (2022). [39] M. Hasanain, T. Elsayed, Cross-lingual transfer learning for check-worthy claim identification over twitter, arXiv: 2211.05087 (2022). [40] P. Atanasova, L. Marquez, A. Barrón-Cedeño, T. Elsayed, R. Suwaileh, W. Zaghouani, S. Kyuchukov, G. Da San Martino, P. Nakov, Overview of the CLEF-2018 CheckThat! lab on automatic identifica- tion and verification of political claims. Task 1: Check-worthiness, CEUR Workshop Proceedings, 2018. [41] P. Atanasova, P. Nakov, G. Karadzhov, M. Mohtarami, G. Da San Martino, Overview of the CLEF-2019 CheckThat! lab on automatic identification and verification of claims. Task 1: Check- worthiness, CEUR Workshop Proceedings, 2019. [42] A. Barrón-Cedeño, T. Elsayed, P. Nakov, G. Da San Martino, M. Hasanain, R. Suwaileh, F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, S. Shaar, Z. Sheikh Ali, Overview of CheckThat! 2020: Automatic identification and verification of claims in social media, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020), LNCS (12260), Springer, 2020, pp. 215–236. [43] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, A. Nikolov, M. Kutlu, Y. S. Kartal, F. Alam, G. Da San Martino, A. Barrón-Cedeño, R. Míguez, J. Beltrán, T. Elsayed, P. Nakov, Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in tweets and political debates, 2021. [44] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, M. Hasanain, W. Mansour, B. Hamdan, Z. S. Ali, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, M. Kutlu, Y. S. Kartal, Overview of the CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news, in: K. Candan, B. Ionescu, L. Goeuriot, B. Larsen, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Eds.), Ex- perimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Twelfth International Conference of the CLEF Association, LNCS (12880), 2021. [45] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov, N. Babulkov, Y. S. Kartal, J. Beltrán, M. Wiegand, M. Siegel, J. Köhler, Overview of the CLEF-2022 CheckThat! lab on fighting the COVID-19 infodemic and fake news detection, in: A. Barrón-Cedeño, G. Da San Mar- tino, M. Degli Esposti, F. Sebastiani, C. Macdonald, G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, F. Nicola (Eds.), Proceedings of the 13th International Conference of the CLEF Association: Infor- mation Access Evaluation meets Multilinguality, Multimodality, and Visualization, CLEF ’2022, Bologna, Italy, 2022. [46] F. Alam, A. Barrón-Cedeño, G. S. Cheema, S. Hakimov, M. Hasanain, C. Li, R. Míguez, H. Mubarak, G. K. Shahi, W. Zaghouani, P. Nakov, Overview of the CLEF-2023 CheckThat! lab task 1 on check-worthiness in multimodal and multigenre content, in: M. Aliannejadi, G. Faggioli, N. Ferro, Vlachos, Michalis (Eds.), Working Notes of CLEF 2023–Conference and Labs of the Evaluation Forum, CLEF 2023, Thessaloniki, Greece, 2023. [47] A. Barrón-Cedeño, F. Alam, T. Caselli, G. Da San Martino, T. Elsayed, A. Galassi, F. Haouari, F. Ruggeri, J. M. Struß, R. N. Nandi, G. S. Cheema, D. Azizov, P. Nakov, The CLEF-2023 CheckThat! Lab: Checkworthiness, subjectivity, political bias, factuality, and authority, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2023, pp. 506–517. [48] G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, 2024.