1. Introduction

Overview of the CLEF-2025 CheckThat! Lab Task 2 on Claim Normalization

Megha Sundriyal

meghas@iiitd.ac.in 1

Tanmoy Chakraborty

Preslav Nakov

Preslav.Nakov@mbzuai.ae 2 0 Indian Institute of Technology Delhi , India 1 Indraprastha Institute of Information Technology Delhi , India 2 Mohamed bin Zayed University of Artificial Intelligence , UAE

We present an overview of Task 2 from CheckThat! at CLEF 2025, which focuses on claim normalization. The tasks asks systems to transform informal and often noisy social media posts into clear, concise, and verifiable statements known as normalized claims, which capture the core factual assertion of a post, which makes them much easier to verify and fact-check. The task is especially relevant in multilingual and low-resource contexts, where the diversity of languages and limited labelled data pose serious challenges. Task 2 was conducted in two distinct settings: (i) monolingual, where systems were trained and tested on the same language, and (ii) zero-shot, where models had to normalize claims in a new target language without any in-language training data. The monolingual track covered thirteen languages, including English, German, French, Spanish, Portugese, Hindi, Marathi, Punjabi, Tamil, Arabic, Thai, Indonesian, and Polish. While the zero-shot setting introduced seven more languages, such as Dutch, Romanian, Bengali, Telugu, Korean, Greek, and Czech. This structure allowed us to evaluate both language-specific performance and cross-lingual generalization. In total, 18 teams participated in Task 2, submitting 1,226 valid runs across the two settings. The submissions were evaluated using the METEOR score. Many teams leveraged transformer-based models, multilingual embeddings, and retrieval-augmented strategies. In this paper, we outline the task setup, give details about the datasets, and provide a detailed summary of the diverse approaches adopted by the participating teams.

eol>Claim Normalization Social Media Posts Multilinguality Claims

1. Introduction

Social media have revolutionized global communication, removing geographical barriers and allowing global knowledge exchange. However, it has also become a breeding ground for misinformation, spreading false claims quickly across languages and cultures [ 1 ]. These false claims jeopardize the integrity of online discourse and public trust. For instance, they have afected various critical events, including the 45th US Presidential Election [ 2 ], the COVID-19 pandemic [ 3, 4 ], the Russia–Ukraine conflict [ 5 ], etc. While journalists and fact-checkers work tirelessly to ensure the accuracy of online content, the sheer volume and the linguistic diversity of social media posts make it dificult to identify and debunk every single claim across diverse languages efectively [ 6 ]. In recent years, several studies have examined the needs of fact-checkers and have identified tasks that could be automated to reduce their manual eforts and to improve the efectiveness of their work [ 7, 8, 9, 10 ]. These tasks include looking for the source of evidence for verification [ 11 ], exploring other versions of misinformation [ 12 ], and searching within existing fact-checking datasets [ 13 ].

Social media posts are often written in vague, informal language, frequently mixing opinions, using rhetorical questions, and incomplete thoughts. This makes it dificult to extract clear, check-worthy claims, defined as factual statements that can be verified or disproven [ 10 ]. Recently, Sundriyal et al. [ 14 ] introduced the task of claim normalization, which aims to simplify a given text containing a claim, such as a long, noisy social media post, into a concise and precise statement.

The task is a precursor to fact-checking, distilling the essence of the claim and removing any unnecessary information, thereby increasing the eficiency and reliability of the fact-checking process.

Despite eforts to combat misinformation across a variety of languages [ 15, 16, 17 ], research into claim normalization has been predominantly English-centred. The Task 2 from CheckThat! at CLEF 2025 aims to bridge this gap by ofering the task in a multi-lingual setting.

The CheckThat! Lab aims to accelerate the development of tools and datasets that enable diferent phases of the fact-checking pipeline. Since its beginning, the lab has organized several shared tasks that represent real-world issues in misinformation detection and verification, with a focus on multilingual, cross-domain, and practical applications. The 2025 edition of the lab included four tasks in monolingual, multilingual, and cross-lingual settings, covering over 20 languages across these tasks [ 18 ]. This paper presents Task 2 on Claim Normalization, which addresses the problem of converting informal, noisy social media posts into clear, concise, and verifiable claims. The task plays a vital role in bridging unstructured content with structured fact-checking workflows, especially in multilingual and lowresource settings.

Task Description. In this year’s CheckThat! Lab, Task 2 addressed the growing need to extract verified claims from the informal language found on social media. Unlike standard fact-checking pipelines that rely on well-formed input, our task aimed to rewrite user-generated content—often imprecise, opinionated, or fragmented—into clear, concise, and factual statements, the way that human fact-checkers formulate the claims they are checking.

The task is especially timely and relevant in multilingual and low-resource settings. To simulate realistic fact-checking scenarios, Task 2 was conducted in two settings: • Monolingual: In the monolingual setting, training, development datasets are provided for the language used for testing. The model is trained, validated, and tested on the same language, allowing it to learn language-specific structures and patterns. The languages included in this setup are English, German, French, Spanish, Portugese, Hindi, Marathi, Punjabi, Tamil, Arabic, Thai, Indonesian, and Polish. • Zero-shot: The zero-shot setting provides only the test data for the target language, without any corresponding training or development data for it. The participants may train their models using data from other languages or conduct zero-shot experiments with large language models (LLMs), evaluating the performance in the target language without prior exposure. This setup tests the model’s ability to generalize to unseen languages. The languages in this setting are Dutch, Romanian, Bengali, Telugu, Korean, Greek, and Czech.

In the following sections, we give details about our dataset, a detailed overview of the participating systems, and discussion of the approaches.

2. Related Work

Fact-checking is critical for combating the spread of false claims. As fully automating manual factchecking is very time-consuming, researchers have worked on specific subtasks that can help human fact-checkers. This encompasses a spectrum of tasks, including claim detection [ 10, 19 ], claim checkworthiness assessment [ 20, 21 ], claim span identification [ 8, 17 ], claim verification [ 22, 23 ], etc.

The proliferation of false claims on social media platforms has led to the development of specialized systems tailored for handling informal texts from these platforms [ 24, 25, 26 ]. These systems are designed to quickly identify and debunk potentially misleading information, allowing for timely intervention by human fact-checkers. Within the larger context of fact-checking, claim normalization has recently emerged as an important novel research direction. Sundriyal et al. [ 14 ] introduced this task of claim normalization, which distils the key claim from long noisy social media posts.

Most existing methods aimed at combating misinformation have primarily focused on English [ 14, 26, 25 ]. However, there has been a recent surge in interest regarding the advancement of factchecking techniques for various languages. Jaradat et al. [ 15 ] developed ClaimRank, an online system to identify sentences with credible claims in Arabic and English. Gupta and Srikumar [27] developed X-FACT, a multilingual dataset for factual verification of real-world claims across 25 languages. Mittal et al. [ 17 ] released X-CLAIM, a multilingual dataset for claim span identification, consisting of 7,000 realworld claims collected from various social media platforms in five Indian languages and English. Pikuliak et al. [28] introduced MultiClaim, a multilingual dataset for detecting previously checked claim retrieval. They gathered 28k social media posts in 27 languages, 206k professional fact-checks in 39 languages, and 31k connections between these two groups. Chang et al. [29] introduced a multilingual version of the FEVER dataset. Over the past seven years, the CheckThat! Lab organized several multilingual claim-related tasks as part of CLEF, gradually expanding language support and attracting an increasing number of submissions [ 30, 31, 32, 9, 33, 34 ]. The most recent edition of the CheckThat! lab included six tasks in fifteen languages, including Arabic, Bulgarian, English, Dutch, French, Georgian, German, Greek, Italian, Polish, Portuguese, Russian, Slovene, Spanish, and code-mixed Hindi-English [34].

Despite the growing interest in fact-checking across multiple languages, the task of claim normalization has been largely unexplored beyond English [ 14 ]. This narrow focus presents challenges as multilingual social media platforms host content in multiple languages, and thus claims originate in many languages. Moreover, linguistic nuances and cultural contexts complicate the task, emphasizing the need for multilingual approaches. This motivated our multilingual claim normalization task this year.

3. Dataset Below, we describe the dataset for our tasks, which we call mCLAN. 3.1. Data Compilation

Inspired by the principle of dataset recycling Koch et al. [35], we identified and reused four datasets, which we repurposed for the task of claim normalization. This reduces annotation efort as well as subjective annotation biases. Below, we describe each dataset in detail:

(a) CLAN [ 14 ]: It contains 6,388 social media posts, each with normalized claims from various fact-checking websites. Notably, every example in the dataset is in English. We use all the pairs of a post and its corresponding normalized claim.

(b) MultiClaim [28]: It contains multilingual fact-checking pairs obtained from 142 fact-checking sites, making it the largest dataset of fact-checks released to date, encompassing 39 languages. Each fact-checking article is represented in the dataset by its claim, title, publication date, and URL. However, the entire text of the articles has not been published. In addition, the dataset includes relevant social media posts with text, OCR of attached images (if any), publication date, social media platform, and fact-checker rating for each post. We used this dataset to collect claims from fact-checking websites and corresponding social media posts for our study. This allowed us to extract 21k in post-claim pairs. It is worth noting that we only use monolingual pairs from this dataset in our work.

(c) X-Claim [ 17 ]: This is a multilingual dataset labeled for claim spans and includes six languages, primarily focusing on low-resource languages. The authors collected social media posts and corresponding claims from several fact-checking websites. They used a variety of filtering rules to eliminate posts containing videos, Instagram reels, or excessively short or long text. Using awesome-align [36], they found word tokens in the post-sentence that matched those in the normalized claim. The claim span was then calculated as a sequence of word tokens that began with the first aligned word token and ended with the last aligned word token in the sentence. Given that each example in this dataset included social media posts and the corresponding claims obtained from the fact-check sites, we used all the examples in the dataset: 5,840 post-claim pairs in six languages. (d) Twitter Dataset [37]: The authors proposed an abstractive text summarization dataset consisting of noisy claims from Twitter and their gold summaries for eficiently detecting previously fact-checked claims that use abstractive summaries to generate crisp queries. They crawled Twitter for URLs from fact-checking organizations like Snopes, PolitiFact, The Quint, etc., resulting in a preliminary collection of Tweet and Claim Review1 pairs. Pairs with tweets in languages other than English were discarded, as were such with only image or video content. They also ensured that each tweet included a claim and could be textually summarized to match the corresponding Claim Review. The final dataset only included <Social Media Content, Claim Review> pairs with both components in English. We used all the 567 pairs provided in this dataset.

To ensure the data quality of the final compiled corpus, we randomly selected 50 examples from each language and asked native speakers to verify the post and the corresponding normalized claims. For languages where we could not find native speakers, we used the Google Translate API to translate them into English and cross-checked the quality of the examples. Table 1 shows a few examples from our mCLAN dataset in diferent languages. We consolidated all examples from these datasets and performed a combined analysis. Table 1 shows a few examples from mCLAN dataset in diferent languages.

3.2. Data Statistics and Analysis

Through data compilation, we obtained a total of 28,012 instances in twenty languages from all datasets. To maintain uniformity, we used the train/dev/test splits from the original datasets. For languages with a small number of instances, e.g., around 100, we only kept the test sets with no training data. Table 2 gives details about the final dataset and the train/dev/test splits.

1Short summary of the claim written by the fact-checker.

To better apprehend the distribution of languages, we analysed the dataset linguistically. With its diverse vocabulary and flexible syntax, English is the primary language used on several social media platforms [38]. Thus, our dataset is also primarily composed of English examples. While German and Dutch are less dominant, they still benefit from a shared Latin script and similar grammatical structure. The Indic languages in the dataset encompass Hindi, Marathi, Punjabi, and Bengali. Hindi uses the Devanagari script. Marathi also uses the Devanagari script, albeit with some diferences in the characters. The Gurmukhi script is used for Punjabi, and Bengali is written using the Bengali script. Due to their diverse scripts and extensive use of diacritics, Indic languages pose unique computational challenges. The dataset also includes two languages from the Dravidian languages: Tamil and Telugu. Both are important representatives of the Dravidian language family, with scripts derived from the ancient Brahmic script.

4. Submissions

We received submissions from 18 teams, totalling 1,226 valid runs across all the languages; 12 of these teams submitted their working notes. Table 3 lists all teams and their ranking for each language.

Baseline and Evaluation Metric. We used mT5-large as our baseline. For the monolingual setting, we fine-tuned the model using language-specific training data, translating the instruction “Identify the central claim in the given post: < >” into the language of the test claim. This allowed the model to operate directly in the target language. We used METEOR as an evaluation measure.

Table 6 presents the results for the monolingual setup, while Table 4 reports the scores for the zero-shot setup. Most of the teams outperformed the baseline, while dfkinit2b [39], DS@GT [40], TIFIN [41], and AKCIT-FN [42] consistently ranked among the top-performers across most languages. Team dfkinit2b [39] was ranked first in 6 out of 13 languages in the monolingual setting. In the zero-shot setting, they were first across all seven unseen languages.

4.1. Overview of the Systems

Most teams used sequence-to-sequence generation strategies for claim normalization, typically relying on transformer-based models. The most prevalent approach involved fine-tuning pretrained models such as BART, T5, mBART, and LLaMA on monolingual data.

Team dfkinit2b [39] participated in both settings, testing zero- and few-shot prompting with models such as Gemma-3, Qwen-3, Qwen-2.5, Llama-3.3, and Mistral. They explored various prompts and used cosine similarity to select demonstrations for few-shot learning. They also included adapter fine-tuning, data pre-processing with language checks and emoji removal, and data augmentation via translation. For the final submission, they ensembled top-performing model outputs by computing embedding centroids with multilingual SentenceTransformers and selecting claims closest to these centroids.

Team DS@GT [40] embedded the unnormalized claims from the pooled train and development datasets, as well as from the test set, using state-of-the-art embeddings for each language. For testing, a GPT-4o mini model was prompted following the approach discussed in [ 14 ], using the top-3 most similar examples from the train and development sets as in-context examples. The final response for the monolingual task was derived by combining the best-matching answer from the train and development sets, based on cosine similarity, and the output of the GPT-4 model. For zero-shot, they used a modified version of CACN [ 14 ], essentially using the prompting method with standard examples.

Team TIFIN [41] fine-tuned Qwen-14B using LoRA with 4-bit precision for eficiency. They preprocessed data by filtering meaningful post-claim pairs, removing duplicates, and creating a unified multilingual dataset. Instruction-based fine-tuning incorporated Chain-of-Thought prompting with 5W1H questions to guide claim extraction. During inference, context resolution replaced partial posts with complete ones, and few-shot prompting with similar examples improved claim structure. This approach aimed to boost claim extraction accuracy and multilingual performance.

Team AKCIT-FN [42] adopted a dual-strategy approach tailored to data availability. For the 13 supervised languages, they fine-tuned various language-specific and multilingual Small Language Models (SLMs) such as PTT5, AraT5, and Varta T5. For the seven zero-shot languages, they used prompting with Large Language Models (LLMs) such as the GPT series, Gemini, and Qwen 2.5. Their methodology also included a data cleaning algorithm to remove repetitive content and trailing None placeholders, as well as cross-split deduplication. Few-shot prompting experiments for monolingual settings involved selecting examples randomly, based on dificulty (METEOR score), or using HDBSCAN cluster prototypes for semantic diversity.

Team Factiverse and IAI [43] focused on the monolingual setting, comparing four main approaches: zero-shot prompting, fine-tuning, Fixed In-Context Learning (FICL), and Adaptive In-Context Learning (AICL). For the ICL methods, they used a ChromaDB vector store with all-MiniLM-L6-v2 embeddings to retrieve semantically similar examples from the training data based on cosine distance. While FICL used a fixed number of top-K examples, the team’s novel AICL approach dynamically selected examples by applying a cosine distance threshold, eliminating the need to pre-determine the number of shots. They also explored data augmentation via machine translation for low-resource languages.

The MMA team [44] focused on the monolingual setting, exploring several model architectures and training strategies. Their approaches included fine-tuning a unified multilingual umt5 model on all languages, as well as training separate umt5 models for each language. They also tested zero-shot prompting with Qwen2.5 models and employed a parameter-eficient fine-tuning (PEFT) method using LoRA, which involved a two-stage process of first extracting key points and then generating a claim from those points. For Arabic, they conducted specific experiments by fine-tuning ara-t5 and augmenting the training data with scraped post-claim pairs from the Google Fact Check Tools API.

The UmuTeam [49] used a generative approach based on the Flan-T5-Base model for the Claim Extraction and Normalization task. Their strategy varied based on the data setting: for the monolingual scenarios, they fine-tuned a separate instance of Flan-T5-Base for each language, using only that language’s specific training data to allow the models to specialize. For the zero-shot languages, they ifne-tuned a single Flan-T5-Base model on the concatenated training data from all other languages, aiming to leverage cross-lingual transfer for generalization.

The UNH team [45] only experimented with the English language. Their fine-tuning experiments included fully fine-tuning a Flan-T5 Large model, using LoRA for a Flan-T5 Base model, and fine-tuning a DeepSeek-Llama-8b model. Their prompting strategies involved few-shot prompting with keywordbased example selection, iterative self-refinement to improve claim quality, and a Max Multi-Prompt method that simulated choosing the best output from several targeted prompts.

Saivineetha [50] focused on Hindi and Telugu. For Hindi, which was in the monolingual setting, they performed Parameter-Eficient Fine-Tuning (PEFT) using QLoRA with 4-bit quantization on the Gemma 2 9B instruct model. The model was instruction fine-tuned on the provided Hindi dataset of posts and normalized claims. For Telugu, which was in the zero-shot setting, they used zero-shot prompting with the Gemma 3 12B instruct model, using a prompt template designed to convert unstructured Telugu posts into normalized claims.

The JU_NLP@M&S team [48] framed the claim normalization task as a monolingual sequenceto-sequence generation problem, centered on fine-tuning a BART-Large transformer model. Their methodology included a preprocessing module for tokenization using byte-level BPE, padding inputs to a fixed length, and truncating where necessary. Model training was conducted for 5 epochs using Hugging Face’s Seq2SeqTrainer, employing mixed-precision (FP16) to optimize memory usage and a learning rate of 3e-5. For inference, they used beam search with four beams to enhance the quality of the generated claims.

Team Investigators [46] focused on the claim normalization task by fine-tuning several models, including LLaMA-3.2, BART, and T5, with a particular focus on the flan-t5-base model for the final submission. Their methodology was primarily monolingual, with extensive experiments on the English and Spanish datasets. Before training, they implemented a pre-processing pipeline to filter out records that were not in the target language. For the zero-shot setting, they experimented with cross-lingual transfer by training a model on the Spanish dataset and then evaluating it on the Korean test data.

Team OpenFact [47] experimented with several decoder-only LLMs, including LLaMA 3.1, DeepSeekR1, and GPT-4.1-mini. Their methodology had three steps: (1) generating up to three initial claim candidates, (2) iteratively refining each candidate using a self-reflection technique where the model provides feedback on its output, and (3) using an LLM as a judge to select the best among the refined candidates. They also performed supervised fine-tuning on the GPT-4.1-mini model using the cleaned training data. dfkinit2b [39] § § DS@GT [40] § § TIFIN [41] § § AKCIT-FN [42] § § Factiverse and IAI [43] § MMA [44] § UNH [45] § Investigators [46] § § § OpenFact [47] § § § JU_NLP@M&S [48] § Saivineetha [50] § § UmuTeam [49] § §

5. Discussion of Approaches

The participating teams in CheckThat! 2025 Task 2 tried several strategies for multilingual claim normalization. These approaches can be analyzed through four major dimensions: model architecture, ifne-tuning vs. in-context learning paradigms, data handling, and performance across monolingual and zero-shot settings. An overview of the approaches is given in Table 5.

5.1. Model Architectures

The primary area of divergence among the teams was their selection of model architecture. Some teams handled the task as a typical sequence-to-sequence problem, using encoder-decoder models that excel at summarization. For instance, the JU_NLP@M&S team fine-tuned BART-Large for monolingual textto-text generation. UmuTeam, Investigators, and MMA explored variants of T5, including multilingual models such as Flan-T5 and UMT5. In contrast, other teams used decoder-only large language models to improve their in-context learning and reasoning abilities. OpenFact evaluated models such as LLaMA 3.1, DeepSeek-R1, and GPT-4.1-mini. Similarly, TIFIN and dfkinit2b chose Qwen for their multilingual performance and eficiency in fine-tuning. This distinction highlights the trade-of between the recognised strengths of encoder-decoder in generation tasks and the growing potential of decoderonly models for flexible reasoning.

5.2. Adaptation Strategies

Fine-tuning was a common choice among the participating teams. Several teams employed parametereficient fine-tuning (PEFT) approaches, such as LoRA or QLoRA. For example, Saivineetha fine-tuned Gemma 2 for Hindi, while TIFIN and dfkinit2b applied LoRA to Qwen models. OpenFact’s supervised ifne-tuning of GPT-4.1-mini was reported to be their most efective configuration. In contrast, teams using decoder-only models emphasized in-context learning (ICL). DS@GT used retrieval-based ICL, pulling top-3 similar examples from the training set as dynamic prompts. dfkinit2b also adopted semantic similarity-driven selection for few-shot prompts. Factiverse used Adaptive In-Context Learning (AICL), which dynamically modifies the number of in-context examples depending on similarity thresholds. TIFIN implemented a 5W1H prompting strategy, structuring claim-related information into six categories (Who, What, Where, When, Why, and How) to guide model reasoning. OpenFact and UNH also used self-refinement, where an LLM iteratively critiques and improves its outputs.

5.3. Data Handling and Hybrid Methods

Due to the noisy nature of social media data, data preprocessing becomes crucial. OpenFact used GPT-4.1-mini to filter out training instances with some mismatches with the ground truth. To augment the training data, MMA scraped additional Arabic samples using Google’s Fact Check Tools API, while Investigators used the Gemini API to generate synthetic examples. DS@GT created a retrieval-first pipeline that reused existing normalizations when similar posts were found. dfkinit2b employed an ensemble method to generate claims based on five diferent approaches. The output closest to the centroid of all created embeddings was then chosen. This strategy worked well in both monolingual and zero-shot settings.

5.4. Adapting to Monolingual and Zero-Shot Scenarios

In the monolingual setting, where training data for 13 languages was available, the participating teams either trained language-specific models or used the data to retrieve information for ICL. Saivineetha, for example, trained a dedicated Hindi model, while DS@GT and Factiverse retrieved similar examples to construct prompts dynamically. In the zero-shot setting, the teams had to rely on cross-lingual generalization. UmuTeam and MMA developed multilingual models based on merged monolingual data and applied them to zero-shot languages. Other teams, such as TIFIN and DS@GT, used English-centric prompts and relied on the inherent multilingual capacity of the LLMs to handle the target languages. Among the most efective zero-shot strategies were dfkinit2b’s ensemble approach and OpenFact’s ifne-tuned GPT-4.1-mini, both of which performed consistently well across languages without labeled data.

6. Conclusion

We presented a detailed overview of Task 2 from the CheckThat! Lab at CLEF 2025. It focused on claim normalization, the task of transforming informal and noisy social media content into clear, concise, and verifiable statements. In total, 18 teams participated in the task. Most of the participants used Transformer-based models, with a clear trend towards leveraging large language models from the T5, Qwen, and Llama families. Common and efective strategies included parameter-eficient ifne-tuning, retrieval-augmented in-context learning, and sophisticated data preprocessing. The dual setting for monolingual and zero-shot evaluation provided a valuable framework for assessing both language-specific adaptation and cross-lingual generalization.

Declaration on Generative AI

In this study, we employed mT5-large as the baseline system. All experiments were carried out under controlled conditions. To help with spell check suggestions, OpenAI GPT-4o was accessed through a plugin on Overleaf. The authors thoroughly evaluated and edited all of the tool’s suggestions. No generative AI tools were used to generate the content of the main manuscript. The authors take full responsibility for the final content of the publication.

Social Media, PNAS nexus 3 (2024) pgae217. [27] A. Gupta, V. Srikumar, X-Fact: A New Benchmark Dataset for Multilingual Fact Checking, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Association for Computational Linguistics, Online, 2021, pp. 675–682. [28] M. Pikuliak, I. Srba, R. Moro, T. Hromadka, T. Smoleň, M. Melišek, I. Vykopal, J. Simko, J. Podroužek, M. Bielikova, Multilingual Previously Fact-Checked Claim Retrieval, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, 2023, pp. 16477–16500. [29] Y.-C. Chang, C. Kruengkrai, J. Yamagishi, XFEVER: Exploring Fact Verification across Languages, in: Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023), 2023, pp. 1–11. [30] P. Nakov, A. Barrón-Cedeno, T. Elsayed, R. Suwaileh, L. Màrquez, W. Zaghouani, P. Atanasova, S. Kyuchukov, G. Da San Martino, Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings 9, Springer, 2018, pp. 372–387. [31] S. Shaar, A. Nikolov, N. Babulkov, F. Alam, A. Barrón-Cedeno, T. Elsayed, M. Hasanain, R. Suwaileh, F. Haouari, G. Da San Martino, et al., Overview of CheckThat! 2020 English: Automatic Identification and Verification of Claims in Social Media., CLEF (Working Notes) 2696 (2020). [32] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, M. Hasanain, W. Mansour, et al., Overview of the CLEF-2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News, in: Proceedings of the 12th International Conference of the CLEF Association: Information Access Evaluation Meets Multiliguality, Multimodality, and Visualization, CLEF ’2021, Bucharest, Romania (online), 2021, pp. 264–291. [33] A. Barrón-Cedeño, F. Alam, A. Galassi, G. Da San Martino, P. Nakov, T. Elsayed, D. Azizov, T. Caselli, G. S. Cheema, F. Haouari, et al., Overview of the CLEF–2023 CheckThat! Lab on Checkworthiness, Subjectivity, Political Bias, Factuality, and Authority of News Articles and their Source, in: International conference of the cross-language evaluation forum for European languages, Springer, 2023, pp. 251–275. [34] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli, G. Da San Martino, F. Haouari, et al., Overview of the CLEF-2024 CheckThat! Lab: CheckWorthiness, Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2024, pp. 28–52. [35] B. Koch, E. Denton, A. Hanna, J. G. Foster, Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research, in: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. [36] Z.-Y. Dou, G. Neubig, Word Alignment by Fine-tuning Embeddings on Parallel Corpora, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 2112–2128. [37] V. Bhatnagar, D. Kanojia, K. Chebrolu, Harnessing Abstractive Summarization for Fact-Checked Claim Detection, in: Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 2934–2945. [38] A. Petrosyan, Most used languages online by share of websites 2024, 2024. Accessed: 01 June 2024. [39] T. Anikina, I. Vykopal, S. Kula, R. K. Chikkala, N. Skachkova, J. Yang, V. Solopova, V. Schmitt, S. Ostermann, dfkinit2b at CheckThat! 2025: Leveraging LLMs and Ensemble of Methods for Multilingual Claim Normalization, in: [51], 2025. [40] A. Pramov, J. Ma, B. Patel, DS@GT at CheckThat! 2025: A Simple Retrieval-First, LLM-Backed

Framework for Claim Normalization, in: [51], 2025. [41] M. Sharma, A. Suneesh, M. Jain, P. K. Rajpoot, P. Devadiga, B. Hazarika, A. Shrivastava, K. Gurumurthy, A. B. Suresh, A. U. Baliga, TIFIN at CheckThat! 2025: Reasoning-Guided Claim Normalization for Noisy Multilingual Social Media Posts, in: [51], 2025. [42] F. L. N. Almada, K. D. P. Mariano, M. A. Dutra, V. E. d. S. Monteiro, J. R. S. Gomes, A. R. Galvão Filho, A. d. S. Soares, Akcit-FN at CheckThat!2025: Switching Fine-Tuned SLMs and LLM Prompting for Multilingual Claim Normalization, in: [51], 2025. [43] P. Amatya, V. Setty, Factiverse and IAI at CheckThat! 2025: Adaptive ICL for Claim Extraction, in: [51], 2025. [44] M. Saeed, M. Yasser, M. Torki, N. Elmakky, MMA at CheckThat! 2025: Multilingual Claim

Normalization of Social-Media Posts, in: [51], 2025. [45] J. Wilder, N. Kadapala, Y. Xu, M. Alsaadi, M. Rogers, P. Agrawal, A. Hassick, L. Dietz, UNH at

Check That! 2025: Fine-tuning Vs Prompting, in: [51], 2025. [46] S. M. A. Hashmi, S. Aamir, M. Anas, T. Usmani, F. Alvi, A. Samad, Investigators at CheckThat! 2025: Using LLMs to Improve Fact-Checking, in: [51], 2025. [47] M. Sawiński, K. Węcel, E. Księżniak, OpenFact at CheckThat! 2025: Application of self-reflecting and reasoning LLMs for fact-checking claim normalization, in: [51], 2025. [48] M. Mondal, S. Saha, D. Saha, D. Das, JU_NLP@M&S at CheckThat! 2025: Automated Claim Extraction and Normalization for Misinformation Detection in Social Media Content, in: [51], 2025. [49] T. B. Beltrán, R. Pan, J. A. García Díaz, R. Valencia García, UmuTeam at CheckThat! 2025:

Language-specific versus multilingual models for Fact-Checking, in: [51], 2025. [50] S. V. Baddepudi Venkata Naga Sri, Saivineetha at CheckThat! 2025: Exploring Fine-Tuning and

Zero-Shot Approaches for Claim Normalization, in: [51], 2025. [51] G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CLEF 2025, Madrid, Spain, 2025.

Thai DS@GT 0.5859 AKCIT-FN 0.3179 dfkinit2b 0.2999 Baseline 0.2015 Factiverse and IAI 0.0965 OpenFact 0.0872 aryasuneesh 0.0464 UmuTeam 0.0147

Arabic dfkinit2b 0.5037 DS@GT 0.5035 MMA 0.4584 OpenFact 0.4175 TIFIN 0.3705 AKCIT-FN 0.3277 Factiverse and IAI 0.2457 Baseline 0.2186 UmuTeam 0.0003

[1]

Muhammed T , S. K. Mathew, The Disaster of Misinformation: A Review of Research in Social Media , International Journal of Data Science and Analytics 13 ( 2022 ) 271 - 285 .

[2]

Allcott ,

Gentzkow , Social Media and Fake News in the 2016 Election, Journal of Economic Perspectives 31 ( 2017 ) 211 - 236 .

[3]

Alam ,

Shaar ,

Dalvi ,

Sajjad ,

Nikolov ,

Mubarak , G. Da San Martino,

Abdelali ,

Durrani ,

Darwish ,

Al-Homaid ,

Zaghouani ,

Caselli ,

Danoe ,

Stolk ,

Bruntink ,

Nakov , Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, FactCheckers, Social Media Platforms, Policy Makers, and the Society , in: M. -

F. Moens , X.

Huang , L.

Specia , S. W.-t. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021 , Association for Computational Linguistics , Punta Cana, Dominican Republic, 2021 , pp. 611 - 649 .

[4]

Nakov ,

Barrón-Cedeño , G. Da San Martino,

Alam ,

J. M.

Struß ,

Mandl ,

Míguez ,

Caselli ,

Kutlu ,

Zaghouani ,

Li ,

Shaar ,

G. K.

Shahi ,

Mubarak ,

Nikolov ,

Babulkov ,

Y. S.

Kartal ,

Beltrán , The

CLEF

-2022 CheckThat! Lab on Fighting the COVID-19 Infodemic and Fake News Detection, in: Proceedings of the 44th European Conference on IR Research: Advances in Information Retrieval, ECIR '22 , Springer-Verlag, Berlin, Heidelberg, 2022 , pp. 416 - 428 .

[5]

Khaldarova ,

Pantti , Fake news, Journalism Practice 10 ( 2016 ) 891 - 901 .

[6]

N. L.

Tsang ,

Feng ,

F. L.

Lee , How Fact-Checkers Delimit Their Scope of Practices and Use Sources: Comparing Professional and

Partisan

Practitioners , Journalism 24 ( 2023 ) 2232 - 2251 .

[7]

Barrón-Cedeño ,

Alam ,

Chakraborty ,

Elsayed ,

Nakov ,

Przybyła ,

J. M.

Struß ,

Haouari ,

Hasanain ,

Ruggeri , et al., The

CLEF

-2024 CheckThat! Lab: Check-Worthiness , Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness, in: European Conference on Information Retrieval , Springer, 2024 , pp. 449 - 458 .

[8]

Sundriyal ,

Kulkarni ,

Pulastya ,

M. S.

Akhtar , T. Chakraborty, Empowering the Factcheckers! Automatic Identification of Claim Spans on Twitter , in: Y. Goldberg , Z. Kozareva , Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022 , pp. 7701 - 7715 .

[9]

Nakov ,

Barrón-Cedeño , G. Da San Martino,

Alam ,

J. M.

Struß ,

Mandl ,

Míguez ,

Caselli ,

Kutlu ,

Zaghouani ,

Li ,

Shaar ,

G. K.

Shahi ,

Mubarak ,

Nikolov ,

Babulkov ,

Y. S.

Kartal ,

Beltrán , Overview of the CLEF-2022 CheckThat! Lab on Fighting the COVID-19 Infodemic and Fake News Detection , in: Proceedings of the 13th International Conference of the CLEF Association: Information Access Evaluation meets Multilinguality , Multimodality, and Visualization, CLEF ' 2022 , Bologna, Italy, 2022 .

[10]

Gupta ,

Singh ,

Sundriyal ,

M. S.

Akhtar , T. Chakraborty, LESA: Linguistic Encapsulation and Semantic Amalgamation Based Generalised Claim Detection from Online Content , in: P. Merlo,

Tiedemann , R. Tsarfaty (Eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics , Online, 2021 , pp. 3178 - 3188 .

[11]

Thorne ,

Vlachos ,

Christodoulopoulos ,

Mittal , FEVER: a Large-scale Dataset for Fact Extraction and VERification , in: M. Walker , H. Ji , A . Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 ( Long

Papers)

, Association for Computational Linguistics , New Orleans, Louisiana, 2018 , pp. 809 - 819 .

[12]

Kazemi ,

Garimella ,

Gafney ,

Hale , Claim Matching Beyond English to Scale Global Fact-Checking, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, 2021 , pp. 4504 - 4517 .

[13]

Shaar ,

Babulkov , G. Da San Martino, P. Nakov, That is a Known Lie: Detecting Previously FactChecked Claims , in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2020 , pp. 3607 - 3618 .

[14]

Sundriyal ,

Chakraborty ,

Nakov , From Chaos to Clarity: Claim Normalization to Empower Fact-Checking, in: H. Bouamor , J. Pino , K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , Association for Computational Linguistics , Singapore, 2023 , pp. 6594 - 6609 .

[15]

Jaradat ,

Gencheva ,

Barrón-Cedeño ,

Màrquez , P. Nakov, ClaimRank: Detecting CheckWorthy Claims in Arabic and English, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Association for Computational Linguistics , New Orleans, Louisiana, 2018 , pp. 26 - 30 .

[16]

Barrón-Cedeño ,

Alam ,

Caselli , G. Da San Martino, T. Elsayed,

Galassi ,

Haouari ,

Ruggeri ,

J. M.

Struß ,

R. N.

Nandi , et al., The

CLEF

-2023 CheckThat! Lab: Checkworthiness, Subjectivity, Political Bias, Factuality, and Authority, in: European Conference on Information Retrieval , Springer, 2023 , pp. 506 - 517 .

[17]

Mittal ,

Sundriyal ,

Nakov , Lost in Translation, Found in Spans: Identifying Claims in Multilingual Social Media, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , 2023 , pp. 3887 - 3902 .

[18]

Alam ,

J. M.

Struß ,

Chakraborty ,

Dietze ,

Hafid ,

Korre ,

Muti ,

Nakov ,

Ruggeri ,

Schellhammer ,

Setty ,

Sundriyal ,

Todorov , V. V. , The CLEF -2025 CheckThat! Lab: Subjectivity, Fact-Checking ,

Claim

Normalization , and Retrieval, in: C. Hauf , C.

Macdonald , D.

Jannach , G.

Kazai , F. M.

Nardini , F.

Pinelli , F.

Silvestri , N. Tonellotto (Eds.), Advances in Information Retrieval , Springer Nature Switzerland, Cham, 2025 , pp. 467 - 478 .

[19]

Sundriyal ,

Singh ,

M. S.

Akhtar ,

Sengupta , T. Chakraborty, DESYR: Definition and Syntactic Representation Based Claim Detection on the Web, in: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM '21 , Association for Computing Machinery, New York, NY, USA, 2021 , p. 1764 - 1773 .

[20]

Sundriyal ,

M. S.

Akhtar , T. Chakraborty, Leveraging Rationality Labels for Explainable Claim Check-Worthiness , IEEE Transactions on Artificial Intelligence ( 2025 ).

[21]

Gencheva ,

Nakov ,

Màrquez ,

Barrón-Cedeño , I. Koychev , A context-aware approach for detecting worth-checking claims in political debates , in: Proc. of RANLP , 2017 , pp. 267 - 276 .

[22]

Sundriyal , G. Malhotra,

M. S.

Akhtar ,

Sengupta ,

Fano , T. Chakraborty, Document Retrieval and Claim Verification to Mitigate COVID-19 Misinformation, in : Proc. of workshop on CONSTRAINT, ACL , 2022 , pp. 66 - 74 .

[23]

Glockner , I. Staliu¯naitė, J. Thorne,

Vallejo ,

Vlachos , I. Gurevych , AmbiFC: Fact-checking Ambiguous Claims with Evidence, Transactions of the Association for Computational Linguistics 12 ( 2024 ) 1 - 18 .

[24]

Hardalov ,

Chernyavskiy ,

Koychev ,

Ilvovsky , P. Nakov, CrowdChecked: Detecting Previously Fact-Checked Claims in Social Media , in: Y. He,

Ji ,

Li ,

Liu , C.-H. Chang (Eds.), Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, Online only , 2022 , pp. 266 - 285 .

[25]

E. C.

Choi , E. Ferrara, FACT-GPT: Fact-Checking Augmentation via Claim Matching with LLMs , in: Companion Proceedings of the ACM Web Conference 2024 , 2024 , pp. 883 - 886 .

[26]

C. P.

Drolsbach ,

Solovev ,

Pröllochs , Community Notes Increase Trust in Fact-Checking on