Overview of the CLEF-2021 CheckThat! Lab Task 1 on Check-Worthiness Estimation in Tweets and Political Debates Shaden Shaar1 , Maram Hasanain2 , Bayan Hamdan3 , Zien Sheikh Ali2 , Fatima Haouari2 , Alex Nikolov4 , Mucahid Kutlu5 , Yavuz Selim Kartal5 , Firoj Alam1 , Giovanni Da San Martino6 , Alberto Barrón-Cedeño7 , Rubén Míguez8 , Javier Beltrán8 , Tamer Elsayed2 and Preslav Nakov1 1 Qatar Computing Research Institute, HBKU, Doha, Qatar 2 Qatar University, Qatar 3 Independent Researcher 4 Sofia University, Bulgaria 5 TOBB University of Economics and Technology, Turkey 6 University of Padova, Italy 7 DIT, Università di Bologna, Italy 8 Newtral Media Audiovisual, Spain Abstract We present an overview of Task 1 of the fourth edition of the CheckThat! Lab, part of the 2021 Confer- ence and Labs of the Evaluation Forum (CLEF). The task asks to predict which posts in a Twitter stream are worth fact-checking, focusing on COVID-19 and politics in five languages: Arabic, Bulgarian, En- glish, Spanish, and Turkish. A total of 15 teams participated in this task and most submissions managed to achieve sizable improvements over the baselines using Transformer-based models such as BERT and RoBERTa. Here, we describe the process of data collection and the task setup, including the evaluation measures, and we give a brief overview of the participating systems. We release to the research com- munity all datasets from the lab as well as the evaluation scripts, which should enable further research in check-worthiness estimation for tweets and political debates. Keywords Check-Worthiness Estimation, Fact-Checking, Veracity, Verified Claims Retrieval, Detecting Previously Fact-Checked Claims, Social Media Verification, Computational Journalism, COVID-19 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " sshaar@hbku.edu.qa (S. Shaar); maram.hasanain@qu.edu.qa (M. Hasanain); bayan.hamdan995@gmail.com (B. Hamdan); zs1407404@qu.edu.qa (Z. S. Ali); 200159617@qu.edu.qa (F. Haouari); alexnickolow@gmail.com (A. Nikolov); m.kutlu@etu.edu.tr (M. Kutlu); ykartal@etu.edu.tr (Y. S. Kartal); fialam@hbku.edu.qa (F. Alam); dasan@math.unipd.it (G. D. S. Martino); a.barron@unibo.it (A. Barrón-Cedeño); ruben.miguez@newtral.es (R. Míguez); javier.beltran@newtral.es (J. Beltrán); telsayed@qu.edu.qa (T. Elsayed); pnakov@hbku.edu.qa (P. Nakov) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1. Introduction The spread of fake news, misinformation and disinformation on the web, in social media and in other communication channels has become an urgent social and political issue. There has been growing interest in fighting against such false or misleading information both in academia and in industry. To address the issue, a number of initiatives have been launched to perform manual claim verification, with over 200 fact-checking organizations worldwide,1 such as PolitiFact, FactCheck, Snopes, Full Fact, and Newtral, among others. Unfortunately, these efforts are insufficient, given the scale of disinformation propagating in various communication channels, which, in the time of COVID-19, has now grown into the First Global Infodemic (according to the World Health Organization). To deal with this problem, we have launched the CheckThat! Lab, which features a number of tasks aiming to help automate the fact-checking process and to reduce the spread of misinformation and disinformation. The CheckThat! lab was run for the fourth time in the framework of CLEF 2021.2 The purpose of the 2021 edition of the lab was to foster the development of technology that would enable finding check-worthy claims, detecting claims that have been previously fact-checked, and predicting the veracity of news articles and their topics. Figure 1 shows the full CheckThat! identification and verification pipeline, including the tasks on detecting previously fact-checked claims and predicting the veracity and the topic of news articles. Here, we focus on Task 1: check-worthiness estimation.3 This task focuses on tweets and political debates and speeches. It consists of the following two subtasks: Subtask 1A Check-worthiness of tweets. Given a topic and a stream of potentially related tweets, rank the tweets according to their check-worthiness for the topic. Subtask 1B Check-worthiness of debates or speeches. Given a political debate/speech, return a list of its sentences, ranked by their check-worthiness. Subtask 1A was offered in Arabic, Bulgarian, English, Spanish, and Turkish. We focused on COVID-19, vaccines, and politics, and we crawled and manually annotated tweets between January 2020 and March 2021. The participants were free to work on any language(s) of their interest, and they could also use multilingual approaches that learn from all datasets. However, the evaluation was per language. This subtask attracted 15 teams, and the most successful approaches used Transformers or a combination of embeddings, manually engineered features, and neural networks. Section 3 offers more details. Subtask 1B was offered in English, and it used PolitiFact as a source of previously fact-checked claims. The task attracted two teams, and a combination of pre-processing, data augmentation, and Transformer-based models performed the best. Section 4 gives more details. The remainder of the paper is organized as follows: Sections 3 and 4 describe the dataset, the evaluation results, and the participating systems for subtasks 1A and 1B, respectively, Section 2 discusses related work, and Section 5 concludes with final remarks. 1 http://tiny.cc/zd1fnz; last visited 02/07/2021. 2 http://sites.google.com/view/clef2021-checkthat/ 3 Refer to [1] for an overview of the full 2021 edition of the CheckThat! lab, including the other two tasks on detecting previously fact-checked claims [2] and on fake news detection [3]. Figure 1: The full verification pipeline including the three tasks addressed in the CheckThat! lab 2021. Task 1 on check-worthiness estimation focuses on Twitter in five languages (subtask 1A) and debates and speeches in English (subtask 1B). See [2, 3] for a discussion on tasks 2 and 3. The grayed tasks were addressed in previous editions of the lab [4, 5]. 2. Related Work Misinformation and disinformation are rapidly spreading in social media, and sometimes false and misleading claims originate in political debates and speeches. To fight the problem, automatic fact-checking has emerged as an important research area, and researchers have worked on a number of subtasks: from automatic identification and verification of claims [6, 7, 8, 9, 5, 10, 11], to identifying check-worthy claims [12, 13, 14, 15], detecting whether a claim has been previously fact-checked [16, 17, 18], retrieving evidence to accept or to reject a claim [19, 20], checking whether the evidence supports or denies the claim [21, 22], and inferring the veracity of the claim [23, 24, 25, 26, 27, 28, 20, 29, 30, 31]. Check-worthiness estimation for debates/speeches. The ClaimBuster system [13] was a pioneering work on check-worthiness estimation. Given a sentence in the context of a political debate, it classified it into one of the following, manually annotated categories: non-factual, unimportant factual, or check-worthy factual. In later work, Gencheva & al. [12] also focused on the 2016 US Presidential debates, for which they obtained binary (check-worthy vs. non- check-worthy) annotations from various fact-checking organizations. An extension of this work resulted in the development of the ClaimRank system, which was trained on more data also including Arabic content [14]. Other related work, also focused on political debates and speeches. For example, Patwari & al. [32] predicted whether a sentence would be selected by a fact-checking organization using a boosting-like model. Similarly, Vasileva & al. [15] used a multi-task learning neural network that predicts whether a sentence would be selected for fact-checking by each individual fact-checking organization (from a set of nine such organizations). Last but not least, the task was part of the CheckThat! lab in CLEF 2018, 2019, and 2020, where the focus was once again on political debates and speeches, from a single fact-checking organization. In the 2018 edition of the task, a total of seven teams submitted runs for Task 1 (which corresponds to Subtask 1B in 2021), with systems based on word embeddings and RNNs [33, 34, 35, 36]. In the 2019 edition of the task, eleven teams submitted runs for the corresponding Task 1, again using word embeddings and RNNs, and further trying a number of interesting representations [37, 38, 39, 40, 41, 42, 43, 44]. In the 2020 edition of the task, three teams submitted runs for the corresponding Task 5 with systems based on word embeddings and BiLSTM [45], TF.IDF representation with Naïve Bayes, logistic regression, decision trees [46], BERT prediction scores, and word embeddings with logistic regression [47]. Check-worthiness estimation for tweets. There has been less effort in identifying check- worthy claims in social media. Previous work in this direction includes Task 1 in the 2020 edition of the lab, and the work of Alam et al. [48, 49], who developed a multi-question annotation schema of tweets about COVID-19, organized around seven questions, including one about claim check-worthiness. For some languages in the 2021 Subtask 1A, we used the setup and the annotations for one of the questions in their schema, as well as their data for that question, which we further extend with additional data. In the 2020 edition of the lab, the focus was on English and Arabic. For the Arabic task, several teams fine-tuned pre-trained models such as AraBERT and multilingual BERT [50, 51, 47]. Other approaches relied on pre-trained models such as GloVe and Word2vec [52, 45] to obtain embeddings for the tweets, which were fed into a neural network or an SVM. In addition to text representations, some teams used other features, namely morphological and syntactic, part-of-speech (POS) tags, named entities, and sentiment features [53, 54]. As for the English task, we also observed the popularity of pre-trained Transformers, namely BERT and RoBERTa [50, 52, 55, 56, 47, 57, 58]. Other approaches relied on word embeddings like GloVe to obtain embeddings for the tweets, which were fed into a neural network [45]. There were also systems that used more traditional machine learning models such as random forest [46]. An indirectly related research line is on credibility assessment of tweets [59], including the CREDBANK tweet corpus [60], which has credibility annotations, multilingual fact-checking corpus [61], as well as work on fake news [62], and on rumor detection in social media [63]; unlike that work, here we focus on detecting check-worthiness rather than predicting the credibility/factuality of the claims in the tweets. Another less relevant research line is on development of datasets of tweets about COVID-19 [64, 65, 66, 67, 68]; however, none of these datasets has focused on check-worthiness estimation. 3. Subtask 1A: Check-Worthiness Estimation for Tweets The aim of Task 1 is to determine whether a piece of text is worth fact-checking. In order to do that, we either resort to the judgments of professional fact-checkers or we ask human annotators to answer several auxiliary questions [49, 48], such as “does it contain a verifiable factual claim?”, “is it harmful?” and “is it of general interest?”, before deciding on the final check-worthiness label. Table 1 Subtask 1A: statistics about the CT–CWT–21 dataset for all five languages. The bottom part of the table shows the main topics. Partition Arabic Bulgarian English Spanish Turkish Total Training 3,444 3,000 822 2,495 1,899 11,660 Development 661 350 140 1,247 388 2,786 Testing 600 357 350 1,248 1,013 3,568 Total 4,705 3,707 1,312 4,990 3,300 18,014 Main topics COVID-19 ■ ■ ■ ■ Politics ■ ■ ■ Subtask 1A focused on Twitter and it is defined as follows: “Given a topic and a stream of potentially related tweets, rank the tweets according to their check-worthiness for the topic.” The task is offered in Arabic, Bulgarian, English, Spanish, and Turkish. We created and released an independent labeled dataset per language as explained in the following section. The participants were free to work on any language(s) of their interest, and they could also use multilingual approaches that make use of all datasets for training. 3.1. Datasets Although all languages tackled major topics such as COVID-19 and politics, the crawling and the annotation were done differently across the languages due to different resources available to the team leading the annotation for each language. Eventually, for each language, we release a tweet dataset with each tweet labeled for check-worthiness. Below, we provide more detail about how the crawling and the annotation were done for each language. Table 1 shows some statistics about the datasets, which is split into training, development, and testing partitions. 3.1.1. Arabic Dataset In order to construct the Arabic dataset, we first manually created several topics over the period of several months. Examples of topic titles include “Coronavirus in the Arab World”, “GCC Reconciliation”, and “Deal of the century”. We augmented each topic with a set of keywords, hashtags, and usernames to track in Twitter. 4 Once we had created a topic, we immediately crawled a one-week stream of tweets using the constructed search terms, where we searched Twitter (via the Twitter search API) using each term by the end of each day. We limited the search to original Arabic tweets (i.e., we excluded retweets). We then de-duplicated the tweets and we dropped those matching our qualification filter that excludes tweets containing terms from a blacklist of explicit terms and tweets that contain more than four hashtags or more than two URLs. Afterwards, we ranked the tweets by popularity (defined by the sum of their retweets and likes), and we selected the top-500 to be annotated. 4 Keywords used to crawl tweets: http://gitlab.com/checkthat_lab/clef2021-checkthat-lab/-/tree/master/task1 Table 2 Subtask 1A, Arabic: tweets with their check-worthiness marked. For each tweet, we also include a translation to English. ‫األردن‬# ‫كورونا_الصيني‬# . .‫✓ األردن تمنع صينيين من دخول أراضيها بسبب فيروس كورونا‬ Jordan prevents Chinese citizens from entering the country due to the Corona virus. #Chinese_Corona #Jordan ‫ قبل ان تنضم قناة الجزيرة رسميا لإلعالم الحربي الحوثي ضد‬،‫كانت قناة المسيرة ترافق الحوثيين في الجبهات‬ ! https://t.co/bziqkP4CPI‫اليمنيين‬ ✘ Al-Masirah TV was accompanying the Houthis on the war fronts, before Al Jazeera channel officially joined the Houthi war media against the Yemenis! https://t.co/bziqkP4CPI https://t.co/pobQlZJs9G‫ترمب يشكل خطرا على الدستور والديمقراطية‬# ‫ الرئيس‬:‫نانسي بيلوسي‬ Nancy Pelosi: President Trump is a danger to the constitution and democracy ✘ https://t.co/pobQlZJs9G https://t.co/I25oW4jHBe41‫القمة_الخليجية_الـ‬# ‫المصالحه_الخليجيه‬# 1310 ‫عدد أيام الحصار‬ The number of siege days is 1310 #gulf_reconciliation #41_gulf_summit https://t.co/I25oW4jHBe ✘ The training and the development sets include 12 topics crawled in January, February, and March 2020 and borrowed from the last edition of the CheckThat! lab, considering topics with highest inter-annotation agreement; refer to [51] for further details on the annotation process. For this year’s edition, for each topic, we only kept tweets that were relevant and had full inter-rater agreement on the check-worthiness label. For the test set, we crawled using two topics in January 2021 and we annotated the resulting tweets as follows. We first recruited one annotator to annotate each tweet for its relevance with respect to the target topic. Then, we labeled for check-worthiness the tweets that were found relevant. This second annotation was done by two expert annotators. It was followed by a subsequent consolidation step, when the annotators talked to each other to resolve potential disagreements. Due to the subjective nature of check-worthiness, we chose to represent the check-worthiness criteria by several questions, to help the annotators think about different aspects of check-worthiness. First, the annotators were asked to answer the following question: 1. Does the tweet contain a verifiable factual claim? If the answer to the above question is positive, the annotator is asked to answer the following additional yes/no questions: 2. Does the claim in the tweet appear to be false? 3. Do you think the claim in the tweet is of interest to or would have an impact on the public? 4. To what extent do you think the claim can morally or physically harm an entity, a country, etc.? 5. Do you think that journalists will be interested in covering the spread of the claim or the information discussed by the claim? Once the annotator has answered the above questions, s/he is further required to answer a fifth question considering all the answers given previously: 6. Do you think that a professional fact-checker or a journalist should verify the claim in the tweet? This is a yes/no question and the answer is the label that we will use to represent the check-worthiness for the target tweet. Table 2 shows examples of annotated tweets in Arabic. 3.1.2. Bulgarian and English Datasets We collected tweets that matched the COVID-19 topic using language-specific keywords and hashtags.5 We ran all the data collection from January 2020 to February 2021, and we selected the most retweeted posts for the manual annotation. We considered a number of factors for the annotation, including tweet popularity in terms of retweets, which is already taken into account as part of the data collection process. We further asked the annotators to answer the following five questions:6 • Q1: Does the tweet contain a verifiable factual claim? This is an objective question. Positive examples include tweets that state a definition, mention a quantity in the present or the past, make a verifiable prediction about the future, reference laws, procedures, and rules of operation, discuss images or videos, and state correlation or causation, among others. This is influenced by [69]. • Q2: To what extent does the tweet appear to contain false information? This question asks for a subjective judgment; it does not ask for annotating the actual factuality of the claim in the tweet, but rather whether the claim appears to be false. • Q3: Will the tweet have an impact on or be of interest to the general public? This question asks for an objective judgment. Generally, claims that contain information related to potential cures, updates on number of cases, on measures taken by governments, or discussing rumors and spreading conspiracy theories should be of general public interest. • Q4: To what extent is the tweet harmful to the society, a person(s), a company(s), or a product(s)? This question also asks for an objective judgment: to identify tweets that can cause harm. • Q5: Do you think that a professional fact-checker should verify the claim in the tweet? This question asks for a subjective judgment. Yet, its answer should be informed by the answer to questions Q2, Q3 and Q4, as a check-worthy factual claim is probably one that is likely to be false, is of public interest, and/or appears to be harmful. Note that we are stressing the fact that a professional fact-checker should verify the claim, which rules out claims that are easy to fact-check by a layman. 5 The keywords used to crawl the tweets are available at http://gitlab.com/checkthat_lab/ clef2021-checkthat-lab/-/tree/master/task1 6 We used the following MicroMappers setup for the annotations: http://micromappers.qcri.org/project/covid19-tweet-labelling/ We considered as check-worthy the tweets that received a positive answer both to Q1 and to Q5; if there was a negative answer to either Q1 or Q5, the tweet was considered not worth fact-checking. The answers to Q2, Q3, and Q4 were not considered directly, but they helped the annotators make a better decision for Q5. For the task, we did not provide the labels for Q2-Q4. The annotations were performed by 2–5 annotators independently, and consolidated for the cases of disagreement. The annotation setup was part of a broader initiative; see [48] for details. Table 3 Subtask 1A, Bulgarian: tweets with their check-worthiness marked. Table 4 Subtask 1A, English: tweets with their check-worthiness marked. Breaking: Congress prepares to shutter Capitol Hill for coronavirus, opens telework center ✓ China has 24 times more people than Italy... ✗ Everyone coming out of corona as barista ✗ Lord, please protect my family & the Philippines from the corona virus ✗ Tables 3 and 4 show examples of annotated tweets for Bulgarian and English, respectively. The first English example, ‘Breaking: Congress prepares to shutter Capitol Hill for coronavirus, opens telework center’, contains a verifiable factual claim and is of high interest to society, and thus it is check-worthy. The second one, ‘China has 24 times more people than Italy...’, contains a verifiable factual claim, but it is trivial to check. The third example, ‘Everyone coming out of corona as barista’, is a joke, and thus not check-worthy. The fourth one, ‘Lord, please protect my family & the Philippines from the corona virus’, does not contain a verifiable factual claim. Table 5 Subtask 1A, Spanish: tweets with their check-worthiness marked (including translations to English). Un nigeriano de 28 años (aun no sabemos si ilegal o no) es detenido tras intentar raptar a una niña ✓ de 11 años en un parque en Getafe. Dicen en la prensa que se desconocen las intenciones del sujeto. Por cierto, Qué montón de hechos aislados! https://t.co/B4ey4rdg8I A 28 years old Nigerian (we do not know if ilegal or not yet) is arrested after trying to kidnap an 11 years old girl in a park at Getafe. The press say that they ignore the intentions of the subject. By the way, how many isolated cases! https://t.co/B4ey4rdg8I Muy preocupado por el futuro de la plantilla de Endesa en As Pontes. Están en juego los proyectos de ✗ 750 familias, el futuro de una comarca y el 50% del tráfico del Puerto de Ferrol. El Gobierno central no puede seguir actuando en todo sin pensar en lo que se lleva por delante Quite worried about the future of the staff of Endesa in As Pontes. The projects of 750 fami- lies, the future of a region and 50% of the traffic of Puerto de Ferrol are at stake. The central government cannot keep acting without considering the consequences 70.000.000€ más en guarderías, 65.500.000€ más en escuelas, 129.500.000€ más en universidades, 261 ✓ profesores más, 7.365 becas comedor más, un 30% menos en tasas universitarias, 124.000.000€ más en I+D+I, 19.000.000€ más en el Sincrotón. . . O somos útiles o no somos. 70,000,000€ more for nurseries, 65,500,000€ more for schools, 129,500,000€ more for universi- ties, 261 extra professors, 7,365 extra lunch scholarships, 30% less in university tuition fees, 124,000,000€ more for I+D+I, 19,000,000€ more for the Synchrotron. . . Either we are useful or we are not. Una ley muy necesaria, que llevamos defendiendo mucho tiempo, para garantizar el derecho a una ✗ muerte digna en nuestro país. https://t.co/vvpvUd3PDY A much needed law that we have been defending for long time to guarantee the right for a dignified death in our country. https://t.co/vvpvUd3PDY 3.1.3. Spanish Dataset The dataset consists of 4,990 tweets sampled from the accounts of 350 well-known Spanish politicians. Professional fact-checkers reviewed all tweets published by these accounts over a period of one month to determine whether a tweet contained a check-worthy claim or not. The human fact-checkers considered different editorial criteria to determine whether a tweet is check-worthy, including factuality, public relevance of the character behind the claim, and potential impact on the general audience. Factuality and relevance are the main factors when making the decision. All tweets were annotated independently by three experts and the final decision was made by majority voting. It is worth noticing that only about 10% of the tweets are considered to be check-worthy, which is close to a realistic distribution. Table 5 shows some examples. We can see that these tweets tend to be fairly long and are often given some context on top of the claim. The third tweet, which is check-worthy, actually contains multiple claims about the impact of a political party on investment. The fourth tweet does contain a claim, but it refers to the support of a new law and was not considered to be check-worthy by the expert. Table 6 Subtask 1A, Turkish: tweets with their check-worthiness marked (including translations to English). Günün sorusu: 1,5 milyon doz biontech aşısı Türkiye’ye geldi mi? Parasını kim ödedi, ne kadar ve ✗ bu aşı kimlere yapıldı? The question of the day: Did 1.5M Biontech vaccines arrive Turkey? Who paid it? How much? And who has been vaccinated with them? Hükümete vuracağım diye inaktif aşı hakkındaki bilgileri çarpıtıyorsunuz. Lakin Türkiye’nin kul- ✓ landığı aşı, diğer aşıların içerisinde hem en güveniliri hem de en etkilisi Kocan Hocanın eline, emeğine sağlık Süreci maniple edenleri de Allah nasıl biliyorsa öyle yapsın! You are distorting information about inactive vaccines in order to criticize the government. How- ever, the vaccine used in Turkey is the most reliable and the most effective one among other vaccines. Thanks to Kocan Hodga for his efforts. May Allah treat those manipulators as He wants. Önümüzdeki iki hafta çok kritik Korona ŞehirEfsaneleri ✗ The following two weeks are very critical for Corona UrbanLegends 45 gün önce aşı olan doktor koronavirüsten hayatını kaybetti https://t.co/aJqJ8I8CMb ✓ https://t.co/ZiitFoHJmS The doctor who was vaccinated 45 days ago lost his life because of coronavirus https://t.co/aJqJ8I8CMb https://t.co/ZiitFoHJmS 3.1.4. Turkish Dataset For the training and the development sets for Turkish, we used the TrClaim-19 [70] dataset. For the testing dataset, we crawled tweets using keywords related to health and to COVID-19 from February 26, 2021 till March 29, 2021. After de-duplication, we randomly selected tweets for manual annotation. Each tweet was annotated by three annotators, which were asked to say whether the corresponding tweet contains a check-worthy claim or not. The labels were aggregated based on majority voting. Table 6 shows some examples. The first example does not contain a verifiable factual claim, just questions, and thus it is judged not to be check-worthy. The second tweet contains a factual claim about the effectiveness of the vaccine used in Turkey, which is of wide public interest, and is thus check-worthy. The third claim makes a statement about the future that is not particularly interesting. The last claim is of public interest as it contains a factual claim about a possible mortality case due to a COVID-19 vaccine, and it is thus considered check-worthy. 3.2. Evaluation This is a ranking task, where a tweet has to be ranked according to its check-worthiness. There- fore, we consider mean average precision (MAP) as the official evaluation measure, which we complement with reciprocal rank (RR), R-precision (R-P), and P@𝑘 for 𝑘 P t1, 3, 5, 10, 20, 30u. The data and the evaluation scripts are available online.7 7 https://gitlab.com/checkthat_lab/clef2021-checkthat-lab/ Table 7 Subtask 1A: summary of the approaches used by the participating systems. Team Models Other Data augmentation Preprocessing Ara-ALBERT DistilBERT RoBERTa AraBERT BERTurk ALBERT Electra SBERT BETO LIWC BERT 1. abaruah – 2. Accenture [71] ○ ○ ○ 3. bigIR – 4. csum112 – 5. DamascusTeam – 6. Fight for 4230 [72] ○ ○ ○ 7. GPLSI [73] ○ ○ ○ ○ 8. iCompass [74] ○ ○ ○ 9. NLP&IR@UNED [75] ○ ○ ○ ○ ○ 10. NLytics [76] ○ 11. QMUL-SDS [77] ○ ○ 12. SCUoL [78] ○ 13. SU-NLP [79] ○ ○ 14. TOBB ETU [80] ○ ○ ○ ○ ○ 15. UPV [81] ○ 3.3. Overview of the Systems Fifteen teams took part in this task, with English and Arabic being the most popular languages. Four out of these fifteen teams submitted runs for all five languages —most of them having trained independent models for each language (yet, team UPV trained a single multilingual model). Most of the system were based on state-of-the-art pre-trained Transformers such as BERT [82] and RoBERTa [83]. Table 7 summarizes the approaches used by the primary submissions of the participating teams. We can see that BERT, AraBERT, and RoBERTa were by far the most popular pre-trained language models among the participants. Below, we provide a short summary of the systems submitted by the participating teams. For each team, we indicate in subscript the languages they took part in and the correspoding rank. Team Accenture [71] (ar:1 bg:4 en:9 es:5 tr:5) used pre-trained language models such as BERT and RoBERTa. They further used data augmentation; in particular, they generated synthetic training data using lexical substitution to create additional synthetic examples for the positive class. To find the most probable substitutions, they used BERT-based contextual embeddings. They further added a mean-pooling and a dropout layers on top of the model before the final classification layer. Team Fight for 4230 [72] (en:2) focused on augmented the data by means of machine transla- tion and WordNet-based substitutions. Pre-processing included link removal and punctuation cleaning, as well as quantity and contraction expansion. All hashtags related to COVID-19 were normalized into one, and hashtags were further expanded. Their best approach was based on BERTweet with a dropout layer and the above-described pre-processing. Team GPLSI [73] (en:5 es:2) applied the RoBERTa and the BETO transformers together with different manually engineered features, such as the occurrence of dates and numbers, or words from LIWC. A thorough exploration of parameters was made using weights and biases techniques. They also tried to split the four-class classification into two binary classifications and one three-class classification. Finally, they tried oversampling and undersampling. Team iCompass (ar:4) used prepossessing, including (i) removing English words, (ii) removing URLs and mentions, and (iii) data normalization by removing tashkeel and the letter madda from texts, as well as duplicates, and replacing some characters to prevent mixing. They proposed a simple ensemble of two BERT-based models, including AraBERT and Arabic-ALBERT. Team NLP&IR@UNED [75] (en:1 es:4) used several transformer models, such as BERT, AL- BERT, RoBERTa, DistilBERT, and Funnel-Transformer. For their official submissions, for English, they used BERT trained on tweets, while for Spanish, they used Electra. Team NLytics [76] (en:8) used RoBERTa with a regression function in the final layer by treating the problem as a ranking task. Team QMUL-SDS [77] (ar:4) used the AraBERT pre-processing (i) to replace URLs, email addresses, and user mentions with standard words, (ii) to remove line breaks, HTML markup, repeated characters, and unwanted characters, e.g., emotion icons, and (iii) to handle white spaces between words and digits (non-Arabic or English), and before/after two brackets; they also (iv) removed unnecessary punctuation. They addressed the task as a ranking problem, and fine-tuned an Arabic transformer (AraBERTv0.2-base) on a combination of the data from this year and the data from the CheckThat! lab 2020 (using the CT20-AR dataset). Team SCUoL[78] (ar:3) used typical preprocessing steps, and fine-tuned different AraBERT models; eventually, they used AraBERTv2-base. Team SU-NLP [79] (tr:2) used prepossessing, including (i) removing emojis, hashtags, and (ii) replacing all mentions with a special token (@USER), and all URLs with the website’s domain. If the URL was for a tweet, they replaced it with TWITTER and the user account name. Finally, they used an ensemble of BERTurk models fine-tuned using different seed values. Team TOBB ETU [80] (ar:6 bg:5 en:10 es:1 tr:1) used data augmentation by machine trans- lation, weak supervision, and cross-lingual training. They removed URLs and user mentions and fine-tuned a separate BERT-based models for each language. In particular, they fine-tuned BERTurk8 , AraBERT, BETO9 , and BERT-base for Turkish, Arabic, Spanish, and English, respec- tively. For Bulgarian, they fine-tuned a RoBERTa model pre-trained on Bulgarian documents.10 8 http://huggingface.co/dbmdz/bert-base-turkish-cased 9 https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased 10 http://huggingface.co/iarfmoose/roberta-base-bulgarian Team UPV [81] (ar:8 bg:2 en:3 es:6 tr:4) used a multilingual sentence transformer (SBERT) with knowledge distillation, originally intended for question answering. They introduced an auxiliary language identification task, aside the downstream check-worthiness task. 3.4. Performance for Different Languages Table 8 shows the performance of the official submissions on the test set, in addition to the 𝑛-gram baseline. The official run was the last valid blind submission by each team. The table shows the runs ranked on the basis of the official MAP measure and includes all five languages. Arabic Eight teams participated for Arabic, submitting a total of 17 runs (yet, recall that only the last submission counts). All participating teams adopted fine-tuning existing pre-trained models, such as AraBERT, and multilingual BERT models. We can see that the top two systems additionally worked on improved training datasets. Team Accenture used a label augmentation approach to increase the positive examples, while team bigIR augmented the training set with the Turkish training set (which they automatically translated to Arabic). Bulgarian Four teams took part for Bulgarian, submitting a total of 11 runs. The top-ranked team was bigIR. They did not submit a task description paper, and thus we cannot give much detail about their system. Team UPV was the second best system, ad they useda multilingual sentence transformer representation (SBERT) with knowledge distillation. They also introduced an auxiliary language identification task, aside from the downstream check-worthiness task. English Ten teams took part in task 1A for English, submitting a total of 21 runs. The top- ranked team was NLP&IR@UNED, and tehy used several pre-trained transformers models. They reported that BERTweet was best on the development set. The model was trained using RoBERTa on 850 million English tweets and 23 million COVID-19 English tweets. The second best system (Team Fight for 4230) also used BERTweet with a dropout layer; they also used pre-processing and data augmentation. Spanish Six teams took part for Spanish, submitting a total of 13 runs. The top-ranked team TOBB ETU explored different data augmentation strategies, including machine translation and weak supervision. Still, their submitted model is a fine-tuned BETO model without data augmentation. The first runner up GPLSI opted for the BETO Spanish transformer together with a number of hand-crafted features, such as the occurrence of numbers or words from the LIWC lexicon. Turkish Five teams participated for Turkish, submitting a total of 9 runs. All participants used BERT-based models. The top ranked team TOBB ETU fine-tuned BERTurk after removing mentions and URLs. The runner up team SU-NLP applied a pre-processing step that includes removing hashtags, emojis, and replacing URLs and mentions with special tokens. Subsequently, they used an ensemble of BERTurk models fine-tuned with different seed values. The third- ranked team bigIR machine-translated the Turkish text to Arabic and then fine-tuned AraBERT on the translated text. Table 8 Subtask 1A: results for the official submissions for all five languages. Team MAP MRR RP P@1 P@3 P@5 P@10 P@20 P@30 1 Accenture [71] 0.658 1.000 0.599 1.000 1.000 1.000 1.000 0.950 0.840 Arabic 2 bigIR 0.615 0.500 0.579 0.000 0.667 0.600 0.600 0.800 0.740 3 SCUoL [78] 0.612 1.000 0.599 1.000 1.000 1.000 1.000 0.950 0.780 4 iCompass 0.597 0.333 0.624 0.000 0.333 0.400 0.400 0.500 0.640 4 QMUL-SDS [77] 0.597 0.500 0.603 0.000 0.667 0.600 0.700 0.650 0.720 6 TOBB ETU [80] 0.575 0.333 0.574 0.000 0.333 0.400 0.400 0.500 0.680 7 DamascusTeam 0.571 0.500 0.558 0.000 0.667 0.600 0.800 0.700 0.640 8 UPV [81] 0.548 1.000 0.550 1.000 0.667 0.600 0.500 0.400 0.580 𝑛-gram baseline 0.428 0.500 0.409 0.000 0.667 0.600 0.500 0.450 0.440 1 bigIR 0.737 1.000 0.632 1.000 1.000 1.000 1.000 1.000 0.800 Bulgarian 2 UPV [81] 0.673 1.000 0.605 1.000 1.000 1.000 1.000 0.800 0.700 𝑛-gram baseline 0.588 1.000 0.474 1.000 1.000 1.000 0.900 0.750 0.640 3 Accenture [71] 0.497 1.000 0.474 1.000 1.000 0.800 0.700 0.600 0.440 4 TOBB ETU [80] 0.149 0.143 0.039 0.000 0.000 0.000 0.200 0.100 0.060 1 NLP&IR@UNED [75] 0.224 1.000 0.211 1.000 0.667 0.400 0.300 0.200 0.160 English 2 Fight for 4230 [72] 0.195 0.333 0.263 0.000 0.333 0.400 0.400 0.250 0.160 3 UPV [81] 0.149 1.000 0.105 1.000 0.333 0.200 0.200 0.100 0.120 4 bigIR 0.136 0.500 0.105 0.000 0.333 0.200 0.100 0.100 0.120 5 GPLSI [73] 0.132 0.167 0.158 0.000 0.000 0.000 0.200 0.150 0.140 6 csum112 0.126 0.250 0.158 0.000 0.000 0.200 0.200 0.150 0.160 7 abaruah 0.121 0.200 0.158 0.000 0.000 0.200 0.200 0.200 0.140 8 NLytics [84] 0.111 0.071 0.053 0.000 0.000 0.000 0.000 0.050 0.120 9 Accenture [71] 0.101 0.143 0.158 0.000 0.000 0.000 0.200 0.200 0.100 10 TOBB ETU [80] 0.081 0.077 0.053 0.000 0.000 0.000 0.000 0.050 0.080 𝑛-gram baseline 0.052 0.020 0.000 0.000 0.000 0.000 0.000 0.000 0.020 1 TOBB ETU [80] 0.537 1.000 0.525 1.000 1.000 0.800 0.900 0.700 0.680 Spanish 2 GPLSI [73] 0.529 0.500 0.533 0.000 0.667 0.600 0.800 0.750 0.620 3 bigIR 0.496 1.000 0.483 1.000 1.000 0.800 0.800 0.600 0.620 4 NLP&IR@UNED [75] 0.492 1.000 0.475 1.000 1.000 1.000 0.800 0.800 0.620 5 Accenture [71] 0.491 1.000 0.508 1.000 0.667 0.800 0.900 0.700 0.620 𝑛-gram baseline 0.450 1.000 0.450 1.000 0.667 0.800 0.700 0.700 0.660 6 UPV 0.446 0.333 0.475 0.000 0.333 0.600 0.800 0.650 0.580 1 TOBB ETU [80] 0.581 1.000 0.585 1.000 1.000 0.800 0.700 0.750 0.660 Turkish 2 SU-NLP [79] 0.574 1.000 0.585 1.000 1.000 1.000 0.800 0.650 0.680 3 bigIR 0.525 1.000 0.503 1.000 1.000 1.000 0.800 0.700 0.720 4 UPV [81] 0.517 1.000 0.508 1.000 1.000 1.000 1.000 0.850 0.700 5 Accenture [71] 0.402 0.250 0.415 0.000 0.000 0.400 0.400 0.650 0.660 𝑛-gram baseline 0.354 1.000 0.311 1.000 0.667 0.600 0.700 0.600 0.460 All languages. Table 9 summarizes the MAP performance for all teams that submitted pre- dictions for all languages in Subtask 1A. We can see that team bigIR performed best overall. Table 9 Subtask 1A: MAP performance for the official submissions for teams participating in all five languages. 𝜇 shows a standard mean of the five MAP scores; 𝜇𝑤 shows a weighed mean, where each MAP is multiplied by the size of the testing set. Team ar bg en es tr 𝜇 𝜇𝑤 1 bigIR 0.615 0.737 0.136 0.496 0.525 0.502 0.513 2 UPV [81] 0.548 0.673 0.149 0.446 0.517 0.467 0.477 3 TOBB ETU [80] 0.575 0.149 0.081 0.537 0.581 0.385 0.472 4 Accenture [71] 0.658 0.497 0.101 0.491 0.402 0.430 0.456 𝑛-gram baseline 0.428 0.588 0.052 0.450 0.354 0.374 0.394 4. Subtask 1B: Check-worthiness of debates or speeches. Subtask 1B is a legacy task that has evolved from the first edition of CheckThat! and it was carried over in 2018, 2019, and 2020 [85, 7, 86]. In each edition, more training data from more diverse sources have been added, with all speeches and debates about politics. The task aims to mimic the selection strategy that fact-checking organizations such as PolitiFact use to select the sentences and the claims to fact-check. The task is defined as follows: “Given a transcript of a speech or a political debate, rank the sentences in the transcript according to the priority with which they should be fact-checked.” Table 10 Subtask 1B: Debate fragments with the check-worthy sentences marked with ✓. C. Booker: We have systemic racism that is eroding our nation from health care to ✗ the criminal justice system. C. Booker: And it’s nice to go all the way back to slavery, but dear God, we have a ✓ criminal justice system that is so racially biased, we have more African- Americans under criminal supervision today than all the slaves in 1850. (a) Fragment from the 2019 Democratic Debate in Detroit L. Stahl: Do you still think that climate change is a hoax? ✓ D. Trump: I think something’s happening. ✗ D. Trump: Something’s changing and it’ll change back again. ✗ D. Trump: I don’t think it’s a hoax, I think there’s probably a difference. ✓ D. Trump: But I don’t know that it’s manmade. ✓ (b) Fragment from the 2018 CBS’ 60 Minutes interview with President Trump D. Trump: We have no country if we have no border. ✗ D. Trump: Hillary wants to give amnesty. ✓ D. Trump: She wants to have open borders. ✓ (c) Fragment from the 2016 third presidential debate Table 7 Subtask 1B: Statistics about the CT–CWT–21 corpus for subtask 1B. Dataset # of debates # of sentences Check-worthy Non-check-worthy Training 40 429 41,604 Development 9 69 3,517 Test 8 298 5,002 Total 57 796 50,123 4.1. Dataset Often, after a major political event, such as a public debate or a speech by a government official, a professional fact-checker would go through the event transcript and would select a few claims to fact-check. Since those claims were selected for verification, we consider them as check-worthy. This is what we used to collect our data, focusing on PolitiFact as a fact-checking source. For a political event (debate/speech), we collected the article from PolitiFact and we obtained its official transcript, e.g., from ABC, Washington Post, CSPAN, etc. We then manually matched the sentences from the PolitiFact articles to the exact statement that was made in the debate/speech. We collected a total of 57 debates/speeches from 2012–2018, and we selected sentences from the transcript that were actually fact-checked by human fact-checkers. We relied on PolitiFact to identify the sentences from the transcripts that could be fact-checked. As fact-checking is a time-consuming process, PolitiFact journalists only fact-check a few claims and there is an abundance of false negative examples in the dataset. Thus, we wanted to address this issue at test time: we manually looked over the debates from the test set, and we attempted to check whether each sentence has a fact-checking verified claim using BM25 suggestions. Table 10 shows some annotated examples, and Table 7 gives some statistics. Note the higher proportion of positive examples in the test set compared to the training and the development sets. Further details about the construction of the CT–CWT–21 corpus can be found in [87]. 4.2. Overview of the Systems Two teams took part in this subtask, submitting a total of 3 runs. Table 8 shows the performance of the official submissions on the test set, in addition to an 𝑛-gram baseline. Similarly to Task 1A, the official run was the last valid blind submission by each team. The table shows the runs ranked on the basis of the official MAP measure. The top-ranked team, Fight for 4230, fine-tuned BERTweet after normalizing the claims, augmenting the data using WordNet-based substitutions and removal of punctuation. They were able to beat the 𝑛-gram baseline by 18 MAP points absolute. The other team, NLytics [76], fine-tuned RoBERTa, but they could not beat the 𝑛-gram baseline. Table 8 Task 1B (English): Official evaluation results, in terms of MAP, MRR, R-Precision, and Precision@𝑘. The teams are ranked by the official evaluation measure: MAP. Team MAP MRR RP P@1 P@3 P@5 P@10 P@20 P@30 1 Fight for 4230 [72] 0.402 0.917 0.403 0.875 0.833 0.750 0.600 0.475 0.350 𝑛-gram baseline 0.235 0.792 0.263 0.625 0.583 0.500 0.400 0.331 0.217 2 NLytics [84] 0.135 0.345 0.130 0.250 0.125 0.100 0.137 0.156 0.135 4.3. Evaluation As this task was very similar to Subtask 1A, but for a different genre, we used the same evaluation measures: namely, MAP as the official measure, and we also report P@𝑘 for various values of 𝑘. Table 8 shows the performance of the primary submissions of the participating teams. We can see that the overall results are low, and only one of the teams managed to beat the 𝑛-gram baseline. Once again, the data and the evaluation scripts are available online.11 5. Conclusion and Future Work We have presented an overview of task 1 of the CLEF-2021 CheckThat! lab. The lab featured tasks that span the full verification pipeline: from spotting check-worthy claims to checking whether they have been fact-checked before. Task 1 focused on check-worthiness in tweets about COVID-19 and politics (Subtask 1A), and in political debates and speeches (Subtask 1B). Inline with the general mission of CLEF, we promoted multi-linguality by offering the task in five different languages. The participating systems used transformer models (e.g., BERT and RoBERTa) and some used data augmentation. Applying standard pre-processing was common for many systems, and almost all systems outperformed an 𝑛-gram baseline for Subtask 1A. In future work, we plan a new iteration of the CLEF CheckThat! lab and of the task, where we would offer larger training sets, as well as more fine-grained tasks. Acknowledgments The work of Tamer Elsayed, Maram Hasanain and Zien Sheikh Ali was made possible by NPRP grant #NPRP-11S-1204-170060 from the Qatar National Research Fund (a member of Qatar Foundation). The work of Fatima Haouari is supported by GSRA grant #GSRA6-1-0611-19074 from the Qatar National Research Fund. The statements made herein are solely the responsibility of the authors. This work is also part of the Tanbih mega-project,12 developed at the Qatar Computing Research Institute, HBKU, which aims to limit the impact of “fake news”, propaganda, and media bias by making users aware of what they are reading, thus promoting media literacy and critical thinking. 11 http://gitlab.com/checkthat_lab/clef2021-checkthat-lab/ 12 http://tanbih.qcri.org References [1] P. Nakov, D. S. M. Giovanni, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, M. Hasanain, W. Mansour, B. Hamdan, Z. S. Ali, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, M. Kutlu, Y. S. Kartal, Overview of the CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news, CLEF 2021, 2021. [2] S. Shaar, F. Haouari, W. Mansour, M. Hasanain, N. Babulkov, F. Alam, G. Da San Martino, T. Elsayed, P. Nakov, Overview of the CLEF-2021 CheckThat! lab task 2 on detect previously fact-checked claims in tweets and political debates, in: [88], 2021. [3] G. K. Shahi, J. M. Struß, T. Mandl, Overview of the CLEF-2021 CheckThat! lab: Task 3 on fake news detection, in: [88], 2021. [4] A. Barrón-Cedeño, T. Elsayed, P. Nakov, G. Da San Martino, M. Hasanain, R. Suwaileh, F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, S. Shaar, Z. Sheikh Ali, Overview of CheckThat! 2020: Automatic identification and verification of claims in social media, LNCS (12260), 2020. [5] T. Elsayed, P. Nakov, A. Barrón-Cedeño, M. Hasanain, R. Suwaileh, G. Da San Martino, P. Atanasova, Overview of the CLEF-2019 CheckThat!: Automatic identification and verification of claims, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, LNCS, 2019, pp. 301–321. [6] P. Atanasova, L. Màrquez, A. Barrón-Cedeño, T. Elsayed, R. Suwaileh, W. Zaghouani, S. Kyuchukov, G. Da San Martino, P. Nakov, Overview of the CLEF-2018 CheckThat! lab on automatic identification and verification of political claims, task 1: Check-worthiness, in: CLEF 2018 Working Notes. Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2018. [7] P. Atanasova, P. Nakov, G. Karadzhov, M. Mohtarami, G. Da San Martino, Overview of the CLEF-2019 CheckThat! lab on automatic identification and verification of claims. Task 1: Check-worthiness, in: [89], 2019. [8] A. Barrón-Cedeño, T. Elsayed, R. Suwaileh, L. Màrquez, P. Atanasova, W. Zaghouani, S. Kyuchukov, G. Da San Martino, P. Nakov, Overview of the CLEF-2018 CheckThat! lab on automatic identification and verification of political claims, task 2: Factuality, in: CLEF 2018 Working Notes. Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2018. [9] T. Elsayed, P. Nakov, A. Barrón-Cedeño, M. Hasanain, R. Suwaileh, G. Da San Martino, P. Atanasova, CheckThat! at CLEF 2019: Automatic identification and verification of claims, in: L. Azzopardi, B. Stein, N. Fuhr, P. Mayr, C. Hauff, D. Hiemstra (Eds.), Advances in Information Retrieval, ECIR ’19, 2019, pp. 309–315. [10] M. Hasanain, R. Suwaileh, T. Elsayed, A. Barrón-Cedeño, P. Nakov, Overview of the CLEF-2019 CheckThat! lab on automatic identification and verification of claims. Task 2: Evidence and factuality, in: [89], 2019. [11] P. Nakov, A. Barrón-Cedeño, T. Elsayed, R. Suwaileh, L. Màrquez, W. Zaghouani, P. Atanasova, S. Kyuchukov, G. Da San Martino, Overview of the CLEF-2018 Check- That! lab on automatic identification and verification of political claims, in: Proceedings of the Ninth International Conference of the CLEF Association: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Lecture Notes in Computer Science, 2018, pp. 372–387. [12] P. Gencheva, P. Nakov, L. Màrquez, A. Barrón-Cedeño, I. Koychev, A context-aware approach for detecting worth-checking claims in political debates, in: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, 2017, pp. 267–276. [13] N. Hassan, C. Li, M. Tremayne, Detecting check-worthy factual claims in presidential debates, in: J. Bailey, A. Moffat, C. C. Aggarwal, M. de Rijke, R. Kumar, V. Murdock, T. K. Sellis, J. X. Yu (Eds.), Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM, 2015, pp. 1835–1838. [14] I. Jaradat, P. Gencheva, A. Barrón-Cedeño, L. Màrquez, P. Nakov, ClaimRank: Detecting check-worthy claims in Arabic and English, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstra- tions, 2018, pp. 26–30. [15] S. Vasileva, P. Atanasova, L. Màrquez, A. Barrón-Cedeño, P. Nakov, It takes nine to smell a rat: Neural multi-task learning for check-worthiness prediction, in: Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP ’19, 2019, pp. 1229–1239. [16] S. Shaar, N. Babulkov, G. Da San Martino, P. Nakov, That is a known lie: Detecting previously fact-checked claims, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3607–3618. [17] P. Nakov, D. Corney, M. Hasanain, F. Alam, T. Elsayed, A. Barrón-Cedeño, P. Papotti, S. Shaar, G. D. S. Martino, Automated fact-checking for assisting human fact-checkers (2021). [18] S. Shaar, F. Alam, G. D. S. Martino, P. Nakov, The role of context in detecting previously fact-checked claims, arXiv:2104.07423 (2021). [19] I. Augenstein, C. Lioma, D. Wang, L. Chaves Lima, C. Hansen, C. Hansen, J. G. Simonsen, MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP- IJCNLP 2019, 2019, pp. 4685–4697. [20] G. Karadzhov, P. Nakov, L. Màrquez, A. Barrón-Cedeño, I. Koychev, Fully automated fact checking using external sources, in: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, 2017, pp. 344–353. [21] M. Mohtarami, R. Baly, J. Glass, P. Nakov, L. Màrquez, A. Moschitti, Automatic stance detection using end-to-end memory networks, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, 2018, pp. 767–776. [22] M. Mohtarami, J. Glass, P. Nakov, Contrastive language adaptation for cross-lingual stance detection, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, 2019, pp. 4442–4452. [23] P. Atanasova, P. Nakov, L. Màrquez, A. Barrón-Cedeño, G. Karadzhov, T. Mihaylova, M. Mohtarami, J. Glass, Automatic fact-checking using context and discourse information, Journal of Data and Information Quality (JDIQ) 11 (2019) 12. [24] C. Castillo, M. Mendoza, B. Poblete, Information credibility on twitter, in: S. Srinivasan, K. Ramamritham, A. Kumar, M. P. Ravindra, E. Bertino, R. Kumar (Eds.), Proceedings of the 20th International Conference on World Wide Web, WWW 2011, 2011, pp. 675–684. [25] D. Kopev, A. Ali, I. Koychev, P. Nakov, Detecting deception in political debates using acoustic and textual features, in: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, ASRU ’19, 2019, pp. 652–659. [26] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, Y. Choi, Truth of varying shades: Analyzing language in fake news and political fact-checking, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2931–2937. [27] R. Baly, G. Karadzhov, D. Alexandrov, J. Glass, P. Nakov, Predicting factuality of reporting and bias of news media sources, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 3528–3539. [28] R. Baly, M. Mohtarami, J. Glass, L. Màrquez, A. Moschitti, P. Nakov, Integrating stance detection and fact checking in a unified corpus, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 21–27. [29] V. Nguyen, K. Sugiyama, P. Nakov, M. Kan, FANG: leveraging social context for fake news detection using graph representation, in: M. d’Aquin, S. Dietze, C. Hauff, E. Curry, P. Cudré- Mauroux (Eds.), CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, 2020, pp. 1165–1174. [30] K. Popat, S. Mukherjee, J. Strötgen, G. Weikum, Where the truth lies: Explaining the credibility of emerging claims on the web and social media, in: Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17, 2017, pp. 1003–1012. [31] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, FEVER: a large-scale dataset for fact extraction and VERification, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 809–819. [32] A. Patwari, D. Goldwasser, S. Bagchi, TATHYA: A multi-classifier system for detecting check-worthy statements in political debates, in: E. Lim, M. Winslett, M. Sanderson, A. W. Fu, J. Sun, J. S. Culpepper, E. Lo, J. C. Ho, D. Donato, R. Agrawal, Y. Zheng, C. Castillo, A. Sun, V. S. Tseng, C. Li (Eds.), Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM, 2017, pp. 2259–2262. [33] R. Agez, C. Bosc, C. Lespagnol, J. Mothe, N. Petitcol, IRIT at CheckThat! 2018, in: [90], 2018. [34] B. Ghanem, M. Montes-y Gómez, F. Rangel, P. Rosso, UPV-INAOE-Autoritas - Check That: Preliminary approach for checking worthiness of claims, in: [90], 2018. [35] C. Hansen, C. Hansen, J. Simonsen, C. Lioma, The Copenhagen team participation in the check-worthiness task of the competition of automatic identification and verification of claims in political debates of the CLEF-2018 fact checking lab, in: [90], 2018. [36] C. Zuo, A. Karakas, R. Banerjee, A hybrid recognition system for check-worthy claims using heuristics and supervised learning, in: [90], 2018. [37] B. Altun, M. Kutlu, TOBB-ETU at CLEF 2019: Prioritizing claims based on check- worthiness, in: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2019. [38] L. Coca, C.-G. Cusmuliuc, A. Iftene, CheckThat! 2019 UAICS, in: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2019. [39] R. Dhar, S. Dutta, D. Das, A hybrid model to rank sentences for check-worthiness, in: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2019. [40] L. Favano, M. Carman, P. Lanzi, TheEarthIsFlat’s submission to CLEF’19 CheckThat! challenge, in: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2019. [41] J. Gasior, P. Przybyła, The IPIPAN team participation in the check-worthiness task of the CLEF2019 CheckThat! lab, in: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2019. [42] C. Hansen, C. Hansen, J. Simonsen, C. Lioma, Neural weakly supervised fact check- worthiness detection with contrastive sampling-based ranking loss, in: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2019. [43] S. Mohtaj, T. Himmelsbach, V. Woloszyn, S. Möller, The TU-Berlin team participation in the check-worthiness task of the CLEF-2019 CheckThat! lab, in: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2019. [44] T. Su, C. Macdonald, I. Ounis, Entity detection for check-worthiness prediction: Glasgow Terrier at CLEF CheckThat! 2019, in: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2019. [45] J. Martinez-Rico, L. Araujo, J. Martinez-Romo, NLP&IR@UNED at CheckThat! 2020: A preliminary approach for check-worthiness and claim retrieval tasks using neural networks and graphs, in: [91], 2020. [46] T. McDonald, Z. Dong, Y. Zhang, R. Hampson, J. Young, Q. Cao, J. Leidner, M. Stevenson, The University of Sheffield at CheckThat! 2020: Claim identification and verification on Twitter, in: [91], 2020. [47] Y. S. Kartal, M. Kutlu, TOBB ETU at CheckThat! 2020: Prioritizing English and Arabic claims based on check-worthiness, in: [91], 2020. [48] F. Alam, S. Shaar, F. Dalvi, H. Sajjad, A. Nikolov, H. Mubarak, G. Da San Martino, A. Abdelali, N. Durrani, K. Darwish, P. Nakov, Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society, ArXiv preprint 2005.00033 (2020). [49] F. Alam, F. Dalvi, S. Shaar, N. Durrani, H. Mubarak, A. Nikolov, G. Da San Martino, A. Abdelali, H. Sajjad, K. Darwish, P. Nakov, Fighting the COVID-19 infodemic in social media: A holistic perspective and a call to arms, in: Proceedings of the International AAAI Conference on Web and Social Media, volume 15, 2021, pp. 913–922. [50] E. Williams, P. Rodrigues, V. Novak, Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of claims using transformer-based models, in: [91], 2020. [51] M. Hasanain, T. Elsayed, bigIR at CheckThat! 2020: Multilingual BERT for ranking Arabic tweets by check-worthiness, in: [91], 2020. [52] G. S. Cheema, S. Hakimov, R. Ewerth, Check_square at CheckThat! 2020: Claim detection in social media via fusion of transformer and syntactic features, in: [91], 2020. [53] A. Hussein, A. Hussein, N. Ghneim, A. Joukhadar, DamascusTeam at CheckThat! 2020: Check worthiness on Twitter with hybrid CNN and RNN models, in: [91], 2020. [54] I. Touahri, A. Mazroui, EvolutionTeam at CheckThat! 2020: Integration of linguistic and sentimental features in a fake news detection approach, in: [91], 2020. [55] R. Alkhalifa, T. Yoong, E. Kochkina, A. Zubiaga, M. Liakata, QMUL-SDS at CheckThat! 2020: Determining COVID-19 tweet check-worthiness using an enhanced CT-BERT with numeric expressions, in: [91], 2020. [56] A. Nikolov, G. Da San Martino, I. Koychev, P. Nakov, Team_Alex at CheckThat! 2020: Identifying check-worthy tweets with transformer models, in: [91], 2020. [57] C.-G. Cusmuliuc, L.-G. Coca, A. Iftene, UAICS at CheckThat! 2020: Fact-checking claim prioritization, in: [91], 2020. [58] S. Krishan T, K. S, T. D, R. Vardhan K, A. Chandrabose, Tweet check worthiness using transformers, CNN and SVM, in: [91], 2020. [59] A. Gupta, P. Kumaraguru, C. Castillo, P. Meier, TweetCred: Real-time credibility assessment of content on Twitter, in: Proceeding of the 6th International Social Informatics Conference, SocInfo ’14, 2014, pp. 228–243. [60] T. Mitra, E. Gilbert, CREDBANK: A large-scale social media corpus with associated credibility annotations, in: Proceedings of the Ninth International AAAI Conference on Web and Social Media, ICWSM ’15, 2015, pp. 258–267. [61] A. Gupta, V. Srikumar, X-fact: A new benchmark dataset for multilingual fact checking, ArXiv:2106.09248 (2021). [62] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: A data mining perspective, SIGKDD Explor. Newsl. 19 (2017) 22–36. [63] Z. Zhao, P. Resnick, Q. Mei, Enquiring minds: Early detection of rumors in social media from enquiry posts, in: A. Gangemi, S. Leonardi, A. Panconesi (Eds.), Proceedings of the 24th International Conference on World Wide Web WWW, 2015, pp. 1395–1405. [64] M. Cinelli, W. Quattrociocchi, A. Galeazzi, C. M. Valensise, E. Brugnoli, A. L. Schmidt, P. Zola, F. Zollo, A. Scala, The COVID-19 social media infodemic, arXiv:2003.05004 (2020). [65] X. Song, J. Petrak, Y. Jiang, I. Singh, D. Maynard, K. Bontcheva, Classification aware neural topic model and its application on a new COVID-19 disinformation corpus, arXiv:2006.03354 (2020). [66] X. Zhou, A. Mulay, E. Ferrara, R. Zafarani, Recovery: A multimodal repository for COVID- 19 news credibility research, in: M. d’Aquin, S. Dietze, C. Hauff, E. Curry, P. Cudré-Mauroux (Eds.), CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, 2020, pp. 3205–3212. [67] F. Haouari, M. Hasanain, R. Suwaileh, T. Elsayed, ArCOV-19: The first Arabic COVID-19 Twitter dataset with propagation networks, arXiv preprint arXiv:2004.05861 (2020). [68] F. Haouari, M. Hasanain, R. Suwaileh, T. Elsayed, ArCOV19-rumors: Arabic COVID-19 Twitter dataset for misinformation detection, arXiv preprint arXiv:2010.08768 (2020). [69] L. Konstantinovskiy, O. Price, M. Babakar, A. Zubiaga, Towards automated factchecking: Developing an annotation schema and benchmark for consistent automated claim detection, arXiv:1809.08193 (2018). [70] Y. S. Kartal, M. Kutlu, TrClaim-19: The first collection for Turkish check-worthy claim detection with annotator rationales, in: Proceedings of the 24th Conference on Computa- tional Natural Language Learning, 2020, pp. 386–395. [71] E. Williams, P. Rodrigues, S. Tran, Accenture at CheckThat! 2021: Interesting claim identification and ranking with contextually sensitive lexical training data augmentation, in: [88], 2021. [72] X. Zhou, B. Wu, P. Fung, Fight for 4230 at CLEF CheckThat! 2021: Domain-specific preprocessing and pretrained model for ranking claims by check-worthiness, in: [88], 2021. [73] R. Sepúlveda-Torres, E. Saquete, GPLSI team at CLEF CheckThat! 2021: Fine-tuning BETO and RoBERTa, in: [88], 2021. [74] O. Rjab, H. Haddad, W. Henia, C. Fourati, iCompass at CheckThat! 2021: Identifying check-worthy Arabic tweets, in: [88], 2021. [75] J. M.-R. Juan R. Martinez-Rico, L. Araujo, NLP&IR@UNED at CheckThat! 2021: Check- worthiness estimation and fake news detection using transformer models, in: [88], 2021. [76] A. Pritzkau, NLytics at CheckThat! 2021: Check-worthiness estimation as a regression problem on transformers, in: [88], 2021. [77] A. S.Abumansour, A. Zubiaga, QMUL-SDS at CheckThat! 2021: Enriching pre-trained language models for the estimation of check-worthiness of Arabic tweets, in: [88], 2021. [78] S. Althabiti, M. Alsalka, E. Atwell, An AraBERT model for check-worthiness of Arabic tweets, in: [88], 2021. [79] B. Carik, R. Yeniterzi, SU-NLP at CheckThat! 2021: Check-worthiness of Turkish tweets, in: [88], 2021. [80] M. S. Zengin, Y. S. Kartal, M. Kutlu, TOBB ETU at CheckThat! 2021: Data engineering for detecting check-worthy claims, in: [88], 2021. [81] I. Baris Schlicht, A. Magnossão de Paula, P. Rosso, UPV at CheckThat! 2021: Mitigating cultural differences for identifying multilingual check-worthy claims, in: [88], 2021. [82] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, 2019, pp. 4171–4186. [83] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, Arxiv:1907.11692 (2019). [84] A. Pritzkau, NLytics at CheckThat! 2021: Multi-class fake news detection of news articles and domain identification with RoBERTa - a baseline model, in: [88], 2021. [85] P. Atanasova, L. Marquez, A. Barrón-Cedeño, T. Elsayed, R. Suwaileh, W. Zaghouani, S. Kyuchukov, G. Da San Martino, P. Nakov, Overview of the CLEF-2018 CheckThat! lab on automatic identification and verification of political claims. Task 1: Check-worthiness, in: [90], 2018. [86] S. Shaar, A. Nikolov, N. Babulkov, F. Alam, A. Barrón-Cedeño, T. Elsayed, M. Hasanain, R. Suwaileh, F. Haouari, G. Da San Martino, P. Nakov, Overview of CheckThat! 2020 English: Automatic identification and verification of claims in social media, in: [91], 2020. [87] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, A. Nikolov, M. Kutlu, Y. S. Kartal, F. Alam, G. Da San Martino, A. Barrón-Cedeño, R. Míguez, J. Beltrán, T. Elsayed, P. Nakov, Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in tweets and political debates, in: [88], 2021. [88] G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021 Working Notes. Working Notes of CLEF 2021–Conference and Labs of the Evaluation Forum, 2021. [89] L. Cappellato, N. Ferro, D. Losada, H. Müller (Eds.), Working Notes of CLEF 2019 Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2019. [90] L. Cappellato, N. Ferro, J.-Y. Nie, L. Soulier (Eds.), Working Notes of CLEF 2018–Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2018. [91] L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), CLEF 2020 Working Notes, CEUR Workshop Proceedings, 2020.