Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims. Task 1: Check-Worthiness? Pepa Atanasova1 , Preslav Nakov2 , Georgi Karadzhov3 , Mitra Mohtarami4 , and Giovanni Da San Martino2 1 Department of Computer Science, University of Copenhagen, Denmark pepa@di.ku.dk 2 Qatar Computing Research Institute, HBKU, Doha, Qatar {pnakov, gmartino}@hbku.edu.qa 3 SiteGround Hosting EOOD, Sofia, Bulgaria georgi.m.karadjov@gmail.com 4 MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA mitram@mit.edu Abstract. We present an overview of the 2nd edition of the CheckThat! Lab, part of CLEF 2019, with focus on Task 1: Check-Worthiness in po- litical debates. The task asks to predict which claims in a political debate should be prioritized for fact-checking. In particular, given a debate or a political speech, the goal is to produce a ranked list of its sentences based on their worthiness for fact-checking. This year, we extended the 2018 dataset with 16 more debates and speeches. A total of 47 teams registered to participate in the lab, and eleven of them actually submit- ted runs for Task 1 (compared to seven last year). The evaluation results show that the most successful approaches to Task 1 used various neural networks and logistic regression. The best system achieved mean average precision of 0.166 (0.250 on the speeches, and 0.054 on the debates). This leaves large room for improvement, and thus we release all datasets and scoring scripts, which should enable further research in check-worthiness estimation. Keywords: Computational journalism · Check-worthiness estimation · Fact- checking · Veracity ? This paper only focuses on Task 1 (Check-Worthiness). For an overview of Task 2 (Factuality), see [18]. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. 1 Introduction The current coverage of the political landscape in both the press and in social media has led to an unprecedented situation. Like never before, a statement in an interview, a press release, a blog note, or a tweet can spread almost in- stantaneously across the globe. This proliferation speed has left little time for double-checking claims against the facts, which has proven critical. The problem caught the public attention in connection to the 2016 US Pres- idential Campaign, which was influenced by fake news in social media and by false claims. Indeed, some politicians were fast to notice that when it comes to shaping public opinion, facts are secondary, and that appealing to emotions and beliefs works better, especially in social media. It has been even proposed that this marked the dawn of a Post-Truth Age. As the problem became evident, a number of fact-checking initiatives have started, led by organizations such as FactCheck and Snopes, among many oth- ers. Yet, this proved to be a very demanding manual effort, which means that only a relatively small number of claims could be fact-checked.5 This makes it important to prioritize the claims that fact-checkers should consider first. Task 1 of the CheckThat! Lab at CLEF-2019 [10,11] aims to help in that respect, ask- ing participants to build systems that can mimic the selection strategies of a particular fact-checking organization: factcheck.org. It is defined as follows: Given a political debate, interview, or speech, transcribed and seg- mented into sentences, rank the sentences concerning the priority with which they should be fact-checked. This is a ranking task and the participating systems are asked to produce one score per sentence, according to which the sentences are to be ranked. This year, Task 1 was offered for English only (it was also offered in Arabic in the 2018 edition of the lab [2]). The dataset for this task is an extension of the CT-CWC-18 dataset [2]. We added annotations from three press-conferences, six public speeches, six debates, and one post, all fact-checked by experts from factcheck.org. Figure 1 shows examples of annotated debate fragments. In Figure 1a, Hillary Clinton discusses the performance of her husband, Bill Clinton, as US president. Donald Trump fires back with a claim that is worth fact-checking, namely that Bill Clinton approved NAFTA. In Figure 1b, Donald Trump is accused of having filed for bankruptcy six times, which is also a claim that is worth fact-checking. In Figure 1c, Donald Trump claims that border walls work. In a real-world scenario, the intervention by Donald Trump in Figure 1a, the second one by Hillary Clinton in Figure 1b, and the first two by Donald Trump in Figure 1c should be ranked at the top of the list in order to get the attention of the fact-checker. 5 Full automation is not yet a viable alternative, partly because of limitations of the existing technology, and partly due to low trust in such methods by the users. H. Clinton: I think my husband did a pretty good job in the 1990s. H. Clinton: I think a lot about what worked and how we can make it work again. . . D. Trump: Well, he approved NAFTA. . . ○ (a) Fragment from the First 2016 US Presidential Debate. H. Clinton: He provided a good middle-class life for us, but the people he worked for, he expected the bargain to be kept on both sides. H. Clinton: And when we talk about your business, you’ve taken business ○ bankruptcy six times. (b) Another fragment from the First 2016 US Presidential Debate. D. Trump: It’s a lot of murders, but it’s not close to 2,000 murders right ○ on the other side of the wall, in Mexico. D. Trump: So everyone knows that walls work. ○ D. Trump: And there are better examples than El Paso, frankly. (c) Fragment from Trump’s National Emergency Remarks in February 2019. Fig. 1: English debate fragments: check-worthy sentences are marked with ○. The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 describes the evaluation framework and the task setup. Section 4 provides an overview of the participating systems, followed by the official results in Section 5, and discussion in Section 6, before we conclude in Section 7. 2 Related Work Automatic fact-checking is envisioned in [38] as a multi-step process that includes (i ) identifying check-worthy statements [14,19,22], (ii ) generating questions to be asked about these statements [23], (iii ) retrieving relevant information to create a knowledge base [28,35], and (iv ) inferring the veracity of the statements, e.g., using text analysis [3,6,34] or external sources [4,5,23,33]. The first work to target check-worthiness was the ClaimBuster system [19]. It was trained on data that was manually annotated by students, professors, and journalists, where each sentence was annotated as non-factual, unimportant factual, or check-worthy factual. The data consisted of transcripts of 30 historical US election debates covering the period from 1960 until 2012 for a total of 28,029 transcribed sentences. The ClaimBuster used an SVM classifier and features such as sentiment, TF.IDF word representations, part-of-speech (POS) tags, and named entities. It did not try to mimic the check-worthiness decisions for any specific fact-checking organization; yet, it was later evaluated against CNN and PolitiFact [20]. In contrast, our dataset is based on actual annotations by a fact-checking organization, and we release freely all data and associated scripts. More relevant to the setup of Task 1 of this Lab is the work of [14], who focused on debates from the US 2016 Presidential Campaign and used pre- existing annotations from nine respected fact-checking organizations (PolitiFact, FactCheck, ABC, CNN, NPR, NYT, Chicago Tribune, The Guardian, and Wash- ington Post): a total of four debates and 5,415 sentences. Besides many of the features borrowed from ClaimBuster —together with the sentiment, tense, and some other features—, their model pays special attention to the context of each sentence. This includes whether it is part of a long intervention by one of the actors and even its position within such an intervention. The authors predicted both (i ) whether any of the fact-checking organizations would select the target sentence, and also (ii ) whether a specific organization would select it. In follow-up work, [22] developed ClaimRank, which can mimic the claim se- lection strategies for each and any of the nine fact-checking organizations, as well as for the union of them all. Even though trained on English, it further supports Arabic, which is achieved via cross-language English-Arabic embeddings. In yet another follow-up work, [37] proposed a multi-task learning neural network that learns from nine fact-checking organizations simultaneously and tries to predict for each sentence whether it would be selected for fact-checking by each of these organizations. The work of [32] also focused on the 2016 US Election campaign, and they also used data from nine fact-checking organizations (but a slightly different dataset). They used 3 Presidential, one Vice-Presidential, and several primary debates (7 Republican and 8 Democratic) for a total of 21,700 sentences. Their setup asked to predict whether any of the fact-checking sources would select the target sentence. They used a boosting-like model that takes SVMs focusing on different clusters of the dataset, and the final outcome was considered as that coming from the most confident classifier. The features considered ranged from LDA-based topic-modeling to POS tuples and bag-of-words representations. In the 2018 edition of the task [2,30], we followed a setup similar to that of [14,22,32], but we manually verified the selected sentences, e.g., to adjust the boundaries of the check-worthy claim, and also to include all instances of a selected check-worthy claim (as fact-checkers would only comment on one instance of a claim). We further had an Arabic version of the dataset. Finally, we chose to focus on a single fact-checking organization. The interested reader can also check the system description papers from last year’s lab. Notably, the best primary submission last year was that of the Prise de Fer team [41], which used a multilayer perceptron and a feature-rich representation. The present CLEF’2019 Lab is an extension of the CLEF’2018 CheckThat Lab [29], and the subtask of identifying check-worthy claims is an extension of last year’s subtask [2]. This year, we provide more training and testing data, and we focus on English only. Finally, there have been some recent related shared tasks. For example, at SemEval’2019 there was a task focusing on fact-checking in community question answering fora [24,25,31] and another one that targeted rumor detection [15]. Table 1: Total number of sentences and number of check-worthy ones in the CT19-T1 corpus. Type Partition Sentences Check-worthy Train 10,648 256 Debates Test 4,584 46 Train 2,718 282 Speeches Test 1,883 50 Train 3,011 36 Press Conferences Test 612 14 Posts Train 44 8 Train 16,421 433 Total Test 7,079 110 3 Evaluation Framework In this section, we describe the evaluation framework, which includes the dataset and the evaluation measure used. 3.1 Data The dataset for Task 1 is an extension of the CT-CWC-18 dataset [2]. The full English part of CT-CWC-18 (training and test) has become the training data this year. For the new test set, we produced labelled data from three press conferences, six public speeches, six debates, and one post. As last year, the annotations for the new instances were derived from the pub- licly available analysis carried out by factcheck.org. We considered as check- worthy those claims whose factuality was challenged by the fact-checkers, and we made them positive instances in our CT19-T1 dataset. Note that our annotation is at the sentence level. Therefore, if only part of a sentence was fact-checked, we annotated the entire sentence as a positive instance. If a claim spanned more than one sentence, we annotated all these sentences as positive. Moreover, in some cases, the same claim was made multiple times in a debate/speech, and thus we annotated all these sentences that referred to it rather than only the one that was fact-checked. Finally, we manually refined the annotations by moving them to a neighbouring sentence (e.g., in case of an argument) or by adding/excluding some annotations. Table 1 shows some statistics about the CT19-T1 corpus. Note that the participating systems were allowed to use external datasets with fact-checking related annotations as well as to extract information from the Web, from social media, etc. 3.2 Evaluation Measures Recall that we defined Task 1 as an information retrieval problem, where we asked the participating systems to rank the sentences in the input document, so that the check-worthy sentences are ranked at the top of the list. Hence, we use mean average precision (MAP) as the official evaluation measure, which is defined as follows: PD d=1 AveP (d) M AP = (1) D where d ∈ D is one of the debates/speeches, and AveP is the average precision, which in turn is defined as follows: PK k=1 (P (k) × δ(k)) AveP = (2) # check-worthy claims where P (k) refers to the value of precision at rank k and δ(k) = 1 iff the claim at that position is actually check-worthy. As in the 2018 edition of the task [2], following [14] we further report some other measures: (i ) mean reciprocal rank (MRR), (ii ) mean R-Precision (MR-P), and (iii ) mean precision@k (P@k). Here mean refers to macro-averaging over the testing debates/speeches. 4 Overview of Participants’ Approaches Eleven teams took part in Task 1. The most successful approaches relied on train- ing supervised classification models to assign a check-worthiness score to each of the sentences. Some participants tried to model the context of each sentence, e.g., by considering the neighbouring sentences to represent an instance [12,16]. Yet, the most successful systems analyzed each sentence in isolation, ignoring the rest of the input text. Table 2 shows an overview of the approaches used by the participating sys- tems. While many systems relied on embedding representations, feature engi- neering was also popular this year. The most popular features were bag-of-words representations, part-of-speech (PoS) tags, named entities (NEs), sentiment anal- ysis, and statistics about word use. Two systems also made use of co-reference resolution. The most popular classifiers included SVM, linear regression, Naı̈ve Bayes, decision trees, and neural networks. The best performing system achieved a score of 0.166 in terms of MAP. This is a clear improvement over the best score from last year’s edition of the task, 0.1332 MAP, but of course the results are not directly comparable as we have a new test set. Apart from the improvement in participants’ solutions, the increased performance of the systems can also be attributed to the fact that we provided twice as much data as we did last year, both for training and for testing purposes. Table 2: Overview of the approaches to Task 1: check-worthiness. The learning model and the representations for the best system [17] are highlighted. Learning Models [1][8][9][12][13][17][27][36] Represent. [1][8][9][12][13][17][27][36] Neural Networks Embeddings LSTM   PoS  Feed-forward  word    SVM  syntactic dep.  Naı̈ve Bayes  SUSE  Logistic regressor  Bag of . . . Regression trees  words    n-grams  Teams NEs    [1] TOBB ETU [27] é proibido cochilar PoS   [8] UAICS [36] Terrier Readability  [9] JUNLP [–] IIT (ISM) Dhanbad Synt. n-grams  [12] TheEarthIsFlat [–] Factify Sentiment   [13] IPIPAN [–] nlpir01 Subjectivity  [17] Copenhagen Sent. context  Topics  Team Copenhagen achieved the best overall performance by building upon their approach from 2018 [16,17]. Their system learned dual token embeddings — domain-specific word embeddings and syntactic dependencies—, and used them in an LSTM recurrent neural network. They further pre-trained this network with previous Trump and Clinton debates, and then supervised it weakly with the ClaimBuster system.6 In their primary submission, they used a contrastive ranking loss (excluded in their contrastive1). For their contrastive2 submission, they concatenated representations for the current and for the previous sentence. Team TheEarthIsFlat [12] trained a feed-forward neural network with two hidden layers, which takes as input Standard Universal Sentence Encoder (SUSE) embeddings [7] for the current sentence as well as for the two previous sentences as a context. In their contrastive1 run, they replaced the embeddings with the Large Universal Sentence Encoder’s ones, and in their constrastive2 run, they trained the model for 1,350 epochs rather than for 1,500 epochs. Team IPPAN first extracted various features about the claims, including bag-of-words n-grams, word2vec vector representations [26], named entity types, part-of-speech tags, sentiment scores, and features from statistical analysis of the sentences [13]. Then, they used these features in an L1-regularized logistic regression to predict the check-worthiness of the sentences. Team Terrier represented the sentences using bag-of-words and named en- tities [36]. They used co-reference resolution to substitute the pronouns by the referring entity/person name. They further computed entity similarity [39] and entity relatedness [40]. For prediction, they used an SVM classifier. 6 http://idir.uta.edu/claimbuster/ Team UAICS used a Naı̈ve Bayes classifier with bag-of-words features [8]. In their contrastive submissions, they used other models, e.g., logistic regression. Team Factify used the pre-trained ULMFiT model [21] and fine-tuned it on the training set. They further over-sampled the minority class by replacing words randomly with similar words based on word2vec similarity. They also used data augmentation based on back-translation, where each sentence was translated to French, Arabic and Japanese and then back to English. Team JUNLP extracted various features, including syntactic n-grams, sen- timent polarity, text subjectivity, and LIX readability score, and used them to train a logistic regression classifier with high recall [9]. Then, they trained an LSTM model fed with word representations from GloVe and part-of-speech tags. The sentence representations from the LSTM model were concatenated with the extracted features and used for prediction by a fully connected layer, which had high precision. Finally, they averaged the posterior probabilities from both mod- els in order to come up with the final check-worthiness score for the sentence. Team nlpir01 extracted features such as tf-idf word vectors, tf-idf PoS vec- tors, word, character, and PoS tag counts. Then, they used these features in a multilayer perceptron regressor with two hidden layers, each of size 2,000. For their contrastive1 run, they oversampled the minority class, and for their contrastive2 run, they reduced the number of units in each layer to 480. Team TOBB ETU used linguistic features such as named entities, topics extracted with IBM Watson’s NLP tools, PoS tags, bigram counts and indicators of the type of sentence to train a multiple additive regression tree [1]. They further decreased the ranks of some sentences using hand-crafted rules. In their contrastive1 run, they added the speaker as a feature, while in their contrastive2 run they used logistic regression. Team IIT (ISM) Dhanbad trained an LSTM-based recurrent neural net- work. They fed the network with word2vec embeddings and features extracted from constituency parse trees as well as features based on named entities and sentiment analysis. Team é proibido cochilar trained an SVM model on bag-of-words represen- tations of the sentences, after performing co-reference resolution and removing all digits [27]. They further used an additional corpus of labelled claims, which they extracted from fact-checking websites, aiming at having a more balanced training corpus and potentially better generalizations.7 Compared to last year, there have been a number of new features introduced such as context features, readability scores, topics, and subjectivity. In terms or representation, we see not only word embeddings (as last year), but also some new representations based on PoS, syntactic dependency, and SUSE embeddings. Moreover, the participants this year actively explored ways of using external data and re-ranking techniques, with systems going beyond simple classification and introducing specialized ranking losses and regression models. 7 Their claim crawling tool: http://github.comx/vwoloszyn/fake_news_extractor Table 3: Results for Task 1: Check-worthiness. The results for the primary sub- mission appear next to the team’s identifier, followed by the contrastive sub- missions (if any). The subscript numbers indicate the rank of each primary submission with respect to the corresponding evaluation measure. Team MAP RR R-P P@1 P@3 P@5 P@10 P@20 P@50 [17] Copenhagen .16601 .41763 .13874 .28572 .23811 .25711 .22862 .15712 .12292 contr.-1 .1496 .3098 .1297 .1429 .2381 .2000 .2000 .1429 .1143 contr.-2 .1580 .2740 .1622 .1429 .1905 .2286 .2429 .1786 .1200 [12] TheEarthIsFlat .15972 .195311 .20521 .00004 .09523 .22862 .21433 .18571 .14571 contr.-1 .1453 .3158 .1101 .2857 .2381 .1429 .1429 .1357 .1171 contr.-2 .1821 .4187 .1937 .2857 .2381 .2286 .2286 .2143 .1400 [13] IPIPAN .13323 .28646 .14812 .14293 .09523 .14295 .17145 .15003 .11713 [36] Terrier .12634 .32535 .10888 .28572 .23811 .20003 .20004 .12866 .09147 [8] UAICS .12345 .46501 .14603 .42861 .23811 .22862 .24291 .14294 .09436 contr.-1 .0649 .2817 .0655 .1429 .2381 .1429 .1143 .0786 .0343 contr.-2 .0726 .4492 .0547 .4286 .2857 .1714 .1143 .0643 .0257 Factify .12106 .22858 .12925 .14293 .09523 .11436 .14296 .14294 .10864 [9] JUNLP .11627 .44192 .11287 .28572 .19052 .17144 .17145 .12866 .10005 contr.-1 .0976 .3054 .0814 .1429 .2381 .1429 .0857 .0786 .0771 contr.-2 .1226 .4465 .1357 .2857 .2381 .2000 .1571 .1286 .0886 nlpir01 .10008 .28407 .10639 .14293 .23811 .17144 .10008 .12147 .09436 contr.-1 .0966 .3797 .0849 .2857 .1905 .2286 .1429 .1071 .0886 contr.-2 .0965 .3391 .1129 .1429 .2381 .2286 .1571 .1286 .0943 [1] TOBB ETU .08849 .202810 .11506 .00004 .09523 .14295 .12867 .13575 .08298 contr.-1 .0898 .2013 .1150 .0000 .1429 .1143 .1286 .1429 .0829 contr.-2 .0913 .3427 .1007 .1429 .1429 .1143 .0714 .1214 .0829 IIT (ISM) Dhanbad .083510 .22389 .071411 .00004 .19052 .11436 .08579 .08579 .07719 [27] é proibido cochilar .079611 .35144 .088610 .14293 .23811 .14295 .12867 .10718 .071410 contr.-1 .1357 .5414 .1595 .4286 .2381 .2571 .2714 .1643 .1200 5 Evaluation The participants were allowed to submit one primary and up to two contrastive runs in order to test variations of their primary models or entirely different al- ternative models. Only the primary runs were considered for the official ranking. A total of eleven teams submitted 21 runs. Table 3 shows the results. The best-performing system was the one by team Copenhagen. They achieved a strong MAP score using a ranking loss based on contrastive sampling. Indeed, this is the only team that modelled the task as a ranking one and the decrease in the performance without the ranking loss (see their contrastive1 run) shows the importance of using this loss. Two teams made use of external datasets: team Copenhagen used a weakly supervised dataset for pretraining, and team é proibido cochilar included claims scraped from several fact-checking Web sites. In order to address the class imbalance in the training dataset, team nlpir01 used oversampling in their contrastive1 run, but could not gain any improve- ments. Oversampling and augmenting with additional data points did not help team é proibido cochilar either. Many systems used pretrained sentence or word embedding models. Team TheEarthIsFlat, which has the second-best performing system, used the Stan- dard Universal Sentence Embeddings, which performed well on the task. The best MAP score overall was achieved by the contrastive2 run by this team: the only difference with respect to their primary submission was the number of train- ing epochs. Some teams also used fine-tuning, e.g., team Factify fine-tuned the ULMFiT model on the training dataset. Interestingly, the top-performing run was an unofficial one, namely the con- trastive2 run by the TheEarthIsFlat team [12]. As described in Section 4, this model consisted of a feed-forward neural network fed with Standard Universal Sentence Encoder embeddings. The only difference with their primary run is the number of epochs they trained the network for. 6 Discussion Debates vs. Speeches While the training data included debates only, the test data also contained speeches. Thus, it is interesting to see how the systems perform on debates vs. speeches. Table 4 shows the MAP for the primary submissions. This year again, the performance on speeches was better than on debates. The best MAP on speeches last year was .1460, while on debates it was .1011. We can see that thus year the performance on speeches improved by more than 10% absolute, while the performance on debates decreased by almost 5%. We are not sure why this should be the case, but the speeches in our test dataset contain about twice as many check-worthy claims as there are in the debates (see Table 1). Table 4: MAP for the primary submissions for debates vs. speeches. Team Debates Speeches [17] Copenhagen .05382 .25021 [12] TheEarthIsFlat .04873 .24302 [13] IPIPAN .06321 .18585 [36] Terrier .021011 .20533 [8] UAICS .023510 .19834 Factify .04374 .17906 [9] JUNLP .03875 .17437 nlpir01 .03298 .15048 [1] TOBB ETU .03149 .13119 IIT (ISM) Dhanbad, India .03646 .118810 [27] é proibido cochilar .03517 .113011 Ensembles We further experimented with constructing an ensemble using the scores by the individual systems. In particular, we first performed min-max nor- malization of the predictions of the individual systems, and then we summed these normalized scores.8 The overall results are shown in Table 5. We can see that there is a small improvement for the ensemble over the best individual sys- tem in terms of MAP (compare the first to the last line). The results for the other evaluation measures are somewhat mixed, which might be due to contradictions or duplication of the information coming from the different systems. Ablation Table 5 further shows the results for ablation experiments, where we add one system to the ensemble at a time. We can see that the ensemble of the top-4 systems works best. However, as we keep adding systems, the score does not always improve. This can be due to various reasons, e.g., the noise and the wrong signal we get from the systems can harm instead of improve the final score. The best combination of teams in terms of MAP is shown on the last line of the table, and it includes the teams Copenhagen, TheEarthIsFlat and Terrier. Table 5: Ablation results from adding one team at a time to an ensemble. The ensemble is the sum of the score by each of the teams in the ensemble. The order of adding the teams in the ensemble is based on their official rank. Team MAP RR R-P P@1 P@3 P@5 P@10 P@20 P@50 Copenhagen (best team) .1660 .4176 .1387 .2857 .2381 .2571 .2286 .1571 .1229 + TheEarthIsFlat .1694 .3959 .1613 .2857 .1905 .2000 .2286 .1714 .1286 + IPIPAN .1647 .3681 .1548 .1429 .1905 .2571 .2000 .1714 .1314 + Terrier .1707 .4474 .1694 .2857 .2857 .2571 .2143 .1929 .1286 + UAICS .1601 .4025 .1557 .2857 .2381 .2286 .2143 .1714 .1257 + Factify .1605 .3983 .1655 .2857 .2381 .2286 .2143 .1571 .1286 + JUNLP .1547 .3211 .1873 .1429 .2381 .2571 .2000 .1929 .1229 + nlpir01 .1538 .3376 .1930 .1429 .2381 .2286 .1714 .2071 .1257 + TOBB ETU .1530 .3386 .1918 .1429 .2381 .2286 .1714 .2000 .1257 + IIT (ISM) Dhanbad, India .1476 .3148 .1918 .1429 .1905 .2000 .1714 .1857 .1257 + é proibido cochilar .1514 .4585 .1750 .2857 .2381 .2571 .2143 .1786 .1200 Copenhagen + TheEarthIsFlat + Terrier .1747 .3771 .1877 .2857 .1905 .2000 .2429 .1929 .1257 7 Conclusion and Future Work We have presented an overview of the CLEF–2019 CheckThat! Lab Task 1 on Automatic Identification of Claims. The task asked the participating teams to predict which claims in a political debate should be prioritized for fact-checking. As part of the CheckThat! lab, we release the dataset and the evaluation tools in order to enable further research in check-worthiness estimation.9 8 We also tried summing the reciprocal ranks of the rankings that the systems assigned to each sentence, but this yielded much worse results. 9 http://github.com/apepa/clef2019-factchecking-task1 A total of 11 teams participated in task 1 (compared to 7 in 2018). The evaluation results show that the most successful approaches used various neural networks and logistic regression. In future work, we want to expand the dataset with more annotations, which should enable more interesting neural network architectures. We also plan to include information from multiple fact-checking organizations. As noted in [14], the agreement between the selection choices for different fact-checking organi- zations is low, meaning that there is a certain bias in the selection of claims by each of the fact-checking organizations, and aggregating the annotations from multiple sources could potentially help in that respect. It would further enable multi-task learning [37] Acknowledgments We want to thank Spas Kyuchukov and Oliver Ren for taking part in the anno- tation process of the new speeches and debates for this year’s extended dataset. This work is part of the Tanbih project,10 which aims to limit the effect of “fake news”, propaganda and media bias by making users aware of what they are reading. The project is developed in collaboration between the Qatar Computing Research Institute (QCRI), HBKU and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). References 1. Altun, B., Kutlu, M.: TOBB-ETU at CLEF 2019: Prioritizing claims based on check-worthiness. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 2. Atanasova, P., Màrquez, L., Barrón-Cedeño, A., Elsayed, T., Suwaileh, R., Za- ghouani, W., Kyuchukov, S., Da San Martino, G., Nakov, P.: Overview of the CLEF-2018 CheckThat! lab on automatic identification and verification of politi- cal claims, task 1: Check-worthiness. In: CLEF 2018 Working Notes. Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Avignon, France (2018) 3. Atanasova, P., Nakov, P., Màrquez, L., Barrón-Cedeño, A., Karadzhov, G., Mi- haylova, T., Mohtarami, M., Glass, J.: Automatic fact-checking using context and discourse information. Journal of Data and Information Quality (JDIQ) 11(3), 12 (2019) 4. Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., Nakov, P.: Predicting fac- tuality of reporting and bias of news media sources. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 3528–3539. EMNLP ’18, Brussels, Belgium (2018) 5. Baly, R., Mohtarami, M., Glass, J., Màrquez, L., Moschitti, A., Nakov, P.: Inte- grating stance detection and fact checking in a unified corpus. In: Proceedings of 10 http://tanbih.qcri.org/ the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies. pp. 21–27. NAACL-HLT ’18, New Orleans, Louisiana, USA (2018) 6. Castillo, C., Mendoza, M., Poblete, B.: Information credibility on Twitter. In: Proceedings of the 20th International Conference on World Wide Web. pp. 675– 684. WWW ’11, Hyderabad, India (2011) 7. Cer, D., Yang, Y., Kong, S.y., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018) 8. Coca, L., Cusmuliuc, C.G., Iftene, A.: 2019 UAICS. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 9. Dhar, R., Dutta, S., Das, D.: A hybrid model to rank sentences for check- worthiness. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Con- ference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR- WS.org, Lugano, Switzerland (2019) 10. Elsayed, T., Nakov, P., Barrón-Cedeño, A., Hasanain, M., Suwaileh, R., Da San Martino, G., Atanasova, P.: CheckThat! at CLEF 2019: Automatic iden- tification and verification of claims. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) Advances in Information Retrieval. pp. 309–315. ECIR ’19, Springer International Publishing (2019) 11. Elsayed, T., Nakov, P., Barrón-Cedeño, A., Hasanain, M., Suwaileh, R., Da San Martino, G., Atanasova, P.: Overview of the CLEF-2019 CheckThat!: Automatic identification and verification of claims. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. LNCS, Lugano, Switzerland (2019) 12. Favano, L., Carman, M., Lanzi, P.: TheEarthIsFlat’s submission to CLEF’19 CheckThat! challenge. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceed- ings, CEUR-WS.org, Lugano, Switzerland (2019) 13. Gasior, J., Przybyla, P.: The IPIPAN team participation in the check-worthiness task of the CLEF2019 CheckThat! lab. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 14. Gencheva, P., Nakov, P., Màrquez, L., Barrón-Cedeño, A., Koychev, I.: A context- aware approach for detecting worth-checking claims in political debates. In: Pro- ceedings of the International Conference Recent Advances in Natural Language Processing. pp. 267–276. RANLP ’17, Varna, Bulgaria (2017) 15. Gorrell, G., Aker, A., Bontcheva, K., Derczynski, L., Kochkina, E., Liakata, M., Zubiaga, A.: SemEval-2019 task 7: RumourEval, determining rumour veracity and support for rumours. In: Proceedings of the 13th International Workshop on Se- mantic Evaluation. pp. 845–854. Minneapolis, Minnesota, USA (2019) 16. Hansen, C., Hansen, C., Alstrup, S., Simonsen, J.G., Lioma, C.: Neural check- worthiness ranking with weak supervision: Finding sentences for fact-checking. arXiv preprint arXiv:1903.08404 (2019) 17. Hansen, C., Hansen, C., Simonsen, J., Lioma, C.: Neural weakly supervised fact check-worthiness detection with contrastive sampling-based ranking loss. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 18. Hasanain, M., Suwaileh, R., Elsayed, T., Barrón-Cedeño, A., Nakov, P.: Overview of the CLEF-2019 CheckThat! Lab on Automatic Identification and Verification of Claims. Task 2: Evidence and Factuality. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 19. Hassan, N., Li, C., Tremayne, M.: Detecting check-worthy factual claims in presi- dential debates. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. pp. 1835–1838. CIKM ’15 (2015) 20. Hassan, N., Tremayne, M., Arslan, F., Li, C.: Comparing automated factual claim detection against judgments of journalism organizations. In: Computation + Jour- nalism Symposium. Stanford, California, USA (2016) 21. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018) 22. Jaradat, I., Gencheva, P., Barrón-Cedeño, A., Màrquez, L., Nakov, P.: ClaimRank: Detecting check-worthy claims in Arabic and English. In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Compu- tational Linguistics. pp. 26–30. NAACL-HLT ’18, New Orleans, Louisiana, USA (2018) 23. Karadzhov, G., Nakov, P., Màrquez, L., Barrón-Cedeño, A., Koychev, I.: Fully automated fact checking using external sources. In: Proceedings of the Conference on Recent Advances in Natural Language Processing. pp. 344–353. RANLP ’17, Varna, Bulgaria (2017) 24. Mihaylova, T., Karadzhov, G., Atanasova, P., Baly, R., Mohtarami, M., Nakov, P.: SemEval-2019 task 8: Fact checking in community question answering forums. In: Proceedings of the 13th International Workshop on Semantic Evaluation. pp. 860–869 (2019) 25. Mihaylova, T., Nakov, P., Màrquez, L., Barrón-Cedeño, A., Mohtarami, M., Karadjov, G., Glass, J.: Fact checking in community forums. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. pp. 5309–5316. AAAI ’18, New Orleans, Louisiana, USA (2018) 26. Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies. pp. 746–751. NAACL-HLT ’13, Atlanta, Georgia, USA (2013) 27. Mohtaj, S., Himmelsbach, T., Woloszyn, V., Möller, S.: The TU-Berlin team partic- ipation in the check-worthiness task of the CLEF-2019 CheckThat! lab. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 28. Mohtarami, M., Baly, R., Glass, J., Nakov, P., Màrquez, L., Moschitti, A.: Auto- matic stance detection using end-to-end memory networks. In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 767–776. NAACL- HLT ’18, New Orleans, Louisiana, USA (2018) 29. Nakov, P., Barrón-Cedeño, A., Elsayed, T., Suwaileh, R., Màrquez, L., Zaghouani, W., Atanasova, P., Kyuchukov, S., Da San Martino, G.: Overview of the CLEF- 2018 CheckThat! lab on automatic identification and verification of political claims. In: Proceedings of the Ninth International Conference of the CLEF Association: Experimental IR Meets Multilinguality, Multimodality, and Interaction. pp. 372– 387. Lecture Notes in Computer Science, Springer, Avignon, France (2018) 30. Nakov, P., Barrón-Cedeño, A., Elsayed, T., Suwaileh, R., Màrquez, L., Zaghouani, W., Gencheva, P., Kyuchukov, S., Da San Martino, G.: CLEF-2018 lab on au- tomatic identification and verification of claims in political debates. In: Working Notes of CLEF 2018 – Conference and Labs of the Evaluation Forum. CLEF ’18, Avignon, France (2018) 31. Nakov, P., Mihaylova, T., Màrquez, L., Shiroya, Y., Koychev, I.: Do not trust the trolls: Predicting credibility in community question answering forums. In: Pro- ceedings of the International Conference on Recent Advances in Natural Language Processing. pp. 551–560. RANLP ’17, Varna, Bulgaria (2017) 32. Patwari, A., Goldwasser, D., Bagchi, S.: TATHYA: a multi-classifier system for detecting check-worthy statements in political debates. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. pp. 2259–2262. CIKM ’17, Singapore (2017) 33. Popat, K., Mukherjee, S., Strötgen, J., Weikum, G.: Where the truth lies: Explain- ing the credibility of emerging claims on the web and social media. In: Proceedings of the 26th International Conference on World Wide Web Companion. pp. 1003– 1012. WWW ’17, Perth, Australia (2017) 34. Rashkin, H., Choi, E., Jang, J.Y., Volkova, S., Choi, Y.: Truth of varying shades: Analyzing language in fake news and political fact-checking. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 2931–2937. EMNLP ’17, Copenhagen, Denmark (2017) 35. Shiralkar, P., Flammini, A., Menczer, F., Ciampaglia, G.L.: Finding streams in knowledge graphs to support fact checking. In: Proceedings of the IEEE In- ternational Conference on Data Mining. pp. 859–864. ICDM ’17, New Orleans, Louisiana, USA (2017) 36. Su, T., Macdonald, C., Ounis, I.: Entity detection for check-worthiness prediction: Glasgow Terrier at CLEF CheckThat! 2019. In: CLEF 2019 Working Notes. Work- ing Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 37. Vasileva, S., Atanasova, P., Màrquez, L., Barrón-Cedeño, A., Nakov, P.: It takes nine to smell a rat: Neural multi-task learning for check-worthiness prediction. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing. RANLP ’19, Varna, Bulgaria (2019) 38. Vlachos, A., Riedel, S.: Fact checking: Task definition and dataset construction. In: Proceedings of the ACL 2014 Workshop on Language Technologies and Com- putational Social Science. pp. 18–22. Baltimore, Maryland, USA (2014) 39. Zhu, G., Iglesias, C.A.: Computing semantic similarity of concepts in knowledge graphs. IEEE Transactions on Knowledge and Data Engineering 29(1), 72–85 (2016) 40. Zhu, G., Iglesias, C.A.: Sematch: Semantic entity search from knowledge graph. In: Cheng, G., Gunaratna, K., Thalhammer, A., Paulheim, H., Voigt, M., Garcı́a, R. (eds.) Joint Proceedings of the 1st International Workshop on Summarizing and Presenting Entities and Ontologies and the 3rd International Workshop on Human Semantic Web Interfaces. CEUR Workshop Proceedings, vol. 1556. CEUR-WS.org, Portoroz, Slovenia (2015) 41. Zuo, C., Karakas, A., Banerjee, R.: A hybrid recognition system for check-worthy claims using heuristics and supervised learning. In: Cappellato, L., Ferro, N., Nie, J., Soulier, L. (eds.) Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018. CEUR Workshop Proceedings, vol. 2125. CEUR-WS.org (2018)