Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims. Task 2: Evidence and Factuality Maram Hasanain1 , Reem Suwaileh1 , Tamer Elsayed1 , Alberto Barrón-Cedeño2 , and Preslav Nakov3 1 Computer Science and Engineering Department, Qatar University, Doha, Qatar {maram.hasanain,rs081123,telsayed}@qu.edu.qa 2 DIT, Università di Bologna, Forlı̀, Italy a.barron@unibo.it 3 Qatar Computing Research Institute, HBKU, Doha, Qatar pnakov@qf.org.qa Abstract. We present an overview of Task 2 of the second edition of the CheckThat! Lab at CLEF 2019. Task 2 asked (A) to rank a given set of Web pages with respect to a check-worthy claim based on their usefulness for fact-checking that claim, (B) to classify these same Web pages according to their degree of usefulness for fact-checking the target claim, (C) to identify useful passages from these pages, and (D) to use the useful pages to predict the claim’s factuality. Task 2 at CheckThat! provided a full evaluation framework, consisting of data in Arabic (gath- ered and annotated from scratch) and evaluation based on normalized discounted cumulative gain (nDCG) for ranking, and F1 for classification. Four teams submitted runs. The most successful approach to subtask A used learning-to-rank, while different classifiers were used in the other subtasks. We release to the research community all datasets from the lab as well as the evaluation scripts, which should enable further research in the important task of evidence-based automatic claim verification. Keywords: Fact-Checking · Veracity · Evidence-based Verification · Fake News Detection · Computational Journalism 1 Introduction The spread of “fake news” in all types of online media created a pressing need for automatic fake news detection systems [23]. The problem has various as- pects [24], but here we are interested in identifying the information that is useful for fact-checking a given claim, and then also in predicting its factual- ity [5,13,20,22,25,28]. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. Check-worthy Debate Check worthiness claims A Rerank search results B Classify search results C Classify passages on usefulness Web search Fact-checked results Fact-checking claims Fig. 1: Information verification pipeline with the two tasks in the CheckThat! lab: check-worthiness estimation and factuality verification. Evidence-based fake news detection systems can serve fact-checking in two ways: (i ) by facilitating the job of a human fact-checker, but not replacing her, and (ii ) by increasing her trust in a system’s decision [19,22,25]. We fo- cus on the problem of checking the factuality of a claim, which has been stud- ied before but rarely in the context of evidence-based fake news detection sys- tems [3,4,7,15,17,21,27,29]. There are several challenges that make the development of automatic fake news detection systems difficult: 1. A fact-checking system is effective if it is able to identify a false claim before it reaches a large audience. Thus, the current speed at which claims spread on the Internet and social media imposes strict efficiency constraints on fact- checking systems. 2. The problem is difficult to the extent that, in some cases, even humans can hardly distinguish between fake and true news [24]. 3. There are very few large-scale benchmark datasets that could be used to test and improve fake news detection systems [24,25]. Thus, in 2018 we started the CheckThat! lab on Automatic Identification and Verification of Political Claims [1,6,18]. We organized a second edition of the lab in 2019 [2,8,9], which aims at providing a full evaluation framework along with large-scale evaluation datasets. The lab this year is organized around two differ- ent tasks, which correspond to the main blocks in the verification pipeline, as depicted in Figure 1. This paper describes Task 2: Evidence and Factuality. This task focuses on extracting evidence from the Web to support the making of a veracity judgment for a given target claim. We divide Task 2 into the follow- ing four subtasks: (A) ranking Web pages with respect to a check-worthy claim based on their potential usefulness for fact-checking that claim; (B) classifying Web pages according to their degree of usefulness for fact-checking the target claim; (C) extracting passages from these Web pages that would be useful for fact-checking the target claim; and (D) using these useful pages to verify whether the target claim is factually true or not. Since Task 2 in this edition of the lab had a different goal from last year’s [18], we built a new dataset from scratch by manually curating claims, retrieving Web pages through a commercial search engine, and then hiring both in-house and crowd annotators to collect judgments for the four subtasks. As a result of our efforts, we release the CT19-T2 dataset, which contains Arabic claims as well as retrieved Web pages, along with three sets of annotations for the four subtasks. Four teams participated in this year’s Task 2, and they submitted 55% more runs compared to the 2018 edition [18]. The most successful systems relied on supervised machine learning models for both ranking and classification. We be- lieve that there is still large room for improvement, and thus we release the annotated corpora and the evaluation scripts, which should enable further re- search on evidence-supported automatic claim verification.1 The remainder of this paper is organized as follows. Section 2 discusses the task in detail. Section 3 describes the dataset. Section 4 describes the participans’ approaches and their performance on the four subtasks. Finally, Section 5 draws some conclusions and points to possible directions for future work. 2 Task Definition Task 2 focuses on building tools to verify the factuality of a given claim. This is the first-ever version of this task, and we run it in Arabic.2 The task is formally defined as follows: Given a check-worthy claim c and a set of Web pages P (the re- trieved results of Web search in response to a search query repre- senting c), identify which of the Web pages (and passages A of those Web pages) can be useful for assisting a human in fact-checking the claim. Finally, determine the factuality of the claim according to the supporting information in the useful pages and passages. As Figure 2 shows, the task is divided into four subtasks that target different aspects of the problem. Fig. 2: A zoom into the four subtasks in Task 2. 1 http://sites.google.com/view/clef2019-checkthat/datasets-tools 2 In 2018, we had a different fact-checking task, where no retrieved Web pages were provided [6]. Subtask A, Webpage ranking: Rank the Web pages P based on how useful they are for verifying the target claim. The systems are asked to produce a score for each page, based on which the pages would be ranked. See the definition of “useful” below. Subtask B, Webpage classification: Classify each Web page p ∈ P as “very useful for verification”, “useful”, “not useful”, or “not relevant.” A page p is considered very useful for verification if it is relevant with respect to c (i.e., on-topic and discussing the claim) and it provides sufficient evidence to verify the veracity of c, such that there is no need for another document to be considered for verifying this claim. A page is useful for verification if it is relevant to the claim and provides some valid evidence, but it is not solely sufficient to determine the c’s veracity on its own. The evidence can be a source, some statistics, a quote, etc. A particular piece of evidence is considered not valid if the source cannot be verified or is ambiguous (e.g., expressing that “experts say that. . . ” without mentioning who those experts are), or it is just an opinion of a person/expert instead of an objective analysis. Notice that this is different from stance detection as a page might agree with a claim, but it might still lack evidence to verify it. Subtask C, Passage identification: Find passages within the Web pages P that are useful for claim verification. Again, notice that this is different from stance detection. Subtask D, Claim classification: Classify the claim’s factuality as “true” or “false.” The claim is considered true if it is accurate as stated (or there is sufficient reliable evidence supporting it), otherwise it is considered false. Figure 3 shows an example: a Web page considered as useful for verifying the given claim, since it has evidence showing the claim to be true and it is an official United Kingdom page on national statistics. The useful passage in the page is the one reporting the supporting statistics. For the sake of readability, the example is given in English, but this year the task was offered only in Arabic. Claim Useful Web page e-commerce sales in UK increased by 8 billions between 2015 and 2016 Fig. 3: English claim, a useful Web page, and a useful passage (in the orange rectangle on the right). Figure 4 shows an Arabic example of an actual claim, a useful Web page, and a paragraph from our training dataset. The claim translates to English as follows: “The Confederation of African Football has withdrawn the organization of the Africa Cup of Nations from Cameroon.” The page shows a news article reporting the news; it is useful for facrt-checking since it contains a quotation of an official statement confirming the claim. pageID: CT19-T2-077-22 :CT19-T2-077 ‫االدعاء‬ ‫األفريق لكرة القدم تنظيم‬ ‫ي‬ ‫سحب االتحاد‬ .‫نهائيات كأس أمم أفريقيا من الكاميون‬ parID: CT19-T2-077-22-01 Fig. 4: Arabic claim, a useful Web page, and a useful passage (in the orange rectangle on the right) from the training data. 3 Dataset Collecting claims. Subtasks A, B, and C are all new to the lab this year. As a result, we built a new evaluation dataset to support all subtasks —the CT19-T2 corpus. We selected 69 claims from multiple sources including a pre- existing set of Arabic claims [5], a survey in which we asked the public to provide examples of claims they have heard of, and some headlines from six Arabic news agencies that we rewrote into claims. The news agencies selected are well-known in the Arab world: Al Jazeera, BBC Arabic, CNN Arabic, Al Youm Al Sabea, Al Arabiya, and RT Arabic. We made sure the claims span different topical domains, e.g., health or sports, besides politics. Ten claims were released for training and the rest were used for testing. Labeling claims. We acquired the veracity labels for the claims in two steps. First, two of the lab organizers labelled each of the 69 claims independently. Then, they met to resolve any disagreements, and thus reach consensus on the veracity labels for all claims. Labeling pages and passages. We formulated a query representing each claim, and we issued it against the Google search engine in order to retrieve the top 100 Web pages. We used a language detection tool to filter out non-Arabic pages, and we eventually used the top-50 of the remaining pages. The labeling pipeline was carried out as follows: 1. Relevance. We first identified relevant pages, since we assume that non- relevant pages cannot be useful for claim verification, and thus should be filtered out from any further labeling. In order to speedup the relevance la- beling process, we hired two types of annotators: Amazon Mechanical Turk crowd-workers and in-house annotators. Each page was labeled by three an- notators, and the majority label was used as the final page label. 2. Usefulness as a whole. Relevant pages were then given to in-house an- notators to be labeled for usefulness using a two-way classification scheme: useful (including very useful, but not distinguishing between the two) and not useful. Similarly to relevance labeling, each page was labeled by three annotators, and the final page label was the majority label. 3. Useful vs. very useful. One of the lab organizers went over the useful pages from step 2 and further classified them into useful and very useful. We opted for this design since we found through pilot studies that the annotators found it difficult to differentiate between useful and very useful pages. 4. Splitting into passages. We manually split the useful and the very useful pages into passages, as we found that the automatic techniques for splitting pages into passages were not accurate enough. 5. Useful passages. Finally, one of the lab organizers labelled each passage for usefulness. Due to time constraints, we could not split the pages and label the resulting passages for all the claims in the testing set. Thus, we only release labels for passages of pages corresponding to 33 out of the 59 testing claims. Note that this only affects subtask C. Table 1 summarizes the statistics about the training and the test data. Note that the passages in the test set are for 33 claims only (see above). Table 1: Statistics about the CT19-T2 corpus for Task 2. Claims Pages Passages Set Total True Total Useful Total Useful Training 10 5 395 32 167 54 Test 59 30 2,641 575 1,722 578 Table 2: Summary of participants’ approaches. Subtask A B C D2 Team [10] [12] [26] [10] [11] [12] [26] [10] [12] [11] [12] [26] Representation BERT embeddings    Word embeddings      Bag of words       Models Feed-Forward DNN    Naı̈ve Bayes    Random Forest     Gradient Boosting   Support vector machine    Enhanced Sequential Inference   Rule-based   Features Content           Credibility    Similarity       Statistical        External data    Teams [10] TheEarthIsFlat [11] UPV-UMA [12] bigIR [26] EvolutionTeam 4 Evaluation In this section we describe the participants’ approaches to the different subtasks. Table 2 summarizes the approaches. We also present the evaluation set-up used to evaluate each subtask, and then we present and discuss the results. 4.1 Subtask A Runs. Three teams participated in this subtask submitting a total of seven runs [10,12,26]. There were two kinds of approaches. In the first kind, token-level BERT embeddings were used with text classification to rank pages [10]. In the second kind, the runs used a learning-to-rank model based on different classifiers, including Naı̈ve Bayes and Random Forest, with a variety of features [12]. In one run, external data was used to train the text classifier [10], while all other runs represent systems trained on the provided labelled data only. Table 3: Results for Subtask 2.A, ordered by nDCG@10 score. The runs that used external data are marked with *. Team Run nDCG@5 nDCG@10 nDCG@15 nDCG@20 Baseline – 0.52 0.55 0.58 0.61 bigIR 1 0.47 0.50 0.54 0.55 bigIR 3 0.41 0.47 0.50 0.52 EvolutionTeam 1 0.40 0.45 0.48 0.51 bigIR 4 0.39 0.45 0.48 0.51 bigIR 2 0.38 0.41 0.45 0.47 TheEarthIsFlat2A 1 0.08 0.10 0.12 0.14 TheEarthIsFlat2A* 2 0.05 0.07 0.10 0.12 Evaluation measures. Subtask A was modeled as a ranking problem, in which very useful and useful pages should be ranked on top. Since this is a graded use- fulness problem, we evaluate it using the mean of Normalized Discounted Cumu- lative Gain (nDCG) [14,16]. In particular, we consider nDCG@10 (i.e., nDCG computed at cutoff 10) as the official evaluation measure for this subtask, but we report nDCG at cutoffs 5, 15, and 20 as well. For all measures, we used macro-averaging over the testing claims. Results. Table 3 shows the results for all seven runs. It also includes the results for a simple baseline: the original ranking in the search result list. We can see that the baseline surprisingly performs the best. This is due to the fact that in our definition of usefulness, useful pages must be relevant, and Google, as an effective search engine, has managed to rank relevant pages (and consequently, many of the useful pages) first. This result indicates that the task of ranking pages by usefulness is not easy and systems need to be further developed in order to differentiate between relevance and usefulness, while also benefiting from the relevance-based rank of a page. 4.2 Subtask B Runs. Four teams participated in this subtask, submitting a total of eight runs [10,11,12,26]. All runs used supervised text classification models, such as Random Forest and Gradient Boosting [12]. Two teams opted for using embedding- based language representations: one considered word embeddings [11] and an- other BERT-based token-level embeddings [10]. In one run, external data was used to train the model [10], while all the remaining runs were trained on the provided training data only. Evaluation measures. Similarly to Subtask A, Subtask B also aims at iden- tifying useful pages for claim verification, but it is modeled as a classification, rather than a ranking problem. Thus, here we use standard evaluation measures for text classification: Precision, Recall, F1 , and Accuracy, with F1 being the official score for the task. Table 4: Results for Subtask 2.B for 2-way and 4-way classification. The runs are ranked by F1 score. Runs tagged with * used external data. (a) 2-way classification (b) 4-way classification Team Run F1 P R Acc Team Run F1 P R Acc Baseline – 0.42 0.30 0.72 0.57 TheEarthIsFlat 1 0.31 0.28 0.36 0.59 UPV-UMA 1 0.38 0.26 0.73 0.49 bigIR 3 0.31 0.37 0.33 0.58 bigIR 1 0.08 0.40 0.04 0.78 TheEarthIsFlat* 2 0.30 0.27 0.35 0.60 bigIR 3 0.07 0.39 0.04 0.78 bigIR 4 0.30 0.41 0.32 0.57 bigIR 4 0.07 0.57 0.04 0.78 EvolutionTeam 1 0.29 0.26 0.33 0.58 bigIR 2 0.04 0.22 0.02 0.77 Baseline – 0.28 0.32 0.32 0.30 TheEarthIsFlat 1 0.00 0.00 0.00 0.78 UPV-UMA 1 0.23 0.30 0.29 0.24 TheEarthIsFlat* 2 0.00 0.00 0.00 0.78 bigIR 1 0.16 0.25 0.23 0.26 EvolutionTeam 1 0.00 0.00 0.00 0.78 bigIR 2 0.16 0.25 0.22 0.25 Results. Table 4a reports the results for 2-way classification —useful/very useful vs. not useful/not relevant—, reporting results for predicting the useful class. Table 4b shows the results for 4-way classification —very useful vs. useful vs. not useful vs. not relevant—, reporting macro-averaged scores over the four classes, for each of the evaluation measures. We include a baseline: the original ranking from the search results list. The baseline assumes the top-50% of the results to be useful and the rest not useful for the 2-way classification. For the 4-way classification, the baseline assumes the top-25% to be very useful, the next 25% to be useful, the third 25% to be not useful, and the rest to be not relevant. Table 4a shows that almost all systems struggled to retrieve any useful pages at all. Team UPV-UMA is the only one that managed to achieve high recall. This is probably due to the useful class being under-represented in the training dataset, while being much more frequent in the test dataset: we can see in Table 1 that it covers just 8% of the training examples, but 22% of the testing ones. Training the models with a limited number of useful pages might have caused them to learn to underpredict this class. Similarly to Subtask A, the simple baseline that assumes the top-ranked pages to be more useful is most effective. This again can be due to the correlation between usefulness and relevance. Comparing the results in Table 4a to those in Table 4b, we notice a very different performance ranking; runs that had the worst performance at finding useful pages, are actually among the best runs in the 4-way classification. These runs were able to effectively detect the not relevant and not useful pages as compared to useful ones. The baseline, which was effective at identifying useful pages, is not as effective at identifying pages in the other classes. This might indicate that not useful and not relevant pages are not always at the bottom of the ranked list as this baseline assumes, which sheds some light on the importance of usefulness estimation to aid fact-checking. Table 5: Performance of the models when predicting useful passages for Subtask 2.C. Precision, recall and F1 are calculated with respect to the positive class, i.e., useful. The runs are ranked by F1 . Team Run F1 P R Acc TheEarthIsFlat2Cnoext 1 0.56 0.40 0.94 0.51 TheEarthIsFlat2Cnoext 2 0.55 0.41 0.87 0.53 bigIR 2 0.40 0.39 0.42 0.58 bigIR 1 0.39 0.38 0.41 0.58 bigIR 4 0.37 0.37 0.38 0.57 Baseline 0.37 0.42 0.39 0.57 bigIR 3 0.19 0.33 0.14 0.61 One additional factor that might have caused such a varied ranking of runs is our own observation on the difficulty and subjectivity of differentiating between useful and very useful pages. At annotation time, we observed that annotators and even lab organizers were not able to easily distinguish between these two types of pages. 4.3 Subtask C Runs. Two teams participated in this subtask [10,12], submitting a total of seven runs. One of the teams used text classifiers including Naı̈ve Bayes and SVM with a variety of features such as bag-of-words and named entities [12]. All runs also considered using the similarity between the claim and the passages as a feature in their models. Evaluation measures. Subtask C aims at identifying useful passages for claim verification and we modeled it as a classification problem. As in typical classifi- cation problems, we evaluated it using Precision, Recall, F1 , and Accuracy, with F1 being the official evaluation measure. Results. Table 5 shows the evaluation results, including a simple baseline that assumes the first passage in a page to be not useful, the next two passages to be useful, and the remaining passages to be not useful. This baseline is motivated by our observation that useful passages are typically located at the heart of the document following some introductory passage(s). Team TheEarthIsFlat managed to identify most useful passages, thus achiev- ing a very high recall (0.94 for its run 1), with a relatively similar precision to the other runs, and the baseline. Note that in all the runs by the bigIR system, as well as in the baseline system, the precision and the recall are fairly balanced. The baseline performs almost as well as the four runs by bigIR. This indicates that considering the position of the passage in a page might be a useful feature when predicting the passage usefulness, and thus it should be considered when addressing the problem. Table 6: Results for Subtask 2.D for both cycles 1 and 2. The runs are ranked by F1 score. The runs tagged with a * used external data. (a) Cycle 1, where the usefulness of the (b) Cycle 2, where the the usefulness of the Web pages was unknown. Web pages was known. Team F1 P R Acc Team Run F1 P R Acc EvolutionTeam 0.48 0.55 0.53 0.53 UPV-UMA* 21 0.62 0.63 0.63 0.63 Baseline 0.34 0.25 0.50 0.51 UPV-UMA* 11 0.55 0.56 0.56 0.56 UPV-UMA* 22 0.54 0.60 0.57 0.58 bigIR 1 0.53 0.55 0.55 0.54 bigIR 3 0.53 0.55 0.54 0.54 bigIR 2 0.51 0.53 0.53 0.53 bigIR 4 0.51 0.53 0.53 0.53 UPV-UMA* 12 0.51 0.65 0.57 0.58 EvolutionTeam 1 0.43 0.45 0.46 0.46 Baseline 0.34 0.25 0.50 0.51 4.4 Subtask D The main aim of Task 2 was to study the effect of using identified useful and very useful pages for claim verification. Thus, we had two evaluation cycles for Subtask D. In the first cycle, the teams were asked to fact-check claims using all the Web pages, without knowing which were useful /very useful. In the second cycle, the usefulness labels were released in order to allow the systems to fact- check the claims using only useful /very useful Web pages. Runs. Two teams participated in cycle 1, submitting one run each [12,26], but one of the runs was invalid, and thus there is only one official run. Cycle 2 attracted more participation: three teams with nine runs [11,12,26]. Thus, we will focus our discussion on cycle 2. One team opted for using textual entailment with embedding-based representations for classification [11]. Another team used text classifiers such as Gradient Boosting and Random Forests [12]. External data was used to train the textual entailment component of the system in four runs, whereas the remaining runs were trained on the provided data only. Evaluation measures. Subtask D aims at predicting a claim’s veracity. It is a classification task, and thus we evaluate it using Precision, Recall, F1 , and Accuracy, with F1 being the official measure. Results. Table 6 shows the results for cycles 1 and 2, where we macro-average precision, recall, and F1 over the two classes. We show the results for a simple majority-class baseline, which all runs manage to beat for both cycles. Due to the low participation in cycle 1, it is difficult to draw conclusions about whether providing systems with useful pages helps to improve their performance. 5 Conclusion and Future Work We have presented an overview of Task 2 of the CLEF–2019 CheckThat! Lab on Automatic Identification and Verification of Claims, which is the second edition of the lab. Task 2 was designed to aid a human who is fact-checking a claim. It asked systems (A) to rank Web pages with respect to a check-worthy claim based on their usefulness for fact-checking that claim, (B) to classify the Web pages according to their degree of usefulness, (C) to identify useful passages from these pages, and (D) to use the useful pages to predict the claim’s factuality. As part of the lab, we release a dataset in Arabic in order to enable further research in automatic claim verification. A total of four teams participated in the task (compared to two in 2018) submitting a total of 31 runs. The evaluation results show that the most suc- cessful approaches to Task 2 used learning-to-rank for subtask A, while different classifiers were used in the other subtasks. Although one of the aims of the lab was to study the effect of using useful pages for claim verification, the low participation in the first cycle of subtask D has hindered carrying such a study. In the future, we plan to setup this subtask, so that the teams would need to participate in both cycles in order for their runs to be considered valid. We also plan to extend the dataset for Task 2 to include claims in at least one language other than Arabic. Acknowledgments This work was made possible in part by grant# NPRP 7-1330-2-483 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors. This research is also part of the Tanbih project,3 which aims to limit the effect of “fake news”, propaganda and media bias by making users aware of what they are reading. The project is developed in collaboration between the Qatar Computing Research Institute (QCRI), HBKU and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). References 1. Atanasova, P., Màrquez, L., Barrón-Cedeño, A., Elsayed, T., Suwaileh, R., Za- ghouani, W., Kyuchukov, S., Da San Martino, G., Nakov, P.: Overview of the CLEF-2018 CheckThat! Lab on automatic identification and verification of po- litical claims, Task 1: Check-worthiness. In: CLEF 2018 Working Notes. Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum. CEUR Work- shop Proceedings, CEUR-WS.org (2018) 2. Atanasova, P., Nakov, P., Karadzhov, G., Mohtarami, M., Martino, G.D.S.: Overview of the CLEF-2019 CheckThat! Lab on Automatic Identification and Veri- fication of Claims. Task 1: Check-Worthiness. In: Cappellato, L., Ferro, N., Losada, 3 http://tanbih.qcri.org/ D., Müller, H. (eds.) CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 3. Ba, M.L., Berti-Equille, L., Shah, K., Hammady, H.M.: VERA: A platform for veracity estimation over web data. In: Proceedings of the 25th International Con- ference Companion on World Wide Web. pp. 159–162. WWW ’16 (2016) 4. Baly, R., Karadzhov, G., Saleh, A., Glass, J., Nakov, P.: Multi-task ordinal regres- sion for jointly predicting the trustworthiness and the leading political ideology of news media. In: Proceedings of the 17th Annual Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 2109–2116. NAACL-HLT ’19, Minneapolis, MN, USA (2019) 5. Baly, R., Mohtarami, M., Glass, J., Màrquez, L., Moschitti, A., Nakov, P.: Inte- grating stance detection and fact checking in a unified corpus. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies. pp. 21–27. NAACL-HLT ’18, New Orleans, Louisiana, USA (2018) 6. Barrón-Cedeño, A., Elsayed, T., Suwaileh, R., Màrquez, L., Atanasova, P., Za- ghouani, W., Kyuchukov, S., Da San Martino, G., Nakov, P.: Overview of the CLEF-2018 CheckThat! Lab on automatic identification and verification of polit- ical claims, Task 2: Factuality. In: CLEF 2018 Working Notes. Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org (2018) 7. Castillo, C., Mendoza, M., Poblete, B.: Information credibility on Twitter. In: Proceedings of the 20th International Conference on World Wide Web. pp. 675– 684. WWW ’11, Hyderabad, India (2011) 8. Elsayed, T., Nakov, P., Barrón-Cedeño, A., Hasanain, M., Suwaileh, R., Da San Martino, G., Atanasova, P.: CheckThat! at CLEF 2019: Automatic iden- tification and verification of claims. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) Advances in Information Retrieval. pp. 309–315. Springer International Publishing (2019) 9. Elsayed, T., Nakov, P., Barrón-Cedeño, A., Hasanain, M., Suwaileh, R., Da San Martino, G., Atanasova, P.: Overview of the CLEF-2019 CheckThat!: Automatic identification and verification of claims. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. LNCS, Lugano, Switzerland (2019) 10. Favano, L., Carman, M., Lanzi, P.: TheEarthIsFlat’s submission to CLEF’19 CheckThat! challenge. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceed- ings, CEUR-WS.org, Lugano, Switzerland (2019) 11. Ghanem, B., Glavaš, G., Giachanou, A., Ponzetto, S., Rosso, P., Rangel, F.: UPV- UMA at CheckThat! Lab: Verifying Arabic claims using cross lingual approach. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 12. Haouari, F., Ali, Z., Elsayed, T.: bigIR at CLEF 2019: Automatic verification of Arabic claims over the web. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 13. Jaradat, I., Gencheva, P., Barrón-Cedeño, A., Màrquez, L., Nakov, P.: ClaimRank: Detecting check-worthy claims in Arabic and English. In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Compu- tational Linguistics. pp. 26–30. NAACL-HLT ’18, New Orleans, Louisiana, USA (2018) 14. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20(4), 422–446 (2002) 15. Ma, J., Gao, W., Mitra, P., Kwon, S., Jansen, B.J., Wong, K.F., Cha, M.: Detecting rumors from microblogs with recurrent neural networks. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. pp. 3818–3824. IJCAI ’16, New York, New York, USA (2016) 16. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008) 17. Mukherjee, S., Weikum, G.: Leveraging joint interactions for credibility analysis in news communities. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. pp. 353–362. CIKM ’15, Melbourne, Australia (2015) 18. Nakov, P., Barrón-Cedeño, A., Elsayed, T., Suwaileh, R., Màrquez, L., Zaghouani, W., Atanasova, P., Kyuchukov, S., Da San Martino, G.: Overview of the CLEF- 2018 CheckThat! Lab on automatic identification and verification of political claims. In: Proceedings of the Ninth International Conference of the CLEF As- sociation: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Lecture Notes in Computer Science, Springer (2018) 19. Nguyen, A.T., Kharosekar, A., Lease, M., Wallace, B.: An interpretable joint graph- ical model for fact-checking from crowds. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 1511–1518. AAAI ’18, New Orleans, LA, USA (2018) 20. Nie, Y., Chen, H., Bansal, M.: Combining fact extraction and verification with neural semantic matching networks. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. AAAI ’19, Honolulu, Hawaii, USA (2019) 21. Popat, K., Mukherjee, S., Strötgen, J., Weikum, G.: Credibility assessment of tex- tual claims on the web. In: Proceedings of the 25th ACM International Conference on Information and Knowledge Management. pp. 2173–2178. CIKM ’16, Indianapo- lis, Indiana, USA (2016) 22. Popat, K., Mukherjee, S., Yates, A., Weikum, G.: DeClarE: Debunking fake news and false claims using evidence-aware deep learning. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 22–32. EMNLP ’18, Brussels, Belgium (2018) 23. Rubin, V.L., Chen, Y., Conroy, N.J.: Deception detection for news: three types of fakes. In: Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community. p. 83. American Society for Information Science (2015) 24. Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter 19(1), 22–36 (2017) 25. Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: FEVER: a large-scale dataset for Fact Extraction and VERification. In: Proceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies. pp. 809–819. NAACL-HLT ’18, New Orleans, LA, USA (2018) 26. Touahri, I., Mazroui, A.: Automatic identification and verification of political claims. In: CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland (2019) 27. Yasser, K., Kutlu, M., Elsayed, T.: Re-ranking web search results for better fact- checking: A preliminary study. In: Proceedings of 27th ACM International Con- ference on Information and Knowledge Management. pp. 1783–1786. CIKM ’19, Turin, Italy (2018) 28. Yoneda, T., Mitchell, J., Welbl, J., Stenetorp, P., Riedel, S.: UCL machine reading group: Four factor framework for fact finding (HexaF). In: Proceedings of the First Workshop on Fact Extraction and VERification. pp. 97–102. FEVER ’18, Brussels, Belgium (2018) 29. Zubiaga, A., Liakata, M., Procter, R., Hoi, G.W.S., Tolmie, P.: Analysing how people orient to and spread rumours in social media by looking at conversational threads. PloS one 11(3), e0150989 (2016)