=Paper=
{{Paper
|id=Vol-2699/paper34
|storemode=property
|title=Too Many Claims to Fact-Check: Prioritizing Political Claims Based on Check-Worthiness
|pdfUrl=https://ceur-ws.org/Vol-2699/paper34.pdf
|volume=Vol-2699
|authors=Yavuz Selim Kartal,Mucahid Kutlu,Busra Guvenen
|dblpUrl=https://dblp.org/rec/conf/cikm/KartalKG20
}}
==Too Many Claims to Fact-Check: Prioritizing Political Claims Based on Check-Worthiness==
Too Many Claims to Fact-Check: Prioritizing Political Claims Based on Check-Worthiness Yavuz Selim Kartal, Mucahid Kutlu, and Busra Guvenen Department of Computer Engineering TOBB University of Economics and Technology Ankara, Turkey {ykartal, m.kutlu, bguvenen}@etu.edu.tr since 2013 such as the gunfight due to “Pizzagate” fake news2 and increased mistrust towards vaccines3 . Abstract In order to combat against misinformation and its negative outcomes, fact-checking websites (e.g., The massive amount of misinformation Snopes4 ) detect the veracity of claims spread over spreading on the Internet on a daily basis the Internet and share their findings with their read- has enormous negative impacts on societies. ers [5]. However, fact-checking is an extremely time- Therefore, we need automated systems help- consuming process, taking around one day for a single ing fact-checkers in the combat against misin- claim [12]. While these invaluable journalistic efforts formation. In this paper, we propose a model help to reduce the spread of misinformation, Vosoughi prioritizing the claims based on their check- et al. [22] report that false news spread eight times worthiness. We use BERT model with addi- faster than true news. Therefore, systems helping fact- tional features including domain-specific con- checkers are urgently needed in the combat against troversial topics, word embeddings, and oth- misinformation. ers. In our experiments, we show that our pro- As human fact-checkers are not able detect the ve- posed model outperforms all state-of-the-art racity of all claims spread on the Internet, it is vital to models in both test collections of CLEF Check spend their precious time in fact-checking the most im- That! Lab in 2018 and 2019. We also conduct portant claims. Therefore, an automatic system moni- a qualitative analysis to shed light detecting toring social media posts, news articles and statements check-worthy claims. We suggest requesting of politicians, and detecting the check-worthy claims is rationales behind judgments are needed to un- needed. A number of researchers focused on this im- derstand subjective nature of the task and portant problem (e.g., [12, 19, 13]). Furthermore, Con- problematic labels. ference and Labs of Evaluation Forum (CLEF) Check That! Lab (CTL) has been organizing shared-tasks on detecting check-worthy claims since 2018 [18, 2, 4]. In 1 Introduction CTL tasks, a political debate or a transcribed speech The World Economic Forum (WEF) has ranked mas- is separated by sentences and participants are asked sive digital misinformation as one of the top global to rank the sentences according to their priority to risks in 20131 . Unfortunately, the foresight of WEF be fact-checked. In CTL’20 [3], tweets have also been seems right as we encountered many unpleasant inci- used for this task. dents due to the misinformation spread on the Internet In this paper, we propose a ranking model that pri- oritizes claims based on their check-worthiness. We Copyright c by the paper’s authors. Use permitted under Cre- propose a BERT-based hybrid system in which we first ative Commons License Attribution 4.0 International (CC BY 4.0). 2 www.nytimes.com/2016/12/05/business/media/comet- Title of the Proceedings: “Proceedings of the CIKM 2020 Work- ping-pong-pizza-shooting-fake-news-consequences.html shops October 19-20, Galway, Ireland” 3 www.washingtonpost.com/news/wonk/wp/2014/10/13/the- Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi inevitable-rise-of-ebola-conspiracy-theories 1 http://reports.weforum.org/global-risks-2013 4 https://www.snopes.com/ fine tune a BERT [6] model for this task, and then use using almost every feature mentioned before with its prediction and other features we define in a logistic SVM-Multilayer perceptron learning. regression model to prioritize the claims. The features In 2019, 11 teams participated in check-worthiness we use include word-embeddings, presence of compar- task of CTL’19. Participants used varying models such ative and superlative adjectives, domain-specific con- as LSTM, SVM, naive bayes, and logistic regression troversial topics, and others. Our model achieves (LR) with many features including readability of sen- 0.255 and 0.176 mean average precision (MAP) scores tences and their context [2]. Copenhagen team [11] in CTL’18 and CTL’19 datasets, respectively, out- achieved the best overall performance using syntactic performing all state-of-the-art models including par- dependency and word embeddings with weakly super- ticipants of the corresponding shared-tasks, Claim- vised LSTM model. Buster [12], BERT, XLNET [24], and Lespagnol et Lespagnol et al. [15] investigated using various al.[15]’s model. We share our code for the reproducibil- learning models such as SVM, LR, and Random ity of our results5 . Forests, with a long list of features including word- embeddings, POS tags, syntactic dependency tags, en- 2 Related Work tities, and “information nutritional” features which represent factuality, emotion, controversy, credibility, As the US presidential election in 2016 is one of and technicality of statements. In our experiments we the main motivating reasons for fact-checking studies, show that our model outperforms Lespagnol et al. on prior work mostly used debates and other speeches of both test collections. US politicians as their datasets (e.g., [12, 15]). There- Our proposed an approach distinguishes from the fore, the majority of studies focused on English. The existing studies as follows. 1) We propose a BERT- Arabic datasets used in prior work ([13, 18]) are just based hybrid model which uses fine-tuned BERT’s out- translations of English datasets. put with many other features. 2) As the topic might ClaimBuster [12] is one of the first studies about be a strong indicator for check-worthiness, many stud- check-worthiness. ClaimBuster is a supervised model ies used various types of topics such as general topics using many features including part-of-speech (POS) [25], globally controversial topics [15], and topics dis- tags, named entities, sentiment, and TF-IDF represen- cussed in old US presindential debates [19]. However, tations of claims. TATHYA [19] uses topics, POS tu- we believe that check-worthiness of a claim depends on ples, entity history, and bag-of-words as features. The local and present controversial topics. Thus, we use a topics are detected by LDA model trained on tran- list of hand-crafted controversial topics related to US scripts of all presidential debates from 1976 to 2016. elections. 3) We also use two different sets of features Gencheva et al. [8] propose a neural network model including a hand-crafted list of words and presence of with a long list of sentence level and contextual fea- comparative and superlative adjectives and adverbs. tures including sentiment, named entities, word em- beddings, topics, contradictions, and others. Jaradat et al. [13] use roughly the same features with Gencheva 3 Proposed Approach et al., but extend the model for Arabic. In its followup We propose a supervised model with a number of fea- work, Vasileva et al. [21] propose a multi-task learning tures described below. We investigate various learning model to detect whether a claim will be fact-checked models including LR, SVM, random forest, MART [7], by at least five (out of nine) pre-selected reputable and LambdaMART [23]. Now we explain the features fact-checking organizations. we use. CLEF has been organizing Check That! Labs BERT: We first fine tune BERT using respective (CTL) since 2018. Seven teams participated in check- training data. Next, we use its prediction value as one worthiness task of CTL’18. The participant teams of our features. used various learning models such as recurrent neu- Word Embeddings (WE): Words that are se- ral network (RNN) [10], multilayer perceptron [26], mantically and syntactically similar tends to be close random forest (RF) [1], k-nearest neighbor (kNN) [9] in the embedding space, allowing us to capture sim- and Support Vector Machine (SVM) [25] with differ- ilarities between claims. We represent a sentence as ent sets of features such as bag-of-words [26], charac- the average vector of its words excluding the out- ter n-gram [9], POS tags [26, 10, 25], verbal forms [26], of-vocabulary ones. Word embedding vectors are named entities [26, 25], syntactic dependencies [26, 10], extracted from the pre-trained word2vec model [17] and word embeddings [26, 10, 25]. On English dataset, which has a feature vector size of 300. Prise de Fer [26] team achieved the best MAP scores Controversial Topics (CT): Sentences about 5 https://github.com/YSKartal/political-claims- controversial topics might include check-worthy checkworthiness claims. Lespagnol et al. [15] use a list of controversial issues compiled from Wikipedia arti- tence (e.g., “percent”) and 3) its semantic represents cle “Wikipedia:List of controversial issues”. However, a comparison between two cases (e.g., “increase” and the list they use covers many controversial issues which “decrease”). Thus, we first identified 66 words analyz- have very limited coverage in current US media such ing training datasets of CTL’18 and CTL’19. In this as “Lebanon”, “Chernobyl”, and “Spanish Civil War” feature, we check whether there is an overlap between while the data we use are about recent US politics. lemmas of selected words and lemmas of words in the We believe that controversy of a topic depends on respective sentence. the society. For instance, US politicians propose dif- Verbe Tense (VT): We cannot detect the veracity ferent policies for immigrants, yielding heated discus- of claims about future while we can only verify claims sions among them and their supporters. On the other about the present or past. Thus, the verbe tense of hand, US domestic politics are much less interested sentences might be an effective indicator for check- in refugee crisis in Mediterranean sea than European worthiness of claims. This feature vector represents countries. Therefore, a claim about Mexican immi- the existence or absence of each tense in the predicate grants might be check-worthy for people living in US of the claims. while they might find claims about refugees taking a Part-of-speech (POS) Tags: If a sentence does dangerous path to reach Europe not-check-worthy. In not contain any informative words, then it is less likely contrast, people living in Europe might consider the to be check-worthy. To represent the information load latter case as check-worthy and the former one as not- of a claim, we use the number of nouns, verbs, adverbs check-worthy. In addition, controversy of a topic might and adjectives, separately. change over time. For instance, Cold War (which also exists in that Wikipedia list) might be one of the most 4 Experiments discussed topics in US politics before the collapse of the Soviet Union in 1991. However, nowadays it is 4.1 Experimental Setup rarely covered by US media. Therefore, we propose Implementation: We use ktrain library6 to fine-tune using controversial issues related to the data we use, BERT model with 1 cycle learning rate policy and instead of any controversial issue around the globe and maximum learning rate of 2e-5 [20]. We use SpaCy7 in the history. for all syntactic and semantic analyses. We use Scikit Firstly, we identified 11 major topics in current US toolkit8 for the implementations of SVM, Random For- politics including immigration, gun policy, racism, ed- est (RF), and LR. The parameter settings of the learn- ucation, Islam, climate change, health policy, abortion, ing algorithms are as follows. We use default parame- LGBT, terror, and wars in Afghanistan and Iraq. For ters for SVM. We set the number of trees to 50 and the each topic, we identified related words and calculate maximum depth to 5 for RF. We use multinomial and the average of these words using their word embedding lbfgs settings for LR. For MART and LambdaMART vectors. For instance, for the immigration topic, we models, we use RankLib9 library, and set the number used words “immigrants”, “illegal”, “borders”, “Mex- of trees and leaves to 50 and 2, respectively. ican”, “Latino” and “Hispanic”. Data: We evaluate the performance of our system In this feature set of size 11, we calculate cosine with two datasets used in CTL’18 and CTL’19. The similarity between sentences and each topic by using details about them are given in Table 1. CTL’18 their vector presentation. We use the average of word consists of transcripts of debates and speeches while embeddings for sentences excluding stopwords with CTL’19 contains also press conferences and posts. NLTK [16]. Comparative & Superlative (CS): Politicians Table 1: Details about CTL’18 and CTL’19 datasets. frequently use sentences comparing themselves with others because each candidate tries to convince the CTL’18 CTL’19 public that s/he is better than his/her opponent. # Docs 3 19 Therefore, the comparisons in political speeches might Train # Sentence 4,064 16,421 impact people’s voting decision and, thereby, it might # CW Claims 90 (2,2%) 433 (2,6%) be important to check their veracity. Thus, in this # Docs 7 7 feature, we use the number of comparative and su- Test # Sentence 4,882 7,079 perlative adjectives and adverbs in sentences. # CW Claims 192 (3,9%) 110 (1,6%) Handcrafted Word List (HW): Particular words convey important information about check- 6 https://pypi.org/project/ktrain/ worthiness because 1) it might be related to an im- 7 https://spacy.io/ portant topic (e.g., “unemployment”), 2) it represents 8 https://scikit-learn.org a numerical value, increasing the factuality of the sen- 9 https://sourceforge.net/p/lemur/wiki/RankLib/ Baselines: We compare our model against the follow- (LR), SVM, random forest (RF), MART and Lamb- ing models. daMART models using all features defined in Section 3. Table 2 shows MAP scores of each model. Inter- • Lespagnol et al. [15] : Lespagnol et al. report the estingly, LR outperforms all other models. In a similar best results on CTL’18 so far. Therefore, we use it experiment Lespagnol et al.[15] conducted, they also as one of our baselines. In order to get its results report that LR yields higher results than other models for CTL’19, we contacted with the authors to get they used. Nevertheless, we use LR in our following their own code. The authors provide us the values experiments. of “information nutrition” features and instruc- tions about how to generate WE embeddings. We implemented their method using the values they Table 2: MAP Score for Varying Models Using shared and following their instructions10 . All Features Learning Model CTL’18 CTL’19 • ClaimBuster : We use the popular pretrained LR .2303 .1775 ClaimBuster API11 [12] which is trained on a RF .1468 .1542 dataset covering different debates that do not ex- SVM .1716 .1346 ist on CTL’18 and CTL’19. MART .1764 .1732 Lambda MART .0671 .0564 • BERT : As it is reported that BERT based mod- els outperform state-of-the-art models in various NLP tasks, we compare our model against using Feature Ablation. In order to analyze the effective- only BERT. We fine tune BERT model using the ness of features we use, we apply two techniques: 1) respective training dataset and predict the check- Leave-one-out methodology in which we exclude one worthiness of claims using the fine-tuned model. type of feature group and calculate the model’s per- formance without it, and 2) Use-only-one methodology • XLNET : It is reported that XLNet outperfroms in which only a single feature group is used for predic- BERT in various NLP tasks [24]. Thus, we use tion. The results are shown in Table 3. XL-NET for this task by fine-tuning with the re- From the results in Table 3, we see that features spective training dataset. have different effects on each dataset. BERT is the • Best of CTL’18 and CTL’19 : For each dataset, most effective feature on CTL’19. However, in contrast we also report the performance of best systems to our expectations, WE seems more effective feature participated in the shared-tasks, i.e., Prise de Fer than BERT on CTL’18. On CTL’18, the performance team [26] and Copenhagen team [11] for CTL’18 decreases by nearly 25% when WE is excluded. In and CTL’19, respectively. addition, we achieve the highest MAP score when we use only WE. On CTL’19, we achieve 0.1356 MAP Training & Testing: We use the same setup with score using only WE, showing that it is more effective CTL’18 and CTL’19 to maintain a fair comparison than other features except BERT. However, the per- with the baselines. We follow the evaluation method formance of our model increases when we exclude WE used on CTL’18 and CTL’19: We calculate average (0.1775 vs. 0.1786 in Table 3), suggesting that the in- precision (AP), R-precision (RP), precision@5 (P@5) formation it contributes is covered by other features and precision@10 (P@10) for each file (i.e., debate, on CTL’19. speech) and then report the average performance. Excluding hand-crafted word list (HW) features causes performance decrease in both test collections. 4.2 Experimental Results In addition, using only HW features outperforms all In this section, we present experimental results on test participants of CTL’18 (0.153 vs 0.1332 in Table 3). data using different sets of features and varying learn- These promising results suggest that expanding this ing algorithms. list might lead further performance increases. Comparison of Learning Algorithms. In our Our results also suggest that Controversial Top- first set of experiments, we evaluate logistic regression ics (CT) are effective features. Excluding them de- 10 It is noteworthy that we obtain 0.2115 MAP score on creases the performance of the model in both collec- CTL’18 with our implementation of their method while they tions while using only CT features yield high scores, report 0.23 MAP score in their paper. We are not aware of any slightly outperforming the best performing system on bug in our code but the performance difference might be be- CTL’18 (0.1363 vs. 0.1332 in Table 3). cause of different versions of the same library. Nevertheless, the results we present for their method on CTL’19 should be taken Excluding CS and POS features also slightly de- with a grain of salt. crease the performance of the model in both test col- 11 https://idir.uta.edu/claimbuster/ lections. Regarding time tense features, our results are Table 3: MAP Scores for Varying Feature Sets Leave-One-Out Use-Only-One Features CTL18 CTL19 Features CTL18 CTL19 All .2303 .1775 All-CS .2239 .1765 CS .751 .604 All-BERT .2211 .1580 BERT .1850 .1701 All-VT .2547 .1761 VT .1007 .598 All-HW .2126 .1727 HW .1530 .1043 All-WE .1756 .1786 WE .2068 .1356 All-CT .2170 .1739 CT .1363 .1046 All-POS .2283 .1767 POS .1048 .631 Table 4: Comparison with Competing Models. * sign indicates the results obtained from our implementation of the respective competing model. CTL’18 CTL’19 Model MAP RP P@5 P@10 MAP RP P@5 P@10 BERT .1850 .2218 .3142 .2857 .1701 .1945 .2571 .2429 XLNET .1974 .2393 .2857 .2571 .0932 .0770 .1429 .1143 Lespagnol et al. [15] .230 .254 .314 .2857* .1292* .1347* .1714* .2000* Prise de Fer Team .1332 .1352 .2000 .1429 - - - - Copenhagen Team - - - - .1660 .4176 .2571 .2286 ClaimBuster .2003 .2162 .2571 .2429 .1329 .1555 .1714 .2000 Our Model .2547 .2579 .4000 .3429 .1761 .2028 .2571 .2143 mix. Excluding time tense feature causes a slight per- rank. Table 5 shows these not-check-worthy state- formance decrease on CTL’19, but yields higher per- ments for each file with our system’s ranking and formance score on CTL’18. speaker of the statement. Comparison Against Baselines. We pick the The statement in Row 1 is a claim about the future. model that includes all features except VT as our pri- Our model with verb tense could rank this statement mary model because it achieves the highest MAP score at lower ranks but our primary model does not use on average. We compare our primary model with the verb tense features because it yields lower performance baselines. The results are presented in Table 4. on average. In Row 2, the statement is very complex Our proposed model outperforms all other mod- with many relative clauses, in perhaps decreasing the els based on all evaluation metrics on CTL’18. On performance of BERT model and WE features in rep- CTL’19, our proposed model achieves the highest resenting the statement. In Row 3, our model makes MAP score, which is the official metric used in CTL. an obvious mistake and ranks a statement which does BERT model outperforms other models based on not have even any predicate, at very high ranks. Per- P@10 on CTL’19. Regarding P@5 metric, our model, haps our model falls short because the word “jobs” BERT and Copenhagen Team achieve the same high- indicates that the statement is about unemployment, est scores with 0.2571. Regarding RP, Copenhagen which is one of the controversial topics we defined. Team achieves the highest score. Overall, our model As reported by Vasileva et al. [21] fact-checking or- outperforms all other models based on the official eva- ganizations investigate different claims with very mini- lution metric of CTL while BERT and Copenhagen mal overlaps between selected claims. We observe this Team [10] also achieve comparable performance on subjective nature of annotations in Rows 4-14 because CTL’19. all statements are actually factual claims and some of them might also be considered as check-worthy. For 5 Qualitative Analysis instance, statements in Row 8, 11 and 13 are clearly said to change people’s voting decision. In addition, In this section, we present our qualitative analysis for almost all statements are about economics which is an the output of our primary model. For each input file, important factor on people’s votes. Therefore, check- we rank the claims based on their check-worthiness and ing their veracity might be also important not to mis- then detect not-check-worthy claim with the highest inform public. Nevertheless, these examples show the Table 5: Highest ranked non check-worthy statements from each test document by our primary model Row Rank File Name Speaker Statement 1 4 task1-en-file1 CLINTON The plan he has will cost us jobs and possibly lead to another Great Recession. 2 1 task1-en-file2 CLINTON Then he doubled down on that in the New York Daily News interview, when asked whether he would support the Sandy Hook parents suing to try to do something to rein in the advertising of the AR-15, which is advertised to young people as being a combat weapon, killing on the battlefield. 3 1 task1-en-file3 TRUMP Jobs, jobs, jobs. 4 2 task1-en-file4 TRUMP Before that, Democrat President John F. Kennedy cham- pioned tax cuts that surged the economy and massively reduced unemployment. 5 3 task1-en-file5 TRUMP The world’s largest company, Apple, announced plans to bring $245 billion in overseas profits home to America. 6 1 task1-en-file6 TRUMP America has lost nearly-one third of its manufacturing jobs since 1997, following the enactment of disastrous trade deals supported by Bill and Hillary Clinton. 7 1 task1-en-file7 TRUMP Our trade deficit in goods with the world last year was nearly $800 billion dollars. 8 1 20151219 3 dem O’MALLEY We increased education funding by 37 percent. 9 1 20160129 7 gop KASICH We’re up 400,000 jobs. 10 1 20160311 12 gop TAPPER Critics say these deals are great for corporate America’s bottom line, but have cost the U.S. at least 1 million jobs. 11 3 20180131 state TRUMP Unemployment claims have hit a 45-year low. union 12 1 20181015 60 min TRUMP –if you think about it, so far, I put 25% tariffs on steel dumping, and aluminum dumping 10%. 13 3 20190205 trump TRUMP Unemployment for Americans with disabilities has also state reached an all-time low. 14 1 20190215 trump TRUMP They have the largest number of murders that they’ve emergency ever had in their history - almost 40,000 murders. the subjective nature of check-worthiness annotations. 20170315 nashville file (training data on CTL’19), In addition to subjective judgments, we also noticed Donald Trump’s statement “We’re going to put our inconsistencies within the annotations. For instance, auto industry back to work” is labeled as check-worthy. the statement in Row 9 (“We are up 400,000 jobs”) However, the statement is about future and cannot be also exists in “20160311 12 gop” file but annotated as verified. “check-worthy”. In addition, there exists semantically Overall, our qualitative analysis suggests that anno- very similar statements with different labels. For in- tating check-worthiness of claims is a subjective task stance, Donald Trump’s statement “I did not support and the annotations might be noisy. Kutlu et al. [14] the war in Iraq” in 1079th line of 20160926 1pres file is show that using text excerpts within documents as ra- labeled as “not-check-worthy” while his statement in tionales help understanding disagreements in relevance 1086th line of the same file “I was against the war in judging. Similarly, we might request rationales behind Iraq” is labeled as “check-worthy”. Both statements check-worthiness annotations to understand if the la- have similar meanings and exists in the same con- bel is due to a human judging error or the subjective text (i.e., their position in file are very close). There- nature of the annotation task. Furthermore, rationales fore, both might have the same labels. As a counter behind these annotations might help us develop effec- argument, “being against” suggests an action while tive solutions for this challenging problem. “not supporting” does not require any action to be taken. Thus, different annotations for similar state- 6 Conclusion ments might also be again due to the subjective nature In this paper, we presented a supervised method which of check-worthiness judgments. prioritize claims based on check-worthiness. We use lo- Furthermore, there are also annotations that we gistic regression classifier with features including state- strongly disagree with the label. For instance, in of-the-art language model BERT, domain-specific controversial topics, pretrained word embeddings, [5] F. Cherubini and L. Graves. The rise of fact- handcrafted word list, POS tags and comparative- checking sites in europe. Reuters Institute for the superlative clauses. In our experiments on CTL’18 Study of Journalism, University of Oxford, 2016. and CTL’19, we show that our proposed model outper- forms all state-of-the-art models in both collections. [6] J. Devlin, M.-W. Chang, K. Lee, and We show that BERT’s performance can be increased K. Toutanova. Bert: Pre-training of deep by using additional features for this task. In our fea- bidirectional transformers for language under- ture ablation study, BERT model and word embed- standing. In Proceedings of the 2019 Conference dings appear to be the most effective features while of the North American Chapter of the Association handcrafted word list and domain-specific controver- for Computational Linguistics: Human Language sial topics also seem effective. Based on our qualita- Technologies, Volume 1 (Long and Short Papers), tive analysis, we believe requesting rationales for the pages 4171–4186, 2019. check-worthiness annotations would further help in de- [7] J. Friedman. Greedy function approximation: A veloping effective systems. gradient boosting machine. Annals of Statistics, In the future, we plan to work on weak supervi- 29:1189–1232, 2001. sion techniques to extend the training dataset. With the increased data, we will be able explore using deep [8] P. Gencheva, P. Nakov, L. Màrquez, A. Barrón- learning techniques for this task. In addition, we plan Cedeño, and I. Koychev. A context-aware ap- to extend our study to detect check-worthy claims in proach for detecting worth-checking claims in po- social media platforms because it is the channel where litical debates. In Proceedings of the International most of the people affected by misinformation. More- Conference Recent Advances in Natural Language over, working on different languages and building a Processing, RANLP 2017, pages 267–276, 2017. multilingual model is an important research direction in the combat against misinformation. [9] B. Ghanem, M. Montes-y-Gómez, F. M. R. Pardo, and P. Rosso. UPV-INAOE - check that: Prelim- References inary approach for checking worthiness of claims. In Working Notes of CLEF 2018 - Conference and [1] R. Agez, C. Bosc, C. Lespagnol, N. Petitcol, and Labs of the Evaluation Forum, Avignon, France, J. Mothe. IRIT at checkthat! 2018. In Working September 10-14, 2018, 2018. Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September [10] C. Hansen, C. Hansen, J. G. Simonsen, and C. Li- 10-14, 2018, 2018. oma. The copenhagen team participation in the check-worthiness task of the competition of auto- [2] P. Atanasova, P. Nakov, G. Karadzhov, M. Mo- matic identification and verification of claims in htarami, and G. Da San Martino. Overview of political debates of the clef-2018 checkthat! lab. the clef-2019 checkthat! lab on automatic identi- In CLEF, 2018. fication and verification of claims. task 1: Check- worthiness. In CEUR Workshop Proceedings, [11] C. Hansen, C. Hansen, J. G. Simonsen, and 2019. C. Lioma. Neural weakly supervised fact check- worthiness detection with contrastive sampling- [3] A. Barrón-Cedeño, T. Elsayed, P. Nakov, based ranking loss. In Working Notes of CLEF G. Da San Martino, M. Hasanain, R. Suwaileh, 2019 - Conference and Labs of the Evaluation Fo- F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, rum, Lugano, Switzerland, September 9-12, 2019, S. Shaar, and Z. S. Ali. Overview of checkthat! 2019. 2020: Automatic identification and verification of claims in social media. In Experimental IR [12] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, Meets Multilinguality, Multimodality, and Inter- D. Jimenez, S. Gawsane, S. Hasan, M. Joseph, action, pages 215–236, Cham, 2020. Springer In- A. Kulkarni, A. K. Nayak, V. Sable, C. Li, and ternational Publishing. M. Tremayne. Claimbuster: The first-ever end-to- end fact-checking system. PVLDB, 10:1945–1948, [4] A. Barrón-Cedeño, T. Elsayed, P. Nakov, 2017. G. D. S. Martino, M. Hasanain, R. Suwaileh, and F. Haouari. Checkthat! at clef 2020: Enabling [13] I. Jaradat, P. Gencheva, A. Barrón-Cedeño, the automatic identification and verification of L. Màrquez, and P. Nakov. Claimrank: Detect- claims in social media. Advances in Information ing check-worthy claims in arabic and english. In Retrieval, 12036:499 – 507, 2020. Proceedings of the 2018 Conference of the North American Chapter of the Association for Compu- [22] S. Vosoughi, D. Roy, and S. Aral. The tational Linguistics: Demonstrations, pages 26– spread of true and false news online. Science, 30, 2018. 359(6380):1146–1151, 2018. [14] M. Kutlu, T. McDonnell, Y. Barkallah, T. El- [23] Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. sayed, and M. Lease. Crowd vs. expert: What Adapting boosting for information retrieval mea- can relevance judgment rationales teach us about sures. Inf. Retr., 13(3):254–270, June 2010. assessor disagreement? In The 41st International ACM SIGIR Conference on Research & Devel- [24] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. opment in Information Retrieval, pages 805–814. Salakhutdinov, and Q. V. Le. Xlnet: General- ACM, 2018. ized autoregressive pretraining for language un- derstanding. In Advances in neural information [15] C. Lespagnol, J. Mothe, and M. Z. Ullah. Infor- processing systems, pages 5754–5764, 2019. mation nutritional label and word embedding to estimate information check-worthiness. In Pro- [25] K. Yasser, M. Kutlu, and T. Elsayed. bigir at ceedings of the 42nd International ACM SIGIR CLEF 2018: Detection and verification of check- Conference on Research and Development in In- worthy political claims. In Working Notes of formation Retrieval, pages 941–944. ACM, 2019. CLEF 2018 - Conference and Labs of the Eval- uation Forum, 2018. [16] E. Loper and S. Bird. Nltk: The natural language toolkit. In In Proceedings of the ACL Workshop [26] C. Zuo, A. Karakas, and R. Banerjee. A hybrid on Effective Tools and Methodologies for Teaching recognition system for check-worthy claims us- Natural Language Processing and Computational ing heuristics and supervised learning. In CLEF, Linguistics. Philadelphia: Association for Com- 2018. putational Linguistics, 2002. [17] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [18] P. Nakov, A. Barrón-Cedeño, T. Elsayed, R. Suwaileh, L. Màrquez, W. Zaghouani, P. Atanasova, S. Kyuchukov, and G. Da San Mar- tino. Overview of the clef-2018 checkthat! lab on automatic identification and verification of polit- ical claims. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 372–387, 2018. [19] A. Patwari, D. Goldwasser, and S. Bagchi. Tathya: A multi-classifier system for detecting check-worthy statements in political debates. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 2259–2262. ACM, 2017. [20] L. N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. ArXiv, abs/1803.09820, 2018. [21] S. Vasileva, P. Atanasova, L. Màrquez, A. Barrón- Cedeño, and P. Nakov. It takes nine to smell a rat: Neural multi-task learning for check-worthiness prediction. In Proceedings of the International Conference on Recent Advances in Natural Lan- guage Processing, 2019.