Real-time Claim Detection from News Articles and Retrieval of Semantically-Similar Factchecks Ben Adler Giacomo Boscaini-Gilroy London UK London UK ben@thelogically.co.uk giacomo@logically.co.uk Logically the truth, they make those in positions of power ac- countable. This is a result of labour intensive work Abstract that involves monitoring the news for spurious claims and carrying out rigorous research to judge credibility. Factchecking has always been a part of the So far, it has only been possible to scale their output journalistic process. However with newsroom upwards by hiring more personnel. This is problem- budgets shrinking [Pew16] it is coming un- atic because newsrooms need significant resources to der increasing pressure just as the amount employ factcheckers. Publication budgets have been of false information circulating is on the rise decreasing, resulting in a steady decline in the size of [MAGM18]. We therefore propose a method their workforce [Pew16]. Factchecking is not a directly to increase the efficiency of the factcheck- profitable activity, which negatively affects the alloca- ing process, using the latest developments tion of resources towards it in for-profit organisations. in Natural Language Processing (NLP). This It is often taken on by charities and philanthropists method allows us to compare incoming claims instead. to an existing corpus and return similar, factchecked, claims in a live system—allowing factcheckers to work simultaneously without duplicating their work. To compensate for this shortfall, our strategy is to harness the latest developments in NLP to make 1 Introduction factchecking more efficient and therefore less costly. To this end, the new field of automated factcheck- In recent years, the spread of misinformation has be- ing has captured the imagination of both non-profits come a growing concern for researchers and the pub- and start-ups [Gra18, BM16, TV18]. It aims to speed lic at large [MAGM18]. Researchers at MIT found up certain aspects of the factchecking process rather that social media users are more likely to share false than create AI that can replace factchecking person- information than true information [VRA18]. Due to nel. This includes monitoring claims that are made renewed focus on finding ways to foster healthy polit- in the news, aiding decisions about which statements ical conversation, the profile of factcheckers has been are the most important to check and automatically re- raised. trieving existing factchecks that are relevant to a new Factcheckers positively influence public debate by claim. publishing good quality information and asking politi- cians and journalists to retract misleading or false statements. By calling out lies and the blurring of The claim detection and claim clustering methods Copyright c 2019 for the individual papers by the papers’ au- that we set out in this paper can be applied to each of thors. Copying permitted for private and academic purposes. these. We sought to devise a system that would auto- This volume is published and copyrighted by its editors. matically detect claims in articles and compare them In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the to previously submitted claims. Storing the results to NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019, allow a factchecker’s work on one of these claims to be published at http://ceur-ws.org easily transferred to others in the same cluster. 2 Claim Detection Table 1: Examples of claims taken from real articles. 2.1 Related Work Sentence Claim? It is important to decide what sentences are claims be- In its 2015 order, the NGT had banned Yes fore attempting to cluster them. The first such claim the plying of petrol vehicles older than detection system to have been created is ClaimBuster 15 years and diesel vehicles older than [HNS+ 17], which scores sentences with an SVM to 10 years in the National Capital Region determine how likely they are to be politically per- (NCR). tinent statements. Similarly, ClaimRank [JGBC+ 18] In my view, farmers should not just No uses real claims checked by factchecking institutions rely on agriculture but also adopt as training data in order to surface sentences that are dairy farming. worthy of factchecking. These methods deal with the question of what is a a simple classifier such as logistic regression, which is politically interesting claim. In order to classify the what we used. They used Facebook’s sentence embed- objective qualities of what set apart different types dings, InferSent [CKS+ 17], which was a recent break- of claims, the ClaimBuster team created PolitiTax through at the time. Such is the speed of new devel- [Car18], a taxonomy of claims, and factchecking organ- opment in the field that since then, several papers de- isation Full Fact [KPBZ18] developed their preferred scribing textual embeddings have been published. Due annotation schema for statements in consultation with to the fact that we had already evaluated embeddings their own factcheckers. This research provides a more for clustering, and therefore knew our system would solid framework within which to construct claim de- rely on Google USE Large [CYK+ 18], we decided to tection classifiers. use this instead. We compared this to TFIDF and Full The above considers whether or not a sentence Fact’s results as baselines. The results are displayed is a claim, but often claims are subsections of sen- in Table 2. tences and multiple claims might be found in one However, ClaimBuster and Full Fact focused on live sentence. In order to accommodate this, [LGS+ 17] factchecking of TV debates. Logically is a news ag- proposes extracting phrases called Context Dependent gregator and we analyse the bodies of published news Claims (CDC) that are relevant to a certain ‘Topic’. stories. We found that in our corpus, the majority of Along these lines, [AJC+ 19] proposes new definitions sentences are claims and therefore our model needed for frames to be incorporated into FrameNet [BFL98] to be as selective as possible. In practice, we choose that are specific to facts, in particular those found in to filter out sentences that are predictions since gener- a political context. ally the substance of the claim cannot be fully checked until after the event has occurred. Likewise, we try to 2.2 Method remove claims based on personal experience or anec- dotal evidence as they are difficult to verify. It is much easier to build a dataset and reliably eval- uate a model if the starting definitions are clear and Table 2: Claim Detection Results. objective. Questions around what is an interesting or Embedding Method P R F1 pertinent claim are inherently subjective. For exam- Google USE Large 0.90 0.89 0.89 ple, it is obvious that a politician will judge their oppo- [CYK+ 18] nents’ claims to be more important to factcheck than Full Fact (not on 0.88 0.80 0.83 their own. the same data) [KPBZ18] Therefore, we built on the methodologies that dealt TFIDF (Baseline) 0.84 0.84 0.84 with the objective qualities of claims, which were the [Jon72] PolitiTax and Full Fact taxonomies. We annotated sentences from our own database of news articles based on a combination of these. We also used the Full Fact definition of a claim as a statement about the world 3 Claim Clustering that can be checked. Some examples of claims accord- ing to this definition are shown in Table 1. We decided 3.1 Related Work the first statement was a claim since it declares the oc- Traditional text clustering methods, using TFIDF and currence of an event, while the second was considered some clustering algorithm, are poorly suited to the not to be a claim as it is an expression of feeling. problem of clustering and comparing short texts, as Full Fact’s approach centred around using sentence they can be semantically very similar but use dif- embeddings as a feature engineering step, followed by ferent words. This is a manifestation of the the data sparsity problem with Bag-of-Words (BoW) mod- unlikely ever to approach such scale as they require els. [SR15]. Dimensionality reduction methods such human annotations which can be expensive to assem- as Latent Dirichlet Allocation (LDA) can help solve ble. The SNLI entailment dataset is an example of this problem by giving a dense approximation of this a large open source dataset [BAPM15]. It features sparse representation [BNJ03]. More recently, efforts pairs of sentences along with labels specifying whether in this area have used text embedding-based systems or not one entails the other. Google’s Universal Sen- in order to capture dense representation of the texts tence Encoder (USE) [CYK+ 18] is a sentence embed- [WXX+ 15]. Much of this recent work has relied on the ding created with a hybrid supervised/unsupervised increase of focus in word and text embeddings. Text method, leveraging both the vast amounts of unsuper- embeddings have been an increasingly popular tool in vised training data and the extra detail that can be NLP since the introduction of Word2Vec [MCCD13], derived from a supervised method. The SNLI dataset and since then the number of different embeddings has and the related MultiNLI dataset are often used for exploded. While many focus on giving a vector repre- this because textual entailment is seen as a good basis sentation of a word, an increasing number now exist for general Natural Language Understanding (NLU) that will give a vector representation of a entire sen- [WNB18]. tence or text. Following on from this work, we seek to devise a system that can run online, performing text 3.2 Choosing an embedding clustering on the embeddings of texts one at a time In order to choose an embedding, we sought a dataset 3.1.1 Text Embeddings to represent our problem. Although no perfect matches exist, we decided upon the Quora duplicate Some considerations to bear in mind when deciding question dataset [SIC17] as the best match. To study on an embedding scheme to use are: the size of the the embeddings, we computed the euclidean distance final vector, the complexity of the model itself and, if between the two questions using various embeddings, using a pretrained implementation, the data the model to study the distance between semantically similar and has been trained on and whether it is trained in a dissimilar questions. supervised or unsupervised manner. The size of the embedding can have numerous re- sults downstream. In our example we will be doing dis- tance calculations on the resultant vectors and there- fore any increase in length will increase the complex- ity of those distance calculations. We would therefore like as short a vector as possible, but we still wish to capture all salient information about the claim; longer vectors have more capacity to store information, both salient and non-salient. A similar effect is seen for the complexity of the model. A more complicated model, with more train- able parameters, may be able to capture finer details about the text, but it will require a larger corpus to achieve this, and will require more computational time to calculate the embeddings. We should therefore at- tempt to find the simplest embedding system that can accurately solve our problem. When attempting to use pretrained models to help in other areas, it is always important to ensure that the models you are using are trained on similar ma- terial, to increase the chance that their findings will generalise to the new problem. Many unsupervised text embeddings are trained on the CommonCrawl 1 dataset of approx. 840 billion tokens. This gives a huge amount of data across many domains, but re- quires a similarly huge amount of computing power to train on the entire dataset. Supervised datasets are Figure 1: Analysis of Different Embeddings on the Quora Question Answering Dataset 1 CommonCrawl found at http://commoncrawl.org/ Table 3: Comparing Sentence Embeddings for Clustering News Claims. Embedding Time Number Number Percentage of Percentage of method taken (s) of claims of clusters claims in claims in clusters clustered majority clusters of one story Elmo [PNI+ 18] 122.87 156 21 57.05% 3.84% Googe USE [CYK+ 18] 117.16 926 46 57.95% 4.21% Google USE Large [CYK+ 18] 95.06 726 63 60.74% 7.02% Infersent [CKS+ 17] 623.00 260 34 63.08% 10.0% TFIDF (Baseline) [Jon72] 25.97 533 58 62.85% 7.12% The graphs in figure 1 show the distances between a check that the findings we obtained from the Quora duplicate and non-duplicate questions using different dataset will generalise to our domain. We ran code embedding systems. The X axis shows the euclidean which vectorized 2,000 sentences and then used the distance between vectors and the Y axis frequency. A DBScan clustering method [EKSX96] to cluster using perfect result would be a blue peak to the left and an a grid search to find the best  value, maximizing this entirely disconnected orange spike to the right, show- formula. We used DBScan as it mirrored the cluster- ing that all non-duplicate questions have a greater eu- ing method used to derive the original article clusters. clidean distance than the least similar duplicate pair of The results for this experiment can be found in Ta- questions. As can be clearly seen in the figure above, ble 3. We included TFIDF in the experiment as a Elmo [PNI+ 18] and Infersent [CKS+ 17] show almost baseline to judge other results. It is not suitable for no separation and therefore cannot be considered good our eventual purposes, but it the basis of the origi- models for this problem. A much greater disparity is nal keyword-based model used to build the clusters 2 . shown by the Google USE models [CYK+ 18], and even That being said, TFIDF performs very well, with only more for the Google USE Large model. In fact the Google USE Large and Infersent coming close in terms Google USE Large achieved a F1 score of 0.71 for this of ‘accuracy’. In the case of Infersent, this comes with task without any specific training, simply by choosing the penalty of a much smaller number of claims in- a threshold below which all sentence pairs are consid- cluded in the clusters. Google USE Large, however, ered duplicates. clusters a greater number and for this reason we chose In order to test whether these results generalised to to use Google’s USE Large. 3 our domain, we devised a test that would make use Since Google USE Large was the best-performing of what little data we had to evaluate. We had no embedding in both the tests we devised, this was our original data on whether sentences were semantically chosen embedding to use for clustering. However as similar, but we did have a corpus of articles clustered can be seen from the results shown above, this is not a into stories. Working on the assumption that similar perfect solution and the inaccuracy here will introduce claims would be more likely to be in the same story, inaccuracy further down the clustering pipeline. we developed an equation to judge how well our corpus of sentences was clustered, rewarding clustering which 3.3 Clustering Method matches the article clustering and the total number of claims clustered. The precise formula is given below, We decided to follow a methodology upon the DBScan where Pos is the proportion of claims in clusters from method of clustering [EKSX96]. DBScan considers all one story cluster, Pcc is the proportion of claims in the distances between pairs of points. If they are under  correct claim cluster, where they are from the most then those two are linked. Once the number of con- common story cluster, and Nc is the number of claims nected points exceeds a minimum size threshold, they placed in clusters. A,B and C are parameters to tune. are considered a cluster and all other points are consid- ered to be unclustered. This method is advantageous for our purposes because unlike other methods, such   A × Pos + B × Pcc × (C × Nc ) as K-Means, it does not require the number of clusters Figure 2: Formula to assess the correctness of claim to be specified. To create a system that can build clus- clusters based on article clusters ters dynamically, adding one point at a time, we set 2 Described in the newslens paper [LH17] This method is limited in how well it can represent 3 Google USE Large is the Transformer based model, the problem, but it can give indications as to a good or found at https://tfhub.dev/google/universal-sentence-encoder- bad clustering method or embedding, and can act as large/3, whereas Google USE uses a DAN architecture the minimum cluster size to one, meaning that every frames. Computation + Journalism Sym- point is a member of a cluster. posium, 2019. A potential disadvantage of this method is that be- cause points require only one connection to a cluster [BAPM15] Samuel R. Bowman, Gabor Angeli, to join it, they may only be related to one point in the Christopher Potts, and Christopher D. cluster, but be considered in the same cluster as all Manning. A large annotated corpus of them. In small examples this is not a problem as for learning natural language inference. all points in the cluster should be very similar. How- In Proceedings of the 2015 Conference ever as the number of points being considered grows, on Empirical Methods in Natural Lan- this behaviour raises the prospect of one or several guage Processing (EMNLP). Association borderline clustering decisions leading to massive clus- for Computational Linguistics, 2015. ters made from tenuous connections between genuine [BFL98] Collin F. Baker, Charles J. Fillmore, and clusters. To mitigate this problem we used a method John B. Lower. The Berkeley FrameNet described in the Newslens paper [LH17] to solve a sim- Project. 1998. ilar problem when clustering entire articles. We stored all of our claims in a graph with the connections be- [BGLL08] Vincent D. Blondel, Jean-Loup Guil- tween them added when the distance between them laume, Renaud Lambiotte, and Etienne was determined to be less than . To determine the Lefebvre. Fast unfolding of communities final clusters we run a Louvain Community Detection in large networks. 2008. [BGLL08] over this graph to split it into defined com- [BM16] Mevan Babakar and Will Moy. The state munities. This improved the compactness of a cluster. of automated factchecking, 2016. When clustering claims one by one, this algorithm can be performed on the connected subgraph featuring the [BNJ03] David M. Blei, Andrew Y. Ng, and new claim, to reduce the computation required. Michael I. Jordan. Latent dirichlet allo- As this method involves distance calculations be- cation. volume 3, pages 993–1022, 2003. tween the claim being added and every existing claim, the time taken to add one claim will increase roughly [Car18] Josue Caraballo. A taxonomy of political linearly with respect to the number of previous claims. claims. 2018. Through much optimization we have brought the com- [CKS+ 17] Alexis Conneau, Douwe Kiela, Holger putational time down to approximately 300ms per Schwenk, Loic Barrault, and Antoine Bor- claim, which stays fairly static with respect to the des. Supervised learning of universal sen- number of previous claims. tence representations from natural lan- guage inference data, 2017. 4 Next Steps [CYK+ 18] Daniel Cer, Yinfei Yang, Sheng-yi Kong, The clustering described above is heavily dependent Nan Hua, Nicole Limtiaco, Rhomni St. on the embedding used. The rate of advances in this John, Noah Constant, Mario Guajardo- field has been rapid in recent years, but an embedding Cespedes, Steve Yuan, Chris Tar, Yun- will always be an imperfect representation of an claim Hsuan Sung, Brian Strope, and Ray and therefore always an area of improvement. A do- Kurzweil. Universal sentence encoder. main specific-embedding will likely offer a more accu- 2018. rate representation but creates problems with cluster- ing claims from different domains. They also require [EKSX96] Martin Ester, Hans-Peter Kriegel, Jrg a huge amount of data to give a good model and that Sander, and Xiaowei Xu. A density-based is not possible in all domains. algorithm for discovering clusters in large spatial databases with noise. pages 226– Acknowledgements 231. AAAI Press, 1996. Thanks to Anil Bandhakavi, Tom Dakin and Felicity [Gra18] Lucas Graves. Understanding the promise Handley for their time, advice and proofreading. and limits of automated fact-checking, 2018. References [HNS+ 17] Naeemul Hassan, Anil Kumar Nayak, + [AJC 19] Fatma Arslan, Damian Jimenez, Jo- Vikas Sable, Chengkai Li, Mark sue Caraballo, Gensheng Zhang, and Tremayne, Gensheng Zhang, Fatma Ar- Chengkai Li. Modeling factual claims by slan, Josue Caraballo, Damian Jimenez, Siddhant Gawsane, Shohedul Hasan, Clark, Kenton Lee, and Luke Zettle- Minumol Joseph, and Aaditya Kulkarni. moyer. Deep contextualized word repre- ClaimBuster. Proceedings of the VLDB sentations. In Proc. of NAACL, 2018. Endowment, 10(12):1945–1948, 2017. [SIC17] Nikhil Dandekar Shankar Iyer and Ko- + rnl Csernai. First quora dataset release: [JGBC 18] Israa Jaradat, Pepa Gencheva, Al- berto Barrón-Cedeño, Lluı́s Màrquez, and Question pairs, 2017. Preslav Nakov. ClaimRank: Detecting [SR15] Angqiu Song and Dan Roth. Unsuper- check-worthy claims in arabic and english. vised sparse vector densification for short In Proceedings of the 2018 Conference of text similarity. pages 1275–1280, 2015. the North American Chapter of the As- sociation for Computational Linguistics: [TV18] James Thorne and Andreas Vlachos. Au- Demonstrations. Association for Compu- tomated fact checking: Task formula- tational Linguistics, 2018. tions, methods and future directions. 2018. [Jon72] Karen Sprck Jones. A statistical interpre- tation of term specificity and its applica- [VRA18] Soroush Vosoughi, Deb Roy, and Sinan tion in retrieval. Journal of Documenta- Aral. The spread of true and false news tion, 28:11–21, 1972. online. Science, 359(6380):1146–1151, 2018. [KPBZ18] Lev Konstantinovskiy, Oliver Price, Mevan Babakar, and Arkaitz Zubiaga. [WNB18] Adina Williams, Nikita Nangia, and Towards automated factchecking: De- Samuel R. Bowman. A broad-coverage veloping an annotation schema and challenge corpus forsentence understand- benchmark for consistent automated ing through inference. Proceedings of claim detection. 2018. NAACL-HLT 2018, 2018. [LGS+ 17] Ran Levy, Shai Gretz, Benjamin Szna- [WXX+ 15] Peng Wang, Jiaming Xu, Bo Xu, Cheng- jder, Shay Hummel, Ranit Aharonov, and Lin Liu, Heng ZhangFangyuan Wang, Noam Slonim. Unsupervised corpus–wide and Hongwei Hao. Semantic clustering claim detection. In Proceedings of the 4th and convolutional neural networkfor short Workshop on Argument Mining. Associa- text categorization. Number 6, pages 352– tion for Computational Linguistics, 2017. 357, 2015. [LH17] Philippe Laban and Marti Hearst. newslens: building and visualizing long- ranging news stories. In Proceedings of the Events and Stories in the News Work- shop, pages 1–9, Vancouver, Canada, 2017. [MAGM18] Bertin Martens, Luis Aguiar, Estrella Gomez-Herrera, and Frank Mueller- Langer. The digital transformation of news media and the rise of disinformation and fakenews. 2018. [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. 2013. [Pew16] Pew Research Center. State of the news media, 2016. [PNI+ 18] Matthew E. Peters, Mark Neumann, Mo- hit Iyyer, Matt Gardner, Christopher