Real-time Claim Detection from News Articles and
         Retrieval of Semantically-Similar Factchecks

                               Ben Adler                          Giacomo Boscaini-Gilroy
                              London UK                                London UK
                          ben@thelogically.co.uk                  giacomo@logically.co.uk
                                                           Logically


                                                                 the truth, they make those in positions of power ac-
                                                                 countable. This is a result of labour intensive work
                        Abstract                                 that involves monitoring the news for spurious claims
                                                                 and carrying out rigorous research to judge credibility.
    Factchecking has always been a part of the                   So far, it has only been possible to scale their output
    journalistic process. However with newsroom                  upwards by hiring more personnel. This is problem-
    budgets shrinking [Pew16] it is coming un-                   atic because newsrooms need significant resources to
    der increasing pressure just as the amount                   employ factcheckers. Publication budgets have been
    of false information circulating is on the rise              decreasing, resulting in a steady decline in the size of
    [MAGM18]. We therefore propose a method                      their workforce [Pew16]. Factchecking is not a directly
    to increase the efficiency of the factcheck-                 profitable activity, which negatively affects the alloca-
    ing process, using the latest developments                   tion of resources towards it in for-profit organisations.
    in Natural Language Processing (NLP). This                   It is often taken on by charities and philanthropists
    method allows us to compare incoming claims                  instead.
    to an existing corpus and return similar,
    factchecked, claims in a live system—allowing
    factcheckers to work simultaneously without
    duplicating their work.                                         To compensate for this shortfall, our strategy is
                                                                 to harness the latest developments in NLP to make
1    Introduction                                                factchecking more efficient and therefore less costly.
                                                                 To this end, the new field of automated factcheck-
In recent years, the spread of misinformation has be-            ing has captured the imagination of both non-profits
come a growing concern for researchers and the pub-              and start-ups [Gra18, BM16, TV18]. It aims to speed
lic at large [MAGM18]. Researchers at MIT found                  up certain aspects of the factchecking process rather
that social media users are more likely to share false           than create AI that can replace factchecking person-
information than true information [VRA18]. Due to                nel. This includes monitoring claims that are made
renewed focus on finding ways to foster healthy polit-           in the news, aiding decisions about which statements
ical conversation, the profile of factcheckers has been          are the most important to check and automatically re-
raised.                                                          trieving existing factchecks that are relevant to a new
   Factcheckers positively influence public debate by            claim.
publishing good quality information and asking politi-
cians and journalists to retract misleading or false
statements. By calling out lies and the blurring of
                                                                    The claim detection and claim clustering methods
Copyright c 2019 for the individual papers by the papers’ au-    that we set out in this paper can be applied to each of
thors. Copying permitted for private and academic purposes.      these. We sought to devise a system that would auto-
This volume is published and copyrighted by its editors.
                                                                 matically detect claims in articles and compare them
In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen,
M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the
                                                                 to previously submitted claims. Storing the results to
NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019,        allow a factchecker’s work on one of these claims to be
published at http://ceur-ws.org                                  easily transferred to others in the same cluster.
2     Claim Detection                                         Table 1: Examples of claims taken from real articles.

2.1   Related Work                                                Sentence                                    Claim?
It is important to decide what sentences are claims be-           In its 2015 order, the NGT had banned       Yes
fore attempting to cluster them. The first such claim             the plying of petrol vehicles older than
detection system to have been created is ClaimBuster              15 years and diesel vehicles older than
[HNS+ 17], which scores sentences with an SVM to                  10 years in the National Capital Region
determine how likely they are to be politically per-              (NCR).
tinent statements. Similarly, ClaimRank [JGBC+ 18]                In my view, farmers should not just         No
uses real claims checked by factchecking institutions             rely on agriculture but also adopt
as training data in order to surface sentences that are           dairy farming.
worthy of factchecking.
    These methods deal with the question of what is a         a simple classifier such as logistic regression, which is
politically interesting claim. In order to classify the       what we used. They used Facebook’s sentence embed-
objective qualities of what set apart different types         dings, InferSent [CKS+ 17], which was a recent break-
of claims, the ClaimBuster team created PolitiTax             through at the time. Such is the speed of new devel-
[Car18], a taxonomy of claims, and factchecking organ-        opment in the field that since then, several papers de-
isation Full Fact [KPBZ18] developed their preferred          scribing textual embeddings have been published. Due
annotation schema for statements in consultation with         to the fact that we had already evaluated embeddings
their own factcheckers. This research provides a more         for clustering, and therefore knew our system would
solid framework within which to construct claim de-           rely on Google USE Large [CYK+ 18], we decided to
tection classifiers.                                          use this instead. We compared this to TFIDF and Full
    The above considers whether or not a sentence             Fact’s results as baselines. The results are displayed
is a claim, but often claims are subsections of sen-          in Table 2.
tences and multiple claims might be found in one                 However, ClaimBuster and Full Fact focused on live
sentence. In order to accommodate this, [LGS+ 17]             factchecking of TV debates. Logically is a news ag-
proposes extracting phrases called Context Dependent          gregator and we analyse the bodies of published news
Claims (CDC) that are relevant to a certain ‘Topic’.          stories. We found that in our corpus, the majority of
Along these lines, [AJC+ 19] proposes new definitions         sentences are claims and therefore our model needed
for frames to be incorporated into FrameNet [BFL98]           to be as selective as possible. In practice, we choose
that are specific to facts, in particular those found in      to filter out sentences that are predictions since gener-
a political context.                                          ally the substance of the claim cannot be fully checked
                                                              until after the event has occurred. Likewise, we try to
2.2   Method                                                  remove claims based on personal experience or anec-
                                                              dotal evidence as they are difficult to verify.
It is much easier to build a dataset and reliably eval-
uate a model if the starting definitions are clear and                   Table 2: Claim Detection Results.
objective. Questions around what is an interesting or
                                                                    Embedding Method            P      R       F1
pertinent claim are inherently subjective. For exam-
                                                                    Google USE Large            0.90   0.89    0.89
ple, it is obvious that a politician will judge their oppo-
                                                                    [CYK+ 18]
nents’ claims to be more important to factcheck than
                                                                    Full Fact (not on           0.88   0.80    0.83
their own.
                                                                    the same data) [KPBZ18]
   Therefore, we built on the methodologies that dealt
                                                                    TFIDF (Baseline)            0.84   0.84    0.84
with the objective qualities of claims, which were the
                                                                    [Jon72]
PolitiTax and Full Fact taxonomies. We annotated
sentences from our own database of news articles based
on a combination of these. We also used the Full Fact
definition of a claim as a statement about the world          3     Claim Clustering
that can be checked. Some examples of claims accord-
ing to this definition are shown in Table 1. We decided       3.1    Related Work
the first statement was a claim since it declares the oc-     Traditional text clustering methods, using TFIDF and
currence of an event, while the second was considered         some clustering algorithm, are poorly suited to the
not to be a claim as it is an expression of feeling.          problem of clustering and comparing short texts, as
   Full Fact’s approach centred around using sentence         they can be semantically very similar but use dif-
embeddings as a feature engineering step, followed by         ferent words. This is a manifestation of the the
data sparsity problem with Bag-of-Words (BoW) mod-         unlikely ever to approach such scale as they require
els. [SR15]. Dimensionality reduction methods such         human annotations which can be expensive to assem-
as Latent Dirichlet Allocation (LDA) can help solve        ble. The SNLI entailment dataset is an example of
this problem by giving a dense approximation of this       a large open source dataset [BAPM15]. It features
sparse representation [BNJ03]. More recently, efforts      pairs of sentences along with labels specifying whether
in this area have used text embedding-based systems        or not one entails the other. Google’s Universal Sen-
in order to capture dense representation of the texts      tence Encoder (USE) [CYK+ 18] is a sentence embed-
[WXX+ 15]. Much of this recent work has relied on the      ding created with a hybrid supervised/unsupervised
increase of focus in word and text embeddings. Text        method, leveraging both the vast amounts of unsuper-
embeddings have been an increasingly popular tool in       vised training data and the extra detail that can be
NLP since the introduction of Word2Vec [MCCD13],           derived from a supervised method. The SNLI dataset
and since then the number of different embeddings has      and the related MultiNLI dataset are often used for
exploded. While many focus on giving a vector repre-       this because textual entailment is seen as a good basis
sentation of a word, an increasing number now exist        for general Natural Language Understanding (NLU)
that will give a vector representation of a entire sen-    [WNB18].
tence or text. Following on from this work, we seek to
devise a system that can run online, performing text       3.2   Choosing an embedding
clustering on the embeddings of texts one at a time
                                                           In order to choose an embedding, we sought a dataset
3.1.1   Text Embeddings                                    to represent our problem.       Although no perfect
                                                           matches exist, we decided upon the Quora duplicate
Some considerations to bear in mind when deciding          question dataset [SIC17] as the best match. To study
on an embedding scheme to use are: the size of the         the embeddings, we computed the euclidean distance
final vector, the complexity of the model itself and, if   between the two questions using various embeddings,
using a pretrained implementation, the data the model      to study the distance between semantically similar and
has been trained on and whether it is trained in a         dissimilar questions.
supervised or unsupervised manner.
   The size of the embedding can have numerous re-
sults downstream. In our example we will be doing dis-
tance calculations on the resultant vectors and there-
fore any increase in length will increase the complex-
ity of those distance calculations. We would therefore
like as short a vector as possible, but we still wish to
capture all salient information about the claim; longer
vectors have more capacity to store information, both
salient and non-salient.
   A similar effect is seen for the complexity of the
model. A more complicated model, with more train-
able parameters, may be able to capture finer details
about the text, but it will require a larger corpus to
achieve this, and will require more computational time
to calculate the embeddings. We should therefore at-
tempt to find the simplest embedding system that can
accurately solve our problem.
   When attempting to use pretrained models to help
in other areas, it is always important to ensure that
the models you are using are trained on similar ma-
terial, to increase the chance that their findings will
generalise to the new problem. Many unsupervised
text embeddings are trained on the CommonCrawl 1
dataset of approx. 840 billion tokens. This gives a
huge amount of data across many domains, but re-
quires a similarly huge amount of computing power to
train on the entire dataset. Supervised datasets are       Figure 1: Analysis of Different Embeddings on the
                                                           Quora Question Answering Dataset
  1 CommonCrawl found at http://commoncrawl.org/
                     Table 3: Comparing Sentence Embeddings for Clustering News Claims.


   Embedding                         Time        Number       Number        Percentage of         Percentage of
   method                            taken (s)   of claims    of clusters   claims in             claims in clusters
                                                 clustered                  majority clusters     of one story
   Elmo [PNI+ 18]                    122.87      156          21            57.05%                3.84%
   Googe USE [CYK+ 18]               117.16      926          46            57.95%                4.21%
   Google USE Large [CYK+ 18]        95.06       726          63            60.74%                7.02%
   Infersent [CKS+ 17]               623.00      260          34            63.08%                10.0%
   TFIDF (Baseline) [Jon72]          25.97       533          58            62.85%                7.12%

   The graphs in figure 1 show the distances between         a check that the findings we obtained from the Quora
duplicate and non-duplicate questions using different        dataset will generalise to our domain. We ran code
embedding systems. The X axis shows the euclidean            which vectorized 2,000 sentences and then used the
distance between vectors and the Y axis frequency. A         DBScan clustering method [EKSX96] to cluster using
perfect result would be a blue peak to the left and an       a grid search to find the best  value, maximizing this
entirely disconnected orange spike to the right, show-       formula. We used DBScan as it mirrored the cluster-
ing that all non-duplicate questions have a greater eu-      ing method used to derive the original article clusters.
clidean distance than the least similar duplicate pair of    The results for this experiment can be found in Ta-
questions. As can be clearly seen in the figure above,       ble 3. We included TFIDF in the experiment as a
Elmo [PNI+ 18] and Infersent [CKS+ 17] show almost           baseline to judge other results. It is not suitable for
no separation and therefore cannot be considered good        our eventual purposes, but it the basis of the origi-
models for this problem. A much greater disparity is         nal keyword-based model used to build the clusters 2 .
shown by the Google USE models [CYK+ 18], and even           That being said, TFIDF performs very well, with only
more for the Google USE Large model. In fact the             Google USE Large and Infersent coming close in terms
Google USE Large achieved a F1 score of 0.71 for this        of ‘accuracy’. In the case of Infersent, this comes with
task without any specific training, simply by choosing       the penalty of a much smaller number of claims in-
a threshold below which all sentence pairs are consid-       cluded in the clusters. Google USE Large, however,
ered duplicates.                                             clusters a greater number and for this reason we chose
   In order to test whether these results generalised to     to use Google’s USE Large. 3
our domain, we devised a test that would make use               Since Google USE Large was the best-performing
of what little data we had to evaluate. We had no            embedding in both the tests we devised, this was our
original data on whether sentences were semantically         chosen embedding to use for clustering. However as
similar, but we did have a corpus of articles clustered      can be seen from the results shown above, this is not a
into stories. Working on the assumption that similar         perfect solution and the inaccuracy here will introduce
claims would be more likely to be in the same story,         inaccuracy further down the clustering pipeline.
we developed an equation to judge how well our corpus
of sentences was clustered, rewarding clustering which
                                                             3.3   Clustering Method
matches the article clustering and the total number of
claims clustered. The precise formula is given below,        We decided to follow a methodology upon the DBScan
where Pos is the proportion of claims in clusters from       method of clustering [EKSX96]. DBScan considers all
one story cluster, Pcc is the proportion of claims in the    distances between pairs of points. If they are under 
correct claim cluster, where they are from the most          then those two are linked. Once the number of con-
common story cluster, and Nc is the number of claims         nected points exceeds a minimum size threshold, they
placed in clusters. A,B and C are parameters to tune.        are considered a cluster and all other points are consid-
                                                             ered to be unclustered. This method is advantageous
                                                             for our purposes because unlike other methods, such
                              
              A × Pos + B × Pcc × (C × Nc )
                                                             as K-Means, it does not require the number of clusters
Figure 2: Formula to assess the correctness of claim         to be specified. To create a system that can build clus-
clusters based on article clusters                           ters dynamically, adding one point at a time, we set
                                                               2 Described in the newslens paper [LH17]
   This method is limited in how well it can represent          3 Google USE Large is the Transformer based model,
the problem, but it can give indications as to a good or     found at https://tfhub.dev/google/universal-sentence-encoder-
bad clustering method or embedding, and can act as           large/3, whereas Google USE uses a DAN architecture
the minimum cluster size to one, meaning that every                    frames. Computation + Journalism Sym-
point is a member of a cluster.                                        posium, 2019.
   A potential disadvantage of this method is that be-
cause points require only one connection to a cluster      [BAPM15] Samuel R. Bowman, Gabor Angeli,
to join it, they may only be related to one point in the            Christopher Potts, and Christopher D.
cluster, but be considered in the same cluster as all               Manning.     A large annotated corpus
of them. In small examples this is not a problem as                 for learning natural language inference.
all points in the cluster should be very similar. How-              In Proceedings of the 2015 Conference
ever as the number of points being considered grows,                on Empirical Methods in Natural Lan-
this behaviour raises the prospect of one or several                guage Processing (EMNLP). Association
borderline clustering decisions leading to massive clus-            for Computational Linguistics, 2015.
ters made from tenuous connections between genuine         [BFL98]     Collin F. Baker, Charles J. Fillmore, and
clusters. To mitigate this problem we used a method                    John B. Lower. The Berkeley FrameNet
described in the Newslens paper [LH17] to solve a sim-                 Project. 1998.
ilar problem when clustering entire articles. We stored
all of our claims in a graph with the connections be-      [BGLL08]    Vincent D. Blondel, Jean-Loup Guil-
tween them added when the distance between them                        laume, Renaud Lambiotte, and Etienne
was determined to be less than . To determine the                     Lefebvre. Fast unfolding of communities
final clusters we run a Louvain Community Detection                    in large networks. 2008.
[BGLL08] over this graph to split it into defined com-
                                                           [BM16]      Mevan Babakar and Will Moy. The state
munities. This improved the compactness of a cluster.
                                                                       of automated factchecking, 2016.
When clustering claims one by one, this algorithm can
be performed on the connected subgraph featuring the       [BNJ03]     David M. Blei, Andrew Y. Ng, and
new claim, to reduce the computation required.                         Michael I. Jordan. Latent dirichlet allo-
   As this method involves distance calculations be-                   cation. volume 3, pages 993–1022, 2003.
tween the claim being added and every existing claim,
the time taken to add one claim will increase roughly      [Car18]     Josue Caraballo. A taxonomy of political
linearly with respect to the number of previous claims.                claims. 2018.
Through much optimization we have brought the com-
                                                           [CKS+ 17]   Alexis Conneau, Douwe Kiela, Holger
putational time down to approximately 300ms per
                                                                       Schwenk, Loic Barrault, and Antoine Bor-
claim, which stays fairly static with respect to the
                                                                       des. Supervised learning of universal sen-
number of previous claims.
                                                                       tence representations from natural lan-
                                                                       guage inference data, 2017.
4   Next Steps
                                                           [CYK+ 18] Daniel Cer, Yinfei Yang, Sheng-yi Kong,
The clustering described above is heavily dependent                  Nan Hua, Nicole Limtiaco, Rhomni St.
on the embedding used. The rate of advances in this                  John, Noah Constant, Mario Guajardo-
field has been rapid in recent years, but an embedding               Cespedes, Steve Yuan, Chris Tar, Yun-
will always be an imperfect representation of an claim               Hsuan Sung, Brian Strope, and Ray
and therefore always an area of improvement. A do-                   Kurzweil. Universal sentence encoder.
main specific-embedding will likely offer a more accu-               2018.
rate representation but creates problems with cluster-
ing claims from different domains. They also require       [EKSX96]    Martin Ester, Hans-Peter Kriegel, Jrg
a huge amount of data to give a good model and that                    Sander, and Xiaowei Xu. A density-based
is not possible in all domains.                                        algorithm for discovering clusters in large
                                                                       spatial databases with noise. pages 226–
Acknowledgements                                                       231. AAAI Press, 1996.
Thanks to Anil Bandhakavi, Tom Dakin and Felicity          [Gra18]     Lucas Graves. Understanding the promise
Handley for their time, advice and proofreading.                       and limits of automated fact-checking,
                                                                       2018.
References
                                                           [HNS+ 17]   Naeemul Hassan, Anil Kumar Nayak,
     +
[AJC 19]    Fatma Arslan, Damian Jimenez, Jo-                          Vikas Sable, Chengkai Li, Mark
            sue Caraballo, Gensheng Zhang, and                         Tremayne, Gensheng Zhang, Fatma Ar-
            Chengkai Li. Modeling factual claims by                    slan, Josue Caraballo, Damian Jimenez,
             Siddhant Gawsane, Shohedul Hasan,                        Clark, Kenton Lee, and Luke Zettle-
             Minumol Joseph, and Aaditya Kulkarni.                    moyer. Deep contextualized word repre-
             ClaimBuster. Proceedings of the VLDB                     sentations. In Proc. of NAACL, 2018.
             Endowment, 10(12):1945–1948, 2017.
                                                           [SIC17]    Nikhil Dandekar Shankar Iyer and Ko-
         +                                                            rnl Csernai. First quora dataset release:
[JGBC 18] Israa Jaradat, Pepa Gencheva, Al-
          berto Barrón-Cedeño, Lluı́s Màrquez, and                 Question pairs, 2017.
          Preslav Nakov. ClaimRank: Detecting
                                                           [SR15]     Angqiu Song and Dan Roth. Unsuper-
          check-worthy claims in arabic and english.
                                                                      vised sparse vector densification for short
          In Proceedings of the 2018 Conference of
                                                                      text similarity. pages 1275–1280, 2015.
          the North American Chapter of the As-
          sociation for Computational Linguistics:         [TV18]     James Thorne and Andreas Vlachos. Au-
          Demonstrations. Association for Compu-                      tomated fact checking: Task formula-
          tational Linguistics, 2018.                                 tions, methods and future directions.
                                                                      2018.
[Jon72]      Karen Sprck Jones. A statistical interpre-
             tation of term specificity and its applica-   [VRA18]    Soroush Vosoughi, Deb Roy, and Sinan
             tion in retrieval. Journal of Documenta-                 Aral. The spread of true and false news
             tion, 28:11–21, 1972.                                    online.  Science, 359(6380):1146–1151,
                                                                      2018.
[KPBZ18]     Lev Konstantinovskiy, Oliver Price,
             Mevan Babakar, and Arkaitz Zubiaga.           [WNB18]    Adina Williams, Nikita Nangia, and
             Towards automated factchecking: De-                      Samuel R. Bowman. A broad-coverage
             veloping an annotation schema and                        challenge corpus forsentence understand-
             benchmark for consistent automated                       ing through inference. Proceedings of
             claim detection. 2018.                                   NAACL-HLT 2018, 2018.

[LGS+ 17]    Ran Levy, Shai Gretz, Benjamin Szna-          [WXX+ 15] Peng Wang, Jiaming Xu, Bo Xu, Cheng-
             jder, Shay Hummel, Ranit Aharonov, and                  Lin Liu, Heng ZhangFangyuan Wang,
             Noam Slonim. Unsupervised corpus–wide                   and Hongwei Hao. Semantic clustering
             claim detection. In Proceedings of the 4th              and convolutional neural networkfor short
             Workshop on Argument Mining. Associa-                   text categorization. Number 6, pages 352–
             tion for Computational Linguistics, 2017.               357, 2015.

[LH17]       Philippe Laban and Marti Hearst.
             newslens: building and visualizing long-
             ranging news stories. In Proceedings of
             the Events and Stories in the News Work-
             shop, pages 1–9, Vancouver, Canada,
             2017.

[MAGM18] Bertin Martens, Luis Aguiar, Estrella
         Gomez-Herrera, and Frank Mueller-
         Langer. The digital transformation of
         news media and the rise of disinformation
         and fakenews. 2018.

[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado,
         and Jeffrey Dean. Efficient estimation
         of word representations in vector space.
         2013.

[Pew16]      Pew Research Center. State of the news
             media, 2016.

[PNI+ 18]    Matthew E. Peters, Mark Neumann, Mo-
             hit Iyyer, Matt Gardner, Christopher