Using External Knowledge Bases and
           Coreference Resolution for Detecting
                Check-Worthy Statements
    CLEF-2019 Shared Task: Automatic Identification and
                  Verification of Claims

                         Salar Mohtaj1 , Tilo Himmelsbach1 ,
                     Vinicius Woloszyn1 , and Sebastian Möller1,2
              1
                  Quality and Usability Lab, Technische Universität Berlin
                               2
                                 DFKI Projektbüro Berlin
                                    Berlin, Germany
                    {salar.mohtaj, tilo.himmelsbach, woloszyn,
                           sebastian.moeller}@tu-berlin.de


        Abstract. With the proliferation of online information sources, it has
        become more and more difficult to judge the trustworthiness of a state-
        ment on the Web. Nevertheless, recent advances in natural language
        processing allow us to analyze information more objectively according to
        certain criteria - e.g. whether a proposition is factual or opinative, or even
        the authority or credibility of an author in a certain topic. In this paper,
        we formulated a ranking schema that can be employed in textual claims
        for speeding up the human fact-checking process. Our experiments have
        shown that our proposed method statistically outperformed the base-
        line. Additionally, this work describes a multilingual data set of claims
        collected from several fact-check websites, which was used to fine-tuning
        our model.

        Keywords: fact-checking · check-worthiness · fake news · coreference
        resolution · political debates


1     Introduction

The 2016 American presidential elections were a source of growing public aware-
ness of what has since been denominated as “fake news”. The term started to
be used in different positions within the social space as a means of discrediting,
attacking and delegitimizing political opponents. However, the task of assessing
the credibility of a claim is time-consuming for the user. For example, Kumar’s
work [12] reports that even humans are not able to always distinguish hoax from

    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem-
    ber 2019, Lugano, Switzerland.
authentic claims and that quite a few people could differentiate satirical articles
from true news.
    With the increasing number of false claims and rumors, fact-checking websites
like snopes.com, politifact.com, fullfact.org, have become popular. These websites
compile articles written by experts who manually investigate controversial claims
to determine their veracity, providing shreds of evidence for the verdict (e.g. true
or false). However, with the quick proliferation of such false statements, especially
in the context of a political debate, it becomes very difficult for a single person
to assess the validity of all claims made.
    In this paper, our team “É Proibido Cochilar”(“É Proibido Cochilar” is the
title of a Brazilian song and means “it is forbidden to take a nap”) have put
forward a new supervised worthiness-rank of the checking of a claim. Addi-
tionally to the Presidential debates in the 2016 US campaign[3], we have also
created a large multilingual data set of statements extracted from several dif-
ferent fact-checking websites. The experiments have shown that our proposed
method statistically outperformed the baseline in terms of imitating the human
experts judging about the Worthiness of checking a claim.
    The remainder of this paper is organized as follows. Section 2 discusses pre-
vious works on fake News detection. Section 3 presents details of our approach.
Section 4 and 5 describe the design of our experiments and the results. Section
6 summarizes our conclusions and presents future research directions.


2   Related Work

Several studies have addressed the task of assessing the credibility of a claim. For
instance, Popat et al. [16] proposed a new approach to identify the credibility of
a claim in a text. For a certain claim, it retrieves the corresponding articles from
News and/or social media and feeds those into a distantly supervised classifier for
assessing their credibility. Experiments with claims from the website snopes.com
and from popular cases of Wikipedia hoaxes demonstrate the viability of Popat
et al proposed methods. Another example is TrustRank [9]. This work presents
a semi-supervised approach to separate reputable good pages from spam. To
discover good pages it relies on an observation that good pages seldom point
to bad ones, i.e. people creating good pages have little reason to point to bad
pages. Finally, it employs a biased PageRank using this empirical observation to
discover other pages that are likely to be good.
    Controversial subjects can also be indicative of dispute or debate involving
different opinions about the same subject. Detect and alert users when they are
reading a controversial web-page is one way to make users aware of the infor-
mation quality they are consuming. One example of controversy detection is
[6] which relies on supervised k-nearest-neighbor classification that maps a web-
page into a set of neighboring controversial articles extracted from Wikipedia. In
this approach, a page adjacent to controversial pages is likely to be controversial
itself. Another work in this sense is [13] which aims to generate contrastive sum-
maries of different viewpoints in opinionated texts. It proposes a comparative
LexRank, that relies on random walk formulation to give a score to a sentence
based on their difference to other sentences.
    Factuality Assessment is another way to asses the information quality. Yu
et al.’s work [21] aims to separate opinions from facts, at both the document and
sentence level. It uses a Bayesian classifier for discriminating between documents
with a preponderance of opinions, such as editorials from regular news stories.
The main goal of this approach is to classify a document/sentence in factual
or opinionated text from the perspective of the author. The evaluation of the
proposed system reported promising results in both document and sentence lev-
els. Other work on the same line is [17], which proposes a two-stage framework
to extract opinionated sentences from news articles. In the first stage, a super-
vised learning model gives a score to each sentence based on the probability
of the sentence to be opinionated. In the second stage, it uses these probabil-
ities within the HITS schema to treat the opinionated sentences as Hubs, and
the facts around these opinions are treated as the Authorities. The proposed
method extracts opinions, grouping them with supporting facts as well as other
supporting opinions.
    There are also some works that analyze how a piece of information flows
over the internet. For instance, [7] presents an interesting analysis about how
Twitter bots can send spam tweets, manipulate public opinion and use them
for online fraud. It reports the discovery of the ‘Star Wars’ botnet on Twitter,
which consists of more than 350,000 bots tweeting random quotations exclusively
from Star Wars novels. It analyzes and reveals rich details on how the botnet is
designed and gives insights on how to detect virality in Twitter.
    Other works analyze the writing style in order to detect a false claim. [10] re-
ports that fake news in most cases are more similar to satire than to real news,
leading us to conclude that persuasion in the fake news is achieved through
heuristics rather than the strength of arguments. It shows that the overall title
structure and the use of proper nouns in titles are very significant in differen-
tiating fake from real. It gives an idea that fake news is targeted for audiences
who are not likely to read beyond titles and that they aim at creating mental
associations between entities and claims. Decrease the readability of texts is
also another way to overshadow false claims on the internet. Many automatic
methods to evaluate the readability of texts have been proposed. For instance,
Coh-Metrix [8], which is a computational tool that measures cohesion, discourse,
and text difficulty.
    Most of the works just cited rely on supervised learning strategies addressed
to assess News articles using few different aspects, such as credibility, contro-
versy, factuality and virality of information. Nonetheless, a common drawback
of supervised learning approaches is that the quality of the results is heavily
influenced by the availability of a large, domain-dependent annotated corpus to
train the model. Unsupervised and semi-supervised learning techniques, on the
other hand, are attractive because they do not imply the cost of corpus anno-
tation. In short, our method uses a semi-supervised strategy where only a small
set of unreliable News websites is used to spot another bad News websites using
a biased PageRank.


3     Proposed Approach

In order to rank statements according to their estimated check-worthiness, we
relied on an important empirical observation: there is a significant number of
claims with pronouns referring back to nouns mentioned in previous statements.
For example, “I beat her, and I beat her badly. She’s raising your taxes re-
ally high”; the pronouns her and she refer to the same person, namely Hillary
Clinton. More examples are given in table 1.


Table 1. Sample sentences from the training data that contain pronouns referring back
to nouns mentioned in previous statements.

           Speaker Sentence                                     Label
           SANDERS They are working longer hours for low wages.   1
           TRUMP I beat her, and I beat her badly.                1
           CLINTON They’re interested in keeping Assad in power. 0
           SANDERS Listen to what I told them then.               0
           TRUMP She’s raising your taxes really high.            1


    Sentences that contain pronouns are normally an issue for statistical models
and can significantly decrease the quality of prediction. To overcome this is-
sue, a coreference resolution technique is applied to replace pronouns with their
original references. We used a feed-forward neural-network to compute the coref-
erence score for each pair of potential mentions [1], e.g. Hillary Clinton ← she.
We have considered the last 30 sentences (slide-window) to compute the coref-
erences. Table 2 illustrates the coreference resolution of the examples presented
in Table 1. To resolve coreferences leads to more clear-cut statements, which in
our experiments improved the performance of our predictions.


 Table 2. The result of applying coreference resolution on the sentences in Table 1.

    Speaker Sentence                                                     Label
    SANDERS Millions of Americans are working longer hours for low wages. 1
    TRUMP I beat Hillary Clinton, and I beat Hillary Clinton badly.        1
    CLINTON Russia is interested in keeping Bashar al-Assad in power.      0
    SANDERS Listen to what I told YouTube then.                            0
    TRUMP Hillary Clinton’s raising your taxes really high.                1


   Additionally, we have performed a normalization of the corpus using standard
techniques: lowercasing, lemmatization, number removal, white-space removal,
stop-word removal, and tokenization. In addition to preparing the data set to
the training phase, we used some external fact-checking collection to tackle some
issues in the provided data set. Firstly, since the provided data is highly imbal-
anced (less than 3% of data are labeled as 1), we provide external data to make
the data more balanced. Moreover, it can lead to an improved generalization of
the classification model if the training data is more diverse. To add the external
data-set to the training data same pre-processing steps includes in coreference
resolution, are applied on the data.
    For this purpose, we have created a tool - called Fake News Extractor[2]
- to automatically extract claims from Fact-Checking websites and then con-
solidate a large data set for machine learning purposes. It extracts claims in
three different languages: English, Portuguese and German. Table 3 gives some
statistics about the data set created by our tool.


                    Table 3. Claims used to train our model.

             URL                                      Language     #
             http://fullfact.org                      English
             http://www.snopes.com                    English
             http://www.politifact.com                English    27594
             http://TruthOrFiction.com                English
             http://checkyourfact.com                 English
             http://piaui.folha.uol.com.br/lupa/      Portuguese
             http://aosfatos.org/aos-fatos-e-noticia/ Portuguese
             http://apublica.org/checagem/            Portuguese 1463
             http://g1.globo.com/e-ou-nao-e/          Portuguese
             http://www.e-farsas.com/                 Portuguese
             http://www.mimikama.at/                  German
                                                                  5193
             http://correctiv.org/                    German


    We have used Support Vector Machine Regression (SVM) [18] and Term
Frequency–Inverse Document Frequency (TF-IDF). Additionally, we have used
Scikit-Learn [14] library for feature extracting, for example uni-gram, bi-grams
and tri-grams. In a nutshell, the main contributions to tackle the challenge are
as follows:

 – the use of using coreference resolution in political debates
 – creation of an external collection of claims extracted from fact-check websites
   employed as a training set


4   Experiment Design

For the validation of our experiments, we used 5-fold cross validation in the doc-
ument level. In other words, we have splited the training data into 5 categories,
where each fold of the whole document is considered as belonging to either the
training or testing set. The reason for splitting the data into training and test-
ing folds in the document level is to preserve the sequence of sentences of each
debate.
    We have created three different models, as follow:
    Resolving coreference (ReCo): we have tested the performance of our
model using the normalization of the corpus - previously described in Section 3.
    Resolving coreference + further pre-processing (ReCo+pre): as de-
scribed in the previous Section, in this experiment the coreference resolution
technique is used to replace pronouns by the right references. We also employed
in this model the normalization of the corpus.
    Using external fact-checking data-set (ExtDat): in this model we used
an external data set of claims described previously. Additionally, all mentioned
text normalization techniques were used in this experiment.


5    Results

In this section, we present the results and discuss the evaluation of our proposed
approach for Worthiness-Rank of Claims Checking.
    Figure 1 shows that our models yield better results in comparison to the
baseline. The differences range from 4.02 to 8.86 percentage points (pp) when
compared to the runner-up method, namely ExtDat. Using a Wilcoxon statistical
test [19] with a significance level of 0.05, we verified that the results of our models
are statistically superior to the baseline.


                         36
                                                             34.99

                         34
                                                   32.35
                         32
                   MAP


                                         30.15
                         30


                         28

                               26.13
                         26


                              Baseline   ReCo    ReCo+pre   ExtDat


    Fig. 1. The obtained results in each experiment in addition to baseline results


    Regarding the final submission, we used the 2-top best models, namely ReCo+pre
and ExtDat models as our contrastive and primary submissions, respectively. Ta-
ble 4 presents our results on the test data in different evaluation measures.
           Table 4. Our primary and contrastive results on the test data

        Submission MAP RR R-P P@1 P@3 P@5 P@10 P@20 P@50
        Primary     .079 .351 .088 .142 .238 .142 .128 .107 .0714
        Contrastive .135 .541 .159 .428 .238 .257 .271 .164 .120


6    Conclusions and Future Work
The performance of a machine learning model trained in a supervised man-
ner is mostly determined by the amount and quality of the training data. The
paradigm of transfer-learning can be a remedy to the problem of having only
small amounts of human-labeled data [11]. Language models that are trained un-
supervised on a large but unlabeled corpus from a similar domain tend to learn
abstract/high-level features that can benefit supervised training [15]. We assume
that the basic understanding of a language that is learning by Language Models
like ELMo [15], XLNet [20], and BERT [5] can be of particular use for teach-
ing the machine the concept of check-worthiness. Furthermore, check-worthiness
could be interpreted as more than a pure language understanding problem. The
overall goal of reducing the human workload of checking claims could be further
approached by a Fact-Checking system based on the ideas of question answering
over knowledge-bases [4]. This way obvious true or false claims could be filtered
out. Factual claims like “Homicides last year increased by 17 percent in Amer-
ica’s fifty largest cities.” are relatively easy to verify compared to “[...] NAFTA
[is] one of the worst economic deals ever made by our country.”.


References
 1. Github - huggingface/neuralcoref: Fast coreference resolution in spacy with neural
    networks. https://github.com/huggingface/neuralcoref, accessed: 2019-06-30
 2. Github - vwoloszyn/fake news extractor: This project is a collective effort to auto-
    matically extract claims from fact-checking websites and then consolidate a large
    data set for machine learning purposes. currently, these claims are available for en-
    glish, portuguese and german. https://github.com/vwoloszyn/fake news extractor,
    accessed: 2019-06-25
 3. Atanasova, P., Nakov, P., Karadzhov, G., Mohtarami, M., Da San Martino, G.:
    Overview of the CLEF-2019 CheckThat! Lab on Automatic Identification and Ver-
    ification of Claims. Task 1: Check-Worthiness
 4. Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on Freebase from
    question-answer pairs. In: Proceedings of the 2013 Conference on Empirical Meth-
    ods in Natural Language Processing. pp. 1533–1544. Association for Computational
    Linguistics, Seattle, Washington, USA (Oct 2013)
 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 6. Dori-Hacohen, S., Allan, J.: Detecting controversy on the web. In: Proceedings of
    the 22nd ACM international conference on Conference on information & knowledge
    management. pp. 1845–1848. ACM (2013)
 7. Echeverria, J., Zhou, S.: Discovery, retrieval, and analysis of the’star wars’ botnet
    in twitter. In: Proceedings of the 2017 IEEE/ACM International Conference on
    Advances in Social Networks Analysis and Mining 2017. pp. 1–8. ACM (2017)
 8. Graesser, A.C., McNamara, D.S., Kulikowich, J.M.: Coh-metrix: Providing multi-
    level analyses of text characteristics. Educational researcher 40(5), 223–234 (2011)
 9. Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with trustrank.
    In: Proceedings of the Thirtieth international conference on Very large data bases-
    Volume 30. pp. 576–587. VLDB Endowment (2004)
10. Horne, B.D., Adali, S.: This just in: fake news packs a lot in title, uses simpler,
    repetitive content in text body, more similar to satire than real news. arXiv preprint
    arXiv:1703.09398 (2017)
11. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification.
    In: Proceedings of the 56th Annual Meeting of the Association for Computational
    Linguistics (Volume 1: Long Papers). pp. 328–339. Association for Computational
    Linguistics, Melbourne, Australia (Jul 2018)
12. Kumar, S., West, R., Leskovec, J.: Disinformation on the web: Impact, character-
    istics, and detection of wikipedia hoaxes. In: Proceedings of the 25th International
    Conference on World Wide Web. pp. 591–602. International World Wide Web
    Conferences Steering Committee (2016)
13. Paul, M.J., Zhai, C., Girju, R.: Summarizing contrastive viewpoints in opinionated
    text. In: Proceedings of the 2010 Conference on Empirical Methods in Natural
    Language Processing. pp. 66–76. Association for Computational Linguistics (2010)
14. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
    Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Ma-
    chine learning in python. Journal of machine learning research 12(Oct), 2825–2830
    (2011)
15. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer,
    L.: Deep contextualized word representations. In: Proceedings of the 2018 Confer-
    ence of the North American Chapter of the Association for Computational Lin-
    guistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227–2237.
    Association for Computational Linguistics, New Orleans, Louisiana (Jun 2018)
16. Popat, K., Mukherjee, S., Strötgen, J., Weikum, G.: Credibility assessment of tex-
    tual claims on the web. In: Proceedings of the 25th ACM International on Confer-
    ence on Information and Knowledge Management. pp. 2173–2178. ACM (2016)
17. Rajkumar, P., Desai, S., Ganguly, N., Goyal, P.: A novel two-stage frame-
    work for extracting opinionated sentences from news articles. In: Proceedings of
    TextGraphs-9: the workshop on Graph-based Methods for Natural Language Pro-
    cessing. pp. 25–33 (2014)
18. Vapnik, V.: The Support Vector Method of Function Estimation, pp. 55–85.
    Springer US, Boston, MA (1998). https://doi.org/10.1007/978-1-4615-5703-63,
    https://doi.org/10.1007/978-1-4615-5703-63
19. Wilcoxon, F., Katti, S., Wilcox, R.A.: Critical values and probability levels for
    the wilcoxon rank sum test and the wilcoxon signed rank test. Selected tables in
    mathematical statistics 1, 171–259 (1970)
20. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: Xlnet:
    Generalized autoregressive pretraining for language understanding. arXiv preprint
    arXiv:1906.08237 (2019)
21. Yu, H., Hatzivassiloglou, V.: Towards answering opinion questions: Separating facts
    from opinions and identifying the polarity of opinion sentences. In: Proceedings of
    the 2003 conference on Empirical methods in natural language processing. pp.
    129–136. Association for Computational Linguistics (2003)