=Paper= {{Paper |id=Vol-2699/paper34 |storemode=property |title=Too Many Claims to Fact-Check: Prioritizing Political Claims Based on Check-Worthiness |pdfUrl=https://ceur-ws.org/Vol-2699/paper34.pdf |volume=Vol-2699 |authors=Yavuz Selim Kartal,Mucahid Kutlu,Busra Guvenen |dblpUrl=https://dblp.org/rec/conf/cikm/KartalKG20 }} ==Too Many Claims to Fact-Check: Prioritizing Political Claims Based on Check-Worthiness== https://ceur-ws.org/Vol-2699/paper34.pdf
    Too Many Claims to Fact-Check: Prioritizing Political
           Claims Based on Check-Worthiness

                        Yavuz Selim Kartal, Mucahid Kutlu, and Busra Guvenen
                                 Department of Computer Engineering
                            TOBB University of Economics and Technology
                                            Ankara, Turkey
                            {ykartal, m.kutlu, bguvenen}@etu.edu.tr



                                                                since 2013 such as the gunfight due to “Pizzagate” fake
                                                                news2 and increased mistrust towards vaccines3 .
                       Abstract                                    In order to combat against misinformation and
                                                                its negative outcomes, fact-checking websites (e.g.,
    The massive amount of misinformation                        Snopes4 ) detect the veracity of claims spread over
    spreading on the Internet on a daily basis                  the Internet and share their findings with their read-
    has enormous negative impacts on societies.                 ers [5]. However, fact-checking is an extremely time-
    Therefore, we need automated systems help-                  consuming process, taking around one day for a single
    ing fact-checkers in the combat against misin-              claim [12]. While these invaluable journalistic efforts
    formation. In this paper, we propose a model                help to reduce the spread of misinformation, Vosoughi
    prioritizing the claims based on their check-               et al. [22] report that false news spread eight times
    worthiness. We use BERT model with addi-                    faster than true news. Therefore, systems helping fact-
    tional features including domain-specific con-              checkers are urgently needed in the combat against
    troversial topics, word embeddings, and oth-                misinformation.
    ers. In our experiments, we show that our pro-                 As human fact-checkers are not able detect the ve-
    posed model outperforms all state-of-the-art                racity of all claims spread on the Internet, it is vital to
    models in both test collections of CLEF Check               spend their precious time in fact-checking the most im-
    That! Lab in 2018 and 2019. We also conduct                 portant claims. Therefore, an automatic system moni-
    a qualitative analysis to shed light detecting              toring social media posts, news articles and statements
    check-worthy claims. We suggest requesting                  of politicians, and detecting the check-worthy claims is
    rationales behind judgments are needed to un-               needed. A number of researchers focused on this im-
    derstand subjective nature of the task and                  portant problem (e.g., [12, 19, 13]). Furthermore, Con-
    problematic labels.                                         ference and Labs of Evaluation Forum (CLEF) Check
                                                                That! Lab (CTL) has been organizing shared-tasks on
                                                                detecting check-worthy claims since 2018 [18, 2, 4]. In
1    Introduction
                                                                CTL tasks, a political debate or a transcribed speech
The World Economic Forum (WEF) has ranked mas-                  is separated by sentences and participants are asked
sive digital misinformation as one of the top global            to rank the sentences according to their priority to
risks in 20131 . Unfortunately, the foresight of WEF            be fact-checked. In CTL’20 [3], tweets have also been
seems right as we encountered many unpleasant inci-             used for this task.
dents due to the misinformation spread on the Internet             In this paper, we propose a ranking model that pri-
                                                                oritizes claims based on their check-worthiness. We
Copyright c by the paper’s authors. Use permitted under Cre-    propose a BERT-based hybrid system in which we first
ative Commons License Attribution 4.0 International (CC BY
4.0).                                                              2 www.nytimes.com/2016/12/05/business/media/comet-

Title of the Proceedings: “Proceedings of the CIKM 2020 Work-   ping-pong-pizza-shooting-fake-news-consequences.html
shops October 19-20, Galway, Ireland”                              3 www.washingtonpost.com/news/wonk/wp/2014/10/13/the-

Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi         inevitable-rise-of-ebola-conspiracy-theories
   1 http://reports.weforum.org/global-risks-2013                  4 https://www.snopes.com/
fine tune a BERT [6] model for this task, and then use      using almost every feature mentioned before with
its prediction and other features we define in a logistic   SVM-Multilayer perceptron learning.
regression model to prioritize the claims. The features         In 2019, 11 teams participated in check-worthiness
we use include word-embeddings, presence of compar-         task of CTL’19. Participants used varying models such
ative and superlative adjectives, domain-specific con-      as LSTM, SVM, naive bayes, and logistic regression
troversial topics, and others. Our model achieves           (LR) with many features including readability of sen-
0.255 and 0.176 mean average precision (MAP) scores         tences and their context [2]. Copenhagen team [11]
in CTL’18 and CTL’19 datasets, respectively, out-           achieved the best overall performance using syntactic
performing all state-of-the-art models including par-       dependency and word embeddings with weakly super-
ticipants of the corresponding shared-tasks, Claim-         vised LSTM model.
Buster [12], BERT, XLNET [24], and Lespagnol et                 Lespagnol et al. [15] investigated using various
al.[15]’s model. We share our code for the reproducibil-    learning models such as SVM, LR, and Random
ity of our results5 .                                       Forests, with a long list of features including word-
                                                            embeddings, POS tags, syntactic dependency tags, en-
2    Related Work                                           tities, and “information nutritional” features which
                                                            represent factuality, emotion, controversy, credibility,
As the US presidential election in 2016 is one of           and technicality of statements. In our experiments we
the main motivating reasons for fact-checking studies,      show that our model outperforms Lespagnol et al. on
prior work mostly used debates and other speeches of        both test collections.
US politicians as their datasets (e.g., [12, 15]). There-
                                                                Our proposed an approach distinguishes from the
fore, the majority of studies focused on English. The
                                                            existing studies as follows. 1) We propose a BERT-
Arabic datasets used in prior work ([13, 18]) are just
                                                            based hybrid model which uses fine-tuned BERT’s out-
translations of English datasets.
                                                            put with many other features. 2) As the topic might
   ClaimBuster [12] is one of the first studies about
                                                            be a strong indicator for check-worthiness, many stud-
check-worthiness. ClaimBuster is a supervised model
                                                            ies used various types of topics such as general topics
using many features including part-of-speech (POS)
                                                            [25], globally controversial topics [15], and topics dis-
tags, named entities, sentiment, and TF-IDF represen-
                                                            cussed in old US presindential debates [19]. However,
tations of claims. TATHYA [19] uses topics, POS tu-
                                                            we believe that check-worthiness of a claim depends on
ples, entity history, and bag-of-words as features. The
                                                            local and present controversial topics. Thus, we use a
topics are detected by LDA model trained on tran-
                                                            list of hand-crafted controversial topics related to US
scripts of all presidential debates from 1976 to 2016.
                                                            elections. 3) We also use two different sets of features
   Gencheva et al. [8] propose a neural network model       including a hand-crafted list of words and presence of
with a long list of sentence level and contextual fea-      comparative and superlative adjectives and adverbs.
tures including sentiment, named entities, word em-
beddings, topics, contradictions, and others. Jaradat
et al. [13] use roughly the same features with Gencheva     3   Proposed Approach
et al., but extend the model for Arabic. In its followup    We propose a supervised model with a number of fea-
work, Vasileva et al. [21] propose a multi-task learning    tures described below. We investigate various learning
model to detect whether a claim will be fact-checked        models including LR, SVM, random forest, MART [7],
by at least five (out of nine) pre-selected reputable       and LambdaMART [23]. Now we explain the features
fact-checking organizations.                                we use.
   CLEF has been organizing Check That! Labs                   BERT: We first fine tune BERT using respective
(CTL) since 2018. Seven teams participated in check-        training data. Next, we use its prediction value as one
worthiness task of CTL’18. The participant teams            of our features.
used various learning models such as recurrent neu-            Word Embeddings (WE): Words that are se-
ral network (RNN) [10], multilayer perceptron [26],         mantically and syntactically similar tends to be close
random forest (RF) [1], k-nearest neighbor (kNN) [9]        in the embedding space, allowing us to capture sim-
and Support Vector Machine (SVM) [25] with differ-          ilarities between claims. We represent a sentence as
ent sets of features such as bag-of-words [26], charac-     the average vector of its words excluding the out-
ter n-gram [9], POS tags [26, 10, 25], verbal forms [26],   of-vocabulary ones. Word embedding vectors are
named entities [26, 25], syntactic dependencies [26, 10],   extracted from the pre-trained word2vec model [17]
and word embeddings [26, 10, 25]. On English dataset,       which has a feature vector size of 300.
Prise de Fer [26] team achieved the best MAP scores            Controversial Topics (CT): Sentences about
   5 https://github.com/YSKartal/political-claims-          controversial topics might include check-worthy
checkworthiness                                             claims.      Lespagnol et al.     [15] use a list of
controversial issues compiled from Wikipedia arti-         tence (e.g., “percent”) and 3) its semantic represents
cle “Wikipedia:List of controversial issues”. However,     a comparison between two cases (e.g., “increase” and
the list they use covers many controversial issues which   “decrease”). Thus, we first identified 66 words analyz-
have very limited coverage in current US media such        ing training datasets of CTL’18 and CTL’19. In this
as “Lebanon”, “Chernobyl”, and “Spanish Civil War”         feature, we check whether there is an overlap between
while the data we use are about recent US politics.        lemmas of selected words and lemmas of words in the
We believe that controversy of a topic depends on          respective sentence.
the society. For instance, US politicians propose dif-        Verbe Tense (VT): We cannot detect the veracity
ferent policies for immigrants, yielding heated discus-    of claims about future while we can only verify claims
sions among them and their supporters. On the other        about the present or past. Thus, the verbe tense of
hand, US domestic politics are much less interested        sentences might be an effective indicator for check-
in refugee crisis in Mediterranean sea than European       worthiness of claims. This feature vector represents
countries. Therefore, a claim about Mexican immi-          the existence or absence of each tense in the predicate
grants might be check-worthy for people living in US       of the claims.
while they might find claims about refugees taking a          Part-of-speech (POS) Tags: If a sentence does
dangerous path to reach Europe not-check-worthy. In        not contain any informative words, then it is less likely
contrast, people living in Europe might consider the       to be check-worthy. To represent the information load
latter case as check-worthy and the former one as not-     of a claim, we use the number of nouns, verbs, adverbs
check-worthy. In addition, controversy of a topic might    and adjectives, separately.
change over time. For instance, Cold War (which also
exists in that Wikipedia list) might be one of the most    4     Experiments
discussed topics in US politics before the collapse of
the Soviet Union in 1991. However, nowadays it is          4.1     Experimental Setup
rarely covered by US media. Therefore, we propose          Implementation: We use ktrain library6 to fine-tune
using controversial issues related to the data we use,     BERT model with 1 cycle learning rate policy and
instead of any controversial issue around the globe and    maximum learning rate of 2e-5 [20]. We use SpaCy7
in the history.                                            for all syntactic and semantic analyses. We use Scikit
   Firstly, we identified 11 major topics in current US    toolkit8 for the implementations of SVM, Random For-
politics including immigration, gun policy, racism, ed-    est (RF), and LR. The parameter settings of the learn-
ucation, Islam, climate change, health policy, abortion,   ing algorithms are as follows. We use default parame-
LGBT, terror, and wars in Afghanistan and Iraq. For        ters for SVM. We set the number of trees to 50 and the
each topic, we identified related words and calculate      maximum depth to 5 for RF. We use multinomial and
the average of these words using their word embedding      lbfgs settings for LR. For MART and LambdaMART
vectors. For instance, for the immigration topic, we       models, we use RankLib9 library, and set the number
used words “immigrants”, “illegal”, “borders”, “Mex-       of trees and leaves to 50 and 2, respectively.
ican”, “Latino” and “Hispanic”.                            Data: We evaluate the performance of our system
   In this feature set of size 11, we calculate cosine     with two datasets used in CTL’18 and CTL’19. The
similarity between sentences and each topic by using       details about them are given in Table 1. CTL’18
their vector presentation. We use the average of word      consists of transcripts of debates and speeches while
embeddings for sentences excluding stopwords with          CTL’19 contains also press conferences and posts.
NLTK [16].
   Comparative & Superlative (CS): Politicians             Table 1: Details about CTL’18 and CTL’19 datasets.
frequently use sentences comparing themselves with
others because each candidate tries to convince the                                            CTL’18        CTL’19
public that s/he is better than his/her opponent.                       # Docs                     3            19
Therefore, the comparisons in political speeches might      Train       # Sentence               4,064        16,421
impact people’s voting decision and, thereby, it might                  # CW Claims            90 (2,2%)    433 (2,6%)
be important to check their veracity. Thus, in this                     # Docs                     7             7
feature, we use the number of comparative and su-              Test     # Sentence               4,882         7,079
perlative adjectives and adverbs in sentences.                          # CW Claims           192 (3,9%)    110 (1,6%)
   Handcrafted Word List (HW): Particular
words convey important information about check-                6 https://pypi.org/project/ktrain/
worthiness because 1) it might be related to an im-            7 https://spacy.io/
portant topic (e.g., “unemployment”), 2) it represents         8 https://scikit-learn.org

a numerical value, increasing the factuality of the sen-       9 https://sourceforge.net/p/lemur/wiki/RankLib/
Baselines: We compare our model against the follow-                  (LR), SVM, random forest (RF), MART and Lamb-
ing models.                                                          daMART models using all features defined in Section
                                                                     3. Table 2 shows MAP scores of each model. Inter-
  • Lespagnol et al. [15] : Lespagnol et al. report the              estingly, LR outperforms all other models. In a similar
    best results on CTL’18 so far. Therefore, we use it              experiment Lespagnol et al.[15] conducted, they also
    as one of our baselines. In order to get its results             report that LR yields higher results than other models
    for CTL’19, we contacted with the authors to get                 they used. Nevertheless, we use LR in our following
    their own code. The authors provide us the values                experiments.
    of “information nutrition” features and instruc-
    tions about how to generate WE embeddings. We
    implemented their method using the values they                   Table 2: MAP Score for Varying Models Using
    shared and following their instructions10 .                      All Features
                                                                         Learning Model    CTL’18    CTL’19
  • ClaimBuster : We use the popular pretrained                                 LR          .2303     .1775
    ClaimBuster API11 [12] which is trained on a                                RF           .1468     .1542
    dataset covering different debates that do not ex-                         SVM           .1716     .1346
    ist on CTL’18 and CTL’19.                                                 MART           .1764     .1732
                                                                         Lambda MART         .0671     .0564
  • BERT : As it is reported that BERT based mod-
    els outperform state-of-the-art models in various
    NLP tasks, we compare our model against using                    Feature Ablation. In order to analyze the effective-
    only BERT. We fine tune BERT model using the                     ness of features we use, we apply two techniques: 1)
    respective training dataset and predict the check-               Leave-one-out methodology in which we exclude one
    worthiness of claims using the fine-tuned model.                 type of feature group and calculate the model’s per-
                                                                     formance without it, and 2) Use-only-one methodology
  • XLNET : It is reported that XLNet outperfroms                    in which only a single feature group is used for predic-
    BERT in various NLP tasks [24]. Thus, we use                     tion. The results are shown in Table 3.
    XL-NET for this task by fine-tuning with the re-                     From the results in Table 3, we see that features
    spective training dataset.                                       have different effects on each dataset. BERT is the
  • Best of CTL’18 and CTL’19 : For each dataset,                    most effective feature on CTL’19. However, in contrast
    we also report the performance of best systems                   to our expectations, WE seems more effective feature
    participated in the shared-tasks, i.e., Prise de Fer             than BERT on CTL’18. On CTL’18, the performance
    team [26] and Copenhagen team [11] for CTL’18                    decreases by nearly 25% when WE is excluded. In
    and CTL’19, respectively.                                        addition, we achieve the highest MAP score when we
                                                                     use only WE. On CTL’19, we achieve 0.1356 MAP
Training & Testing: We use the same setup with                       score using only WE, showing that it is more effective
CTL’18 and CTL’19 to maintain a fair comparison                      than other features except BERT. However, the per-
with the baselines. We follow the evaluation method                  formance of our model increases when we exclude WE
used on CTL’18 and CTL’19: We calculate average                      (0.1775 vs. 0.1786 in Table 3), suggesting that the in-
precision (AP), R-precision (RP), precision@5 (P@5)                  formation it contributes is covered by other features
and precision@10 (P@10) for each file (i.e., debate,                 on CTL’19.
speech) and then report the average performance.                         Excluding hand-crafted word list (HW) features
                                                                     causes performance decrease in both test collections.
4.2    Experimental Results                                          In addition, using only HW features outperforms all
In this section, we present experimental results on test             participants of CTL’18 (0.153 vs 0.1332 in Table 3).
data using different sets of features and varying learn-             These promising results suggest that expanding this
ing algorithms.                                                      list might lead further performance increases.
Comparison of Learning Algorithms. In our                                Our results also suggest that Controversial Top-
first set of experiments, we evaluate logistic regression            ics (CT) are effective features. Excluding them de-
  10 It is noteworthy that we obtain 0.2115 MAP score on
                                                                     creases the performance of the model in both collec-
CTL’18 with our implementation of their method while they            tions while using only CT features yield high scores,
report 0.23 MAP score in their paper. We are not aware of any        slightly outperforming the best performing system on
bug in our code but the performance difference might be be-          CTL’18 (0.1363 vs. 0.1332 in Table 3).
cause of different versions of the same library. Nevertheless, the
results we present for their method on CTL’19 should be taken
                                                                         Excluding CS and POS features also slightly de-
with a grain of salt.                                                crease the performance of the model in both test col-
  11 https://idir.uta.edu/claimbuster/                               lections. Regarding time tense features, our results are
                            Table 3: MAP Scores for Varying Feature Sets
                            Leave-One-Out                  Use-Only-One
                     Features     CTL18   CTL19 Features       CTL18     CTL19
                     All           .2303   .1775
                     All-CS        .2239   .1765    CS           .751      .604
                     All-BERT      .2211   .1580    BERT        .1850    .1701
                     All-VT       .2547    .1761    VT          .1007      .598
                     All-HW        .2126   .1727    HW          .1530     .1043
                     All-WE        .1756  .1786     WE         .2068      .1356
                     All-CT        .2170   .1739    CT          .1363     .1046
                     All-POS       .2283   .1767    POS         .1048      .631



Table 4: Comparison with Competing Models. * sign indicates the results obtained from our
implementation of the respective competing model.
                                       CTL’18                       CTL’19
      Model                  MAP      RP P@5      P@10    MAP     RP    P@5    P@10
      BERT                   .1850  .2218  .3142   .2857  .1701  .1945  .2571 .2429
      XLNET                  .1974  .2393  .2857   .2571  .0932  .0770   .1429  .1143
      Lespagnol et al. [15]   .230   .254   .314  .2857* .1292* .1347* .1714* .2000*
      Prise de Fer Team      .1332  .1352  .2000   .1429    -      -       -      -
      Copenhagen Team           -      -      -      -    .1660 .4176 .2571     .2286
      ClaimBuster            .2003  .2162  .2571   .2429  .1329  .1555   .1714  .2000
      Our Model             .2547 .2579 .4000 .3429      .1761   .2028  .2571   .2143

mix. Excluding time tense feature causes a slight per-     rank. Table 5 shows these not-check-worthy state-
formance decrease on CTL’19, but yields higher per-        ments for each file with our system’s ranking and
formance score on CTL’18.                                  speaker of the statement.
Comparison Against Baselines. We pick the                     The statement in Row 1 is a claim about the future.
model that includes all features except VT as our pri-     Our model with verb tense could rank this statement
mary model because it achieves the highest MAP score       at lower ranks but our primary model does not use
on average. We compare our primary model with the          verb tense features because it yields lower performance
baselines. The results are presented in Table 4.           on average. In Row 2, the statement is very complex
   Our proposed model outperforms all other mod-           with many relative clauses, in perhaps decreasing the
els based on all evaluation metrics on CTL’18. On          performance of BERT model and WE features in rep-
CTL’19, our proposed model achieves the highest            resenting the statement. In Row 3, our model makes
MAP score, which is the official metric used in CTL.       an obvious mistake and ranks a statement which does
BERT model outperforms other models based on               not have even any predicate, at very high ranks. Per-
P@10 on CTL’19. Regarding P@5 metric, our model,           haps our model falls short because the word “jobs”
BERT and Copenhagen Team achieve the same high-            indicates that the statement is about unemployment,
est scores with 0.2571. Regarding RP, Copenhagen           which is one of the controversial topics we defined.
Team achieves the highest score. Overall, our model           As reported by Vasileva et al. [21] fact-checking or-
outperforms all other models based on the official eva-    ganizations investigate different claims with very mini-
lution metric of CTL while BERT and Copenhagen             mal overlaps between selected claims. We observe this
Team [10] also achieve comparable performance on           subjective nature of annotations in Rows 4-14 because
CTL’19.                                                    all statements are actually factual claims and some of
                                                           them might also be considered as check-worthy. For
5   Qualitative Analysis                                   instance, statements in Row 8, 11 and 13 are clearly
                                                           said to change people’s voting decision. In addition,
In this section, we present our qualitative analysis for   almost all statements are about economics which is an
the output of our primary model. For each input file,      important factor on people’s votes. Therefore, check-
we rank the claims based on their check-worthiness and     ing their veracity might be also important not to mis-
then detect not-check-worthy claim with the highest        inform public. Nevertheless, these examples show the
    Table 5: Highest ranked non check-worthy statements from each test document by our primary model
     Row     Rank     File Name              Speaker          Statement
      1       4       task1-en-file1        CLINTON          The plan he has will cost us jobs and possibly lead to
                                                             another Great Recession.
       2        1     task1-en-file2        CLINTON          Then he doubled down on that in the New York Daily
                                                             News interview, when asked whether he would support
                                                             the Sandy Hook parents suing to try to do something to
                                                             rein in the advertising of the AR-15, which is advertised
                                                             to young people as being a combat weapon, killing on the
                                                             battlefield.
       3        1     task1-en-file3         TRUMP           Jobs, jobs, jobs.
       4        2     task1-en-file4         TRUMP           Before that, Democrat President John F. Kennedy cham-
                                                             pioned tax cuts that surged the economy and massively
                                                             reduced unemployment.
       5        3     task1-en-file5         TRUMP           The world’s largest company, Apple, announced plans to
                                                             bring $245 billion in overseas profits home to America.
       6        1     task1-en-file6         TRUMP           America has lost nearly-one third of its manufacturing
                                                             jobs since 1997, following the enactment of disastrous
                                                             trade deals supported by Bill and Hillary Clinton.
       7        1     task1-en-file7         TRUMP           Our trade deficit in goods with the world last year was
                                                             nearly $800 billion dollars.
       8        1     20151219 3 dem       O’MALLEY          We increased education funding by 37 percent.
       9        1     20160129 7 gop        KASICH           We’re up 400,000 jobs.
      10        1     20160311 12 gop       TAPPER           Critics say these deals are great for corporate America’s
                                                             bottom line, but have cost the U.S. at least 1 million jobs.
      11        3     20180131 state         TRUMP           Unemployment claims have hit a 45-year low.
                       union
      12        1     20181015 60 min        TRUMP           –if you think about it, so far, I put 25% tariffs on steel
                                                             dumping, and aluminum dumping 10%.
      13        3     20190205 trump         TRUMP           Unemployment for Americans with disabilities has also
                       state                                 reached an all-time low.
      14        1     20190215 trump         TRUMP           They have the largest number of murders that they’ve
                       emergency                             ever had in their history - almost 40,000 murders.

the subjective nature of check-worthiness annotations.          20170315 nashville file (training data on CTL’19),
   In addition to subjective judgments, we also noticed         Donald Trump’s statement “We’re going to put our
inconsistencies within the annotations. For instance,           auto industry back to work” is labeled as check-worthy.
the statement in Row 9 (“We are up 400,000 jobs”)               However, the statement is about future and cannot be
also exists in “20160311 12 gop” file but annotated as          verified.
“check-worthy”. In addition, there exists semantically             Overall, our qualitative analysis suggests that anno-
very similar statements with different labels. For in-          tating check-worthiness of claims is a subjective task
stance, Donald Trump’s statement “I did not support             and the annotations might be noisy. Kutlu et al. [14]
the war in Iraq” in 1079th line of 20160926 1pres file is       show that using text excerpts within documents as ra-
labeled as “not-check-worthy” while his statement in            tionales help understanding disagreements in relevance
1086th line of the same file “I was against the war in          judging. Similarly, we might request rationales behind
Iraq” is labeled as “check-worthy”. Both statements             check-worthiness annotations to understand if the la-
have similar meanings and exists in the same con-               bel is due to a human judging error or the subjective
text (i.e., their position in file are very close). There-      nature of the annotation task. Furthermore, rationales
fore, both might have the same labels. As a counter             behind these annotations might help us develop effec-
argument, “being against” suggests an action while              tive solutions for this challenging problem.
“not supporting” does not require any action to be
taken. Thus, different annotations for similar state-           6     Conclusion
ments might also be again due to the subjective nature          In this paper, we presented a supervised method which
of check-worthiness judgments.                                  prioritize claims based on check-worthiness. We use lo-
   Furthermore, there are also annotations that we              gistic regression classifier with features including state-
strongly disagree with the label. For instance, in              of-the-art language model BERT, domain-specific
controversial topics, pretrained word embeddings,           [5] F. Cherubini and L. Graves. The rise of fact-
handcrafted word list, POS tags and comparative-                checking sites in europe. Reuters Institute for the
superlative clauses. In our experiments on CTL’18               Study of Journalism, University of Oxford, 2016.
and CTL’19, we show that our proposed model outper-
forms all state-of-the-art models in both collections.      [6] J. Devlin, M.-W. Chang, K. Lee, and
We show that BERT’s performance can be increased                K. Toutanova.      Bert: Pre-training of deep
by using additional features for this task. In our fea-         bidirectional transformers for language under-
ture ablation study, BERT model and word embed-                 standing. In Proceedings of the 2019 Conference
dings appear to be the most effective features while            of the North American Chapter of the Association
handcrafted word list and domain-specific controver-            for Computational Linguistics: Human Language
sial topics also seem effective. Based on our qualita-          Technologies, Volume 1 (Long and Short Papers),
tive analysis, we believe requesting rationales for the         pages 4171–4186, 2019.
check-worthiness annotations would further help in de-
                                                            [7] J. Friedman. Greedy function approximation: A
veloping effective systems.
                                                                gradient boosting machine. Annals of Statistics,
   In the future, we plan to work on weak supervi-
                                                                29:1189–1232, 2001.
sion techniques to extend the training dataset. With
the increased data, we will be able explore using deep      [8] P. Gencheva, P. Nakov, L. Màrquez, A. Barrón-
learning techniques for this task. In addition, we plan         Cedeño, and I. Koychev. A context-aware ap-
to extend our study to detect check-worthy claims in            proach for detecting worth-checking claims in po-
social media platforms because it is the channel where          litical debates. In Proceedings of the International
most of the people affected by misinformation. More-            Conference Recent Advances in Natural Language
over, working on different languages and building a             Processing, RANLP 2017, pages 267–276, 2017.
multilingual model is an important research direction
in the combat against misinformation.                       [9] B. Ghanem, M. Montes-y-Gómez, F. M. R. Pardo,
                                                                and P. Rosso. UPV-INAOE - check that: Prelim-
References                                                      inary approach for checking worthiness of claims.
                                                                In Working Notes of CLEF 2018 - Conference and
 [1] R. Agez, C. Bosc, C. Lespagnol, N. Petitcol, and           Labs of the Evaluation Forum, Avignon, France,
     J. Mothe. IRIT at checkthat! 2018. In Working              September 10-14, 2018, 2018.
     Notes of CLEF 2018 - Conference and Labs of the
     Evaluation Forum, Avignon, France, September          [10] C. Hansen, C. Hansen, J. G. Simonsen, and C. Li-
     10-14, 2018, 2018.                                         oma. The copenhagen team participation in the
                                                                check-worthiness task of the competition of auto-
 [2] P. Atanasova, P. Nakov, G. Karadzhov, M. Mo-               matic identification and verification of claims in
     htarami, and G. Da San Martino. Overview of                political debates of the clef-2018 checkthat! lab.
     the clef-2019 checkthat! lab on automatic identi-          In CLEF, 2018.
     fication and verification of claims. task 1: Check-
     worthiness. In CEUR Workshop Proceedings,             [11] C. Hansen, C. Hansen, J. G. Simonsen, and
     2019.                                                      C. Lioma. Neural weakly supervised fact check-
                                                                worthiness detection with contrastive sampling-
 [3] A. Barrón-Cedeño, T. Elsayed, P. Nakov,                  based ranking loss. In Working Notes of CLEF
     G. Da San Martino, M. Hasanain, R. Suwaileh,               2019 - Conference and Labs of the Evaluation Fo-
     F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov,            rum, Lugano, Switzerland, September 9-12, 2019,
     S. Shaar, and Z. S. Ali. Overview of checkthat!            2019.
     2020: Automatic identification and verification
     of claims in social media. In Experimental IR         [12] N. Hassan, G. Zhang, F. Arslan, J. Caraballo,
     Meets Multilinguality, Multimodality, and Inter-           D. Jimenez, S. Gawsane, S. Hasan, M. Joseph,
     action, pages 215–236, Cham, 2020. Springer In-            A. Kulkarni, A. K. Nayak, V. Sable, C. Li, and
     ternational Publishing.                                    M. Tremayne. Claimbuster: The first-ever end-to-
                                                                end fact-checking system. PVLDB, 10:1945–1948,
 [4] A. Barrón-Cedeño, T. Elsayed, P. Nakov,                  2017.
     G. D. S. Martino, M. Hasanain, R. Suwaileh, and
     F. Haouari. Checkthat! at clef 2020: Enabling         [13] I. Jaradat, P. Gencheva, A. Barrón-Cedeño,
     the automatic identification and verification of           L. Màrquez, and P. Nakov. Claimrank: Detect-
     claims in social media. Advances in Information            ing check-worthy claims in arabic and english. In
     Retrieval, 12036:499 – 507, 2020.                          Proceedings of the 2018 Conference of the North
    American Chapter of the Association for Compu-          [22] S. Vosoughi, D. Roy, and S. Aral.         The
    tational Linguistics: Demonstrations, pages 26–              spread of true and false news online. Science,
    30, 2018.                                                    359(6380):1146–1151, 2018.
[14] M. Kutlu, T. McDonnell, Y. Barkallah, T. El-           [23] Q. Wu, C. J. Burges, K. M. Svore, and J. Gao.
     sayed, and M. Lease. Crowd vs. expert: What                 Adapting boosting for information retrieval mea-
     can relevance judgment rationales teach us about            sures. Inf. Retr., 13(3):254–270, June 2010.
     assessor disagreement? In The 41st International
     ACM SIGIR Conference on Research & Devel-              [24] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R.
     opment in Information Retrieval, pages 805–814.             Salakhutdinov, and Q. V. Le. Xlnet: General-
     ACM, 2018.                                                  ized autoregressive pretraining for language un-
                                                                 derstanding. In Advances in neural information
[15] C. Lespagnol, J. Mothe, and M. Z. Ullah. Infor-             processing systems, pages 5754–5764, 2019.
     mation nutritional label and word embedding to
     estimate information check-worthiness. In Pro-         [25] K. Yasser, M. Kutlu, and T. Elsayed. bigir at
     ceedings of the 42nd International ACM SIGIR                CLEF 2018: Detection and verification of check-
     Conference on Research and Development in In-               worthy political claims. In Working Notes of
     formation Retrieval, pages 941–944. ACM, 2019.              CLEF 2018 - Conference and Labs of the Eval-
                                                                 uation Forum, 2018.
[16] E. Loper and S. Bird. Nltk: The natural language
     toolkit. In In Proceedings of the ACL Workshop         [26] C. Zuo, A. Karakas, and R. Banerjee. A hybrid
     on Effective Tools and Methodologies for Teaching           recognition system for check-worthy claims us-
     Natural Language Processing and Computational               ing heuristics and supervised learning. In CLEF,
     Linguistics. Philadelphia: Association for Com-             2018.
     putational Linguistics, 2002.
[17] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
     Efficient estimation of word representations in
     vector space. arXiv preprint arXiv:1301.3781,
     2013.
[18] P. Nakov, A. Barrón-Cedeño, T. Elsayed,
     R. Suwaileh, L. Màrquez, W. Zaghouani,
     P. Atanasova, S. Kyuchukov, and G. Da San Mar-
     tino. Overview of the clef-2018 checkthat! lab on
     automatic identification and verification of polit-
     ical claims. In International Conference of the
     Cross-Language Evaluation Forum for European
     Languages, pages 372–387, 2018.
[19] A. Patwari, D. Goldwasser, and S. Bagchi.
     Tathya: A multi-classifier system for detecting
     check-worthy statements in political debates. In
     Proceedings of the 2017 ACM on Conference on
     Information and Knowledge Management, pages
     2259–2262. ACM, 2017.
[20] L. N. Smith. A disciplined approach to neural
     network hyper-parameters: Part 1 - learning rate,
     batch size, momentum, and weight decay. ArXiv,
     abs/1803.09820, 2018.
[21] S. Vasileva, P. Atanasova, L. Màrquez, A. Barrón-
     Cedeño, and P. Nakov. It takes nine to smell a rat:
     Neural multi-task learning for check-worthiness
     prediction. In Proceedings of the International
     Conference on Recent Advances in Natural Lan-
     guage Processing, 2019.