=Paper= {{Paper |id=Vol-2380/paper_78 |storemode=property |title=The IPIPAN Team Participation in the Check-Worthiness Task of the CLEF2019 CheckThat! Lab |pdfUrl=https://ceur-ws.org/Vol-2380/paper_78.pdf |volume=Vol-2380 |authors=Jakub Gąsior,Piotr Przybyła |dblpUrl=https://dblp.org/rec/conf/clef/GasiorP19 }} ==The IPIPAN Team Participation in the Check-Worthiness Task of the CLEF2019 CheckThat! Lab== https://ceur-ws.org/Vol-2380/paper_78.pdf
         The IPIPAN Team Participation in the
        Check-Worthiness Task of the CLEF2019
                  CheckThat! Lab

                          Jakub Gąsior and Piotr Przybyła

      Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
                               j.gasior@ipipan.waw.pl
                             p.przybyla@ipipan.waw.pl



        Abstract. This paper describes the participation of the IPIPAN team
        at the CLEF-2019 CheckThat! Lab focused on automatic identification
        and verification of claims. We participated in Task 1 oriented on assess-
        ing the check-worthiness of claims in political debate by identifying and
        ranking, which sentences should be prioritized for fact-checking. We pro-
        posed a logistic regression-based classifier using features such as vector
        representation of sentences, Part-of-Speech (POS) tags, named entities,
        and sentiment scores. In the official evaluation, our best performing run
        was ranked 3rd out of 12 teams.


        Keywords: Information retrieval, Fact-checking, Logistic regression.


1     Introduction

The recent spread of misinformation in political debates and media has stimu-
lated further research in fact-checking: the task of assessing the truthfulness of
a claim.
    The CLEF-2019 CheckThat! Lab [6] aims at streamlining a typical fact-
checking pipeline consisting of the following steps:

 – Identifying check-worthy text fragments (Task 1) [4];
 – Retrieving and supporting evidence for the selected claims (Task 2A) [7];
 – Determining whether the claim is likely true or likely false by comparing a
   claim against the retrieved evidence (Task 2B) [7].

   This report details our proposed methods and results for Task 1, where we
focused on the English part [4]. The overall aim of this task was to identify
check-worthy claims and rank them according to perceived worthiness for fact-
checking.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem-
    ber 2019, Lugano, Switzerland.
    The remainder of this paper is organized as follows: In Section 2, we present
the works related to the task of identifying check-worthy claims and fact-checking
itself. In Section 3, we provide a detailed description of the task, discuss the
datasets and performance metrics. Section 4 details the proposed approach and
the evaluation results. Finally, Section 5 concludes the paper.


2   State of the Art

Automating the process of fact-checking has been first discussed in the context
of computational journalism [5], where authors outline a vision for a system to
support mass collaboration of investigative journalists and concerned citizens.
They discuss several features of the system to highlight a few important database
research challenges such as privacy, trust, authority, data mining and information
retrieval.
    Thorne and Vlachos [16] survey automated fact-checking research stemming
from natural language processing and related disciplines, unifying the task for-
mulations and methodologies across papers and authors. Similar work in the
area of political debate was introduced in [18]. Authors detail the construction
of a publicly available dataset using statements fact-checked by journalists avail-
able online and discuss baseline approaches for the challenges that need to be
addressed. Similar datasets were later released [13, 19], where authors collated
labeled claims from Politifact. Wang created a dataset of almost 13 thousand
claims with additional meta-data such as the speaker affiliation and the context
in which the claim appears [19]. Rashkin and Choi later supplemented the Poli-
tifact dataset with numerous news articles deemed as hoax according to a US
News & World report in order to build a prediction model [13].
    One of the biggest datasets of this kind was released in [17], where authors
presented a dataset containing over 185 thousand claims generated by altering
sentences extracted from Wikipedia and subsequently verified without knowledge
of the sentence they were derived from. These claims were later classified as
Supported, Refuted or NotEnoughInfo by annotators achieving 31.87% accuracy
on labeling a claim accompanied by the correct evidence and 50.91% accuracy,
while ignoring the evidence.
    Similar work was introduced by Redi and Fetahu [14], where authors provided
an algorithmic assessment of Wikipedia’s verifiability. They provided algorithmic
models to determine if a statement requires a citation, and to predict the citation
reason based on custom taxonomy. Authors provided a complete evaluation of
the robustness of proposed models across different classes of Wikipedia articles
of varying quality, as well as on an additional dataset of claims annotated for
fact-checking. Unfortunately, the model could not reliably detect check-worthy
claims in the datasets, labeling most of them as negatives.
    One of the first complete tools in the area of assessing check-worthiness was
provided in [8, 9], where authors proposed the ClaimBuster: a fact-checking
platform, using natural language processing and supervised learning to detect
important factual claims in political discourses. ClaimBuster uses a sentiment,
sentence length, Part-of-Speech (POS) tags and entity types as features in order
to rank claims from the least to the most check-worthy. Authors claim average
accuracy for sentences fact-checked by CNN of 0.433 and 0.438 for sentences
fact-checked by PolitiFact.
    A similar tool was introduced in [10], where authors presented the Claim-
Rank: a multilingual automatic system to detect check-worthy claims in a given
text. A model is trained on annotations from nine reputable fact-checking or-
ganizations (PolitiFact, FactCheck, ABC, CNN, NPR, NYT, Chicago Tribune,
The Guardian, and Washington Post), and thus it can mimic the optimal claim
selection strategies. Authors achieved the Mean Average Precision score of 0.323
for English and 0.302 for Arabic.
    Finally, in [11] authors present an approach based on universal sentence repre-
sentations created in collaboration with Full Fact, an independent fact-checking
charity. Authors claim an F1 score of 0.83, with 5% relative improvement over
the state-of-the-art methods ClaimBuster and ClaimRank, discussed above.

3     Task Description
The objective of the task was to identify check-worthy claims in order to facilitate
manual fact-checking efforts by prioritizing the claims that fact-checkers should
consider first.

3.1   Datasets
The task organizers provided two datasets: a training set comprised of 19 political
debates and speeches (a total 16421 sentences) and a testing set comprised of
7 files (a total of 7080 sentences). Each file was annotated by its speaker and
check-worthiness factor (0 or 1) as determined by experts.
    The training set contained 440 annotated check-worthy sentences (2.68% of
total), while the final testing set contained only 136 check-worthy sentences
(1.92% of total).

3.2   Evaluation metrics
The task was evaluated according to the following metrics:
 – Average Precision - Precision at N , estimated for N check-worthy sen-
   tences and then averaged over the total number of check-worthy sentences;
 – Reciprocal Rank - The reciprocal of the rank of the first check-worthy
   sentence in the list of predictions sorted by score (in descending order);
 – Precision@N - Precision estimated for the first N sentences in the provided
   ranked list;
 – R-Precision - Precision at R, where R is the number of relevant sentences
   for the evaluated set.
   The official measure ranking the submission of teams was the Mean Average
Precision (MAP), calculated over multiple provided debates (each with its own
separate prediction file).
4     Proposed Approach

In this section, we describe the details of our approach and present the evaluation
results.


4.1   Feature Design and Selection

The features we extracted for each sentence in the database can be divided into
the following categories:

 – Bag-of-Words N-Gram Representation of Sentences: The first step
   was vocabulary-based vectorization. We employed term frequency–inverse
   document frequency (TF-IDF) transformation and extracted and built the
   n-gram model (up to 3) of the dataset. After pruning the most common
   terms, we ended up with 1006 unigram features, 1177 bigram features, and
   1186 trigram features.
 – Vector Representation of Sentences: We employed word2vec tool to a
   text corpus as input and produced the word vectors as output. We used
   a model pretrained on Google News archive [1]. To represent sentences in
   the provided dataset, we first create 300-dimensional vectors components for
   each term in the sentence and then select 300 minimal, 300 maximal, and
   300 averaged values resulting in a features vector of 900 elements.
 – Types of Named Entities Detected: The process of recognizing named
   entities is one the first step towards information extraction that seeks to
   locate and classify entities in text into predefined categories such as the
   names of persons, organizations, locations, expressions of times, quantities,
   monetary values, percentages. We employ the NLTK library [2] to extract a
   subset of 18 NER tags for each sentence.
 – Part-of-Speech (POS) Tags: We employ the NLTK’s library POS tagger
   [2] to mark up individual words in a sentence as corresponding to a particular
   Penn Treebank Constituent Tag, based on both its definition and relationship
   with adjacent and related words in a sentence. It results in a 36-dimensional
   feature vector.
 – Sentiment Scores: To determine the sentiment of each sentence we use
   BING, AFINN and NRC Lexicons [3] to extract the sentiment score of each
   term in the sentence, as well as an averaged sentiment score of the whole
   sentence. It results in a feature vector consisting of 15 elements (11 tags from
   NRC indicating feelings or emotions, i.e., anger, fear, joy; 2 tags from BING
   and AFINN; and an overall averaged sentiment of the whole sentence.).
 – Statistical Analysis of Sentences: We also calculate basic metrics of each
   sentence after tokenization, such as word count, character count, average
   word density, punctuation count, upper case count, and title word count. It
   results in a feature vector consisting of 6 elements.
4.2   Classifier
As our classifier, we selected a L1-regularized logistic regression model (LASSO)
or so-called sparse logistic regression model, where the weight vector of the
classifier has a small number of nonzero values. By adding the penalty on the
weights w:

                            argmin lavg (w, v) + λ||w||1 ,                       (1)
                              w,v
    where λ is a regularization parameter, it is possible to achieve attractive
properties such as feature selection and robustness to noise [12, 15, 20].
    We carried out multiple evaluating runs with different subsets of features
discussed in Section 4.1 in order to select the best combination of dependent
variables. In order to select the best model to employ in the testing phase, we
performed a Leave-One-Out (LOO) cross validation on the whole set of N = 19
training debates for various combinations of selected features - training the model
on the set of N − 1 debates and testing it on the last one. This process was
repeated N times.
    Table 1 shows the results of Mean Average Precision, Reciprocal Rank and R-
Precision metrics achieved during the training phase. As can be seen, none of the
analyzed models achieved maximum scores in all measured performance metrics.
As a result, top scoring models for each performance metric were selected for
the final submission, i.e.:
 – Text2vec + NER + POS + Sentiment - Primary submission;
 – Text2vec + N-Gram (2) - Contrastive submission No. 1;
 – N-Gram (1) + NER + Sentiment - Contrastive submission No. 2.

4.3   Final Results
Twelve teams submitted a total of 25 runs to this task. Table 2 presents the
results of our three submission runs as well as the top-ranked submission from
the Copenhagen team.
    Overall, our primary submission has been ranked third according to the offi-
cial measure (Mean Average Precision), sixth according to Reciprocal Rank, and
second according to R-Precision score. These results allow us to conclude, that
the proposed model was better at finding the most check-worthy claims, than at
a task of finding all the check-worthy claims in the provided texts. The precise
reasons for such behavior will require further analysis.
    Surprisingly, our best performing model was one of the contrastive runs,
employing text vectors and POS tags as the only selection features. Our primary
submission used also NER tags and sentiment scores, which had a negative
impact on Mean Average Precision and Reciprocal Rank scores.
    We can conclude that this result is caused by a lower representation of NER
features in the final testing set. Also, further analysis of the testing set revealed
that most of the check-worthy claims had significantly more positive sentiment
scores than the claims in the training dataset (see, Table 3). This also impacted
the overall performance of the submitted primary model.
       Table 1: Mean Average Precision, Reciprocal Rank and R-
       Precision scores for each model across the provided training
       datasets. The best results are marked in a bold font.
                                       Mean Average Reciprocal         R-
                                          Precision          Rank   Precision
N-Gram (1) + NER + POS +                     .1788            .5100   .1957
Sentiment
N-Gram (1) + NER + Sentiment                 .1860           .6382    .1972
N-Gram (1) + POS + Sentiment                 .1781            .5104   .2029
N-Gram (1) + NER + POS                       .1756            .5043   .1834
Text2vec                                     .2367            .4743   .2495
Text2vec + N-Gram (1)                        .2127            .4173   .2501
Text2vec + N-Gram (2)                        .2200            .4957  .2623
Text2vec + N-Gram (3)                        .2181            .4604   .2528
Text2vec + N-Gram (1+2+3) + NER              .2185            .4638   .2463
+ POS + Sentiment
Text2vec + N-Gram (1+2+3) + NER              .2151            .4361   .2513
+ Sentiment
Text2vec + N-Gram (1+2+3) + POS              .2166            .4649   .2544
+ Sentiment
Text2vec + NER                               .2403            .4959   .2551
Text2vec + NER + POS                         .2383            .5543   .2537
Text2vec + NER + POS + Sentiment            .2415             .5488   .2469
Text2vec + NER + Sentiment                   .2364            .4879   .2498
Text2vec + POS                               .2408            .5562   .2519
Text2vec + POS + Sentiment                   .2310            .4789   .2526



5   Conclusion and Future Work


In this paper, we present our solution to Task 1 of the CLEF-2019 Check-
That! Lab. For the task of detecting check-worthy claims, we employed an L1-
regularized logistic regression (LASSO) classifier. We selected features such as
vector representation, named entities, POS tags, as well as averaged sentiment
values achieving 3rd place on the English dataset.
    This work opens up several possible avenues for future research. First, we in-
tend to employ syntactic parsing and sentence dependency mapping in order to
extract additional information regarding the stance of claims, as well as contra-
dicting or confirming sentences during debates. Secondly, we plan to extend the
vector representation of sentences to larger segments of text in order to capture
additional nuances of longer debates or speeches.
       Table 2: Mean Average Precision, Reciprocal Rank and R-
       Precision scores for each model across the provided testing
       datasets. The best results are marked in a bold font, while the
       best results from our submissions are underlined.
                                    Mean Average Reciprocal             R-
                                       Precision         Rank       Precision
Text2vec + NER + POS +                    .1332            .2864       .1481
Sentiment (primary)
Text2vec + N-Gram (2)                     .1365            .3079      .1490
N-Gram (1) + NER + Sentiment              .1013            .2791       .1002
Copenhagen (primary)                     .1660            .4176        .1387


       Table 3: Average sentiment scores (negative, positive and over-
       all) calculated for check-worthy sentences in training and testing
       datasets, respectively.
                       Negative              Overall              Positive
                      Sentiment            Sentiment            Sentiment
Training                -2.06676             0.017379              1.23742
Dataset
Testing Dataset         -0.97227             0.045449              1.04570


Acknowledgment

This work was supported by the Polish National Agency for Academic Exchange
through a Polish Return grant number PPN/PPO/2018/1/00006.




                             Bibliography


[1] Word2vec Project. https://code.google.com/archive/p/word2vec/. Accessed:
    2019-05-24.
[2] Natural Language Toolkit. https://www.nltk.org/. Accessed: 2019-05-24.
[3] NRC Word-Emotion Association Lexicon.                https://saifmohammad.com/
    WebPages/NRC-Emotion-Lexicon.htm. Accessed: 2019-05-24.
[4] Pepa Atanasova, Preslav Nakov, Georgi Karadzhov, Mitra Mohtarami, and Gio-
    vanni Da San Martino. Overview of the CLEF-2019 CheckThat! Lab on Automatic
    Identification and Verification of Claims. Task 1: Check-Worthiness.
[5] Sarah Cohen, Chengkai Li, Jun Yang, and Cong Yu. Computational Journalism:
    A Call to Arms to Database Researchers. In Proceedings of the Conference on
    Innovative Data Systems Research, pages 148–151, 04 2011.
 [6] Tamer Elsayed, Preslav Nakov, Alberto Barrón-Cedeño, Maram Hasanain, Reem
     Suwaileh, Giovanni Da San Martino, and Pepa Atanasova. Overview of the CLEF-
     2019 CheckThat!: Automatic Identification and Verification of Claims. In Experi-
     mental IR Meets Multilinguality, Multimodality, and Interaction, LNCS, Lugano,
     Switzerland, September 2019.
 [7] Maram Hasanain, Reem Suwaileh, Tamer Elsayed, Alberto Barrón-Cedeño, and
     Preslav Nakov. Overview of the CLEF-2019 CheckThat! Lab on Automatic Iden-
     tification and Verification of Claims. Task 2: Evidence and Factuality.
 [8] Naeemul Hassan, Chengkai Li, and Mark Tremayne. Detecting Check-worthy
     Factual Claims in Presidential Debates. In Proceedings of the 24th ACM In-
     ternational on Conference on Information and Knowledge Management, CIKM
     ’15, pages 1835–1838, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3794-
     6. doi: 10.1145/2806416.2806652. URL http://doi.acm.org/10.1145/2806416.
     2806652.
 [9] Naeemul Hassan, Fatma Arslan, Chengkai Li, and Mark Tremayne. Toward Au-
     tomated Fact-Checking: Detecting Check-worthy Factual Claims by ClaimBuster.
     In Proceedings of the 23rd ACM SIGKDD International Conference on Knowl-
     edge Discovery and Data Mining, KDD ’17, pages 1803–1812, New York, NY,
     USA, 2017. ACM. ISBN 978-1-4503-4887-4. doi: 10.1145/3097983.3098131. URL
     http://doi.acm.org/10.1145/3097983.3098131.
[10] Israa Jaradat, Pepa Gencheva, Alberto Barrón-Cedeño, Lluís Màrquez, and
     Preslav Nakov. ClaimRank: Detecting Check-Worthy Claims in Arabic and En-
     glish. In Proceedings of the 2018 Conference of the North American Chapter of
     the Association for Computational Linguistics: Demonstrations, pages 26–30, New
     Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi:
     10.18653/v1/N18-5006. URL https://www.aclweb.org/anthology/N18-5006.
[11] Lev Konstantinovskiy, Oliver Price, Mevan Babakar, and Arkaitz Zubiaga. To-
     wards Automated Factchecking: Developing an Annotation Schema and Bench-
     mark for Consistent Automated Claim Detection. CoRR, abs/1809.08193, 2018.
     URL http://arxiv.org/abs/1809.08193.
[12] Andrew Y. Ng. Feature Selection, L1 vs. L2 Regularization, and Rotational In-
     variance. In Proceedings of the Twenty-first International Conference on Ma-
     chine Learning, ICML ’04, pages 78–, New York, NY, USA, 2004. ACM. ISBN
     1-58113-838-5. doi: 10.1145/1015330.1015435. URL http://doi.acm.org/10.
     1145/1015330.1015435.
[13] Hannah Rashkin, Eunsol Choi, Jin Yea Jang, Svitlana Volkova, and Yejin Choi.
     Truth of varying shades: Analyzing language in fake news and political fact-
     checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural
     Language Processing, pages 2931–2937, Copenhagen, Denmark, September 2017.
     Association for Computational Linguistics. doi: 10.18653/v1/D17-1317. URL
     https://www.aclweb.org/anthology/D17-1317.
[14] Miriam Redi, Besnik Fetahu, Jonathan T. Morgan, and Dario Taraborelli. Citation
     Needed: A Taxonomy and Algorithmic Assessment of Wikipedia’s Verifiability.
     CoRR, abs/1902.11116, 2019. URL http://arxiv.org/abs/1902.11116.
[15] Jianing Shi, Wotao Yin, Stanley Osher, and Paul Sajda. A Fast Hybrid Algorithm
     for Large-Scale L1-Regularized Logistic Regression. J. Mach. Learn. Res., 11:713–
     741, March 2010. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=
     1756006.1756029.
[16] James Thorne and Andreas Vlachos. Automated Fact Checking: Task formu-
     lations, methods and future directions. CoRR, abs/1806.07687, 2018. URL
     http://arxiv.org/abs/1806.07687.
[17] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mit-
     tal. FEVER: a large-scale dataset for Fact Extraction and VERification. CoRR,
     abs/1803.05355, 2018. URL http://arxiv.org/abs/1803.05355.
[18] Andreas Vlachos and Sebastian Riedel. Fact Checking: Task definition and dataset
     construction. In Proceedings of the ACL 2014 Workshop on Language Technolo-
     gies and Computational Social Science, pages 18–22, Baltimore, MD, USA, June
     2014. Association for Computational Linguistics. doi: 10.3115/v1/W14-2508. URL
     https://www.aclweb.org/anthology/W14-2508.
[19] William Yang Wang. "Liar, Liar Pants on Fire": A New Benchmark Dataset for
     Fake News Detection. CoRR, abs/1705.00648, 2017. URL http://arxiv.org/
     abs/1705.00648.
[20] Peng Zhao and Bin Yu. On Model Selection Consistency of Lasso. J. Mach. Learn.
     Res., 7:2541–2563, December 2006. ISSN 1532-4435. URL http://dl.acm.org/
     citation.cfm?id=1248547.1248637.