CheckThat! 2019 UAICS

              Lucia-Georgiana Coca, Ciprian-Gabriel Cusmuliuc, Adrian Iftene

           Alexandru Ioan Cuza University, Faculty of Computer Science, Iasi, Romania
                   {georgiana.coca, ciprian.cusmuliuc, adiftene}@info.uaic.ro


          Abstract. Investigative journalists or detectives lose a lot of time to prove a cer-
          tain claim, searching for the source or evidence to support this assertion often by
          hand. In this context, in order to address this problem, the 2019 CLEF Check-
          That! comes with two tasks: (1) Check-Worthiness and (2) Evidence & Factual-
          ity. Our group participated to the first task whose aim is to evaluate the check
          worthiness of a political claim in a debate. The method to achieve the goal of the
          task was to represent each claim by a feature vector and feed it to a machine
          learning classification algorithms in order to classify if the claim is check-worthy
          or not. We submitter 3 runs, one primary and two contrastive, the primary being
          a Naive Bayes, the first contrastive Linear Regression and the second one SVM.
          The best result we achieved using the official measure MAP was with the Naive
          Bayes, the second best was the SVM and the third was the Linear Regression.
          This paper presents the details of our approaches.

          Keywords: CheckThat!, SVM, Naive Bayes, Linear Regression.


1         Introduction

The popularity of social networks has increased significantly in recent years, and read-
ing news on these social networks has become a natural activity for all users. The news
is instantly transmitted to these networks, which are read quickly, marked with opinions
(see Facebook1), retransmitted (retweet on Twitter2, share on Facebook) without hav-
ing to check many times whether they are true or false news.
    This problem has also affected the political environment; growing political unrest in
many countries has made politicians exchange accusations in diverse political debates,
some that are more accurate than others are. The challenge is thus presented to us to
first check the need to verify a political claim and then to verify if it is factually true.
The first task consists of classifying claims from a presidential political debate, each
candidate can make a claim and the others have a short time to issue a response, each


1 https://www.facebook.com/
2
    https://twitter.com/

Copyright (c) 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzer-
land.
of them has the right to make accusations in order to convince the audience about their
political wittiness (Atanasova et al., 2019).
   In order to investigate the check-worthiness of a claim we have been provided with
multiple presidential transcripts from the last elections in the United States. The goal is
to provide a score for each line in the transcript, score that would signify the priority
for fact checking and would be an input for task 2 (Elsayed et al., 2019).
   This paper describes the participation of team UAICS, from the Faculty of Computer
Science, Alexandru Ioan Cuza University of Iasi, in Task 1 at CLEF 2019.
   The remaining of this paper was organized as follows: Section 2 gives a description
of the task. Section 3 details the model we developed and the submitted runs and then
Section 4 details the results we obtained, finally Section 5 concludes this paper.


2         Task Description

2.1       Objectives

The objective is to provide a score for each line of a presidential debate3, this score
signifying the worthiness of the line to continue for fact checking4 (which is task 2)
thus the objective is to create a filtering layer for the second task.
   Given the fact that in a political debate things evolve quickly a manual checking
would, be very cumbersome and slow, thus the need arises for automated checking in
order to make the public more informed about the discussion and discourage fake in-
formation


2.2       Dataset
There were two datasets available, one to train the model and one for testing the model.
They were both political debates transcripts from the last United States presidential
elections.
   The dataset was consisting of the following columns: line no, speaker, text and label;
the test file did not have the label available. The label was a binary one, zero signifying
that the sentence should not be fact checked and one to be fact checked. A concrete
example with label 1 would be from Trump: “So Ford is leaving” and one with label 0
would be from BLITZER: “Let’s begin with Senator Sanders”.
   The training had 19 files and the test had seven files.


2.3       Evaluation metric
The task has been evaluated according to the official organizer’s measures. The official
measure is MAP (Beitzel et al., 2009) which calculates the usual mean of the average
precision. Other measures used are the Mean Reciprocal Rank (Craswell, 2009) which
allows obtaining reciprocals of rank of the first relevant document, as well as Mean

3
    https://sites.google.com/view/clef2019-checkthat/task-1-check-worthiness
4 https://sites.google.com/view/clef2019-checkthat/task-2-evidence-factuality
Precision at k, which performs the average of k best candidates. Details on the measures
used can be found in the task overview.
   Evaluations are carried out on primary and contrastive runs, the resulting metrics are
as described above. Each participant has the right to three models, one primary and two
secondary (contrastive).


3       Methods and runs

Right from the start, we decide we only want to use the training data provided and no
other external information for our models, thus only the presidential debates were avail-
able.
    We selected multiple machine-learning algorithms in order to test which one would
be best for our problem. We started with the classical ones, Decision Trees (Dobra,
2009) and Naive Bayes (Rennie, 2003), and then we moved to more advanced algo-
rithms such as SVM (Liu, 2009), Random Forest (Ho, 1995), Logistic Regressions
(Hosmer, 2000) and finally we tested a neural network, a Multilayer Perceptron Clas-
sifier. We use machine-learning algorithms in previous editions of CLEF (Iftene et al.,
2009), (Iftene et al., 2012), (Iftene et al., 2013), and (Cristea et al., 2016).
    In order to verify our algorithms, given we had no validation data; it was decided to
split the training data 70-30, 70% would be used for training and 30 for measuring the
performance.
    To ease our implementation we used PySpark, combined with the PySpark MLlib
that contains prebuild, ready to use machine-learning algorithms, we decided to use this
tool in order to benefit from the parallel processing power of PySpark to scale the ap-
plication in order to process large amounts of data.
    Our metrics were based on sklearn’s metrics so in the end we would have the Preci-
sion, Recall and F1 of the classifiers but also the confusion matrix. In addition to our
metrics, in order to comply with the organizers requirements, we also used their pro-
vided metrics that are the following: R-Precision, Average Precision, Reciprocal Rank
and Precision@k so in the end we would have a multitude of metrics that would help
us better measure our classification efforts.


3.1     Training and test data
The data provided contained presidential elections debates from the United States in
2016. The data was of two main categories, training and test. The training had 19 files
while the test had seven files. The main difference of the test and training was that the
test has a missing label column that is the classification category of the phrase.
   One training example with the available columns would be the following:

                             Table 1. Training example.
      Line no.     Speaker                  Text                         Label
      1            Trump                    So Ford is leaving.          1
  The test had the following format:

                             Table 2. Test example.

              Line no.      Speaker        Text
              1             Sanders        They went to the DNC quietly.

From the training data we made several decisions: the speaker is not relevant for the
algorithms and it would make it more biased to certain decisions (which we do not
want, so we excluded it), we would not exclude the speaker “SYSTEM” with phrases
such as “(APPLAUSE)” as they are all very similar and the label 0 would be enough to
make the algorithms realize that it has to predict with 0 and that besides tokenization
we can run feature extraction algorithms without much pre-processing at all.


3.2    Preprocessing and feature extraction
Before feeding the data to the machine learning algorithms, we had to preprocess the
text and extract features. This section describes in detail this process in order to fully
understand the training data fed to the algorithms.
   As said in the previous section, we did not take into consideration the speaker col-
umn in the classification process, on the “text” column and the “label”. For prepro-
cessing, we only applied a “Tokenizer” for each line, taking text (such as a sentence)
and breaking it into individual terms (words). After the tokenization process we re-
moved the stop words with the “StopWordsRemover” class, given the fact that the texts
are in English we only removed stop words from this language.
   We did not take into account irony or sentiment analysis in the preprocessing part as
we believed feature extraction could represent this indirectly however we were aware
this could affect certain edge cases of the classification.
   After we preprocessed the text it was necessary to extract features from the text, we
did this using a multitude of methods trying to find the perfect fit for every algorithm.
The best results were obtained with two main methods: TF-IDF and CountVectorizer
(Convert a collection of text documents to a matrix of token counts).
   For the first method, TF-IDF, we used it with Logistic Regression and Multilayer
Perceptron, the implementations in Pyspark are HashingTF and then next in pipeline
would be IDF. The TF-IDF would create a feature model where the term frequency
would yield informational value to classification algorithms. We decided to use Hash-
ingTF in order to make the implementation faster as this would create a feature map
where a raw feature is mapped into an index (term) by applying a hash function, after
which the IDF would take the generated term frequency vectors to fit which scales each
feature and down-weighs features that appear frequently in the corpus.
   As for settings, we had to fine-tune the preprocessing methods and the final form of
them is that for Logistic Regression the number of features of HashingTF was 262,144
and for Multilayer Perceptron was 5,000 (as this would force us to create the same
number of input layers for the neural network, we had to scale it down to this value).
The settings for IDF for both algorithms are the same, the minimum number of docu-
ments in which a term should appear for filtering is 0.
    For the second model, CountVectorizer, we used it with Naive Bayes and SVM, the
implementation is with the same name, after the CountVectorizer we applied IDF. The
CountVectorizer is very similar to HashingTF, the main difference being that the first
one is reversible (because of not doing hashing), is more computationally intensive
however it does not reduce the dimension, having lower informational loss. We also
tried HashingTF on these models however we obtained worse results that is the logic
behind switching the feature extraction algorithm. The settings of the CountVectorizer
are the following: minimum term frequency is 1 and so is the minimum definition fre-
quency, the maximum definition frequency is 2^63-1 and the vocabulary size is 2^18.


3.3    Models
After we preprocessed the data and extracted the feature with the methods described in
the previous sections, in this section we will talk about the algorithms used to fit the
extracted features.
   Our approach was to use a few diverse algorithms and select the best three - one
primary and two contrastive. We started investigating these models by looking into
what participants of CLEF CheckThat 2018 did, the most notable algorithms being
Multilayer Perceptron and SVM but also Random Forests.
   We started with the classical ones, Decision Trees and Naive Bayes, and then we
moved to more advanced algorithms such as SVM, Random Forest, Logistic Regres-
sions and finally we tested a neural network, the Multilayer Perceptron Classifier. As
stated in the beginning of this section the algorithms had the implementation based in
PySpark5. After we trained and measured the performance using the aforementioned
metrics, from these initial six algorithm only four were left: Logistic Regression, Naive
Bayes, SVM and Multilayer Perceptron, these had the best results and we could further
improve them.
   The Naive Bayes was right from the start one of the top performers of our tests, so
naturally it received the most attention. The settings used for this algorithm were rather
slim, the smoothing was set to 1 and the model was multinomial (given how we are
classifying on word counts from the text).
   The next algorithm that caught our eye was the SVM that had very interesting results,
granted not as good as Bayes but very notable, we used this SVM’s hyperplane to clas-
sify the multidimensional feature matrix. As for settings the SVM is using a linear ker-
nel, the maximum iterations are set to 100 and the regression parameter is set to zero.
   The third best algorithm was the Logistic Regression. Having good results close to
SVM, with the settings being: the maximum iterations are set to 100, the regression
parameter is set to 0 and the label distribution was set to automatically be identified
(binomial or multinomial).


5 https://www.tutorialspoint.com/pyspark/index.html
   In contrast to the three algorithms that performed well we had one that yielded less
than satisfactory results, even though it had long training times. The Multilayer Percep-
tron had very low accuracy that is why we chose not to include it in the sent result. The
Multilayer Perceptron had the maximum iterations set to 1,500 and no matter how long
the training iterations were set the accuracy remained low. The network had 5,000 input
neurons (similar to the number of features) and it had two hidden layers with 1,000 and
2,000 neurons while the output layers were reducing to two neurons corresponding to
the labels. Probably the unsatisfactory results were much related to the low number of
features extracted, but long training times and the fact that the other algorithms had
very fast results with high accuracy made us leave this one last.
   To conclude this section, for the submission we chose as primary the Naive Bayes,
as contrastive one, we chose Linear Regression and finally for contrastive 2 we chose
SVM. We only made two submissions, UAICS-1 and UAICS-2, the latter being the
final version of our system.


4      Results

In this section, the results of the three submissions will be discussed. The official results
of our submissions (of team UAICS), ranking fifth out of 12 for the primary MAP met-
ric, are:

                                       Table 3. Results

 submis-     MAP      RR       R-P      P@1      P@3      P@5     P@10     P@20     P@50
 sion

 primary     .1234 .4650       .1460    .4286    .2381 .2286      .2429    .1429    .0943


 contr.-1    .0649 .2817       .0655    .1429    .2381 .1429      .1143    .0786    .0343


 contr.-2    .0726 .4492       .0547    4286     .2857 .1714      .1143    .0643    .0257


To detail our results, we got the following places:
              • MAP (Mean Average Precision) - 5th place;
              • RR (Reciprocal Rank) - 1st place;
              • R-P (R-Precision) - 3rd place;
              • P@1 (Precision@1) - 1st place;
              • P@3 (Precision@3) - 1st place;
              • P@5 (Precision@5) - 2nd place;
              • P@10 (Precision@10) - 1st place;
              • P@20 (Precision@20) - 4th place;
              • P@50 (Precision@50) - 6th place.
   With the official labeled test files (7,080 lines of test), we can inspect further metrics
such as Precision, Recall, F1:

                           Table 4. Precision, Recall and F1 results primary

    Primary              Precision           Recall             F1                support

    0                       0.98               0.96            0.97                6944

    1                       0.11               0.24            0.15                136

    Micro avg               0.95               0.95            0.95                7080

    Macro avg               0.55               0.60            0.56                7080

    Weighted avg            0.97               0.95            0.96                7080


                        Table 5. Precision, Recall and F1 results contrastive 1

Contr.-1                   Precision             Recall            F1              support

0                             0.98                0.97            0.98               6944

1                             0.06                0.09            0.07               136

Micro avg                     0.96                0.96            0.96               7080

Macro avg                     0.52                0.53            0.52               7080

Weighted avg                  0.96                0.96            0.96               7080


                   Table 6. . Precision, Recall and F1 results contrastive 2

     Contr.-2                Precision            Recall           F1             support

     0                          0.98               0.99           0.99             6944

     1                          0.10               0.06           0.08             136

     Micro avg                  0.97               0.97           0.97             7080

     Macro avg                  0.54               0.52           0.53             7080
    Weighted avg               0.96               0.97         0.97              7080

In addition, the confusion matrices are:

                          Table 7. Confusion Matrix for primary

         Primary                      Predicted No             Predicted Yes

         Actual No                         6694                        250

         Actual Yes                         104                            32


                        Table 8. Confusion Matrix for contrastive 1

           contr.-1               Predicted No              Predicted Yes

           Actual No                   6751                        193

           Actual Yes                   124                           12


                        Table 9. Confusion Matrix for contrastive 2

       contr.-2                       Predicted No                Predicted Yes

       Actual No                           6875                             69

       Actual Yes                          128                              8


   If we analyze the performance of each system the findings are that the Primary -
Naive Bayes has the overall best performance having very high detection rate of phrases
which are not worthy of checking and also the best rate of detecting cases which are
worthy of fact checking. The confusion matrix also confirm that Naive Bayes is a very
performant algorithm, from 7000+ lines only 354 were wrongly labeled.
   The next contestant to the place of the best algorithm is the Contrastive-1-Logistic
Regression, which seems to have a good capability of predicting non-priority fact
checking cases but it is much worse than Naive Bayes at predicting the cases that actu-
ally have to be fact checked, thus this would be a close second. The confusion matrix
goes hand in hand with the above-mentioned metrics, as the algorithm has wrongly
classified 317 lines from the test data.
   The final contestant to the place of the best algorithm is the Contrastive-2-Support
Vector Machine with the highest detection rate of non-priority cases but unfortunately
very low performance in detecting priority cases.
      Comparing the Precision, Recall, F1 and Confusion Matrix with the official re-
sults it can clearly be seen that the best algorithm is still the primary-Naive Bayes,
however the second best is contrastive-2 followed by contrastive-1, this may be because
contrastive-1 and contrastive-2 were very close to each other.


5      Conclusions and future work

In this paper, we proposed three models to solve the CLEF2019 CheckThat challenge
(task 1 Check Worthiness) which deals with the evaluation of the check-worthiness of
statements in political debates. We used Naive Bayes, SVM and Logistic Regression to
train the models on the extracted features from TF-IDF and CounVectorizer. We got
good results with these three models, ranking 5th out of 12. We are currently trying to
further improve the feature extraction methods and also insert more data into the algo-
rithms, so we would have even better results, also very important for us is to improve
our algorithms and also find new ones, we would like to improve on the Multilayer
Perceptron and test everything with a Convolutional Neural Network, so that in the
future we would get even better results.

Acknowledgement. This work is partially supported by POC-A1-A1.2.3-G-2015 pro-
gram, as part of the PrivateSky project (P 40 371/13/01.09.2016).


References
 1. Atanasova, P., Nakov, P., Karadzhov, G., Mohtarami, Mitra, Da San Martino, G. (2019)
    Overview of the CLEF-2019 CheckThat! Lab on Automatic Identification and Verification
    of Claims. Task 1: Check-Worthiness, CLEF 2019, Working Notes.
 2. Beitzel S.M., Jensen E.C., Frieder O. (2009) MAP. In: LIU L., ÖZSU M.T. (eds) Encyclo-
    pedia of Database Systems. Springer, Boston, MA
 3. Craswell N. (2009) Mean Reciprocal Rank. In: LIU L., ÖZSU M.T. (eds) Encyclopedia of
    Database Systems. Springer, Boston, MA
 4. Cristea, A.G., Savoaia, M.M., Martac, M.A., Pătraș, I.C., Scutaru, A.O. Covrig, C.E., Iftene,
    A. (2016) Using Machine Learning Techniques, Textual and Visual Processing in Scalable
    Concept Image Annotation Challenge. In Working Notes of CLEF 2016 - Conference and
    Labs of the Evaluation forum - ImageCLEF2016. 5-8 September 2016, Evora, Portugal.
 5. Dobra A. (2009) Decision Tree Classification. In: LIU L., ÖZSU M.T. (eds) Encyclopedia
    of Database Systems. Springer, Boston, MA
 6. Elsayed, T., Nakov, P., Barron-Cedeno, A., Hasanain, M., Suwaileh, R., Da San Martino,
    G., Atanasova, P. (2019) Overview of the CLEF-2019 CheckThat!: Automatic Identification
    and Verification of Claims. In Experimental IR Meets Multilinguality, Multimodality, and
    Interaction, LNCS, Springer, Lugano, Switzerland, September, 2019.
 7. Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International
    Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp.
    278–282. Archived from the original (PDF) on 17 April 2016. Retrieved 5 June 2016.
 8. Hosmer, David W.; Lemeshow, Stanley (2000). Applied Logistic Regression (2nd ed.).
    Wiley.
 9. Iftene, A., Gînscă, A.L., Moruz, A., Trandabăț, D., Husarciuc, M., Boroș, E. (2012) Enhanc-
    ing a Question Answering system with Textual Entailment for Machine Reading Evaluation.
    Notebook Paper for the CLEF 2012 LABs Workshop - QA4MRE, 17-20 September, Rome,
    Italy.
10. Iftene, A. Moruz, A., Ignat, E. (2013) Using Anaphora resolution in a Question Answering
    system for Machine Reading Evaluation. Notebook Paper for the CLEF 2013 LABs Work-
    shop - QA4MRE, 23-26 September, Valencia, Spain.
11. Iftene, A., Trandabăț, D., Pistol, I., Moruz, A., Husarciuc, M., Cristea, D. (2009) UAIC Par-
    ticipation at QA@CLEF2008. In Evaluating Systems for Multilingual and Multimodal In-
    formation Access, 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008,
    Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers. Lecture Notes in Com-
    puter Science. Vol. 5706/2009, pp. 448-451.
12. Liu L., Özsu M.T. (2009) SVM. In Encyclopedia of Database Systems. Springer, Boston,
    MA
13. Rennie, J.; Shih, L.; Teevan, J.; Karger, D. (2003). Tackling the poor assumptions of Naive
    Bayes classifiers. ICML.