-

IRIT at CheckThat! 2018

Romain Agez

Clement Bosc

Cedric Lespagnol

Josiane Mothe

Josiane.Mothe@irit.fr 0

Noemie Petitcol

1 0 ESPE, IRIT, UMR5505, CNRS & Universite de Toulouse , France 1 Universite P. Sabatier de Toulouse, UPS , France

The 2018 CLEF CheckThat! is composed of two tasks: (1) Check-Worthiness and (2) Factuality. We participated to task (1) only which purpose is to evaluate the check-worthiness of claims in political debates. Our method to achieve this goal is to represent each claim by a vector of ve computed values that correspond to scores on ve criteria. These vectors are then used with machine learning algorithms to classify claims as check-worthy or not. We submitted three runs using di erent machine learning algorithms. The best result we achieved using the o cial measure MAP ranks our run that uses non linear SVM the 12th over the 16 submitted runs. Our run that uses linear SVMis ranked 2nd with the Mean Precision@1 measure.

Information retrieval fact-checking tional label

The CLEF CheckThat! rst task aims at predicting which claims in political debates should be prioritized for fact-checking. All the background and detailed information about the task are available on the task description paper provided by the organizers of the task [ 8 ].

To achieve this goal, the task organizers released several textual transcripts of political debates with each sentence being annotated according to whether it is check-worthy or not.

This paper describes the participation of the Universite de Toulouse team (o cial name RNCC) at CLEF 2018 CheckThat! pilot task for check-worthiness.

We preprocessed the data by representing each sentence corresponding to a transcription of what a speaker said in the debate by a vector containing the score of this sentence for ve di erent criteria. We then trained three classi ers using these vectors to submit three di erent runs.

The remaining of this paper is organized as follows: Section 2 gives a description of the pilot task. Section 3 details the model we developed and the submitted runs. Then Section 4 details the results we obtained. Finally, Section 5 concludes this paper. 2 2.1

Task Description Objectives

The Check-Worthiness task aims to predict which statements in a political debate should be fact-checked. Indeed, nowadays, information objects are spreading faster and faster on the Internet and especially on social networks. This spreading is named the virality of the information [ 1 ].

During a political debate, any of the statements made by the participants can be reused without checking its factuality and it even can become viral. CheckThat! aims at providing journalists with a list of statements members of the debate made that should be checked before they are reused by others. 2.2

Dataset

There are two datasets : one to train the model and one to test it. Both sets consist of political debates transcribed into texts.

They are annotated so that each row indicates the sentence number, the speaker, the transcription of the sentence that the speaker said. The training dataset includes in addition a label that indicates whether this sentence is to be fact-checked or not. The training set contains three political debates while the test set contains seven debates [ 8 ]. 2.3

Evaluation metric

The task has been evaluated according to di erent measures. The o cial measure is MAP which calculates the usual mean of the average precision. Then, other measures were used as Mean Reciprocal Rank which allows to obtain reciprocals of rank of the rst relevant document as well as Mean Precision at x which performs the average of x best candidates. Details on the measures used can be found in the task overview [ 8 ].

Evaluations are carried out on primary and contrastive runs. Primary run corresponds to the results le of the participant's main model ; the decision of the main run was the participant's decision. Contrastive runs match the secondary models the participant used. 3

Method and runs

We computed ve of the criteria from the Information Nutritional Label for Online Documents proposed by [ 3 ]. These criteria and the methods we developed in this work to calculate their score are as follows: { Factuality and Opinion : Determines whether a sentence represents a fact or a personal opinion. These two features are based on the same algorithm. Each value is the opposite of the other, it is either 0 or 1. We use a Multi-layer Perceptron classi er, using LBFGS gradient descent [ 10 ]. This neural network is composed of 500 neurons in the rst hidden layer and 5 neurons in the second hidden layer. The activation function used is the recti ed linear unit function ("relu"). We used a MLP classi er because it was the best performing classi er over Random Forest, Support Vector Machine and Linear Regression. The datasets to train the neural network come from various Wikipedia articles3 for factual sentences and from Opinosis4 for opinion sentences. The features used to classify a sentence are ne-grained part-of-speech tags extracted with spaCy5. { Controversy : Determines the degree of controversy in a text. We count the number of controversial issues in the text based on the Wikipedia Article List of controversial issues6. For each issue referenced in the wiki article, we also take in account the anchor text labels7 to nd the synonyms and other appellations of the issues in all of the Wikipedia database. For example : Donald Trump is in the list of controversial issues. Other names can link to his Wikipedia page such as "45th President of America". These names are called anchor text labels and will be recognized as a controversial issue. { Emotion : Determines the intensity of emotion in a sentence. We use the list of 2; 477 emotional words and valuation from AFINN8 [ 9 ] (ex : abusive = -3, proud = 2). We sum the absolute value of the positive and negative valuations of the emotional words found in the sentence and we divide it by the total number of words in the sentence :

(X posW ordV alue + X jnegW ordV aluej)=totalN umberW ords { Technicality : Determines the degree of technicality in a text. We count the number of domain-speci c terms in the text. For that, we use NLTK9 [ 2 ] to perform part of speech tagging (adjective = JJ, name = NN, etc.). Then, we use the RE library10 to match, from tags, a regular expression de ned in [ 6 ] which identi es the terminological noun phrases (NPs). NPs represent domain-speci c terms in the text. We extract all the NPs from the text and 3 Each of the following URL should be preceded by https://en.wikipedia.org/wiki /World War I, /Industrial Revolution, /October Revolution, /Fermi paradox, /Steam engine, /Barack Obama, /Amazon (company), /Netherlands, /Triangular trade, /Song dynasty, /Nanking Massacre, /The Holocaust 4 http://kavita-ganesan.com/opinosis/ 5 spaCy is a library for Natural Language Processing in Python. It provides NER, POS tagging, dependency parsing, word vectors and more.

https://spacy.io/ 6 https://en.wikipedia.org/wiki/Wikipedia:List of controversial issues 7 https://en.wikipedia.org/wiki/Anchor text 8 http://www2.imm.dtu.dk/pubdb/p.php?6010 9 Natural Language ToolKit, https://www.nltk.org/ 10 Regular Expression, https://docs.python.org/3/library/re.html keep those which appear more than once. We then calculate the ratio of the number of these NPs over the number of words in the text.

(X N P s)=totalN umberW ords

We decided to use only these criteria as features because our goal was to test the Information Nutritional Label on a concrete task. 3.1

Models

Each of our three runs uses its own model to compute a check-worthiness score. For each of our models, we preprocessed the data using the criteria previously described. We computed the ve features for each sentence that has to be evaluated for check-worthiness. These sentences are then represented by a vector containing ve features, one for each criterion score.

For our INL SVM RBF (primary run) and INL SVM Lin ( rst contrastive) runs, we decided to use the Support Vector Machine in sklearn 11 with the probability setting set to "True". We used a RBF kernel for INL SVM RBF run and a linear kernel for the INL SVM Lin run. For our INL RF (second contrastive) run, we used the random forest classi er in sklearn.

To train our models, we used the three annotated debates provided by the clef2018-factchecking github repository12.

To obtain a score of check-worthiness, we computed the probability for each sentence to be check-worthy using the classi ers. The score of a sentence was then normalized by the highest score obtained for this sentence divided by the highest probability computed, so that the scores are between 0 and 1. 4

Results

Seven teams submitted runs to this task for a total of 16 runs.

Table 1 presents the results of our three runs and the best submitted run according to the MAP measure, which is from the Copenhagen team [ 4 ]. 11 http://scikit-learn.org/stable/modules/svm.html 12 https://github.com/clef2018-factchecking/clef2018-factchecking/tree/master/data/task1/English

Overall, the INL SVM Lin run obtained better results than the INL SVM RBF run; that was somehow unexpected since non linear kernel have been shown to work better in other information retrieval applications. The INL SVM Lin run has been ranked twelfth according to the main measurement (Mean Average Precision), but obtained better rank when considering other measures: it is ranked fth according to the Mean Reciprocal Rank and second according to the Mean Precision@1. These ranks mean that our INL SVM Lin run would be good if the purpose of the task was nding the most check-worthy claim instead of nding all the check-worthy claims. However, we need to deeper analyse the results to understand why.

Post-hoc experiments showed that the least important criterion is Technicality. This may be due to the fact that the method we use to compute this feature was meant to work with large texts and it is not appropriate for a single sentence. The most important criterion is Emotion. We can assume that a claim has greater chances to be check-worthy if it is highly emotional. The speaker thinks less about what he says and it is more likely that his claims are not fully accurate. We will check this hypothesis in future work.

Table 2 presents the weight of the 5 features for our INL SVM Lin model. The weights of the features for our INL RF model are similar.

Conclusion and perspectives for future works

In this paper we proposed three models to solve the CLEF2018 CheckThat! challenge (task 1 Check Worthiness) which deals with the evaluation of the check-worthiness of statements in political debates. We used random forest and support vector machine to learn models that make use of the Information Nutritional Label features [ 3 ]. We show that these models perform pretty well when considering the Mean Precision@1 measure, which ranks our run that uses a support vector machine with a linear kernel 2nd over 16 submitted runs.

We are currently working on better calculation of the ve features. We would like to complete the representations of the texts by using content-based components like it is done in [ 5 ]. While the objective is di erent (virality prediction), some of the features may also be useful for the task tackled by CheckThat!. To improve more our models, we would also like to investigate the use of wordembedding since we are using successfully this approach in other tasks [ 7 ] and this approach also worked well according to Hansen et al. [ 4 ] in the CheckThat! context. As future work, we will also take in consideration the sentences around the one to be classi ed and who said these sentences.

Finally, we will test these models on other datasets such as social networks. For example, we will consider a Twitter-based dataset where each tweet would have a score indicating its worthiness for fact-checking taking into account hashtags and tweet sources.

1. Berger , J. , Milkman , K.L. : What makes online content viral ? Journal of marketing research 49(2) , 192 { 205 ( 2012 )

2. Bird , Steven , E.L. , Klein , E.: Natural language processing with python ( 2009 )

3. Fuhr , N. , Giachanou , A. , Grefenstette , G. , Gurevych , I. , Hanselowski , A. , Jarvelin , K. , Jones , R. , Liu, Y. , Mothe , J. , Nejdl , W. , et al.: An information nutritional label for online documents . In: ACM SIGIR Forum . vol. 51 , pp. 46 { 66 . ACM ( 2018 )

4. Hansen , C. , Hansen , C. , Simonsen , J. , Lioma , C. : The Copenhagen Team Participation in the Check-Worthiness Task of the Competition of Automatic Identi cation and Veri cation of Claims in Political Debates of the CLEF-2018 Fact Checking Lab . In: Cappellato, L. , Ferro , N. , Nie , J.Y. , Soulier , L . (eds.) CLEF 2018 Working Notes, Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings , CEUR-WS.org, Avignon, France ( September 2018 )

5. Hoang , T.B.N. , Mothe , J.: Predicting information di usion on twitter{ analysis of predictive features . Journal of Computational Science ( 2017 ), https://doi.org/10.1016/j.jocs. 2017 . 10 .010

6. Justeson , J.S. , Katz , S.M.: Technical terminology: some linguistic properties and an algorithm for identi cation in text . Natural language engineering 1 ( 1 ), 9 { 27 ( 1995 )

7. Mothe , J. , Ramiandrisoa , F. : IRIT at TRAC . In: Proceedings of the First Workshop on Trolling, Aggression and Cyberbulling (TRAC) , 27th International Conference on Computational Linguistics, COLIN 18. International Committee on Computational Linguistics ( 2018 )

8. Nakov , P. , Barron-Cedeno , A. , Elsayed , T. , Suwaileh , R. , Marquez , L. , Zaghouani , W. , Atanasova , P. , Kyuchukov , S. , Da San Martino, G.: Overview of the CLEF2018 CheckThat! Lab on Automatic Identi cation and Veri cation of Political Claims . In: Bellot, P. , Trabelsi , C. , Mothe , J. , Murtagh , F. , Nie , J.Y. , Soulier , L. , SanJuan , E., Cappellato , L. , Ferro , N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction . Proceedings of the Nineth International Conference of the CLEF Association (CLEF 2018 ). Lecture Notes in Computer Science (LNCS) 11018 , Springer, Heidelberg, Germany ( 2018 )

9. Nielsen , F.A. : A nn (mar 2011 ), http://www2.imm.dtu.dk/pubdb/p. php?6010

10. Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , Vanderplas , J. , Passos , A. , Cournapeau , D. , Brucher , M. , Perrot , M. , Duchesnay , E.: Scikit-learn: Machine learning in Python . Journal of Machine Learning Research 12 , 2825 { 2830 ( 2011 )