IRIT at CheckThat! 2018 Romain Agez1 , Clement Bosc1 , Cedric Lespagnol1 , Josiane Mothe2[0000−0001−9273−2193] , and Noemie Petitcol1 1 Université P. Sabatier de Toulouse, UPS, France FirstName.LastName@univ-tlse3.fr 2 ESPE, IRIT, UMR5505, CNRS & Université de Toulouse, France Josiane.Mothe@irit.fr Abstract. The 2018 CLEF CheckThat! is composed of two tasks: (1) Check-Worthiness and (2) Factuality. We participated to task (1) only which purpose is to evaluate the check-worthiness of claims in political debates. Our method to achieve this goal is to represent each claim by a vector of five computed values that correspond to scores on five criteria. These vectors are then used with machine learning algorithms to classify claims as check-worthy or not. We submitted three runs using different machine learning algorithms. The best result we achieved using the offi- cial measure MAP ranks our run that uses non linear SVM the 12th over the 16 submitted runs. Our run that uses linear SVMis ranked 2nd with the Mean Precision@1 measure. Keywords: Information retrieval · fact-checking · information nutri- tional label. 1 Introduction The CLEF CheckThat! first task aims at predicting which claims in political debates should be prioritized for fact-checking. All the background and detailed information about the task are available on the task description paper provided by the organizers of the task [8]. To achieve this goal, the task organizers released several textual transcripts of political debates with each sentence being annotated according to whether it is check-worthy or not. This paper describes the participation of the Université de Toulouse team (official name RNCC) at CLEF 2018 CheckThat! pilot task for check-worthiness. We preprocessed the data by representing each sentence corresponding to a transcription of what a speaker said in the debate by a vector containing the score of this sentence for five different criteria. We then trained three classifiers using these vectors to submit three different runs. The remaining of this paper is organized as follows: Section 2 gives a descrip- tion of the pilot task. Section 3 details the model we developed and the submitted runs. Then Section 4 details the results we obtained. Finally, Section 5 concludes this paper. 2 Task Description 2.1 Objectives The Check-Worthiness task aims to predict which statements in a political de- bate should be fact-checked. Indeed, nowadays, information objects are spread- ing faster and faster on the Internet and especially on social networks. This spreading is named the virality of the information [1]. During a political debate, any of the statements made by the participants can be reused without checking its factuality and it even can become viral. CheckThat! aims at providing journalists with a list of statements members of the debate made that should be checked before they are reused by others. 2.2 Dataset There are two datasets : one to train the model and one to test it. Both sets consist of political debates transcribed into texts. They are annotated so that each row indicates the sentence number, the speaker, the transcription of the sentence that the speaker said. The training dataset includes in addition a label that indicates whether this sentence is to be fact-checked or not. The training set contains three political debates while the test set contains seven debates [8]. 2.3 Evaluation metric The task has been evaluated according to different measures. The official measure is MAP which calculates the usual mean of the average precision. Then, other measures were used as Mean Reciprocal Rank which allows to obtain reciprocals of rank of the first relevant document as well as Mean Precision at x which performs the average of x best candidates. Details on the measures used can be found in the task overview [8]. Evaluations are carried out on primary and contrastive runs. Primary run corresponds to the results file of the participant’s main model ; the decision of the main run was the participant’s decision. Contrastive runs match the secondary models the participant used. 3 Method and runs We computed five of the criteria from the Information Nutritional Label for Online Documents proposed by [3]. These criteria and the methods we developed in this work to calculate their score are as follows: – Factuality and Opinion : Determines whether a sentence represents a fact or a personal opinion. These two features are based on the same algo- rithm. Each value is the opposite of the other, it is either 0 or 1. We use a Multi-layer Perceptron classifier, using LBFGS gradient descent [10]. This neural network is composed of 500 neurons in the first hidden layer and 5 neurons in the second hidden layer. The activation function used is the rectified linear unit function (”relu”). We used a MLP classifier because it was the best performing classifier over Random Forest, Support Vector Ma- chine and Linear Regression. The datasets to train the neural network come from various Wikipedia articles3 for factual sentences and from Opinosis4 for opinion sentences. The features used to classify a sentence are fine-grained part-of-speech tags extracted with spaCy5 . – Controversy : Determines the degree of controversy in a text. We count the number of controversial issues in the text based on the Wikipedia Article List of controversial issues6 . For each issue referenced in the wiki article, we also take in account the anchor text labels7 to find the synonyms and other appellations of the issues in all of the Wikipedia database. For example : Donald Trump is in the list of controversial issues. Other names can link to his Wikipedia page such as ”45th President of America”. These names are called anchor text labels and will be recognized as a controversial issue. – Emotion : Determines the intensity of emotion in a sentence. We use the list of 2, 477 emotional words and valuation from AFINN8 [9] (ex : abusive = -3, proud = 2). We sum the absolute value of the positive and negative valuations of the emotional words found in the sentence and we divide it by the total number of words in the sentence : X X ( posW ordV alue + |negW ordV alue|)/totalN umberW ords – Technicality : Determines the degree of technicality in a text. We count the number of domain-specific terms in the text. For that, we use NLTK9 [2] to perform part of speech tagging (adjective = JJ, name = NN, etc.). Then, we use the RE library10 to match, from tags, a regular expression defined in [6] which identifies the terminological noun phrases (NPs). NPs represent domain-specific terms in the text. We extract all the NPs from the text and 3 Each of the following URL should be preceded by https://en.wikipedia.org/wiki /World War I, /Industrial Revolution, /October Revolution, /Fermi paradox, /Steam engine, /Barack Obama, /Amazon (company), /Netherlands, /Triangu- lar trade, /Song dynasty, /Nanking Massacre, /The Holocaust 4 http://kavita-ganesan.com/opinosis/ 5 spaCy is a library for Natural Language Processing in Python. It provides NER, POS tagging, dependency parsing, word vectors and more. https://spacy.io/ 6 https://en.wikipedia.org/wiki/Wikipedia:List of controversial issues 7 https://en.wikipedia.org/wiki/Anchor text 8 http://www2.imm.dtu.dk/pubdb/p.php?6010 9 Natural Language ToolKit, https://www.nltk.org/ 10 Regular Expression, https://docs.python.org/3/library/re.html keep those which appear more than once. We then calculate the ratio of the number of these NPs over the number of words in the text. X ( N P s)/totalN umberW ords We decided to use only these criteria as features because our goal was to test the Information Nutritional Label on a concrete task. 3.1 Models Each of our three runs uses its own model to compute a check-worthiness score. For each of our models, we preprocessed the data using the criteria previously described. We computed the five features for each sentence that has to be eval- uated for check-worthiness. These sentences are then represented by a vector containing five features, one for each criterion score. For our INL SVM RBF (primary run) and INL SVM Lin (first contrastive) runs, we decided to use the Support Vector Machine in sklearn 11 with the probability setting set to ”True”. We used a RBF kernel for INL SVM RBF run and a linear kernel for the INL SVM Lin run. For our INL RF (second contrastive) run, we used the random forest classifier in sklearn. To train our models, we used the three annotated debates provided by the clef2018-factchecking github repository12 . To obtain a score of check-worthiness, we computed the probability for each sentence to be check-worthy using the classifiers. The score of a sentence was then normalized by the highest score obtained for this sentence divided by the highest probability computed, so that the scores are between 0 and 1. 4 Results Seven teams submitted runs to this task for a total of 16 runs. Table 1 presents the results of our three runs and the best submitted run according to the MAP measure, which is from the Copenhagen team [4]. Table 1. Results for each of our runs and the best run submitted. Values in parenthesis correspond to the ranks of our runs over the 16 that were submitted. Name MAP MRR Mean Prec@1 INL SVM RBF .0632 (16) .3775 (9) .2857 (6) INL SVM Lin .0886 (12) .4844 (5) .4286 (2) INL RF .0747 (15) .2198 (15) .0000 (14) Copenhagen [4] .1810 (1) .6224 (1) .5714 (1) 11 http://scikit-learn.org/stable/modules/svm.html 12 https://github.com/clef2018-factchecking/clef2018-factchecking/tree/master/data/task1/English Overall, the INL SVM Lin run obtained better results than the INL SVM RBF run; that was somehow unexpected since non linear kernel have been shown to work better in other information retrieval applications. The INL SVM Lin run has been ranked twelfth according to the main measurement (Mean Average Pre- cision), but obtained better rank when considering other measures: it is ranked fifth according to the Mean Reciprocal Rank and second according to the Mean Precision@1. These ranks mean that our INL SVM Lin run would be good if the purpose of the task was finding the most check-worthy claim instead of finding all the check-worthy claims. However, we need to deeper analyse the results to understand why. Post-hoc experiments showed that the least important criterion is Techni- cality. This may be due to the fact that the method we use to compute this feature was meant to work with large texts and it is not appropriate for a single sentence. The most important criterion is Emotion. We can assume that a claim has greater chances to be check-worthy if it is highly emotional. The speaker thinks less about what he says and it is more likely that his claims are not fully accurate. We will check this hypothesis in future work. Table 2 presents the weight of the 5 features for our INL SVM Lin model. The weights of the features for our INL RF model are similar. Table 2. Weights for the features used in our INL SVM Lin model. Feature Weight Controversy -2.08e-05 Factuality and Opinion -1.03e-05 and 1.03e-05 Technicality 2.22e-06 Emotion 2.56e-05 5 Conclusion and perspectives for future works In this paper we proposed three models to solve the CLEF2018 CheckThat! challenge (task 1 Check Worthiness) which deals with the evaluation of the check-worthiness of statements in political debates. We used random forest and support vector machine to learn models that make use of the Information Nutri- tional Label features [3]. We show that these models perform pretty well when considering the Mean Precision@1 measure, which ranks our run that uses a support vector machine with a linear kernel 2nd over 16 submitted runs. We are currently working on better calculation of the five features. We would like to complete the representations of the texts by using content-based compo- nents like it is done in [5]. While the objective is different (virality prediction), some of the features may also be useful for the task tackled by CheckThat!. To improve more our models, we would also like to investigate the use of word- embedding since we are using successfully this approach in other tasks [7] and this approach also worked well according to Hansen et al. [4] in the CheckThat! context. As future work, we will also take in consideration the sentences around the one to be classified and who said these sentences. Finally, we will test these models on other datasets such as social networks. For example, we will consider a Twitter-based dataset where each tweet would have a score indicating its worthiness for fact-checking taking into account hash- tags and tweet sources. References 1. Berger, J., Milkman, K.L.: What makes online content viral? Journal of marketing research 49(2), 192–205 (2012) 2. Bird, Steven, E.L., Klein, E.: Natural language processing with python (2009) 3. Fuhr, N., Giachanou, A., Grefenstette, G., Gurevych, I., Hanselowski, A., Jarvelin, K., Jones, R., Liu, Y., Mothe, J., Nejdl, W., et al.: An information nutritional label for online documents. In: ACM SIGIR Forum. vol. 51, pp. 46–66. ACM (2018) 4. Hansen, C., Hansen, C., Simonsen, J., Lioma, C.: The Copenhagen Team Participa- tion in the Check-Worthiness Task of the Competition of Automatic Identification and Verification of Claims in Political Debates of the CLEF-2018 Fact Checking Lab. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) CLEF 2018 Working Notes, Conference and Labs of the Evaluation Forum. CEUR Workshop Proceed- ings, CEUR-WS.org, Avignon, France (September 2018) 5. Hoang, T.B.N., Mothe, J.: Predicting information diffusion on twitter– analysis of predictive features. Journal of Computational Science (2017), https://doi.org/10.1016/j.jocs.2017.10.010 6. Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Natural language engineering 1(1), 9–27 (1995) 7. Mothe, J., Ramiandrisoa, F.: IRIT at TRAC. In: Proceedings of the First Workshop on Trolling, Aggression and Cyberbulling (TRAC), 27th International Conference on Computational Linguistics, COLIN 18. International Committee on Computa- tional Linguistics (2018) 8. Nakov, P., Barron-Cedeno, A., Elsayed, T., Suwaileh, R., Marquez, L., Zaghouani, W., Atanasova, P., Kyuchukov, S., Da San Martino, G.: Overview of the CLEF- 2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. In: Bellot, P., Trabelsi, C., Mothe, J., Murtagh, F., Nie, J.Y., Soulier, L., SanJuan, E., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilin- guality, Multimodality, and Interaction. Proceedings of the Nineth International Conference of the CLEF Association (CLEF 2018). Lecture Notes in Computer Science (LNCS) 11018, Springer, Heidelberg, Germany (2018) 9. Nielsen, F.Å.: Afinn (mar 2011), http://www2.imm.dtu.dk/pubdb/p.php?6010 10. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)