Linguistic parameters and word embeddings for protest news detection in text. Chedi Bechikh Ali1 LISI laboratory, Université de Carthage, Tunisie chedi.bechikh@gmail.com Abstract. We present in this paper our participation in ProtestNews lab at CLEF 2019 in task 1 and task 2. In task 1, the objective is to predict if an article contains protest news or not. In task 2, we must decide if a sentence contains a protest event or not. For these two tasks, we used a supervised machine learning approach based on the logistic regression model. We combine the supervised learning algorithm with two different natural language techniques. The first relies on text processing with linguistic properties. The second is based on the expansion of the text with related term using word embedding similarity. Keywords: Linguistic parameters · compound nouns · word embed- dings · supervised learning. 1 Introduction This paper describes the participation of LISI laboratory at the Conference and Labs of the Evaluation Forum (CLEF) 2019 ProtestNews for the detection of a protest event in news articles. We submitted results obtained from different approaches. In this paper, we describe the different proposed approaches as well as the findings concerning the results. The ProtestNews lab proposed three tasks: – Task 1 is a classification task, it consists to identify which text contain protest news. – Task 2 objective is to classify if a sentence contains and event-trigger of protest or not. – Task 3 is an information extraction task, the objective is to extract locations, participants and time about protest event. The reminder of this paper is organized as follow, in section 2 we describe our methodolgy based on supervised classification algorithm. In section 3 we describe the linguistics characteristics that we use to process the documents. Then, in section 4 we present our document expansion approach. In section 5, the experimental results are presented. Then we conclude and present future works. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. 2 Methodology This section describes the model that have been used to classify the test data of both task 1 and task 2. The model is applied for both tasks, and use the same set of linguistic settings. The overall architecture of our proposed framework consists of two main phases: training and testing. In the training phase, the classifier learns from a set of labeled text. Subsequently, the classifier is capable of classifying new unlabeled documents in the testing phase. Each phase consists of the following steps: documents preparation, features extraction/selection, and classification. We opt for a supervised approach based on a classification algorithm [5]. The classification algorithm was implemented using the scikit learn1 library which is a machine learning library for the Python programming language. We notice that ProtestNews documents present different characteristics for each task: documents in task 1 are composed of long sentences, this may cause drift in the classification process. In the other side, documents for task 2 are short, so they don’t contain enough context for the training step. This can lead to different problems: word ambiguity and word mismatch between training data and test data. To deal with these two tasks, we compared different classification algorithms, among them logistic regression algorithm, random forest, and naive Bayes algo- rithm. Preliminary experiments were carried on the development data after a training step. Based on these finding, we decided to use the logistic regression algorithm for the rest of the experiments because it gives the best results. 3 Linguistic preprocessing Before extracting the feature vectors it is required to pre-process the data with stop words removal and text lemmatization. We rely on linguistic processings since they lead to good results in previous work for sentiment analyis task [4]. – Stop word removal: We used an English stop words list provided by the Terrier information retrieval team of the School of Computing Science of Glasgow University. The list contains 733 stop words. – Lemmatization: We have chosen to lemmatize document words to treat the morphological variations and thus to increase the recall. Lemmatization al- lows transforming words into a reduced form that is the lemma, which leads to ignoring variations in number and gender. We rely on the part-of-speech tagger Treetagger2 to lemmatize the text. – Eliminating named entities (person, place, organization) form text content, because they can’t represent protest news. Named entity lead to a drift in the classification process because they can present in both protest news and 1 https://scikit-learn.org 2 https://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/ regular news. In this study, we define named entities as the words that were annotated with the tag ”NP” by Treetagger. – Compound noun (CN) annotation: CN can capture important concepts in the content of the document such as event or concepts related to protest event (e.g.: district violence, armys action, tense situation, etc) [1]. To ensure that we extract CN related to protest events, we first extract CN from documents that are classified as protest news in the training set. Then, we annotate each corresponding CN in the test data. In this work, we only use CN composed of tow words, because in preliminary experiment yield better results than longer CN. After text processing, we proceed to feature extraction, The aims of the feature selection technique are to find the most relevant features for the classifi- cation task. We used unigram because it gives the best results. We didn’t rely on any weighting scheme, because using the tf.idf scheme degraded the classification performances in preliminaries experiments with the development data. 4 Document expansion with word embeddings To deal with the term mismatch problem, we decided to expand documents with the most similar word for each token. Since in previous work using pretrained word embeddings has proven to have a postif impact on different natural lan- guage processing tasks: Word Sense Disambiguation, Relational Similarity, Se- mantic Relatedness [2]. We pose the hypothesis that adding similar or related terms can help to enhance the recall and so the overall performance of the clas- sification process. For this approach we trained two word embeddings [3]: – wiki-emb: a word embeddings trained on text83 dataset which is a sample of a Wikipedia dump4 . – protest-emb: a word embeddings trained on the India training dataset given for task 1. We choose to train two word embeddings with different data sets, to see if a specialized dataset have an impact on the classification performance, in comparison with text8 dataset which is a 100 megabytes cleaned dataset. For the word embeddings training, we rely on the Gensim python library. 5 Experiments and results We studied the performances of the proposed approach and we performed dif- ferent experiments using different setting and processing: – Run1: consist of applying lemmatization and stop words removal on the training and test set. 3 http://mattmahoney.net/dc/textdata.html 4 English Wikipedia dump on Mar 3, 2006 – Run2: consist of combining lemmatization, stop words and named entities removal on both training and test data sets. – Run3: expanding every word in the sentences with the most similar word from the protest-emb word embeddings. – Run4: expanding every word in the sentences with the most similar word from wiki-emb word embeddings. – Run5: combining run 2 settings with the annotation of all CN. Table 1 present the official submitted runs, there are some runs where there are results only for task 2. The analysis of the results shows that the first run (lemmatizing and eliminating stop word) allows to obtain 0.7612 for task 1 and it corresponds to our second best run. The best result for task 1 was achieved when we expanded the content of the text with bigram extracted from the same span of text. This run allows to obtain the best overall results, but the best result for task 1 (China and India) and the best result for task 2 (India). We note the degradation of the results for task 2 (China). The best overall result for task 2 is obtained by a simple approach that consists of lemmatizing the text and eliminating stop words. In a preliminary study phase with development data, we found that expand- ing text with the most similar word is only beneficial for task 2 and it degrades results for task1. We decided to study the impact of this approach only for task 2. We notice that the best overall results are obtained when training the word embedding on the training set. The best result for task 2 is obtained with word embedding trained on the same data, but the best result for China data is ob- tained when we used word embedding from another general corpus. This can be explained by the fact that the text8 dataset is bigger dataset and contain more tokens than task 1 training dataset. Table 1. Classification results for task 1 and task 2 based on F1 measure model task1 test china task1 avg task1 task2 test china task2 avg task 2 avg task Run1 0.7612 0.3846 0.5729 0.5657 0.4788 0.5223 0.5476 Run2 0.7612 0.4418 0.6015 0.4727 0.3960 0.4343 0.5179 Run3 - - - 0.5692 0.4615 0.5150 - Run4 - - - 0.5748 0.4143 0.4945 - Run 5 0.7676 0.5032 0.6354 0.5877 0.3086 0.44819 0.5418 Our official final run was the Run6 and it was ranked sixth among 12 teams. We can note that with this run we achieved the best our results in task 1 and task 2 with India data. Degradation of the performance has been noticed for task 2 with China data, this can be explained by terms mismatch between CN in the training set with india data and the CN with the China data because CN in India data represents other concepts than those extracted in China dataset. 6 Conclusion This paper describes our participation in the ProtestNews detection lab at CLEF 2019. The aim of this work is to make a decision if a text contains protest news or not. The objective is to develop text classification tools. For this purpose, we used a classifier based on the Logistic regression algorithm. As the first step, we processed the linguistic data processing as a first step. Then, we use word embeddings to expand text with the most similar word. Also, we proposed to add CN extracted from the same text. This work is still in progress and needs more investigations. For future work, we plan to use deep neural network since it achieved good results for other NLP tasks. References 1. Bechikh-Ali, C., Haddad, H., Slimani, Y.: Empirical evaluation of compounds in- dexing for turkish texts. Computer Speech & Language 56, 95–106 (2019) 2. Li, J., Jurafsky, D.: Do multi-sense embeddings improve natural language under- standing? In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015. pp. 1722–1732 (2015) 3. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems: 27th Annual Conference on Neural Information Processing Systems. Lake Tahoe, Nevada, United States. December 5-8, 2013. pp. 3111–3119 (2013) 4. Mulki, H., Ali, C.B., Haddad, H., Babaoglu, I.: Tw-star at semeval-2018 task 1: Preprocessing impact on multi-label emotion classification. In: Proceedings of The 12th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT, New Orleans, Louisiana, June 5-6, 2018. pp. 167–171 (2018) 5. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, Philadelphia, PA, USA, July 6-7, 2002 (2002)