TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 111-115 INGEOTEC solution for Task 4 in TASS’18 competition Solución del grupo INGEOTEC para la tarea 4 de la competencia TASS’18 Daniela Moctezuma1 , José Ortiz-Bejar3 , Eric S. Tellez2 , Sabino Miranda-Jiménez2 , Mario Graff2 1 CONACYT-CentroGEO 2 CONACYT-INFOTEC 3 UMSNH dmoctezuma@centrogeo.edu.mx, jortiz@umich.mx, eric.tellez@infotec.mx, sabino.miranda@infotec.mx, mario.graff@infotec.mx Resumen: En este artı́culo se presenta un sistema de clasificación de noticias basado en clasificadores B4MSA, vectores de documentos calculados mediante vectores de palabras pre-entrenados, ası́ como recursos especializados para detectar agresividad y afectividad en el texto. Todos estos recursos fueron entrenados de manera inde- pendiente para luego ser combinados en un modelo único mediante Programación Genética, utilizando nuestro clasificador EvoMSA. Utilizando este esquema, nuestro sistema alcanzó los mejores resultados de este año en la competencia en dos de los tres corpus de prueba. Palabras clave: Categorización automática de texto, programación genética, clasi- ficación automática de noticias seguras o inseguras. Abstract: This paper describes a classification system based on our generic classifier B4MSA, sequence vectors computed with pre-trained Spanish word embeddings, and a list of specialized resources to detect aggressiveness and affectivity in text. These resources, along with the official training set, were independently trained and combined into a single model using Genetic Programming with our EvoMSA classifier. Using this approach, our system achieves the best performances, in two of three test corpus, of this year in Task 4. Keywords: text categorization, genetic programming, safe-unsafe classification of news. 1 Introduction is SAFE or UNSAFE, a corpus was built from RSS feeds of a number of online newspapers News classification problem is closely related in different varieties of Spanish (Argentina, to traditional text classification applications Chile, Colombia, Cuba, Spain, USA, Mexico, such as topic classification (e.g., classifying a Peru, and Venezuela). For the purpose of news-like text into sports, politics, or econ- classifying these news, as SAFE or UNSAFE, omy). Knowing any kind of categorization the headlines were provided. of news can reflect the problems of society From Task 4, two sub-tasks were spec- in several domains. For instance, the discov- ified: Subtask-1 Monolingual classification ery of negative news over time can be helpful and Subtask-2 Multilingual classification. to have a reference for leaders or decision- The main difference between these two tasks makers to do something about the current is that, in the first case, the algorithm must situation (Wang et al., 2018). be trained and tested with the same Spanish In this year, in TASS competition variety. In contrary, in the second case, the (Martı́nez-Cámara et al., 2018), a new task algorithm can be trained with a Spanish va- was proposed (Task 4), this task is related to riety and tested with a different one. More an emotional categorization of news articles. information about details from Task 4 please With the purpose for knowing if a new article see (Martı́nez-Cámara et al., 2018). ISSN 1613-0073 Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes. Daniela Moctezuma, José Ortiz-Bejar, Eric S. Tellez, Sabino Miranda-Jiménez y Mario Graff In this paper, the solution proposed for the first part, a set of different classifiers are the INGEOTEC team is presented. This so- trained with datasets provided by the con- lution is based on our B4MSA classifier and tests and others as additional knowledge, i.e., a number of specialized resources related to whatever knowledge could be integrated into aggressiveness and affectivity detection. Fi- EvoMSA. In this case, we used tailor-made nally, our EvoMSA classifier based on Ge- lexicons for identifying aggressiveness, posi- netic Programming is used to combine all the tiveness, and negativeness in texts, see Sec- resources and the available training data. It tion 2.2 for more details. The precise con- is worth to mention that our scheme to cre- figuration of our benchmarked system is de- ate our systems for Task 1 (monolingual and scribed in Section 3. cross-lingual polarity classification) and Task 2.1.1 B4MSA 4 (this one) is pretty similar; of course, we use the given task’s training set to learn and B4MSA2 (a.k.a. µTC) is a minimalistic sys- optimize for each task. tem able to tackle general text classification The manuscript is organized as follows. In tasks independently of domain and language. Section 2 the description of our solution is For complete details of the model see (Tellez detailed. In Section 3 our results achieved in et al., 2018). Roughly speaking, µTC cre- task 4 is presented. Finally, the conclusions ates text classifiers searching for the best are commented in Section 4. models in given configuration space. A con- figuration consists of instructions to enable 2 System Description several preprocessing functions, a combina- tion of tokenizers among the power set of As commented before, we use a com- several possible ones (character q-grams, n- bination of several sub-systems to tackle word grams, and skip-grams), and a weight- the (un)safeness categorization of the given ing scheme such as TF, TFIDF, or several news. Firstly, we use our generic text clas- distributional schemes. µTC uses an SVM sifier B4MSA (Tellez et al., 2017) and a (Support Vector Machine) classifier with a vocabulary of pre-trained vectors of Fast- linear kernel. A text transformation feature Text (Mikolov et al., 2013). Also, we use two could be binary options (yes/no) or ternary different domain-specific lexicon resources, options (group/delete/none). Tokenizers de- one of them designed to detect aggressive- note how texts must be split after applying ness and the other one designed to detect the process of each text transformation to emotions in text. All these sub-systems and texts, all tokens generated are part of the resources are combined using our genetic pro- text representation. In Table 1, we can see gramming scheme (EvoMSA) over the deci- details of the preprocessing, tokenizers, and sion functions of several classifiers built on term weighting scheme. top of these resources. The rest of this sec- tion details the use of these sub-systems and 2.2 Lexicon-based models resources. To introduce extra knowledge into our ap- 2.1 EvoMSA proach, we used two lexicon-based mod- EvoMSA1 has two stages. The first one, els. The first, Up-Down model produces a namely B4MSA (Tellez et al., 2017), uses counting of affective words, that is, it pro- SVMs to predict their decision function val- duces two indexes for a given text: one ues of a given text. On the second hand, for positive words, and another for negative EvoDAG (Graff et al., 2016; Graff et al., words. We created the positive-negative lex- 2017) is a classifier based on Genetic Pro- icon based on the several Spanish affective gramming with semantic operators which lexicons (de Albornoz, Plaza, y Gervás, 2012; makes the final prediction through a com- Sidorov et al., 2013; Perez-Rosas, Banea, bination of all the decision function values. y Mihalcea, 2012); we also enriched this Furthermore, EvoMSA is open to being fed lexicon with Spanish WordNet (Fernández- with different models such as B4MSA (Tellez Montraveta, Vázquez, y Fellbaum, 2008). et al., 2018), and lexicon-based models, and The other Bernoulli model was created to EvoDAG. It is an architecture of two phases predict aggressiveness using a lexicon with to solve classification tasks, see Figure 1. In aggressive words. We created this lexicon 1 2 https://github.com/INGEOTEC/EvoMSA https://github.com/INGEOTEC/microTC 112 INGEOTEC solution for Task 4 in TASS'18 competition Figure 1: Architecture of our EvoMSA framework Text transformation Value 2.3 EvoDAG remove diacritics yes EvoDAG3 (Graff et al., 2016; Graff et al., remove duplicates yes 2017) is a Genetic Programming system remove punctuation yes specifically tailored to tackle classification emoticons group problems on very large and high dimensional lowercase yes vector spaces. EvoDAG uses the principles numbers group urls group of Darwinian evolution to create models rep- users group resented as a directed acyclic graph (DAG). hashtags none Due to lack of space, we refer the reader to entities none (Graff et al., 2016) where EvoDAG is broadly Term weighting described. It is important to mention that EvoDAG does not have information regard- TF-IDF yes ing whether input Xi comes from a particu- Entropy no lar class decision function, consequently from Tokenizers EvoDAG point of view all inputs are equiva- n-words {1, 2} lent. q-grams {2, 3, 4} skip-grams — 2.4 FastText FastText (Joulin et al., 2017) is a tool to create text classifiers and learn a semantic Table 1: Example of set of configurations for vocabulary, learned from a given collection text modeling of documents; this vocabulary is represented with a collection of high dimensional vectors, one per word. It is worth to mention that FastText is robust to lexical errors since out- vocabulary words are represented as the com- bination of vectors of sub-words, that is, a gathering common aggressive words for Span- kind of character q-grams limited in context ish. These indexes and prediction along with to words. Nonetheless, the main reason of in- B4MSA’s (µTC) outputs are the input for 3 EvoDAG system. https://github.com/mgraffg/EvoDAG 113 Daniela Moctezuma, José Ortiz-Bejar, Eric S. Tellez, Sabino Miranda-Jiménez y Mario Graff cluding FastText as part of our system is to overcome the small train set that comes with Team’s name Macro-F1 Accuracy Task 4, which is fulfilled using the pre-trained ELiRF 0.883 0.893 vectors computed in the Spanish content of rbnUGR 0.873 0.888 Wikipedia (Bojanowski et al., 2016). We use INGEOTEC 0.866 0.871 these vectors to create document vectors, one MeaningCloud 0.793 0.801 vector per document. A document vector is, SINAI 0.773 0.793 roughly speaking, a linear combination of the TNT-UA-WFU 0.544 0.552 word vectors that compose the document into lone wolf 0 0 a single vector of the same dimension. These document vectors were used as input to an Table 3: Subtask-2: Multilingual Classifica- SVM with a linear kernel, and we use the de- tion: SANSE-TEST-13152 cision function as input to EvoMSA. 3 Experiments and results Team’s name Macro-F1 Accuracy In order to test all the approaches in the INGEOTEC 0.719 0.737 Task-4, the SANSE (Spanish brANd Safe ELiRF-UPV 0.699 0.722 Emotion) corpus was established. The rbnUGR 0.683 0.631 SANSE corpus is composed of 2,000 head- MeaningCloud 0.651 0.658 lines of news written in the Spanish lan- ITAINNOVA 0.617 0.575 guage along several Spanish speaker countries Spain, Mexico, Cuba, Chile, Colombia, Ar- Table 4: Subtask-2: Multilingual Classifica- gentina, Venezuela, Peru, and U.S.A. tion: SANSE-408 In the case of Subtask-1, Monolingual Classification, the goal was training with a 4 Conclusions Spanish variety, e.g., Mexico, and then test- Our solution based on Genetic Program- ing with the same Spanish variety. In this ming reached the best result in SubTask- case, our results and the results of the best 1 Monolingual Classification SANSE-TEST- five teams ranked by Macro-F1 metric, are 500 and SubTask-2 Multilingual Classifica- presented in Table 2. tion SANSE 408 corpus. In the largest cor- pus in SubTask-2 (SANSE-TEST-13152) our system reached the third best team solution. Team’s name Macro-F1 Accuracy Our approach, EvoMSA, is able to deal INGEOTEC 0.795 0.802 with several data sources through an ensem- ELiRF-UPV 0.79 0.8 ble of decision functions from each different rbnUGR 0.774 0.786 bunch of data such as extra knowledge coded MeaningCloud 0.767 0.776 into lexicons for sentiment analysis and ag- SINAI 0.728 0.742 gressiveness identification, and semantic in- lone wolf 0.700 0.718 formation from word vectors. TNT-UA-WFU 0.492 0.518 Acknowledgements Table 2: Subtask-1: Monolingual Classifica- tion results: SANSE-TEST-500 The authors would like to thank Laborato- rio Nacional de GeoInteligencia for partially Table 3 shows the best five teams in the funding this work. SANSE-TEST-13152 corpus. With this cor- pus, our team reached the third position with References a 0.866 and 0.871 of Macro-F1 and accuracy, Bojanowski, P., E. Grave, A. Joulin, y respectively. Regarding Multilingual Classi- T. Mikolov. 2016. Enriching word vectors fication subtask, in Table 4, all the results with subword information. arXiv preprint obtained by the best five teams, ranked by arXiv:1607.04606. Macro-F1 metric, are reported. In nutshell, from three datasets, our solu- de Albornoz, J. C., L. Plaza, y P. Gervás. tion reached highest Macro-F1 in two corpus 2012. Sentisense: An easily scalable and middle position in the other corpus. concept-based affective lexicon for senti- 114 INGEOTEC solution for Task 4 in TASS'18 competition ment analysis. En Proceedings of LREC information processing systems, páginas 2012, páginas 3562–3567. 3111–3119. Fernández-Montraveta, A., G. Vázquez, y Perez-Rosas, V., C. Banea, y R. Mihal- C. Fellbaum. 2008. The spanish version of cea. 2012. Learning sentiment lexi- wordnet 3.0. Text Resources and Lexical cons in spanish. En LREC, volumen 12, Knowledge. Mouton de Gruyter, páginas página 73. 175–182. Sidorov, G., S. Miranda-Jiménez, F. Viveros- Graff, M., E. S. Tellez, S. Miranda-Jiménez, Jiménez, A. Gelbukh, N. Castro-Sánchez, y H. J. Escalante. 2016. Evodag: A F. Velásquez, I. Dı́az-Rangel, S. Suárez- semantic genetic programming python li- Guerra, A. Treviño, y J. Gordon. 2013. brary. En 2016 IEEE International Au- Empirical study of machine learning based tumn Meeting on Power, Electronics and approach for opinion mining in tweets. En Computing (ROPEC), páginas 1–6, Nov. Proceedings of the 11th Mexican Interna- tional Conference on Advances in Arti- Graff, M., E. S. Tellez, H. J. Escalante, y ficial Intelligence - Volume Part I, MI- S. Miranda-Jiménez. 2017. Semantic Ge- CAI’12, páginas 1–14, Berlin, Heidelberg. netic Programming for Sentiment Analy- Springer-Verlag. sis. En O. Schütze L. Trujillo P. Legrand, y Y. Maldonado, editores, NEO 2015, nu- Tellez, E. S., S. Miranda-Jiménez, M. Graff, mero 663 en Studies in Computational In- D. Moctezuma, R. R. Suárez, y O. S. Sior- telligence. Springer International Publish- dia. 2017. A simple approach to multi- ing, páginas 43–65. DOI: 10.1007/978-3- lingual polarity classification in Twitter. 319-44003-3 2. Pattern Recognition Letters, 94:68–74. Joulin, A., E. Grave, P. Bojanowski, y Tellez, E. S., D. Moctezuma, S. Miranda- T. Mikolov. 2017. Bag of tricks for Jiménez, y M. Graff. 2018. An au- efficient text classification. En Proceed- tomated text categorization framework ings of the 15th Conference of the Euro- based on hyperparameter optimization. pean Chapter of the Association for Com- Knowledge-Based Systems, 149:110–123. putational Linguistics: Volume 2, Short Wang, B., L. Gao, T. An, M. Meng, y Papers, páginas 427–431. Association for T. Zhang. 2018. A method of educa- Computational Linguistics, April. tional news classification based on emo- tional dictionary. En 2018 Chinese Con- Martı́nez-Cámara, E., Y. Almeida-Cruz, trol And Decision Conference (CCDC), M. C. Dı́az-Galiano, S. Estévez-Velarde, páginas 3547–3551, June. M. A. Garcı́a-Cumbreras, M. Garcı́a- Vega, Y. Gutiérrez, A. Montejo Ráez, A. Montoyo, R. Muñoz, A. Piad- Morffis, y J. Villena-Román. 2018. Overview of TASS 2018: Opinions, health and emotions. En E. Martı́nez- Cámara Y. Almeida-Cruz M. C. Dı́az- Galiano S. Estévez-Velarde M. A. Garcı́a- Cumbreras M. Garcı́a-Vega Y. Gutiérrez A. Montejo Ráez A. Montoyo R. Muñoz A. Piad-Morffis, y J. Villena-Román, edi- tores, Proceedings of TASS 2018: Work- shop on Semantic Analysis at SEPLN (TASS 2018), volumen 2172 de CEUR Workshop Proceedings, Sevilla, Spain, September. CEUR-WS. Mikolov, T., I. Sutskever, K. Chen, G. S. Cor- rado, y J. Dean. 2013. Distributed repre- sentations of words and phrases and their compositionality. En Advances in neural 115