TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 45-49 INGEOTEC solution for Task 1 in TASS’18 competition Solución del grupo INGEOTEC para la tarea 1 de la competencia TASS’18 Daniela Moctezuma1 , José Ortiz-Bejar3 , Eric S. Tellez2 , Sabino Miranda-Jiménez2 , Mario Graff2 1 CONACYT-CentroGEO 2 CONACYT-INFOTEC 3 UMSNH dmoctezuma@centrogeo.edu.mx, jortiz@umich.mx, eric.tellez@infotec.mx, sabino.miranda@infotec.mx, mario.graff@infotec.mx Resumen: El análisis de sentimientos sobre redes sociales consiste en analizar men- sajes publicados por usuarios de dichas redes sociales y determinar la polaridad de dichos mensajes (p.e. positivos, negativos, o una gama similar pero más amplia de dichos sentimientos). Cada lenguaje tiene caracterı́sticas que podrı́an dificultar el análisis de polaridad, como la ambigüedad natural en los pronombres, la sinónimia o la polisemı́a; adicionalmente, dado que las redes sociales suelen ser un medio de comunicación poco formal ya que los mensajes suele tener una gran cantidad de errores y variantes léxicas que dificultan el análisis mediante enfoques tradicionales. En este artı́culo se presenta la participación del equipo INGEOTEC en TASS’18. Esta solución propuesta está basada en varios subsistemas orquestados mediante nuestro sistema de programación genética EvoMSA. Palabras clave: Categorización automática de texto, programación genética, análisis de sentimientos, clasificación de polaridad Abstract: The sentiment analysis over social networks determines the polarity of messages published by users. In this sense, a message can be classified as positive or negative, or a similar scheme using more fine-grained labels. Each language has characteristics that difficult the correct determination of the sentiment, such as the natural ambiguity of pronouns, the synonymy, and the polysemy. Additionally, given that messages in social networks are quite informal, they tend to be plagued with lexical errors and lexical variations that make difficult to determine a sentiment using traditional approaches. This paper describes our participating system in TASS’18. Our solution is composed of several subsystems independently collected and trained, combined with our EvoMSA genetic programming system. Keywords: text categorization, genetic programming, sentiment analysis, polarity classification 1 Introduction pus InterTASS, has been expanded with two more subsets, namely, a dataset containing Sentiment Analysis is an active research area tweets from Costa Rica and another one com- that performs the computational analysis of ing from Peruvian tweeters. Therefore, there people’s feelings or beliefs expressed in texts are three varieties of the Spanish language, such as emotions, opinions, attitudes, ap- namely, Spain (ES), Peru (PE), and Costa praisals, among others (Liu y Zhang, 2012). Rica (CR). Moreover, several subtasks are In social media, people share their opinions also introduced: and sentiments. In addition to the inher- ent polarity, these feelings also have an in- tensity. As in previous years, TASS’18 or- • Subtask-1: Monolingual ES: Train- ganizes a task related to four level polarity ing and test using the InterTASS ES classification in tweets. In this year, the cor- dataset. ISSN 1613-0073 Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes. Daniela Moctezuma, José Ortiz-Bejar, Eric S. Tellez, Sabino Miranda-Jiménez y Mario Graff • Subtask-2: Monolingual PE: Train- a meta-heuristics to solve a combinatorial ing and test using the InterTASS PE optimization problem over the configuration dataset. space; the selected model is described in Ta- ble 1. On the second hand, EvoDAG (Graff • Subtask-3: Monolingual CR: Train- et al., 2016; Graff et al., 2017) is a classi- ing and test using the InterTASS CR fier based on Genetic Programming with se- dataset. mantic operators which makes the final pre- • Subtask-4: Cross-lingual: Here, the diction through a combination of all the de- training can be with a specific dataset cision function values. The domain-specific and a different one is used to test. resources can be also added under the same scheme. Figure 1 shows the architecture of These subtasks are mostly based on sep- EvoMSA. In the first part, a set of differ- arating language variations in train and test ent classifiers are trained with datasets pro- datasets. Martı́nez-Cámara et al. (Martı́nez- vided by the contests and others resources as Cámara et al., 2018) detail TASS’18 Task 1 additional knowledge, i.e., the idea is to be and their associated datasets. able to integrate any other kind of related This paper details the Task 1 solution of knowledge into the model. In this case, we our INGEOTEC team. Our approach con- used tailor-made lexicons for the aggressive- sists of a number of subsystems combined ness task: aggressiveness words and affective using a non-linear expression over individ- words (positive and negative), see Section 2.2 ual predictions using our EvoMSA genetic for more details. The precise configuration of programming system. It is worth to men- our benchmarked system is described in Sec- tion that we tackle both Task 1 (this one) tion 3. and Task 4 (good or bad news) using a sim- ilar scheme, that is, the same resources and Table 1: Example of set of configurations for the same portfolio of algorithms, we also ap- text modeling plied the same hyper-parameters for the al- gorithms; of course, we use the given task’s Text transformation Value training set to learn and optimize for each remove diacritics yes task. remove duplicates yes The manuscript is organized as follows. remove punctuation yes Section 2 details subsystems that compose emoticons group lowercase yes our solution. Section 3 presents our results, numbers group and finally, Section 4 summarizes and con- urls group cludes this report. users group hashtags none 2 System Description entities none Our participating system is a combination of Term weighting several sub-systems that tackles the polarity TF-IDF yes categorization of the tweets independently, Entropy no and then all these independent predictions Tokenizers are combined using our EvoMSA genetic pro- gramming system. The rest of this section n-words {1, 2} details the use of these sub-systems and re- q-grams {2, 3, 4} skip-grams — sources. 2.1 EvoMSA EvoMSA1 is a multilingual sentiment analy- 2.2 Lexicon-based models sis system based on genetic text classifiers, domain-specific resources, and a genetic pro- To introduce extra knowledge into our ap- gramming combiner of the parts. The first proach, we used two lexicon-based mod- one, namely B4MSA (Tellez et al., 2017), per- els. The first, Up-Down model produces a forms a hyper-parameter optimization over a counting of affective words, that is, it pro- large search space of possible models. It uses duces two indexes for a given text: one for positive words, and another for negative 1 https://github.com/INGEOTEC/EvoMSA words. We created the positive-negative lex- 46 INGEOTEC solution for Task 1 in TASS'18 competition Figure 1: Architecture of our EvoMSA framework icon based on the several Spanish affective lent. lexicons (de Albornoz, Plaza, y Gervás, 2012; Sidorov et al., 2013; Perez-Rosas, Banea, 2.4 FastText y Mihalcea, 2012); we also enriched this FastText (Joulin et al., 2017) is a tool to lexicon with Spanish WordNet (Fernández- create text classifiers and learn a semantic Montraveta, Vázquez, y Fellbaum, 2008). vocabulary, learned from a given collection The other Bernoulli model was created to of documents; this vocabulary is represented predict aggressiveness using a lexicon with with a collection of high dimensional vectors, aggressive words. We created this lexicon one per word. It is worth to mention that gathering common aggressive words for Span- FastText is robust to lexical errors since out- ish. These indexes and prediction along with vocabulary words are represented as the com- B4MSA’s (µTC) outputs are the input for bination of vectors of sub-words, that is, a EvoDAG system. kind of character q-grams limited in context to words. Nonetheless, the main reason of in- 2.3 EvoDAG cluding FastText as part of our system is to EvoDAG2 (Graff et al., 2016; Graff et al., overcome the small train set that comes with 2017) is a Genetic Programming system Task 4, which is fulfilled using the pre-trained specifically tailored to tackle classification vectors computed in the Spanish content of problems on very large and high dimensional Wikipedia (Bojanowski et al., 2016). We use vector spaces. EvoDAG uses the principles these vectors to create document vectors, one of Darwinian evolution to create models rep- vector per document. A document vector is, resented as a directed acyclic graph (DAG). roughly speaking, a linear combination of the Due to lack of space, we refer the reader to word vectors that compose the document into (Graff et al., 2016) where EvoDAG is broadly a single vector of the same dimension. These described. It is important to mention that document vectors were used as input to an EvoDAG does not have information regard- SVM with a linear kernel, and we use the de- ing whether input Xi comes from a particu- cision function as input to EvoMSA. lar class decision function, consequently from 3 Experiments and results EvoDAG point of view all inputs are equiva- The following tables show the performance 2 https://github.com/mgraffg/EvoDAG of our system in the InterTASS dataset. We 47 Daniela Moctezuma, José Ortiz-Bejar, Eric S. Tellez, Sabino Miranda-Jiménez y Mario Graff also show the performance of a number of se- Table 2: Monolingual subtasks lected systems to provide a context for our (a) Subtask-1, Spain dataset (ES) solution. The following tables always show the top-k best results that include our sys- Team’s name Macro-F1 Accuracy tem, i.e., we always show the best ones but ELiRF-UPV 0.503 0.612 sometimes we do not show all results below RETUYT-InCo 0.499 0.549 our system. Atalaya 0.476 0.544 Please recall that the InterTASS dataset UNSA dajo 0.472 0.6 is split according to each sub-task. Ta- UNSA UCSP DaJo 0.472 0.6 ble 2 shows the performance on monolingual MEFaMAF 0.46 0.55 datasets. For instance, the results of training INGEOTEC 0.445 0.53 with Spain-InterTASS and testing on tweets ABBOT 0.409 0.482 generated by people of Spain is shown in ITAINNOVA 0.383 0.433 Table 2a where we reached seventh position from a total of nine participants teams. In (b) Subtask-2, Costa Rica’s dataset (CR) the case training and test corpus of other Spanish varieties, in Table 2b and Table 2c Team’s name Macro-F1 Accuracy show the result of training with CR and PE RETUYT-InCo 0.504 0.537 subsets, respectively. Our team achieved the ELiRF-UPV 0.482 0.561 fourth position among eight teams in CR, Atalaya 0.475 0.582 and the third one among eight participants. INGEOTEC 0.474 0.522 Notice that all our results are marked as bold MEFaMAF 0.418 0.512 to improve the readability. ABBOT 0.408 0.46 In contrary, the results of training with the ES subset and test with subsets ES, CR, and (c) Subtask-3, Peruvian dataset (PE) PE are presented in Tables 3a, 3b, and 3c, re- spectively. Our team achieved the best result Team’s name Macro-F1 Accuracy in cross-lingual task with Peruvian tweets, RETUYT-InCo 0.472 0.494 and also reached the second best results in Atalaya 0.462 0.451 ES (Spain) and CR (Costa Rica) subsets. INGEOTEC 0.439 0.447 The performance of our method in cross ELiRF-UPV 0.438 0.461 lingual tasks 4 is shown in Table 3. For in- UNSA dajo 0.413 0.319 stance, Table 3a shows our performance on the ES subset; here, we achieved the second position among three teams. In general, the number of participants was smaller than the herent feature of the Spanish variation. monolingual tasks. Table 3b show the rank of the four participant teams over the Peruvian Acknowledgements subset of the test, here we reached the best The authors would like to thank Laborato- position on the Macro-F1 score. Finally, we rio Nacional de GeoInteligencia for partially reached the second rank on the Costa Rica funding this work. subset, just below of RETUYT-InCo. References 4 Conclusions Bojanowski, P., E. Grave, A. Joulin, y It is worth to mention that we used the same T. Mikolov. 2016. Enriching word vectors scheme, explained in Section 2, to tackle all with subword information. arXiv preprint subtasks. Note that our EvoMSA allow to arXiv:1607.04606. change the training set as specified for each de Albornoz, J. C., L. Plaza, y P. Gervás. subtasks, so we can optimize the pipeline for 2012. Sentisense: An easily scalable each particular objective. concept-based affective lexicon for senti- Regarding the obtained results, our ap- ment analysis. En Proceedings of LREC proach performs better when it is trained 2012, páginas 3562–3567. with tweets from Spain and test with other Spanish varieties. However, it is not clear if Fernández-Montraveta, A., G. Vázquez, y this performance is due to the data or a in- C. Fellbaum. 2008. The spanish version of 48 INGEOTEC solution for Task 1 in TASS'18 competition Table 3: Performance comparison of the Liu, B. y L. Zhang, 2012. A Survey of cross-lingual (subtask-4) benchmark over Opinion Mining and Sentiment Analysis, three different test corpus. páginas 415–463. Springer US, Boston, MA. (a) Spain’s variation (ES). Martı́nez-Cámara, E., Y. Almeida-Cruz, Team’s name Macro-F1 Accuracy M. C. Dı́az-Galiano, S. Estévez-Velarde, RETUYT-InCo 0.471 0.555 M. A. Garcı́a-Cumbreras, M. Garcı́a- INGEOTEC 0.445 0.53 Vega, Y. Gutiérrez, A. Montejo Ráez, Atalaya 0.441 0.485 A. Montoyo, R. Muñoz, A. Piad- Morffis, y J. Villena-Román. 2018. (b) Peruvian variation (PE). Overview of TASS 2018: Opinions, health and emotions. En E. Martı́nez- Team’s name Macro-F1 Accuracy Cámara Y. Almeida-Cruz M. C. Dı́az- INGEOTEC 0.447 0.506 Galiano S. Estévez-Velarde M. A. Garcı́a- RETUYT-InCo 0.445 0.514 Cumbreras M. Garcı́a-Vega Y. Gutiérrez Atalaya 0.438 0.523 A. Montejo Ráez A. Montoyo R. Muñoz ITAINNOVA 0.367 0.382 A. Piad-Morffis, y J. Villena-Román, edi- tores, Proceedings of TASS 2018: Work- (c) Costa Rica’s variation (CR). shop on Semantic Analysis at SEPLN (TASS 2018), volumen 2172 de CEUR Team’s name Macro-F1 Accuracy Workshop Proceedings, Sevilla, Spain, RETUYT-InCo 0.476 0.569 September. CEUR-WS. INGEOTEC 0.454 0.538 Perez-Rosas, V., C. Banea, y R. Mihal- Atalaya 0.453 0.565 cea. 2012. Learning sentiment lexi- ITAINNOVA 0.409 0.440 cons in spanish. En LREC, volumen 12, página 73. Sidorov, G., S. Miranda-Jiménez, F. Viveros- wordnet 3.0. Text Resources and Lexical Jiménez, A. Gelbukh, N. Castro-Sánchez, Knowledge. Mouton de Gruyter, páginas F. Velásquez, I. Dı́az-Rangel, S. Suárez- 175–182. Guerra, A. Treviño, y J. Gordon. 2013. Graff, M., E. S. Tellez, S. Miranda-Jiménez, Empirical study of machine learning based y H. J. Escalante. 2016. Evodag: A approach for opinion mining in tweets. En semantic genetic programming python li- Proceedings of the 11th Mexican Interna- brary. En 2016 IEEE International Au- tional Conference on Advances in Arti- tumn Meeting on Power, Electronics and ficial Intelligence - Volume Part I, MI- Computing (ROPEC), páginas 1–6, Nov. CAI’12, páginas 1–14, Berlin, Heidelberg. Springer-Verlag. Graff, M., E. S. Tellez, H. J. Escalante, y S. Miranda-Jiménez. 2017. Semantic Ge- Tellez, E. S., S. Miranda-Jiménez, M. Graff, netic Programming for Sentiment Analy- D. Moctezuma, R. R. Suárez, y O. S. Sior- sis. En O. Schütze L. Trujillo P. Legrand, dia. 2017. A simple approach to multi- y Y. Maldonado, editores, NEO 2015, nu- lingual polarity classification in Twitter. mero 663 en Studies in Computational In- Pattern Recognition Letters, 94:68–74. telligence. Springer International Publish- ing, páginas 43–65. DOI: 10.1007/978-3- 319-44003-3 2. Joulin, A., E. Grave, P. Bojanowski, y T. Mikolov. 2017. Bag of tricks for efficient text classification. En Proceed- ings of the 15th Conference of the Euro- pean Chapter of the Association for Com- putational Linguistics: Volume 2, Short Papers, páginas 427–431. Association for Computational Linguistics, April. 49