-

1613-0073

10.1007/978-3

INGEOTEC solution for Task 1 in TASS'18 competition

Daniela Moctezuma

Jose Ortiz-Bejar

Eric S. Tellez

Sabino Miranda-Jimenez

Mario Gra

CONACYT-CentroGEO

CONACYT-INFOTEC

UMSNH dmoctezuma@centrogeo.edu.mx

jortiz@umich.mx

eric.tellez@infotec.mx

sabino.miranda@infotec.mx

mario.gra @infotec.mx

2017

2 45 49

The sentiment analysis over social networks determines the polarity of messages published by users. In this sense, a message can be classi ed as positive or negative, or a similar scheme using more ne-grained labels. Each language has characteristics that di cult the correct determination of the sentiment, such as the natural ambiguity of pronouns, the synonymy, and the polysemy. Additionally, given that messages in social networks are quite informal, they tend to be plagued with lexical errors and lexical variations that make di cult to determine a sentiment using traditional approaches. This paper describes our participating system in TASS'18. Our solution is composed of several subsystems independently collected and trained, combined with our EvoMSA genetic programming system.

Sentiment Analysis is an active research area that performs the computational analysis of people's feelings or beliefs expressed in texts such as emotions, opinions, attitudes, appraisals, among others (Liu y Zhang, 2012).

In social media, people share their opinions and sentiments. In addition to the inherent polarity, these feelings also have an intensity. As in previous years, TASS'18 organizes a task related to four level polarity classi cation in tweets. In this year, the corpus InterTASS, has been expanded with two more subsets, namely, a dataset containing tweets from Costa Rica and another one coming from Peruvian tweeters. Therefore, there are three varieties of the Spanish language, namely, Spain (ES), Peru (PE), and Costa Rica (CR). Moreover, several subtasks are also introduced:

Subtask-1: Monolingual ES: Training and test using the InterTASS ES dataset.

Subtask-3: Monolingual CR: Training and test using the InterTASS CR dataset.

Subtask-4: Cross-lingual: Here, the training can be with a speci c dataset and a di erent one is used to test.

These subtasks are mostly based on separating language variations in train and test datasets. Mart nez-Camara et al. (Mart nezCamara et al., 2018) detail TASS'18 Task 1 and their associated datasets.

This paper details the Task 1 solution of our INGEOTEC team. Our approach consists of a number of subsystems combined using a non-linear expression over individual predictions using our EvoMSA genetic programming system. It is worth to mention that we tackle both Task 1 (this one) and Task 4 (good or bad news) using a similar scheme, that is, the same resources and the same portfolio of algorithms, we also applied the same hyper-parameters for the algorithms; of course, we use the given task's training set to learn and optimize for each task.

The manuscript is organized as follows.

Section 2 details subsystems that compose our solution. Section 3 presents our results, and nally, Section 4 summarizes and concludes this report. 2

System Description

Our participating system is a combination of several sub-systems that tackles the polarity categorization of the tweets independently, and then all these independent predictions are combined using our EvoMSA genetic programming system. The rest of this section details the use of these sub-systems and resources. 2.1

EvoMSA

EvoMSA1 is a multilingual sentiment analysis system based on genetic text classi ers, domain-speci c resources, and a genetic programming combiner of the parts. The rst one, namely B4MSA (Tellez et al., 2017) , performs a hyper-parameter optimization over a large search space of possible models. It uses 1https://github.com/INGEOTEC/EvoMSA a meta-heuristics to solve a combinatorial optimization problem over the con guration space; the selected model is described in Table 1. On the second hand, EvoDAG (Gra et al., 2016; Gra et al., 2017) is a classier based on Genetic Programming with semantic operators which makes the nal prediction through a combination of all the decision function values. The domain-speci c resources can be also added under the same scheme. Figure 1 shows the architecture of EvoMSA. In the rst part, a set of di erent classi ers are trained with datasets provided by the contests and others resources as additional knowledge, i.e., the idea is to be able to integrate any other kind of related knowledge into the model. In this case, we used tailor-made lexicons for the aggressiveness task: aggressiveness words and a ective words (positive and negative), see Section 2.2 for more details. The precise con guration of our benchmarked system is described in Section 3.

Lexicon-based models

To introduce extra knowledge into our approach, we used two lexicon-based models. The rst, Up-Down model produces a counting of a ective words, that is, it produces two indexes for a given text: one for positive words, and another for negative words. We created the positive-negative lexicon based on the several Spanish a ective lexicons (de Albornoz, Plaza, y Gervas, 2012; Sidorov et al., 2013; Perez-Rosas, Banea, y Mihalcea, 2012) ; we also enriched this lexicon with Spanish WordNet (FernandezMontraveta, Vazquez, y Fellbaum, 2008) . The other Bernoulli model was created to predict aggressiveness using a lexicon with aggressive words. We created this lexicon gathering common aggressive words for Spanish. These indexes and prediction along with B4MSA's ( TC) outputs are the input for EvoDAG system. 2.3

EvoDAG

EvoDAG2 (Gra et al., 2016; Gra et al., 2017) is a Genetic Programming system speci cally tailored to tackle classi cation problems on very large and high dimensional vector spaces. EvoDAG uses the principles of Darwinian evolution to create models represented as a directed acyclic graph (DAG). Due to lack of space, we refer the reader to (Gra et al., 2016) where EvoDAG is broadly described. It is important to mention that EvoDAG does not have information regarding whether input Xi comes from a particular class decision function, consequently from EvoDAG point of view all inputs are equiva2https://github.com/mgra g/EvoDAG

2.4 FastText

FastText (Joulin et al., 2017) is a tool to create text classi ers and learn a semantic vocabulary, learned from a given collection of documents; this vocabulary is represented with a collection of high dimensional vectors, one per word. It is worth to mention that FastText is robust to lexical errors since outvocabulary words are represented as the combination of vectors of sub-words, that is, a kind of character q-grams limited in context to words. Nonetheless, the main reason of including FastText as part of our system is to overcome the small train set that comes with Task 4, which is ful lled using the pre-trained vectors computed in the Spanish content of Wikipedia (Bojanowski et al., 2016) . We use these vectors to create document vectors, one vector per document. A document vector is, roughly speaking, a linear combination of the word vectors that compose the document into a single vector of the same dimension. These document vectors were used as input to an SVM with a linear kernel, and we use the decision function as input to EvoMSA. 3

Experiments and results

The following tables show the performance of our system in the InterTASS dataset. We also show the performance of a number of selected systems to provide a context for our solution. The following tables always show the top-k best results that include our system, i.e., we always show the best ones but sometimes we do not show all results below our system.

Please recall that the InterTASS dataset is split according to each sub-task. Table 2 shows the performance on monolingual datasets. For instance, the results of training with Spain-InterTASS and testing on tweets generated by people of Spain is shown in Table 2a where we reached seventh position from a total of nine participants teams. In the case training and test corpus of other Spanish varieties, in Table 2b and Table 2c show the result of training with CR and PE subsets, respectively. Our team achieved the fourth position among eight teams in CR, and the third one among eight participants. Notice that all our results are marked as bold to improve the readability.

In contrary, the results of training with the ES subset and test with subsets ES, CR, and PE are presented in Tables 3a, 3b, and 3c, respectively. Our team achieved the best result in cross-lingual task with Peruvian tweets, and also reached the second best results in ES (Spain) and CR (Costa Rica) subsets.

The performance of our method in cross lingual tasks 4 is shown in Table 3. For instance, Table 3a shows our performance on the ES subset; here, we achieved the second position among three teams. In general, the number of participants was smaller than the monolingual tasks. Table 3b show the rank of the four participant teams over the Peruvian subset of the test, here we reached the best position on the Macro-F1 score. Finally, we reached the second rank on the Costa Rica subset, just below of RETUYT-InCo. 4

Conclusions

It is worth to mention that we used the same scheme, explained in Section 2, to tackle all subtasks. Note that our EvoMSA allow to change the training set as speci ed for each subtasks, so we can optimize the pipeline for each particular objective.

Regarding the obtained results, our approach performs better when it is trained with tweets from Spain and test with other Spanish varieties. However, it is not clear if this performance is due to the data or a in

RETUYT-InCo ELiRF-UPV Atalaya

INGEOTEC MEFaMAF

ABBOT

RETUYT-InCo Atalaya

INGEOTEC ELiRF-UPV UNSA dajo herent feature of the Spanish variation.

Acknowledgements

The authors would like to thank Laboratorio Nacional de GeoInteligencia for partially funding this work. 0.612 0.549 0.544 0.6 0.6 0.55 0.53 0.482 0.433 0.537 0.561 0.582 0.522 0.512 0.46

(b) Peruvian variation (PE). (c) Costa Rica's variation (CR). wordnet 3.0. Text Resources and Lexical Knowledge. Mouton de Gruyter, paginas 175{182. 0.471 0.445 0.441 0.447 0.445 0.438 0.367 Liu, B. y L. Zhang, 2012. A Survey of Opinion Mining and Sentiment Analysis, paginas 415{463. Springer US, Boston, MA.

Bojanowski , P. , E. Grave, A . Joulin, y

Mikolov . 2016 . Enriching word vectors with subword information . arXiv preprint arXiv:1607 . 04606 .

de Albornoz , J. C. , L. Plaza,

y P.

Gervas . 2012 . Sentisense: An easily scalable concept-based a ective lexicon for sentiment analysis . En Proceedings of LREC 2012 , paginas 3562 { 3567 .

Fernandez-Montraveta , A. , G. Vazquez, y

Fellbaum . 2008 . The spanish version of Mart nez-

Camara , E., Y.

Almeida-Cruz , M. C.

D az-

Galiano , S.

Estevez-Velarde , M. A.

Garc a-Cumbreras, M. Garc aVega, Y.

Gutierrez , A. Montejo

Raez , A.

Montoyo , R. Mun~oz, A. PiadMor s, y J. Villena-Roman . 2018 . Overview of TASS 2018: Opinions, health and emotions . En E. Mart nezCamara

Almeida-Cruz M. C. D azGaliano S. Estevez-Velarde M. A. Garc aCumbreras M. Garc a-Vega

Gutierrez A. Montejo Raez A. Montoyo R. Mun ~oz A. Piad-Mor s , y J. Villena-Roman, editores , Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ), volumen 2172 de CEUR Workshop Proceedings, Sevilla, Spain, September. CEUR-WS.

Perez-Rosas , V., C. Banea, y R.

Mihalcea . 2012 . Learning sentiment lexicons in spanish . En LREC , volumen 12 , pagina 73.

Sidorov , G., S.

Miranda-Jimenez , F.

ViverosJimenez , A.

Gelbukh , N.

Castro-Sanchez , F.

Velasquez , I.

D az-

Rangel , S. SuarezGuerra , A. Trevin~o,

y J.

Gordon . 2013 . Empirical study of machine learning based approach for opinion mining in tweets . En Proceedings of the 11th Mexican International Conference on Advances in Articial Intelligence - Volume Part I, MICAI'12, paginas 1 { 14 , Berlin, Heidelberg. Springer-Verlag.

Tellez , E. S. ,

Miranda-Jimenez ,

Gra ,

Moctezuma ,

R. R.

Suarez , y

O. S.

Siordia . 2017 . A simple approach to multilingual polarity classi cation in Twitter . Pattern Recognition Letters , 94 : 68 { 74 .