-

1613-0073

RETUYT-InCo at TASS 2018: Sentiment Analysis in Spanish Variants using Neural Networks and SVM

Luis Chiruzzo Aiala Rosa

0 1 0 Facultad de Ingenier a, Universidad de la Republica Montevideo , Uruguay 1 URL references were replaced by the token

2018

57 63

This paper presents three approaches for classifying the sentiment of tweets for di erent Spanish variants in the TASS 2018 challenge. The classi ers are based on Support Vector Machines (SVM), Convolutional Neural Netowrks (CNN) and Long Short Term Memory networks (LSTM). Although di erent classi ers worked better for di erent language variants, the use of word embeddings was key for obtaining performance improvements. Also, using a mixed-balanced training method for the LSTM resulted in a signi cant improvement in the detection of neutral tweets.

Sentiment analysis is one of the most important tasks related to subjectivity analysis within Natural Language Processing. The sentiment analysis of tweets is especially interesting due to the large volume of information generated every day, the subjective nature of most messages, and the easy access to this material for analysis and processing.

The existence of speci c tasks related to this

eld, for several years now, shows the interest of the NLP community in working on this subject. The International Workshop on Semantic Evaluation (SemEval) includes a task on Tweets Sentiment Analysis since 2013 1.

For Spanish, the TASS workshop, organized by the SEPLN (Sociedad Espan~ola para el Procesamiento del Lenguaje Natural), focuses on this task since 20122.

1https://www.cs.york.ac.uk/semeval-2013/ task2.html 2http://www.sepln.org/workshops/tass/2012/

In the TASS editions prior to 2017, most of the participants presented machine learning systems based on hand crafted features. For example, in TASS 2016 (Garc a-Cumbreras et al., 2016) best results were obtained by a system based on an ensemble of Logistic Regression classi ers including features derived from a subjective lexicon, negation processing, and n-grams (Ceron-Guzman, 2016) ; and a system based on a set of SVM classi ers with morpho-syntactic information and ngrams as features (Hurtado y Pla, 2016) . Other authors (Montejo-Raez y D az-Galiano, 2016; Quiros, Segura-Bedmar, y Mart nez, 2016) used word embeddings, reaching lower results.

In TASS 2017 (Mart nez-Camara et al., 2017) (task 1) several systems used deep learning approaches. The best results were obtained by: Hurtado, Pla, y Gonzalez (2017 ), who experimented with di erent deep neural network architectures, using as in

Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes. put domain-speci c and general-domain sets of embeddings; Ceron-Guzman (2017), who presented an ensemble of SVM and Logistic Regression classi ers; Rosa et al. (2017) , who presented an SVM classi er based on the centroid of the tweets embeddings, a deep neural network (CNN), and a combination of both; and Moctezuma et al. (2017) , who combined an SVM classi er with genetic programming.

On the other hand, for the rst time, SemEval 2018 (Mohammad et al., 2018) included a dataset for Spanish tweets sentiment analysis. The corpus used in task 1.4 (ordinal classi cation of sentiment) is annotated with 7 values, indicating di erent levels of positive or negative sentiment. The best results for Spanish were obtained by systems based on deep neural networks.

In this paper we describe di erent approaches for Spanish tweet classi cation presented by the RETUYT-InCo team for the TASS 2018 sentiment analysis challenge (Mart nezCamara et al., 2018) : an SVM-based classi er which uses a set of features, including word embeddings; and two deep neural network approaches: CNN and LSTM. 2

Corpus pre-processing

For this year's edition of the challenge, the organizers provided three sets of corpora for Spanish variants spoken in di erent countries: Spain (ES), Costa Rica (CR) and Peru (PE). For each of the variants, training, development and test data was provided. The training and development sets were annotated with four possible polarity categories per tweet: P, N, NEU or NONE. The test corpora had no annotations.

For some of our experiments, we also used the general TASS training data from a previous edition of the competition. This corpus was divided in training (85 %) and development (15 %) subsets. Table 1 shows the sizes of the di erent corpora and the number of tweets for each class.

Each corpus was pre-processed as follows: Redundant space characters and ellipsis were removed.

Corpus General InterTASS-ES InterTASS-CR InterTASS-PE Category Train

N NEU NONE

P Total

N NEU NONE

P Total

N NEU NONE

P Total

N NEU NONE

P Total

Sequences of three or more occurrences of the same character were replaced by a unique occurrence of that character. For instance, \holaaaa" was replaced by \hola".

Interjections denoting laughter (\jajajaja", \jejeje", \jajaj") were replaced by the token \jaja".

The text was converted to lowercase.

We did not include any grammatical information, like lemma, POS-tag, morphological or syntactic information. 3 3.1

Resources Positive and Negative Lexicons

We built a subjective lexicon consisting of the union of three subjective lexicons available for Spanish (Cruz et al., 2014; Saralegi y San Vicente, 2013; Brooke, To loski, y Taboada, 2009) . The lexicon, containing 6875 negative lemmas and 4853 positive lemmas, was expanded with the in ectional forms of each lemma, reaching a total of 76291 words (48959 negative and 27332 positive). This was done in order to alleviate the fact that tweets were not lemmatized. For the lexicon expansion we used the FreeLing dictionary (Padro y Stanilovsky, 2012) .

In a previous work (Rosa et al., 2017) , we used the same three Spanish lexicons, but we took the intersection instead of the union, obtaining a lexicon with 4730 words. Some experiments showed that the largest lexicon provides a small improvement on results when used to calculate some SVM features (as described below). 3.2

Word embeddings set

We used a 300 dimension word embeddings set, trained by (Azzinnari y Mart nez, 2016) using word2vec (Mikolov et al., 2013) . These embeddings are based on a corpus of almost six billion words in Spanish. Most of the texts come from Internet media sites. 3.3

Word Polarity Predictor

We built a regression algorithm based on SVM using the subjective lexicon as training set. This model should be able to predict a real number representing the polarity of each word. The model takes as input the 300 real values of the vector representing the word and returns a real value for the word polarity. For training, we assigned the value 1 to positive words and the value -1 to negative words. In table 2 we show the result of applying this classi er to some words.

Word

Prediction apoyamos 1.09973945

amigo 0.89985318 excelente 1.04574863 cansancio -0.98582263 abat an -1.02370082 horrible -0.91882273 apartamento -0.30991363

telefono -0.48884958

As these examples show, words expected to be positive have values close to 1 and words expected to be negative have values close to -1. On the other hand, neutral words have values closer to 0 than to 1 or -1. 3.4

Category Markers

We obtained the list of all the words in the training corpus and for each one we calculated the distribution of the four categories in all the tweets where this word occurs. We consider that a word is a category marker if it occurs at least 75 % times in this category. Using this information we built markers lists for the four categories: 429 positive words, 438 negative words, 12 neutral words, and 33 no opinion markers. 4 4.1

Classi ers SVM based approach

The SVM classi er con gurations are almost the same as the ones described in (Rosa et al., 2017) . However, the use of the new positive and negative word lexicons implied retraining the word polarity predictor and rebuilding the feature sets. These are the features used by the SVM classi ers:

Centroid of tweet word embeddings. Previous works showed that, while using the centroid (or mean vector) is a simple technique, it reaches good results for several NLP problems, particularly for sentiment analysis (White et al., 2015) . (300 real values) Polarity of the nine more relevant words of the tweet according to the polarity predictor. The number nine is the average length of tweets in the training corpus, ltering stop words. We considered that the more relevant words are those words whose polarities have the highest absolute value. If the tweet has less than nine words we completed the nine values repeating the polarities of the words in the tweet. (9 real values) Number of words belonging to the positive lexicon and to the negative lexicon. (2 natural values) Number of words whose vector representations are close to the mean vector of the positive and the negative lexicons. (2 natural values) Number of words belonging to the lists of category markers. (4 natural values) Features indicating if the original tweet has repeated characters or some word written entirely in upper case. (2 boolean values) Tentative polarity (P, N, NEU, NONE) of the tweet, based on the number of positive and negative words in the tweet, taking into account negation markers (from a list). We inverted the polarity of words occurring between the negation marker and a punctuation mark. (4 classes) The ve more relevant words from the training corpus, according to a bag of words classi er. The value ve was experimentally de ned. We ltered out words belonging to a list of stop words adapted for this task (some words relevant for sentiment analysis, such as \no" and \pero" were removed from the stop words list). (5 boolean values)

As in the previous editions, the SVM experiments were done using the scikit-learn toolkit (Pedregosa et al., 2011) and trained using the multiclass probability estimation method based on (Wu, Lin, y Weng, 2004) . 4.2

CNN based approach

Our CNN approach uses a simpler network than the one used in (Rosa et al., 2017) . In that case it was a convolutional network with three branches considering two, three and four words of context, but in our case only one convolutional branch considering three words of context was used, as shown in gure 1. The input of the network is the sequence of word embeddings corresponding to each word in the tweet, up to a maximum of 32 words. This input is fed to the convolutional layer, then the output goes to a max pooling layer and a dense layer with a dropout of 0.2 before going to a softmax layer for output. For training this network we keep a 70 %-30 % split for validation and use early stopping over the validation set. 4.3

LSTM based approach

Our LSTM neural network architecture uses the embedding for each word as input, up to a maximum of 32 words. This input is sent through a LSTM layer and then a dense layer with a dropout of 0.2, before getting the output through a softmax layer, as shown in gure 2.

The initial experiments using this network yielded good accuracy results, but the macroF measure was very low because the network did not predict any output for the class NEU. This class has proven to be the most di cult to learn throughout our experiments. However, we started to get better results using a di erent training strategy: we created two versions of the training corpus, one of them with all the tweets, and the other one taking the same number of tweets for each category (exactly the same number of tweets as the NEU category, which was the one with the fewest tweets). We call this set the balanced corpus.

The training strategy involves training one epoch with the whole corpus and one epoch with the balanced corpus, then iterate this training process until the performance over the development set stopped improving. Training the network in this fashion yields a little less accuracy but it compensates in macro-F measure, as it captures a lot more tweets of the NEU category.

Both neural network approaches (CNN and LSTM) were implemented using the Keras library (Chollet, 2015) and trained using the adam optimization algorithm (Kingma y Ba, 2014) . 5

Results

Three di erent corpora considering three Spanish variants were used for this task: from Spain (ES), Costa Rica (CR) and Peru (PE). Furthermore, the systems could be trained with training data for the corresponding Spanish variant (monolingual case), or they could be trained using data from other variants (cross-lingual case). We decided to submit the two best results for each classi er family on each of the variants and training combinations. Our results are shown in tables 3 and 4.

Taking in consideration the macro-F measure, our systems achieved good performance in all the test variants, ranking top 1 for monolingual CR and PE and cross-lingual ES and CR; and ranking top 2 for monolingual ES and cross-lingual PE. The best results for our systems in the monolingual training case were achieved by the neural networks approaches: in two cases, the best systems were LSTMs and in the other case it was a CNN. In the cross-lingual training cases, on the other hand, the three best systems were SVMs.

We submitted another system that combined the output probabilities of the best LSTM and SVM, in order to leverage the information of both classi ers. This approach had yielded good results in the past (Rosa et al., 2017) . In this case, although the performance of the combined approach was good (49.1 % macro-F for the ES corpus), it was still a little lower than the LSTM approaches.

As can be seen in table 5, one of the reasons the LSTM could have gotten better results over the test set was because it could We presented three approaches for TASS 2018 Task 1 about classifying the sentiment of tweets in di erent Spanish variants. The approaches we used are: SVM using word embedding centroids and manually crafted features, CNN using word embeddings as input, and LSTM using word embeddings, trained with focus on improving the recognition of neutral tweets. None of the classi ers was a clear winner in our experiments, as some of them worked better than others for di erent Spanish variants. However, we found that the training method used for the LSTMs signi cantly improved their macro-F measure by improving the detection of neutral tweets. In all cases, the use of word embeddings was key to improve the performance of the methods.

Dev

Test

P F1 46.4 47.1 45.0 44.8 43.8 47.0 Ceron-Guzman, J. A. 2017. Classi er ensembles that push the state-of-the-art in sentiment analysis of spanish tweets. En Proceedings of TASS. 47.6 47.4 42.1 46.2 Chollet, F. 2015. Keras. https://github. 47.3 com/fchollet/keras. 44.4

Cruz, F. L., J. A. Troyano, B. Pontes, y F. J.

Ortega. 2014. Building layered, multilingual sentiment lexicons at synset and lemma levels. Expert Systems with Applications, 41(13):5984{5994. Guillena A. Piad Mor s, y J. VillenaRoman, editores, Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018), volumen 2172 de CEUR Workshop Proceedings, Sevilla, Spain, September. CEUR-WS.

Azzinnari , A . y A . Mart nez. 2016 . Representacion de Palabras en Espacios de Vectores. Proyecto de grado, Universidad de la Republica, Uruguay.

Brooke , J. , M. To loski, y M. Taboada . 2009 . Cross-linguistic sentiment analysis: From english to spanish . En RANLP , paginas 50 { 54 .

Ceron-Guzman , J. A.

2016 . Jacerong at tass 2016 : An ensemble classi er for sentiment tweets at global level . En Proceedings of TASS 2016: Workshop on Sentiment Analysis at SEPLN co-located with 32nd SEPLN Conference (SEPLN 2016 ), Salamanca, Spain.

44.4 Garc a-Cumbreras , M. ,

Villena-Roman , E.

Mart nez-

Camara , M.

D az-

Galiano , M.

Mart n-Valdivia, y L. na Lopez . 2016 . Overview of tass 2016 . En Proceedings of TASS 2016 : Workshop on Sentiment Analysis at SEPLN co-located with the 32nd SEPLN Conference (SEPLN 2016 ), Salamanca, Spain, September.

Hurtado , L.- F. , F.

Pla , y J.

Gonzalez . 2017 . ELiRF-UPV at TASS 2017: Sentiment Analysis in Twitter based on Deep Learning . En Proceedings of TASS.

Hurtado , L. F. y F.

Pla . 2016 . Elirfupv en tass 2016: Analisis de sentimientos en twitter . En Proceedings of TASS 2016: Workshop on Sentiment Analysis at SEPLN co-located with 32nd SEPLN Conference (SEPLN 2016 ), Salamanca, Spain.

Kingma , D. P. y J.

Ba . 2014 . Adam: A method for stochastic optimization . CoRR, abs/1412 .6980.

Mart nez-Camara, E., Y.

Almeida Cruz , M. C. D az-

Galiano , S. Estevez

Velarde , M. A.

Garc a-Cumbreras, M. Garc aVega, Y.

Gutierrez Vazquez , A. Montejo

Raez , A.

Montoyo

Guijarro

Mun ~oz Guillena, A. Piad Mor s , y J. VillenaRoman . 2018 . Overview of TASS 2018: Opinions, health and emotions . En E. Mart nez-Camara Y. Almeida Cruz M. C. D az-Galiano S. Estevez Velarde M. A. Garc a -Cumbreras M. Garc aVega Y. Gutierrez Vazquez A. Montejo Raez A. Montoyo Guijarro R. Mun ~oz

Mart nez-Camara, E. , M. C.

D az-

Galiano , M. A.

Garc a-Cumbreras, M. Garc aVega, y J. Villena-Roman . 2017 . Overview of tass 2017 . En J. Villena Roman M. A. Garc a Cumbreras D. G. M. C. Mart nez-Camara, Eugenio, y M. Garc a Vega, editores , Proceedings of TASS 2017: Workshop on Semantic Analysis at SEPLN (TASS 2017 ), volumen 1896 de CEUR Workshop Proceedings, Murcia, Spain, September. CEURWS.

Mikolov , T. ,

Chen , G. Corrado,

y J.

Dean . 2013 . E cient estimation of word representations in vector space . arXiv preprint arXiv:1301 . 3781 .

Moctezuma , D. ,

Gra , S. MirandaJimenez, E. Tellez,

Coronado , C. Sanchez, y J. Ortiz-Bejar . 2017 . A genetic programming approach to sentiment analysis for twitter: Tass'17. En Proceedings of TASS.

Mohammad , S. ,

Bravo-Marquez , M. Salameh, y

Kiritchenko . 2018 . Semeval2018 task 1: A ect in tweets . En Proceedings of The 12th International Workshop on Semantic Evaluation, paginas 1 { 17 .

Montejo-Raez , A. y M. C. D az-Galiano. 2016 . Participacion de SINAI en TASS 2016 . En Proceedings of TASS 2016: Workshop on Sentiment Analysis at SEPLN co-located with 32nd SEPLN Conference (SEPLN 2016 ), Salamanca, Spain.

Padro , L. y E.

Stanilovsky . 2012 . Freeling 3.0: Towards wider multilinguality . En Proceedings of the Language Resources and Evaluation Conference (LREC 2012 ), Istanbul, Turkey, May. ELRA.

Pedregosa , F. ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher , M. Perrot,

y E.

Duchesnay . 2011 . Scikit-learn: Machine learning in Python . Journal of Machine Learning Research , 12 : 2825 { 2830 .

Quiros , A. , I. Segura-Bedmar, y P. Mart nez. 2016 . LABDA at the 2016 TASS challenge task: Using word embeddings for the sentiment analysis task . En Proceedings of TASS 2016: Workshop on Sentiment Analysis at SEPLN co-located with 32nd SEPLN Conference (SEPLN 2016 ), Salamanca, Spain.

Rosa , A. ,

Chiruzzo , M. Etcheverry, y

Castro . 2017 . RETUYT en TASS 2017: Analisis de Sentimientos de Tweets en Espan~ ol utilizando SVM y CNN . En Proceedings of TASS.

Saralegi , X. y I. San

Vicente . 2013 . Elhuyar at tass 2013 . XXIX Congreso de la Sociedad Espaola de Procesamiento de lenguaje natural, Workshop on Sentiment Analysis at SEPLN (TASS2013) , paginas 143 { 150 .

White , L. ,

Togneri , W. Liu, y

Bennamoun . 2015 . How well sentence embeddings capture meaning . En Proceedings of the 20th Australasian Document Computing Symposium, pagina 9 . ACM.

Wu , T.-F., C.-J. Lin , y R. C. Weng . 2004 . Probability estimates for multi-class classi cation by pairwise coupling . Journal of Machine Learning Research , 5 (Aug): 975 { 1005 .