=Paper=
{{Paper
|id=Vol-1896/p8_gsi_tass2017
|storemode=property
|title=Applying Recurrent Neural Networks to Sentiment Analysis of Spanish Tweets
|pdfUrl=https://ceur-ws.org/Vol-1896/p8_gsi_tass2017.pdf
|volume=Vol-1896
|authors=Oscar Araque,Rodrigo Barbado,J. Fernando Sánchez-Rada,Carlos A. Iglesias
}}
==Applying Recurrent Neural Networks to Sentiment Analysis of Spanish Tweets==
TASS 2017: Workshop on Semantic Analysis at SEPLN, septiembre 2017, págs. 71-76 Applying Recurrent Neural Networks to Sentiment Analysis of Spanish Tweets Aplicación de Redes Neuronales Recurrentes al Análisis de Sentimientos sobre Tweets en Español Oscar Araque, Rodrigo Barbado, J. Fernando Sánchez-Rada y Carlos A. Iglesias Intelligent Systems Group, Universidad Politécnica de Madrid Av. Complutense 30, 28040 Madrid o.araque@upm.es, rodrigo.barbado.esteban@alumnos.upm.es {jfernando, carlosangel.iglesias}@upm.es Abstract: This article presents the participation of the Intelligent Systems Group (GSI) at Universidad Politécnica de Madrid (UPM) in the Sentiment Analysis work- shop focused in Spanish tweets, TASS2017. We have worked on Task 1, aiming to classify sentiment polarity of Spanish tweets. For this task we propose a Recurrent Neural Network (RNN) architecture composed of Long Short-Term Memory (LSTM) cells followed by a feedforward network. The architecture makes use of two different types of features: word embeddings and sentiment lexicon values. The recurrent ar- chitecture allows us to process text sequences of different lengths, while the lexicon inserts directly into the system sentiment information. The results indicate that this feature combination leads to enhanced sentiment analysis performances. Keywords: Deep Learning, Natural Language Processing, Sentiment Analysis, Re- current Neural Network, TensorFlow Resumen: En este artı́culo se presenta la participación del Grupo de Sistemas Inteligentes (GSI) de la Universidad Politécnica de Madrid (UPM) en el taller de Análisis de Sentimientos centrado en tweets en Español: el TASS2017. Hemos traba- jado en la Tarea 1, tratando de predecir correctamente la polaridad del sentimiento de tweets en español. Para esta tarea hemos propuesto una arquitectura consistente en una Red Neuronal Recurrente (RNN) compuesta de celdas Long Short-Term Memory (LSTM) seguida por una red neuronal prealimentada. La arquitectura hace uso de dos tipos distintos de caracterı́sticas: word embeddings y los valores de un diccionario de sentimientos. La recurrencia de la arquitectura permite procesar secuencias de texto de distintas longitudes, mientras que el diccionario inserta infor- mación de sentimiento directamente en el sistema. Los resultados obtenidos indican que esta combinación de caracterı́sticas lleva a mejorar los resultados en análisis de sentimientos. Palabras clave: Aprendizaje Profundo, Procesamiento de Lenguaje Natural, Análisis de Sentimientos, Red Neuronal Recurrente, TensorFlow 1 Introduction sults (Araque et al., 2017). Recent developments in the area of deep This paper describes our participation in learning are strongly impacting sentiment TASS 2017 (Martı́nez-Cámara et al., 2017). analysis techniques. While traditional meth- Taller de Análisis de Sentimientos en la SE- ods based on feature engineering are still PLN (TASS) is a workshop that fosters the prevalent, new deep learning approaches are research of sentiment analysis in Spanish for succeeding and reduce the need of labeled short text such as tweets. The first task of corpus and feature definition. Moreover, this challenge, Task 1, consists in determin- traditional and deep learning approaches ing the global polarity at a message level. can be combined obtaining improved re- The dataset for the evaluation of this task ISSN 1613-0073 Copyright © 2017 by the paper's authors. Copying permitted for private and academic purposes. Oscar Araque, Rodrigo Barbado, J. Fernando Sánchez-Rada, Carlos A. Iglesias considers annotated tweets with 4 polarity la- compared to linear classifiers. Also, word em- bels (P, N, NEU, NONE). P stands for pos- beddings have been leveraged in previous ver- itive, while N means negative and NEU is sions, as shown in (Martınez-Cámara et al., neutral. It is considered that NONE means 2015). Nevertheless, neural networks have absence of sentiment polarity. This task pro- not been thoroughly studied in TASS, and vides a corpus, which contains a total of 1514 many potentially interesting techniques re- tweets written in Spanish, describing a diver- main unused. sity of subjects. We have faced this challenge as an oppor- 3 Sentiment analysis Task tunity to evaluate how these techniques could 3.1 Model architecture be applied in the TASS domain, and their The approach followed for the Sentiment results compared with the traditional tech- Analysis at Tweet level Task consists in a niques we used in a previous participation in RNN composed of LSTM cells that parse the this challenge (Araque et al., 2015). input into a fixed-size vector representation. The reminder of this paper is organized The representation of the text is used to per- as follows. Sect. 2 introduces related work. form the sentiment classification. Two vari- Then Sect.3 describes the proposed polarity ations of this architecture are used: (i) a classification model and its implementation, LSTM that iterates over the input word vec- which is evaluated in Sect. 4. Finally, conclu- tors or (ii) over a combination of the input sions are drawn in Sect. 6. word vectors and polarity values from a sen- timent lexicon. 2 Related work The general architecture of the model Many works in the last years involve the use takes as inputs the words vectors and the of neural architectures to learn text classi- lexicon values for each word from an input fication problems and, more specifically, to tweet. Then, the inputs are passed through perform Sentiment Analysis. A relevant ex- a one-layer LSTM with a tunnable number ample of this are Recursive Neural Tensor of hidden units. The generated representa- Networks (Socher et al., 2013). This archi- tion is then used to determine the polarity of tecture makes use of the structure of parse the input text using a feedforward layer with trees to effectively capture the negation phe- softmax activation as output function. The nomena and its scope. A similar work (Tai, output of this last layer encodes the probabil- Socher, and Manning, 2015) introduces the ity that the input text belongs to each class. use of LSTM in tree structures, leveraging Fig. 2 shows this architecture, which is fur- both the information contained in these trees ther described as follows: and the representation capabilities of gated units. Although parse trees can result very 1. The input vector is the word embed- useful in sentiment analysis, many works do ding of each word in a given tweet. It not make use of them, as they introduce an contains word-level information or senti- additional computation overhead. In (Wang ment word-level information. Each spe- et al., 2015) a data-driven approach is de- cific case will be described in more detail scribed that learns from noisy annotated data afterwards. also making use of LSTM units and a er- 2. The RNN number of units is chosen dur- ror signal processing to avoid the problem ing training for optimization purposes. of vanishing gradient. Another useful tech- In this work we use a one-layer LSTM nique is attention (Bahdanau, Cho, and Ben- to avoid overfitting of the network to the gio, 2014), that enables weighting the impor- training data. tance of the different words in a given piece of text. Attention has been used in Sentiment 3. The weight matrix has as input dimen- Analysis successfully in a recurrent architec- sion the RNN size, and the number of ture, as presented in (Wang et al., 2016). classes as output dimension. This means In the context of the TASS challenge, it that, taking as inputs the last LSTM has not been the first time that neural archi- output, we obtain a vector whose length tectures have been proposed for solving the is the number of classes. This matrix is different tasks. In (Vilares et al., 2015), the also optimized during the training pro- authors propose a LSTM architecture that is cess. 72 Applying Recurrent Neural Networks to Sentiment Analysis of Spanish Tweets Figure 1: Recurrent Neural Network (RNN) architecture 4. The final probability vector is obtained is followed, but instead of using information by passing the result of the previous about the different words contained on each matrix multiplication through a softmax tweet, information about the sentiment of function, which converts the values of each word is used. In this case, the prepro- the components of this result vector into cessing process is modified: probabilities representation. Finally, the predicted label for the tweet is the com- 1. First, each tweet is split into tokens. ponent of the output vector with the 2. Secondly, a sentiment dictionary is used highest probability. to map words with sentiment polarity values. In this way, each word is mapped Following, the two types of inputs used are into a positive, neutral or negative value. described thoroughly. 3. Finally, the representation of a word 3.2 Word-level RNN consists in its word vector concatenated For this input, the tweet text is tokenized into with its sentiment polarity label. word tokens, which are then expressed in a 3.4 Regularization one-hot representation. That is, each token is Given the reduced number of training exam- represented as a IR|V|×1 vector with all 0s and ples that is available for this task (Sec. 4) a and one 1 at the index of that token in the number of regularization techniques has been sorted token vocabulary. For example, the used in the experiments. Regularization is representations for the tokens a, antes and used in machine learning to control the com- zebra would appear as: plexity of a learning model so it does not wa = 1 0 0 · · · 0 , wantes = 0 1 0 · · · 0 overfit to the training data and generalized better to the test data. wzebra = 0 0 0 · · · 1 It is known that Recurrent Neural Net- work tend to heavily overfit to the training We limit the number of words to a certain set (Zaremba, Sutskever, and Vinyals, 2014). vocabulary size in order to limit the computa- For this, we employ two regularization tech- tional cost of this preprocessing step. Before niques to prevent this: feeding this data to the network, each tweet is presented by the one-hot representation of 1. L2 regularization (Ng, 2004). This tech- all the tokens in the tweet. nique is applied on the weights of the feedforward layer of the network. Being 3.3 Sentiment word-level RNN WMLP the weights of this layer, this reg- Additionally, we include different sentiment ularization adds to the cost function the information into the word representations by following value: means of a sentiment lexicon. In this case, T a similar approach as the word-level RNN λkWM LP WM LP k 73 Oscar Araque, Rodrigo Barbado, J. Fernando Sánchez-Rada, Carlos A. Iglesias where λ is a parameter that represents the training set flows through all the the importance that is assigned to this computation graph yielding to a predic- regularization in the overall cost func- tion result. The cost metric is computed tion. by comparing the obtained result with the true training labels. When the back- 2. Dropout (Srivastava et al., 2014; Gal propagation is finished, the variable val- and Ghahramani, 2016b). This strategy ues are updated and the following itera- consists in randomly setting a fraction tion proceeds. of units to 0 at each step of the training process to prevent overfitting. During • In order to enhance performance, we use test time, the outputs are averaged by early stopping on the accuracy on the de- this fraction. Dropout has been recently velopment set. That is, for each epoch found to be theoretically similar to ap- we monitor the performance of the net- plying a bayesian prior to the network work in the development set. If it has weigths (Gal and Ghahramani, 2016a). not improved for a number of epochs (in this work, 3 epochs) the training pro- 3.5 TensorFlow implementation cess is stopped and the model weights TensorFlow is an interface for expressing ma- are freezed. chine learning algorithms, and an implemen- • The number of iterations can be chosen tation for executing such algorithms (Abadi, as well as other parameters such as the Agarwal, and et al., 2015). For implement- RNN size. For testing new examples, we ing the model previously described, first we use as input the test data, passing all had to define a computation graph composed the tweets through our model having as by the RNN architecture, matrices and op- a result the vector of probabilities of the erations needed. Once the graph is defined, class each tweet belongs to, choosing the the training process consists in iteratively ad- class with a higher probability value for justing numerical values in order to reach the each tweet. best results. This task was done following those ideas: 4 Experimental setup • The values to optimize are the internal For the development of Task 1 a training parameters and matrices that form the and development dataset is made public, con- network. Those are: the word embed- taining 1,514 labeled tweets which belong ding representations, the LSTM internal to the InterTASS corpus. Additionally, we weights and the feedforward weight ma- use the TASS2015 edition training dataset trix used as last layer. At the beginning that was extracted from the general cor- of training, those values are initialized pus (Garcı́a Cumbreras et al., 2016). We in a random way using a normal distri- train the system with the InterTASS and the bution ∼ N (µ, σ), with µ = 0, and are TASS General Corpus training datasets, and considered variables to be optimized at adjust the hyper-parameters with the Inter- each training step by TensorFlow. TASS development set. For the lexicon, we used ElhPolar dictionary (Urizar and Roncal, • Having those variables defined, the 2013), as it has been previously used in TASS training process iteratively modifies competitions. them in order to reach the better re- There are three test datasets, one belong- sults. In order to obtain a error signal ing to the InterTASS corpus and two belong- that can be used to modify the learning ing to the General Corpus of TASS: the full parameters we use a cost function which version, with all the 60,798 tweets; and the has to be minimized. That minimiza- 1k version, that contains a subset of 1,000 tion problem is solved by applying the tweets. gradient descent method via backpropa- In order to enhance the classification per- gation (LeCun et al., 2012). In this work formance several hyper-parameters have been we employ the Adam algorithm (Kingma explored, and the values that yield better and Ba, 2014). performance are selected to be used in the • In each iteration of the training process, testing phase. The vocabulary size is set to which are known as epochs, data from 20,000, with a batch size of 256 and the num- 74 Applying Recurrent Neural Networks to Sentiment Analysis of Spanish Tweets Model Corpus Accuracy Macro-F1 LSTM + MLP InterTASS 53.70 37.1 LSTM + MLP TASS (1k) 60.1 45.6 LSTM + MLP TASS (Full) 63.1 50.9 LSTM + MLP + Lexicon InterTASS 56.2 38.7 LSTM + MLP + Lexicon TASS (1k) 63.6 46.8 LSTM + MLP + Lexicon TASS (Full) 63.1 49.7 Table 1: Results in TASS 2017 ber of epochs being 20. With this value, the quence of text due to the dynamic recurrent early stopping mechanism stopped the train- structure of the architecture. Also, several ing before its completion. Regarding the size techniques have been used for avoiding over- of the layers, the number of dimensions of the fitting. From the experiments, it is seen that word embeddings is set to 16, as well as it is adding a sentiment lexicon can enhance the done with the number of units in the LSTM classification performance. layer. The dimensionality of the feedforward However, the proposed model does not layer is given by the output of the LSTM, compare with the best results in the TASS which is 16, and the number of classes of the competition. This can be due to a number classification task (in this case, 4). Note that of reasons, but the training process suggests these values are smaller than in the usual that overfitting is a relevant issue. Although neural architectures in order to further pre- benefit comes from the use of regularization vent overfitting. Also, we select the λ param- techniques, the network is not able to largely eter to 0.05, and the dropout rate to 0.7. generalize. To address this, we think that fu- ture work in this direction should include the 5 Experimental Results expansion of the training set. Table 1 shows the results of the two variations Other possible improvement for future of the proposed model: LSTM+MLP stack work is doing a better preprocessing of in- with or without lexicon values. In light of this put texts at word level. In addition, Convo- results it is possible to affirm that the used ar- lutional Neural Networks could be used for chitecture shows promising performances in feature extraction in combination with the the task of sentiment analysis of tweets. Al- Recurrent Neural Network architecture. This though, the achieved performances are below could lead to the computation of most com- the best in this year challenge. This indicates plex features, which could also yield better that further work should be done in order to results. improve the results. Acknowledgement The experimental results confirm the idea that the introduction of a sentiment lexicon The authors gratefully acknowledge the into the word presentations results, in gen- support of NVIDIA Corporation with eral, beneficial for the final performance. We the donation of the Titan X Pascal GPU see this improvement in the InterTASS and used in this research. This research 1k corpora. Nevertheless, when attending to work is partially supported through the the Full corpus, a performance decrease in projects Semola (TEC2015-68284-R), Emo- the Marco-F1 is observed. Spaces (RTC-2016-5053-7), MOSI-AGIL (S2013/ICE-3019), Somedi (ITEA3 15011) 6 Conclusions and Future Work and Trivalent (H2020 Action Grant No. 740934, SEC-06-FCT-2016). In this paper we have described the participa- tion of the GSI in the TASS 2017 challenge. Bibliografı́a Our proposal relies on a Recurrent Neural Abadi, M., A. Agarwal, and P. B. et al. 2015. Network architecture for Sentiment Analysis TensorFlow: Large-scale machine learning with Long Short-Term Memory cells. This on heterogeneous systems. Software avail- network can be fed with both word vectors able from tensorflow.org. and sentiment lexicon values. This approach is able to represent a arbitrarily long se- Araque, O., I. Corcuera, C. Román, C. A. 75 Oscar Araque, Rodrigo Barbado, J. Fernando Sánchez-Rada, Carlos A. Iglesias Iglesias, and J. F. Sánchez-Rada. 2015. R. Munoz-Guillena. 2015. Ensemble clas- Aspect based sentiment analysis of span- sifier for twitter sentiment analysis. In ish tweets. In TASS@ SEPLN, pages 29– R. Izquierdo, editor, Proceedings of the 34. Workshop on NLP Applications: complet- ing the puzzle, number 1386 in CEUR Araque, O., I. Corcuera-Platas, J. F. Workshop Proceedings, Aachen. Sánchez-Rada, and C. A. Iglesias. 2017. Enhancing Deep Learning Sentiment Ng, A. Y. 2004. Feature selection, l 1 vs. Analysis with Ensemble Techniques in So- l 2 regularization, and rotational invari- cial Applications. Expert Systems with ance. In Proceedings of the twenty-first in- Applications, June. ternational conference on Machine learn- ing, page 78. ACM. Bahdanau, D., K. Cho, and Y. Bengio. 2014. Neural machine translation by Socher, R., A. Perelygin, J. Y. Wu, jointly learning to align and translate. J. Chuang, C. D. Manning, A. Y. Ng, arXiv preprint arXiv:1409.0473. C. Potts, et al. 2013. Recursive deep mod- els for semantic compositionality over a Gal, Y. and Z. Ghahramani. 2016a. Dropout sentiment treebank. In Proceedings of the as a bayesian approximation: Represent- conference on empirical methods in natu- ing model uncertainty in deep learning. ral language processing (EMNLP), volume In international conference on machine 1631, page 1642. learning, pages 1050–1059. Srivastava, N., G. E. Hinton, A. Krizhevsky, Gal, Y. and Z. Ghahramani. 2016b. A theo- I. Sutskever, and R. Salakhutdinov. 2014. retically grounded application of dropout Dropout: a simple way to prevent neu- in recurrent neural networks. In Advances ral networks from overfitting. Journal of in neural information processing systems, Machine Learning Research, 15(1):1929– pages 1019–1027. 1958. Garcı́a Cumbreras, M. Á., Tai, K. S., R. Socher, and C. D. Man- E. Martı́nez Cámara, J. Villena Román, ning. 2015. Improved semantic represen- and J. Garcı́a Morera. 2016. Tass tations from tree-structured long short- 2015–the evolution of the spanish opin- term memory networks. arXiv preprint ion mining systems. Procesamiento del arXiv:1503.00075. Lenguaje Natural, 56:33–40. Urizar, X. S. and I. S. V. Roncal. 2013. El- Kingma, D. and J. Ba. 2014. Adam: A huyar at tass 2013. In Proceedings of the method for stochastic optimization. arXiv Workshop on Sentiment Analysis at SE- preprint arXiv:1412.6980. PLN (TASS 2013), pages 143–150. LeCun, Y. A., L. Bottou, G. B. Orr, and Vilares, D., Y. Doval, M. A. Alonso, and K.-R. Müller. 2012. Efficient backprop. C. Gómez-Rodrı́guez. 2015. Lys at tass In Neural networks: Tricks of the trade. 2015: Deep learning experiments for sen- Springer, pages 9–48. timent analysis on spanish tweets. In TASS@ SEPLN, pages 47–52. Martı́nez-Cámara, E., M. C. Dı́az-Galiano, M. A. Garcı́a-Cumbreras, M. Garcı́a- Wang, X., Y. Liu, C. Sun, B. Wang, and Vega, and J. Villena-Román. 2017. X. Wang. 2015. Predicting polarities Overview of tass 2017. In J. Vil- of tweets by composing word embeddings lena Román, M. A. Garcı́a Cumbreras, with long short-term memory. In ACL D. G. M. C. Martı́nez-Cámara, Eugenio, (1), pages 1343–1353. and M. Garcı́a Vega, editors, Proceedings Wang, Y., M. Huang, X. Zhu, and L. Zhao. of TASS 2017: Workshop on Semantic 2016. Attention-based lstm for aspect- Analysis at SEPLN (TASS 2017), vol- level sentiment classification. In EMNLP, ume 1896 of CEUR Workshop Proceed- pages 606–615. ings, Murcia, Spain, September. CEUR- WS. Zaremba, W., I. Sutskever, and O. Vinyals. 2014. Recurrent neural network regular- Martınez-Cámara, E., Y. Gutiérrez-Vázquez, ization. arXiv preprint arXiv:1409.2329. J. Fernández, A. Montejo-Ráez, and 76