TASS 2015, septiembre 2015, pp 47-52 recibido 10-07-15 revisado 24-07-15 aceptado 26-07-15 LyS at TASS 2015: Deep Learning Experiments for Sentiment Analysis on Spanish Tweets∗ LyS en TASS 2015: Experimentos con Deep Learning para Análisis del Sentimiento sobre Tweets en Español David Vilares, Yerai Doval, Miguel A. Alonso and Carlos Gómez-Rodrı́guez Grupo LyS, Departamento de Computación, Campus de A Coruña s/n Universidade da Coruña, 15071, A Coruña, Spain {david.vilares, yerai.doval, miguel.alonso, carlos.gomez}@udc.es Resumen: Este artı́culo describe la participación del grupo LyS en el tass 2015. En la edición de este año, hemos utilizado una red neuronal denominada long short- term memory para abordar los dos retos propuestos: (1) análisis del sentimiento a nivel global y (2) análisis del sentimiento a nivel de aspectos sobre tuits futbolı́sticos y de polı́tica. El rendimiento obtenido por esta red de aprendizaje profundo es comparado con el de nuestro sistema del año pasado, una regresión logı́stica con una regularización cuadrática. Los resultados experimentales muestran que es necesario incluir estrategias como pre-entrenamiento no supervisado, técnicas especı́ficas para representar palabras como vectores o modificar la arquitectura actual para alcanzar resultados acordes con el estado del arte. Palabras clave: deep learning, long short-term memory, análisis del sentimiento, Twitter Abstract: This paper describes the participation of the LyS group at tass 2015. In this year’s edition, we used a long short-term memory neural network to address the two proposed challenges: (1) sentiment analysis at a global level and (2) aspect-based sentiment analysis on football and political tweets. The performance of this deep learning approach is compared to our last-year model, based on a square-regularized logistic regression. Experimental results show that strategies such as unsupervised pre-training, sentiment-specific word embedding or modifying the current architec- ture might be needed to achieve state-of-the-art results. Keywords: deep learning, long short-term memory, sentiment analysis, Twitter 1 Introduction lexical- (Brooke, Tofiloski, and Taboada, The 4th edition of the tass workshop ad- 2009; Thelwall et al., 2010) or syntactic- dresses two of the most popular tasks on based rules (Vilares, Alonso, and Gómez- sentiment analysis (sa), focusing on Spanish Rodrı́guez, 2015c) to deal with phenomena tweets: (1) polarity classification at a global such as negation, intensification or irrealis. level and (2) a simplified version of aspect- The second group focuses on training clas- based sentiment analysis, where the goal is sifiers through supervised learning algorithms to predict the polarity of a set of predefined that are fed a number of features (Pang, Lee, and identified aspects (Villena-Román et al., and Vaithyanathan, 2002; Mohammad, Kir- b). itchenko, and Zhu, 2013; Hurtado and Pla, The challenge of polarity classification has 2014). Although competitive when labelled been typically tackled from two different an- data is provided, they have shown weak- gles: lexicon-based and machine learning ness when interpreting the compositionality (ml) approaches. The first group relies on of complex phrases (e.g. adversative subor- sentiment dictionaries to detect the subjec- dinate clauses). In this respect, some stud- tive words or phrases of the text, and defines ies have evaluated the impact of syntactic- ∗ based features on these supervised learn- This research is supported by the Ministerio de ing techniques (Vilares, Alonso, and Gómez- Economı́a y Competitividad y FEDER (FFI2014- 51978-C2) and Xunta de Galicia (R2014/034). The Rodrı́guez, 2015b; Joshi and Penstein-Rosé, first author is funded by the Ministerio de Educación, 2009) or other related tasks, such as multi- Cultura y Deporte (FPU13/01180). topic detection on tweets (Vilares, Alonso, Publicado en http://ceur-ws.org/Vol-1397/. CEUR-WS.org es una publicación en serie con ISSN reconocido ISSN 1613-0073 David Vilares, Yerai Doval, Miguel A. Alonso, Carlos Gómez-Rodríguez and Gómez-Rodrı́guez, 2015a). (L4 ) is also proposed, where the polarities p+ More recently, deep learning (Bengio, and n+ are included into p and n, respec- 2009) has shown its competitiveness on po- tively. larity classification. Bespalov et al. (2011) In the rest of the paper, we will use h4 and introduce a word-embedding approach for h6 to refer our prediction models for 4 and 6 higher-order n-grams, using a multi-layer classes, respectively. perceptron and a linear function as the out- put layer. Socher et al. (2013) introduce 3 Task2: Sentiment Analysis at a new deep learning architecture, a Recur- the aspect level sive Neural Tensor Network, which improved the state of the art on the Pang and Lee Let L={l0 , l1 , ..., ln } be the set of polarity la- (2005) movie reviews corpus, when trained bels, A={a0 , a1 , ...ao } the set of aspects and together with the Stanford Sentiment Tree- a T ={t0 , t1 , ..., tm } the set of texts, the aim bank. Tang et al. (2014) suggest that cur- of the task consists of defining an hypothe- rently existing word embedding methods are sis function, h : A × T → L. Two different not adequate for sa, because words with com- corpora are provided to evaluate this task: a pletely different sentiment might appear in social-tv corpus with football tweets (1 773 similar contexts (e.g. ‘good’ and ‘bad’ ). They training and 1 000 test tweets) and a politi- pose an sentiment-specific words embedding cal corpus (784 training and 500 test tweets), (sswe) model, using a deep learning architec- called stompol. Each aspect can be as- ture trained from massive distant-supervised signed the p, n or neu polarities (L3 ). tweets. For Spanish, Montejo-Raéz, Garcı́a- The tass organisation provided both A Cumbreras, and Dı́az-Galiano (2014) apply and the identification of the aspects that ap- word embedding using Word2Vec (Mikolov et pear in each tweet, so the task can be seen as al., 2013), to then use those vectors as fea- identifying the scope s(a, t) of an aspect a in tures for traditional machine learning tech- the tweet t ∈ T , with s a substring of t and niques. a ∈ A, to then predict the polarity using the In this paper we also rely on a deep learn- hypothesis function, h3 (s) → L3 . ing architecture, a long short-term memory To identify the scope we followed (lstm) recurrent neural network, to solve the a naı̈ve approach: given an aspect a challenges of this tass edition. The results that appears at position i in a text, are compared with respect to our model for t=[w0 , ..., wi−x , ..., ai , ..., wi+x , ..., wp ], we cre- last year’s edition, a logistic regression ap- ated a snippet of length x that is considered proach fed with hand-crafted features. to be the scope of the aspect. Preliminary experiments on the social-tv and the 2 Task1: Sentiment Analysis at a stompol corpus showed that x = 4 and global level taking the entire tweet were the best options for each collection, respectively. Let L={l0 , l1 , ..., ln } be the set of polarity la- bels and T ={t0 , t1 , ..., tm } the set of labelled 4 Supervised sentiment analysis texts, the aim of the task consists of defining models an hypothesis function, h : T → L. To train and evaluate the task, the collec- Our aim this year was to compare our last- tion from tass-2014 (Villena-Román et al., year model to a deep learning architecture 2015) was used. It contains a training set that was initially available for binary polarity of 7 128 tweets, intended to build and tune classification. the models, and two test sets: (1) a pooling- labelled collection of 60 798 tweets and (2) 4.1 Long Short-Term Memory a manually-labelled test set of 1 000 tweets. Long Short-Term Memory (lstm) is a re- The collection is annotated using two differ- current neural network (rnn) proposed by ent criteria. The first one considers a set of Hochreiter and Schmidhuber (1997). Tradi- 6 polarities (L6 ): no opinion (none), posi- tional rnn were born with the objective of tive (p), strongly positive (p+), negative (n), being able to store representations of inputs strongly negative (n+) and mixed (neu), that in form of activations, showing temporal ca- are tweets that mix both negative and posi- pacities and helping to learn short-term de- tive ideas. A simplified version with 4 classes pendencies. However, they might suffer from 48 LyS at TASS 2015: Deep Learning Experiments for Sentiment Analysis on Spanish Tweets the problem of exploding gradients 1 . The ising system for neu tweets to determine the lstm tries to solve these problems using a polarities under the L6 configuration, where: different type of units, called memory cells, given an L4 and an L6 lg-classifier and a which can remember a value for an arbitrary tweet t, if h6 (t) = neu and h4 (t) 6= neu then period of time. h6 (t) := h4 (t). The results obtained on the In this work, we use a model composed of test set shown that we obtained an improve- a single lstm and a logistic function as the ment of 1 percentage point with this strategy output layer, which has an available imple- (from 55.2% to 56.8% that is reported in the mentation2 in Theano (Bastien et al., 2012). Experiments section). To train the model, the tweets were to- kenised (Gimpel et al., 2011), lemmatised 5 Experimental results (Taulé, Martı́, and Recasens, 2008), con- Table 1 compares our models with the best verted to lowercase to reduce sparsity and fi- performing run of the rest of the participants nally indexed. To train the lstm-rnn, we re- (out of date runs are not included). The per- lied on adadelta (Zeiler, 2012), an adaptive formance of our current deep learning model learning rate method, using stochastic train- is still far from the top ranking systems, and ing (batch size = 16) to speed up the learning from our last-year model too, although it process. Experiments with non-stochastic worked acceptably under the L6 manually- training runs did not show an improvement in labelled test. terms of accuracy. We empirically explored Table 2 and 3 show the f1 score for each the size of the word embedding3 and the num- polarity, for the lstm-rnn and l2-lg mod- ber of words to keep in the vocabulary4 , ob- els, respectively. The results reflect the lack taining the best performance using a choice of capacity of the current lstm model to of 128 and 10 000 respectively. learn the minority classes in the training data 4.2 L2 logistic regression (p, n+ and neu). In this respect, we plan to explore how balanced corpora and bigger cor- Our last-year edition model relied on the sim- pora can help diminish this problem. ple and well-known squared-regularised logis- tic regression (l2-lg), that performed very System Ac 6 Ac 6-1k Ac 4 Ac 4-1k competitively for all polarity classification lif 0.6721 0.5161 0.7251 0.6921 tasks. A detailed description of this model elirf 0.6592 0.4883 0.7222 0.6455 can be found in Vilares et al. (2014a), but gsi 0.6183 0.4874 0.6904 0.6583 here we just list the features that were used: dlsi 0.5954 0.38514 0.6556 0.6377 lemmas (Taulé, Martı́, and Recasens, 2008), gti-grad 0.5925 0.5092 0.6953 0.6742 lys-lg• 0.5686 0.4345 0.6645 0.6349 psychometric properties (Pennebaker, Fran- dt 0.5577 0.40810 0.6257 0.60111 cis, and Booth, 2001) and subjective lexicons itainnova 0.5498 0.40511 0.61010 0.48414 (Saralegi and San Vicente, 2013). This archi- BittenPotato 0.5359 0.4188 0.60211 0.63210 tecture also obtained robust and competitive lys-lstm• 0.5059∗ 0.4306∗ 0.59911∗ 0.60510∗ performance for English tweets, on SemEval sinai-esma 0.50210 0.4119 - - 2014 (Vilares et al., 2014b). cu 0.49511 0.4197 0.48113 0.60012 ingeotec 0.48812 0.4316 - - Penalising neutral tweets sinai 0.47413 0.38913 0.6198 0.6416 Previous editions of tass have shown that tid-spark 0.46214 0.40012 0.59412 0.6494 the performance on neu tweets is much lower gas-ucr 0.34215 0.33815 0.44614 0.55613 than for the rest of the classes (Villena- ucsp 0.27316 - 0.6139 0.6368 Román et al., a). This year we proposed a small variation on our l2-lg model: a penal- Table 1: Comparison of accuracy for Task 1, 1 The gradient signal becomes either too small or between the best performance of each partic- large causing a very slow learning or a diverging sit- ipant with respect to our machine- and deep uation, respectively. learning models. Bold runs indicate our l2- 2 http://deeplearning.net/tutorial/ lg and lstm runs. Subscripts indicate the 3 The size of the vector obtained for each word and ranking for each group for their best run. the number of hidden units on the lstm layer. 4 Number of words to be indexed. The rest of the words are set to unknown tokens, giving to all of them Finally, Table 4 compares the performance the same index. of the participating systems Task 2, both for 49 David Vilares, Yerai Doval, Miguel A. Alonso, Carlos Gómez-Rodríguez Corpus n+ n neu none p p+ System social-tv stompol L6 0.000 0.486 0.000 0.582 0.049 0.575 elirf 0.6331 0.6551 L6 -1k 0.090 0.462 0.093 0.508 0.209 0.603 lys-lg• 0.5992 0.6104 L4 - 0.623 0.00 0.437 0.688 - gsi - 0.6352 L4 -1k - 0.587 0.00 0.515 0.679 - tid-spark 0.5573 0.6313 lys-lstm• 0.5403∗ 0.5224∗ Table 2: F1 score of our lstm-rnn model for each test set proposed at Task 1. 1k refers Table 4: Comparison of accuracy for Task to the manually-labelled corpus containing 2, between the best run of the rest of par- 1 000 tweets. ticipants and our machine and deep learning models Corpus n+ n neu none p p+ L6 0.508 0.464 0.135 0.613 0.205 0.682 L6 -1k 0.451 0.370 0.000 0.446 0.232 0.628 tures for AI. Foundations and trends in L4 - 0.674 0.071 0.569 0.747 - Machine Learning, 2(1):1–127. L4 -1k - 0.642 0.028 0.518 0.714 - Bespalov, D., B. Bai, Y. Qi, and A. Shok- oufandeh. 2011. Sentiment classification Table 3: F1 score of our l2-lg model for each based on supervised latent n-gram anal- test set proposed at Task 1 ysis. In Proceedings of the 20th ACM in- ternational conference on Information and football and political tweets. The trend re- knowledge management, pages 375—-382. mains in this case and the machine learn- ACM. ing approaches outperformed again our deep Brooke, J, M Tofiloski, and M Taboada. learning proposal. 2009. Cross-Linguistic Sentiment Analy- sis: From English to Spanish. In Pro- 6 Conclusions and future ceedings of the International Conference research RANLP-2009, pages 50–54, Borovets, In the 4th edition of tass 2015, we have Bulgaria. ACL. tried a long short-term memory neural net- work to determine the polarity of tweets Gimpel, K, N Schneider, B O’connor, D Das, at the global and aspect levels. The per- D Mills, J Eisenstein, M Heilman, D Yo- formance of this model has been compared gatama, J Flanigan, and N A Smith. 2011. with the performance of our last-year sys- Part-of-speech tagging for Twitter: anno- tem, based on an l2 logistic regression. Ex- tation, features, and experiments. HLT perimental results suggest that we need to ’11 Proc. of the 49th Annual Meeting of explore new architectures and specific word the Association for Computational Lin- embedding representations to obtain state- guistics: Human Language Technologies: of-the-art results on sentiment analysis tasks. short papers, 2:42–47. In this respect, we believe sentiment-specific Hochreiter, S and J. Schmidhuber. 1997. word embeddings and other deep learning ap- Long short-term memory. Neural compu- proaches (Tang et al., 2014) can help en- tation, 9(8):1735–1780. rich our current model. Unsupervised pre- training has also been shown to improve per- Hurtado, L. and F. Pla. 2014. ELiRF-UPV formance of deep learning architectures (Sev- en TASS 2014: Análisis de sentimientos, eryn and Moschitti, 2015). detección de tópicos y análisis de sen- timientos de aspectos en twitter. In Pro- References ceedings of the TASS workshop at SEPLN. Bastien, F., P. Lamblin, R. Pascanu, Joshi, M and C Penstein-Rosé. 2009. Gen- J. Bergstra, I. Goodfellow, A. Berg- eralizing dependency features for opin- eron, N. Bouchard, D. Warde-Farley, and ion mining. In Proceedings of the ACL- Y. Bengio. 2012. Theano: new features IJCNLP 2009 Conference Short Papers, and speed improvements. arXiv preprint ACLShort ’09, pages 313–316, Strouds- arXiv:1211.5590. burg, PA, USA. Association for Compu- Bengio, Y. 2009. Learning deep architec- tational Linguistics. 50 LyS at TASS 2015: Deep Learning Experiments for Sentiment Analysis on Spanish Tweets Mikolov, T., K. Chen, G. Corrado, and Treebank. In EMNLP 2013. 2013 Con- J. Dean. 2013. Efficient estimation ference on Empirical Methods in Natu- of word representations in vector space. ral Language Processing. Proceedings of arXiv preprint arXiv:1301.3781. the Conference, pages 1631–1642, Seattle, Washington, USA. ACL. Mohammad, S. M, S. Kiritchenko, and X. Zhu. 2013. NRC-Canada: Building the Tang, D., F. Wei, N. Yang, M. Zhou, T. Liu, State-of-the-Art in Sentiment Analysis of and B. Qin. 2014. Learning sentiment- Tweets. In Proceedings of the seventh in- specific word embedding for twitter sen- ternational workshop on Semantic Evalu- timent classification. In Proceedings of ation Exercises (SemEval-2013), Atlanta, the 52nd Annual Meeting of the Associa- Georgia, USA, June. tion for Computational Linguistics, pages Montejo-Raéz, A., M. A. Garcı́a-Cumbreras, 1555–1565. and M. C. Dı́az-Galiano. 2014. Partici- Taulé, M., M. A. Martı́, and M. Recasens. pación de SINAI word2vec en TASS 2014. 2008. AnCora: Multilevel Annotated Cor- In Proceedings of the TASS workshop at pora for Catalan and Spanish. In Nico- SEPLN. letta Calzolari, Khalid Choukri, Bente Pang, B. and L. Lee. 2005. Seeing stars: Maegaard, Joseph Mariani, Jan Odjik, Exploiting class relationships for senti- Stelios Piperidis, and Daniel Tapias, ed- ment categorization with respect to rat- itors, Proceedings of the Sixth Interna- ing scales. In Proceedings of the 43rd An- tional Conference on Language Resources and Evaluation (LREC’08), pages 96–101, nual Meeting on Association for Compu- tational Linguistics, pages 115–124. Asso- Marrakech, Morocco. ciation for Computational Linguistics. Thelwall, M., K. Buckley, G. Paltoglou, Pang, B., L. Lee, and S Vaithyanathan. 2002. D. Cai, and A. Kappas. 2010. Senti- Thumbs up? Sentiment classification us- ment Strength Detection in Short Infor- ing machine learning techniques. In Pro- mal Text. Journal of the American Soci- ceedings of EMNLP, pages 79–86. ety for Information Science and Technol- ogy, 61(12):2544–2558, December. Pennebaker, J. W., M. E. Francis, and R. J. Booth. 2001. Linguistic inquiry and word Vilares, D., M. A. Alonso, and C. Gómez- count: LIWC 2001. Mahway: Lawrence Rodrı́guez. 2015a. A linguistic ap- Erlbaum Associates, page 71. proach for determining the topics of Span- ish Twitter messages. Journal of Informa- Saralegi, X. and I. San Vicente. 2013. Elhu- tion Science, 41(2):127–145. yar at TASS 2013. In Alberto Dı́az Este- ban, Iñaki Alegrı́a Loinaz, and Julio Vil- Vilares, D., M. A. Alonso, and C. Gómez- lena Román, editors, XXIX Congreso de Rodrı́guez. 2015b. On the usefulness of la Sociedad Española de Procesamiento de lexical and syntactic processing in polarity Lenguaje Natural (SEPLN 2013). TASS classification of Twitter messages. Jour- 2013 - Workshop on Sentiment Analysis nal of the Association for Information Sci- at SEPLN 2013, pages 143–150, Madrid, ence Science and Technology, to appear. Spain, September. Vilares, D., M. A Alonso, and C. Gómez- Severyn, A. and A. Moschitti. 2015. UNITN: Rodrı́guez. 2015c. A syntactic ap- Training Deep Convolutional Neural Net- proach for opinion mining on spanish re- work for Twitter Sentiment Classification. views. Natural Language Engineering, In Proceedings of the 9th International 21(01):139–163. Workshop on Semantic Evaluation (Se- Vilares, D., Y. Doval, M. A. Alonso, and mEval 2015), pages 464–469, Denver, Col- C. Gómez-Rodrı́guez. 2014a. LyS at orado. Association for Computational Lin- TASS 2014: A prototype for extract- guistics. ing and analysing aspects from spanish Socher, R., A. Perelygin, J. Wu, J. Chuang, tweets. In Proceedings of the TASS work- C. D. Manning, A. Ng, and C. Potts. shop at SEPLN. 2013. Recursive Deep Models for Seman- Vilares, D., M. Hermo, M. A. Alonso, tic Compositionality Over a Sentiment C. Gómez-Rodrı́guez, and Y. Doval. 51 David Vilares, Yerai Doval, Miguel A. Alonso, Carlos Gómez-Rodríguez 2014b. LyS : Porting a Twitter Sentiment Analysis Approach from Spanish to En- glish na. In Proceedings og The 8th In- ternationalWorkshop on Semantic Evalu- ation (SemEval 2014), number SemEval, pages 411–415. Villena-Román, J., J. Garcı́a-Morera, C. Moreno-Garcı́a, S. Lana-Serrano, and J. C. González-Cristóba. TASS 2013 — a second step in reputation analysis in Spanish. Procesamiento del Lenguaje Natural, pages 37–44. Villena-Román, Julio, Janine Garcı́a-Morera, Miguel A. Garcı́a-Cumbreras, Eugenio Martı́nez-Cámara, M. Teresa Martı́n- Valdivia, and L. Alfonso Ureña López. Overview of TASS 2015. Villena-Román, L., E. Martı́nez-Cámara, Ja- nine Morera-Garcı́a, and S. M. Jiménez- Zafra. 2015. TASS 2014-the challenge of aspect-based sentiment analysis. Proce- samiento del Lenguaje Natural, 54:61–68. Zeiler, M.D. 2012. ADADELTA: An adap- tive learning rate method. arXiv preprint arXiv:1212.5701. 52