TASS 2015, septiembre 2015, pp 47-52                                            recibido 10-07-15 revisado 24-07-15 aceptado 26-07-15


    LyS at TASS 2015: Deep Learning Experiments for Sentiment
                   Analysis on Spanish Tweets∗
    LyS en TASS 2015: Experimentos con Deep Learning para Análisis del
                  Sentimiento sobre Tweets en Español
 David Vilares, Yerai Doval, Miguel A. Alonso and Carlos Gómez-Rodrı́guez
       Grupo LyS, Departamento de Computación, Campus de A Coruña s/n
                  Universidade da Coruña, 15071, A Coruña, Spain
           {david.vilares, yerai.doval, miguel.alonso, carlos.gomez}@udc.es

        Resumen: Este artı́culo describe la participación del grupo LyS en el tass 2015.
        En la edición de este año, hemos utilizado una red neuronal denominada long short-
        term memory para abordar los dos retos propuestos: (1) análisis del sentimiento a
        nivel global y (2) análisis del sentimiento a nivel de aspectos sobre tuits futbolı́sticos
        y de polı́tica. El rendimiento obtenido por esta red de aprendizaje profundo es
        comparado con el de nuestro sistema del año pasado, una regresión logı́stica con una
        regularización cuadrática. Los resultados experimentales muestran que es necesario
        incluir estrategias como pre-entrenamiento no supervisado, técnicas especı́ficas para
        representar palabras como vectores o modificar la arquitectura actual para alcanzar
        resultados acordes con el estado del arte.
        Palabras clave: deep learning, long short-term memory, análisis del sentimiento,
        Twitter
        Abstract: This paper describes the participation of the LyS group at tass 2015. In
        this year’s edition, we used a long short-term memory neural network to address the
        two proposed challenges: (1) sentiment analysis at a global level and (2) aspect-based
        sentiment analysis on football and political tweets. The performance of this deep
        learning approach is compared to our last-year model, based on a square-regularized
        logistic regression. Experimental results show that strategies such as unsupervised
        pre-training, sentiment-specific word embedding or modifying the current architec-
        ture might be needed to achieve state-of-the-art results.
        Keywords: deep learning, long short-term memory, sentiment analysis, Twitter

1     Introduction                                                     lexical- (Brooke, Tofiloski, and Taboada,
The 4th edition of the tass workshop ad-                               2009; Thelwall et al., 2010) or syntactic-
dresses two of the most popular tasks on                               based rules (Vilares, Alonso, and Gómez-
sentiment analysis (sa), focusing on Spanish                           Rodrı́guez, 2015c) to deal with phenomena
tweets: (1) polarity classification at a global                        such as negation, intensification or irrealis.
level and (2) a simplified version of aspect-                              The second group focuses on training clas-
based sentiment analysis, where the goal is                            sifiers through supervised learning algorithms
to predict the polarity of a set of predefined                         that are fed a number of features (Pang, Lee,
and identified aspects (Villena-Román et al.,                         and Vaithyanathan, 2002; Mohammad, Kir-
b).                                                                    itchenko, and Zhu, 2013; Hurtado and Pla,
    The challenge of polarity classification has                       2014). Although competitive when labelled
been typically tackled from two different an-                          data is provided, they have shown weak-
gles: lexicon-based and machine learning                               ness when interpreting the compositionality
(ml) approaches. The first group relies on                             of complex phrases (e.g. adversative subor-
sentiment dictionaries to detect the subjec-                           dinate clauses). In this respect, some stud-
tive words or phrases of the text, and defines                         ies have evaluated the impact of syntactic-
∗
                                                                       based features on these supervised learn-
   This research is supported by the Ministerio de                     ing techniques (Vilares, Alonso, and Gómez-
Economı́a y Competitividad y FEDER (FFI2014-
51978-C2) and Xunta de Galicia (R2014/034). The
                                                                       Rodrı́guez, 2015b; Joshi and Penstein-Rosé,
first author is funded by the Ministerio de Educación,                2009) or other related tasks, such as multi-
Cultura y Deporte (FPU13/01180).                                       topic detection on tweets (Vilares, Alonso,
Publicado en http://ceur-ws.org/Vol-1397/. CEUR-WS.org es una publicación en serie con ISSN reconocido               ISSN 1613-0073
                          David Vilares, Yerai Doval, Miguel A. Alonso, Carlos Gómez-Rodríguez


and Gómez-Rodrı́guez, 2015a).                                  (L4 ) is also proposed, where the polarities p+
    More recently, deep learning (Bengio,                       and n+ are included into p and n, respec-
2009) has shown its competitiveness on po-                      tively.
larity classification. Bespalov et al. (2011)                      In the rest of the paper, we will use h4 and
introduce a word-embedding approach for                         h6 to refer our prediction models for 4 and 6
higher-order n-grams, using a multi-layer                       classes, respectively.
perceptron and a linear function as the out-
put layer. Socher et al. (2013) introduce                       3     Task2: Sentiment Analysis at
a new deep learning architecture, a Recur-                            the aspect level
sive Neural Tensor Network, which improved
the state of the art on the Pang and Lee                        Let L={l0 , l1 , ..., ln } be the set of polarity la-
(2005) movie reviews corpus, when trained                       bels, A={a0 , a1 , ...ao } the set of aspects and
together with the Stanford Sentiment Tree-                      a T ={t0 , t1 , ..., tm } the set of texts, the aim
bank. Tang et al. (2014) suggest that cur-                      of the task consists of defining an hypothe-
rently existing word embedding methods are                      sis function, h : A × T → L. Two different
not adequate for sa, because words with com-                    corpora are provided to evaluate this task: a
pletely different sentiment might appear in                     social-tv corpus with football tweets (1 773
similar contexts (e.g. ‘good’ and ‘bad’ ). They                 training and 1 000 test tweets) and a politi-
pose an sentiment-specific words embedding                      cal corpus (784 training and 500 test tweets),
(sswe) model, using a deep learning architec-                   called stompol. Each aspect can be as-
ture trained from massive distant-supervised                    signed the p, n or neu polarities (L3 ).
tweets. For Spanish, Montejo-Raéz, Garcı́a-                        The tass organisation provided both A
Cumbreras, and Dı́az-Galiano (2014) apply                       and the identification of the aspects that ap-
word embedding using Word2Vec (Mikolov et                       pear in each tweet, so the task can be seen as
al., 2013), to then use those vectors as fea-                   identifying the scope s(a, t) of an aspect a in
tures for traditional machine learning tech-                    the tweet t ∈ T , with s a substring of t and
niques.                                                         a ∈ A, to then predict the polarity using the
    In this paper we also rely on a deep learn-                 hypothesis function, h3 (s) → L3 .
ing architecture, a long short-term memory                          To identify the scope we followed
(lstm) recurrent neural network, to solve the                   a naı̈ve approach:               given an aspect a
challenges of this tass edition. The results                    that appears at position i in a text,
are compared with respect to our model for                      t=[w0 , ..., wi−x , ..., ai , ..., wi+x , ..., wp ], we cre-
last year’s edition, a logistic regression ap-                  ated a snippet of length x that is considered
proach fed with hand-crafted features.                          to be the scope of the aspect. Preliminary
                                                                experiments on the social-tv and the
2    Task1: Sentiment Analysis at a                             stompol corpus showed that x = 4 and
     global level                                               taking the entire tweet were the best options
                                                                for each collection, respectively.
Let L={l0 , l1 , ..., ln } be the set of polarity la-
bels and T ={t0 , t1 , ..., tm } the set of labelled            4     Supervised sentiment analysis
texts, the aim of the task consists of defining
                                                                      models
an hypothesis function, h : T → L.
   To train and evaluate the task, the collec-                  Our aim this year was to compare our last-
tion from tass-2014 (Villena-Román et al.,                     year model to a deep learning architecture
2015) was used. It contains a training set                      that was initially available for binary polarity
of 7 128 tweets, intended to build and tune                     classification.
the models, and two test sets: (1) a pooling-
labelled collection of 60 798 tweets and (2)                    4.1      Long Short-Term Memory
a manually-labelled test set of 1 000 tweets.                   Long Short-Term Memory (lstm) is a re-
The collection is annotated using two differ-                   current neural network (rnn) proposed by
ent criteria. The first one considers a set of                  Hochreiter and Schmidhuber (1997). Tradi-
6 polarities (L6 ): no opinion (none), posi-                    tional rnn were born with the objective of
tive (p), strongly positive (p+), negative (n),                 being able to store representations of inputs
strongly negative (n+) and mixed (neu), that                    in form of activations, showing temporal ca-
are tweets that mix both negative and posi-                     pacities and helping to learn short-term de-
tive ideas. A simplified version with 4 classes                 pendencies. However, they might suffer from
                                                          48
                    LyS at TASS 2015: Deep Learning Experiments for Sentiment Analysis on Spanish Tweets


the problem of exploding gradients 1 . The                        ising system for neu tweets to determine the
lstm tries to solve these problems using a                        polarities under the L6 configuration, where:
different type of units, called memory cells,                     given an L4 and an L6 lg-classifier and a
which can remember a value for an arbitrary                       tweet t, if h6 (t) = neu and h4 (t) 6= neu then
period of time.                                                   h6 (t) := h4 (t). The results obtained on the
    In this work, we use a model composed of                      test set shown that we obtained an improve-
a single lstm and a logistic function as the                      ment of 1 percentage point with this strategy
output layer, which has an available imple-                       (from 55.2% to 56.8% that is reported in the
mentation2 in Theano (Bastien et al., 2012).                      Experiments section).
    To train the model, the tweets were to-
kenised (Gimpel et al., 2011), lemmatised                         5     Experimental results
(Taulé, Martı́, and Recasens, 2008), con-
                                                                  Table 1 compares our models with the best
verted to lowercase to reduce sparsity and fi-
                                                                  performing run of the rest of the participants
nally indexed. To train the lstm-rnn, we re-
                                                                  (out of date runs are not included). The per-
lied on adadelta (Zeiler, 2012), an adaptive
                                                                  formance of our current deep learning model
learning rate method, using stochastic train-
                                                                  is still far from the top ranking systems, and
ing (batch size = 16) to speed up the learning
                                                                  from our last-year model too, although it
process. Experiments with non-stochastic
                                                                  worked acceptably under the L6 manually-
training runs did not show an improvement in
                                                                  labelled test.
terms of accuracy. We empirically explored
                                                                      Table 2 and 3 show the f1 score for each
the size of the word embedding3 and the num-
                                                                  polarity, for the lstm-rnn and l2-lg mod-
ber of words to keep in the vocabulary4 , ob-
                                                                  els, respectively. The results reflect the lack
taining the best performance using a choice
                                                                  of capacity of the current lstm model to
of 128 and 10 000 respectively.
                                                                  learn the minority classes in the training data
4.2    L2 logistic regression                                     (p, n+ and neu). In this respect, we plan to
                                                                  explore how balanced corpora and bigger cor-
Our last-year edition model relied on the sim-                    pora can help diminish this problem.
ple and well-known squared-regularised logis-
tic regression (l2-lg), that performed very                            System   Ac 6    Ac 6-1k Ac 4   Ac 4-1k
competitively for all polarity classification                            lif    0.6721 0.5161 0.7251   0.6921
tasks. A detailed description of this model                             elirf   0.6592 0.4883 0.7222   0.6455
can be found in Vilares et al. (2014a), but                              gsi    0.6183 0.4874 0.6904   0.6583
here we just list the features that were used:                           dlsi   0.5954 0.38514 0.6556  0.6377
lemmas (Taulé, Martı́, and Recasens, 2008),                          gti-grad  0.5925 0.5092 0.6953   0.6742
                                                                      lys-lg•   0.5686 0.4345 0.6645 0.6349
psychometric properties (Pennebaker, Fran-
                                                                        dt      0.5577 0.40810 0.6257  0.60111
cis, and Booth, 2001) and subjective lexicons                       itainnova 0.5498 0.40511 0.61010 0.48414
(Saralegi and San Vicente, 2013). This archi-                      BittenPotato 0.5359 0.4188 0.60211 0.63210
tecture also obtained robust and competitive                        lys-lstm• 0.5059∗ 0.4306∗ 0.59911∗ 0.60510∗
performance for English tweets, on SemEval                         sinai-esma 0.50210 0.4119 -         -
2014 (Vilares et al., 2014b).                                           cu      0.49511 0.4197 0.48113 0.60012
                                                                    ingeotec 0.48812 0.4316 -          -
Penalising neutral tweets                                              sinai    0.47413 0.38913 0.6198 0.6416
   Previous editions of tass have shown that                        tid-spark 0.46214 0.40012 0.59412 0.6494
the performance on neu tweets is much lower                          gas-ucr 0.34215 0.33815 0.44614 0.55613
than for the rest of the classes (Villena-                             ucsp     0.27316 -       0.6139 0.6368
Román et al., a). This year we proposed a
small variation on our l2-lg model: a penal-
                                                                  Table 1: Comparison of accuracy for Task 1,
   1
     The gradient signal becomes either too small or              between the best performance of each partic-
large causing a very slow learning or a diverging sit-            ipant with respect to our machine- and deep
uation, respectively.                                             learning models. Bold runs indicate our l2-
   2
     http://deeplearning.net/tutorial/                            lg and lstm runs. Subscripts indicate the
   3
     The size of the vector obtained for each word and
                                                                  ranking for each group for their best run.
the number of hidden units on the lstm layer.
   4
     Number of words to be indexed. The rest of the
words are set to unknown tokens, giving to all of them               Finally, Table 4 compares the performance
the same index.                                                   of the participating systems Task 2, both for
                                                            49
                         David Vilares, Yerai Doval, Miguel A. Alonso, Carlos Gómez-Rodríguez


    Corpus n+      n    neu none p        p+                             System         social-tv   stompol
      L6 0.000 0.486 0.000 0.582 0.049 0.575                              elirf           0.6331      0.6551
    L6 -1k 0.090 0.462 0.093 0.508 0.209 0.603                          lys-lg•           0.5992     0.6104
      L4     - 0.623 0.00 0.437 0.688 -                                    gsi               -        0.6352
    L4 -1k   - 0.587 0.00 0.515 0.679 -                                tid-spark          0.5573      0.6313
                                                                       lys-lstm•         0.5403∗     0.5224∗

Table 2: F1 score of our lstm-rnn model for
each test set proposed at Task 1. 1k refers                    Table 4: Comparison of accuracy for Task
to the manually-labelled corpus containing                     2, between the best run of the rest of par-
1 000 tweets.                                                  ticipants and our machine and deep learning
                                                               models
    Corpus n+      n    neu none p        p+
      L6 0.508 0.464 0.135 0.613 0.205 0.682
    L6 -1k 0.451 0.370 0.000 0.446 0.232 0.628                     tures for AI. Foundations and trends in
      L4     - 0.674 0.071 0.569 0.747 -                           Machine Learning, 2(1):1–127.
    L4 -1k   - 0.642 0.028 0.518 0.714 -
                                                               Bespalov, D., B. Bai, Y. Qi, and A. Shok-
                                                                 oufandeh. 2011. Sentiment classification
Table 3: F1 score of our l2-lg model for each                    based on supervised latent n-gram anal-
test set proposed at Task 1                                      ysis. In Proceedings of the 20th ACM in-
                                                                 ternational conference on Information and
football and political tweets. The trend re-                     knowledge management, pages 375—-382.
mains in this case and the machine learn-                        ACM.
ing approaches outperformed again our deep                     Brooke, J, M Tofiloski, and M Taboada.
learning proposal.                                               2009. Cross-Linguistic Sentiment Analy-
                                                                 sis: From English to Spanish. In Pro-
6     Conclusions and future                                     ceedings of the International Conference
      research                                                   RANLP-2009, pages 50–54, Borovets,
In the 4th edition of tass 2015, we have                         Bulgaria. ACL.
tried a long short-term memory neural net-
work to determine the polarity of tweets                       Gimpel, K, N Schneider, B O’connor, D Das,
at the global and aspect levels. The per-                        D Mills, J Eisenstein, M Heilman, D Yo-
formance of this model has been compared                         gatama, J Flanigan, and N A Smith. 2011.
with the performance of our last-year sys-                       Part-of-speech tagging for Twitter: anno-
tem, based on an l2 logistic regression. Ex-                     tation, features, and experiments. HLT
perimental results suggest that we need to                       ’11 Proc. of the 49th Annual Meeting of
explore new architectures and specific word                      the Association for Computational Lin-
embedding representations to obtain state-                       guistics: Human Language Technologies:
of-the-art results on sentiment analysis tasks.                  short papers, 2:42–47.
In this respect, we believe sentiment-specific                 Hochreiter, S and J. Schmidhuber. 1997.
word embeddings and other deep learning ap-                      Long short-term memory. Neural compu-
proaches (Tang et al., 2014) can help en-                        tation, 9(8):1735–1780.
rich our current model. Unsupervised pre-
training has also been shown to improve per-                   Hurtado, L. and F. Pla. 2014. ELiRF-UPV
formance of deep learning architectures (Sev-                    en TASS 2014: Análisis de sentimientos,
eryn and Moschitti, 2015).                                       detección de tópicos y análisis de sen-
                                                                 timientos de aspectos en twitter. In Pro-
References                                                       ceedings of the TASS workshop at SEPLN.
Bastien, F., P. Lamblin, R. Pascanu,
                                                               Joshi, M and C Penstein-Rosé. 2009. Gen-
  J. Bergstra, I. Goodfellow, A. Berg-
                                                                  eralizing dependency features for opin-
  eron, N. Bouchard, D. Warde-Farley, and
                                                                  ion mining. In Proceedings of the ACL-
  Y. Bengio. 2012. Theano: new features
                                                                  IJCNLP 2009 Conference Short Papers,
  and speed improvements. arXiv preprint
                                                                  ACLShort ’09, pages 313–316, Strouds-
  arXiv:1211.5590.
                                                                  burg, PA, USA. Association for Compu-
Bengio, Y. 2009. Learning deep architec-                          tational Linguistics.
                                                         50
                 LyS at TASS 2015: Deep Learning Experiments for Sentiment Analysis on Spanish Tweets


Mikolov, T., K. Chen, G. Corrado, and                              Treebank. In EMNLP 2013. 2013 Con-
  J. Dean.    2013.    Efficient estimation                        ference on Empirical Methods in Natu-
  of word representations in vector space.                         ral Language Processing. Proceedings of
  arXiv preprint arXiv:1301.3781.                                  the Conference, pages 1631–1642, Seattle,
                                                                   Washington, USA. ACL.
Mohammad, S. M, S. Kiritchenko, and
  X. Zhu. 2013. NRC-Canada: Building the                       Tang, D., F. Wei, N. Yang, M. Zhou, T. Liu,
  State-of-the-Art in Sentiment Analysis of                      and B. Qin. 2014. Learning sentiment-
  Tweets. In Proceedings of the seventh in-                      specific word embedding for twitter sen-
  ternational workshop on Semantic Evalu-                        timent classification. In Proceedings of
  ation Exercises (SemEval-2013), Atlanta,                       the 52nd Annual Meeting of the Associa-
  Georgia, USA, June.                                            tion for Computational Linguistics, pages
Montejo-Raéz, A., M. A. Garcı́a-Cumbreras,                      1555–1565.
  and M. C. Dı́az-Galiano. 2014. Partici-                      Taulé, M., M. A. Martı́, and M. Recasens.
  pación de SINAI word2vec en TASS 2014.                        2008. AnCora: Multilevel Annotated Cor-
  In Proceedings of the TASS workshop at                         pora for Catalan and Spanish. In Nico-
  SEPLN.                                                         letta Calzolari, Khalid Choukri, Bente
Pang, B. and L. Lee. 2005. Seeing stars:                         Maegaard, Joseph Mariani, Jan Odjik,
  Exploiting class relationships for senti-                      Stelios Piperidis, and Daniel Tapias, ed-
  ment categorization with respect to rat-                       itors, Proceedings of the Sixth Interna-
  ing scales. In Proceedings of the 43rd An-                     tional Conference on Language Resources
                                                                 and Evaluation (LREC’08), pages 96–101,
  nual Meeting on Association for Compu-
  tational Linguistics, pages 115–124. Asso-                     Marrakech, Morocco.
  ciation for Computational Linguistics.                       Thelwall, M., K. Buckley, G. Paltoglou,
Pang, B., L. Lee, and S Vaithyanathan. 2002.                     D. Cai, and A. Kappas. 2010. Senti-
  Thumbs up? Sentiment classification us-                        ment Strength Detection in Short Infor-
  ing machine learning techniques. In Pro-                       mal Text. Journal of the American Soci-
  ceedings of EMNLP, pages 79–86.                                ety for Information Science and Technol-
                                                                 ogy, 61(12):2544–2558, December.
Pennebaker, J. W., M. E. Francis, and R. J.
  Booth. 2001. Linguistic inquiry and word                     Vilares, D., M. A. Alonso, and C. Gómez-
  count: LIWC 2001. Mahway: Lawrence                              Rodrı́guez.    2015a.    A linguistic ap-
  Erlbaum Associates, page 71.                                    proach for determining the topics of Span-
                                                                  ish Twitter messages. Journal of Informa-
Saralegi, X. and I. San Vicente. 2013. Elhu-                      tion Science, 41(2):127–145.
   yar at TASS 2013. In Alberto Dı́az Este-
   ban, Iñaki Alegrı́a Loinaz, and Julio Vil-                 Vilares, D., M. A. Alonso, and C. Gómez-
   lena Román, editors, XXIX Congreso de                         Rodrı́guez. 2015b. On the usefulness of
   la Sociedad Española de Procesamiento de                      lexical and syntactic processing in polarity
   Lenguaje Natural (SEPLN 2013). TASS                            classification of Twitter messages. Jour-
   2013 - Workshop on Sentiment Analysis                          nal of the Association for Information Sci-
   at SEPLN 2013, pages 143–150, Madrid,                          ence Science and Technology, to appear.
   Spain, September.                                           Vilares, D., M. A Alonso, and C. Gómez-
Severyn, A. and A. Moschitti. 2015. UNITN:                        Rodrı́guez.   2015c.   A syntactic ap-
   Training Deep Convolutional Neural Net-                        proach for opinion mining on spanish re-
   work for Twitter Sentiment Classification.                     views. Natural Language Engineering,
   In Proceedings of the 9th International                        21(01):139–163.
   Workshop on Semantic Evaluation (Se-                        Vilares, D., Y. Doval, M. A. Alonso, and
   mEval 2015), pages 464–469, Denver, Col-                       C. Gómez-Rodrı́guez. 2014a. LyS at
   orado. Association for Computational Lin-                      TASS 2014: A prototype for extract-
   guistics.                                                      ing and analysing aspects from spanish
Socher, R., A. Perelygin, J. Wu, J. Chuang,                       tweets. In Proceedings of the TASS work-
   C. D. Manning, A. Ng, and C. Potts.                            shop at SEPLN.
   2013. Recursive Deep Models for Seman-                      Vilares, D., M. Hermo, M. A. Alonso,
   tic Compositionality Over a Sentiment                          C. Gómez-Rodrı́guez, and Y. Doval.
                                                         51
                      David Vilares, Yerai Doval, Miguel A. Alonso, Carlos Gómez-Rodríguez


  2014b. LyS : Porting a Twitter Sentiment
  Analysis Approach from Spanish to En-
  glish na. In Proceedings og The 8th In-
  ternationalWorkshop on Semantic Evalu-
  ation (SemEval 2014), number SemEval,
  pages 411–415.
Villena-Román, J., J. Garcı́a-Morera,
   C. Moreno-Garcı́a, S. Lana-Serrano, and
   J. C. González-Cristóba. TASS 2013 —
   a second step in reputation analysis in
   Spanish.    Procesamiento del Lenguaje
   Natural, pages 37–44.
Villena-Román, Julio, Janine Garcı́a-Morera,
   Miguel A. Garcı́a-Cumbreras, Eugenio
   Martı́nez-Cámara, M. Teresa Martı́n-
   Valdivia, and L. Alfonso Ureña López.
   Overview of TASS 2015.
Villena-Román, L., E. Martı́nez-Cámara, Ja-
   nine Morera-Garcı́a, and S. M. Jiménez-
   Zafra. 2015. TASS 2014-the challenge of
   aspect-based sentiment analysis. Proce-
   samiento del Lenguaje Natural, 54:61–68.
Zeiler, M.D. 2012. ADADELTA: An adap-
   tive learning rate method. arXiv preprint
   arXiv:1212.5701.


                                                      52