=Paper= {{Paper |id=Vol-1896/p8_gsi_tass2017 |storemode=property |title=Applying Recurrent Neural Networks to Sentiment Analysis of Spanish Tweets |pdfUrl=https://ceur-ws.org/Vol-1896/p8_gsi_tass2017.pdf |volume=Vol-1896 |authors=Oscar Araque,Rodrigo Barbado,J. Fernando Sánchez-Rada,Carlos A. Iglesias }} ==Applying Recurrent Neural Networks to Sentiment Analysis of Spanish Tweets== https://ceur-ws.org/Vol-1896/p8_gsi_tass2017.pdf
                   TASS 2017: Workshop on Semantic Analysis at SEPLN, septiembre 2017, págs. 71-76




    Applying Recurrent Neural Networks to Sentiment
              Analysis of Spanish Tweets
    Aplicación de Redes Neuronales Recurrentes al Análisis de
               Sentimientos sobre Tweets en Español
           Oscar Araque, Rodrigo Barbado, J. Fernando Sánchez-Rada y
                                     Carlos A. Iglesias
               Intelligent Systems Group, Universidad Politécnica de Madrid
                              Av. Complutense 30, 28040 Madrid
                o.araque@upm.es, rodrigo.barbado.esteban@alumnos.upm.es
                           {jfernando, carlosangel.iglesias}@upm.es

       Abstract: This article presents the participation of the Intelligent Systems Group
       (GSI) at Universidad Politécnica de Madrid (UPM) in the Sentiment Analysis work-
       shop focused in Spanish tweets, TASS2017. We have worked on Task 1, aiming to
       classify sentiment polarity of Spanish tweets. For this task we propose a Recurrent
       Neural Network (RNN) architecture composed of Long Short-Term Memory (LSTM)
       cells followed by a feedforward network. The architecture makes use of two different
       types of features: word embeddings and sentiment lexicon values. The recurrent ar-
       chitecture allows us to process text sequences of different lengths, while the lexicon
       inserts directly into the system sentiment information. The results indicate that this
       feature combination leads to enhanced sentiment analysis performances.
       Keywords: Deep Learning, Natural Language Processing, Sentiment Analysis, Re-
       current Neural Network, TensorFlow
       Resumen: En este artı́culo se presenta la participación del Grupo de Sistemas
       Inteligentes (GSI) de la Universidad Politécnica de Madrid (UPM) en el taller de
       Análisis de Sentimientos centrado en tweets en Español: el TASS2017. Hemos traba-
       jado en la Tarea 1, tratando de predecir correctamente la polaridad del sentimiento
       de tweets en español. Para esta tarea hemos propuesto una arquitectura consistente
       en una Red Neuronal Recurrente (RNN) compuesta de celdas Long Short-Term
       Memory (LSTM) seguida por una red neuronal prealimentada. La arquitectura
       hace uso de dos tipos distintos de caracterı́sticas: word embeddings y los valores de
       un diccionario de sentimientos. La recurrencia de la arquitectura permite procesar
       secuencias de texto de distintas longitudes, mientras que el diccionario inserta infor-
       mación de sentimiento directamente en el sistema. Los resultados obtenidos indican
       que esta combinación de caracterı́sticas lleva a mejorar los resultados en análisis de
       sentimientos.
       Palabras clave: Aprendizaje Profundo, Procesamiento de Lenguaje Natural,
       Análisis de Sentimientos, Red Neuronal Recurrente, TensorFlow


1    Introduction                                               sults (Araque et al., 2017).
Recent developments in the area of deep                            This paper describes our participation in
learning are strongly impacting sentiment                       TASS 2017 (Martı́nez-Cámara et al., 2017).
analysis techniques. While traditional meth-                    Taller de Análisis de Sentimientos en la SE-
ods based on feature engineering are still                      PLN (TASS) is a workshop that fosters the
prevalent, new deep learning approaches are                     research of sentiment analysis in Spanish for
succeeding and reduce the need of labeled                       short text such as tweets. The first task of
corpus and feature definition. Moreover,                        this challenge, Task 1, consists in determin-
traditional and deep learning approaches                        ing the global polarity at a message level.
can be combined obtaining improved re-                          The dataset for the evaluation of this task
ISSN 1613-0073                     Copyright © 2017 by the paper's authors. Copying permitted for private and academic purposes.
                      Oscar Araque, Rodrigo Barbado, J. Fernando Sánchez-Rada, Carlos A. Iglesias



considers annotated tweets with 4 polarity la-                  compared to linear classifiers. Also, word em-
bels (P, N, NEU, NONE). P stands for pos-                       beddings have been leveraged in previous ver-
itive, while N means negative and NEU is                        sions, as shown in (Martınez-Cámara et al.,
neutral. It is considered that NONE means                       2015). Nevertheless, neural networks have
absence of sentiment polarity. This task pro-                   not been thoroughly studied in TASS, and
vides a corpus, which contains a total of 1514                  many potentially interesting techniques re-
tweets written in Spanish, describing a diver-                  main unused.
sity of subjects.
    We have faced this challenge as an oppor-                   3 Sentiment analysis Task
tunity to evaluate how these techniques could                   3.1 Model architecture
be applied in the TASS domain, and their
                                                                The approach followed for the Sentiment
results compared with the traditional tech-
                                                                Analysis at Tweet level Task consists in a
niques we used in a previous participation in
                                                                RNN composed of LSTM cells that parse the
this challenge (Araque et al., 2015).
                                                                input into a fixed-size vector representation.
    The reminder of this paper is organized
                                                                The representation of the text is used to per-
as follows. Sect. 2 introduces related work.
                                                                form the sentiment classification. Two vari-
Then Sect.3 describes the proposed polarity
                                                                ations of this architecture are used: (i) a
classification model and its implementation,
                                                                LSTM that iterates over the input word vec-
which is evaluated in Sect. 4. Finally, conclu-
                                                                tors or (ii) over a combination of the input
sions are drawn in Sect. 6.
                                                                word vectors and polarity values from a sen-
                                                                timent lexicon.
2   Related work
                                                                   The general architecture of the model
Many works in the last years involve the use                    takes as inputs the words vectors and the
of neural architectures to learn text classi-                   lexicon values for each word from an input
fication problems and, more specifically, to                    tweet. Then, the inputs are passed through
perform Sentiment Analysis. A relevant ex-                      a one-layer LSTM with a tunnable number
ample of this are Recursive Neural Tensor                       of hidden units. The generated representa-
Networks (Socher et al., 2013). This archi-                     tion is then used to determine the polarity of
tecture makes use of the structure of parse                     the input text using a feedforward layer with
trees to effectively capture the negation phe-                  softmax activation as output function. The
nomena and its scope. A similar work (Tai,                      output of this last layer encodes the probabil-
Socher, and Manning, 2015) introduces the                       ity that the input text belongs to each class.
use of LSTM in tree structures, leveraging                      Fig. 2 shows this architecture, which is fur-
both the information contained in these trees                   ther described as follows:
and the representation capabilities of gated
units. Although parse trees can result very                       1. The input vector is the word embed-
useful in sentiment analysis, many works do                          ding of each word in a given tweet. It
not make use of them, as they introduce an                           contains word-level information or senti-
additional computation overhead. In (Wang                            ment word-level information. Each spe-
et al., 2015) a data-driven approach is de-                          cific case will be described in more detail
scribed that learns from noisy annotated data                        afterwards.
also making use of LSTM units and a er-
                                                                  2. The RNN number of units is chosen dur-
ror signal processing to avoid the problem
                                                                     ing training for optimization purposes.
of vanishing gradient. Another useful tech-
                                                                     In this work we use a one-layer LSTM
nique is attention (Bahdanau, Cho, and Ben-
                                                                     to avoid overfitting of the network to the
gio, 2014), that enables weighting the impor-
                                                                     training data.
tance of the different words in a given piece of
text. Attention has been used in Sentiment                        3. The weight matrix has as input dimen-
Analysis successfully in a recurrent architec-                       sion the RNN size, and the number of
ture, as presented in (Wang et al., 2016).                           classes as output dimension. This means
    In the context of the TASS challenge, it                         that, taking as inputs the last LSTM
has not been the first time that neural archi-                       output, we obtain a vector whose length
tectures have been proposed for solving the                          is the number of classes. This matrix is
different tasks. In (Vilares et al., 2015), the                      also optimized during the training pro-
authors propose a LSTM architecture that is                          cess.
                                                          72
                      Applying Recurrent Neural Networks to Sentiment Analysis of Spanish Tweets




                    Figure 1: Recurrent Neural Network (RNN) architecture

 4. The final probability vector is obtained                   is followed, but instead of using information
    by passing the result of the previous                      about the different words contained on each
    matrix multiplication through a softmax                    tweet, information about the sentiment of
    function, which converts the values of                     each word is used. In this case, the prepro-
    the components of this result vector into                  cessing process is modified:
    probabilities representation. Finally, the
    predicted label for the tweet is the com-                    1. First, each tweet is split into tokens.
    ponent of the output vector with the                         2. Secondly, a sentiment dictionary is used
    highest probability.                                            to map words with sentiment polarity
                                                                    values. In this way, each word is mapped
   Following, the two types of inputs used are                      into a positive, neutral or negative value.
described thoroughly.
                                                                 3. Finally, the representation of a word
3.2   Word-level RNN                                                consists in its word vector concatenated
For this input, the tweet text is tokenized into                    with its sentiment polarity label.
word tokens, which are then expressed in a                     3.4      Regularization
one-hot representation. That is, each token is
                                                               Given the reduced number of training exam-
represented as a IR|V|×1 vector with all 0s and
                                                               ples that is available for this task (Sec. 4) a
and one 1 at the index of that token in the
                                                               number of regularization techniques has been
sorted token vocabulary. For example, the
                                                               used in the experiments. Regularization is
representations for the tokens a, antes and
                                                               used in machine learning to control the com-
zebra would appear as:
                                                               plexity of a learning model so it does not
                                
   wa = 1 0 0 · · · 0 , wantes = 0 1 0 · · · 0
                                                              overfit to the training data and generalized
                                                             better to the test data.
             wzebra = 0 0 0 · · · 1                               It is known that Recurrent Neural Net-
                                                               work tend to heavily overfit to the training
    We limit the number of words to a certain                  set (Zaremba, Sutskever, and Vinyals, 2014).
vocabulary size in order to limit the computa-                 For this, we employ two regularization tech-
tional cost of this preprocessing step. Before                 niques to prevent this:
feeding this data to the network, each tweet
is presented by the one-hot representation of                    1. L2 regularization (Ng, 2004). This tech-
all the tokens in the tweet.                                        nique is applied on the weights of the
                                                                    feedforward layer of the network. Being
3.3   Sentiment word-level RNN                                      WMLP the weights of this layer, this reg-
Additionally, we include different sentiment                        ularization adds to the cost function the
information into the word representations by                        following value:
means of a sentiment lexicon. In this case,
                                                                                        T
a similar approach as the word-level RNN                                             λkWM LP WM LP k
                                                         73
                      Oscar Araque, Rodrigo Barbado, J. Fernando Sánchez-Rada, Carlos A. Iglesias



      where λ is a parameter that represents                          the training set flows through all the
      the importance that is assigned to this                         computation graph yielding to a predic-
      regularization in the overall cost func-                        tion result. The cost metric is computed
      tion.                                                           by comparing the obtained result with
                                                                      the true training labels. When the back-
 2. Dropout (Srivastava et al., 2014; Gal
                                                                      propagation is finished, the variable val-
    and Ghahramani, 2016b). This strategy
                                                                      ues are updated and the following itera-
    consists in randomly setting a fraction
                                                                      tion proceeds.
    of units to 0 at each step of the training
    process to prevent overfitting. During                          • In order to enhance performance, we use
    test time, the outputs are averaged by                            early stopping on the accuracy on the de-
    this fraction. Dropout has been recently                          velopment set. That is, for each epoch
    found to be theoretically similar to ap-                          we monitor the performance of the net-
    plying a bayesian prior to the network                            work in the development set. If it has
    weigths (Gal and Ghahramani, 2016a).                              not improved for a number of epochs (in
                                                                      this work, 3 epochs) the training pro-
3.5    TensorFlow implementation                                      cess is stopped and the model weights
TensorFlow is an interface for expressing ma-                         are freezed.
chine learning algorithms, and an implemen-                         • The number of iterations can be chosen
tation for executing such algorithms (Abadi,                          as well as other parameters such as the
Agarwal, and et al., 2015). For implement-                            RNN size. For testing new examples, we
ing the model previously described, first we                          use as input the test data, passing all
had to define a computation graph composed                            the tweets through our model having as
by the RNN architecture, matrices and op-                             a result the vector of probabilities of the
erations needed. Once the graph is defined,                           class each tweet belongs to, choosing the
the training process consists in iteratively ad-                      class with a higher probability value for
justing numerical values in order to reach the                        each tweet.
best results. This task was done following
those ideas:                                                    4     Experimental setup
  • The values to optimize are the internal                     For the development of Task 1 a training
    parameters and matrices that form the                       and development dataset is made public, con-
    network. Those are: the word embed-                         taining 1,514 labeled tweets which belong
    ding representations, the LSTM internal                     to the InterTASS corpus. Additionally, we
    weights and the feedforward weight ma-                      use the TASS2015 edition training dataset
    trix used as last layer. At the beginning                   that was extracted from the general cor-
    of training, those values are initialized                   pus (Garcı́a Cumbreras et al., 2016). We
    in a random way using a normal distri-                      train the system with the InterTASS and the
    bution ∼ N (µ, σ), with µ = 0, and are                      TASS General Corpus training datasets, and
    considered variables to be optimized at                     adjust the hyper-parameters with the Inter-
    each training step by TensorFlow.                           TASS development set. For the lexicon, we
                                                                used ElhPolar dictionary (Urizar and Roncal,
  • Having those variables defined, the
                                                                2013), as it has been previously used in TASS
    training process iteratively modifies
                                                                competitions.
    them in order to reach the better re-
                                                                   There are three test datasets, one belong-
    sults. In order to obtain a error signal
                                                                ing to the InterTASS corpus and two belong-
    that can be used to modify the learning
                                                                ing to the General Corpus of TASS: the full
    parameters we use a cost function which
                                                                version, with all the 60,798 tweets; and the
    has to be minimized. That minimiza-
                                                                1k version, that contains a subset of 1,000
    tion problem is solved by applying the
                                                                tweets.
    gradient descent method via backpropa-
                                                                   In order to enhance the classification per-
    gation (LeCun et al., 2012). In this work
                                                                formance several hyper-parameters have been
    we employ the Adam algorithm (Kingma
                                                                explored, and the values that yield better
    and Ba, 2014).
                                                                performance are selected to be used in the
  • In each iteration of the training process,                  testing phase. The vocabulary size is set to
    which are known as epochs, data from                        20,000, with a batch size of 256 and the num-
                                                          74
                       Applying Recurrent Neural Networks to Sentiment Analysis of Spanish Tweets




                      Model                           Corpus             Accuracy          Macro-F1
                   LSTM + MLP                        InterTASS             53.70             37.1
                   LSTM + MLP                        TASS (1k)             60.1              45.6
                   LSTM + MLP                       TASS (Full)            63.1              50.9
               LSTM + MLP + Lexicon                  InterTASS             56.2              38.7
               LSTM + MLP + Lexicon                  TASS (1k)             63.6              46.8
               LSTM + MLP + Lexicon                 TASS (Full)            63.1              49.7


                                      Table 1: Results in TASS 2017
ber of epochs being 20. With this value, the                    quence of text due to the dynamic recurrent
early stopping mechanism stopped the train-                     structure of the architecture. Also, several
ing before its completion. Regarding the size                   techniques have been used for avoiding over-
of the layers, the number of dimensions of the                  fitting. From the experiments, it is seen that
word embeddings is set to 16, as well as it is                  adding a sentiment lexicon can enhance the
done with the number of units in the LSTM                       classification performance.
layer. The dimensionality of the feedforward                        However, the proposed model does not
layer is given by the output of the LSTM,                       compare with the best results in the TASS
which is 16, and the number of classes of the                   competition. This can be due to a number
classification task (in this case, 4). Note that                of reasons, but the training process suggests
these values are smaller than in the usual                      that overfitting is a relevant issue. Although
neural architectures in order to further pre-                   benefit comes from the use of regularization
vent overfitting. Also, we select the λ param-                  techniques, the network is not able to largely
eter to 0.05, and the dropout rate to 0.7.                      generalize. To address this, we think that fu-
                                                                ture work in this direction should include the
5   Experimental Results                                        expansion of the training set.
Table 1 shows the results of the two variations                     Other possible improvement for future
of the proposed model: LSTM+MLP stack                           work is doing a better preprocessing of in-
with or without lexicon values. In light of this                put texts at word level. In addition, Convo-
results it is possible to affirm that the used ar-              lutional Neural Networks could be used for
chitecture shows promising performances in                      feature extraction in combination with the
the task of sentiment analysis of tweets. Al-                   Recurrent Neural Network architecture. This
though, the achieved performances are below                     could lead to the computation of most com-
the best in this year challenge. This indicates                 plex features, which could also yield better
that further work should be done in order to                    results.
improve the results.
                                                                Acknowledgement
   The experimental results confirm the idea
that the introduction of a sentiment lexicon                    The authors gratefully acknowledge the
into the word presentations results, in gen-                    support of NVIDIA Corporation with
eral, beneficial for the final performance. We                  the donation of the Titan X Pascal GPU
see this improvement in the InterTASS and                       used in this research.     This research
1k corpora. Nevertheless, when attending to                     work is partially supported through the
the Full corpus, a performance decrease in                      projects Semola (TEC2015-68284-R), Emo-
the Marco-F1 is observed.                                       Spaces (RTC-2016-5053-7), MOSI-AGIL
                                                                (S2013/ICE-3019), Somedi (ITEA3 15011)
6   Conclusions and Future Work                                 and Trivalent (H2020 Action Grant No.
                                                                740934, SEC-06-FCT-2016).
In this paper we have described the participa-
tion of the GSI in the TASS 2017 challenge.                     Bibliografı́a
Our proposal relies on a Recurrent Neural
                                                                Abadi, M., A. Agarwal, and P. B. et al. 2015.
Network architecture for Sentiment Analysis
                                                                  TensorFlow: Large-scale machine learning
with Long Short-Term Memory cells. This
                                                                  on heterogeneous systems. Software avail-
network can be fed with both word vectors
                                                                  able from tensorflow.org.
and sentiment lexicon values. This approach
is able to represent a arbitrarily long se-                     Araque, O., I. Corcuera, C. Román, C. A.
                                                          75
                    Oscar Araque, Rodrigo Barbado, J. Fernando Sánchez-Rada, Carlos A. Iglesias



  Iglesias, and J. F. Sánchez-Rada. 2015.                        R. Munoz-Guillena. 2015. Ensemble clas-
  Aspect based sentiment analysis of span-                        sifier for twitter sentiment analysis. In
  ish tweets. In TASS@ SEPLN, pages 29–                           R. Izquierdo, editor, Proceedings of the
  34.                                                             Workshop on NLP Applications: complet-
                                                                  ing the puzzle, number 1386 in CEUR
Araque, O., I. Corcuera-Platas, J. F.
                                                                  Workshop Proceedings, Aachen.
  Sánchez-Rada, and C. A. Iglesias. 2017.
  Enhancing Deep Learning Sentiment                           Ng, A. Y. 2004. Feature selection, l 1 vs.
  Analysis with Ensemble Techniques in So-                      l 2 regularization, and rotational invari-
  cial Applications. Expert Systems with                        ance. In Proceedings of the twenty-first in-
  Applications, June.                                           ternational conference on Machine learn-
                                                                ing, page 78. ACM.
Bahdanau, D., K. Cho, and Y. Bengio.
  2014.    Neural machine translation by                      Socher, R., A. Perelygin, J. Y. Wu,
  jointly learning to align and translate.                       J. Chuang, C. D. Manning, A. Y. Ng,
  arXiv preprint arXiv:1409.0473.                                C. Potts, et al. 2013. Recursive deep mod-
                                                                 els for semantic compositionality over a
Gal, Y. and Z. Ghahramani. 2016a. Dropout                        sentiment treebank. In Proceedings of the
  as a bayesian approximation: Represent-                        conference on empirical methods in natu-
  ing model uncertainty in deep learning.                        ral language processing (EMNLP), volume
  In international conference on machine                         1631, page 1642.
  learning, pages 1050–1059.
                                                              Srivastava, N., G. E. Hinton, A. Krizhevsky,
Gal, Y. and Z. Ghahramani. 2016b. A theo-                        I. Sutskever, and R. Salakhutdinov. 2014.
  retically grounded application of dropout                      Dropout: a simple way to prevent neu-
  in recurrent neural networks. In Advances                      ral networks from overfitting. Journal of
  in neural information processing systems,                      Machine Learning Research, 15(1):1929–
  pages 1019–1027.                                               1958.
Garcı́a     Cumbreras,         M.     Á.,                    Tai, K. S., R. Socher, and C. D. Man-
  E. Martı́nez Cámara, J. Villena Román,                       ning. 2015. Improved semantic represen-
  and J. Garcı́a Morera.      2016.  Tass                        tations from tree-structured long short-
  2015–the evolution of the spanish opin-                        term memory networks. arXiv preprint
  ion mining systems. Procesamiento del                          arXiv:1503.00075.
  Lenguaje Natural, 56:33–40.
                                                              Urizar, X. S. and I. S. V. Roncal. 2013. El-
Kingma, D. and J. Ba. 2014. Adam: A                              huyar at tass 2013. In Proceedings of the
  method for stochastic optimization. arXiv                      Workshop on Sentiment Analysis at SE-
  preprint arXiv:1412.6980.                                      PLN (TASS 2013), pages 143–150.
LeCun, Y. A., L. Bottou, G. B. Orr, and                       Vilares, D., Y. Doval, M. A. Alonso, and
  K.-R. Müller. 2012. Efficient backprop.                       C. Gómez-Rodrı́guez. 2015. Lys at tass
  In Neural networks: Tricks of the trade.                       2015: Deep learning experiments for sen-
  Springer, pages 9–48.                                          timent analysis on spanish tweets. In
                                                                 TASS@ SEPLN, pages 47–52.
Martı́nez-Cámara, E., M. C. Dı́az-Galiano,
  M. A. Garcı́a-Cumbreras, M. Garcı́a-                        Wang, X., Y. Liu, C. Sun, B. Wang, and
  Vega, and J. Villena-Román.        2017.                     X. Wang. 2015. Predicting polarities
  Overview of tass 2017.        In J. Vil-                      of tweets by composing word embeddings
  lena Román, M. A. Garcı́a Cumbreras,                         with long short-term memory. In ACL
  D. G. M. C. Martı́nez-Cámara, Eugenio,                       (1), pages 1343–1353.
  and M. Garcı́a Vega, editors, Proceedings                   Wang, Y., M. Huang, X. Zhu, and L. Zhao.
  of TASS 2017: Workshop on Semantic                            2016. Attention-based lstm for aspect-
  Analysis at SEPLN (TASS 2017), vol-                           level sentiment classification. In EMNLP,
  ume 1896 of CEUR Workshop Proceed-                            pages 606–615.
  ings, Murcia, Spain, September. CEUR-
  WS.                                                         Zaremba, W., I. Sutskever, and O. Vinyals.
                                                                2014. Recurrent neural network regular-
Martınez-Cámara, E., Y. Gutiérrez-Vázquez,                   ization. arXiv preprint arXiv:1409.2329.
  J. Fernández, A. Montejo-Ráez, and
                                                        76