TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 117-123


      SCI2S at TASS 2018: Emotion Classification with
                Recurrent Neural Networks
   SCI2 S en TASS 2018: Clasificación de Emociones con Redes
                    Recurrentes Neuronales
      Nuria Rodrı́guez Barroso, Eugenio Martı́nez-Cámara, Francisco Herrera
Instituto de Investigación Andaluz en Ciencia de Datos e Inteligencia Computacional (DaSCI)
                               Universidad de Granada, España
                    rbnuria@gmail.com, {emcamara, fherrera}@decsai.ugr.es

         Abstract: In this paper, we describe the participation of the team SCI2 S in all the
         Subtasks of the Task 4 of TASS 2018. We claim that the use of external emotional
         knowledge is not required for the development of an emotional classification system.
         Accordingly, we propose three Deep Learning models that are based on a sequence
         encoding layer built on a Long Short-Term Memory gated-architecture of Recurrent
         Neural Network. The results reached by the systems are over the average in the two
         Subtasks, which shows that our claim holds.
         Keywords: Deep Learning, Recurrent Neuronal Networks, LSTM, Emotion Clas-
         sification
         Resumen: En este artı́culo se presenta la participación del equipo SCI2 S en la
         Tarea 4 de TASS 2018. Partiendo de la asunción de que no es necesario el uso
         de caracterı́sticas emocionales para el desarrollo de un sismtema de clasificación de
         emocoiones, se proponen tres modelos de redes neuronales basados en el uso de una
         capa de Red Recurrente Neuronal de tipo Long Short-Term Memory. Los sistemas
         han alcanzado una posición por encima de la media en las dos Subtareas en las que
         se ha participado, lo cual ha permitido confirmar nuestra hipótesis.
         Palabras clave: Redes Neuronales, Redes Neuronales Recurrentes, LSTM, Clasi-
         ficación de Emociones
  1    Introduction                                               ers are disgusted by the news, they may be
                                                                  revolted by the advertisement too, which is
  People usually have a look at advertise-                        highly detrimental for the brand advertised.
  ments when they read traditional newspa-                        The advertising spots in online newspapers
  pers. These advertisements generally fit the                    are fixed beforehand, and the advertisement
  news that are in the same, previous or next                     that appears in each spot does not depend
  page, because the match of the news and                         on the decision of the editor or the journal-
  the ads are carefully decided during the edi-                   ist, but it depends on a automatic broad-
  tion time, which is before the printing of the                  casting system of ads of an online marketing
  newspaper. Nowadays, online newspapers                          company. Consequently, companies are not
  are as read as traditional ones, hence compa-                   able to control whether the reputation of its
  nies also want to show their brands in online                   brands may be damaged, which is known by
  newspapers, and they invest money to buy                        marketing experts as the brand safety issue.1
  ads in them. However, one of the differences                        The Task 4 of TASS 2018 (Martı́nez-
  between traditional and on-line newspapers                      Cámara et al., 2018) is focused on the men-
  is the moment when the correspondence be-                       tioned issue of brand safety, and it proposes
  tween the news and the advertisements is                        the classification if a news is sure for a brand
  done, which is in reading time. Thus, the                       according to the emotion elicited from the
  news and the ads likely do not match.                           readers when they read the headline of a
      The lack of correspondence between a                        news. The organization provided an anno-
  news and a advertisement means that the
  topic of the news is not suitable for the ad-                      1
                                                                      https://www.thedrum.com/opinion/2018/
  vertisement, or the emotion that may elicit                     07/09/brand-safety-the-importance-quality-
  from the reader is not positive. If the read-                   media-fake-news-and-staying-vigilant
  ISSN 1613-0073                     Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes.
                        Nuria Rodríguez Barroso, Eugenio Martínez-Cámara y Francisco Herrera


tated corpus of headlines of news of Spanish                  linked to a real-valued continuous vector of
written newspapers from around the world,                     dimension demb .
so the corpus SANSE is a global representa-                      There are different algorithms to build
tion of the written Spanish language. In this                 vectors of word embeddings in the literature,
paper, we present the systems submitted by                    standing out C&W (Collobert et al., 2011),
the SCI2 S team to the two Subtasks of Task                   word2vect (Mikolov et al., 2013) and Glove
4 of TASS 2018.2                                              (Pennington, Socher, and Manning, 2014).
    We claim that the emotional classification                Likewise, several sets of pre-trained vectors of
can be tackled without the use of emotional                   word embeddings built using the previous al-
features or any other kind of handcrafted lin-                gorithms are freely available. However, those
guistic feature. We thus propose the genera-                  pre-trained sets were generated using docu-
tion of dense high quality features following a               ments written in English, thus they cannot
sentence encoding approach, and then the use                  been used for representing Spanish words.
of a non lineal classifier. We submitted three                   We used the pre-trained set of word em-
systems based on the encoding of the input                    beddings SBW3 (Cardellino, 2016). SBW
headline with a Recurrent Neural Network                      was built upon several Spanish corpora, and
(RNN) Long Short Term Memory (LSTM).                          the most relevant characteristics of its de-
Our submitted systems are over the average                    velopment were: (1) the capitalization of
in the competition, hence this fact shows that                the words were kept unchanged; (2) the
our claim holds.                                              word2vect algorithm used was skip-gram;
                                                              (3) the minimum allowed word frequency was
2       Architecture of the models                            5; and (4) the dimension or components of
The organization proposed two Subtasks, the                   the word vectors is 300 (demb = 300).
first one is defined in a monolingual context,                   We tokenized the input headlines with
and the second in a multilingual one. The                     the default tokenizer of NLTK4 in order to
first Subtask has two levels of evaluation,                   project them in the feature vector space de-
which differ in the size of the evaluation set.               fined by the vector of word embeddings. Con-
We designed the neural architecture without                   sequently, each headline (h) is transformed in
taking into account the specific characteris-                 a sequence of n words (w1:n = {w1 , . . . , wn }).
tics of the Subtasks, because our aim was the                 The size of the input sequence (n) was de-
evaluation of our claim on the SANSE cor-                     fined by the maximum length of the inputs
pus.                                                          in the training data, hence sequences shorter
    The architecture of the three systems sub-                than n were truncated. After the tokeniza-
mitted is composed of three modules: (1) lan-                 tion, the first layer of our architecture model
guage representation, for the sake of simplic-                is an embedding lookup layer, which makes
ity embeddings lookup module; (2) sequence                    the projection of the sequence of tokens into
encoding module, in which the three architec-                 the feature vector space. Therefore, the out-
tures differ; and (3) non lineal classification.              put of the embeddings lookup layer is the ma-
The details of each module are explained in                   trix WE ∈ IRd,n , WET     1:n = (we1 , . . . , wen ),
the following subsections.                                    where wei ∈ IRd . The parameters of the em-
                                                              bedding lookup layer are not updated during
2.1      Embeddings lookup layer                              the training.
Regarding our claim, we defined a feature
vector space for the training and the eval-                   2.2      Sequence encoding layer
uation that is composed of unsupervised vec-                  The aim of the sequence encoding layer is the
tors of word embeddings. A set of vectors of                  generation of high level features, which con-
word embeddings is the representation of the                  dense the semantic of the entire sentence. We
ideal semantic space of words in a real-valued                used an RNN layer because RNNs can rep-
continuous vector space, hence the relation-                  resent sequential input in a fixed-size vector
ships between vectors of words mirror the lin-                and paying attention to the structured prop-
guistic relationships of the words. Vectors of                erties of the input (Goldberg, 2017). RNN is
word embeddings are a dense representation                    defined as a recursive R function applied to
of the meaning of a word, thus each word is                       3
                                                                  https://crscardellino.github.io/SBWCE/
    2                                                             4
    The details about the Task 4 of TASS 2018 are                 https://www.nltk.org/api/nltk.tokenize.
in (Martı́nez-Cámara et al., 2018).                          html
                                                        118
                          SCI^2S at TASS 2018: Emotion Classification with Recurrent Neural Networks


a input sequence. The input of the function                        Single LSTM (SLSTM). The layer is
R is an state vector si−1 and an element of                         composed of one LSTM, whose input is the
the input sequence, in our case a word vector                       sequence we1:n , and its output is composed
(wei ). The output of R is a new state vector                       of a single vector, namely the last output
(si ), which is transformed to the output vec-                      vector (yn ∈ IRdout ). In this case, the se-
tor yi by a deterministic function O. Equa-                         mantic information of the entire headline is
tion 15 summarizes the former definition.                           condensed in the last vector of the LSTM,
                                                                    which correspond to the last word.
         RNN(we1:n , s0 ) = y1:n                                   Single biLSTM (SbLSTM). In order to
                      yi = O(si )                      (1)          encoded the previous and forward context
                                                                    of the words of the input sequence, the se-
                      si = R(wei , si−1 );                          quential encoding layer of this system is a
                                                                    biLSTM. The output is the concatenation
       wei ∈ IRdin , si ∈ IRf(dout ) , yi ∈ IRdout                  of the last output vector of the two LSTMs
    From a linguistic point of view, each                           of the biLSTM (yn = [ynf ; ynb ] ∈ IR2×dout ).
vector (yi ) of the output sequence of an                          Sequence LSTM (SeLSTM). The encod-
RNN condenses the semantic information                              ing is carried out by an LSTM, but the out-
of the word wi and the previous words                               put is composed of all output vectors of all
({w1 , . . . , wi−1 }). However, according to the                   the words of the sequence, hence the out-
distributional hypothesis of language (Har-                         put is not a vector, but the sequence y1:n ,
ris, 1954), semantically similar words tend to                      yi ∈ IRdout .
have similar contextual distributions, or in
other words, the meaning of a word is defined                         The semantic information returned by
by its contexts. An RNN can only encode the                        SeLSTM is greater than the other two layers,
previous context of a word when the input of                       because it returns the output vector of each
the RNN is the sequence we1:n . However, the                       word, therefore the subsequent layers receive
input of the RNN can be also the reverse of                        more semantic information from the sequence
the previous sequence (wen:1 ). Consequently,                      encoding layer.
we can elaborate a composition of two RNNs,
the first one encodes the sequence from the                        2.3      Non lineal classification layer
beginning to the end (forward, f ), and a sec-
                                                                   Since RNN and specifically LSTM has the
ond one from the end to the beginning (back-
                                                                   ability of encoding the semantic information
ward, b), therefore the previous and the fol-
                                                                   of the input sequence, the output of the se-
lowing context of a word is encoded. This
                                                                   quence encoding layer is a high level repre-
elaboration is known as bidirectional RNN
                                                                   sentation of the semantic information of the
(biRNN), whose definition is in Equation 2.
                                                                   input headline.
                                                                       The sequence representation of the head-
   biRNN(we1:n ) = [RNNf (we1:n , sf0 );                           line is then classified by three fully connected
                                                                   layers with ReLU as activation function, and
                           RNNb (wen:1 , sb0 )]        (2)         additional layer activated by the softmax
                                                                   function. The layers activated by ReLU have
   The three systems submitted are based                           different hidden units or output neurons (see
on the use of a specific gated-architecture                        Table 1). The SeLSTM layer does not return
of RNN, namely LSTM (Hochreiter and                                an output vector, but an output sequence
Schmidhuber, 1997), which has reached                              y1:n ∈ IRn,dout . Thus, after the second fully
strong results in several Natural Language                         connected layer, the sequence is flattened to
Processing tasks (Tang, Qin, and Liu, 2015;                        a single vector y ∈ IRn×dout . Since the task
Kiperwasser and Goldberg, 2016; Martı́nez-                         is a binary classification task, the number of
Cámara et al., 2017). The specific details of                     hidden units of the softmax layer is 2.
the sequence encoding layer of each submit-                            In order to avoid overfitting, we add a
ted system are described as what follows.                          dropout layer after each fully connected layer
   5
     The definition of RNN states that the dimension
                                                                   with a dropout rate value (dr ). Besides, we
of si is a function of the output dimension, but some              applied an L2 regularization function to the
architectures as LSTM does not allow that flexibility.             output of each fully connected layer with a
                                                             119
                             Nuria Rodríguez Barroso, Eugenio Martínez-Cámara y Francisco Herrera


regularization value (r ). Moreover, the train-                    the SeLSTM is the model that uses more
ing is stopped in case the loss value does not                     parameters, because it processes the output
improve in 5 epochs.                                               vectors of the sequence encoding layer of each
   The training of the network was per-                            input word.
formed by the minimization of the cross en-                           We expected that models with a higher
tropy function, and the learning process was                       number of parameters and capacity of en-
optimized with the Adam algorithm (Kingma                          coding semantic information, they will reach
and Ba, 2015) with its default learning rate.                      higher results in the competition, or in other
The training was performed following the                           words, they will have a higher capacity of
minibatches approach with a batch size of 25,                      generalization. However, the comparison of
and the number of epochs was 40.                                   the results reached on the development and
   For the sake of the replicability of the ex-                    test set shows a non expected performance.
periments, Table 1 shows the values of the hy-                     Regarding the two main differences among
perparaments of the network, and the source                        the models, we highlight the following two
code of our experiments is publicly available.6                    facts:

Hyper. value SLSTM              biLSTM         SeLSTM              Generalization capacity. The model that
                                                                    reached a higher results in the two levels of
n                       20            20              20            the Subtask 1 is SLSTM. The performance
demb                   300           300             300            of SLSTM stands out in the second level of
dout                   512        256×2              512            Subtask 1, because it is the second higher
dr1                   0.35          0.35            0.35            ranked system. Since the test set of the
dr2                   0.35          0.35             0.5            second level is larger than the level one,
dr3                    0.5           0.5             0.5            it demands a higher generalization capac-
L2 r1               0.0001        0.0001          0.0001            ity from the systems, thus the good perfor-
L2 r2                0.001         0.001           0.001            mance of SLSTM is more relevant. In con-
L2 r3                 0.01          0.01            0.01            trast, SbLSTM and SeLSTM are in the fifth
                                                                    and sixth position respectively in the second
Table 1: Hyperparameter values of the sys-                          level, and the sixth and seventh position in
tems submitted                                                      the first level of Subtask 1, which was not
                                                                    expected due to they have more parameters
3       Results and Analysis                                        and condense more semantic information.
The organization provided a development set                         Concerning the Subtask 2, the results
of the SANSE corpus with the aim that the                           reached were the expected ones, be-
teams would use the same data to tune the                           cause SeLSTM, which has more parameters
classification models. We participated in the                       and condense more semantic information,
two levels of Subtasks 1 and in the Subtask                         reached the best results among our three
2, and we present in Tables 2, 3 and 4 the                          systems. The generalization demand in this
results reached with the development set (de-                       task is high too, because the language or the
velopment time) and the official results with                       domain of the training and the test sets and
the test set of SANSE (evaluation time).                            different, because the training set is com-
   The main differences among the submit-                           posed of headlines written in the Spanish
ted systems are: (1) The semantic informa-                          language used in America, and the test set
tion encoded; and (2) the number of pa-                             is written in the Spanish language used in
rameters. SLSTM is the model with less                              Spain.
semantic information encoded, because the                           Although the generalization capacity of our
LSTM is only run in one direction, and the                          systems is high, the different performance in
last output vector of the LSTM is only pro-                         Subtask 1 and Subtask 2 allow us to con-
cessed by the subsequent layers. Although                           clude that to reach a good generalization
SbLSTM encodes more semantic information                            capacity, a balance between the number of
than SLSTM, they have the same number of                            parameters and the complexity or depth of
parameters, because SbLSTM only processes                           the neural network is required as it is also
the last output vector of the sequence encod-                       asserted in (Conneau et al., 2017).
ing layer as the SLSTM model. In contrast,
                                                                   Differences among datasets. SLSTM and
    6
        https://github.com/rbnuria/TASS-2018                        SbLSTM reached a value of Macro Recall
                                                             120
                     SCI^2S at TASS 2018: Emotion Classification with Recurrent Neural Networks


                                  Development                                            Test (official)
System             M. Prec.      M. Recall        M. F1        Acc.     M. Prec.        M. Recall     M. F1   Acc.
SLSTM1                73.89            74.74       74.10      74.80          78.40            76.40   77.40   78.60
SbLSTM3               75.24            75.15       75.19      76.40          77.40            75.20   76.30   77.60
SeLSTM2               76.08            76.35       76.21      77.20          76.30            76.50   76.40   77.20

Table 2: The macro-average and accuracy values in % reached by the three systems on the
development and test sets in the Subtask 1, level 1. The superscript is the official rank (ranked
by the M. F1 value) among the three submitted systems in the official results

                                  Development                                            Test (official)
System             M. Prec.      M. Recall        M. F1        Acc.     M. Prec.        M. Recall     M. F1   Acc.
SLSTM1                73.89            74.74       74.10      74.80          88.80            86.70   87.30   88.80
SbLSTM2               75.24            75.15       75.19      76.40          86.80            85.70   86.30   87.80
SeLSTM3               76.08            76.35       76.21      77.20          83.80            87.00   85.30   85.30

Table 3: The macro-average and accuracy values in % reached by the three systems on the
development and test sets in the Subtask 1, level 2. The superscript is the official rank (ranked
by the M. F1 value) among the three submitted systems in the official results

                                  Development                                            Test (official)
System             M. Prec.      M. Recall        M. F1        Acc.     M. Prec.        M. Recall     M. F1   Acc.
SLSTM3                74.54            72.05       72.67      75.00          68.30            66.10   67.20   70.00
SbLSTM2               75.60            71.14       71.87      75.90          67.90            67.20   67.60   69.80
SeLSTM1               72.47            69.41       69.98      77.20          68.70            67.80   68.30   63.11

Table 4: The macro-average and accuracy values in % reached by the three systems on the
development and test sets in the Subtask 2. The superscript is the official rank (ranked by the
M. F1 value) among the three submitted systems in the official results

 higher than the value of Macro-Precision in                     Regarding the competition, the rank posi-
 the development set of Subtask 1 in the two                  tion of our systems are in Table 5. In Subtask
 levels of evaluation. However, they reached                  1, the systems reached a rank position over
 the inverse relation on the test set of both                 the average, and SLSTM stands out in Level
 levels of Subtask 1. In contrast, SeLSTM                     2 of Subtask 1. In Subtask 2, the systems are
 had the same trend in both datasets, thus                    on the average, and the performance is close
 the performance of SeLSTM shows a higher                     to their competitors. Regarding our claim
 stability. On the other hand, the three sys-                 and the high results reached by the three sys-
 tems had the same performance in the de-                     tems, we conclude that our claim holds, hence
 velopment and test sets in Subtask 2, that                   we can obtain strong results in the task of
 it is to say, the value of Macro-Precision was               emotion classification without the use of emo-
 higher than the value of Macro-Recall in de-                 tional features.
 velopment and evaluation time.
                                                              4     Conclusions
 Regarding the differences between the                        We described the three systems submitted
 datasets, the performance of models with                     to all the Subtasks of Task 4 of TASS 2018
 more parameters and with more semantic                       by the team SCI2 S. Our proposal is based
 information is more stable, which means                      on the claim that emotional classification
 that the results in development time follows                 can be performed without the use of emo-
 a similar trend to the results in evaluation                 tional external knowledge or handcrafted fea-
 time that is an desirable characteristic of a                tures. The three systems are three neural
 classification system.                                       networks grounded in a sentence classifica-
                                                        121
                          Nuria Rodríguez Barroso, Eugenio Martínez-Cámara y Francisco Herrera


                             Rank                                   In Proceedings of the 15th Conference of
                                                                    the European Chapter of the Association
System      Sub. 1, L1       Sub. 1, L2        Sub. 2
                                                                    for Computational Linguistics: Volume 1,
SLSTM              4/13              2/10          6/8              Long Papers, pages 1107–1116. Associa-
SbLSTM             7/13              5/10          5/8              tion for Computational Linguistics.
SeLSTM             6/13              6/10          4/8
                                                                Goldberg, Y. 2017. Neural Network Methods
                                                                  for Natural Language Processing. Morgan
Table 5: Rank position of the submitted sys-
                                                                  & Claypool Publishers.
tems in the competition
                                                                Harris, Z. S. 1954. Distributional structure.
                                                                  WORD, 10(2-3):146–162.
tion approach, namely the use of an LSTM
and a biLSTM. The three systems reached                         Hochreiter, S. and J. Schmidhuber. 1997.
a rank position over the average in the two                       Long short-term memory. Neural Com-
Subtasks of Task 4, thus we conclude that                         putation, 9(8):1735–1780, November.
our claim holds.                                                Kingma, D. P. and J. Ba. 2015. Adam:
   Our future work will go in the direction                       A method for stochastic optimization. In
defined by the analysis of the results (see Sec-                  3rd International Conference for Learning
tion 3), hence we are going to work in the                        Representations, San Diego, 2015.
study of the balance between the depth and
the generalization capacity of our emotional                    Kiperwasser, E. and Y. Goldberg. 2016. Sim-
classification model. Likewise, we will work                      ple and accurate dependency parsing us-
in the addition of an Attention layer (Bah-                       ing bidirectional lstm feature representa-
danau, Cho, and Bengio, 2015) to the model,                       tions. Transactions of the Association of
with the aim of automatically selecting the                       Computational Linguistics, 4:313–327.
most relevant features.                                         Martı́nez-Cámara, E., Y. Almeida-Cruz,
                                                                  M. C. Dı́az-Galiano, S. Estévez-Velarde,
Acknowledgements                                                  M. A. Garcı́a-Cumbreras, M. Garcı́a-
This work was partially supported by the                          Vega, Y. Gutiérrez, A. Montejo Ráez,
Spanish Ministry of Science and Technology                        A. Montoyo, R. Muñoz, A. Piad-
under the project TIN2017-89517-P, and a                          Morffis, and J. Villena-Román. 2018.
grant from the Fondo Europeo de Desar-                            Overview of TASS 2018:           Opinions,
rollo Regional (FEDER). Eugenio Martı́nez                         health and emotions. In E. Martı́nez-
Cámara was supported by the Juan de la                           Cámara, Y. Almeida-Cruz, M. C.
Cierva Formación Programme (FJCI-2016-                           Dı́az-Galiano, S. Estévez-Velarde, M. A.
28353) from the Spanish Government.                               Garcı́a-Cumbreras,     M. Garcı́a-Vega,
                                                                  Y. Gutiérrez, A. Montejo Ráez, A. Mon-
References                                                        toyo, R. Muñoz, A. Piad-Morffis, and
Bahdanau, D., K. Cho, and Y. Bengio.                              J. Villena-Román, editors, Proceedings
  2015.     Neural machine translation by                         of TASS 2018: Workshop on Semantic
  jointly learning to align and translate. In                     Analysis at SEPLN (TASS 2018), volume
  3rd International Conference for Learning                       2172 of CEUR Workshop Proceedings,
  Representations, San Diego, 2015.                               Sevilla, Spain, September. CEUR-WS.

Cardellino, C. 2016. Spanish Billion Words                      Martı́nez-Cámara,  E.,     V.   Shwartz,
  Corpus and Embeddings, March.                                   I. Gurevych, and I. Dagan.        2017.
                                                                  Neural disambiguation of causal lexical
Collobert, R., J. Weston, L. Bottou,                              markers based on context. In IWCS
  M. Karlen, K. Kavukcuoglu, and                                  2017 – 12th International Conference on
  P. Kuksa.     2011.    Natural language                         Computational Semantics – Short papers.
  processing (almost) from scratch. Jour-
                                                                Mikolov, T., I. Sutskever, K. Chen, G. S.
  nal of Machine Learning Research,
                                                                  Corrado, and J. Dean. 2013. Distributed
  12:2493–2537, November.
                                                                  representations of words and phrases and
Conneau, A., H. Schwenk, L. Barrault, and                         their compositionality. In C. J. C. Burges,
  Y. Lecun.    2017.    Very deep convo-                          L. Bottou, M. Welling, Z. Ghahramani,
  lutional networks for text classification.                      and K. Q. Weinberger, editors, Advances
                                                          122
                    SCI^2S at TASS 2018: Emotion Classification with Recurrent Neural Networks


  in Neural Information Processing Systems
  26. Curran Associates, Inc., pages 3111–
  3119.
Pennington, J., R. Socher, and C. Man-
  ning. 2014. Glove: Global vectors for
  word representation. In Proceedings of the
  2014 Conference on Empirical Methods in
  Natural Language Processing (EMNLP),
  pages 1532–1543. Association for Compu-
  tational Linguistics.
Tang, D., B. Qin, and T. Liu. 2015. Docu-
  ment modeling with gated recurrent neu-
  ral network for sentiment classification.
  In Proceedings of the 2015 Conference on
  Empirical Methods in Natural Language
  Processing, pages 1422–1432. Association
  for Computational Linguistics.


                                                       123