TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 117-123 SCI2S at TASS 2018: Emotion Classification with Recurrent Neural Networks SCI2 S en TASS 2018: Clasificación de Emociones con Redes Recurrentes Neuronales Nuria Rodrı́guez Barroso, Eugenio Martı́nez-Cámara, Francisco Herrera Instituto de Investigación Andaluz en Ciencia de Datos e Inteligencia Computacional (DaSCI) Universidad de Granada, España rbnuria@gmail.com, {emcamara, fherrera}@decsai.ugr.es Abstract: In this paper, we describe the participation of the team SCI2 S in all the Subtasks of the Task 4 of TASS 2018. We claim that the use of external emotional knowledge is not required for the development of an emotional classification system. Accordingly, we propose three Deep Learning models that are based on a sequence encoding layer built on a Long Short-Term Memory gated-architecture of Recurrent Neural Network. The results reached by the systems are over the average in the two Subtasks, which shows that our claim holds. Keywords: Deep Learning, Recurrent Neuronal Networks, LSTM, Emotion Clas- sification Resumen: En este artı́culo se presenta la participación del equipo SCI2 S en la Tarea 4 de TASS 2018. Partiendo de la asunción de que no es necesario el uso de caracterı́sticas emocionales para el desarrollo de un sismtema de clasificación de emocoiones, se proponen tres modelos de redes neuronales basados en el uso de una capa de Red Recurrente Neuronal de tipo Long Short-Term Memory. Los sistemas han alcanzado una posición por encima de la media en las dos Subtareas en las que se ha participado, lo cual ha permitido confirmar nuestra hipótesis. Palabras clave: Redes Neuronales, Redes Neuronales Recurrentes, LSTM, Clasi- ficación de Emociones 1 Introduction ers are disgusted by the news, they may be revolted by the advertisement too, which is People usually have a look at advertise- highly detrimental for the brand advertised. ments when they read traditional newspa- The advertising spots in online newspapers pers. These advertisements generally fit the are fixed beforehand, and the advertisement news that are in the same, previous or next that appears in each spot does not depend page, because the match of the news and on the decision of the editor or the journal- the ads are carefully decided during the edi- ist, but it depends on a automatic broad- tion time, which is before the printing of the casting system of ads of an online marketing newspaper. Nowadays, online newspapers company. Consequently, companies are not are as read as traditional ones, hence compa- able to control whether the reputation of its nies also want to show their brands in online brands may be damaged, which is known by newspapers, and they invest money to buy marketing experts as the brand safety issue.1 ads in them. However, one of the differences The Task 4 of TASS 2018 (Martı́nez- between traditional and on-line newspapers Cámara et al., 2018) is focused on the men- is the moment when the correspondence be- tioned issue of brand safety, and it proposes tween the news and the advertisements is the classification if a news is sure for a brand done, which is in reading time. Thus, the according to the emotion elicited from the news and the ads likely do not match. readers when they read the headline of a The lack of correspondence between a news. The organization provided an anno- news and a advertisement means that the topic of the news is not suitable for the ad- 1 https://www.thedrum.com/opinion/2018/ vertisement, or the emotion that may elicit 07/09/brand-safety-the-importance-quality- from the reader is not positive. If the read- media-fake-news-and-staying-vigilant ISSN 1613-0073 Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes. Nuria Rodríguez Barroso, Eugenio Martínez-Cámara y Francisco Herrera tated corpus of headlines of news of Spanish linked to a real-valued continuous vector of written newspapers from around the world, dimension demb . so the corpus SANSE is a global representa- There are different algorithms to build tion of the written Spanish language. In this vectors of word embeddings in the literature, paper, we present the systems submitted by standing out C&W (Collobert et al., 2011), the SCI2 S team to the two Subtasks of Task word2vect (Mikolov et al., 2013) and Glove 4 of TASS 2018.2 (Pennington, Socher, and Manning, 2014). We claim that the emotional classification Likewise, several sets of pre-trained vectors of can be tackled without the use of emotional word embeddings built using the previous al- features or any other kind of handcrafted lin- gorithms are freely available. However, those guistic feature. We thus propose the genera- pre-trained sets were generated using docu- tion of dense high quality features following a ments written in English, thus they cannot sentence encoding approach, and then the use been used for representing Spanish words. of a non lineal classifier. We submitted three We used the pre-trained set of word em- systems based on the encoding of the input beddings SBW3 (Cardellino, 2016). SBW headline with a Recurrent Neural Network was built upon several Spanish corpora, and (RNN) Long Short Term Memory (LSTM). the most relevant characteristics of its de- Our submitted systems are over the average velopment were: (1) the capitalization of in the competition, hence this fact shows that the words were kept unchanged; (2) the our claim holds. word2vect algorithm used was skip-gram; (3) the minimum allowed word frequency was 2 Architecture of the models 5; and (4) the dimension or components of The organization proposed two Subtasks, the the word vectors is 300 (demb = 300). first one is defined in a monolingual context, We tokenized the input headlines with and the second in a multilingual one. The the default tokenizer of NLTK4 in order to first Subtask has two levels of evaluation, project them in the feature vector space de- which differ in the size of the evaluation set. fined by the vector of word embeddings. Con- We designed the neural architecture without sequently, each headline (h) is transformed in taking into account the specific characteris- a sequence of n words (w1:n = {w1 , . . . , wn }). tics of the Subtasks, because our aim was the The size of the input sequence (n) was de- evaluation of our claim on the SANSE cor- fined by the maximum length of the inputs pus. in the training data, hence sequences shorter The architecture of the three systems sub- than n were truncated. After the tokeniza- mitted is composed of three modules: (1) lan- tion, the first layer of our architecture model guage representation, for the sake of simplic- is an embedding lookup layer, which makes ity embeddings lookup module; (2) sequence the projection of the sequence of tokens into encoding module, in which the three architec- the feature vector space. Therefore, the out- tures differ; and (3) non lineal classification. put of the embeddings lookup layer is the ma- The details of each module are explained in trix WE ∈ IRd,n , WET 1:n = (we1 , . . . , wen ), the following subsections. where wei ∈ IRd . The parameters of the em- bedding lookup layer are not updated during 2.1 Embeddings lookup layer the training. Regarding our claim, we defined a feature vector space for the training and the eval- 2.2 Sequence encoding layer uation that is composed of unsupervised vec- The aim of the sequence encoding layer is the tors of word embeddings. A set of vectors of generation of high level features, which con- word embeddings is the representation of the dense the semantic of the entire sentence. We ideal semantic space of words in a real-valued used an RNN layer because RNNs can rep- continuous vector space, hence the relation- resent sequential input in a fixed-size vector ships between vectors of words mirror the lin- and paying attention to the structured prop- guistic relationships of the words. Vectors of erties of the input (Goldberg, 2017). RNN is word embeddings are a dense representation defined as a recursive R function applied to of the meaning of a word, thus each word is 3 https://crscardellino.github.io/SBWCE/ 2 4 The details about the Task 4 of TASS 2018 are https://www.nltk.org/api/nltk.tokenize. in (Martı́nez-Cámara et al., 2018). html 118 SCI^2S at TASS 2018: Emotion Classification with Recurrent Neural Networks a input sequence. The input of the function Single LSTM (SLSTM). The layer is R is an state vector si−1 and an element of composed of one LSTM, whose input is the the input sequence, in our case a word vector sequence we1:n , and its output is composed (wei ). The output of R is a new state vector of a single vector, namely the last output (si ), which is transformed to the output vec- vector (yn ∈ IRdout ). In this case, the se- tor yi by a deterministic function O. Equa- mantic information of the entire headline is tion 15 summarizes the former definition. condensed in the last vector of the LSTM, which correspond to the last word. RNN(we1:n , s0 ) = y1:n Single biLSTM (SbLSTM). In order to yi = O(si ) (1) encoded the previous and forward context of the words of the input sequence, the se- si = R(wei , si−1 ); quential encoding layer of this system is a biLSTM. The output is the concatenation wei ∈ IRdin , si ∈ IRf(dout ) , yi ∈ IRdout of the last output vector of the two LSTMs From a linguistic point of view, each of the biLSTM (yn = [ynf ; ynb ] ∈ IR2×dout ). vector (yi ) of the output sequence of an Sequence LSTM (SeLSTM). The encod- RNN condenses the semantic information ing is carried out by an LSTM, but the out- of the word wi and the previous words put is composed of all output vectors of all ({w1 , . . . , wi−1 }). However, according to the the words of the sequence, hence the out- distributional hypothesis of language (Har- put is not a vector, but the sequence y1:n , ris, 1954), semantically similar words tend to yi ∈ IRdout . have similar contextual distributions, or in other words, the meaning of a word is defined The semantic information returned by by its contexts. An RNN can only encode the SeLSTM is greater than the other two layers, previous context of a word when the input of because it returns the output vector of each the RNN is the sequence we1:n . However, the word, therefore the subsequent layers receive input of the RNN can be also the reverse of more semantic information from the sequence the previous sequence (wen:1 ). Consequently, encoding layer. we can elaborate a composition of two RNNs, the first one encodes the sequence from the 2.3 Non lineal classification layer beginning to the end (forward, f ), and a sec- Since RNN and specifically LSTM has the ond one from the end to the beginning (back- ability of encoding the semantic information ward, b), therefore the previous and the fol- of the input sequence, the output of the se- lowing context of a word is encoded. This quence encoding layer is a high level repre- elaboration is known as bidirectional RNN sentation of the semantic information of the (biRNN), whose definition is in Equation 2. input headline. The sequence representation of the head- biRNN(we1:n ) = [RNNf (we1:n , sf0 ); line is then classified by three fully connected layers with ReLU as activation function, and RNNb (wen:1 , sb0 )] (2) additional layer activated by the softmax function. The layers activated by ReLU have The three systems submitted are based different hidden units or output neurons (see on the use of a specific gated-architecture Table 1). The SeLSTM layer does not return of RNN, namely LSTM (Hochreiter and an output vector, but an output sequence Schmidhuber, 1997), which has reached y1:n ∈ IRn,dout . Thus, after the second fully strong results in several Natural Language connected layer, the sequence is flattened to Processing tasks (Tang, Qin, and Liu, 2015; a single vector y ∈ IRn×dout . Since the task Kiperwasser and Goldberg, 2016; Martı́nez- is a binary classification task, the number of Cámara et al., 2017). The specific details of hidden units of the softmax layer is 2. the sequence encoding layer of each submit- In order to avoid overfitting, we add a ted system are described as what follows. dropout layer after each fully connected layer 5 The definition of RNN states that the dimension with a dropout rate value (dr ). Besides, we of si is a function of the output dimension, but some applied an L2 regularization function to the architectures as LSTM does not allow that flexibility. output of each fully connected layer with a 119 Nuria Rodríguez Barroso, Eugenio Martínez-Cámara y Francisco Herrera regularization value (r ). Moreover, the train- the SeLSTM is the model that uses more ing is stopped in case the loss value does not parameters, because it processes the output improve in 5 epochs. vectors of the sequence encoding layer of each The training of the network was per- input word. formed by the minimization of the cross en- We expected that models with a higher tropy function, and the learning process was number of parameters and capacity of en- optimized with the Adam algorithm (Kingma coding semantic information, they will reach and Ba, 2015) with its default learning rate. higher results in the competition, or in other The training was performed following the words, they will have a higher capacity of minibatches approach with a batch size of 25, generalization. However, the comparison of and the number of epochs was 40. the results reached on the development and For the sake of the replicability of the ex- test set shows a non expected performance. periments, Table 1 shows the values of the hy- Regarding the two main differences among perparaments of the network, and the source the models, we highlight the following two code of our experiments is publicly available.6 facts: Hyper. value SLSTM biLSTM SeLSTM Generalization capacity. The model that reached a higher results in the two levels of n 20 20 20 the Subtask 1 is SLSTM. The performance demb 300 300 300 of SLSTM stands out in the second level of dout 512 256×2 512 Subtask 1, because it is the second higher dr1 0.35 0.35 0.35 ranked system. Since the test set of the dr2 0.35 0.35 0.5 second level is larger than the level one, dr3 0.5 0.5 0.5 it demands a higher generalization capac- L2 r1 0.0001 0.0001 0.0001 ity from the systems, thus the good perfor- L2 r2 0.001 0.001 0.001 mance of SLSTM is more relevant. In con- L2 r3 0.01 0.01 0.01 trast, SbLSTM and SeLSTM are in the fifth and sixth position respectively in the second Table 1: Hyperparameter values of the sys- level, and the sixth and seventh position in tems submitted the first level of Subtask 1, which was not expected due to they have more parameters 3 Results and Analysis and condense more semantic information. The organization provided a development set Concerning the Subtask 2, the results of the SANSE corpus with the aim that the reached were the expected ones, be- teams would use the same data to tune the cause SeLSTM, which has more parameters classification models. We participated in the and condense more semantic information, two levels of Subtasks 1 and in the Subtask reached the best results among our three 2, and we present in Tables 2, 3 and 4 the systems. The generalization demand in this results reached with the development set (de- task is high too, because the language or the velopment time) and the official results with domain of the training and the test sets and the test set of SANSE (evaluation time). different, because the training set is com- The main differences among the submit- posed of headlines written in the Spanish ted systems are: (1) The semantic informa- language used in America, and the test set tion encoded; and (2) the number of pa- is written in the Spanish language used in rameters. SLSTM is the model with less Spain. semantic information encoded, because the Although the generalization capacity of our LSTM is only run in one direction, and the systems is high, the different performance in last output vector of the LSTM is only pro- Subtask 1 and Subtask 2 allow us to con- cessed by the subsequent layers. Although clude that to reach a good generalization SbLSTM encodes more semantic information capacity, a balance between the number of than SLSTM, they have the same number of parameters and the complexity or depth of parameters, because SbLSTM only processes the neural network is required as it is also the last output vector of the sequence encod- asserted in (Conneau et al., 2017). ing layer as the SLSTM model. In contrast, Differences among datasets. SLSTM and 6 https://github.com/rbnuria/TASS-2018 SbLSTM reached a value of Macro Recall 120 SCI^2S at TASS 2018: Emotion Classification with Recurrent Neural Networks Development Test (official) System M. Prec. M. Recall M. F1 Acc. M. Prec. M. Recall M. F1 Acc. SLSTM1 73.89 74.74 74.10 74.80 78.40 76.40 77.40 78.60 SbLSTM3 75.24 75.15 75.19 76.40 77.40 75.20 76.30 77.60 SeLSTM2 76.08 76.35 76.21 77.20 76.30 76.50 76.40 77.20 Table 2: The macro-average and accuracy values in % reached by the three systems on the development and test sets in the Subtask 1, level 1. The superscript is the official rank (ranked by the M. F1 value) among the three submitted systems in the official results Development Test (official) System M. Prec. M. Recall M. F1 Acc. M. Prec. M. Recall M. F1 Acc. SLSTM1 73.89 74.74 74.10 74.80 88.80 86.70 87.30 88.80 SbLSTM2 75.24 75.15 75.19 76.40 86.80 85.70 86.30 87.80 SeLSTM3 76.08 76.35 76.21 77.20 83.80 87.00 85.30 85.30 Table 3: The macro-average and accuracy values in % reached by the three systems on the development and test sets in the Subtask 1, level 2. The superscript is the official rank (ranked by the M. F1 value) among the three submitted systems in the official results Development Test (official) System M. Prec. M. Recall M. F1 Acc. M. Prec. M. Recall M. F1 Acc. SLSTM3 74.54 72.05 72.67 75.00 68.30 66.10 67.20 70.00 SbLSTM2 75.60 71.14 71.87 75.90 67.90 67.20 67.60 69.80 SeLSTM1 72.47 69.41 69.98 77.20 68.70 67.80 68.30 63.11 Table 4: The macro-average and accuracy values in % reached by the three systems on the development and test sets in the Subtask 2. The superscript is the official rank (ranked by the M. F1 value) among the three submitted systems in the official results higher than the value of Macro-Precision in Regarding the competition, the rank posi- the development set of Subtask 1 in the two tion of our systems are in Table 5. In Subtask levels of evaluation. However, they reached 1, the systems reached a rank position over the inverse relation on the test set of both the average, and SLSTM stands out in Level levels of Subtask 1. In contrast, SeLSTM 2 of Subtask 1. In Subtask 2, the systems are had the same trend in both datasets, thus on the average, and the performance is close the performance of SeLSTM shows a higher to their competitors. Regarding our claim stability. On the other hand, the three sys- and the high results reached by the three sys- tems had the same performance in the de- tems, we conclude that our claim holds, hence velopment and test sets in Subtask 2, that we can obtain strong results in the task of it is to say, the value of Macro-Precision was emotion classification without the use of emo- higher than the value of Macro-Recall in de- tional features. velopment and evaluation time. 4 Conclusions Regarding the differences between the We described the three systems submitted datasets, the performance of models with to all the Subtasks of Task 4 of TASS 2018 more parameters and with more semantic by the team SCI2 S. Our proposal is based information is more stable, which means on the claim that emotional classification that the results in development time follows can be performed without the use of emo- a similar trend to the results in evaluation tional external knowledge or handcrafted fea- time that is an desirable characteristic of a tures. The three systems are three neural classification system. networks grounded in a sentence classifica- 121 Nuria Rodríguez Barroso, Eugenio Martínez-Cámara y Francisco Herrera Rank In Proceedings of the 15th Conference of the European Chapter of the Association System Sub. 1, L1 Sub. 1, L2 Sub. 2 for Computational Linguistics: Volume 1, SLSTM 4/13 2/10 6/8 Long Papers, pages 1107–1116. Associa- SbLSTM 7/13 5/10 5/8 tion for Computational Linguistics. SeLSTM 6/13 6/10 4/8 Goldberg, Y. 2017. Neural Network Methods for Natural Language Processing. Morgan Table 5: Rank position of the submitted sys- & Claypool Publishers. tems in the competition Harris, Z. S. 1954. Distributional structure. WORD, 10(2-3):146–162. tion approach, namely the use of an LSTM and a biLSTM. The three systems reached Hochreiter, S. and J. Schmidhuber. 1997. a rank position over the average in the two Long short-term memory. Neural Com- Subtasks of Task 4, thus we conclude that putation, 9(8):1735–1780, November. our claim holds. Kingma, D. P. and J. Ba. 2015. Adam: Our future work will go in the direction A method for stochastic optimization. In defined by the analysis of the results (see Sec- 3rd International Conference for Learning tion 3), hence we are going to work in the Representations, San Diego, 2015. study of the balance between the depth and the generalization capacity of our emotional Kiperwasser, E. and Y. Goldberg. 2016. Sim- classification model. Likewise, we will work ple and accurate dependency parsing us- in the addition of an Attention layer (Bah- ing bidirectional lstm feature representa- danau, Cho, and Bengio, 2015) to the model, tions. Transactions of the Association of with the aim of automatically selecting the Computational Linguistics, 4:313–327. most relevant features. Martı́nez-Cámara, E., Y. Almeida-Cruz, M. C. Dı́az-Galiano, S. Estévez-Velarde, Acknowledgements M. A. Garcı́a-Cumbreras, M. Garcı́a- This work was partially supported by the Vega, Y. Gutiérrez, A. Montejo Ráez, Spanish Ministry of Science and Technology A. Montoyo, R. Muñoz, A. Piad- under the project TIN2017-89517-P, and a Morffis, and J. Villena-Román. 2018. grant from the Fondo Europeo de Desar- Overview of TASS 2018: Opinions, rollo Regional (FEDER). Eugenio Martı́nez health and emotions. In E. Martı́nez- Cámara was supported by the Juan de la Cámara, Y. Almeida-Cruz, M. C. Cierva Formación Programme (FJCI-2016- Dı́az-Galiano, S. Estévez-Velarde, M. A. 28353) from the Spanish Government. Garcı́a-Cumbreras, M. Garcı́a-Vega, Y. Gutiérrez, A. Montejo Ráez, A. Mon- References toyo, R. Muñoz, A. Piad-Morffis, and Bahdanau, D., K. Cho, and Y. Bengio. J. Villena-Román, editors, Proceedings 2015. Neural machine translation by of TASS 2018: Workshop on Semantic jointly learning to align and translate. In Analysis at SEPLN (TASS 2018), volume 3rd International Conference for Learning 2172 of CEUR Workshop Proceedings, Representations, San Diego, 2015. Sevilla, Spain, September. CEUR-WS. Cardellino, C. 2016. Spanish Billion Words Martı́nez-Cámara, E., V. Shwartz, Corpus and Embeddings, March. I. Gurevych, and I. Dagan. 2017. Neural disambiguation of causal lexical Collobert, R., J. Weston, L. Bottou, markers based on context. In IWCS M. Karlen, K. Kavukcuoglu, and 2017 – 12th International Conference on P. Kuksa. 2011. Natural language Computational Semantics – Short papers. processing (almost) from scratch. Jour- Mikolov, T., I. Sutskever, K. Chen, G. S. nal of Machine Learning Research, Corrado, and J. Dean. 2013. Distributed 12:2493–2537, November. representations of words and phrases and Conneau, A., H. Schwenk, L. Barrault, and their compositionality. In C. J. C. Burges, Y. Lecun. 2017. Very deep convo- L. Bottou, M. Welling, Z. Ghahramani, lutional networks for text classification. and K. Q. Weinberger, editors, Advances 122 SCI^2S at TASS 2018: Emotion Classification with Recurrent Neural Networks in Neural Information Processing Systems 26. Curran Associates, Inc., pages 3111– 3119. Pennington, J., R. Socher, and C. Man- ning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Compu- tational Linguistics. Tang, D., B. Qin, and T. Liu. 2015. Docu- ment modeling with gated recurrent neu- ral network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1422–1432. Association for Computational Linguistics. 123