ELiRF-UPV at TASS 2019: Transformer Encoders for Twitter Sentiment Analysis in Spanish José-Ángel González, Lluı́s-Felip Hurtado, and Ferran Pla VRAIN: Valencian Research Institute for Artificial Intelligence Universitat Politècnica de València {jogonba2,lhurtado,fpla}@dsic.upv.es Abstract. This paper describes the participation of the ELiRF research group of the Universitat Politècnica de València in the TASS 2019 Work- shop, framed within the XXXV edition of the International Congress of the Spanish Society for the Processing of Natural Language. We present the approach used for the Monolingual InterTASS task of the workshop, as well as the results obtained and a discussion of them. Our participa- tion has focused mainly on employing the encoders of the Transformer model, based on self-attention mechanisms, achieving competitive results in the task addressed. Keywords: Twitter · Sentiment Analysis · Transformer Encoders. 1 Introduction Sentiment Analysis workshop at SEPLN (TASS) has been proposing a set of tasks related to Twitter sentiment analysis in order to evaluate different ap- proaches presented by the participants. In addition, it develops free resources, such as, corpora annotated with polarity, thematic, political tendency or aspects, which are very useful for the comparison of different approaches to the proposed tasks. In this eighth edition of the TASS [3], several tasks are proposed for global sentiment analysis about different Spanish variants. The organizers propose two different tasks: 1) Monolingual sentiment analysis and 2) crosslingual sentiment analysis. In this way, in the first task, only a specific language can be used to train and to evaluate the system; in contrast, in the second task, any combination of the corpus can be used to train the systems. Thus, for both tasks, the organizers provide five different corpus of tweets written in Spanish variants from Spain, Costa Rica, Peru, Uruguay and Mexico. This article summarizes the participation of the ELiRF-UPV team of the Universitat Politècnica de València only for the first task. Our approach uses Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 Septem- ber 2019, Bilbao, Spain. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 2 J-A. González et al. state-of-the-art approaches that has provided competitive results in English sen- timent analysis and machine translation [8] [1]. The rest of the article is structured as follows. Section 2 presents a description of the addressed task. In section 3 we describe the architecture of the proposed system. Section 4 summarizes the conducted experimental evaluation and the achieved results. Finally, some conclusions and possible future works are shown in section 5. 2 Task description The organization has defined two subtasks: Task 1, monolingual SA, and Task 2, crosslingual SA. These tasks consists of assigning a global polarity to tweets (N, NEU, NONE and P). In Task 1 only one Spanish variant can be used, both for training and testing the system. In contrast, in Task 2 any combina- tion of Spanish variants can be considered with the only restriction that those considered in the training set can not be used in the test set. For both subtasks, five different corpora were considered for several Spanish variants. First, the InterTASS-ES corpus (Spain) composed of a training parti- tion of 1125 samples, a validation of 581 samples and a test set consisting of 1706 samples. InterTASS-CR (Costa Rica) composed of 777 training samples, 390 for validation and 1166 for testing. InterTASS-PE (Peru), formed by 966 samples of training, 498 of validation and 1464 of test. InterTASS-UY (Uruguay), which contains 943 training samples, 486 validation and 1428 tests. Finally, InterTASS- MX (Mexico), with 989 training samples, 510 validation and 1500 test samples. The tweet distribution according to their polarity in the InterTASS corpus training sets is shown in Table 1. Table 1: Distribution of tweets in the training sets of InterTASS for all the Spanish variants. ES CR PE UY MX N 474 310 228 367 505 NEU 140 91 170 192 79 NONE 157 155 352 94 93 P 354 221 216 290 312 Σ 1125 777 966 943 989 As can be seen in Table 1, the training corpora are unbalanced and they have a bias to the N and P classes, except in the InterTASS-PE corpus, where the most frequent class is N ON E. Moreover, the class N EU is always the least populated except in the case of Uruguay. 572 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Title Suppressed Due to Excessive Length 3 3 System In this section, we discuss the system architecture proposed to address the first task of TASS 2019 as well as the description of the resources used and the preprocessing applied to the tweets. 3.1 Resources and preprocessing In order to learn a word embedding model from Spanish tweets, we downloaded 87 million tweets of several Spanish variants. To provide the embedding layer of our system with a rich semantic representation on the Twitter domain, we use 300-dimensional word embeddings extracted from a skip-gram model [9] trained with the 87 million tweets by using Word2Vec framework [4]. 3.2 Transformer Encoders Our system is based on the Transformer [11] model. Initially proposed for ma- chine translation, the Transformer model dispenses with convolution and recur- rences to learn long-range relationships. Instead of this kind of mechanisms, it relies on multi-head self-attention, where multiple attentions among the terms of a sequence are computed in parallel to take into account different relationships among them. Concretely, we use only the encoder part in order to extract vector represen- tations that are useful to perform sentiment analysis. We denote this encoding part of the Transformer model as Transformer Encoder. Figure 1 shows a rep- resentation of the proposed architecture for sentiment analysis. The input of the model is a tweet X = {x1 , x2 , ..., xT : xi ∈ {0, ..., V }} where T is the maximum length of the tweet and V is the vocabulary size. This tweet is sent to a d-dimensional fixed embedding layer, E, initialized with the weights of our embedding model. Moreover, to take into account positional information we also experimented with the sine and cosine functions proposed in [11]. Af- ter the combination of the word embeddings with the positional information, dropout [10] was used to drop input words with a certain probability p. On top of these representations, N x transformer encoders are applied, which relies on multi-head scaled dot-product attention with h different heads. To do this we used an architecture similar to the one described in [11]. It includes the layer normalization [2] and the residual connections. Due to a vector representation is required to train classifiers on top of these encoders, a global average pooling mechanism was applied to the output of the last encoder, and it is used as input to a feed-forward neural network, with only one hidden layer, whose output layer computes a probability distribution over the the four classes of the task C = {P, N, N EU, N ON E}. We use Adam as update rule with β1 = 0.9 and β2 = 0.999 and Noam [11] as learning rate schedule with 5 warmup steps. The weighted cross entropy is used as loss function. Only the class distribution of the Spanish variant is considered to weight the cross entropy that is used for all language variants. 573 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 4 J-A. González et al. Fig. 1: The Transformer Encoder system for TASS 2019. 4 Experiments We fixed some hyper-parameters to carry out the experimentation, concretely: batch size = 32, dk = 64, df f = d and T = 50. Another hyper-parameters such as p or warmup steps were set following some results obtained in preliminary experiments to p = 0.7, warmup steps = 5 epochs and h = 8. Moreover, we compair our proposal, which is based on transformer encoders (TE), with another deep learning systems such as Deep Averaging Networks (DAN) [7] and Attention Long Short Term Memory Networks [6] (Att-LSTM) that are commonly used in related text classification tasks obtaining very com- petitive results. Concretely, these implementations are the systems proposed by our team in the TASS2018 edition, which achieved very competitive results [5]. In order to study how some system mechanisms (positional encodings) or hyper-parameters (N x) affect the results obtained in terms of macro-F1 (M F1 ), macro-recall (M R), macro-precision (M P ) and Accuracy (Acc) we conducted some additional experimentation. Concretely, we removed the positional infor- mation and we used N x ∈ {1, 2} encoders. All the configurations were applied only to the Spanish subtask and the best two configurations are used also in the remaining subtasks. All these results are shown in Table 2. As it can be seen in Table 2 for systems 1-TE-Pos and 2-TE-Pos on subtask ES, the use of positional information decreases the system performance. This seems to indicate that the positional information, represented by sine and cosine functions added to the word embeddings, is useless to the classifier. However, the results obtained by Att-LSTM, which takes into account the positional in- formation by its internal memory, obtains better results than the 1-TE-Pos and 2-TE-Pos in almost all the metrics. These results show that the way the posi- tional information is considered affects the performance of the systems in this task. 574 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Title Suppressed Due to Excessive Length 5 Table 2: Results of the experimentation in the different variants. MP MR M F1 Acc ES DAN 47.66 48.46 47.94 56.28 Att-LSTM 50.00 48.14 48.83 58.00 1-TE-NoPos 52.80 54.38 53.34 60.75 1-TE-Pos 46.26 46.56 46.25 55.94 2-TE-NoPos 52.85 53.03 51.47 61.27 2-TE-Pos 47.31 48.79 47.71 56.11 PE 1-TE-NoPos 49.06 50.43 49.51 54.62 2-TE-NoPos 46.29 46.00 44.92 46.79 CR 1-TE-NoPos 55.36 56.10 54.56 58.46 2-TE-NoPos 52.14 52.36 51.71 55.13 UY 1-TE-NoPos 54.71 56.63 54.83 57.20 2-TE-NoPos 55.82 53.56 54.29 58.64 MX 1-TE-NoPos 53.59 55.03 54.10 63.52 2-TE-NoPos 52.78 57.34 54.07 60.78 The best results in terms of M R are achieved by the 1-TE-NoPos model. Due to this fact, the 1-TE-NoPos model outperforms 2-TE-NoPos model also in terms of M F1 , although the 2-TE-NoPos model achieves better results in the M P measure. This behavior is observed in almost all the Spanish variants, except on the MX subtask, where both models obtain similar results in terms of M F1 . Moreover, in the ES variant, several configurations of the TE model outper- forms the systems proposed by our team in previous editions of TASS (DAN and Att-LSTM) by a margin of ∼5 points of M F1 , mainly due to the improvement (∼6 points) in terms of M R and M P (improvement of ∼3 points). In Table 3, the results at class level for each variant, obtained with our best model (1-TE-NoPos), are shown. It is interesting to observe the improvements achieved by our system for the class N ON E compared to our results in previous editions for this class. Generally, the results for the class N are better than those obtained on the other classes, except in the PE variant. In this case the N ON E class is the easiest to detect due to this class is most frequent in the corpus. The results for P class are generally better than those for classes N EU and N ON E, except on the PE variant. As it is observed in all the previous editions of TASS[5], the N EU class obtains the worse results. The confusion matrix of our best system (1-TE-NoPos) for the ES variant is shown in Table 4. It is possible to see that the worse classified class (N EU ) is usually confused with the classes N and P . This seems to indicate that our 575 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 6 J-A. González et al. Table 3: Results at class level, for each sub-task, of the best model 1-TE-NoPos (0 refers to N class, 1 to NEU, 2 to NONE and 3 to P). P0 P1 P2 P3 R0 R1 R 2 R3 F10 F11 F12 F13 ES 73.03 30.56 46.34 61.25 73.31 26.51 59.38 58.33 73.17 28.39 52.05 59.76 PE 51.40 27.27 64.88 52.67 51.40 26.79 57.83 65.71 51.40 27.03 61.15 58.47 CR 74.58 27.87 46.09 72.92 61.54 30.91 73.61 58.33 67.43 29.31 56.68 64.81 UY 69.70 34.51 50.00 64.64 47.92 43.33 58.85 76.47 56.79 38.42 54.05 70.06 MX 73.93 30.91 44.07 65.47 75.40 33.33 54.17 57.23 74.66 32.08 48.60 61.07 model detects the presence of sentiment (positive or negative), but is unable to detect when both classes are neutralized. Table 4: Confusion matrix for 1-TE-NoPos model on the ES development set. N NEU NONE P N 195 25 18 28 NEU 25 22 13 23 NONE 9 6 38 11 P 38 19 13 98 Finally, the system 1-TE-NoPos was used for labeling the test set of each variant. The results obtained by this model (M F1 , M P , and M R) and the ranking of our system in the competition are shown in Table 5. As it can be seen, our system is ranked as first for the ES subtask and second in all the remaining variants. Table 5: Results and ranking of our system on the test sets. M F1 MP MR Rank ES 50.70 50.50 50.80 1/9 CR 49.60 49.80 49.30 2/9 PE 44.70 45.60 43.90 2/9 UY 51.50 49.70 53.60 2/7 MX 50.10 49.00 51.20 N/A 5 Conclusions We have proposed a system based on the encoder part of the Transformer archi- tecture in order to extract useful word representations that are discriminative to perform sentiment analysis on tweets from several Spanish variants. The results 576 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Title Suppressed Due to Excessive Length 7 obtained by our system are very promising, being the first or second ranked sys- tem on almost all the Spanish variants. This is especially significant, considering that these results have been obtained without an extensive experimentation on the hyperparameters of the model and these hyperparameters were only tuned on the ES subtask. This opens the door to future improvements by exploring modifications on the architecture and its hyperparameters. Acknowledgements This work has been partially supported by the Spanish MINECO and FEDER founds under project AMIC (TIN2017-85854-C4-2-R) and by the GiSPRO project (PROMETEU/2018/176). Work of José-Ángel González is financed by Univer- sitat Politècnica de València under grant PAID-01-17. References 1. Ambartsoumian, A., Popowich, F.: Self-attention: A better building block for sen- timent analysis neural network classifiers. In: WASSA@EMNLP (2018) 2. Ba, L.J., Kiros, R., Hinton, G.E.: Layer normalization. CoRR abs/1607.06450 (2016), http://arxiv.org/abs/1607.06450 3. Dı́az-Galiano, M.C., et al.: Overview of tass 2019. CEUR-WS, Bilbao, Spain (2019) 4. González, J., Hurtado, L., Pla, F.: ELiRF-UPV en TASS 2017: Análisis de Sen- timientos en Twitter basado en Aprendizaje Profundo (ELiRF-UPV at TASS 2017: Sentiment Analysis in Twitter based on Deep Learning). In: Proceedings of TASS 2017: Workshop on Semantic Analysis at SEPLN, TASS@SEPLN 2017, co-located with 33nd SEPLN Conference (SEPLN 2017), Murcia, Spain, September 18th, 2017. pp. 29–34 (2017), http://ceur-ws.org/Vol-1896/p2 elirf tass2017.pdf 5. González, J., Hurtado, L., Pla, F.: ELiRF-UPV en TASS 2018: Análisis de Sen- timientos en Twitter basado en Aprendizaje Profundo (ELiRF-UPV at TASS 2018: Sentiment Analysis in Twitter based on Deep Learning). In: Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN, TASS@SEPLN 2018, co-located with 34nd SEPLN Conference (SEPLN 2018), Sevilla, Spain, September 18th, 2018. pp. 37–44 (2018), http://ceur-ws.org/Vol-2172/p2 elirf tass2018.pdf 6. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Com- put. 9(8), 1735–1780 (Nov 1997). https://doi.org/10.1162/neco.1997.9.8.1735, http://dx.doi.org/10.1162/neco.1997.9.8.1735 7. Iyyer, M., Manjunatha, V., Boyd-Graber, J., Daumé III, H.: Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguis- tics and the 7th International Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers). pp. 1681–1691. Association for Computa- tional Linguistics, Beijing, China (Jul 2015). https://doi.org/10.3115/v1/P15- 1162, https://www.aclweb.org/anthology/P15-1162 8. Letarte, G., Paradis, F., Giguère, P., Laviolette, F.: Importance of self-attention for sentiment analysis. In: Proceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP. pp. 267– 275. Association for Computational Linguistics, Brussels, Belgium (Nov 2018), https://www.aclweb.org/anthology/W18-5429 577 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 8 J-A. González et al. 9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed rep- resentations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Sys- tems - Volume 2. pp. 3111–3119. NIPS’13, Curran Associates Inc., USA (2013), http://dl.acm.org/citation.cfm?id=2999792.2999959 10. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdi- nov, R.: Dropout: A simple way to prevent neural networks from over- fitting. Journal of Machine Learning Research 15, 1929–1958 (2014), http://jmlr.org/papers/v15/srivastava14a.html 11. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017) 578