Introduction

ELiRF-UPV at TASS 2019: Transformer Encoders for Twitter Sentiment Analysis in Spanish

Jose-Angel Gonzalez

Llu s-Felip Hurtado

Ferran Pla

fplag@dsic.upv.es 0 0 VRAIN: Valencian Research Institute for Arti cial Intelligence Universitat Politecnica de Valencia

2019

571 578

This paper describes the participation of the ELiRF research group of the Universitat Politecnica de Valencia in the TASS 2019 Workshop, framed within the XXXV edition of the International Congress of the Spanish Society for the Processing of Natural Language. We present the approach used for the Monolingual InterTASS task of the workshop, as well as the results obtained and a discussion of them. Our participation has focused mainly on employing the encoders of the Transformer model, based on self-attention mechanisms, achieving competitive results in the task addressed.

Twitter Sentiment Analysis Transformer Encoders

Introduction

Sentiment Analysis workshop at SEPLN (TASS) has been proposing a set of tasks related to Twitter sentiment analysis in order to evaluate di erent approaches presented by the participants. In addition, it develops free resources, such as, corpora annotated with polarity, thematic, political tendency or aspects, which are very useful for the comparison of di erent approaches to the proposed tasks.

In this eighth edition of the TASS [ 3 ], several tasks are proposed for global sentiment analysis about di erent Spanish variants. The organizers propose two di erent tasks: 1) Monolingual sentiment analysis and 2) crosslingual sentiment analysis. In this way, in the rst task, only a speci c language can be used to train and to evaluate the system; in contrast, in the second task, any combination of the corpus can be used to train the systems. Thus, for both tasks, the organizers provide ve di erent corpus of tweets written in Spanish variants from Spain, Costa Rica, Peru, Uruguay and Mexico.

This article summarizes the participation of the ELiRF-UPV team of the Universitat Politecnica de Valencia only for the rst task. Our approach uses 2 state-of-the-art approaches that has provided competitive results in English sentiment analysis and machine translation [ 8 ] [ 1 ].

The rest of the article is structured as follows. Section 2 presents a description of the addressed task. In section 3 we describe the architecture of the proposed system. Section 4 summarizes the conducted experimental evaluation and the achieved results. Finally, some conclusions and possible future works are shown in section 5. 2

Task description

The organization has de ned two subtasks: Task 1, monolingual SA, and Task 2, crosslingual SA. These tasks consists of assigning a global polarity to tweets (N, NEU, NONE and P). In Task 1 only one Spanish variant can be used, both for training and testing the system. In contrast, in Task 2 any combination of Spanish variants can be considered with the only restriction that those considered in the training set can not be used in the test set.

For both subtasks, ve di erent corpora were considered for several Spanish variants. First, the InterTASS-ES corpus (Spain) composed of a training partition of 1125 samples, a validation of 581 samples and a test set consisting of 1706 samples. InterTASS-CR (Costa Rica) composed of 777 training samples, 390 for validation and 1166 for testing. InterTASS-PE (Peru), formed by 966 samples of training, 498 of validation and 1464 of test. InterTASS-UY (Uruguay), which contains 943 training samples, 486 validation and 1428 tests. Finally, InterTASSMX (Mexico), with 989 training samples, 510 validation and 1500 test samples.

The tweet distribution according to their polarity in the InterTASS corpus training sets is shown in Table 1.

As can be seen in Table 1, the training corpora are unbalanced and they have a bias to the N and P classes, except in the InterTASS-PE corpus, where the most frequent class is N ON E. Moreover, the class N EU is always the least populated except in the case of Uruguay.

System

In this section, we discuss the system architecture proposed to address the rst task of TASS 2019 as well as the description of the resources used and the preprocessing applied to the tweets. 3.1

Resources and preprocessing

In order to learn a word embedding model from Spanish tweets, we downloaded 87 million tweets of several Spanish variants. To provide the embedding layer of our system with a rich semantic representation on the Twitter domain, we use 300-dimensional word embeddings extracted from a skip-gram model [ 9 ] trained with the 87 million tweets by using Word2Vec framework [ 4 ]. 3.2

Transformer Encoders

Our system is based on the Transformer [ 11 ] model. Initially proposed for machine translation, the Transformer model dispenses with convolution and recurrences to learn long-range relationships. Instead of this kind of mechanisms, it relies on multi-head self-attention, where multiple attentions among the terms of a sequence are computed in parallel to take into account di erent relationships among them.

Concretely, we use only the encoder part in order to extract vector representations that are useful to perform sentiment analysis. We denote this encoding part of the Transformer model as Transformer Encoder. Figure 1 shows a representation of the proposed architecture for sentiment analysis.

The input of the model is a tweet X = fx1; x2; :::; xT : xi 2 f0; :::; V gg where T is the maximum length of the tweet and V is the vocabulary size. This tweet is sent to a d-dimensional xed embedding layer, E, initialized with the weights of our embedding model. Moreover, to take into account positional information we also experimented with the sine and cosine functions proposed in [ 11 ]. After the combination of the word embeddings with the positional information, dropout [ 10 ] was used to drop input words with a certain probability p. On top of these representations, N x transformer encoders are applied, which relies on multi-head scaled dot-product attention with h di erent heads. To do this we used an architecture similar to the one described in [ 11 ]. It includes the layer normalization [ 2 ] and the residual connections.

Due to a vector representation is required to train classi ers on top of these encoders, a global average pooling mechanism was applied to the output of the last encoder, and it is used as input to a feed-forward neural network, with only one hidden layer, whose output layer computes a probability distribution over the the four classes of the task C = fP; N; N EU; N ON Eg.

We use Adam as update rule with 1 = 0:9 and 2 = 0:999 and Noam [ 11 ] as learning rate schedule with 5 warmup steps. The weighted cross entropy is used as loss function. Only the class distribution of the Spanish variant is considered to weight the cross entropy that is used for all language variants. We xed some hyper-parameters to carry out the experimentation, concretely: batch size = 32, dk = 64, dff = d and T = 50. Another hyper-parameters such as p or warmup steps were set following some results obtained in preliminary experiments to p = 0:7, warmup steps = 5 epochs and h = 8.

Moreover, we compair our proposal, which is based on transformer encoders (TE), with another deep learning systems such as Deep Averaging Networks (DAN) [ 7 ] and Attention Long Short Term Memory Networks [ 6 ] (Att-LSTM) that are commonly used in related text classi cation tasks obtaining very competitive results. Concretely, these implementations are the systems proposed by our team in the TASS2018 edition, which achieved very competitive results [ 5 ].

In order to study how some system mechanisms (positional encodings) or hyper-parameters (N x) a ect the results obtained in terms of macro-F1 (M F1), macro-recall (M R), macro-precision (M P ) and Accuracy (Acc) we conducted some additional experimentation. Concretely, we removed the positional information and we used N x 2 f1; 2g encoders. All the con gurations were applied only to the Spanish subtask and the best two con gurations are used also in the remaining subtasks. All these results are shown in Table 2.

As it can be seen in Table 2 for systems 1-TE-Pos and 2-TE-Pos on subtask ES, the use of positional information decreases the system performance. This seems to indicate that the positional information, represented by sine and cosine functions added to the word embeddings, is useless to the classi er. However, the results obtained by Att-LSTM, which takes into account the positional information by its internal memory, obtains better results than the 1-TE-Pos and 2-TE-Pos in almost all the metrics. These results show that the way the positional information is considered a ects the performance of the systems in this task.

M F1 Acc

The best results in terms of M R are achieved by the 1-TE-NoPos model. Due to this fact, the 1-TE-NoPos model outperforms 2-TE-NoPos model also in terms of M F1, although the 2-TE-NoPos model achieves better results in the M P measure. This behavior is observed in almost all the Spanish variants, except on the MX subtask, where both models obtain similar results in terms of M F1.

Moreover, in the ES variant, several con gurations of the TE model outperforms the systems proposed by our team in previous editions of TASS (DAN and Att-LSTM) by a margin of 5 points of M F1, mainly due to the improvement ( 6 points) in terms of M R and M P (improvement of 3 points).

In Table 3, the results at class level for each variant, obtained with our best model (1-TE-NoPos), are shown. It is interesting to observe the improvements achieved by our system for the class N ON E compared to our results in previous editions for this class. Generally, the results for the class N are better than those obtained on the other classes, except in the PE variant. In this case the N ON E class is the easiest to detect due to this class is most frequent in the corpus. The results for P class are generally better than those for classes N EU and N ON E, except on the PE variant. As it is observed in all the previous editions of TASS[ 5 ], the N EU class obtains the worse results.

The confusion matrix of our best system (1-TE-NoPos) for the ES variant is shown in Table 4. It is possible to see that the worse classi ed class (N EU ) is usually confused with the classes N and P . This seems to indicate that our

J-A. Gonzalez et al. model detects the presence of sentiment (positive or negative), but is unable to detect when both classes are neutralized.

Finally, the system 1-TE-NoPos was used for labeling the test set of each variant. The results obtained by this model (M F1, M P , and M R) and the ranking of our system in the competition are shown in Table 5. As it can be seen, our system is ranked as rst for the ES subtask and second in all the remaining variants. We have proposed a system based on the encoder part of the Transformer architecture in order to extract useful word representations that are discriminative to perform sentiment analysis on tweets from several Spanish variants. The results obtained by our system are very promising, being the rst or second ranked system on almost all the Spanish variants. This is especially signi cant, considering that these results have been obtained without an extensive experimentation on the hyperparameters of the model and these hyperparameters were only tuned on the ES subtask. This opens the door to future improvements by exploring modi cations on the architecture and its hyperparameters.

Acknowledgements

This work has been partially supported by the Spanish MINECO and FEDER founds under project AMIC (TIN2017-85854-C4-2-R) and by the GiSPRO project (PROMETEU/2018/176). Work of Jose-Angel Gonzalez is nanced by Universitat Politecnica de Valencia under grant PAID-01-17.

J-A. Gonzalez et al.

1. Ambartsoumian , A. , Popowich , F. : Self-attention: A better building block for sentiment analysis neural network classi ers . In: WASSA@EMNLP ( 2018 )

2. Ba , L.J. , Kiros , R. , Hinton , G.E.: Layer normalization . CoRR abs/1607 .06450 ( 2016 ), http://arxiv.org/abs/1607.06450

az-Galiano , M.C. , et al.: Overview of tass 2019 . CEUR-WS, Bilbao , Spain ( 2019 )

4. Gonzalez , J. , Hurtado , L. , Pla , F. : ELiRF-UPV en TASS 2017: Analisis de Sentimientos en Twitter basado en Aprendizaje Profundo (ELiRF-UPV at TASS 2017: Sentiment Analysis in Twitter based on Deep Learning) . In: Proceedings of TASS 2017: Workshop on Semantic Analysis at SEPLN, TASS@SEPLN 2017 , co-located with 33nd SEPLN Conference (SEPLN 2017 ), Murcia, Spain, September 18th , 2017 . pp. 29 { 34 ( 2017 ), http://ceur-ws. org/ Vol-1896 /p2 elirf tass2017.pdf

5. Gonzalez , J. , Hurtado , L. , Pla , F. : ELiRF-UPV en TASS 2018: Analisis de Sentimientos en Twitter basado en Aprendizaje Profundo (ELiRF-UPV at TASS 2018: Sentiment Analysis in Twitter based on Deep Learning) . In: Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN, TASS@SEPLN 2018 , co-located with 34nd SEPLN Conference (SEPLN 2018 ), Sevilla, Spain, September 18th , 2018 . pp. 37 { 44 ( 2018 ), http://ceur-ws. org/ Vol- 2172 /p2 elirf tass2018.pdf

6. Hochreiter , S. , Schmidhuber , J.: Long short-term memory . Neural Comput . 9 ( 8 ), 1735 {1780 (Nov 1997 ). https://doi.org/10.1162/neco. 1997 . 9 .8.1735, http://dx.doi.org/10.1162/neco. 1997 . 9 .8. 1735

7. Iyyer , M. , Manjunatha , V. , Boyd-Graber , J. , Daume

III

, H.: Deep unordered composition rivals syntactic methods for text classi cation . In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) . pp. 1681 { 1691 . Association for Computational Linguistics, Beijing, China (Jul 2015 ). https://doi.org/10.3115/v1/ P15 - 1162, https://www.aclweb.org/anthology/P15-1162

8. Letarte , G. , Paradis , F. , Giguere , P. , Laviolette , F. : Importance of self-attention for sentiment analysis . In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP . pp. 267 { 275 . Association for Computational Linguistics, Brussels, Belgium (Nov 2018 ), https://www.aclweb.org/anthology/W18-5429

9. Mikolov , T. , Sutskever , I. , Chen , K. , Corrado , G. , Dean , J. : Distributed representations of words and phrases and their compositionality . In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 . pp. 3111 { 3119 . NIPS' 13 , Curran Associates Inc., USA ( 2013 ), http://dl.acm.org/citation.cfm?id= 2999792 . 2999959

10. Srivastava , N. , Hinton , G. , Krizhevsky , A. , Sutskever , I. , Salakhutdinov , R.: Dropout: A simple way to prevent neural networks from overtting . Journal of Machine Learning Research 15 , 1929 { 1958 ( 2014 ), http://jmlr.org/papers/v15/srivastava14a.html

11. Vaswani , A. , Shazeer , N. , Parmar , N. , Uszkoreit , J. , Jones , L. , Gomez , A.N. , Kaiser , L.u., Polosukhin , I. : Attention is all you need . In: Guyon, I. , Luxburg , U.V. , Bengio , S. , Wallach , H. , Fergus , R. , Vishwanathan , S. , Garnett , R . (eds.) Advances in Neural Information Processing Systems 30 , pp. 5998 { 6008 . Curran Associates, Inc. ( 2017 )