Introduction

URJC-Team at EmoEvalEs 2021: BERT for Emotion Classi cation in Spanish Tweets

Jorge Alberto Flores Sanchez

jorgeflores8185@gmail.com 1

Soto Montalvo Herranz

soto.montalvo@urjc.es 1

Raquel Mart nez Unanue

raquel@lsi.uned.es 0 0 Universidad Nacional de Educacion a Distancia , Spain 1 Universidad Rey Juan Carlos , Spain

This paper describes the participation of the URJC-Team in the EmoEvalEs 2021 task of the IberLEF evaluation campaign. The task consists of classifying the emotion expressed in a tweet among seven di erent classes of emotion. Our proposal is based on transfer learning using BERT language modeling. We train three ne-tuned BERT models nally selecting for the submitted runs two of them, along with a system that combines all the models by means of an ensemble method. We obtained competitive results in the challenge, ranking fth. Additional work needs to be done to improve the results.

Emotion Classi cation Tweets Deep Learning Transformer

Introduction Material and Methods

Data The EmoEvalEs dataset [ 6 ] is based on events that took place in April 2019 and related to di erent domains: entertainment, catastrophe, political, global commemoration, and global strike. For the task, the data were divided into dev, training and test partitions. The distribution of emotions for each dataset is shown in Table 1.

To develop our proposal we have used the development and training partitions, as the test partition was later provided by the organizers to evaluate the participating systems and determine the winner of the challenge. We have merged the training and development data by blending them together to create a larger training set, which will be referred to hereafter as training data.

We randomly selected 90% of each emotion class to train the model and the remaining 10% to test it. Table 2 shows the nal distribution of these data. We explore the use of Bidirectional Encoder Representations from Transformers (BERT) [ 2 ], a deep learning approach that has proven to be very successful when applied to several NLP tasks. In particular, we have experimented with a pre-trained BERT model, BETO [ 1 ], as the core for the semantic representation of the input tokens. BETO is a BERT model trained on over 300M lines of a Spanish corpus, and it is similar in size to a BERT-Base model [ 1 ].

BETO has 12 self-attention layers with 16 attention-heads each, using 1024 as hidden size. In total the model has 110M parameters. Two versions of BETO are trained: one with cased data and one with uncased data [ 1 ].

The proposed system has been implemented in Python 3.7 with HuggingFace's transformers library [ 7 ]. Three models have been trained with di erent data and con guration parameters. First of all, a basic pre-processing were carried out, eliminating the special character `#'. Then it is tokenized by taking the words to subwords found in the 32k token vocabulary. Adam optimizer [ 3 ] was used with the standard parameters ( 1 = 0.9, 2 = 0.999). We applied a linear decay function to decrease the initial learning rate to 0. And nally the max sequence length is 128 tokens.

We xed some hyper-parameters for the di erent models: { Model 1. The case model was trained with all data, batch size=32, learning rate=2e-5, epochs=4, and weight decay=0.1. { Model 2. The uncase model was trained with all data, batch size=32, learning rate=5e-5, epochs=3, and weight decay=0.1. { Model 3. The case model was trained with all data, but removing 30% of the \others" class since it was the majority class. Batch size=32, learning rate=2e-5, epochs=4, and weight decay=0.1.

We have submitted three di erent runs: one that assembles the results of the three previous models by means of a voting system, the nal result being the class with the most votes, in the event of a tie, the nal result will be the prediction of Model 1. The others two submitted runs with the results of models 2 and 3, respectively. 3

Results

The evaluation measures used by organizers are the following: accuracy and the weighted-averaged versions of Precision, Recall, and F1. The participant systems are ranked by the weighted-averaged F1 and accuracy measures in a multi-class evaluation scenario.

Table 3 shows the results obtained by the three runs in the challenge. The best results are for the ensemble method, this is because the predictions of the multiple models are combined taking advantage of the performance of each of them. Table 4 contains the results of the three submitted runs for each emotion. The system achieves the best results for other, sadness, joy, fear and anger classes. However, for the disgust and surprise classes it works badly, this is because the system confuses these emotions with others that are similar, such as disgust with anger, and surprise with joy or others. In addition, the small sample that is available for these classes can be a ecting. Thus, the system does not have enough data to train the model and be able to di erentiate between these classes.

Making a comparison between Model 2 and Model 3, it can be seen that when training the model with the least amount of data for the others class, the performance of the model increases for the joy class and decreases for the others class. This is because with the new data distribution, the model is able to better di erentiate the joy class over the others class, thus classifying a greater number of tweets correctly. Otherwise, as this data decrease is not performed for the others class, the performance of the joy class decreases, since several tweets belonging to this class are predicted by the model as others.

Moreover, it can be seen that although the best general results have been obtained with the voting system, there are classes like fear and disgust where this is not the case, since for Model 2, the fear class reaches an F1 of 0.7 and voting 0.6, and the same for disgust class, where Model 2 reaches 0.11 and voting 0.09.

Finally, it is important to note that although the results are slightly worse in some classes, overall robustness is gained.

On the other hand, a comprehensive comparison and ranking of the results from all the shared task participants can be found in [ 5 ]. Table 5 summarizes these results. Our system has reached position number four among the fteen participants. Making a comparison with the best system on the accuracy metric, the di erence is 2.475%, this is equivalent to the fact that the system correctly classi ed 41 more tweets than our system (the validation set is composed by 1656 tweets). This paper describes the system presented by the URJC-Team at the EmoEvalEs 2021 task at the IberLEF evaluation campaign. Several deep-learning models were trained and ensembled to automatically detect and classify emotions expressed by people from associated events in Spanish tweets. Although it is a complex task, our system achieves good results for certain emotions and is competitive with respect to the other systems of the other participants, obtaining a di erence of 2.475% with the best system of the workshop.

As future work, it is intended to carry out further experiments with BETO and other pre-trained linguistic models in order to improve the results in the task, taking into account the classes that were more di cult to detect, disgust and surprise. In addition, it might be interesting to do some preprocessing to deal with unbalanced data.

Acknowledgments

This work was supported by MCI/AEI/FEDER, UE DOTT-HEALTH Project (MCI/AEI/FEDER, UE) under Grant PID2019-106942RB-C32.

1. Can~ete, J., Chaperon , G. , Fuentes , R. , Ho , J.H. , Kang , H. , Perez , J.: Spanish pretrained bert model and evaluation data . In: Proceedings of the Practical ML for Developing Countries Workshop at the Eighth International Conference on Learning Representations (ICLR 2020 ). Addis Ababa, Ethiopia (Apr 2020 )

2. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of deep bidirectional transformers for language understanding . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). pp. 4171 { 4186 . Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019 ). https://doi.org/10.18653/v1/ N19 -1423, https://www.aclweb.org/anthology/N19-1423

3. Kingma , D.P. , Ba , J.: Adam: A method for stochastic optimization . In: Bengio, Y. , LeCun , Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015 , San Diego, CA, USA, May 7- 9 , 2015 , Conference Track Proceedings ( 2015 ), http://arxiv.org/abs/1412.6980

4. Montes , M. , Rosso , P. , Gonzalo , J. , Aragon , E. , Agerri , R. , Alvarez-Carmona , M.A. , Alvarez Mellado , E. , Carrillo-de Albornoz , J., Chiruzzo , L. , Freitas , L. , Gomez

Adorno

, H. , Gutierrez , Y. , Jimenez-Zafra , S.M. , Lima , S. , Plaza-de Arco , F.M. , Taule , M. (eds.): Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021 ) ( 2021 )

5. Plaza-del-Arco , F.M. , Jimenez-Zafra , S.M. , Montejo-Raez , A. , Molina-Gonzalez , M.D. , Uren~a- Lopez , L.A. , Mart n-Valdivia, M.T.: Overview of the EmoEvalEs task on emotion detection for Spanish at IberLEF 2021 . Procesamiento del Lenguaje Natural 67 ( 0 ) ( 2021 )

6. Plaza-del-Arco , F. , Strapparava , C. , Uren

Lopez , L.A. , Martin-Valdivia , M.T.: EmoEvent: A Multilingual Emotion Corpus based on di erent Events . In: Proceedings of the 12th Language Resources and Evaluation Conference . pp. 1492 { 1498 . European Language Resources Association , Marseille, France (May 2020 ), https://www.aclweb.org/anthology/2020.lrec- 1 . 186

7. Wolf , T. , Debut , L. , Sanh , V. , Chaumond , J. , Delangue , C. , Moi , A. , Cistac , P. , Rault , T. , Louf , R. , Funtowicz , M. , Brew , J.: Huggingface's transformers: State-of-the-art natural language processing . CoRR abs/ 1910 .03771 ( 2019 ), http://arxiv.org/abs/ 1910 .03771