Emotion Detection for Spanish with Data Augmentation and Transformer-Based Models Hongxin Luo School of Information Science and Engineering Yunnan University, Yunnan, P.R. China 1104792873@qq.com Abstract. In this paper we describe the participation of Yeti team in IberLEF EmoEvalEs task, which is based on the Spanish Semantic Anal- ysis in TASS 2020 version, and proposes as separate task for 2021 in Iber- LEF. We introduce the methods we used in the emotion detection task and the results obtained. First, we used back-translation data augmen- tation technology to solve the problems of data scarcity and data imbal- ance. Our method is based on transfer learning using the BETO language model for sentiment classification in Spanish. This system showed excel- lent performance and finally achieved an accuracy score of 0.7125. We won third place in the final result, which is only 0.0151 points away from the best result. Keywords: Natural Language Processing, Transformers, Data Augmen- tation, Sentiment Analysis 1 Introduction Sentiment analysis in tweets is a challenging task because a lot of subjective information is generated every day. It is very difficult to deal with these messages with potential language phenomena [6], and these subjective languages can be used to express private states beyond opinions [1]. People have been looking for efficient sentiment analysis algorithms based on tweets [15]. In the past few years, most of the work on sentiment analysis has combined neural network models and word embedding techniques to achieve better results [4][11]. This work is to promote the development of a Twitter sentiment classification system in Spanish. Iberian Languages Evaluation Forum (IberLEF) is a comparative evaluation campaign for Natural Language Processing Systems in Spanish and other Iberian languages [12]. The main content of EmoEvalEs task [13] includes classifying the emotion expressed in a tweet as one of six Ekman’s basic emotions [5] that best represents the mental state of the Twitter sender:Anger, Disgust, Fear, Joy, IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Sadness, Surprise or Others. In the task, the data set is divided into training set, development set, and test set. This article mainly summarizes our participation in Emotion Detection and Evaluation tasks [13]. According to the results of TASS 2020 [6], we can find that the performance of the BERT-based model [3] on the task is very competitive [6]. We considered a number of state-of-the-art neural network models, and finally our approach is to adaptively fine-tune the Transformer architecture based on multi-language pre-trained. We used ALBERT as the baseline for comparison. The rest of this paper is organized as follows. Chapter 2 describes the task and the corpus. Chapter 3 introduces our system in detail. Chapter 4 intro- duces the experimental setup. Chapter 5 outlines the evaluation process, and the conclusions are in Chapter 6. 2 Corpus description The organizer proposed a sentiment detection task, which is a single-label multi- classification task, which divides the sentiment labels corresponding to tweets into seven different sentiments. The seven sentiments are Anger, Disgust, Fear, Joy, Sadness, Surprise and Others. The data set is mainly collected from related events in different domains in April 2019, including entertainment, catastrophe, politics, global commemorations and global strikes [14]. The corpus is divided into three parts: training, development, and testing, with a total of 8223 items. The data in the training set and the development set have a total of five at- tributes, namely id, event, tweet, offensive, and emotion. The test set does not contain the emotion label. In order to prevent the classifier from relying on hash- tags to classify sentiment with tweets, the organizer replaced the hashtag in the dataset with the keyword “HASHTAG” [6]. The challenges we have to face are as follows [14]: – Lack of context: Tweets are short (up to 240 characters). – Informal language: Misspellings, emojis and onomatopoeias are common. – Multiclass classification: The dataset is labeled with seven different classes. Table 1 shows the number distribution characteristics of various labels in this data set. It can be seen from the table that the distribution of data is extremely unbalanced. The largest number is the Others category, and the smallest number is the Fear category. The difference between the two is close to 43 times. 3 Materials and methods 3.1 Pre-processing Data preprocessing is particularly important for reducing the noise informa- tion in tweets. High-quality input data can improve the output performance of the model [8]. Before conducting our experiments, we performed the following Table 1. Distribution of sentiment labels in the corpus. Anger Disgust Fear Joy Sadness Surprise Others Total Train 1227 693 589 238 111 65 2800 5723 Dev 181 104 85 35 16 9 414 844 Test - - - - - - - 1656 Total 1408 797 674 273 127 74 3214 8223 preprocessing on the data respectively. First of all, we delete the URLs and punctuation marks from the text content. In order to remove unnecessary se- mantic information, we removed the stop words through the NLTK toolkit and converted the content of the tweets to lowercase. Finally, we also used the emoji library to convert the emojis in the tweets into text content. At the same time, we also kept the original version of the data set. In the experiment, we compared the results of various pre-processing in the experiment. 3.2 Data augmentation Due to the extremely unbalanced distribution of the data set, the model tends to be over-fitting, predicting the most frequent category. We decided to use data augmentation technology to solve this problem. A simple and effective method is back-translation [16]. Back-translation is to translate sentences into other lan- guages (such as Spanish to English), and then translate English back to Span- ish. Check whether the newly generated sentence is different from the original sentence. If it is not the same, use the newly generated sentence as the data augmentation version of the original text. Run back-translation in multiple dif- ferent languages at the same time to generate more variants. This augmentation technique helps to introduce changes in vocabulary and syntax in tweets, most of the time it can maintain the original meaning [10]. We used two representative languages (Chinese and English) to run back-translation to expand the training data, because we found during the experiments that using more languages to run back-translation does not significantly improve the experimental results. In order to obtain translation results, we used Baidu Translation API service.* 3.3 ALBERT We used the ALBERT model as the baseline of our work, because ALBERT [9] is a newly released model that has excellent performance in various Natu- ral Language Processing (NLP) tasks. ALBERT solves the problem of memory and training speed by designing a Lite BERT architecture, which has fewer pa- rameters than the traditional BERT architecture [9]. The structure of ALBERT is basically the same as BERT, and there are three specific improvements, in- cluding embedding layer parameter factorization, cross-layer parameter sharing, * Baidu Translation API available at https://api.fanyi.baidu.com/ Next Sentence Prediction (NSP) task is changed to Sentence Order Prediction (SOP) task. The hyperparameter settings of the model are as follows (the set- tings found to perform well in several fine-tunings; the parameters not mentioned keep the default values): – albert model : albert base v2 – max seq length : 128 – optimizer : AdamW – warmup step : 200 – learning rate : 3e-5 – train step : 800 – train batch size : 64 3.4 BETO Inspired by the results of the TASS 2020 seminar and our emotion classification task, we decided to use the BETO model to complete this challenging Emotion Detection and Evaluation for Spanish task. BETO is a BERT model trained on a large Spanish corpus. BETO model combines the pre-training model with the downstream task model, which means that the BETO model is still used when doing downstream tasks, and it naturally supports text classification tasks, and there is no need to modify the model when doing text classification tasks. BETO is trained on a large Spanish corpus, which can more accurately represent the text features of Spanish, and can solve the problem of the dependence of the task on the Spanish language. It has BETO-uncased and BETO-cased. We used BETO-cased as our Language Model (LM). The size of BETO is similar to BERT-base, according to the guidelines presented by Cañete et al. [2], BETO has received Whole Word Masking (WWM) training, and both use about 31k Byte Pair Encoder (BPE) subwords constructed by Sentence Piece Vocabulary list, and have been trained in 2M steps [2]. In the training process, the dynamic mask technology is introduced, which is to use 10 different masks for the same sentence in the corpus. When using WWM to mask a specific token, if the token corresponds to a subword in a sentence, all consecutive tokens that make up the same word will be masked. We use the ADAM optimizer [7] for optimization. We hope to use BETO as a basic initial LM to construct a robust method to complete this challenge and achieve excellent performance in the final result. The settings of the optimal hyper-parameters in the experiment are as follows (the settings that performed well in several fine-tunings; the parameters not mentioned keep the default values): – beto model : BETO-cased – max seq length : 128 – train batch size : 32 – learning rate : 2e-5 – num train epochs : 3.0 4 Experimental Setup In this section we introduce our experimental procedure. In order to compare the results of this experiment, we will use ALBERT as the baseline. We compare with the BETO model with the best result obtained by fine-tuning during the experiment (the hyperparameter settings are shown in the introduction in sec- tion 3.3), both are pre-trained deep models. The IberLEF organization released three corpora for training, development, and testing. The label of each tweet corresponds to one of the 6 emotions. In the exhaustive search process with the BETO model as the main research object, we determined the model configuration parameters (as shown in section 3.4). Inspired by TASS 2020 Task 1, through observing the data, we found that the tweet also corresponds to an Offensive label, so based on the nearly deter- mined model configuration parameters, we tried to input the Offensive + Tweet content into the model for prediction and compare with the result of input only Tweet. Then we processed the unbalanced data in the corpus by using back-translation data augmentation technology, mainly using Chinese and English as intermediate languages for back translation. We mainly enhance the two very few categories (Fear and Disgust) to expand the data volume and prevent the model from overfitting and predicting a large number of categories. Finally, we also tried to convert the emojis in data into corresponding content texts to explore better model performance. We input the processed data into the two models for comparison. All experiments are performed on a machine equipped with Nvidia GPU (Tesla V100). 5 Results The results of our model on the validation set of the Emotion Detection and Evaluation for Spanish task are shown in Table 2. Our final submission results and rankings on the official test set are shown in Table 5. The final result of our system is quite competitive. In the final submitted result, it got the fourth place overall with a score of 0.7125, which is only 0.0151 behind the best result. The results given in Table 4 show that the results obtained by using the BETO model are better than the baseline, and at the same time far better than the BERT model using the multi-language model. This fully shows that the result of using the specific languages pre-trained model is higher than that of the multi-language pre-trained model. Through Table 2 we also observe that our data preprocessing does not promote the performance of the model. Compared with the unprocessed raw data, the effect is reduced. After our discussion, we concluded that we believe that the reason for the drop in results may be related to the pre-trained of the model. The pre-trained of the original BERT model is to ensure that the context and semantic connections can be learned, and the input data set is a raw material that has not undergone any pre-processing raw data, and I added preprocessing when fine-tuning downstream tasks, which Table 2. The results of BETO model on the development set. Pre-1 means data prepro- cessing for deleting URLs, punctuation, and stop words; pre-2 means data preprocessing for deleting URLs and punctuation; Pre-3 means data preprocessing for deleting only URLs; Input means that the input data only contains one column of Tweet or contains two columns of Offensive + Tweet. Back-translation Input pre-processing ACC No 0.7215 Pre-1 0.6919 Offensive + Tweet Pre-2 0.6990 Pre-3 0.6919 No No 0.7156 Pre-1 0.7097 Tweet Pre-2 0.6954 Pre-3 0.7014 No 0.7322 Pre-1 0.6931 Offensive + Tweet Pre-2 0.6978 Pre-3 0.7298 Yes No 0.7132 Pre-1 0.6966 Tweet Pre-2 0.7002 Pre-3 0.7061 Table 3. Comparison of results before and after data augmentation. Model Data Acc raw 0.7049 BETO back-translation 0.7322 back-translation + emojis-to-text 0.7315 may destroy the contextual text relationship, which will result in poor results. Finally, it can be seen from Table 3 that the back-translation data augmentation technology we used is helpful to the improvement of model performance, and the conversion of emojis into text content also slightly improves the effect. Our final submission results and rankings on the official test set are shown in Table 5. 6 Conclusions We propose a BETO-cased sentiment classification system for IberLEF 2021 EmoEvalEs task. This method is based on BETO transfer learning. It is mainly applied to the sentiment analysis of Spanish tweets, which includes an additional data augmentation step, and has achieved good results in the Spanish task. We are very satisfied with the results of our first participation in the IberLEF workshop. Although the method is relatively simple, it is important that we Table 4. Comparison of the results of our model on the validation set. Model Data augmentation Pre-process Acc Baseline Yes No 0.6866 BERT Yes No 0.6646 BETO Yes No 0.7322 Table 5. Our final submitted results and rankings. Team Name Acc MP MR MF1 Rank GSI-UPM 0.7276 0.7094 0.7276 0.7170 1 Yeti 0.7125 0.7044 0.7125 0.7054 3 qu 0.4498 0.6188 0.4498 0.4469 15 have achieved very good results in the task by exploring the hyperparameters of the model and configuring our model reasonably in the task. Careful selection of language models and data augmentation techniques play an important role in sentiment analysis of small sample data set. However, there are still huge challenges in sentiment analysis regarding the content of tweets, and our system still has a lot of room for improvement. In the future work, I hope to use more powerful data augmentation technology to solve the problem of data scarcity. We look forward to further exploring more advanced techniques to solve the sentiment analysis of Spanish tweets. Acknowledgments First of all, thank the organizer for the valuable opportunity provided to us. Then I would also like to thank the teacher for supporting my research work and the patience of future reviewers. References 1. Algeo, J.: A comprehensive grammar of the english language. by randolph quirk, sidney greenbaum, geoffrey leech, and jan svartvik. london: Longman. 1985. x+ 1779. Journal of English Linguistics 20(1), 122–136 (1987) 2. Cañete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish pre-trained bert model and evaluation data. PML4DC at ICLR 2020 (2020) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 4. Dı́az Galiano, M.C., Martı́nez Cámara, E., Garcı́a Cumbreras, M.Á., Garcı́a Vega, M., Villena Román, J.: The democratization of deep learning in tass 2017 (2018) 5. Ekman, P.: Are there basic emotions? (1992) 6. Garcı́a-Vega, M., Dı́az-Galiano, M.C., Garcı́a-Cumbreras, M.Á., del Arco, F.M.P., Montejo-Ráez, A., Jiménez-Zafra, S.M., Cámara, E.M., Aguilar, C.A., Antonio, M., Cabezudo, S., et al.: Overview of tass 2020: introducing emotion detection (2020) 7. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 8. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Data preprocessing for supervised leaning. world academy of science, engineering and technology. International Jour- nal of Computer, Electrical, Automation, Control and Information Engineering 1, 4104–4109 (2007) 9. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019) 10. Luque, F.M.: Atalaya at tass 2019: Data augmentation and robust embeddings for sentiment analysis. arXiv preprint arXiv:1909.11241 (2019) 11. Martı́nez Cámara, E., Almeida-Cruz, Y., Dı́az Galiano, M.C., Estévez-Velarde, S., Garcı́a Cumbreras, M.Á., Garcı́a Vega, M., Gutiérrez, Y., Montejo Ráez, A., Mon- toyo, A., Munoz, R., et al.: Overview of tass 2018: Opinions, health and emotions (2018) 12. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Álvarez Carmona, M., Álvarez Mellado, E., Carrillo-de Albornoz, J., Chiruzzo, L., Freitas, L., Gómez Adorno, H., Gutiérrez, Y., Jiménez-Zafra, S.M., Lima, S., Plaza-de Arco, F.M., Taulé, M. (eds.): Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) (2021) 13. Plaza-del-Arco, F.M., Jiménez-Zafra, S.M., Montejo-Ráez, A., Molina-González, M.D., Ureña-López, L.A., Martı́n-Valdivia, M.T.: Overview of the EmoEvalEs task on emotion detection for Spanish at IberLEF 2021. Procesamiento del Lenguaje Natural 67(0) (2021) 14. Plaza-del-Arco, F., Strapparava, C., Ureña-Lopez, L.A., Martin-Valdivia, M.T.: EmoEvent: A Multilingual Emotion Corpus based on different Events. In: Pro- ceedings of the 12th Language Resources and Evaluation Conference. pp. 1492– 1498. European Language Resources Association, Marseille, France (May 2020), https://www.aclweb.org/anthology/2020.lrec-1.186 15. Villena Román, J., Lana Serrano, S., Martı́nez Cámara, E., González Cristóbal, J.C.: Tass-workshop on sentiment analysis at sepln (2013) 16. Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848 (2019)