Emotion Detection for Spanish by Combining LASER Embeddings, Topic Information, and Offense Features Fedor Vitiugin and Giorgio Barnabò Universitat Pompeu Fabra, Barcelona, Spain fedor.vitiugin@upf.edu Abstract. This paper describes the system submitted by WSSC Team to the EmoEvalEs@IberLEF 2021 emotions detection competition. We propose a novel model for Emotion Detection that combines transform- ers embeddings with topic information and offense features. The system classifies social media text emotions leveraging its context representa- tions. Our results show that, for this kind of task, our model outper- forms baselines and state-of-the-art text classification methods. As for the leader-board, our classification model achieved a macro weighted av- eraged F1 score of 0.661427, and a overall accuracy of 0.675725, reaching the 9th and 10th place respectively. Keywords: Natural language processing · Emotion detection · Deep learning. 1 Introduction Emotion Detection is a branch of sentiment analysis that seeks to extract fine- grained emotions from either speech/voice, image, or text data. Detecting emo- tions from texts has proven to be quite a challenging task, regardless of the quantity of available data [1]. Understanding emotions expressed by users on social media is particularly hard due to the absence of voice modulation, facial expressions, and other features that may work as clues during the context and relation extraction process. Besides that, the need for disambiguating emotion-conveying words in order to verify classified emotions as real emotions still represents a significant hitch, since texts often contain expressions that could refer to different emotions. For example, a phrase like “I can’t stand it” could convey anger and disgust depend- ing on the context. Nonetheless, recently, state-of-the-art results were obtained by using pre-trained transformer-based models. Needless to say, in the past three IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). years, pre-trained language models such as BERT [6] revolutionized the NLP world allowing to achieve extraordinary results in almost any known task. These models are particularly effective because they generate word embeddings that capture the semantic and contextual information of texts. The existing state-of-the-art emotion detection models usually only extract context features from texts and pay less attention to external features like the kind of event these messages were posted for. In our work, we tried to fill this gap by including additional context information and by also considering the presence of offenses inside these messages. LASER [4] embeddings were used to encode the social media texts and were then combined with topic features and offense features. The main contribution of this study is an approach based on a combination of contextualized work embeddings, topic information, and offense features specif- ically tailored for improving the emotion detection process. We evaluated our methodology on the EmoEvalEs@IberLEF 2021 [11] competition dataset, show- ing that our model outperforms baselines. We also analyzed the most frequent mistakes that our model made. The remainder of this paper is organized as follows. We first present the re- lated work, then we introduce our approach, and finally we show the experiment results and the error analysis. 2 Related work There are five classes of approaches for recognizing emotions in texts: keyword- based approaches, rule-based approaches, classical learning-based approaches, hybrid approaches, and deep learning approaches[3]. Recent approaches for emo- tions detection propose solutions that use deep learning techniques to classify emotions in texts. 2.1 LSTM Deep learning is a branch of machine learning in which deep neural network architectures learn from experience and understand the world in terms of a hierarchy of concepts, where each concept is defined in terms of its relation to simpler concepts. This approach allows a model to incrementally learn complex concepts putting together simpler ones [7]. In this context, the long short-term memory (LSTM) architecture proved to be particularly effective. LSTM is a special form of recurrent neural network (RNN) with the capability of handling long-term dependencies. LSTM overcomes the vanishing or exploding gradient problem common in other type of RNNs. Here’s a list of the main steps to take when using LSTM for emotion recog- nition in texts: 1. text preprocessing, that is tokenization, stopwords removal, and lemmatiza- tion; 2. encode texts through an embedding layer and then use these embeddings to feed one or more LSTM layers. 3. delivering outputs into a dense neural network (DNN) with units equal to the number of emotion labels and a sigmoid activation function to perform classification. 2.2 Transformers The encoder block of transformers, initially designed for machine translation, has become the de-facto standard pre-trained language modeling architecture for solving most NLP tasks such as text classification, text generation, document summarization, question answering just to name a few [2]. Up to now, several state-of-the-art model for detecting text-based emotions already use BERT and its variants. One of the ways to improve the performance of emotion classification is the extension of BERT model by a linear transformation layer with sigmoid activa- tion. The proposed model was evaluated using the EmoBank data and obtained a micro F1 score of 0.688 and 0.695 when fine-tuned on the ISEAR and SemEval datasets, respectively [12]. Another way of using BERT for emotion classifica- tion is to use a two-step approach that first encode texts into vectors and then classify them into emotions using the soft max classifier [10]. One more way of using BERT is extracting contextualized word embeddings from text data and subsequently use SVM to perform classification. Authors of this approach[8] feed the model with text passages of an average length of 650 tokens. Since BERT can only process 512 input tokens, the essays were divided into sub-documents. The sub-documents were pre-processed and fed into the BERT base model. Feature vectors for the document were obtained by computing the mean of each of the 12 BERT layers’ contextual token representations. The last four-layer represen- tations were then concatenated with the corresponding 84 Mairesse features for the essay. The feature vector was then fed into the SVM classifier, producing a prediction. The final prediction was obtained through majority voting. 3 Model 3.1 Pre-processing During the pre-processing step, we only detected and replaced all emojis with their respective short-codes using a freely available Python library emoji. Since data provided by organizers of the competition were polarized, they replaced all the hashtags with the keyword “HASHTAG” in order to prevent the automatic classifier from relying on hashtags to categorize the emotion associated with a tweet. Moreover, the user mentions were replaced by “@USER”. 3.2 LASER Embeddings For representing the input data, we used embeddings generated by two pre- trained transformer-based models: DistilBERT and Language-Agnostic SEn- tence Representations (LASER) [4]. The main difference of LASER from other transformers is generating sentence-level embeddings instead of word/token-level embeddings. Given an input sentence, LASER provides sentence embeddings which are obtained by applying max-pooling operation over the output of a Bidirectional LSTM (BiLSTM) encoder. BiLSTM output is constructed by concatenating out- puts of two individual LSTMs working in opposite directions (forward and back- ward). This way more contextual information is included in the output with respect to a single LSTM reading text from left to right. In our experiments, we used LASER to embed all tweet sentences into 1024-dimension fixed-size vectors. Fig. 1. Combining the transformer embeddings, topic information and offense features using deep MLP. 3.3 Proposed Model As additional features, we detected offenses and we extracted topics of tweets. The both types of features were provided in the EmoEvent corpus. LASER embeddings are passed as input to a Long Short Term Memory Network model to encode the social media texts. Finally, we combined all features through the architecture originally proposed for the detection of fake news articles [5]. The full architecture is shown in Figure 1. 4 Experiment 4.1 Dataset Description We use the dataset released for the EmoEvalEs@IberLEF 2021 competition [14] — shared task “Emotion detection and Evaluation for Spanish”. The task con- sists of classifying the emotion expressed in a tweet as one of the following emotion classes: – anger (also includes annoyance and rage); – disgust (also includes disinterest, dislike, and loathing); – fear (also includes apprehension, anxiety, concern, and terror); – joy (also includes serenity and ecstasy); – sadness (also includes pensiveness and grief); – surprise (also includes distraction and amazement); – others: the emotion expressed in a tweet as ‘neutral or no emotion’. The dataset is based on events that took place in April 2019 related to different domains: entertainment, catastrophe, political, global commemoration, and global strike. There are messages in a total of 8 different topics. For the task dataset was split into dev, training, and testing partitions. The distribution of EmoEvalEs@IberLEF 2021 dataset is shown in Table 1. Table 1. EmoEvalEs@IberLEF 2021 dataset description. anger disgust fear joy sadness surprise others total train 589 111 65 1227 693 238 2800 5723 dev 85 16 9 181 104 35 414 844 test 168 33 21 354 199 67 814 1657 4.2 Training parameters The proposed model computes the feature vectors separately and then combines these with the help of an MLP layer. We use categorical cross-entropy as the loss function to optimize our architecture with a soft-max layer that tries to classify any given social media text into one of seven emotion classes. The hyper- parameter setting is shown in Table 2. The full code is provided in the project repository https://github.com/vitiugin/ComboLASER. 4.3 Baselines and compared methods In the current work, we also used schemes with a combination of DistilBERT embeddings. The concept of distillation in neural networks aims at speeding up models. The key idea is to replace massive architectures with countless param- eters with a lightweight version of the same architecture that possesses fewer Table 2. Values of hyper-parameters. The first row of the table describes the parame- ters for extracting individual features. The second row shows the parameter setting of the feature combination layer. Hyperparameter Offense Features LASER Embeddings Topic features MLP layers 2 1 2 MLP neurons 128;24 256;128 128;24 Dropout - 0.5 - Activation relu sigmoid relu MLP layers 1 MLP neurons 7 Activation softmax Optimizer Adam Learning rate 0.001 Batch size 100 Loss Categorical Crossentropy parameters [15]. The DistilBERT takes the architecture of the initial version of BERT, reduces the number of layers in the BERT-base model by a factor of 2, removes token embeddings and poolers to yield a much smaller and faster ver- sion of BERT for general-purpose use. It applies dynamic masking and ignores the next sentence predictions for better inference [9]. According to recent surveys SVM is the most popular machine learning scheme for emotion detection from text [3]. Subsequently, one of our baselines is a model that concatenates vectors of transformer embeddings (LASER and DistilBERT), topics and offense features which passes them to an SVM classifier. To prove the need of using additional topic and offense feature vectors we also used only transformer embeddings as input to a LSTM model. 4.4 Results As evaluation measures we used two multi-class classification metrics: accuracy and macro weighted averaged F1 score. The full results on development and test splits are shown in Table 3. We can observe that the SVM-based models with concatenated feature vec- tors have high performance even compared with LSTM-based networks trained only on transformers embeddings. Further LASER embeddings demonstrate a higher performance compared with DistilBERT embeddings. The proposed Combo LASER model shows the highest performance, which is perhaps due to the fact that the model takes into consideration the sentence-level context en- coded in the LASER embeddings. In terms of performance, the proposed solution is worse than the solution that took the first place by 4.5%. Analysing our model mistakes, we found that our model often (more than 50% compared to the volume of class in tested data) mis-classified Disgust as Anger and Fear as Sadness. On the other hand, the best results were achieved for Sadness, Surprise, and Others (less than 25% of mistakes). Table 3. Comparison with baselines. Results multiclass classification. Best perfor- mances are in bold. (5 fold CV). ∗ denotes the proposed model. Model Scheme ACC F1 SVM+DistilBERT dev 66.89±0.17 65.57±0.14 test 66.99±0.12 65.34±0.14 SVM+LASER dev 67.48±0.16 65.62±0.12 test 66.49±0.11 64.76±0.12 LSTM+DistilBERT dev 67.63±0.52 58.63±0.19 test 64.76±1.13 60.84±0.36 LSTM+LASER dev 67.84±0.84 60.61±0.46 test 66.86±0.49 61.82±0.58 Combo DistilBERT dev 64.00±1.38 61.63±1.17 test 62.68±0.49 62.68±0.46 *Combo LASER dev 68.10±1.68 66.16±0.67 test 67.54±0.78 66.32±0.76 We also found out that two pairs of emotions detected with mistakes on the both sides: of Anger –Disgust and Joy–Other. While the similarity of the first pair could be explained by close nature of this emotions, the second pair could be explained only by the size of trained and test data. Joy and Others classes are over-represented in the train and test data. 5 Conclusion In this paper, we explored the benefit of incorporating transformers, topic in- formation, and offense features to deep neural networks on the task of multi- class emotions detection. We also presented our model based on extracted pre- trained LASER embeddings. Experiments on the dataset released during Emo- EvalEs@IberLEF 2021 competition demonstrate that our Combo LASER model performs better than several baselines and additional features improves the model performance compared with models based only on transformer embed- dings[13]. We presented analyses of mistakes that our model made at classifica- tion time which can inform future studies for emotion detection. References 1. Acheampong, F.A., Nunoo-Mensah, H., Chen, W.: Transformer models for text- based emotion detection: a review of bert-based approaches. Artificial Intelligence Review pp. 1–41 (2021) 2. Al-Rfou, R., Choe, D., Constant, N., Guo, M., Jones, L.: Character-level language modeling with deeper self-attention. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 3159–3166 (2019) 3. Alswaidan, N., Menai, M.E.B.: A survey of state-of-the-art approaches for emotion recognition in text. Knowledge and Information Systems pp. 1–51 (2020) 4. Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond. Transactions of the Association for Com- putational Linguistics 7, 597–610 (2019) 5. Bhatt, G., Sharma, A., Sharma, S., Nagpal, A., Raman, B., Mittal, A.: On the benefit of combining neural, statistical and external features for fake news identi- fication. arXiv preprint arXiv:1712.03935 (2017) 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 7. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning, vol. 1. MIT press Cambridge (2016) 8. Kazameini, A., Fatehi, S., Mehta, Y., Eetemadi, S., Cambria, E.: Personality trait detection using bagged svm over bert word embedding ensembles. arXiv preprint arXiv:2010.01309 (2020) 9. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019) 10. Luo, L., Wang, Y.: Emotionx-hsu: Adopting pre-trained bert for emotion classifi- cation. arXiv preprint arXiv:1907.09669 (2019) 11. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Álvarez-Carmona, M.Á., Álvarez Mellado, E., Carrillo-de Albornoz, J., Chiruzzo, L., Freitas, L., Gómez Adorno, H., Gutiérrez, Y., Jiménez-Zafra, S.M., Lima, S., Plaza-de Arco, F.M., Taulé, M. (eds.): Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) (2021) 12. Park, S., Kim, J., Jeon, J., Park, H., Oh, A.: Toward dimensional emotion detection from categorical emotion annotations. arXiv preprint arXiv:1911.02499 (2019) 13. Plaza-del-Arco, F.M., Jiménez-Zafra, S.M., Montejo-Ráez, A., Molina-González, M.D., Ureña-López, L.A., Martı́n-Valdivia, M.T.: Overview of the EmoEvalEs task on emotion detection for Spanish at IberLEF 2021. Procesamiento del Lenguaje Natural 67(0) (2021) 14. Plaza-del-Arco, F., Strapparava, C., Ureña-Lopez, L.A., Martin-Valdivia, M.T.: EmoEvent: A Multilingual Emotion Corpus based on different Events. In: Pro- ceedings of the 12th Language Resources and Evaluation Conference. pp. 1492– 1498. European Language Resources Association, Marseille, France (May 2020), https://www.aclweb.org/anthology/2020.lrec-1.186 15. Tang, R., Lu, Y., Liu, L., Mou, L., Vechtomova, O., Lin, J.: Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136 (2019)