=Paper=
{{Paper
|id=Vol-3124/paper23
|storemode=property
|title=Emotion Recognition in Real-World Support Call Center Data for Latvian Language
|pdfUrl=https://ceur-ws.org/Vol-3124/paper23.pdf
|volume=Vol-3124
|authors=Eduards Blumentals,Askars Salimbajevs
|dblpUrl=https://dblp.org/rec/conf/iui/BlumentalsS22
}}
==Emotion Recognition in Real-World Support Call Center Data for Latvian Language==
Emotion Recognition in Real-World Support Call Center Data for Latvian Language Eduards Blumentals1 , Askars Salimbajevs1,2 1 Tilde SIA, Vienibas gatve 75a, Riga, Latvia 2 Faculty of Computing, University of Latvia, Raina bulvaris 19, Riga, Latvia Abstract Emotion recognition from speech is a research area that focuses on grasping genuine feelings from audio data. It makes it possible to extract various useful data points from sound that are further used to improve decision-making. This research was conducted to test an emotion recognition toolkit on real-world recording of phone calls in Latvian language. This scenario presents at least two significant challenges: mismatch between real-world data and "artificially" created data, and lack of the training data for the Latvian language. The study mainly focuses on investigating training data requirements for successful emotion recognition. Keywords datasets, neural networks, speech, emotion recognition 1. Introduction 2. Data Nowadays, emotion recognition from speech is a highly The main dataset used in this paper consists of technical relevant topic. It is used in a wide variety of applications support phone calls. Recordings are done with an 8 kHz from businesses to governmental bodies. For example, sampling rate and a single channel. The dataset included in call centers it helps to monitor client support quality audio recordings of 39 conversations, that held in Lat- and to study clients’ reaction to certain emotional trig- vian which were further separated into 6,171 segments. gers. Multiple studies have been conducted on emotion Qualitative analysis of telephone conversations was per- recognition from speech signal. However, most of the formed, annotating in several layers’ potential affective papers investigate machine learning model performance features - affect dimensions, linguistic units, paralinguis- on public artificially created datasets such as EMODB[1], tic units etc. A total of 11 synchronous annotation layers IEMOCAP[2], TESS[3], and RAVDESS[4]. Although this were created for each segment. approach ensures a common benchmark, it ignores the Each segment had two parameters valence and activa- fact that in the real world speech data is not as clear or tion, where valence measures how positive or negative well-defined. A few papers such as Kostulas et al.[5], the emotion is, and activation measures its magnitude. Dhall et al.[6] and Tawari et al.[7] aim to address this These parameters were assigned by trained individuals issue. (pedagogy and psychology students and professors). De- When it comes to emotion recognition from speech in tailed description of dataset creation process and qualita- Latvian language, the literature is even shallower. Nei- tive analysis of the dataset is presented in [8]. ther datasets nor well-recognized research on the topic Based on the given dimensions segments were as- exists for Latvian language. Therefore, this study inves- signed to nine categories: happy, surprised, angry, disap- tigates how a deep learning emotion recognition model pointed, sad, bored, calm, satisfied and neutral following performs on a Latvian language dataset comprised of real the approach proposed by Russell and Barrett[9]. There- phone calls. The goal of this research is to see whether fore, the problem is transformed from regression into a typical deep learning architecture can handle the task multiclass classification. of detecting emotions from real speech. The paper eval- An insignificant number of observations in several uates model performance with different training data emotion categories necessitated proceeding with the five setups. In addition it evaluates human error to estimate most represented emotions. Table 1 summarizes the final the difficulty of the exercise for an untrained person. dataset used in this paper. This dataset is further divided into train (80%) and test (20%) sets. Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Additionaly, several public emotional speech datasets Helsinki, Finland were included in the research. EMODB, IEMOCAP, TESS, $ eduards.blumentals@tilde.lv (E. Blumentals); and RAVDESS were used to increase the dataset size, as askars.salimbajevs@tilde.lv (A. Salimbajevs) 0000-0003-0165-0868 (A. Salimbajevs) well as to see how our model performs compared to the © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). state-of-the-art models. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Table 1 Data Summary Emotion Observation count Surprised 2281 Angry 2208 Neutral 591 Happy 359 Sad 105 3. Methodology This paper investigates a deep learning approach to emo- tion recognition. Each input audio was converted into the 39-dimensional mel-frequency cepstral coefficients (MFCC) feature vector and passed through the model. Due to the relatively small dataset size, a shallow neural network was used to prevent overfitting. The final model was comprised of two LSTM layers, two fully connected layers and a softmax output layer. For additional regular- ization, a 30% dropout after each layer was added. Figure 1 displays the model architecture. Categorical cross-entropy was used as a loss function Figure 1: Model Architecture and an Adam optimizer was used for backward propaga- tion. The model training was performed using batches of 64 observations. Each model was trained for 200 epochs with validation after each epoch. Next model weights that yielded the highest accuracy were retrieved. Finally, all trained models are compared based upon their accu- racy on the test set. 4. Results 4.1. Validating the Model First, model performance on public datasets was eval- uated. TESS and RAVDESS were combined into one dataset, separated into train (80%) and test (20%) sets and used to train the model for 200 epochs. The final test accuracy (Figure 2) was 86.02%. In addition, a similar experiment was conducted with an IEMOCAP dataset Figure 2: Model Accuracy on TESS & RAVDESS Test Set which was comprised of conversations between actors. The final test accuracy (Figure 3) was 60.65% which is slightly lower than state-of-the-art[10]. From this, one mance of models can be improved by simply supplying can conclude that the deep learning architecture used in additional data. In this experiment, the model was trained this paper performs reasonably well on public "Wizard and evaluated on real world audio recordings from a Lat- of Oz" datasets. vian support call center. The model was trained on dif- ferent portions of the train set collected from phone call 4.2. Model Accuracy on Real-World data. It started at 50% volume and moved towards the full train set with a step of 10 percentage points. The results Latvian Data of this experiment are displayed in Figure 4. Seemingly, Next, the impact of changes in training data volume on in the case of this research, increasing data volume would test accuracy was evaluated to see whether the perfor- not yield significantly better results. Table 2 Training with Additional Data Training Data Sampling Rate Accuracy Phone calls only 8 kHz 53.83% All data 8 kHz 50.68% Phone calls only 16 kHz 53.47% All data 16 kHz 50.41% 4.4. Obtaining Human Error Finally, an untrained human person was asked to guess the emotions in the same test set to estimate the human error. Audio segments were presented in random order, Figure 3: Model Accuracy on IEMOCAP Test Set so that the person can not analyse the overall seman- tics and context of the conversations, and have to rely solely on the acoustics, similarly to the deep learning model. The test accuracy ended up being 22.72% which indicates that predicting emotions in the random seg- ments of phone calls is not a trivial exercise even for a human. 5. Conclusions This research investigated emotion recognition from real world phone call data. The output of the research can be summarized according to the following points: • Emotion recognition on real-world data is a more Figure 4: Model Accuracy Depending on Training Data Vol- difficult exercise than emotion recognition on ar- ume tificially created datasets • The model architecture proposed in this paper is capable of surpassing a untrained human-level 4.3. Using Additional Training Data Sets error for the given exercise The main goal of the following experiments was to un- • Augmenting training phone call data with artifi- derstand if adding additional data from public datasets cially created datasets does not seem to help to can improve the performance of models. Because phone improve model performance call data is recorded with 8 kHz sampling rate, but pub- • At this stage increasing data volume twofold lic datasets are 16 kHz, following 4 experiments were marginally improves model performance performed. • Upsampling and downsampling the audio data In the first experiment, the model was trained on the neither improves nor worsens the performance phone call data in its original format. In the second ex- of the models periment, IEMOCAP, TESS, RAVDESS and EMODB were downsampled to 8 kHz and added to the train set. In the For further research it might be worth trying to in- third experiment, phone call data (both train and test) crease the training dataset further (at least 500-1000% in- were upsampled to 16 kHz. In the fourth experiment, crease). Given the untrained human-level error, it seems IEMOCAP, TESS, RAVDESS and EMODB were added to that in order to predict emotions accurately, even human the upsampled train set. The results of those experiments needs more context than a single utterance, preferably are summarized in Table 2. whole conversations. Therefore, increasing input con- text is interesting avenue for the follow-up work. Fur- thermore, defining an emotion as a set of dimensions and predicting each dimension separately might improve forecasting accuracy. Acknowledgments Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2020, pp. 186–195. The research leading to these results has received fund- ing from the research project "Competence Centre of Information and Communication Technologies" of EU Structural funds, contract No. 1.2.1.1/18/A/003 signed between IT Competence Centre and Central Finance and Contracting Agency, Research No. 2.9. “Automated mul- tilingual subtitling”. References [1] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss, A database of german emo- tional speech, in: Ninth European Conference on Speech Communication and Technology, 2005. [2] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan, Iemocap: Interactive emotional dyadic motion capture database, Language resources and evaluation 42 (2008) 335–359. [3] M. K. Pichora-Fuller, K. Dupuis, Toronto emotional speech set (TESS) (2020). URL: https://doi.org/10. 5683/SP2/E8H2MF. doi:10.5683/SP2/E8H2MF. [4] S. R. Livingstone, F. A. Russo, The ryerson audio- visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english, PloS one 13 (2018) e0196391. [5] T. Kostoulas, T. Ganchev, N. Fakotakis, Study on speaker-independent emotion recognition from speech on real-world data, in: Verbal and nonver- bal features of human-human and human-machine interaction, Springer, 2008, pp. 235–242. [6] A. Dhall, R. Goecke, J. Joshi, M. Wagner, T. Gedeon, Emotion recognition in the wild challenge 2013, in: Proceedings of the 15th ACM on International con- ference on multimodal interaction, 2013, pp. 509– 516. [7] A. Tawari, M. M. Trivedi, Speech emotion analysis in noisy real-world environment, in: 2010 20th International Conference on Pattern Recognition, IEEE, 2010, pp. 4605–4608. [8] E. Vanags, A qualitative analysis of affect signs in telecommunication dialogues, in: The 79th In- ternational Scientific Conference of the UL section Psychological well-being, 2021. [9] J. A. Russell, L. F. Barrett, Core affect, prototypical emotional episodes, and other things called emo- tion: dissecting the elephant., Journal of personality and social psychology 76 (1999) 805. [10] Y. Wang, J. Zhang, J. Ma, S. Wang, J. Xiao, Con- textualized emotion recognition in conversation as sequence tagging, in: Proceedings of the 21th