=Paper= {{Paper |id=Vol-3124/paper23 |storemode=property |title=Emotion Recognition in Real-World Support Call Center Data for Latvian Language |pdfUrl=https://ceur-ws.org/Vol-3124/paper23.pdf |volume=Vol-3124 |authors=Eduards Blumentals,Askars Salimbajevs |dblpUrl=https://dblp.org/rec/conf/iui/BlumentalsS22 }} ==Emotion Recognition in Real-World Support Call Center Data for Latvian Language== https://ceur-ws.org/Vol-3124/paper23.pdf
Emotion Recognition in Real-World Support Call Center
Data for Latvian Language
Eduards Blumentals1 , Askars Salimbajevs1,2
1
    Tilde SIA, Vienibas gatve 75a, Riga, Latvia
2
    Faculty of Computing, University of Latvia, Raina bulvaris 19, Riga, Latvia


                                             Abstract
                                             Emotion recognition from speech is a research area that focuses on grasping genuine feelings from audio data. It makes it
                                             possible to extract various useful data points from sound that are further used to improve decision-making. This research was
                                             conducted to test an emotion recognition toolkit on real-world recording of phone calls in Latvian language. This scenario
                                             presents at least two significant challenges: mismatch between real-world data and "artificially" created data, and lack of the
                                             training data for the Latvian language. The study mainly focuses on investigating training data requirements for successful
                                             emotion recognition.

                                             Keywords
                                             datasets, neural networks, speech, emotion recognition



1. Introduction                                                                                                       2. Data
Nowadays, emotion recognition from speech is a highly                                                                 The main dataset used in this paper consists of technical
relevant topic. It is used in a wide variety of applications                                                          support phone calls. Recordings are done with an 8 kHz
from businesses to governmental bodies. For example,                                                                  sampling rate and a single channel. The dataset included
in call centers it helps to monitor client support quality                                                            audio recordings of 39 conversations, that held in Lat-
and to study clients’ reaction to certain emotional trig-                                                             vian which were further separated into 6,171 segments.
gers. Multiple studies have been conducted on emotion                                                                 Qualitative analysis of telephone conversations was per-
recognition from speech signal. However, most of the                                                                  formed, annotating in several layers’ potential affective
papers investigate machine learning model performance                                                                 features - affect dimensions, linguistic units, paralinguis-
on public artificially created datasets such as EMODB[1],                                                             tic units etc. A total of 11 synchronous annotation layers
IEMOCAP[2], TESS[3], and RAVDESS[4]. Although this                                                                    were created for each segment.
approach ensures a common benchmark, it ignores the                                                                      Each segment had two parameters valence and activa-
fact that in the real world speech data is not as clear or                                                            tion, where valence measures how positive or negative
well-defined. A few papers such as Kostulas et al.[5],                                                                the emotion is, and activation measures its magnitude.
Dhall et al.[6] and Tawari et al.[7] aim to address this                                                              These parameters were assigned by trained individuals
issue.                                                                                                                (pedagogy and psychology students and professors). De-
   When it comes to emotion recognition from speech in                                                                tailed description of dataset creation process and qualita-
Latvian language, the literature is even shallower. Nei-                                                              tive analysis of the dataset is presented in [8].
ther datasets nor well-recognized research on the topic                                                                  Based on the given dimensions segments were as-
exists for Latvian language. Therefore, this study inves-                                                             signed to nine categories: happy, surprised, angry, disap-
tigates how a deep learning emotion recognition model                                                                 pointed, sad, bored, calm, satisfied and neutral following
performs on a Latvian language dataset comprised of real                                                              the approach proposed by Russell and Barrett[9]. There-
phone calls. The goal of this research is to see whether                                                              fore, the problem is transformed from regression into a
typical deep learning architecture can handle the task                                                                multiclass classification.
of detecting emotions from real speech. The paper eval-                                                                  An insignificant number of observations in several
uates model performance with different training data                                                                  emotion categories necessitated proceeding with the five
setups. In addition it evaluates human error to estimate                                                              most represented emotions. Table 1 summarizes the final
the difficulty of the exercise for an untrained person.                                                               dataset used in this paper. This dataset is further divided
                                                                                                                      into train (80%) and test (20%) sets.
Joint Proceedings of the ACM IUI Workshops 2022, March 2022,                                                             Additionaly, several public emotional speech datasets
Helsinki, Finland                                                                                                     were included in the research. EMODB, IEMOCAP, TESS,
$ eduards.blumentals@tilde.lv (E. Blumentals);
                                                                                                                      and RAVDESS were used to increase the dataset size, as
askars.salimbajevs@tilde.lv (A. Salimbajevs)
 0000-0003-0165-0868 (A. Salimbajevs)                                                                                well as to see how our model performs compared to the
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      state-of-the-art models.
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Data Summary

             Emotion     Observation count
             Surprised                   2281
             Angry                       2208
             Neutral                      591
             Happy                        359
             Sad                          105



3. Methodology
This paper investigates a deep learning approach to emo-
tion recognition. Each input audio was converted into
the 39-dimensional mel-frequency cepstral coefficients
(MFCC) feature vector and passed through the model.
Due to the relatively small dataset size, a shallow neural
network was used to prevent overfitting. The final model
was comprised of two LSTM layers, two fully connected
layers and a softmax output layer. For additional regular-
ization, a 30% dropout after each layer was added. Figure
1 displays the model architecture.
   Categorical cross-entropy was used as a loss function Figure 1: Model Architecture
and an Adam optimizer was used for backward propaga-
tion. The model training was performed using batches of
64 observations. Each model was trained for 200 epochs
with validation after each epoch. Next model weights
that yielded the highest accuracy were retrieved. Finally,
all trained models are compared based upon their accu-
racy on the test set.


4. Results
4.1. Validating the Model
First, model performance on public datasets was eval-
uated. TESS and RAVDESS were combined into one
dataset, separated into train (80%) and test (20%) sets
and used to train the model for 200 epochs. The final
test accuracy (Figure 2) was 86.02%. In addition, a similar
experiment was conducted with an IEMOCAP dataset              Figure 2: Model Accuracy on TESS & RAVDESS Test Set
which was comprised of conversations between actors.
The final test accuracy (Figure 3) was 60.65% which is
slightly lower than state-of-the-art[10]. From this, one
                                                       mance of models can be improved by simply supplying
can conclude that the deep learning architecture used in
                                                       additional data. In this experiment, the model was trained
this paper performs reasonably well on public "Wizard  and evaluated on real world audio recordings from a Lat-
of Oz" datasets.                                       vian support call center. The model was trained on dif-
                                                       ferent portions of the train set collected from phone call
4.2. Model Accuracy on Real-World                      data. It started at 50% volume and moved towards the full
                                                       train set with a step of 10 percentage points. The results
      Latvian Data
                                                       of this experiment are displayed in Figure 4. Seemingly,
Next, the impact of changes in training data volume on in the case of this research, increasing data volume would
test accuracy was evaluated to see whether the perfor- not yield significantly better results.
                                                            Table 2
                                                            Training with Additional Data

                                                                   Training Data     Sampling Rate     Accuracy
                                                                  Phone calls only           8 kHz      53.83%
                                                                     All data                8 kHz      50.68%
                                                                  Phone calls only          16 kHz      53.47%
                                                                     All data               16 kHz      50.41%



                                                            4.4. Obtaining Human Error
                                                            Finally, an untrained human person was asked to guess
                                                            the emotions in the same test set to estimate the human
                                                            error. Audio segments were presented in random order,
Figure 3: Model Accuracy on IEMOCAP Test Set                so that the person can not analyse the overall seman-
                                                            tics and context of the conversations, and have to rely
                                                            solely on the acoustics, similarly to the deep learning
                                                            model. The test accuracy ended up being 22.72% which
                                                            indicates that predicting emotions in the random seg-
                                                            ments of phone calls is not a trivial exercise even for a
                                                            human.


                                                            5. Conclusions
                                                            This research investigated emotion recognition from real
                                                            world phone call data. The output of the research can be
                                                            summarized according to the following points:

                                                                 • Emotion recognition on real-world data is a more
Figure 4: Model Accuracy Depending on Training Data Vol-           difficult exercise than emotion recognition on ar-
ume                                                                tificially created datasets
                                                                 • The model architecture proposed in this paper is
                                                                   capable of surpassing a untrained human-level
4.3. Using Additional Training Data Sets                           error for the given exercise
The main goal of the following experiments was to un-            • Augmenting training phone call data with artifi-
derstand if adding additional data from public datasets            cially created datasets does not seem to help to
can improve the performance of models. Because phone               improve model performance
call data is recorded with 8 kHz sampling rate, but pub-         • At this stage increasing data volume twofold
lic datasets are 16 kHz, following 4 experiments were              marginally improves model performance
performed.                                                       • Upsampling and downsampling the audio data
   In the first experiment, the model was trained on the           neither improves nor worsens the performance
phone call data in its original format. In the second ex-          of the models
periment, IEMOCAP, TESS, RAVDESS and EMODB were
downsampled to 8 kHz and added to the train set. In the        For further research it might be worth trying to in-
third experiment, phone call data (both train and test)     crease the training dataset further (at least 500-1000% in-
were upsampled to 16 kHz. In the fourth experiment,         crease). Given the untrained human-level error, it seems
IEMOCAP, TESS, RAVDESS and EMODB were added to              that in order to predict emotions accurately, even human
the upsampled train set. The results of those experiments   needs more context than a single utterance, preferably
are summarized in Table 2.                                  whole conversations. Therefore, increasing input con-
                                                            text is interesting avenue for the follow-up work. Fur-
                                                            thermore, defining an emotion as a set of dimensions
                                                            and predicting each dimension separately might improve
                                                            forecasting accuracy.
Acknowledgments                                                Annual Meeting of the Special Interest Group on
                                                               Discourse and Dialogue, 2020, pp. 186–195.
The research leading to these results has received fund-
ing from the research project "Competence Centre of
Information and Communication Technologies" of EU
Structural funds, contract No. 1.2.1.1/18/A/003 signed
between IT Competence Centre and Central Finance and
Contracting Agency, Research No. 2.9. “Automated mul-
tilingual subtitling”.


References
 [1] F. Burkhardt, A. Paeschke, M. Rolfes, W. F.
     Sendlmeier, B. Weiss, A database of german emo-
     tional speech, in: Ninth European Conference on
     Speech Communication and Technology, 2005.
 [2] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh,
     E. Mower, S. Kim, J. N. Chang, S. Lee, S. S.
     Narayanan, Iemocap: Interactive emotional dyadic
     motion capture database, Language resources and
     evaluation 42 (2008) 335–359.
 [3] M. K. Pichora-Fuller, K. Dupuis, Toronto emotional
     speech set (TESS) (2020). URL: https://doi.org/10.
     5683/SP2/E8H2MF. doi:10.5683/SP2/E8H2MF.
 [4] S. R. Livingstone, F. A. Russo, The ryerson audio-
     visual database of emotional speech and song
     (ravdess): A dynamic, multimodal set of facial and
     vocal expressions in north american english, PloS
     one 13 (2018) e0196391.
 [5] T. Kostoulas, T. Ganchev, N. Fakotakis, Study
     on speaker-independent emotion recognition from
     speech on real-world data, in: Verbal and nonver-
     bal features of human-human and human-machine
     interaction, Springer, 2008, pp. 235–242.
 [6] A. Dhall, R. Goecke, J. Joshi, M. Wagner, T. Gedeon,
     Emotion recognition in the wild challenge 2013, in:
     Proceedings of the 15th ACM on International con-
     ference on multimodal interaction, 2013, pp. 509–
     516.
 [7] A. Tawari, M. M. Trivedi, Speech emotion analysis
     in noisy real-world environment, in: 2010 20th
     International Conference on Pattern Recognition,
     IEEE, 2010, pp. 4605–4608.
 [8] E. Vanags, A qualitative analysis of affect signs
     in telecommunication dialogues, in: The 79th In-
     ternational Scientific Conference of the UL section
     Psychological well-being, 2021.
 [9] J. A. Russell, L. F. Barrett, Core affect, prototypical
     emotional episodes, and other things called emo-
     tion: dissecting the elephant., Journal of personality
     and social psychology 76 (1999) 805.
[10] Y. Wang, J. Zhang, J. Ma, S. Wang, J. Xiao, Con-
     textualized emotion recognition in conversation
     as sequence tagging, in: Proceedings of the 21th