<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Helsinki, Finland
$ eduards.blumentals@tilde.lv (E. Blumentals);
askars.salimbajevs@tilde.lv (A. Salimbajevs)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Emotion Recognition in Real-World Support Call Center Data for Latvian Language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eduards Blumentals</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Askars Salimbajevs</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computing, University of Latvia</institution>
          ,
          <addr-line>Raina bulvaris 19, Riga</addr-line>
          ,
          <country country="LV">Latvia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tilde SIA</institution>
          ,
          <addr-line>Vienibas gatve 75a, Riga</addr-line>
          ,
          <country country="LV">Latvia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Emotion recognition from speech is a research area that focuses on grasping genuine feelings from audio data. It makes it possible to extract various useful data points from sound that are further used to improve decision-making. This research was conducted to test an emotion recognition toolkit on real-world recording of phone calls in Latvian language. This scenario presents at least two significant challenges: mismatch between real-world data and "artificially" created data, and lack of the training data for the Latvian language. The study mainly focuses on investigating training data requirements for successful emotion recognition.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;datasets</kwd>
        <kwd>neural networks</kwd>
        <kwd>speech</kwd>
        <kwd>emotion recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>
        Nowadays, emotion recognition from speech is a highly The main dataset used in this paper consists of technical
relevant topic. It is used in a wide variety of applications support phone calls. Recordings are done with an 8 kHz
from businesses to governmental bodies. For example, sampling rate and a single channel. The dataset included
in call centers it helps to monitor client support quality audio recordings of 39 conversations, that held in
Latand to study clients’ reaction to certain emotional trig- vian which were further separated into 6,171 segments.
gers. Multiple studies have been conducted on emotion Qualitative analysis of telephone conversations was
perrecognition from speech signal. However, most of the formed, annotating in several layers’ potential afective
papers investigate machine learning model performance features - afect dimensions, linguistic units,
paralinguison public artificially created datasets such as EMODB[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], tic units etc. A total of 11 synchronous annotation layers
IEMOCAP[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], TESS[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and RAVDESS[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Although this were created for each segment.
approach ensures a common benchmark, it ignores the Each segment had two parameters valence and
activafact that in the real world speech data is not as clear or tion, where valence measures how positive or negative
well-defined. A few papers such as Kostulas et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the emotion is, and activation measures its magnitude.
Dhall et al.[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Tawari et al.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] aim to address this These parameters were assigned by trained individuals
issue. (pedagogy and psychology students and professors).
De
      </p>
      <p>
        When it comes to emotion recognition from speech in tailed description of dataset creation process and
qualitaLatvian language, the literature is even shallower. Nei- tive analysis of the dataset is presented in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
ther datasets nor well-recognized research on the topic Based on the given dimensions segments were
asexists for Latvian language. Therefore, this study inves- signed to nine categories: happy, surprised, angry,
disaptigates how a deep learning emotion recognition model pointed, sad, bored, calm, satisfied and neutral following
performs on a Latvian language dataset comprised of real the approach proposed by Russell and Barrett[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Therephone calls. The goal of this research is to see whether fore, the problem is transformed from regression into a
typical deep learning architecture can handle the task multiclass classification.
of detecting emotions from real speech. The paper eval- An insignificant number of observations in several
uates model performance with diferent training data emotion categories necessitated proceeding with the five
setups. In addition it evaluates human error to estimate most represented emotions. Table 1 summarizes the final
the dificulty of the exercise for an untrained person. dataset used in this paper. This dataset is further divided
into train (80%) and test (20%) sets.
      </p>
      <p>Additionaly, several public emotional speech datasets
were included in the research. EMODB, IEMOCAP, TESS,
and RAVDESS were used to increase the dataset size, as
well as to see how our model performs compared to the
state-of-the-art models.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>This paper investigates a deep learning approach to emo</title>
        <p>tion recognition. Each input audio was converted into
the 39-dimensional mel-frequency cepstral coeficients
(MFCC) feature vector and passed through the model.
Due to the relatively small dataset size, a shallow neural
network was used to prevent overfitting. The final model
was comprised of two LSTM layers, two fully connected
layers and a softmax output layer. For additional
regularization, a 30% dropout after each layer was added. Figure
1 displays the model architecture.</p>
        <p>Categorical cross-entropy was used as a loss function
and an Adam optimizer was used for backward
propagation. The model training was performed using batches of
64 observations. Each model was trained for 200 epochs
with validation after each epoch. Next model weights
that yielded the highest accuracy were retrieved. Finally,
all trained models are compared based upon their
accuracy on the test set.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Validating the Model</title>
        <p>First, model performance on public datasets was
evaluated. TESS and RAVDESS were combined into one
dataset, separated into train (80%) and test (20%) sets
and used to train the model for 200 epochs. The final
test accuracy (Figure 2) was 86.02%. In addition, a similar
experiment was conducted with an IEMOCAP dataset
which was comprised of conversations between actors.</p>
        <p>
          The final test accuracy (Figure 3) was 60.65% which is
slightly lower than state-of-the-art[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. From this, one
can conclude that the deep learning architecture used in
this paper performs reasonably well on public "Wizard
of Oz" datasets.
mance of models can be improved by simply supplying
additional data. In this experiment, the model was trained
and evaluated on real world audio recordings from a
Latvian support call center. The model was trained on
different portions of the train set collected from phone call
4.2. Model Accuracy on Real-World data. It started at 50% volume and moved towards the full
Latvian Data train set with a step of 10 percentage points. The results
of this experiment are displayed in Figure 4. Seemingly,
Next, the impact of changes in training data volume on in the case of this research, increasing data volume would
test accuracy was evaluated to see whether the perfor- not yield significantly better results.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Using Additional Training Data Sets</title>
        <sec id="sec-4-2-1">
          <title>The main goal of the following experiments was to un</title>
          <p>derstand if adding additional data from public datasets
can improve the performance of models. Because phone
call data is recorded with 8 kHz sampling rate, but
public datasets are 16 kHz, following 4 experiments were
performed.</p>
          <p>In the first experiment, the model was trained on the
phone call data in its original format. In the second
experiment, IEMOCAP, TESS, RAVDESS and EMODB were
downsampled to 8 kHz and added to the train set. In the
third experiment, phone call data (both train and test)
were upsampled to 16 kHz. In the fourth experiment,
IEMOCAP, TESS, RAVDESS and EMODB were added to
the upsampled train set. The results of those experiments
are summarized in Table 2.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.4. Obtaining Human Error</title>
        <p>Finally, an untrained human person was asked to guess
the emotions in the same test set to estimate the human
error. Audio segments were presented in random order,
so that the person can not analyse the overall
semantics and context of the conversations, and have to rely
solely on the acoustics, similarly to the deep learning
model. The test accuracy ended up being 22.72% which
indicates that predicting emotions in the random
segments of phone calls is not a trivial exercise even for a
human.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This research investigated emotion recognition from real
world phone call data. The output of the research can be
summarized according to the following points:
• Emotion recognition on real-world data is a more
dificult exercise than emotion recognition on
artificially created datasets
• The model architecture proposed in this paper is
capable of surpassing a untrained human-level
error for the given exercise
• Augmenting training phone call data with
artificially created datasets does not seem to help to
improve model performance
• At this stage increasing data volume twofold
marginally improves model performance
• Upsampling and downsampling the audio data
neither improves nor worsens the performance
of the models</p>
      <p>For further research it might be worth trying to
increase the training dataset further (at least 500-1000%
increase). Given the untrained human-level error, it seems
that in order to predict emotions accurately, even human
needs more context than a single utterance, preferably
whole conversations. Therefore, increasing input
context is interesting avenue for the follow-up work.
Furthermore, defining an emotion as a set of dimensions
and predicting each dimension separately might improve
forecasting accuracy.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>The research leading to these results has received fund</title>
        <p>ing from the research project "Competence Centre of
Information and Communication Technologies" of EU
Structural funds, contract No. 1.2.1.1/18/A/003 signed
between IT Competence Centre and Central Finance and
Contracting Agency, Research No. 2.9. “Automated
multilingual subtitling”.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Burkhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paeschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rolfes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. F.</given-names>
            <surname>Sendlmeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <article-title>A database of german emotional speech</article-title>
          ,
          <source>in: Ninth European Conference on Speech Communication and Technology</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Busso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bulut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kazemzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mower</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          , Iemocap:
          <article-title>Interactive emotional dyadic motion capture database, Language resources</article-title>
          and evaluation
          <volume>42</volume>
          (
          <year>2008</year>
          )
          <fpage>335</fpage>
          -
          <lpage>359</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Pichora-Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dupuis</surname>
          </string-name>
          , Toronto emotional speech set
          <source>(TESS)</source>
          (
          <year>2020</year>
          ). URL: https://doi.org/10. 5683/SP2/E8H2MF. doi:
          <volume>10</volume>
          .5683/SP2/E8H2MF.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Livingstone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <article-title>The ryerson audiovisual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english</article-title>
          ,
          <source>PloS one 13</source>
          (
          <year>2018</year>
          )
          <article-title>e0196391</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kostoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ganchev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fakotakis</surname>
          </string-name>
          ,
          <article-title>Study on speaker-independent emotion recognition from speech on real-world data, in: Verbal and nonverbal features of human-human and human-machine interaction</article-title>
          , Springer,
          <year>2008</year>
          , pp.
          <fpage>235</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Goecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wagner</surname>
          </string-name>
          , T. Gedeon,
          <article-title>Emotion recognition in the wild challenge 2013</article-title>
          ,
          <source>in: Proceedings of the 15th ACM on International conference on multimodal interaction</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>509</fpage>
          -
          <lpage>516</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tawari</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Trivedi</surname>
          </string-name>
          ,
          <article-title>Speech emotion analysis in noisy real-world environment</article-title>
          ,
          <source>in: 2010 20th International Conference on Pattern Recognition</source>
          , IEEE,
          <year>2010</year>
          , pp.
          <fpage>4605</fpage>
          -
          <lpage>4608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Vanags</surname>
          </string-name>
          ,
          <article-title>A qualitative analysis of afect signs in telecommunication dialogues</article-title>
          ,
          <source>in: The 79th International Scientific Conference of the UL section Psychological well-being</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. F.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <article-title>Core afect, prototypical emotional episodes, and other things called emotion: dissecting the elephant</article-title>
          .,
          <source>Journal of personality and social psychology 76</source>
          (
          <year>1999</year>
          )
          <fpage>805</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Ma,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <article-title>Contextualized emotion recognition in conversation as sequence tagging</article-title>
          ,
          <source>in: Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>186</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>