<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Matching for Syllable-Level Prosody Encoding</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abdul Rehman</string-name>
          <email>arehman@bournemouth.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian Jun Zhang</string-name>
          <email>jzhang@bournemouth.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaosong Yang</string-name>
          <email>xyang@bournemouth.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Media and Communications, Bournemouth University</institution>
          ,
          <addr-line>Bournemouth</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We address the challenge of machine interpretation of subtle speech intonations that convey complex meanings. We assume that emotions and interrogative statements follow regular prosodic patterns, allowing us to create an unsupervised intonation template dictionary. These templates can then serve as encoding mechanisms for higher-level labels. We use piecewise interpolation of syllable-level formant features to create intonation templates and evaluate their efectiveness on three speech emotion recognition datasets and declarative-interrogative utterances. The results indicate that individual syllables can be detected for basic emotions with nearly double the accuracy of chance. Additionally, certain intonation templates exhibit a correlation with interrogative implications.</p>
      </abstract>
      <kwd-group>
        <kwd>intonations</kwd>
        <kwd>speech processing</kwd>
        <kwd>emotion recognition</kwd>
        <kwd>computational paralinguistics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Paralingual intonations can alter sentence meanings without changing the words, such as adding
sarcasm or conveying politeness. These nuances are often challenging for machine speech
recognition due to the ambiguity of implications [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Nonetheless, some implied meanings in
speech are thought to be consistent across cultures and languages [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        This research aims to simplify the mel-spectrum into a few explainable variables that capture
paralingual cues in temporal sequences. Traditional sequence learning methods like RNN have
limitations due to dataset variations and domain-specific issues [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. To address this, we
propose using the quantifiable flux of syllables rather than the discrete sequence of phonemes.
We create standard templates from common syllable feature patterns and use them to match test
syllables, ofering a way to mathematically quantify syllable flux without relying on a standard
paralingual symbol dictionary.
      </p>
      <p>
        There have been few pieces of research done on the comparison between the applicability of
prosodic and lexical features for emotional clue extraction, they point out the higher importance
of prosodic features as compared to their lexical dependence [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Moreover, the prosodic cues
(X. Yang)
      </p>
      <p>https://staffprofiles.bournemouth.ac.uk/display/arehman (A. Rehman); https://jzhang.bournemouth.ac.uk/
(J. Zhang); https://staffprofiles.bournemouth.ac.uk/display/xyang (X. Yang)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings</p>
      <p>Training
clips</p>
      <p>Test
syllable
m
a
r
g
o
r
t
e
p
S
le
M
c</p>
      <p>
        A
o
ti
n
e
tt
t
n
a
m
r
o
F
n
o
leab titan
ll e
yS gm
e
S
o
ti
a
intonation [
        <xref ref-type="bibr" rid="ref10 ref3 ref8 ref9">3, 8, 9, 10</xref>
        ]. They all report better detection using lexicon instead of acoustics.
Margolis et al. reported that their text-trained model had poor recognition of declarative
questions even when considering the prosodic features [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The poor recognition is thought
to be due to the ambiguous implications of uncertainty, i.e., the rising pitch at the end doesn’t
always mean interrogation, it can also signal confirmation seeking or uncertainty [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Method</title>
      <p>
        The proposed method is based on the syllable formant attention mechanism, which relies on the
ifrst two formants of vowel sounds as efective descriptors of speech [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ]. We also
incorporate the combined amplitude of the top 3 formants as a third frame-level feature, assuming
that loudness variation conveys intonation information. Figure 3 illustrates these formants in a
4-syllable speech segment, segmented using onset and ofset detection as previously described
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. To quantify these formant patterns as a measure of intonation, we create a template
dictionary from common formant patterns using syllable feature-time data interpolation. We
then assess how well other syllables align with these templates.
      </p>
      <p>The final output of the interpolation process is the residual errors and correlations with the
feature templates that can be used to estimate useful paralingual cues. For a syllable template
 of a feature ℎ (out of  0,  1, 
equation that creates a template model:</p>
      <p>) there are five polynomials  ,ℎ in a 4-degree polynomial
 ,ℎ ( ) =  ,ℎ,0 +  ,ℎ,1   +  ,ℎ,2  

2 +  ,ℎ,3  
3 +  ,ℎ,4  
4
(1)
where  is the index of the syllable being fitted,  ,ℎ is the interpolated value of the formant
feature ℎ at frame index   , and  is the incremental index of the equation in the feature template
dictionary. We want  ,ℎ to be as close to the actual value of that feature as possible. A loss
minimization method is used that fits the know formant features onto a polynomial curve in
two stages. The first stage creates an estimate of best-fit for the polynomial matrix
 ,ℎ using
matrix manipulations as
(2)
(3)
 ,ℎ = [ ⋅</p>
      <p>
        ]−1 ⋅ [  ⋅   ]
 ,
where  is a matrix with frames index  of all the syllables to be fitted as rows and the polynomial
coeficients (  ∈ {0, ..., 4} for quartic fitting) as columns where each element has a value of
=   . Whereas  is the single-column matrix of the expected formant feature values at each
frame of the syllable. Then at the second stage, a loss minimization by gradient descent is used
to improve the fitting of the polynomials. We used the simplex algorithm for unconstrained
minimization of the polynomial regression loss [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>A sum-squared error is used to estimate  ,ℎ during the unconstrained minimization as
   
=0   =0
 ,ℎ = ∑
∑ [ ,ℎ ( ) −  ,ℎ (  )]2

where  ,ℎ is the calculated values by the polynomial equation, and  ,ℎ is the actual value of the
formant feature ℎ of syllable  at frame   ,   is the total number of syllables in being fitted for
this template, and</p>
      <p>is the total frame length of an individual syllable  . The initial coeficients
 ,ℎ produced in the first stage of coeficients are only a rough estimate to help reach the minima
in less time. The actual unconstrained minimization of the polynomial curve is performed
using the simplex algorithm. The algorithm moves towards the function 
adjusting the parameters of the worst points. The simplex algorithm usually converges in 5 to
20 iterations because of the initialization, otherwise, it takes many more iterations. Figure 3
shows the regression values of quartic curves of the first formants of each syllable laid over the
,ℎ minimum by
actual frequencies.</p>
      <sec id="sec-3-1">
        <title>2.1. Matching Syllables with Intonation Template</title>
        <p>The sequences of the first two formants;  0 and  1 , and the magnitude 
are fitted using
audio speech recordings to create 3 separate sets of curve templates using the curve fitting
method described above. The 3 sets are further divided into 10 (for each of the 3 features)
categories by length because we assume that the duration of a syllable is also one of the key
discriminating features. Feature sequences of various durations have their own sets of template
curves. The number of total templates for a feature-duration class depends on the variety in the
recording and on the coeficient of determination
 2 threshold set for a match to be considered
or to create a new template if the match score is lower than the threshold.</p>
        <p>Once all the template coeficients have been estimated, they can be used as a match
predictor model for various paralingual tasks such as emotion recognition or interrogative speech
detection. For example, if there are a total of 100 templates of various shapes and sizes for 3
formant features, the  2 scores for all of the 100 templates can be used as a feature vector to
train a classifier to predict any paralingual label for a syllable.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Experimentation</title>
      <p>We evaluated the proposed method on the tasks of speech emotion recognition and interrogative
intonation analysis. We make 2 assumptions to evaluate our proposed approach:
• Basic emotions can be recognized from individual syllables without needing a huge dataset
for regularization. We test this in two ways: 1) By training speech emotion recognition
models on one dataset and testing on another, and 2) by decreasing the size of data used
for training to check if the proposed method can be trained with a small sample.
• Interrogative intonation can be distinguished from individual syllables when the same
statement is said with a declarative vs interrogative tone. Due to the lack of a dataset
that can be used for machine learning, we used a small dataset that was only big enough
for observations.</p>
      <p>
        For speech emotion recognition, we used 3 widely used databases that are recorded in a
scripted or improvised scenario in English. IEMOCAP database [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], MSP-Improv database [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
and the RAVDESS speech dataset [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The validation was performed using a 5-fold method
when the training source data and testing data are from the same database. Whereas, for
cross-corpus validation, two diferent datasets were used for training and testing.
      </p>
      <p>
        For interrogative intonation analysis, we used a small set of 72 utterances that were originally
collected for the purpose of controlled stimuli by Xie et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. There are 6 two-word sentences
ending with a continuous verb “ing” sound, said 12 times with an increasingly questioning tone
by the same speaker on a continuum between statements and declarative questions e.g., “It’s
raining.” and “It’s raining?”. The lack of variability in other factors makes this dataset ideal for
an analysis of the efectiveness of the proposed template-matching method.
      </p>
      <p>Using the method described in Section 2, a set of feature templates was derived from the
training dataset. The templates’ polynomial coeficients are used to estimate correlations  2 of
each formant feature ( 0,  1,  ) with their respective sets of templates. The template that
matches the test syllable’s feature sequence the most would have the highest correlation among
all the template curves. The number of template curves is not fixed and depends on the error
threshold set during training. In relative terms, the error threshold was set to 0.08, i.e., if the
nearest match has  2 ≤ 0.92 for a new syllable with the already known templates then a new
template is created using the new syllable. Otherwise, the new syllable data is appended to the
data series of the nearest matching template, which is then refitted again using the simplex
algorithm.</p>
      <p>The encoded features vectors extracted from the proposed method (the match scores with
the templates) were then used to train a single hidden layer MLP (Multi-Layer Perceptron) with
8 units to perform the emotion classification task. For the cross-corpus validation tasks, the
width of the syllable feature vector for training on RAVDESS was 133, 150 for IEMOCAP, and
146 for MSP-Improv.</p>
      <sec id="sec-4-1">
        <title>3.1. Results</title>
        <p>
          The results in Table 1 show that the UA (Unweighted Accuracy) for the proposed method is
significantly better than the chance (25%) that reflects the amount of prosodic information
0.8
captured by templates. More importantly, the cross-corpus accuracies show that templates
learned from one corpus can predict the emotions in other corpora. For comparison, Alex et
al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] have reported syllable-level accuracies for the IEMOCAP dataset 37.57% and 63.83%
at utterance level. Most other works report accuracies for utterance level therefore a valid
comparison can’t be made [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusions</title>
      <p>We introduced a method for analyzing syllable-level intonation patterns by matching formant
features in syllables with common patterns. Using template similarity scores, we predicted
the emotional tone of each syllable and explored the correlation between match scores and
questioning intonation levels. Our findings revealed that only a few templates efectively
captured interrogative intonation in the sample recordings.</p>
      <p>This research had limitations, including the absence of syllable-level annotated data, which
afected learning precision due to variations within utterances. Future work involves collecting
and annotating syllable-level data to enhance computational paralinguistics. Another challenge
was the computational heaviness of polynomial optimization for large datasets. Future eforts
will explore alternative approaches for syllable template extraction to improve eficiency.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Kanakaraddi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Nandyal</surname>
          </string-name>
          ,
          <article-title>Survey on parts of speech tagger techniques</article-title>
          ,
          <source>in: 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Armstrong</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>del Mar Vanrell, Intonational polar question markers and implicature in american english and majorcan catalan</article-title>
          ,
          <source>Speech Prosody</source>
          <year>2016</year>
          (
          <year>2016</year>
          )
          <fpage>158</fpage>
          -
          <lpage>162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>Question detection from acoustic features using recurrent neural network with gated recurrent unit</article-title>
          ,
          <source>in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>6125</fpage>
          -
          <lpage>6129</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.-T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-H.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence</article-title>
          ,
          <source>Information Sciences 563</source>
          (
          <year>2021</year>
          )
          <fpage>309</fpage>
          -
          <lpage>325</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Grichkovtsova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lacheret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morel</surname>
          </string-name>
          ,
          <article-title>The role of intonation and voice quality in the afective speech perception</article-title>
          , in: Eighth Annual Conference of the International Speech Communication Association,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Grichkovtsova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lacheret</surname>
          </string-name>
          ,
          <article-title>The role of voice quality and prosodic contour in afective speech perception</article-title>
          ,
          <source>Speech Communication</source>
          <volume>54</volume>
          (
          <year>2012</year>
          )
          <fpage>414</fpage>
          -
          <lpage>429</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buxó-Lugo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kurumada</surname>
          </string-name>
          ,
          <article-title>Encoding and decoding of meaning through structured variability in intonational speech prosody</article-title>
          ,
          <source>Cognition</source>
          <volume>211</volume>
          (
          <year>2021</year>
          )
          <fpage>104619</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Margolis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ostendorf</surname>
          </string-name>
          ,
          <article-title>Question detection in spoken conversations using textual conversations</article-title>
          ,
          <source>in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>118</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Boakye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Favre</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Hakkani-Tür,
          <article-title>Any questions? automatic question detection in meetings</article-title>
          ,
          <source>in: 2009 IEEE Workshop on Automatic Speech Recognition &amp; Understanding</source>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          ,
          <year>2009</year>
          , pp.
          <fpage>485</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Asakawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Masumura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kamiyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kobashikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Aono</surname>
          </string-name>
          ,
          <article-title>Automatic question detection from acoustic and phonetic features using feature-wise pre-training</article-title>
          .,
          <source>in: INTERSPEECH</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1731</fpage>
          -
          <lpage>1735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Šafářová</surname>
          </string-name>
          ,
          <article-title>The semantics of rising intonation in interrogatives and declaratives</article-title>
          ,
          <source>in: Proceedings of sinn und bedeutung</source>
          , volume
          <volume>9</volume>
          ,
          <year>2005</year>
          , pp.
          <fpage>355</fpage>
          -
          <lpage>369</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rehman</surname>
          </string-name>
          , Z.-T. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-H.</given-names>
            <surname>Cao</surname>
          </string-name>
          , C.-S. Jiang,
          <article-title>Speech emotion recognition based on syllable-level feature extraction</article-title>
          ,
          <source>Applied Acoustics</source>
          <volume>211</volume>
          (
          <year>2023</year>
          )
          <fpage>109444</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rehman</surname>
          </string-name>
          , Z.-T. Liu,
          <string-name>
            <surname>J.-M. Xu</surname>
          </string-name>
          ,
          <article-title>Syllable level speech emotion recognition based on formant attention</article-title>
          ,
          <source>in: CAAI International Conference on Artificial Intelligence</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>261</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Kent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. K.</given-names>
            <surname>Vorperian</surname>
          </string-name>
          ,
          <article-title>Static measurements of vowel formant frequencies and bandwidths: A review</article-title>
          ,
          <source>Journal of communication disorders 74</source>
          (
          <year>2018</year>
          )
          <fpage>74</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. E. Dennis</given-names>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Woods</surname>
          </string-name>
          ,
          <article-title>Optimization on microcomputers. the nelder-mead simplex algorithm</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Busso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bulut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kazemzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mower</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          , Iemocap:
          <article-title>Interactive emotional dyadic motion capture database, Language resources</article-title>
          and evaluation
          <volume>42</volume>
          (
          <year>2008</year>
          )
          <fpage>335</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Busso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Parthasarathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Burmania</surname>
          </string-name>
          , M. AbdelWahab,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sadoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Provost</surname>
          </string-name>
          ,
          <article-title>Msp-improv: An acted corpus of dyadic interactions to study emotion perception</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>8</volume>
          (
          <year>2016</year>
          )
          <fpage>67</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Livingstone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <article-title>The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english</article-title>
          ,
          <source>PloS one 13</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Alex</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Babu</surname>
          </string-name>
          ,
          <article-title>Attention and feature selection for automatic speech emotion recognition using utterance and syllable-level prosodic features</article-title>
          ,
          <source>Circuits, Systems, and Signal Processing</source>
          <volume>39</volume>
          (
          <year>2020</year>
          )
          <fpage>5681</fpage>
          -
          <lpage>5709</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Aldeneh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Provost</surname>
          </string-name>
          ,
          <article-title>Using regional saliency for speech emotion recognition</article-title>
          ,
          <source>in: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>2741</fpage>
          -
          <lpage>2745</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>