<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HEDCM: Human Emotions Detection and Classification Model from Speech using CNN</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anjali Tripathi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Upasana Singh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Garima Bansal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rishabh Gupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashutosh Kumar Singh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Technology</institution>
          ,
          <addr-line>Kurukshetra, Haryana</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Emotion Detection and Classification using Speech is still in an emerging stage in the research community. In this paper, a speech-based classification model is proposed, categorizing speech into six basic emotion categories: anger, surprise, sadness, fear, happiness, and neutral. There are different techniques which are used in speaker discrimination and sentiment analysis task. Every method has its pros and cons. In the proposed model, the Mel Frequency Cepstrum Coefficient (MFCC) feature calculates the various attributes of the speech signal and Convolutional Neural Network (CNN) model to classify different types of emotions. The numerical evaluation has been performed with the RAVDESS dataset, which consists of recorded audio files by 12 Actors &amp; 12 Actresses. For the emotion classification, prediction accuracy of 70.22% was obtained in our proposed model, along with a model accuracy of 96%. The accuracy is improved compared to other similar implementations as deep learning works better than old ML classification methods.</p>
      </abstract>
      <kwd-group>
        <kwd>1 MFCC</kwd>
        <kwd>Deep neural network</kwd>
        <kwd>Emotion recognition</kwd>
        <kwd>Feature extraction</kwd>
        <kwd>CNN</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Humans can predict the probability of
emotions just by hearing, viewing, or speaking
to a person; the need of the hour is to provide
similar functionality to the machines. In such a
case, post identification of humans’ emotions,
the device could act accordingly, keeping into
consideration the user's requirements and
priorities. Understanding human sentiments
have always been a captivated area of
research, and it is gaining a lot of concern
from researchers and other disciplines.</p>
      <p>
        Automatic emotion recognition has a natural
application in this space since it can be used not
only for automatic user feedback but also to
construct more pleasant and natural
conversation partners. Emotion recognition
technology is essential to such assistants to
become more seamlessly integrated into the
users’ daily lives. Detection of emotion from
the audio is not an easy task given the
accompanying reasons: In separating between
different feelings which specific features of
speech are more valuable isn’t clear. Due to the
various sentences, speakers' talking styles,
speaking rates of speaker's greeting fluctuation
were presented due to which speech signal
features get straightforwardly influenced [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The study of individuals' feelings or frames
of mind towards an occasion, discussion on
themes, or general. The main distinguishing
qualities that humans possess are their ability to
demonstrate and understand emotions through
various communication modes. Humans can
comprehend even complex emotions in any
way, and these emotions guide the
understanding of their interpersonal
relationships daily. As speech-based assistants'
popularity surges, one of the most visibly
apparent deficiencies of such systems are their
inability to understand their users' emotions
and, further, to demonstrate any emotion in
return [13] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Numerous scientists have proposed useful
features of speech that contain the
information of emotion, for example, frequency
of pitch, recurrence [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], energy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Linear
Prediction Coefficients (LPC), MFCC, and
Linear Prediction Cepstrum Coefficients
(LPCC). Moreover, numerous analysts
investigated a few classification methods, for
example, Neural Networks (NN) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Kernel
Regression and K- closest Neighbors (KNN),
Hidden Markov model (HMM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Gaussian
Mixture Model (GMM), Maximum Likelihood
Bayesian classifier (MLC) and Support vector
machines (SVM). Although these systems can
perform fundamental sentiment analysis, they
do not inherently process the richness of
emotion in everyday speech. To classify these
emotions in various categories like Happy, Sad,
Anger, surprise, neutral, and fearful. We
propose a model that not only detects the
sentiment, but it also classified that into
different categories. In the HEDC model, the
MFCC method and CNN classifier are used to
classify the emotions. The system gives better
accuracy compared to other pre-existing
models. Work well in noisy data and classify
sentiment into different categories like Happy,
sad, anger, etc. Previously, all the work is done
only on fixed-size input, but the HEDC model
can also take dynamic size input. HEDC model
can Easley detect the emotions in the case of
Homophones words.
      </p>
      <p>The paper is structured as Section 2 consists
of the comparison in techniques of previous
models. Section 3 contains the system model
description. Section 4 includes the
experimentation and results, and finally,
Section 5 consists of the conclusion and some
further future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Understanding human sentiments have
always been a captivated area of research, and
it is gaining a lot of consideration from
researchers and other disciplines.
Communication over a long distance is very
persistent in the current scenario, and the focus
is on trying to communicate precisely what the
person wants to say. But it’s hard to evaluate
the actual emotions of a person while
communicating over a long distance. Working
on such an initiative will positively affect
conveying the correct message [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In the
current years, knowledge of human behavior
has obtained a lot of attention. Many techniques
were used to understand human emotions and
their polarity. Table 1 compares the various
emotion detection methods using speech from
some papers of the year 2005-2018.
      </p>
      <p>
        Kagalkar et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] worked on two types of
datasets- training data and testing data. MFCC
is used for feature extraction in both datasets.
Using the extracted feature, the GMM and
SVM classify the speaker's age into various age
spaces. At that point, the feeling is predicted
based on the trained data. Fung et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
presented a model that uses a real-time CNN to
detect emotions. This model can distinguish
emotions into three classes- Happy, Sad, and
Angry. The average accuracy gave by the
model is-66.1%. The number of emotions
detected is significantly less. Thus, it’s hard to
foresee any feeling other than them. Chavhan et
al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed a model that classified
emotion in 4 types using MEDC and MFCC for
feature extraction technique with SVM
classifier and applied this in three Kinds of
input speech signal: gender independent, male
and female. Yet, it just gives 100% accuracy
for female speech.
      </p>
      <p>
        Huang et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] suggested another method
for an emotion detection system. They utilized
five layers of DBNs for feature extraction and
classified emotion into four types with a
nondirect SVM classifier's assistance. Yet, the
downside of this new strategy was that the
DBNs model's time cost was excessively more
than other feature extraction techniques. Zheng
et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] utilized the PCA-DCNNs-SER
approach for emotion detection and
classification using the IEMOCAP database.
This method was discovered in a way that is
better than the SVM classification method.
Because of the inappropriate appropriation of
emotion-based data in the referenced database,
the determined accuracy is less. Chen et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
proposed an emotion recognition system using
acoustic and linguistic features. Different
feature representations are used for emotion
detection in both acoustic and linguistic. Using
these other representations, the accuracy of
emotion classification is compared to a single
database named USC-IEMOCAP.
      </p>
      <sec id="sec-2-1">
        <title>It doesn't have issues with local</title>
        <p>minima and over-preparing.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Ready to manage high dimensional input vectors</title>
      </sec>
      <sec id="sec-2-3">
        <title>Doesn't work with variable length input.</title>
      </sec>
      <sec id="sec-2-4">
        <title>The number of classes increments</title>
        <p>computational expense.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Not able to manage massive databases.</title>
      </sec>
      <sec id="sec-2-6">
        <title>Model time appropriation of a sound signal.</title>
      </sec>
      <sec id="sec-2-7">
        <title>Development is easy.</title>
      </sec>
      <sec id="sec-2-8">
        <title>Support vector length input.</title>
      </sec>
      <sec id="sec-2-9">
        <title>Ready to display both discrete and nonstop signals.</title>
      </sec>
      <sec id="sec-2-10">
        <title>It is accepting the likelihood of existing in an explicit state and autonomous of its past form.</title>
      </sec>
      <sec id="sec-2-11">
        <title>The capacity of self-association and self-learning.</title>
      </sec>
      <sec id="sec-2-12">
        <title>Adjustability in various conditions.</title>
      </sec>
      <sec id="sec-2-13">
        <title>Reasonable for design acknowledgment.</title>
      </sec>
      <sec id="sec-2-14">
        <title>Requires broad training.</title>
      </sec>
      <sec id="sec-2-15">
        <title>Supervised classification</title>
      </sec>
      <sec id="sec-2-16">
        <title>Highlights in a single class are thought to be autonomous of others.</title>
      </sec>
      <sec id="sec-2-17">
        <title>The simplicity of straightforward execution and understanding.</title>
      </sec>
      <sec id="sec-2-18">
        <title>Definite</title>
        <p>suspicions.</p>
      </sec>
      <sec id="sec-2-19">
        <title>Overfitting element autonomy</title>
      </sec>
      <sec id="sec-2-20">
        <title>Automatically detect the import features of an image.</title>
      </sec>
      <sec id="sec-2-21">
        <title>Speed is right on the short text.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. System Architecture</title>
      <p>In this section, the proposed model, as
shown in Fig. 1. will be discussed in detail. The
whole system is divided into three phases:
3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>Speaker discrimination</title>
      <p>The objective of this phase is to determine
whether the voice is of a male or a female. The
input is first dividing into small chunks to make
it easier and efficient because evaluating a
whole audio file will take a bit more time than
assessing the little chunks. Those chunks are
compared with our training data to determine
gender. This is important to evaluate first
because there is a difference in the

</p>
      <sec id="sec-4-1">
        <title>Cons:</title>
        <p>
</p>
      </sec>
      <sec id="sec-4-2">
        <title>Ease to implement.</title>
      </sec>
      <sec id="sec-4-3">
        <title>Give the best results on an imagebased data.</title>
      </sec>
      <sec id="sec-4-4">
        <title>High computational cost. They used to need a lot of training data.</title>
        <p>characteristics of voice of different genders for
the same emotion, which can provide a more
accurate result in the end. The output of this
phase is a speaker id.</p>
        <p>3.2.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Speech recognition</title>
      <p>Along with the first phase, this phase’s
process is also being executed. The audio file is
going through a pre-processing part, which is
filtration. The attempt to filter noise and remove
background sound from the voice takes place to
understand the voice and determine emotions
more accurately clearly. After the filtration
process, the MFCC graph of the file is being
plotted. MFCC stands for Mel Frequency
Cepstral coefficient. It is one of the standard
strategies which is used for feature extraction of
any speech signal.</p>
      <p>
        MFCC utilizes a non-linear recurrence unit
to recreate the human sound-related framework.
MFCC is the most used characteristic of voice
in the emotion detection system. The next step
is to plot the Spectrogram with the help of
MFCC, in which ranges are defined for different
polarity, and it is also the output of this phase.
A spectrogram is a pictorial technique of
representing the strength of a signal, or
“loudness,” of a signal over the long run at
exclusive frequencies found in a particular
waveform. Not exclusively might one be
capable of seeing whether there may be quite
plenty of strength at, for example, 2 Hz versus
10 Hz; however, you may likewise perceive how
energy ranges fluctuate over the long haul.
When implemented to a sound sign,
spectrograms are called sonographs,
voicegrams, or voiceprints. When the
information is signified in 3D axes, it is probably
called cascades [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>When both phases execute, the output of the
phases is combined. Depending on the
Spectrogram, their polarity is determined
whether the audio file is of either positive
outlook or negative or neutral outlook.</p>
      <p>3.3.</p>
    </sec>
    <sec id="sec-6">
      <title>Emotion classification</title>
      <p>This final phase determines the audio's
emotion or sentiment with the Convolutional
Neural Network (CNN). The motive to use
CNN is that it has high accuracy in image
classification and recognition, as the output of
the second phase is the image of a graph. In
CNN, the polarity is further divided into
different emotions based on their
characteristics. For instance, the positive is also
classified into happy, surprise. The negative
polarity is further divided into anger and sad,
and so on.</p>
      <p>When implemented, this new proposed
model proves to be more accurate compared to
the previously proposed models.</p>
    </sec>
    <sec id="sec-7">
      <title>4. Experiments and Results</title>
      <p>The dataset used in this pro is The Ryerson
Audio-Visual Database of Emotional Speech
and Song (RAVDESS). This dataset includes:
1. Recorded speech and song versions of
12 Actors &amp; 12 Actresses, respectively.</p>
      <p>2. Song version data of Actor 18 is not
present in the dataset.</p>
      <p>3. Disgust Surprised and Neutral,
emotions are not present in the song version of
the dataset.</p>
      <p>In this dataset, eight types of emotions say and
sing two sentences and two instances perform by
every 12 actors. Therefore, every actor would
incite four examples for every emotion other
than neutral, disgust, and surprise in view that
there may be no making a song statistic for these
emotions. Each sound wave is around 4 seconds,
the first and last second are likely to quiet.</p>
      <p>The input is an audio file that is passing
through the early two phases simultaneously.
So, in Speaker Discrimination Phase, speech is
going through a pre-processing part in which it
is further divided into small chunks. Then the
comparison algorithm executes, and it
determines whether it is the female voice or the
male voice and then providing a speaker id,
which is the output of this phase. Now, the
question arises about how the comparison
algorithm works? How is it differentiating that
the audio file is in female voice or male voice?
So, the answer is, in the dataset, every file is
given a number, and it is arranged in such a way
that every alternate file is female, so if the file
number is even, then it is a female voice, and an
odd file is of a male voice.</p>
      <p>
        Simultaneously, in the second phase, i.e., the
speech recognition phase is also being executed.
In this phase, the audio file is again going
through a pre-processing part, in which an
attempt to remove noise from the audio file is
being made. After this, a graph is plotted with
the help of MFCC. MFCC is a state-of-art
feature in Speech Recognition tasks because it
turned into invented in the 1980s. The MFCC
form, as shown in Fig. 2. governs what sound
comes out. If anyone can find the shape
correctly, this should deliver a genuine
depiction of the produced phenomenon. With
the MFCC graph's help, the next step is to plot
the Spectrogram of audio, which is also the
output of this phase, in which ranges are
defined for different polarities. Loading audio
data and altering it to MFCCs format can be
speedily performed using the Python package
deal Librosa [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>After the execution of the first two phases,
the output of both phases is parsed. Depending
on the Spectrogram, their polarity is determined
whether the particular audio file is of either a
positive outlook or negative or neutral stance.</p>
      <p>
        The final phase is the Sentiment Analysis
phase. For this phase, Convolutional Neural
Network (CNN) is used to further classify the
polarity in different emotions. For instance, the
positive is also classified into happy, surprise.
The negative polarity is further divided into
anger and sad, and so on [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>The CNN model is advanced with the assist
of Keras and created with seven layers- 6
C0nv1D layers observed via a dense layer. The
proposed model only trained with 700 epochs
lacking any learning rate schedule etc. The
accuracy of this model is comprised of its loss
function and the evaluation metric.</p>
      <p>The followed procedure maintains a model
accuracy of 93.96%. The overall prediction
accuracy is 75%. Fig. 5. shows the model loss
graph, which perfectly illustrates the difference
of prediction in the training and the testing data
used. The difference is less as compared to the
difference in other basic ML classifiers. Fig. 3
and Fig. 4 show the difference between the real
and predicted values.</p>
      <p>The confusion matrix shows the comparison
between the expected result and the predicted
results of a model. The confusion matrix is
shown in Fig. 6. shows that 107 times out of 160
times, the model predicts the correct outcome,
whereas 53 times, it offers a different result in a
particular audio file. In Fig. 7. the accuracy
graph is shown for the same. The comparison of
negative and positive emotions of a male is
shown in the given confusion matrix. The CNN
model needs a large data file though it requires
more memory; the overall accuracy is better
than all other techniques.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Conclusion and Future Scope</title>
      <p>
        In emotion classification, database choice
played an essential role in good exactness in
the result. Feature extraction from the audio
signal is the second most crucial step in this
field [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There are many methods for feature
extraction, which is briefly covered in this
paper: amongst them, the MFCC algorithm is
widely used because it performs better than the
other feature extraction techniques in the case
of noise. The next important part of sentiment
analysis is the use of Classifier. CNN is the most
often used for solving Sentiment analysis and
gives the best result in image-based data. As the
confusion matrix shows, some emotions are
easy to identify. Some are confused with other
emotions and challenging to identify the model
in which the speech belongs.
      </p>
      <p>Many other issues need to be solved, like
diversity in emotion, recognizing spontaneous
emotion, and speaker recognition in
simultaneous conversation. The future scope of
this work is to investigate different strategies for
ongoing issues in sentiment analysis. Also,
extracting more useful features of a speech
signal will enhance the model's accuracy and
work better on a real-time system.</p>
    </sec>
    <sec id="sec-9">
      <title>6. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] Zhang, Shiqing, Shiliang Zhang, Tiejun Huang, and
          <string-name>
            <given-names>Wen</given-names>
            <surname>Gao</surname>
          </string-name>
          .
          <article-title>"Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching</article-title>
          .
          <source>" IEEE Transactions on Multimedia</source>
          <volume>20</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chaudhari</surname>
            ,
            <given-names>Shivaji J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ramesh</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kagalkar</surname>
          </string-name>
          .
          <article-title>"Automatic speaker age estimation and gender-dependent emotion recognition."</article-title>
          <source>International Journal of Computer Applications</source>
          <volume>117</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Souraya</given-names>
            <surname>Ezzat</surname>
          </string-name>
          , Neamat El Ghayar, and
          <string-name>
            <surname>Moustafa</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ghanem</surname>
          </string-name>
          . “
          <article-title>Investigating analysis of speech content through Text Classification</article-title>
          .”
          <source>International Conference of Soft Computing and Pattern Recognition</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Esraa</given-names>
            <surname>Ali</surname>
          </string-name>
          <string-name>
            <given-names>Hassan</given-names>
            , Neamat El Gayar, and
            <surname>Moustafa</surname>
          </string-name>
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghanem</surname>
          </string-name>
          . “
          <article-title>Emotions analysis of speech for call classification</article-title>
          .
          <source>” 10th International Conference on Intelligent Systems Design and Applications</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Abdul</given-names>
            <surname>Malik</surname>
          </string-name>
          <string-name>
            <surname>Babshah</surname>
          </string-name>
          , Jamil Ahmad, Nasir Rahim, and Sung Wook Baik. “
          <article-title>Speech emotion recognition from spectrograms with the deep convolutional neural network." IEEE(</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Bertero</surname>
            , Dario, and
            <given-names>Pascale</given-names>
          </string-name>
          <string-name>
            <surname>Fung</surname>
          </string-name>
          .
          <article-title>"A first look into a convolutional neural network for speech emotion detection." In 2017 IEEE international conference on acoustics, speech, and signal processing (ICASSP), IEEE (</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Basu</surname>
            , Saikat, Jaybrata Chakraborty, Arnab Bag, and
            <given-names>Md</given-names>
          </string-name>
          <string-name>
            <surname>Aftabuddin</surname>
          </string-name>
          .
          <article-title>"A review on emotion recognition using speech."</article-title>
          <source>In 2017 International Conference on Inventive Communicationand Computational Technologies (ICICCT)</source>
          ,
          <source>IEEE</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Huang</surname>
            , Chenchen, Wei Gong, Wenlong Fu, and
            <given-names>Dongyu</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>"A research of speech emotion recognition based on deep belief network and SVM." Mathematical Problems in Engineering 2014 (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Chavhan</surname>
            , Yashpalsing,
            <given-names>M. L.</given-names>
          </string-name>
          <string-name>
            <surname>Dhore</surname>
            , and
            <given-names>Pallavi</given-names>
          </string-name>
          <string-name>
            <surname>Yesaware</surname>
          </string-name>
          .
          <article-title>"Speech emotion recognition using support vector machine."</article-title>
          <source>International Journal of Computer Applications</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jin</surname>
            , Qin,
            <given-names>Chengxin</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Shizhe</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , and
            <given-names>Huimin</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>"Speech emotion recognition with acoustic and lexical features." In 2015 IEEE international conference on acoustics, speech, and signal processing (ICASSP), IEEE (</article-title>
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>W. Q.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. X.</given-names>
            <surname>Zou</surname>
          </string-name>
          .
          <article-title>"An experimental study of speech emotion recognition based on deep convolutional neural networks." In 2015 international conference on affective computing and intelligent interaction (ACII), IEEE (</article-title>
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Tripathi</surname>
            , Anjali, Upasana Singh,
            <given-names>Garima Bansal</given-names>
          </string-name>
          , Rishabh Gupta, and Ashutosh Kumar Singh.
          <article-title>"A Review on Emotion Detection and Classification using Speech."</article-title>
          <source>Available at SSRN</source>
          <volume>3601803</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>