=Paper=
{{Paper
|id=Vol-1619/paper1
|storemode=property
|title=WISE: Web-based Interactive Speech Emotion Classification
|pdfUrl=https://ceur-ws.org/Vol-1619/paper1.pdf
|volume=Vol-1619
|authors=Sefik Emre Eskimez,Melissa Sturge-Apple,Zhiyao Duan,Wendi Heinzelman
|dblpUrl=https://dblp.org/rec/conf/ijcai/EskimezSDH16
}}
==WISE: Web-based Interactive Speech Emotion Classification==
<pdf width="1500px">https://ceur-ws.org/Vol-1619/paper1.pdf</pdf>
<pre>
                 WISE: Web-based Interactive Speech Emotion Classification

        Sefik Emre Eskimez? , Melissa Sturge-Apple† , Zhiyao Duan? and Wendi Heinzelman?
                           ?
                             Dept. of Electrical and Computer Engineering
                        †
                          Dept. of Clinical and Social Sciences in Psychology
                               University of Rochester, Rochester, NY


                          Abstract                                   of information about one’s internal state, disposition, inten-
                                                                     tions, and needs.
     The ability to classify emotions from speech is ben-
                                                                        In many situations, audio is the only recorded data for a
     eficial in a number of domains, including the study
                                                                     social interaction, and estimating emotions from speech be-
     of human relationships. However, manual classifi-
                                                                     comes a critical task for psychological analysis. Today’s tech-
     cation of emotions from speech is time consuming.
                                                                     nology allows for gathering vast amounts of emotional speech
     Current technology supports the automatic classifi-
                                                                     data from the web, yet analyzing this content is impractical.
     cation of emotions from speech, but these systems
                                                                     This fact prevents many interesting large-scale investigations.
     have some limitations. In particular, existing sys-
     tems are trained with a given data set and cannot                  Given the amount of speech data that proliferates, there
     adapt to new data nor can they adapt to different               have been many attempts to create automatic emotion classi-
     users’ notions of emotions. In this study, we intro-            fication systems. However, the performance of these systems
     duce WISE, a web-based interactive speech emo-                  is not as high as necessary in many situations. Many poten-
     tion classification system. WISE has a web-based                tial applications would benefit from automated emotion clas-
     interface that allows users to upload speech data               sification systems, such as call-center monitoring [Petrushin,
     and automatically classify the emotions within this             1999; Gupta, 2007], service robot interactions [Park et al.,
     speech using pre-trained models. The user can then              2009; Liu et al., 2013] and driver assistance systems [Jones
     adjust the emotion label if the system classifica-              and Jonsson, 2005; Tawari and Trivedi, 2010]. Indeed, there
     tion of the emotion does not agree with the user’s              are many automated systems today that focus on speech
                                                                     [Sethu et al., 2008; Busso et al., 2009; Rachuri et al., 2010;
     perception, and this updated label is then fed back
     into the system to retrain the models. In this way,             Bitouk et al., 2010; Stuhlsatz et al., 2011; Yang, 2015]. How-
     WISE enables the emotion classification models to               ever, emotion classification accuracy of fully automated sys-
     be adapted over time. We evaluate WISE by sim-                  tems is still not satisfactory in many practical situations.
     ulating the user interactions with the system using                In this study, we propose WISE, a web-based interac-
     the LDC dataset, which has known, ground-truth                  tive speech emotion classification system. This system uses
     labels. We evaluate the benefit of the user feed-               a web-based interface that allows users to easily upload a
     back enabled by WISE in situations where manu-                  speech file to the server for emotion analysis, without the
     ally classifying emotions in a large dataset is costly,         need for installing any additional software. Once the speech
     yet trained models alone will not be able to accu-              files are uploaded, the system classifies the emotions using a
     rately classify the data.                                       model trained on previously labeled training samples. Each
                                                                     classification is also associated with a confidence value. The
                                                                     user can either accept or correct the classification, to “teach”
1 Introduction                                                       the system the user’s specific concept of emotions. Over
Accurately estimating emotions of conversational partners            time, the system adapts its emotion classification models to
plays a vital role in successful human communication. A              the user’s concept, and can increase its classification accu-
social-functional approach to human emotion emphasizes the           racy with respect to the user’s concept of emotions.
interpersonal function of emotion for the establishment and             The key contribution of our work is that we provide an
maintenance of social relationships [Campos et al., 1989],           interactive speech-based emotion analysis framework. This
[Ekman, 1992], [Keltner and Kring, 1998]. According to               framework combines the machine’s computational power
[Campos et al., 1989] “Emotions are not mere feelings, but           with human users’ high emotion classification accuracy.
rather are processes of establishing, maintaining, or disrupt-       Compared to purely manual labeling, it is much more effi-
ing relations between the person and the internal or external        cient. Compared to fully automated systems, it is much more
environment, when such relations are significant to the indi-        accurate. This opens up possibilities for large-scale speech
vidual.” Thus, the expression and recognition of emotions al-        emotion analysis with high accuracy.
lows the facilitation of social bonds through the conveyance            The proposed framework only considers offline labeling


                                                                 2
 Proceedings of the 4th Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2016), IJCAI 2016, pages 2-7,
                                          New York City, USA, July 10, 2016.
and returns labels in three categories: emotion, arousal and
valance with time codes. To evaluate our system, we have
simulated the user-interface interactions in several settings,
by providing ground truth labels on behalf of the user. One
of the scenarios is designed to be a baseline, with which we
can compare the remaining scenarios. In another scenario,
we test if the system can adapt to the samples whose speaker
is unknown to the system. The next scenario tests how the
system’s classification confidence of a sample effects the sys-
tem’s accuracy. The full system is available for researchers to
use. 1
   The rest of the paper is organized as follows. Section 2
contains a review of the related work. Section 3 describes
the WISE web user-interface, while Section 4 explains the
automated speech-based emotion recognition system used in
this work. We evaluate the WISE system in Section 5, and
conclude our work in Section 6.

2 Related Work
All-in-one frameworks for automatic emotion classification
from speech, such as EmoVoice [Vogt et al., 2008] and Ope-
nEar [Eyben et al., 2009], are standalone software packages
with various capabilities, including audio recording, audio
file reading, feature extraction, and emotion classification.
   EmoVoice allows the user to create a personal speech-
based emotion recognizer, and it can track the emotional state
of the user in real-time. Each user records their own speech             Figure 1: Flow chart showing the operation of WISE.
emotion corpus to train the system, and the system can then
be used for real-time emotion classification for the same user.       represented by categories rather than numerical values, and
The system outputs the x- and y-coordinates of an arousal-            those are agreement, dominance, engagement, performance
valance coordinate system with time codes. It is reported in          and rapport. No automatic classification/labeling modules are
[Vogt et al., 2008] that EmoVoice has been used in several            included in ANNEMO.
systems including humanoid robot-human and virtual agent-                In contrast, WISE is a web-based system and can be used
human interactions. EmoVoice does not consider user feed-             easily without installing any software, unlike EmoVoice and
back once the classifier is trained, whereas in our system, the       OpenEar. WISE is similar to ANNEMO in terms of the web-
user can continually train and improve the system.                    based labeling aspect, however WISE only considers audio
   OpenEar is an emotion classification multi-platform soft-          data and provides automatic classification as well.
ware package that includes libraries for feature extraction
written in C++ and pre-trained models as well as scripts to           3 Web-based Interaction
support model building. One of its main modules is named
SMILE (Speech and Music Interpretation by Large-Space                 Our system’s interface, shown in Figure 2, is web-based, al-
Extraction), and it can extract more than 500K features in            lowing easy, secure access and use without installing any
real-time. The other main module allows external classifiers          other software except a modern browser.
and libraries such as LibSVM [Chang and Lin, 2011] to be                 When a user uploads an audio file, the waveform appears
integrated and used in classification. OpenEar also supports          on the main screen, allowing the user to select different parts
popular machine learning frameworks’ data formats, such               of the waveform. Selected parts can be played and labeled
as the Hidden Markov Model Toolkit (HTK) [Young et al.,               independently. These selected parts will also be added to a
2006], WEKA [Hall et al., 2009], and scikit-learn for Python          list, as shown in the bottom-left side of Figure 2. The user
[Pedregosa et al., 2011], and therefore allows easy transi-           can download this list by clicking on the “save” button in the
tion between frameworks. OpenEar’s capability of batch pro-           interface.
cessing, combined with its advantage in transitioning to other           The labeling scheme is restricted to three categories: emo-
learning frameworks, makes it appealing for large databases.          tion, arousal and valence. Emotion category elements are
   ANNEMO (ANNotating EMOtions) [Ringeval et al.,                     anger, disgust, fear, happy, neutral, sadness. Arousal category
2013] is a web-based annotation tool that allows labeling             elements are active, passive and neutral, and valance category
arousal, valence and social dimensions in audio-visual data.          elements are positive, negative and neutral. Our future work
The states are represented between -1 and 1, where the user           includes adding user defined emotion labels into the system.
changes the values using a slider. The social dimension is               The user can request labels from the automated emotion
                                                                      classifier by clicking on the “request label” button. The sys-
   1
       http://www.ece.rochester.edu/projects/wcng                     tem then shows suggested labels to the user.


                                                                  3
                                           Figure 2: WISE user interface screenshot.


  The next section describes the automated speech-based               (F0 ), 12 mel-frequency cepstral coefficients (MFCCs), en-
emotion classification system used in WISE.                           ergy, frequency and bandwidth of first four formants, zero-
                                                                      crossing rate, spectral roll-off, brightness, centroid, spread,
4 Automated Emotion Classification System                             skewness, kurtosis, flatness, entropy, roughness, and irregu-
                                                                      larity, in addition to the derivatives of these features. Statisti-
There are various automated speech-based emotion classi-              cal values such as minimum, maximum, mean, standard devi-
fication systems [Sethu et al., 2008; Busso et al., 2009;             ation and range (i.e., max-min) are calculated from all frames
Rachuri et al., 2010; Bitouk et al., 2010; Stuhlsatz et al.,          within the sample. Additionally, speaking rate is calculated
2011] that consider different features, feature selection meth-       over the entire sample. Hence, the final feature vector length
ods, classifiers and decision mechanisms. Our system is               is 331.
based on [Yang, 2015], which provides a confidence value
along with the classification label.                                  4.2   Feature Selection
                                                                      The system employs the support vector machine (SVM) re-
4.1   Features                                                        cursive feature elimination method [Guyon et al., 2002]. This
Speech samples are divided into overlapping frames for fea-           approach takes advantage of SVM weights to detect which
ture extraction. The window and hop sizes are set to 60 ms            features are better than others. After the SVM is trained, the
and 10 ms, respectively. For every frame that contains speech,        features are ranked according to the order of their weights.
the following features are calculated: fundamental frequency          The last ranked feature is eliminated from the list and the pro-


                                                                  4
Figure 3: The results of emotion category for Scenarios I-III.          Figure 5: The results of valence category for Scenarios I-III.

                                                                        5 Evaluation
                                                                        To evaluate WISE and the benefit of user-assisted labeling of
                                                                        the data, we have simulated user-interface interactions using
                                                                        the LDC database as the source of data for training, validation
                                                                        and testing.

                                                                        5.1   Dataset
                                                                        We use the Linguistic Data Consortium (LDC) Emotional
                                                                        Prosody Speech and Transcripts [Liberman et al., 2002]
                                                                        database in our simulations. The LDC database contains sam-
                                                                        ples from 15 emotion categories; however, in our evaluation,
                                                                        we only use 6 of the emotions as listed in Section 3. The LDC
                                                                        database contains acted speech, voiced by 7 professionals, 4
                                                                        female and 3 male. The transcripts are in English and contain
                                                                        semantically neutral utterances, such as dates and times.

                                                                        5.2   Simulations
                                                                        We have simulated user-interface interactions in different sce-
Figure 4: The results of arousal category for Scenarios I-III.          narios for which WISE can be used to enable user feedback
                                                                        to improve classification accuracy. In these simulations, there
                                                                        are three data groups: training, test and validation. We as-
cess starts again, until there are no features left. Features are       sume that validation data represents the samples where the
ranked in reverse order of elimination order. The top 80 best           user provides the “correct” label. In each iteration, the sys-
features are chosen to be used in the classification system.            tem evaluates the test data using the current models, and at
Note that in Section 5.2, the features are selected beforehand          the end of each iteration, a sample from the validation data
and not updated when a new sample is added to the system.               is added to the training data to update the models. Next, we
                                                                        describe the different scenarios in detail.
4.3   Classifier
                                                                        Scenario 0 - Baseline
Our system uses a one-against-all (OAA) binary SVM with                 In this scenario, the data from 1 of the 7 speakers is used
radial basis function (RBF) for each emotion, arousal and               for testing, while the remaining 6 speakers’ data are used for
valance category element, for a total of 12 SVMs. The trained           training and validation. Since only a limited amount of data
SVMs calculate confidence scores for any sample that is be-             is available from each speaker in the next scenarios, we also
ing classified. The system labels the sample with the class of          limit the amount of the validation data in this scenario. In
the binary classifier with maximum classification confidence            this way, the baseline becomes more comparable to the other
on the considered sample.                                               scenarios.


                                                                    5
   The training data starts with N samples from each class for          The full system is available for the community to use. The
each category. For the emotion classification, there are only           evaluation results show that the system can adapt to the user’s
2 samples available in each class (emotion) for the validation          choices and can increase the future classification accuracy
data. However, the arousal and valance categories have half             when the speaker of the sample is unknown. Hence, WISE
the number of classes that the emotion category has, there-             will enable adaptive, large scale emotion classification.
fore, there are 3 samples available in each class that can be
used in validation data for these categories. After the data are        References
chosen randomly, the system simulates the interaction pro-
cess. This process is repeated for all speakers, and the results        [Bitouk et al., 2010] Dmitri Bitouk, Ragini Verma, and Ani
are averaged over all 7 speakers and 200 trials.                           Nenkova. Class-level spectral features for emotion recog-
                                                                           nition. Speech Commun., 52(7):613–625, 2010.
Scenario I
This scenario has the same settings as Scenario 0, except this          [Busso et al., 2009] C. Busso, S. Lee, and S. Narayanan.
time, the testing data, as well as the validation data are chosen         Analysis of emotionally salient aspects of fundamental
from a speaker, and the training data is chosen among the                 frequency for emotion detection. IEEE Transactions on
remaining 6 speakers’ data.                                               Audio, Speech, and Language Processing, 17(4):582–596,
                                                                          May 2009.
Scenario II
This scenario has the same settings as Scenario I with a sin-           [Campos et al., 1989] Joseph J Campos, Rosemary G Cam-
gle difference: in each round, the validation data has been               pos, and Karen C Barrett. Emergent themes in the study
ordered in ascending order of the classifier’s confidence level           of emotional development and emotion regulation. Dev
in classifying them. Therefore in each iteration, the sample,             Psychol., 25(3):394, 1989.
on which the system has the least confidence, is added to the           [Chang and Lin, 2011] Chih-Chung Chang and Chih-Jen
training data from the validation data.                                   Lin. LIBSVM: A library for support vector machines.
Discussion                                                                ACM Transactions on Intelligent Systems and Technology,
Figures 3-5 show the classification accuracy versus the num-              2:27:1–27:27, 2011.
ber of added samples for each scenario for the emotion,                 [Ekman, 1992] Paul Ekman. An argument for basic emo-
arousal and valence, respectively. Note that the error bars rep-          tions. Cognition Emotion, 6(3-4):169–200, 1992.
resent the standard deviation of the results over the 7 speakers
                                                                        [Eyben et al., 2009] Florian Eyben, Martin Wllmer, and Bjrn
and 200 trials.
   Scenario I shows the ability of WISE to enable adaptation              Schuller. openear - introducing the munich open-source
of the models. In many situations, trained models of auto-                emotion and affect recognition toolkit. In In ACII, pages
matic systems have no information on the speaker to be clas-              576–581, 2009.
sified. The comparison of classification accuracy between               [Gupta, 2007] Purnima Gupta. Two-Stream Emotion Recog-
Scenario 0 and Scenario I shows that adaptation to unknown                nition For Call Center Monitoring. In Interspeech 2007,
data is vital for accurate emotion estimation, as the accuracy            2007.
increases greatly when data from the new user are added.
                                                                        [Guyon et al., 2002] Isabelle Guyon, Jason Weston, Stephen
   For example, in Scenarios I and II, when N is 4 for the
emotion category, the system’s initial accuracy starts around             Barnhill, and Vladimir Vapnik. Gene selection for can-
37% and increases to approximately 63%, as can be seen in                 cer classification using support vector machines. Mach.
Figure 3, where on the other hand in Scenario 0, accuracy                 Learn., 46(1-3):389–422, March 2002.
can only increase to approximately 41%. In Scenarios I and              [Hall et al., 2009] Mark Hall, Eibe Frank, Geoffrey Holmes,
II , when N is 10, the classification accuracy starts higher              Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten.
then the previous case, yet with the same number of added                 The weka data mining software: An update. SIGKDD Ex-
samples, they converge to the same percentage. This enables               plor. Newsl., 11(1):10–18, November 2009.
the possibility of using pre-trained models in our system that          [Jones and Jonsson, 2005] Christian Martyn Jones and Ing-
are trained on available databases.
                                                                           Marie Jonsson. Automatic recognition of affective cues
   The results of Scenario II suggest that adding the samples
                                                                           in the speech of car drivers to allow appropriate re-
with low classification confidence are slightly more beneficial
                                                                           sponses. In Proceedings of the 17th Australia Conference
than adding a sample for which the system already has more
                                                                           on Computer-Human Interaction: Citizens Online: Con-
confidence. Figures 3-5 show that the classifier in Scenario II
                                                                           siderations for Today and the Future, OZCHI ’05, pages
converges to a slightly higher classification accuracy than the
                                                                           1–10, Narrabundah, Australia, Australia, 2005.
one in Scenario I. This can be seen especially in the arousal
category results.                                                       [Keltner and Kring, 1998] Dacher Keltner and Ann M Kring.
                                                                          Emotion, social function, and psychopathology. Rev. Gen.
6 Conclusion                                                              Psychol., 2(3):320, 1998.
This study introduced and evaluated the WISE system, which              [Liberman et al., 2002] Mark Liberman, Kelly Davis,
is an interactive web-based emotion analysis framework to                  M Grossman, N Martey, and J Bell. Emotional prosody
assist in the classification of human emotion from voice data.             speech and transcripts. In Proc. LDC, 2002.


                                                                    6
[Liu et al., 2013] Chih-Yin Liu, Tzu-Hsin Hung, Kai-Chung            [Sethu et al., 2008] Vidhyasaharan Sethu, Eliathamby Am-
   Cheng, and Tzuu-Hseng S Li. Hmm and bpnn based                       bikairajah, and Julien Epps. Empirical mode decompo-
   speech recognition system for home service robot. In Ad-             sition based weighted frequency feature for speech-based
   vanced Robotics and Intelligent Systems (ARIS), 2013 In-             emotion classification. In Proc. IEEE ICASSP, pages
   ternational Conference on, pages 38–43. IEEE, 2013.                  5017–5020, 2008.
[Park et al., 2009] J. S. Park, J. H. Kim, and Y. H. Oh. Fea-        [Stuhlsatz et al., 2011] A. Stuhlsatz, C. Meyer, F. Eyben,
   ture vector classification based speech emotion recogni-             T. Zielke, G. Meier, and B. Schuller. Deep neural networks
   tion for service robots. IEEE Transactions on Consumer               for acoustic emotion recognition: Raising the benchmarks.
   Electronics, 55(3):1590–1596, August 2009.                           In Acoustics, Speech and Signal Processing (ICASSP),
[Pedregosa et al., 2011] F. Pedregosa, G. Varoquaux,                    2011 IEEE International Conference on, pages 5688–
   A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-              5691, May 2011.
   del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,        [Tawari and Trivedi, 2010] A. Tawari and M. Trivedi.
   A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and                 Speech based emotion classification framework for driver
   E. Duchesnay. Scikit-learn: Machine learning in Python.              assistance system. In Intelligent Vehicles Symposium (IV),
   Journal of Machine Learning Research, 12:2825–2830,                  2010 IEEE, pages 174–178, June 2010.
   2011.
                                                                     [Vogt et al., 2008] Thurid Vogt, Elisabeth Andr, and Niko-
[Petrushin, 1999] Valery A. Petrushin. Emotion in speech:               laus Bee. Emovoice - a framework for online recognition
   Recognition and application to call centers. In In Engr,             of emotions from voice. In In Proceedings of Workshop on
   pages 7–10, 1999.                                                    Perception and Interactive Technologies for Speech-Based
[Rachuri et al., 2010] Kiran K Rachuri, Mirco Musolesi, Ce-             Systems, Springer, Kloster Irsee, 2008.
   cilia Mascolo, Peter J Rentfrow, Chris Longworth, and             [Yang, 2015] Na Yang. Algorithms for affective and ubiq-
   Andrius Aucinas. Emotionsense: a mobile phones based                 uitous sensing systems and for protein structure pre-
   adaptive platform for experimental social psychology re-             diction. PhD thesis, University of Rochester, 2015.
   search. In Proc. 12th ACM Int. Conf. on Ubiquitous Com-              http://hdl.handle.net/1802/29666.
   puting, pages 281–290, 2010.                                      [Young et al., 2006] S. J. Young, G. Evermann, M. J. F.
[Ringeval et al., 2013] F. Ringeval, A. Sonderegger, J. Sauer,          Gales, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ol-
   and D. Lalanne. Introducing the recola multimodal cor-               lason, D. Povey, V. Valtchev, and P. C. Woodland. The
   pus of remote collaborative and affective interactions. In           HTK Book, version 3.4. Cambridge University Engineer-
   2013 10th IEEE International Conference and Workshops                ing Department, Cambridge, UK, 2006.
   on Automatic Face and Gesture Recognition (FG), pages
   1–8, April 2013.


                                                                 7

</pre>