=Paper=
{{Paper
|id=Vol-1619/paper1
|storemode=property
|title=WISE: Web-based Interactive Speech Emotion Classification
|pdfUrl=https://ceur-ws.org/Vol-1619/paper1.pdf
|volume=Vol-1619
|authors=Sefik Emre Eskimez,Melissa Sturge-Apple,Zhiyao Duan,Wendi Heinzelman
|dblpUrl=https://dblp.org/rec/conf/ijcai/EskimezSDH16
}}
==WISE: Web-based Interactive Speech Emotion Classification==
WISE: Web-based Interactive Speech Emotion Classification Sefik Emre Eskimez? , Melissa Sturge-Apple† , Zhiyao Duan? and Wendi Heinzelman? ? Dept. of Electrical and Computer Engineering † Dept. of Clinical and Social Sciences in Psychology University of Rochester, Rochester, NY Abstract of information about one’s internal state, disposition, inten- tions, and needs. The ability to classify emotions from speech is ben- In many situations, audio is the only recorded data for a eficial in a number of domains, including the study social interaction, and estimating emotions from speech be- of human relationships. However, manual classifi- comes a critical task for psychological analysis. Today’s tech- cation of emotions from speech is time consuming. nology allows for gathering vast amounts of emotional speech Current technology supports the automatic classifi- data from the web, yet analyzing this content is impractical. cation of emotions from speech, but these systems This fact prevents many interesting large-scale investigations. have some limitations. In particular, existing sys- tems are trained with a given data set and cannot Given the amount of speech data that proliferates, there adapt to new data nor can they adapt to different have been many attempts to create automatic emotion classi- users’ notions of emotions. In this study, we intro- fication systems. However, the performance of these systems duce WISE, a web-based interactive speech emo- is not as high as necessary in many situations. Many poten- tion classification system. WISE has a web-based tial applications would benefit from automated emotion clas- interface that allows users to upload speech data sification systems, such as call-center monitoring [Petrushin, and automatically classify the emotions within this 1999; Gupta, 2007], service robot interactions [Park et al., speech using pre-trained models. The user can then 2009; Liu et al., 2013] and driver assistance systems [Jones adjust the emotion label if the system classifica- and Jonsson, 2005; Tawari and Trivedi, 2010]. Indeed, there tion of the emotion does not agree with the user’s are many automated systems today that focus on speech [Sethu et al., 2008; Busso et al., 2009; Rachuri et al., 2010; perception, and this updated label is then fed back into the system to retrain the models. In this way, Bitouk et al., 2010; Stuhlsatz et al., 2011; Yang, 2015]. How- WISE enables the emotion classification models to ever, emotion classification accuracy of fully automated sys- be adapted over time. We evaluate WISE by sim- tems is still not satisfactory in many practical situations. ulating the user interactions with the system using In this study, we propose WISE, a web-based interac- the LDC dataset, which has known, ground-truth tive speech emotion classification system. This system uses labels. We evaluate the benefit of the user feed- a web-based interface that allows users to easily upload a back enabled by WISE in situations where manu- speech file to the server for emotion analysis, without the ally classifying emotions in a large dataset is costly, need for installing any additional software. Once the speech yet trained models alone will not be able to accu- files are uploaded, the system classifies the emotions using a rately classify the data. model trained on previously labeled training samples. Each classification is also associated with a confidence value. The user can either accept or correct the classification, to “teach” 1 Introduction the system the user’s specific concept of emotions. Over Accurately estimating emotions of conversational partners time, the system adapts its emotion classification models to plays a vital role in successful human communication. A the user’s concept, and can increase its classification accu- social-functional approach to human emotion emphasizes the racy with respect to the user’s concept of emotions. interpersonal function of emotion for the establishment and The key contribution of our work is that we provide an maintenance of social relationships [Campos et al., 1989], interactive speech-based emotion analysis framework. This [Ekman, 1992], [Keltner and Kring, 1998]. According to framework combines the machine’s computational power [Campos et al., 1989] “Emotions are not mere feelings, but with human users’ high emotion classification accuracy. rather are processes of establishing, maintaining, or disrupt- Compared to purely manual labeling, it is much more effi- ing relations between the person and the internal or external cient. Compared to fully automated systems, it is much more environment, when such relations are significant to the indi- accurate. This opens up possibilities for large-scale speech vidual.” Thus, the expression and recognition of emotions al- emotion analysis with high accuracy. lows the facilitation of social bonds through the conveyance The proposed framework only considers offline labeling 2 Proceedings of the 4th Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2016), IJCAI 2016, pages 2-7, New York City, USA, July 10, 2016. and returns labels in three categories: emotion, arousal and valance with time codes. To evaluate our system, we have simulated the user-interface interactions in several settings, by providing ground truth labels on behalf of the user. One of the scenarios is designed to be a baseline, with which we can compare the remaining scenarios. In another scenario, we test if the system can adapt to the samples whose speaker is unknown to the system. The next scenario tests how the system’s classification confidence of a sample effects the sys- tem’s accuracy. The full system is available for researchers to use. 1 The rest of the paper is organized as follows. Section 2 contains a review of the related work. Section 3 describes the WISE web user-interface, while Section 4 explains the automated speech-based emotion recognition system used in this work. We evaluate the WISE system in Section 5, and conclude our work in Section 6. 2 Related Work All-in-one frameworks for automatic emotion classification from speech, such as EmoVoice [Vogt et al., 2008] and Ope- nEar [Eyben et al., 2009], are standalone software packages with various capabilities, including audio recording, audio file reading, feature extraction, and emotion classification. EmoVoice allows the user to create a personal speech- based emotion recognizer, and it can track the emotional state of the user in real-time. Each user records their own speech Figure 1: Flow chart showing the operation of WISE. emotion corpus to train the system, and the system can then be used for real-time emotion classification for the same user. represented by categories rather than numerical values, and The system outputs the x- and y-coordinates of an arousal- those are agreement, dominance, engagement, performance valance coordinate system with time codes. It is reported in and rapport. No automatic classification/labeling modules are [Vogt et al., 2008] that EmoVoice has been used in several included in ANNEMO. systems including humanoid robot-human and virtual agent- In contrast, WISE is a web-based system and can be used human interactions. EmoVoice does not consider user feed- easily without installing any software, unlike EmoVoice and back once the classifier is trained, whereas in our system, the OpenEar. WISE is similar to ANNEMO in terms of the web- user can continually train and improve the system. based labeling aspect, however WISE only considers audio OpenEar is an emotion classification multi-platform soft- data and provides automatic classification as well. ware package that includes libraries for feature extraction written in C++ and pre-trained models as well as scripts to 3 Web-based Interaction support model building. One of its main modules is named SMILE (Speech and Music Interpretation by Large-Space Our system’s interface, shown in Figure 2, is web-based, al- Extraction), and it can extract more than 500K features in lowing easy, secure access and use without installing any real-time. The other main module allows external classifiers other software except a modern browser. and libraries such as LibSVM [Chang and Lin, 2011] to be When a user uploads an audio file, the waveform appears integrated and used in classification. OpenEar also supports on the main screen, allowing the user to select different parts popular machine learning frameworks’ data formats, such of the waveform. Selected parts can be played and labeled as the Hidden Markov Model Toolkit (HTK) [Young et al., independently. These selected parts will also be added to a 2006], WEKA [Hall et al., 2009], and scikit-learn for Python list, as shown in the bottom-left side of Figure 2. The user [Pedregosa et al., 2011], and therefore allows easy transi- can download this list by clicking on the “save” button in the tion between frameworks. OpenEar’s capability of batch pro- interface. cessing, combined with its advantage in transitioning to other The labeling scheme is restricted to three categories: emo- learning frameworks, makes it appealing for large databases. tion, arousal and valence. Emotion category elements are ANNEMO (ANNotating EMOtions) [Ringeval et al., anger, disgust, fear, happy, neutral, sadness. Arousal category 2013] is a web-based annotation tool that allows labeling elements are active, passive and neutral, and valance category arousal, valence and social dimensions in audio-visual data. elements are positive, negative and neutral. Our future work The states are represented between -1 and 1, where the user includes adding user defined emotion labels into the system. changes the values using a slider. The social dimension is The user can request labels from the automated emotion classifier by clicking on the “request label” button. The sys- 1 http://www.ece.rochester.edu/projects/wcng tem then shows suggested labels to the user. 3 Figure 2: WISE user interface screenshot. The next section describes the automated speech-based (F0 ), 12 mel-frequency cepstral coefficients (MFCCs), en- emotion classification system used in WISE. ergy, frequency and bandwidth of first four formants, zero- crossing rate, spectral roll-off, brightness, centroid, spread, 4 Automated Emotion Classification System skewness, kurtosis, flatness, entropy, roughness, and irregu- larity, in addition to the derivatives of these features. Statisti- There are various automated speech-based emotion classi- cal values such as minimum, maximum, mean, standard devi- fication systems [Sethu et al., 2008; Busso et al., 2009; ation and range (i.e., max-min) are calculated from all frames Rachuri et al., 2010; Bitouk et al., 2010; Stuhlsatz et al., within the sample. Additionally, speaking rate is calculated 2011] that consider different features, feature selection meth- over the entire sample. Hence, the final feature vector length ods, classifiers and decision mechanisms. Our system is is 331. based on [Yang, 2015], which provides a confidence value along with the classification label. 4.2 Feature Selection The system employs the support vector machine (SVM) re- 4.1 Features cursive feature elimination method [Guyon et al., 2002]. This Speech samples are divided into overlapping frames for fea- approach takes advantage of SVM weights to detect which ture extraction. The window and hop sizes are set to 60 ms features are better than others. After the SVM is trained, the and 10 ms, respectively. For every frame that contains speech, features are ranked according to the order of their weights. the following features are calculated: fundamental frequency The last ranked feature is eliminated from the list and the pro- 4 Figure 3: The results of emotion category for Scenarios I-III. Figure 5: The results of valence category for Scenarios I-III. 5 Evaluation To evaluate WISE and the benefit of user-assisted labeling of the data, we have simulated user-interface interactions using the LDC database as the source of data for training, validation and testing. 5.1 Dataset We use the Linguistic Data Consortium (LDC) Emotional Prosody Speech and Transcripts [Liberman et al., 2002] database in our simulations. The LDC database contains sam- ples from 15 emotion categories; however, in our evaluation, we only use 6 of the emotions as listed in Section 3. The LDC database contains acted speech, voiced by 7 professionals, 4 female and 3 male. The transcripts are in English and contain semantically neutral utterances, such as dates and times. 5.2 Simulations We have simulated user-interface interactions in different sce- Figure 4: The results of arousal category for Scenarios I-III. narios for which WISE can be used to enable user feedback to improve classification accuracy. In these simulations, there are three data groups: training, test and validation. We as- cess starts again, until there are no features left. Features are sume that validation data represents the samples where the ranked in reverse order of elimination order. The top 80 best user provides the “correct” label. In each iteration, the sys- features are chosen to be used in the classification system. tem evaluates the test data using the current models, and at Note that in Section 5.2, the features are selected beforehand the end of each iteration, a sample from the validation data and not updated when a new sample is added to the system. is added to the training data to update the models. Next, we describe the different scenarios in detail. 4.3 Classifier Scenario 0 - Baseline Our system uses a one-against-all (OAA) binary SVM with In this scenario, the data from 1 of the 7 speakers is used radial basis function (RBF) for each emotion, arousal and for testing, while the remaining 6 speakers’ data are used for valance category element, for a total of 12 SVMs. The trained training and validation. Since only a limited amount of data SVMs calculate confidence scores for any sample that is be- is available from each speaker in the next scenarios, we also ing classified. The system labels the sample with the class of limit the amount of the validation data in this scenario. In the binary classifier with maximum classification confidence this way, the baseline becomes more comparable to the other on the considered sample. scenarios. 5 The training data starts with N samples from each class for The full system is available for the community to use. The each category. For the emotion classification, there are only evaluation results show that the system can adapt to the user’s 2 samples available in each class (emotion) for the validation choices and can increase the future classification accuracy data. However, the arousal and valance categories have half when the speaker of the sample is unknown. Hence, WISE the number of classes that the emotion category has, there- will enable adaptive, large scale emotion classification. fore, there are 3 samples available in each class that can be used in validation data for these categories. After the data are References chosen randomly, the system simulates the interaction pro- cess. This process is repeated for all speakers, and the results [Bitouk et al., 2010] Dmitri Bitouk, Ragini Verma, and Ani are averaged over all 7 speakers and 200 trials. Nenkova. Class-level spectral features for emotion recog- nition. Speech Commun., 52(7):613–625, 2010. Scenario I This scenario has the same settings as Scenario 0, except this [Busso et al., 2009] C. Busso, S. Lee, and S. Narayanan. time, the testing data, as well as the validation data are chosen Analysis of emotionally salient aspects of fundamental from a speaker, and the training data is chosen among the frequency for emotion detection. IEEE Transactions on remaining 6 speakers’ data. Audio, Speech, and Language Processing, 17(4):582–596, May 2009. Scenario II This scenario has the same settings as Scenario I with a sin- [Campos et al., 1989] Joseph J Campos, Rosemary G Cam- gle difference: in each round, the validation data has been pos, and Karen C Barrett. Emergent themes in the study ordered in ascending order of the classifier’s confidence level of emotional development and emotion regulation. Dev in classifying them. Therefore in each iteration, the sample, Psychol., 25(3):394, 1989. on which the system has the least confidence, is added to the [Chang and Lin, 2011] Chih-Chung Chang and Chih-Jen training data from the validation data. Lin. LIBSVM: A library for support vector machines. Discussion ACM Transactions on Intelligent Systems and Technology, Figures 3-5 show the classification accuracy versus the num- 2:27:1–27:27, 2011. ber of added samples for each scenario for the emotion, [Ekman, 1992] Paul Ekman. An argument for basic emo- arousal and valence, respectively. Note that the error bars rep- tions. Cognition Emotion, 6(3-4):169–200, 1992. resent the standard deviation of the results over the 7 speakers [Eyben et al., 2009] Florian Eyben, Martin Wllmer, and Bjrn and 200 trials. Scenario I shows the ability of WISE to enable adaptation Schuller. openear - introducing the munich open-source of the models. In many situations, trained models of auto- emotion and affect recognition toolkit. In In ACII, pages matic systems have no information on the speaker to be clas- 576–581, 2009. sified. The comparison of classification accuracy between [Gupta, 2007] Purnima Gupta. Two-Stream Emotion Recog- Scenario 0 and Scenario I shows that adaptation to unknown nition For Call Center Monitoring. In Interspeech 2007, data is vital for accurate emotion estimation, as the accuracy 2007. increases greatly when data from the new user are added. [Guyon et al., 2002] Isabelle Guyon, Jason Weston, Stephen For example, in Scenarios I and II, when N is 4 for the emotion category, the system’s initial accuracy starts around Barnhill, and Vladimir Vapnik. Gene selection for can- 37% and increases to approximately 63%, as can be seen in cer classification using support vector machines. Mach. Figure 3, where on the other hand in Scenario 0, accuracy Learn., 46(1-3):389–422, March 2002. can only increase to approximately 41%. In Scenarios I and [Hall et al., 2009] Mark Hall, Eibe Frank, Geoffrey Holmes, II , when N is 10, the classification accuracy starts higher Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. then the previous case, yet with the same number of added The weka data mining software: An update. SIGKDD Ex- samples, they converge to the same percentage. This enables plor. Newsl., 11(1):10–18, November 2009. the possibility of using pre-trained models in our system that [Jones and Jonsson, 2005] Christian Martyn Jones and Ing- are trained on available databases. Marie Jonsson. Automatic recognition of affective cues The results of Scenario II suggest that adding the samples in the speech of car drivers to allow appropriate re- with low classification confidence are slightly more beneficial sponses. In Proceedings of the 17th Australia Conference than adding a sample for which the system already has more on Computer-Human Interaction: Citizens Online: Con- confidence. Figures 3-5 show that the classifier in Scenario II siderations for Today and the Future, OZCHI ’05, pages converges to a slightly higher classification accuracy than the 1–10, Narrabundah, Australia, Australia, 2005. one in Scenario I. This can be seen especially in the arousal category results. [Keltner and Kring, 1998] Dacher Keltner and Ann M Kring. Emotion, social function, and psychopathology. Rev. Gen. 6 Conclusion Psychol., 2(3):320, 1998. This study introduced and evaluated the WISE system, which [Liberman et al., 2002] Mark Liberman, Kelly Davis, is an interactive web-based emotion analysis framework to M Grossman, N Martey, and J Bell. Emotional prosody assist in the classification of human emotion from voice data. speech and transcripts. In Proc. LDC, 2002. 6 [Liu et al., 2013] Chih-Yin Liu, Tzu-Hsin Hung, Kai-Chung [Sethu et al., 2008] Vidhyasaharan Sethu, Eliathamby Am- Cheng, and Tzuu-Hseng S Li. Hmm and bpnn based bikairajah, and Julien Epps. Empirical mode decompo- speech recognition system for home service robot. In Ad- sition based weighted frequency feature for speech-based vanced Robotics and Intelligent Systems (ARIS), 2013 In- emotion classification. In Proc. IEEE ICASSP, pages ternational Conference on, pages 38–43. IEEE, 2013. 5017–5020, 2008. [Park et al., 2009] J. S. Park, J. H. Kim, and Y. H. Oh. Fea- [Stuhlsatz et al., 2011] A. Stuhlsatz, C. Meyer, F. Eyben, ture vector classification based speech emotion recogni- T. Zielke, G. Meier, and B. Schuller. Deep neural networks tion for service robots. IEEE Transactions on Consumer for acoustic emotion recognition: Raising the benchmarks. Electronics, 55(3):1590–1596, August 2009. In Acoustics, Speech and Signal Processing (ICASSP), [Pedregosa et al., 2011] F. Pedregosa, G. Varoquaux, 2011 IEEE International Conference on, pages 5688– A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- 5691, May 2011. del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, [Tawari and Trivedi, 2010] A. Tawari and M. Trivedi. A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and Speech based emotion classification framework for driver E. Duchesnay. Scikit-learn: Machine learning in Python. assistance system. In Intelligent Vehicles Symposium (IV), Journal of Machine Learning Research, 12:2825–2830, 2010 IEEE, pages 174–178, June 2010. 2011. [Vogt et al., 2008] Thurid Vogt, Elisabeth Andr, and Niko- [Petrushin, 1999] Valery A. Petrushin. Emotion in speech: laus Bee. Emovoice - a framework for online recognition Recognition and application to call centers. In In Engr, of emotions from voice. In In Proceedings of Workshop on pages 7–10, 1999. Perception and Interactive Technologies for Speech-Based [Rachuri et al., 2010] Kiran K Rachuri, Mirco Musolesi, Ce- Systems, Springer, Kloster Irsee, 2008. cilia Mascolo, Peter J Rentfrow, Chris Longworth, and [Yang, 2015] Na Yang. Algorithms for affective and ubiq- Andrius Aucinas. Emotionsense: a mobile phones based uitous sensing systems and for protein structure pre- adaptive platform for experimental social psychology re- diction. PhD thesis, University of Rochester, 2015. search. In Proc. 12th ACM Int. Conf. on Ubiquitous Com- http://hdl.handle.net/1802/29666. puting, pages 281–290, 2010. [Young et al., 2006] S. J. Young, G. Evermann, M. J. F. [Ringeval et al., 2013] F. Ringeval, A. Sonderegger, J. Sauer, Gales, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ol- and D. Lalanne. Introducing the recola multimodal cor- lason, D. Povey, V. Valtchev, and P. C. Woodland. The pus of remote collaborative and affective interactions. In HTK Book, version 3.4. Cambridge University Engineer- 2013 10th IEEE International Conference and Workshops ing Department, Cambridge, UK, 2006. on Automatic Face and Gesture Recognition (FG), pages 1–8, April 2013. 7