Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 A Novel Multimodal Emotion Recognition Approach for Affective Human Robot Interaction Felipe Cid, Luis J. Manso and Pedro Núñez Abstract— Facial expressions and speech are elements that useful in real interaction scenarios, where for instance, errors provide emotional information about the user through multiple associated to noise or occlusions can be reduced. communication channels. In this paper, a novel multimodal This work presents a novel multimodal emotion recog- emotion recognition system based on visual and auditory information processing is proposed. The proposed approach is nition system for affective HRI. The proposed approach used in real affective human robot communication in order to is based on a real-time multimodal system that integrates estimate five different emotional states (i.e., happiness, anger, speech and facial expression analysis. The main contribution fear, sadness and neutral), and it consists of two subsystems of this work is a robust feature extraction approach for facial with similar structure. The first subsystem achieves a robust expression recognition. In this paper, these facial features are facial feature extraction based on consecutively applied filters to the edge image and the use of a Dynamic Bayessian Classifier. a combination of independent and antagonistic distortions A similar classifier is used in the second subsystem, where of the face. Besides, a novel acoustic analysis of speech the input is associated to a set of speech descriptors, such is used to extract features relevant to emotion. Both visual as speech-rate, energy and pitch. Both subsystems are finally and acoustic features constitute the input of two Dynamic combined in real time. The results of this multimodal approach Bayessian Network (DBN) [3], which classify them into a show the robustness and accuracy of the methodology respect to single emotion recognition systems. set of basic emotions. The output of each DBN estimates the user’s emotional state based on the extracted features, the I. I NTRODUCTION emotional states available in each DBN being: happiness, sadness, anger, f ear and neutral (non-emotional state). In the last decade, Human Robot Interaction (HRI) has Finally, the purpose of this system is to integrate the output become one of the most important issues in social robotics. information of each subsystem, in a third DBN to analyze Within HRI, one of the main objectives is the development the information associated with each mode. This multimodal of methodologies focused on non-invasive techniques based methodology uses redundant information to estimate accurate on natural language. This would allow the robot to interact and robust results on the user’s emotional state. with users in a similar way to humans, through multimodal This paper is organized as follows: after discussing known systems that combine information from different channels. approaches to emotion recognition systems from facial ex- In order to socially interact with humans, a robotic system pressions, speech or multimodal systems in Section II, Sec- should be able not only to understand users behaviour and tion III presents an overview of the proposed Multimodal intentions, but also to estimate their emotional state. Know- Emotion Recognition System. In Section IV, the experimen- ing and understanding these human emotions helps social tal results are pointed out, and finally, Section V describes robots adapting the communication in real time, improving the conclusions and future work of the approach. and enriching the interaction [1]. This kind of HRI is usually known as affective HRI. II. R ELATED W ORKS The design of social robots looks for developing natu- ral interfaces for an affective interaction. In this context, In the field of HRI, different automatic emotion recogni- most of the current affective HRI techniques use an unique tion systems have been studied. Most of these approaches information channel, called mode, such as speech, facial are based on single information channel analysis, such as expressiveness or body language. However, an emotion is video sequences or audio signals. Independent of the nature generally expressed through several modalities. In multi- of the information source, the raw data is processed and a set modal affective interaction, the user can communicate his/her of features are extracted. Then, these features are classified emotional state to the robot using different several input into different categories, i.e., emotions. On one hand, facial channels [2]. Contrary to approaches where each channel expressions have been commonly used to detect and rec- introduces complementary information in the emotion recog- ognize human emotions. An interesting and updated review nition system, most of the multimodal systems use these was shown in [4]. Commonly, these frameworks use the channels as redundancy information. This redundancy is Facial Action Coding System (FACS) proposed by Ekman et al’s [5], which is based on facial muscle deformations. L. J. Manso and P. Núñez are members of Robotics and Artificial On the other hand, speech has been also used for emotion Vision Lab. Robolab Group, University of Extremadura, Spain (e-mail: recognition (see the review [6]). These systems are usually lmanso@unex.es; pnuntru@unex.es) F. Cid is with Institute of Electrical and Electronics, Universidad Austral focused on acoustic variables variations that are related to de Chile, Chile. (e-mail:felipe.cid@uach.cl emotions. 1 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 Approaches that only use visual or speech information individually usually fail in real scenarios. Light conditions, shadows or occlusions, among others, are typical situations where the accuracy of the results decreases for visual sys- tems. In a similar way, environmental noise or people moving while talking are error sources in audio systems. Therefore, for an efficient affective HRI, several authors have focused their attention on multimodal systems that recognize the emotional state of the user from different modalities or information sources such as: face, gesture, body language, Fig. 1. Overview of the proposed facial expression recognition system. The approach flows from left to right. See the text for more details. speech or physiological signals, among others. For instance, in [3] the authors develop an architecture for multimodal emotion recognition, where facial expresiveness and speech are used. In [7], Kessous et al. propose a system that fuse 1) Data acquisition: Data acquisition for the recognition body and facial languages, and speech. Also, interesting of facial expressions is based on the processing of a sequence reviews are presented in [2], [8]. Most of these approaches of RGB images Sf r obtained from the robot camera for use a dominant mode in the classifier strategy, that is, real-time interaction. First, the system recognizes the user’s when the probability of the detected emotion is low, other face within each frame f r(t) in the instant of time t, using channel information is used. In contrast to these works, this the well-known Viola and Jones’ method [9] to obtain the paper presents a multimodal approach where all the input initial region of interest of the face ROII . The ROII is modes are analyzed in real-time and the fusion strategy normalized to a fixed size and converted to gray scale to be consists on a Dynamic Bayessian Network classifier. This pre-processed. multimodal fusion is produced at the decision level (i.e., 2) Pre-processing: In this stage, the ROII is processed to the information is integrated from the single modalities after remove noise, reducing its light dependence and eliminating being interpreted by their own classifiers). unnecessary information. To eliminate errors in the detection III. M ULTIMODAL E MOTION R ECOGNITION S YSTEM and reduce the processing time of the system, the ROII is divided in two sub-regions, ROItop and ROIbottom , In this section, the proposed multimodal emotion recogni- respectively. The usage of both sub-regions allow the elim- tion system is presented. The framework consists on two sub- ination of irrelevant information (e.g., nose) and divides systems running in parallel that estimate human emotions us- the feature extraction process in two elements (eyes and ing two independent DBN. The facial expression recognition mouth). Let ROII being the face image of size W x H, subsystem (see Fig. 1) is based on a fast and robust feature and let p0 = (n, m) being the central pixel in the image, extraction approach where consecutively morphological and which estimates the approximated position of the nose in convolutional filters are applied to reduce the noise and the the image. Then, ROItop and ROIbottom are defined as dependence against luminosity changes. After that, a Gabor selective copies of ROII as follows: ROItop of size W filter is used for efficient edge detection. The output edge x (m − NT h ) and ROIbottom of size W x (m + NT h )), image of this filter bank is used to detect and extract scale- where NT h is an user-fixed threshold. In order to detect invariant facial features, which will be the input variables of and extract robust facial features in different environments, the DBN to estimate the user’s emotional state. In the second both ROItop and ROIbottom images are processed to reduce system, the user’s speech is analyzed in order to extract a set the effects of the light dependence. The method is based of independent descriptors. These descriptors are the input on the approach described in [10]. The processing sequence of a second DBN. Finally, the proposed system integrates follows a set of consecutive stages: 1) gamma correction; 2) the information associated to both methods in a third DBN, difference of Gaussian (DoG) filtering; 3) masking; and 4) which estimates the final user emotion. The output of both contrast equalization. Next, a filter bank of Median, Blur and subsystems and a third DBN has as a result an estimate of the Gaussian filters, is applied to mitigate the noise effect in the emotional state of the user, within four possible emotional images by the beard, wounds or similar facial elements. states (happiness, sadness, anger, fear) and a non-emotional state (neutral). Each subsystem is described in detail in the 3) Gabor Filter: The Gabor filter is a fast and effec- following subsections. tive linear filter for the detection of edges with different orientations. In the proposed approach, the Gabor filter is A. Emotion recognition from Facial Expressions used as a previous stage to the detection and extraction of The facial expression recognition system proposed in this facial features, which are extracted using the contours of the work uses a video sequence acquired by the robot in real facial elements (i.e., the eyes, the mouth or the eyebrows). time. An overview of the proposed methodology is shown in Gabor impulse response in the spatial domain consists of a Fig. 1. Each video frame is processed and a set of robust and sinusoidal plane wave of some orientation and frequency, invariant features of the user’s face is detected. The proposed modulated by a two-dimensional Gaussian envelope. Let method consists on the following stages: I(u, v) be the input image, then the output of the Gabor 2 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 (a) (b) Fig. 3. Edge-based facial feature extraction: a) Region of interest in the face image (ROItop and ROIbottom ); and b) Features extracted in the image. This bayesian approach is based on the detection of 11 Fig. 2. Action Units (AUs) used in this paper. AUs with antagonistic and exclusive features as leaves of the DBN, which allows us to reduce the size of the DBN to 7 variables that group these AUs. Thus, these 7 variables are filter, G(u, v), is given by: obtained from combinations of the extracted features of the previous process: dA, dB and dC. In this case, it is assumed 1 u2 + v 2 uθ G(u, v) = exp(− ( θ 2 θ ))cos(2π + ψ) (1) that these 7 leaf variables are independent given the facial 2 σ λ expression (F E). Although some muscular movements from where θ, λ and ψ are associated to the sinusoidal plane wave one area of the face may slightly affect other areas, this (orientation, wavelength and phase, respectively), and being small influence could not be detected by the robot cameras. uθ and vθ described as: Besides, for correct detection of each emotional state it is necessary that each AU achieves a minimum threshold uθ = ucosθ + vsinθ intensity of B (i.e., slight evidence), within the intensity (2) vθ = −usinθ + vcosθ range described in FACS [5]. Fig. 4 illustrates the structure of the two level bayesian 4) Feature extraction: The crucial step in an automatic network, and the time influence that characterizes this DBN. facial expressions recognition system is the extraction of The first level shows the estimation of the user’s emotional relevant features from the image, F I ={fiI | i = 1..m}. In state F E, while the second level shows the 7 variables the proposed work, a set of edge-based features is extracted, based on the AUs. In addition, one of the main features which is invariant to scale or distance from the user to the worth mentioning about the bayesian network is the need robot. Each one of the features are directly related to the to provide it with learning data. The most common method Action Units (AUs) described by the Facial Action Coding is that each new sample is matched using a threshold. System (FACS) [5]. A set of independent and antagonistic In this work, to avoid the extant gaps, a pre-processing AUs has been used in this paper (see Fig. 2, AU1 and stage is done before the learning stage, fitting a Gaussian AU4 are related to distortions of the eyebrows, and they distribution to the data. The learning data acquisition was are antagonistic and independent). In this approach, only performed by a supervisor, who collected the samples of three features are defined in the edge face image, which the 7 random variables manually, correctly classifying them. are associated to the Euclidean distance between the upper The leaf random variables of the model, and their respective contour of the eyebrows and the lower edge of the eyes (dA), virtual-scopes are shown in the Table I. lip corners (dB) and upper and lower contour of the mouth Table II the relationship between the basic emotional states (dC), respectively. These features are normalized using the of a user, and the Action U nits associated with different values extracted from the neutral state, which allows the facial distortions. In this case, these AU s depend on the system to be independent of the scale or distance of the few deformable or mobile elements of the face, such as: the user to the sensor. Fig. 3(a) illustrates the ROI of the mouth, eyes and eyebrows. This feature of deformation or face in the image. The image is processed according to the movement of facial elements is what allows analyze a mus- method described in this section. Results after applying light cular change associated with a specific emotion, otherwise normalization, noise removal methods and Gabor filtering there would be no facial expressions. However, elements are shown in Fig. 3(b). Fig. 3(b) also illustrates the set of such as the nose does not present specific changes in facial extracted features, labelled as dA (yellow), dB (brown) and expressions. dC (violet). In this approach, the data (D) in the classification process is 5) Dynamic Bayesian Network classifier : In order to clas- obtained according to the following setup: sify the Facial Expression (F E) produced by the user’s face, a Dynamic Bayesian network is proposed, where the overall D = ((x1 , y1 )...(xn , yn )), xi ∈ Rd , yi ∈ R (3) classification result achieved is the one foreseen by the belief variable F E, in the scope (F E[neutral] , F E[happiness] , Consider that y1 to y5 are the five possible emotional F E[sadness] , F E[f ear] , F E[anger] ). states (F E[neutral] , F E[happiness] , F E[sadness] , F E[f ear] , 3 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 Emotion Action Units AUs Sadness AU1 - AU4 - AU15 -AU17 Happiness AU6 - AU12 - AU25 F ear AU1 - AU4 - AU20 - AU25 Anger AU4 + AU7 - AU17 - AU23 - AU24 N eutral — TABLE II R ELATIONSHIP BETWEEN ACTION U NITS ( FACIAL FEATURES ) AND THE DIFFERENT EMOTIONAL STATES OF THIS SYSTEM . (I NFORMATION COLLECTED FROM : [3]). Fig. 4. Facial Expression Dynamic Bayessian Network. Two time intervals are shown. Variable Action Units AU s Element of the face EB AU1, AU4 Eye-Brows Ch AU6 Cheeks LE AU7 Lower Eyelids LC AU12, AU15 Lips Corners CB AU17 Chin Boss CB AU20, AU23 Mouth’s Form MA AU24, AU25 Mouth’s Aperture TABLE I L EAF VARIABLES OF THE DBN Fig. 5. Overview of the proposed emotion recognition system from speech. F E[anger] ); and each dimension of x, corresponds to one of (Image acquired from the publication: [12]) the previously described random variables, namely: EB, Ch, LE, LC, CB, M F and M A. Since the learning data may have gaps between its samples, a model is built assuming B. Emotion recognition from Speech that (X1 , ..., Xn ) are independent given F E, and Emotion recognition using speech as input poses several Xi ∼ N (priorT xi , σ 2 ) (4) subproblems, such as: detection, extraction and characteri- At first, prior ∼ U (1/n), however throughout the iterations, zation of a set of significant acoustic features to recognize the posterior of t − 1 becomes the prior on t. emotions. Therefore, the proposed approach has a similar Finally, the posterior equation is obtained using Bayes’ structure of the subsystem described in Sec. III-A, thus, like rule: the former, a set of acoustic features is extracted and used Qn P (xi |F E) ∗ P (F E) as input to a DBN classifier. The proposed methodology P (F E|xm ) = 1 , (5) is based on the work presented in [11], which studied the P (xm ) influence of pitch, energy and speech-rate over the activation where xm is the most recent visual information acquired. and also the intensity of some emotions. Fig. 5 illustrates an The last dividend can be computed using the Bayesian overview of the proposed system. The system is divided in marginalization rule: the following stages: q XY 1) Detection of the Speech: The audio signal is pre- P (xm ) = P (xi |F E) ∗ P (F E), (6) processed in order to detect the presence or absence of speech FE 1 in real time, within a communication between a human and being q = 7, the number of random variables of the system. a robot. This stage was achieved by using audio library SoX The dynamic properties of the network cause a conver- [13], which allows the detection of voice and silences of gence over time. The resultant histogram from the previous agreement with the sensitivity of the microphone. Specifi- frame is used as prior knowledge for the current frame. cally, the library analyzes and processes the stream original Each classification is considered correct, if it converges in audio from the microphone through a feature VAD (Voice a maximum number of 5 frames exceeding a threshold of Activity Detection) based on the measurement of the power 80%. Otherwise, if after 5 frames no value is higher than the cepstral. Since this function is responsible for removing the threshold, the classifier selects the highest probability value noise, silence or any type of sound not related to the human (usually referred to as the Maximum a posteriori decision in voice. Thus, a signal in the time, x(t), corresponding to an Bayesian theory) as the result. audio signal acquired by the robot, the power Cepstrum 4 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS) IEEE/RSJ International Conference on Intelligent Robots and Systems 2015 C(τ ) [14] is given by the equation: according to eq. 8 for each spectral frame. i=R Y 2 C(τ ) = F −1 log(|F(x(t))|2 )  (7) Y (ω) = |X(ωr)| , 0≤w