=Paper= {{Paper |id=Vol-1540/paper_01 |storemode=property |title=A Novel Multimodal Emotion Recognition Approach for Affective Human Robot Interaction |pdfUrl=https://ceur-ws.org/Vol-1540/paper_01.pdf |volume=Vol-1540 |dblpUrl=https://dblp.org/rec/conf/iros/BurgosMT15a }} ==A Novel Multimodal Emotion Recognition Approach for Affective Human Robot Interaction== https://ceur-ws.org/Vol-1540/paper_01.pdf
                                                                    Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                    IEEE/RSJ International Conference on Intelligent Robots and Systems 2015




  A Novel Multimodal Emotion Recognition Approach for Affective Human
                           Robot Interaction
                                              Felipe Cid, Luis J. Manso and Pedro Núñez


   Abstract— Facial expressions and speech are elements that                        useful in real interaction scenarios, where for instance, errors
provide emotional information about the user through multiple                       associated to noise or occlusions can be reduced.
communication channels. In this paper, a novel multimodal
                                                                                       This work presents a novel multimodal emotion recog-
emotion recognition system based on visual and auditory
information processing is proposed. The proposed approach is                        nition system for affective HRI. The proposed approach
used in real affective human robot communication in order to                        is based on a real-time multimodal system that integrates
estimate five different emotional states (i.e., happiness, anger,                   speech and facial expression analysis. The main contribution
fear, sadness and neutral), and it consists of two subsystems                       of this work is a robust feature extraction approach for facial
with similar structure. The first subsystem achieves a robust
                                                                                    expression recognition. In this paper, these facial features are
facial feature extraction based on consecutively applied filters
to the edge image and the use of a Dynamic Bayessian Classifier.                    a combination of independent and antagonistic distortions
A similar classifier is used in the second subsystem, where                         of the face. Besides, a novel acoustic analysis of speech
the input is associated to a set of speech descriptors, such                        is used to extract features relevant to emotion. Both visual
as speech-rate, energy and pitch. Both subsystems are finally                       and acoustic features constitute the input of two Dynamic
combined in real time. The results of this multimodal approach
                                                                                    Bayessian Network (DBN) [3], which classify them into a
show the robustness and accuracy of the methodology respect
to single emotion recognition systems.                                              set of basic emotions. The output of each DBN estimates the
                                                                                    user’s emotional state based on the extracted features, the
                          I. I NTRODUCTION                                          emotional states available in each DBN being: happiness,
                                                                                    sadness, anger, f ear and neutral (non-emotional state).
   In the last decade, Human Robot Interaction (HRI) has                            Finally, the purpose of this system is to integrate the output
become one of the most important issues in social robotics.                         information of each subsystem, in a third DBN to analyze
Within HRI, one of the main objectives is the development                           the information associated with each mode. This multimodal
of methodologies focused on non-invasive techniques based                           methodology uses redundant information to estimate accurate
on natural language. This would allow the robot to interact                         and robust results on the user’s emotional state.
with users in a similar way to humans, through multimodal                              This paper is organized as follows: after discussing known
systems that combine information from different channels.                           approaches to emotion recognition systems from facial ex-
In order to socially interact with humans, a robotic system                         pressions, speech or multimodal systems in Section II, Sec-
should be able not only to understand users behaviour and                           tion III presents an overview of the proposed Multimodal
intentions, but also to estimate their emotional state. Know-                       Emotion Recognition System. In Section IV, the experimen-
ing and understanding these human emotions helps social                             tal results are pointed out, and finally, Section V describes
robots adapting the communication in real time, improving                           the conclusions and future work of the approach.
and enriching the interaction [1]. This kind of HRI is usually
known as affective HRI.                                                                                II. R ELATED W ORKS
   The design of social robots looks for developing natu-
ral interfaces for an affective interaction. In this context,                          In the field of HRI, different automatic emotion recogni-
most of the current affective HRI techniques use an unique                          tion systems have been studied. Most of these approaches
information channel, called mode, such as speech, facial                            are based on single information channel analysis, such as
expressiveness or body language. However, an emotion is                             video sequences or audio signals. Independent of the nature
generally expressed through several modalities. In multi-                           of the information source, the raw data is processed and a set
modal affective interaction, the user can communicate his/her                       of features are extracted. Then, these features are classified
emotional state to the robot using different several input                          into different categories, i.e., emotions. On one hand, facial
channels [2]. Contrary to approaches where each channel                             expressions have been commonly used to detect and rec-
introduces complementary information in the emotion recog-                          ognize human emotions. An interesting and updated review
nition system, most of the multimodal systems use these                             was shown in [4]. Commonly, these frameworks use the
channels as redundancy information. This redundancy is                              Facial Action Coding System (FACS) proposed by Ekman
                                                                                    et al’s [5], which is based on facial muscle deformations.
  L. J. Manso and P. Núñez are members of Robotics and Artificial                 On the other hand, speech has been also used for emotion
Vision Lab. Robolab Group, University of Extremadura, Spain (e-mail:                recognition (see the review [6]). These systems are usually
lmanso@unex.es; pnuntru@unex.es)
  F. Cid is with Institute of Electrical and Electronics, Universidad Austral       focused on acoustic variables variations that are related to
de Chile, Chile. (e-mail:felipe.cid@uach.cl                                         emotions.

                                                                                1
                                                       Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                       IEEE/RSJ International Conference on Intelligent Robots and Systems 2015




   Approaches that only use visual or speech information
individually usually fail in real scenarios. Light conditions,
shadows or occlusions, among others, are typical situations
where the accuracy of the results decreases for visual sys-
tems. In a similar way, environmental noise or people moving
while talking are error sources in audio systems. Therefore,
for an efficient affective HRI, several authors have focused
their attention on multimodal systems that recognize the
emotional state of the user from different modalities or
information sources such as: face, gesture, body language,            Fig. 1. Overview of the proposed facial expression recognition system.
                                                                      The approach flows from left to right. See the text for more details.
speech or physiological signals, among others. For instance,
in [3] the authors develop an architecture for multimodal
emotion recognition, where facial expresiveness and speech
are used. In [7], Kessous et al. propose a system that fuse              1) Data acquisition: Data acquisition for the recognition
body and facial languages, and speech. Also, interesting              of facial expressions is based on the processing of a sequence
reviews are presented in [2], [8]. Most of these approaches           of RGB images Sf r obtained from the robot camera for
use a dominant mode in the classifier strategy, that is,              real-time interaction. First, the system recognizes the user’s
when the probability of the detected emotion is low, other            face within each frame f r(t) in the instant of time t, using
channel information is used. In contrast to these works, this         the well-known Viola and Jones’ method [9] to obtain the
paper presents a multimodal approach where all the input              initial region of interest of the face ROII . The ROII is
modes are analyzed in real-time and the fusion strategy               normalized to a fixed size and converted to gray scale to be
consists on a Dynamic Bayessian Network classifier. This              pre-processed.
multimodal fusion is produced at the decision level (i.e.,
                                                                         2) Pre-processing: In this stage, the ROII is processed to
the information is integrated from the single modalities after
                                                                      remove noise, reducing its light dependence and eliminating
being interpreted by their own classifiers).
                                                                      unnecessary information. To eliminate errors in the detection
  III. M ULTIMODAL E MOTION R ECOGNITION S YSTEM                      and reduce the processing time of the system, the ROII
                                                                      is divided in two sub-regions, ROItop and ROIbottom ,
    In this section, the proposed multimodal emotion recogni-
                                                                      respectively. The usage of both sub-regions allow the elim-
tion system is presented. The framework consists on two sub-
                                                                      ination of irrelevant information (e.g., nose) and divides
systems running in parallel that estimate human emotions us-
                                                                      the feature extraction process in two elements (eyes and
ing two independent DBN. The facial expression recognition
                                                                      mouth). Let ROII being the face image of size W x H,
subsystem (see Fig. 1) is based on a fast and robust feature
                                                                      and let p0 = (n, m) being the central pixel in the image,
extraction approach where consecutively morphological and
                                                                      which estimates the approximated position of the nose in
convolutional filters are applied to reduce the noise and the
                                                                      the image. Then, ROItop and ROIbottom are defined as
dependence against luminosity changes. After that, a Gabor
                                                                      selective copies of ROII as follows: ROItop of size W
filter is used for efficient edge detection. The output edge
                                                                      x (m − NT h ) and ROIbottom of size W x (m + NT h )),
image of this filter bank is used to detect and extract scale-
                                                                      where NT h is an user-fixed threshold. In order to detect
invariant facial features, which will be the input variables of
                                                                      and extract robust facial features in different environments,
the DBN to estimate the user’s emotional state. In the second
                                                                      both ROItop and ROIbottom images are processed to reduce
system, the user’s speech is analyzed in order to extract a set
                                                                      the effects of the light dependence. The method is based
of independent descriptors. These descriptors are the input
                                                                      on the approach described in [10]. The processing sequence
of a second DBN. Finally, the proposed system integrates
                                                                      follows a set of consecutive stages: 1) gamma correction; 2)
the information associated to both methods in a third DBN,
                                                                      difference of Gaussian (DoG) filtering; 3) masking; and 4)
which estimates the final user emotion. The output of both
                                                                      contrast equalization. Next, a filter bank of Median, Blur and
subsystems and a third DBN has as a result an estimate of the
                                                                      Gaussian filters, is applied to mitigate the noise effect in the
emotional state of the user, within four possible emotional
                                                                      images by the beard, wounds or similar facial elements.
states (happiness, sadness, anger, fear) and a non-emotional
state (neutral). Each subsystem is described in detail in the            3) Gabor Filter: The Gabor filter is a fast and effec-
following subsections.                                                tive linear filter for the detection of edges with different
                                                                      orientations. In the proposed approach, the Gabor filter is
A. Emotion recognition from Facial Expressions                        used as a previous stage to the detection and extraction of
   The facial expression recognition system proposed in this          facial features, which are extracted using the contours of the
work uses a video sequence acquired by the robot in real              facial elements (i.e., the eyes, the mouth or the eyebrows).
time. An overview of the proposed methodology is shown in             Gabor impulse response in the spatial domain consists of a
Fig. 1. Each video frame is processed and a set of robust and         sinusoidal plane wave of some orientation and frequency,
invariant features of the user’s face is detected. The proposed       modulated by a two-dimensional Gaussian envelope. Let
method consists on the following stages:                              I(u, v) be the input image, then the output of the Gabor

                                                                  2
                                                              Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                              IEEE/RSJ International Conference on Intelligent Robots and Systems 2015




                                                                                              (a)                            (b)
                                                                            Fig. 3. Edge-based facial feature extraction: a) Region of interest in the
                                                                            face image (ROItop and ROIbottom ); and b) Features extracted in the
                                                                            image.



                                                                               This bayesian approach is based on the detection of 11
           Fig. 2.   Action Units (AUs) used in this paper.                 AUs with antagonistic and exclusive features as leaves of
                                                                            the DBN, which allows us to reduce the size of the DBN to
                                                                            7 variables that group these AUs. Thus, these 7 variables are
filter, G(u, v), is given by:                                               obtained from combinations of the extracted features of the
                                                                            previous process: dA, dB and dC. In this case, it is assumed
                      1 u2 + v 2        uθ
       G(u, v) = exp(− ( θ 2 θ ))cos(2π    + ψ)                   (1)       that these 7 leaf variables are independent given the facial
                      2   σ             λ                                   expression (F E). Although some muscular movements from
where θ, λ and ψ are associated to the sinusoidal plane wave                one area of the face may slightly affect other areas, this
(orientation, wavelength and phase, respectively), and being                small influence could not be detected by the robot cameras.
uθ and vθ described as:                                                     Besides, for correct detection of each emotional state it
                                                                            is necessary that each AU achieves a minimum threshold
                       uθ = ucosθ + vsinθ                                   intensity of B (i.e., slight evidence), within the intensity
                                                                  (2)
                     vθ = −usinθ + vcosθ                                    range described in FACS [5].
                                                                               Fig. 4 illustrates the structure of the two level bayesian
   4) Feature extraction: The crucial step in an automatic
                                                                            network, and the time influence that characterizes this DBN.
facial expressions recognition system is the extraction of
                                                                            The first level shows the estimation of the user’s emotional
relevant features from the image, F I ={fiI | i = 1..m}. In
                                                                            state F E, while the second level shows the 7 variables
the proposed work, a set of edge-based features is extracted,
                                                                            based on the AUs. In addition, one of the main features
which is invariant to scale or distance from the user to the
                                                                            worth mentioning about the bayesian network is the need
robot. Each one of the features are directly related to the
                                                                            to provide it with learning data. The most common method
Action Units (AUs) described by the Facial Action Coding
                                                                            is that each new sample is matched using a threshold.
System (FACS) [5]. A set of independent and antagonistic
                                                                            In this work, to avoid the extant gaps, a pre-processing
AUs has been used in this paper (see Fig. 2, AU1 and
                                                                            stage is done before the learning stage, fitting a Gaussian
AU4 are related to distortions of the eyebrows, and they
                                                                            distribution to the data. The learning data acquisition was
are antagonistic and independent). In this approach, only
                                                                            performed by a supervisor, who collected the samples of
three features are defined in the edge face image, which
                                                                            the 7 random variables manually, correctly classifying them.
are associated to the Euclidean distance between the upper
                                                                            The leaf random variables of the model, and their respective
contour of the eyebrows and the lower edge of the eyes (dA),
                                                                            virtual-scopes are shown in the Table I.
lip corners (dB) and upper and lower contour of the mouth
                                                                               Table II the relationship between the basic emotional states
(dC), respectively. These features are normalized using the
                                                                            of a user, and the Action U nits associated with different
values extracted from the neutral state, which allows the
                                                                            facial distortions. In this case, these AU s depend on the
system to be independent of the scale or distance of the
                                                                            few deformable or mobile elements of the face, such as: the
user to the sensor. Fig. 3(a) illustrates the ROI of the
                                                                            mouth, eyes and eyebrows. This feature of deformation or
face in the image. The image is processed according to the
                                                                            movement of facial elements is what allows analyze a mus-
method described in this section. Results after applying light
                                                                            cular change associated with a specific emotion, otherwise
normalization, noise removal methods and Gabor filtering
                                                                            there would be no facial expressions. However, elements
are shown in Fig. 3(b). Fig. 3(b) also illustrates the set of
                                                                            such as the nose does not present specific changes in facial
extracted features, labelled as dA (yellow), dB (brown) and
                                                                            expressions.
dC (violet).
                                                                            In this approach, the data (D) in the classification process is
   5) Dynamic Bayesian Network classifier : In order to clas-
                                                                            obtained according to the following setup:
sify the Facial Expression (F E) produced by the user’s face,
a Dynamic Bayesian network is proposed, where the overall                             D = ((x1 , y1 )...(xn , yn )), xi ∈ Rd , yi ∈ R             (3)
classification result achieved is the one foreseen by the
belief variable F E, in the scope (F E[neutral] , F E[happiness] ,          Consider that y1 to y5 are the five possible emotional
F E[sadness] , F E[f ear] , F E[anger] ).                                   states (F E[neutral] , F E[happiness] , F E[sadness] , F E[f ear] ,

                                                                        3
                                                                Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                IEEE/RSJ International Conference on Intelligent Robots and Systems 2015



                                                                                         Emotion              Action Units AUs
                                                                                         Sadness          AU1 - AU4 - AU15 -AU17
                                                                                        Happiness            AU6 - AU12 - AU25
                                                                                          F ear           AU1 - AU4 - AU20 - AU25
                                                                                          Anger        AU4 + AU7 - AU17 - AU23 - AU24
                                                                                         N eutral                    —

                                                                                                             TABLE II
                                                                              R ELATIONSHIP BETWEEN ACTION U NITS ( FACIAL FEATURES ) AND THE
                                                                                 DIFFERENT EMOTIONAL STATES OF THIS SYSTEM . (I NFORMATION
                                                                                                     COLLECTED FROM : [3]).




Fig. 4. Facial Expression Dynamic Bayessian Network. Two time intervals
are shown.

         Variable    Action Units AU s    Element of the face
           EB            AU1, AU4             Eye-Brows
           Ch               AU6                 Cheeks
           LE               AU7             Lower Eyelids
           LC          AU12, AU15            Lips Corners
           CB              AU17               Chin Boss
           CB          AU20, AU23            Mouth’s Form
           MA          AU24, AU25          Mouth’s Aperture

                              TABLE I
                    L EAF VARIABLES OF THE DBN



                                                                              Fig. 5. Overview of the proposed emotion recognition system from speech.
F E[anger] ); and each dimension of x, corresponds to one of                  (Image acquired from the publication: [12])
the previously described random variables, namely: EB, Ch,
LE, LC, CB, M F and M A. Since the learning data may
have gaps between its samples, a model is built assuming                      B. Emotion recognition from Speech
that (X1 , ..., Xn ) are independent given F E, and
                                                                                 Emotion recognition using speech as input poses several
                     Xi ∼ N (priorT xi , σ 2 )                      (4)       subproblems, such as: detection, extraction and characteri-
At first, prior ∼ U (1/n), however throughout the iterations,                 zation of a set of significant acoustic features to recognize
the posterior of t − 1 becomes the prior on t.                                emotions. Therefore, the proposed approach has a similar
   Finally, the posterior equation is obtained using Bayes’                   structure of the subsystem described in Sec. III-A, thus, like
rule:                                                                         the former, a set of acoustic features is extracted and used
                          Qn
                             P (xi |F E) ∗ P (F E)                            as input to a DBN classifier. The proposed methodology
           P (F E|xm ) = 1                         ,      (5)                 is based on the work presented in [11], which studied the
                                  P (xm )
                                                                              influence of pitch, energy and speech-rate over the activation
where xm is the most recent visual information acquired.                      and also the intensity of some emotions. Fig. 5 illustrates an
The last dividend can be computed using the Bayesian                          overview of the proposed system. The system is divided in
marginalization rule:                                                         the following stages:
                         q
                        XY                                                       1) Detection of the Speech: The audio signal is pre-
            P (xm ) =             P (xi |F E) ∗ P (F E),            (6)       processed in order to detect the presence or absence of speech
                        FE    1                                               in real time, within a communication between a human and
being q = 7, the number of random variables of the system.                    a robot. This stage was achieved by using audio library SoX
   The dynamic properties of the network cause a conver-                      [13], which allows the detection of voice and silences of
gence over time. The resultant histogram from the previous                    agreement with the sensitivity of the microphone. Specifi-
frame is used as prior knowledge for the current frame.                       cally, the library analyzes and processes the stream original
Each classification is considered correct, if it converges in                 audio from the microphone through a feature VAD (Voice
a maximum number of 5 frames exceeding a threshold of                         Activity Detection) based on the measurement of the power
80%. Otherwise, if after 5 frames no value is higher than the                 cepstral. Since this function is responsible for removing the
threshold, the classifier selects the highest probability value               noise, silence or any type of sound not related to the human
(usually referred to as the Maximum a posteriori decision in                  voice. Thus, a signal in the time, x(t), corresponding to an
Bayesian theory) as the result.                                               audio signal acquired by the robot, the power Cepstrum

                                                                          4
                                                         Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                         IEEE/RSJ International Conference on Intelligent Robots and Systems 2015




C(τ ) [14] is given by the equation:                                    according to eq. 8 for each spectral frame.
                                                                                            i=R
                                                                                            Y               2
               C(τ ) = F −1 log(|F(x(t))|2 )
                                                
                                                             (7)                  Y (ω) =         |X(ωr)| ,          0≤w