Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                    IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


  A Novel Multimodal Emotion Recognition Approach for Affective Human
                           Robot Interaction
                                              Felipe Cid, Luis J. Manso and Pedro Núñez


   Abstract— Facial expressions and speech are elements that                        useful in real interaction scenarios, where for instance, errors
provide emotional information about the user through multiple                       associated to noise or occlusions can be reduced.
communication channels. In this paper, a novel multimodal
                                                                                       This work presents a novel multimodal emotion recog-
emotion recognition system based on visual and auditory
information processing is proposed. The proposed approach is                        nition system for affective HRI. The proposed approach
used in real affective human robot communication in order to                        is based on a real-time multimodal system that integrates
estimate five different emotional states (i.e., happiness, anger,                   speech and facial expression analysis. The main contribution
fear, sadness and neutral), and it consists of two subsystems                       of this work is a robust feature extraction approach for facial
with similar structure. The first subsystem achieves a robust
                                                                                    expression recognition. In this paper, these facial features are
facial feature extraction based on consecutively applied filters
to the edge image and the use of a Dynamic Bayessian Classifier.                    a combination of independent and antagonistic distortions
A similar classifier is used in the second subsystem, where                         of the face. Besides, a novel acoustic analysis of speech
the input is associated to a set of speech descriptors, such                        is used to extract features relevant to emotion. Both visual
as speech-rate, energy and pitch. Both subsystems are finally                       and acoustic features constitute the input of two Dynamic
combined in real time. The results of this multimodal approach
                                                                                    Bayessian Network (DBN) [3], which classify them into a
show the robustness and accuracy of the methodology respect
to single emotion recognition systems.                                              set of basic emotions. The output of each DBN estimates the
                                                                                    user’s emotional state based on the extracted features, the
                          I. I NTRODUCTION                                          emotional states available in each DBN being: happiness,
                                                                                    sadness, anger, f ear and neutral (non-emotional state).
   In the last decade, Human Robot Interaction (HRI) has                            Finally, the purpose of this system is to integrate the output
become one of the most important issues in social robotics.                         information of each subsystem, in a third DBN to analyze
Within HRI, one of the main objectives is the development                           the information associated with each mode. This multimodal
of methodologies focused on non-invasive techniques based                           methodology uses redundant information to estimate accurate
on natural language. This would allow the robot to interact                         and robust results on the user’s emotional state.
with users in a similar way to humans, through multimodal                              This paper is organized as follows: after discussing known
systems that combine information from different channels.                           approaches to emotion recognition systems from facial ex-
In order to socially interact with humans, a robotic system                         pressions, speech or multimodal systems in Section II, Sec-
should be able not only to understand users behaviour and                           tion III presents an overview of the proposed Multimodal
intentions, but also to estimate their emotional state. Know-                       Emotion Recognition System. In Section IV, the experimen-
ing and understanding these human emotions helps social                             tal results are pointed out, and finally, Section V describes
robots adapting the communication in real time, improving                           the conclusions and future work of the approach.
and enriching the interaction [1]. This kind of HRI is usually
known as affective HRI.                                                                                II. R ELATED W ORKS
   The design of social robots looks for developing natu-
ral interfaces for an affective interaction. In this context,                          In the field of HRI, different automatic emotion recogni-
most of the current affective HRI techniques use an unique                          tion systems have been studied. Most of these approaches
information channel, called mode, such as speech, facial                            are based on single information channel analysis, such as
expressiveness or body language. However, an emotion is                             video sequences or audio signals. Independent of the nature
generally expressed through several modalities. In multi-                           of the information source, the raw data is processed and a set
modal affective interaction, the user can communicate his/her                       of features are extracted. Then, these features are classified
emotional state to the robot using different several input                          into different categories, i.e., emotions. On one hand, facial
channels [2]. Contrary to approaches where each channel                             expressions have been commonly used to detect and rec-
introduces complementary information in the emotion recog-                          ognize human emotions. An interesting and updated review
nition system, most of the multimodal systems use these                             was shown in [4]. Commonly, these frameworks use the
channels as redundancy information. This redundancy is                              Facial Action Coding System (FACS) proposed by Ekman
                                                                                    et al’s [5], which is based on facial muscle deformations.
  L. J. Manso and P. Núñez are members of Robotics and Artificial                 On the other hand, speech has been also used for emotion
Vision Lab. Robolab Group, University of Extremadura, Spain (e-mail:                recognition (see the review [6]). These systems are usually
lmanso@unex.es; pnuntru@unex.es)
  F. Cid is with Institute of Electrical and Electronics, Universidad Austral       focused on acoustic variables variations that are related to
de Chile, Chile. (e-mail:felipe.cid@uach.cl                                         emotions.

                                                                                1
                                                       Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                       IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


   Approaches that only use visual or speech information
individually usually fail in real scenarios. Light conditions,
shadows or occlusions, among others, are typical situations
where the accuracy of the results decreases for visual sys-
tems. In a similar way, environmental noise or people moving
while talking are error sources in audio systems. Therefore,
for an efficient affective HRI, several authors have focused
their attention on multimodal systems that recognize the
emotional state of the user from different modalities or
information sources such as: face, gesture, body language,            Fig. 1. Overview of the proposed facial expression recognition system.
                                                                      The approach flows from left to right. See the text for more details.
speech or physiological signals, among others. For instance,
in [3] the authors develop an architecture for multimodal
emotion recognition, where facial expresiveness and speech
are used. In [7], Kessous et al. propose a system that fuse              1) Data acquisition: Data acquisition for the recognition
body and facial languages, and speech. Also, interesting              of facial expressions is based on the processing of a sequence
reviews are presented in [2], [8]. Most of these approaches           of RGB images Sf r obtained from the robot camera for
use a dominant mode in the classifier strategy, that is,              real-time interaction. First, the system recognizes the user’s
when the probability of the detected emotion is low, other            face within each frame f r(t) in the instant of time t, using
channel information is used. In contrast to these works, this         the well-known Viola and Jones’ method [9] to obtain the
paper presents a multimodal approach where all the input              initial region of interest of the face ROII . The ROII is
modes are analyzed in real-time and the fusion strategy               normalized to a fixed size and converted to gray scale to be
consists on a Dynamic Bayessian Network classifier. This              pre-processed.
multimodal fusion is produced at the decision level (i.e.,
                                                                         2) Pre-processing: In this stage, the ROII is processed to
the information is integrated from the single modalities after
                                                                      remove noise, reducing its light dependence and eliminating
being interpreted by their own classifiers).
                                                                      unnecessary information. To eliminate errors in the detection
  III. M ULTIMODAL E MOTION R ECOGNITION S YSTEM                      and reduce the processing time of the system, the ROII
                                                                      is divided in two sub-regions, ROItop and ROIbottom ,
    In this section, the proposed multimodal emotion recogni-
                                                                      respectively. The usage of both sub-regions allow the elim-
tion system is presented. The framework consists on two sub-
                                                                      ination of irrelevant information (e.g., nose) and divides
systems running in parallel that estimate human emotions us-
                                                                      the feature extraction process in two elements (eyes and
ing two independent DBN. The facial expression recognition
                                                                      mouth). Let ROII being the face image of size W x H,
subsystem (see Fig. 1) is based on a fast and robust feature
                                                                      and let p0 = (n, m) being the central pixel in the image,
extraction approach where consecutively morphological and
                                                                      which estimates the approximated position of the nose in
convolutional filters are applied to reduce the noise and the
                                                                      the image. Then, ROItop and ROIbottom are defined as
dependence against luminosity changes. After that, a Gabor
                                                                      selective copies of ROII as follows: ROItop of size W
filter is used for efficient edge detection. The output edge
                                                                      x (m − NT h ) and ROIbottom of size W x (m + NT h )),
image of this filter bank is used to detect and extract scale-
                                                                      where NT h is an user-fixed threshold. In order to detect
invariant facial features, which will be the input variables of
                                                                      and extract robust facial features in different environments,
the DBN to estimate the user’s emotional state. In the second
                                                                      both ROItop and ROIbottom images are processed to reduce
system, the user’s speech is analyzed in order to extract a set
                                                                      the effects of the light dependence. The method is based
of independent descriptors. These descriptors are the input
                                                                      on the approach described in [10]. The processing sequence
of a second DBN. Finally, the proposed system integrates
                                                                      follows a set of consecutive stages: 1) gamma correction; 2)
the information associated to both methods in a third DBN,
                                                                      difference of Gaussian (DoG) filtering; 3) masking; and 4)
which estimates the final user emotion. The output of both
                                                                      contrast equalization. Next, a filter bank of Median, Blur and
subsystems and a third DBN has as a result an estimate of the
                                                                      Gaussian filters, is applied to mitigate the noise effect in the
emotional state of the user, within four possible emotional
                                                                      images by the beard, wounds or similar facial elements.
states (happiness, sadness, anger, fear) and a non-emotional
state (neutral). Each subsystem is described in detail in the            3) Gabor Filter: The Gabor filter is a fast and effec-
following subsections.                                                tive linear filter for the detection of edges with different
                                                                      orientations. In the proposed approach, the Gabor filter is
A. Emotion recognition from Facial Expressions                        used as a previous stage to the detection and extraction of
   The facial expression recognition system proposed in this          facial features, which are extracted using the contours of the
work uses a video sequence acquired by the robot in real              facial elements (i.e., the eyes, the mouth or the eyebrows).
time. An overview of the proposed methodology is shown in             Gabor impulse response in the spatial domain consists of a
Fig. 1. Each video frame is processed and a set of robust and         sinusoidal plane wave of some orientation and frequency,
invariant features of the user’s face is detected. The proposed       modulated by a two-dimensional Gaussian envelope. Let
method consists on the following stages:                              I(u, v) be the input image, then the output of the Gabor

                                                                  2
                                                              Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                              IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


                                                                                              (a)                            (b)
                                                                            Fig. 3. Edge-based facial feature extraction: a) Region of interest in the
                                                                            face image (ROItop and ROIbottom ); and b) Features extracted in the
                                                                            image.


                                                                               This bayesian approach is based on the detection of 11
           Fig. 2.   Action Units (AUs) used in this paper.                 AUs with antagonistic and exclusive features as leaves of
                                                                            the DBN, which allows us to reduce the size of the DBN to
                                                                            7 variables that group these AUs. Thus, these 7 variables are
filter, G(u, v), is given by:                                               obtained from combinations of the extracted features of the
                                                                            previous process: dA, dB and dC. In this case, it is assumed
                      1 u2 + v 2        uθ
       G(u, v) = exp(− ( θ 2 θ ))cos(2π    + ψ)                   (1)       that these 7 leaf variables are independent given the facial
                      2   σ             λ                                   expression (F E). Although some muscular movements from
where θ, λ and ψ are associated to the sinusoidal plane wave                one area of the face may slightly affect other areas, this
(orientation, wavelength and phase, respectively), and being                small influence could not be detected by the robot cameras.
uθ and vθ described as:                                                     Besides, for correct detection of each emotional state it
                                                                            is necessary that each AU achieves a minimum threshold
                       uθ = ucosθ + vsinθ                                   intensity of B (i.e., slight evidence), within the intensity
                                                                  (2)
                     vθ = −usinθ + vcosθ                                    range described in FACS [5].
                                                                               Fig. 4 illustrates the structure of the two level bayesian
   4) Feature extraction: The crucial step in an automatic
                                                                            network, and the time influence that characterizes this DBN.
facial expressions recognition system is the extraction of
                                                                            The first level shows the estimation of the user’s emotional
relevant features from the image, F I ={fiI | i = 1..m}. In
                                                                            state F E, while the second level shows the 7 variables
the proposed work, a set of edge-based features is extracted,
                                                                            based on the AUs. In addition, one of the main features
which is invariant to scale or distance from the user to the
                                                                            worth mentioning about the bayesian network is the need
robot. Each one of the features are directly related to the
                                                                            to provide it with learning data. The most common method
Action Units (AUs) described by the Facial Action Coding
                                                                            is that each new sample is matched using a threshold.
System (FACS) [5]. A set of independent and antagonistic
                                                                            In this work, to avoid the extant gaps, a pre-processing
AUs has been used in this paper (see Fig. 2, AU1 and
                                                                            stage is done before the learning stage, fitting a Gaussian
AU4 are related to distortions of the eyebrows, and they
                                                                            distribution to the data. The learning data acquisition was
are antagonistic and independent). In this approach, only
                                                                            performed by a supervisor, who collected the samples of
three features are defined in the edge face image, which
                                                                            the 7 random variables manually, correctly classifying them.
are associated to the Euclidean distance between the upper
                                                                            The leaf random variables of the model, and their respective
contour of the eyebrows and the lower edge of the eyes (dA),
                                                                            virtual-scopes are shown in the Table I.
lip corners (dB) and upper and lower contour of the mouth
                                                                               Table II the relationship between the basic emotional states
(dC), respectively. These features are normalized using the
                                                                            of a user, and the Action U nits associated with different
values extracted from the neutral state, which allows the
                                                                            facial distortions. In this case, these AU s depend on the
system to be independent of the scale or distance of the
                                                                            few deformable or mobile elements of the face, such as: the
user to the sensor. Fig. 3(a) illustrates the ROI of the
                                                                            mouth, eyes and eyebrows. This feature of deformation or
face in the image. The image is processed according to the
                                                                            movement of facial elements is what allows analyze a mus-
method described in this section. Results after applying light
                                                                            cular change associated with a specific emotion, otherwise
normalization, noise removal methods and Gabor filtering
                                                                            there would be no facial expressions. However, elements
are shown in Fig. 3(b). Fig. 3(b) also illustrates the set of
                                                                            such as the nose does not present specific changes in facial
extracted features, labelled as dA (yellow), dB (brown) and
                                                                            expressions.
dC (violet).
                                                                            In this approach, the data (D) in the classification process is
   5) Dynamic Bayesian Network classifier : In order to clas-
                                                                            obtained according to the following setup:
sify the Facial Expression (F E) produced by the user’s face,
a Dynamic Bayesian network is proposed, where the overall                             D = ((x1 , y1 )...(xn , yn )), xi ∈ Rd , yi ∈ R             (3)
classification result achieved is the one foreseen by the
belief variable F E, in the scope (F E[neutral] , F E[happiness] ,          Consider that y1 to y5 are the five possible emotional
F E[sadness] , F E[f ear] , F E[anger] ).                                   states (F E[neutral] , F E[happiness] , F E[sadness] , F E[f ear] ,

                                                                        3
                                                                Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


                                                                                         Emotion              Action Units AUs
                                                                                         Sadness          AU1 - AU4 - AU15 -AU17
                                                                                        Happiness            AU6 - AU12 - AU25
                                                                                          F ear           AU1 - AU4 - AU20 - AU25
                                                                                          Anger        AU4 + AU7 - AU17 - AU23 - AU24
                                                                                         N eutral                    —

                                                                                                             TABLE II
                                                                              R ELATIONSHIP BETWEEN ACTION U NITS ( FACIAL FEATURES ) AND THE
                                                                                 DIFFERENT EMOTIONAL STATES OF THIS SYSTEM . (I NFORMATION
                                                                                                     COLLECTED FROM : [3]).


Fig. 4. Facial Expression Dynamic Bayessian Network. Two time intervals
are shown.

         Variable    Action Units AU s    Element of the face
           EB            AU1, AU4             Eye-Brows
           Ch               AU6                 Cheeks
           LE               AU7             Lower Eyelids
           LC          AU12, AU15            Lips Corners
           CB              AU17               Chin Boss
           CB          AU20, AU23            Mouth’s Form
           MA          AU24, AU25          Mouth’s Aperture

                              TABLE I
                    L EAF VARIABLES OF THE DBN


                                                                              Fig. 5. Overview of the proposed emotion recognition system from speech.
F E[anger] ); and each dimension of x, corresponds to one of                  (Image acquired from the publication: [12])
the previously described random variables, namely: EB, Ch,
LE, LC, CB, M F and M A. Since the learning data may
have gaps between its samples, a model is built assuming                      B. Emotion recognition from Speech
that (X1 , ..., Xn ) are independent given F E, and
                                                                                 Emotion recognition using speech as input poses several
                     Xi ∼ N (priorT xi , σ 2 )                      (4)       subproblems, such as: detection, extraction and characteri-
At first, prior ∼ U (1/n), however throughout the iterations,                 zation of a set of significant acoustic features to recognize
the posterior of t − 1 becomes the prior on t.                                emotions. Therefore, the proposed approach has a similar
   Finally, the posterior equation is obtained using Bayes’                   structure of the subsystem described in Sec. III-A, thus, like
rule:                                                                         the former, a set of acoustic features is extracted and used
                          Qn
                             P (xi |F E) ∗ P (F E)                            as input to a DBN classifier. The proposed methodology
           P (F E|xm ) = 1                         ,      (5)                 is based on the work presented in [11], which studied the
                                  P (xm )
                                                                              influence of pitch, energy and speech-rate over the activation
where xm is the most recent visual information acquired.                      and also the intensity of some emotions. Fig. 5 illustrates an
The last dividend can be computed using the Bayesian                          overview of the proposed system. The system is divided in
marginalization rule:                                                         the following stages:
                         q
                        XY                                                       1) Detection of the Speech: The audio signal is pre-
            P (xm ) =             P (xi |F E) ∗ P (F E),            (6)       processed in order to detect the presence or absence of speech
                        FE    1                                               in real time, within a communication between a human and
being q = 7, the number of random variables of the system.                    a robot. This stage was achieved by using audio library SoX
   The dynamic properties of the network cause a conver-                      [13], which allows the detection of voice and silences of
gence over time. The resultant histogram from the previous                    agreement with the sensitivity of the microphone. Specifi-
frame is used as prior knowledge for the current frame.                       cally, the library analyzes and processes the stream original
Each classification is considered correct, if it converges in                 audio from the microphone through a feature VAD (Voice
a maximum number of 5 frames exceeding a threshold of                         Activity Detection) based on the measurement of the power
80%. Otherwise, if after 5 frames no value is higher than the                 cepstral. Since this function is responsible for removing the
threshold, the classifier selects the highest probability value               noise, silence or any type of sound not related to the human
(usually referred to as the Maximum a posteriori decision in                  voice. Thus, a signal in the time, x(t), corresponding to an
Bayesian theory) as the result.                                               audio signal acquired by the robot, the power Cepstrum

                                                                          4
                                                         Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                         IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


C(τ ) [14] is given by the equation:                                    according to eq. 8 for each spectral frame.
                                                                                            i=R
                                                                                            Y               2
               C(τ ) = F −1 log(|F(x(t))|2 )
                                                
                                                             (7)                  Y (ω) =         |X(ωr)| ,          0≤w<N                  (8)
                                                                                            r=1
   Where F and F − 1 represent the Fourier Transform direct             where R is the number of harmonics to be considered, and
and reverse, respectively. The output of this process is a sig-         frequency ωi is in the range of possible fundamental fre-
nal that is composed of frames that contain the information of          quencies. Pitch value, Y p (ωi ), is calculated as the maximum
the user’s voice,at a sampling frequency fs of 44,100 KHz.              value of the resulting periodic correlation array, Y (ω).
Finally, these frames are input in the function responsible for
extracting acoustic features.                                                               Y p (ωi ) = max(Y (ω))                          (9)
   2) Acoustic feature extraction: From the selected spoken
                                                                           Energy: The quantification of the signal’s energy is de-
sentence, a set of acoustic features, able to characterize the
                                                                        fined as:
user’s emotional state, are extracted. The selection of these                                         x=i
                                                                                                  1 X
characteristics follows a common theme with other methods                                  E=       ·     x[i]2                (10)
of literature, which are related to the use of elements of                                       N x=0
prosody, such as: [3],[15], [16] and [17].                                 Speech-Rate: In order to calculate the Speech-Rate, the
   Due the previous delivery process the audio frames, it               beats are evaluated according to the following process: i)
is possible to extract a set m of features represented by:              the Fast Fourier Transform of the signal is performed; ii)
F A ={fiA | i = 1..m}, as described in [11]. In this system,            the signal is multiplied with trains of pulses with different
it is considered 3 characteristic elements of prosody (m=               speech-rates; and iii) the amount of Energy at each rate is
3), such as the pitch, energy and the tempo. In [11] the                analyzed. Finally, the Speech-Rate is calculated as the signal
relationship between the different features of the speech and           with the highest Energy value.
the emotional states of an interlocutor is studied, coming                 These three features are the input of the Bayesian network
to the conclusion that many of the elements of prosody are              that estimates the user’s emotional state. The table III shows
affected by the intensity and valencia of each of the emotions.         the relation between the acoustic features and the emotional
For example, emotions with high intensity features have high            states of the user. In the table, it is seen as low-intensity
values (Energy, Pitch and Tempo). While the emotions with               emotions, such as sadness and neutral state, have similar
low intensity have lower values in these same features.                 features. Similarly, the emotions with a high intensity, such
   In this section, the following features to extract and the re-       as anger or fear, show also common characteristics, but with
spective justification of its relevance for emotion recognition         slight differences perceptible to the user.
are:
  • Pitch: also called Fundamental Frequency, is the rate                   Emotion          P itch             Energy       T empo
                                                                             Sadness    Slightly narrower        Lower   Slightly slower
    of vibration of the vocal cords to produce sound. The
                                                                            Happiness      Much wider           Higher   Faster or slower
    pitch range is a feature that allows to identify not only                 Fear         Much wider           Normal     Much faster
    the gender of users, but also their emotional states.                     Anger        Much wider           Higher    Slightly faster
                                                                             Neutral     much narrower          Normal        Slower
  • Energy: distribution of the signal amplitude values in
    time. In an audio signal, the energy in the voice is                                          TABLE III
    a determining factor in the generation of emotions.                  R ELATIONSHIP BETWEEN ACOUSTIC FEATURES AND THE DIFFERENT
    Emotions with higher intensity are associated to higher               EMOTIONAL STATES OF THIS SYSTEM . (I NFORMATION COLLECTED
    energy values in the voice. In a similar way, emotions                                   FROM : [2] AND [3]).
    with lower intensity show lower energy values.
  • Speech-Rate:        also      called      velocity     or
    BPM (beats per minute), is diction speed or number                     3) Dynamic Bayesian Network classifier: For the
    of words in a time period. In a similar way to the                  classification of the emotional state of the user from the
    Energy, the Speech-Rate is a feature associated directly            voice, a second DBN is proposed. This second network,
    to the intensity of emotions. Thus, emotions with                   represented in Fig. 6, has a two-tier structure, and the
    high intensity present high Speech-Rate values, and                 scope of the possible emotional states are identical to the
    emotions with low intensity are associated with low                 facial expressions DBN (mentioned in section III-A.5).
    Speech-Rate values.                                                 The first level is a single node that represents the variable
   The methods to extract each of these three acoustic fea-             associated with the result of the classified auditory emotion
tures are briefly described bellow:                                     (AE[N eutral, Anger, Happines, Sadness, F ear]).         The
   Pitch: The pitch range is calculated using the HPS (Har-             second level of the network corresponds to three nodes
monic Product Spectrum) algorithm [18]. This algorithm                  associated with three independent variables between them,
uses the Hann function to create windows of short duration              such as: P T - Pitch; EN - Energy.and T E- Speech-Rate.
on the input signal, separating it into frames X(ω). The HPS            These 3 variables quantify the elements of the speech as:
algorithm measures the maximum coincidence for harmonics                speed and the intensity of talk, and are directly related to

                                                                    5
                                                                 Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                 IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


Fig. 6. Dynamic Bayessian Network classifier. Two time interval are shown.       Fig. 7. Emotion Dynamic Bayesian network, three levels are shown.


the intensity of emotions. Moreover, each of the elements                        is used, by the following equation:
of the second level has as its father the node AE. In
                                                                                                          j
order to perform the estimation, the network first needs to                                              XY
                                                                                             P (xc ) =            P (xi |UE ) ∗ P (UE )      (13)
be filled with learning information about the features in
                                                                                                         UE   1
each emotional state. Henceforth, similarly to the facial
expression network, Bayes’ rule is once again used for                             From the joint distribution, using the Bayes’ rule:
inference given learning, as:                                                                             Qn
                         Qn                                                                                   P (xi |UE ) ∗ P (UE )
                              P (xi |AE) ∗ P (AE)                                          P (UE |xc ) = 1                                   (14)
           P (AE|xa ) = 1                              (11)                                                        P (xc )
                                   P (xa )
                                                                                    Through this third bayesian network, the emotional state
Where xa is the most recently acquired auditory information,
                                                                                 of the user is estimated in this multimodal system, giving
given by the variables in the second level (PT, EN, TE).
                                                                                 rise to a robust, real-time result.
Using the Bayesian marginalization rule, we can compute:
                                                                                    Time Control: One of the main problems of the modal
                          k
                         XY                                                      systems is related to the synchronization between the dif-
             P (xa ) =             P (xi |AE) ∗ P (AE)              (12)         ferent blocks that make up the recognizer. The detection of
                          AE   1                                                 a facial expression does not have to coincide in time with
being k = 3, the number of variables in the network.                             a corporal expression, and much less with the emotional
                                                                                 information that can be extracted from the speech analysis.
C. Multimodal fusion for emotion recognition                                     For this reason, a control block of time that synchronizes the
   In this section, the system analyzes the information ob-                      results of each of the Bayesian networks are established, but
tained by both modalities. Through the use of a multimodal                       instead of giving the output of each subsystem an estimate
system, it is possible to eliminate errors in the detection or                   of human emotion in the same instant of time, the proposed
classification, by checking the results with those obtained                      system uses the emotion recognizer based on the analysis of
through another modality. The results of both modalities are                     facial expression as the dominant mode in the entire system
used as input to a third bayesian network of three levels,                       [2]. Thus, only when there is audio information or bodily
giving rise to the estimate of the emotional state of the user,                  expressions during the interaction, the time control is the
in a similar way to humans.                                                      module for which these data should be merged into the final
   The last DBN classifies the results of the two previous                       dynamic classifier.
DBN, to get the emotional state of the user in the conversa-                        In the Fig. 8 shows the behavior in time of the proposed
tion. This third DBN has a structure of three levels that meets                  system. Where, the predominant system corresponds to the
all the nodes of the two previous networks, where the node                       output of recognizer of facial expressions. Only when there
UE is the parent of the nodes of the second level F E and                        is audio data results are merged into a single output of
AE. For the nodes in the first and second level: UE , F E and                    multimodal system.
AE, there are only five possible results (Neutral, Happiness,
                                                                                                IV. E XPERIMENTAL R ESULTS
Fear, Anger, Sadness). For the variables of the third level, 7
variables belong to the node F E that estimated the emotions                        In this section, a set of tests evaluates the performance of
through facial expressions, and 3 variables belong to the node                   the proposed system, through the evaluation of each of the
AE that estimate the emotional states through the speech, as                     two modalities separately and together. In the two first tests,
shown in Fig. 7.                                                                 the visual and auditory modalities are tested separatedly. In
   To estimate the emotional state on the basis of these two                     the last one, the whole set of the two modalities working
results, the joint distribution asociated to the Bayesian Fusion                 together is examined.

                                                                             6
                                                                  Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                  IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


                                                                              Besides, in the Table VIII detail the errors in this test in
                                                                              3 groups: M isclassif ication, Ambiguous (between two
                                                                              emotional states with low recognition rates.) and U nder
                                                                              threshold (not exceed the minimum threshold in a state).

                                                                                    Test PF E       a       b     c      d         e   Errors
                                                                                    a. Sadness     90%     0%    0%    0%        3%     7%
                                                                                   b. Happiness    0%     97%    0%    0%        0%     3%
        Fig. 8.   Operation in the time of the proposed system.                        c. Fear     1%     2%     93%    0%        0%    4%
                                                                                      d. Anger     2%     0%     0%    94%        0%    4%
                                                                                     e. Neutral    3%     0%     0%    0%        95%    2%
   The algorithms presented in this paper were developed
                                                                                                           TABLE IV
in C ++, and the benchmark tests were performed on a
                                                                                 R ECOGNITION FROM FACIAL E XPRESSION ( BASED ON SELECTED
computer with a 2.8 GHz Intel(R) Core(TM) i7 CPU and 4
                                                                                                  VOLUNTEERS FOR THESE TESTS )
Gb RAM running Linux. The software to control the system
is built on the top of the robotics framework RoboComp [19].
For each test, two groups of evaluation were used: The first
group is composed of visual and auditory information from                           Test PF E       a       b     c      d         e   Errors
users was provided by the Surrey Audio-Visual Expressed                             a. Sadness     96%     0%    0%    0%        1%     3%
Emotion (SAV EE) database [20], through video and audio                            b. Happiness    0%     98%    0%    0%        0%     2%
files. Meanwhile, a second group is comprised of 40 users                              c. Fear     1%     3%     89%    0%        0%    7%
                                                                                      d. Anger     2%     0%     0%    93%        0%    5%
with different gender and facial features has been used (The                         e. Neutral    1%     0%     0%    0%        97%    2%
tests related to the audio, were performed in: [12]). This last
group represents the non-trained users between adolescents                                              TABLE V
and adults who were volunteers for these experiments in real                   R ECOGNITION FROM FACIAL E XPRESSION ( BASED ON THE DATABASE
time, within the laboratory ROBOLAB from the University                                                 SAV EE)
of Extremadura. The visual and auditory information of these
non-trained users was stored in our own database, through
10 tests or sequences for each user. So in each test, the user                   2) Recognition from speech: The second test checks the
shall represent each of the emotional states in random order                  robustness of the emotion recognition system from speech,
determined by the user. The relevant information of the users                 analyzing the sentences in the audio signal to estimate
is analyzed through audio and video files for each test.                      the emotional states of the user. For this second test all
   The information for each user group is presented in audio                  the auditory information from users was provided by the
and video files, which are used as the main entrance of the                   two database mentioned above, to analyze and verify the
system for each evaluation. In order to differentiate between                 performance of the system. The results of this second test
each group and to avoid confusion, was appointed the first                    are seen in the Table VI and VII, where the emotions with
group as ”Database”, and the second group as ”Volunteers”.                    less intensity showed better results in the recognition from
Finally, the evaluation of the system is divided into the                     speech. However, the results of the recognition of emotions
following tests:                                                              based on speech presents a lower performance in comparison
   1) Recognition from Facial Expression: The first test                      with the recognition of emotions based on facial expres-
checks the performance of the system for the recognition of                   sions. Since the auditory information can have factors that
facial expressions. This system detects and extracts features                 affect the final results, such as: dialect (cultural), personality
to estimate the emotional states of the user. Therefore, during               (psychological), nervousness (psychological), among others.
the test all the visual information is provided by the two                    These factors have consequences that can be analyzed in
database mentioned above, through video files of users with                   the Table VI and VII, which despite having a number of
different facial features, to verify the correct detection and                differences in the methods of recording and the quality of
classification of this system. The results related to each                    the data acquired by the volunteers of this experiment and
group of users show important differences, associated to                      the database ”SAVEE”, exhibit similar results. The details of
the quantities of errors in the classification, as shown in                   the errors are illustrated in the Table VIII.
Table IV and V. These differences between the data for                           3) Errors in the tests: The main errors described in Table
each group is due to the reliability and the conditions for                   VIII, are associated with the classification and estimation of
the acquisition of information in each case. On the one                       the emotional states of the user through the auditory infor-
hand, the database ”SAVEE” provides best information in a                     mation of both groups. These failures in the classification is
controlled scenario, with high-quality images, trained users                  due to the inconsistency of the data that cause ambiguities
and a controlled lighting (without natural light). On the other               between several states and errors of the classifier, called
hand, the information gained from the volunteers is related                   ”misclassification” . In relation to the visual information,
to non-trained users (mainly students and adults), in a non-                  the facial expressions show more errors in the classification
controlled environment with natural light and ambient noise.                  process as ”misclassification” and ”under the threshold”, due

                                                                          7
                                                         Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                         IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


      Test PAE        a     b     c         d     e   Errors                  Errors          Misclassification    Ambiguous      Under threshold
      a. Sadness    87%    0%    0%       0%    2%     11%              Volunteers (PF E )          2%               1%                1%
     b. Happiness    0%   71%    5%       0%    0%     24%              Volunteers (PAE )          11%               2%                5%
         c. Fear     2%    2%    67%       5%    0%    24%               Database (PF E )           2%               1%                2%
       d. Anger      2%    0%    5%       78%    0%    15%               Database (PAE )           13%               1%                2%
       e. Neutral    5%    0%    0%       0%    82%    13%
                                                                                                       TABLE VIII
                           TABLE VI                                                    D ETAILS OF THE ERRORS IN THE T ESTS
R ECOGNITION FROM SPEECH ( BASED ON SELECTED VOLUNTEERS FOR
                          THESE TESTS )
                                                                               Test          Sadness     Happiness     Fear    Anger     Neutral
                                                                         Volunteers PU E      93%          97%         93%     95%        97%
      Test PAE        a     b     c         d     e   Errors              Database PU E       98%          99%         91%     95%        98%
      a. Sadness    83%    0%    0%       0%    4%     13%
     b. Happiness    0%   76%    3%       0%    0%     21%                                             TABLE IX
         c. Fear     0%    3%    81%       4%    0%    12%             ROBUSTNESS OF THE EMOTION RECOGNITION SYSTEM (P: P ERCENTAGE
       d. Anger      1%    0%    6%       67%    0%    26%
                                                                                         OF CORRECTLY DETECTED EMOTION )
       e. Neutral    4%    0%    0%       0%    89%     7%

                           TABLE VII
 R ECOGNITION FROM SPEECH ( BASED ON THE DATABASE SAV EE)
                                                                       obtained by the system, allowing the use of other types of
                                                                       information in the multimodal system as: body language,
                                                                       among other.
to certain unusual behavior of the user or facial distortions
                                                                                               ACKNOWLEDGMENT
that are not associated with a specific emotion.
   4) Emotion recognition from facial expressions and                     This work has been partially supported by the MICINN
speech: Finally, to demonstrate the benefits of the proposed           Project TIN2012-38079-C03-01, and by the Institute of Elec-
multimodal system, a third test was performed to verify the            trical and Electronics of the Universidad Austral de Chile.
multimodal system based on the bayesian network, which                                              R EFERENCES
analyzes the information from the two methods described
                                                                        [1] R. W. Picard, ”Affective Computing”. MIT Press, pp. 88-91, 2000.
above. The last system allows for the correction of detection           [2] N. Sebe, I. Cohen, T. Gevers, T. S. Huang, ”Multimodal Approaches
errors and reduces the uncertainty in the classification. The               for Emotion Recognition: A Survey”, In Internet Imaging VI, SPIE’05,
results of the last test are seen in the Table IX.                          USA, 2005.
                                                                        [3] J. A. Prado, C. Simplı́cio, N. F. Lori, J. Dias. ”Visuo-auditory Mul-
   Table IX shows the differences between the results of                    timodal Emotional Structure to Improve Human-Robot-Interaction”,
the two databases. Where, the advantage of the database                     In International Journal of Social Robotics, Vol.4, Issue 1, pp. 29-
(SAV EE) are based on good performance in the predom-                       51,2012.
                                                                        [4] V. Bettadapura, ”Face Expression Recognition and Analysis: The State
inant system. However, the benefits of this type of systems                 of the Art”, Tech Report, College of Computing, Georgia Institute of
are clear. Given that removes a range of problems associated                Technology, 2012.
with errors of classification, ambiguity, among others.                 [5] P. Ekman, WV Friesen, JC Hager, ”Facial Action Coding System
                                                                            FACS”, The manual, 2002.
                      V. C ONCLUSIONS                                   [6] Z. Zeng, M. Pantic, G. I. Roisman and T. Huang, ”A Survey of
                                                                            Affect Recognition Methods: Audio, Visual and Spontaneous Ex-
   In this paper, an emotion recognition system through a                   pressions”, In IEEE Transactions on Pattern Analysis and Machine
multimodal approach is presented. This system recognizes                    Intelligence,Vol. 31, pp. 39-58, 2008.
                                                                        [7] L. Kessous, G. Castellano and G. Caridakis, ”Multimodal Emotion
emotional states through two modalities, the visual and the                 Recognition in Speech-based Interaction Using Facial Expression,
auditory. The visual modality uses observed information                     Body Gesture and Acoustic Analysis”, Journal on Multimodal User
to recognize facial expressions. It is based on the use of                  Interfaces, Vol. 3, No. 1, pp. 33-48, 2010.
                                                                        [8] A. Jaimes and N. Sebe, ”Multimodal Human Computer Interaction:
contours and Gabor filters. This allows the recognition of                  A survey”, In IEEE International Workshop on Human Computer
patterns in the user’s face, even with interference by the light            Interaction in conjuntion whith ICCV 2005, beijing, China, 2005.
and a wide range of different users. The second modality,               [9] P. Viola and M. J. Jones,”Robust real-time face detection”, In Interna-
                                                                            tional Journal of Computer Vision, vol. 57, no. 2, pp. 137-154, 2004.
auditory, recognizes emotion related items from the audio,             [10] X. Tan and B. Triggs. ”Enhanced Local Texture Feature Sets for Face
using a sentence from a conversation to determine the                       Recognition Under Difficult Lighting Conditions”, IEEE Transactions
emotional state of the user, getting emotional information                  on Image Processing, Vol. 19, pp. 1635-1650, 2007.
                                                                       [11] R. Cowie and R. Cornelius. ”Describing the emotional states that are
in a similar way as a human would. The results show                         expressed in speech”, Speech Communication 40, pp. 5-32, 2003.
the improvements of the multimodal approach against those              [12] F. Cid, J. Moreno, P. Bustos and P. Núñez. ”Muecas: A Multi-Sensor
systems based on a single modality. Multimodal solutions                    Robotic Head for Affective Human Robot Interaction and Imitation”.
                                                                            In Sensors 2014, 14(5), pp. 7711-7737, 2014.
show a significant reduction of detection and classification           [13] Sound eXchange, SoX. Available online: http://sox.sourceforge.net/
errors.                                                                [14] D. Childers, D. Skinner and R. Kemerait. ”The Cepstrum: A Guide to
   Future work will focus on using more precise information,                Processing”. Proc. of the IEEE, Vol.65, No.10, pp. 1428-1443, 1977.
                                                                       [15] A. Nogueiras, J. Marino, A. Moreno and A. Bonafonte. ”Speech
through the use of new features based on the spectrum and                   emotion recognition using hidden markov models”. In European Conf.
a microphone array. This would improve the information                      on Speech Communication and Technology (Eurospeech 01), 2001.


                                                                   8
                                                                     Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
                                                                     IEEE/RSJ International Conference on Intelligent Robots and Systems 2015


[16] T. Chen, ”Audio-Visual Integration in multimodal Communication”.
     In IEEE Proceedings, May, 1998.
[17] B. Schuller, G. Rigoll and M. Lang. ”Speech emotion recognition
     combining acoustic features and linguistic information in a hybrid
     support vector machine-belief network architecture”. In Proceeding of
     the IEEE International Conference on Acoustics, Speech, and Signal
     Processing, pp. I–577–I–580, 2004.
[18] Noll, M. ”Pitch determination of human speech by the harmonic prod-
     uct spectrum, the harmonic sum spectrum, and a maximum likelihood
     estimate”. In Proceedings of the Symposium on Computer Processing
     in Communications, Vol. XIX, Polytechnic Press: Brooklyn, New
     York, pp. 779-797, 1970.
[19] L. Manso, P. Bachiller, P. Bustos, P. Núñez, R. Cintas and L.Calderita.
     ”RoboComp: a Tool-based Robotics Framework”, In Proceedings,
     SIMPAR Second International Conference on Simulation, Modeling
     and Programming for Autonomous Robots. pp 251-262. 2010.
[20] S. Haq and P.J.B. Jackson, ”Multimodal Emotion Recognition”, In W.
     Wang (ed), Machine Audition: Principles, Algorithms and Systems,
     IGI Global Press, ISBN 978-1615209194, DOI 10.4018/978-1-61520-
     919-4, chapter 17, pp. 398-423, July 2010.


                                                                                 9