=Paper=
{{Paper
|id=Vol-1540/paper_01
|storemode=property
|title=A Novel Multimodal Emotion Recognition Approach for Affective Human Robot Interaction
|pdfUrl=https://ceur-ws.org/Vol-1540/paper_01.pdf
|volume=Vol-1540
|dblpUrl=https://dblp.org/rec/conf/iros/BurgosMT15a
}}
==A Novel Multimodal Emotion Recognition Approach for Affective Human Robot Interaction==
Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
IEEE/RSJ International Conference on Intelligent Robots and Systems 2015
A Novel Multimodal Emotion Recognition Approach for Affective Human
Robot Interaction
Felipe Cid, Luis J. Manso and Pedro Núñez
Abstract— Facial expressions and speech are elements that useful in real interaction scenarios, where for instance, errors
provide emotional information about the user through multiple associated to noise or occlusions can be reduced.
communication channels. In this paper, a novel multimodal
This work presents a novel multimodal emotion recog-
emotion recognition system based on visual and auditory
information processing is proposed. The proposed approach is nition system for affective HRI. The proposed approach
used in real affective human robot communication in order to is based on a real-time multimodal system that integrates
estimate five different emotional states (i.e., happiness, anger, speech and facial expression analysis. The main contribution
fear, sadness and neutral), and it consists of two subsystems of this work is a robust feature extraction approach for facial
with similar structure. The first subsystem achieves a robust
expression recognition. In this paper, these facial features are
facial feature extraction based on consecutively applied filters
to the edge image and the use of a Dynamic Bayessian Classifier. a combination of independent and antagonistic distortions
A similar classifier is used in the second subsystem, where of the face. Besides, a novel acoustic analysis of speech
the input is associated to a set of speech descriptors, such is used to extract features relevant to emotion. Both visual
as speech-rate, energy and pitch. Both subsystems are finally and acoustic features constitute the input of two Dynamic
combined in real time. The results of this multimodal approach
Bayessian Network (DBN) [3], which classify them into a
show the robustness and accuracy of the methodology respect
to single emotion recognition systems. set of basic emotions. The output of each DBN estimates the
user’s emotional state based on the extracted features, the
I. I NTRODUCTION emotional states available in each DBN being: happiness,
sadness, anger, f ear and neutral (non-emotional state).
In the last decade, Human Robot Interaction (HRI) has Finally, the purpose of this system is to integrate the output
become one of the most important issues in social robotics. information of each subsystem, in a third DBN to analyze
Within HRI, one of the main objectives is the development the information associated with each mode. This multimodal
of methodologies focused on non-invasive techniques based methodology uses redundant information to estimate accurate
on natural language. This would allow the robot to interact and robust results on the user’s emotional state.
with users in a similar way to humans, through multimodal This paper is organized as follows: after discussing known
systems that combine information from different channels. approaches to emotion recognition systems from facial ex-
In order to socially interact with humans, a robotic system pressions, speech or multimodal systems in Section II, Sec-
should be able not only to understand users behaviour and tion III presents an overview of the proposed Multimodal
intentions, but also to estimate their emotional state. Know- Emotion Recognition System. In Section IV, the experimen-
ing and understanding these human emotions helps social tal results are pointed out, and finally, Section V describes
robots adapting the communication in real time, improving the conclusions and future work of the approach.
and enriching the interaction [1]. This kind of HRI is usually
known as affective HRI. II. R ELATED W ORKS
The design of social robots looks for developing natu-
ral interfaces for an affective interaction. In this context, In the field of HRI, different automatic emotion recogni-
most of the current affective HRI techniques use an unique tion systems have been studied. Most of these approaches
information channel, called mode, such as speech, facial are based on single information channel analysis, such as
expressiveness or body language. However, an emotion is video sequences or audio signals. Independent of the nature
generally expressed through several modalities. In multi- of the information source, the raw data is processed and a set
modal affective interaction, the user can communicate his/her of features are extracted. Then, these features are classified
emotional state to the robot using different several input into different categories, i.e., emotions. On one hand, facial
channels [2]. Contrary to approaches where each channel expressions have been commonly used to detect and rec-
introduces complementary information in the emotion recog- ognize human emotions. An interesting and updated review
nition system, most of the multimodal systems use these was shown in [4]. Commonly, these frameworks use the
channels as redundancy information. This redundancy is Facial Action Coding System (FACS) proposed by Ekman
et al’s [5], which is based on facial muscle deformations.
L. J. Manso and P. Núñez are members of Robotics and Artificial On the other hand, speech has been also used for emotion
Vision Lab. Robolab Group, University of Extremadura, Spain (e-mail: recognition (see the review [6]). These systems are usually
lmanso@unex.es; pnuntru@unex.es)
F. Cid is with Institute of Electrical and Electronics, Universidad Austral focused on acoustic variables variations that are related to
de Chile, Chile. (e-mail:felipe.cid@uach.cl emotions.
1
Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
IEEE/RSJ International Conference on Intelligent Robots and Systems 2015
Approaches that only use visual or speech information
individually usually fail in real scenarios. Light conditions,
shadows or occlusions, among others, are typical situations
where the accuracy of the results decreases for visual sys-
tems. In a similar way, environmental noise or people moving
while talking are error sources in audio systems. Therefore,
for an efficient affective HRI, several authors have focused
their attention on multimodal systems that recognize the
emotional state of the user from different modalities or
information sources such as: face, gesture, body language, Fig. 1. Overview of the proposed facial expression recognition system.
The approach flows from left to right. See the text for more details.
speech or physiological signals, among others. For instance,
in [3] the authors develop an architecture for multimodal
emotion recognition, where facial expresiveness and speech
are used. In [7], Kessous et al. propose a system that fuse 1) Data acquisition: Data acquisition for the recognition
body and facial languages, and speech. Also, interesting of facial expressions is based on the processing of a sequence
reviews are presented in [2], [8]. Most of these approaches of RGB images Sf r obtained from the robot camera for
use a dominant mode in the classifier strategy, that is, real-time interaction. First, the system recognizes the user’s
when the probability of the detected emotion is low, other face within each frame f r(t) in the instant of time t, using
channel information is used. In contrast to these works, this the well-known Viola and Jones’ method [9] to obtain the
paper presents a multimodal approach where all the input initial region of interest of the face ROII . The ROII is
modes are analyzed in real-time and the fusion strategy normalized to a fixed size and converted to gray scale to be
consists on a Dynamic Bayessian Network classifier. This pre-processed.
multimodal fusion is produced at the decision level (i.e.,
2) Pre-processing: In this stage, the ROII is processed to
the information is integrated from the single modalities after
remove noise, reducing its light dependence and eliminating
being interpreted by their own classifiers).
unnecessary information. To eliminate errors in the detection
III. M ULTIMODAL E MOTION R ECOGNITION S YSTEM and reduce the processing time of the system, the ROII
is divided in two sub-regions, ROItop and ROIbottom ,
In this section, the proposed multimodal emotion recogni-
respectively. The usage of both sub-regions allow the elim-
tion system is presented. The framework consists on two sub-
ination of irrelevant information (e.g., nose) and divides
systems running in parallel that estimate human emotions us-
the feature extraction process in two elements (eyes and
ing two independent DBN. The facial expression recognition
mouth). Let ROII being the face image of size W x H,
subsystem (see Fig. 1) is based on a fast and robust feature
and let p0 = (n, m) being the central pixel in the image,
extraction approach where consecutively morphological and
which estimates the approximated position of the nose in
convolutional filters are applied to reduce the noise and the
the image. Then, ROItop and ROIbottom are defined as
dependence against luminosity changes. After that, a Gabor
selective copies of ROII as follows: ROItop of size W
filter is used for efficient edge detection. The output edge
x (m − NT h ) and ROIbottom of size W x (m + NT h )),
image of this filter bank is used to detect and extract scale-
where NT h is an user-fixed threshold. In order to detect
invariant facial features, which will be the input variables of
and extract robust facial features in different environments,
the DBN to estimate the user’s emotional state. In the second
both ROItop and ROIbottom images are processed to reduce
system, the user’s speech is analyzed in order to extract a set
the effects of the light dependence. The method is based
of independent descriptors. These descriptors are the input
on the approach described in [10]. The processing sequence
of a second DBN. Finally, the proposed system integrates
follows a set of consecutive stages: 1) gamma correction; 2)
the information associated to both methods in a third DBN,
difference of Gaussian (DoG) filtering; 3) masking; and 4)
which estimates the final user emotion. The output of both
contrast equalization. Next, a filter bank of Median, Blur and
subsystems and a third DBN has as a result an estimate of the
Gaussian filters, is applied to mitigate the noise effect in the
emotional state of the user, within four possible emotional
images by the beard, wounds or similar facial elements.
states (happiness, sadness, anger, fear) and a non-emotional
state (neutral). Each subsystem is described in detail in the 3) Gabor Filter: The Gabor filter is a fast and effec-
following subsections. tive linear filter for the detection of edges with different
orientations. In the proposed approach, the Gabor filter is
A. Emotion recognition from Facial Expressions used as a previous stage to the detection and extraction of
The facial expression recognition system proposed in this facial features, which are extracted using the contours of the
work uses a video sequence acquired by the robot in real facial elements (i.e., the eyes, the mouth or the eyebrows).
time. An overview of the proposed methodology is shown in Gabor impulse response in the spatial domain consists of a
Fig. 1. Each video frame is processed and a set of robust and sinusoidal plane wave of some orientation and frequency,
invariant features of the user’s face is detected. The proposed modulated by a two-dimensional Gaussian envelope. Let
method consists on the following stages: I(u, v) be the input image, then the output of the Gabor
2
Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
IEEE/RSJ International Conference on Intelligent Robots and Systems 2015
(a) (b)
Fig. 3. Edge-based facial feature extraction: a) Region of interest in the
face image (ROItop and ROIbottom ); and b) Features extracted in the
image.
This bayesian approach is based on the detection of 11
Fig. 2. Action Units (AUs) used in this paper. AUs with antagonistic and exclusive features as leaves of
the DBN, which allows us to reduce the size of the DBN to
7 variables that group these AUs. Thus, these 7 variables are
filter, G(u, v), is given by: obtained from combinations of the extracted features of the
previous process: dA, dB and dC. In this case, it is assumed
1 u2 + v 2 uθ
G(u, v) = exp(− ( θ 2 θ ))cos(2π + ψ) (1) that these 7 leaf variables are independent given the facial
2 σ λ expression (F E). Although some muscular movements from
where θ, λ and ψ are associated to the sinusoidal plane wave one area of the face may slightly affect other areas, this
(orientation, wavelength and phase, respectively), and being small influence could not be detected by the robot cameras.
uθ and vθ described as: Besides, for correct detection of each emotional state it
is necessary that each AU achieves a minimum threshold
uθ = ucosθ + vsinθ intensity of B (i.e., slight evidence), within the intensity
(2)
vθ = −usinθ + vcosθ range described in FACS [5].
Fig. 4 illustrates the structure of the two level bayesian
4) Feature extraction: The crucial step in an automatic
network, and the time influence that characterizes this DBN.
facial expressions recognition system is the extraction of
The first level shows the estimation of the user’s emotional
relevant features from the image, F I ={fiI | i = 1..m}. In
state F E, while the second level shows the 7 variables
the proposed work, a set of edge-based features is extracted,
based on the AUs. In addition, one of the main features
which is invariant to scale or distance from the user to the
worth mentioning about the bayesian network is the need
robot. Each one of the features are directly related to the
to provide it with learning data. The most common method
Action Units (AUs) described by the Facial Action Coding
is that each new sample is matched using a threshold.
System (FACS) [5]. A set of independent and antagonistic
In this work, to avoid the extant gaps, a pre-processing
AUs has been used in this paper (see Fig. 2, AU1 and
stage is done before the learning stage, fitting a Gaussian
AU4 are related to distortions of the eyebrows, and they
distribution to the data. The learning data acquisition was
are antagonistic and independent). In this approach, only
performed by a supervisor, who collected the samples of
three features are defined in the edge face image, which
the 7 random variables manually, correctly classifying them.
are associated to the Euclidean distance between the upper
The leaf random variables of the model, and their respective
contour of the eyebrows and the lower edge of the eyes (dA),
virtual-scopes are shown in the Table I.
lip corners (dB) and upper and lower contour of the mouth
Table II the relationship between the basic emotional states
(dC), respectively. These features are normalized using the
of a user, and the Action U nits associated with different
values extracted from the neutral state, which allows the
facial distortions. In this case, these AU s depend on the
system to be independent of the scale or distance of the
few deformable or mobile elements of the face, such as: the
user to the sensor. Fig. 3(a) illustrates the ROI of the
mouth, eyes and eyebrows. This feature of deformation or
face in the image. The image is processed according to the
movement of facial elements is what allows analyze a mus-
method described in this section. Results after applying light
cular change associated with a specific emotion, otherwise
normalization, noise removal methods and Gabor filtering
there would be no facial expressions. However, elements
are shown in Fig. 3(b). Fig. 3(b) also illustrates the set of
such as the nose does not present specific changes in facial
extracted features, labelled as dA (yellow), dB (brown) and
expressions.
dC (violet).
In this approach, the data (D) in the classification process is
5) Dynamic Bayesian Network classifier : In order to clas-
obtained according to the following setup:
sify the Facial Expression (F E) produced by the user’s face,
a Dynamic Bayesian network is proposed, where the overall D = ((x1 , y1 )...(xn , yn )), xi ∈ Rd , yi ∈ R (3)
classification result achieved is the one foreseen by the
belief variable F E, in the scope (F E[neutral] , F E[happiness] , Consider that y1 to y5 are the five possible emotional
F E[sadness] , F E[f ear] , F E[anger] ). states (F E[neutral] , F E[happiness] , F E[sadness] , F E[f ear] ,
3
Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
IEEE/RSJ International Conference on Intelligent Robots and Systems 2015
Emotion Action Units AUs
Sadness AU1 - AU4 - AU15 -AU17
Happiness AU6 - AU12 - AU25
F ear AU1 - AU4 - AU20 - AU25
Anger AU4 + AU7 - AU17 - AU23 - AU24
N eutral —
TABLE II
R ELATIONSHIP BETWEEN ACTION U NITS ( FACIAL FEATURES ) AND THE
DIFFERENT EMOTIONAL STATES OF THIS SYSTEM . (I NFORMATION
COLLECTED FROM : [3]).
Fig. 4. Facial Expression Dynamic Bayessian Network. Two time intervals
are shown.
Variable Action Units AU s Element of the face
EB AU1, AU4 Eye-Brows
Ch AU6 Cheeks
LE AU7 Lower Eyelids
LC AU12, AU15 Lips Corners
CB AU17 Chin Boss
CB AU20, AU23 Mouth’s Form
MA AU24, AU25 Mouth’s Aperture
TABLE I
L EAF VARIABLES OF THE DBN
Fig. 5. Overview of the proposed emotion recognition system from speech.
F E[anger] ); and each dimension of x, corresponds to one of (Image acquired from the publication: [12])
the previously described random variables, namely: EB, Ch,
LE, LC, CB, M F and M A. Since the learning data may
have gaps between its samples, a model is built assuming B. Emotion recognition from Speech
that (X1 , ..., Xn ) are independent given F E, and
Emotion recognition using speech as input poses several
Xi ∼ N (priorT xi , σ 2 ) (4) subproblems, such as: detection, extraction and characteri-
At first, prior ∼ U (1/n), however throughout the iterations, zation of a set of significant acoustic features to recognize
the posterior of t − 1 becomes the prior on t. emotions. Therefore, the proposed approach has a similar
Finally, the posterior equation is obtained using Bayes’ structure of the subsystem described in Sec. III-A, thus, like
rule: the former, a set of acoustic features is extracted and used
Qn
P (xi |F E) ∗ P (F E) as input to a DBN classifier. The proposed methodology
P (F E|xm ) = 1 , (5) is based on the work presented in [11], which studied the
P (xm )
influence of pitch, energy and speech-rate over the activation
where xm is the most recent visual information acquired. and also the intensity of some emotions. Fig. 5 illustrates an
The last dividend can be computed using the Bayesian overview of the proposed system. The system is divided in
marginalization rule: the following stages:
q
XY 1) Detection of the Speech: The audio signal is pre-
P (xm ) = P (xi |F E) ∗ P (F E), (6) processed in order to detect the presence or absence of speech
FE 1 in real time, within a communication between a human and
being q = 7, the number of random variables of the system. a robot. This stage was achieved by using audio library SoX
The dynamic properties of the network cause a conver- [13], which allows the detection of voice and silences of
gence over time. The resultant histogram from the previous agreement with the sensitivity of the microphone. Specifi-
frame is used as prior knowledge for the current frame. cally, the library analyzes and processes the stream original
Each classification is considered correct, if it converges in audio from the microphone through a feature VAD (Voice
a maximum number of 5 frames exceeding a threshold of Activity Detection) based on the measurement of the power
80%. Otherwise, if after 5 frames no value is higher than the cepstral. Since this function is responsible for removing the
threshold, the classifier selects the highest probability value noise, silence or any type of sound not related to the human
(usually referred to as the Maximum a posteriori decision in voice. Thus, a signal in the time, x(t), corresponding to an
Bayesian theory) as the result. audio signal acquired by the robot, the power Cepstrum
4
Workshop on Multimodal Semantics for Robotic Systems (MuSRobS)
IEEE/RSJ International Conference on Intelligent Robots and Systems 2015
C(τ ) [14] is given by the equation: according to eq. 8 for each spectral frame.
i=R
Y 2
C(τ ) = F −1 log(|F(x(t))|2 )
(7) Y (ω) = |X(ωr)| , 0≤w