=Paper=
{{Paper
|id=Vol-3092/p11
|storemode=property
|title=Automatic RGB Inference Based on Facial Emotion
Recognition
|pdfUrl=https://ceur-ws.org/Vol-3092/p11.pdf
|volume=Vol-3092
|authors=Nicolo’ Brandizzi,Valerio Bianco,Giulia Castro,Samuele Russo,Agata Wajda
|dblpUrl=https://dblp.org/rec/conf/system/BrandizziBCRW21
}}
==Automatic RGB Inference Based on Facial Emotion
Recognition==
Automatic RGB Inference Based on Facial Emotion
Recognition
Nicolo’ Brandizzi1 , Valerio Bianco1 , Giulia Castro1 , Samuele Russo2 and Agata Wajda3
1
Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00135, Rome, Italy
2
Department of Psychology, Sapienza University of Rome, Via Ariosto 25, 00135, Rome, Italy
3
Department of Mathematics Applications and Methods for Artificial Intelligence, Faculty of Applied Mathematics, Silesian University of
Technology, 44-100 Gliwice, Poland
Abstract
Recently, Facial Emotion Recognition (FER) has been one of the most promising and growing field in computer vision and
human-robot interaction. In this work, a deep learning neural network is introduced to address the problem of facial emotion
recognition. In particular, a CNN+RNN architecture has been designed to capture both spatial features and temporal dynam-
ics of facial expressions. Experiments are performed on CK+ dataset. Furthermore, we present a possible application of the
proposed Facial Emotion Recognition system in human-robot interaction. A method for dynamically changing ambient light
or LED colors, based on recognized emotions is presented. Indeed, it is proven that equipping robots with the ability of per-
ceiving emotions and accordingly reacting by introducing suitable emphatic strategies significantly improves human-robot
interaction performances. Possible scenarios of application are education, healthcare and autism therapy where such kind
of emphatic strategies play a fundamental role.
Keywords
Facial Emotion Recognition, Human Robot interaction,
1. Introduction state, together with jaw drop. Facial emotion recogni-
tion is a challenging task due to interclass similarities
Facial Emotion Recognition (FER) is the process of iden- problems. Indeed different people can show emotions in
tifying human feelings and emotions from facial expres- a different and personal way and with a different level
sions. Nowadays, automatic emotion recognition has of intensity which makes the problem particularly hard.
a key role in a wide area of applications, with partic- On the other hand it is possible that different motiva-
ular interest in cognitive science[1], human-robot and tional states show very similar features and similar facial
human-computer interaction. In human-robot interac- expressions.
tion for example, the ability of recognize intentions and In this work, a standard CNN-RNN architecture is used
emotions and accordingly react to the particular motiva- to learn both spatial and temporal cues from human facial
tional states of the user is crucial for making interaction expressions built up gradually across time. A method to
more friendly and natural, improving both usability and express emotions by dynamically changing RGB ambient
acceptability of the new technology. light components based on recognized emotional state
Ekman in [2] defined a set of six universal emotions: is here proposed. Indeed, a lot of studies have been con-
anger, disgust, fear, happiness, sadness, and surprise, ducted in order to understand the relationship between
which can be universally recognized and described re- colors and emotions, and precisely how a particular color
gardless of people culture and context by using a set of can evoke positive feelings in the observer. So depending
action unit (AU) described in table 1. Each AU is the on the particular recognized human emotion a differ-
action of a muscle of the face that is typically activated ent ambient light color will be set according to Nijidam
when a given facial expression is produced. For example, color-emotion mapping theory [3].
AU number 1, that corresponds to "Inner brow raiser" The remainder of this paper is structured as follows.
typically appears when people show surprise emotional Section 2 analyzes existing literature and discusses the
state-of-the-art in facial emotion recognition. Section 3
SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology, is dedicated to the description of the dataset and data
Engineering and Mathematics. July 27–29, 2021, Catania, IT refinement process. Section 4 formalizes the problem and
" brandizzi@diag.uniroma1.it (N. Brandizzi);
castro.1742813@studenti.uniroma1.it (G. Castro);
explains proposed CNN+RNN model architecture. Sec-
samuele.russo@studenti.uniroma1.it (S. Russo); tion 5 shows experiments, implementation details, and
agata.wajda@polsl.pl (A. Wajda) achieved results. Section 6 propose a real-time human-
0000-0002-1846-9996 (S. Russo); 0000-0002-1667-3328 robot interaction application which can take advantage
(A. Wajda) and multiple benefits from the proposed Facial Emotion
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). Recognition system. Finally, in Section 7, we discuss
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
66
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74
Table 1
Facial Action Unit (AU) code and corresponding description.
AU Description AU Description AU Description
1 Inner Brow Raiser 13 Cheek Puller 25 Lips Part
2 Outer Brow Raiser 14 Dimpler 26 Jaw Drop
4 Brow Lowerer 15 Lip Corner Depressor 27 Mouth Stretch
5 Upper Lip Raiser 16 Lower Lip Depressor 28 Lip Suck
6 Cheek Raiser 17 Chin Raiser 29 Jaw Thrust
7 Lip Tightener 18 Lip Puckerer 31 Jaw Clencher
9 Nose Wrinkler 20 Lip Stretcher 34 Cheek Puff
10 Upper Lip Raiser 21 Neck Tightener 38 Nostril Dilator
11 Nasolabial Deepener 23 Lip Tightener 39 Nostril Compressor
12 Lip Corner Puller 24 Lip Pressor 43 Eyes Closed
conclusions and future works. quences. In [9] multiple LSTMs layers are stacked on top
of CNNs. Then temporal and spatial representations are
aggregated into a fusion network to produce per-frame
2. Related Work prediction of 12 facial action units (AU). Fan et al. in [10]
propose an hybrid network combining a CNN-features-
Several techniques and deep learning models have been
based spatio-temporal RNN model with a 3 dimensional
investigated over the last decade in order to address the
Convolutional Neural Network (C3D), including also au-
problem of facial analysis and emotion recognition from
dio features in order to maximize accuracy predictions.
RGB images and videos. Most of them use Convolutional
To deal with expression-variations and intra-class vari-
Neural Networks (CNNs) for extracting geometric fea-
ations, namely intensity and subject identity variations,
tures from facial landmark points. In order to train such
[11] introduces objective functions on CNN to improve
high-capacity classifier, with the very small-size available
expression class separability of the spatial feature repre-
FER dataset, one of the most common approach is to use
sentation and minimize intra-class variation within the
transfer learning. Works [4, 5] use pre-trained models
same expression class. Differently from previous works,
such as VGG16 and AlexNet to initialize the weights of
which first apply CNN architectures or pre-trained im-
the CNN that can improve accuracy and reduce over-
age classifier as visual feature extractor, and then use
fitting. Nguyen et al. in [4] introduced a two-stage su-
extracted spatial feature representation for training the
pervised fine-tuning: a first-stage fine-tuning is applied
RNNs separately, the proposed network want to analyze
using auxiliary face expression datasets followed by a
the spatio-temporal behaviour of facial emotions by us-
final fine-tuning on the target AFEW dataset [6]. In [5]
ing an end-to-end trainable CNN+RNN computational
a VGG-16 deep pre-trained model plus redefined dense
efficient architecture.
layers is used in FER, by identifying essential and op-
Several experiments have also been done to prove
tional convolutional blocks in the fine-tuning step. In the
how a facial emotion recognition systems can improve
training process the selected blocks of VGG-16 model are
human-robot interaction performances and potentialities.
included step by step, instead of training all at a time, to
Jimenez et al. in [12] show how monitoring colored lights
diminish the effect of initial random weight.
based on user emotion can represent a real communica-
Instead of analyzing static images independently, thus
tion channel between humans and robots. They intro-
ignoring the temporal relations of sequence frames in
duce a self-sufficiency model system that recognizes and
videos, also 3D CNNs were explored in order to extract
empathizes with human emotions using colored lights
spatio-temporal features with outstanding results. In [7],
on a robot’s face. Feldmaier et al. in [13] also show the
3D CNNs was used to model appearance and motion
effectiveness of displaying color combinations and color
of videos, learning simultaneously spatial and temporal
patterns in Affective Agents, also adjusting variations of
aspects from image sequences. In [8], two deep networks
intensity, brightness and frequency to obtain a psycho-
are combined: a 3D CNN is used to capture temporal
logical influence in the user. A similar strategy is adopted
appearance of facial expressions, while a deep temporal
in this work as well.
geometry network extracts geometrical behaviours of
Many other works have been recently published in the
the facial landmark points.
field of emotion recognition, face detection, and related
Similarly, some recent works have proposed to use
classification tasks[14, 15].
both combination of CNNs and RNNs capable of keeping
track of arbitrary long-term dependencies in input se-
67
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74
Table 2 Appendix. Comparison with the emotion prediction Ta-
Ekman’s six basic emotions description in terms of facial Ac- ble 6 was done by applying the emotion prediction rule
tion Units. With reference to table 1 we define 1: Inner Brow very strictly.
Raiser, 2: Outer Brow Raiser, 4: Brow Lowerer, 7: Lip Tight- In order to maximize the amount of data which ap-
ener, 9: Nose Wrinkler, 10: Upper Lip Raiser, 12: Lip Cor- pears to be too poor for the training model, another 35%
ner Puller, 15: Lip Corner Depressor, 17: Chin Raiser, 20: Lip of the dataset was hand-made labelled by using not only
Stretcher, 24: Lip Pressor, 25: Lips Part, 26: Jaw Drop.
Prototypes but also their Major Variants as shown in
Emotional state Action Units the Emotion Prediction Table 6. Compared with Proto-
Anger 4, 7, 24 types, variants allow for subset of AUs. As a consequence
Disgust 9, 10, 17 they are less strictly definitions but always truly repre-
Fear 1, 4, 20, 25 sentative for the given emotion. As a result, a total of
Happiness 12, 25 511 video sequences are collected by including also the
Sadness 4, 15 major variants definitions in the conversion rules. Also
Surprise 1, 2, 25, 26 neutral facial expressions were included, for a total of
7 emotions categories: Neutral (Neu), Anger (Ang), Dis-
Table 3 gust (Dis), Fear (Fea), Happiness (Hap), Sadness (Sad) and
Final number of samples per category in the dataset. In order: Surprise (Sur). Notice that even if the dataset increases
Anger (Ang), Neutral (Neu), Disgust (Dis), Fear (Fea), Happi- its dimension, it remains unbalanced, indeed some of
ness (Hap), Sadness (Sad), Surprise (Sur) number of videos the categories such as surprise and happiness are more
samples. represented in the dataset if compared to the others. The
Emotion la- Ang Neu Dis Fea Hap Sad Sur final distribution of data samples among the seven cate-
bel gories is reported in table 3. The minimum length of data
#Videos 51 52 70 61 108 82 87 samples in the dataset is 10 frames. For this reason, in
order to maximize the number of input sequences, and
collect the largest amount of training data, the sequence
length, i.e the number of frames per video is set to 10.
3. Dataset For sequence of grater length, only the last 10 frames are
considered, in order to ensure that the apex (peak) frame
For training, validating and testing the Facial Emotion
that is the most representative for the given emotion will
Recognition model the Extended Cohn-Kanade Dataset
be captured. For each of the frames pixels values are
(CK+) [16] was used. It contains 593 sequences across 123
rescaled in order to have each pixel ∈ [0, 1]. A central
subjects and each of the sequences contains images from
cropping is applied to each of the frame in order to have
onset (neutral frame) to peak expression (last frame). The
a more focus on human face. After resizing and cropping
image sequence can vary in duration from a minimum
the final dimension of each input sequences will be
of 10 to a maximum of 60 frames. Images have frontal
views and 30-degree views and were digitized into either (𝑛_𝑓 𝑟𝑎𝑚𝑒𝑠, 𝑤𝑖𝑑𝑡ℎ, ℎ𝑒𝑖𝑔ℎ𝑡, 𝑛_𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠) = (10, 48, 48, 3)
640x490 or 640x480 pixel arrays with 8- bit gray-scale
or 24-bit color values. Each of the image sequences is
labelled with Action Unit combinations. A complete list
of the possible AUs is reported in table 1. 4. Model Architecture
For each of the data sequence, if the action units list
show consistency with one of the six basic emotion cat- The emotion recognition task is modelled as a multi-class
egory among Anger, Disgust, Fear, Happiness, Sadness, classification problem over a set of 7 different categories
Surprise and Contempt, a nominal emotion label is as- 𝑌 = {Anger, Neutral, Disgust, Fear, Happiness, Sad-
sociated to the sequences. At this aim, table 2 shows a ness, Surprise}.
complete mapping between Ekman’s six basic emotions As shown in sec 2 Recurrent Neural Networks (RNN) in
and AUs. Note that only the six basic emotions are con- combination with Convolutional Neural Networks (CNN)
sidered in the proposed FER model and then reported and 3 dimensional Convolutional Neural Network (C3D)
in table 2, while Contempt category which is very simi- provide powerful results when dealing with sequential
lar to Disgust emotion class and do not belongs to basic image data. Following this approach, a standard CNN
emotions group is excluded. As a result of this selection + RNN computational efficient architecture is here pro-
process, only 296 of the 593 sequences fit the prototypic posed. From the input sequence of frames, spatial feature
definition and meet criteria for one of the six discrete representations are learned by a Convolutional Neural
emotions. Prototypes definitions used for translating AU Network (CNN). In order to capture the facial expres-
sores into emotions terms are shown in Table 6 in the sion dynamics, temporal features representation of the
68
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74
Table 4
Per Category Validation Accuracy. In order: Anger (Ang),
Neutral (Neu), Disgust (Dis), Fear (Fea), Happiness (Hap),
Sadness (Sad), Surprise (Sur) accuracy scores.
Emotion la- Ang Neu Dis Fea Hap Sad Sur
bel
Val Accu- 10% 83% 50% 30% 86% 86% 79%
racy
while last couple have 3x3 filter size. All 2D Max Pooling
have a kernel size of 2x2. In order to extract temporal
correlation in the extracted input features, CNN output
is directly fed as input to a GRU of 8 units, which is the
output dimension. Finally, the output layer is a Dense
one, that is a deeply fully connected layer, with a Softmax
activation function. Softmax layer have the same number
of nodes as the output layer in order to assign decimal
probabilities with sum 1 to each of the emotion categories.
For each value 𝑧𝑖 from the neurons of the output layer,
per category probability is computed as:
𝑒𝑥𝑝(𝑧𝑖 )
𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧𝑖 ) = ∑︀ (1)
𝑗 𝑒𝑥𝑝(𝑧𝑗 )
such that probabilities values always sum to 1 and only
one emotion label is activated.
5. Experiments
5.1. Implementation Details
Figure 1: Proposed model Architecture for Facial Emotion
Recognition system which takes as input a 10 frames se- To verify the effectiveness of the proposed model, experi-
quence and outputs a single label emotion. ments have been conducted on the modified CK+ dataset
as described in Section 3.
For an efficient data parsing, input images sequences
of 10 frames are first transformed in TFRecord files. Then
facial expression is learned via the Recurrent Neural Net-
tensorflow TFRecordDataset class is used to standardize
work (RNN). Both Long Short-Term Memory (LSTM) and
data and generate batch of 24 images sequences to be fed
Gated Recurrent Unit (GRU) have been tested for the
as input to the model. 80% of the data samples are used
given problem and provide comparable results. However
in training phase, while the remaining 20% as test set.
GRU have a less complex structure and are computation-
For training the model we used Adam optimizer with
ally more efficient. Therefore given also the fact that
learning rate 1 × 𝑒−3 . For a multi-class classification
the model does not have a huge amount of data, GRU
problem, each of the input sample can only belong to
are preferred to get a good accuracy. Moreover, such 2D
one out of many possible categories, therefore a Cate-
CNN + GRU approach allows end-to-end trainable model
gorical Cross Entropy Loss was used. Since the dataset is
which is lower computational expensive comparing with
unbalanced the Categorical Accuracy is not enough to
the others. Indeed the whole model counts a very small
have a true evaluation for the model, but also Precision
number of parameters, around 8k, which makes it very
and Recall metrics were considered. In order to prevent
light.
over-fitting problem, L1 and L2 Regularizer are set to 0.01
The model takes as input a window of 10 RGB frames
and dropout equal to 0.4. The model was trained for a
of size 48x48. Data are processed in batches of 24 image
total of 400 epochs, with a mean training time for a single
sequences. The full architecture of the proposed model
epoch of only 4sec.
is shown in Figure 1. It consist of a four-layer 2D CNN.
The first two convolutional layers have 5x5 kernel size,
69
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74
Figure 2: Confusion matrix over CK+ validation set. On the axis motions categories are disposed in the following way:
{1:Anger 2:Neutral 3:Disgust 4:Fear 5:Happy 6:Sad 7:Surprise}
validation epoch the model achieve 66% of mean Preci-
sion and 65% of mean Recall across categories. Table 4
shows per category validation accuracy rates. As we can
notice mean validation accuracy over categories is not
so high, indeed per category validation accuracy reaches
very high scores for some of the emotion classes and
very low values for some others, meaning that the model
almost fails when tries to recognize anger and fear emo-
Figure 3: Left: Apex (peak) frame from Angry facial expres-
sion from CK+ dataset. Right: Apex frame from Sad facial
tions, while it behaves very well when dealing with all
expression performed by a different subject in CK+ dataset. the others categories. In particular, the model appears
very powerful in recognizing Happy and Sad facial ex-
pressions with the 86% of accuracy. High accuracy scores
Table 5 are also obtained for Neutral (83%) and Surprise (79%)
Evaluation metrics for the model. Mean Validation Accuracy emotions categories. As a result, even if the model shows
, Precision and Recall across categories. very high performances for most of the emotions, dif-
Validation Accuracy 66% ficulties in learning some specific facial expressions, in
Validation Precision 66% particular Angry with only the 10% of accuracy, lead to a
Validation Recall 65% lower mean accuracy score across categories. To better
understand the model, the confusion matrix over the CK+
validation set is reported in Fig 2. As we can notice, the
5.2. Results analysis main diagonal is highlighted by high rates of correctly
classified samples. However, some off-diagonal elements
In this section, the experimental results are presented. reflects mislabeled predictions by the classifier.
Since the dataset is not balanced both precision and re- In particular, predictions errors mainly concern Anger
call, together with accuracy score are considered in order and Fear emotions categories, which appear to be harder
to evaluate the performance of the model. For the best to learn and recognize. As shown in the confusion matrix,
70
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74
Figure 4: Top row: Fear emotion data sample from CK+ dataset. 4/10 frames are shown to illustrate temporal evolution of
the facial expression. Bottom row: Happines emotion evolution over time, performed by the same subject.
Angry facial expressions are frequently confused with 6. Facial Emotion Recognition in
Sad ones, which actually looks like very similar. Indeed
if we look at some data examples, as shown in figure 3, it
human-robot interaction
will be quite difficult even for humans to infer the correct In this chapter, a possible application of the proposed
labeling. This because both Sadness and Anger emotions Facial emotion recognition system in Human-Robot and
are characterized by very similar features such as lowered Human-Computer interaction is presented. More re-
eyebrows and tight mouth as also highlighted in table search [17, 18, 19, 20] have proven as equip the robot with
2 which shows emotional states description in terms of the ability of perceiving user feelings and emotions and
facial Actions Units. As a consequence, even if Sad facial accordingly react with them can significantly improve
expressions typically shows much more lowered lip cor- the quality of the interaction, and more importantly user
ners, more discrete and shy performed emotions may be acceptability. Affective communication for social robots
easily confused due to inter-class similarity and emotions has been deeply investigated in the last years showing
intensity variations problems. In an analogue way, Fear as behind social intelligence, also emotional intelligence
emotion are often wrongly classified as Happiness, since plays a crucial role for successfully interactions. In partic-
they show a very similar behaviour of the lips and share ular the need of recognize emotions and properly intro-
action unit number 25 - Lips Part, as reported in table duce emphatic strategies turned out to be fundamental
2. As shown in fig 4 top row, fear facial expression is in vulnerable scenarios such us education, healthcare,
represented with the jaw dropped open and lips stretched autism therapy and driving support.
horizontally. As a result, if we look at fig 4, for both Fear
and Happiness facial expression (top and bottom rows
respectively) lips are closed on the onset (first frame) 6.1. Application
than start to became farther until to be apart in the apex A very effective and powerful solution in human-robot
frame, showing a very similar temporal behaviour which interaction can be to introduce a colored lights based em-
makes it difficult to distinguish between the two. phatic strategy. Indeed, colors and emotions are closely
In summary, the network can precisely recognize Hap- linked. Several physical and psychological studies have
piness, Sadness, Surprise and Neutral emotional states, shown as play with colors and dynamic lights can have
while is not so much reliable for Anger and Fear. There- very effective outcome since they can evoke feelings and
fore only emotions categories with a valid accuracy over emotions in human observers. For this reason, a method
the 50% will be used in human-robot interaction applica- for dynamically changing ambient light or LED colors
tions. based on recognized emotions is presented. The aim is to
71
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74
Table 6
Emotion prediction. Conversion Rules for translating AU scores into Emotions. [16]
Table note: * means in this combination the AU may be at any level of intensity.
Emotion Prototypes Major Variants
Surprise 1+2+5B+26 1+2+5B
1+2+5B+27 1+2+26
1+2+27
5B+26
5B+27
Fear 1+2+4+5*+20*+25 1+2+4+5*+L or R20*+25, 26, or 27
1+2+4+5*+25 1+2+4+5*
+2+5Z, with or without 25, 26, 27
5*+20* with or without 25, 26, 27
Happiness 6+12*
12C/D
Sadness 1+4+11+15B with or without 54+64 1+4+11 with or without 54+64
1+4+15* with or without 54+64 1+4+15B with or without 54+64
6+15* with or without 54+64 1+4+15B+17 with or without 54+64
11+17
25 or 26 may occur with all prototypes
or major variants
Disgust 9
9+16+15, 26
9+17
10*
10*+16+25, 26
10+17
Anger 4+5*+7+10*+22+23+25,26
4+5*+7+l0*+23+25,26
4+5*+7+23+25, 26
4+5*+7+17+23
4+5*+7+17+24
4+5*+7+23
4+5*+7+24
establish a very intuitive and natural way of communi- tive emotions such as hopeful, peaceful and satisfaction.
cating emotions, and also to provide a positive influence The same applies for blue color that is typically associ-
on the user by means of evocative associations. ated to calm and relax emotional states, and as for green
In order to use colors to stimulate a certain positive color, can be useful to contrast negative emotions such
feeling, such as calm, energy, happiness we first need as Angry. Studies have proven that yellow color is able
to have a semantic mapping between colors and emo- to evoke happy and joy emotional states, therefore yel-
tions. Plenty of experiments have been done in order low lights can be set whenever the robot recognize that
to find a precise and reliable mapping, and all have pro- the user can be sad or even happy and surprise, in order
vided almost the same results in terms of color-meaning. to emphasize and agree with these positive emotions.
To achieve a simple and affordable model, the proposed Strong colors such as red are usually associated to anger,
method relies on Naz Kaya research [21] whose results aggressive and very intense emotions and mixed with
are summarized in table 5. yellow shadows can evoke active, energetic and power-
Once found a good emotion-color mapping, an em- ful motivational states. Then it is reasonable to activate
phatic strategy that selects a specific evocative color de- Yellow-Red shadows when user is recognize to be sad
pending on the particular recognized emotion can be and demotivated.
adopted.
For example when the robot perceive that the user is
in trouble or is afraid, lights can automatically switched 7. Conclusion
to green color or green shadows in order to recall a sense
In this work, we address the problem of Facial Emotion
of peace and create a more comfortable environment.
Recognition by introducing an end-to-end trainable deep
Indeed color green is usually associated to the most posi-
72
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74
Advancing the state of the art in this field also means
exposing every human inner self to the world. As for
most of the research the pros and cons must be weighted
and evaluated considering every possible use case sce-
nario. We believe that our work does not hold enough
practical ground to be misused by third actors, but we
are still concerned with this possibility.
References
[1] S. Russo, S. Illari, R. Avanzato, C. Napoli, Reducing
the psychological burden of isolated oncological
patients by means of decision trees, volume 2768,
2020, pp. 46–53.
[2] P. Ekman, W. V. Friesen, Constants across cultures
in the face and emotion., Journal of personality and
social psychology 17 (1971) 124.
[3] N. A. Nijdam, Mapping emotion to color, Book
Mapping emotion to color (2009) 2–9.
Figure 5: Naz Kaya emotions-colors mapping. [4] H.-W. Ng, V. D. Nguyen, V. Vonikakis, S. Winkler,
Deep learning for emotion recognition on small
datasets using transfer learning, in: Proceedings
of the 2015 ACM on international conference on
learning model which can reasoning on both spatial and
multimodal interaction, 2015, pp. 443–449.
temporal features of facial expressions. Results have
[5] M. Akhand, S. Roy, N. Siddique, M. A. S. Kamal,
shown as the proposed model can effectively learn to
T. Shimamura, Facial emotion recognition using
distinguish between most basic emotions. Furthermore,
transfer learning in the deep cnn, Electronics 10
a method for setting RGB lights components according
(2021) 1036.
to recognized user emotions is presented. In particular
[6] A. Dhall, R. Goecke, S. Lucey, T. Gedeon, Acted
an emphatic strategy that exploits the proposed emotion
facial expressions in the wild database, Australian
recognition system to identify user emotions and accord-
National University, Canberra, Australia, Technical
ingly monitoring ambient colored lights can be used to
Report TR-CS-11 2 (2011) 1.
improve the quality of human-robot interactions.
[7] H. Jung, S. Lee, J. Yim, S. Park, J. Kim, Joint fine-
In the future, the proposed strategy can be developed
tuning in deep neural networks for facial expres-
and validated in a social robot like Pepper from SoftBank
sion recognition, in: Proceedings of the IEEE inter-
Robotics [22], by monitoring colors of its body LEDs.
national conference on computer vision, 2015, pp.
They are placed in the chest, eyes, shoulders and ears
2983–2991.
allowing for a more friendly and engaging interaction.
[8] J. Haddad, O. Lézoray, P. Hamel, 3d-cnn for facial
Another possible direction of future work could explore
emotion recognition in videos, in: International
how this emphatic model can become adaptive to the user
Symposium on Visual Computing, Springer, 2020,
preferences, in such a way the robot can learn the impact
pp. 298–309.
of the different colors in a particular user, depending on
[9] W.-S. Chu, F. De la Torre, J. F. Cohn, Learning spa-
his previous reactions.
tial and temporal cues for multi-label facial action
unit detection, in: 2017 12th IEEE International
7.1. Ethical Impacts Conference on Automatic Face & Gesture Recogni-
tion (FG 2017), IEEE, 2017, pp. 25–32.
The system presented in this paper has a wide field of
[10] Y. Fan, X. Lu, D. Li, Y. Liu, Video-based emotion
employment. It’s aim is to detect and point out a person’s
recognition using cnn-rnn and c3d hybrid networks,
emotion in a very simple and informative way through
in: Proceedings of the 18th ACM international con-
colors. Applications of such systems can be useful when
ference on multimodal interaction, 2016, pp. 445–
the environment needs to intelligently adapt to the user,
450.
i.e. changing the light color in response to people mood
[11] D. H. Kim, W. J. Baddar, J. Jang, Y. M. Ro, Multi-
in a room full of music stimuli. But we also acknowl-
objective based spatio-temporal feature representa-
edge the possibility of misuse. Indeed, emotions are a
tion learning robust to expression intensity varia-
fundamental part of peoples live and thus are private.
73
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74
tions for facial expression recognition, IEEE Trans-
actions on Affective Computing 10 (2017) 223–236.
[12] F. Jimenez, T. Ando, M. Kanoh, T. Nakamura, Psy-
chological effects of a synchronously reliant agent
on human beings, Journal of Advanced Computa-
tional Intelligence Vol 17 (2013).
[13] J. Feldmaier, T. Marmat, J. Kuhn, K. Diepold, Evalua-
tion of a rgb-led-based emotion display for affective
agents, arXiv preprint arXiv:1612.07303 (2016).
[14] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac-
caro, Yolov3-based mask and face recognition al-
gorithm for individual protection applications, vol-
ume 2768, 2020, pp. 41–45.
[15] S. Russo, C. Napoli, A comprehensive solution for
psychological treatment and therapeutic path plan-
ning based on knowledge base and expertise shar-
ing, volume 2472, 2019, pp. 41–47.
[16] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Am-
badar, I. Matthews, The extended cohn-kanade
dataset (ck+): A complete dataset for action unit
and emotion-specified expression, in: 2010 ieee
computer society conference on computer vision
and pattern recognition-workshops, IEEE, 2010, pp.
94–101.
[17] I. Leite, G. Castellano, A. Pereira, C. Martinho,
A. Paiva, Modelling empathic behaviour in a robotic
game companion for children: an ethnographic
study in real-world settings, in: Proceedings of the
seventh annual ACM/IEEE international conference
on Human-Robot Interaction, 2012, pp. 367–374.
[18] C. Napoli, G. Pappalardo, E. Tramontana, Us-
ing modularity metrics to assist move method
refactoring of large systems, 2013, pp. 529–534.
doi:10.1109/CISIS.2013.96.
[19] M. Nalin, L. Bergamini, A. Giusti, I. Baroni,
A. Sanna, Children’s perception of a robotic
companion in a mildly constrained setting, in:
IEEE/ACM human-robot interaction 2011 confer-
ence (robots with children workshop) proceedings,
Citeseer, 2011.
[20] M. Wozniak, D. Polap, G. Borowik, C. Napoli, A
first attempt to cloud-based user verification in
distributed system, in: 2015 Asia-Pacific Confer-
ence on Computer Aided System Engineering, IEEE,
2015, pp. 226–231.
[21] N. Kaya, H. H. Epps, Relationship between color
and emotion: A study of college students, College
student journal 38 (2004) 396–405.
[22] A. K. Pandey, R. Gelin, A mass-produced sociable
humanoid robot: Pepper: The first machine of its
kind, IEEE Robotics & Automation Magazine 25
(2018) 40–48.
74