=Paper=
{{Paper
|id=Vol-3092/p11
|storemode=property
|title=Automatic RGB Inference Based on Facial Emotion
Recognition
|pdfUrl=https://ceur-ws.org/Vol-3092/p11.pdf
|volume=Vol-3092
|authors=Nicolo’ Brandizzi,Valerio Bianco,Giulia Castro,Samuele Russo,Agata Wajda
|dblpUrl=https://dblp.org/rec/conf/system/BrandizziBCRW21
}}
==Automatic RGB Inference Based on Facial Emotion
Recognition==
<pdf width="1500px">https://ceur-ws.org/Vol-3092/p11.pdf</pdf>
<pre>
Automatic RGB Inference Based on Facial Emotion
Recognition
Nicolo’ Brandizzi1 , Valerio Bianco1 , Giulia Castro1 , Samuele Russo2 and Agata Wajda3
1
  Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00135, Rome, Italy
2
  Department of Psychology, Sapienza University of Rome, Via Ariosto 25, 00135, Rome, Italy
3
  Department of Mathematics Applications and Methods for Artificial Intelligence, Faculty of Applied Mathematics, Silesian University of
Technology, 44-100 Gliwice, Poland


                                             Abstract
                                             Recently, Facial Emotion Recognition (FER) has been one of the most promising and growing field in computer vision and
                                             human-robot interaction. In this work, a deep learning neural network is introduced to address the problem of facial emotion
                                             recognition. In particular, a CNN+RNN architecture has been designed to capture both spatial features and temporal dynam-
                                             ics of facial expressions. Experiments are performed on CK+ dataset. Furthermore, we present a possible application of the
                                             proposed Facial Emotion Recognition system in human-robot interaction. A method for dynamically changing ambient light
                                             or LED colors, based on recognized emotions is presented. Indeed, it is proven that equipping robots with the ability of per-
                                             ceiving emotions and accordingly reacting by introducing suitable emphatic strategies significantly improves human-robot
                                             interaction performances. Possible scenarios of application are education, healthcare and autism therapy where such kind
                                             of emphatic strategies play a fundamental role.

                                             Keywords
                                             Facial Emotion Recognition, Human Robot interaction,


1. Introduction                                                                                                            state, together with jaw drop. Facial emotion recogni-
                                                                                                                           tion is a challenging task due to interclass similarities
Facial Emotion Recognition (FER) is the process of iden-                                                                   problems. Indeed different people can show emotions in
tifying human feelings and emotions from facial expres-                                                                    a different and personal way and with a different level
sions. Nowadays, automatic emotion recognition has                                                                         of intensity which makes the problem particularly hard.
a key role in a wide area of applications, with partic-                                                                    On the other hand it is possible that different motiva-
ular interest in cognitive science[1], human-robot and                                                                     tional states show very similar features and similar facial
human-computer interaction. In human-robot interac-                                                                        expressions.
tion for example, the ability of recognize intentions and                                                                     In this work, a standard CNN-RNN architecture is used
emotions and accordingly react to the particular motiva-                                                                   to learn both spatial and temporal cues from human facial
tional states of the user is crucial for making interaction                                                                expressions built up gradually across time. A method to
more friendly and natural, improving both usability and                                                                    express emotions by dynamically changing RGB ambient
acceptability of the new technology.                                                                                       light components based on recognized emotional state
   Ekman in [2] defined a set of six universal emotions:                                                                   is here proposed. Indeed, a lot of studies have been con-
anger, disgust, fear, happiness, sadness, and surprise,                                                                    ducted in order to understand the relationship between
which can be universally recognized and described re-                                                                      colors and emotions, and precisely how a particular color
gardless of people culture and context by using a set of                                                                   can evoke positive feelings in the observer. So depending
action unit (AU) described in table 1. Each AU is the                                                                      on the particular recognized human emotion a differ-
action of a muscle of the face that is typically activated                                                                 ent ambient light color will be set according to Nijidam
when a given facial expression is produced. For example,                                                                   color-emotion mapping theory [3].
AU number 1, that corresponds to "Inner brow raiser"                                                                          The remainder of this paper is structured as follows.
typically appears when people show surprise emotional                                                                      Section 2 analyzes existing literature and discusses the
                                                                                                                           state-of-the-art in facial emotion recognition. Section 3
SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology,                                                                    is dedicated to the description of the dataset and data
Engineering and Mathematics. July 27–29, 2021, Catania, IT                                                                 refinement process. Section 4 formalizes the problem and
" brandizzi@diag.uniroma1.it (N. Brandizzi);
castro.1742813@studenti.uniroma1.it (G. Castro);
                                                                                                                           explains proposed CNN+RNN model architecture. Sec-
samuele.russo@studenti.uniroma1.it (S. Russo);                                                                             tion 5 shows experiments, implementation details, and
agata.wajda@polsl.pl (A. Wajda)                                                                                            achieved results. Section 6 propose a real-time human-
 0000-0002-1846-9996 (S. Russo); 0000-0002-1667-3328                                                                      robot interaction application which can take advantage
(A. Wajda)                                                                                                                 and multiple benefits from the proposed Facial Emotion
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                          Recognition system. Finally, in Section 7, we discuss
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                      66
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings                                                                       66–74


Table 1
Facial Action Unit (AU) code and corresponding description.
 AU        Description                       AU       Description                            AU        Description
 1         Inner Brow Raiser                 13       Cheek Puller                           25        Lips Part
 2         Outer Brow Raiser                 14       Dimpler                                26        Jaw Drop
 4         Brow Lowerer                      15       Lip Corner Depressor                   27        Mouth Stretch
 5         Upper Lip Raiser                  16       Lower Lip Depressor                    28        Lip Suck
 6         Cheek Raiser                      17       Chin Raiser                            29        Jaw Thrust
 7         Lip Tightener                     18       Lip Puckerer                           31        Jaw Clencher
 9         Nose Wrinkler                     20       Lip Stretcher                          34        Cheek Puff
 10        Upper Lip Raiser                  21       Neck Tightener                         38        Nostril Dilator
 11        Nasolabial Deepener               23       Lip Tightener                          39        Nostril Compressor
 12        Lip Corner Puller                 24       Lip Pressor                            43        Eyes Closed


conclusions and future works.                                      quences. In [9] multiple LSTMs layers are stacked on top
                                                                   of CNNs. Then temporal and spatial representations are
                                                                   aggregated into a fusion network to produce per-frame
2. Related Work                                                    prediction of 12 facial action units (AU). Fan et al. in [10]
                                                                   propose an hybrid network combining a CNN-features-
Several techniques and deep learning models have been
                                                                   based spatio-temporal RNN model with a 3 dimensional
investigated over the last decade in order to address the
                                                                   Convolutional Neural Network (C3D), including also au-
problem of facial analysis and emotion recognition from
                                                                   dio features in order to maximize accuracy predictions.
RGB images and videos. Most of them use Convolutional
                                                                   To deal with expression-variations and intra-class vari-
Neural Networks (CNNs) for extracting geometric fea-
                                                                   ations, namely intensity and subject identity variations,
tures from facial landmark points. In order to train such
                                                                   [11] introduces objective functions on CNN to improve
high-capacity classifier, with the very small-size available
                                                                   expression class separability of the spatial feature repre-
FER dataset, one of the most common approach is to use
                                                                   sentation and minimize intra-class variation within the
transfer learning. Works [4, 5] use pre-trained models
                                                                   same expression class. Differently from previous works,
such as VGG16 and AlexNet to initialize the weights of
                                                                   which first apply CNN architectures or pre-trained im-
the CNN that can improve accuracy and reduce over-
                                                                   age classifier as visual feature extractor, and then use
fitting. Nguyen et al. in [4] introduced a two-stage su-
                                                                   extracted spatial feature representation for training the
pervised fine-tuning: a first-stage fine-tuning is applied
                                                                   RNNs separately, the proposed network want to analyze
using auxiliary face expression datasets followed by a
                                                                   the spatio-temporal behaviour of facial emotions by us-
final fine-tuning on the target AFEW dataset [6]. In [5]
                                                                   ing an end-to-end trainable CNN+RNN computational
a VGG-16 deep pre-trained model plus redefined dense
                                                                   efficient architecture.
layers is used in FER, by identifying essential and op-
                                                                      Several experiments have also been done to prove
tional convolutional blocks in the fine-tuning step. In the
                                                                   how a facial emotion recognition systems can improve
training process the selected blocks of VGG-16 model are
                                                                   human-robot interaction performances and potentialities.
included step by step, instead of training all at a time, to
                                                                   Jimenez et al. in [12] show how monitoring colored lights
diminish the effect of initial random weight.
                                                                   based on user emotion can represent a real communica-
   Instead of analyzing static images independently, thus
                                                                   tion channel between humans and robots. They intro-
ignoring the temporal relations of sequence frames in
                                                                   duce a self-sufficiency model system that recognizes and
videos, also 3D CNNs were explored in order to extract
                                                                   empathizes with human emotions using colored lights
spatio-temporal features with outstanding results. In [7],
                                                                   on a robot’s face. Feldmaier et al. in [13] also show the
3D CNNs was used to model appearance and motion
                                                                   effectiveness of displaying color combinations and color
of videos, learning simultaneously spatial and temporal
                                                                   patterns in Affective Agents, also adjusting variations of
aspects from image sequences. In [8], two deep networks
                                                                   intensity, brightness and frequency to obtain a psycho-
are combined: a 3D CNN is used to capture temporal
                                                                   logical influence in the user. A similar strategy is adopted
appearance of facial expressions, while a deep temporal
                                                                   in this work as well.
geometry network extracts geometrical behaviours of
                                                                      Many other works have been recently published in the
the facial landmark points.
                                                                   field of emotion recognition, face detection, and related
   Similarly, some recent works have proposed to use
                                                                   classification tasks[14, 15].
both combination of CNNs and RNNs capable of keeping
track of arbitrary long-term dependencies in input se-


                                                              67
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings                                                            66–74


Table 2                                                        Appendix. Comparison with the emotion prediction Ta-
Ekman’s six basic emotions description in terms of facial Ac-  ble 6 was done by applying the emotion prediction rule
tion Units. With reference to table 1 we define 1: Inner Brow  very strictly.
Raiser, 2: Outer Brow Raiser, 4: Brow Lowerer, 7: Lip Tight-      In order to maximize the amount of data which ap-
ener, 9: Nose Wrinkler, 10: Upper Lip Raiser, 12: Lip Cor-     pears to be too poor for the training model, another 35%
ner Puller, 15: Lip Corner Depressor, 17: Chin Raiser, 20: Lip of the dataset was hand-made labelled by using not only
Stretcher, 24: Lip Pressor, 25: Lips Part, 26: Jaw Drop.
                                                               Prototypes but also their Major Variants as shown in
          Emotional state      Action Units                    the Emotion Prediction Table 6. Compared with Proto-
          Anger                4, 7, 24                        types, variants allow for subset of AUs. As a consequence
          Disgust              9, 10, 17                       they are less strictly definitions but always truly repre-
          Fear                 1, 4, 20, 25                    sentative for the given emotion. As a result, a total of
          Happiness            12, 25                          511 video sequences are collected by including also the
          Sadness              4, 15                           major variants definitions in the conversion rules. Also
          Surprise             1, 2, 25, 26                    neutral facial expressions were included, for a total of
                                                               7 emotions categories: Neutral (Neu), Anger (Ang), Dis-
Table 3                                                        gust (Dis), Fear (Fea), Happiness (Hap), Sadness (Sad) and
Final number of samples per category in the dataset. In order: Surprise (Sur). Notice that even if the dataset increases
Anger (Ang), Neutral (Neu), Disgust (Dis), Fear (Fea), Happi- its dimension, it remains unbalanced, indeed some of
ness (Hap), Sadness (Sad), Surprise (Sur) number of videos the categories such as surprise and happiness are more
samples.                                                       represented in the dataset if compared to the others. The
  Emotion la- Ang Neu Dis Fea Hap Sad Sur final distribution of data samples among the seven cate-
  bel                                                          gories is reported in table 3. The minimum length of data
  #Videos        51    52     70      61    108 82        87   samples in the dataset is 10 frames. For this reason, in
                                                               order to maximize the number of input sequences, and
                                                               collect the largest amount of training data, the sequence
                                                               length, i.e the number of frames per video is set to 10.
3. Dataset                                                     For sequence of grater length, only the last 10 frames are
                                                               considered, in order to ensure that the apex (peak) frame
For training, validating and testing the Facial Emotion
                                                               that is the most representative for the given emotion will
Recognition model the Extended Cohn-Kanade Dataset
                                                               be captured. For each of the frames pixels values are
(CK+) [16] was used. It contains 593 sequences across 123
                                                               rescaled in order to have each pixel ∈ [0, 1]. A central
subjects and each of the sequences contains images from
                                                               cropping is applied to each of the frame in order to have
onset (neutral frame) to peak expression (last frame). The
                                                               a more focus on human face. After resizing and cropping
image sequence can vary in duration from a minimum
                                                               the final dimension of each input sequences will be
of 10 to a maximum of 60 frames. Images have frontal
views and 30-degree views and were digitized into either (𝑛_𝑓 𝑟𝑎𝑚𝑒𝑠, 𝑤𝑖𝑑𝑡ℎ, ℎ𝑒𝑖𝑔ℎ𝑡, 𝑛_𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠) = (10, 48, 48, 3)
640x490 or 640x480 pixel arrays with 8- bit gray-scale
or 24-bit color values. Each of the image sequences is
labelled with Action Unit combinations. A complete list
of the possible AUs is reported in table 1.                    4. Model Architecture
   For each of the data sequence, if the action units list
show consistency with one of the six basic emotion cat- The emotion recognition task is modelled as a multi-class
egory among Anger, Disgust, Fear, Happiness, Sadness, classification problem over a set of 7 different categories
Surprise and Contempt, a nominal emotion label is as- 𝑌 = {Anger, Neutral, Disgust, Fear, Happiness, Sad-
sociated to the sequences. At this aim, table 2 shows a ness, Surprise}.
complete mapping between Ekman’s six basic emotions               As shown in sec 2 Recurrent Neural Networks (RNN) in
and AUs. Note that only the six basic emotions are con- combination with Convolutional Neural Networks (CNN)
sidered in the proposed FER model and then reported and 3 dimensional Convolutional Neural Network (C3D)
in table 2, while Contempt category which is very simi- provide powerful results when dealing with sequential
lar to Disgust emotion class and do not belongs to basic image data. Following this approach, a standard CNN
emotions group is excluded. As a result of this selection + RNN computational efficient architecture is here pro-
process, only 296 of the 593 sequences fit the prototypic posed. From the input sequence of frames, spatial feature
definition and meet criteria for one of the six discrete representations are learned by a Convolutional Neural
emotions. Prototypes definitions used for translating AU Network (CNN). In order to capture the facial expres-
sores into emotions terms are shown in Table 6 in the sion dynamics, temporal features representation of the


                                                             68
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings                                                                  66–74


                                                              Table 4
                                                              Per Category Validation Accuracy. In order: Anger (Ang),
                                                              Neutral (Neu), Disgust (Dis), Fear (Fea), Happiness (Hap),
                                                              Sadness (Sad), Surprise (Sur) accuracy scores.
                                                               Emotion la-    Ang    Neu Dis      Fea   Hap Sad      Sur
                                                               bel
                                                               Val Accu-      10%    83%   50%    30%   86%   86%    79%
                                                               racy


                                                              while last couple have 3x3 filter size. All 2D Max Pooling
                                                              have a kernel size of 2x2. In order to extract temporal
                                                              correlation in the extracted input features, CNN output
                                                              is directly fed as input to a GRU of 8 units, which is the
                                                              output dimension. Finally, the output layer is a Dense
                                                              one, that is a deeply fully connected layer, with a Softmax
                                                              activation function. Softmax layer have the same number
                                                              of nodes as the output layer in order to assign decimal
                                                              probabilities with sum 1 to each of the emotion categories.
                                                              For each value 𝑧𝑖 from the neurons of the output layer,
                                                              per category probability is computed as:

                                                                                             𝑒𝑥𝑝(𝑧𝑖 )
                                                                           𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧𝑖 ) = ∑︀                         (1)
                                                                                             𝑗 𝑒𝑥𝑝(𝑧𝑗 )

                                                              such that probabilities values always sum to 1 and only
                                                              one emotion label is activated.


                                                              5. Experiments
                                                              5.1. Implementation Details
Figure 1: Proposed model Architecture for Facial Emotion
Recognition system which takes as input a 10 frames se-    To verify the effectiveness of the proposed model, experi-
quence and outputs a single label emotion.                 ments have been conducted on the modified CK+ dataset
                                                           as described in Section 3.
                                                              For an efficient data parsing, input images sequences
                                                           of 10 frames are first transformed in TFRecord files. Then
facial expression is learned via the Recurrent Neural Net-
                                                           tensorflow TFRecordDataset class is used to standardize
work (RNN). Both Long Short-Term Memory (LSTM) and
                                                           data and generate batch of 24 images sequences to be fed
Gated Recurrent Unit (GRU) have been tested for the
                                                           as input to the model. 80% of the data samples are used
given problem and provide comparable results. However
                                                           in training phase, while the remaining 20% as test set.
GRU have a less complex structure and are computation-
                                                              For training the model we used Adam optimizer with
ally more efficient. Therefore given also the fact that
                                                           learning rate 1 × 𝑒−3 . For a multi-class classification
the model does not have a huge amount of data, GRU
                                                           problem, each of the input sample can only belong to
are preferred to get a good accuracy. Moreover, such 2D
                                                           one out of many possible categories, therefore a Cate-
CNN + GRU approach allows end-to-end trainable model
                                                           gorical Cross Entropy Loss was used. Since the dataset is
which is lower computational expensive comparing with
                                                           unbalanced the Categorical Accuracy is not enough to
the others. Indeed the whole model counts a very small
                                                           have a true evaluation for the model, but also Precision
number of parameters, around 8k, which makes it very
                                                           and Recall metrics were considered. In order to prevent
light.
                                                           over-fitting problem, L1 and L2 Regularizer are set to 0.01
   The model takes as input a window of 10 RGB frames
                                                           and dropout equal to 0.4. The model was trained for a
of size 48x48. Data are processed in batches of 24 image
                                                           total of 400 epochs, with a mean training time for a single
sequences. The full architecture of the proposed model
                                                           epoch of only 4sec.
is shown in Figure 1. It consist of a four-layer 2D CNN.
The first two convolutional layers have 5x5 kernel size,


                                                         69
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings                                                              66–74


Figure 2: Confusion matrix over CK+ validation set. On the axis motions categories are disposed in the following way:
{1:Anger 2:Neutral 3:Disgust 4:Fear 5:Happy 6:Sad 7:Surprise}


                                                            validation epoch the model achieve 66% of mean Preci-
                                                            sion and 65% of mean Recall across categories. Table 4
                                                            shows per category validation accuracy rates. As we can
                                                            notice mean validation accuracy over categories is not
                                                            so high, indeed per category validation accuracy reaches
                                                            very high scores for some of the emotion classes and
                                                            very low values for some others, meaning that the model
                                                            almost fails when tries to recognize anger and fear emo-
Figure 3: Left: Apex (peak) frame from Angry facial expres-
sion from CK+ dataset. Right: Apex frame from Sad facial
                                                            tions, while it behaves very well when dealing with all
expression performed by a different subject in CK+ dataset. the others categories. In particular, the model appears
                                                            very powerful in recognizing Happy and Sad facial ex-
                                                            pressions with the 86% of accuracy. High accuracy scores
Table 5                                                     are also obtained for Neutral (83%) and Surprise (79%)
Evaluation metrics for the model. Mean Validation Accuracy emotions categories. As a result, even if the model shows
, Precision and Recall across categories.                   very high performances for most of the emotions, dif-
              Validation Accuracy         66%               ficulties in learning some specific facial expressions, in
              Validation Precision        66%               particular Angry with only the 10% of accuracy, lead to a
              Validation Recall           65%               lower mean accuracy score across categories. To better
                                                            understand the model, the confusion matrix over the CK+
                                                            validation set is reported in Fig 2. As we can notice, the
5.2. Results analysis                                       main diagonal is highlighted by high rates of correctly
                                                            classified samples. However, some off-diagonal elements
In this section, the experimental results are presented. reflects mislabeled predictions by the classifier.
Since the dataset is not balanced both precision and re-       In particular, predictions errors mainly concern Anger
call, together with accuracy score are considered in order and Fear emotions categories, which appear to be harder
to evaluate the performance of the model. For the best to learn and recognize. As shown in the confusion matrix,


                                                         70
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings                                                                      66–74


Figure 4: Top row: Fear emotion data sample from CK+ dataset. 4/10 frames are shown to illustrate temporal evolution of
the facial expression. Bottom row: Happines emotion evolution over time, performed by the same subject.


Angry facial expressions are frequently confused with            6. Facial Emotion Recognition in
Sad ones, which actually looks like very similar. Indeed
if we look at some data examples, as shown in figure 3, it
                                                                    human-robot interaction
will be quite difficult even for humans to infer the correct     In this chapter, a possible application of the proposed
labeling. This because both Sadness and Anger emotions           Facial emotion recognition system in Human-Robot and
are characterized by very similar features such as lowered       Human-Computer interaction is presented. More re-
eyebrows and tight mouth as also highlighted in table            search [17, 18, 19, 20] have proven as equip the robot with
2 which shows emotional states description in terms of           the ability of perceiving user feelings and emotions and
facial Actions Units. As a consequence, even if Sad facial       accordingly react with them can significantly improve
expressions typically shows much more lowered lip cor-           the quality of the interaction, and more importantly user
ners, more discrete and shy performed emotions may be            acceptability. Affective communication for social robots
easily confused due to inter-class similarity and emotions       has been deeply investigated in the last years showing
intensity variations problems. In an analogue way, Fear          as behind social intelligence, also emotional intelligence
emotion are often wrongly classified as Happiness, since         plays a crucial role for successfully interactions. In partic-
they show a very similar behaviour of the lips and share         ular the need of recognize emotions and properly intro-
action unit number 25 - Lips Part, as reported in table          duce emphatic strategies turned out to be fundamental
2. As shown in fig 4 top row, fear facial expression is          in vulnerable scenarios such us education, healthcare,
represented with the jaw dropped open and lips stretched         autism therapy and driving support.
horizontally. As a result, if we look at fig 4, for both Fear
and Happiness facial expression (top and bottom rows
respectively) lips are closed on the onset (first frame)         6.1. Application
than start to became farther until to be apart in the apex       A very effective and powerful solution in human-robot
frame, showing a very similar temporal behaviour which           interaction can be to introduce a colored lights based em-
makes it difficult to distinguish between the two.               phatic strategy. Indeed, colors and emotions are closely
   In summary, the network can precisely recognize Hap-          linked. Several physical and psychological studies have
piness, Sadness, Surprise and Neutral emotional states,          shown as play with colors and dynamic lights can have
while is not so much reliable for Anger and Fear. There-         very effective outcome since they can evoke feelings and
fore only emotions categories with a valid accuracy over         emotions in human observers. For this reason, a method
the 50% will be used in human-robot interaction applica-         for dynamically changing ambient light or LED colors
tions.                                                           based on recognized emotions is presented. The aim is to


                                                            71
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings                                                                 66–74


Table 6
                  Emotion prediction. Conversion Rules for translating AU scores into Emotions. [16]
                    Table note: * means in this combination the AU may be at any level of intensity.
              Emotion         Prototypes                              Major Variants
              Surprise        1+2+5B+26                               1+2+5B
                              1+2+5B+27                               1+2+26
                                                                      1+2+27
                                                                      5B+26
                                                                      5B+27
              Fear            1+2+4+5*+20*+25                         1+2+4+5*+L or R20*+25, 26, or 27
                              1+2+4+5*+25                             1+2+4+5*
                                                                      +2+5Z, with or without 25, 26, 27
                                                                      5*+20* with or without 25, 26, 27
              Happiness       6+12*
                              12C/D
              Sadness         1+4+11+15B with or without 54+64        1+4+11 with or without 54+64
                              1+4+15* with or without 54+64           1+4+15B with or without 54+64
                              6+15* with or without 54+64             1+4+15B+17 with or without 54+64
                                                                      11+17
                                                                      25 or 26 may occur with all prototypes
                                                                      or major variants
              Disgust         9
                              9+16+15, 26
                              9+17
                              10*
                              10*+16+25, 26
                              10+17
              Anger           4+5*+7+10*+22+23+25,26
                              4+5*+7+l0*+23+25,26
                              4+5*+7+23+25, 26
                              4+5*+7+17+23
                              4+5*+7+17+24
                              4+5*+7+23
                              4+5*+7+24


establish a very intuitive and natural way of communi-         tive emotions such as hopeful, peaceful and satisfaction.
cating emotions, and also to provide a positive influence      The same applies for blue color that is typically associ-
on the user by means of evocative associations.                ated to calm and relax emotional states, and as for green
   In order to use colors to stimulate a certain positive      color, can be useful to contrast negative emotions such
feeling, such as calm, energy, happiness we first need         as Angry. Studies have proven that yellow color is able
to have a semantic mapping between colors and emo-             to evoke happy and joy emotional states, therefore yel-
tions. Plenty of experiments have been done in order           low lights can be set whenever the robot recognize that
to find a precise and reliable mapping, and all have pro-      the user can be sad or even happy and surprise, in order
vided almost the same results in terms of color-meaning.       to emphasize and agree with these positive emotions.
To achieve a simple and affordable model, the proposed         Strong colors such as red are usually associated to anger,
method relies on Naz Kaya research [21] whose results          aggressive and very intense emotions and mixed with
are summarized in table 5.                                     yellow shadows can evoke active, energetic and power-
   Once found a good emotion-color mapping, an em-             ful motivational states. Then it is reasonable to activate
phatic strategy that selects a specific evocative color de-    Yellow-Red shadows when user is recognize to be sad
pending on the particular recognized emotion can be            and demotivated.
adopted.
   For example when the robot perceive that the user is
in trouble or is afraid, lights can automatically switched     7. Conclusion
to green color or green shadows in order to recall a sense
                                                               In this work, we address the problem of Facial Emotion
of peace and create a more comfortable environment.
                                                               Recognition by introducing an end-to-end trainable deep
Indeed color green is usually associated to the most posi-


                                                          72
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings                                                                66–74


                                                              Advancing the state of the art in this field also means
                                                              exposing every human inner self to the world. As for
                                                              most of the research the pros and cons must be weighted
                                                              and evaluated considering every possible use case sce-
                                                              nario. We believe that our work does not hold enough
                                                              practical ground to be misused by third actors, but we
                                                              are still concerned with this possibility.


                                                              References
                                                               [1] S. Russo, S. Illari, R. Avanzato, C. Napoli, Reducing
                                                                   the psychological burden of isolated oncological
                                                                   patients by means of decision trees, volume 2768,
                                                                   2020, pp. 46–53.
                                                               [2] P. Ekman, W. V. Friesen, Constants across cultures
                                                                   in the face and emotion., Journal of personality and
                                                                   social psychology 17 (1971) 124.
                                                               [3] N. A. Nijdam, Mapping emotion to color, Book
                                                                   Mapping emotion to color (2009) 2–9.
Figure 5: Naz Kaya emotions-colors mapping.                    [4] H.-W. Ng, V. D. Nguyen, V. Vonikakis, S. Winkler,
                                                                   Deep learning for emotion recognition on small
                                                                   datasets using transfer learning, in: Proceedings
                                                                   of the 2015 ACM on international conference on
learning model which can reasoning on both spatial and
                                                                   multimodal interaction, 2015, pp. 443–449.
temporal features of facial expressions. Results have
                                                               [5] M. Akhand, S. Roy, N. Siddique, M. A. S. Kamal,
shown as the proposed model can effectively learn to
                                                                   T. Shimamura, Facial emotion recognition using
distinguish between most basic emotions. Furthermore,
                                                                   transfer learning in the deep cnn, Electronics 10
a method for setting RGB lights components according
                                                                   (2021) 1036.
to recognized user emotions is presented. In particular
                                                               [6] A. Dhall, R. Goecke, S. Lucey, T. Gedeon, Acted
an emphatic strategy that exploits the proposed emotion
                                                                   facial expressions in the wild database, Australian
recognition system to identify user emotions and accord-
                                                                   National University, Canberra, Australia, Technical
ingly monitoring ambient colored lights can be used to
                                                                   Report TR-CS-11 2 (2011) 1.
improve the quality of human-robot interactions.
                                                               [7] H. Jung, S. Lee, J. Yim, S. Park, J. Kim, Joint fine-
   In the future, the proposed strategy can be developed
                                                                   tuning in deep neural networks for facial expres-
and validated in a social robot like Pepper from SoftBank
                                                                   sion recognition, in: Proceedings of the IEEE inter-
Robotics [22], by monitoring colors of its body LEDs.
                                                                   national conference on computer vision, 2015, pp.
They are placed in the chest, eyes, shoulders and ears
                                                                   2983–2991.
allowing for a more friendly and engaging interaction.
                                                               [8] J. Haddad, O. Lézoray, P. Hamel, 3d-cnn for facial
Another possible direction of future work could explore
                                                                   emotion recognition in videos, in: International
how this emphatic model can become adaptive to the user
                                                                   Symposium on Visual Computing, Springer, 2020,
preferences, in such a way the robot can learn the impact
                                                                   pp. 298–309.
of the different colors in a particular user, depending on
                                                               [9] W.-S. Chu, F. De la Torre, J. F. Cohn, Learning spa-
his previous reactions.
                                                                   tial and temporal cues for multi-label facial action
                                                                   unit detection, in: 2017 12th IEEE International
7.1. Ethical Impacts                                               Conference on Automatic Face & Gesture Recogni-
                                                                   tion (FG 2017), IEEE, 2017, pp. 25–32.
The system presented in this paper has a wide field of
                                                              [10] Y. Fan, X. Lu, D. Li, Y. Liu, Video-based emotion
employment. It’s aim is to detect and point out a person’s
                                                                   recognition using cnn-rnn and c3d hybrid networks,
emotion in a very simple and informative way through
                                                                   in: Proceedings of the 18th ACM international con-
colors. Applications of such systems can be useful when
                                                                   ference on multimodal interaction, 2016, pp. 445–
the environment needs to intelligently adapt to the user,
                                                                   450.
i.e. changing the light color in response to people mood
                                                              [11] D. H. Kim, W. J. Baddar, J. Jang, Y. M. Ro, Multi-
in a room full of music stimuli. But we also acknowl-
                                                                   objective based spatio-temporal feature representa-
edge the possibility of misuse. Indeed, emotions are a
                                                                   tion learning robust to expression intensity varia-
fundamental part of peoples live and thus are private.


                                                         73
Nicolo’ Brandizzi et al. CEUR Workshop Proceedings             66–74


     tions for facial expression recognition, IEEE Trans-
     actions on Affective Computing 10 (2017) 223–236.
[12] F. Jimenez, T. Ando, M. Kanoh, T. Nakamura, Psy-
     chological effects of a synchronously reliant agent
     on human beings, Journal of Advanced Computa-
     tional Intelligence Vol 17 (2013).
[13] J. Feldmaier, T. Marmat, J. Kuhn, K. Diepold, Evalua-
     tion of a rgb-led-based emotion display for affective
     agents, arXiv preprint arXiv:1612.07303 (2016).
[14] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac-
     caro, Yolov3-based mask and face recognition al-
     gorithm for individual protection applications, vol-
     ume 2768, 2020, pp. 41–45.
[15] S. Russo, C. Napoli, A comprehensive solution for
     psychological treatment and therapeutic path plan-
     ning based on knowledge base and expertise shar-
     ing, volume 2472, 2019, pp. 41–47.
[16] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Am-
     badar, I. Matthews, The extended cohn-kanade
     dataset (ck+): A complete dataset for action unit
     and emotion-specified expression, in: 2010 ieee
     computer society conference on computer vision
     and pattern recognition-workshops, IEEE, 2010, pp.
     94–101.
[17] I. Leite, G. Castellano, A. Pereira, C. Martinho,
     A. Paiva, Modelling empathic behaviour in a robotic
     game companion for children: an ethnographic
     study in real-world settings, in: Proceedings of the
     seventh annual ACM/IEEE international conference
     on Human-Robot Interaction, 2012, pp. 367–374.
[18] C. Napoli, G. Pappalardo, E. Tramontana, Us-
     ing modularity metrics to assist move method
     refactoring of large systems, 2013, pp. 529–534.
     doi:10.1109/CISIS.2013.96.
[19] M. Nalin, L. Bergamini, A. Giusti, I. Baroni,
     A. Sanna, Children’s perception of a robotic
     companion in a mildly constrained setting, in:
     IEEE/ACM human-robot interaction 2011 confer-
     ence (robots with children workshop) proceedings,
     Citeseer, 2011.
[20] M. Wozniak, D. Polap, G. Borowik, C. Napoli, A
     first attempt to cloud-based user verification in
     distributed system, in: 2015 Asia-Pacific Confer-
     ence on Computer Aided System Engineering, IEEE,
     2015, pp. 226–231.
[21] N. Kaya, H. H. Epps, Relationship between color
     and emotion: A study of college students, College
     student journal 38 (2004) 396–405.
[22] A. K. Pandey, R. Gelin, A mass-produced sociable
     humanoid robot: Pepper: The first machine of its
     kind, IEEE Robotics & Automation Magazine 25
     (2018) 40–48.


                                                          74

</pre>