<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic RGB Inference Based on Facial Emotion Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicolo' Brandizzi</string-name>
          <email>brandizzi@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Bianco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Castro</string-name>
          <email>castro.1742813@studenti.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samuele Russo</string-name>
          <email>samuele.russo@studenti.uniroma1.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Agata Wajda</string-name>
          <email>agata.wajda@polsl.pl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer, Control and Management Engineering, Sapienza University of Rome</institution>
          ,
          <addr-line>Via Ariosto 25, 00135, Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics Applications and Methods for Artificial Intelligence, Faculty of Applied Mathematics, Silesian University of Technology</institution>
          ,
          <addr-line>44-100 Gliwice</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Psychology, Sapienza University of Rome</institution>
          ,
          <addr-line>Via Ariosto 25, 00135, Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>66</fpage>
      <lpage>74</lpage>
      <abstract>
        <p>Recently, Facial Emotion Recognition (FER) has been one of the most promising and growing field in computer vision and human-robot interaction. In this work, a deep learning neural network is introduced to address the problem of facial emotion recognition. In particular, a CNN+RNN architecture has been designed to capture both spatial features and temporal dynamics of facial expressions. Experiments are performed on CK+ dataset. Furthermore, we present a possible application of the proposed Facial Emotion Recognition system in human-robot interaction. A method for dynamically changing ambient light or LED colors, based on recognized emotions is presented. Indeed, it is proven that equipping robots with the ability of perceiving emotions and accordingly reacting by introducing suitable emphatic strategies significantly improves human-robot interaction performances. Possible scenarios of application are education, healthcare and autism therapy where such kind of emphatic strategies play a fundamental role.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Facial Emotion Recognition</kwd>
        <kwd>Human Robot interaction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>Several techniques and deep learning models have been</title>
        <p>
          investigated over the last decade in order to address the
problem of facial analysis and emotion recognition from
RGB images and videos. Most of them use Convolutional
Neural Networks (CNNs) for extracting geometric
features from facial landmark points. In order to train such
high-capacity classifier, with the very small-size available
FER dataset, one of the most common approach is to use
transfer learning. Works [
          <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
          ] use pre-trained models
such as VGG16 and AlexNet to initialize the weights of
the CNN that can improve accuracy and reduce
overiftting. Nguyen et al. in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] introduced a two-stage
supervised fine-tuning: a first-stage fine-tuning is applied
using auxiliary face expression datasets followed by a
ifnal fine-tuning on the target AFEW dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
a VGG-16 deep pre-trained model plus redefined dense
layers is used in FER, by identifying essential and
optional convolutional blocks in the fine-tuning step. In the
training process the selected blocks of VGG-16 model are
included step by step, instead of training all at a time, to
diminish the efect of initial random weight.
        </p>
        <p>
          Instead of analyzing static images independently, thus
ignoring the temporal relations of sequence frames in
videos, also 3D CNNs were explored in order to extract
spatio-temporal features with outstanding results. In [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ],
3D CNNs was used to model appearance and motion
of videos, learning simultaneously spatial and temporal
aspects from image sequences. In [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], two deep networks
are combined: a 3D CNN is used to capture temporal
appearance of facial expressions, while a deep temporal
geometry network extracts geometrical behaviours of
the facial landmark points.
        </p>
        <p>
          Similarly, some recent works have proposed to use
both combination of CNNs and RNNs capable of keeping
track of arbitrary long-term dependencies in input
sequences. In [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] multiple LSTMs layers are stacked on top
of CNNs. Then temporal and spatial representations are
aggregated into a fusion network to produce per-frame
prediction of 12 facial action units (AU). Fan et al. in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
propose an hybrid network combining a
CNN-featuresbased spatio-temporal RNN model with a 3 dimensional
Convolutional Neural Network (C3D), including also
audio features in order to maximize accuracy predictions.
        </p>
        <p>
          To deal with expression-variations and intra-class
variations, namely intensity and subject identity variations,
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] introduces objective functions on CNN to improve
expression class separability of the spatial feature
representation and minimize intra-class variation within the
same expression class. Diferently from previous works,
which first apply CNN architectures or pre-trained
image classifier as visual feature extractor, and then use
extracted spatial feature representation for training the
RNNs separately, the proposed network want to analyze
the spatio-temporal behaviour of facial emotions by
using an end-to-end trainable CNN+RNN computational
eficient architecture.
        </p>
        <p>Several experiments have also been done to prove
how a facial emotion recognition systems can improve
human-robot interaction performances and potentialities.</p>
        <p>
          Jimenez et al. in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] show how monitoring colored lights
based on user emotion can represent a real
communication channel between humans and robots. They
introduce a self-suficiency model system that recognizes and
empathizes with human emotions using colored lights
on a robot’s face. Feldmaier et al. in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] also show the
efectiveness of displaying color combinations and color
patterns in Afective Agents, also adjusting variations of
intensity, brightness and frequency to obtain a
psychological influence in the user. A similar strategy is adopted
in this work as well.
        </p>
        <p>
          Many other works have been recently published in the
ifeld of emotion recognition, face detection, and related
classification tasks[
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ].
Table 2 Appendix. Comparison with the emotion prediction
TaEkman’s six basic emotions description in terms of facial Ac- ble 6 was done by applying the emotion prediction rule
tion Units. With reference to table 1 we define 1: Inner Brow very strictly.
        </p>
        <p>
          Raiser, 2: Outer Brow Raiser, 4: Brow Lowerer, 7: Lip Tight- In order to maximize the amount of data which
apener, 9: Nose Wrinkler, 10: Upper Lip Raiser, 12: Lip Cor- pears to be too poor for the training model, another 35%
ner Puller, 15: Lip Corner Depressor, 17: Chin Raiser, 20: Lip of the dataset was hand-made labelled by using not only
Stretcher, 24: Lip Pressor, 25: Lips Part, 26: Jaw Drop. Prototypes but also their Major Variants as shown in
Emotional state Action Units the Emotion Prediction Table 6. Compared with
ProtoAnger 4, 7, 24 types, variants allow for subset of AUs. As a consequence
Disgust 9, 10, 17 they are less strictly definitions but always truly
repreFear 1, 4, 20, 25 sentative for the given emotion. As a result, a total of
Happiness 12, 25 511 video sequences are collected by including also the
Sadness 4, 15 major variants definitions in the conversion rules. Also
Surprise 1, 2, 25, 26 neutral facial expressions were included, for a total of
7 emotions categories: Neutral (Neu), Anger (Ang),
DisTable 3 gust (Dis), Fear (Fea), Happiness (Hap), Sadness (Sad) and
Final number of samples per category in the dataset. In order: Surprise (Sur). Notice that even if the dataset increases
Anger (Ang), Neutral (Neu), Disgust (Dis), Fear (Fea), Happi- its dimension, it remains unbalanced, indeed some of
ness (Hap), Sadness (Sad), Surprise (Sur) number of videos the categories such as surprise and happiness are more
samples. represented in the dataset if compared to the others. The
Emotion la- Ang Neu Dis Fea Hap Sad Sur ifnal distribution of data samples among the seven
catebel gories is reported in table 3. The minimum length of data
#Videos 51 52 70 61 108 82 87 samples in the dataset is 10 frames. For this reason, in
order to maximize the number of input sequences, and
collect the largest amount of training data, the sequence
3. Dataset length, i.e the number of frames per video is set to 10.
For sequence of grater length, only the last 10 frames are
considered, in order to ensure that the apex (peak) frame
that is the most representative for the given emotion will
be captured. For each of the frames pixels values are
rescaled in order to have each pixel ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. A central
cropping is applied to each of the frame in order to have
a more focus on human face. After resizing and cropping
the final dimension of each input sequences will be
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>For training, validating and testing the Facial Emotion</title>
        <p>
          Recognition model the Extended Cohn-Kanade Dataset
(CK+) [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] was used. It contains 593 sequences across 123
subjects and each of the sequences contains images from
onset (neutral frame) to peak expression (last frame). The
image sequence can vary in duration from a minimum
of 10 to a maximum of 60 frames. Images have frontal
views and 30-degree views and were digitized into either
640x490 or 640x480 pixel arrays with 8- bit gray-scale
or 24-bit color values. Each of the image sequences is
labelled with Action Unit combinations. A complete list
of the possible AUs is reported in table 1. 4. Model Architecture
        </p>
        <p>For each of the data sequence, if the action units list
show consistency with one of the six basic emotion cat- The emotion recognition task is modelled as a multi-class
egory among Anger, Disgust, Fear, Happiness, Sadness, classification problem over a set of 7 diferent categories
Surprise and Contempt, a nominal emotion label is as-  = {Anger, Neutral, Disgust, Fear, Happiness,
Sadsociated to the sequences. At this aim, table 2 shows a ness, Surprise}.
complete mapping between Ekman’s six basic emotions As shown in sec 2 Recurrent Neural Networks (RNN) in
and AUs. Note that only the six basic emotions are con- combination with Convolutional Neural Networks (CNN)
sidered in the proposed FER model and then reported and 3 dimensional Convolutional Neural Network (C3D)
in table 2, while Contempt category which is very simi- provide powerful results when dealing with sequential
lar to Disgust emotion class and do not belongs to basic image data. Following this approach, a standard CNN
emotions group is excluded. As a result of this selection + RNN computational eficient architecture is here
proprocess, only 296 of the 593 sequences fit the prototypic posed. From the input sequence of frames, spatial feature
definition and meet criteria for one of the six discrete representations are learned by a Convolutional Neural
emotions. Prototypes definitions used for translating AU Network (CNN). In order to capture the facial
expressores into emotions terms are shown in Table 6 in the sion dynamics, temporal features representation of the
(_ , ℎ, ℎℎ, _ℎ) = (10, 48, 48, 3)
while last couple have 3x3 filter size. All 2D Max Pooling
have a kernel size of 2x2. In order to extract temporal
correlation in the extracted input features, CNN output
is directly fed as input to a GRU of 8 units, which is the
output dimension. Finally, the output layer is a Dense
one, that is a deeply fully connected layer, with a Softmax
activation function. Softmax layer have the same number
of nodes as the output layer in order to assign decimal
probabilities with sum 1 to each of the emotion categories.</p>
        <p>For each value  from the neurons of the output layer,
per category probability is computed as:</p>
        <p>()
 () = ∑︀ ( )</p>
        <p>(1)
such that probabilities values always sum to 1 and only
one emotion label is activated.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Experiments</title>
      <sec id="sec-3-1">
        <title>5.1. Implementation Details</title>
        <sec id="sec-3-1-1">
          <title>To verify the efectiveness of the proposed model, experi</title>
          <p>ments have been conducted on the modified CK+ dataset
as described in Section 3.</p>
          <p>For an eficient data parsing, input images sequences
of 10 frames are first transformed in TFRecord files. Then
facial expression is learned via the Recurrent Neural Net- tensorflow TFRecordDataset class is used to standardize
work (RNN). Both Long Short-Term Memory (LSTM) and data and generate batch of 24 images sequences to be fed
Gated Recurrent Unit (GRU) have been tested for the as input to the model. 80% of the data samples are used
given problem and provide comparable results. However in training phase, while the remaining 20% as test set.
GRU have a less complex structure and are computation- For training the model we used Adam optimizer with
ally more eficient. Therefore given also the fact that learning rate 1 × − 3. For a multi-class classification
the model does not have a huge amount of data, GRU problem, each of the input sample can only belong to
are preferred to get a good accuracy. Moreover, such 2D one out of many possible categories, therefore a
CateCNN + GRU approach allows end-to-end trainable model gorical Cross Entropy Loss was used. Since the dataset is
which is lower computational expensive comparing with unbalanced the Categorical Accuracy is not enough to
the others. Indeed the whole model counts a very small have a true evaluation for the model, but also Precision
number of parameters, around 8k, which makes it very and Recall metrics were considered. In order to prevent
light. over-fitting problem, L1 and L2 Regularizer are set to 0.01</p>
          <p>The model takes as input a window of 10 RGB frames and dropout equal to 0.4. The model was trained for a
of size 48x48. Data are processed in batches of 24 image total of 400 epochs, with a mean training time for a single
sequences. The full architecture of the proposed model epoch of only 4sec.
is shown in Figure 1. It consist of a four-layer 2D CNN.</p>
          <p>The first two convolutional layers have 5x5 kernel size,
validation epoch the model achieve 66% of mean
Precision and 65% of mean Recall across categories. Table 4
shows per category validation accuracy rates. As we can
notice mean validation accuracy over categories is not
so high, indeed per category validation accuracy reaches
very high scores for some of the emotion classes and
very low values for some others, meaning that the model
almost fails when tries to recognize anger and fear
emosFiiognurfreom3: CLeKft:+Adpaetxa(speet.akR) ifgrhatm: eAfpreoxmfrAanmgeryfrfoamciaSlaedxpfraecsi-al tions, while it behaves very well when dealing with all
expression performed by a diferent subject in CK+ dataset. the others categories. In particular, the model appears
very powerful in recognizing Happy and Sad facial
expressions with the 86% of accuracy. High accuracy scores
Table 5 are also obtained for Neutral (83%) and Surprise (79%)
Evaluation metrics for the model. Mean Validation Accuracy emotions categories. As a result, even if the model shows
, Precision and Recall across categories. very high performances for most of the emotions,
difValidation Accuracy 66% ifculties in learning some specific facial expressions, in
Validation Precision 66% particular Angry with only the 10% of accuracy, lead to a
Validation Recall 65% lower mean accuracy score across categories. To better
understand the model, the confusion matrix over the CK+
validation set is reported in Fig 2. As we can notice, the
5.2. Results analysis main diagonal is highlighted by high rates of correctly
classified samples. However, some of-diagonal elements
In this section, the experimental results are presented. reflects mislabeled predictions by the classifier.
Since the dataset is not balanced both precision and re- In particular, predictions errors mainly concern Anger
call, together with accuracy score are considered in order and Fear emotions categories, which appear to be harder
to evaluate the performance of the model. For the best to learn and recognize. As shown in the confusion matrix,</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Facial Emotion Recognition in human-robot interaction</title>
      <sec id="sec-4-1">
        <title>Angry facial expressions are frequently confused with</title>
        <p>
          Sad ones, which actually looks like very similar. Indeed
if we look at some data examples, as shown in figure 3, it
will be quite dificult even for humans to infer the correct In this chapter, a possible application of the proposed
labeling. This because both Sadness and Anger emotions Facial emotion recognition system in Human-Robot and
are characterized by very similar features such as lowered Human-Computer interaction is presented. More
reeyebrows and tight mouth as also highlighted in table search [
          <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20">17, 18, 19, 20</xref>
          ] have proven as equip the robot with
2 which shows emotional states description in terms of the ability of perceiving user feelings and emotions and
facial Actions Units. As a consequence, even if Sad facial accordingly react with them can significantly improve
expressions typically shows much more lowered lip cor- the quality of the interaction, and more importantly user
ners, more discrete and shy performed emotions may be acceptability. Afective communication for social robots
easily confused due to inter-class similarity and emotions has been deeply investigated in the last years showing
intensity variations problems. In an analogue way, Fear as behind social intelligence, also emotional intelligence
emotion are often wrongly classified as Happiness, since plays a crucial role for successfully interactions. In
particthey show a very similar behaviour of the lips and share ular the need of recognize emotions and properly
introaction unit number 25 - Lips Part, as reported in table duce emphatic strategies turned out to be fundamental
2. As shown in fig 4 top row, fear facial expression is in vulnerable scenarios such us education, healthcare,
represented with the jaw dropped open and lips stretched autism therapy and driving support.
horizontally. As a result, if we look at fig 4, for both Fear
and Happiness facial expression (top and bottom rows
respectively) lips are closed on the onset (first frame) 6.1. Application
than start to became farther until to be apart in the apex A very efective and powerful solution in human-robot
frame, showing a very similar temporal behaviour which interaction can be to introduce a colored lights based
emmakes it dificult to distinguish between the two. phatic strategy. Indeed, colors and emotions are closely
        </p>
        <p>In summary, the network can precisely recognize Hap- linked. Several physical and psychological studies have
piness, Sadness, Surprise and Neutral emotional states, shown as play with colors and dynamic lights can have
while is not so much reliable for Anger and Fear. There- very efective outcome since they can evoke feelings and
fore only emotions categories with a valid accuracy over emotions in human observers. For this reason, a method
the 50% will be used in human-robot interaction applica- for dynamically changing ambient light or LED colors
tions. based on recognized emotions is presented. The aim is to
establish a very intuitive and natural way of communi- tive emotions such as hopeful, peaceful and satisfaction.
cating emotions, and also to provide a positive influence The same applies for blue color that is typically
associon the user by means of evocative associations. ated to calm and relax emotional states, and as for green</p>
        <p>
          In order to use colors to stimulate a certain positive color, can be useful to contrast negative emotions such
feeling, such as calm, energy, happiness we first need as Angry. Studies have proven that yellow color is able
to have a semantic mapping between colors and emo- to evoke happy and joy emotional states, therefore
yeltions. Plenty of experiments have been done in order low lights can be set whenever the robot recognize that
to find a precise and reliable mapping, and all have pro- the user can be sad or even happy and surprise, in order
vided almost the same results in terms of color-meaning. to emphasize and agree with these positive emotions.
To achieve a simple and afordable model, the proposed Strong colors such as red are usually associated to anger,
method relies on Naz Kaya research [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] whose results aggressive and very intense emotions and mixed with
are summarized in table 5. yellow shadows can evoke active, energetic and
power
        </p>
        <p>Once found a good emotion-color mapping, an em- ful motivational states. Then it is reasonable to activate
phatic strategy that selects a specific evocative color de- Yellow-Red shadows when user is recognize to be sad
pending on the particular recognized emotion can be and demotivated.
adopted.</p>
        <p>For example when the robot perceive that the user is
in trouble or is afraid, lights can automatically switched 7. Conclusion
to green color or green shadows in order to recall a sense
of peace and create a more comfortable environment. In this work, we address the problem of Facial Emotion
Indeed color green is usually associated to the most posi- Recognition by introducing an end-to-end trainable deep
learning model which can reasoning on both spatial and
temporal features of facial expressions. Results have
shown as the proposed model can efectively learn to
distinguish between most basic emotions. Furthermore,
a method for setting RGB lights components according
to recognized user emotions is presented. In particular
an emphatic strategy that exploits the proposed emotion
recognition system to identify user emotions and
accordingly monitoring ambient colored lights can be used to
improve the quality of human-robot interactions.</p>
        <p>
          In the future, the proposed strategy can be developed
and validated in a social robot like Pepper from SoftBank
Robotics [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], by monitoring colors of its body LEDs.
        </p>
        <p>They are placed in the chest, eyes, shoulders and ears
allowing for a more friendly and engaging interaction.</p>
        <p>Another possible direction of future work could explore
how this emphatic model can become adaptive to the user
preferences, in such a way the robot can learn the impact
of the diferent colors in a particular user, depending on
his previous reactions.</p>
        <sec id="sec-4-1-1">
          <title>7.1. Ethical Impacts</title>
          <p>The system presented in this paper has a wide field of
employment. It’s aim is to detect and point out a person’s
emotion in a very simple and informative way through
colors. Applications of such systems can be useful when
the environment needs to intelligently adapt to the user,
i.e. changing the light color in response to people mood
in a room full of music stimuli. But we also
acknowledge the possibility of misuse. Indeed, emotions are a
fundamental part of peoples live and thus are private.</p>
          <p>Advancing the state of the art in this field also means
exposing every human inner self to the world. As for
most of the research the pros and cons must be weighted
and evaluated considering every possible use case
scenario. We believe that our work does not hold enough
practical ground to be misused by third actors, but we
are still concerned with this possibility.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Illari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Avanzato</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Napoli, Reducing the psychological burden of isolated oncological patients by means of decision trees</article-title>
          , volume
          <volume>2768</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ekman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. V.</given-names>
            <surname>Friesen</surname>
          </string-name>
          ,
          <article-title>Constants across cultures in the face and emotion</article-title>
          .,
          <source>Journal of personality and social psychology 17</source>
          (
          <year>1971</year>
          )
          <fpage>124</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Nijdam</surname>
          </string-name>
          ,
          <article-title>Mapping emotion to color, Book Mapping emotion to color (</article-title>
          <year>2009</year>
          )
          <fpage>2</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.-W.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vonikakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Winkler</surname>
          </string-name>
          ,
          <article-title>Deep learning for emotion recognition on small datasets using transfer learning</article-title>
          ,
          <source>in: Proceedings of the 2015 ACM on international conference on multimodal interaction</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>443</fpage>
          -
          <lpage>449</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Akhand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Siddique</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A. S.</given-names>
            <surname>Kamal</surname>
          </string-name>
          , T. Shimamura,
          <article-title>Facial emotion recognition using transfer learning in the deep cnn</article-title>
          ,
          <source>Electronics</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>1036</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Goecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lucey</surname>
          </string-name>
          , T. Gedeon,
          <article-title>Acted facial expressions in the wild database</article-title>
          , Australian National University, Canberra, Australia,
          <source>Technical Report TR-CS-11 2</source>
          (
          <issue>2011</issue>
          )
          <article-title>1</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Park</surname>
          </string-name>
          , J. Kim,
          <article-title>Joint finetuning in deep neural networks for facial expression recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>2983</fpage>
          -
          <lpage>2991</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Haddad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lézoray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hamel</surname>
          </string-name>
          ,
          <article-title>3d-cnn for facial emotion recognition in videos</article-title>
          ,
          <source>in: International Symposium on Visual Computing</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>298</fpage>
          -
          <lpage>309</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.-S.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>De la Torre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <article-title>Learning spatial and temporal cues for multi-label facial action unit detection</article-title>
          ,
          <source>in: 2017 12th IEEE International Conference on Automatic Face &amp; Gesture Recognition (FG</source>
          <year>2017</year>
          ), IEEE,
          <year>2017</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Video-based emotion recognition using cnn-rnn and c3d hybrid networks</article-title>
          ,
          <source>in: Proceedings of the 18th ACM international conference on multimodal interaction</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>445</fpage>
          -
          <lpage>450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Baddar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. M.</given-names>
            <surname>Ro</surname>
          </string-name>
          ,
          <article-title>Multiobjective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>10</volume>
          (
          <year>2017</year>
          )
          <fpage>223</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Jimenez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kanoh</surname>
          </string-name>
          , T. Nakamura,
          <article-title>Psychological efects of a synchronously reliant agent on human beings</article-title>
          ,
          <source>Journal of Advanced Computational Intelligence</source>
          Vol
          <volume>17</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Feldmaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Marmat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Diepold</surname>
          </string-name>
          ,
          <article-title>Evaluation of a rgb-led-based emotion display for afective agents</article-title>
          ,
          <source>arXiv preprint arXiv:1612.07303</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Avanzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Beritelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Vaccaro, Yolov3-based mask and face recognition algorithm for individual protection applications</article-title>
          , volume
          <volume>2768</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>A comprehensive solution for psychological treatment and therapeutic path planning based on knowledge base and expertise sharing</article-title>
          , volume
          <volume>2472</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lucey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kanade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Saragih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ambadar</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Matthews</surname>
          </string-name>
          ,
          <article-title>The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression</article-title>
          , in
          <article-title>: 2010 ieee computer society conference on computer vision and pattern recognition-workshops,</article-title>
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          ,
          <year>2010</year>
          , pp.
          <fpage>94</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I.</given-names>
            <surname>Leite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Castellano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martinho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paiva</surname>
          </string-name>
          ,
          <article-title>Modelling empathic behaviour in a robotic game companion for children: an ethnographic study in real-world settings</article-title>
          ,
          <source>in: Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>367</fpage>
          -
          <lpage>374</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          , E. Tramontana,
          <article-title>Using modularity metrics to assist move method refactoring of large systems</article-title>
          ,
          <year>2013</year>
          , pp.
          <fpage>529</fpage>
          -
          <lpage>534</lpage>
          . doi:
          <volume>10</volume>
          .1109/CISIS.
          <year>2013</year>
          .
          <volume>96</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nalin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bergamini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giusti</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Baroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanna</surname>
          </string-name>
          ,
          <article-title>Children's perception of a robotic companion in a mildly constrained setting, in: IEEE/ACM human-robot interaction 2011 conference (robots with children workshop</article-title>
          ) proceedings, Citeseer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wozniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Polap</surname>
          </string-name>
          , G. Borowik,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>A ifrst attempt to cloud-based user verification in distributed system</article-title>
          ,
          <source>in: 2015 Asia-Pacific Conference on Computer Aided System Engineering</source>
          , IEEE,
          <year>2015</year>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Epps</surname>
          </string-name>
          ,
          <article-title>Relationship between color and emotion: A study of college students</article-title>
          ,
          <source>College student journal 38</source>
          (
          <year>2004</year>
          )
          <fpage>396</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>A. K. Pandey</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Gelin</surname>
          </string-name>
          ,
          <article-title>A mass-produced sociable humanoid robot: Pepper: The first machine of its kind</article-title>
          ,
          <source>IEEE Robotics &amp; Automation Magazine</source>
          <volume>25</volume>
          (
          <year>2018</year>
          )
          <fpage>40</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>