1. Introduction

Automatic RGB Inference Based on Facial Emotion Recognition

Nicolo' Brandizzi

brandizzi@diag.uniroma1.it 0

Valerio Bianco

Giulia Castro

castro.1742813@studenti.uniroma1.it 0

Samuele Russo

samuele.russo@studenti.uniroma1.it 2

Agata Wajda

agata.wajda@polsl.pl 1 0 Department of Computer, Control and Management Engineering, Sapienza University of Rome , Via Ariosto 25, 00135, Rome , Italy 1 Department of Mathematics Applications and Methods for Artificial Intelligence, Faculty of Applied Mathematics, Silesian University of Technology , 44-100 Gliwice , Poland 2 Department of Psychology, Sapienza University of Rome , Via Ariosto 25, 00135, Rome , Italy

66 74

Recently, Facial Emotion Recognition (FER) has been one of the most promising and growing field in computer vision and human-robot interaction. In this work, a deep learning neural network is introduced to address the problem of facial emotion recognition. In particular, a CNN+RNN architecture has been designed to capture both spatial features and temporal dynamics of facial expressions. Experiments are performed on CK+ dataset. Furthermore, we present a possible application of the proposed Facial Emotion Recognition system in human-robot interaction. A method for dynamically changing ambient light or LED colors, based on recognized emotions is presented. Indeed, it is proven that equipping robots with the ability of perceiving emotions and accordingly reacting by introducing suitable emphatic strategies significantly improves human-robot interaction performances. Possible scenarios of application are education, healthcare and autism therapy where such kind of emphatic strategies play a fundamental role.

eol>Facial Emotion Recognition Human Robot interaction

1. Introduction 2. Related Work Several techniques and deep learning models have been

investigated over the last decade in order to address the problem of facial analysis and emotion recognition from RGB images and videos. Most of them use Convolutional Neural Networks (CNNs) for extracting geometric features from facial landmark points. In order to train such high-capacity classifier, with the very small-size available FER dataset, one of the most common approach is to use transfer learning. Works [ 4, 5 ] use pre-trained models such as VGG16 and AlexNet to initialize the weights of the CNN that can improve accuracy and reduce overiftting. Nguyen et al. in [ 4 ] introduced a two-stage supervised fine-tuning: a first-stage fine-tuning is applied using auxiliary face expression datasets followed by a ifnal fine-tuning on the target AFEW dataset [ 6 ]. In [ 5 ] a VGG-16 deep pre-trained model plus redefined dense layers is used in FER, by identifying essential and optional convolutional blocks in the fine-tuning step. In the training process the selected blocks of VGG-16 model are included step by step, instead of training all at a time, to diminish the efect of initial random weight.

Instead of analyzing static images independently, thus ignoring the temporal relations of sequence frames in videos, also 3D CNNs were explored in order to extract spatio-temporal features with outstanding results. In [ 7 ], 3D CNNs was used to model appearance and motion of videos, learning simultaneously spatial and temporal aspects from image sequences. In [ 8 ], two deep networks are combined: a 3D CNN is used to capture temporal appearance of facial expressions, while a deep temporal geometry network extracts geometrical behaviours of the facial landmark points.

Similarly, some recent works have proposed to use both combination of CNNs and RNNs capable of keeping track of arbitrary long-term dependencies in input sequences. In [ 9 ] multiple LSTMs layers are stacked on top of CNNs. Then temporal and spatial representations are aggregated into a fusion network to produce per-frame prediction of 12 facial action units (AU). Fan et al. in [ 10 ] propose an hybrid network combining a CNN-featuresbased spatio-temporal RNN model with a 3 dimensional Convolutional Neural Network (C3D), including also audio features in order to maximize accuracy predictions.

To deal with expression-variations and intra-class variations, namely intensity and subject identity variations, [ 11 ] introduces objective functions on CNN to improve expression class separability of the spatial feature representation and minimize intra-class variation within the same expression class. Diferently from previous works, which first apply CNN architectures or pre-trained image classifier as visual feature extractor, and then use extracted spatial feature representation for training the RNNs separately, the proposed network want to analyze the spatio-temporal behaviour of facial emotions by using an end-to-end trainable CNN+RNN computational eficient architecture.

Several experiments have also been done to prove how a facial emotion recognition systems can improve human-robot interaction performances and potentialities.

Jimenez et al. in [ 12 ] show how monitoring colored lights based on user emotion can represent a real communication channel between humans and robots. They introduce a self-suficiency model system that recognizes and empathizes with human emotions using colored lights on a robot’s face. Feldmaier et al. in [ 13 ] also show the efectiveness of displaying color combinations and color patterns in Afective Agents, also adjusting variations of intensity, brightness and frequency to obtain a psychological influence in the user. A similar strategy is adopted in this work as well.

Many other works have been recently published in the ifeld of emotion recognition, face detection, and related classification tasks[ 14, 15 ]. Table 2 Appendix. Comparison with the emotion prediction TaEkman’s six basic emotions description in terms of facial Ac- ble 6 was done by applying the emotion prediction rule tion Units. With reference to table 1 we define 1: Inner Brow very strictly.

Raiser, 2: Outer Brow Raiser, 4: Brow Lowerer, 7: Lip Tight- In order to maximize the amount of data which apener, 9: Nose Wrinkler, 10: Upper Lip Raiser, 12: Lip Cor- pears to be too poor for the training model, another 35% ner Puller, 15: Lip Corner Depressor, 17: Chin Raiser, 20: Lip of the dataset was hand-made labelled by using not only Stretcher, 24: Lip Pressor, 25: Lips Part, 26: Jaw Drop. Prototypes but also their Major Variants as shown in Emotional state Action Units the Emotion Prediction Table 6. Compared with ProtoAnger 4, 7, 24 types, variants allow for subset of AUs. As a consequence Disgust 9, 10, 17 they are less strictly definitions but always truly repreFear 1, 4, 20, 25 sentative for the given emotion. As a result, a total of Happiness 12, 25 511 video sequences are collected by including also the Sadness 4, 15 major variants definitions in the conversion rules. Also Surprise 1, 2, 25, 26 neutral facial expressions were included, for a total of 7 emotions categories: Neutral (Neu), Anger (Ang), DisTable 3 gust (Dis), Fear (Fea), Happiness (Hap), Sadness (Sad) and Final number of samples per category in the dataset. In order: Surprise (Sur). Notice that even if the dataset increases Anger (Ang), Neutral (Neu), Disgust (Dis), Fear (Fea), Happi- its dimension, it remains unbalanced, indeed some of ness (Hap), Sadness (Sad), Surprise (Sur) number of videos the categories such as surprise and happiness are more samples. represented in the dataset if compared to the others. The Emotion la- Ang Neu Dis Fea Hap Sad Sur ifnal distribution of data samples among the seven catebel gories is reported in table 3. The minimum length of data #Videos 51 52 70 61 108 82 87 samples in the dataset is 10 frames. For this reason, in order to maximize the number of input sequences, and collect the largest amount of training data, the sequence 3. Dataset length, i.e the number of frames per video is set to 10. For sequence of grater length, only the last 10 frames are considered, in order to ensure that the apex (peak) frame that is the most representative for the given emotion will be captured. For each of the frames pixels values are rescaled in order to have each pixel ∈ [ 0, 1 ]. A central cropping is applied to each of the frame in order to have a more focus on human face. After resizing and cropping the final dimension of each input sequences will be

For training, validating and testing the Facial Emotion

Recognition model the Extended Cohn-Kanade Dataset (CK+) [ 16 ] was used. It contains 593 sequences across 123 subjects and each of the sequences contains images from onset (neutral frame) to peak expression (last frame). The image sequence can vary in duration from a minimum of 10 to a maximum of 60 frames. Images have frontal views and 30-degree views and were digitized into either 640x490 or 640x480 pixel arrays with 8- bit gray-scale or 24-bit color values. Each of the image sequences is labelled with Action Unit combinations. A complete list of the possible AUs is reported in table 1. 4. Model Architecture

For each of the data sequence, if the action units list show consistency with one of the six basic emotion cat- The emotion recognition task is modelled as a multi-class egory among Anger, Disgust, Fear, Happiness, Sadness, classification problem over a set of 7 diferent categories Surprise and Contempt, a nominal emotion label is as- = {Anger, Neutral, Disgust, Fear, Happiness, Sadsociated to the sequences. At this aim, table 2 shows a ness, Surprise}. complete mapping between Ekman’s six basic emotions As shown in sec 2 Recurrent Neural Networks (RNN) in and AUs. Note that only the six basic emotions are con- combination with Convolutional Neural Networks (CNN) sidered in the proposed FER model and then reported and 3 dimensional Convolutional Neural Network (C3D) in table 2, while Contempt category which is very simi- provide powerful results when dealing with sequential lar to Disgust emotion class and do not belongs to basic image data. Following this approach, a standard CNN emotions group is excluded. As a result of this selection + RNN computational eficient architecture is here proprocess, only 296 of the 593 sequences fit the prototypic posed. From the input sequence of frames, spatial feature definition and meet criteria for one of the six discrete representations are learned by a Convolutional Neural emotions. Prototypes definitions used for translating AU Network (CNN). In order to capture the facial expressores into emotions terms are shown in Table 6 in the sion dynamics, temporal features representation of the (_ , ℎ, ℎℎ, _ℎ) = (10, 48, 48, 3) while last couple have 3x3 filter size. All 2D Max Pooling have a kernel size of 2x2. In order to extract temporal correlation in the extracted input features, CNN output is directly fed as input to a GRU of 8 units, which is the output dimension. Finally, the output layer is a Dense one, that is a deeply fully connected layer, with a Softmax activation function. Softmax layer have the same number of nodes as the output layer in order to assign decimal probabilities with sum 1 to each of the emotion categories.

For each value from the neurons of the output layer, per category probability is computed as:

() () = ∑︀ ( )

(1) such that probabilities values always sum to 1 and only one emotion label is activated.

5. Experiments 5.1. Implementation Details To verify the efectiveness of the proposed model, experi

ments have been conducted on the modified CK+ dataset as described in Section 3.

For an eficient data parsing, input images sequences of 10 frames are first transformed in TFRecord files. Then facial expression is learned via the Recurrent Neural Net- tensorflow TFRecordDataset class is used to standardize work (RNN). Both Long Short-Term Memory (LSTM) and data and generate batch of 24 images sequences to be fed Gated Recurrent Unit (GRU) have been tested for the as input to the model. 80% of the data samples are used given problem and provide comparable results. However in training phase, while the remaining 20% as test set. GRU have a less complex structure and are computation- For training the model we used Adam optimizer with ally more eficient. Therefore given also the fact that learning rate 1 × − 3. For a multi-class classification the model does not have a huge amount of data, GRU problem, each of the input sample can only belong to are preferred to get a good accuracy. Moreover, such 2D one out of many possible categories, therefore a CateCNN + GRU approach allows end-to-end trainable model gorical Cross Entropy Loss was used. Since the dataset is which is lower computational expensive comparing with unbalanced the Categorical Accuracy is not enough to the others. Indeed the whole model counts a very small have a true evaluation for the model, but also Precision number of parameters, around 8k, which makes it very and Recall metrics were considered. In order to prevent light. over-fitting problem, L1 and L2 Regularizer are set to 0.01

The model takes as input a window of 10 RGB frames and dropout equal to 0.4. The model was trained for a of size 48x48. Data are processed in batches of 24 image total of 400 epochs, with a mean training time for a single sequences. The full architecture of the proposed model epoch of only 4sec. is shown in Figure 1. It consist of a four-layer 2D CNN.

The first two convolutional layers have 5x5 kernel size, validation epoch the model achieve 66% of mean Precision and 65% of mean Recall across categories. Table 4 shows per category validation accuracy rates. As we can notice mean validation accuracy over categories is not so high, indeed per category validation accuracy reaches very high scores for some of the emotion classes and very low values for some others, meaning that the model almost fails when tries to recognize anger and fear emosFiiognurfreom3: CLeKft:+Adpaetxa(speet.akR) ifgrhatm: eAfpreoxmfrAanmgeryfrfoamciaSlaedxpfraecsi-al tions, while it behaves very well when dealing with all expression performed by a diferent subject in CK+ dataset. the others categories. In particular, the model appears very powerful in recognizing Happy and Sad facial expressions with the 86% of accuracy. High accuracy scores Table 5 are also obtained for Neutral (83%) and Surprise (79%) Evaluation metrics for the model. Mean Validation Accuracy emotions categories. As a result, even if the model shows , Precision and Recall across categories. very high performances for most of the emotions, difValidation Accuracy 66% ifculties in learning some specific facial expressions, in Validation Precision 66% particular Angry with only the 10% of accuracy, lead to a Validation Recall 65% lower mean accuracy score across categories. To better understand the model, the confusion matrix over the CK+ validation set is reported in Fig 2. As we can notice, the 5.2. Results analysis main diagonal is highlighted by high rates of correctly classified samples. However, some of-diagonal elements In this section, the experimental results are presented. reflects mislabeled predictions by the classifier. Since the dataset is not balanced both precision and re- In particular, predictions errors mainly concern Anger call, together with accuracy score are considered in order and Fear emotions categories, which appear to be harder to evaluate the performance of the model. For the best to learn and recognize. As shown in the confusion matrix,

6. Facial Emotion Recognition in human-robot interaction Angry facial expressions are frequently confused with

Sad ones, which actually looks like very similar. Indeed if we look at some data examples, as shown in figure 3, it will be quite dificult even for humans to infer the correct In this chapter, a possible application of the proposed labeling. This because both Sadness and Anger emotions Facial emotion recognition system in Human-Robot and are characterized by very similar features such as lowered Human-Computer interaction is presented. More reeyebrows and tight mouth as also highlighted in table search [ 17, 18, 19, 20 ] have proven as equip the robot with 2 which shows emotional states description in terms of the ability of perceiving user feelings and emotions and facial Actions Units. As a consequence, even if Sad facial accordingly react with them can significantly improve expressions typically shows much more lowered lip cor- the quality of the interaction, and more importantly user ners, more discrete and shy performed emotions may be acceptability. Afective communication for social robots easily confused due to inter-class similarity and emotions has been deeply investigated in the last years showing intensity variations problems. In an analogue way, Fear as behind social intelligence, also emotional intelligence emotion are often wrongly classified as Happiness, since plays a crucial role for successfully interactions. In particthey show a very similar behaviour of the lips and share ular the need of recognize emotions and properly introaction unit number 25 - Lips Part, as reported in table duce emphatic strategies turned out to be fundamental 2. As shown in fig 4 top row, fear facial expression is in vulnerable scenarios such us education, healthcare, represented with the jaw dropped open and lips stretched autism therapy and driving support. horizontally. As a result, if we look at fig 4, for both Fear and Happiness facial expression (top and bottom rows respectively) lips are closed on the onset (first frame) 6.1. Application than start to became farther until to be apart in the apex A very efective and powerful solution in human-robot frame, showing a very similar temporal behaviour which interaction can be to introduce a colored lights based emmakes it dificult to distinguish between the two. phatic strategy. Indeed, colors and emotions are closely

In summary, the network can precisely recognize Hap- linked. Several physical and psychological studies have piness, Sadness, Surprise and Neutral emotional states, shown as play with colors and dynamic lights can have while is not so much reliable for Anger and Fear. There- very efective outcome since they can evoke feelings and fore only emotions categories with a valid accuracy over emotions in human observers. For this reason, a method the 50% will be used in human-robot interaction applica- for dynamically changing ambient light or LED colors tions. based on recognized emotions is presented. The aim is to establish a very intuitive and natural way of communi- tive emotions such as hopeful, peaceful and satisfaction. cating emotions, and also to provide a positive influence The same applies for blue color that is typically associon the user by means of evocative associations. ated to calm and relax emotional states, and as for green

In order to use colors to stimulate a certain positive color, can be useful to contrast negative emotions such feeling, such as calm, energy, happiness we first need as Angry. Studies have proven that yellow color is able to have a semantic mapping between colors and emo- to evoke happy and joy emotional states, therefore yeltions. Plenty of experiments have been done in order low lights can be set whenever the robot recognize that to find a precise and reliable mapping, and all have pro- the user can be sad or even happy and surprise, in order vided almost the same results in terms of color-meaning. to emphasize and agree with these positive emotions. To achieve a simple and afordable model, the proposed Strong colors such as red are usually associated to anger, method relies on Naz Kaya research [ 21 ] whose results aggressive and very intense emotions and mixed with are summarized in table 5. yellow shadows can evoke active, energetic and power

Once found a good emotion-color mapping, an em- ful motivational states. Then it is reasonable to activate phatic strategy that selects a specific evocative color de- Yellow-Red shadows when user is recognize to be sad pending on the particular recognized emotion can be and demotivated. adopted.

For example when the robot perceive that the user is in trouble or is afraid, lights can automatically switched 7. Conclusion to green color or green shadows in order to recall a sense of peace and create a more comfortable environment. In this work, we address the problem of Facial Emotion Indeed color green is usually associated to the most posi- Recognition by introducing an end-to-end trainable deep learning model which can reasoning on both spatial and temporal features of facial expressions. Results have shown as the proposed model can efectively learn to distinguish between most basic emotions. Furthermore, a method for setting RGB lights components according to recognized user emotions is presented. In particular an emphatic strategy that exploits the proposed emotion recognition system to identify user emotions and accordingly monitoring ambient colored lights can be used to improve the quality of human-robot interactions.

In the future, the proposed strategy can be developed and validated in a social robot like Pepper from SoftBank Robotics [ 22 ], by monitoring colors of its body LEDs.

They are placed in the chest, eyes, shoulders and ears allowing for a more friendly and engaging interaction.

Another possible direction of future work could explore how this emphatic model can become adaptive to the user preferences, in such a way the robot can learn the impact of the diferent colors in a particular user, depending on his previous reactions.

7.1. Ethical Impacts

The system presented in this paper has a wide field of employment. It’s aim is to detect and point out a person’s emotion in a very simple and informative way through colors. Applications of such systems can be useful when the environment needs to intelligently adapt to the user, i.e. changing the light color in response to people mood in a room full of music stimuli. But we also acknowledge the possibility of misuse. Indeed, emotions are a fundamental part of peoples live and thus are private.

Advancing the state of the art in this field also means exposing every human inner self to the world. As for most of the research the pros and cons must be weighted and evaluated considering every possible use case scenario. We believe that our work does not hold enough practical ground to be misused by third actors, but we are still concerned with this possibility.

[1]

Russo ,

Illari ,

Avanzato , C. Napoli, Reducing the psychological burden of isolated oncological patients by means of decision trees , volume 2768 , 2020 , pp. 46 - 53 .

[2]

Ekman ,

W. V.

Friesen , Constants across cultures in the face and emotion ., Journal of personality and social psychology 17 ( 1971 ) 124 .

[3]

N. A.

Nijdam , Mapping emotion to color, Book Mapping emotion to color ( 2009 ) 2 - 9 .

[4]

H.-W.

Ng ,

V. D.

Nguyen ,

Vonikakis ,

Winkler , Deep learning for emotion recognition on small datasets using transfer learning , in: Proceedings of the 2015 ACM on international conference on multimodal interaction , 2015 , pp. 443 - 449 .

[5]

Akhand ,

Roy ,

Siddique ,

M. A. S.

Kamal , T. Shimamura, Facial emotion recognition using transfer learning in the deep cnn , Electronics 10 ( 2021 ) 1036 .

[6]

Dhall ,

Goecke ,

Lucey , T. Gedeon, Acted facial expressions in the wild database , Australian National University, Canberra, Australia, Technical Report TR-CS-11 2 ( 2011 ) 1 .

[7]

Jung ,

Lee ,

Yim ,

Park , J. Kim, Joint finetuning in deep neural networks for facial expression recognition , in: Proceedings of the IEEE international conference on computer vision , 2015 , pp. 2983 - 2991 .

[8]

Haddad ,

Lézoray ,

Hamel , 3d-cnn for facial emotion recognition in videos , in: International Symposium on Visual Computing , Springer, 2020 , pp. 298 - 309 .

[9]

W.-S.

Chu ,

De la Torre ,

J. F.

Cohn , Learning spatial and temporal cues for multi-label facial action unit detection , in: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017 ), IEEE, 2017 , pp. 25 - 32 .

[10]

Fan ,

Lu ,

Li ,

Liu , Video-based emotion recognition using cnn-rnn and c3d hybrid networks , in: Proceedings of the 18th ACM international conference on multimodal interaction , 2016 , pp. 445 - 450 .

[11]

D. H.

Kim ,

W. J.

Baddar ,

Jang ,

Y. M.

Ro , Multiobjective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition , IEEE Transactions on Afective Computing 10 ( 2017 ) 223 - 236 .

[12]

Jimenez ,

Ando ,

Kanoh , T. Nakamura, Psychological efects of a synchronously reliant agent on human beings , Journal of Advanced Computational Intelligence Vol 17 ( 2013 ).

[13]

Feldmaier ,

Marmat ,

Kuhn ,

Diepold , Evaluation of a rgb-led-based emotion display for afective agents , arXiv preprint arXiv:1612.07303 ( 2016 ).

[14]

Avanzato ,

Beritelli ,

Russo ,

Russo , M. Vaccaro, Yolov3-based mask and face recognition algorithm for individual protection applications , volume 2768 , 2020 , pp. 41 - 45 .

[15]

Russo ,

Napoli , A comprehensive solution for psychological treatment and therapeutic path planning based on knowledge base and expertise sharing , volume 2472 , 2019 , pp. 41 - 47 .

[16]

Lucey ,

J. F.

Cohn ,

Kanade ,

Saragih ,

Ambadar , I. Matthews , The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression , in : 2010 ieee computer society conference on computer vision and pattern recognition-workshops, IEEE , 2010 , pp. 94 - 101 .

[17]

Leite ,

Castellano ,

Pereira ,

Martinho ,

Paiva , Modelling empathic behaviour in a robotic game companion for children: an ethnographic study in real-world settings , in: Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction , 2012 , pp. 367 - 374 .

[18]

Napoli ,

Pappalardo , E. Tramontana, Using modularity metrics to assist move method refactoring of large systems , 2013 , pp. 529 - 534 . doi: 10 .1109/CISIS. 2013 . 96 .

[19]

Nalin ,

Bergamini ,

Giusti , I. Baroni ,

Sanna , Children's perception of a robotic companion in a mildly constrained setting, in: IEEE/ACM human-robot interaction 2011 conference (robots with children workshop ) proceedings, Citeseer, 2011 .

[20]

Wozniak ,

Polap , G. Borowik,

Napoli , A ifrst attempt to cloud-based user verification in distributed system , in: 2015 Asia-Pacific Conference on Computer Aided System Engineering , IEEE, 2015 , pp. 226 - 231 .

[21]

Kaya ,

H. H.

Epps , Relationship between color and emotion: A study of college students , College student journal 38 ( 2004 ) 396 - 405 .

[22] A. K. Pandey , R. Gelin , A mass-produced sociable humanoid robot: Pepper: The first machine of its kind , IEEE Robotics & Automation Magazine 25 ( 2018 ) 40 - 48 .