Automatic RGB Inference Based on Facial Emotion Recognition Nicolo’ Brandizzi1 , Valerio Bianco1 , Giulia Castro1 , Samuele Russo2 and Agata Wajda3 1 Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00135, Rome, Italy 2 Department of Psychology, Sapienza University of Rome, Via Ariosto 25, 00135, Rome, Italy 3 Department of Mathematics Applications and Methods for Artificial Intelligence, Faculty of Applied Mathematics, Silesian University of Technology, 44-100 Gliwice, Poland Abstract Recently, Facial Emotion Recognition (FER) has been one of the most promising and growing field in computer vision and human-robot interaction. In this work, a deep learning neural network is introduced to address the problem of facial emotion recognition. In particular, a CNN+RNN architecture has been designed to capture both spatial features and temporal dynam- ics of facial expressions. Experiments are performed on CK+ dataset. Furthermore, we present a possible application of the proposed Facial Emotion Recognition system in human-robot interaction. A method for dynamically changing ambient light or LED colors, based on recognized emotions is presented. Indeed, it is proven that equipping robots with the ability of per- ceiving emotions and accordingly reacting by introducing suitable emphatic strategies significantly improves human-robot interaction performances. Possible scenarios of application are education, healthcare and autism therapy where such kind of emphatic strategies play a fundamental role. Keywords Facial Emotion Recognition, Human Robot interaction, 1. Introduction state, together with jaw drop. Facial emotion recogni- tion is a challenging task due to interclass similarities Facial Emotion Recognition (FER) is the process of iden- problems. Indeed different people can show emotions in tifying human feelings and emotions from facial expres- a different and personal way and with a different level sions. Nowadays, automatic emotion recognition has of intensity which makes the problem particularly hard. a key role in a wide area of applications, with partic- On the other hand it is possible that different motiva- ular interest in cognitive science[1], human-robot and tional states show very similar features and similar facial human-computer interaction. In human-robot interac- expressions. tion for example, the ability of recognize intentions and In this work, a standard CNN-RNN architecture is used emotions and accordingly react to the particular motiva- to learn both spatial and temporal cues from human facial tional states of the user is crucial for making interaction expressions built up gradually across time. A method to more friendly and natural, improving both usability and express emotions by dynamically changing RGB ambient acceptability of the new technology. light components based on recognized emotional state Ekman in [2] defined a set of six universal emotions: is here proposed. Indeed, a lot of studies have been con- anger, disgust, fear, happiness, sadness, and surprise, ducted in order to understand the relationship between which can be universally recognized and described re- colors and emotions, and precisely how a particular color gardless of people culture and context by using a set of can evoke positive feelings in the observer. So depending action unit (AU) described in table 1. Each AU is the on the particular recognized human emotion a differ- action of a muscle of the face that is typically activated ent ambient light color will be set according to Nijidam when a given facial expression is produced. For example, color-emotion mapping theory [3]. AU number 1, that corresponds to "Inner brow raiser" The remainder of this paper is structured as follows. typically appears when people show surprise emotional Section 2 analyzes existing literature and discusses the state-of-the-art in facial emotion recognition. Section 3 SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology, is dedicated to the description of the dataset and data Engineering and Mathematics. July 27–29, 2021, Catania, IT refinement process. Section 4 formalizes the problem and " brandizzi@diag.uniroma1.it (N. Brandizzi); castro.1742813@studenti.uniroma1.it (G. Castro); explains proposed CNN+RNN model architecture. Sec- samuele.russo@studenti.uniroma1.it (S. Russo); tion 5 shows experiments, implementation details, and agata.wajda@polsl.pl (A. Wajda) achieved results. Section 6 propose a real-time human-  0000-0002-1846-9996 (S. Russo); 0000-0002-1667-3328 robot interaction application which can take advantage (A. Wajda) and multiple benefits from the proposed Facial Emotion © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Recognition system. Finally, in Section 7, we discuss CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 66 Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74 Table 1 Facial Action Unit (AU) code and corresponding description. AU Description AU Description AU Description 1 Inner Brow Raiser 13 Cheek Puller 25 Lips Part 2 Outer Brow Raiser 14 Dimpler 26 Jaw Drop 4 Brow Lowerer 15 Lip Corner Depressor 27 Mouth Stretch 5 Upper Lip Raiser 16 Lower Lip Depressor 28 Lip Suck 6 Cheek Raiser 17 Chin Raiser 29 Jaw Thrust 7 Lip Tightener 18 Lip Puckerer 31 Jaw Clencher 9 Nose Wrinkler 20 Lip Stretcher 34 Cheek Puff 10 Upper Lip Raiser 21 Neck Tightener 38 Nostril Dilator 11 Nasolabial Deepener 23 Lip Tightener 39 Nostril Compressor 12 Lip Corner Puller 24 Lip Pressor 43 Eyes Closed conclusions and future works. quences. In [9] multiple LSTMs layers are stacked on top of CNNs. Then temporal and spatial representations are aggregated into a fusion network to produce per-frame 2. Related Work prediction of 12 facial action units (AU). Fan et al. in [10] propose an hybrid network combining a CNN-features- Several techniques and deep learning models have been based spatio-temporal RNN model with a 3 dimensional investigated over the last decade in order to address the Convolutional Neural Network (C3D), including also au- problem of facial analysis and emotion recognition from dio features in order to maximize accuracy predictions. RGB images and videos. Most of them use Convolutional To deal with expression-variations and intra-class vari- Neural Networks (CNNs) for extracting geometric fea- ations, namely intensity and subject identity variations, tures from facial landmark points. In order to train such [11] introduces objective functions on CNN to improve high-capacity classifier, with the very small-size available expression class separability of the spatial feature repre- FER dataset, one of the most common approach is to use sentation and minimize intra-class variation within the transfer learning. Works [4, 5] use pre-trained models same expression class. Differently from previous works, such as VGG16 and AlexNet to initialize the weights of which first apply CNN architectures or pre-trained im- the CNN that can improve accuracy and reduce over- age classifier as visual feature extractor, and then use fitting. Nguyen et al. in [4] introduced a two-stage su- extracted spatial feature representation for training the pervised fine-tuning: a first-stage fine-tuning is applied RNNs separately, the proposed network want to analyze using auxiliary face expression datasets followed by a the spatio-temporal behaviour of facial emotions by us- final fine-tuning on the target AFEW dataset [6]. In [5] ing an end-to-end trainable CNN+RNN computational a VGG-16 deep pre-trained model plus redefined dense efficient architecture. layers is used in FER, by identifying essential and op- Several experiments have also been done to prove tional convolutional blocks in the fine-tuning step. In the how a facial emotion recognition systems can improve training process the selected blocks of VGG-16 model are human-robot interaction performances and potentialities. included step by step, instead of training all at a time, to Jimenez et al. in [12] show how monitoring colored lights diminish the effect of initial random weight. based on user emotion can represent a real communica- Instead of analyzing static images independently, thus tion channel between humans and robots. They intro- ignoring the temporal relations of sequence frames in duce a self-sufficiency model system that recognizes and videos, also 3D CNNs were explored in order to extract empathizes with human emotions using colored lights spatio-temporal features with outstanding results. In [7], on a robot’s face. Feldmaier et al. in [13] also show the 3D CNNs was used to model appearance and motion effectiveness of displaying color combinations and color of videos, learning simultaneously spatial and temporal patterns in Affective Agents, also adjusting variations of aspects from image sequences. In [8], two deep networks intensity, brightness and frequency to obtain a psycho- are combined: a 3D CNN is used to capture temporal logical influence in the user. A similar strategy is adopted appearance of facial expressions, while a deep temporal in this work as well. geometry network extracts geometrical behaviours of Many other works have been recently published in the the facial landmark points. field of emotion recognition, face detection, and related Similarly, some recent works have proposed to use classification tasks[14, 15]. both combination of CNNs and RNNs capable of keeping track of arbitrary long-term dependencies in input se- 67 Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74 Table 2 Appendix. Comparison with the emotion prediction Ta- Ekman’s six basic emotions description in terms of facial Ac- ble 6 was done by applying the emotion prediction rule tion Units. With reference to table 1 we define 1: Inner Brow very strictly. Raiser, 2: Outer Brow Raiser, 4: Brow Lowerer, 7: Lip Tight- In order to maximize the amount of data which ap- ener, 9: Nose Wrinkler, 10: Upper Lip Raiser, 12: Lip Cor- pears to be too poor for the training model, another 35% ner Puller, 15: Lip Corner Depressor, 17: Chin Raiser, 20: Lip of the dataset was hand-made labelled by using not only Stretcher, 24: Lip Pressor, 25: Lips Part, 26: Jaw Drop. Prototypes but also their Major Variants as shown in Emotional state Action Units the Emotion Prediction Table 6. Compared with Proto- Anger 4, 7, 24 types, variants allow for subset of AUs. As a consequence Disgust 9, 10, 17 they are less strictly definitions but always truly repre- Fear 1, 4, 20, 25 sentative for the given emotion. As a result, a total of Happiness 12, 25 511 video sequences are collected by including also the Sadness 4, 15 major variants definitions in the conversion rules. Also Surprise 1, 2, 25, 26 neutral facial expressions were included, for a total of 7 emotions categories: Neutral (Neu), Anger (Ang), Dis- Table 3 gust (Dis), Fear (Fea), Happiness (Hap), Sadness (Sad) and Final number of samples per category in the dataset. In order: Surprise (Sur). Notice that even if the dataset increases Anger (Ang), Neutral (Neu), Disgust (Dis), Fear (Fea), Happi- its dimension, it remains unbalanced, indeed some of ness (Hap), Sadness (Sad), Surprise (Sur) number of videos the categories such as surprise and happiness are more samples. represented in the dataset if compared to the others. The Emotion la- Ang Neu Dis Fea Hap Sad Sur final distribution of data samples among the seven cate- bel gories is reported in table 3. The minimum length of data #Videos 51 52 70 61 108 82 87 samples in the dataset is 10 frames. For this reason, in order to maximize the number of input sequences, and collect the largest amount of training data, the sequence length, i.e the number of frames per video is set to 10. 3. Dataset For sequence of grater length, only the last 10 frames are considered, in order to ensure that the apex (peak) frame For training, validating and testing the Facial Emotion that is the most representative for the given emotion will Recognition model the Extended Cohn-Kanade Dataset be captured. For each of the frames pixels values are (CK+) [16] was used. It contains 593 sequences across 123 rescaled in order to have each pixel ∈ [0, 1]. A central subjects and each of the sequences contains images from cropping is applied to each of the frame in order to have onset (neutral frame) to peak expression (last frame). The a more focus on human face. After resizing and cropping image sequence can vary in duration from a minimum the final dimension of each input sequences will be of 10 to a maximum of 60 frames. Images have frontal views and 30-degree views and were digitized into either (𝑛_𝑓 𝑟𝑎𝑚𝑒𝑠, 𝑤𝑖𝑑𝑡ℎ, ℎ𝑒𝑖𝑔ℎ𝑡, 𝑛_𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠) = (10, 48, 48, 3) 640x490 or 640x480 pixel arrays with 8- bit gray-scale or 24-bit color values. Each of the image sequences is labelled with Action Unit combinations. A complete list of the possible AUs is reported in table 1. 4. Model Architecture For each of the data sequence, if the action units list show consistency with one of the six basic emotion cat- The emotion recognition task is modelled as a multi-class egory among Anger, Disgust, Fear, Happiness, Sadness, classification problem over a set of 7 different categories Surprise and Contempt, a nominal emotion label is as- 𝑌 = {Anger, Neutral, Disgust, Fear, Happiness, Sad- sociated to the sequences. At this aim, table 2 shows a ness, Surprise}. complete mapping between Ekman’s six basic emotions As shown in sec 2 Recurrent Neural Networks (RNN) in and AUs. Note that only the six basic emotions are con- combination with Convolutional Neural Networks (CNN) sidered in the proposed FER model and then reported and 3 dimensional Convolutional Neural Network (C3D) in table 2, while Contempt category which is very simi- provide powerful results when dealing with sequential lar to Disgust emotion class and do not belongs to basic image data. Following this approach, a standard CNN emotions group is excluded. As a result of this selection + RNN computational efficient architecture is here pro- process, only 296 of the 593 sequences fit the prototypic posed. From the input sequence of frames, spatial feature definition and meet criteria for one of the six discrete representations are learned by a Convolutional Neural emotions. Prototypes definitions used for translating AU Network (CNN). In order to capture the facial expres- sores into emotions terms are shown in Table 6 in the sion dynamics, temporal features representation of the 68 Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74 Table 4 Per Category Validation Accuracy. In order: Anger (Ang), Neutral (Neu), Disgust (Dis), Fear (Fea), Happiness (Hap), Sadness (Sad), Surprise (Sur) accuracy scores. Emotion la- Ang Neu Dis Fea Hap Sad Sur bel Val Accu- 10% 83% 50% 30% 86% 86% 79% racy while last couple have 3x3 filter size. All 2D Max Pooling have a kernel size of 2x2. In order to extract temporal correlation in the extracted input features, CNN output is directly fed as input to a GRU of 8 units, which is the output dimension. Finally, the output layer is a Dense one, that is a deeply fully connected layer, with a Softmax activation function. Softmax layer have the same number of nodes as the output layer in order to assign decimal probabilities with sum 1 to each of the emotion categories. For each value 𝑧𝑖 from the neurons of the output layer, per category probability is computed as: 𝑒𝑥𝑝(𝑧𝑖 ) 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧𝑖 ) = ∑︀ (1) 𝑗 𝑒𝑥𝑝(𝑧𝑗 ) such that probabilities values always sum to 1 and only one emotion label is activated. 5. Experiments 5.1. Implementation Details Figure 1: Proposed model Architecture for Facial Emotion Recognition system which takes as input a 10 frames se- To verify the effectiveness of the proposed model, experi- quence and outputs a single label emotion. ments have been conducted on the modified CK+ dataset as described in Section 3. For an efficient data parsing, input images sequences of 10 frames are first transformed in TFRecord files. Then facial expression is learned via the Recurrent Neural Net- tensorflow TFRecordDataset class is used to standardize work (RNN). Both Long Short-Term Memory (LSTM) and data and generate batch of 24 images sequences to be fed Gated Recurrent Unit (GRU) have been tested for the as input to the model. 80% of the data samples are used given problem and provide comparable results. However in training phase, while the remaining 20% as test set. GRU have a less complex structure and are computation- For training the model we used Adam optimizer with ally more efficient. Therefore given also the fact that learning rate 1 × 𝑒−3 . For a multi-class classification the model does not have a huge amount of data, GRU problem, each of the input sample can only belong to are preferred to get a good accuracy. Moreover, such 2D one out of many possible categories, therefore a Cate- CNN + GRU approach allows end-to-end trainable model gorical Cross Entropy Loss was used. Since the dataset is which is lower computational expensive comparing with unbalanced the Categorical Accuracy is not enough to the others. Indeed the whole model counts a very small have a true evaluation for the model, but also Precision number of parameters, around 8k, which makes it very and Recall metrics were considered. In order to prevent light. over-fitting problem, L1 and L2 Regularizer are set to 0.01 The model takes as input a window of 10 RGB frames and dropout equal to 0.4. The model was trained for a of size 48x48. Data are processed in batches of 24 image total of 400 epochs, with a mean training time for a single sequences. The full architecture of the proposed model epoch of only 4sec. is shown in Figure 1. It consist of a four-layer 2D CNN. The first two convolutional layers have 5x5 kernel size, 69 Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74 Figure 2: Confusion matrix over CK+ validation set. On the axis motions categories are disposed in the following way: {1:Anger 2:Neutral 3:Disgust 4:Fear 5:Happy 6:Sad 7:Surprise} validation epoch the model achieve 66% of mean Preci- sion and 65% of mean Recall across categories. Table 4 shows per category validation accuracy rates. As we can notice mean validation accuracy over categories is not so high, indeed per category validation accuracy reaches very high scores for some of the emotion classes and very low values for some others, meaning that the model almost fails when tries to recognize anger and fear emo- Figure 3: Left: Apex (peak) frame from Angry facial expres- sion from CK+ dataset. Right: Apex frame from Sad facial tions, while it behaves very well when dealing with all expression performed by a different subject in CK+ dataset. the others categories. In particular, the model appears very powerful in recognizing Happy and Sad facial ex- pressions with the 86% of accuracy. High accuracy scores Table 5 are also obtained for Neutral (83%) and Surprise (79%) Evaluation metrics for the model. Mean Validation Accuracy emotions categories. As a result, even if the model shows , Precision and Recall across categories. very high performances for most of the emotions, dif- Validation Accuracy 66% ficulties in learning some specific facial expressions, in Validation Precision 66% particular Angry with only the 10% of accuracy, lead to a Validation Recall 65% lower mean accuracy score across categories. To better understand the model, the confusion matrix over the CK+ validation set is reported in Fig 2. As we can notice, the 5.2. Results analysis main diagonal is highlighted by high rates of correctly classified samples. However, some off-diagonal elements In this section, the experimental results are presented. reflects mislabeled predictions by the classifier. Since the dataset is not balanced both precision and re- In particular, predictions errors mainly concern Anger call, together with accuracy score are considered in order and Fear emotions categories, which appear to be harder to evaluate the performance of the model. For the best to learn and recognize. As shown in the confusion matrix, 70 Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74 Figure 4: Top row: Fear emotion data sample from CK+ dataset. 4/10 frames are shown to illustrate temporal evolution of the facial expression. Bottom row: Happines emotion evolution over time, performed by the same subject. Angry facial expressions are frequently confused with 6. Facial Emotion Recognition in Sad ones, which actually looks like very similar. Indeed if we look at some data examples, as shown in figure 3, it human-robot interaction will be quite difficult even for humans to infer the correct In this chapter, a possible application of the proposed labeling. This because both Sadness and Anger emotions Facial emotion recognition system in Human-Robot and are characterized by very similar features such as lowered Human-Computer interaction is presented. More re- eyebrows and tight mouth as also highlighted in table search [17, 18, 19, 20] have proven as equip the robot with 2 which shows emotional states description in terms of the ability of perceiving user feelings and emotions and facial Actions Units. As a consequence, even if Sad facial accordingly react with them can significantly improve expressions typically shows much more lowered lip cor- the quality of the interaction, and more importantly user ners, more discrete and shy performed emotions may be acceptability. Affective communication for social robots easily confused due to inter-class similarity and emotions has been deeply investigated in the last years showing intensity variations problems. In an analogue way, Fear as behind social intelligence, also emotional intelligence emotion are often wrongly classified as Happiness, since plays a crucial role for successfully interactions. In partic- they show a very similar behaviour of the lips and share ular the need of recognize emotions and properly intro- action unit number 25 - Lips Part, as reported in table duce emphatic strategies turned out to be fundamental 2. As shown in fig 4 top row, fear facial expression is in vulnerable scenarios such us education, healthcare, represented with the jaw dropped open and lips stretched autism therapy and driving support. horizontally. As a result, if we look at fig 4, for both Fear and Happiness facial expression (top and bottom rows respectively) lips are closed on the onset (first frame) 6.1. Application than start to became farther until to be apart in the apex A very effective and powerful solution in human-robot frame, showing a very similar temporal behaviour which interaction can be to introduce a colored lights based em- makes it difficult to distinguish between the two. phatic strategy. Indeed, colors and emotions are closely In summary, the network can precisely recognize Hap- linked. Several physical and psychological studies have piness, Sadness, Surprise and Neutral emotional states, shown as play with colors and dynamic lights can have while is not so much reliable for Anger and Fear. There- very effective outcome since they can evoke feelings and fore only emotions categories with a valid accuracy over emotions in human observers. For this reason, a method the 50% will be used in human-robot interaction applica- for dynamically changing ambient light or LED colors tions. based on recognized emotions is presented. The aim is to 71 Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74 Table 6 Emotion prediction. Conversion Rules for translating AU scores into Emotions. [16] Table note: * means in this combination the AU may be at any level of intensity. Emotion Prototypes Major Variants Surprise 1+2+5B+26 1+2+5B 1+2+5B+27 1+2+26 1+2+27 5B+26 5B+27 Fear 1+2+4+5*+20*+25 1+2+4+5*+L or R20*+25, 26, or 27 1+2+4+5*+25 1+2+4+5* +2+5Z, with or without 25, 26, 27 5*+20* with or without 25, 26, 27 Happiness 6+12* 12C/D Sadness 1+4+11+15B with or without 54+64 1+4+11 with or without 54+64 1+4+15* with or without 54+64 1+4+15B with or without 54+64 6+15* with or without 54+64 1+4+15B+17 with or without 54+64 11+17 25 or 26 may occur with all prototypes or major variants Disgust 9 9+16+15, 26 9+17 10* 10*+16+25, 26 10+17 Anger 4+5*+7+10*+22+23+25,26 4+5*+7+l0*+23+25,26 4+5*+7+23+25, 26 4+5*+7+17+23 4+5*+7+17+24 4+5*+7+23 4+5*+7+24 establish a very intuitive and natural way of communi- tive emotions such as hopeful, peaceful and satisfaction. cating emotions, and also to provide a positive influence The same applies for blue color that is typically associ- on the user by means of evocative associations. ated to calm and relax emotional states, and as for green In order to use colors to stimulate a certain positive color, can be useful to contrast negative emotions such feeling, such as calm, energy, happiness we first need as Angry. Studies have proven that yellow color is able to have a semantic mapping between colors and emo- to evoke happy and joy emotional states, therefore yel- tions. Plenty of experiments have been done in order low lights can be set whenever the robot recognize that to find a precise and reliable mapping, and all have pro- the user can be sad or even happy and surprise, in order vided almost the same results in terms of color-meaning. to emphasize and agree with these positive emotions. To achieve a simple and affordable model, the proposed Strong colors such as red are usually associated to anger, method relies on Naz Kaya research [21] whose results aggressive and very intense emotions and mixed with are summarized in table 5. yellow shadows can evoke active, energetic and power- Once found a good emotion-color mapping, an em- ful motivational states. Then it is reasonable to activate phatic strategy that selects a specific evocative color de- Yellow-Red shadows when user is recognize to be sad pending on the particular recognized emotion can be and demotivated. adopted. For example when the robot perceive that the user is in trouble or is afraid, lights can automatically switched 7. Conclusion to green color or green shadows in order to recall a sense In this work, we address the problem of Facial Emotion of peace and create a more comfortable environment. Recognition by introducing an end-to-end trainable deep Indeed color green is usually associated to the most posi- 72 Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74 Advancing the state of the art in this field also means exposing every human inner self to the world. As for most of the research the pros and cons must be weighted and evaluated considering every possible use case sce- nario. We believe that our work does not hold enough practical ground to be misused by third actors, but we are still concerned with this possibility. References [1] S. Russo, S. Illari, R. Avanzato, C. Napoli, Reducing the psychological burden of isolated oncological patients by means of decision trees, volume 2768, 2020, pp. 46–53. [2] P. Ekman, W. V. Friesen, Constants across cultures in the face and emotion., Journal of personality and social psychology 17 (1971) 124. [3] N. A. Nijdam, Mapping emotion to color, Book Mapping emotion to color (2009) 2–9. Figure 5: Naz Kaya emotions-colors mapping. [4] H.-W. Ng, V. D. Nguyen, V. Vonikakis, S. Winkler, Deep learning for emotion recognition on small datasets using transfer learning, in: Proceedings of the 2015 ACM on international conference on learning model which can reasoning on both spatial and multimodal interaction, 2015, pp. 443–449. temporal features of facial expressions. Results have [5] M. Akhand, S. Roy, N. Siddique, M. A. S. Kamal, shown as the proposed model can effectively learn to T. Shimamura, Facial emotion recognition using distinguish between most basic emotions. Furthermore, transfer learning in the deep cnn, Electronics 10 a method for setting RGB lights components according (2021) 1036. to recognized user emotions is presented. In particular [6] A. Dhall, R. Goecke, S. Lucey, T. Gedeon, Acted an emphatic strategy that exploits the proposed emotion facial expressions in the wild database, Australian recognition system to identify user emotions and accord- National University, Canberra, Australia, Technical ingly monitoring ambient colored lights can be used to Report TR-CS-11 2 (2011) 1. improve the quality of human-robot interactions. [7] H. Jung, S. Lee, J. Yim, S. Park, J. Kim, Joint fine- In the future, the proposed strategy can be developed tuning in deep neural networks for facial expres- and validated in a social robot like Pepper from SoftBank sion recognition, in: Proceedings of the IEEE inter- Robotics [22], by monitoring colors of its body LEDs. national conference on computer vision, 2015, pp. They are placed in the chest, eyes, shoulders and ears 2983–2991. allowing for a more friendly and engaging interaction. [8] J. Haddad, O. Lézoray, P. Hamel, 3d-cnn for facial Another possible direction of future work could explore emotion recognition in videos, in: International how this emphatic model can become adaptive to the user Symposium on Visual Computing, Springer, 2020, preferences, in such a way the robot can learn the impact pp. 298–309. of the different colors in a particular user, depending on [9] W.-S. Chu, F. De la Torre, J. F. Cohn, Learning spa- his previous reactions. tial and temporal cues for multi-label facial action unit detection, in: 2017 12th IEEE International 7.1. Ethical Impacts Conference on Automatic Face & Gesture Recogni- tion (FG 2017), IEEE, 2017, pp. 25–32. The system presented in this paper has a wide field of [10] Y. Fan, X. Lu, D. Li, Y. Liu, Video-based emotion employment. It’s aim is to detect and point out a person’s recognition using cnn-rnn and c3d hybrid networks, emotion in a very simple and informative way through in: Proceedings of the 18th ACM international con- colors. Applications of such systems can be useful when ference on multimodal interaction, 2016, pp. 445– the environment needs to intelligently adapt to the user, 450. i.e. changing the light color in response to people mood [11] D. H. Kim, W. J. Baddar, J. Jang, Y. M. Ro, Multi- in a room full of music stimuli. But we also acknowl- objective based spatio-temporal feature representa- edge the possibility of misuse. Indeed, emotions are a tion learning robust to expression intensity varia- fundamental part of peoples live and thus are private. 73 Nicolo’ Brandizzi et al. CEUR Workshop Proceedings 66–74 tions for facial expression recognition, IEEE Trans- actions on Affective Computing 10 (2017) 223–236. [12] F. Jimenez, T. Ando, M. Kanoh, T. Nakamura, Psy- chological effects of a synchronously reliant agent on human beings, Journal of Advanced Computa- tional Intelligence Vol 17 (2013). [13] J. Feldmaier, T. Marmat, J. Kuhn, K. Diepold, Evalua- tion of a rgb-led-based emotion display for affective agents, arXiv preprint arXiv:1612.07303 (2016). [14] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac- caro, Yolov3-based mask and face recognition al- gorithm for individual protection applications, vol- ume 2768, 2020, pp. 41–45. [15] S. Russo, C. Napoli, A comprehensive solution for psychological treatment and therapeutic path plan- ning based on knowledge base and expertise shar- ing, volume 2472, 2019, pp. 41–47. [16] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Am- badar, I. Matthews, The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression, in: 2010 ieee computer society conference on computer vision and pattern recognition-workshops, IEEE, 2010, pp. 94–101. [17] I. Leite, G. Castellano, A. Pereira, C. Martinho, A. Paiva, Modelling empathic behaviour in a robotic game companion for children: an ethnographic study in real-world settings, in: Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, 2012, pp. 367–374. [18] C. Napoli, G. Pappalardo, E. Tramontana, Us- ing modularity metrics to assist move method refactoring of large systems, 2013, pp. 529–534. doi:10.1109/CISIS.2013.96. [19] M. Nalin, L. Bergamini, A. Giusti, I. Baroni, A. Sanna, Children’s perception of a robotic companion in a mildly constrained setting, in: IEEE/ACM human-robot interaction 2011 confer- ence (robots with children workshop) proceedings, Citeseer, 2011. [20] M. Wozniak, D. Polap, G. Borowik, C. Napoli, A first attempt to cloud-based user verification in distributed system, in: 2015 Asia-Pacific Confer- ence on Computer Aided System Engineering, IEEE, 2015, pp. 226–231. [21] N. Kaya, H. H. Epps, Relationship between color and emotion: A study of college students, College student journal 38 (2004) 396–405. [22] A. K. Pandey, R. Gelin, A mass-produced sociable humanoid robot: Pepper: The first machine of its kind, IEEE Robotics & Automation Magazine 25 (2018) 40–48. 74