1. Introduction and Background

Helsinki, Finland

0 Alessandro Aiuti 1 Department of Education, Roma Tre University , Viale del Castro Pretorio 20, 00185 Rome , Italy 2 Department of Engineering, Roma Tre University , Via della Vasca Navale 79, 00146 Rome , Italy

Personalized systems are becoming more and more popular in everyday life. Their goal is to adapt the output to the characteristics (i.e., interests and preferences) of the active user. To achieve this purpose, a process of inferring these characteristics is needed. In this paper, we verify the existence of some significant correlation between the facial microexpressions of individuals and their emotional state. If so, we could think of monitoring the user while enjoying a certain visual stimulus, to understand her emotional response. For example, we could comprehend whether a visitor of a museum or an exhibition likes or dislikes the object she is observing, thus deriving her interests and tastes, regardless of the reality from which she comes. It could foster the role of the museum/exhibition intended as a vehicle of aggregation between a broad range of users, thus favoring their cultural and social inclusion. It could also allow us to design and realize recommender systems for enhancing the experience of users with dificulty in explicitly expressing their interests, such as people belonging to vulnerable groups (e.g., elderly, children, disabled people) or diferent cultures. Although the sample analyzed is limited and concerns a specific context (i.e., music video clips), the experimental results have been encouraging, thus spurring us to carry on with our research activities.

User interfaces Computer vision Deep Learning Museum visitors

Nowadays, technology accompanies and often afects munity [2]. For example, Machine Learning models and methods [3] (e.g., Deep Learning [4]) allow the realization of increasingly efective customized systems [ 5]. Among these, there exist also systems capable of integrating user profiles with additional information about their personality [6], their emotional state [7] as well as the temporal dynamics [8] and the actual nature [9] of their interests. ties on the Web can be considered in the user modeling process [10]. This work concerns the application of Computer Vision techniques [11] for the analysis of a user’s facial micro-expressions while viewing video sequences, to identify her emotional state. The action units (AUs) are the individual components of muscle movements in which it is possible to break down facial expressions and, in certain combinations, allow us to analyze a person’s emotional state. The study of the action units originated thanks to the research work by Ekman and Friesen who 2.

Method Initially, we wanted to carry out a live analysis with

real users (e.g., see [17]). We intended to monitor the user while viewing certain visual stimuli both through the RGB video recording of her facial expressions and the recording of her electroencephalogram (EEG) signal.

Unfortunately, the pandemic situation we are currently

experiencing has not allowed us to proceed as planned.

We, therefore, decided to use datasets that are publicly

available online [18]. In particular, we chose the DEAP dataset proposed by Koelstra et al. [19]. To collect the

DEAP dataset, music videos were used as visual stimuli

to arouse diferent emotions. One minute of each music video was selected, more specifically, the one with the highest level of solicitation. A frontal face video was recorded for 22 participants while viewing those videos. Participants evaluated each video in terms of valence and arousal through the use of Self-Assessment Manikin with values ranging from 1 to 9. Once the dataset was obtained, it was necessary to process the videos obtained by recording the facial expressions of the users while they observed the music videos. There exist several facial recognition software tools, some of which are free. For this purpose, we used OpenFace1, an opensource toolkit capable of capturing and analyzing the action units. Initially, we considered the mean and standard deviation of the action units for each user. Then, we calculated the correlation between those values and their positions in the Cartesian plane (see Figure 1). Considering the analysis of the average values, as regards the action units alone, the most significant correlations were found between AU25 (lips part) and AU-17 (chin raiser) and between AU-01 (inner brow raiser) and AU-02 (outer brow raiser). On the other hand, we did not find any noteworthy correlation between the AUs and their positions in the quadrants. So setting aside this analysis and the quadrant classification, we decided to move on to the signal analysis of the various action units. Figure 2 shows the activation and variation of AU-12 (lip corner puller). We can see one activation and increase in the intensity of the AU signal when the user smiles and, therefore, the angle of his lips varies. Specifically, we have extracted several features of a higher order than the previous ones, which we have classified into four main types, as proposed in [ 20]: • Statistical features • Discrete features • Dynamic features • Quantitative features

The statistical features represent the first four statistical

moments (mean, variance, skewness, and kurtosis) and 1https://github.com/TadasBaltrusaitis/OpenFace were computed for each AU intensity signal obtained from each facial recording. Then, we quantized the intensity signal over time for each action unit. The quantization was obtained through the k-means algorithm, with four clusters, a value obtained through the Elbow method. Figure 3 shows an example of a quantized signal. From each quantized signal, we calculated the following

For each quantized signal, we then generated a transition

matrix that measures the number of transitions between the diferent levels. From this matrix, for each action unit, we extracted three dynamic features: • Change ratio: Proportion of transitions between levels; • Slow change ratio: Proportion of slow transitions posed approach could be in museums [24] and cities [25], (diference of one level); intended as open-air exhibitions, because it would allow • Fast change ratio: Proportion of fast transitions the automatic detection of visitors with similar tastes and (diference of two or more levels). interests [26], regardless of their social, demographic, and cultural peculiarities. In this way, museums and cities could represent possible inclusive places through the sharing of common interests and preferences [27].

As a quantitative feature, we considered the ratio between the number of activations of each action unit and the number of frames. The criterion with which we calculated the number of activations is that peaks occur when they have a value higher than the variance and a dura- 3. Conclusions and Future Works tion higher than three frames. Once all features were calculated, we determined the Pearson correlation ma- In this article, we have described an approach that contrix between the extracted features and the arousal and sists in analyzing facial recordings of users subjected to valence values. The resulting matrix is shown in Figure 4. visual stimuli to extract the action units relating to their Analyzing the correlation values between the extracted facial micro-expressions. From these, we obtained some features, the three strongest correlations are between features and verified their correlation with the valence kurtosis and skewness, activation ratio and activation and arousal values. level, and change ratio and the number of activations per Although the results obtained are encouraging and the number of frames (act/frame). As for the correlation suggest that there is indeed a significant correlation bevalues between the extracted features and the valence tween the features extracted from the action units and values, it can be noted that the strongest (negative) cor- the user’s emotional response, this study has some limitarelations occur between valence and change ratio, and tions. First of all, it was carried out on a small sample of valence and fast change ratio. As for the correlation val- users and video sequences. User reactions were collected ues between the extracted features and the arousal values, using the Self-Assessment Manikin that, while having the the highest correlation values are with the number of advantage of simplifying the evaluation by the user, canactivations per the number of frames (act/frame), change not be considered completely reliable. Furthermore, the ratio, and fast change ratio. users’ reactions were collected by showing them music

The obtained findings are interesting because they video clips and not, for example, artworks. Among fushow the possibility of inferring information relating to ture possible developments, there is, therefore, certainly a a user’s emotional state by monitoring her facial expres- live analysis on real users, showing them diferent visual sions. From a practical point of view, this can make it pos- stimuli and collecting their emotional responses more sible to automatically derive the degree of appreciation accurately, for example, also using the EEG signal [28]. that the user has towards the displayed object, without Moreover, in our study, we only considered valence and having to resort to long and annoying questionnaires. arousal. Further development could include, for example, This information can be usefully exploited to design and the use of other emotional dimensions, such as likability develop recommender systems [21] capable of suggest- and rewatch. ing, for example, points of interest [22] and itineraries between them [23]. A possible application of the pro[14] D. Kollias, S. Zafeiriou, Afect analysis in-the-wild:

Valence-arousal, expressions, action units and a [1] M. Gordon, Solitude and privacy: How technology unified framework, CoRR abs/2103.15792 (2021). is destroying our aloneness and why it matters, [15] S. Zhao, G. Jia, J. Yang, G. Ding, K. Keutzer, Emotion Technology in Society 68 (2022) 101858. recognition from multiple modalities: Fundamen[2] G. D’Aniello, M. Gaeta, F. Orciuoli, G. Sansonetti, tals and methodologies, IEEE Signal Processing F. Sorgente, Knowledge-based smart city service Magazine 38 (2021) 59–73.

system, Electronics (Switzerland) 9 (2020) 1–22. [16] J. A. Russell, A circumplex model of afect, Journal [3] L. Vaccaro, G. Sansonetti, A. Micarelli, An empirical of personality and social psychology 39 (1980). review of automated machine learning, Computers [17] S. Park, S. W. Lee, M. Whang, The analysis of emo10 (2021). tion authenticity based on facial micromovements, [4] G. Sansonetti, F. Gasparetti, G. D’Aniello, A. Mi- Sensors 21 (2021).

carelli, Unreliable users detection in social media: [18] Y.-H. Oh, J. See, A. C. Le Ngo, R. C. W. Phan, V. M. Deep learning techniques for automatic detection, Baskaran, A survey of automatic facial microIEEE Access 8 (2020) 213154–213167. expression analysis: Databases, methods, and chal[5] H. A. M. Hassan, G. Sansonetti, F. Gasparetti, A. Mi- lenges, Frontiers in Psychology 9 (2018). carelli, J. Beel, Bert, elmo, use and infersent sen- [19] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yaztence encoders: The panacea for research-paper dani, T. Ebrahimi, T. Pun, A. Nijholt, I. Patras, Deap: recommendation?, in: M. Tkalcic, S. Pera (Eds.), A database for emotion analysis using physiological Proceedings of ACM RecSys 2019 Late-Breaking Re- signals, IEEE TAC 3 (2012).

sults, volume 2431, CEUR-WS.org, 2019, pp. 6–10. [20] D. Hadar, T. Tron, D. Weinshall, Implicit media [6] M. Onori, A. Micarelli, G. Sansonetti, A comparative tagging and afect prediction from rgb-d video of analysis of personality-based music recommender spontaneous facial expressions, in: Proc. of the systems, in: CEUR Workshop Proceedings, volume 12th IEEE Int. Conf. on Automatic Face & Gesture 1680, CEUR-WS.org, Aachen, Germany, 2016. Recognition, IEEE Computer Society, USA, 2017. [7] M. Tkalcic, B. De Carolis, M. de Gemmis, A. Odic, [21] G. Sansonetti, F. Gasparetti, A. Micarelli, F. Cena, A. Kosir, Introduction to emotions and personality C. Gena, Enhancing cultural recommendations in personalized systems, in: M. Tkalcic, B. D. Caro- through social and linked open data, User Modeling lis, M. de Gemmis, A. Odic, A. Kosir (Eds.), Emotions and User-Adapted Interaction 29 (2019) 121–159. and Personality in Personalized Services - Models, [22] G. Sansonetti, Point of interest recommendation Evaluation and Applications, Springer, 2017. based on social and linked open data, Personal and [8] S. Caldarelli, D. F. Gurini, A. Micarelli, G. Sansonetti, Ubiquitous Computing 23 (2019) 199–214.

A signal-based approach to news recommendation, [23] A. Fogli, G. Sansonetti, Exploiting semantics for in: CEUR Workshop Proceedings, volume 1618, context-aware itinerary recommendation, Personal CEUR-WS.org, Aachen, Germany, 2016, pp. 1–4. and Ubiquitous Computing 23 (2019) 215–231. [9] D. Feltoni Gurini, F. Gasparetti, A. Micarelli, G. San- [24] A. Ferrato, C. Limongelli, M. Mezzini, G. Sansonetti, sonetti, Temporal people-to-people recommenda- Using deep learning for collecting data about mution on social networks with sentiment-based ma- seum visitor behavior, Applied Sciences 12 (2022). trix factorization, Future Generation Computer [25] D. D’Agostino, F. Gasparetti, A. Micarelli, G. SanSystems 78 (2018) 430–439. sonetti, A social context-aware recommender of [10] F. Gasparetti, A. Micarelli, G. Sansonetti, Exploiting itineraries between relevant points of interest, in: web browsing activities for user needs identifica- HCI International 2016, volume 618, Springer Intertion, in: Proc. of the 2014 CSCI, volume 2, 2014. national Publishing, Cham, 2016, pp. 354–359. [11] A. Micarelli, A. Neri, G. Sansonetti, A case-based [26] F. Gasparetti, G. Sansonetti, A. Micarelli, Commuapproach to image recognition, in: Proceedings of nity detection in social recommender systems: a the 5th European Workshop on Advances in Case- survey, Applied Intelligence 51 (2021) 3975–3995. Based Reasoning, EWCBR ’00, Springer-Verlag, [27] K. Cofee, Cultural inclusion, exclusion and the Berlin, Heidelberg, 2000, pp. 443–454. formative roles of museums, Museum Management [12] P. Ekman, W. V. Friesen, Facial action coding system, and Curatorship 23 (2008) 261–279.

1978. [28] F. Galvão, S. M. Alarcão, M. J. Fonseca, Predicting [13] J. F. Cohn, K. Schmidt, R. Gross, P. Ekman, Individ- exact valence and arousal values from eeg, Sensors ual diferences in facial expression: stability over 21 (2021). time, relation to self-reported emotion, and ability to inform person identification, in: Proc. of the 4th IEEE ICMI, IEEE Computer Society, USA, 2002.