<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="editor">
          <string-name>Helsinki, Finland</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alessandro Aiuti</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Education, Roma Tre University</institution>
          ,
          <addr-line>Viale del Castro Pretorio 20, 00185 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Engineering, Roma Tre University</institution>
          ,
          <addr-line>Via della Vasca Navale 79, 00146 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Personalized systems are becoming more and more popular in everyday life. Their goal is to adapt the output to the characteristics (i.e., interests and preferences) of the active user. To achieve this purpose, a process of inferring these characteristics is needed. In this paper, we verify the existence of some significant correlation between the facial microexpressions of individuals and their emotional state. If so, we could think of monitoring the user while enjoying a certain visual stimulus, to understand her emotional response. For example, we could comprehend whether a visitor of a museum or an exhibition likes or dislikes the object she is observing, thus deriving her interests and tastes, regardless of the reality from which she comes. It could foster the role of the museum/exhibition intended as a vehicle of aggregation between a broad range of users, thus favoring their cultural and social inclusion. It could also allow us to design and realize recommender systems for enhancing the experience of users with dificulty in explicitly expressing their interests, such as people belonging to vulnerable groups (e.g., elderly, children, disabled people) or diferent cultures. Although the sample analyzed is limited and concerns a specific context (i.e., music video clips), the experimental results have been encouraging, thus spurring us to carry on with our research activities.</p>
      </abstract>
      <kwd-group>
        <kwd>User interfaces</kwd>
        <kwd>Computer vision</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Museum visitors</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Background</title>
      <p>Nowadays, technology accompanies and often afects
munity [2]. For example, Machine Learning models and
methods [3] (e.g., Deep Learning [4]) allow the realization
of increasingly efective customized systems [ 5]. Among
these, there exist also systems capable of integrating user
profiles with additional information about their
personality [6], their emotional state [7] as well as the temporal
dynamics [8] and the actual nature [9] of their interests.
ties on the Web can be considered in the user modeling
process [10]. This work concerns the application of
Computer Vision techniques [11] for the analysis of a user’s
facial micro-expressions while viewing video sequences,
to identify her emotional state. The action units (AUs)
are the individual components of muscle movements in
which it is possible to break down facial expressions and,
in certain combinations, allow us to analyze a person’s
emotional state. The study of the action units originated
thanks to the research work by Ekman and Friesen who
2.</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <sec id="sec-2-1">
        <title>Initially, we wanted to carry out a live analysis with</title>
        <p>real users (e.g., see [17]). We intended to monitor the
user while viewing certain visual stimuli both through
the RGB video recording of her facial expressions and
the recording of her electroencephalogram (EEG) signal.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Unfortunately, the pandemic situation we are currently</title>
        <p>experiencing has not allowed us to proceed as planned.</p>
      </sec>
      <sec id="sec-2-3">
        <title>We, therefore, decided to use datasets that are publicly</title>
        <p>available online [18]. In particular, we chose the DEAP
dataset proposed by Koelstra et al. [19]. To collect the</p>
      </sec>
      <sec id="sec-2-4">
        <title>DEAP dataset, music videos were used as visual stimuli</title>
        <p>to arouse diferent emotions. One minute of each music
video was selected, more specifically, the one with the
highest level of solicitation. A frontal face video was
recorded for 22 participants while viewing those videos.
Participants evaluated each video in terms of valence
and arousal through the use of Self-Assessment Manikin
with values ranging from 1 to 9. Once the dataset was
obtained, it was necessary to process the videos obtained by
recording the facial expressions of the users while they
observed the music videos. There exist several facial
recognition software tools, some of which are free. For
this purpose, we used OpenFace1, an opensource toolkit
capable of capturing and analyzing the action units.
Initially, we considered the mean and standard deviation of
the action units for each user. Then, we calculated the
correlation between those values and their positions in
the Cartesian plane (see Figure 1). Considering the
analysis of the average values, as regards the action units alone,
the most significant correlations were found between
AU25 (lips part) and AU-17 (chin raiser) and between AU-01
(inner brow raiser) and AU-02 (outer brow raiser). On the
other hand, we did not find any noteworthy correlation
between the AUs and their positions in the quadrants.
So setting aside this analysis and the quadrant
classification, we decided to move on to the signal analysis of the
various action units. Figure 2 shows the activation and
variation of AU-12 (lip corner puller). We can see one
activation and increase in the intensity of the AU signal
when the user smiles and, therefore, the angle of his lips
varies. Specifically, we have extracted several features
of a higher order than the previous ones, which we have
classified into four main types, as proposed in [ 20]:
• Statistical features
• Discrete features
• Dynamic features
• Quantitative features</p>
      </sec>
      <sec id="sec-2-5">
        <title>The statistical features represent the first four statistical</title>
        <p>moments (mean, variance, skewness, and kurtosis) and
1https://github.com/TadasBaltrusaitis/OpenFace
were computed for each AU intensity signal obtained
from each facial recording. Then, we quantized the
intensity signal over time for each action unit. The
quantization was obtained through the k-means algorithm,
with four clusters, a value obtained through the Elbow
method. Figure 3 shows an example of a quantized signal.
From each quantized signal, we calculated the following</p>
      </sec>
      <sec id="sec-2-6">
        <title>For each quantized signal, we then generated a transition</title>
        <p>matrix that measures the number of transitions between
the diferent levels. From this matrix, for each action
unit, we extracted three dynamic features:
• Change ratio: Proportion of transitions between
levels;
• Slow change ratio: Proportion of slow transitions posed approach could be in museums [24] and cities [25],
(diference of one level); intended as open-air exhibitions, because it would allow
• Fast change ratio: Proportion of fast transitions the automatic detection of visitors with similar tastes and
(diference of two or more levels). interests [26], regardless of their social, demographic, and
cultural peculiarities. In this way, museums and cities
could represent possible inclusive places through the
sharing of common interests and preferences [27].</p>
        <p>As a quantitative feature, we considered the ratio
between the number of activations of each action unit and
the number of frames. The criterion with which we
calculated the number of activations is that peaks occur when
they have a value higher than the variance and a dura- 3. Conclusions and Future Works
tion higher than three frames. Once all features were
calculated, we determined the Pearson correlation ma- In this article, we have described an approach that
contrix between the extracted features and the arousal and sists in analyzing facial recordings of users subjected to
valence values. The resulting matrix is shown in Figure 4. visual stimuli to extract the action units relating to their
Analyzing the correlation values between the extracted facial micro-expressions. From these, we obtained some
features, the three strongest correlations are between features and verified their correlation with the valence
kurtosis and skewness, activation ratio and activation and arousal values.
level, and change ratio and the number of activations per Although the results obtained are encouraging and
the number of frames (act/frame). As for the correlation suggest that there is indeed a significant correlation
bevalues between the extracted features and the valence tween the features extracted from the action units and
values, it can be noted that the strongest (negative) cor- the user’s emotional response, this study has some
limitarelations occur between valence and change ratio, and tions. First of all, it was carried out on a small sample of
valence and fast change ratio. As for the correlation val- users and video sequences. User reactions were collected
ues between the extracted features and the arousal values, using the Self-Assessment Manikin that, while having the
the highest correlation values are with the number of advantage of simplifying the evaluation by the user,
canactivations per the number of frames (act/frame), change not be considered completely reliable. Furthermore, the
ratio, and fast change ratio. users’ reactions were collected by showing them music</p>
        <p>The obtained findings are interesting because they video clips and not, for example, artworks. Among
fushow the possibility of inferring information relating to ture possible developments, there is, therefore, certainly a
a user’s emotional state by monitoring her facial expres- live analysis on real users, showing them diferent visual
sions. From a practical point of view, this can make it pos- stimuli and collecting their emotional responses more
sible to automatically derive the degree of appreciation accurately, for example, also using the EEG signal [28].
that the user has towards the displayed object, without Moreover, in our study, we only considered valence and
having to resort to long and annoying questionnaires. arousal. Further development could include, for example,
This information can be usefully exploited to design and the use of other emotional dimensions, such as likability
develop recommender systems [21] capable of suggest- and rewatch.
ing, for example, points of interest [22] and itineraries
between them [23]. A possible application of the
pro[14] D. Kollias, S. Zafeiriou, Afect analysis in-the-wild:</p>
        <p>Valence-arousal, expressions, action units and a
[1] M. Gordon, Solitude and privacy: How technology unified framework, CoRR abs/2103.15792 (2021).
is destroying our aloneness and why it matters, [15] S. Zhao, G. Jia, J. Yang, G. Ding, K. Keutzer, Emotion
Technology in Society 68 (2022) 101858. recognition from multiple modalities:
Fundamen[2] G. D’Aniello, M. Gaeta, F. Orciuoli, G. Sansonetti, tals and methodologies, IEEE Signal Processing
F. Sorgente, Knowledge-based smart city service Magazine 38 (2021) 59–73.</p>
        <p>system, Electronics (Switzerland) 9 (2020) 1–22. [16] J. A. Russell, A circumplex model of afect, Journal
[3] L. Vaccaro, G. Sansonetti, A. Micarelli, An empirical of personality and social psychology 39 (1980).
review of automated machine learning, Computers [17] S. Park, S. W. Lee, M. Whang, The analysis of
emo10 (2021). tion authenticity based on facial micromovements,
[4] G. Sansonetti, F. Gasparetti, G. D’Aniello, A. Mi- Sensors 21 (2021).</p>
        <p>carelli, Unreliable users detection in social media: [18] Y.-H. Oh, J. See, A. C. Le Ngo, R. C. W. Phan, V. M.
Deep learning techniques for automatic detection, Baskaran, A survey of automatic facial
microIEEE Access 8 (2020) 213154–213167. expression analysis: Databases, methods, and
chal[5] H. A. M. Hassan, G. Sansonetti, F. Gasparetti, A. Mi- lenges, Frontiers in Psychology 9 (2018).
carelli, J. Beel, Bert, elmo, use and infersent sen- [19] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A.
Yaztence encoders: The panacea for research-paper dani, T. Ebrahimi, T. Pun, A. Nijholt, I. Patras, Deap:
recommendation?, in: M. Tkalcic, S. Pera (Eds.), A database for emotion analysis using physiological
Proceedings of ACM RecSys 2019 Late-Breaking Re- signals, IEEE TAC 3 (2012).</p>
        <p>sults, volume 2431, CEUR-WS.org, 2019, pp. 6–10. [20] D. Hadar, T. Tron, D. Weinshall, Implicit media
[6] M. Onori, A. Micarelli, G. Sansonetti, A comparative tagging and afect prediction from rgb-d video of
analysis of personality-based music recommender spontaneous facial expressions, in: Proc. of the
systems, in: CEUR Workshop Proceedings, volume 12th IEEE Int. Conf. on Automatic Face &amp; Gesture
1680, CEUR-WS.org, Aachen, Germany, 2016. Recognition, IEEE Computer Society, USA, 2017.
[7] M. Tkalcic, B. De Carolis, M. de Gemmis, A. Odic, [21] G. Sansonetti, F. Gasparetti, A. Micarelli, F. Cena,
A. Kosir, Introduction to emotions and personality C. Gena, Enhancing cultural recommendations
in personalized systems, in: M. Tkalcic, B. D. Caro- through social and linked open data, User Modeling
lis, M. de Gemmis, A. Odic, A. Kosir (Eds.), Emotions and User-Adapted Interaction 29 (2019) 121–159.
and Personality in Personalized Services - Models, [22] G. Sansonetti, Point of interest recommendation
Evaluation and Applications, Springer, 2017. based on social and linked open data, Personal and
[8] S. Caldarelli, D. F. Gurini, A. Micarelli, G. Sansonetti, Ubiquitous Computing 23 (2019) 199–214.</p>
        <p>A signal-based approach to news recommendation, [23] A. Fogli, G. Sansonetti, Exploiting semantics for
in: CEUR Workshop Proceedings, volume 1618, context-aware itinerary recommendation, Personal
CEUR-WS.org, Aachen, Germany, 2016, pp. 1–4. and Ubiquitous Computing 23 (2019) 215–231.
[9] D. Feltoni Gurini, F. Gasparetti, A. Micarelli, G. San- [24] A. Ferrato, C. Limongelli, M. Mezzini, G. Sansonetti,
sonetti, Temporal people-to-people recommenda- Using deep learning for collecting data about
mution on social networks with sentiment-based ma- seum visitor behavior, Applied Sciences 12 (2022).
trix factorization, Future Generation Computer [25] D. D’Agostino, F. Gasparetti, A. Micarelli, G.
SanSystems 78 (2018) 430–439. sonetti, A social context-aware recommender of
[10] F. Gasparetti, A. Micarelli, G. Sansonetti, Exploiting itineraries between relevant points of interest, in:
web browsing activities for user needs identifica- HCI International 2016, volume 618, Springer
Intertion, in: Proc. of the 2014 CSCI, volume 2, 2014. national Publishing, Cham, 2016, pp. 354–359.
[11] A. Micarelli, A. Neri, G. Sansonetti, A case-based [26] F. Gasparetti, G. Sansonetti, A. Micarelli,
Commuapproach to image recognition, in: Proceedings of nity detection in social recommender systems: a
the 5th European Workshop on Advances in Case- survey, Applied Intelligence 51 (2021) 3975–3995.
Based Reasoning, EWCBR ’00, Springer-Verlag, [27] K. Cofee, Cultural inclusion, exclusion and the
Berlin, Heidelberg, 2000, pp. 443–454. formative roles of museums, Museum Management
[12] P. Ekman, W. V. Friesen, Facial action coding system, and Curatorship 23 (2008) 261–279.</p>
        <p>1978. [28] F. Galvão, S. M. Alarcão, M. J. Fonseca, Predicting
[13] J. F. Cohn, K. Schmidt, R. Gross, P. Ekman, Individ- exact valence and arousal values from eeg, Sensors
ual diferences in facial expression: stability over 21 (2021).
time, relation to self-reported emotion, and ability
to inform person identification, in: Proc. of the 4th
IEEE ICMI, IEEE Computer Society, USA, 2002.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>