=Paper= {{Paper |id=Vol-2450/paper4 |storemode=property |title=Using Facial Recognition Services as Implicit Feedback for Recommenders |pdfUrl=https://ceur-ws.org/Vol-2450/paper4.pdf |volume=Vol-2450 |authors=Stoon De Pessemier,Ine Coppens, Luc Martens |dblpUrl=https://dblp.org/rec/conf/recsys/PessemierCM19 }} ==Using Facial Recognition Services as Implicit Feedback for Recommenders== https://ceur-ws.org/Vol-2450/paper4.pdf
                                     Using facial recognition services as
                                    implicit feedback for recommenders
             Toon De Pessemier                                            Ine Coppens                                  Luc Martens
      imec - WAVES - Ghent University                        imec - WAVES - Ghent University                imec - WAVES - Ghent University
        toon.depessemier@ugent.be                                 ine.coppens@ugent.be                           luc1.martens@ugent.be

ABSTRACT                                                                            the video is playing. However, classic recommender systems are
User authentication and feedback gathering are crucial aspects                      not adjusted to this dynamic situation. Typically, recommendations
for recommender systems. The most common implementations, a                         are generated based on the profile of the individual who initiates
username / password login and star rating systems, require user                     the video session. Family profiles can be created, but do not take
interaction and a cognitive effort from the user. As a result, users                into account who is actually in front of the screen or changes in the
opt to save their password in the interface and optional feedback                   number of spectators during the video watching. However, man-
with a star rating system is often skipped, especially for applica-                 ually logging in each individual user, one by one, would be time
tions such as video watching in a home environment. In this article,                consuming and user-unfriendly. The same issues are applicable for
we propose an alternative method for user authentication based                      the feedback process. Explicit feedback is not requested separately
on facial recognition and an automatic feedback gathering method                    for each individual. For implicit feedback, such as viewing time, it is
by detecting various face characteristics. Using facial recognition                 unclear to whom this refers. Moreover, since star rating systems are
with a camera in a tablet, smartphone, or smart TV, the persons in                  often ignored by the user, an automatic implicit feedback system
front of the screen can be identified in order to link video watch-                 would be more suitable.
ing sessions to their user profile. During video watching, implicit                    This article presents a more user-friendly and practical approach
feedback is automatically gathered through emotion recognition,                     based on facial recognition to log in automatically every viewer and
attention measurements, and behavior analysis. An emotion finger-                   fetch their preferences to compose a dynamic group for group rec-
print, which is defined as a unique spectrum of expected emotions                   ommendations. These preferences are derived from their implicit
for a video scene, is compared to the recognized emotions in order                  feedback, which is gathered automatically by detecting various
to estimate the experience of a user while watching. An evaluation                  facial characteristics during past video watching sessions. We eval-
with a test panel showed that happiness can be most accurately                      uated this implicit feedback gathering by using facial recognition
detected and the recognized emotions are correlated with the user’s                 services based on a dataset of photos as well as with a user test.
star rating.
                                                                                    2   RELATED WORK
CCS CONCEPTS                                                                        Face detection is the technique that locates the face of a person
• Information systems → Information systems applications;                           in a photo. It is the prerequisite of all facial analysis and different
Data analytics; Data mining.                                                        approaches for the detection have been studied [4]. Facial recogni-
                                                                                    tion is the process of matching a detected face to a person who was
KEYWORDS                                                                            previously detected by the system. In the study of Yang et al., this
Feedback, Emotion recognition, Facial analysis, Recommendation                      is also called face authentication and defined as the identification
ACM Reference Format:
                                                                                    of an individual in a photo [18]. Related to this is the analysis of
Toon De Pessemier, Ine Coppens, and Luc Martens. 2019. Using facial recog-          faces for the purpose of age detection and gender detection. Au-
nition services as implicit feedback for recommenders. In IntRS ’19: Joint          tomatically detecting the gender and age group of the user (child,
Workshop on Interfaces and Human Decision Making for Recommender Sys-               young-adult, adult, or senior) can be useful for initial profiling of
tems, 8 pages.                                                                      the user. In this paper, various commercial services for gender and
                                                                                    age detection are used: Microsoft’s Facial Recognition Software:
1    INTRODUCTION                                                                   Face [3], Face++ [8], and Kairos [11]. Even more recognition ser-
Many video services generate personal recommendations for their                     vices exist, such as FaceReader [15], but some are rather expensive
customers to assist them in the content selection process that be-                  or are not available as a web service that can be queried from a
comes more difficult by the abundance of available content. In the                  mobile device. So, the first research question of this study is: “How
application domain of video watching, the content is often con-                     accurately are these commercial services for age detection and gender
sumed simultaneously by multiple people (e.g., a family watching                    detection in view of an initial user profile for video watching?”
together) or the device is shared by multiple people (e.g., a tablet is                While watching video content or using an app or service in gen-
used by multiple people of the family). Moreover, in the context of                 eral, facial expressions of users might reveal their feelings about the
a household, people may join and leave the watching activity while                  content or their usage. In the field of psychology, the relationship
                                                                                    between distinctive patterns of the facial muscles and particular
Copyright ©2019 for this paper by its authors. Use permitted under Creative Com-    emotions have been demonstrated to be universal across different
mons License Attribution 4.0 International (CC BY 4.0). IntRS ’19: Joint Workshop
on Interfaces and Human Decision Making for Recommender Systems, 19 Sept 2019,      cultures [6]. The psychologists conducted experiments in which
Copenhagen, DK.                                                                     they showed still photographs of faces to people from different
IntRS Workshop, September 2019, Copenhagen, DK.                                                                             De Pessemier, et al.


cultures in order to determine whether the same facial behavior              Figure 1 shows the data flow. The research focus of this article
would be judged as the same emotion, regardless of the observers’         is on the first and the third phase. In the first phase, the goal is
culture. These studies demonstrated the recognizability of emotions       to identify and recognize each face in the photo. For new faces,
(happiness, sadness, anger, fear, surprise, disgust, interest).           age and gender will be detected to create an initial user profile. In
   Based on these concepts, facial expression recognition is de-          the third phase, the photos will be used for emotion recognition,
scribed as the identification of the emotions. The automatic recog-       attention measurements, and behavior analysis in view of deriving
nition of facial expressions, and especially emotions, enables the        automatic feedback. The second phase, offering personalized recom-
automatic exploitation of emotions for profiling and recommen-            mendations, is used to help users in the content selection process
dation purposes. Therefore, the same three commercial services            and demonstrate the added value of facial expression recognition.
are used for facial expression recognition during video watching in
this study.
   Various researchers have investigated the role of emotions in
recommender systems. Emotions can be used to improve the quality
of recommender systems in three different stages [17]:                     Picture of
                                                                              user
    (1) The entry stage: when a user starts to use a content delivery                                       Groups:
                                                                                                       Avg without misery
        system with or without recommendations, the user is in an
        affective state, the entry mood. The user’s decision making
        process is influenced by this entry mood. A recommender
        can adapt the list of recommended items to the user’s entry
        mood by considering this as contextual information [1].
    (2) The consumption stage: after the user starts to consume
                                                                          Figure 1: Data flow-3 phases: login, recommender, feedback.
        content, the user experiences affective responses that are
        induced by the content [17]. Moreover, by automatic emo-
        tion detection from facial expressions, an affective profile
        of movie scenes can be constructed. Such an item profile
        structure labels changes of users’ emotions through time,
                                                                          3.1       Phase 1: User authentication and profiling
        relative to the video timestamp [10].
    (3) The exit stage: after the user has finished with the content      Although facial recognition is often used to unlock smartphones
        consumption, the user is in the exit mood. The exit mood          automatically, the applications in a group context, to identify multi-
        will influence the user’s next decisions. In case that the user   ple people simultaneously, are less common. In other words, facial
        continues to use the content delivery system, the exit mood       recognition is used to give an answer to the question: “who is in
        for the content just consumed is the entry mood for the next      front of the screen?”. In a real world scenario, it can be several
        content to be consumed [17].                                      people, all of whom will be individually identified.
                                                                              For the authentication of recurring users (who have been identi-
   In this paper the focus is on the consumption stage. Users watch       fied by our app in a previous session), our Android app uses Face
movies and their facial expressions are captured as a vector of           Storage of the Microsoft service. This saves persons with their faces
emotions that change over time. The facial expressions, such as           in a Person Group, which is trained based on the photos of the
emotions, are used as an indicator of the user’s satisfaction with        camera. This enables to link the user in front of the screen with one
the content. The assumption is that users appreciate a video if they      of the existing user profiles. For new users, the age and gender is
sympathize with the video and express their emotions in accordance        estimated (Section 4.1). To cope with the cold-start problem, initial
with the expected emotions.                                               recommendations are based on these demographics.
   Therefore, the second research question of this study is: “Can             In practice, user authentication and profiling works as follows.
facial expression recognition during video watching be used as an         Using the app, users can log in by ensuring that their face is visible
unobtrusive (implicit) feedback collection technique?”                    for the front-facing camera when they push the start button. A
                                                                          photo is made that is used as input for the facial recognition services.
3    METHOD                                                               Recurring users are logged in automatically; their existing profile
To facilitate human-computer interaction for video watching ser-          (age, gender, and watching history) is retrieved, and the new photo
vices, an Android application has been developed with the following       is a new training sample for Face Storage. For new users, a profile is
three subsequent phases: 1) User authentication with an automated         created based on their estimated age and gender. After every login,
login procedure and user profiling (gender and age) based on facial       the age estimation is adjusted based on the new photo. This update
recognition to identify all people who are in front of the screen. 2)     can correct age estimations based on previous photos, but also takes
Personalized recommendations (group recommendations in case               into account the aging of users when using the system for multiple
multiple people are in front of the screen). 3) Automatic feedback        years. This is especially useful for children who can get access to
gathering while the chosen video is playing. Using the front-facing       more content as they fulfill the minimum age requirements over
camera of the tablet/smartphone or a camera connected to a smart          time. Moreover, storing a photo for every session has the advantage
TV, the app takes photos of all people in front of the screen and         that changes to the user’s appearance (e.g., different hairstyle) can
sends requests to different facial recognition services.                  be taken into account.
De Pessemier, et al.                                                                         IntRS Workshop, September 2019, Copenhagen, DK.


3.2    Phase 2: Group recommendations                                          used as implicit feedback [2], since different videos evoke different
Group recommendations are generated by aggregating individual                  emotions. One can assume that users appreciate a video if they
user models (consisting of age, gender, ratings, and watching his-             sympathize with the video and express their emotions in accordance
tory), one for every user in front of the screen. From the eleven              with the expected emotions. E.g., during a comedy scene users may
strategies proposed by Judith Masthoff [14], the “Average without              laugh (‘happy’ emotion), whereas during a horror scene ‘fear’ can
misery” strategy was adopted in our group recommender algorithm.               be expected. Recognized emotions that are not expected, might be
This strategy takes into account the (predicted) rating score of every         due to external influences (e.g., other people in the room) or reflect
user by calculating a group average, while avoiding misery by elim-            contempt for the video (e.g., laughing with terrifying scenes of a
inating videos that are really hated by some group members, and                horror movie). Therefore, unexpected emotions are not taken into
therefore considered as unacceptable for the group. The Lenskit [7]            account.
recommendation framework was used to calculate these rating pre-                  Thus, the similarity between the expressed emotions (=recog-
diction scores and transform them into a Top-N recommendation                  nized emotions) and the expected emotions is calculated to deter-
list.                                                                          mine the user’s experience while watching the video. The expected
    Besides personal preferences, other criteria, such as the age and          emotions are based on the emotion fingerprint, which is defined
historical viewing activities of the users, are taken into account. Age        as a unique spectrum of expected emotions for a video scene. For
is modeled as classes of age ranges, firstly to filter out inappropriate       every second of the video, the emotion spectrum of the fingerprint
content for minors, secondly for estimating the ratings for cold-              specifies the probability value of each of the six possible emotions:
start users based on other users of the same class. We used the age            anger, disgust, fear, happiness, sadness, and surprise. These emotion
ranges that are also used by IMDb: <18, 18-29, 30-44, 45+.                     dimensions have been identified in the field of psychology [6]. So,
    The age of the users is used to determine whether a video is               the emotion fingerprint shows which emotions the video typically
suitable for the viewers. For every video, the advised minimum                 provokes among viewers at every second of the video. The emo-
age is retrieved from the Common Sense Media website [16]. If at               tion fingerprint is composed by aggregating emotions expressed by
least one of the users is younger than this age threshold, the video           many users during watching this specific video. Section 4.4 explains
is marked as unsuitable for the group according to the average                 in detail how the fingerprint of a video scene is computed based on
without misery strategy. Likewise, if at least one of the users has            an example.
already seen the video, it is considered as unsuitable for the group              The distance between expressed and expected emotions is calcu-
since this person probably does not want to see the video again.               lated based on the euclidean distance between the values of these
    If a new user is present in front of the screen, i.e. a cold-start user,   two emotion spectra for every second i of the video and each emo-
user preferences for a movie are estimated based on demographics.              tion j. For the expressed emotions, the output of the Microsoft
An estimation of the user’s age and gender, as provided by the facial          service is used in our online experiment (Section 4.4) because of
recognition services, is used to find users with similar demographics.         the results of the offline evaluation (Section 4.3). The similarity
The preferences of that demographic group (age & gender class)                 between expressed and expected emotions is calculated based on
are used to estimate the preferences of the cold-start user. In case           the inverse of the emotion distance and an additional constant to
of an explicit rating for example, we use the mean rating of that              avoid a division by zero.
demographic group for the movie, as mentioned by IMDb [9]. The
                                                                                                        v
                                                                                                        u
                                                                                                        tÕ n Õ 6
mean rating provided by the demographic group is compared with                    emotionDistance =               (expectedi, j − expressedi, j )2 (1)
the mean rating over all users for this specific movie. This difference                                  i=0 j=1
(demographic group mean - global mean) indicates if the movie is
less or more suitable for a specific age/gender.                                                                            1
                                                                                           emotionSimilarity =                                   (2)
                                                                                                                  1 + emotionDistance
                                                                                  Besides emotions, also the attention level and user behavior are
3.3    Phase 3: Automatic feedback                                             analyzed during video watching as an additional implicit feedback
Commercial services that perform emotion recognition, attention                mechanism. The Microsoft service has an additional interesting
measurements, and behavior analysis are often based on the analy-              feature that recognizes occluded areas of the face. This occlusion is
sis of photos. Therefore, our Android app continuously takes photos            used to recognize negative feedback during video watching in case
from the users with the front-facing camera during video watching.             users respond to the video by holding their hands in front of their
Every second, a photo is taken and sent to the Microsoft recognition           mouth or eyes (typical for shocking content).
service for face detection and authentication in order to check if                Face++ is the only service that can detect closed eyes, which can
all viewers are still in front of the screen. Subsequently, for each           be an indication of sleeping. Also the user’s head pose is derived
identified face, the area of the photo containing the face is selected         from Face++. Although other services can recognize the head pose
and the photo is cropped so that only one person’s face is visible.            as well, the estimation of Face++ showed to be the most accurate
Next, the cropped photo is sent to each of the three recognition               one. In case users do not want to see a scene (negative feedback),
services. Since photos are sent for every identified face, facial ex-          they might close their eyes or turn their head.
pressions will be recognized for all identified individuals in front              The Kairos recognition service offers a feature that represents
of the screen.                                                                 the attention level of the user, which is estimated based on eye
   For recognizing the emotions on the users’ face, the Microsoft              tracking and head pose. In our application, these behavioral aspects
service was used. But these recognized emotions cannot be directly             are combined into the overallAttention level by aggregating the
IntRS Workshop, September 2019, Copenhagen, DK.                                                                                                                                     De Pessemier, et al.


service results over all photos taken during video watching. The                                               Table 1: Evaluation of gender & age estimation
overall attention level is calculated as the percentage of photos in
which the user pays attention and following conditions are met:                                                                         Kairos          Microsoft              Face++                Aggregation
Kairos’ attention level > 0.5, both eyes open, no occlusion, and
                                                                                                      Detection failed                    4                  2                     2                        2
head pose angles are between the margins: 30 degrees for the yaw
                                                                                                     Avg abs. age error                  8.88               4.31                 13.14                     7.91
angle and 20 degrees for the pitch angle. The assumption is that the
                                                                                                     Median age error                    6.0                2.9                  11.0                      8.1
user is not paying attention to the video if one of these conditions
                                                                                                     Gender error (%)                    11.9               15.9                 13.6                      11.3
is not met.

                     #Photos(attention & eyes & noOcclusion & headPose)
overallAttention =                                                          with the front facing camera are used as input for the recognition
                                           #Photos
                                                                     (3)    services.
   An implicitFeedbackScore on a scale ranging from 0 to 10 is cal-
culated by aggregating the different facial analysis features. The          4.1                                Age & gender estimation
similarity with the expected emotions has a contribution of six
                                                                            Firstly, the authentication was evaluated: recognizing the user who
points out of ten points. The overall attention level counts for the
                                                                            used the app in the past. The automatic authentication of the 46
remaining four points.
                                                                            users (login process) showed to be very accurate: 4 undetected faces
                                                                            with Kairos (9%), 2 with Microsoft and Face++ (4%).
 implicitFeedbackScore = 6 · emotionSimilarity + 4 · overallAttention (4)      Subsequently, for the recognized faces, the services were used
                                                                            to estimate the users’ age and gender based on photos of the test
4   EVALUATION                                                              users taken while holding the tablet. The estimated age and gender,
                                                                            as provided by the recognition services, were compared to the
Evaluations of commercial facial recognition services have been
                                                                            people’s real age and gender. Figure 2 shows the differences between
performed in literature, but are typically based on datasets with
                                                                            estimation and real age, sorted according to the real age of the users.
high-quality photos that enable an accurate recognition: sufficiently
                                                                            The largest errors were obtained for estimating the age of children.
illuminated, no shadow or reflections, high resolution, and a perfect
                                                                            Kairos and Face++ typically estimate the children to be older than
position of the face in the middle of the photo [2, 5]. In contrast,
                                                                            they are. Table 1 reports the number of photos for which a detection
for facial recognition and analysis during (mobile) video watching,
                                                                            was not possible, the average absolute age error, the median age
the front-facing camera of the device is used without flash, which
                                                                            error, and the percentage of photos for which the gender estimation
yields not always ideal photos.
                                                                            was wrong.
    Therefore, we evaluated the three facial recognition services in
                                                                               The three facial recognition services were compared with a hy-
an offline test (based on a publicly available dataset of photos) in
                                                                            brid solution that aggregates the results of the three. For the age
Section 4.3 as well as in an online setting (with real users). For the
                                                                            estimation, the aggregation is the average of the results of the three.
evaluation of the age & gender estimation (Section 4.1), 46 users
                                                                            For the gender, the three gender estimations are aggregated using
with age ranging from 0 to 66 were involved in our test. For the
                                                                            a voting process. For each photo, the gender with the most votes is
evaluation of the attention level (Section 4.2), we used 76 photos of
                                                                            the result of the aggregation. This voting aggregation showed to be
our test users with different levels of attention. The evaluation of
                                                                            more reliable than each individual service for estimating gender.
emotion recognition during video playback (Section 4.4) requires
                                                                               Microsoft has the best results in the age test, so we decided to
more time from the user and was therefore performed by only 20
                                                                            use only this service to estimate the user’s age. For the gender
users. Since the focus of this study was on age & gender estimation
                                                                            estimation, the aggregation method is used.
and emotion recognition, the group recommendations were not
                                                                               So as an answer to the first research question, we can say that
evaluated, and all users used the app alone.
                                                                            the facial recognition services provide an accurate age and gender
    The overall aim of this study is to improve the user friendliness
                                                                            detection for creating an initial profile for a cold-start user.
of devices for video watching in the living room. This evaluation
is the first step to reach the future goal of multi-user recognition
                                                                                                     60
and is therefore carried out with a tablet, in a rather controlled                                   50
                                                                                                                Difference between real age and age estimation of facial recognition services
environment, with one person at a time. During the test, photos of                                   40
                                                                                                                                                                             Kairos
the test user were taken with the front-facing camera of the tablet
                                                                            Age difference (years)




                                                                                                     30
                                                                                                                                                                             Microsoft
(Samsung Galaxy tab A6). If the tablet would have captured two                                       20                                                                      Face++

people in the photo, the recognition process would be performed                                      10

                                                                                                      0
for both recognized faces.                                                                                 0         10            20              30               40               50               60          70
                                                                                                     -10
    To have a realistic camera angle, the users were asked to hold                                   -20
                                                                                                                                             Test persons, sorted by their real age (age in years)
the tablet in front of them, as they would usually do for watch-                                     -30

ing a video. The users were sitting on a chair and the room was
sufficiently illuminated. However, no guidelines were provided re-                        Figure 2: Age estimation using facial recognition services.
garding their behavior, head position, or attention; e.g., nothing
was said about looking away or closing eyes. The photos taken
De Pessemier, et al.                                                                                                                                                                                                                                IntRS Workshop, September 2019, Copenhagen, DK.


                                                         Anger Kairos                                                                                                                Anger Microsoft                                                                                                                 Anger Face++
    120                                                                                                                             1.2                                                                                                                           120

    100                                                                                                                               1                                                                                                                           100

     80                                                                                                                             0.8                                                                                                                            80

     60                                                                                                                             0.6                                                                                                                            60

     40                                                                                                                             0.4                                                                                                                            40

     20                                                                                                                             0.2                                                                                                                            20

      0                                                                                                                               0                                                                                                                             0
          0     5              10            15           20         25         30         35         40         45            50         0     5             10           15          20         25         30         35          40         45            50         0     5              10           15           20         25         30         35          40         45             50

              Anger                Fear                Happiness          Surprise              Sadness              Disgust                  Anger                Fear              Happiness         Surprise              Sadness               Disgust                  Anger                 Fear              Happiness          Surprise              Sadness               Disgust




                                                             Fear Kairos                                                                                                              Fear Microsoft                                                                                                                   Fear Face++
    120                                                                                                                             1.2                                                                                                                           120

    100                                                                                                                              1                                                                                                                            100

     80                                                                                                                             0.8                                                                                                                            80

     60                                                                                                                             0.6                                                                                                                            60

     40                                                                                                                             0.4                                                                                                                            40

     20                                                                                                                             0.2                                                                                                                            20

     0                                                                                                                               0                                                                                                                              0
          0                    5                          10                    15                    20                       25         0               5                     10                15               20                    25                  30         0                5                     10                 15               20                    25                   30

              Anger                Fear                Happiness          Surprise              Sadness              Disgust                  Anger                Fear              Happiness         Surprise              Sadness               Disgust                  Anger                 Fear              Happiness          Surprise              Sadness               Disgust




                                                   Happiness Kairos                                                                                                             Happiness Microsoft                                                                                                            Happiness Face++
    120                                                                                                                             1.2                                                                                                                           120

    100                                                                                                                              1                                                                                                                            100

     80                                                                                                                             0.8                                                                                                                            80

     60                                                                                                                             0.6                                                                                                                            60

     40                                                                                                                             0.4                                                                                                                            40

     20                                                                                                                             0.2                                                                                                                            20

      0                                                                                                                              0                                                                                                                              0
          0         10                  20              30           40          50              60             70             80         0         10              20               30           40          50              60              70             80         0         10               20                30           40          50              60              70              80

              Anger                Fear                Happiness          Surprise              Sadness              Disgust                  Anger                Fear              Happiness         Surprise              Sadness               Disgust                  Anger                 Fear              Happiness          Surprise              Sadness               Disgust




                                                       Surprise Kairos                                                                                                           Surprise Microsoft                                                                                                                 Surprise Face++
    120                                                                                                                             1.2                                                                                                                           120

    100                                                                                                                               1                                                                                                                           100

     80                                                                                                                             0.8                                                                                                                            80

     60                                                                                                                             0.6                                                                                                                            60

     40                                                                                                                             0.4                                                                                                                            40

     20                                                                                                                             0.2                                                                                                                            20

      0                                                                                                                               0                                                                                                                             0
          0      10                20             30            40        50          60           70            80            90         0     10             20               30           40        50          60              70          80            90         0     10              20               30            40        50          60              70          80             90

              Anger                 Fear               Happiness          Surprise              Sadness              Disgust                  Anger                Fear              Happiness         Surprise              Sadness               Disgust                  Anger                 Fear              Happiness          Surprise              Sadness               Disgust




                                                       Sadness Kairos                                                                                                            Sadness Microsoft                                                                                                                  Sadness Face++
    120                                                                                                                             1.2                                                                                                                           120                                                                                                                               1.2

    100                                                                                                                               1                                                                                                                           100

     80                                                                                                                             0.8                                                                                                                            80                                                                                                                               0.8

     60                                                                                                                             0.6                                                                                                                            60                                                                                                                               0.6

     40                                                                                                                             0.4                                                                                                                            40                                                                                                                               0.4

     20                                                                                                                             0.2                                                                                                                            20                                                                                                                               0.2

      0                                                                                                                               0                                                                                                                             0
          0                5                      10                 15               20                   25                  30         0               5                     10                15               20                    25                  30         0                5                     10                 15               20                    25                    30

              Anger                Fear                Happiness          Surprise              Sadness              Disgust                  Anger                Fear              Happiness         Surprise              Sadness               Disgust                  Anger                 Fear              Happiness          Surprise               Sadness               Disgust




                                                        Disgust Kairos                                                                                                           Disgust Microsoft                                                                                                                  Disgust Face++
    120                                                                                                                             1.2                                                                                                                           120
    100                                                                                                                               1                                                                                                                           100

     80                                                                                                                             0.8                                                                                                                            80

     60                                                                                                                             0.6                                                                                                                            60

     40                                                                                                                             0.4                                                                                                                            40

     20                                                                                                                             0.2                                                                                                                            20

      0                                                                                                                               0                                                                                                                            0
          0           10                     20                30          40               50              60                 70         0          10                   20                30          40               50               60                 70         0           10                   20                 30          40               50               60                  70

              Anger                 Fear               Happiness          Surprise              Sadness              Disgust                  Anger                Fear              Happiness         Surprise              Sadness               Disgust                  Anger                 Fear              Happiness          Surprise              Sadness               Disgust




               Figure 3: Output of the recognition services: recognized emotions in photos of people expressing emotions.
IntRS Workshop, September 2019, Copenhagen, DK.                                                                                                                   De Pessemier, et al.

 Table 2: Attention level: percentage correctly recognized                                                                 Aggregated Emotion Values
                                                                                                   1




                                                                         Emotion probability
                                                                                                  0.8
                                      Kairos   Microsoft    Face++
                                                                                                  0.6
         Covering eyes                N/A        97.37%       N/A
                                                                                                  0.4
        Covering mouth                N/A        94.74%       N/A
                                                                                                  0.2
       Covering forehead              N/A        98.68%       N/A
          Closed eyes                 N/A         N/A        97.37%                                0
                                                                                                        0           20           40          60          80       100        120        140
           Attention                 82.97%       N/A         N/A
                                                                                                                                                Time (s)
       Head pose attention           60.53%       N/A        72.37%                                         Anger         Fear         Happiness       Surprise     Sadness         Disgust

 No detection: Face turned away      11.84%       7.89%      2.36%
                                                                                                  Figure 4: Emotion values aggregated over all test users.


4.2    Attention level offline                                                                     1
                                                                                                                                 Emotion fingerprint




                                                                            Emotion probability
The features that constitute the attention score of the user (equa-                               0.8
tion 3) are evaluated based on a dataset that we created with photos                              0.6
of the users taken during the test. In addition, some photos were
                                                                                                  0.4
added for which users were explicitly asked to cover part of their
                                                                                                  0.2
face. The photos were manually annotated with the features (e.g.,
eyes closed or not) to obtain the ground truth. The result was a                                   0
                                                                                                        0           20           40          60         80        100        120        140
dataset of 76 photos with a focus on these attention features (e.g.,
                                                                                                                                                Time(s)
multiple users covering their eyes, mouth, etc.).                                                       Anger            Fear         Happiness      Surprise      Sadness         Disgust
   Table 2 shows the percentage correctly recognized photos for
each attention feature. However, not all attention features are avail-
                                                                         Figure 5: The emotion fingerprint based on the aggregated
able for the three services. Features that are not available are indi-
                                                                         emotions.
cated with N/A.
   Face++ provides two probability values for closed eyes (for left
and right eye). If both values have a probability of 40% or more, we     been recognized with great certainty). For the Microsoft service,
consider this as “Closed eyes”.                                          the output values range from 0 to 1 (with the same interpretation).
   Kairos estimates the attention of the user and expresses this with        Figure 3 shows for each of the six photo sets how the emotions
a value between 0 (no attention at all) and 1 (full attention). Kairos   are recognized by the services. The emotion values are shown on
attention feature is based on eye tracking and head pose. To convert     the Y-axis for each photo set that was used as input (photo index
this to a binary value (attention or not), we used a threshold of 0.5.   on the X-axis). Each recognized emotion has a different color. For a
   Kairos and Face++ can recognize the head pose of the user. If         specific photo set, the ideal emotion recognition should result in
the head position is outside the boundaries (30 degrees for the          the detection of only one emotion with a value of 1 for Microsoft
yaw angle and 20 degrees for the pitch angle), we interpret this         and 100 for Kairos and Face++, while the other emotion values are
as “head turned away and not paying attention”. The estimation           0. For a limited number of photos, the person’s face could not be
of Face++ is more accurate than this of Kairos. Therefore, the head      detected. This resulted in no output of the service. Therefore, not
pose specified by Face++ is used in the app.                             all indices have an emotion value in the graphs of Kairos. In general,
   If the face is turned away too much from the camera or a large        the results clearly show that some emotions, such as happiness and
part of the face is covered, then face detection might fail. The         surprise, are more easy to detect with a high certainty, whereas
percentage of “no detections” is also indicated in Table 2. Remember     other emotions, such as fear, are more difficult to detect and can
that this dataset was created with the focus on attention level. For     easily be confused. Although the people of these photos are expres-
many photos, users were explicitly asked to turn their head away.        sively showing their emotions, the automatic recognition of these
Therefore, the number of no detections is rather high.                   emotions is not yet perfect.
                                                                             Anger is accurately recognized by Kairos and Microsoft, whereas
4.3    Emotion recognition offline                                       Face++ confuses anger with disgust and sadness for some photos.
The emotion recognition ability of the three facial recognition ser-     Fear is the most difficult to detect: Kairos detects fear in most photos;
vices was evaluated using the Cohn Kanade dataset [12, 13], which        but Microsoft and Face++ sometimes incorrectly recognize sadness
contains photos of people showing different emotions evolving            and disgust. Happiness is very accurately detected by all three
from neutral to a very explicit emotion. Six photo sets with the         services. With the Microsoft service, the results are almost perfect:
very explicit emotions (one set for each emotion) are used as input      only happiness is detected and no other emotions. Also surprise
for the facial recognition services. The output of the recognition       is very well recognized by all three service with high emotion
services is a vector of 6 values, one for each emotion. For Kairos and   values. Sadness is recognized for most photos, but in comparison to
Face++, these output values range from 0 (meaning this emotion           happiness and surprise, the emotion values are lower. This indicates
has not been recognized at all) to 100 (meaning this emotion has         that sadness is less clearly recognizable for emotion recognition
  De Pessemier, et al.                                                                                                                                                                            IntRS Workshop, September 2019, Copenhagen, DK.


                                         Emotions expressed by user 3                                                                          Emotions expressed by user 4                                                                            Emotions expressed by user 13
                       1                                                                                                    1                                                                                                            1




                                                                                                     Emotion probability
Emotion probability




                                                                                                                                                                                                                  Emotion probability
                      0.8                                                                                                  0.8                                                                                                          0.8

                      0.6                                                                                                  0.6                                                                                                          0.6

                      0.4                                                                                                  0.4                                                                                                          0.4

                      0.2                                                                                                  0.2                                                                                                          0.2

                       0                                                                                                    0                                                                                                            0
                            0       20       40       60         80      100        120        140                               0       20          40          60         80        100        120        140                               0       20      40       60         80      100        120        140
                                                      Time (s)                                                                                                    Time (s)                                                                                              Time (s)
                            Anger    Fear     Happiness       Surprise    Sadness         Disgust                                Anger        Fear        Happiness       Surprise     Sadness         Disgust                                Anger    Fear    Happiness       Surprise    Sadness         Disgust




                                    Figure 6: Emotions expressed by 3 users during video watching. Users 4 and 13 like the video, user 3 doesn’t.


 services. Disgust is sometimes confused with anger, but Microsoft                                                                                                        resulting emotion fingerprint. We consider this emotion fingerprint
 and Face++ rightly assign a much lower value to anger for most                                                                                                           as the expected emotion spectrum for the specific video.
 photos.                                                                                                                                                                     To discuss the results, we elaborate on the emotion spectrum of
    In conclusion, the comparison between the recognized emotions                                                                                                         three users of the test. Figure 6 shows the expressed emotions of
 and the true emotion labels of the photos, revealed that the Mi-                                                                                                         users 3, 4, and 13, while watching the comedy video. The expressed
 crosoft service has the most accurate emotion recognition. There-                                                                                                        emotions of users 4 and 13 clearly show some similarities with the
 fore, the Microsoft service was chosen as solution for emotion recog-                                                                                                    emotion fingerprint. Happiness is the most dominant emotion, but
 nition in Section 4.4. The evaluation based on the Cohn Kanade                                                                                                           also some sad and surprising aspects are in the movie. The video
 dataset also indicated that - even with the most explicit emotion                                                                                                        contains the most expressive emotions (funny scene) from second
 photos - anger, disgust, and fear are always detected with a low                                                                                                         30, which is visible in the expressed emotions of user 4 and 13.
 probability value. Happiness can be detected with high probability                                                                                                          The explicit ratings for the video of users 3, 4, and 13 were
 values. So, happiness can be considered as the emotion that is rather                                                                                                    respectively: 3, 6, and 6.5 stars on a scale from 1 to 10. The low
 easy to detect with a high confidence, whereas anger, disgust, and                                                                                                       explicit rating of user 3 is reflected in the emotion values of this
 fear are much harder to detect.                                                                                                                                          user (implicit feedback), which are significantly lower than with
                                                                                                                                                                          the other users.
  4.4                           Emotion recognition online                                                                                                                   For the test with 20 users, we achieved a significant positive
                                                                                                                                                                          correlation of 0.37 between the explicit rating given by the user,
 Emotion recognition as a tool for gathering automatic feedback,
                                                                                                                                                                          and the similarity between the user’s expressed emotions and the
 was evaluated with a test panel consisting of 20 users between the
                                                                                                                                                                          expected emotion fingerprint (equation 2). Since the rating process
 ages of 5 and 72. During the test, each user watched six videos
                                                                                                                                                                          and emotion recognition are characterized by a lot of noise, the
 on a tablet. For each of the six basic emotions, one characteristic
                                                                                                                                                                          correlation between both will never be very high. However, the
 video was chosen (e.g., for happiness a comedy, for fear a scary
                                                                                                                                                                          positive correlation indicates that expressed emotions clearly are
 horror movie, etc.). During video watching, the front-facing camera
                                                                                                                                                                          a form of implicit feedback that can be used as input for a recom-
 continuously took photos that were analyzed, and for which an
                                                                                                                                                                          mender system. Moreover, we expect that the correlation might
 emotion score (based on equation 1 and 2), overall attention score
                                                                                                                                                                          improve if users watch full movies or tv shows instead of movie
 (equation 3), and implicit feedback score based on a complete facial
                                                                                                                                                                          trailers, as in our user test. Therefore, we can consider the rec-
 analysis (equation 4) were calculated.
                                                                                                                                                                          ognized emotions as a valid alternative feedback method in case
    The emotion fingerprint of the video was obtained by aggregat-
                                                                                                                                                                          ratings are not available, or as a feedback method ‘during’ con-
 ing the expressed emotions over all the test users. Figure 4 gives an
                                                                                                                                                                          tent consumption instead of ‘after’ finishing the consumption. This
 example of this aggregation for a comedy video (a scene from the
                                                                                                                                                                          answers our second research question.
 movie “Dude, Where’s My Car?”). The emotion signal of the fin-
                                                                                                                                                                             Besides the emotion score, we also studied the implicit feedback
 gerprint is the average emotion value over all users at each second
                                                                                                                                                                          score (equation 4), which is the combination of emotion and atten-
 of the video. Because of the aggregation of emotions of multiple
                                                                                                                                                                          tion score. However, the variation in the attention score was limited
 test persons, the emotion fingerprint was constructed after the user
                                                                                                                                                                          for our user test, since all trailers are rather short (2-3 minutes).
 test. Subsequently, irrelevant emotion values are removed and only
                                                                                                                                                                          We suspect that the duration of the trailers is too short to build
 the most dominant emotions are retained (e.g., happiness, surprise,
                                                                                                                                                                          up intense emotional moments that make users inclined to cover
 and sadness in this comedy movie). Key scenes of the video that
                                                                                                                                                                          their eyes or mouth. Moreover, the trailers are too short to witness
 may provoke emotions are manually selected. During periods of
                                                                                                                                                                          a decreasing level of attention (e.g., falling asleep). Therefore, we
 the video without expressive emotions, the fingerprint values are
                                                                                                                                                                          expect that the attention score and implicit feedback score might
 set to zero. During these periods, we assume that the emotions rec-
                                                                                                                                                                          be better suited as implicit feedback for content items with a longer
 ognized from the users’ face are due to external factors. As visible
                                                                                                                                                                          duration.
 in Figure 5, the video contains no emotional scene from second 0
 until 30. Next, the fluctuations of the emotion signal are reduced by
 using the maximum observed emotion value over a time window of                                                                                                           5          DISCUSSION
 10 seconds. This takes into account that an expression of emotions                                                                                                       During the user test, it became clear that people do not express
 typically takes multiple seconds. Figure 5 shows an example of a                                                                                                         their emotions much during video watching, even not if the videos
IntRS Workshop, September 2019, Copenhagen, DK.                                                                                                     De Pessemier, et al.


contain scenes with intense emotions as selected in our test. Happi-                    [3] Microsoft Azure. 2019. Face API - Facial Recognition Software. Available at
ness is expressed most clearly, and is the only emotion that reached                        https://azure.microsoft.com/en-us/services/cognitive-services/face/.
                                                                                        [4] Mayank Chauhan and Mukesh Sakle. 2014. Study & analysis of different face
the maximum probability value of 1, e.g., for person 13 as visible                          detection techniques. International Journal of Computer Science and Information
in Figure 6. For the other basic emotions, the recognition services                         Technologies 5, 2 (2014), 1615–1618.
                                                                                        [5] Toon De Pessemier, Damien Verlee, and Luc Martens. 2016. Enhancing recom-
typically register probabilities that are much lower. The second                            mender systems for TV by face recognition. In 12th International Conference on
most recognizable emotion was sadness. It has a maximum value                               Web Information Systems and Technologies (WEBIST 2016), Vol. 2. 243–250.
over all users of 0.68, with only 15% of the test users scoring a                       [6] Paul Ekman and Wallace V Friesen. 1971. Constants across cultures in the face
                                                                                            and emotion. Journal of personality and social psychology 17, 2 (1971), 124.
sadness value of 0.60 or higher (for the sad video). For fear, the                      [7] Michael D. Ekstrand. 2018. The LKPY Package for Recommender Systems Experi-
maximum registered value over all test users was only 0.27 (during                          ments. Computer Science Faculty Publications and Presentations 147. Boise State
the fearful video). Fear is the most difficult emotion to recognize,                        University. https://doi.org/10.18122/cs_facpubs/147/boisestate
                                                                                        [8] Face++. 2019. Cognitive Services - Leading Facial Recognition Technology. Avail-
as was also discussed in the offline test.                                                  able at https://www.faceplusplus.com/.
   For this experiment, the emotion fingerprint was constructed by                      [9] IMDb. 2019. Ratings and reviews for new movies and TV shows. Available at
                                                                                            https://www.imdb.com/.
aggregating the emotion values of all users. A big challenge is to                     [10] Hideo Joho, Joemon M Jose, Roberto Valenti, and Nicu Sebe. 2009. Exploiting
identify the correct expected emotions and their probability values                         facial expressions for affective video summarisation. In Proceedings of the ACM
for the fingerprint spectrum. For this, we propose the following                            international conference on image and video retrieval. ACM, 31.
                                                                                       [11] Kairos. 2019. Serving Businesses with Face Recognition. Available at https:
guidelines: 1) Limit the fingerprint to a few emotions that are clearly                     //www.kairos.com/.
expressed in the video. 2) Some emotions, such as fear, are more                       [12] Takeo Kanade, Jeffrey F Cohn, and Yingli Tian. 2000. Comprehensive database
difficult to detect than others, such as happiness. The emotion                             for facial expression analysis. In Proceedings Fourth IEEE International Conference
                                                                                            on Automatic Face and Gesture Recognition (Cat. No. PR00580). IEEE, 46–53.
probabilities from the facial recognition services are often much                      [13] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and
lower for the difficult emotions. This should be reflected in the                           Iain Matthews. 2010. The extended cohn-kanade dataset (ck+): A complete dataset
                                                                                            for action unit and emotion-specified expression. In 2010 IEEE Computer Society
values of the fingerprint. 3) Limit the comparison of expected and                          Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 94–101.
expressed emotions to the key scenes of the movie. Recognized                          [14] Judith Masthoff. 2011. Group recommender systems: Combining individual
emotions during scenes without emotions might be due to other                               models. In Recommender systems handbook. Springer, 677–702.
                                                                                       [15] Noldus. 2019. FaceReader - Facial Expression Recognition Software. Available at
causes than the video.                                                                      https://www.noldus.com/human-behavior-research/products/facereader.
                                                                                       [16] Commom sense media. 2019. You know your kids. We know media and tech.
                                                                                            Together we can build a digital world where our kids can thrive. Available at
6    CONCLUSION                                                                             https://www.commonsensemedia.org/about-us/our-mission.
                                                                                       [17] Marko Tkalc̆ic̆, Andrej Kos̆ir, and Jurij Tasic̆. 2011. Affective recommender sys-
An Android app was developed to investigate if facial recognition                           tems: the role of emotions in recommender systems. In Joint proceedings of the
services can be used as a tool for automatic authentication, user                           RecSys 2011 Workshop on Human Decision Making in Recommender Systems (Deci-
profiling, and feedback gathering during video watching. The idea                           sions@RecSys’11) and User-Centric Evaluation of Recommender Systems and Their
                                                                                            Interfaces-2 (UCERSTI 2) affiliated with the 5th ACM Conference on Recommender.
is to use this feedback as input for a recommender systems. In                              9–13.
contrast to ratings, this feedback is available during content play-                   [18] Ming-Hsuan Yang, David J Kriegman, and Narendra Ahuja. 2002. Detecting
                                                                                            faces in images: A survey. IEEE Transactions on pattern analysis and machine
back. An evaluation with a test panel of 20 users showed that the                           intelligence 24, 1 (2002), 34–58.
authentication is almost perfect. Estimation of gender and age are
in most cases accurate enough to cope with the cold-start problem
by recommending movies typical for the user’s age and gender.
Facial analysis can be used to derive automatic feedback from the
user during video watching. Closed eyes, looking away (head pose,
attention level), covering eyes or mouth (occlusion), etc., are typical
indications that the user does not want to see the video, and can
be considered as negative implicit feedback for the recommender.
By emotion recognition and a comparison with an emotion fin-
gerprint, we calculated a user feedback value, which is positively
correlated to the user’s star rating. This indicates that recognized
emotions can be considered as valuable implicit feedback for the
recommender. Happiness can be most accurately detected. Taking
photos or making videos with the front-facing camera has been
expressed as a privacy-sensitive aspect by our test users and will
be further tackled in future research.

REFERENCES
 [1] Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and Alexan-
     der Tuzhilin. 2005. Incorporating contextual information in recommender sys-
     tems using a multidimensional approach. ACM Transactions on Information
     Systems (TOIS) 23, 1 (2005), 103–145.
 [2] Ioannis Arapakis, Yashar Moshfeghi, Hideo Joho, Reede Ren, David Hannah, and
     Joemon M Jose. 2009. Integrating facial expressions into user profiling for the
     improvement of a multimodal recommender system. In 2009 IEEE International
     Conference on Multimedia and Expo. IEEE, 1440–1443.