Using facial recognition services as implicit feedback for recommenders Toon De Pessemier Ine Coppens Luc Martens imec - WAVES - Ghent University imec - WAVES - Ghent University imec - WAVES - Ghent University toon.depessemier@ugent.be ine.coppens@ugent.be luc1.martens@ugent.be ABSTRACT the video is playing. However, classic recommender systems are User authentication and feedback gathering are crucial aspects not adjusted to this dynamic situation. Typically, recommendations for recommender systems. The most common implementations, a are generated based on the profile of the individual who initiates username / password login and star rating systems, require user the video session. Family profiles can be created, but do not take interaction and a cognitive effort from the user. As a result, users into account who is actually in front of the screen or changes in the opt to save their password in the interface and optional feedback number of spectators during the video watching. However, man- with a star rating system is often skipped, especially for applica- ually logging in each individual user, one by one, would be time tions such as video watching in a home environment. In this article, consuming and user-unfriendly. The same issues are applicable for we propose an alternative method for user authentication based the feedback process. Explicit feedback is not requested separately on facial recognition and an automatic feedback gathering method for each individual. For implicit feedback, such as viewing time, it is by detecting various face characteristics. Using facial recognition unclear to whom this refers. Moreover, since star rating systems are with a camera in a tablet, smartphone, or smart TV, the persons in often ignored by the user, an automatic implicit feedback system front of the screen can be identified in order to link video watch- would be more suitable. ing sessions to their user profile. During video watching, implicit This article presents a more user-friendly and practical approach feedback is automatically gathered through emotion recognition, based on facial recognition to log in automatically every viewer and attention measurements, and behavior analysis. An emotion finger- fetch their preferences to compose a dynamic group for group rec- print, which is defined as a unique spectrum of expected emotions ommendations. These preferences are derived from their implicit for a video scene, is compared to the recognized emotions in order feedback, which is gathered automatically by detecting various to estimate the experience of a user while watching. An evaluation facial characteristics during past video watching sessions. We eval- with a test panel showed that happiness can be most accurately uated this implicit feedback gathering by using facial recognition detected and the recognized emotions are correlated with the user’s services based on a dataset of photos as well as with a user test. star rating. 2 RELATED WORK CCS CONCEPTS Face detection is the technique that locates the face of a person • Information systems → Information systems applications; in a photo. It is the prerequisite of all facial analysis and different Data analytics; Data mining. approaches for the detection have been studied [4]. Facial recogni- tion is the process of matching a detected face to a person who was KEYWORDS previously detected by the system. In the study of Yang et al., this Feedback, Emotion recognition, Facial analysis, Recommendation is also called face authentication and defined as the identification ACM Reference Format: of an individual in a photo [18]. Related to this is the analysis of Toon De Pessemier, Ine Coppens, and Luc Martens. 2019. Using facial recog- faces for the purpose of age detection and gender detection. Au- nition services as implicit feedback for recommenders. In IntRS ’19: Joint tomatically detecting the gender and age group of the user (child, Workshop on Interfaces and Human Decision Making for Recommender Sys- young-adult, adult, or senior) can be useful for initial profiling of tems, 8 pages. the user. In this paper, various commercial services for gender and age detection are used: Microsoft’s Facial Recognition Software: 1 INTRODUCTION Face [3], Face++ [8], and Kairos [11]. Even more recognition ser- Many video services generate personal recommendations for their vices exist, such as FaceReader [15], but some are rather expensive customers to assist them in the content selection process that be- or are not available as a web service that can be queried from a comes more difficult by the abundance of available content. In the mobile device. So, the first research question of this study is: “How application domain of video watching, the content is often con- accurately are these commercial services for age detection and gender sumed simultaneously by multiple people (e.g., a family watching detection in view of an initial user profile for video watching?” together) or the device is shared by multiple people (e.g., a tablet is While watching video content or using an app or service in gen- used by multiple people of the family). Moreover, in the context of eral, facial expressions of users might reveal their feelings about the a household, people may join and leave the watching activity while content or their usage. In the field of psychology, the relationship between distinctive patterns of the facial muscles and particular Copyright ©2019 for this paper by its authors. Use permitted under Creative Com- emotions have been demonstrated to be universal across different mons License Attribution 4.0 International (CC BY 4.0). IntRS ’19: Joint Workshop on Interfaces and Human Decision Making for Recommender Systems, 19 Sept 2019, cultures [6]. The psychologists conducted experiments in which Copenhagen, DK. they showed still photographs of faces to people from different IntRS Workshop, September 2019, Copenhagen, DK. De Pessemier, et al. cultures in order to determine whether the same facial behavior Figure 1 shows the data flow. The research focus of this article would be judged as the same emotion, regardless of the observers’ is on the first and the third phase. In the first phase, the goal is culture. These studies demonstrated the recognizability of emotions to identify and recognize each face in the photo. For new faces, (happiness, sadness, anger, fear, surprise, disgust, interest). age and gender will be detected to create an initial user profile. In Based on these concepts, facial expression recognition is de- the third phase, the photos will be used for emotion recognition, scribed as the identification of the emotions. The automatic recog- attention measurements, and behavior analysis in view of deriving nition of facial expressions, and especially emotions, enables the automatic feedback. The second phase, offering personalized recom- automatic exploitation of emotions for profiling and recommen- mendations, is used to help users in the content selection process dation purposes. Therefore, the same three commercial services and demonstrate the added value of facial expression recognition. are used for facial expression recognition during video watching in this study. Various researchers have investigated the role of emotions in recommender systems. Emotions can be used to improve the quality of recommender systems in three different stages [17]: Picture of user (1) The entry stage: when a user starts to use a content delivery Groups: Avg without misery system with or without recommendations, the user is in an affective state, the entry mood. The user’s decision making process is influenced by this entry mood. A recommender can adapt the list of recommended items to the user’s entry mood by considering this as contextual information [1]. (2) The consumption stage: after the user starts to consume Figure 1: Data flow-3 phases: login, recommender, feedback. content, the user experiences affective responses that are induced by the content [17]. Moreover, by automatic emo- tion detection from facial expressions, an affective profile of movie scenes can be constructed. Such an item profile structure labels changes of users’ emotions through time, 3.1 Phase 1: User authentication and profiling relative to the video timestamp [10]. (3) The exit stage: after the user has finished with the content Although facial recognition is often used to unlock smartphones consumption, the user is in the exit mood. The exit mood automatically, the applications in a group context, to identify multi- will influence the user’s next decisions. In case that the user ple people simultaneously, are less common. In other words, facial continues to use the content delivery system, the exit mood recognition is used to give an answer to the question: “who is in for the content just consumed is the entry mood for the next front of the screen?”. In a real world scenario, it can be several content to be consumed [17]. people, all of whom will be individually identified. For the authentication of recurring users (who have been identi- In this paper the focus is on the consumption stage. Users watch fied by our app in a previous session), our Android app uses Face movies and their facial expressions are captured as a vector of Storage of the Microsoft service. This saves persons with their faces emotions that change over time. The facial expressions, such as in a Person Group, which is trained based on the photos of the emotions, are used as an indicator of the user’s satisfaction with camera. This enables to link the user in front of the screen with one the content. The assumption is that users appreciate a video if they of the existing user profiles. For new users, the age and gender is sympathize with the video and express their emotions in accordance estimated (Section 4.1). To cope with the cold-start problem, initial with the expected emotions. recommendations are based on these demographics. Therefore, the second research question of this study is: “Can In practice, user authentication and profiling works as follows. facial expression recognition during video watching be used as an Using the app, users can log in by ensuring that their face is visible unobtrusive (implicit) feedback collection technique?” for the front-facing camera when they push the start button. A photo is made that is used as input for the facial recognition services. 3 METHOD Recurring users are logged in automatically; their existing profile To facilitate human-computer interaction for video watching ser- (age, gender, and watching history) is retrieved, and the new photo vices, an Android application has been developed with the following is a new training sample for Face Storage. For new users, a profile is three subsequent phases: 1) User authentication with an automated created based on their estimated age and gender. After every login, login procedure and user profiling (gender and age) based on facial the age estimation is adjusted based on the new photo. This update recognition to identify all people who are in front of the screen. 2) can correct age estimations based on previous photos, but also takes Personalized recommendations (group recommendations in case into account the aging of users when using the system for multiple multiple people are in front of the screen). 3) Automatic feedback years. This is especially useful for children who can get access to gathering while the chosen video is playing. Using the front-facing more content as they fulfill the minimum age requirements over camera of the tablet/smartphone or a camera connected to a smart time. Moreover, storing a photo for every session has the advantage TV, the app takes photos of all people in front of the screen and that changes to the user’s appearance (e.g., different hairstyle) can sends requests to different facial recognition services. be taken into account. De Pessemier, et al. IntRS Workshop, September 2019, Copenhagen, DK. 3.2 Phase 2: Group recommendations used as implicit feedback [2], since different videos evoke different Group recommendations are generated by aggregating individual emotions. One can assume that users appreciate a video if they user models (consisting of age, gender, ratings, and watching his- sympathize with the video and express their emotions in accordance tory), one for every user in front of the screen. From the eleven with the expected emotions. E.g., during a comedy scene users may strategies proposed by Judith Masthoff [14], the “Average without laugh (‘happy’ emotion), whereas during a horror scene ‘fear’ can misery” strategy was adopted in our group recommender algorithm. be expected. Recognized emotions that are not expected, might be This strategy takes into account the (predicted) rating score of every due to external influences (e.g., other people in the room) or reflect user by calculating a group average, while avoiding misery by elim- contempt for the video (e.g., laughing with terrifying scenes of a inating videos that are really hated by some group members, and horror movie). Therefore, unexpected emotions are not taken into therefore considered as unacceptable for the group. The Lenskit [7] account. recommendation framework was used to calculate these rating pre- Thus, the similarity between the expressed emotions (=recog- diction scores and transform them into a Top-N recommendation nized emotions) and the expected emotions is calculated to deter- list. mine the user’s experience while watching the video. The expected Besides personal preferences, other criteria, such as the age and emotions are based on the emotion fingerprint, which is defined historical viewing activities of the users, are taken into account. Age as a unique spectrum of expected emotions for a video scene. For is modeled as classes of age ranges, firstly to filter out inappropriate every second of the video, the emotion spectrum of the fingerprint content for minors, secondly for estimating the ratings for cold- specifies the probability value of each of the six possible emotions: start users based on other users of the same class. We used the age anger, disgust, fear, happiness, sadness, and surprise. These emotion ranges that are also used by IMDb: <18, 18-29, 30-44, 45+. dimensions have been identified in the field of psychology [6]. So, The age of the users is used to determine whether a video is the emotion fingerprint shows which emotions the video typically suitable for the viewers. For every video, the advised minimum provokes among viewers at every second of the video. The emo- age is retrieved from the Common Sense Media website [16]. If at tion fingerprint is composed by aggregating emotions expressed by least one of the users is younger than this age threshold, the video many users during watching this specific video. Section 4.4 explains is marked as unsuitable for the group according to the average in detail how the fingerprint of a video scene is computed based on without misery strategy. Likewise, if at least one of the users has an example. already seen the video, it is considered as unsuitable for the group The distance between expressed and expected emotions is calcu- since this person probably does not want to see the video again. lated based on the euclidean distance between the values of these If a new user is present in front of the screen, i.e. a cold-start user, two emotion spectra for every second i of the video and each emo- user preferences for a movie are estimated based on demographics. tion j. For the expressed emotions, the output of the Microsoft An estimation of the user’s age and gender, as provided by the facial service is used in our online experiment (Section 4.4) because of recognition services, is used to find users with similar demographics. the results of the offline evaluation (Section 4.3). The similarity The preferences of that demographic group (age & gender class) between expressed and expected emotions is calculated based on are used to estimate the preferences of the cold-start user. In case the inverse of the emotion distance and an additional constant to of an explicit rating for example, we use the mean rating of that avoid a division by zero. demographic group for the movie, as mentioned by IMDb [9]. The v u tÕ n Õ 6 mean rating provided by the demographic group is compared with emotionDistance = (expectedi, j − expressedi, j )2 (1) the mean rating over all users for this specific movie. This difference i=0 j=1 (demographic group mean - global mean) indicates if the movie is less or more suitable for a specific age/gender. 1 emotionSimilarity = (2) 1 + emotionDistance Besides emotions, also the attention level and user behavior are 3.3 Phase 3: Automatic feedback analyzed during video watching as an additional implicit feedback Commercial services that perform emotion recognition, attention mechanism. The Microsoft service has an additional interesting measurements, and behavior analysis are often based on the analy- feature that recognizes occluded areas of the face. This occlusion is sis of photos. Therefore, our Android app continuously takes photos used to recognize negative feedback during video watching in case from the users with the front-facing camera during video watching. users respond to the video by holding their hands in front of their Every second, a photo is taken and sent to the Microsoft recognition mouth or eyes (typical for shocking content). service for face detection and authentication in order to check if Face++ is the only service that can detect closed eyes, which can all viewers are still in front of the screen. Subsequently, for each be an indication of sleeping. Also the user’s head pose is derived identified face, the area of the photo containing the face is selected from Face++. Although other services can recognize the head pose and the photo is cropped so that only one person’s face is visible. as well, the estimation of Face++ showed to be the most accurate Next, the cropped photo is sent to each of the three recognition one. In case users do not want to see a scene (negative feedback), services. Since photos are sent for every identified face, facial ex- they might close their eyes or turn their head. pressions will be recognized for all identified individuals in front The Kairos recognition service offers a feature that represents of the screen. the attention level of the user, which is estimated based on eye For recognizing the emotions on the users’ face, the Microsoft tracking and head pose. In our application, these behavioral aspects service was used. But these recognized emotions cannot be directly are combined into the overallAttention level by aggregating the IntRS Workshop, September 2019, Copenhagen, DK. De Pessemier, et al. service results over all photos taken during video watching. The Table 1: Evaluation of gender & age estimation overall attention level is calculated as the percentage of photos in which the user pays attention and following conditions are met: Kairos Microsoft Face++ Aggregation Kairos’ attention level > 0.5, both eyes open, no occlusion, and Detection failed 4 2 2 2 head pose angles are between the margins: 30 degrees for the yaw Avg abs. age error 8.88 4.31 13.14 7.91 angle and 20 degrees for the pitch angle. The assumption is that the Median age error 6.0 2.9 11.0 8.1 user is not paying attention to the video if one of these conditions Gender error (%) 11.9 15.9 13.6 11.3 is not met. #Photos(attention & eyes & noOcclusion & headPose) overallAttention = with the front facing camera are used as input for the recognition #Photos (3) services. An implicitFeedbackScore on a scale ranging from 0 to 10 is cal- culated by aggregating the different facial analysis features. The 4.1 Age & gender estimation similarity with the expected emotions has a contribution of six Firstly, the authentication was evaluated: recognizing the user who points out of ten points. The overall attention level counts for the used the app in the past. The automatic authentication of the 46 remaining four points. users (login process) showed to be very accurate: 4 undetected faces with Kairos (9%), 2 with Microsoft and Face++ (4%). implicitFeedbackScore = 6 · emotionSimilarity + 4 · overallAttention (4) Subsequently, for the recognized faces, the services were used to estimate the users’ age and gender based on photos of the test 4 EVALUATION users taken while holding the tablet. The estimated age and gender, as provided by the recognition services, were compared to the Evaluations of commercial facial recognition services have been people’s real age and gender. Figure 2 shows the differences between performed in literature, but are typically based on datasets with estimation and real age, sorted according to the real age of the users. high-quality photos that enable an accurate recognition: sufficiently The largest errors were obtained for estimating the age of children. illuminated, no shadow or reflections, high resolution, and a perfect Kairos and Face++ typically estimate the children to be older than position of the face in the middle of the photo [2, 5]. In contrast, they are. Table 1 reports the number of photos for which a detection for facial recognition and analysis during (mobile) video watching, was not possible, the average absolute age error, the median age the front-facing camera of the device is used without flash, which error, and the percentage of photos for which the gender estimation yields not always ideal photos. was wrong. Therefore, we evaluated the three facial recognition services in The three facial recognition services were compared with a hy- an offline test (based on a publicly available dataset of photos) in brid solution that aggregates the results of the three. For the age Section 4.3 as well as in an online setting (with real users). For the estimation, the aggregation is the average of the results of the three. evaluation of the age & gender estimation (Section 4.1), 46 users For the gender, the three gender estimations are aggregated using with age ranging from 0 to 66 were involved in our test. For the a voting process. For each photo, the gender with the most votes is evaluation of the attention level (Section 4.2), we used 76 photos of the result of the aggregation. This voting aggregation showed to be our test users with different levels of attention. The evaluation of more reliable than each individual service for estimating gender. emotion recognition during video playback (Section 4.4) requires Microsoft has the best results in the age test, so we decided to more time from the user and was therefore performed by only 20 use only this service to estimate the user’s age. For the gender users. Since the focus of this study was on age & gender estimation estimation, the aggregation method is used. and emotion recognition, the group recommendations were not So as an answer to the first research question, we can say that evaluated, and all users used the app alone. the facial recognition services provide an accurate age and gender The overall aim of this study is to improve the user friendliness detection for creating an initial profile for a cold-start user. of devices for video watching in the living room. This evaluation is the first step to reach the future goal of multi-user recognition 60 and is therefore carried out with a tablet, in a rather controlled 50 Difference between real age and age estimation of facial recognition services environment, with one person at a time. During the test, photos of 40 Kairos the test user were taken with the front-facing camera of the tablet Age difference (years) 30 Microsoft (Samsung Galaxy tab A6). If the tablet would have captured two 20 Face++ people in the photo, the recognition process would be performed 10 0 for both recognized faces. 0 10 20 30 40 50 60 70 -10 To have a realistic camera angle, the users were asked to hold -20 Test persons, sorted by their real age (age in years) the tablet in front of them, as they would usually do for watch- -30 ing a video. The users were sitting on a chair and the room was sufficiently illuminated. However, no guidelines were provided re- Figure 2: Age estimation using facial recognition services. garding their behavior, head position, or attention; e.g., nothing was said about looking away or closing eyes. The photos taken De Pessemier, et al. IntRS Workshop, September 2019, Copenhagen, DK. Anger Kairos Anger Microsoft Anger Face++ 120 1.2 120 100 1 100 80 0.8 80 60 0.6 60 40 0.4 40 20 0.2 20 0 0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Fear Kairos Fear Microsoft Fear Face++ 120 1.2 120 100 1 100 80 0.8 80 60 0.6 60 40 0.4 40 20 0.2 20 0 0 0 0 5 10 15 20 25 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Happiness Kairos Happiness Microsoft Happiness Face++ 120 1.2 120 100 1 100 80 0.8 80 60 0.6 60 40 0.4 40 20 0.2 20 0 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Surprise Kairos Surprise Microsoft Surprise Face++ 120 1.2 120 100 1 100 80 0.8 80 60 0.6 60 40 0.4 40 20 0.2 20 0 0 0 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Sadness Kairos Sadness Microsoft Sadness Face++ 120 1.2 120 1.2 100 1 100 80 0.8 80 0.8 60 0.6 60 0.6 40 0.4 40 0.4 20 0.2 20 0.2 0 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Disgust Kairos Disgust Microsoft Disgust Face++ 120 1.2 120 100 1 100 80 0.8 80 60 0.6 60 40 0.4 40 20 0.2 20 0 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Figure 3: Output of the recognition services: recognized emotions in photos of people expressing emotions. IntRS Workshop, September 2019, Copenhagen, DK. De Pessemier, et al. Table 2: Attention level: percentage correctly recognized Aggregated Emotion Values 1 Emotion probability 0.8 Kairos Microsoft Face++ 0.6 Covering eyes N/A 97.37% N/A 0.4 Covering mouth N/A 94.74% N/A 0.2 Covering forehead N/A 98.68% N/A Closed eyes N/A N/A 97.37% 0 0 20 40 60 80 100 120 140 Attention 82.97% N/A N/A Time (s) Head pose attention 60.53% N/A 72.37% Anger Fear Happiness Surprise Sadness Disgust No detection: Face turned away 11.84% 7.89% 2.36% Figure 4: Emotion values aggregated over all test users. 4.2 Attention level offline 1 Emotion fingerprint Emotion probability The features that constitute the attention score of the user (equa- 0.8 tion 3) are evaluated based on a dataset that we created with photos 0.6 of the users taken during the test. In addition, some photos were 0.4 added for which users were explicitly asked to cover part of their 0.2 face. The photos were manually annotated with the features (e.g., eyes closed or not) to obtain the ground truth. The result was a 0 0 20 40 60 80 100 120 140 dataset of 76 photos with a focus on these attention features (e.g., Time(s) multiple users covering their eyes, mouth, etc.). Anger Fear Happiness Surprise Sadness Disgust Table 2 shows the percentage correctly recognized photos for each attention feature. However, not all attention features are avail- Figure 5: The emotion fingerprint based on the aggregated able for the three services. Features that are not available are indi- emotions. cated with N/A. Face++ provides two probability values for closed eyes (for left and right eye). If both values have a probability of 40% or more, we been recognized with great certainty). For the Microsoft service, consider this as “Closed eyes”. the output values range from 0 to 1 (with the same interpretation). Kairos estimates the attention of the user and expresses this with Figure 3 shows for each of the six photo sets how the emotions a value between 0 (no attention at all) and 1 (full attention). Kairos are recognized by the services. The emotion values are shown on attention feature is based on eye tracking and head pose. To convert the Y-axis for each photo set that was used as input (photo index this to a binary value (attention or not), we used a threshold of 0.5. on the X-axis). Each recognized emotion has a different color. For a Kairos and Face++ can recognize the head pose of the user. If specific photo set, the ideal emotion recognition should result in the head position is outside the boundaries (30 degrees for the the detection of only one emotion with a value of 1 for Microsoft yaw angle and 20 degrees for the pitch angle), we interpret this and 100 for Kairos and Face++, while the other emotion values are as “head turned away and not paying attention”. The estimation 0. For a limited number of photos, the person’s face could not be of Face++ is more accurate than this of Kairos. Therefore, the head detected. This resulted in no output of the service. Therefore, not pose specified by Face++ is used in the app. all indices have an emotion value in the graphs of Kairos. In general, If the face is turned away too much from the camera or a large the results clearly show that some emotions, such as happiness and part of the face is covered, then face detection might fail. The surprise, are more easy to detect with a high certainty, whereas percentage of “no detections” is also indicated in Table 2. Remember other emotions, such as fear, are more difficult to detect and can that this dataset was created with the focus on attention level. For easily be confused. Although the people of these photos are expres- many photos, users were explicitly asked to turn their head away. sively showing their emotions, the automatic recognition of these Therefore, the number of no detections is rather high. emotions is not yet perfect. Anger is accurately recognized by Kairos and Microsoft, whereas 4.3 Emotion recognition offline Face++ confuses anger with disgust and sadness for some photos. The emotion recognition ability of the three facial recognition ser- Fear is the most difficult to detect: Kairos detects fear in most photos; vices was evaluated using the Cohn Kanade dataset [12, 13], which but Microsoft and Face++ sometimes incorrectly recognize sadness contains photos of people showing different emotions evolving and disgust. Happiness is very accurately detected by all three from neutral to a very explicit emotion. Six photo sets with the services. With the Microsoft service, the results are almost perfect: very explicit emotions (one set for each emotion) are used as input only happiness is detected and no other emotions. Also surprise for the facial recognition services. The output of the recognition is very well recognized by all three service with high emotion services is a vector of 6 values, one for each emotion. For Kairos and values. Sadness is recognized for most photos, but in comparison to Face++, these output values range from 0 (meaning this emotion happiness and surprise, the emotion values are lower. This indicates has not been recognized at all) to 100 (meaning this emotion has that sadness is less clearly recognizable for emotion recognition De Pessemier, et al. IntRS Workshop, September 2019, Copenhagen, DK. Emotions expressed by user 3 Emotions expressed by user 4 Emotions expressed by user 13 1 1 1 Emotion probability Emotion probability Emotion probability 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Time (s) Time (s) Time (s) Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Anger Fear Happiness Surprise Sadness Disgust Figure 6: Emotions expressed by 3 users during video watching. Users 4 and 13 like the video, user 3 doesn’t. services. Disgust is sometimes confused with anger, but Microsoft resulting emotion fingerprint. We consider this emotion fingerprint and Face++ rightly assign a much lower value to anger for most as the expected emotion spectrum for the specific video. photos. To discuss the results, we elaborate on the emotion spectrum of In conclusion, the comparison between the recognized emotions three users of the test. Figure 6 shows the expressed emotions of and the true emotion labels of the photos, revealed that the Mi- users 3, 4, and 13, while watching the comedy video. The expressed crosoft service has the most accurate emotion recognition. There- emotions of users 4 and 13 clearly show some similarities with the fore, the Microsoft service was chosen as solution for emotion recog- emotion fingerprint. Happiness is the most dominant emotion, but nition in Section 4.4. The evaluation based on the Cohn Kanade also some sad and surprising aspects are in the movie. The video dataset also indicated that - even with the most explicit emotion contains the most expressive emotions (funny scene) from second photos - anger, disgust, and fear are always detected with a low 30, which is visible in the expressed emotions of user 4 and 13. probability value. Happiness can be detected with high probability The explicit ratings for the video of users 3, 4, and 13 were values. So, happiness can be considered as the emotion that is rather respectively: 3, 6, and 6.5 stars on a scale from 1 to 10. The low easy to detect with a high confidence, whereas anger, disgust, and explicit rating of user 3 is reflected in the emotion values of this fear are much harder to detect. user (implicit feedback), which are significantly lower than with the other users. 4.4 Emotion recognition online For the test with 20 users, we achieved a significant positive correlation of 0.37 between the explicit rating given by the user, Emotion recognition as a tool for gathering automatic feedback, and the similarity between the user’s expressed emotions and the was evaluated with a test panel consisting of 20 users between the expected emotion fingerprint (equation 2). Since the rating process ages of 5 and 72. During the test, each user watched six videos and emotion recognition are characterized by a lot of noise, the on a tablet. For each of the six basic emotions, one characteristic correlation between both will never be very high. However, the video was chosen (e.g., for happiness a comedy, for fear a scary positive correlation indicates that expressed emotions clearly are horror movie, etc.). During video watching, the front-facing camera a form of implicit feedback that can be used as input for a recom- continuously took photos that were analyzed, and for which an mender system. Moreover, we expect that the correlation might emotion score (based on equation 1 and 2), overall attention score improve if users watch full movies or tv shows instead of movie (equation 3), and implicit feedback score based on a complete facial trailers, as in our user test. Therefore, we can consider the rec- analysis (equation 4) were calculated. ognized emotions as a valid alternative feedback method in case The emotion fingerprint of the video was obtained by aggregat- ratings are not available, or as a feedback method ‘during’ con- ing the expressed emotions over all the test users. Figure 4 gives an tent consumption instead of ‘after’ finishing the consumption. This example of this aggregation for a comedy video (a scene from the answers our second research question. movie “Dude, Where’s My Car?”). The emotion signal of the fin- Besides the emotion score, we also studied the implicit feedback gerprint is the average emotion value over all users at each second score (equation 4), which is the combination of emotion and atten- of the video. Because of the aggregation of emotions of multiple tion score. However, the variation in the attention score was limited test persons, the emotion fingerprint was constructed after the user for our user test, since all trailers are rather short (2-3 minutes). test. Subsequently, irrelevant emotion values are removed and only We suspect that the duration of the trailers is too short to build the most dominant emotions are retained (e.g., happiness, surprise, up intense emotional moments that make users inclined to cover and sadness in this comedy movie). Key scenes of the video that their eyes or mouth. Moreover, the trailers are too short to witness may provoke emotions are manually selected. During periods of a decreasing level of attention (e.g., falling asleep). Therefore, we the video without expressive emotions, the fingerprint values are expect that the attention score and implicit feedback score might set to zero. During these periods, we assume that the emotions rec- be better suited as implicit feedback for content items with a longer ognized from the users’ face are due to external factors. As visible duration. in Figure 5, the video contains no emotional scene from second 0 until 30. Next, the fluctuations of the emotion signal are reduced by using the maximum observed emotion value over a time window of 5 DISCUSSION 10 seconds. This takes into account that an expression of emotions During the user test, it became clear that people do not express typically takes multiple seconds. Figure 5 shows an example of a their emotions much during video watching, even not if the videos IntRS Workshop, September 2019, Copenhagen, DK. De Pessemier, et al. contain scenes with intense emotions as selected in our test. Happi- [3] Microsoft Azure. 2019. Face API - Facial Recognition Software. Available at ness is expressed most clearly, and is the only emotion that reached https://azure.microsoft.com/en-us/services/cognitive-services/face/. [4] Mayank Chauhan and Mukesh Sakle. 2014. Study & analysis of different face the maximum probability value of 1, e.g., for person 13 as visible detection techniques. International Journal of Computer Science and Information in Figure 6. For the other basic emotions, the recognition services Technologies 5, 2 (2014), 1615–1618. [5] Toon De Pessemier, Damien Verlee, and Luc Martens. 2016. Enhancing recom- typically register probabilities that are much lower. The second mender systems for TV by face recognition. In 12th International Conference on most recognizable emotion was sadness. It has a maximum value Web Information Systems and Technologies (WEBIST 2016), Vol. 2. 243–250. over all users of 0.68, with only 15% of the test users scoring a [6] Paul Ekman and Wallace V Friesen. 1971. Constants across cultures in the face and emotion. Journal of personality and social psychology 17, 2 (1971), 124. sadness value of 0.60 or higher (for the sad video). For fear, the [7] Michael D. Ekstrand. 2018. The LKPY Package for Recommender Systems Experi- maximum registered value over all test users was only 0.27 (during ments. Computer Science Faculty Publications and Presentations 147. Boise State the fearful video). Fear is the most difficult emotion to recognize, University. https://doi.org/10.18122/cs_facpubs/147/boisestate [8] Face++. 2019. Cognitive Services - Leading Facial Recognition Technology. Avail- as was also discussed in the offline test. able at https://www.faceplusplus.com/. For this experiment, the emotion fingerprint was constructed by [9] IMDb. 2019. Ratings and reviews for new movies and TV shows. Available at https://www.imdb.com/. aggregating the emotion values of all users. A big challenge is to [10] Hideo Joho, Joemon M Jose, Roberto Valenti, and Nicu Sebe. 2009. Exploiting identify the correct expected emotions and their probability values facial expressions for affective video summarisation. In Proceedings of the ACM for the fingerprint spectrum. For this, we propose the following international conference on image and video retrieval. ACM, 31. [11] Kairos. 2019. Serving Businesses with Face Recognition. Available at https: guidelines: 1) Limit the fingerprint to a few emotions that are clearly //www.kairos.com/. expressed in the video. 2) Some emotions, such as fear, are more [12] Takeo Kanade, Jeffrey F Cohn, and Yingli Tian. 2000. Comprehensive database difficult to detect than others, such as happiness. The emotion for facial expression analysis. In Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580). IEEE, 46–53. probabilities from the facial recognition services are often much [13] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and lower for the difficult emotions. This should be reflected in the Iain Matthews. 2010. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 IEEE Computer Society values of the fingerprint. 3) Limit the comparison of expected and Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 94–101. expressed emotions to the key scenes of the movie. Recognized [14] Judith Masthoff. 2011. Group recommender systems: Combining individual emotions during scenes without emotions might be due to other models. In Recommender systems handbook. Springer, 677–702. [15] Noldus. 2019. FaceReader - Facial Expression Recognition Software. Available at causes than the video. https://www.noldus.com/human-behavior-research/products/facereader. [16] Commom sense media. 2019. You know your kids. We know media and tech. Together we can build a digital world where our kids can thrive. Available at 6 CONCLUSION https://www.commonsensemedia.org/about-us/our-mission. [17] Marko Tkalc̆ic̆, Andrej Kos̆ir, and Jurij Tasic̆. 2011. Affective recommender sys- An Android app was developed to investigate if facial recognition tems: the role of emotions in recommender systems. In Joint proceedings of the services can be used as a tool for automatic authentication, user RecSys 2011 Workshop on Human Decision Making in Recommender Systems (Deci- profiling, and feedback gathering during video watching. The idea sions@RecSys’11) and User-Centric Evaluation of Recommender Systems and Their Interfaces-2 (UCERSTI 2) affiliated with the 5th ACM Conference on Recommender. is to use this feedback as input for a recommender systems. In 9–13. contrast to ratings, this feedback is available during content play- [18] Ming-Hsuan Yang, David J Kriegman, and Narendra Ahuja. 2002. Detecting faces in images: A survey. IEEE Transactions on pattern analysis and machine back. An evaluation with a test panel of 20 users showed that the intelligence 24, 1 (2002), 34–58. authentication is almost perfect. Estimation of gender and age are in most cases accurate enough to cope with the cold-start problem by recommending movies typical for the user’s age and gender. Facial analysis can be used to derive automatic feedback from the user during video watching. Closed eyes, looking away (head pose, attention level), covering eyes or mouth (occlusion), etc., are typical indications that the user does not want to see the video, and can be considered as negative implicit feedback for the recommender. By emotion recognition and a comparison with an emotion fin- gerprint, we calculated a user feedback value, which is positively correlated to the user’s star rating. This indicates that recognized emotions can be considered as valuable implicit feedback for the recommender. Happiness can be most accurately detected. Taking photos or making videos with the front-facing camera has been expressed as a privacy-sensitive aspect by our test users and will be further tackled in future research. REFERENCES [1] Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and Alexan- der Tuzhilin. 2005. Incorporating contextual information in recommender sys- tems using a multidimensional approach. ACM Transactions on Information Systems (TOIS) 23, 1 (2005), 103–145. [2] Ioannis Arapakis, Yashar Moshfeghi, Hideo Joho, Reede Ren, David Hannah, and Joemon M Jose. 2009. Integrating facial expressions into user profiling for the improvement of a multimodal recommender system. In 2009 IEEE International Conference on Multimedia and Expo. IEEE, 1440–1443.