Item Familiarity Effects in User-Centric Evaluations of Recommender Systems Dietmar Jannach Lukas Lerche Michael Jugovac TU Dortmund, Germany TU Dortmund, Germany TU Dortmund, Germany dietmar.jannach@tu- lukas.lerche@tu- michael.jugovac@tu- dortmund.de dortmund.de dortmund.de ABSTRACT parisons. One typical problem in such setups is that the rec- Laboratory studies are a common way of comparing recom- ommendation lists contain both movies that the users already mendation approaches with respect to different quality di- know and movies unknown to the participants. In the second mensions that might be relevant for real users. One typical case, additional information about the movies is often pro- experimental setup is to first present the participants with vided [3, 5] and users have to make their assessment based recommendation lists that were created with different algo- on plot summaries or movie trailers. This situation may in rithms and then ask the participants to assess these recom- turn lead to two possible effects. First, in case unknown mendations individually or to compare two item lists. The movies are displayed, the cognitive load for the participants cognitive effort required by the participants for the evalua- to assess, e.g., the suitability of the recommended movies, tion of item recommendations in such settings depends on is higher, which can result in a reduced overall satisfaction whether or not they already know the (features of the) rec- with the system. Second, an assessment like “Would I enjoy ommended items. Furthermore, lists containing popular and this movie?” based only on the meta-information or a trailer broadly known items are correspondingly easier to evaluate. could be an unreliable predictor of the assessment of a movie In this paper we report the results of a user study in which after a participant has actually watched it. To our knowl- participants recruited on a crowdsourcing platform assessed edge, how item familiarity can impact the users’ perception system-provided recommendations in a between-subjects ex- of a recommendation system in different dimensions has not perimental design. The results surprisingly showed that been discussed explicitly in the literature before. The study users found non-personalized recommendations of popular in [5] does not consider item familiarity as a factor; the au- items the best match for their preferences. An analysis re- thors of [2] cover item familiarity in their “novelty” construct vealed a measurable correlation between item familiarity but base it on the self-reported familiarity with the recom- and user acceptance. Overall, the observations indicate that mendation list as a whole and do not explicitly ask users to item familiarity can be a potential confounding factor in such indicate if they know the individual movies. studies and should be considered in experimental designs. 2. EXPERIMENT Keywords We conducted a user study1 in the style of [2] and [5]. The participants were first asked to rate a set of movies Recommender Systems; User-centric Evaluation; Bias known to them using a specifically designed web application based on MovieLens data. In the second step, they were pre- 1. INTRODUCTION sented with movie recommendations created with five differ- Studies with users in a controlled environment are a pow- ent algorithms including Matrix Factorization (Funk-SVD), erful means to assess qualities of a recommendation system Bayesian Personalized Ranking (BPR), SlopeOne, a content- which can often not be evaluated in offline experimental de- based technique (CB), and a non-personalized popularity- signs. A common setup in the research literature is that based baseline (PopRank). The participants had to rate the the participants of an experiment use a software tool that presented movies individually (based on meta-information) implements two or more variations of a certain recommen- and furthermore assessed the lists as a whole regarding fac- dation functionality. After interacting with the system, the tors like diversity, transparency, or surprise. For each pre- participants are asked to explicitly evaluate certain aspects sented movie, the users had to state if they already knew of the system, including, e.g., the suitability or the perceived the movie or not. The participants were recruited via Me- diversity of the recommendations or other aspects like the chanical Turk. From the 175 “Turkers” we filtered unreli- value of system-provided explanations [1, 2, 3, 5]. able ones through different automated and comparably strict In the recent studies presented in [2] and [5], the sub- measures. At the end 96 participants (about 20 per treat- jects were asked to assess the presented movie recommen- ment) were considered as being reliable. dations in dimensions such as diversity, novelty or perceived accuracy and the participants had to either evaluate lists of 3. OBSERVATIONS recommended movies individually or make side-by-side com- Accuracy. Fig. 1(a) shows how the participants an- swered the question how well the presented list of movies Copyright is held by the author(s). RecSys 2015 Poster Proceedings, September 16-20, 1 2015, Austria, Vienna. Details are described in [4]. 1.0 0.8 0.6 0.4 0.2 0.0 (a) (b) Perceived diversity Surprise Transparency Figure 1: Self-reported preference match and avg. ratings. Figure 2: Perceived Diversity, Surprise and Transparency. as a whole matched their preferences; Fig. 1(b) displays the (“transparency”) in particular when popular items were pre- average rating assigned to the recommended movies. sented (both in a non-personalized and personalized way). To some surprise, the popularity-based method PopRank is User Acceptance. Figure 3 finally reports the average perceived by the users as the most accurate method, followed answers regarding user acceptance in terms of ease-of-use, by the BPR technique, which has a comparably strong bias intention-to-reuse, and intention to recommend the system to recommend popular items to everyone. Movie recommen- to a friend. “Ease of use” is generally high, but users had dations that contained only blockbusters – about 94% of the more trouble using the system (assigning ratings) when un- recommendations made by these two methods were known familiar movies were presented. The other two satisfaction to the users – were considered the best preference match for indicators in Fig. 3 are correlated with the assessment of the participants (significant at p 0.05). the preference match shown in Fig. 1. In comparison, recommendation lists that were actually 1.0 personalized and contained various niche items2 received 0.8 0.6 lower scores. The preference match is inversely related to 0.4 0.2 the number of items known to the user. Funk-SVD users 0.0 knew about 50% of the items and SlopeOne users even less. We contrasted these findings with an offline accuracy anal- Ease of use Intention to reuse Recommendation to a friend ysis of the underlying MovieLens dataset, which led to the expected superiority of the Funk-SVD method in terms of Figure 3: User Acceptance Results. precision and the RMSE. We then computed the accuracy of the different algorithms for those recommended items that were rated by the participants. We took individual measure- 4. DISCUSSION ments for the set of known and unknown movies for the algo- Recommending popular and familiar items has shown to rithms which did not only contain popular movies (Fig. 1). be a well-suited strategy in this user study to achieve high The “offline” accuracy measurement – except for SlopeOne – satisfaction with the system and the presented recommenda- shows to be a comparably good predictor for the movies that tions, even though in practice recommending only popular the users already knew (“seen”). When applied to the un- items is typically of limited value. seen movies, the predictions made by the algorithms largely Our preliminary study – experiments with more partici- deviate from the user’s ratings3 . pants, non-Turkers, and a more specific questionnaire focus- ing on item familiarity are still required – suggests that item Table 1: Offline accuracy analysis (RMSE). familiarity can be a possible confounding factor in user stud- ies. Specifically, lab experiments in which users are asked to RMSE CB Funk-SVD SlopeOne assess items unknown to them might have limited predictive MovieLens (offline) 1.92 1.65 1.72 power with respect to the true usability of the tested system. Survey, all 2.19 3.46 4.03 Survey, only not seen 2.90 4.55 4.45 5. REFERENCES Survey, only seen 1.87 1.99 2.59 [1] P. Cremonesi, F. Garzotto, and R. Turrin. User-centric % of seen movies 0.74 0.52 0.27 vs. system-centric evaluation of recommender systems. In Proc. INTERACT 2013, pages 334–351, 2013. Diversity, Surprise, Transparency. Fig. 2 shows the [2] M. D. Ekstrand, F. M. Harper, M. C. Willemsen, and averaged questionnaire answers regarding perceived diver- J. A. Konstan. User perception of differences in sity, surprise and transparency. Again, we see unexpected recommender algorithms. In Proc. RecSys ’14, pages results, in particular that the content-based (CB) recom- 161–168, 2014. mendations were perceived to be diverse. When measuring [3] F. Gedikli, D. Jannach, and M. Ge. How should I the inverse Intra-List-Similarity (ILS) of the recommenda- explain? A comparison of different explanation types tions using TF-IDF vectors of the movie descriptions, the for recommender systems. Int. J. Hum.-Comput. Stud., CB method as expected led to the lowest diversity, which 72(4):367–382, 2014. raises the question if the ILS measure is a suitable proxy for [4] D. Jannach, L. Lerche, and M. Jugovac. Item perceived diversity. The surprise factor for the popularity- familiarity as a possible confounding factor in biased methods was low, as expected. Finally, users felt user-centric recommender systems evaluation. i-com that they could understand the logic of the recommendations Journal for Interactive Media, 14(1):29–40, 2015. 2 In [2], unpopular items recommended by Funk-SVD were filtered. [5] A. Said, B. Fields, B. J. Jain, and S. Albayrak. 3 The absolute RMSE values are comparably high as the participants User-centric evaluation of a k-furthest neighbor only had to rate 15 items in the first phase. The observations for precision are comparable; RMSE values for PopRank and BPR are collaborative filtering recommender algorithm. In Proc. missing as these methods generate no rating predictions. CSCW ’13, pages 1399–1408, 2013.