Item Familiarity Effects in User-Centric Evaluations
                       of Recommender Systems

                    Dietmar Jannach                                        Lukas Lerche                              Michael Jugovac
                  TU Dortmund, Germany                               TU Dortmund, Germany                         TU Dortmund, Germany
                 dietmar.jannach@tu-                                    lukas.lerche@tu-                         michael.jugovac@tu-
                     dortmund.de                                          dortmund.de                                dortmund.de

ABSTRACT                                                                               parisons. One typical problem in such setups is that the rec-
Laboratory studies are a common way of comparing recom-                                ommendation lists contain both movies that the users already
mendation approaches with respect to different quality di-                             know and movies unknown to the participants. In the second
mensions that might be relevant for real users. One typical                            case, additional information about the movies is often pro-
experimental setup is to first present the participants with                           vided [3, 5] and users have to make their assessment based
recommendation lists that were created with different algo-                            on plot summaries or movie trailers. This situation may in
rithms and then ask the participants to assess these recom-                            turn lead to two possible effects. First, in case unknown
mendations individually or to compare two item lists. The                              movies are displayed, the cognitive load for the participants
cognitive effort required by the participants for the evalua-                          to assess, e.g., the suitability of the recommended movies,
tion of item recommendations in such settings depends on                               is higher, which can result in a reduced overall satisfaction
whether or not they already know the (features of the) rec-                            with the system. Second, an assessment like “Would I enjoy
ommended items. Furthermore, lists containing popular and                              this movie?” based only on the meta-information or a trailer
broadly known items are correspondingly easier to evaluate.                            could be an unreliable predictor of the assessment of a movie
   In this paper we report the results of a user study in which                        after a participant has actually watched it. To our knowl-
participants recruited on a crowdsourcing platform assessed                            edge, how item familiarity can impact the users’ perception
system-provided recommendations in a between-subjects ex-                              of a recommendation system in different dimensions has not
perimental design. The results surprisingly showed that                                been discussed explicitly in the literature before. The study
users found non-personalized recommendations of popular                                in [5] does not consider item familiarity as a factor; the au-
items the best match for their preferences. An analysis re-                            thors of [2] cover item familiarity in their “novelty” construct
vealed a measurable correlation between item familiarity                               but base it on the self-reported familiarity with the recom-
and user acceptance. Overall, the observations indicate that                           mendation list as a whole and do not explicitly ask users to
item familiarity can be a potential confounding factor in such                         indicate if they know the individual movies.
studies and should be considered in experimental designs.
                                                                                       2.       EXPERIMENT
Keywords                                                                                 We conducted a user study1 in the style of [2] and [5].
                                                                                       The participants were first asked to rate a set of movies
Recommender Systems; User-centric Evaluation; Bias
                                                                                       known to them using a specifically designed web application
                                                                                       based on MovieLens data. In the second step, they were pre-
1.     INTRODUCTION                                                                    sented with movie recommendations created with five differ-
   Studies with users in a controlled environment are a pow-                           ent algorithms including Matrix Factorization (Funk-SVD),
erful means to assess qualities of a recommendation system                             Bayesian Personalized Ranking (BPR), SlopeOne, a content-
which can often not be evaluated in offline experimental de-                           based technique (CB), and a non-personalized popularity-
signs. A common setup in the research literature is that                               based baseline (PopRank). The participants had to rate the
the participants of an experiment use a software tool that                             presented movies individually (based on meta-information)
implements two or more variations of a certain recommen-                               and furthermore assessed the lists as a whole regarding fac-
dation functionality. After interacting with the system, the                           tors like diversity, transparency, or surprise. For each pre-
participants are asked to explicitly evaluate certain aspects                          sented movie, the users had to state if they already knew
of the system, including, e.g., the suitability or the perceived                       the movie or not. The participants were recruited via Me-
diversity of the recommendations or other aspects like the                             chanical Turk. From the 175 “Turkers” we filtered unreli-
value of system-provided explanations [1, 2, 3, 5].                                    able ones through different automated and comparably strict
   In the recent studies presented in [2] and [5], the sub-                            measures. At the end 96 participants (about 20 per treat-
jects were asked to assess the presented movie recommen-                               ment) were considered as being reliable.
dations in dimensions such as diversity, novelty or perceived
accuracy and the participants had to either evaluate lists of                          3.       OBSERVATIONS
recommended movies individually or make side-by-side com-
                                                                                         Accuracy. Fig. 1(a) shows how the participants an-
                                                                                       swered the question how well the presented list of movies
Copyright is held by the author(s). RecSys 2015 Poster Proceedings, September 16-20,
                                                                                       1
2015, Austria, Vienna.                                                                     Details are described in [4].
                                                                     1.0
                                                                     0.8
                                                                     0.6
                                                                     0.4
                                                                     0.2
                                                                     0.0


        (a)                               (b)                              Perceived diversity         Surprise               Transparency


Figure 1: Self-reported preference match and avg. ratings.           Figure 2: Perceived Diversity, Surprise and Transparency.

as a whole matched their preferences; Fig. 1(b) displays the         (“transparency”) in particular when popular items were pre-
average rating assigned to the recommended movies.                   sented (both in a non-personalized and personalized way).
   To some surprise, the popularity-based method PopRank is             User Acceptance. Figure 3 finally reports the average
perceived by the users as the most accurate method, followed         answers regarding user acceptance in terms of ease-of-use,
by the BPR technique, which has a comparably strong bias             intention-to-reuse, and intention to recommend the system
to recommend popular items to everyone. Movie recommen-              to a friend. “Ease of use” is generally high, but users had
dations that contained only blockbusters – about 94% of the          more trouble using the system (assigning ratings) when un-
recommendations made by these two methods were known                 familiar movies were presented. The other two satisfaction
to the users – were considered the best preference match for         indicators in Fig. 3 are correlated with the assessment of
the participants (significant at p 0.05).                            the preference match shown in Fig. 1.
   In comparison, recommendation lists that were actually            1.0
personalized and contained various niche items2 received             0.8
                                                                     0.6
lower scores. The preference match is inversely related to           0.4
                                                                     0.2
the number of items known to the user. Funk-SVD users                0.0

knew about 50% of the items and SlopeOne users even less.
   We contrasted these findings with an offline accuracy anal-                Ease of use        Intention to reuse   Recommendation to a friend
ysis of the underlying MovieLens dataset, which led to the
expected superiority of the Funk-SVD method in terms of                              Figure 3: User Acceptance Results.
precision and the RMSE. We then computed the accuracy of
the different algorithms for those recommended items that
were rated by the participants. We took individual measure-
                                                                     4.    DISCUSSION
ments for the set of known and unknown movies for the algo-             Recommending popular and familiar items has shown to
rithms which did not only contain popular movies (Fig. 1).           be a well-suited strategy in this user study to achieve high
The “offline” accuracy measurement – except for SlopeOne –           satisfaction with the system and the presented recommenda-
shows to be a comparably good predictor for the movies that          tions, even though in practice recommending only popular
the users already knew (“seen”). When applied to the un-             items is typically of limited value.
seen movies, the predictions made by the algorithms largely             Our preliminary study – experiments with more partici-
deviate from the user’s ratings3 .                                   pants, non-Turkers, and a more specific questionnaire focus-
                                                                     ing on item familiarity are still required – suggests that item
         Table 1: Offline accuracy analysis (RMSE).                  familiarity can be a possible confounding factor in user stud-
                                                                     ies. Specifically, lab experiments in which users are asked to
    RMSE                     CB      Funk-SVD      SlopeOne          assess items unknown to them might have limited predictive
    MovieLens (offline)      1.92       1.65          1.72           power with respect to the true usability of the tested system.
    Survey, all              2.19       3.46          4.03
    Survey, only not seen    2.90       4.55          4.45           5.    REFERENCES
    Survey, only seen        1.87       1.99          2.59           [1] P. Cremonesi, F. Garzotto, and R. Turrin. User-centric
    % of seen movies         0.74       0.52          0.27               vs. system-centric evaluation of recommender systems.
                                                                         In Proc. INTERACT 2013, pages 334–351, 2013.
   Diversity, Surprise, Transparency. Fig. 2 shows the               [2] M. D. Ekstrand, F. M. Harper, M. C. Willemsen, and
averaged questionnaire answers regarding perceived diver-                J. A. Konstan. User perception of differences in
sity, surprise and transparency. Again, we see unexpected                recommender algorithms. In Proc. RecSys ’14, pages
results, in particular that the content-based (CB) recom-                161–168, 2014.
mendations were perceived to be diverse. When measuring              [3] F. Gedikli, D. Jannach, and M. Ge. How should I
the inverse Intra-List-Similarity (ILS) of the recommenda-               explain? A comparison of different explanation types
tions using TF-IDF vectors of the movie descriptions, the                for recommender systems. Int. J. Hum.-Comput. Stud.,
CB method as expected led to the lowest diversity, which                 72(4):367–382, 2014.
raises the question if the ILS measure is a suitable proxy for       [4] D. Jannach, L. Lerche, and M. Jugovac. Item
perceived diversity. The surprise factor for the popularity-             familiarity as a possible confounding factor in
biased methods was low, as expected. Finally, users felt                 user-centric recommender systems evaluation. i-com
that they could understand the logic of the recommendations              Journal for Interactive Media, 14(1):29–40, 2015.
2
 In [2], unpopular items recommended by Funk-SVD were filtered.      [5] A. Said, B. Fields, B. J. Jain, and S. Albayrak.
3
 The absolute RMSE values are comparably high as the participants        User-centric evaluation of a k-furthest neighbor
only had to rate 15 items in the first phase. The observations for
precision are comparable; RMSE values for PopRank and BPR are
                                                                         collaborative filtering recommender algorithm. In Proc.
missing as these methods generate no rating predictions.                 CSCW ’13, pages 1399–1408, 2013.