Measuring the Impact of Recommender Systems – A Position
           Paper on Item Consumption in User Studies
                               Benedikt Loepp                                                                     Jürgen Ziegler
                        University of Duisburg-Essen                                                     University of Duisburg-Essen
                            Duisburg, Germany                                                                 Duisburg, Germany
                         benedikt.loepp@uni-due.de                                                        juergen.ziegler@uni-due.de

ABSTRACT                                                                               to user experience without being able to consume recommended
While participants of recommender systems user studies usually                         items. Accordingly, depending on certain circumstances, allowing
cannot experience recommended items, it is common practice that                        participants to experience items may be a necessity for ensuring
researchers ask them to fill in questionnaires regarding the quality                   the validity of RS user studies.
of systems and recommendations. While this has been shown to                              While we were able to derive important conclusions for future
work well under certain circumstances, it sometimes seems not pos-                     user studies (e.g. results appear to provide at least a lower bound),
sible to assess user experience without enabling users to consume                      there are many open questions that are strongly related, but go
items, raising the question of whether the impact of recommender                       beyond what we already have investigated in the music and movie
systems has always been measured adequately in past user studies.                      domain [6]. Regardless of the success of A/B tests in industry, user
In this position paper, we aim at exploring this question by means                     studies are especially important in academia, where they become
of a literature review and at identifying aspects that need to be                      more and more acknowledged as indispensable means for holis-
further investigated in terms of their influence on assessments in                     tically capturing the qualities of RS [3]. Considering this and the
users studies, for instance, the difference between consumption                        generally increasing efforts towards reproducibility, it thus seems
of products or only of related information as well as the effect of                    to be of particular interest to study the impact of item consumption
domain, domain knowledge and other possibly confounding factors.                       and of other possibly confounding factors on the assessment of
                                                                                       recommendations in user studies in more depth.
CCS CONCEPTS
• Information systems → Recommender systems.
                                                                                       2     LITERATURE REVIEW
                                                                                       First, for putting our findings from [6] more into context, we have
KEYWORDS                                                                               performed a literature review. We analyzed all 46 papers accepted
                                                                                       to the five editions of the Joint Workshop on Interfaces and Human
Recommender Systems; Experimentation; User Studies
                                                                                       Decision Making for Recommender Systems (IntRS)1 which were
                                                                                       held from 2014 to 2018 in conjunction with the ACM Conference
1    THE PROBLEM WITH USER STUDIES                                                     on Recommender Systems (RecSys). In 66 % of these papers, a user
Questionnaires for assessing quality of recommendations and user                       study had been reported (there were a few more, which however
experience of recommender systems (RS) have been, for instance,                        did not focus on recommendation issues but e.g. “only” on com-
proposed in [4, 5, 8]. These established instruments are often em-                     paring different interfaces). In some of the papers without a user
ployed in academic user studies, where participants usually have to                    study, applying such an evaluation method would not have been
first use a RS and are subsequently asked to fill in a questionnaire.                  appropriate for investigating the respective research question (or
However, recommended items in these scenarios are almost al-                           even impossible). Accordingly, this number seems actually quite
ways represented through “proxy presentations”, i.e. items are only                    high, especially considering that user studies are still rarely used
shown to users by means of images, descriptive texts, metadata,                        in broader recommender research [3].
etc. The actual consumption of items is in contrast to real-world                         Taking a closer look at the procedure of the user studies however
situations rarely possible. There, it is mostly required to have, for                  sheds a bit different light: As far as we were able to grasp the details
instance, bought a product, visited a hotel, or watched a movie,                       from the papers, it was possible only in 44 % of the reported user
before being even able to provide an opinion.                                          studies to actually consume products (i.e. in 30 % of all papers; see
   Previously, we have investigated whether the consumption of                         Figure 1). Admittedly, in some papers, this would have made no
items during user studies has an impact on the succeeding assess-                      sense or consumption would have been unrealistic (e.g. hotel or
ment of recommendations by means of questionnaires [6]. In other                       date recommendations). Sometimes, it simply was not necessary
studies, e.g. on explanations [1, 9], the impact of consumption has                    for answering the underlying research question. We decided not
never been directly addressed. However, we found, among others,                        to count in consumption of movie trailers [2] or song excerpts
that it strongly depends on domain as well as type and amount                          [7], but included cases where, for instance, recommended research
of presented information whether it is possible for participants                       papers were only accessible via a link to an external website [10],
to adequately assess recommendation quality and aspects related                        making it less likely that many participants took that chance. In
                                                                                       summary, with the smaller number of user studies presented at
ImpactRS ’19, September 19, 2019, Copenhagen, Denmark
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                     1 Website of this year’s edition: https://intrs19.wordpress.com/
ImpactRS ’19, September 19, 2019, Copenhagen, Denmark                                                                            Benedikt Loepp and Jürgen Ziegler


                                   10
                                                        11                                specifically designed for the purpose of the study, thus also limited
                       10                  9             2
    Number of papers


                             8                    8                                       to this purpose, might affect ecological validity: The assessment
                                    4      3                   Accepted papers . . .
                             2
                                                  4      5         without user studies
                                                                                          might be different compared to when a recommendation set is in-
                        5    3      3      3                       with user studies      tegrated into a real-world e-commerce platform. Among others,
                                                                   with consumption
                             3      3      3
                                                  3      4                                economic reasons (real money needs to be spent) or different re-
                                                  1
                        0
                            2014   2015   2016   2017   2018
                                                                                          lationships between students and researchers vs. customers and
Figure 1: Results from our literature review showing how                                  commercial system providers might affect e.g. reported purchase
many papers were accepted to past IntRS workshop editions,                                intention or perceived trustworthiness. Also, questionnaires might
how many of these papers contained user studies, and in                                   interfere with internal validity as item formulations can be am-
how many studies item consumption was possible.                                           biguous, e.g. regarding whether recommended products are novel
                                                                                          (recently released vs. only new to the participant) or the set appears
less user-centric venues in mind that likely allowed consuming                            well-chosen (because products fit together or actually represent the
items in even fewer cases, the question arises whether evaluation                         participant’s taste). More generally, using such instruments at all
results would have been overall the same if item consumption had                          might be an issue as they might provoke a more conscious assess-
been possible. While our literature review is indeed limited, the                         ment, possibly affecting decision-making (i.e. participants could
impact of RS has most likely not always been measured accurately                          settle for different items if not confronted with a questionnaire).
since participants might not have had everything they needed to
adequately assess recommendation quality and user experience.                             4    CONCLUSIONS AND PERSPECTIVES
                                                                                          We have now positioned our work on the effects of item consump-
3                      ASPECTS TO INVESTIGATE                                             tion [6] in context of the broader question of how the impact of RS
With the importance of item consumption in mind, the literature                           can adequately be measured by means of user studies in academia.
review points out possible omissions in past research, emphasizing                        We identified a number of aspects that still need to be investigated
the need to take this aspect more into account when designing                             in order to pursue the superordinate goal of deriving a set of guide-
future user experiments. For doing so, a number of research ques-                         lines for promoting validity of future experiments and fostering
tions still need to be answered. This may help to decide in a more                        reproducibility. Currently, we are planning a study to investigate
structured manner, for example, whether it is necessary to provide                        the influence of the aspects listed in the previous section. In addi-
participants with the possibility to consume items at all, or which                       tion, we would like to address the questions beyond and encourage
substitutes may be used otherwise. It may also indicate which fac-                        others to do so as well, possibly also by employing unprecedented
tors that might confound the assessment and thus lead to a distorted                      means for assessing the impact of RS. For instance, developing
impression of the recommender’s impact need to be considered—for                          methods that use eye-tracking to determine which items are of
planning the study, analyzing results and drawing conclusions.                            most interest for participants might help to avoid interventions
   The following (non-exhaustive) list contains aspects we think                          and make them switch decision-making styles. Overall, the insights
are generally important and possibly mediate the effects of item                          that may be gained could also have broader impact, for example,
consumption. Concretely, we suggest investigating the influence:                          by finding solutions for algorithms to adequately deal both with
     • of item consumption also in other domains, depending on                            ratings provided in real-world systems without previously experi-
       domain knowledge of participants as well as product type                           encing the products (e.g. “This recipe sounds awesome” → 5-star
       and attributes (e.g. search vs. experience products),                              rating) and ratings resulting from actual consumption.
     • of presenting different kinds of information (subjective vs.
       objective item descriptions) as possible substitutes for item                      REFERENCES
       consumption at varying level of detail (only metadata or ad-                        [1] M. Bilgic and R. J. Mooney. 2005. Explaining recommendations: Satisfaction vs.
       ditional content descriptions, other item-related information                           promotion. In Proc. Beyond Personalization Workshop.
                                                                                           [2] M. P. Graus and M. C. Willemsen. 2016. Can trailers help to alleviate popularity
       such as user reviews, system-generated explanations, paper                              bias in choice-based preference elicitation?. In Proc. IntRS ’16. 22–27.
       abstracts, song excerpts, movie trailers, etc.),                                    [3] B. P. Knijnenburg and M. C. Willemsen. 2015. Recommender Systems Handbook.
     • of user characteristics such as personality or decision-making                          Springer US, Chapter Evaluating recommender systems with user experiments,
                                                                                               309–352.
       style (making decisions in either a rational or intuitive way                       [4] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell. 2012.
       might affect the need for actual item consumption),                                     Explaining the user experience of recommender systems. User Model. User-Adap.
                                                                                               22, 4-5 (2012), 441–504.
     • and of the point in time assessments take place (since the                          [5] B. P. Knijnenburg, M. C. Willemsen, and A. Kobsa. 2011. A pragmatic procedure
       effect of item consumption might diminish over time).                                   to support the user-centric evaluation of recommender systems. In Proc. RecSys
   Beyond that, there are certainly many other aspects that may                                ’11. ACM, New York, NY, USA, 321–324.
                                                                                           [6] B. Loepp, T. Donkers, T. Kleemann, and J. Ziegler. 2018. Impact of item consump-
influence study results when trying to quantify the impact of RS.                              tion on assessment of recommendations in user studies. In Proc. RecSys ’18. ACM,
For instance, the improvements made regarding user experience in                               New York, NY, USA, 49–53.
the past couple of years led to higher perceived recommendation                            [7] F. Lu and N. Tintarev. 2018. A diversity adjusting strategy with personality for
                                                                                               music recommendation. In Proc. IntRS ’18. 7–14.
quality without any changes to recommendations [3]. However,                               [8] P. Pu, L. Chen, and R. Hu. 2011. A user-centric evaluation framework for recom-
apart from these attempts intended to positively affect the impact of                          mender systems. In Proc. RecSys ’11. ACM, New York, NY, USA, 157–164.
                                                                                           [9] N. Tintarev and J. Masthoff. 2012. Evaluating the effectiveness of explanations
RS, some aspects may unintentionally cause differences due to the                              for recommender systems. User Model. User-Adap. 22, 4-5 (2012), 399–439.
specific characteristics of user experiments. First, the experimental                     [10] K. Verbert, D. Parra, and P. Brusilovsky. 2014. The effect of different set-based
situation itself (e.g. presence of supervisor, lab study), with systems                        visualizations on user exploration of recommendations. In Proc. IntRS ’14. 37–44.