Measuring the Impact of Recommender Systems – A Position Paper on Item Consumption in User Studies Benedikt Loepp Jürgen Ziegler University of Duisburg-Essen University of Duisburg-Essen Duisburg, Germany Duisburg, Germany benedikt.loepp@uni-due.de juergen.ziegler@uni-due.de ABSTRACT to user experience without being able to consume recommended While participants of recommender systems user studies usually items. Accordingly, depending on certain circumstances, allowing cannot experience recommended items, it is common practice that participants to experience items may be a necessity for ensuring researchers ask them to fill in questionnaires regarding the quality the validity of RS user studies. of systems and recommendations. While this has been shown to While we were able to derive important conclusions for future work well under certain circumstances, it sometimes seems not pos- user studies (e.g. results appear to provide at least a lower bound), sible to assess user experience without enabling users to consume there are many open questions that are strongly related, but go items, raising the question of whether the impact of recommender beyond what we already have investigated in the music and movie systems has always been measured adequately in past user studies. domain [6]. Regardless of the success of A/B tests in industry, user In this position paper, we aim at exploring this question by means studies are especially important in academia, where they become of a literature review and at identifying aspects that need to be more and more acknowledged as indispensable means for holis- further investigated in terms of their influence on assessments in tically capturing the qualities of RS [3]. Considering this and the users studies, for instance, the difference between consumption generally increasing efforts towards reproducibility, it thus seems of products or only of related information as well as the effect of to be of particular interest to study the impact of item consumption domain, domain knowledge and other possibly confounding factors. and of other possibly confounding factors on the assessment of recommendations in user studies in more depth. CCS CONCEPTS • Information systems → Recommender systems. 2 LITERATURE REVIEW First, for putting our findings from [6] more into context, we have KEYWORDS performed a literature review. We analyzed all 46 papers accepted to the five editions of the Joint Workshop on Interfaces and Human Recommender Systems; Experimentation; User Studies Decision Making for Recommender Systems (IntRS)1 which were held from 2014 to 2018 in conjunction with the ACM Conference 1 THE PROBLEM WITH USER STUDIES on Recommender Systems (RecSys). In 66 % of these papers, a user Questionnaires for assessing quality of recommendations and user study had been reported (there were a few more, which however experience of recommender systems (RS) have been, for instance, did not focus on recommendation issues but e.g. “only” on com- proposed in [4, 5, 8]. These established instruments are often em- paring different interfaces). In some of the papers without a user ployed in academic user studies, where participants usually have to study, applying such an evaluation method would not have been first use a RS and are subsequently asked to fill in a questionnaire. appropriate for investigating the respective research question (or However, recommended items in these scenarios are almost al- even impossible). Accordingly, this number seems actually quite ways represented through “proxy presentations”, i.e. items are only high, especially considering that user studies are still rarely used shown to users by means of images, descriptive texts, metadata, in broader recommender research [3]. etc. The actual consumption of items is in contrast to real-world Taking a closer look at the procedure of the user studies however situations rarely possible. There, it is mostly required to have, for sheds a bit different light: As far as we were able to grasp the details instance, bought a product, visited a hotel, or watched a movie, from the papers, it was possible only in 44 % of the reported user before being even able to provide an opinion. studies to actually consume products (i.e. in 30 % of all papers; see Previously, we have investigated whether the consumption of Figure 1). Admittedly, in some papers, this would have made no items during user studies has an impact on the succeeding assess- sense or consumption would have been unrealistic (e.g. hotel or ment of recommendations by means of questionnaires [6]. In other date recommendations). Sometimes, it simply was not necessary studies, e.g. on explanations [1, 9], the impact of consumption has for answering the underlying research question. We decided not never been directly addressed. However, we found, among others, to count in consumption of movie trailers [2] or song excerpts that it strongly depends on domain as well as type and amount [7], but included cases where, for instance, recommended research of presented information whether it is possible for participants papers were only accessible via a link to an external website [10], to adequately assess recommendation quality and aspects related making it less likely that many participants took that chance. In summary, with the smaller number of user studies presented at ImpactRS ’19, September 19, 2019, Copenhagen, Denmark Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 Website of this year’s edition: https://intrs19.wordpress.com/ ImpactRS ’19, September 19, 2019, Copenhagen, Denmark Benedikt Loepp and Jürgen Ziegler 10 11 specifically designed for the purpose of the study, thus also limited 10 9 2 Number of papers 8 8 to this purpose, might affect ecological validity: The assessment 4 3 Accepted papers . . . 2 4 5 without user studies might be different compared to when a recommendation set is in- 5 3 3 3 with user studies tegrated into a real-world e-commerce platform. Among others, with consumption 3 3 3 3 4 economic reasons (real money needs to be spent) or different re- 1 0 2014 2015 2016 2017 2018 lationships between students and researchers vs. customers and Figure 1: Results from our literature review showing how commercial system providers might affect e.g. reported purchase many papers were accepted to past IntRS workshop editions, intention or perceived trustworthiness. Also, questionnaires might how many of these papers contained user studies, and in interfere with internal validity as item formulations can be am- how many studies item consumption was possible. biguous, e.g. regarding whether recommended products are novel (recently released vs. only new to the participant) or the set appears less user-centric venues in mind that likely allowed consuming well-chosen (because products fit together or actually represent the items in even fewer cases, the question arises whether evaluation participant’s taste). More generally, using such instruments at all results would have been overall the same if item consumption had might be an issue as they might provoke a more conscious assess- been possible. While our literature review is indeed limited, the ment, possibly affecting decision-making (i.e. participants could impact of RS has most likely not always been measured accurately settle for different items if not confronted with a questionnaire). since participants might not have had everything they needed to adequately assess recommendation quality and user experience. 4 CONCLUSIONS AND PERSPECTIVES We have now positioned our work on the effects of item consump- 3 ASPECTS TO INVESTIGATE tion [6] in context of the broader question of how the impact of RS With the importance of item consumption in mind, the literature can adequately be measured by means of user studies in academia. review points out possible omissions in past research, emphasizing We identified a number of aspects that still need to be investigated the need to take this aspect more into account when designing in order to pursue the superordinate goal of deriving a set of guide- future user experiments. For doing so, a number of research ques- lines for promoting validity of future experiments and fostering tions still need to be answered. This may help to decide in a more reproducibility. Currently, we are planning a study to investigate structured manner, for example, whether it is necessary to provide the influence of the aspects listed in the previous section. In addi- participants with the possibility to consume items at all, or which tion, we would like to address the questions beyond and encourage substitutes may be used otherwise. It may also indicate which fac- others to do so as well, possibly also by employing unprecedented tors that might confound the assessment and thus lead to a distorted means for assessing the impact of RS. For instance, developing impression of the recommender’s impact need to be considered—for methods that use eye-tracking to determine which items are of planning the study, analyzing results and drawing conclusions. most interest for participants might help to avoid interventions The following (non-exhaustive) list contains aspects we think and make them switch decision-making styles. Overall, the insights are generally important and possibly mediate the effects of item that may be gained could also have broader impact, for example, consumption. Concretely, we suggest investigating the influence: by finding solutions for algorithms to adequately deal both with • of item consumption also in other domains, depending on ratings provided in real-world systems without previously experi- domain knowledge of participants as well as product type encing the products (e.g. “This recipe sounds awesome” → 5-star and attributes (e.g. search vs. experience products), rating) and ratings resulting from actual consumption. • of presenting different kinds of information (subjective vs. objective item descriptions) as possible substitutes for item REFERENCES consumption at varying level of detail (only metadata or ad- [1] M. Bilgic and R. J. Mooney. 2005. Explaining recommendations: Satisfaction vs. ditional content descriptions, other item-related information promotion. In Proc. Beyond Personalization Workshop. [2] M. P. Graus and M. C. Willemsen. 2016. Can trailers help to alleviate popularity such as user reviews, system-generated explanations, paper bias in choice-based preference elicitation?. In Proc. IntRS ’16. 22–27. abstracts, song excerpts, movie trailers, etc.), [3] B. P. Knijnenburg and M. C. Willemsen. 2015. Recommender Systems Handbook. • of user characteristics such as personality or decision-making Springer US, Chapter Evaluating recommender systems with user experiments, 309–352. style (making decisions in either a rational or intuitive way [4] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell. 2012. might affect the need for actual item consumption), Explaining the user experience of recommender systems. User Model. User-Adap. 22, 4-5 (2012), 441–504. • and of the point in time assessments take place (since the [5] B. P. Knijnenburg, M. C. Willemsen, and A. Kobsa. 2011. A pragmatic procedure effect of item consumption might diminish over time). to support the user-centric evaluation of recommender systems. In Proc. RecSys Beyond that, there are certainly many other aspects that may ’11. ACM, New York, NY, USA, 321–324. [6] B. Loepp, T. Donkers, T. Kleemann, and J. Ziegler. 2018. Impact of item consump- influence study results when trying to quantify the impact of RS. tion on assessment of recommendations in user studies. In Proc. RecSys ’18. ACM, For instance, the improvements made regarding user experience in New York, NY, USA, 49–53. the past couple of years led to higher perceived recommendation [7] F. Lu and N. Tintarev. 2018. A diversity adjusting strategy with personality for music recommendation. In Proc. IntRS ’18. 7–14. quality without any changes to recommendations [3]. However, [8] P. Pu, L. Chen, and R. Hu. 2011. A user-centric evaluation framework for recom- apart from these attempts intended to positively affect the impact of mender systems. In Proc. RecSys ’11. ACM, New York, NY, USA, 157–164. [9] N. Tintarev and J. Masthoff. 2012. Evaluating the effectiveness of explanations RS, some aspects may unintentionally cause differences due to the for recommender systems. User Model. User-Adap. 22, 4-5 (2012), 399–439. specific characteristics of user experiments. First, the experimental [10] K. Verbert, D. Parra, and P. Brusilovsky. 2014. The effect of different set-based situation itself (e.g. presence of supervisor, lab study), with systems visualizations on user exploration of recommendations. In Proc. IntRS ’14. 37–44.