Recommender Systems Alone Are Not Everything: Towards a Broader Perspective in the Evaluation of Recommender Systems Benedikt Loepp1 1 University of Duisburg-Essen, Duisburg, Germany Abstract Thus far, in most of the user experiments conducted in the area of recommender systems, the respective system is considered as an isolated component, i.e., participants can only interact with the recommender that is under investigation. This fails to recognize the situation of users in real-world settings, where the recommender usually represents only one part of a greater system, with many other options for users to find suitable items than using the mechanisms that are part of the recommender, e.g., liking, rating, or critiquing. For example, in current web applications, users can often choose from a wide range of decision aids, from text-based search over faceted filtering to intelligent conversational agents. This variety of methods, which may equally support users in their decision making, raises the question of whether the current practice in recommender evaluation is sufficient to fully capture the user experience. In this position paper, we discuss the need to take a broader perspective in future evaluations of recommender systems, and raise awareness for evaluation methods which we think may help to achieve this goal, but have not yet gained the attention they deserve. Keywords Recommender systems, Information filtering, Conversational user interfaces, Decision aids, Evaluation, User experience, User studies, User-centered design 1. Problem statement Over the last few years, user-centered evaluation of recommender systems has become more and more accepted in the research community [1]. However, it has thus far mostly been ignored, that in real-world settings, recommender systems alone are not everything: Indeed, it is widely accepted that recommendations are responsible for a large amount of the products bought on Amazon or the content watched on Netflix [2, 3], but there exists a broad range of other methods to help users in making a decision when confronted with an overwhelmingly large item space. These decision aids, however, are studied and developed mostly independently of recommender systems, e.g., text-based search and faceted filtering in the field of information retrieval [4, 5], dialog-based assistants and intelligent chatbots by the conversational user interfaces community [6, 7]. This is mirrored in commercial environments, where one can rarely observe that users Perspectives on the Evaluation of Recommender Systems Workshop (PERSPECTIVES 2022), September 22nd, 2022, co-located with the 16th ACM Conference on Recommender Systems, Seattle, WA, USA. $ benedikt.loepp@uni-due.de (B. Loepp) € https://benedikt.loepp.eu/ (B. Loepp)  0000-0001-9059-5324 (B. Loepp) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) are supported in a holistic fashion: Even if multiple decision aids are available, they are often isolated, i.e., the input provided to one component hardly affects the results generated by another. For instance, when a user constrains the set of items by selecting several filter criteria, a recommender that is part of the same system will not necessarily reflect this selection in the generated recommendations, but only take the user’s implicit or explicit item feedback into account (and vice versa). Similarly, the interaction with a chatbot will usually start from nothing, ignoring any other interests or needs that have been expressed before. This separation is difficult to understand in light of the fact that it is well known that different decision aids contribute differently to the user’s progress in accomplishing typical choice and decision-making tasks [8, 9]. Recent studies on systems that provide multiple support components have confirmed that users use different mechanisms before settling on an item, and that this is (partly) due to personal and situational characteristics [10, 11, 12, 13]. For these reasons, there have, of course, been several calls over the years to bring the methods and their fields closer together [14, 15, 16, 17]. These calls, however, have largely been focused on methodological aspects, whereas it should not be overseen, that this narrow perspective is also reflected in the evaluation of the methods—regardless of whether they are tightly integrated with other decision aids, or appear in a decoupled fashion, as it is currently common. The work presented in [18] is one of the few exceptions from information retrieval research that argues for a more holistic evaluation approach, in particular, “to think outside the (search) box.” In general, however, this aspect has, to the best of our knowledge, not yet received much attention, neither in the area of information retrieval, nor conversational user interfaces, nor recommender systems [cf. 19, 20, 1, 21]. This means that whenever user experience of recommender systems is studied in empirical experiments, participants can typically only interact with the recommendation component, i.e., the subject of the evaluation, which is often designed specifically for the purpose of the study, especially in academia. Switching to a different method, which participants may find more appealing depending on their personal preferences and progress in the decision-making process, is, however, not possible. Hence, there is a lack of data to indicate which method works best for which user. As a consequence, we argue that it is necessary for all kinds of decision aids “to think outside the box.” In the following, as another motivating example, we present qualitative feedback from a user experiment on multiple decision aids that we recently conducted. Afterwards, we provide an overview of methods that may help to apply a broader perspective when evaluating recommender systems, and thus, to obtain a more accurate picture of real-world scenarios, in which these systems usually represent only one out of many available decision aids. 2. Motivating example and potential future directions Because of the reasons described above, we argue that a broader perspective is required when evaluating recommender systems. Otherwise, there is the danger of continuing to follow certain paths in the development of interactive and conversational recommendation approaches (cf. the surveys in [22, 23]), without knowing whether these approaches are really what users want. To highlight that this is an issue already, we refer to a recent experiment, in which we asked participants (𝑛 = 100, 47 females, 2 non-binary, age: M = 35.36, SD = 12.60) to use different decision aids. For this purpose, we confronted them with one of two tasks, either related to a goal-driven or an explorative scenario. In both scenarios, participants had to find at least two suitable laptops. To accomplish the respective task, they were allowed to choose (and to switch) between: a faceted filtering component, a content-based recommender with the option to like and dislike items, a product advisor with a dialog showing a limited number of guiding questions, and a natural language chatbot implemented using Google Dialogflow. 2.1. Qualitative insights from an experiment with multiple decision aids While the study is described in more detail in [13], we here want to provide additional comments made by participants when they were asked why they did not use one of the components. For example, this was the case because they were reluctant to use chatbots. One participant stated: “I generally do not like interacting with chatbots. I feel whatever input I give to select a product, I might as well use a filter component.” In contrast to the increasing popularity of conversational agents for recommendation purposes, another participant indicated that he or she uses chatbots only when there is “a problem with the product or a payment issue, [but that a] chatbot is not required [for] browsing.” Others were even more direct in their criticism, writing: “I hate chatbots. I feel I spend a lot of time typing, and I hate the fake cheeriness of them,” or that they turn “something really simple like searching for a new laptop into a dark comedy.” Other participants cited personal characteristics as reasons for their concerns, such as domain knowledge (“I assume the chatbot would be more useful for someone who does not know what to look for.”) or need for control (“I prefer doing things myself, I can chat to the bot when I cannot seem to find what I am looking for using other [. . . ] options.”) For the other options, however, we obtained similar feedback. For example, with respect to the advisor, one participant stated that he or she “should have tried this component, but [likes] to shop for these products using filters.” Although it was clearly an academic study, another participant refrained from using the recommender because he or she suspected “that this component uses sponsored companies which pay money to have their products included in the recommendation section.” Also in this case, others were generally reluctant, writing: “I never use recommendations as they are never in line with what I need. It may be the bestseller for the store in question, but not for my needs.” Domain knowledge again played a role, notably in both directions, with participants stating to be “knowledgeable enough [that they] do not need recommendations,” but also that they “do not know enough about computers to give a thumbs up or thumbs down.” A lack of knowledge was also a major reason to stay away from faceted filtering, of which participants “thought you need technical understanding of laptops,” or mentioned that they “often feel overwhelmed using filters, and generally do not know where to start in regard to buying laptops.” All these comments suggest that users often have an idea of which decision aid to use. However, this is not necessarily the one that is offered by a system—or is the subject of the experiment they participate in. For a holistic user-centered evaluation, this means, that the perspective is too narrow, as only the user experience with the specific recommender for a given task is addressed, ignoring the interdependencies with other methods that may exist, and may be more suitable depending on the current circumstances. Some participants explicitly indicated that they would like to “specify certain filters and let the recommendations pick only from the result set of the filters,” or to “combine components [so that] the advisor/chatbot works within the result set created by the must-have filters.” However, this is exactly what participants usually cannot do, since most experiments are strongly focused on individual decision aids. To evaluate recommender systems from a broader perspective, we therefore suggest to improve the current practice by applying the following evaluation methods, which, to date, are used only for single methods, or not at all. 2.2. Offline experiments and simulation studies: Richer data, entire systems Offline experiments are well established in recommender research [24]. However, they are increasingly criticized as they do not allow obtaining insights into the quality dimensions that are relevant from a user perspective [1, 25, 26]. Nevertheless, with the large datasets that are available today (e.g., MovieLens, Netflix, Amazon), they remain essential to make objective decisions whether or not to use a specific recommendation method. However, most datasets are limited to implicit or explicit user-item feedback. Even though they represent different domains, contain a varying amount of side information, and are nowadays often available in sequential form, this limits what can be concluded from the corresponding experiments, i.e., which algorithm generates better item recommendations. Therefore, we argue that future offline experiments should be conducted based on richer datasets, which include data from all components of a system, such as both user-item feedback and search queries issued. Thus, provided adequate metrics are found, one could also determine which decision aids work best for individual users and keep them longer engaged. More generally, one could examine system support above the item level, e.g., with respect to the objective quality of recommendations for item features, or even of recommendations for switching to other decision aids. The same applies to simulation studies, which have only recently gained more attention in recommender research [27]. By simulating typical user behavior, e.g., with respect to critiquing mechanisms [28] or interaction with items over time [29], this type of experiment has shown strong potential in economic terms. With multiple decision aids, and thus, a larger design space, this will become even more important, also on a global level, e.g., to study long-term user behavior with respect to the question of when a recommender is used, or another component is perceived as more suitable and may contribute more to the user’s progress. In addition, domain and other factors such as product type (search vs. experience), product category (cheap streaming content vs. expensive high-risk products), and the given task (goal-oriented or explorative) may affect which method works best. Thus, simulation studies may be the only way to investigate how preferences evolve over time when using different decision aids, and to understand which interaction effects can occur between a recommender and other components—something that would never be possible with actual users. 2.3. User-centered evaluation: Multiple decision aids, insightful methods Well-known qualitative methods from human-computer interaction research, which are fre- quently used at the beginning of user-centered design processes, e.g., focus groups, interviews, or contextual inquiries [cf. 30], are rarely used in the area of recommender systems. We argue, however, that these techniques could be useful for obtaining insights into users’ actual needs with respect to the interaction with such a system, in particular, when it comes to the relation to other decision aids. In this context, it is worth noting that it has been found only recently, that users’ mental models do not necessarily correspond to the implementations of recommender systems, and are subject to large inter-individual differences [31]. However, identifying the understanding users have of the system behavior is considered highly important for evaluating the impact of a recommender and improving it [32]. Accordingly, to better inform the design of applications that embed multiple support components, it will be inevitable to explore these models in more depth, in particular, with a focus on the users’ comprehension of possible interactions between the components—by means of both qualitative methods such as grounded theory [33] and quantitative approaches such as proposed in [34]. Once a (prototypical) recommender is implemented, questionnaire-based assessment is the most common way of measuring the different qualities related to user experience. For this purpose, well-established frameworks and questionnaires exist [e.g., 35, 36, 37], which, however, are strongly focused on dimensions that are specific to recommender systems. On the other hand, general usability questionnaires, e.g., SUS [38] and UEQ [39], are too broad to draw conclusions about the suitability of this or other decision aids for preferential choice and decision-making tasks. Therefore, existing instruments need to be extended in order to allow for a more global, subjective assessment of recommendation components—in the context of the applications in which they are embedded, and thus, of the interplay with other methods. Otherwise, it will be hardly possible to gain insights into why users prefer a specific method to perform a certain task, and, more generally, whether they would use more strongly connected decision aids, or dislike this idea, e.g., because they expect higher complexity or have increasing privacy concerns. Finally, we would like to highlight two further aspects that do not receive much attention in current recommender research: in-situ and long-term evaluation. The former is important, among others, because questionnaires suffer from the problem that they typically require self reflection disconnected from actual system usage, and, worse, often from consumption or experience of the items, which has shown to influence the assessment of recommendations [40]. Therefore, we deem it necessary to develop methods for a quantitative in-situ assessment of the users’ motivation to use different components, e.g., via questionnaires directly embedded into the respective applications, as it has been done to study the reasons to switch between search engines [41] or selected decision aids [13]. This, however, may need to be complemented by qualitative methods, such as the think aloud procedure, systematic user observation, or shadowing— techniques that are generally popular but have also rarely been applied in recommender research. Moreover, eye tracking may be considered as a useful alternative. Being less disruptive, it has become more popular in recent years, e.g., in studies on recommender interfaces and recommendation presentation [42, 43], critiquing [44, 45], and effects of personal characteristics [46]. However, also in these cases, participants’ behavior was observed only in relation to the recommendation component, largely ignoring its surroundings. Either way, following these directions will not be sufficient to understand how users interact with these surroundings over a longer period of time. Therefore, while repeated study designs, longitudinal studies, and field studies are still rare in recommender research, with only very few exceptions [e.g., 47, 48], these methods appear to be of particular importance when taking a broader perspective: Goals and tasks may vary over time, which can have a substantial impact on the usage of different components. Thus, experimental data that represent long-term user behavior only with respect to a single decision aid may distort the picture, since other components may be perceived as more appropriate in other contexts or stages of the decision-making process. 3. Conclusions Overall, it seems important for future research in the recommender area, but also for other communities, to face the challenge of evaluating the respective methods in a context that is more similar to real-world settings, where decision aids rarely stand on their own. In this position paper, we explained why we think this way, and outlined how this challenge may be addressed. By this means, we hope to raise awareness that contemporary decision aids not only need to be brought together from a methodological perspective, but that a broader perspective is required when evaluating the methods. Of course, this may open up new issues. For instance, the more holistic the evaluation, the higher effort and costs for running an experiment, which are factors that already limit many studies in academia. Moreover, the difficulty of designing an experiment and analyzing its results, but also, the number of application-specific parameters and possible confounding factors, increase with the consideration of more than one decision aid. For these reasons, it will remain important to keep in mind the specific circumstances in which most experiments take place, and not to think that a broader perspective in the evaluation of the methods automatically allows more general conclusions to be drawn. Nevertheless, we are certain that “thinking outside the (recommendations) box” will help gain a better understanding not only of the degree to which a recommender can satisfy a user in a given situation, but, in particular, of how the interplay with other decision aids can affect the assessment of the system. In the end, this may shift the focus away from further improving decision aids that are less effective or users do not want to use for specific tasks, to those that exhibit the greatest potential for providing support at the respective stage of the decision-making process. For now, however, we hope to encourage at least a discussion about using the mentioned evaluation methods more extensively to gain more in-depth insights into the users’ understanding of and preference for recommendation components in relation to other decision aids. Of course, other methods may equally well be used, but we leave it for future work to provide more concrete suggestions on which methods to use and in which order. Acknowledgments Thanks to Timm Kleemann, who implemented the system for the study that is mentioned here, and contributed to this study to the same extent as the author of the present paper. The study was partially supported by the Eurostars project ACODA (grant no. 01QE1946C). References [1] B. P. Knijnenburg, M. C. Willemsen, Evaluating recommender systems with user experi- ments, in: F. Ricci, L. Rokach, B. Shapira (Eds.), Recommender Systems Handbook, Springer US, Boston, MA, USA, 2015, pp. 309–352. [2] C. A. Gomez-Uribe, N. Hunt, The Netflix recommender system: Algorithms, business value, and innovation, ACM Transactions on Management Information Systems 6 (2015) 13:1–13:19. [3] B. Smith, G. Linden, Two decades of recommender systems at Amazon.com, IEEE Internet Computing 21 (2017) 12–18. [4] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM, New York, NY, USA, 1999. [5] M. A. Hearst, Search User Interfaces, Cambridge University Press, Cambridge, UK, 2009. [6] K. Ramesh, S. Ravishankaran, A. Joshi, K. Chandrasekaran, A survey of design techniques for conversational agents, in: ICICCT ’17: Proceedings of the 2nd International Confer- ence on Information, Communication and Computing Technology, Springer Singapore, Singapore, 2017, pp. 336–350. [7] A. Anand, L. Cavedon, H. Joho, M. Sanderson, B. Stein, Conversational search (Dagstuhl Seminar 19461), Dagstuhl Reports 9 (2020) 34–83. [8] G. Häubl, V. Trifts, Consumer decision making in online shopping environments: The effects of interactive decision aids, Marketing Science 19 (2000) 4–21. [9] S. Castagnos, N. Jones, P. Pu, Recommenders’ influence on buyers’ decision process, in: RecSys ’09: Proceedings of the 3rd ACM Conference on Recommender Systems, ACM, New York, NY, USA, 2009, pp. 361–364. [10] J. Schaffer, J. Humann, J. O’Donovan, T. Höllerer, Quantitative modeling of dynamic human-agent cognition, in: M. D. McNeese, E. Salas, M. R. Endsley (Eds.), Contemporary Research: Models, Methodologies, and Measures in Distributed Team Cognition, CRC Press, Boca Raton, FL, USA, 2020, pp. 137–186. [11] P. Virdi, A. D. Kalro, D. Sharma, Online decision aids: The role of decision-making styles and decision-making stages, International Journal of Retail & Distribution Management 48 (2020) 555–574. [12] T. Kleemann, M. Wagner, B. Loepp, J. Ziegler, Modeling user interaction at the convergence of filtering mechanisms, recommender algorithms and advisory components, in: Mensch & Computer 2021 – Tagungsband, ACM, New York, NY, USA, 2021, pp. 531–543. [13] T. Kleemann, B. Loepp, J. Ziegler, Towards multi-method support for product search and recommending, in: UMAP ’22: Adjunct Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization, ACM, New York, NY, USA, 2022, pp. 74–79. [14] H. Garcia-Molina, G. Koutrika, A. Parameswaran, Information seeking: Convergence of search, recommendations, and advertising, Communications of the ACM 54 (2011) 121–130. [15] E. H. Chi, Blurring of the boundary between interactive search and recommendation, in: IUI ’15: Proceedings of the 20th International Conference on Intelligent User Interfaces, ACM, New York, NY, USA, 2015, p. 2. [16] B. Loepp, On the convergence of intelligent decision aids, in: UCAI ’21: Proceedings of the 2nd Workshop on User-Centered Artificial Intelligence, 2021. [17] A. D. Starke, M. Lee, Unifying recommender systems and conversational user interfaces, in: CUI ’22: Proceedings of the 4th International Conference on Conversational User Interfaces, ACM, New York, NY, USA, 2022. [18] P. Clough, Evaluation: Thinking outside the (search) box, in: FIRE ’14: Proceedings of the Forum for Information Retrieval Evaluation, ACM, New York, NY, USA, 2015, pp. 1–9. [19] D. Kelly, Methods for evaluating interactive information retrieval systems with users, Foundations and Trends in Information Retrieval 3 (2009) 1–224. [20] C. Mulwa, S. Lawless, M. Sharp, V. Wade, The evaluation of adaptive and personalised information retrieval systems: A review, International Journal of Knowledge and Web Intelligence 2 (2011) 138–156. [21] A. B. Kocaballi, L. Laranjo, E. Coiera, Understanding and measuring user experience in conversational interfaces, Interacting with Computers 31 (2019) 192–207. [22] M. Jugovac, D. Jannach, Interacting with recommenders – Overview and research direc- tions, ACM Transactions on Interactive Intelligent Systems 7 (2017) 10:1–10:46. [23] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey on conversational recommender systems, ACM Computing Surveys 54 (2022) 105:1–105:36. [24] A. Gunawardana, G. Shani, S. Yogev, Evaluating recommender systems, in: F. Ricci, L. Rokach, B. Shapira (Eds.), Recommender Systems Handbook, Springer, New York, NY, USA, 2022, pp. 547–601. [25] M. Rossetti, F. Stella, M. Zanker, Contrasting offline and online results when evaluating recommendation algorithms, in: RecSys ’16: Proceedings of the 10th ACM Conference on Recommender Systems, ACM, New York, NY, USA, 2016, pp. 31–34. [26] T. Rehorek, O. Biza, R. Bartyzal, P. Kordik, I. Povalyev, O. Podstavek, Comparing offline and online evaluation results of recommender systems, in: REVEAL ’18: Proceedings of the Workshop on Offline Evaluation for Recommender Systems, 2018. [27] N. Hazrati, F. Ricci, Simulating users’ interactions with recommender systems, in: Adjunct Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personaliza- tion, ACM, New York, NY, USA, 2022, pp. 95–98. [28] H. Xie, D. D. Wang, Y. Rao, T.-L. Wong, L. Y. K. Raymond, L. Chen, F. L. Wang, Incorporating user experience into critiquing-based recommender systems: A collaborative approach based on compound critiquing, International Journal of Machine Learning and Cybernetics 9 (2018) 837–852. [29] J. McInerney, E. Elahi, J. Basilico, Y. Raimond, T. Jebara, Accordion: A trainable simulator for long-term interactive systems, in: RecSys ’21: Proceedings of the 15th ACM Conference on Recommender Systems, ACM, New York, NY, USA, 2021, pp. 102–113. [30] G. Cockton, Usability evaluation, The Encyclopedia of Human-Computer Interaction (2nd Edition) (2013). [31] T. Ngo, J. Kunkel, J. Ziegler, Exploring mental models for transparent and controllable recommender systems: A qualitative study, in: UMAP ’20: Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, ACM, New York, NY, USA, 2020, pp. 183–191. [32] M. M. Ghori, A. Dehpanah, J. Gemmell, H. Qahri-Saremi, B. Mobasher, Does the user have a theory of the recommender? A grounded theory study, in: Adjunct Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization, ACM, New York, NY, USA, 2022, pp. 167–174. [33] J. Corbin, A. Strauss, Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory, 3 ed., Sage Publications, Inc., Thousand Oaks, CA, USA, 2008. [34] J. Kunkel, T. Ngo, J. Ziegler, N. Krämer, Identifying group-specific mental models of recommender systems: A novel quantitative approach, in: C. Ardito, R. Lanzilotti, A. Mal- izia, H. Petrie, A. Piccinno, G. Desolda, K. Inkpen (Eds.), Human-Computer Interaction — INTERACT 2021, volume 12935 of Lecture Notes in Computer Science, Springer, Berlin, Germany, 2021, pp. 383–404. [35] B. P. Knijnenburg, M. C. Willemsen, A. Kobsa, A pragmatic procedure to support the user-centric evaluation of recommender systems, in: RecSys ’11: Proceedings of the 5th ACM Conference on Recommender Systems, ACM, New York, NY, USA, 2011, pp. 321–324. [36] P. Pu, L. Chen, R. Hu, A user-centric evaluation framework for recommender systems, in: RecSys ’11: Proceedings of the 5th ACM Conference on Recommender Systems, ACM, New York, NY, USA, 2011, pp. 157–164. [37] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, C. Newell, Explaining the user experience of recommender systems, User Modeling and User-Adapted Interaction 22 (2012) 441–504. [38] J. Brooke, SUS – A quick and dirty usability scale, in: Usability Evaluation in Industry, Taylor & Francis, London, UK, 1996, pp. 189–194. [39] B. Laugwitz, T. Held, M. Schrepp, Construction and evaluation of a user experience questionnaire, in: A. Holzinger (Ed.), HCI and Usability for Education and Work, volume 5298 of Lecture Notes in Computer Science, Springer, Berlin, Germany, 2008, pp. 63–76. [40] B. Loepp, T. Donkers, T. Kleemann, J. Ziegler, Impact of item consumption on assessment of recommendations in user studies, in: RecSys ’18: Proceedings of the 12th ACM Conference on Recommender Systems, ACM, New York, NY, USA, 2018, pp. 49–53. [41] Q. Guo, R. W. White, Y. Zhang, B. Anderson, S. T. Dumais, Why searchers switch: Under- standing and predicting engine switching rationales, in: SIGIR ’11: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, 2011, pp. 335–344. [42] Q. Zhao, S. Chang, F. M. Harper, J. A. Konstan, Gaze prediction for recommender systems, in: RecSys ’16: Proceedings of the 10th ACM Conference on Recommender Systems, ACM, New York, NY, USA, 2016, pp. 131–138. [43] P. Gaspar, M. Kompan, J. Simko, M. Bielikova, Analysis of user behavior in interfaces with recommended items – An eye-tracking study, in: IntRS ’18: Proceedings of the 5th Joint Workshop on Interfaces and Human Decision Making for Recommender Systems, 2018, pp. 32–36. [44] L. Chen, F. Wang, An eye-tracking study: Implication to implicit critiquing feedback elici- tation in recommender systems, in: UMAP ’16: Proceedings of the 24th ACM Conference on User Modeling, Adaptation and Personalization, ACM, New York, NY, USA, 2016, pp. 163–167. [45] L. Chen, F. Wang, W. Wu, Inferring users’ critiquing feedback on recommendations from eye movements, in: A. Goel, M. B. Díaz-Agudo, T. Roth-Berghofer (Eds.), Case-Based Reasoning Research and Development, volume 9969 of Lecture Notes in Computer Science, Springer, Berlin, Germany, 2016, pp. 62–76. [46] M. Millecamp, N. N. Htun, C. Conati, K. Verbert, What’s in a user? Towards personalising transparency for music recommender interfaces, in: UMAP ’20: Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, ACM, New York, NY, USA, 2020, pp. 173–182. [47] Y. Zhong, T. L. S. Menezes, V. Kumar, Q. Zhao, F. M. Harper, A field study of related video recommendations: Newest, most similar, or most relevant?, in: Proceedings of the 12th ACM Conference on Recommender Systems, ACM, New York, NY, USA, 2018, pp. 274–278. [48] Y. Liang, M. C. Willemsen, Exploring the longitudinal effects of nudging on users’ music genre exploration behavior and listening preferences, in: RecSys ’22: Proceedings of the 16th ACM Conference on Recommender Systems, ACM, New York, NY, USA, to appear.