Designing an Interpretable Interface for Contextual Bandits Andrew Maher* , Matia Gobbo, Lancelot Lachartre, Subash Prabanantham, Rowan Swiers and Puli Liyanagama Metica, London Abstract Contextual bandits have become an increasingly popular solution for personalized recommender systems. Despite their growing use, the interpretability of these systems remains a significant challenge, particularly for the often non-expert operators tasked with ensuring their optimal performance. In this paper, we address this challenge by designing a new interface to explain to domain experts the underlying behaviour of a bandit. Central is a metric we term “value gain”, a measure derived from off-policy evaluation to quantify the real-world impact of sub-components within a bandit. We conduct a qualitative user study to evaluate the effectiveness of our interface. Our findings suggest that by carefully balancing technical rigour with accessible presentation, it is possible to empower non-experts to manage complex machine learning systems. We conclude by outlining guiding principles that other researchers should consider when building similar such interfaces in future. Keywords User interfaces for decision-making, Contextual bandits, Off-policy evaluation, Interpretable machine learning, 1. Introduction Complex personalized recommender systems have become vital to building engaging, modern user experiences across a variety of domains [1–3]. Although powerful, these systems cannot properly function without a human operator in place who can deploy and manage their correct running. These people – typically non-experts in statistics and machine learning – are expected to make reasoned, higher-order decisions about the recommender system, to maximize its performance and ensure it adds holistic value to the broader environment in which it sits. By default, however, modern recommender systems are complex and hard to interpret [4, 5]. They comprise multiple interlocking parts, each of which requires a strong mathematical background to understand. Take contextual bandits. They are an increasingly popular methodological approach that address known challenges such as the cold-start problem [6] and non-stationary environments [7, 8]. Despite their efficacy as a recommender system method, providing a robust interpretation to their decisions is an unsolved problem. Similar to traditional supervised learning systems, they depend on black-box models to estimate the expected performance of recommendable items given a context. This difficulty is compounded by several factors: observational data only becomes available for the items selected by the bandit; interpretation is required not only for a single output but for multiple items from which the bandit is choosing; the ongoing modulation between exploration and exploitation means a bandit system does not always select the arm it predicts as most valuable. For the non-expert human operator, several higher-order considerations need to be made to ensure each bandit-based recommender is continually well-tuned. Is it performing well enough to keep it in production? Should new arms be added, or existing ones removed? Are the context fields considered by the bandit sufficiently discriminatory to yield interesting results? Answering these questions requires an understanding of the underlying system that is both deep and broad. Yet there is a gap between this need for interpretation and the tools and interfaces that exist to provide it. IntRS’24: Joint Workshop on Interfaces and Human Decision Making for Recommender Systems, October 18, 2024, Bari (Italy) * Corresponding author. $ andrew@metica.com (A. Maher); matia@metica.com (M. Gobbo); lancelot@metica.com (L. Lachartre); subash@metica.com (S. Prabanantham); rowan@metica.com (R. Swiers); puli@metica.com (P. Liyanagama) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings A comparison with a sister decision-making domain is apt. In A/B testing, there is a well-known set of metrics and visualizations that determine which of the options being tested is best [9, 10]. Statistical significance and MLE-based uplift charts predominate the field. Moreover, there exist dozens of commercially available platforms with interfaces designed for easy interpretation [11], and probably thousands of proprietary in-house equivalents [12, 13]. Few such equivalent interfaces exist for contextual bandits; none are publicly available. To address these challenges, we have developed an intuitive interface designed to explain the behavior of a contextual bandit system. It is in production and being used in a commercial setting. By leveraging techniques from data visualization, off-policy evaluation and user-centric design, our interface aims to make the inner workings of a bandit system comprehensible to domain-expert operators. Central to the interface is a generic metric framework we term "value gain" – a measure derived from off-policy evaluation that provides a clear indication of the real-world value of different elements of the system. It is important to clarify that our target audience is not the end-users who receive recommendations; significant research has already been conducted on boosting interpretability for these people [4, 14–17]. We focus instead on the people who choose the inputs to the recommender system, and who are responsible for its proper functioning. We assume nothing about their knowledge of statistics, only that they can read a quantitative dashboard. As well, it is worth noting that although the interface has been designed for contextual bandits, and works with any underlying bandit algorithm, the ideas apply equally well to any recommender methodology. The sole requirement is that the system comprise a (relatively) limited number of options from which to choose – and that there is value in understanding their respective performance. We do not try to solve for the problem of choosing from an ever-changing and very large library of options (as in, say, a video recommender system). The rest of the paper is organised as follows. In Section 2 we discuss related work. We then present our developed interface in Section 3, followed by a user study in Section 4 to evaluate its effectiveness. Finally, in Section 5 we outline guiding principles for future practitioners looking to build similar dashboards, then discuss future directions in this space. 2. Related work Contextual Bandits can be viewed as an extension of traditional experimentation in which the arm- assignment decision is both automated (hence bandit) and personalized (hence contextual). Typically, they comprise a reward model and a policy. The former governs the bandit’s understanding of the world, with common choices including linear regression [18] and neural networks [19]. The latter dictates how the bandit modulates between exploration and exploitation. Canonical examples are UCB (Upper Confidence Bound) [20], and Thompson sampling [21], with numerous applications across various domains such as online advertising [22], personalized news feeds [23], customer support [24], and e-commerce recommendations [25, 26]. Off-Policy Evaluation is the main paradigm through which the efficacy of different bandit approaches is evaluated. It is a counterfactual estimation procedure in which the logged policy – the one for which real-world data is observed and measured – is compared to a hypothetical target policy. Numerous estimators exist to facilitate this comparison, including inverse propensity scoring, the direct method, and doubly robust estimators [27–29]. These techniques allow for the assessment of new policies without the need for costly and time-consuming online experimentation. In contrast to the "how good," Explainable AI attempts to elucidate the purer "how" and "why" of machine learning approaches [30]. Recent research in this field has focused on identifying new methods, different presentation approaches, and better ways to judge the goodness of these explanations. The vast majority focus on interpretability concerns with respect to the recipient of a recommendation, something that is out of the scope of this work. Of particular relevance are a number of User Interfaces designed to enable the proper understanding of different ML systems. For example, ActiVis and LSTMVis are two different visualisation interfaces for interpreting deep learning models and results [31, 32], and InterpetML is a holistic system to help understand ensembles of decision trees [33]. A number of similar explanations and interfaces exist for more general reinforcement learning policies [34, 35]. Although many organisations deploy contextual bandits – and some offer them as services to other companies – we could not find any equivalent bandit interfaces in the literature. 3. Interface Figure 1 shows the interface we designed, through which human operators can interpret the workings of a contextual bandit system. It contains multiple, ordered visualizations that aim to provide increasing detail about the performance of different aspects of the bandit. Figure 1: User interface for contextual bandits. The interface comprises three main components: top level performance, variance performance and performance per context. They each describe different elements of the performance of the bandit system, in increasing granularity of units. The audience of the interface are the people responsible for launching – and potentially altering – the bandit. They need to understand not just its holistic performance, but also how each component of the bandit contributed to that performance. One way of providing this understanding is through comparison: supposing that component wasn’t included in the system, how much less value would the bandit generate? In other words, what is the value gained by the inclusion of that component. To this end, we introduce the value gain metric – an estimate of the value of the production bandit, with respect to a simpler one in which certain components are ablated. Below, we define this metric in its general form. We then describe the interface itself in detail – considering as a prototypical example a use-case in which a mobile game wishes to increase dollars spent on in-app purchases. Although we consider this example for the purposes of the paper, the interface applies equally well to any use-case served by a contextual bandit. 3.1. Value gain Let 𝑟 denote the reward observed by the contextual bandit from one of its actions, and let 𝜋 be the combination of policy-and-context-model that chooses actions yielding 𝑟. We can define the value of 𝜋 as 𝑣 𝜋 = E𝜋 [𝑟] . (1) Here 𝑣 is measured in the same units as the optimisation goal of the bandit. It describes, for example, the average revenue per user achieved by the bandit. This policy-and-context-model 𝜋 comprises multiple elements: the choice of exploration agent; the fields introduced to represent user context; the arms selected for inclusion in the bandit. Each of these individual components contributes to the value of the overall system. We attempt to measure that contribution through ablation. This is the principle behind the value gain metric. Suppose we had an alternative policy-and-context-model 𝜋. And further suppose that 𝜋 is equivalent to 𝜋 in all but some elements (which we’ll denote as 𝜏 ) – ie, its components are a subset of those present in 𝜋. We define the value gain – the gain in value produced by those missing elements – as being given by 𝑔(𝜏 ) = 𝑣 𝜋 − 𝑣 𝜋 = E𝜋 [𝑟] − E𝜋 [𝑟] . (2) It is impractical to calculate online 𝑣 𝜋 across the range of elements in which we might be interested. Doing so would require diverting a large proportion of traffic to alternative assignment algorithms, reducing the volume of data from which the main policy can learn and exposing multiple users to a potentially inferior bandit. Instead, we use methods introduced in the off-policy evaluation literature [28, 29]. Although these methods are not perfect, they nonetheless represent the gold standard by which bandit policies are evaluated offline. In practice any off-policy estimator can be used. To make concrete the ideas within this paper, we consider the inverse-propensity score estimator [28]. It estimates 𝑣 𝜋 as 𝑛 1 ∑︁ 𝜋(𝑎|𝑥𝑖 ) 𝑣𝜋 = 𝑟(𝑥𝑖 ), (3) 𝑛 𝜋(𝑎|𝑥𝑖 ) 𝑖=1 where the functions in the fraction represent the probability of assigning arm 𝑎 to context 𝑥𝑖 for the ablated and actual policy respectively. For example: suppose an e-commerce website launched a bandit to improve conversion rate on its landing page, by launching different variants of its home page. And suppose further that one of the variants is the pre-existing home page, ie, the one that existed before the launch of the bandit. Here, 𝑔(𝜏 ) can be used to measure the value gained from including all non-baseline arms. In this case, 𝜏 denotes the set of non-baseline arms, 𝜋 is a trivial policy that assigns the baseline homepage to all incoming traffic, and 𝜋 is the contextual bandit launched by the e-commerce website. The value gain 𝑔(𝜏 ) shows the extra conversion rate introduced by the bandit with respect to the original baseline experience. 3.2. User interface The interface in Figure 1 comprises three main sections: 1. Top level performance: Short summary of the bandit’s overall performance. It provides an overview that can be digested in as short a time as possible. 2. Variant performance: Description of how each arm in the bandit in the bandit is performing. It enables an operator to determine which arms performs best, and which are candidates for removal from the system. 3. Performance per context: More granular information about the relationship between expected performance and the different contexts used within the bandit. It provides further understanding of the way in which the bandit is personalising the underlying experience. 3.2.1. Top level performance The top level performance section presents three distinct metrics: 1. Uplift vs original offer: What is the percentage increase provided by the bandit on the goal metric, when compared to a baseline arm? 2. Players: How many people have been exposed to the bandit so far? 3. Dollars spent per player: What is the performance of the bandit on its revenue-maximising goal metric so far? The metrics are explicitly presented in order of importance. First, how much value is the bandit adding its use-case? Second, is it reaching a sufficiently large population? Third, what is its average performance? The first metric is based off the previously defined value gain. Here, we assume that one of the arms can be considered as a “baseline” – it is the arm that would be shown to all traffic, if the recommendation policy was not in use. In this case, we show how the policy performs relative to another one that contains only that baseline arm. Following the notation above – and assuming the production policy comprises the set of arms 𝑎𝑘 ∈ 𝒜, we calculate 𝑔(𝜏 ) for 𝜏 = {𝑎𝑘 ∈ 𝒜 : 𝑎𝑘 ̸= 𝑎baseline } , (4) ie, the counterfactual policy contains only the baseline arm 𝑎baseline . Players and Dollars spent per player are calculated directly from logged assignment and rewards data. 3.2.2. Variant performance The variant performance table shows arm-level information. It is a table comprising one row per arm and four columns. One of the columns describes the arm itself (with a name or other distinguishing information), the other three summarise performance information about that arm. The first metric column – Dollars spent – shows both (A) the expected performance of the arm for all people exposed to the bandit, as well as (B) a range of potential performance values. We calculate the expectation and range by first estimating for a given its reward across all players. Then, on that distribution of estimates, we compute three summary statistics: the mean, and the 10th and 90th percentiles. The second column is Expected benefit (Dollars spent). It measures the achieved value that can be attributed to the arm in question. We again use value gain to evaluate this quantity. In this instance, we compare the production policy to a counterfactual one whose ablated element is 𝜏 = 𝑎𝑘 , where 𝑎𝑘 is the arm being measured. Take variant $0.99 as an example. It has an expected benefit of +0.17. This means the bandit gains 17 cents more per user thanks to its ability to show this arm. The third column shows the proportion of contexts for which that arm was displayed to users. 3.2.3. Performance per context The performance per context component contains two visualizations. The first is a radar chart in which a circle is split into multiple segments, with each segment representing a single arm. Dots are plotted on to the segments. Each dot represents a distinct context vector encountered by the bandit. The dots are placed into the segment corresponding to the expected best arm for that context vector. Their distance from the chart’s origin is defined by the relative uplift of that arm compared to the original offer. We again use the value gain to calculate this distance. The second visualisation is a bar chart shows the value gain attributable to each context field. Here we compare the production policy to a counterfactual one, in which the context field in question is removed. Each bar hence describes how much better is the bandit thanks to the inclusion of that context field. 4. User study To better understand the ability of our interface to meaningfully represent interpretable results from a contextual bandit system, we conducted a user study. Table 1 Results from the self-guided component of the user study. Introduction to All interviewees quickly discerned the meaning of the top-level performance interview and page metrics and how they would help in measuring performance. The variant performance table was the second element of the page at which they each arrived. All three understood the dollars spent and winner frequency columns, but needed some prompting with the latter. More difficult was the expected benefit column. Each interviewee correctly Variant performance stated it denoted the value of the variant, and that the measure was comparing table the variant to something else. No interviewee could state what that something else was. At first, each said it might be the baseline variant before deducing that to be impossible (as the baseline variant also had a non-zero value). Even after extensive prompting, they couldn’t correctly define the metric. The radar chart was the last component of the page each interviewee discovered. They all found it somewhat daunting to explore at first, but quickly established (A) what each point represented and that (B) each segment related to an Radar chart individual variant. Two of the interviewees stated the correct definition of the dots’ placement. All three worked out how to evaluate the different variants using the chart. One candidate noted the chart was pretty but potentially superficial. Beyond the specific sub-components of the page, the three interviewees each expressed a desire for more context beyond the base numbers shown. All Desire for three explicitly requested information on the “significance” of the results. Two contextualisation wanted to understand the number of observations relating to each number. One interviewee requested filters to gain more granular information about the data. 4.1. Format As in [31] and [36], we performed a qualitative evaluation built off deep-dive interviews with candidates who would use this system as part of their daily workload. The interviews each took forty-five minutes. We started with a short explanation of how contextual bandits work, to ensure candidates had sufficient familiarity with the topic. Interviewees were then encouraged to explore the UI on their own and, whenever they focused on a specific component, they were asked to explain their perception of what it meant. During the self-guided exploration, we also asked each interviewee the following specific knowledge- based questions to probe the extent of their ability to correctly interpret the bandit using the interface: Table 2 Results from the knowledge-based questions in the user study. All interviewees correctly stated the bandit provided value above the baseline. They relied only on the top-left uplift metric to make this point. When asked Bandit value whether they’d let the bandit continue running, all three replied yes. They again depended on the top-left uplift metric. All three interviewees could reason about which variants were worse- performing. They used a combination of the winner frequency, the radar chart and the expected benefit to answer – with no clear preference among Variant performance these elements. They all determined a best set of variants ($2.99 and $9.99) us- ing the same elements. None tried to contrast the quality of these two variants (using the expected benefit column and winner frequency, for example). Two interviewees used the context contribution chart to reason that removing poor-performing context fields would improve the bandit (by avoiding oppor- tunity cost and/or simplifying the system). One interviewee couldn’t reason Context contribution effectively about which contexts best contributed to the bandit. They instead lent on their practical experience (of user behaviour in different countries). They didn’t try to use the context contribution chart for their answer. • Bandit value: How is the bandit performing compared to a default experience? When do you think the optimization would/should stop? • Variant performance: What are the best / worst performing variants? Why? Given the information presented, would you intervene to change anything about the variants? If so, what changes would you make? • Context contribution: Given the information presented, would you intervene to change any- thing about the context fields being used? If so, what changes would you make? 4.2. Results We conducted three such deep-dive interviews in total. Each interviewee was a marketing professional who would be the person responsible for interpreting bandit choices and outcomes to make operational decisions. All of them had extensive experience with A/B testing, but little practical background usage of a contextual bandit. Each interviewee was shown the interface as depicted in Figure 1. Results from the self-guided exploration and knowledge-based questions are summarised in Tables 1 and 2. 5. Conclusions 5.1. Guiding principles We have presented a visual interface to explain the workings of an in-operation contextual bandit system, built using novel metrics underpinned by methods from off-policy evaluation. Through this exercise, we can identify a number of broad, guiding principles to inform the useful design of a similar interfaces in future. These principles are outlined in Table 3. Future researchers and practitioners should use them to help direct their own design processes. Of these, the two most crucial are the complementary pair: Feel empowered to use technical tools / Use clear non-technical language. In the context of machine learning and recommender systems, the most insightful metrics can be simple to understand but inherently complicated to calculate. If they add the most value – use them. For example, we introduced ideas from off-policy evaluation to our dashboard. Despite their relevance, off-policy evaluation is not particularly well-known outside the machine learning and statistics community. As we built the interface, we held concerns internally that our audience might feel uncomfortable with these metrics, and not trust them sufficiently. However none of our interviewees raised a problem. Table 3 Guiding principles for an interpretable interface for a contextual bandit. Future researchers and practioners should consider these when building their own, similar interfaces. Principle Explanation Feel empowered to Sometimes the most relevant metrics are highly technical. Don’t shy from their use technical tools usage. We used techniques from off-policy evaluation, an alien field to our interviewees. No one raised a concern; our candidates trusted and accepted the information we shared. Use clear, Describe results in a way the audience can easily reason about. We named one non-technical column “expected benefit”, as it related to statistical expectation and indirectly language conveyed meaning. The title made sense to us as statisticians; our interviewees didn’t get it. By contrast, all candidates understood “uplift vs original”. Consciously order Different results exist within the hierarchy of complexity in a recommender information system. Carefully consider what to show and when. Our interviewees could digest our most complex visualisation – an information-dense radar chart – precisely because they’d been carefully shown simpler results earlier on. Contextualise results A repeated criticism of our interface was a lack of statistical significance or volume information. People couldn’t reason about the importance of the results they were seeing. Providing contextual information of this sort enables operators to respond proportionately to insights. Facilitate Fundamentally, insights are useful only if they lead to decision-making. Con- decision-making sider what will guide choices, and present it in complementary formats where useful. Our interviewees successfully answered our task questions by combin- ing multiple elements of our interface. It was this action-oriented approach they praised most. To caveat the above point, its important to consider the audience that will read the results. Think about the implication the statistical machinery conveys, then use that as the description of the metric. Don’t blindly use the jargon term that aligns best with the literature if an end-user will not understand its meaning – an error we made in our design. 5.2. Future work In this paper, we explore ideas and designs to enable the useful interpretation of a single contextual bandit system containing a (relatively) limited number of meaningful options. Relaxing these two constraints – interpreting only a single bandit, and considering a much wider set of options – would require additional design considerations. Take multiple bandits: our explorations here relate to providing a deep-dive on a narrow, single optimisation. This necessitates an abundance of information that becomes hard to parse when multiplied across use-cases. This is what would be found in practice – one bandit managing a search experience, another bandit the delivery of product details, and a third bandit the creatives to display. We’ve so far considered an initial solution of surfacing a single key metric for each bandit (the uplift vs original offer metric). Future research could improve on this by (A) more robustly determining which metric is the most salient to display, (B) ascertaining how to usefully triage among running bandits and (C) explaining the health of all systems in parallel. Considering a much wider number of arms is another interesting technical challenge. Components we’ve introduced here – the radar chart, the table of performance per variant – do not naturally extend to the case where there is more than, say, twenty options. But modern contextual bandit systems, particularly with the advent of generative AI, can be easily designed to have much larger numbers of meaningful variants. Conveying information from across a multitude of potentially differing variants is something we hope to consider in future work. References [1] M. De Nadai, F. Fabbri, P. Gigioli, A. Wang, A. Li, F. Silvestri, L. Kim, S. Lin, V. Radosavljevic, S. Ghael, D. Nyhan, H. Bouchard, M. Lalmas, A. Damianou, Personalized audiobook recommendations at spotify through graph neural networks, Association for Computing Machinery, New York, NY, USA, 2024. URL: https://doi.org/10.1145/3589335.3648339. doi:10.1145/3589335.3648339. [2] G. Tang, J. Pan, H. Wang, J. Basilico, Reward innovation for long-term member satisfaction, in: Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 396–399. URL: https://doi.org/10.1145/ 3604915.3608873. doi:10.1145/3604915.3608873. [3] X. Liu, Z. Li, Y. Gao, J. Yang, T. Cao, Z. Wang, B. Yin, Y. Song, Enhanc- ing user intent capture in session-based recommendation with attribute pat- terns, in: NeurIPS 2023, 2023. URL: https://www.amazon.science/publications/ enhancing-user-intent-capture-in-session-based-recommendation-with-attribute-patterns. [4] D. Afchar, A. Melchiorre, M. Schedl, R. Hennequin, E. Epure, M. Moussallam, Explainability in music recommender systems, AI Magazine 43 (2022) 190–208. [5] H. Steck, L. Baltrunas, E. Elahi, D. Liang, Y. Raimond, J. Basilico, Deep learning for recommender systems: A netflix case study, AI Magazine 42 (2021) 7–18. [6] H. T. Nguyen, J. Mary, P. Preux, Cold-start problems in recommendation systems via contextual- bandit algorithms, 2014. URL: https://arxiv.org/abs/1405.7544. arXiv:1405.7544. [7] J. Hong, B. Kveton, M. Zaheer, Y. Chow, A. Ahmed, M. Ghavamzadeh, C. Boutilier, Non-stationary latent bandits, 2020. URL: https://arxiv.org/abs/2012.00386. arXiv:2012.00386. [8] C. Li, Q. Wu, H. Wang, Unifying clustered and non-stationary bandits, 2020. URL: https://arxiv. org/abs/2009.02463. arXiv:2009.02463. [9] S. Greenland, S. J. Senn, K. J. Rothman, J. B. Carlin, C. Poole, S. N. Goodman, D. G. Altman, Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations, European journal of epidemiology 31 (2016) 337–350. [10] D. J. Biau, B. M. Jolles, R. Porcher, P value and the theory of hypothesis testing: an explanation for new researchers, Clinical Orthopaedics and Related Research® 468 (2010) 885–892. [11] A. Fabijan, P. Dmitriev, B. Arai, A. Drake, S. Kohlmeier, A. Kwong, A/b integrations: 7 lessons learned from enabling a/b testing as a product feature, in: 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), IEEE, 2023, pp. 304–314. [12] D. K. Vasthimal, P. K. Srirama, A. K. Akkinapalli, Scalable data reporting platform for a/b tests, in: 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Confer- ence on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS), 2019, pp. 230–238. doi:10.1109/BigDataSecurity-HPSC-IDS.2019. 00052. [13] R. L. Kaufman, J. Pitchforth, L. Vermeer, Democratizing online controlled experiments at book- ing.com, 2017. URL: https://arxiv.org/abs/1710.08217. arXiv:1710.08217. [14] C.-H. Tsai, P. Brusilovsky, Evaluating visual explanations for similarity-based recommendations: User perception and performance, in: Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization, 2019, pp. 22–30. [15] C. Musto, G. Rossiello, M. de Gemmis, P. Lops, G. Semeraro, Combining text summarization and aspect-based sentiment analysis of users’ reviews to justify recommendations, in: Proceedings of the 13th ACM conference on recommender systems, 2019, pp. 383–387. [16] Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, J. Zhang, Chat-rec: Towards interactive and explainable llms-augmented recommender system, arXiv preprint arXiv:2303.14524 (2023). [17] J. Tan, S. Xu, Y. Ge, Y. Li, X. Chen, Y. Zhang, Counterfactual explainable recommendation, in: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 1784–1793. [18] W. Chu, L. Li, L. Reyzin, R. Schapire, Contextual bandits with linear payoff functions, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2011, pp. 208–214. [19] O. Nabati, T. Zahavy, S. Mannor, Online limited memory neural-linear bandits with likelihood matching, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 7905–7915. URL: https://proceedings.mlr.press/v139/nabati21a.html. [20] Y. Abbasi-Yadkori, D. Pál, C. Szepesvári, Improved algorithms for linear stochastic bandits, Advances in neural information processing systems 24 (2011). [21] S. Agrawal, N. Goyal, Thompson sampling for contextual bandits with linear payoffs, in: Interna- tional conference on machine learning, PMLR, 2013, pp. 127–135. [22] B. Han, J. Gabor, Contextual bandits for advertising budget allocation, Proceedings of the ADKDD 17 (2020). [23] L. Li, W. Chu, J. Langford, R. E. Schapire, A contextual-bandit approach to personalized news article recommendation, in: Proceedings of the 19th international conference on World wide web, 2010, pp. 661–670. [24] S. Sajeev, J. Huang, N. Karampatziakis, M. Hall, S. Kochman, W. Chen, Contextual bandit appli- cations in a customer support bot, in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 3522–3530. [25] D. N. Hill, H. Nassif, Y. Liu, A. Iyer, S. Vishwanathan, An efficient bandit algorithm for realtime multivariate optimization, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1813–1821. [26] X. HE, B. An, Y. Li, H. Chen, Q. Guo, X. Li, Z. Wang, Contextual user browsing bandits for large-scale online mobile recommendation, in: Proceedings of the 14th ACM Conference on Recommender Systems, RecSys ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 63–72. URL: https://doi.org/10.1145/3383313.3412234. doi:10.1145/3383313.3412234. [27] M. Farajtabar, Y. Chow, M. Ghavamzadeh, More robust doubly robust off-policy evaluation, 2018. arXiv:1802.03493. [28] M. Dudík, J. Langford, L. Li, Doubly robust policy evaluation and learning, arXiv preprint arXiv:1103.4601 (2011). [29] Y.-X. Wang, A. Agarwal, M. Dudık, Optimal and adaptive off-policy evaluation in contextual bandits, in: International Conference on Machine Learning, PMLR, 2017, pp. 3589–3597. [30] S. Mohseni, N. Zarei, E. D. Ragan, A multidisciplinary survey and framework for design and evaluation of explainable ai systems, ACM Transactions on Interactive Intelligent Systems (TiiS) 11 (2021) 1–45. [31] M. Kahng, P. Y. Andrews, A. Kalro, D. H. Chau, Activis: Visual exploration of industry-scale deep neural network models, IEEE Transactions on Visualization and Computer Graphics 24 (2018) 88–97. doi:10.1109/TVCG.2017.2744718. [32] H. Strobelt, S. Gehrmann, H. Pfister, A. M. Rush, Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks, IEEE transactions on visualization and computer graphics 24 (2017) 667–676. [33] H. Nori, S. Jenkins, P. Koch, R. Caruana, Interpretml: A unified framework for machine learning interpretability, arXiv preprint arXiv:1909.09223 (2019). [34] A. Mishra, U. Soni, J. Huang, C. Bryan, Why? why not? when? visual explanations of agent be- haviour in reinforcement learning, in: 2022 IEEE 15th Pacific Visualization Symposium (PacificVis), IEEE, 2022, pp. 111–120. [35] S. Milani, N. Topin, M. Veloso, F. Fang, Explainable reinforcement learning: A survey and comparative review, ACM Computing Surveys 56 (2024) 1–36. [36] E. Purificato, B. A. Manikandan, P. V. Karanam, M. V. Pattadkal, E. W. De Luca, Evaluating explainable interfaces for a knowledge graph-based recommender system., in: IntRS@ RecSys, 2021, pp. 73–88.