Explaining health recommendations to lay users: The dos and don’ts Maxwell Szymanski1 , Vero Vanden Abeele1 and Katrien Verbert1 1 Department of Computer Science, KU Leuven, Leuven, Belgium Abstract In recent years, mobile health recommendations are used in an increasing number of applications. Researchers have highlighted the importance of explaining these recommendations to lay users, with benefits such as increased trust and a higher tendency to follow up on these recommendations. However, a different explanation modality can impact the way users perceive the recommendation, either in a positive or negative way. This paper will explore and evaluate six different explanation designs through a qualitative user study, and give general design guidelines and considerations regarding explaining pain-related health recommendations to lay users. Keywords explainable AI, explainable recommender systems, explanation interpretation, lay users, health recommendations, HRS 1. Introduction & Related Work recommendations with the user’s expectations. Such mismatch can not only lead to a decrease in system effec- Recommender systems are becoming more prevalent in tiveness [5], but a decrease in trust towards the system health-related domains. However, several key aspects as well, potentially steering the user away from future have to be taken into account when designing recom- use of such HRS. Early research mainly focused on in- mender systems, such as transparency through explana- creasing the accuracy of RS in order to mitigate this issue. tions and end user expertise. However, Valdez et al. [6] explain that recent research has undergone a shift in focus from improving accuracy, 1.1. RecSys in Health to exploring the effects of human factors. This broader approach in reasoning about RS should allow researchers Recommender systems (RS) have become prominent in to improve RS effectiveness beyond quantitative algorith- health applications, where they help retrieve relevant mic capability. The new approach includes the research information or recommend possible next actions tailored on and addition of: explanations to increase transparency, to the needs of the end user. These health recommender human-in-the-loop feedback to correct misunderstand- systems (HRS) are used both in clinical settings as well as ings, and using conversational RS to increase familiarity in personal contexts where health applications aid users towards the system’s interface. in their daily lives. A recent systematic review [1] of In this paper, we will focus on the explanation aspect, HRS for lay users shows that the majority of HRS that more specifically, on designing and assessing different used a graphical user interface focus on mobile appli- explanation types for a mobile health recommender sys- cations. These mobile HRS span several fields, such as tem. The research is conducted in the context of a per- sports, mental health and nutrition, and include applica- sonal coaching app that guides users with chronic mus- tions that e.g. suggest the appropriate action to take for culoskeletal pain through various informative and inter- users with diabetes [2], recommend activities to promote active topics, such as activity- and stress-management, healthier lifestyles [3] or help with anxiety by recom- pain-education, etc. Additionally, the app also includes mending external apps that will suit the user’s needs a pain logbook, that can be used for logging pain flare- [4]. These recommender systems all have a shared main ups. Using this logged information, which consists of goal of potentially steering the user towards a better and the context in which the pain occurred, as well as the healthier lifestyle. thoughts and reactions users had, the app is able to give However, the increased use of HRS is also paralleled personalised recommendations to better cope with pain with certain barriers. One such issue is a mismatch in flare-ups in the future. In this study, we look into sev- eral designs that are deemed fit for explaining these pain Joint Proceedings of the ACM IUI Workshops 2022, March 2022, related recommendations to end users. There remain, Helsinki, Finland however, several research challenges that need to be ad- $ maxwell.szymanski@kuleuven.be (M. Szymanski); vero.vandenabeele@kuleuven.be (V. V. Abeele); dressed, such as explanation interpretability and end user katrien.verbert@kuleuven.be (K. Verbert) expertise which are discussed in the next related work © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). section. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Maxwell Szymanski et al. CEUR Workshop Proceedings 1–10 1.2. Explaining health recommendations Keeping the aforementioned biases in mind that lay users are prone to, it is therefore tantamount to assess whether As highlighted earlier, adding explanations to recom- explanations are indeed interpretable to make sure no mendations can improve the overall effectiveness. These misalignment in trust is created. make the system interpretable, which in turn can im- With these considerations in mind, we investigate the prove trust towards the system [7]. There exist HRS that following research questions: explain their rationale to the end user, such as the food recommender system of Wayman et al. that explains why RQ1 What explanation design do lay users prefer when certain recipes are recommended based on the user’s nu- explaining health recommendations and why? tritional intake [8], or a visualisation for medical experts RQ2 What design considerations are substantial when that is able to explain breast cancer similarities [9]. How- explaining health recommendations to lay users? ever, the systematic review of De Croon et al. states that only 10% of HRS that focus on lay users make use of explanations. This makes HRS explanations for lay users 2. Explanation designs a novel, but under-explored topic. Additionally, a study of Bussone et al. points out that providing overly detailed As mentioned in section 1.1, we will focus on designing explanations for health recommenders can create unfore- different explanations that will explain why users are seen effects, such as creating over-reliance on explana- receiving specific recommendations for their pain flare- tions [10], which points out that health recommender ups. Keeping the context and type of end users in mind, explanations should be designed with sufficient care. This the following design guidelines have to be kept in mind makes designing explanations with non-expert users in for all variants of explanations: mind, and evaluating them with end users, paramount. • Mobile-friendly: as the explanations will be offered within the context of the mobile health 1.3. End user expertise app, the explanations have to be well-suited for An increasing amount of research has pointed out that display on a small mobile screen. the expertise of end users should be taken into account • Summative: the explanations should possess the when designing explanations. Ribera et al. [11] have pro- ability to summarise categorical data, as input posed three main categories of end users: non-experts consists of (semi-)unstructured user input. (lay users), domain experts (in our context medical profes- • Suited for non-experts: as the end users are sionals or health coaches) and software- and AI-experts. non-experts, the explanations should not use any Each category of users comes with its own needs, goals advanced and statistical concepts to explain why and limitations. AI expert users, for example, use XAI the recommendation is suggested. to verify or improve the underlying AI system, whereas Keeping these criteria in mind, we came up with the domain experts can leverage explanations to gain addi- following designs in Figure 1 based on well-known and tional insights and learn from the system. Lay users have widely used explanation types: their own set of goals, but more interestingly their own array of limitations as well. Wang et al. have pointed out • Text-based: briefly explain why the recommen- several shortcomings in non-expert users related to cog- dation is related to the most prevalent input. The nitive biases, such as confirmation and anchoring bias, wording is based on the "communicating health- due to a backward-oriented, hypothesis-driven reason- related news to patients" guidelines described by ing process [12]. Tsai et al. also noticed a reinforcing [16] and these explanations were collaboratively effect, where users avoid interacting with content they designed for the purpose of this study by six ergo- are not familiar with [13]. Szymanski et al. additionally and physiotherapists. pointed out that non-expert users, despite having these • Text-based + inline reply: an addition to the biases and incorrectly interpreting certain complex expla- textual explanation, where the inline-reply shows nations, can still have a preference for them over other, which specific user message most contributed to simpler explanation modalities [14]. the recommendation. Thus we see that interpretability through explanations has multiple benefits and can result in an increased trust • Tags: tags are a common method of communi- towards the system. However, as previously mentioned, cating all topics that are relevant to a recommen- the adoption of explanations in HRS is still low. Fur- dation (e.g. Bidargaddi et al. [17]). thermore, most health-related AI explanations are being • Word clouds: in addition to showing all relevant researched with AI and domain expert users in mind [15], topics, word clouds are able to additionally com- which leaves a big gap for explanations w.r.t. lay users. municate relative importance/relevance of these topics (e.g. [18, 19]). 2 Maxwell Szymanski et al. CEUR Workshop Proceedings 1–10 (a) Purely textual (b) Inline reply (c) Tags (d) Word cloud (e) Feature importance (f) Feature importance + % Figure 1: Explanation designs for pain-related health recommendations used throughout the user study • Feature-importances (FI): feature importance inputs, by also displaying the exact values used by the bars communicate contributing themes of the underlying RS. user input, as well as their input relevance, albeit in a more specific way compared word clouds. 2.1. Participants • Feature-importances (FI) + percentages: adds percentages to the FI bars to communicate exact For the user study, we recruited 11 participants out of a topic importances. pool of 286 people who were already using the mobile health coaching application without the pain logbook These explanation designs are sorted from least to most and its recommender system, as mentioned in section 1.1, by the amount of information they convey regarding the and thus knew and have interacted with the content and inputs relevant to the recommendation. The textual ex- different modules. The group consisted of nine women planation only focuses on one input, with the inline reply and two men, of which four finished graduate school, six being able to also show which specific input triggered college, and one high school. Age-wise, 2 participants the recommendation, whereas the tags are able to dis- were between 21-30, 5 between 31-40, 3 between 41-50 play all relevant input categories that are related to the and 1 between 51-60. All 11 users noted to use the in- recommendation. The word-cloud further builds on this ternet on the regular basis, with 6 participants stating by also displaying the relative importance of each input to be average computer and IT users, and 5 participants related to the recommendation, and the FI shows the ex- stating to be advanced computer and IT users. act sorting of input according to importance. The added percentages give the most transparency regarding the 3 Maxwell Szymanski et al. CEUR Workshop Proceedings 1–10 2.2. Protocol of the evaluation study Insights through XAI (+) At the start of the study, users were briefed on the pur- Six users liked the fact that they were able to gain more pose and context of the think-aloud study, and gave their insight through this explanation modality. Four users consent to having the audio recorded, after which they also stated that the percentages were a “nice-to-know”, filled in the ResQue demographics questionnaire [20]. making the explanation more useful and informative. Afterwards, they were guided through the pain logbook, which they had to fill in with recent pain-episode they Negative sentiment towards XAI (-) experienced in mind. Having done so, they received some information regarding the recommendations that On the flip-side, two users disliked the addition of display- are going to be given, along with the explanations. We ing percentages, stating that when it comes to emotions briefly went over the six explanation designs in a fixed and feelings, certain aspects are not quantifiable. U4 order, after which we asked the participant to “explain stated: “Personally I think feelings are not quantifiable. what they like or dislike about the explanation” sepa- The bars are good, but don’t put an exact number on it. It’s rately for each design once they have seen them all. To okay if you’re communicating frequencies, like how often conclude this preference elicitation, the users had to sort an emotion occurred for example.”. the explanations by preference, with 1 being their most preferred one, and 6 their least preferred. They also had Visual/information overload (-) to give (or repeat) a key reason as to why they are giving Two users also stated that the addition of percentages is each explanation a certain ranking. The audio recordings unnecessary, mentioning that only using bars to com- of both the preference elicitation and ranking are used municate importances is sufficient. afterwards for a thematic analysis. 3.2. Feature importance 2.3. Data analysis Rank: 2 · The feature importance explanation was The thematic analysis was done in two phases, with the among the most preferred explanations, liked for the first phase consisting of deriving granular themes from fact that is was able to give a summary of the user input the thematic analysis with two researchers, and the sec- (𝑛 = 11), as well as being able to give additional insights ond phase focusing on merging them to higher level (𝑛 = 2). themes with a third researcher. The resulting higher level themes are displayed in Figure 3, along with the Provides summary (+) frequencies in which they occur per explanation design. The agreement percentage of the first phase two-coder Six users found the feature importance bars to be a clear thematic analysis is 88.1%, with Cohen’s kappa being way of communicating input topics and their importance. 𝜅 = 0.66, resulting in a substantial inter-coder agree- Four users stated that it gives them a nice overview of ment [21]. their input. Insights through XAI (+) 3. Results Two users specifically liked the additional insights that Taking the average raking scores of all explanation de- they were able to get from the feature importances. U4 signs, we are now able to rank the 6 explanation modal- mentioned: “There are of course no numbers given, but ities from best to worst ranked, along with the results I can assume that I am really frustrated, and a bit less from the thematic analysis to explain why each explana- angry. I find it interesting to reflect on results that come tion type scored poorly or adequately. Figure 2 shows out of a questionnaire.” the frequencies of the rankings given to each explanation design. Negative sentiment towards XAI (-) 3.1. Feature importance + percentage Three users were unsure of the ranking of some topics, stating that they agreed with the general content, but not Rank: 1 (best) · This explanation type was favored by as to why one topic was deemed more important over most users, mainly due to the fact that it provided the others. This caused these users to slightly dislike and most insight and transparency (𝑛 = 10). Only three out distrust the system, and give it a lower ranking. of 11 people found the addition of the percentages to feature importance bars to be inefficacious. 4 Maxwell Szymanski et al. CEUR Workshop Proceedings 1–10 Figure 2: Frequencies of rankings per explanation type Figure 3: TA themes per explanation design and their frequencies Visual/information overload (-) Insights through explanation (+) Two users found the bars to be unnecessary, giving Three users were fond of the additional insights they them information as to what contributed towards the rec- got from the tags and the general themes that were ommendation, but not why, like the textual explanation present in their input. U3 stated: “When inputting my did. U6 stated: “There is not a lot of background given. It feelings I did not necessarily perceive them as negative or shows that these inputs contributed to my recommendation, angry. But based on these tags, I’m able to see: okay, this but not why.” is how the app interprets my feelings.” 3.3. Tags Visual/information overload (-) Rank: 3 · Tags scored relatively better than the previous Only two users stated that tags were unnecessary or three explanations in terms of average ranking, and were provided too much information. U6 stated: “Yes it’s clear, liked for their summative ability (𝑛 = 8). Only people but less practical. I tend to focus on one thing at a time.” who disliked having a lot of information, were less in favor of the tag explanation (𝑛 = 2). 3.4. Purely textual Rank: 4 · Purely textual explanations received mixed Provides summary (+) reactions during the think-aloud study. When users liked Four users found using tags to be a nice way of providing or agreed with the recommendation, the textual explana- a summary of their input. Four users also stated that tion was a welcome addition helping them understand the doing in such a way is a clear and concise method of recommendation process and the recommendation itself, explaining why the recommendation is given. and gave users a nice summary of why the recommenda- tion matched their inputs (𝑛 = 8). However, when the recommendation wasn’t in line with the user’s expecta- tions, the textual explanation highlighted the mismatch 5 Maxwell Szymanski et al. CEUR Workshop Proceedings 1–10 even more and caused a poor reception of the recom- Problem with representation (-) mender system in general (𝑛 = 5). Here is an overview Only some minor and infrequent negative remarks were of these topics: given surrounding inline replies. Three users disliked the fact that by highlighting or repeating their negative Provides summary (+) input, they are more confronted with it. One user ad- Six users found that the textual explanation was able to ditionally mentioned that this explanation feels like the summarize their input quite well, albeit only focusing recommendation is only tuned to one input instead of on one topic (the most relevant one) surrounding the multiple user inputs, making it feel too specific. recommendation. 3.6. Word cloud Positive sentiment towards explanation (+) Rank: 6 (last) · The word cloud received the lowest av- Two users stated that the written explanation was con- erage score. In general, users like the addition of display- firming and comforting. One user also stated that the ing keyword or topic importance, however using a word wording of the textual explanation felt less confronting cloud to do so proves to be an inferior solution. The the- regarding their negative input. matic analysis points out two main negative themes as to why this explanation is disliked: problems with represen- Negative sentiment towards explanation (-) tation and content (𝑛 = 9) and visual/information over- load (𝑛 = 4) and one positive theme, insights through On the other hand, three users mentioned that they can- explanation (𝑛 = 4). not relate to the recommendation, and that the textual explanation highlighted this fact. U4 also found the ex- Problems with representation (-) planation to also be provoking, stating the following: “I know that I’m frustrated and that it does not help. However, Three users pointed out having keyword size commu- explaining that acts like waving a red flag in front of a nicate importance was unclear, and would rather have bull.” something concrete like bars indicating exact relevance. Three users also pointed out that the inconsistent sizes 3.5. Inline-reply inherent to the design of word clouds were visually dis- pleasing. Two users additionally stated highlighting Rank: 5 · During the think-aloud study, the inline reply important keywords might be too confronting with re- received relatively positive feedback and comments re- spect to their own input, e.g. if a user inputs that they garding the succinct summary it gave of the users input are feeling sad, having it displayed as a large word might (𝑛 = 7), with only some minor remarks regarding the confront the user too much with their state of mind. presentation of the explanation (𝑛 = 3). However, it scored quite low during the preference ranking itself due Visual/information overload (-) to other explanation modalities simply being preferred over the inline-reply. Three users found the addition of displaying relevance in such a way unnecessary, one of which additionally Provides summary (+) stated that adding the information in such way is too distracting. Six users found the explanation modality to be clear and more concrete, and one user additionally stated that Insights through explanation (+) showing which message triggered the recommendation requires less analysis from the user. Four users stated however that adding this information of keyword relevance gives more insight due to not Insights through explanation (+) only showing the relevant topics, but their importance as well. Three users liked the fact that the inline-reply raises awareness of the fact that the recommendation is related to one of their own inputs. U3 stated: “I find it better than 4. Discussion the textual explanation. There, they state ’You seem to be frustrated’, and here you really are made aware of the fact We will now discuss some of the most prevalent obser- that it’s your own input.“ vations that were present in several explanation designs, as well as suggest guidelines on how to design health explanations for lay users experiencing (chronic) pain. 6 Maxwell Szymanski et al. CEUR Workshop Proceedings 1–10 4.1. Beware of confronting people with Figure 4. Keeping the control aspect in mind from previ- negative sentiments ous section, users are also able to tap on different topics to request recommendations regarding said topic. People experiencing (chronic) pain or illness can feel dis- tress when receiving negative information surrounding their state. In our study, we noticed that highlighting keywords that are potentially negative (e.g. negative emotions, reactions, etc.), can cause distress with users and therefore make them dislike the explanation. This was apparent with the inline reply and word cloud expla- nations, where visually highlighting negative sentiments that relate to the recommendation caused users to dislike the explanation. 4.2. Use tags or feature importance when control is needed Due to the fact that tags and FI/FI+% are able to dis- play multiple input categories, users positively expressed that this would provide them more control over the rec- ommendation process, if the design or implementation allows for it. One user suggested that tapping certain topics could be useful to request recommendations in a Figure 4: Adapted feature importance explanation design more user-controlled way. Other users additionally sug- gested U9:“It’s nice if you can individually remove certain topics”, and U7: “... especially of you notice something that wasn’t interpreted the way you intended it”. 4.4. Insight vs. information overload 4.3. Design FI through a lay user’s Users generally liked the holistic approach of the feature importances, and were more inclined to look into the perspective recommendation itself. When asked why they liked the The FI and FI+% designs were favored by most users, recommendations more when explained using FI com- giving most users the insight and summary they needed. pared to the purely textual explanation, they stated that However, as mentioned in section 3.2, U4 interpreted the the FI were able to show them a general overview of them FI bars as “... I can assume that I am really frustrated, as a person. and a bit less angry”, indicating that they saw it as an On the other hand, there were also some users who dis- overview of their input, and not how strongly their input agreed with the ordering of keyword importances that relates to the recommendation. In total, 10 out of 11 lay the feature importance bars were displaying, causing users interpreted FI differently than intended. Only U4 a slight increase in distrust towards the recommender was able to correctly interpret the bars (after reading the system, ranking the explanation lower. This is to be ex- text above the FI bars - “This is how your inputs relate pected, as increasing transparency of explanations can to the recommendation”), saying “The frustrated bar is cause a higher drop in trust towards the system if the the biggest, okay, so that contributes most to my recom- content of the explanation or recommendation does not mendation”. Having a wrong interpretation could lead to align with the user’s expectation. However, the effect confusion towards the system when, for example, a next of a misaligned textual explanation is still stronger, as recommendation is shown, and the input keywords and users who did not agree with either the recommendation their relevance change with respect to this new recom- or the explanation expressed a more negative sentiment mendation. However, overcoming biases and changing towards the recommendation, and gave the textual rec- mental models of lay users often proves to be difficult. ommendation a lower ranking. This is in line with similar A possible design adaptations to the FI and FI+% design, research by Balog et al. [5], in which they state that mis- may show a general overview/summary of the user in- aligned recommendations that focus on a single topic or put to be in line with what users were interpreting, and item are more susceptible to a lower perceived quality of then highlight the keywords that are relevant to the rec- explanation compared to multi-item recommendations. ommendation that is being shown. This can be seen in 7 Maxwell Szymanski et al. CEUR Workshop Proceedings 1–10 5. Conclusion G0A3319N, financed by Research Foundation Flanders (FWO). This paper introduced several explanation designs for mobile pain related health recommendations, and com- pared them among lay users. Most users preferred the References added transparency that was provided by the tags and FI / FI+% designs, stating that it gave them a brief and clear [1] R. De Croon, L. Van Houdt, N. N. Htun, G. Štiglic, overview of their input which helped them understand V. Vanden Abeele, K. Verbert, Health recommender why they received certain recommendations. Another in- systems: Systematic review, J Med Internet Res 23 teresting aspect is the fact that designs should be careful (2021) e18035. URL: https://www.jmir.org/2021/6/ with visually highlighting negative sentiments of users. e18035. doi:10.2196/18035. Designs that did so, i.e. the inline-reply and word cloud, [2] F. Torrent-Fontbona, B. Lopez, Personalized adap- were received poorly by users. Lastly, we confirmed that tive cbr bolus recommender system for type 1 di- lay users might interpret certain visual explanations dif- abetes, IEEE Journal of Biomedical and Health In- ferently than intended, yet still prefer them over others. formatics 23 (2019) 387–394. doi:10.1109/JBHI. Given their feedback, we presented an adapted design 2018.2813424, robin’s Paper: [93]. of the favoured FI / FI+% explanation to be in line with [3] R. Gouveia, E. Karapanos, M. Hassenzahl, How do what lay users expect. we engage with activity trackers? a longitudinal study of habito, UbiComp 2015 - Proceedings of the 2015 ACM International Joint Conference on 6. Limitations & Future work Pervasive and Ubiquitous Computing (2015) 1305– 1316. doi:10.1145/2750858.2804290. The qualitative aspect of this study was already able [4] K. Cheung, W. Ling, C. J. Karr, K. Weingardt, S. M. to point out several key aspects related to designing Schueller, D. C. Mohr, Evaluation of a recommender health explanations for patients experiencing chronic app for apps for the treatment of depression and pain. However, a larger scale quantitative user study is anxiety: An analysis of longitudinal user engage- needed to further investigate these results. One such ment, Journal of the American Medical Informat- aspect is the fact that some users preferred textual expla- ics Association 25 (2018) 955–962. doi:10.1093/ nations over explanations that offered more information. jamia/ocy023. Investigating whether this correlates to the user’s need [5] K. Balog, F. Radlinski, Measuring Recommenda- for cognition (NFC), and what its implications are, can tion Explanation Quality: The Conflicting Goals prove to be an interesting research direction similar to the of Explanations, in: Proceedings of the 43rd Inter- research of Millecamp et al. [22]. Another aspect is the national ACM SIGIR Conference on Research and fact that while most users disliked being confronted with Development in Information Retrieval, SIGIR ’20, their negative input, some did not mind. This could be Association for Computing Machinery, New York, related to the "warriors vs. worriers" research, in which NY, USA, 2020, p. 329–338. URL: https://doi.org/ some users experiencing chronic pain actually prefer be- 10.1145/3397271.3401032. doi:10.1145/3397271. ing exposed to negative feedback so they could address it, 3401032. and could prove useful for further research [23]. Future [6] A. Calero Valdez, M. Ziefle, K. Verbert, Hci for rec- research should also consider other designs to explain ommender systems: The past, the present and the health recommendations and elaborate design guidelines future, in: Proceedings of the 10th ACM Confer- that can be used by researchers and practitioners in this ence on Recommender Systems, RecSys ’16, As- exciting domain. In addition, an interesting further line sociation for Computing Machinery, New York, of research is to personalise these explanations on-the- NY, USA, 2016, p. 123–126. URL: https://doi.org/ fly, based on interaction data of end-users. As in work 10.1145/2959100.2959158. doi:10.1145/2959100. of [24], clicks and hover interactions as well as eye gaze 2959158. data can be considered for such personalisation. [7] D. V. Carvalho, E. M. Pereira, J. S. Cardoso, Ma- chine learning interpretability: A survey on meth- ods and metrics, Electronics 8 (2019). URL: https:// Acknowledgments www.mdpi.com/2079-9292/8/8/832. doi:10.3390/ This work is part of the research projects Personal electronics8080832. Health Empowerment (PHE) with project number [8] E. Wayman, S. Madhvanath, Nudging Grocery HBC.2018.2012, financed by Flanders Innovation & En- Shoppers to Make Healthier Choices, in: Proceed- trepreneurship, and IMPERIUM with project number ings of the Ninth Conference on Recommender 8 Maxwell Szymanski et al. CEUR Workshop Proceedings 1–10 Systems, ACM, 2015, pp. 289–292. doi:10.1145/ ommendation service for a curated list of read- 2792838.2799669. ily available mental health and well-being mobile [9] J.-B. Lamy, B. Sekar, G. Guezennec, J. Bouaud, apps for young people: Randomized controlled B. Séroussi, Explainable artificial intelligence trial, Journal of Medical Internet Research 19 (2017). for breast cancer: A visual case-based reasoning doi:10.2196/jmir.6775, robin’s Paper: [55]. approach, Artificial Intelligence in Medicine 94 [18] Y. Wu, M. Ester, Flame: A probabilistic model (2019) 42–53. URL: https://www.sciencedirect. combining aspect based opinion mining and col- com/science/article/pii/S0933365718304846. laborative filtering, in: Proceedings of the Eighth doi:https://doi.org/10.1016/j.artmed. ACM International Conference on Web Search 2019.01.001. and Data Mining, WSDM ’15, Association for [10] A. Bussone, S. Stumpf, D. M. O’Sullivan, The role Computing Machinery, New York, NY, USA, 2015, of explanations on trust and reliance in clinical de- p. 199–208. URL: https://doi.org/10.1145/2684822. cision support systems, 2015 International Confer- 2685291. doi:10.1145/2684822.2685291. ence on Healthcare Informatics (2015) 160–169. [19] C.-H. Tsai, P. Brusilovsky, Evaluating Visual Ex- [11] M. Ribera, A. Lapedriza, Can we do better explana- planations for Similarity-Based Recommendations: tions? a proposal of user-centered explainable ai, User Perception and Performance, in: Proceed- CEUR Workshop Proceedings 2327 (2019). ings of the 27th ACM Conference on User Mod- [12] D. Wang, Q. Yang, A. Abdul, B. Y. Lim, Designing eling, Adaptation and Personalization, UMAP ’19, Theory-Driven User-Centric Explainable AI, Asso- Association for Computing Machinery, New York, ciation for Computing Machinery, New York, NY, NY, USA, 2019, p. 22–30. URL: https://doi.org/ USA, 2019, p. 1–15. URL: https://doi.org/10.1145/ 10.1145/3320435.3320465. doi:10.1145/3320435. 3290605.3300831. 3320465. [13] C.-H. Tsai, P. Brusilovsky, Beyond the ranked list: [20] P. Pu, L. Chen, R. Hu, A user-centric evaluation User-driven exploration and diversification of so- framework for recommender systems, in: Pro- cial recommendation, in: 23rd International Con- ceedings of the Fifth ACM Conference on Rec- ference on Intelligent User Interfaces, IUI ’18, As- ommender Systems, RecSys ’11, Association for sociation for Computing Machinery, New York, Computing Machinery, New York, NY, USA, 2011, NY, USA, 2018, p. 239–250. URL: https://doi.org/ p. 157–164. URL: https://doi.org/10.1145/2043932. 10.1145/3172944.3172959. doi:10.1145/3172944. 2043962. doi:10.1145/2043932.2043962. 3172959. [21] N. J.-M. Blackman, J. J. Koval, Interval es- [14] M. Szymanski, M. Millecamp, K. Verbert, Visual, timation for cohen’s kappa as a measure of textual or hybrid: The effect of user expertise on agreement, Statistics in Medicine 19 (2000) different explanations, in: 26th International Con- 723–741. doi:https://doi.org/10.1002/ ference on Intelligent User Interfaces, IUI ’21, As- (SICI)1097-0258(20000315)19:5<723:: sociation for Computing Machinery, New York, AID-SIM379>3.0.CO;2-A. NY, USA, 2021, p. 109–119. URL: https://doi.org/ [22] M. Millecamp, N. N. Htun, C. Conati, K. Verbert, To 10.1145/3397481.3450662. doi:10.1145/3397481. explain or not to explain: The effects of personal 3450662. characteristics when explaining music recommen- [15] J. Ooge, G. Stiglic, K. Verbert, Explaining arti- dations, in: Proceedings of the 24th International ficial intelligence with visual analytics in health- Conference on Intelligent User Interfaces, IUI ’19, care, WIREs Data Mining and Knowledge Dis- Association for Computing Machinery, New York, covery 12 (2021). URL: https://wires.onlinelibrary. NY, USA, 2019, p. 397–407. URL: https://doi.org/ wiley.com/doi/abs/10.1002/widm.1427. doi:https: 10.1145/3301275.3302313. doi:10.1145/3301275. //doi.org/10.1002/widm.1427. 3302313. [16] M. Schmid Mast, A. Kindlimann, W. Lange- [23] J. Geuens, T. Swinnen, L. Geurts, R. Westhovens, witz, Recipients’ perspective on breaking bad R. De Croon, V. Vanden Abeele, Worriers versus news: How you put it really makes a difference, warriors: Tailoring mhealth to address differences Patient Education and Counseling 58 (2005) in patients with chronic arthritis, in: 2020 IEEE In- 244–251. URL: https://www.sciencedirect.com/ ternational Conference on Healthcare Informatics science/article/pii/S0738399105001473. doi:https: (ICHI), 2020, pp. 1–12. doi:10.1109/ICHI48887. //doi.org/10.1016/j.pec.2005.05.005, 2020.9374322. medical Education and Training in Communication. [24] M. Millecamp, T. Willemot, K. Verbert, Your eyes ex- [17] N. Bidargaddi, P. Musiat, M. Winsall, G. Vogl, plain everything: exploring the use of eye tracking V. Blake, S. Quinn, S. Orlowski, G. Antezana, to provide explanations on-the-fly, in: Proceedings G. Schrader, Efficacy of a web-based guided rec- of the 8th Joint Workshop on Interfaces and Hu- 9 Maxwell Szymanski et al. CEUR Workshop Proceedings 1–10 man Decision Making for Recommender Systems co-located with 15th ACM Conference on Recom- mender Systems (RecSys 2021), volume 2948, CEUR Workshop Proceedings, 2021, pp. 89–100. 10