=Paper=
{{Paper
|id=Vol-3124/paper1
|storemode=property
|title=Explaining Health Recommendations to Lay Users: The Dos and
              Dont's
|pdfUrl=https://ceur-ws.org/Vol-3124/paper1.pdf
|volume=Vol-3124
|authors=Maxwell Szymanski,Vero Vanden Abeele,Katrien Verbert
|dblpUrl=https://dblp.org/rec/conf/iui/SzymanskiAV22
}}
==Explaining Health Recommendations to Lay Users: The Dos and
              Dont's==
<pdf width="1500px">https://ceur-ws.org/Vol-3124/paper1.pdf</pdf>
<pre>
Explaining health recommendations to lay users: The dos
and don’ts
Maxwell Szymanski1 , Vero Vanden Abeele1 and Katrien Verbert1
1
    Department of Computer Science, KU Leuven, Leuven, Belgium


                                             Abstract
                                             In recent years, mobile health recommendations are used in an increasing number of applications. Researchers have highlighted
                                             the importance of explaining these recommendations to lay users, with benefits such as increased trust and a higher tendency
                                             to follow up on these recommendations. However, a different explanation modality can impact the way users perceive the
                                             recommendation, either in a positive or negative way. This paper will explore and evaluate six different explanation designs
                                             through a qualitative user study, and give general design guidelines and considerations regarding explaining pain-related
                                             health recommendations to lay users.

                                             Keywords
                                             explainable AI, explainable recommender systems, explanation interpretation, lay users, health recommendations, HRS


1. Introduction & Related Work                                                                                            recommendations with the user’s expectations. Such
                                                                                                                          mismatch can not only lead to a decrease in system effec-
Recommender systems are becoming more prevalent in                                                                        tiveness [5], but a decrease in trust towards the system
health-related domains. However, several key aspects                                                                      as well, potentially steering the user away from future
have to be taken into account when designing recom-                                                                       use of such HRS. Early research mainly focused on in-
mender systems, such as transparency through explana-                                                                     creasing the accuracy of RS in order to mitigate this issue.
tions and end user expertise.                                                                                             However, Valdez et al. [6] explain that recent research
                                                                                                                          has undergone a shift in focus from improving accuracy,
1.1. RecSys in Health                                                                                                     to exploring the effects of human factors. This broader
                                                                                                                          approach in reasoning about RS should allow researchers
Recommender systems (RS) have become prominent in                                                                         to improve RS effectiveness beyond quantitative algorith-
health applications, where they help retrieve relevant                                                                    mic capability. The new approach includes the research
information or recommend possible next actions tailored                                                                   on and addition of: explanations to increase transparency,
to the needs of the end user. These health recommender                                                                    human-in-the-loop feedback to correct misunderstand-
systems (HRS) are used both in clinical settings as well as                                                               ings, and using conversational RS to increase familiarity
in personal contexts where health applications aid users                                                                  towards the system’s interface.
in their daily lives. A recent systematic review [1] of                                                                      In this paper, we will focus on the explanation aspect,
HRS for lay users shows that the majority of HRS that                                                                     more specifically, on designing and assessing different
used a graphical user interface focus on mobile appli-                                                                    explanation types for a mobile health recommender sys-
cations. These mobile HRS span several fields, such as                                                                    tem. The research is conducted in the context of a per-
sports, mental health and nutrition, and include applica-                                                                 sonal coaching app that guides users with chronic mus-
tions that e.g. suggest the appropriate action to take for                                                                culoskeletal pain through various informative and inter-
users with diabetes [2], recommend activities to promote                                                                  active topics, such as activity- and stress-management,
healthier lifestyles [3] or help with anxiety by recom-                                                                   pain-education, etc. Additionally, the app also includes
mending external apps that will suit the user’s needs                                                                     a pain logbook, that can be used for logging pain flare-
[4]. These recommender systems all have a shared main                                                                     ups. Using this logged information, which consists of
goal of potentially steering the user towards a better and                                                                the context in which the pain occurred, as well as the
healthier lifestyle.                                                                                                      thoughts and reactions users had, the app is able to give
   However, the increased use of HRS is also paralleled                                                                   personalised recommendations to better cope with pain
with certain barriers. One such issue is a mismatch in                                                                    flare-ups in the future. In this study, we look into sev-
                                                                                                                          eral designs that are deemed fit for explaining these pain
Joint Proceedings of the ACM IUI Workshops 2022, March 2022,                                                              related recommendations to end users. There remain,
Helsinki, Finland
                                                                                                                          however, several research challenges that need to be ad-
$ maxwell.szymanski@kuleuven.be (M. Szymanski);
vero.vandenabeele@kuleuven.be (V. V. Abeele);                                                                             dressed, such as explanation interpretability and end user
katrien.verbert@kuleuven.be (K. Verbert)                                                                                  expertise which are discussed in the next related work
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                         section.
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                      1
Maxwell Szymanski et al. CEUR Workshop Proceedings                                                                1–10


1.2. Explaining health recommendations                       Keeping the aforementioned biases in mind that lay users
                                                             are prone to, it is therefore tantamount to assess whether
As highlighted earlier, adding explanations to recom-
                                                             explanations are indeed interpretable to make sure no
mendations can improve the overall effectiveness. These
                                                             misalignment in trust is created.
make the system interpretable, which in turn can im-
                                                             With these considerations in mind, we investigate the
prove trust towards the system [7]. There exist HRS that
                                                             following research questions:
explain their rationale to the end user, such as the food
recommender system of Wayman et al. that explains why         RQ1 What explanation design do lay users prefer when
certain recipes are recommended based on the user’s nu-             explaining health recommendations and why?
tritional intake [8], or a visualisation for medical experts
                                                              RQ2 What design considerations are substantial when
that is able to explain breast cancer similarities [9]. How-
                                                                    explaining health recommendations to lay users?
ever, the systematic review of De Croon et al. states that
only 10% of HRS that focus on lay users make use of
explanations. This makes HRS explanations for lay users 2. Explanation designs
a novel, but under-explored topic. Additionally, a study
of Bussone et al. points out that providing overly detailed As mentioned in section 1.1, we will focus on designing
explanations for health recommenders can create unfore- different explanations that will explain why users are
seen effects, such as creating over-reliance on explana- receiving specific recommendations for their pain flare-
tions [10], which points out that health recommender ups. Keeping the context and type of end users in mind,
explanations should be designed with sufficient care. This the following design guidelines have to be kept in mind
makes designing explanations with non-expert users in for all variants of explanations:
mind, and evaluating them with end users, paramount.
                                                                  • Mobile-friendly: as the explanations will be
                                                                    offered within the context of the mobile health
1.3. End user expertise                                             app, the explanations have to be well-suited for
An increasing amount of research has pointed out that               display on a small mobile screen.
the expertise of end users should be taken into account           • Summative: the explanations should possess the
when designing explanations. Ribera et al. [11] have pro-           ability to summarise categorical data, as input
posed three main categories of end users: non-experts               consists of (semi-)unstructured user input.
(lay users), domain experts (in our context medical profes-       • Suited for non-experts: as the end users are
sionals or health coaches) and software- and AI-experts.            non-experts, the explanations should not use any
Each category of users comes with its own needs, goals              advanced and statistical concepts to explain why
and limitations. AI expert users, for example, use XAI              the recommendation is suggested.
to verify or improve the underlying AI system, whereas
                                                                Keeping these criteria in mind, we came up with the
domain experts can leverage explanations to gain addi-
                                                             following designs in Figure 1 based on well-known and
tional insights and learn from the system. Lay users have
                                                             widely used explanation types:
their own set of goals, but more interestingly their own
array of limitations as well. Wang et al. have pointed out
                                                                  • Text-based: briefly explain why the recommen-
several shortcomings in non-expert users related to cog-
                                                                    dation is related to the most prevalent input. The
nitive biases, such as confirmation and anchoring bias,
                                                                    wording is based on the "communicating health-
due to a backward-oriented, hypothesis-driven reason-
                                                                    related news to patients" guidelines described by
ing process [12]. Tsai et al. also noticed a reinforcing
                                                                    [16] and these explanations were collaboratively
effect, where users avoid interacting with content they
                                                                    designed for the purpose of this study by six ergo-
are not familiar with [13]. Szymanski et al. additionally
                                                                    and physiotherapists.
pointed out that non-expert users, despite having these
                                                                  • Text-based + inline reply: an addition to the
biases and incorrectly interpreting certain complex expla-
                                                                    textual explanation, where the inline-reply shows
nations, can still have a preference for them over other,
                                                                    which specific user message most contributed to
simpler explanation modalities [14].
                                                                    the recommendation.
   Thus we see that interpretability through explanations
has multiple benefits and can result in an increased trust        • Tags: tags are a common method of communi-
towards the system. However, as previously mentioned,               cating all topics that are relevant to a recommen-
the adoption of explanations in HRS is still low. Fur-              dation (e.g. Bidargaddi et al. [17]).
thermore, most health-related AI explanations are being           • Word clouds: in addition to showing all relevant
researched with AI and domain expert users in mind [15],            topics, word clouds are able to additionally com-
which leaves a big gap for explanations w.r.t. lay users.           municate relative importance/relevance of these
                                                                    topics (e.g. [18, 19]).


                                                          2
Maxwell Szymanski et al. CEUR Workshop Proceedings                                                                       1–10


         (a) Purely textual                          (b) Inline reply                                (c) Tags


          (d) Word cloud                         (e) Feature importance                     (f) Feature importance + %

Figure 1: Explanation designs for pain-related health recommendations used throughout the user study


     • Feature-importances (FI): feature importance               inputs, by also displaying the exact values used by the
       bars communicate contributing themes of the                underlying RS.
       user input, as well as their input relevance, albeit
       in a more specific way compared word clouds.               2.1. Participants
     • Feature-importances (FI) + percentages: adds
       percentages to the FI bars to communicate exact     For the user study, we recruited 11 participants out of a
       topic importances.                                  pool of 286 people who were already using the mobile
                                                           health coaching application without the pain logbook
   These explanation designs are sorted from least to most and its recommender system, as mentioned in section 1.1,
by the amount of information they convey regarding the and thus knew and have interacted with the content and
inputs relevant to the recommendation. The textual ex- different modules. The group consisted of nine women
planation only focuses on one input, with the inline reply and two men, of which four finished graduate school, six
being able to also show which specific input triggered college, and one high school. Age-wise, 2 participants
the recommendation, whereas the tags are able to dis- were between 21-30, 5 between 31-40, 3 between 41-50
play all relevant input categories that are related to the and 1 between 51-60. All 11 users noted to use the in-
recommendation. The word-cloud further builds on this ternet on the regular basis, with 6 participants stating
by also displaying the relative importance of each input to be average computer and IT users, and 5 participants
related to the recommendation, and the FI shows the ex- stating to be advanced computer and IT users.
act sorting of input according to importance. The added
percentages give the most transparency regarding the


                                                              3
Maxwell Szymanski et al. CEUR Workshop Proceedings                                                                       1–10


2.2. Protocol of the evaluation study                             Insights through XAI (+)
At the start of the study, users were briefed on the pur-         Six users liked the fact that they were able to gain more
pose and context of the think-aloud study, and gave their         insight through this explanation modality. Four users
consent to having the audio recorded, after which they            also stated that the percentages were a “nice-to-know”,
filled in the ResQue demographics questionnaire [20].             making the explanation more useful and informative.
Afterwards, they were guided through the pain logbook,
which they had to fill in with recent pain-episode they           Negative sentiment towards XAI (-)
experienced in mind. Having done so, they received
some information regarding the recommendations that               On the flip-side, two users disliked the addition of display-
are going to be given, along with the explanations. We            ing percentages, stating that when it comes to emotions
briefly went over the six explanation designs in a fixed          and feelings, certain aspects are not quantifiable. U4
order, after which we asked the participant to “explain           stated: “Personally I think feelings are not quantifiable.
what they like or dislike about the explanation” sepa-            The bars are good, but don’t put an exact number on it. It’s
rately for each design once they have seen them all. To           okay if you’re communicating frequencies, like how often
conclude this preference elicitation, the users had to sort       an emotion occurred for example.”.
the explanations by preference, with 1 being their most
preferred one, and 6 their least preferred. They also had         Visual/information overload (-)
to give (or repeat) a key reason as to why they are giving        Two users also stated that the addition of percentages is
each explanation a certain ranking. The audio recordings          unnecessary, mentioning that only using bars to com-
of both the preference elicitation and ranking are used           municate importances is sufficient.
afterwards for a thematic analysis.

                                                                  3.2. Feature importance
2.3. Data analysis
                                                                  Rank: 2 · The feature importance explanation was
The thematic analysis was done in two phases, with the            among the most preferred explanations, liked for the
first phase consisting of deriving granular themes from           fact that is was able to give a summary of the user input
the thematic analysis with two researchers, and the sec-          (𝑛 = 11), as well as being able to give additional insights
ond phase focusing on merging them to higher level                (𝑛 = 2).
themes with a third researcher. The resulting higher
level themes are displayed in Figure 3, along with the
                                                                  Provides summary (+)
frequencies in which they occur per explanation design.
The agreement percentage of the first phase two-coder             Six users found the feature importance bars to be a clear
thematic analysis is 88.1%, with Cohen’s kappa being              way of communicating input topics and their importance.
𝜅 = 0.66, resulting in a substantial inter-coder agree-           Four users stated that it gives them a nice overview of
ment [21].                                                        their input.

                                                                  Insights through XAI (+)
3. Results
                                                                  Two users specifically liked the additional insights that
Taking the average raking scores of all explanation de-           they were able to get from the feature importances. U4
signs, we are now able to rank the 6 explanation modal-           mentioned: “There are of course no numbers given, but
ities from best to worst ranked, along with the results           I can assume that I am really frustrated, and a bit less
from the thematic analysis to explain why each explana-           angry. I find it interesting to reflect on results that come
tion type scored poorly or adequately. Figure 2 shows             out of a questionnaire.”
the frequencies of the rankings given to each explanation
design.
                                                                  Negative sentiment towards XAI (-)

3.1. Feature importance + percentage                              Three users were unsure of the ranking of some topics,
                                                                  stating that they agreed with the general content, but not
Rank: 1 (best) · This explanation type was favored by             as to why one topic was deemed more important over
most users, mainly due to the fact that it provided the           others. This caused these users to slightly dislike and
most insight and transparency (𝑛 = 10). Only three out            distrust the system, and give it a lower ranking.
of 11 people found the addition of the percentages to
feature importance bars to be inefficacious.


                                                              4
Maxwell Szymanski et al. CEUR Workshop Proceedings                                                                       1–10


Figure 2: Frequencies of rankings per explanation type


Figure 3: TA themes per explanation design and their frequencies


Visual/information overload (-)                                   Insights through explanation (+)
Two users found the bars to be unnecessary, giving                Three users were fond of the additional insights they
them information as to what contributed towards the rec-          got from the tags and the general themes that were
ommendation, but not why, like the textual explanation            present in their input. U3 stated: “When inputting my
did. U6 stated: “There is not a lot of background given. It       feelings I did not necessarily perceive them as negative or
shows that these inputs contributed to my recommendation,         angry. But based on these tags, I’m able to see: okay, this
but not why.”                                                     is how the app interprets my feelings.”

3.3. Tags                                                         Visual/information overload (-)

Rank: 3 · Tags scored relatively better than the previous         Only two users stated that tags were unnecessary or
three explanations in terms of average ranking, and were          provided too much information. U6 stated: “Yes it’s clear,
liked for their summative ability (𝑛 = 8). Only people            but less practical. I tend to focus on one thing at a time.”
who disliked having a lot of information, were less in
favor of the tag explanation (𝑛 = 2).                             3.4. Purely textual
                                                                  Rank: 4 · Purely textual explanations received mixed
Provides summary (+)
                                                                  reactions during the think-aloud study. When users liked
Four users found using tags to be a nice way of providing         or agreed with the recommendation, the textual explana-
a summary of their input. Four users also stated that             tion was a welcome addition helping them understand the
doing in such a way is a clear and concise method of              recommendation process and the recommendation itself,
explaining why the recommendation is given.                       and gave users a nice summary of why the recommenda-
                                                                  tion matched their inputs (𝑛 = 8). However, when the
                                                                  recommendation wasn’t in line with the user’s expecta-
                                                                  tions, the textual explanation highlighted the mismatch


                                                              5
Maxwell Szymanski et al. CEUR Workshop Proceedings                                                                  1–10


even more and caused a poor reception of the recom- Problem with representation (-)
mender system in general (𝑛 = 5). Here is an overview
                                                         Only some minor and infrequent negative remarks were
of these topics:
                                                         given surrounding inline replies. Three users disliked
                                                         the fact that by highlighting or repeating their negative
Provides summary (+)                                     input, they are more confronted with it. One user ad-
Six users found that the textual explanation was able to ditionally mentioned that this explanation feels like the
summarize their input quite well, albeit only focusing recommendation is only tuned to one input instead of
on one topic (the most relevant one) surrounding the multiple user inputs, making it feel too specific.
recommendation.
                                                                3.6. Word cloud
Positive sentiment towards explanation (+)
                                                              Rank: 6 (last) · The word cloud received the lowest av-
Two users stated that the written explanation was con-        erage score. In general, users like the addition of display-
firming and comforting. One user also stated that the         ing keyword or topic importance, however using a word
wording of the textual explanation felt less confronting      cloud to do so proves to be an inferior solution. The the-
regarding their negative input.                               matic analysis points out two main negative themes as to
                                                              why this explanation is disliked: problems with represen-
Negative sentiment towards explanation (-)                    tation and content (𝑛 = 9) and visual/information over-
                                                              load (𝑛 = 4) and one positive theme, insights through
On the other hand, three users mentioned that they can- explanation (𝑛 = 4).
not relate to the recommendation, and that the textual
explanation highlighted this fact. U4 also found the ex-
                                                              Problems with representation (-)
planation to also be provoking, stating the following: “I
know that I’m frustrated and that it does not help. However, Three users pointed out having keyword size commu-
explaining that acts like waving a red flag in front of a nicate importance was unclear, and would rather have
bull.”                                                        something concrete like bars indicating exact relevance.
                                                              Three users also pointed out that the inconsistent sizes
3.5. Inline-reply                                             inherent to the design of word clouds were visually dis-
                                                              pleasing. Two users additionally stated highlighting
Rank: 5 · During the think-aloud study, the inline reply important keywords might be too confronting with re-
received relatively positive feedback and comments re- spect to their own input, e.g. if a user inputs that they
garding the succinct summary it gave of the users input are feeling sad, having it displayed as a large word might
(𝑛 = 7), with only some minor remarks regarding the confront the user too much with their state of mind.
presentation of the explanation (𝑛 = 3). However, it
scored quite low during the preference ranking itself due Visual/information overload (-)
to other explanation modalities simply being preferred
over the inline-reply.                                        Three users found the addition of displaying relevance
                                                              in such a way unnecessary, one of which additionally
Provides summary (+)                                          stated that adding the information in such way is too
                                                              distracting.
Six users found the explanation modality to be clear and
more concrete, and one user additionally stated that Insights through explanation (+)
showing which message triggered the recommendation
requires less analysis from the user.                         Four users stated however that adding this information
                                                              of keyword relevance gives more insight due to not
Insights through explanation (+)                              only showing the relevant topics, but their importance
                                                              as well.
Three users liked the fact that the inline-reply raises
awareness of the fact that the recommendation is related
to one of their own inputs. U3 stated: “I find it better than 4. Discussion
the textual explanation. There, they state ’You seem to be
frustrated’, and here you really are made aware of the fact We will now discuss some of the most prevalent obser-
that it’s your own input.“                                    vations that were present in several explanation designs,
                                                              as well as suggest guidelines on how to design health
                                                              explanations for lay users experiencing (chronic) pain.


                                                            6
Maxwell Szymanski et al. CEUR Workshop Proceedings                                                                      1–10


4.1. Beware of confronting people with                            Figure 4. Keeping the control aspect in mind from previ-
     negative sentiments                                          ous section, users are also able to tap on different topics
                                                                  to request recommendations regarding said topic.
People experiencing (chronic) pain or illness can feel dis-
tress when receiving negative information surrounding
their state. In our study, we noticed that highlighting
keywords that are potentially negative (e.g. negative
emotions, reactions, etc.), can cause distress with users
and therefore make them dislike the explanation. This
was apparent with the inline reply and word cloud expla-
nations, where visually highlighting negative sentiments
that relate to the recommendation caused users to dislike
the explanation.

4.2. Use tags or feature importance when
     control is needed
Due to the fact that tags and FI/FI+% are able to dis-
play multiple input categories, users positively expressed
that this would provide them more control over the rec-
ommendation process, if the design or implementation
allows for it. One user suggested that tapping certain
topics could be useful to request recommendations in a
                                                              Figure 4: Adapted feature importance explanation design
more user-controlled way. Other users additionally sug-
gested U9:“It’s nice if you can individually remove certain
topics”, and U7: “... especially of you notice something that
wasn’t interpreted the way you intended it”.
                                                                  4.4. Insight vs. information overload
4.3. Design FI through a lay user’s                               Users generally liked the holistic approach of the feature
                                                                  importances, and were more inclined to look into the
     perspective                                                  recommendation itself. When asked why they liked the
The FI and FI+% designs were favored by most users,               recommendations more when explained using FI com-
giving most users the insight and summary they needed.            pared to the purely textual explanation, they stated that
However, as mentioned in section 3.2, U4 interpreted the          the FI were able to show them a general overview of them
FI bars as “... I can assume that I am really frustrated,         as a person.
and a bit less angry”, indicating that they saw it as an             On the other hand, there were also some users who dis-
overview of their input, and not how strongly their input         agreed with the ordering of keyword importances that
relates to the recommendation. In total, 10 out of 11 lay         the feature importance bars were displaying, causing
users interpreted FI differently than intended. Only U4           a slight increase in distrust towards the recommender
was able to correctly interpret the bars (after reading the       system, ranking the explanation lower. This is to be ex-
text above the FI bars - “This is how your inputs relate          pected, as increasing transparency of explanations can
to the recommendation”), saying “The frustrated bar is            cause a higher drop in trust towards the system if the
the biggest, okay, so that contributes most to my recom-          content of the explanation or recommendation does not
mendation”. Having a wrong interpretation could lead to           align with the user’s expectation. However, the effect
confusion towards the system when, for example, a next            of a misaligned textual explanation is still stronger, as
recommendation is shown, and the input keywords and               users who did not agree with either the recommendation
their relevance change with respect to this new recom-            or the explanation expressed a more negative sentiment
mendation. However, overcoming biases and changing                towards the recommendation, and gave the textual rec-
mental models of lay users often proves to be difficult.          ommendation a lower ranking. This is in line with similar
A possible design adaptations to the FI and FI+% design,          research by Balog et al. [5], in which they state that mis-
may show a general overview/summary of the user in-               aligned recommendations that focus on a single topic or
put to be in line with what users were interpreting, and          item are more susceptible to a lower perceived quality of
then highlight the keywords that are relevant to the rec-         explanation compared to multi-item recommendations.
ommendation that is being shown. This can be seen in


                                                              7
Maxwell Szymanski et al. CEUR Workshop Proceedings                                                                     1–10


5. Conclusion                                                      G0A3319N, financed by Research Foundation Flanders
                                                                   (FWO).
This paper introduced several explanation designs for
mobile pain related health recommendations, and com-
pared them among lay users. Most users preferred the               References
added transparency that was provided by the tags and FI
/ FI+% designs, stating that it gave them a brief and clear         [1] R. De Croon, L. Van Houdt, N. N. Htun, G. Štiglic,
overview of their input which helped them understand                    V. Vanden Abeele, K. Verbert, Health recommender
why they received certain recommendations. Another in-                  systems: Systematic review, J Med Internet Res 23
teresting aspect is the fact that designs should be careful             (2021) e18035. URL: https://www.jmir.org/2021/6/
with visually highlighting negative sentiments of users.                e18035. doi:10.2196/18035.
Designs that did so, i.e. the inline-reply and word cloud,          [2] F. Torrent-Fontbona, B. Lopez, Personalized adap-
were received poorly by users. Lastly, we confirmed that                tive cbr bolus recommender system for type 1 di-
lay users might interpret certain visual explanations dif-              abetes, IEEE Journal of Biomedical and Health In-
ferently than intended, yet still prefer them over others.              formatics 23 (2019) 387–394. doi:10.1109/JBHI.
Given their feedback, we presented an adapted design                    2018.2813424, robin’s Paper: [93].
of the favoured FI / FI+% explanation to be in line with            [3] R. Gouveia, E. Karapanos, M. Hassenzahl, How do
what lay users expect.                                                  we engage with activity trackers? a longitudinal
                                                                        study of habito, UbiComp 2015 - Proceedings of
                                                                        the 2015 ACM International Joint Conference on
6. Limitations & Future work                                            Pervasive and Ubiquitous Computing (2015) 1305–
                                                                        1316. doi:10.1145/2750858.2804290.
The qualitative aspect of this study was already able               [4] K. Cheung, W. Ling, C. J. Karr, K. Weingardt, S. M.
to point out several key aspects related to designing                   Schueller, D. C. Mohr, Evaluation of a recommender
health explanations for patients experiencing chronic                   app for apps for the treatment of depression and
pain. However, a larger scale quantitative user study is                anxiety: An analysis of longitudinal user engage-
needed to further investigate these results. One such                   ment, Journal of the American Medical Informat-
aspect is the fact that some users preferred textual expla-             ics Association 25 (2018) 955–962. doi:10.1093/
nations over explanations that offered more information.                jamia/ocy023.
Investigating whether this correlates to the user’s need            [5] K. Balog, F. Radlinski, Measuring Recommenda-
for cognition (NFC), and what its implications are, can                 tion Explanation Quality: The Conflicting Goals
prove to be an interesting research direction similar to the            of Explanations, in: Proceedings of the 43rd Inter-
research of Millecamp et al. [22]. Another aspect is the                national ACM SIGIR Conference on Research and
fact that while most users disliked being confronted with               Development in Information Retrieval, SIGIR ’20,
their negative input, some did not mind. This could be                  Association for Computing Machinery, New York,
related to the "warriors vs. worriers" research, in which               NY, USA, 2020, p. 329–338. URL: https://doi.org/
some users experiencing chronic pain actually prefer be-                10.1145/3397271.3401032. doi:10.1145/3397271.
ing exposed to negative feedback so they could address it,              3401032.
and could prove useful for further research [23]. Future            [6] A. Calero Valdez, M. Ziefle, K. Verbert, Hci for rec-
research should also consider other designs to explain                  ommender systems: The past, the present and the
health recommendations and elaborate design guidelines                  future, in: Proceedings of the 10th ACM Confer-
that can be used by researchers and practitioners in this               ence on Recommender Systems, RecSys ’16, As-
exciting domain. In addition, an interesting further line               sociation for Computing Machinery, New York,
of research is to personalise these explanations on-the-                NY, USA, 2016, p. 123–126. URL: https://doi.org/
fly, based on interaction data of end-users. As in work                 10.1145/2959100.2959158. doi:10.1145/2959100.
of [24], clicks and hover interactions as well as eye gaze              2959158.
data can be considered for such personalisation.                    [7] D. V. Carvalho, E. M. Pereira, J. S. Cardoso, Ma-
                                                                        chine learning interpretability: A survey on meth-
                                                                        ods and metrics, Electronics 8 (2019). URL: https://
Acknowledgments                                                         www.mdpi.com/2079-9292/8/8/832. doi:10.3390/
This work is part of the research projects Personal                     electronics8080832.
Health Empowerment (PHE) with project number                        [8] E. Wayman, S. Madhvanath, Nudging Grocery
HBC.2018.2012, financed by Flanders Innovation & En-                    Shoppers to Make Healthier Choices, in: Proceed-
trepreneurship, and IMPERIUM with project number                        ings of the Ninth Conference on Recommender


                                                               8
Maxwell Szymanski et al. CEUR Workshop Proceedings                                                              1–10


     Systems, ACM, 2015, pp. 289–292. doi:10.1145/              ommendation service for a curated list of read-
     2792838.2799669.                                           ily available mental health and well-being mobile
 [9] J.-B. Lamy, B. Sekar, G. Guezennec, J. Bouaud,             apps for young people: Randomized controlled
     B. Séroussi, Explainable artificial intelligence           trial, Journal of Medical Internet Research 19 (2017).
     for breast cancer: A visual case-based reasoning           doi:10.2196/jmir.6775, robin’s Paper: [55].
     approach, Artificial Intelligence in Medicine 94      [18] Y. Wu, M. Ester, Flame: A probabilistic model
     (2019) 42–53. URL: https://www.sciencedirect.              combining aspect based opinion mining and col-
     com/science/article/pii/S0933365718304846.                 laborative filtering, in: Proceedings of the Eighth
     doi:https://doi.org/10.1016/j.artmed.                      ACM International Conference on Web Search
     2019.01.001.                                               and Data Mining, WSDM ’15, Association for
[10] A. Bussone, S. Stumpf, D. M. O’Sullivan, The role          Computing Machinery, New York, NY, USA, 2015,
     of explanations on trust and reliance in clinical de-      p. 199–208. URL: https://doi.org/10.1145/2684822.
     cision support systems, 2015 International Confer-         2685291. doi:10.1145/2684822.2685291.
     ence on Healthcare Informatics (2015) 160–169.        [19] C.-H. Tsai, P. Brusilovsky, Evaluating Visual Ex-
[11] M. Ribera, A. Lapedriza, Can we do better explana-         planations for Similarity-Based Recommendations:
     tions? a proposal of user-centered explainable ai,         User Perception and Performance, in: Proceed-
     CEUR Workshop Proceedings 2327 (2019).                     ings of the 27th ACM Conference on User Mod-
[12] D. Wang, Q. Yang, A. Abdul, B. Y. Lim, Designing           eling, Adaptation and Personalization, UMAP ’19,
     Theory-Driven User-Centric Explainable AI, Asso-           Association for Computing Machinery, New York,
     ciation for Computing Machinery, New York, NY,             NY, USA, 2019, p. 22–30. URL: https://doi.org/
     USA, 2019, p. 1–15. URL: https://doi.org/10.1145/          10.1145/3320435.3320465. doi:10.1145/3320435.
     3290605.3300831.                                           3320465.
[13] C.-H. Tsai, P. Brusilovsky, Beyond the ranked list:   [20] P. Pu, L. Chen, R. Hu, A user-centric evaluation
     User-driven exploration and diversification of so-         framework for recommender systems, in: Pro-
     cial recommendation, in: 23rd International Con-           ceedings of the Fifth ACM Conference on Rec-
     ference on Intelligent User Interfaces, IUI ’18, As-       ommender Systems, RecSys ’11, Association for
     sociation for Computing Machinery, New York,               Computing Machinery, New York, NY, USA, 2011,
     NY, USA, 2018, p. 239–250. URL: https://doi.org/           p. 157–164. URL: https://doi.org/10.1145/2043932.
     10.1145/3172944.3172959. doi:10.1145/3172944.              2043962. doi:10.1145/2043932.2043962.
     3172959.                                              [21] N. J.-M. Blackman, J. J. Koval,          Interval es-
[14] M. Szymanski, M. Millecamp, K. Verbert, Visual,            timation for cohen’s kappa as a measure of
     textual or hybrid: The effect of user expertise on         agreement,        Statistics in Medicine 19 (2000)
     different explanations, in: 26th International Con-        723–741.         doi:https://doi.org/10.1002/
     ference on Intelligent User Interfaces, IUI ’21, As-  (SICI)1097-0258(20000315)19:5<723::
     sociation for Computing Machinery, New York,          AID-SIM379>3.0.CO;2-A.
     NY, USA, 2021, p. 109–119. URL: https://doi.org/ [22] M. Millecamp, N. N. Htun, C. Conati, K. Verbert, To
     10.1145/3397481.3450662. doi:10.1145/3397481.         explain or not to explain: The effects of personal
     3450662.                                              characteristics when explaining music recommen-
[15] J. Ooge, G. Stiglic, K. Verbert, Explaining arti-     dations, in: Proceedings of the 24th International
     ficial intelligence with visual analytics in health-  Conference on Intelligent User Interfaces, IUI ’19,
     care, WIREs Data Mining and Knowledge Dis-            Association for Computing Machinery, New York,
     covery 12 (2021). URL: https://wires.onlinelibrary.   NY, USA, 2019, p. 397–407. URL: https://doi.org/
     wiley.com/doi/abs/10.1002/widm.1427. doi:https:       10.1145/3301275.3302313. doi:10.1145/3301275.
     //doi.org/10.1002/widm.1427.                          3302313.
[16] M. Schmid Mast, A. Kindlimann, W. Lange- [23] J. Geuens, T. Swinnen, L. Geurts, R. Westhovens,
     witz, Recipients’ perspective on breaking bad         R. De Croon, V. Vanden Abeele, Worriers versus
     news: How you put it really makes a difference,       warriors: Tailoring mhealth to address differences
     Patient Education and Counseling 58 (2005)            in patients with chronic arthritis, in: 2020 IEEE In-
     244–251. URL: https://www.sciencedirect.com/          ternational Conference on Healthcare Informatics
     science/article/pii/S0738399105001473. doi:https:     (ICHI), 2020, pp. 1–12. doi:10.1109/ICHI48887.
     //doi.org/10.1016/j.pec.2005.05.005,                  2020.9374322.
     medical Education and Training in Communication. [24] M. Millecamp, T. Willemot, K. Verbert, Your eyes ex-
[17] N. Bidargaddi, P. Musiat, M. Winsall, G. Vogl,        plain everything: exploring the use of eye tracking
     V. Blake, S. Quinn, S. Orlowski, G. Antezana,         to provide explanations on-the-fly, in: Proceedings
     G. Schrader, Efficacy of a web-based guided rec-      of the 8th Joint Workshop on Interfaces and Hu-


                                                       9
Maxwell Szymanski et al. CEUR Workshop Proceedings        1–10


     man Decision Making for Recommender Systems
     co-located with 15th ACM Conference on Recom-
     mender Systems (RecSys 2021), volume 2948, CEUR
     Workshop Proceedings, 2021, pp. 89–100.


                                                     10

</pre>