=Paper= {{Paper |id=None |storemode=property |title=Interactive Explanations in Mobile Shopping Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-1253/paper3.pdf |volume=Vol-1253 }} ==Interactive Explanations in Mobile Shopping Recommender Systems== https://ceur-ws.org/Vol-1253/paper3.pdf

Interactive Explanations in Mobile
Shopping Recommender Systems

Béatrice Lamche Uğur Adıgüzel Wolfgang Wörndl
TU München TU München TU München
Boltzmannstr. 3 Boltzmannstr. 3 Boltzmannstr. 3
85748 Garching, Germany 85748 Garching, Germany 85748 Garching, Germany
lamche@in.tum.de adiguzel@in.tum.de woerndl@in.tum.de

ABSTRACT either find what we want or make decisions. Mobile rec-
This work presents a concept featuring interactive expla- ommender systems are addressing this problem in a mo-
nations for mobile shopping recommender systems in the bile environment by providing their users with potentially
domain of fashion. It combines previous research in expla- useful suggestions that can support their decisions to find
nations in recommender systems and critiquing systems. It what they are looking for or discover new interesting things.
is tailored to a modern smartphone platform, exploits the Explanations of recommendations help users to make bet-
benefits of the mobile environment and incorporates a touch- ter decisions in contrast to recommendations without ex-
based interface for convenient user input. Explanations have planations while also increasing the transparency between
the potential to be more conversational when the user can the system and the user [8]. However, recommender sys-
change the system behavior by interacting with them. How- tems employing explanations so far did not leverage their
ever, in traditional recommender systems, explanations are interactivity aspect. Touch based interfaces in smartphones
used for one-way communication only. We therefore design reduce user effort while giving input. This can empower the
a system, which generates personalized interactive explana- interactivity for explanations. There are two main goals of
tions using the current state of the user’s inferred preferences this work. One is to study whether a mobile recommender
and the mobile context. An Android application was devel- model with interactive explanations leads to more user con-
oped and evaluated by following the proposed concept. The trol and transparency in critique-based mobile recommender
application proved itself to outperform the previous version systems. Second is to develop a strategy to generate interac-
without interactive and personalized explanations in terms tive explanations in a content-based recommender system.
of transparency, scrutability, perceived efficiency and user A mobile shopping recommender system is chosen as appli-
acceptance. cation scenario. The rest of the paper is organized as follows.
We first start off with some definitions relevant for explana-
tions in recommender systems and summarize related work.
Categories and Subject Descriptors The next section explains the reasoning behind and the path
H.5.2 [Information Interfaces and Presentation]: User towards a final mobile application, detailing the vision guid-
Interfaces—Interaction styles, User-centered design ing the process. The user study evaluating the developed
system is discussed in section 4. We close by suggesting
opportunities for future research.
General Terms
Design, Experimentation, Human Factors. 2. BACKGROUND & RELATED WORK
An important aspect of explanations is the benefit they
Keywords can bring to a system. Tintarev et al. define the follow-
mobile recommender systems, explanations, user interac- ing seven goals for explanations in recommender systems
tion, Active Learning, content-based, scrutability [8]: 1. Transparency to help users understand how the rec-
ommendations are generated and how the system works. 2.
Scrutability to help users correct wrong assumptions made
1. INTRODUCTION by the system. 3. Trust to increase users’ confidence in the
In today’s world, we are constantly dealing with complex system. 4. Persuasiveness to convince users to try or buy
information spaces where we are often having trouble to items and enhance user acceptance of the system. 5. Effec-
tiveness to help users make better decisions. 6. Efficiency
to help users decide faster, which recommended item is the
best for them and 7. Satisfaction to increase the user’s satis-
faction with the system. However, meeting all these criteria
Permission to make digital or hard copies of all or part of this work for is unlikely, some of these aims are even contradicting such as
personal or classroom use is granted without fee provided that copies are persuasiveness and effectiveness. Thus, choosing which cri-
not made or distributed for profit or commercial advantage and that copies teria to improve is a trade-off. Explanations might also differ
bear this notice and the full citation on the first page. To copy otherwise, to by the degree of personalization. While non-personalized ex-
IntRS 2014,
republish, October
to post 6, or
on servers 2014, SilicontoValley,
to redistribute CA, prior
lists, requires USA.specific
planations use general information to indicate the relevance
Copyright 2014aby
permission and/or fee.the author(s).
of a recommendation, personalized explanations clarify how
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
a user might relate to a recommended item [8]. supporting them in decision making by providing interac-
Due to the benefits of explanations in mobile recommender tive explanations. Mobile recommender systems use a lot of
systems, a lot of research has been conducted in this context. situational information to generate recommendations, so it
Since our work focuses on explanations aiming at improving might not always be clear to the user how the recommenda-
transparency and scrutability in a recommender system, we tions are generated. Introducing transparency can help solv-
investigated previous research in these two areas. ing this problem. However, mobile devices require even more
The work of Vig et al. [9] separates justification from considerations in the design and development (e.g. due to
transparency. While transparency should give an honest the small display size). Thus, these should also be taken into
statement of how the recommendation set is generated and account when generating transparent explanations. More-
how the system works in general, justification can be re- over, the explanation framework should generate textual ex-
frained from the recommendation algorithm and explain why planations that make it clear to the user how her preferences
a recommendation was selected. Vig et al. developed a web- are modeled. In order not to bore the user, explanations
based Tagsplanations system where the recommendation must be concise and include variations in wording. Further-
is justified using relevance of tags. Their approach, as the more, introducing transparency alone might not be enough
authors noted, lacked the ability to let users override their because users often want to feel in control of the recommen-
inferred tag preferences. dation process. The explanation goal scrutability addresses
Cramer et al. [3] applied transparent explanations in the this issue by letting users correct system mistakes. There
web-based CHIP (Cultural Heritage Information Person- have been several approaches to incorporate scrutable ex-
alization) system that recommends artworks based on the planations to traditional web-based recommender systems.
individual user’s ratings of artworks. The main goal of the However, more investigation is required in the area of mo-
work was to make the criteria more transparent the system bile recommender systems. First of all, the system should
uses to recommend artworks. It did so by showing the users highlight the areas of textual explanations that can be in-
the criteria on which the system based its recommendation. teracted with. Second, the system should allow the user to
The authors argue that transparency increased the accep- easily make changes and get new recommendations. While
tance of the system. transparent and scrutable explanations are the main focus of
An interesting approach to increase scrutability has been this work, there are also some side goals, such as satisfaction
taken by Czarkowski [4]. The author developed SASY, and efficiency.
a web-based holiday recommender system which has scru-
tinization tools that aim not only to enable users to un- 3.1 The Baseline
derstand what is going on in the system, but also to let Shopr, a previously developed mobile recommender sys-
them take control over recommendations by enabling them tem serves as the baseline in our user study [6]. The system
to modify data that is stored about them. uses a conversation-based Active Learning strategy that in-
TasteWeights is a web-based social recommender system volves users in ongoing sessions of recommendations by get-
developed by Knijnenburg et al. [5] aiming at increasing in- ting feedback on one of the items in each session. Thus, the
spectability and control. The system provides inspectability system learns the user’s preferences in the current context.
by displaying a graph of the user’s items, friends and recom- An important point is that the system initially recommends
mendations. The system allows control over recommenda- very diverse items without asking its users to input their ini-
tions by allowing users to adjust the weights of the items and tial preferences. After a recommendation set is presented,
friends they have. The authors evaluated the system with the user is expected to give feedback on one of the items in
267 participants. Their results showed that users appreci- the form of like or dislike over item features (e.g. price of
ated the inspectability and control over recommendations. the item or color) and can state which features she in par-
The control given via weighting of items and friends made ticular likes or dislikes. In case the user submitted a posi-
the system more understandable. Finally, the authors con- tive feedback, using the refine algorithm shows more similar
cluded that such interactive control results in scrutability. items. Otherwise, the system concludes a negative progress
Wasinger et al. [10] apply scrutinization in a mobile restau- has been made and refocuses on another item region and
rant recommender system named Menu Mentor. In this shows more diverse items. The algorithm keeps the previ-
system, users can see the personalized score of a recom- ously critiqued item in the new recommendation set in order
mended restaurant and the details of how the system com- to allow the user to further critique it for better recommen-
puted that score. However, users can change the recom- dations. The explanation strategy used in this system is
mendation behavior only by critiquing presented items via very simple. An explanation text is put on top of all items,
meal star ratings, no granular control over meal content is which tries to convey the current profile of the user’s prefer-
provided. A conducted user study showed that participants ences. It allows the user to observe the effect of her critiques
perceived enhanced personal control over given recommen- and to compare the current profile against the actually dis-
dations. played items. An example for such an explanation text is
In summary, although previous research focused on in- ”avoid grey, only female, preferably shirt/dress”.
creasing either scrutability or transparency in recommender
systems, no research was conducted on how interactive ex- 3.2 How Explicit Feedback Affects Weights
planations can increase transparency as well as scrutability
The modeling of the user’s preferences is an important
in mobile recommender systems.
part of the proposed explanation generation strategy and
feedback model and is adapted from the approach of Shopr
3. DESIGNING THE PROTOTYPE [6], described in the Baseline section. It is modeled as a
Our system aims at offering shoppers a way to find nearby search query q with weights for values of features (e.g. red
shopping locations with interesting clothing items while also is a possible value of the feature color ). For each feature,
there is a weight vector that allows the prioritization of one Local score LSI,D measures the performance of a dimen-
feature value over another. A query q for a user looking for sion without taking into account how much the user values
only red dresses from open shops in 2000m reach would look that dimension. Our system uses feature value weight vec-
like this (we here assume that each item has only the two tors to represent both item features and features in a query,
features ’color’ and ’type’): which represents the current preferences of the user. Local
score of a feature is the scalar product of the weight vector
q = ((distance ≤ 2000m) ∧ (time open = (for that feature) in the query with respective weight vector
now + 30min)), {colorred,blue,green (1.0, 0, 0), (1) in the item’s representation. It is formalized as below, where
wI,D represents the feature value weight vector for item di-
typeblouse,dress,trousers (0, 1.0, 0)} mension D and wQ,D represents the feature value weight
Our system uses two types of user feedback. One of them vector for query dimension D and n stands for the number
is by critiquing the recommended items on their features of feature values for that dimension:
(which was already provided in the baseline system, see sec-
tion 3.1). The other is by correcting mistakes regarding n−1
X
the user’s preferences via explicit preference statement. Ex- LSI,D = wI,D (i).wQ,D (i) (3)
planations are designed to be interactive, so that the user i=0
can state her actual preference over feature values after tap-
Explanation score ESI,D describes the explaining per-
ping on the explanation. If the user states interests on
formance of a dimension. The weight for each dimension is
some feature values, a new value vector will be initialized
calculated dynamically by using a function that decreases
for the query with all interested values being assigned equal
the effects of the number of feature values in each dimen-
weight summing to 1.0 and the rest having 0.0 weight. That
sion. It is formalized as follows, where lengthwD denotes
means that the system will focus on the stated feature val-
the number of feature values in a specific dimension D and
ues, whereas the other values will be avoided. For example
lengthtotal attribute values the total number of feature values
if a user interacts with the explanation associated with the
for all dimensions. Using the square root produced good
query presented in equation 1 and states that she is actually
results since it limits the effect of number feature values on
only interested in blue and green, then the resulting new
the calculation of weights.
weight vector would look like the following (assuming that
we only distinguish between three colors) which will influ- s
ence the search query and thus the new recommendations: lengthwD
wD = (4)
lengthtotal attribute values
f eedbackpositive (blue, green) :
(2) With the following dynamically calculated weight for a
colorred,blue,green (0.0, 0.5, 0.5)
dimension, explanation score of the dimension can be calcu-
3.3 Generating Interactive Explanations lated by multiplying it with the local score of that dimension:
The main vision behind interactive explanations is to use
them not only as a booster for transparency and understand- ESI,D = LSI,D .wD (5)
ability of the recommendation process but also as an enabler
Information score ISD measures the amount of infor-
for user control. In order to explain the current state of the
mation provided by a dimension. The calculation of infor-
user model (which stores the user’s preferences) and the rea-
mation score suggested by [1] is preserved as it already lays a
soning behind recommendations, two types of explanations
good foundation to reason whether explaining an item from
are defined: recommendation- and preference explanations.
a given dimension provides a good value. So, it can be de-
3.3.1 Interactive Recommendation Explanations fined as follows where R denotes the range of explanation
Recommendation explanations are interactive textual ex- scores for that dimension for all recommended items and I
planations. Their first aim is to justify why an item in the denotes the information that dimension provides for an item:
recommendation set is relevant for the user. Second, they
let the user make direct changes to her inferred preferences. R+I
ISD = (6)
The generation is based on the set of recommended items, 2
the user model and the mobile context. Range R is calculated as the difference between the max-
imum and minimum explanation score for the given dimen-
Argument Assessment. sion for all recommended items, namely R = max(ESI,D ) −
Argument assessment is used to determine the quality of min(ESI,D ). Information I, however, is calculated quite
every possible argument about an item. The argument as- differently from the strategy proposed by [1]. In their sys-
sessment method is based on the method described in [1] . It tem, a dimension provides less and less information as the
uses Multi-Criteria Decision Making Methods (MCDM) to number of items to be explained from the same dimension
assess items I on multiple decision dimensions D (e.g. fea- increases. This does not apply to the context of the cloth-
tures that an item can have) by means of utility functions. ing recommender developed for this work. An item could
Dimensions in the context of this recommender system are still provide good information if not there are not so many
features and contexts. The method described in [1] uses items that can be explained from the same feature value.
four scores, which lay a good foundation for the method in For instance, it is still informative to explain an item from
this work. However, their calculations have to be adapted the color blue; although another item is also explained by
to the underlying recommendation infrastructure to produce the same dimension (color) but from a different value, let’s
meaningful explanations. say green. Therefore, I is calculated as a function of the size
of recommendation set (n) and number of items in the set
n−h
that has the same value for a dimension (h): I = .
n−1
Global score GSI measures the overall quality of an item
in all dimensions. It is the mean of explanation scores of all
of its dimensions. The following formula demonstrates how
it is formalized, where n denotes the total number of all
dimensions and ESI,Di the explanation score of an item on
ith dimension.

Pn−1
i=0 ESI,Di (7)
GSI =
n
The above-defined methods for calculating explanation
and information scores are only valid for item features. Ex-
planations should also include relevant context arguments.
In order to support that, every context instance that is cap-
tured and used by the system in the computation of the
recommendation set should also be assessed. The expla-
nation score of a context dimension is calculated using do-
main knowledge. The most important values for the context
gets the highest explanation score and it becomes lower and
lower as the relevance of the value of the context decreases.
For example, for location context, the explanation score is
inversely proportional to the distance between the current
location of the user and the shop where the explained item
is sold. Explanation score gets higher as the distance gets
lower. Information score is calculated with the same formula Figure 1: Generation of explanations.
R+I
defined earlier for features ISD = , but Information I
2
slightly changes. As proposed earlier, it is calculated using
n−h as the process is divided into the selection and organization
the formula I = , but in this case h stands for the
n−1 of explanation content and the transformation in a human
number of items with similar explanation score.
readable form.
Content Selection. The argumentation strategy selects
Argument Types. arguments for every item I separately. One or more primary
In order to generate explanations with convincing argu- arguments are selected first to help the user to instantly rec-
ments, different argument aspects are defined by follow- ognize why the item is relevant. There are four alternative
ing the guidelines for evaluative arguments described in [2]. ways to select the primary arguments (alternatives 1 to 4
Moreover, the types of arguments described in [1] are taken in figure 1). First alternative is that the item is in the rec-
as a basis. First of all, arguments can be either positive or ommendation set because it was the last critique and it was
negative. While positive arguments are used to convince the carried (1). Another is that the system has enough strong
user to the relevance of recommendations, negative argu- arguments to explain an item (2). If there are not any strong
ments are computed so that the system can give an honest arguments, the strategy checks if there are any weak argu-
statement about the quality of the recommended item. The ments (3). In case there are one or more weak arguments,
second aspect of arguments is the type of dimension they the system also adds supporting arguments to make the ex-
explain, feature or context. Lastly, they can be primary or planation more convincing. Finally, if there are no weak
supporting arguments. Primary arguments alone are used arguments too, then the item is checked if it is a good av-
to generate concise explanations. Combination of primary erage by comparing its global score GSI to threshold β (4).
and supporting arguments are used to generate detailed ex- If so, similar to alternative (3), supporting arguments are
planations. We distinguish between five argument types: also added to increase the competence of the explanation.
Strong primary feature arguments, Weak primary feature ar- Otherwise the strategy supposes that the recommended item
guments, Supporting feature arguments, Context arguments is serendipitous and added to the set to explore the user’s
and Negative arguments. preferences. With one or more primary arguments, the sys-
tem checks if there are any negative arguments and context
Explanation Process. arguments to add (5 and 6).
The explanation process is based on the approach de- Surface Generation. The result of the content selec-
scribed in [1] but it is adapted to use the previously de- tion is an abstract explanation, which needs to be resolved
fined argument types. Different from the system in [1], ex- to something the user understands. This is done in the sur-
planations are designed to contain multiple positive argu- face generation phase. Various explanation sentence tem-
ments on features. Negative arguments are generated but plates are decorated with either feature values or context
only displayed when necessary by using a ramping strategy. values (7 and 8). Explanation templates are sentences with
Figure 1 shows the process to select arguments. It follows placeholders for feature and context values stored in XML
the framework for explanation generation described in [2] format. The previously determined primary argument type
Table 1: Text templates for recommendation explanations. Table 2: Text templates for preference explanations.

Text template Example phrase Text template Example phrase
Strong argument “Mainly because you currently like X.” Only some val- “You are currently interested only in
Weak argument “Partially as you are currently inter- ues X, Y [...].” The word “only” in the
ested in X.” text is emphasized in bold.
Supporting argu- “Also, slightly because of your current Avoiding some “You are currently avoiding X, Y
ment interest in X.” values [...].” The word “avoiding” is empha-
Location context “And it is just Y meters away from sized in bold.
you.” Preferably some “It seems, you currently prefer X, Y
Average item “An average item but might be inter- values [...].”
esting for you.” Indifferent to “You are currently indifferent to X
Last critique “Kept so that you can keep track of feature feature”.
your critiques.”
Serendipity “This might help us discovering your
preferences.” or “A serendipitous item
that you perhaps like.”
Negative argu- “However, it has the following fea-
ment ture(s) you don’t like: X, Y [...].”

is used to determine which type of explanation template to
use. Feature values in the generated textual output are then
highlighted and their interaction endpoints are defined (9).
The resulting output is a textual explanation, highlighted
in the parts where feature values are mentioned. They are
interactive such that, after the user taps on the highlighted
areas, she can specify what she exactly wants.

3.3.2 Interactive Preference Explanations
Preference explanations have got two main goals. First,
they aim to let the user inspect the current state of the sys- (a) (b)
tem’s understanding of the user’s preferences. Second, they
intend to let the user make direct changes to the prefer- Figure 2: Recommendation list (a) and explicit preference
ence. Two main types of preferences explanations are de- feedback screen (b).
fined, interactive textual explanations and interactive visual
explanations.
appearing in the chart is modeled with its weights (scaled
Generating Textual Preference Explanations. to a percentage), color and description in the user interface.
The only input to textual preference explanation gener- Figure 5 illustrates this chart representation.
ation algorithm is the user model. For each dimension D
the algorithm can generate interactive explanations. Di- 3.3.3 Using Text Templates Supporting Variation
mensions are features that an item can have. The algorithm XML templates are used to generate explanation sentences
distinguishes between four feature value weight vectors, indi- for the different user preference types in English language.
cating different user preferences: First, the user is indifferent Those templates contain placeholders for feature and con-
to any feature value. Second, the user is only interested in text values which are replaced during the explanation gen-
a set of feature values. Third, the user is avoiding a set of eration process. For recommendation explanations, there are
feature values. And fourth, the user prefers a set of feature a few sentence variations for almost every type of arguments.
values over others. See table 1 for examples of the different text templates for
recommendation explanations. These templates can be used
Generating Visual Preference Explanations. in combination with each other. For example, supporting
Visual preference explanations are generated also by using arguments can support a weak argument. In such cases,
the user model, more specifically by making use of the array argument sentences are connected using conjunctions.
of feature value weight vectors, which represents the user’s Similar mechanism is also used for the preference explana-
current preferences. For each feature, there is already a tions. However, to keep it simple, variation is not provided,
feature value weight vector, which indicates the priorities of as the number of features to explain is already limited. See
the user among feature values. All those weights are between table 2 for selected examples of several text templates for
0.0 and 1.0 summing up to 1.0. They could be scaled to a preference explanations.
percentage to generate charts showing the distribution of
percentage of interests for feature values. 3.4 Interaction and Interface Design
In order to generate charts, it is also required to determine The first issue was to clarify how to integrate the inter-
with which color and description a feature value will be rep- action process with textual explanations. It was envisioned
resented in a chart. In order to support that, a feature value to give the user the opportunity to tap on the highlighted
different “drill down” screen for all screens was developed as
part of the mindmap feature. Figure 5 shows the mindmap
detail screens for the clothing color feature. The user’s pref-
erences on feature values are represented as a chart. Every
feature value is displayed as a different color in the charts.
One of the most important features is that the highlighted
parts of the explanation texts and the charts are interactive
as well which lets the user access the explicit feedback screen
to provide her actual preferences.
The full source code and resources for the Android app
and the algorithm are available online1 .

4. USER STUDY
The main three goals of the evaluation are: First, to find
out whether transparency and user control can be improved
by feature-based personalized explanations supported by scru-
(a) (b) table interfaces in recommender systems. Second, to find out
whether side goals such as higher satisfaction are achieved
Figure 3: Detailed information screens of items. and lastly to see whether other important system goals such
as efficiency are not damaged.

4.1 Setup
areas of the explanation text to state her actual preferences The test hardware is a 4.3 inch 480 x 800 resolution An-
on a feature. This leads to a two-step process. First, the droid smartphone (Samsung Galaxy S2) running the Jelly
user sees an item with an explanation including highlighted Bean version of the Android operating system (4.1.2).
words (highlighted words are always associated with a fea- Two variants of the system are put to the test. In order
ture, see figure 2a) and taps on one of them (e.g. in figure to refrain from the effects of different recommender algo-
2b, ”t-shirt” was tapped). Then the system directs the user rithms, both variants use the same recommendation algo-
to the screen where the user can make changes. In this sec- rithm which uses diversity-based Active Learning [6]. More-
ond step, she specifies which feature values she is currently over, critiquing and item details interfaces are exactly the
interested in. Lastly, the system updates the list of recom- same. The difference lies in the explanations: The EXP vari-
mendations which complets a recommendation cycle. Note ant refers to the proposed system, described in the previous
that the critiquing process and associated screens from the section. In order to test the value of the developed explana-
project Shopr, which is taken as a basis (see section 3.1) tions and scrutinization tools, a baseline (BASE variant) to
are kept in the developed system. Eventually, the interac- compare against is needed (see subsection 3.1). The study
tion is a hybrid of critiquing and explicitly stating current is designed as within-subject to keep the number of testers
preferences. On top of each explicit feedback screen, a text at a reasonable size. Thus one group of people tests both
description of what is expected from the user is given. variants. Which system is tested first was flipped in between
Due to the applied ramping strategy mentioned in sec- subjects so that a bias because of learning effects could be
tion 3.3.1, all extra arguments in explanations that are not reduced.
important were not shown as explanations in the list of rec- In order to create a realistic setup, it is necessary to gen-
ommendations but in the screen where item details are pre- erate a data set that represents real-world items. For that
sented. Tapping on an item picture accesses that screen. purpose, we developed a data set creation tool as an open-
Here, the user can also browse through several pictures of source project2 . The tool crawls clothing items from a well-
an item by swiping the current picture from right to left known online clothing retailer website. To keep the amount
(see figure 3b). In order to make it obvious for the user, the of work reasonable, items were associated with an id, one of
sentences with positive arguments always start with a green 19 types of clothing, one of 18 colors, one of 5 brands, the
“+” sign. Negative arguments sentences, on the other hand, price (in Euro), the gender (male, female or unisex) and a
always start with a red “-” sign (see figure 3). list of image links for the item. The resulting set is 2318
The next issue was to implement preference explanations, items large, with 1141 for the male and 1177 for the female
what we call Mindmap feature. Mindmap feature is the way gender.
that system explains its mental map about the preferences For the study, participants of various age, educational
of the user. The overview screen for mindmap was designed background and current profession were looked for. Overall
to quickly show the system’s assumptions about the user’s 30 people participated, whereas 33% of users were female
current preferences. To keep it simple but yet usable, only and 67% were male.
textual explanations are used for each feature (see figure 4b). The actual testing procedure used in the evaluation was
In order to make it easy for the user to notice what is im- structured as follows: We first asked the participants to
portant, the feature values used in the explanation text are provide background information about themselves, such as
highlighted. Moreover, every element representing a feature demographic information and their knowledge about mobile
is made interactive. This lets the user access the explicit systems and recommender systems. Next, the idea of the
feedback screen to provide her actual preferences.
1
The user should also be able to get more detailed visual https://github.com/adiguzel/Shopr
2
information for all the features. In order to achieve that, a https://github.com/adiguzel/pickpocket
(a) (b)

Figure 4: Navigation Drawer (a) and Overview (b).

system was introduced and the purpose of the user study Figure 5: Mindmap detail screens for color.
was made clear. We chose a realistic scenario instead of
asking users to find an item they could like:
Task: Imagine you want to buy yourself new clothes for
preferences in each cycle and uses it to generate recommen-
an event in a summer evening. You believe that following
dations that can be interesting for the user.
type of clothes would be appropriate for this event: shirt,
On average, when asked if a tester understands the sys-
t-shirt, polo shirt, dress, blouse or top. As per color you
tem’s reasoning behind its recommendations, EXP performs
consider shades of blue, green, white, black and red. You
better than BASE (mean average of 4.63 compared to 4.3
have a budget of up to e 100. You use the Shopr app to look
out of a 1-5 Likert scale). Further analysis suggests that the
for a product you might want to purchase.
variant with interactive explanations (EXP) is perceived sig-
After introducing them to the task, users were given hands nificantly more transparent than the variant with baseline
on time to familiarize themselves with the user interface and explanations (one-tail t-test, p<0.05 with p=0.018).
grasp how the app works. After selecting and confirming Users were asked about the ease of telling the system what
the choice for a product, the task was completed. Then they want in order to measure the overall user control they
testers were asked to rate statements about transparency, perceived. Average rating of participants was better with
user control, efficiency and satisfaction based on their expe- EXP (4.33 versus 3.23). In a further analysis, EXP seemed
rience with the system on a five-point Likert scale (from 1, significantly better in terms of perceived overall control than
strongly disagree to 5, strongly agree) and offer any general BASE (one-tail t-test, p<0.05 with p=0.0003).
feedback and observations. After having tested both vari- When asked about the ease of correcting system mistakes,
ants, participants stated which variant they preferred and EXP performs a lot better than BASE (mean average of
why that was the case. 4.36 compared to 3 out of a 1-5 Likert scale). Further anal-
ysis reveals that EXP is significantly better in terms of per-
4.2 Results ceived scrutability than BASE (one-tail t-test, p<0.05 with
The testing framework applied in the user study is a sub- p=0.6.08E-06).
set of the aspects that are relevant for critiquing recom- Participants completed their task in average one cycle less
menders and explanations in critiquing recommenders. It using EXP than BASE (6.5 with EXP, 7.46 with BASE).
follows the user-centric approach presented in [7]. The mea- However, one-tail t-test shows that EXP is not significantly
sured data is divided into four areas: transparency, user better than BASE (p>0.05 with p=0.14).
control, efficiency and satisfaction. The next part of measuring objective effort is done via
The means of the measured values for the most important tracking the time it took for each participant from seeing
metrics of the two systems, BASE denoting the variant using the initial set of recommendations until the target item was
only simple non-interactive explanations, EXP the version selected. On average BASE seems to be better with a mean
with interactive explanations, are shown in table 3. Next session length of 160 seconds against 165 seconds. However,
to the mean the standard deviation is shown, the last col- it was found not to be significantly more time efficient (one-
umn denoting the p-value of a one-tail paired t-test with 29 tail t-test, p>0.05 with p=0.39). One reason for this could
degrees of freedom (30 participants - 1). be that although EXP gives its users tools to update pref-
In order to measure actual understanding after using a erences over several features quickly, it has more detailed
variant, users were asked to describe how the underlying explanations. Thus, users spent more time with reading.
recommendation system of that variant works. In general, Users were asked about the ease of finding information
almost all of the participants could explain for both rec- and the effort required to use the system in order to get
ommenders that the systems builds a model of the user’s an idea about the system’s efficiency. The participants’ av-
Table 3: The means of some important measured values recommendations, skipping to the next list of recommenda-
comparing both variations of the system. tions without critiquing and having more item attributes for
critiquing, could make the application even more appealing.
BASE stdev EXP stdev p Future development may also include the creation of more
mean mean value complex recommendation scenarios to test the capability of
Perceived trans- 4.3 0.70 4.63 0.49 0.018 the proposed concept even further. One can add more item
parency features to critique and also take the user’s mobile context
Perceived overall 3.23 1.04 4.33 0.71 0.0003 (e.g. mood and seasonal conditions) into account during
control the recommendation process. Furthermore, future research
Scrutability 3 1.31 4.36 0.85 6.08E- might study the generation of interactive explanations for
06 systems with rather complex recommendation algorithms.
Cycles 7.46 3.64 6.5 3.28 0.14 Interactive explanations might make adjustable parts of the
Time consumption 160 s 74 165 83 0.39 algorithm transparent and allow the user to change them.
s
Perceived efficiency 3.43 1.13 4.33 0.75 0.0003 6. REFERENCES
Satisfaction 3.76 0.85 4.43 0.56 0.0004 [1] R. Bader, W. Woerndl, A. Karitnig, and G. Leitner.
Designing an explanation interface for proactive
recommendations in automotive scenarios. In
erage rating was better with EXP with 4.33 against 3.43 Proceedings of the 19th International Conference on
with BASE. Further analysis revealed that users perceived Advances in User Modeling, UMAP’11, pages 92–104,
EXP significantly more efficient than BASE (one-tail t-test, Berlin, Heidelberg, 2012. Springer-Verlag.
p<0.05 with p=0.0003). [2] G. Carenini and J. D. Moore. Generating and
When inquired how satisfied participants were with the evaluating evaluative arguments. Artif. Intell.,
system overall, EXP performs better with 4.43 against 3.76. 170(11):925–952, Aug. 2006.
One-tail t-test suggests that this is a significant result (p<0.05 [3] H. Cramer, V. Evers, S. Ramlal, M. Someren,
with p=0.0004). L. Rutledge, N. Stash, L. Aroyo, and B. Wielinga. The
Finally, participants were asked to pick a favorite from effects of transparency on trust in and acceptance of a
the two evaluated variants. 90% preferred the variant with content-based art recommender. User Modeling and
interactive explanations (EXP) over the variant with simple User-Adapted Interaction, 18(5):455–496, Nov. 2008.
non-interactive explanations (BASE), mostly because of the [4] M. Czarkowski. A Scrutable Adaptive Hypertext. PhD
increased perception of control over recommendations. thesis, University of Sydney, 2006.
[5] B. P. Knijnenburg, S. Bostandjiev, J. O’Donovan, and
5. CONCLUSION AND FUTURE WORK A. Kobsa. Inspectability and control in social
recommenders. In Proceedings of the Sixth ACM
This work investigated the development and impact of a Conference on Recommender Systems, RecSys ’12,
concept featuring interactive explanations for Active Learn- pages 43–50, New York, NY, USA, 2012. ACM.
ing critique-based mobile recommender systems in the fash-
[6] B. Lamche, U. Trottman, and W. Wörndl. Active
ion domain. The developed concept proposes the generation
learning strategies for exploratory mobile
of explanations to make the system more transparent while
recommender systems. In Proceedings of CaRR
also using them as an enabler for user control in the recom-
workshop, 36th European Conference on Information
mendation process. Furthermore, the concept defines the
Retrieval, Amsterdam, Netherlands, Apr 2014.
user feedback as a hybrid of critiquing and explicit state-
[7] P. Pu, L. Chen, and R. Hu. A user-centric evaluation
ments of current interests. A method is developed to gener-
framework for recommender systems. In Proceedings of
ate explanations based on a content-based recommendation
the Fifth ACM Conference on Recommender Systems,
approach. The explanations are always made interactive
RecSys ’11, pages 157–164, New York, NY, USA,
to give the user a chance to correct possible system mis-
2011. ACM.
takes. In order to measure the applicability of the concept,
a mobile Android app using the proposed concept and the [8] N. Tintarev and J. Masthoff. Evaluating the
explanation generation algorithm was developed. Several effectiveness of explanations for recommender systems.
aspects regarding display and interaction design of explana- User Modeling and User-Adapted Interaction,
tions in mobile recommender systems are discussed and solu- 22(4-5):399–439, Oct. 2012.
tions to the problems faced during the development process [9] J. Vig, S. Sen, and J. Riedl. Tagsplanations:
are summarized. The prototype was evaluated in a study Explaining recommendations using tags. In
with 30 real users. The proposed concept performed signifi- Proceedings of the 14th International Conference on
cantly better compared to the approach with non-interactive Intelligent User Interfaces, IUI ’09, pages 47–56, New
simple explanations in terms of our main goals to increase York, NY, USA, 2009. ACM.
transparency and scrutability and side goals to increasing [10] R. Wasinger, J. Wallbank, L. Pizzato, J. Kay,
perceived efficiency and satisfaction. Overall, the developed B. Kummerfeld, M. Böhmer, and A. Krüger. Scrutable
interactive explanations approach demonstrated the user ap- user models and personalised item recommendation in
preciation of transparency and control over the recommen- mobile lifestyle applications. In User Modeling,
dation process in a conversation-based Active Learning mo- Adaptation, and Personalization, volume 7899 of
bile recommender system tailored to a modern smartphone Lecture Notes in Computer Science, pages 77–88.
platform. Some changes, such as increasing the number of Springer Berlin Heidelberg, 2013.