=Paper=
{{Paper
|id=None
|storemode=property
|title=Interactive Explanations in Mobile Shopping Recommender Systems
|pdfUrl=https://ceur-ws.org/Vol-1253/paper3.pdf
|volume=Vol-1253
}}
==Interactive Explanations in Mobile Shopping Recommender Systems==
Interactive Explanations in Mobile Shopping Recommender Systems Béatrice Lamche Uğur Adıgüzel Wolfgang Wörndl TU München TU München TU München Boltzmannstr. 3 Boltzmannstr. 3 Boltzmannstr. 3 85748 Garching, Germany 85748 Garching, Germany 85748 Garching, Germany lamche@in.tum.de adiguzel@in.tum.de woerndl@in.tum.de ABSTRACT either find what we want or make decisions. Mobile rec- This work presents a concept featuring interactive expla- ommender systems are addressing this problem in a mo- nations for mobile shopping recommender systems in the bile environment by providing their users with potentially domain of fashion. It combines previous research in expla- useful suggestions that can support their decisions to find nations in recommender systems and critiquing systems. It what they are looking for or discover new interesting things. is tailored to a modern smartphone platform, exploits the Explanations of recommendations help users to make bet- benefits of the mobile environment and incorporates a touch- ter decisions in contrast to recommendations without ex- based interface for convenient user input. Explanations have planations while also increasing the transparency between the potential to be more conversational when the user can the system and the user [8]. However, recommender sys- change the system behavior by interacting with them. How- tems employing explanations so far did not leverage their ever, in traditional recommender systems, explanations are interactivity aspect. Touch based interfaces in smartphones used for one-way communication only. We therefore design reduce user effort while giving input. This can empower the a system, which generates personalized interactive explana- interactivity for explanations. There are two main goals of tions using the current state of the user’s inferred preferences this work. One is to study whether a mobile recommender and the mobile context. An Android application was devel- model with interactive explanations leads to more user con- oped and evaluated by following the proposed concept. The trol and transparency in critique-based mobile recommender application proved itself to outperform the previous version systems. Second is to develop a strategy to generate interac- without interactive and personalized explanations in terms tive explanations in a content-based recommender system. of transparency, scrutability, perceived efficiency and user A mobile shopping recommender system is chosen as appli- acceptance. cation scenario. The rest of the paper is organized as follows. We first start off with some definitions relevant for explana- tions in recommender systems and summarize related work. Categories and Subject Descriptors The next section explains the reasoning behind and the path H.5.2 [Information Interfaces and Presentation]: User towards a final mobile application, detailing the vision guid- Interfaces—Interaction styles, User-centered design ing the process. The user study evaluating the developed system is discussed in section 4. We close by suggesting opportunities for future research. General Terms Design, Experimentation, Human Factors. 2. BACKGROUND & RELATED WORK An important aspect of explanations is the benefit they Keywords can bring to a system. Tintarev et al. define the follow- mobile recommender systems, explanations, user interac- ing seven goals for explanations in recommender systems tion, Active Learning, content-based, scrutability [8]: 1. Transparency to help users understand how the rec- ommendations are generated and how the system works. 2. Scrutability to help users correct wrong assumptions made 1. INTRODUCTION by the system. 3. Trust to increase users’ confidence in the In today’s world, we are constantly dealing with complex system. 4. Persuasiveness to convince users to try or buy information spaces where we are often having trouble to items and enhance user acceptance of the system. 5. Effec- tiveness to help users make better decisions. 6. Efficiency to help users decide faster, which recommended item is the best for them and 7. Satisfaction to increase the user’s satis- faction with the system. However, meeting all these criteria Permission to make digital or hard copies of all or part of this work for is unlikely, some of these aims are even contradicting such as personal or classroom use is granted without fee provided that copies are persuasiveness and effectiveness. Thus, choosing which cri- not made or distributed for profit or commercial advantage and that copies teria to improve is a trade-off. Explanations might also differ bear this notice and the full citation on the first page. To copy otherwise, to by the degree of personalization. While non-personalized ex- IntRS 2014, republish, October to post 6, or on servers 2014, SilicontoValley, to redistribute CA, prior lists, requires USA.specific planations use general information to indicate the relevance Copyright 2014aby permission and/or fee.the author(s). of a recommendation, personalized explanations clarify how Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. a user might relate to a recommended item [8]. supporting them in decision making by providing interac- Due to the benefits of explanations in mobile recommender tive explanations. Mobile recommender systems use a lot of systems, a lot of research has been conducted in this context. situational information to generate recommendations, so it Since our work focuses on explanations aiming at improving might not always be clear to the user how the recommenda- transparency and scrutability in a recommender system, we tions are generated. Introducing transparency can help solv- investigated previous research in these two areas. ing this problem. However, mobile devices require even more The work of Vig et al. [9] separates justification from considerations in the design and development (e.g. due to transparency. While transparency should give an honest the small display size). Thus, these should also be taken into statement of how the recommendation set is generated and account when generating transparent explanations. More- how the system works in general, justification can be re- over, the explanation framework should generate textual ex- frained from the recommendation algorithm and explain why planations that make it clear to the user how her preferences a recommendation was selected. Vig et al. developed a web- are modeled. In order not to bore the user, explanations based Tagsplanations system where the recommendation must be concise and include variations in wording. Further- is justified using relevance of tags. Their approach, as the more, introducing transparency alone might not be enough authors noted, lacked the ability to let users override their because users often want to feel in control of the recommen- inferred tag preferences. dation process. The explanation goal scrutability addresses Cramer et al. [3] applied transparent explanations in the this issue by letting users correct system mistakes. There web-based CHIP (Cultural Heritage Information Person- have been several approaches to incorporate scrutable ex- alization) system that recommends artworks based on the planations to traditional web-based recommender systems. individual user’s ratings of artworks. The main goal of the However, more investigation is required in the area of mo- work was to make the criteria more transparent the system bile recommender systems. First of all, the system should uses to recommend artworks. It did so by showing the users highlight the areas of textual explanations that can be in- the criteria on which the system based its recommendation. teracted with. Second, the system should allow the user to The authors argue that transparency increased the accep- easily make changes and get new recommendations. While tance of the system. transparent and scrutable explanations are the main focus of An interesting approach to increase scrutability has been this work, there are also some side goals, such as satisfaction taken by Czarkowski [4]. The author developed SASY, and efficiency. a web-based holiday recommender system which has scru- tinization tools that aim not only to enable users to un- 3.1 The Baseline derstand what is going on in the system, but also to let Shopr, a previously developed mobile recommender sys- them take control over recommendations by enabling them tem serves as the baseline in our user study [6]. The system to modify data that is stored about them. uses a conversation-based Active Learning strategy that in- TasteWeights is a web-based social recommender system volves users in ongoing sessions of recommendations by get- developed by Knijnenburg et al. [5] aiming at increasing in- ting feedback on one of the items in each session. Thus, the spectability and control. The system provides inspectability system learns the user’s preferences in the current context. by displaying a graph of the user’s items, friends and recom- An important point is that the system initially recommends mendations. The system allows control over recommenda- very diverse items without asking its users to input their ini- tions by allowing users to adjust the weights of the items and tial preferences. After a recommendation set is presented, friends they have. The authors evaluated the system with the user is expected to give feedback on one of the items in 267 participants. Their results showed that users appreci- the form of like or dislike over item features (e.g. price of ated the inspectability and control over recommendations. the item or color) and can state which features she in par- The control given via weighting of items and friends made ticular likes or dislikes. In case the user submitted a posi- the system more understandable. Finally, the authors con- tive feedback, using the refine algorithm shows more similar cluded that such interactive control results in scrutability. items. Otherwise, the system concludes a negative progress Wasinger et al. [10] apply scrutinization in a mobile restau- has been made and refocuses on another item region and rant recommender system named Menu Mentor. In this shows more diverse items. The algorithm keeps the previ- system, users can see the personalized score of a recom- ously critiqued item in the new recommendation set in order mended restaurant and the details of how the system com- to allow the user to further critique it for better recommen- puted that score. However, users can change the recom- dations. The explanation strategy used in this system is mendation behavior only by critiquing presented items via very simple. An explanation text is put on top of all items, meal star ratings, no granular control over meal content is which tries to convey the current profile of the user’s prefer- provided. A conducted user study showed that participants ences. It allows the user to observe the effect of her critiques perceived enhanced personal control over given recommen- and to compare the current profile against the actually dis- dations. played items. An example for such an explanation text is In summary, although previous research focused on in- ”avoid grey, only female, preferably shirt/dress”. creasing either scrutability or transparency in recommender systems, no research was conducted on how interactive ex- 3.2 How Explicit Feedback Affects Weights planations can increase transparency as well as scrutability The modeling of the user’s preferences is an important in mobile recommender systems. part of the proposed explanation generation strategy and feedback model and is adapted from the approach of Shopr 3. DESIGNING THE PROTOTYPE [6], described in the Baseline section. It is modeled as a Our system aims at offering shoppers a way to find nearby search query q with weights for values of features (e.g. red shopping locations with interesting clothing items while also is a possible value of the feature color ). For each feature, there is a weight vector that allows the prioritization of one Local score LSI,D measures the performance of a dimen- feature value over another. A query q for a user looking for sion without taking into account how much the user values only red dresses from open shops in 2000m reach would look that dimension. Our system uses feature value weight vec- like this (we here assume that each item has only the two tors to represent both item features and features in a query, features ’color’ and ’type’): which represents the current preferences of the user. Local score of a feature is the scalar product of the weight vector q = ((distance ≤ 2000m) ∧ (time open = (for that feature) in the query with respective weight vector now + 30min)), {colorred,blue,green (1.0, 0, 0), (1) in the item’s representation. It is formalized as below, where wI,D represents the feature value weight vector for item di- typeblouse,dress,trousers (0, 1.0, 0)} mension D and wQ,D represents the feature value weight Our system uses two types of user feedback. One of them vector for query dimension D and n stands for the number is by critiquing the recommended items on their features of feature values for that dimension: (which was already provided in the baseline system, see sec- tion 3.1). The other is by correcting mistakes regarding n−1 X the user’s preferences via explicit preference statement. Ex- LSI,D = wI,D (i).wQ,D (i) (3) planations are designed to be interactive, so that the user i=0 can state her actual preference over feature values after tap- Explanation score ESI,D describes the explaining per- ping on the explanation. If the user states interests on formance of a dimension. The weight for each dimension is some feature values, a new value vector will be initialized calculated dynamically by using a function that decreases for the query with all interested values being assigned equal the effects of the number of feature values in each dimen- weight summing to 1.0 and the rest having 0.0 weight. That sion. It is formalized as follows, where lengthwD denotes means that the system will focus on the stated feature val- the number of feature values in a specific dimension D and ues, whereas the other values will be avoided. For example lengthtotal attribute values the total number of feature values if a user interacts with the explanation associated with the for all dimensions. Using the square root produced good query presented in equation 1 and states that she is actually results since it limits the effect of number feature values on only interested in blue and green, then the resulting new the calculation of weights. weight vector would look like the following (assuming that we only distinguish between three colors) which will influ- s ence the search query and thus the new recommendations: lengthwD wD = (4) lengthtotal attribute values f eedbackpositive (blue, green) : (2) With the following dynamically calculated weight for a colorred,blue,green (0.0, 0.5, 0.5) dimension, explanation score of the dimension can be calcu- 3.3 Generating Interactive Explanations lated by multiplying it with the local score of that dimension: The main vision behind interactive explanations is to use them not only as a booster for transparency and understand- ESI,D = LSI,D .wD (5) ability of the recommendation process but also as an enabler Information score ISD measures the amount of infor- for user control. In order to explain the current state of the mation provided by a dimension. The calculation of infor- user model (which stores the user’s preferences) and the rea- mation score suggested by [1] is preserved as it already lays a soning behind recommendations, two types of explanations good foundation to reason whether explaining an item from are defined: recommendation- and preference explanations. a given dimension provides a good value. So, it can be de- 3.3.1 Interactive Recommendation Explanations fined as follows where R denotes the range of explanation Recommendation explanations are interactive textual ex- scores for that dimension for all recommended items and I planations. Their first aim is to justify why an item in the denotes the information that dimension provides for an item: recommendation set is relevant for the user. Second, they let the user make direct changes to her inferred preferences. R+I ISD = (6) The generation is based on the set of recommended items, 2 the user model and the mobile context. Range R is calculated as the difference between the max- imum and minimum explanation score for the given dimen- Argument Assessment. sion for all recommended items, namely R = max(ESI,D ) − Argument assessment is used to determine the quality of min(ESI,D ). Information I, however, is calculated quite every possible argument about an item. The argument as- differently from the strategy proposed by [1]. In their sys- sessment method is based on the method described in [1] . It tem, a dimension provides less and less information as the uses Multi-Criteria Decision Making Methods (MCDM) to number of items to be explained from the same dimension assess items I on multiple decision dimensions D (e.g. fea- increases. This does not apply to the context of the cloth- tures that an item can have) by means of utility functions. ing recommender developed for this work. An item could Dimensions in the context of this recommender system are still provide good information if not there are not so many features and contexts. The method described in [1] uses items that can be explained from the same feature value. four scores, which lay a good foundation for the method in For instance, it is still informative to explain an item from this work. However, their calculations have to be adapted the color blue; although another item is also explained by to the underlying recommendation infrastructure to produce the same dimension (color) but from a different value, let’s meaningful explanations. say green. Therefore, I is calculated as a function of the size of recommendation set (n) and number of items in the set n−h that has the same value for a dimension (h): I = . n−1 Global score GSI measures the overall quality of an item in all dimensions. It is the mean of explanation scores of all of its dimensions. The following formula demonstrates how it is formalized, where n denotes the total number of all dimensions and ESI,Di the explanation score of an item on ith dimension. Pn−1 i=0 ESI,Di (7) GSI = n The above-defined methods for calculating explanation and information scores are only valid for item features. Ex- planations should also include relevant context arguments. In order to support that, every context instance that is cap- tured and used by the system in the computation of the recommendation set should also be assessed. The expla- nation score of a context dimension is calculated using do- main knowledge. The most important values for the context gets the highest explanation score and it becomes lower and lower as the relevance of the value of the context decreases. For example, for location context, the explanation score is inversely proportional to the distance between the current location of the user and the shop where the explained item is sold. Explanation score gets higher as the distance gets lower. Information score is calculated with the same formula Figure 1: Generation of explanations. R+I defined earlier for features ISD = , but Information I 2 slightly changes. As proposed earlier, it is calculated using n−h as the process is divided into the selection and organization the formula I = , but in this case h stands for the n−1 of explanation content and the transformation in a human number of items with similar explanation score. readable form. Content Selection. The argumentation strategy selects Argument Types. arguments for every item I separately. One or more primary In order to generate explanations with convincing argu- arguments are selected first to help the user to instantly rec- ments, different argument aspects are defined by follow- ognize why the item is relevant. There are four alternative ing the guidelines for evaluative arguments described in [2]. ways to select the primary arguments (alternatives 1 to 4 Moreover, the types of arguments described in [1] are taken in figure 1). First alternative is that the item is in the rec- as a basis. First of all, arguments can be either positive or ommendation set because it was the last critique and it was negative. While positive arguments are used to convince the carried (1). Another is that the system has enough strong user to the relevance of recommendations, negative argu- arguments to explain an item (2). If there are not any strong ments are computed so that the system can give an honest arguments, the strategy checks if there are any weak argu- statement about the quality of the recommended item. The ments (3). In case there are one or more weak arguments, second aspect of arguments is the type of dimension they the system also adds supporting arguments to make the ex- explain, feature or context. Lastly, they can be primary or planation more convincing. Finally, if there are no weak supporting arguments. Primary arguments alone are used arguments too, then the item is checked if it is a good av- to generate concise explanations. Combination of primary erage by comparing its global score GSI to threshold β (4). and supporting arguments are used to generate detailed ex- If so, similar to alternative (3), supporting arguments are planations. We distinguish between five argument types: also added to increase the competence of the explanation. Strong primary feature arguments, Weak primary feature ar- Otherwise the strategy supposes that the recommended item guments, Supporting feature arguments, Context arguments is serendipitous and added to the set to explore the user’s and Negative arguments. preferences. With one or more primary arguments, the sys- tem checks if there are any negative arguments and context Explanation Process. arguments to add (5 and 6). The explanation process is based on the approach de- Surface Generation. The result of the content selec- scribed in [1] but it is adapted to use the previously de- tion is an abstract explanation, which needs to be resolved fined argument types. Different from the system in [1], ex- to something the user understands. This is done in the sur- planations are designed to contain multiple positive argu- face generation phase. Various explanation sentence tem- ments on features. Negative arguments are generated but plates are decorated with either feature values or context only displayed when necessary by using a ramping strategy. values (7 and 8). Explanation templates are sentences with Figure 1 shows the process to select arguments. It follows placeholders for feature and context values stored in XML the framework for explanation generation described in [2] format. The previously determined primary argument type Table 1: Text templates for recommendation explanations. Table 2: Text templates for preference explanations. Text template Example phrase Text template Example phrase Strong argument “Mainly because you currently like X.” Only some val- “You are currently interested only in Weak argument “Partially as you are currently inter- ues X, Y [...].” The word “only” in the ested in X.” text is emphasized in bold. Supporting argu- “Also, slightly because of your current Avoiding some “You are currently avoiding X, Y ment interest in X.” values [...].” The word “avoiding” is empha- Location context “And it is just Y meters away from sized in bold. you.” Preferably some “It seems, you currently prefer X, Y Average item “An average item but might be inter- values [...].” esting for you.” Indifferent to “You are currently indifferent to X Last critique “Kept so that you can keep track of feature feature”. your critiques.” Serendipity “This might help us discovering your preferences.” or “A serendipitous item that you perhaps like.” Negative argu- “However, it has the following fea- ment ture(s) you don’t like: X, Y [...].” is used to determine which type of explanation template to use. Feature values in the generated textual output are then highlighted and their interaction endpoints are defined (9). The resulting output is a textual explanation, highlighted in the parts where feature values are mentioned. They are interactive such that, after the user taps on the highlighted areas, she can specify what she exactly wants. 3.3.2 Interactive Preference Explanations Preference explanations have got two main goals. First, they aim to let the user inspect the current state of the sys- (a) (b) tem’s understanding of the user’s preferences. Second, they intend to let the user make direct changes to the prefer- Figure 2: Recommendation list (a) and explicit preference ence. Two main types of preferences explanations are de- feedback screen (b). fined, interactive textual explanations and interactive visual explanations. appearing in the chart is modeled with its weights (scaled Generating Textual Preference Explanations. to a percentage), color and description in the user interface. The only input to textual preference explanation gener- Figure 5 illustrates this chart representation. ation algorithm is the user model. For each dimension D the algorithm can generate interactive explanations. Di- 3.3.3 Using Text Templates Supporting Variation mensions are features that an item can have. The algorithm XML templates are used to generate explanation sentences distinguishes between four feature value weight vectors, indi- for the different user preference types in English language. cating different user preferences: First, the user is indifferent Those templates contain placeholders for feature and con- to any feature value. Second, the user is only interested in text values which are replaced during the explanation gen- a set of feature values. Third, the user is avoiding a set of eration process. For recommendation explanations, there are feature values. And fourth, the user prefers a set of feature a few sentence variations for almost every type of arguments. values over others. See table 1 for examples of the different text templates for recommendation explanations. These templates can be used Generating Visual Preference Explanations. in combination with each other. For example, supporting Visual preference explanations are generated also by using arguments can support a weak argument. In such cases, the user model, more specifically by making use of the array argument sentences are connected using conjunctions. of feature value weight vectors, which represents the user’s Similar mechanism is also used for the preference explana- current preferences. For each feature, there is already a tions. However, to keep it simple, variation is not provided, feature value weight vector, which indicates the priorities of as the number of features to explain is already limited. See the user among feature values. All those weights are between table 2 for selected examples of several text templates for 0.0 and 1.0 summing up to 1.0. They could be scaled to a preference explanations. percentage to generate charts showing the distribution of percentage of interests for feature values. 3.4 Interaction and Interface Design In order to generate charts, it is also required to determine The first issue was to clarify how to integrate the inter- with which color and description a feature value will be rep- action process with textual explanations. It was envisioned resented in a chart. In order to support that, a feature value to give the user the opportunity to tap on the highlighted different “drill down” screen for all screens was developed as part of the mindmap feature. Figure 5 shows the mindmap detail screens for the clothing color feature. The user’s pref- erences on feature values are represented as a chart. Every feature value is displayed as a different color in the charts. One of the most important features is that the highlighted parts of the explanation texts and the charts are interactive as well which lets the user access the explicit feedback screen to provide her actual preferences. The full source code and resources for the Android app and the algorithm are available online1 . 4. USER STUDY The main three goals of the evaluation are: First, to find out whether transparency and user control can be improved by feature-based personalized explanations supported by scru- (a) (b) table interfaces in recommender systems. Second, to find out whether side goals such as higher satisfaction are achieved Figure 3: Detailed information screens of items. and lastly to see whether other important system goals such as efficiency are not damaged. 4.1 Setup areas of the explanation text to state her actual preferences The test hardware is a 4.3 inch 480 x 800 resolution An- on a feature. This leads to a two-step process. First, the droid smartphone (Samsung Galaxy S2) running the Jelly user sees an item with an explanation including highlighted Bean version of the Android operating system (4.1.2). words (highlighted words are always associated with a fea- Two variants of the system are put to the test. In order ture, see figure 2a) and taps on one of them (e.g. in figure to refrain from the effects of different recommender algo- 2b, ”t-shirt” was tapped). Then the system directs the user rithms, both variants use the same recommendation algo- to the screen where the user can make changes. In this sec- rithm which uses diversity-based Active Learning [6]. More- ond step, she specifies which feature values she is currently over, critiquing and item details interfaces are exactly the interested in. Lastly, the system updates the list of recom- same. The difference lies in the explanations: The EXP vari- mendations which complets a recommendation cycle. Note ant refers to the proposed system, described in the previous that the critiquing process and associated screens from the section. In order to test the value of the developed explana- project Shopr, which is taken as a basis (see section 3.1) tions and scrutinization tools, a baseline (BASE variant) to are kept in the developed system. Eventually, the interac- compare against is needed (see subsection 3.1). The study tion is a hybrid of critiquing and explicitly stating current is designed as within-subject to keep the number of testers preferences. On top of each explicit feedback screen, a text at a reasonable size. Thus one group of people tests both description of what is expected from the user is given. variants. Which system is tested first was flipped in between Due to the applied ramping strategy mentioned in sec- subjects so that a bias because of learning effects could be tion 3.3.1, all extra arguments in explanations that are not reduced. important were not shown as explanations in the list of rec- In order to create a realistic setup, it is necessary to gen- ommendations but in the screen where item details are pre- erate a data set that represents real-world items. For that sented. Tapping on an item picture accesses that screen. purpose, we developed a data set creation tool as an open- Here, the user can also browse through several pictures of source project2 . The tool crawls clothing items from a well- an item by swiping the current picture from right to left known online clothing retailer website. To keep the amount (see figure 3b). In order to make it obvious for the user, the of work reasonable, items were associated with an id, one of sentences with positive arguments always start with a green 19 types of clothing, one of 18 colors, one of 5 brands, the “+” sign. Negative arguments sentences, on the other hand, price (in Euro), the gender (male, female or unisex) and a always start with a red “-” sign (see figure 3). list of image links for the item. The resulting set is 2318 The next issue was to implement preference explanations, items large, with 1141 for the male and 1177 for the female what we call Mindmap feature. Mindmap feature is the way gender. that system explains its mental map about the preferences For the study, participants of various age, educational of the user. The overview screen for mindmap was designed background and current profession were looked for. Overall to quickly show the system’s assumptions about the user’s 30 people participated, whereas 33% of users were female current preferences. To keep it simple but yet usable, only and 67% were male. textual explanations are used for each feature (see figure 4b). The actual testing procedure used in the evaluation was In order to make it easy for the user to notice what is im- structured as follows: We first asked the participants to portant, the feature values used in the explanation text are provide background information about themselves, such as highlighted. Moreover, every element representing a feature demographic information and their knowledge about mobile is made interactive. This lets the user access the explicit systems and recommender systems. Next, the idea of the feedback screen to provide her actual preferences. 1 The user should also be able to get more detailed visual https://github.com/adiguzel/Shopr 2 information for all the features. In order to achieve that, a https://github.com/adiguzel/pickpocket (a) (b) Figure 4: Navigation Drawer (a) and Overview (b). system was introduced and the purpose of the user study Figure 5: Mindmap detail screens for color. was made clear. We chose a realistic scenario instead of asking users to find an item they could like: Task: Imagine you want to buy yourself new clothes for preferences in each cycle and uses it to generate recommen- an event in a summer evening. You believe that following dations that can be interesting for the user. type of clothes would be appropriate for this event: shirt, On average, when asked if a tester understands the sys- t-shirt, polo shirt, dress, blouse or top. As per color you tem’s reasoning behind its recommendations, EXP performs consider shades of blue, green, white, black and red. You better than BASE (mean average of 4.63 compared to 4.3 have a budget of up to e 100. You use the Shopr app to look out of a 1-5 Likert scale). Further analysis suggests that the for a product you might want to purchase. variant with interactive explanations (EXP) is perceived sig- After introducing them to the task, users were given hands nificantly more transparent than the variant with baseline on time to familiarize themselves with the user interface and explanations (one-tail t-test, p<0.05 with p=0.018). grasp how the app works. After selecting and confirming Users were asked about the ease of telling the system what the choice for a product, the task was completed. Then they want in order to measure the overall user control they testers were asked to rate statements about transparency, perceived. Average rating of participants was better with user control, efficiency and satisfaction based on their expe- EXP (4.33 versus 3.23). In a further analysis, EXP seemed rience with the system on a five-point Likert scale (from 1, significantly better in terms of perceived overall control than strongly disagree to 5, strongly agree) and offer any general BASE (one-tail t-test, p<0.05 with p=0.0003). feedback and observations. After having tested both vari- When asked about the ease of correcting system mistakes, ants, participants stated which variant they preferred and EXP performs a lot better than BASE (mean average of why that was the case. 4.36 compared to 3 out of a 1-5 Likert scale). Further anal- ysis reveals that EXP is significantly better in terms of per- 4.2 Results ceived scrutability than BASE (one-tail t-test, p<0.05 with The testing framework applied in the user study is a sub- p=0.6.08E-06). set of the aspects that are relevant for critiquing recom- Participants completed their task in average one cycle less menders and explanations in critiquing recommenders. It using EXP than BASE (6.5 with EXP, 7.46 with BASE). follows the user-centric approach presented in [7]. The mea- However, one-tail t-test shows that EXP is not significantly sured data is divided into four areas: transparency, user better than BASE (p>0.05 with p=0.14). control, efficiency and satisfaction. The next part of measuring objective effort is done via The means of the measured values for the most important tracking the time it took for each participant from seeing metrics of the two systems, BASE denoting the variant using the initial set of recommendations until the target item was only simple non-interactive explanations, EXP the version selected. On average BASE seems to be better with a mean with interactive explanations, are shown in table 3. Next session length of 160 seconds against 165 seconds. However, to the mean the standard deviation is shown, the last col- it was found not to be significantly more time efficient (one- umn denoting the p-value of a one-tail paired t-test with 29 tail t-test, p>0.05 with p=0.39). One reason for this could degrees of freedom (30 participants - 1). be that although EXP gives its users tools to update pref- In order to measure actual understanding after using a erences over several features quickly, it has more detailed variant, users were asked to describe how the underlying explanations. Thus, users spent more time with reading. recommendation system of that variant works. In general, Users were asked about the ease of finding information almost all of the participants could explain for both rec- and the effort required to use the system in order to get ommenders that the systems builds a model of the user’s an idea about the system’s efficiency. The participants’ av- Table 3: The means of some important measured values recommendations, skipping to the next list of recommenda- comparing both variations of the system. tions without critiquing and having more item attributes for critiquing, could make the application even more appealing. BASE stdev EXP stdev p Future development may also include the creation of more mean mean value complex recommendation scenarios to test the capability of Perceived trans- 4.3 0.70 4.63 0.49 0.018 the proposed concept even further. One can add more item parency features to critique and also take the user’s mobile context Perceived overall 3.23 1.04 4.33 0.71 0.0003 (e.g. mood and seasonal conditions) into account during control the recommendation process. Furthermore, future research Scrutability 3 1.31 4.36 0.85 6.08E- might study the generation of interactive explanations for 06 systems with rather complex recommendation algorithms. Cycles 7.46 3.64 6.5 3.28 0.14 Interactive explanations might make adjustable parts of the Time consumption 160 s 74 165 83 0.39 algorithm transparent and allow the user to change them. s Perceived efficiency 3.43 1.13 4.33 0.75 0.0003 6. REFERENCES Satisfaction 3.76 0.85 4.43 0.56 0.0004 [1] R. Bader, W. Woerndl, A. Karitnig, and G. Leitner. Designing an explanation interface for proactive recommendations in automotive scenarios. In erage rating was better with EXP with 4.33 against 3.43 Proceedings of the 19th International Conference on with BASE. Further analysis revealed that users perceived Advances in User Modeling, UMAP’11, pages 92–104, EXP significantly more efficient than BASE (one-tail t-test, Berlin, Heidelberg, 2012. Springer-Verlag. p<0.05 with p=0.0003). [2] G. Carenini and J. D. Moore. Generating and When inquired how satisfied participants were with the evaluating evaluative arguments. Artif. Intell., system overall, EXP performs better with 4.43 against 3.76. 170(11):925–952, Aug. 2006. One-tail t-test suggests that this is a significant result (p<0.05 [3] H. Cramer, V. Evers, S. Ramlal, M. Someren, with p=0.0004). L. Rutledge, N. Stash, L. Aroyo, and B. Wielinga. The Finally, participants were asked to pick a favorite from effects of transparency on trust in and acceptance of a the two evaluated variants. 90% preferred the variant with content-based art recommender. User Modeling and interactive explanations (EXP) over the variant with simple User-Adapted Interaction, 18(5):455–496, Nov. 2008. non-interactive explanations (BASE), mostly because of the [4] M. Czarkowski. A Scrutable Adaptive Hypertext. PhD increased perception of control over recommendations. thesis, University of Sydney, 2006. [5] B. P. Knijnenburg, S. Bostandjiev, J. O’Donovan, and 5. CONCLUSION AND FUTURE WORK A. Kobsa. Inspectability and control in social recommenders. In Proceedings of the Sixth ACM This work investigated the development and impact of a Conference on Recommender Systems, RecSys ’12, concept featuring interactive explanations for Active Learn- pages 43–50, New York, NY, USA, 2012. ACM. ing critique-based mobile recommender systems in the fash- [6] B. Lamche, U. Trottman, and W. Wörndl. Active ion domain. The developed concept proposes the generation learning strategies for exploratory mobile of explanations to make the system more transparent while recommender systems. In Proceedings of CaRR also using them as an enabler for user control in the recom- workshop, 36th European Conference on Information mendation process. Furthermore, the concept defines the Retrieval, Amsterdam, Netherlands, Apr 2014. user feedback as a hybrid of critiquing and explicit state- [7] P. Pu, L. Chen, and R. Hu. A user-centric evaluation ments of current interests. A method is developed to gener- framework for recommender systems. In Proceedings of ate explanations based on a content-based recommendation the Fifth ACM Conference on Recommender Systems, approach. The explanations are always made interactive RecSys ’11, pages 157–164, New York, NY, USA, to give the user a chance to correct possible system mis- 2011. ACM. takes. In order to measure the applicability of the concept, a mobile Android app using the proposed concept and the [8] N. Tintarev and J. Masthoff. Evaluating the explanation generation algorithm was developed. Several effectiveness of explanations for recommender systems. aspects regarding display and interaction design of explana- User Modeling and User-Adapted Interaction, tions in mobile recommender systems are discussed and solu- 22(4-5):399–439, Oct. 2012. tions to the problems faced during the development process [9] J. Vig, S. Sen, and J. Riedl. Tagsplanations: are summarized. The prototype was evaluated in a study Explaining recommendations using tags. In with 30 real users. The proposed concept performed signifi- Proceedings of the 14th International Conference on cantly better compared to the approach with non-interactive Intelligent User Interfaces, IUI ’09, pages 47–56, New simple explanations in terms of our main goals to increase York, NY, USA, 2009. ACM. transparency and scrutability and side goals to increasing [10] R. Wasinger, J. Wallbank, L. Pizzato, J. Kay, perceived efficiency and satisfaction. Overall, the developed B. Kummerfeld, M. Böhmer, and A. Krüger. Scrutable interactive explanations approach demonstrated the user ap- user models and personalised item recommendation in preciation of transparency and control over the recommen- mobile lifestyle applications. In User Modeling, dation process in a conversation-based Active Learning mo- Adaptation, and Personalization, volume 7899 of bile recommender system tailored to a modern smartphone Lecture Notes in Computer Science, pages 77–88. platform. Some changes, such as increasing the number of Springer Berlin Heidelberg, 2013.