Post-hoc Explanations for Complex Model Recommendations using Simple Methods Dorin Shmaryahu Guy Shani Bracha Shapira Ben-Gurion University of the Ben-Gurion University of the Ben-Gurion University of the Negev, Israel Negev, Israel Negev, Israel dorins@post.bgu.ac.il shanigu@bgu.ac.il bshapira@bgu.ac.il ABSTRACT “users who choose the item that you have chosen often also Many leading approaches for generating recommendations, choose the recommended item”. Content-based algorithms such as matrix factorization and autoencoders, compute a com- [17], that learn for each user a set of content features that the plex model composed of latent variables. As such, explaining user prefers, generate recommendations that can be explained the recommendations generated by these models is a difficult by “the recommended item has a content feature that you task. In this paper, instead of attempting to explain the latent prefer”. variables, we provide post-hoc explanations for why a recom- However, these simple algorithms often provide recommen- mended item may be appropriate for the user, by using a set of dations of lower accuracy than modern approaches. In recent simple, easily explainable recommendation algorithms. When years, two collaborative filtering approaches became popular the output of the simple explainable recommender agrees with for generating good recommendations — the matrix factor- the complex model on a recommended item, we consider ization (MF) approach [13, 14, 15], and the artificial neural the explanation of the simple model to be applicable. We network (ANN) approach [29]. Algorithms of these families suggest both simple collaborative filtering and content based have shown the capacity to generate accurate recommenda- approaches for generating these explanations. We conduct a tions for users. user study in the movie recommendation domain, showing that users accept our explanations, and react positively to simple One of the downsides of both approaches is that they compute and short explanations, even if they do not truly explain the the recommendations through a set of latent variables and mechanism leading to the generated recommendations. their possibly non-linear relations. For example, in the MF approach one computes a vector of latent variables for each Author Keywords user, and a vector of latent variables for each item, and then Recommender Systems, Explainable Recommendation, computes a recommendation score using the inner product be- content-base explanations, collaborative filtering explanations, tween the vectors of a particular user and a particular item. The user-study values of the latent variables do not have an understandable meaning to humans. INTRODUCTION Several researchers have attempted to provide explanations by Recommendation systems that suggest items to users can be understanding the behavior of the latent variables [32, 6]. Such found in many modern applications, from online newspapers efforts may be possible in some cases, but it is unlikely that all, and movie streaming applications, to e-commerce [2, 26, 18]. or even most, latent variables represent an easy to understand Research has shown that in many applications, user may be structure. The problem becomes even more difficult with deep interested in understanding why is a particular recommended ANNs, that may contain thousands of such variables with item appropriate for her [27, 11, 31]. Thus, it is beneficial to complex connections between them. be able to generate explanations for the recommended items. Alternatively, one can take a post-hoc approach to explana- Early simple recommendation algorithms often yield a natural tions [12, 4], that takes the model recommendations as input, explanation for their recommendations. For example, the rec- and attempts to identify reasons as to why these recommended ommendations of a neighborhood based collaborative filtering items are appropriate to the user. For example, [21] used approach [20] can be explained as: “users similar to you often association rule mining to identify explanations for the recom- choose this item”. Item-item collaborative filtering algorithms mendations directly from the data. These explanations cannot [23, 3] provide recommendations that can be explained as be considered to be transparent [28], as they do not shed light on the choices made within the model in recommending the particular item, but may still provide value to the user. They can be effective, helping the user in making decisions. They may be persuasive, convincing the user to explore the recom- mended item. They may also increase trust, by, e.g., providing a reasonable explanation for a recommendation that the user Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Com- dislikes. mons License Attribution 4.0 International (CC BY 4.0). IntRS ’20 - Joint Workshop on Interfaces and Human Decision Making for Recom- mender Systems, September 26, 2020, Virtual Event In this paper we also take a post-hoc explanation generation applications, including TV streaming services [2], online e- approach. Given the output of any black-box recommender, commerce [26], smart tutoring [8], and many more [18]. We we run a set of easy-to-explain recommendation algorithms, focus here one important recommendation task [24] — top- such as the simple collaborative filtering and content based N recommendation, where the system computes a list of N methods suggested above. These algorithms provide a score recommended items that the user may choose. for the items recommended by the black box recommender. When this score is sufficiently high, it means that the explain- There are two dominant approaches for computing recom- able recommender agrees with the black box recommender. mendations for the active user — the user that is currently In this case, we can present the explanation of the explaining interacting with the application and the recommender system. recommender to the user. First, the collaborative filtering approach [5, 10] assumes that users who agreed on preferred items in the past will tend to Our approach is model agnostic — we can generate expla- agree in the future too. Many such methods rely on a matrix nations for any recommender. Our approach is also flexible, R of user-item ratings to predict unknown matrix entries, and in that the explanations can be generated post-hoc by any thus to decide which items to recommend. easy-to-explain recommendation algorithm that outputs a rec- ommendation score for each item. Although in this paper we A simple method in this family [20], commonly referred study only the simple recommenders mentioned above, given to as user-user collaborative filtering, identifies a neighbor- hood of users that are similar to the active user. A common any other easy-to-explain recommender, one can use it to gen- method for computing user similarity is the Jaccard corre- erate new explanations, that would be candidate explanations Iu ∩Iu for the items recommended by the black box recommender. lation Jaccard(u1 , u2 ) = Iu1 ∪Iu2 where Iu is the set of items 1 2 We study the user perception of explanations generated by sim- consumed by a user u. This set of neighbors is based on the ple easy-to-explain recommenders for the items recommended similarity of observed preferences between these users and the by complex models. We evaluate the user’s response to rec- active user. Then, items that were preferred by users in the ommended items with and without explanations of different neighborhood are recommended to the active user. Another types. We also measure participant user preference over the approach [23, 3], known as item-item collaborative filtering various types of explanations. To study these questions we rely on the set of users that consumed two items i1 and i2 . conduct a user study in the movie domain. We use two popular One can compute, e.g., the Jaccard correlation between the Ui ∩Ui recommendation models, an MF and an autoencoder, as black items: Jaccard(i1 , i2 ) = Ui1 ∪Ui2 where Ui is the set of users 1 2 boxes to generate recommendations. For each recommended who consumed item i. Then, the system can recommend to a item we run a set of 6 easy-to-explain approaches to produce user u an item i2 that has high Jaccard similarity to an item i1 explanations for the recommendation — item-item content that u has previously consumed. based, user-item content based, item-item collaborative filter- ing, user-user collaborative filtering, movie overview textual A second popular approach is known as content-based recom- similarity, and a popularity recommender. We show only ex- mendation [17]. In this approach, the system has access to a set planations which are sufficiently relevant, that is, whose score of item features. The system then learns the user preferences passes a method-dependant threshold. over features, and uses these computed preferences to recom- mend new items with similar features. Such recommendations We first ask participants to rank the generated recommenda- are typically titled “similar items”. tions without any explanation. Then, we ask their opinion about recommended items with explanation, showing a single, In content based recommendations one can again take an item- randomly chosen, explanation for every movie. item approach, computing the similarity between items based on shared feature values, such as the leading actors, the same In the next stage of the user study, the participants were shown director. or the same genre. Then, one can recommend an additional recommended movies. In this stage we presented item that has high similarity to an item that was previously all explanations that passed a threshold to the participants, consumed by the user. One can also take a user-item approach, and asked them to rate each explanation. The results in this by computing a user profile — the set of feature values that stage show that participants preferred content based explana- often appear in items consumed by the user, such as actors tions to collaborative filtering explanations, and that popularity that repeatedly appear in movies that the user has consumed, explanations are rated the lowest. or genres the the user often watches. Then, one can compute Finally, the participants completed an online survey, asking the similarity of an item to the user profile to decide whether their opinion about recommendation explanations in general. to recommend the item to the user. Our results indicate that participants prefer short and easy to It is widely agreed in the recommendation system research understand explanations to transparent explanations that fully community that in many domains, collaborative filtering ap- disclose the mechanism behind the computed recommenda- proaches produce better recommendations than content based tions. methods. BACKGROUND A collaborative filtering approach that has gained much at- Recommender systems actively suggest items to users, to tention in the recommender system community is the matrix help them to rapidly discover relevant items, and to increase factorization [13, 14, 15], where the system attempts to factor item consumption [22]. Such systems can be found in many the rating matrix R|U|×|I| into two matrices, P|U|×k and Qk×|I| , for some small number k, where R ≈ P × Q. One can consider user reviews, to learn users preferences related features of the matrix P as a set of latent user features, and Q as a set of items that served as a basis for latent feature tables. latent item features. An item i is considered to be appropri- ate for a user u when the inner product pu · qi is high. The Additional examples can be found for deep learning recom- resulting latent feature vectors pu and qi typically do not have mendation models, such as the work by [6], that learned the a meaning that can be translated into content features, such as distribution of user attention over features of different items actors or genres, but are associated with the user like-dislike that serve as explanations. These algorithms try to analyse the pattern of items. As such, explaining to the user why a particu- meaning of each latent component in a neural network, and lar item was recommended to her, beyond the vague statement how they interact with each other to generate the final results. that the system predicts that the item is a good match for the The second approach is post-hoc and model-agnostic [12, 4]. user, is difficult. It treats the model as a black box and explains the recom- Another state of the art collaborative filtering approach is the mendation results in a rational way by identifying relations variational autoencoder (VAE). An autoencoder (AE) neural between the data provided as input to the recommender sys- network is an unsupervised learning algorithm, attempting to tem and its recommended items. This analysis is decoupled from the recommendation model, considering only the model produce target values equal to the input values, y(i) = x(i) . The input and output. The post-hoc approach has the advantage of autoencoder tries to learn a function hW,b (x) ≈ x where W and enabling explanations in scenarios where the recommendation b is the set of weights and biases corresponding to the hidden model cannot be exposed. Although the post-hoc explanations units in the deep network. presented to a user are not transparent, i.e., they do not reflect While the input and output layers of the network are large, the computation used by the underlying model to provide rec- there is an inner low dimensional layer within the network. ommendations, they commonly present rationale, plausible Thus, the network learns a lower dimension representation information for the user. of the input, the latent space. The autoencoder operates in Some post-hoc explainable recommendation models use sta- two phases, an encoder that reduces the input into a compact tistical methods to analyze the influence of the input on the representation in the low dimension layer, and a decoder, re- output [7]. These methods often require heavy computations sponsible for reconstructing the encoded representation into to provide explanations. Other studies apply various deep the original input. learning reinforcement learning methods to build explanation In the recommendation system task, the input is a user partial models using various types of networks. These studies [30, item choice vector r(u) , e.g., a vector of all movies in the 19] are commonly based on static explanation templates, result system, where only movies that the user has watched receive in complex models, and require parameter tuning. a value of 1. The reconstruction of the input at the output Post-hoc methods are built on the assumption, that we investi- layer contains higher scores for items that the user is likely to gate in this paper, that an explanation that makes sense to the choose. user, even if it is not the exact reason that the recommenda- tion was indeed issued, is acceptable to users and may have a beneficial effect for the recommendation system. RELATED WORK Explainable recommendations provided to users may help [4] suggested that providing explanations to users alongside them understand why certain items are appropriate for them. a recommendation can help users to make more informed By clarifying these reasons, explanations can improve the decisions about consuming the item. They used 3 post-hoc transparency, persuasiveness, effectiveness, trustworthiness, methods — keyword similarity, neighbors ratings, and what and user satisfaction from the recommender system [27, 11, they call influential item computation — to explain recommen- 31]. While earlier recommenders were often naturally ex- dations generated by a hybrid content-based and collaborative plainable, modern models are more complex, and do not yield system rating prediction system. They run a small scale user natural explanations. Studies in explainable recommendations study in a books domain, attempting to understand which ex- hence address the challenge of providing human understand- planation provided the most information for the user to best able explanations for items recommended by complex models. understand the quality of the recommended item for her. Our paper can be seen as an extension of their preliminary work, There are two main approaches to providing explainable rec- describing a general framework for post-hoc explanations us- ommendations [31]. The first approach attempts to create ing simple methods, suggesting additional explanation types, interpretable recommendation models whose results can be and conducting a thorough user study in the movies domain, naturally explained. However, many modern models are often evaluating many more research questions. not naturally explainable, and making them more explainable, often results in reduced recommendation accuracy. This line [21] also extended the work of [4] by suggesting a different of research therefore aims at mitigating the trade-off between post-hoc method, applying association rule mining on the accuracy and explainability by including explainable compo- input data – the user-item rating table. The mining results with nents, layers or external information into non-linear complex association rules, sorted by their confidence and support, that and deep accurate models to make them explainable. Exam- reflect links between items. Those links form the explanations ples of such solutions for MF-based recommendation models that are provided to users whose input data include antecedents include the work by [32], who applied sentiment analysis over of the rule. The explanations, however, unlike our approach, are limited to item-based collaboration-like statements (i.e., The explanations provided by all explaining algorithms are fed “item X is recommended because item Y was consumed”), and into a filter. All plausible explanations received from the ex- require the application of some association mining algorithm plaining algorithms are filtered and one explanation is chosen (e.g., the a-priori algorithm that the authors used [1]). Rule to be shown to the user. For example, such a filter can be based mining algorithms typically require heavier computations than on user preferences, or on the observed response of the user to our simple similarity-based computations. They also defined different types of explanations. Choosing the explanation with Model Fidelity, the portion of recommendations that can be the best score from the explanation algorithms is problematic, explained. Post-hoc explanations may not always apply to because these scores are not calibrated, that is, each explaining all recommendations, and the goal is to provide high model algorithm may use a different scale of scores. fidelity. USER STUDY In a gaming application, Frogger, [9] created a system that gen- As we have explained above, we study the participant per- erated simple rational explanations of the agent state and ac- ception of the provided explanations. We now describe a tions rather than complex detailed explanations. They showed user study applying our approach to a movie recommenda- good perception of the rationales by users, further support- tion application, in which participants evaluate recommended ing our hypothesis that simple post-hoc explanations are well movies, with and without explanations. The participants also received by users. provide their preferences over possible explanations for a rec- The post-hoc explanation approach that we propose in this pa- ommended movie. per emphasizes simplicity, flexibility, and the ease of its appli- More formally, we study two hypotheses: cation. Our method supports simple similarity based models, collaborative and content-based, as well as other simple post- • Users prefer short post-hoc explanations generated by sim- hoc methods. This allows users to choose their preferred type ple methods over a complete explanation of the mechanism of explanation. The main tunable parameter in our approach is of complex models. the method-specific threshold for deciding which explanation is sufficiently supported to be presented to the user. • Presenting a post-hoc explanation to the user influences the user acceptance of a recommended movie. GENERATING POST-HOC EXPLANATIONS USING SIM- We now explain the structure and process of the user study — the dataset and algorithms used to generate the recommenda- PLE METHODS tions and the explanations, and the different parts of the study. We now present our framework for providing post-hoc expla- We then discuss the results that we observed. nations for complex model recommendations. The framework is presented in Figure 1. Dataset and Algorithms Our method for generating recommendation along with plau- Our study is implemented in the movie recommendation do- sible explanation operates in several stages. First, a black box main using the Kaggle movies dataset 1 , containing both rat- recommendation model receives as input the user-item rating ings from MovieLens, as well as movie content data from matrix and outputs a recommendation. Although in this paper TMDB. 2 we focus on collaborative filtering methods, this approach can The dataset originally contains 45,000 movies. We filtered be applied to other methods, such as content-based recom- the dataset for two reasons — first, as we are interested in menders, that employ data sources other than the user-item participants opinion over the presented movies, we prefer to matrix. limit our attention to relatively popular movies, to increase the In the second stage, the recommended item is given as input likelihood that the participant is familiar with a recommended to several explaining algorithms. In addition, each explanation movie. Moreover, we observed that the complex models that algorithm receives as input additional required data sources. we use provide less appropriate recommendations when the These explanation methods can access the data sources avail- input movies have a relatively low number of user opinions. able to the recommender, but also other data sources as needed. As we are not truly interested in evaluating the quality of the For example, a possible explanation is the popularity of the complex models, but rather the participant perception of the item. The algorithm which produces this explanation requires recommended movies, with and without explanations, we pre- data over item popularity. Another possible explanation ap- fer to limit the models to items that are easier to recommend. proach is a content-based item-item method, which requires We hence choose to use only movies with more than 500 as input item content information. ratings, resulting in 3878 movies. We used all users who rated The explaining algorithm is also a recommendation method, at least one of these movies, resulting in 122,147 users, and that produces a recommendation score for items, or a ranking 5.7 million user-movie ratings. of recommended items for the user. We use the explaining For generating the recommendations, we use two complex algorithm to generate such a score for the recommended item. models, an MF recommender that we implemented locally, The algorithm returns an explanation only if the recommen- and a variational autoencoder (VAE) [16]. dation score is sufficiently high. We use a method-specific threshold to decide whether the explanation is sufficiently 1 https://www.kaggle.com/rounakbanik/the-movies-dataset relevant. 2 https://www.themoviedb.org/ Figure 1: Generating explanation method. The method receives as input the user item-rating matrix, and additional data as input for the explanations algorithm, such as, item content,item overview, users system profile ext. The method output a recommendation and the chosen explanation. For generating the explanations, we implemented 6 simple-to- of the number of users who have watched at least one of explain algorithms. Each algorithm receives as input a user the movies. The explanation here reads “This movie was profile, and an item (ir ) that was recommended by a complex recommended to you because you have watched movie m, model (VAE or MF), and generates a recommendation score and many people who like m also like this movie.” for that item. In addition, different algorithms take as input different data sources. • User-user collaborative filtering (denoted U2UCF): we com- pute the user neighborhood using the Jaccard similarity be- • Popularity (denoted POP in the tables below): we compute tween the sets of movies that each user has liked. Then, we the popularity of ir in our dataset. If the movie is suffi- compute the portion of similar users who have watched the ciently popular, we can explain the recommendation by the recommended movie. This explanation reads “This movie movie being popular. The resulting explanation reads “This was recommended to you because x% users who like the movie is popular. Many users have watched it.”. We set the same movies that you did, also like this movie.” threshold here to the 50 most popular movies. • Default explanation (denoted DEF): this is a strawman ex- • Item-item content based (denoted I2ICB): for each movie planation that provides no additional information to the j that the user has rated, we compute a content similarity user, reading “Our system predicts that this movie is a good score between j and ir , and take the item in the user profile match for you.” with the maximal score. The content similarity is computed using the Jaccard score between the movies cast (top 5 For each explanation algorithm we manually tune a threshold actors only), genres and director. The resulting explanation specifying whether the explanation is sufficiently relevant to is based on the particular content features that the items be shown to the user. We leave a smarter tuning of these share. For example, ”This movie was recommended to you thresholds to future research. because you liked j in the past, and the actor c played in both movies, and both movies are of genre g.” Population We recruited to the study mostly engineering students from • User-item content-based (denoted UICB): we generate a different academic institutes. The subjects who completed user profile from the list of movies that the user has liked. the study entered a raffle for a cash prize. Some subjects The profile contains a score for each actor, director, and were given, in addition, a credit in an academic course. We genre, based on the amount of times that a content attribute recruited the subjects by sending an email to several mailing value, e.g., a specific actor, appeared in the movies that lists, asking people to participate in a study over recommender the user has liked. We then compute a weighted Jaccard systems for movies. score between the user profile, and ir content attributes. Overall, we recruited 207 participants, 131 males, and 73 The resulting explanation is based on the specific content females (3 preferred not to specify gender). 24% of the partic- attributes that the user profile and the item have in common. ipants were graduate students, 53% were undergrad students, For example, such an explanation may be “This movie was and 23% had high school education only. 55% of the partici- recommended to you because it was directed by d, and you pants were 25 years old or younger, 35% were between 25-30, have liked other movies that d directed.” and 10% were above 30 years old. • Item-item overview (denoted I2ID): for each movie j that Some of the participants have a background in recommen- the user has rated, we compute the description similarity dations system or related fields. 103 have taken a course in between j and ir , and take the item in the user profile with machine learning, 67 have taken a course in deep learning, 56 the maximal score. The textual similarity between item participants have taken a course in information retrieval, and description is computed using TF-IDF. The explanation in 52 have taken a course in recommendation systems. 40% of this case is: ”This movie was recommended to you because the participants have not taken any course in those Fields. you liked j in the past, and both movies have a similar description.” 16% reported watching a few movies each week, 46% reported watching a movie once a week, 32% once a month, and the • Item-item collaborative filtering (denoted I2ICF): we com- rest (6%) almost never watch a movie Netflix is the lead- pute the item-item Jaccard score, that is, the number of ing movie watching channel (75%). 46% reported watching users who have watched both movies, divided by the union movies at the theater, 40% watch downloaded movies, and Figure 2: Choosing preferred movies. The drop-down lists on the left contain all movies in the study, and can be used to search Figure 3: Presenting two recommendations from either MF or for a movie name. Clicking on the movie poster allowed the VAE, and one popular, non-personal movie, ordered randomly. participants to explore the movie details on the IMDB website. The subject must rate each recommendation. 36% watch movies on broadcast channels. We did not detect During this step, the user was provided and was requested any significant difference between the various populations in to evaluate the above two sets of recommended movies, that the participant behavior and answers below. were presented without any explanations. The opinions of the participants over these sets serve as a baseline for the We asked the participants how they decide which movie to performance of the recommendation algorithms, without the watch. 78% use recommendations from friends, 56% read influence of an explanation. movie reviews online or in the newspaper, 25% report us- ing some automated system to recommend movies, and 20% We present the recommended movies to the participant in two watch whatever is currently on. 81% are familiar with per- different screens, one containing 2 MF recommendations and sonal movie recommendations in Netflix. When asked about one popular movie, and the other containing 2 VAE recom- the quality of the Netflix recommendations , 62% reported mendations and one popular movie (Figure 3). The order of that they sometimes liked the recommendations, 18% almost the systems, as well as the ordering within the 3 movies, is always like Netflix recommendations, 13% mostly do not like random. the recommendations, and 7% reported never liking these recommendations. Netflix presents some shows or movies Throughout the study, we avoid presenting to the subject rec- under the title, "Because you watched X". 58% of the partic- ommended movies that were previously shown to her. If both ipants claimed that they are likely to explore recommended algorithms agree on a recommended movie, we take the next movies under this title, whereas 35% said that they may ex- movie on the recommendation list. plore these recommendations, and 7% will not explore such a The subject rates each recommended movie in a 1-5 scale. recommended movie. Again, clicking on the movie poster allowed the subject to explore movie data from IMDB. Method Step 3: Rating Recommendations with Explanations. We now We now describe the process of the user study, explaining use the black box recommenders to produce two additional the different tasks that the test subjects performed. As we recommended movies. We enrich the user profile by adding all explained above, the subjects were asked to participate in a recommended movies that the subject rated 4 or 5 in the first user study over movie recommendations. The invitation email, step, and avoid recommendations that were already presented as well as the instructions at the beginning of the study, did not in the first step. mention explanations. Specifically, the subjects were told that they are asked to evaluate the recommendations of a system. In addition, we apply all the explanation generation algorithms above. We use only an explanation whose score is higher than Step 1: Creating a User Profile. After an instruction screen, the method-specific threshold required to be considered ac- we asked each subject to choose 5 movies which she likes (Fig- ceptable. From all acceptable explanations, we choose one ure 2). Using these movies, we created a CF user profile that explanation randomly. In cases where none of the algorithms is used as input to the black box recommendation algorithms returned a plausible explanation, we show a default explana- — MF and VAE. tion. Once the participant clicks on the “Let’s go” button, we com- We used in this step 3 different methods for showing the ex- pute two lists of 3 recommendations. For each black box planations: algorithm, we compute two recommended movies using the provided user profile. In addition we add to each of the two • Hidden: we place below the recommended movie a button recommendation lists, a randomly selected movie from the top saying “Why is this movie appropriate for me?”. Click- 100 popular movies according to the IMDB popularity score. ing on the button opened a popup windows containing the These popular movies allow us to evaluate the participant explanation. opinion over non-personalized recommendations. • Teaser: we place below the recommended movie the begin- Step 2: Rating Recommendations Without Explanations. ning of the explanation, followed by an ellipsis. Clicking (a) Hidden explanation (b) Explanation teaser (c) Explicit explanation Figure 4: Step 3: rating recommended movies with explanations. A possible simple explanation is presented for each recommendation. Each subject was shown one of the three alternative explanation displays. on the ellipsis opened a popup window containing the ex- planation. • Visible: We place below the recommended movie the expla- nation. This allows us to check whether the participants are inter- ested in an explanation, and whether they actively seek an explanation. We use a between-subjects setting here, that is, each participant was allocated to one of the 3 groups, to avoid over emphasizing the explanations due to the variations in presentation. We again ask each participant to rate 2 sets of recommendations, each containing 3 recommended movies, as in the previous step. Figure 5: Step 4: rate the explanations generated by the simple models to the Step 4: Rating Explanations. In the final step of the user study recommended movie of a complex model. All possible simple explanations are presented for each recommendation. we explicitly ask the subject to rate possible explanations. We again add to the user profile the successful recommendations from the previous steps, and ask for additional recommenda- tions from the two black box algorithms, MF and VAE. a set of question concerning the demographic details. Then, we ask the participants questions about their movie watching In this step, unlike the previous steps, we present to the partic- habits, and their previous interactions with recommender sys- ipant a single recommended movie. In addition, we present tems. Finally, we ask questions regarding their opinions about movie content information, such as the actors, the genres, and the presented explanations. the description, without requiring the participants to explicitly request for such details (Figure 5). Results We first ask the participant to state whether she likes the rec- We now discuss the user study results. We first review the ommended movie, and then present a set of explanations as to effects of explanation on the subject perception of a recom- why the movie was recommended. We show all explanations mended movie, then, we discuss subject opinion over the that are deemed sufficiently appropriate, achieving a score various explanation types. Finally, we discuss the fidelity of higher than the method specific threshold. The participant is the various explanation methods. asked to rate each explanation on a scale of 1-5. Effects of Various Explanations on Movie Ratings Each participants is shown 6 different movies in this step as well, 3 of which were generated by each of the two black box We now study the effect that the explanations had on the recommenders, and ordered randomly. subject opinion over the recommended movies, comparing the average rating for movies without explanations and with To summarize, we present to each participant in the steps 2-4 explanations. As we explain above, in Step 3, there were 3 18 different recommendations, 7 recommendations from each options for explanation presentation — hidden, requiring click of the complex models, MF and VAE, and 4 additional popular a button, teaser, showing only the beginning of the explanation, movies. and fully presenting the explanation. Post Study Questionnaire. After finishing step 4 above, the Somewhat surprisingly, only for 24% of the recommendations subject is transferred to an online questionnaire. We first ask in the first case, the participant clicked on the button, and only in 14% of the recommendations in the second case the partici- User Ratings for Explanations pant clicked on the teaser. That is, in most recommendations As we explained above, in Step 4 we asked the participants to the participants did not look at the explanations. In informal rate various explanations for a given recommendation. Table 3 discussions following the study, participants indicated that shows the explanation ratings provided by the participants. they did not see the option to request an explanation, or did not think that they needed an explanation to decide whether Somewhat surprisingly, all explanations, including the default the recommended movie is appropriate. explanation, received a relatively positive (above 3 on a 1- 5 scale) rating. The only explanations that the participants Thus, we group together here both movies in Step 2 for which significantly liked less, are the popularity explanation, and the no explanation was shown, and movies in Step 3 shown to par- user-item content-based explanation. ticipants who did not click on an explanation button or teaser. We compare this group to recommendations for which the The latter is especially surprising, given that movies for which explanation was shown. Below, when we discuss significance, this explanation was shown, or for which this explanation we base our claims on a paired t-test. holds, receive the highest user ratings in the results reported in Table 1 and Table 2. We believe that the relatively lower Table 1 compares the average rating for each one of the plau- subject opinion for this type of explanation may be attributed sible explanations, and without explanations. First, although not to its content, but rather to its length. As we discuss this is not the focus of the study, the VAE method produced below, in the post-study questionnaire, participants reported better recommendations than the MF method, which produced that they prefer short explanations. This explanation is by far better results than recommending a random popular movie. the longest. Note that the item-item content-based explanation may also appear to be long, in practice it is not; For the content Looking at the explanations, we can see that the user-item based explanations we report all properties (actors, genres, content-based explanation was shown only 4 times, and is director) that apply. A recommended movie typically has hence not statistically different than other explanations. The more joint property value with the user profile, containing popularity, and the default explanations result in the lowest all the movies that the user has liked (i.e., UICB), than with rating than all other explanations. That is, movies presented a single movie that the user has liked, which entails longer with either CF or CB explanations produce significantly higher explanations for UICB. ratings than the non-personal popularity explanation and the non-informative default explanation. While the differences between the ratings can be attributed to Explanation Fidelity the presented explanation, there is another plausible reason Finally, we evaluate the explanation fidelity — the portion of for these differences. It might be that recommended items recommended items for which each explanation type holds. for which a specific recommendation type applies are better Table 4 shows the empirical fidelity of the various explana- recommendations. For example, it may be that when a rec- tions with respect to all recommended items in our study in ommended item has a strong item-item Jaccard correlation the Steps 2-4. We note that the fidelity is highly sensitive to with an item in the user profile, it is considered as a better the thresholds that we set to decide which explanation is suffi- recommendation for the user, whether we explicitly tell the ciently valid to be presented. We leave an automated careful user about it or not. tuning of these thresholds to future research. Table 2 shows the average rating over movies that we were As can be seen, collaborative filtering fidelity is always higher able to explain through one of the methods although the ex- than its content-based counterpart, which is not surprising, planation was not shown to the subject. This occurs either because the black box recommenders are collaborative filtering in Step 2, or in Step 3 where the subject did not click on the methods. Item-item explanations have higher fidelity than user- explanation button or teaser. As can be seen, similar to the based explanations. This is not surprising, given the relatively ratings in Table 1, movies for which a user based explanation small user profiles that we use. exists, as well as movies with similar descriptions, receive It is especially interesting to look at the difference in content- a statistically significant (t-test p-value=0.046) higher user based fidelity between the MF method that we use and VAE. rating than movies for which an item-item based explanation Together with Table 2, this may explain the lower quality holds. These, in turn, receive a statistically significant higher of recommendations computed by our MF implementation. rating than movies for which only the popularity explanation The movies recommended by the MF method have very low holds. Finally, movies for which none of our explanation types content similarity to the movies that the subject has liked, and hold, receive the lowest rating. this may be the reason that participants rate them lower. To conclude, on the one hand it is unclear whether the ex- Overall, as can be seen in the bottom line of Table 2, 65% of planations that we suggest themselves truly affect the subject the recommended movies could be explained by at least one of behavior. On the other hand, it appears that these explanations our suggested methods (except for the popularity explanation). are well correlated with the way that participants perceive [21] report a model fidelity of 84% at most for their created a recommended movie, and decide whether to rate it higher. association rules. Our model fidelity is sensitive to the thresh- As such, it may be that our explanations indeed capture a olds that we set to accept an explanation. We may be able to part of the subject decision process for her opinion over a increase the model fidelity with more accurate and personal recommended movie. tuning of these thresholds. AE MF Popular All Count Avg Count Avg Count Avg Count Avg Count Avg AE MF All I2ICB 376 4.4 Count Avg Count Avg Count Avg None 126 4.29 126 3.65 126 3.13 378 3.69 UICB 62 4.73 POP 46 3.35 20 3.05 66 3.26 POP 1 5 1 2 122 3.25 124 3.26 I2ID 237 4.63 I2ICB 98 3.71 38 3.79 136 3.74 I2ICB 25 4.52 14 4.21 - - 39 4.41 I2ICF 508 4.27 UICB 52 3.29 4 3.75 56 3.32 UICB 4 4.75 - - - - 4 4.75 U2UCF 109 4.62 I2ID 74 3.61 14 4.36 88 3.73 I2ID 16 4.33 5 4 - - 21 4.25 Only POP 81 3.98 I2ICF 114 3.75 41 4.05 155 3.83 I2ICF 43 4.19 28 3.82 - - 71 4.04 U2UCF 82 3.45 23 4.09 105 3.59 U2UCF 19 4.11 7 3.86 - - 26 4.04 None applies 386 3.55 DEF 140 3.68 79 3.62 219 3.66 DEF 14 3.43 67 3.04 - - 81 3.11 Table 2: Average rating for movies for which each recom- Table 3: Average user ratings for different expla- Table 1: Average ratings for movie recommendations with nations (Step 4). different explanations, and without explanations. mendation type applies. Explanation AE MF All count fidelity count fidelity count fidelity POP 155 0.35 104 0.23 259 0.29 I2ICB 263 0.59 128 0.29 391 0.44 UICB 113 0.25 7 0.02 120 0.13 I2ID 194 0.43 36 0.08 230 0.26 I2ICF 312 0.7 153 0.34 465 0.52 U2UCF 181 0.4 65 0.15 246 0.28 At least one 365 0.82 233 0.52 579 0.65 explanation by methods 2-6 Table 4: Model Fidelity Figure 6: Explanation properties importance. Post Study Questionnaire Results that the property that was deemed most important is that an We now discuss the participant answers to the questions con- explanation should be easy to understand. Participants also cerning the explanations at the post study questionnaire. The thought that an explanation should be accurate, convincing, responses below are hence biased given the explanations and short. We believe that this explains the relatively low opin- shown throughout the study, and may not reflect the subject ion of the participants concerning the content-based user-item opinion prior to the study. explanation which we reported above, as this explanation is quite long. 70% of the participants reported noticing the explanations in our study, 24% noticed them only sometimes, and 6% reported The only property that was not deemed as important by the not noticing the explanations at all. 60% of the participants felt participants is whether the explanation fully explains the rec- that the explanations were mostly appropriate, 26% felt they ommendation mechanism. This is in somewhat in conflict were sometimes appropriate, only 1% felt that the explanations with many research attempts in the recommender system com- were always appropriate, and 3% felt that they were never munity [25, 27] that focus on providing an explanation of the appropriate. 71% of the participants thought that explanations way that the models operate. It appears that users, at least can help understand the recommendation, and may influence the participants of our study, prefer an explanation that will the decision on considering the recommended item. 23%of the help them decide whether the recommended item is appropri- participants said that an explanation is interesting, but would ate for them, than to understand the mechanism behind the not change their opinion over the movie. 5% responded that an recommendation engine. explanation is not important at all, and 1% said they ignore all When asked if they would like to get such recommendations in recommendations and hence the explanations are not relevant. a system that they use (e.g. Netflix), 62% answered positively, Similar results were reported before for the importance of 31% answered maybe and the rest (7%) answered no. explanations in recommendation systems [12, 27]. These findings, that 94% of the participants found many of our We also asked in an open, non-obligatory question to state explanations to be appropriate, and that most people would an explanation that they liked the best. 93 of the participants have liked to see such explanations in a system that they use, choose to answer. We categorized their free text answers together with the relatively low importance of revealing the into groups. 52% of the responses were related to content- recommendation engine behavior, further support our intuition, based explanations. 33% preferred the collaborative filtering that post-hoc explanations generated by simple methods can explanations. 10% liked the popularity explanations, and 4% provide valuable information that users appreciate. liked the default explanation. [4] reports similar preference for content based explanations over CF explanations. Figure 6 shows the participants responses about the impor- CONCLUSION tance of various properties of an explanation. We can see In this paper we suggest a simple method for generating post- hoc explanations for recommendations generated by complex, difficult to explain, models. We use a set of easy to explain [5] John S Breese, David Heckerman, and Carl Kadie. 1998. recommendation algorithms, and when their output agrees Empirical analysis of predictive algorithms for with the recommendation of the complex model, consider the collaborative filtering. In Proceedings of the Fourteenth explanation of the simple model as a valid explanation for the conference on Uncertainty in artificial intelligence. recommended item. While these explanations are clearly not Morgan Kaufmann Publishers Inc., 43–52. transparent, we argue that they provide valuable information [6] Jingwu Chen, Fuzhen Zhuang, Xin Hong, Xiang Ao, for the users in making decisions concerning the recommended Xing Xie, and Qing He. 2018. Attention-driven factor items. model for explainable personalized recommendation. In We study two research questions. First, whether users prefer The 41st International ACM SIGIR Conference on our simple post-hoc explanations to explanations of the mech- Research & Development in Information Retrieval. anism of the neural network or the matrix factorization model. 909–912. Indeed, in our post study questionnaire, users stated that it is [7] Weiyu Cheng, Yanyan Shen, Linpeng Huang, and more important for an explanation be short and clear, than to Yanmin Zhu. 2019. Incorporating Interpretability into fully explain the algorithm. Latent Factor Models via Fast Influence Analysis. In Second, we checked whether presenting a post-hoc explana- Proceedings of the 25th ACM SIGKDD International tion influences the behavior of users. For some of our expla- Conference on Knowledge Discovery & Data Mining. nations, namely, the I2ICB explanation and the I2ID explana- 885–893. tion, the average rating was higher when an explanation was [8] Hendrik Drachsler, Katrien Verbert, Olga C Santos, and presented than the average rating when no explanation was Nikos Manouselis. 2015. Panorama of recommender presented. For other explanations, this did not hold. We specu- systems to support learning. In Recommender systems late that this was due to the explanation length and complexity. handbook. Springer, 421–451. Perhaps a future, simpler phrasing of the explanation would lead to more pronounced effects. [9] Upol Ehsan, Pradyumna Tambwekar, Larry Chan, Brent Harrison, and Mark O Riedl. 2019. Automated rationale To support our claims, we use a user study in the movie do- generation: a technique for explainable AI and its effects main, showing that some explanations may affect the user opin- on human perceptions. In Proceedings of the 24th ion over the recommended item. We also show that movies International Conference on Intelligent User Interfaces. that can be explained by our method may be better items to 263–274. recommend. We evaluate subject opinion over the different ex- planations that we suggest, showing that participants preferred [10] Michael D Ekstrand, John T Riedl, Joseph A Konstan, item-item explanations to user-based explanations. The sub- and others. 2011. Collaborative filtering recommender jects also stated that it is more important for an explanation to systems. Foundations and Trends® in be easy to understand, convincing, and short, than to uncover Human–Computer Interaction 4, 2 (2011), 81–173. the underlying operation of the recommendation engine. [11] Bruce Ferwerda, Kevin Swelsen, and Emily Yang. 2018. Our method can be easily extended by using additional ex- Explaining Content-Based Recommendations. New York plainable recommenders. In the future we will apply more (2018), 1–24. methods. We will also study methods for automatically select- [12] Jonathan L Herlocker, Joseph A Konstan, and John ing a method-specific threshold for deciding if an explanation Riedl. 2000. Explaining collaborative filtering is valid, instead of the manually tuned threshold that we cur- recommendations. In Proceedings of the 2000 ACM rently use. conference on Computer supported cooperative work. 241–250. REFERENCES [1] Rakesh Agrawal, Ramakrishnan Srikant, and others. [13] Yehuda Koren. 2008. Factorization meets the 1994. Fast algorithms for mining association rules. In neighborhood: a multifaceted collaborative filtering Proc. 20th int. conf. very large data bases, VLDB, Vol. model. In Proceedings of the 14th ACM SIGKDD 1215. 487–499. international conference on Knowledge discovery and data mining. ACM, 426–434. [2] Fernando Amat, Ashok Chandrashekar, Tony Jebara, and Justin Basilico. 2018. Artwork personalization at [14] Yehuda Koren. 2010. Factor in the neighbors: Scalable netflix. In Proceedings of the 12th ACM Conference on and accurate collaborative filtering. ACM Transactions Recommender Systems. ACM, 487–488. on Knowledge Discovery from Data (TKDD) 4, 1 (2010), 1. [3] Oren Barkan and Noam Koenigstein. 2016. Item2vec: neural item embedding for collaborative filtering. In [15] Yehuda Koren and Robert Bell. 2015. Advances in 2016 IEEE 26th International Workshop on Machine collaborative filtering. In Recommender systems Learning for Signal Processing (MLSP). IEEE, 1–6. handbook. Springer, 77–118. [4] Mustafa Bilgic and Raymond J Mooney. 2005. [16] Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, Explaining recommendations: Satisfaction vs promotion. and Tony Jebara. 2018. Variational autoencoders for In Beyond Personalization Workshop, IUI, Vol. 5. 153. collaborative filtering. In Proceedings of the 2018 World Wide Web Conference. 689–698. [17] Pasquale Lops, Marco De Gemmis, and Giovanni [25] Rashmi Sinha and Kirsten Swearingen. 2002. The role Semeraro. 2011. Content-based recommender systems: of transparency in recommender systems. In CHI’02 State of the art and trends. In Recommender systems extended abstracts on Human factors in computing handbook. Springer, 73–105. systems. 830–831. [18] Jie Lu, Dianshuang Wu, Mingsong Mao, Wei Wang, and [26] Brent Smith and Greg Linden. 2017. Two decades of Guangquan Zhang. 2015. Recommender system recommender systems at amazon. com. Ieee internet application developments: a survey. Decision Support computing 21, 3 (2017), 12–18. Systems 74 (2015), 12–32. [27] Nava Tintarev and Judith Masthoff. 2007. A survey of [19] James McInerney, Benjamin Lacker, Samantha Hansen, explanations in recommender systems. In 2007 IEEE Karl Higley, Hugues Bouchard, Alois Gruson, and 23rd international conference on data engineering Rishabh Mehrotra. 2018. Explore, Exploit, and Explain: workshop. IEEE, 801–810. Personalizing Explainable Recommendations with [28] Nava Tintarev and Judith Masthoff. 2015. Explaining Bandits. In Proceedings of the 12th ACM Conference on Recommendations: Design and Evaluation. In Recommender Systems (RecSys ’18). Association for Recommender Systems Handbook, Francesco Ricci, Lior Computing Machinery, New York, NY, USA, 31–39. Rokach, and Bracha Shapira (Eds.). Springer, 353–382. DOI:http://dx.doi.org/10.1145/3240323.3240354 DOI:http://dx.doi.org/10.1007/978-1-4899-7637-6_10 [20] Xia Ning, Christian Desrosiers, and George Karypis. [29] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. 2015. A comprehensive survey of neighborhood-based Collaborative deep learning for recommender systems. recommendation methods. In Recommender systems In Proceedings of the 21th ACM SIGKDD international handbook. Springer, 37–76. conference on knowledge discovery and data mining. [21] Georgina Peake and Jun Wang. 2018. Explanation 1235–1244. mining: Post hoc interpretability of latent factor models [30] Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao for recommendation systems. In Proceedings of the 24th Wu, and Xing Xie. 2018. A Reinforcement Learning ACM SIGKDD International Conference on Knowledge Framework for Explainable Recommendation. 2018 Discovery & Data Mining. 2060–2069. IEEE International Conference on Data Mining (ICDM) [22] Francesco Ricci, Lior Rokach, and Bracha Shapira. (2018), 587–596. 2015. Recommender Systems: Introduction and [31] Yongfeng Zhang and Xu Chen. 2018. Explainable Challenges. In Recommender Systems Handbook. 1–34. recommendation: A survey and new perspectives. arXiv [23] Badrul Sarwar, George Karypis, Joseph Konstan, and preprint arXiv:1804.11192 (2018). John Riedl. 2001. Item-based collaborative filtering [32] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, recommendation algorithms. In Proceedings of the 10th Yiqun Liu, and Shaoping Ma. 2014. Explicit factor international conference on World Wide Web. 285–295. models for explainable recommendation based on [24] Guy Shani and Asela Gunawardana. 2011. Evaluating phrase-level sentiment analysis. In Proceedings of the recommendation systems. In Recommender systems 37th international ACM SIGIR conference on Research handbook. Springer, 257–297. & development in information retrieval. 83–92.