Post-hoc Explanations for Complex Model
                         Recommendations using Simple Methods
           Dorin Shmaryahu                                            Guy Shani                                  Bracha Shapira
       Ben-Gurion University of the                            Ben-Gurion University of the                 Ben-Gurion University of the
              Negev, Israel                                           Negev, Israel                                Negev, Israel
          dorins@post.bgu.ac.il                                    shanigu@bgu.ac.il                            bshapira@bgu.ac.il


ABSTRACT                                                                              “users who choose the item that you have chosen often also
Many leading approaches for generating recommendations,                               choose the recommended item”. Content-based algorithms
such as matrix factorization and autoencoders, compute a com-                         [17], that learn for each user a set of content features that the
plex model composed of latent variables. As such, explaining                          user prefers, generate recommendations that can be explained
the recommendations generated by these models is a difficult                          by “the recommended item has a content feature that you
task. In this paper, instead of attempting to explain the latent                      prefer”.
variables, we provide post-hoc explanations for why a recom-
                                                                                      However, these simple algorithms often provide recommen-
mended item may be appropriate for the user, by using a set of
                                                                                      dations of lower accuracy than modern approaches. In recent
simple, easily explainable recommendation algorithms. When
                                                                                      years, two collaborative filtering approaches became popular
the output of the simple explainable recommender agrees with
                                                                                      for generating good recommendations — the matrix factor-
the complex model on a recommended item, we consider
                                                                                      ization (MF) approach [13, 14, 15], and the artificial neural
the explanation of the simple model to be applicable. We
                                                                                      network (ANN) approach [29]. Algorithms of these families
suggest both simple collaborative filtering and content based
                                                                                      have shown the capacity to generate accurate recommenda-
approaches for generating these explanations. We conduct a
                                                                                      tions for users.
user study in the movie recommendation domain, showing that
users accept our explanations, and react positively to simple                         One of the downsides of both approaches is that they compute
and short explanations, even if they do not truly explain the                         the recommendations through a set of latent variables and
mechanism leading to the generated recommendations.                                   their possibly non-linear relations. For example, in the MF
                                                                                      approach one computes a vector of latent variables for each
Author Keywords                                                                       user, and a vector of latent variables for each item, and then
Recommender Systems, Explainable Recommendation,                                      computes a recommendation score using the inner product be-
content-base explanations, collaborative filtering explanations,                      tween the vectors of a particular user and a particular item. The
user-study                                                                            values of the latent variables do not have an understandable
                                                                                      meaning to humans.
INTRODUCTION
                                                                                      Several researchers have attempted to provide explanations by
Recommendation systems that suggest items to users can be                             understanding the behavior of the latent variables [32, 6]. Such
found in many modern applications, from online newspapers                             efforts may be possible in some cases, but it is unlikely that all,
and movie streaming applications, to e-commerce [2, 26, 18].                          or even most, latent variables represent an easy to understand
Research has shown that in many applications, user may be                             structure. The problem becomes even more difficult with deep
interested in understanding why is a particular recommended                           ANNs, that may contain thousands of such variables with
item appropriate for her [27, 11, 31]. Thus, it is beneficial to                      complex connections between them.
be able to generate explanations for the recommended items.
                                                                                      Alternatively, one can take a post-hoc approach to explana-
Early simple recommendation algorithms often yield a natural                          tions [12, 4], that takes the model recommendations as input,
explanation for their recommendations. For example, the rec-                          and attempts to identify reasons as to why these recommended
ommendations of a neighborhood based collaborative filtering                          items are appropriate to the user. For example, [21] used
approach [20] can be explained as: “users similar to you often                        association rule mining to identify explanations for the recom-
choose this item”. Item-item collaborative filtering algorithms                       mendations directly from the data. These explanations cannot
[23, 3] provide recommendations that can be explained as                              be considered to be transparent [28], as they do not shed light
                                                                                      on the choices made within the model in recommending the
                                                                                      particular item, but may still provide value to the user. They
                                                                                      can be effective, helping the user in making decisions. They
                                                                                      may be persuasive, convincing the user to explore the recom-
                                                                                      mended item. They may also increase trust, by, e.g., providing
                                                                                      a reasonable explanation for a recommendation that the user
Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Com-   dislikes.
mons License Attribution 4.0 International (CC BY 4.0).
IntRS ’20 - Joint Workshop on Interfaces and Human Decision Making for Recom-
mender Systems, September 26, 2020, Virtual Event
In this paper we also take a post-hoc explanation generation         applications, including TV streaming services [2], online e-
approach. Given the output of any black-box recommender,             commerce [26], smart tutoring [8], and many more [18]. We
we run a set of easy-to-explain recommendation algorithms,           focus here one important recommendation task [24] — top-
such as the simple collaborative filtering and content based         N recommendation, where the system computes a list of N
methods suggested above. These algorithms provide a score            recommended items that the user may choose.
for the items recommended by the black box recommender.
When this score is sufficiently high, it means that the explain-     There are two dominant approaches for computing recom-
able recommender agrees with the black box recommender.              mendations for the active user — the user that is currently
In this case, we can present the explanation of the explaining       interacting with the application and the recommender system.
recommender to the user.                                             First, the collaborative filtering approach [5, 10] assumes that
                                                                     users who agreed on preferred items in the past will tend to
Our approach is model agnostic — we can generate expla-              agree in the future too. Many such methods rely on a matrix
nations for any recommender. Our approach is also flexible,          R of user-item ratings to predict unknown matrix entries, and
in that the explanations can be generated post-hoc by any            thus to decide which items to recommend.
easy-to-explain recommendation algorithm that outputs a rec-
ommendation score for each item. Although in this paper we           A simple method in this family [20], commonly referred
study only the simple recommenders mentioned above, given            to as user-user collaborative filtering, identifies a neighbor-
                                                                     hood of users that are similar to the active user. A common
any other easy-to-explain recommender, one can use it to gen-
                                                                     method for computing user similarity is the Jaccard corre-
erate new explanations, that would be candidate explanations                                    Iu ∩Iu
for the items recommended by the black box recommender.              lation Jaccard(u1 , u2 ) = Iu1 ∪Iu2 where Iu is the set of items
                                                                                                  1    2

We study the user perception of explanations generated by sim-       consumed by a user u. This set of neighbors is based on the
ple easy-to-explain recommenders for the items recommended           similarity of observed preferences between these users and the
by complex models. We evaluate the user’s response to rec-           active user. Then, items that were preferred by users in the
ommended items with and without explanations of different            neighborhood are recommended to the active user. Another
types. We also measure participant user preference over the          approach [23, 3], known as item-item collaborative filtering
various types of explanations. To study these questions we           rely on the set of users that consumed two items i1 and i2 .
conduct a user study in the movie domain. We use two popular         One can compute, e.g., the Jaccard correlation between the
                                                                                                Ui ∩Ui
recommendation models, an MF and an autoencoder, as black            items: Jaccard(i1 , i2 ) = Ui1 ∪Ui2 where Ui is the set of users
                                                                                                  1    2
boxes to generate recommendations. For each recommended              who consumed item i. Then, the system can recommend to a
item we run a set of 6 easy-to-explain approaches to produce         user u an item i2 that has high Jaccard similarity to an item i1
explanations for the recommendation — item-item content              that u has previously consumed.
based, user-item content based, item-item collaborative filter-
ing, user-user collaborative filtering, movie overview textual       A second popular approach is known as content-based recom-
similarity, and a popularity recommender. We show only ex-           mendation [17]. In this approach, the system has access to a set
planations which are sufficiently relevant, that is, whose score     of item features. The system then learns the user preferences
passes a method-dependant threshold.                                 over features, and uses these computed preferences to recom-
                                                                     mend new items with similar features. Such recommendations
We first ask participants to rank the generated recommenda-          are typically titled “similar items”.
tions without any explanation. Then, we ask their opinion
about recommended items with explanation, showing a single,          In content based recommendations one can again take an item-
randomly chosen, explanation for every movie.                        item approach, computing the similarity between items based
                                                                     on shared feature values, such as the leading actors, the same
In the next stage of the user study, the participants were shown     director. or the same genre. Then, one can recommend an
additional recommended movies. In this stage we presented            item that has high similarity to an item that was previously
all explanations that passed a threshold to the participants,        consumed by the user. One can also take a user-item approach,
and asked them to rate each explanation. The results in this         by computing a user profile — the set of feature values that
stage show that participants preferred content based explana-        often appear in items consumed by the user, such as actors
tions to collaborative filtering explanations, and that popularity   that repeatedly appear in movies that the user has consumed,
explanations are rated the lowest.                                   or genres the the user often watches. Then, one can compute
Finally, the participants completed an online survey, asking         the similarity of an item to the user profile to decide whether
their opinion about recommendation explanations in general.          to recommend the item to the user.
Our results indicate that participants prefer short and easy to      It is widely agreed in the recommendation system research
understand explanations to transparent explanations that fully       community that in many domains, collaborative filtering ap-
disclose the mechanism behind the computed recommenda-               proaches produce better recommendations than content based
tions.                                                               methods.
BACKGROUND                                                           A collaborative filtering approach that has gained much at-
Recommender systems actively suggest items to users, to              tention in the recommender system community is the matrix
help them to rapidly discover relevant items, and to increase        factorization [13, 14, 15], where the system attempts to factor
item consumption [22]. Such systems can be found in many             the rating matrix R|U|×|I| into two matrices, P|U|×k and Qk×|I| ,
for some small number k, where R ≈ P × Q. One can consider           user reviews, to learn users preferences related features of
the matrix P as a set of latent user features, and Q as a set of     items that served as a basis for latent feature tables.
latent item features. An item i is considered to be appropri-
ate for a user u when the inner product pu · qi is high. The         Additional examples can be found for deep learning recom-
resulting latent feature vectors pu and qi typically do not have     mendation models, such as the work by [6], that learned the
a meaning that can be translated into content features, such as      distribution of user attention over features of different items
actors or genres, but are associated with the user like-dislike      that serve as explanations. These algorithms try to analyse the
pattern of items. As such, explaining to the user why a particu-     meaning of each latent component in a neural network, and
lar item was recommended to her, beyond the vague statement          how they interact with each other to generate the final results.
that the system predicts that the item is a good match for the       The second approach is post-hoc and model-agnostic [12, 4].
user, is difficult.                                                  It treats the model as a black box and explains the recom-
Another state of the art collaborative filtering approach is the     mendation results in a rational way by identifying relations
variational autoencoder (VAE). An autoencoder (AE) neural            between the data provided as input to the recommender sys-
network is an unsupervised learning algorithm, attempting to         tem and its recommended items. This analysis is decoupled
                                                                     from the recommendation model, considering only the model
produce target values equal to the input values, y(i) = x(i) . The   input and output. The post-hoc approach has the advantage of
autoencoder tries to learn a function hW,b (x) ≈ x where W and       enabling explanations in scenarios where the recommendation
b is the set of weights and biases corresponding to the hidden       model cannot be exposed. Although the post-hoc explanations
units in the deep network.                                           presented to a user are not transparent, i.e., they do not reflect
While the input and output layers of the network are large,          the computation used by the underlying model to provide rec-
there is an inner low dimensional layer within the network.          ommendations, they commonly present rationale, plausible
Thus, the network learns a lower dimension representation            information for the user.
of the input, the latent space. The autoencoder operates in          Some post-hoc explainable recommendation models use sta-
two phases, an encoder that reduces the input into a compact         tistical methods to analyze the influence of the input on the
representation in the low dimension layer, and a decoder, re-        output [7]. These methods often require heavy computations
sponsible for reconstructing the encoded representation into         to provide explanations. Other studies apply various deep
the original input.                                                  learning reinforcement learning methods to build explanation
In the recommendation system task, the input is a user partial       models using various types of networks. These studies [30,
item choice vector r(u) , e.g., a vector of all movies in the        19] are commonly based on static explanation templates, result
system, where only movies that the user has watched receive          in complex models, and require parameter tuning.
a value of 1. The reconstruction of the input at the output          Post-hoc methods are built on the assumption, that we investi-
layer contains higher scores for items that the user is likely to    gate in this paper, that an explanation that makes sense to the
choose.                                                              user, even if it is not the exact reason that the recommenda-
                                                                     tion was indeed issued, is acceptable to users and may have a
                                                                     beneficial effect for the recommendation system.
RELATED WORK
Explainable recommendations provided to users may help               [4] suggested that providing explanations to users alongside
them understand why certain items are appropriate for them.          a recommendation can help users to make more informed
By clarifying these reasons, explanations can improve the            decisions about consuming the item. They used 3 post-hoc
transparency, persuasiveness, effectiveness, trustworthiness,        methods — keyword similarity, neighbors ratings, and what
and user satisfaction from the recommender system [27, 11,           they call influential item computation — to explain recommen-
31]. While earlier recommenders were often naturally ex-             dations generated by a hybrid content-based and collaborative
plainable, modern models are more complex, and do not yield          system rating prediction system. They run a small scale user
natural explanations. Studies in explainable recommendations         study in a books domain, attempting to understand which ex-
hence address the challenge of providing human understand-           planation provided the most information for the user to best
able explanations for items recommended by complex models.           understand the quality of the recommended item for her. Our
                                                                     paper can be seen as an extension of their preliminary work,
There are two main approaches to providing explainable rec-          describing a general framework for post-hoc explanations us-
ommendations [31]. The first approach attempts to create             ing simple methods, suggesting additional explanation types,
interpretable recommendation models whose results can be             and conducting a thorough user study in the movies domain,
naturally explained. However, many modern models are often           evaluating many more research questions.
not naturally explainable, and making them more explainable,
often results in reduced recommendation accuracy. This line          [21] also extended the work of [4] by suggesting a different
of research therefore aims at mitigating the trade-off between       post-hoc method, applying association rule mining on the
accuracy and explainability by including explainable compo-          input data – the user-item rating table. The mining results with
nents, layers or external information into non-linear complex        association rules, sorted by their confidence and support, that
and deep accurate models to make them explainable. Exam-             reflect links between items. Those links form the explanations
ples of such solutions for MF-based recommendation models            that are provided to users whose input data include antecedents
include the work by [32], who applied sentiment analysis over        of the rule. The explanations, however, unlike our approach,
are limited to item-based collaboration-like statements (i.e.,       The explanations provided by all explaining algorithms are fed
“item X is recommended because item Y was consumed”), and            into a filter. All plausible explanations received from the ex-
require the application of some association mining algorithm         plaining algorithms are filtered and one explanation is chosen
(e.g., the a-priori algorithm that the authors used [1]). Rule       to be shown to the user. For example, such a filter can be based
mining algorithms typically require heavier computations than        on user preferences, or on the observed response of the user to
our simple similarity-based computations. They also defined          different types of explanations. Choosing the explanation with
Model Fidelity, the portion of recommendations that can be           the best score from the explanation algorithms is problematic,
explained. Post-hoc explanations may not always apply to             because these scores are not calibrated, that is, each explaining
all recommendations, and the goal is to provide high model           algorithm may use a different scale of scores.
fidelity.
                                                                     USER STUDY
In a gaming application, Frogger, [9] created a system that gen-
                                                                     As we have explained above, we study the participant per-
erated simple rational explanations of the agent state and ac-
                                                                     ception of the provided explanations. We now describe a
tions rather than complex detailed explanations. They showed
                                                                     user study applying our approach to a movie recommenda-
good perception of the rationales by users, further support-
                                                                     tion application, in which participants evaluate recommended
ing our hypothesis that simple post-hoc explanations are well
                                                                     movies, with and without explanations. The participants also
received by users.
                                                                     provide their preferences over possible explanations for a rec-
The post-hoc explanation approach that we propose in this pa-        ommended movie.
per emphasizes simplicity, flexibility, and the ease of its appli-
                                                                     More formally, we study two hypotheses:
cation. Our method supports simple similarity based models,
collaborative and content-based, as well as other simple post-       • Users prefer short post-hoc explanations generated by sim-
hoc methods. This allows users to choose their preferred type          ple methods over a complete explanation of the mechanism
of explanation. The main tunable parameter in our approach is          of complex models.
the method-specific threshold for deciding which explanation
is sufficiently supported to be presented to the user.               • Presenting a post-hoc explanation to the user influences the
                                                                       user acceptance of a recommended movie.
GENERATING POST-HOC EXPLANATIONS USING SIM-                          We now explain the structure and process of the user study —
                                                                     the dataset and algorithms used to generate the recommenda-
PLE METHODS
                                                                     tions and the explanations, and the different parts of the study.
We now present our framework for providing post-hoc expla-           We then discuss the results that we observed.
nations for complex model recommendations. The framework
is presented in Figure 1.
                                                                     Dataset and Algorithms
Our method for generating recommendation along with plau-            Our study is implemented in the movie recommendation do-
sible explanation operates in several stages. First, a black box     main using the Kaggle movies dataset 1 , containing both rat-
recommendation model receives as input the user-item rating          ings from MovieLens, as well as movie content data from
matrix and outputs a recommendation. Although in this paper          TMDB. 2
we focus on collaborative filtering methods, this approach can
                                                                     The dataset originally contains 45,000 movies. We filtered
be applied to other methods, such as content-based recom-
                                                                     the dataset for two reasons — first, as we are interested in
menders, that employ data sources other than the user-item
                                                                     participants opinion over the presented movies, we prefer to
matrix.
                                                                     limit our attention to relatively popular movies, to increase the
In the second stage, the recommended item is given as input          likelihood that the participant is familiar with a recommended
to several explaining algorithms. In addition, each explanation      movie. Moreover, we observed that the complex models that
algorithm receives as input additional required data sources.        we use provide less appropriate recommendations when the
These explanation methods can access the data sources avail-         input movies have a relatively low number of user opinions.
able to the recommender, but also other data sources as needed.      As we are not truly interested in evaluating the quality of the
For example, a possible explanation is the popularity of the         complex models, but rather the participant perception of the
item. The algorithm which produces this explanation requires         recommended movies, with and without explanations, we pre-
data over item popularity. Another possible explanation ap-          fer to limit the models to items that are easier to recommend.
proach is a content-based item-item method, which requires
                                                                     We hence choose to use only movies with more than 500
as input item content information.
                                                                     ratings, resulting in 3878 movies. We used all users who rated
The explaining algorithm is also a recommendation method,            at least one of these movies, resulting in 122,147 users, and
that produces a recommendation score for items, or a ranking         5.7 million user-movie ratings.
of recommended items for the user. We use the explaining
                                                                     For generating the recommendations, we use two complex
algorithm to generate such a score for the recommended item.
                                                                     models, an MF recommender that we implemented locally,
The algorithm returns an explanation only if the recommen-
                                                                     and a variational autoencoder (VAE) [16].
dation score is sufficiently high. We use a method-specific
threshold to decide whether the explanation is sufficiently          1 https://www.kaggle.com/rounakbanik/the-movies-dataset
relevant.                                                            2 https://www.themoviedb.org/
Figure 1: Generating explanation method. The method receives as input the user item-rating matrix, and additional data as input for the explanations algorithm,
such as, item content,item overview, users system profile ext. The method output a recommendation and the chosen explanation.


For generating the explanations, we implemented 6 simple-to-                          of the number of users who have watched at least one of
explain algorithms. Each algorithm receives as input a user                           the movies. The explanation here reads “This movie was
profile, and an item (ir ) that was recommended by a complex                          recommended to you because you have watched movie m,
model (VAE or MF), and generates a recommendation score                               and many people who like m also like this movie.”
for that item. In addition, different algorithms take as input
different data sources.                                                            • User-user collaborative filtering (denoted U2UCF): we com-
                                                                                     pute the user neighborhood using the Jaccard similarity be-
• Popularity (denoted POP in the tables below): we compute                           tween the sets of movies that each user has liked. Then, we
  the popularity of ir in our dataset. If the movie is suffi-                        compute the portion of similar users who have watched the
  ciently popular, we can explain the recommendation by the                          recommended movie. This explanation reads “This movie
  movie being popular. The resulting explanation reads “This                         was recommended to you because x% users who like the
  movie is popular. Many users have watched it.”. We set the                         same movies that you did, also like this movie.”
  threshold here to the 50 most popular movies.
                                                                                   • Default explanation (denoted DEF): this is a strawman ex-
• Item-item content based (denoted I2ICB): for each movie                            planation that provides no additional information to the
   j that the user has rated, we compute a content similarity                        user, reading “Our system predicts that this movie is a good
  score between j and ir , and take the item in the user profile                     match for you.”
  with the maximal score. The content similarity is computed
  using the Jaccard score between the movies cast (top 5                           For each explanation algorithm we manually tune a threshold
  actors only), genres and director. The resulting explanation                     specifying whether the explanation is sufficiently relevant to
  is based on the particular content features that the items                       be shown to the user. We leave a smarter tuning of these
  share. For example, ”This movie was recommended to you                           thresholds to future research.
  because you liked j in the past, and the actor c played in
  both movies, and both movies are of genre g.”                                    Population
                                                                                  We recruited to the study mostly engineering students from
• User-item content-based (denoted UICB): we generate a                           different academic institutes. The subjects who completed
  user profile from the list of movies that the user has liked.                   the study entered a raffle for a cash prize. Some subjects
  The profile contains a score for each actor, director, and                      were given, in addition, a credit in an academic course. We
  genre, based on the amount of times that a content attribute                    recruited the subjects by sending an email to several mailing
  value, e.g., a specific actor, appeared in the movies that                      lists, asking people to participate in a study over recommender
  the user has liked. We then compute a weighted Jaccard                          systems for movies.
  score between the user profile, and ir content attributes.                       Overall, we recruited 207 participants, 131 males, and 73
  The resulting explanation is based on the specific content                       females (3 preferred not to specify gender). 24% of the partic-
  attributes that the user profile and the item have in common.                    ipants were graduate students, 53% were undergrad students,
  For example, such an explanation may be “This movie was                          and 23% had high school education only. 55% of the partici-
  recommended to you because it was directed by d, and you                         pants were 25 years old or younger, 35% were between 25-30,
  have liked other movies that d directed.”                                        and 10% were above 30 years old.
• Item-item overview (denoted I2ID): for each movie j that                         Some of the participants have a background in recommen-
  the user has rated, we compute the description similarity                        dations system or related fields. 103 have taken a course in
  between j and ir , and take the item in the user profile with                    machine learning, 67 have taken a course in deep learning, 56
  the maximal score. The textual similarity between item                           participants have taken a course in information retrieval, and
  description is computed using TF-IDF. The explanation in                         52 have taken a course in recommendation systems. 40% of
  this case is: ”This movie was recommended to you because                         the participants have not taken any course in those Fields.
  you liked j in the past, and both movies have a similar
  description.”                                                                   16% reported watching a few movies each week, 46% reported
                                                                                  watching a movie once a week, 32% once a month, and the
• Item-item collaborative filtering (denoted I2ICF): we com-                      rest (6%) almost never watch a movie Netflix is the lead-
  pute the item-item Jaccard score, that is, the number of                        ing movie watching channel (75%). 46% reported watching
  users who have watched both movies, divided by the union                        movies at the theater, 40% watch downloaded movies, and
Figure 2: Choosing preferred movies. The drop-down lists on
the left contain all movies in the study, and can be used to search            Figure 3: Presenting two recommendations from either MF or
for a movie name. Clicking on the movie poster allowed the                     VAE, and one popular, non-personal movie, ordered randomly.
participants to explore the movie details on the IMDB website.                 The subject must rate each recommendation.


36% watch movies on broadcast channels. We did not detect             During this step, the user was provided and was requested
any significant difference between the various populations in         to evaluate the above two sets of recommended movies, that
the participant behavior and answers below.                           were presented without any explanations. The opinions of
                                                                      the participants over these sets serve as a baseline for the
We asked the participants how they decide which movie to
                                                                      performance of the recommendation algorithms, without the
watch. 78% use recommendations from friends, 56% read
                                                                      influence of an explanation.
movie reviews online or in the newspaper, 25% report us-
ing some automated system to recommend movies, and 20%                We present the recommended movies to the participant in two
watch whatever is currently on. 81% are familiar with per-            different screens, one containing 2 MF recommendations and
sonal movie recommendations in Netflix. When asked about              one popular movie, and the other containing 2 VAE recom-
the quality of the Netflix recommendations , 62% reported             mendations and one popular movie (Figure 3). The order of
that they sometimes liked the recommendations, 18% almost             the systems, as well as the ordering within the 3 movies, is
always like Netflix recommendations, 13% mostly do not like           random.
the recommendations, and 7% reported never liking these
recommendations. Netflix presents some shows or movies                Throughout the study, we avoid presenting to the subject rec-
under the title, "Because you watched X". 58% of the partic-          ommended movies that were previously shown to her. If both
ipants claimed that they are likely to explore recommended            algorithms agree on a recommended movie, we take the next
movies under this title, whereas 35% said that they may ex-           movie on the recommendation list.
plore these recommendations, and 7% will not explore such a           The subject rates each recommended movie in a 1-5 scale.
recommended movie.                                                    Again, clicking on the movie poster allowed the subject to
                                                                      explore movie data from IMDB.
Method                                                                Step 3: Rating Recommendations with Explanations. We now
We now describe the process of the user study, explaining             use the black box recommenders to produce two additional
the different tasks that the test subjects performed. As we           recommended movies. We enrich the user profile by adding all
explained above, the subjects were asked to participate in a          recommended movies that the subject rated 4 or 5 in the first
user study over movie recommendations. The invitation email,          step, and avoid recommendations that were already presented
as well as the instructions at the beginning of the study, did not    in the first step.
mention explanations. Specifically, the subjects were told that
they are asked to evaluate the recommendations of a system.           In addition, we apply all the explanation generation algorithms
                                                                      above. We use only an explanation whose score is higher than
Step 1: Creating a User Profile. After an instruction screen,         the method-specific threshold required to be considered ac-
we asked each subject to choose 5 movies which she likes (Fig-        ceptable. From all acceptable explanations, we choose one
ure 2). Using these movies, we created a CF user profile that         explanation randomly. In cases where none of the algorithms
is used as input to the black box recommendation algorithms           returned a plausible explanation, we show a default explana-
— MF and VAE.                                                         tion.
Once the participant clicks on the “Let’s go” button, we com-         We used in this step 3 different methods for showing the ex-
pute two lists of 3 recommendations. For each black box               planations:
algorithm, we compute two recommended movies using the
provided user profile. In addition we add to each of the two          • Hidden: we place below the recommended movie a button
recommendation lists, a randomly selected movie from the top            saying “Why is this movie appropriate for me?”. Click-
100 popular movies according to the IMDB popularity score.              ing on the button opened a popup windows containing the
These popular movies allow us to evaluate the participant               explanation.
opinion over non-personalized recommendations.
                                                                      • Teaser: we place below the recommended movie the begin-
Step 2: Rating Recommendations Without Explanations.                    ning of the explanation, followed by an ellipsis. Clicking
          (a) Hidden explanation                                (b) Explanation teaser                                 (c) Explicit explanation

Figure 4: Step 3: rating recommended movies with explanations. A possible simple explanation is presented for each recommendation. Each subject was shown
one of the three alternative explanation displays.


  on the ellipsis opened a popup window containing the ex-
  planation.

• Visible: We place below the recommended movie the expla-
  nation.
This allows us to check whether the participants are inter-
ested in an explanation, and whether they actively seek an
explanation. We use a between-subjects setting here, that is,
each participant was allocated to one of the 3 groups, to avoid
over emphasizing the explanations due to the variations in
presentation. We again ask each participant to rate 2 sets of
recommendations, each containing 3 recommended movies,
as in the previous step.
                                                                                Figure 5: Step 4: rate the explanations generated by the simple models to the
Step 4: Rating Explanations. In the final step of the user study                recommended movie of a complex model. All possible simple explanations
                                                                                are presented for each recommendation.
we explicitly ask the subject to rate possible explanations. We
again add to the user profile the successful recommendations
from the previous steps, and ask for additional recommenda-
tions from the two black box algorithms, MF and VAE.                            a set of question concerning the demographic details. Then,
                                                                                we ask the participants questions about their movie watching
In this step, unlike the previous steps, we present to the partic-              habits, and their previous interactions with recommender sys-
ipant a single recommended movie. In addition, we present                       tems. Finally, we ask questions regarding their opinions about
movie content information, such as the actors, the genres, and                  the presented explanations.
the description, without requiring the participants to explicitly
request for such details (Figure 5).
                                                                                Results
We first ask the participant to state whether she likes the rec-               We now discuss the user study results. We first review the
ommended movie, and then present a set of explanations as to                   effects of explanation on the subject perception of a recom-
why the movie was recommended. We show all explanations                        mended movie, then, we discuss subject opinion over the
that are deemed sufficiently appropriate, achieving a score                    various explanation types. Finally, we discuss the fidelity of
higher than the method specific threshold. The participant is                  the various explanation methods.
asked to rate each explanation on a scale of 1-5.
                                                                                Effects of Various Explanations on Movie Ratings
Each participants is shown 6 different movies in this step as
well, 3 of which were generated by each of the two black box                   We now study the effect that the explanations had on the
recommenders, and ordered randomly.                                            subject opinion over the recommended movies, comparing
                                                                               the average rating for movies without explanations and with
To summarize, we present to each participant in the steps 2-4                  explanations. As we explain above, in Step 3, there were 3
18 different recommendations, 7 recommendations from each                      options for explanation presentation — hidden, requiring click
of the complex models, MF and VAE, and 4 additional popular                    a button, teaser, showing only the beginning of the explanation,
movies.                                                                        and fully presenting the explanation.
Post Study Questionnaire. After finishing step 4 above, the                     Somewhat surprisingly, only for 24% of the recommendations
subject is transferred to an online questionnaire. We first ask                 in the first case, the participant clicked on the button, and only
in 14% of the recommendations in the second case the partici-       User Ratings for Explanations
pant clicked on the teaser. That is, in most recommendations        As we explained above, in Step 4 we asked the participants to
the participants did not look at the explanations. In informal      rate various explanations for a given recommendation. Table 3
discussions following the study, participants indicated that        shows the explanation ratings provided by the participants.
they did not see the option to request an explanation, or did
not think that they needed an explanation to decide whether         Somewhat surprisingly, all explanations, including the default
the recommended movie is appropriate.                               explanation, received a relatively positive (above 3 on a 1-
                                                                    5 scale) rating. The only explanations that the participants
Thus, we group together here both movies in Step 2 for which        significantly liked less, are the popularity explanation, and the
no explanation was shown, and movies in Step 3 shown to par-        user-item content-based explanation.
ticipants who did not click on an explanation button or teaser.
We compare this group to recommendations for which the              The latter is especially surprising, given that movies for which
explanation was shown. Below, when we discuss significance,         this explanation was shown, or for which this explanation
we base our claims on a paired t-test.                              holds, receive the highest user ratings in the results reported
                                                                    in Table 1 and Table 2. We believe that the relatively lower
Table 1 compares the average rating for each one of the plau-       subject opinion for this type of explanation may be attributed
sible explanations, and without explanations. First, although       not to its content, but rather to its length. As we discuss
this is not the focus of the study, the VAE method produced         below, in the post-study questionnaire, participants reported
better recommendations than the MF method, which produced           that they prefer short explanations. This explanation is by far
better results than recommending a random popular movie.            the longest. Note that the item-item content-based explanation
                                                                    may also appear to be long, in practice it is not; For the content
Looking at the explanations, we can see that the user-item
                                                                    based explanations we report all properties (actors, genres,
content-based explanation was shown only 4 times, and is
                                                                    director) that apply. A recommended movie typically has
hence not statistically different than other explanations. The
                                                                    more joint property value with the user profile, containing
popularity, and the default explanations result in the lowest
                                                                    all the movies that the user has liked (i.e., UICB), than with
rating than all other explanations. That is, movies presented
                                                                    a single movie that the user has liked, which entails longer
with either CF or CB explanations produce significantly higher
                                                                    explanations for UICB.
ratings than the non-personal popularity explanation and the
non-informative default explanation.
While the differences between the ratings can be attributed to      Explanation Fidelity
the presented explanation, there is another plausible reason        Finally, we evaluate the explanation fidelity — the portion of
for these differences. It might be that recommended items           recommended items for which each explanation type holds.
for which a specific recommendation type applies are better         Table 4 shows the empirical fidelity of the various explana-
recommendations. For example, it may be that when a rec-            tions with respect to all recommended items in our study in
ommended item has a strong item-item Jaccard correlation            the Steps 2-4. We note that the fidelity is highly sensitive to
with an item in the user profile, it is considered as a better      the thresholds that we set to decide which explanation is suffi-
recommendation for the user, whether we explicitly tell the         ciently valid to be presented. We leave an automated careful
user about it or not.                                               tuning of these thresholds to future research.
Table 2 shows the average rating over movies that we were           As can be seen, collaborative filtering fidelity is always higher
able to explain through one of the methods although the ex-         than its content-based counterpart, which is not surprising,
planation was not shown to the subject. This occurs either          because the black box recommenders are collaborative filtering
in Step 2, or in Step 3 where the subject did not click on the      methods. Item-item explanations have higher fidelity than user-
explanation button or teaser. As can be seen, similar to the        based explanations. This is not surprising, given the relatively
ratings in Table 1, movies for which a user based explanation       small user profiles that we use.
exists, as well as movies with similar descriptions, receive        It is especially interesting to look at the difference in content-
a statistically significant (t-test p-value=0.046) higher user      based fidelity between the MF method that we use and VAE.
rating than movies for which an item-item based explanation         Together with Table 2, this may explain the lower quality
holds. These, in turn, receive a statistically significant higher   of recommendations computed by our MF implementation.
rating than movies for which only the popularity explanation        The movies recommended by the MF method have very low
holds. Finally, movies for which none of our explanation types      content similarity to the movies that the subject has liked, and
hold, receive the lowest rating.                                    this may be the reason that participants rate them lower.
To conclude, on the one hand it is unclear whether the ex-          Overall, as can be seen in the bottom line of Table 2, 65% of
planations that we suggest themselves truly affect the subject      the recommended movies could be explained by at least one of
behavior. On the other hand, it appears that these explanations     our suggested methods (except for the popularity explanation).
are well correlated with the way that participants perceive         [21] report a model fidelity of 84% at most for their created
a recommended movie, and decide whether to rate it higher.          association rules. Our model fidelity is sensitive to the thresh-
As such, it may be that our explanations indeed capture a           olds that we set to accept an explanation. We may be able to
part of the subject decision process for her opinion over a         increase the model fidelity with more accurate and personal
recommended movie.                                                  tuning of these thresholds.
            AE       MF     Popular     All                                  Count Avg
       Count Avg Count Avg Count Avg Count Avg                                                                AE       MF        All
                                                                I2ICB         376 4.4                     Count Avg Count Avg Count Avg
 None   126 4.29 126 3.65 126 3.13 378 3.69                      UICB          62 4.73               POP    46 3.35 20 3.05 66 3.26
  POP     1    5   1    2 122 3.25 124 3.26                       I2ID        237 4.63             I2ICB    98 3.71 38 3.79 136 3.74
I2ICB    25 4.52 14 4.21 -        -   39 4.41                    I2ICF        508 4.27              UICB    52 3.29 4     3.75 56 3.32
 UICB     4 4.75 -      -    -    -    4 4.75                   U2UCF         109 4.62               I2ID   74 3.61 14 4.36 88 3.73
  I2ID   16 4.33 5      4    -    -   21 4.25                  Only POP        81 3.98              I2ICF 114 3.75 41 4.05 155 3.83
 I2ICF   43 4.19 28 3.82 -        -   71 4.04                                                      U2UCF 82 3.45 23 4.09 105 3.59
U2UCF 19 4.11 7 3.86 -            -   26 4.04                 None applies    386 3.55
                                                                                                     DEF   140 3.68 79 3.62 219 3.66
  DEF    14 3.43 67 3.04 -        -   81 3.11
                                                             Table 2: Average rating for
                                                             movies for which each recom-         Table 3: Average user ratings for different expla-
Table 1: Average ratings for movie recommendations with                                           nations (Step 4).
different explanations, and without explanations.            mendation type applies.


 Explanation     AE             MF             All
             count fidelity count fidelity count fidelity
POP           155 0.35 104 0.23 259 0.29
I2ICB         263 0.59 128 0.29 391 0.44
UICB          113 0.25        7     0.02 120 0.13
I2ID          194 0.43 36 0.08 230 0.26
I2ICF         312 0.7 153 0.34 465 0.52
U2UCF         181 0.4        65 0.15 246 0.28
At least one 365 0.82 233 0.52 579 0.65
explanation
by methods
2-6

                   Table 4: Model Fidelity                                      Figure 6: Explanation properties importance.


Post Study Questionnaire Results                                        that the property that was deemed most important is that an
We now discuss the participant answers to the questions con-            explanation should be easy to understand. Participants also
cerning the explanations at the post study questionnaire. The           thought that an explanation should be accurate, convincing,
responses below are hence biased given the explanations                 and short. We believe that this explains the relatively low opin-
shown throughout the study, and may not reflect the subject             ion of the participants concerning the content-based user-item
opinion prior to the study.                                             explanation which we reported above, as this explanation is
                                                                        quite long.
70% of the participants reported noticing the explanations in
our study, 24% noticed them only sometimes, and 6% reported             The only property that was not deemed as important by the
not noticing the explanations at all. 60% of the participants felt      participants is whether the explanation fully explains the rec-
that the explanations were mostly appropriate, 26% felt they            ommendation mechanism. This is in somewhat in conflict
were sometimes appropriate, only 1% felt that the explanations          with many research attempts in the recommender system com-
were always appropriate, and 3% felt that they were never               munity [25, 27] that focus on providing an explanation of the
appropriate. 71% of the participants thought that explanations          way that the models operate. It appears that users, at least
can help understand the recommendation, and may influence               the participants of our study, prefer an explanation that will
the decision on considering the recommended item. 23%of the             help them decide whether the recommended item is appropri-
participants said that an explanation is interesting, but would         ate for them, than to understand the mechanism behind the
not change their opinion over the movie. 5% responded that an           recommendation engine.
explanation is not important at all, and 1% said they ignore all
                                                                        When asked if they would like to get such recommendations in
recommendations and hence the explanations are not relevant.
                                                                        a system that they use (e.g. Netflix), 62% answered positively,
Similar results were reported before for the importance of
                                                                        31% answered maybe and the rest (7%) answered no.
explanations in recommendation systems [12, 27].
                                                                        These findings, that 94% of the participants found many of our
We also asked in an open, non-obligatory question to state
                                                                        explanations to be appropriate, and that most people would
an explanation that they liked the best. 93 of the participants
                                                                        have liked to see such explanations in a system that they use,
choose to answer. We categorized their free text answers
                                                                        together with the relatively low importance of revealing the
into groups. 52% of the responses were related to content-
                                                                        recommendation engine behavior, further support our intuition,
based explanations. 33% preferred the collaborative filtering
                                                                        that post-hoc explanations generated by simple methods can
explanations. 10% liked the popularity explanations, and 4%
                                                                        provide valuable information that users appreciate.
liked the default explanation. [4] reports similar preference
for content based explanations over CF explanations.
Figure 6 shows the participants responses about the impor-              CONCLUSION
tance of various properties of an explanation. We can see               In this paper we suggest a simple method for generating post-
                                                                        hoc explanations for recommendations generated by complex,
difficult to explain, models. We use a set of easy to explain        [5] John S Breese, David Heckerman, and Carl Kadie. 1998.
recommendation algorithms, and when their output agrees                  Empirical analysis of predictive algorithms for
with the recommendation of the complex model, consider the               collaborative filtering. In Proceedings of the Fourteenth
explanation of the simple model as a valid explanation for the           conference on Uncertainty in artificial intelligence.
recommended item. While these explanations are clearly not               Morgan Kaufmann Publishers Inc., 43–52.
transparent, we argue that they provide valuable information
                                                                     [6] Jingwu Chen, Fuzhen Zhuang, Xin Hong, Xiang Ao,
for the users in making decisions concerning the recommended
                                                                         Xing Xie, and Qing He. 2018. Attention-driven factor
items.
                                                                         model for explainable personalized recommendation. In
We study two research questions. First, whether users prefer             The 41st International ACM SIGIR Conference on
our simple post-hoc explanations to explanations of the mech-            Research & Development in Information Retrieval.
anism of the neural network or the matrix factorization model.           909–912.
Indeed, in our post study questionnaire, users stated that it is
                                                                     [7] Weiyu Cheng, Yanyan Shen, Linpeng Huang, and
more important for an explanation be short and clear, than to
                                                                         Yanmin Zhu. 2019. Incorporating Interpretability into
fully explain the algorithm.
                                                                         Latent Factor Models via Fast Influence Analysis. In
Second, we checked whether presenting a post-hoc explana-                Proceedings of the 25th ACM SIGKDD International
tion influences the behavior of users. For some of our expla-            Conference on Knowledge Discovery & Data Mining.
nations, namely, the I2ICB explanation and the I2ID explana-             885–893.
tion, the average rating was higher when an explanation was          [8] Hendrik Drachsler, Katrien Verbert, Olga C Santos, and
presented than the average rating when no explanation was                Nikos Manouselis. 2015. Panorama of recommender
presented. For other explanations, this did not hold. We specu-          systems to support learning. In Recommender systems
late that this was due to the explanation length and complexity.         handbook. Springer, 421–451.
Perhaps a future, simpler phrasing of the explanation would
lead to more pronounced effects.                                     [9] Upol Ehsan, Pradyumna Tambwekar, Larry Chan, Brent
                                                                         Harrison, and Mark O Riedl. 2019. Automated rationale
To support our claims, we use a user study in the movie do-              generation: a technique for explainable AI and its effects
main, showing that some explanations may affect the user opin-           on human perceptions. In Proceedings of the 24th
ion over the recommended item. We also show that movies                  International Conference on Intelligent User Interfaces.
that can be explained by our method may be better items to               263–274.
recommend. We evaluate subject opinion over the different ex-
planations that we suggest, showing that participants preferred     [10] Michael D Ekstrand, John T Riedl, Joseph A Konstan,
item-item explanations to user-based explanations. The sub-              and others. 2011. Collaborative filtering recommender
jects also stated that it is more important for an explanation to        systems. Foundations and Trends® in
be easy to understand, convincing, and short, than to uncover            Human–Computer Interaction 4, 2 (2011), 81–173.
the underlying operation of the recommendation engine.              [11] Bruce Ferwerda, Kevin Swelsen, and Emily Yang. 2018.
Our method can be easily extended by using additional ex-                Explaining Content-Based Recommendations. New York
plainable recommenders. In the future we will apply more                 (2018), 1–24.
methods. We will also study methods for automatically select-       [12] Jonathan L Herlocker, Joseph A Konstan, and John
ing a method-specific threshold for deciding if an explanation           Riedl. 2000. Explaining collaborative filtering
is valid, instead of the manually tuned threshold that we cur-           recommendations. In Proceedings of the 2000 ACM
rently use.                                                              conference on Computer supported cooperative work.
                                                                         241–250.
REFERENCES
 [1] Rakesh Agrawal, Ramakrishnan Srikant, and others.              [13] Yehuda Koren. 2008. Factorization meets the
     1994. Fast algorithms for mining association rules. In              neighborhood: a multifaceted collaborative filtering
     Proc. 20th int. conf. very large data bases, VLDB, Vol.             model. In Proceedings of the 14th ACM SIGKDD
     1215. 487–499.                                                      international conference on Knowledge discovery and
                                                                         data mining. ACM, 426–434.
 [2] Fernando Amat, Ashok Chandrashekar, Tony Jebara,
     and Justin Basilico. 2018. Artwork personalization at          [14] Yehuda Koren. 2010. Factor in the neighbors: Scalable
     netflix. In Proceedings of the 12th ACM Conference on               and accurate collaborative filtering. ACM Transactions
     Recommender Systems. ACM, 487–488.                                  on Knowledge Discovery from Data (TKDD) 4, 1 (2010),
                                                                         1.
 [3] Oren Barkan and Noam Koenigstein. 2016. Item2vec:
     neural item embedding for collaborative filtering. In          [15] Yehuda Koren and Robert Bell. 2015. Advances in
     2016 IEEE 26th International Workshop on Machine                    collaborative filtering. In Recommender systems
     Learning for Signal Processing (MLSP). IEEE, 1–6.                   handbook. Springer, 77–118.
 [4] Mustafa Bilgic and Raymond J Mooney. 2005.                     [16] Dawen Liang, Rahul G Krishnan, Matthew D Hoffman,
     Explaining recommendations: Satisfaction vs promotion.              and Tony Jebara. 2018. Variational autoencoders for
     In Beyond Personalization Workshop, IUI, Vol. 5. 153.               collaborative filtering. In Proceedings of the 2018 World
                                                                         Wide Web Conference. 689–698.
[17] Pasquale Lops, Marco De Gemmis, and Giovanni                [25] Rashmi Sinha and Kirsten Swearingen. 2002. The role
     Semeraro. 2011. Content-based recommender systems:               of transparency in recommender systems. In CHI’02
     State of the art and trends. In Recommender systems              extended abstracts on Human factors in computing
     handbook. Springer, 73–105.                                      systems. 830–831.
[18] Jie Lu, Dianshuang Wu, Mingsong Mao, Wei Wang, and          [26] Brent Smith and Greg Linden. 2017. Two decades of
     Guangquan Zhang. 2015. Recommender system                        recommender systems at amazon. com. Ieee internet
     application developments: a survey. Decision Support             computing 21, 3 (2017), 12–18.
     Systems 74 (2015), 12–32.
                                                                 [27] Nava Tintarev and Judith Masthoff. 2007. A survey of
[19] James McInerney, Benjamin Lacker, Samantha Hansen,               explanations in recommender systems. In 2007 IEEE
     Karl Higley, Hugues Bouchard, Alois Gruson, and                  23rd international conference on data engineering
     Rishabh Mehrotra. 2018. Explore, Exploit, and Explain:           workshop. IEEE, 801–810.
     Personalizing Explainable Recommendations with
                                                                 [28] Nava Tintarev and Judith Masthoff. 2015. Explaining
     Bandits. In Proceedings of the 12th ACM Conference on
                                                                      Recommendations: Design and Evaluation. In
     Recommender Systems (RecSys ’18). Association for
                                                                      Recommender Systems Handbook, Francesco Ricci, Lior
     Computing Machinery, New York, NY, USA, 31–39.
                                                                      Rokach, and Bracha Shapira (Eds.). Springer, 353–382.
     DOI:http://dx.doi.org/10.1145/3240323.3240354
                                                                      DOI:http://dx.doi.org/10.1007/978-1-4899-7637-6_10
[20] Xia Ning, Christian Desrosiers, and George Karypis.
                                                                 [29] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015.
     2015. A comprehensive survey of neighborhood-based
                                                                      Collaborative deep learning for recommender systems.
     recommendation methods. In Recommender systems                   In Proceedings of the 21th ACM SIGKDD international
     handbook. Springer, 37–76.                                       conference on knowledge discovery and data mining.
[21] Georgina Peake and Jun Wang. 2018. Explanation                   1235–1244.
     mining: Post hoc interpretability of latent factor models   [30] Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao
     for recommendation systems. In Proceedings of the 24th           Wu, and Xing Xie. 2018. A Reinforcement Learning
     ACM SIGKDD International Conference on Knowledge                 Framework for Explainable Recommendation. 2018
     Discovery & Data Mining. 2060–2069.                              IEEE International Conference on Data Mining (ICDM)
[22] Francesco Ricci, Lior Rokach, and Bracha Shapira.                (2018), 587–596.
     2015. Recommender Systems: Introduction and                 [31] Yongfeng Zhang and Xu Chen. 2018. Explainable
     Challenges. In Recommender Systems Handbook. 1–34.               recommendation: A survey and new perspectives. arXiv
[23] Badrul Sarwar, George Karypis, Joseph Konstan, and               preprint arXiv:1804.11192 (2018).
     John Riedl. 2001. Item-based collaborative filtering        [32] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang,
     recommendation algorithms. In Proceedings of the 10th            Yiqun Liu, and Shaoping Ma. 2014. Explicit factor
     international conference on World Wide Web. 285–295.             models for explainable recommendation based on
[24] Guy Shani and Asela Gunawardana. 2011. Evaluating                phrase-level sentiment analysis. In Proceedings of the
     recommendation systems. In Recommender systems                   37th international ACM SIGIR conference on Research
     handbook. Springer, 257–297.                                     & development in information retrieval. 83–92.