=Paper= {{Paper |id=Vol-2911/paper3 |storemode=property |title=Learning Causal Explanations for Recommendation |pdfUrl=https://ceur-ws.org/Vol-2911/paper3.pdf |volume=Vol-2911 |authors=Shuyuan Xu,Yunqi Li,Shuchang Liu,Zuohui Fu,Yingqiang Ge,Xu Chen,Yongfeng Zhang }} ==Learning Causal Explanations for Recommendation== https://ceur-ws.org/Vol-2911/paper3.pdf
Learning Causal Explanations for Recommendation
Shuyuan Xu1 , Yunqi Li1 , Shuchang Liu1 , Zuohui Fu1 , Yingqiang Ge1 , Xu Chen2 and
Yongfeng Zhang1
1
    Department of Computer Science, Rutgers University, New Brunswick, NJ 08901, US
2
    Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, 100872, China


                                             Abstract
                                             State-of-the-art recommender systems have the ability to generate high-quality recommendations, but usually cannot pro-
                                             vide explanations to humans due to the usage of black-box prediction models. The lack of transparency has highlighted the
                                             critical importance of improving the explainability of recommender systems. In this paper, we propose to construct causal
                                             explainable recommendation which aims to provide post-hoc explanations for the recommendations by answering “what
                                             if” questions, e.g., “what would the recommendation result change if the user’s behavior history had been different?” Our
                                             approach first obtains counterfactual user histories and counterfactual recommendation items with the aid of a perturbation
                                             model, and then extracts personalized causal relationships for the recommendation model through a causal rule mining
                                             algorithm. Different from some existing explainable recommendation models that aim to provide persuasive explanations,
                                             our model aims to find out the true explanations for the recommendation of an item. Therefore, in addition to evaluating
                                             the fidelity of discovered causal explanations, we adopt the average causal effect to measure the quality of explanations.
                                             Here by quality we mean whether they are true explanations rather than their persuasiveness. We conduct experiments for
                                             several state-of-the-art sequential recommendation models on real-world datasets to verify the performance of our model
                                             on generating causal explanations.

                                             Keywords
                                             Sequential Recommendation, Explainable Recommendation, Post-hoc Explanation, Causal Analysis



1. Introduction                                                                                                                              Real History                   Recommend


As widely used in decision-making, recommender sys-                                                                              User                             Recommender

tems have been recognized for its ability to provide high-                                                                               Counterfactual History
                                                                                                                                                                            Recommend
                                                                                                                                                                                        Result has
                                                                                                                                                                                        been changed,
quality services that reduce the gap between products and                                                                What if the
                                                                                                                        user's history
                                                                                                                                                                                        item 3 could
                                                                                                                                                                                          the reason.
customers. And many state-of-the-art models achieves                                                                      has been
                                                                                                                          different?
outstanding expressiveness by using high-dimensional
user/item representations and deep learning models with
                                                                                                                                                                  Recommender


thousands or even millions of parameters [1, 2]. How-
ever, this excessive complexity easily go beyond the com-
prehension of a human who may demand for intuitive                                                                    Figure 1: An example of causal explanation. Comparing the
explanations for why the model made a specific decision.                                                              recommendation of real history and counterfactual histories,
                                                                                                                      if replacing one certain item will result in the change of rec-
Moreover, providing supportive information and inter-
                                                                                                                      ommendation, the certain item could be the true reason that
pretation along with the recommendation can be helpful
                                                                                                                      the system recommends the original item.
for both the customers and the platform, since it improves
the transparency, persuasiveness, trustworthiness, effec-
tiveness, and user satisfaction of the recommendation
systems, while facilitating system designers to refine the                                                            dation.
algorithms [3]. Thus, people are looking for solutions                                                                   One typical method to solve explainable recommenda-
that can generate explanations along with the recommen-                                                               tion is to construct a model-intrinsic explanation mod-
                                                                                                                      ule that also serves as an intermediate recommendation
                                                                                                                      stage[4, 5]. However, this approach has to redesign the
The 1st International Workshop on Causality in Search and
Recommendation (CSR’21), July 15, 2021, Virtual Event, Canada                                                         original recommendation model and thus may sacrifice
" shuyuan.xu@rutgers.edu (S. Xu); yunqi.li@rutgers.edu (Y. Li);                                                       model accuracy in order to obtain good explanations
shuchang.syt.liu@rutgers.edu (S. Liu); zuohui.fu@rutgers.edu                                                          [6]. Moreover, for complex deep models, it is even more
(Z. Fu); yingqiang.ge@rutgers.edu (Y. Ge); xu.chen@ruc.edu.cn                                                         challenging to integrate an explainable method into the
(X. Chen); yongfeng.zhang@rutgers.edu (Y. Zhang)
                                                                                                                      original design while maintaining recommendation per-
~ https://zuohuif.github.io/ (Z. Fu); https://yingqiangge.github.io/
(Y. Ge); http://xu-chen.com/ (X. Chen); http://yongfeng.me                                                            formance [3]. In contrast, post-hoc models (a.k.a model-
(Y. Zhang)                                                                                                            agnostic explanation) consider the underlying recommen-
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     dation model as a black-box, and provide explanations
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
after the recommendation decision has been made. Al-                 able framework for a wide range of sequential
though such explanations may not strictly follow the                 recommendations.
exact mechanism that generated the corresponding rec-              • We show that this framework can generate per-
ommendations, they offer the flexibility to be applied to a          sonalized post-hoc explanations based on item-
wide range of recommendation models. Furthermore, the                level causal rules.
explanation model and recommendation model work sep-               • We conduct several experiments on real-world
arately, we obtain the benefit of explainability without             data to demonstrate that our explanation model
hurting the prediction performance.                                  outperforms state-of-the-art baselines in terms of
   While it is still not fully understood what informa-              fidelity.
tion is useful to generate the explanations for a certain          • We apply average causal effect to illustrate that
recommendation result, Peake [7] argued that one can                 the causal explanations provided by our frame-
provide post-hoc item-level explanations. Specifically, in-          work are essential component for most sequential
teracted items (the causes) in a user’s history can be used          recommendation model.
as explanations for the future item recommendations (the
effect).The authors propose to solve this by association         For the remainder of this paper, we first review related
rule mining which finds co-occurred items as explanation.     work in Section 2, and then introduce our model in Sec-
However, explanations generated by association rules are      tion 3. Experimental settings and results are provided in
not personalized, i.e., different users would receive the     Section 4. Finally, we conclude this work in Section 5.
same explanation as long as the rules are only applied
to their overlapped histories. This makes it incompatible
with modern recommender systems, which aim to pro-            2. Related Work
vide personalized services to users. Moreover, we believe
that the true explanation of a recommendation model           2.1. Sequential Recommendation
should be able to answer the questions like “which item       Sequential recommendation takes into account the his-
contribute to the system’s decision?” as well as “ Will       torical order of items interacted by a user and aims to
the system change the decision if a different set of items    capture useful sequential patterns to make consecutive
were purchased by the same user? ” In other words,            predictions of the user’s future behaviors. Rendle et al.
the explanation should be aware of the counterfactual         [10] proposed Factorized Personalized Markov Chains
world of the unobserved user histories and their corre-       (FPMC) to combine Markov chain and matrix factoriza-
sponding recommendation when analyzing the cause of           tion for next basket recommendation. The Hierarchical
a recommendation in real world.                               Representation Model (HRM) [11] further extended this
   In this paper, we explore a counterfactual analysis        idea by leveraging representation learning as latent fac-
framework to provide post-hoc causal explanations for         tors in a hierarchical model. However, these methods can
any given black-box sequential recommendation algo-           only model the local sequential patterns of very limited
rithm. Fig.1 shows an example to illustrate our intuition.    number of adjacent records. To model multi-step sequen-
Technically, we first create several counterfactual histo-    tial behaviors, He et al. [12] adopted Markov chain to
ries which are different but similar to the real history      provide recommendations with sparse sequences. Later
through a Variational Auto-Encoder (VAE) based per-           on, the rapid development of representation learning
turbation model, and obtain the recommendation for            and neural networks introduced many new techniques
the counterfactual data. Then we apply causal analysis        that further pushed the research of sequential recom-
on the combined data to extract causal rules between a        mendation to a new level. For example, Hidasi et. al.
user’s history and future behaviors as explanations. Un-      [13] used an RNN-based model to learn the user his-
like other explainable recommendation models [4, 8, 9]        tory representation, Yu et. al. [14] provided a dynamic
that focus on persuading users to keep engaged with           recurrent model, Li et. al. [15] proposed an attention-
the system, this type of explanation focuses on model         based GRU model, Chen et. al. [16] developed user- and
transparency and finds out the true reason or the most        item-level memory networks, and Huang et. al. [17] fur-
essential item that leads to a specific recommendation.       ther integrated knowledge graphs into memory networks.
Therefore, instead of taking user studies or online evalu-    However, most of the models exhibit complicated neural
ations to evaluate the persuasiveness or effectiveness of     network architectures, and it is usually difficult to inter-
explanations, we use the average causal effect to measure     pret their prediction results. To make up for this, we plan
whether the item used for explanation can explain how         to generate explanations for these black box sequential
the system works.                                             recommendation models.
   The key contributions of this paper are as follows:

     • We design and study a counterfactual explain-
2.2. Explainable Recommendation                             and reinforcement learning [43], etc. With respect to
                                                            recommendation tasks, large amount of work is about
Explainable recommendation focuses on developing mod-
                                                            how to achieve de-bias matrix factorization with causal
els that can generate not only high-quality recommen-
                                                            inference. The probabilistic approach ExpoMF proposed
dations but also intuitive explanations, which help to
                                                            in [44] directly incorporated user exposure to items into
improve the transparency of the recommendation sys-
                                                            collaborative filtering, where the exposure is modeled as
tems [3]. Generally, the explainable models can be either
                                                            a latent variable. Liang et. al. [45] followed to develop
model-intrinsic or model-agnostic. As for model-intrinsic
                                                            a causal inference approach to recommender systems
approaches, lots of popular explainable recommendation
                                                            which believed that the exposure and click data came
methods, such as factorization models [4, 18, 9, 19], deep
                                                            from different models, thus using the click data alone to
learning models [20, 16, 21, 22], knowledge graph models
                                                            infer the user preferences would be biased by the expo-
[23, 5, 17, 24, 25, 26], explanation ranking models [27],
                                                            sure data. They used causal inference to correct for this
logical reasoning models [1, 28, 29], dynamic explana-
                                                            bias for improving generalization of recommendation
tion models [30, 31], visual explanation models [8] and
                                                            systems to new data. Bonner et. al. [40] proposed a new
natural language generation models [32, 33, 34] have
                                                            domain adaptation algorithm which was learned from
been proposed. A more complete review of the related
                                                            logged data including outcomes from a biased recom-
models can be seen in [3]. However, they mix the recom-
                                                            mendation policy, and predicted recommendation results
mendation mechanism with interpretable components,
                                                            according to random exposure. Besides de-bias recom-
which often results in over-complicated systems to make
                                                            mendation, Ghazimatin et. al. [46] proposed PRINCE
successful explanations. Moreover, the increased model
                                                            model to explore counterfactual evidence for discovering
complexity may reduce the interpretability. A natural
                                                            causal explanations in a heterogeneous information net-
way to avoid this dilemma is to rely on model-agnostic
                                                            work. Differently, this paper focuses on learning causal
post-hoc approaches so that the recommendation system
                                                            rules to provide more intuitive explanation for the black-
is free from the noises of the down-stream explanation
                                                            box sequential recommendation models. Additionally,
generator. Examples include [35] that proposed a bandit
                                                            we consider [47] as a highly related work though it is
approach, [36] that proposed a reinforcement learning
                                                            originally proposed for natural language processing tasks.
framework to generate sentence explanations, and [7]
                                                            As we will discuss in the later sections, we utilize some
that developed an association rule mining approach. Ad-
                                                            of the key ideas of its model construction, and show why
ditionally, some work distinguish the model explanations
                                                            it works in sequential recommendation scenarios.
by their purpose [37]: while persuasive explanations aim
to improve user engagement, model explanation reflexes
how the system really works and may not necessarily be 3. Proposed Approach
persuasive. Our study fall into the later case and aims
to find causal explanations for a given sequential recom- In this section, we first define the explanation problem
mendation model.                                            and then introduce our model as a combination of two
                                                            parts: a VAE-based perturbation model that generates the
2.3. Causal Inference in Recommendation counterfactual samples for causal analysis, and a causal
                                                            rule mining model that can extract causal dependencies
Originated as statistical problems, causal inference [38, between the cause-effect items.
39] aims at understanding and explaining the causal ef-
fect of one variable on another. While the observational
                                                            3.1. Problem Setting
data is considered as the factual world, causal effect in-
ferences should be aware of the counterfactual world, We denote the set of users as 𝒰 = {𝑢1 , 𝑢2 , · · · , 𝑢|𝒰 | }
thus often being regarded as the questions of "what-if". and set of items as ℐ = {𝑖1 , 𝑖2 . · · · , 𝑖|ℐ| }. Each user 𝑢
The challenge is that it is often expensive or even im- is associated with a purchase history represented as a
possible to obtain counterfactual data. For example, it sequence of items ℋ𝑢 . The 𝑗-th interacted item in the
is immoral to re-do the experiment on a patient to find history is denoted as 𝐻𝑗𝑢 ∈ ℐ. Without specification, the
out what will happen if we have not given the medicine. calligraphic ℋ in the paper represents user history, and
Though the majority of causal inference study resides in a straight 𝐻 represents an item. A black-box sequential
the direction of statistics and philosophy, it has recently recommendation model ℱ : ℋ → ℐ is a function that
attracted the attention from AI community for its great takes a sequence of items (as will discuss later, it can be
power of explainablity and bias elimination ability. Ef- the counterfactual user history) as input and outputs the
forts have managed to bring causal inference to several recommended item. In practice, the underlying mecha-
machine learning areas, including recommendation [40], nism usually consists of two steps: a ranking function
learning to rank [41], natural language processing [42], first scores all candidate items based on the user history,
and then it selects the item with the highest score as the         3.2.1. Perturbation Model
final output. Note that it only uses user-item interac-
                                                              To capture the causal dependency between items in his-
tion without any content or context information, and the
                                                              tory and the recommended items, we want to know what
scores predicted by the ranking function may differ ac-
                                                              would take place if the user history had been different. To
cording to the tasks (e.g. {1, . . . , 5} for rating prediction,
                                                              avoid unknown influences caused by the length of input
while [0, 1] for Click Through Rate (CTR) prediction).
                                                              sequence (i.e., user history), we keep the input length
Our goal is to find an item-level post-hoc model that cap-
                                                              unchanged, and only replace items in the sequence to
tures the causal relation between the history items and
                                                              create counterfactual histories. Ideally, for each item
the recommended item for each user.
                                                              𝐻𝑗𝑢 in a user’s history ℋ𝑢 , it will be replaced by all pos-
Definition 1. (Causal Relation) For two variables 𝑋 and sible items in ℐ to fully explore the influence that 𝐻𝑗𝑢
𝑌 , if 𝑋 triggers 𝑌 , then we say that there is a causal re- makes in the history. However, the number of possible
lation 𝑋 ⇒ 𝑌 , where 𝑋 is the cause and 𝑌 is the effect. combinations will become impractical for the learning
                                                              system, since recommender systems usually deal with
                                                              hundreds of thousands or even tens of millions items. In
   When a given recommendation model ℱ maps a user fact, counterfactual examples that are closest to the orig-
history ℋ to a recommended item 𝑌 ∈ ℐ, all items inal input can be the most useful to a user as shown in
           𝑢                               𝑢

in ℋ𝑢 are considered as potential causes of 𝑌 𝑢 . Thus [48]. Therefore, we pursue a perturbation-based method
we can formulate the set of causal relation candidates as that generate counterfactual examples, which replaces
𝒮 𝑢 = {(𝐻, 𝑌 𝑢 )|𝐻 ∈ ℋ𝑢 }.                                    items in the original user history ℋ𝑢 .
Definition 2. (Causal Explanation for Sequential Rec-            There are various ways to obtain the counterfactual
ommendation Model) Given a causal relation candidate history, as long as they are similar to the real history.
set 𝒮 𝑢 for user 𝑢, if there exists a true causal relation The      simplest solution is randomly selecting an item in
       𝑢        𝑢
(𝐻, 𝑌 ) ∈ 𝒮 , then the causal explanation for recom-          ℋ  𝑢
                                                                    and replacing it with a randomly selected item from
mending 𝑌 𝑢 is described as “Because you purchased 𝐻, ℐ ∖ ℋ . However, user histories are far from random.
                                                                      𝑢

the model recommends you 𝑌 𝑢 ”, denoted as 𝐻 ⇒ 𝑌 𝑢 .          Thus, we assume that their exists a ground truth user
                                                              history distribution, and we adopt VAE to learn such a dis-
   Then the remaining problem is to determine whether tribution. As is shown in Figure 2, we design a VAE-based
a candidate pair is a true causal relation.                   perturbation method, which creates item sequences that
   We can mitigate the problem by allowing a likelihood are similar to but slightly different from a user’s genuine
estimation for a candidate pair being a causal relation.      history sequence, by sampling from a distribution in the
Definition 3. (Causal Dependency) For a given candi- latent embedding space centered around the user’s true
date pair of causal relation (𝐻, 𝑌 𝑢 ), the causal depen- history sequence.
dency 𝜃𝐻,𝑌 𝑢 of that pair is the likelihood of the pair being    In detail, the VAE component consists of a probabilis-
a true causal relation.                                       tic  encoder (𝜇, 𝜎) = ENC(𝒳 ) and a decoder 𝒳         ˜ =
                                                              DEC(𝑧). The encoder ENC(·) takes a sequence of item
   In other words, we would like to find a ranking func- embeddings 𝒳 into latent embedding space, and extracts
tion that predicts the likelihood for each candidate pair, the variational information for the sequence, i.e., mean
and the causal explanation is generated by selecting the and variance of the latent embeddings under independent
pair with top ranking score from these candidates. One Gaussian distribution. The decoder DEC(·) generates a
advantage of this formulation is that it allows the possibil- sequence of item embeddings 𝒳     ˜ given a latent embed-
ity of giving no causal relation between a user’s history ding 𝑧 sampled from the Gaussian distribution. Here,
and the recommended item, e.g., when algorithm rec- both 𝒳 and 𝒳             ˜ are ordered concatenations of pre-trained
ommends the most popular items regardless of the user item embeddings based on pair-wise matrix factorization
history.                                                      (BPR-MF) [49]. We follow the standard training regime
                                                              of VAE by maximizing the variational lower bound of the
3.2. Causal Model for Post-Hoc Explanationdata likelihood [50]. Specifically, the reconstruction error
                                                              involved in this lower bound is calculated by a softmax
In this section, we introduce our counterfactual explana- across all items for each position of the input sequence.
tion framework for recommendation. Inspired by [47], We observe that VAE can reconstruct the original data
we divide our framework into two models: a perturbation set accurately, while offering the power of perturbation.
model and a causal rule mining model. The overview of            After pretraining ENC(·) and DEC(·), the variational
the model framework is shown in Fig.2.                        nature of this model allows us to obtain counterfactual
                                                              history ℋ̃ for any real history ℋ. More specifically, it
                                                              first extracts the mean and variance of the encoded item
                                                 Per tur bation M odel                                                   Causality M ining M odel

                      Re al
                   hi s t o r y         ...
                                                                                                                  Co u nt e r -                              m+1
                                                                         Co u nt e r -                             f ac t u al
                                                                          f ac t u al                                pai r s
                                       Encoder                           hi s t o r y
                                                            Sample m
     user                                                     Times
                                                                                     Recommendation                                Causality Mining
    history




                                                                                                                                             ...


                                                                                                                                                       ...
                                                                                         Model




                                                                                              ...
                                                                                                                                  Causal Dependencies
                                       Decoder
                                                                         Co u nt e r -                                 Rank and Selection
                  Co u nt e r -                                           f ac t u al
                                                                                                                                       Per sonalized
                   f ac t u al
                  hi s t o r y
                                        ...                                Re s u l t
                                                                                                                                       Explanation




                                                                                              ...
Figure 2: Model framework. 𝑥 is the concatenation of the item embeddings of the user history. 𝑥
                                                                                              ˜ is the perturbed embedding.


                                                                                                      𝑢
sequences in the latent space, and then the perturbation                           and output item 𝑌ˆ 𝑖 . We consider that the occurrence of
model samples 𝑚 latent embeddings 𝑧 based on the above                             a single output can be modeled as a logistic regression
variational information. These sampled embeddings 𝑧                                on causal dependencies from all the input items in the
are then passed to the decoder DEC(·) to obtain the                                sequence:
perturbed versions 𝒳 ˜ . For now, each item embedding in                                                              𝑛
 ˜ may not represent an actual item since it is a sampled                                           𝑢   𝑢
                                                                                                                  (︁ ∑︁                                )︁
𝒳                                                                                            𝑃 (𝑌ˆ 𝑖 |ℋ̂𝑖 ) = 𝜎               𝜃𝐻    ^𝑢 · 𝛾
                                                                                                                               ^ 𝑢 ,𝑌
                                                                                                                                           𝑛−𝑗
                                                                                                                                                                   (1)
vector from the latent space, as a result, we find its nearest                                                        𝑗=1
                                                                                                                                  𝑖𝑗     𝑖

neighbor in the candidate item set ℐ ∖ ℋ through dot
product similarity as the actual item. In this way, 𝒳     ˜ is                     where 𝜎 is the sigmoid function defined as 𝜎(𝑥) =
transformed into the final counterfactual history ℋ̃. One                          (1 + exp(−𝑥))−1 in order to scale the score to [0, 1].
should keep in mind that the variance should be kept                               Additionally, in recommendation task, the order of a
small during sampling, so that the resulting sequences                             user’s previously interacted items may affect their causal
can be similar to the original sequence.                                           dependency with the user’s next interaction. A closer
   Finally, the generated counterfactual data ℋ̃ together                          behavior tends to have a stronger effect on user’s future
with the original ℋ will be injected into the black-box                            behaviors, and behaviors are discounted if they happened
recommendation model ℱ to obtain the recommenda-                                   earlier [13]. Therefore, we involve a weight decay param-
tion results 𝑌˜ and 𝑌 , correspondingly. For any user 𝑢,                           eter 𝛾 to represent the time effect. Here 𝛾 is a positive
after completing this process, we will have 𝑚 different                            value less than one.
                                             𝑢    𝑢
counterfactual input-output pairs: {(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚                                   For an input-output pair in 𝒟𝑢 , the probability of its
                                                       𝑖=1 , as
well as the original pair (ℋ , 𝑌 ). Here the value of 𝑚
                              𝑢    𝑢                                               occurrence generated by Eq.(1) should be close to one.
is manually set, but it cannot exceed the number of all                            As a result, we learn the causal dependencies 𝜃 by maxi-
possible item combinations.                                                        mizing the probability over 𝒟𝑢 . When optimizing 𝜃, they
                                                                                   are always initialized as zero to allow for no causation
                                                                                   between two items. When learning this regression model,
3.2.2. Causal Rule Learning Model
                                                                                   we are able to gradually increase 𝜃 until they converge to
Denote 𝒟𝑢 as the combined records of counterfactual                                the point where the data likelihood of 𝒟𝑢 is maximized.
                              𝑢       𝑢
input-output pairs {(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚        𝑖=1 and the original pair                  After gathering all the causal dependencies, we select
(ℋ , 𝑌 ) for user 𝑢. We aim to develop a causal model
    𝑢    𝑢                                                                         the items that have high 𝜃 scores to build causal explana-
that first extracts causal dependencies between input and                          tions. This involves a three-step procedure.
outputs items appeared in 𝒟𝑢 , and then selects the causal                               1. We select those causal dependencies 𝜃𝐻         ^ 𝑢 ,𝑌
                                                                                                                                                ^𝑢
rule based on these inferred causal dependencies.                                                                                        𝑢
                                                                                                                                             𝑖𝑗   𝑖
          𝑢
   Let ℋ̂𝑖 = [𝐻 ˆ 𝑢𝑖1 , 𝐻
                        ˆ 𝑢𝑖2 , · · · , 𝐻
                                        ˆ 𝑢𝑖𝑛 ] be the input sequence                       whose output is the original 𝑌 (i.e., 𝑌 𝑖 = 𝑌 𝑢 ).
                                                                                                                                 𝑢     ˆ
                                              ˆ 𝑢𝑖𝑗 is the j-th item in                     Note that these (𝐻 ˆ 𝑢𝑖𝑗 , 𝑌 𝑢 ) pairs may come from
of the 𝑖-th record of 𝒟𝑢 , where 𝐻
   𝑢         𝑢                                                                              either the original sequence or counterfactual se-
ℋ̂𝑖 . Let 𝑌ˆ 𝑖 represent the corresponding output. Note
                                                                                            quences, because when a counterfactual sequence
that this includes the original real pair (ℋ𝑢 , 𝑌 𝑢 ). The
                                                                                            is fed into the black-box recommendation model,
model should be able to infer the causal dependency
                                                                    ˆ𝑢                      the output may happen to be the same as the
(refer to Definition 3) 𝜃𝐻     ^ 𝑢 ,𝑌 ^ 𝑢 between input item 𝐻 𝑖𝑗
                                  𝑖𝑗    𝑖                                                   original sequence 𝑌 𝑢 .
Algorithm 1 Causal Explanation Model                           Table 1
                                                               Summary of the Datasets
Input: users 𝒰, items ℐ, user history ℋ ,  𝑢

counterfactual number 𝑚, black-box model ℱ,                      Dataset # users # items # interactions # train # test sparsity
embedding model ℰ, causal mining model ℳ
                                                                Movielens 943           1682      100,000      95,285 14,715 6.3%
Output: causal explanations 𝐻 ⇒ 𝑌 𝑢 where 𝐻 ∈ ℋ𝑢                Amazon 573               478      13,062        9,624 3,438 4.7%
 1: Use embedding model ℰ to get item embeddings ℰ(ℐ)

 2: Use ℰ(ℐ) and true user history to train perturbation       can serve as an intuitive explanation for the black-box
    model 𝒫                                                    recommendation model.
 3: for each user 𝑢 do
 4:   for 𝑖 from 1 to 𝑚 do                                     4.1. Dataset Description
             𝑢                𝑢         𝑢
 5:      ℋ̃𝑖 ← 𝒫(ℋ𝑢 ); 𝑌˜ 𝑖 ← ℱ (ℋ̃𝑖 )
                                                               We evaluate our proposed causal explanation framework
 6:   end for
                                                               against baselines on two datasets. The first dataset is
 7:   Construct counterfactual input-output pairs
            𝑢     𝑢                                            MovieLens100k1 . This dataset consists of information
      {(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚𝑖=1
            𝑢     𝑢               𝑢
                                     ˜𝑢 𝑚                      about users, movies and ratings. In this dataset, each user
 8:   {(ℋ̂𝑖 , 𝑌ˆ 𝑖 )}𝑚+1                          𝑢   𝑢
                      𝑖=1 ← {(ℋ̃𝑖 , 𝑌 𝑖 )}𝑖=1 ∪ (ℋ , 𝑌 )       has rated at least 20 movies, and each movie can belong to
                          (︁   𝑢
                                 ˆ 𝑢 𝑚+1
                                           )︁                  several genres. The second dataset is the office product
 9:   𝜃𝐻      ^ 𝑢 ← ℳ {(ℋ̂𝑖 , 𝑌 𝑖 )}𝑖=1
        ^ 𝑢 ,𝑌
          𝑖𝑗    𝑖                                              dataset from Amazon2 , which contains the user-item
10:   Rank 𝜃𝐻      ^ 𝑢 ,𝑌 𝑢  and select top-𝑘 pairs            interactions from May 1996 to July 2014. The original
                  𝑖𝑗
      {(𝐻𝑗 , 𝑌 𝑢 )}𝑘𝑗=1                                        dataset is 5-core. To achieve sequential recommendation
11:   if ∃𝐻min{𝑗} ∈ ℋ𝑢 then                                    with input length of 5, we select the users with at least
12:      Generate causal explanation 𝐻min{𝑗} ⇒ 𝑌 𝑢             15 purchases and the items with at least 10 interactions.
13:   else                                                        Since our framework is used to explain sequential rec-
14:      No explanation for the recommended item 𝑌 𝑢 ommendation models, we split the dataset chronologi-
15:   end if                                                   cally. Further, to learn the pre-trained item embeddings
16: end for                                                    based on BPR-MF [49] (section 3.2.1), we take the last
17: return all causal explanations 𝐻 ⇒ 𝑌
                                             𝑢                 6 interactions from each user to construct the testing
                                                               set, and use all previous interactions from each user as
                                                               the training set. To avoid data leakage, when testing the
    2. We sort the above selected causal dependencies black-box recommendation models and our VAE-based
       in descending order and take the top-𝑘 (𝐻 ˆ 𝑢𝑖𝑗 , 𝑌 𝑢 ) perturbation model, we only use the last 6 interactions
       pairs.                                                  of each user (i.e., the testing set of the pre-training stage).
    3. If there exist one or more pairs in these top-𝑘 Following common practice, we adopt the leave-one-out
                               ˆ 𝑢𝑖𝑗 appears in the user’s protocol, i.e., among the 6 interactions in test set, we use
       pairs, which cause item 𝐻
       input sequence ℋ , then we pick such pair of the the last one for testing, and the previous five interactions
                         𝑢

       highest rank, and construct 𝐻  ˆ 𝑢𝑖𝑗 ⇒ 𝑌 𝑢 as the will serve as input to the recommendation models. A
       causal explanation for the user. Otherwise, i.e., brief summary of the data is shown in Table 1.
       no cause item appears in the user history, then
       we output no causal explanation for the user.           4.2. Experimental Settings
   Note that the extracted causal explanation is personal-     We adopt the following methods to train black-box se-
ized since the algorithm is applied on 𝒟𝑢 , which only con-    quential recommendation models and extract traditional
tains records centered around the user’s original record       association rules as comparative explanations. Mean-
(ℋ𝑢 , 𝑌 𝑢 ), while collaborative learning among users is       while, we further conduct different variants of the per-
indirectly modeled by the VAE-based perturbation model.        turbation model to analyze our model. We include both
The overall algorithm is provided in Alg.1. For each user,     shallow and deep models for the experiment.
there are two phases - perturbation phase (line 4-7) and         FPMC [10]: The Factorized Personalized Markov
causal rule mining phase (line 8-15).                          Chain model, which combines matrix factorization and
                                                               Markov chains to capture user’s personalized sequential
                                                               behavior patterns for prediction3 .
4. Experiments
                                                                    1
                                                                        https://grouplens.org/datasets/movielens/
In this section, we conduct experiments to show what                2
                                                                        https://nijianmo.github.io/amazon/
causal relationships our model can capture and how they             3
                                                                        https://github.com/khesui/FPMC
Table 2
Results of Model Fidelity. Our causal explanation framework is tested under the number of candidate causal explanations
𝑘 = 1. The association explanation framework is tested under support, confidence, and lift thresholds, respectively. The best
fidelity on each column is highlighted in bold.

           Dataset                    Movielens 100k                                      Amazon
           Models       FPMC       GRU4Rec        NARM         Caser     FPMC      GRU4Rec       NARM        Caser
           AR-sup       0.3160       0.1453       0.4581       0.1569    0.2932      0.1449      0.4066     0.2024
           AR-conf      0.2959       0.1410       0.4305       0.1559    0.2949      0.1449      0.4031     0.1885
            AR-lift     0.2959       0.1410       0.4305       0.1559    0.2949      0.1449      0.4031     0.1885
           CR-AE        0.5631       0.7413       0.7084       0.6151   0.6981       0.8255      0.8970     0.7260
           CR-VAE       0.9650       0.9852       0.9714       0.9703   0.9511       0.9721      0.9791     0.9599



   GRU4Rec [13]: A session-based recommendation                   decoder are Multi-Layer Perceptrons (MLP) with two
model, which uses recurrent neural networks – in partic-          hidden layers, and each layer consists of 1024 neurons.
ular, Gated Recurrent Units (GRU) – to capture sequential         The only difference between our model and the vari-
patterns for prediction4 .                                        ant model CR-AE is that the variant model applies fixed
   NARM [15]: A sequential recommendation model                   normal distribution as variance instead of learned person-
which utilizes GRU and attention mechanism to estimate            alized variance. The default number of counterfactual
the importance of each interactions5 .                            input-output pairs is 𝑚 = 500 on both datasets. The
   Caser [51]: The ConvolutionAl Sequence Embedding               default time decay factor is 𝛾 = 0.7. We will discuss the
Recommendation (Caser) model, which adopts convo-                 influence of counterfactual number 𝑚 and time decay
lutional filters over recent items to learn the sequential        factor 𝛾 in the experiments.
patterns for prediction6 .                                           In the following, we will apply our model and all base-
   AR-sup [7]: A post-hoc explanation model, which                lines on the black-box recommendation models to evalu-
extract association rules from interactions from all users        ate and compare the generated explanations. In particu-
and rank based on support value to generate item-level            lar, we evaluate our framework from three perspectives.
explanations.                                                     First, a explanation model should at least be able to offer
   AR-conf [7]: Extracting association rules and rank             explanations for most recommendations, we will show it
based on confidence value to get explanations.                    in the result (explanation fidelity). Second, if our model is
   AR-lift [7]: Rank based on lift value among extracted          capable of generating explanations for most recommen-
association rules to generate explanations.                       dations, we need to verify that the causal explanations
   CR-AE: A variant of our causal rule model which                learned by our framework represent the key component
applies fixed variance in hidden layer of AutoEncoder             of recommendation mechanism (explanation quality). Fi-
model as the perturbation model. Compared with our                nally, since counterfactual examples are involved in our
VAE-based perturbation model, this variant apply non-             framework, our framework should be able to generate
personalized variance.                                            closer counterfactual examples (counterfactual quality).
   For black-box recommendation models FPMC,                      Additionally, we shed light on how our model differs
GRU4Rec, NARM and Caser, we adopt their best                      from other models on statistical metrics.
parameter selection in their corresponding public imple-
mentation. For the association rule-based explanation             4.3. Model Fidelity
model, we follow the recommendations in [7] to set
the optimal parameters: support = 0.1, confidence                 A very basic purpose of designing a explanation model
= 0.1, lift = 0.1, length = 2 for MovieLens100k, and              is to generate explanations for most recommendations.
support = 0.01, confidence = 0.01, lift = 0.01, length            Therefore, an important evaluation measure for explana-
= 2 for Amazon dataset due to its smaller scale. We               tion models is model fidelity, i.e., what’s the percentage
accept top 100 rules based on corresponding values (i.e.          of the recommendation results can be explained by the
support/confidence/lift) as explanations                          model [3]. The results of model fidelity are shown in
   For our causal rule learning framework, we set the             Table 2. In this experiment, we only report the results of
item embedding size as 16, both the VAE encoder and               keeping the number of candidate causal explanations 𝑘
    4
                                                                  as 1 for our framework and variant. For the association
      https://github.com/hungthanhpham94/GRU4REC-pytorch
    5
      https://github.com/Wang-Shuo/Neural-Attentive-Session-
                                                                  rule explanation model (section 4.2), we apply the global
Based-Recommendation-PyTorch                                      association rules [7] ranking by support, confidence, and
    6
      https://github.com/graytowne/caser_pytorch                  lift, respectively.
   We can see that on both datasets, our causal explana-             Here 𝑑𝑜() represents an external intervention, which
tion framework is able to generate explanations for most          forces a variable to take a specific value. Specifically, in
of the recommended items (including the variant), while           our case, for an extracted causal rule 𝐻 ⇒ 𝑌 𝑢 , we define
                                                                                                                𝑢
the association explanation approach can only provide ex-         the binary random variable as 1 if 𝐻 ∈ ℋ̃𝑖 , 0 else. We
planations for significantly fewer recommendations. The           also define another variable 𝑦 as a binary random variable,
                                                                                   𝑢
underlying reason is that association explanations have           which is 1 if 𝑌˜ 𝑖 = 𝑌𝑢 , otherwise it will be 0. We then
to be extracted based on the original input-output pairs,         report average ACE on all generated explanations. Note
which limits the number of pairs that we can use for rule         that since the ACE value is used for causal related models,
extraction. However, based on the perturbation model,             we cannot report it on the association rule baseline.
our causal explanation framework is capable of creating              Suppose the perturbation model (section 3.2.1) cre-
many counterfactual examples to assist causal rule learn-         ates 𝑚 counterfactual input-output pairs for each user 𝑢:
                                                                       𝑢     𝑢                𝑢
ing, which makes it possible to go beyond the limited                            𝑖=1 . Here ℋ̃ is created by our perturbation
                                                                  {(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚
original data to extract causal explanations. Moreover,           model (i.e. not observed in the original data), and thus
                                                                                        𝑢
when the number of input and output items are limited             observing 𝐻 ∈ ℋ̃ implies we have 𝑑𝑜(𝑥 = 1) in ad-
(e.g. five history items as input and the model recom-            vance. Let 𝐻 ⇒ 𝑌 𝑢 be the causal explanation extracted
mends only one item in our case), it is harder to match           by the casual rule learning model (section 3.2.2). Then
global rules with personal interactions and recommen-             we estimate the ACE based on these 𝑚 counterfactual
dation, which limits the flexibility of global association        pairs as,
rules.
                                                                       E[𝑦|𝑑𝑜(𝑥 = 1)] = Pr(𝑦 = 1|𝑑𝑜(𝑥 = 1))
   Another interesting observation is that GRU4Rec and                                                   𝑢
Caser have significantly (𝑝 < 0.01) lower fidelity than                                    #Pairs(𝐻 ∈ ℋ̃ ∧ 𝑌 = 𝑌 𝑢 )
                                                                                       =                       𝑢
FPMC and NARM when explained by the association                                                #Pairs(𝐻 ∈ ℋ̃ )
                                                                                                                           (2)
model. This is reasonable because FPMC is a Markov-                    E[𝑦|𝑑𝑜(𝑥 = 0)] = Pr(𝑦 = 1|𝑑𝑜(𝑥 = 0))
based model that consider input as a basket and directly                                                 𝑢
                                                                                           #Pairs(𝐻 ∈
                                                                                                    / ℋ̃ ∧ 𝑌 = 𝑌 𝑢 )
learns the correlation between candidate items and each                                =                       𝑢
items in a sequence, as a result, it is easier to extract asso-                                #Pairs(𝐻 ∈
                                                                                                        / ℋ̃ )
ciation rules between inputs and outputs for the model. We report the ACE value of our model and variants in
NARM combines the whole session information and in-     Table.3. While showing the ACE value, we still keep the
fluence of each individual item in the session, therefore,
                                                        number of candidate causal explanations 𝑘 as 1.
association rules which involve individual information     We can see that our model can achieve higher ACE
will be easier to be extracted for this model. However, value than the variant for most recommendation models
it also means that the fidelity performance of the asso-on both dataset. But here we can observe an interesting
ciation approach highly depends on the recommenda-      results that the ACE value for FPMC model is much lower
tion model being explained. Meanwhile, we see that our  than other recommendation models (GRU4Rec, NARM,
causal approach achieves comparably good fidelity on allCaser). Meanwhile, the variant model has slightly larger
three recommendation models, because the perturbation   ACE than our model when applying on FPC model.
model is able to create sufficient counterfactual exam-    The difference between FPMC and other recommen-
ples to break the correlation of frequently co-occurringdation models is that FPMC is based on Markov chain
items in the input sequence. This indicates the robust- that only considers the last behavior while other mod-
ness of our causal explanation framework in terms of    els involve the whole session information. For FPMC
model fidelity.                                         model, although we take a session as input to recom-
                                                        mend next item, this model actually considers it as a
4.4. Average Causal Effect                              basket and linearly combines the influence of each item
                                                        from the basket. In this case, every part of the session
We then verify our causal explanations are true expla- will have independent influence towards next item pre-
nations that explanation are important component for diction. So changing a small part of input items may not
recommending original item. A common way is to mea- significantly change the next item prediction which high
sure the causal effect on the outcome of the model[52]. likely results in same recommendation item. Based on
First of all, we show the definition of Average Causal our experiment, when we keep counterfactual histories
Effect.                                                 same for all recommendation models, FPMC model only
Definition 4. (Average Causal Effect) The Average gets 98 counterfactual histories (19.6%) in average with
Causal Effect (ACE) of a binary random variable 𝑥 on different recommendation (different from the recommen-
another random variable 𝑦 is define as E[𝑦|𝑑𝑜(𝑥 = 1)] − dation item based on real history), while other models
E[𝑦|𝑑𝑜(𝑥 = 0)]                                          have at least 315 counterfactual histories (63%) in aver-
                                                        age with different recommendation item. This difference
Table 3                                                        Table 4
Results of Average Causal Effect. Our causal explanation       Results of Proximity. The value of proximity is calculated by
framework is tested under the number of candidate causal       Eq.(3)
explanations 𝑘 = 1.
                                                                    Dataset                 Movielens 100k
    Dataset                Movielens 100k
                                                                    Models      FPMC      GRU4Rec       NARM       Caser
    Models     FPMC      GRU4Rec         NARM     Caser
                                                                    CR-AE       -22.69      -22.37       -22.35    -22.40
    CR-AE       0.0184     0.1479        0.1108   0.1199            CR-VAE      -17.35      -16.88       -16.83    -16.93
    CR-VAE      0.0178     0.1862        0.1274   0.1388
                                                                    Dataset                      Amazon
    Dataset                    Amazon
                                                                    Models      FPMC      GRU4Rec       NARM       Caser
    Models     FPMC      GRU4Rec         NARM     Caser
                                                                    CR-AE       -21.83      -21.28       -21.20    -21.33
    CR-AE       0.0230     0.1150        0.1101   0.1347            CR-VAE      -18.01      -17.40       -17.31    -17.51
    CR-VAE      0.0212     0.1434        0.1511   0.1563



makes the FPMC model has much lower ACE value com-
pared with other recommendation models. Comparing
our model with CR-AE, the variant model will generate
less similar counterfactual histories which more likely
result in different recommendation item than our model.
Therefore, CR-AE has slightly higher ACE values than           (a) Model Fidelity on Movielens   (b) Model Fidelity on Amazon
CR-VAE.
                                                               Figure 3: Model fidelity on different time decay parameters
                                                               𝛾 . 𝑥-axis is the time decay parameter 𝛾 ∈ {0.1, 0.3, 0.7, 1}
4.5. Proximity                                                 and 𝑦 -axis is the model fidelity. The left side pictures are on
                                                               Movielens and the right side pictures are on Amazon.
As we mentioned before, counterfactual examples that
are closest to the original can be the most useful to users.
Similar with [48], we define the proximity as the distance
between negative counterfactual examples (i.e. generate        terfactual examples of our model have higher quality and
recommendation item different from original item) and          be more useful.
original real history. Intuitively, a counterfactual example
that close enough but get totally different results will be    4.6. Influence of Parameters
more helpful. For a given user, the proximity can be
expressed as                                                   In this section, we discuss the influence of two important
                                                               parameters. The first one is time decay parameter 𝛾 – in
                                                𝑢
                                                               our framework, when explaining the sequential recom-
                                 ∑︁
  𝑃 𝑟𝑜𝑥𝑖𝑚𝑖𝑡𝑦𝑢 = −𝑚𝑒𝑎𝑛(                  𝑑𝑖𝑠𝑡(ℋ̃𝑖 , ℋ𝑢 )) (3)
                              ˜ 𝑢 ̸=𝑌𝑢
                              𝑌
                                                               mendation models, earlier interactions in the sequence
                                𝑖
                                                               will have discounted effects to the recommended item.
Here the distance is defined in latent space. The repre-       A proper time decay parameter helps the framework to
sentation of any history sequence is the concatenation         reduce noise signals when learning patterns from the
of the latent representation of each item in the sequence.     sequence. The second parameter is the number of per-
The latent representations of items are learned from pre-      turbed input-output pairs 𝑚 – in our framework, we
trained BPRMF [49] model. The distance of any two              use perturbations to create counterfactual examples for
sequence is defined as Euclidean distance between the          causal learning, but there may exist trade-off between ef-
representation of two sequence. The reported proximity         ficiency and performance. We will analyze the influence
value would be the average over all users.                     of these two parameters.
   Given that association rule model does not involve             Time Decay Effect: Figure 3 shows the influence of 𝛾
counterfactual examples, this metric can only be reported      on different recommendation models and datasets. From
on our model and the variant model on both datasets, as        the result we can see that the time decay effect 𝛾 indeed
shown in Table.4 We can observe that our model can             affects the model performance on fidelity. In particular,
achieve higher proximity compared with the variant             when 𝛾 is small, the previous interactions in a sequence
model. In other words, counterfactual examples gen-            are more likely to be ignored, which thus reduces the
erated with learned latent variance is more similar with       performance on model fidelity. When 𝛾 is large (e.g.,
real history. Therefore, higher proximity implies coun-        𝛾 = 1), old interactions will have equal importance with
                                                                               Association Rule                      Our Explanation
                                                                                 Explanation




                                                                         unknown




(a) Model Fidelity on Movielens   (b) Model Fidelity on Amazon

Figure 4: Model fidelity on different number of counterfac-                                       Association Rule       Our Explanation
                                                                                                    Explanation
tual pairs 𝑚. 𝑥-axis is the number of counterfactual pairs 𝑚.
𝑦 -axis is model fidelity.                                       Figure 5: A case study on MovieLens by the Caser model.
                                                                 The first movie for 𝑢1 is unknown in the dataset.


latest interactions, which also hurts the performance. We
can see from the results that the best performance is            gives >95% confidence that the estimated probability er-
achieved at about 𝛾 = 0.7 on both datasets.                      ror is <0.1.
    Number of Counterfactual Examples: Figure 4
shows the influence for the number of counterfactual             4.7. Case Study
input-output pairs 𝑚. A basic observation from Figure 4
is that when 𝑚 increases, model fidelity will decrease first     In this section, we provide a simple case study to com-
and then increase. The underlying reason is as follows.          pare causal explanations and association explanations.
    When 𝑚 is small, the variance of the counterfactual          Compared with the association explanation model, our
input-output pairs will be small, and fewer counterfac-          model is capable of generating personalized explanations,
tual items will be involved. Then the model is more likely       which means that even if the recommendation model rec-
to select original item as explanation. For example, sup-        ommends the same item for two different users and the
pose the original input-output pair is 𝐴, 𝐵, 𝐶 → 𝑌 . In          users have overlapped histories, our model still has the
the extreme case where 𝑚 = 1, we will have only one              potential to generate different explanations for differ-
counterfactual pair, e.g., 𝐴, 𝐵
                              ˜ , 𝐶 → 𝑌˜ . According to the      ent users. However, the association model will provide
causal rule learning model (section 3.2.2), if 𝑌˜ ̸= 𝑌 , then    the same explanation since the association rules are ex-
𝐵 ⇒ 𝑌 will be the causal explanation since the change            tracted based on global records. An example by the Caser
of 𝐵 results in a different output, while if 𝑌˜ = 𝑌 , then       [51] recommendation model on MovieLens100k dataset is
either 𝐴 ⇒ 𝑌 or 𝐶 ⇒ 𝑌 will be the causal explanation             shown in Figure 5, where two users with one commonly
since their 𝜃 scores will be higher than 𝐵 or 𝐵  ˜ . In either   watched movie (The Sound of Music) get exactly same
case, the model fidelity and percentage of verified causal       recommendation (Pulp Fiction). The association model
rules will be 100%. However, in this case, the results do        provides the overlapped movie as an explanation for the
not present statistical meanings since they are estimated        two different users, while our model can generate per-
on a very small amount of examples.                              sonalized explanation for different users even when they
    When 𝑚 increases but not large enough, then random           got the same recommendation.
noise examples created by the perturbation model will re-
duce the model fidelity. Still consider the above example,       5. Conclusions
if many pairs with the same output 𝑌 are created, then
the model may find other items beyond 𝐴, 𝐵, 𝐶 as the             Recommender systems are widely used in our daily life.
cause, which will result in no explanation for the origi-        Effective recommendation mechanisms usually work
nal sequence. However, if we continue to increase 𝑚 to           through black-box models, resulting in the lack of trans-
sufficiently large numbers, such noise will be statistically     parency. In this paper, we extract causal rules from user
offset, and thus the model fidelity and percentages will         history to provide personalized, item-level, post-hoc ex-
increase again. In the most ideal case, we would create all      planations for the black-box sequential recommendation
of the |ℋ||ℐ| sequences for causal rule learning, where          models. The causal explanations are extracted through a
|ℋ| is the number of item slots in the input sequence, and       perturbation model and a causal rule learning model. We
|ℐ| is the total number of items in the dataset. However,        conduct several experiments on real-world datasets, and
|ℋ||ℐ| would be a huge number that makes it compu-               apply our explanation framework to several state-of-the-
tational infeasible for causal rule learning. In practice,       art sequential recommendation models. Experimental
we only need to specify 𝑚 sufficiently large. Based on           results verified the quality and fidelity of the causal ex-
Chebyshev’s Inequality, we find that 𝑚 = 500 already             planations extracted by our framework.
   In this work, we only considered item-level causal          [9] N. Wang, H. Wang, Y. Jia, Y. Yin, Explainable recom-
relationships, while in the future, it would be interesting        mendation via multi-task learning in opinionated
to explore causal relations on feature-level external data         text data, in: The 41st International ACM SIGIR
such as textual user reviews, which can help to generate           Conference on Research & Development in Infor-
finer-grained causal explanations.                                 mation Retrieval, ACM, 2018.
                                                              [10] S. Rendle, C. Freudenthaler, L. Schmidt-Thieme,
                                                                   Factorizing personalized markov chains for next-
Acknowledgments                                                    basket recommendation, in: Proceedings of the
                                                                   19th international conference on World wide web,
This work was partly supported by NSF IIS-1910154 and
                                                                   ACM, 2010, pp. 811–820.
IIS-2007907. Any opinions, findings, conclusions or rec-
                                                              [11] P. Wang, J. Guo, Y. Lan, J. Xu, S. Wan, X. Cheng,
ommendations expressed in this material are those of
                                                                   Learning hierarchical representation model for
the authors and do not necessarily reflect those of the
                                                                   nextbasket recommendation, in: Proceedings of the
sponsors.
                                                                   38th International ACM SIGIR conference on Re-
                                                                   search and Development in Information Retrieval,
References                                                         ACM, 2015, pp. 403–412.
                                                              [12] R. He, J. McAuley, Fusing similarity models with
 [1] H. Chen, S. Shi, Y. Li, Y. Zhang, Neural collaborative        markov chains for sparse sequential recommenda-
     reasoning, in: Proceedings of the Web Conference              tion, in: 2016 IEEE 16th International Conference
     2021, 2021, pp. 1516–1527.                                    on Data Mining (ICDM), IEEE, 2016, pp. 191–200.
 [2] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning          [13] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk,
     based recommender system: A survey and new                    Session-based recommendations with recurrent
     perspectives, ACM Computing Surveys (CSUR) 52                 neural networks, in: International Conference on
     (2019) 1–38.                                                  Learning Representations, 2016.
 [3] Y. Zhang, X. Chen, Explainable recommendation:           [14] F. Yu, Q. Liu, S. Wu, L. Wang, T. Tan, A dynamic
     A survey and new perspectives, Foundations and                recurrent model for next basket recommendation,
     Trends® in Information Retrieval (2020).                      in: Proceedings of the 39th International ACM SI-
 [4] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, S. Ma,          GIR conference on Research and Development in
     Explicit factor models for explainable recommenda-            Information Retrieval, ACM, 2016, pp. 729–732.
     tion based on phrase-level sentiment analysis, in:       [15] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, J. Ma, Neu-
     Proceedings of the 37th international ACM SIGIR               ral attentive session-based recommendation, in:
     conference on Research & development in informa-              Proceedings of the 2017 ACM on Conference on
     tion retrieval, ACM, 2014, pp. 83–92.                         Information and Knowledge Management, ACM,
 [5] Y. Xian, Z. Fu, S. Muthukrishnan, G. De Melo,                 2017, pp. 1419–1428.
     Y. Zhang, Reinforcement knowledge graph reason-          [16] X. Chen, H. Xu, Y. Zhang, J. Tang, Y. Cao, Z. Qin,
     ing for explainable recommendation, in: Proceed-              H. Zha, Sequential recommendation with user mem-
     ings of the 42nd International ACM SIGIR Confer-              ory networks, in: Proceedings of the eleventh ACM
     ence on Research and Development in Information               international conference on WSDM, 2018, pp. 108–
     Retrieval, 2019, pp. 285–294.                                 116.
 [6] A. Theodorou, R. H. Wortham, J. J. Bryson, De-           [17] J. Huang, W. X. Zhao, H. Dou, J.-R. Wen, E. Y.
     signing and implementing transparency for real                Chang, Improving sequential recommendation
     time inspection of autonomous robots, Connection              with knowledge-enhanced memory networks, in:
     Science 29 (2017) 230–241.                                    The 41st International ACM SIGIR Conference on
 [7] G. Peake, J. Wang, Explanation mining: Post hoc               Research & Development in Information Retrieval,
     interpretability of latent factor models for recom-           ACM, 2018, pp. 505–514.
     mendation systems, in: Proceedings of the 24th           [18] J. Chen, F. Zhuang, X. Hong, X. Ao, X. Xie, Q. He,
     ACM SIGKDD International Conference on Knowl-                 Attention-driven factor model for explainable per-
     edge Discovery & Data Mining, 2018.                           sonalized recommendation, in: The 41st Interna-
 [8] X. Chen, H. Chen, H. Xu, Y. Zhang, Y. Cao, Z. Qin,            tional ACM SIGIR Conference on Research & De-
     H. Zha, Personalized fashion recommendation with              velopment in Information Retrieval, ACM, 2018, pp.
     visual explanations based on multimodal attention             909–912.
     network: Towards visually explainable recommen-          [19] X. Chen, Z. Qin, Y. Zhang, T. Xu, Learning to rank
     dation, in: Proceedings of the 42nd International             features for recommendation over multiple cate-
     ACM SIGIR Conference on Research and Develop-                 gories, in: Proceedings of the 39th International
     ment in Information Retrieval, 2019, pp. 765–774.             ACM SIGIR conference on Research and Develop-
     ment in Information Retrieval, 2016, pp. 305–314.               (TOIS) 38 (2019) 1–29.
[20] S. Seo, J. Huang, H. Yang, Y. Liu, Interpretable           [32] L. Li, Y. Zhang, L. Chen, Personalized transformer
     convolutional neural networks with dual local and               for explainable recommendation, ACL (2021).
     global attention for review rating prediction, in:         [33] L. Li, Y. Zhang, L. Chen, Generate neural template
     Proceedings of the Eleventh ACM Conference on                   explanations for recommendation, in: Proceed-
     RecSys, 2017, pp. 297–305.                                      ings of the 29th ACM International Conference on
[21] C. Li, C. Quan, L. Peng, Y. Qi, Y. Deng, L. Wu, A cap-          Information & Knowledge Management, 2020, pp.
     sule network for recommendation and explaining                  755–764.
     what you like and dislike, in: Proceedings of the          [34] H. Chen, X. Chen, S. Shi, Y. Zhang, Generate natural
     42nd International ACM SIGIR Conference on Re-                  language explanations for recommendation, SIGIR
     search and Development in Information Retrieval,                2019 Workshop on ExplainAble Recommendation
     ACM, 2019, pp. 275–284.                                         and Search (2019).
[22] F. Costa, S. Ouyang, P. Dolog, A. Lawlor, Automatic        [35] J. McInerney, B. Lacker, S. Hansen, K. Higley,
     generation of natural language explanations, in:                H. Bouchard, A. Gruson, R. Mehrotra, Explore,
     Proceedings of the 23rd International Conference                exploit, and explain: personalizing explainable rec-
     on Intelligent User Interfaces Companion, ACM,                  ommendations with bandits, in: Proceedings of the
     2018, p. 57.                                                    12th ACM Conference on Recommender Systems,
[23] Q. Ai, V. Azizi, X. Chen, Y. Zhang, Learning hetero-            ACM, 2018, pp. 31–39.
     geneous knowledge base embeddings for explain-             [36] X. Wang, Y. Chen, J. Yang, L. Wu, Z. Wu, X. Xie,
     able recommendation, Algorithms 11 (2018) 137.                  A reinforcement learning framework for explain-
[24] Z. Fu, Y. Xian, R. Gao, J. Zhao, Q. Huang, Y. Ge, S. Xu,        able recommendation, in: 2018 IEEE International
     S. Geng, C. Shah, Y. Zhang, et al., Fairness-aware ex-          Conference on Data Mining (ICDM), IEEE, 2018, pp.
     plainable recommendation over knowledge graphs,                 587–596.
     SIGIR (2020).                                              [37] N. Tintarev, Explanations of recommendations,
[25] W. Ma, M. Zhang, Y. Cao, W. Jin, C. Wang, Y. Liu,               in: Proceedings of the 2007 ACM conference on
     S. Ma, X. Ren, Jointly learning explainable rules for           Recommender systems, 2007, pp. 203–206.
     recommendation with knowledge graph, in: The               [38] J. Pearl, Causality: models, reasoning and inference,
     World Wide Web Conference, 2019, pp. 1210–1221.                 volume 29, Springer, 2000.
[26] Y. Xian, Z. Fu, H. Zhao, Y. Ge, X. Chen, Q. Huang,         [39] G. W. Imbens, D. B. Rubin, Causal inference in statis-
     S. Geng, Z. Qin, G. De Melo, S. Muthukrishnan,                  tics, social, and biomedical sciences, Cambridge
     et al., Cafe: Coarse-to-fine neural symbolic reason-            University Press, 2015.
     ing for explainable recommendation, in: Proceed-           [40] S. Bonner, F. Vasile, Causal embeddings for rec-
     ings of the 29th ACM International Conference on                ommendation, in: Proceedings of the 12th ACM
     Information & Knowledge Management, 2020, pp.                   Conference on Recommender Systems, ACM, 2018.
     1645–1654.                                                 [41] T. Joachims, A. Swaminathan, T. Schnabel, Un-
[27] L. Li, Y. Zhang, L. Chen, Extra: Explanation ranking            biased learning-to-rank with biased feedback, in:
     datasets for explainable recommendation, SIGIR                  Proceedings of the Tenth ACM International Con-
     (2021).                                                         ference on Web Search and Data Mining, ACM,
[28] S. Shi, H. Chen, W. Ma, J. Mao, M. Zhang, Y. Zhang,             2017, pp. 781–789.
     Neural logic reasoning, in: Proceedings of the 29th        [42] Z. Wood-Doughty, I. Shpitser, M. Dredze, Chal-
     ACM International Conference on Information &                   lenges of using text classifiers for causal inference,
     Knowledge Management, 2020, pp. 1365–1374.                      in: Proceedings of the 2018 Conference on Empiri-
[29] Y. Zhu, Y. Xian, Z. Fu, G. de Melo, Y. Zhang, Faith-            cal Methods in Natural Language Processing, 2018,
     fully explainable recommendation via neural logic               pp. 4586–4598.
     reasoning, in: Proceedings of the 2021 Conference          [43] L. Buesing, T. Weber, Y. Zwols, S. Racaniere,
     of the North American Chapter of the Association                A. Guez, J.-B. Lespiau, N. Heess, Woulda, coulda,
     for Computational Linguistics: Human Language                   shoulda: Counterfactually-guided policy search, in:
     Technologies, 2021, pp. 3083–3090.                              ICLR, 2019.
[30] X. Chen, Y. Zhang, Z. Qin, Dynamic explainable             [44] D. Liang, L. Charlin, J. McInerney, D. M. Blei, Mod-
     recommendation based on neural attentive mod-                   eling user exposure in recommendation, in: Pro-
     els, in: Proceedings of the AAAI Conference on                  ceedings of the 25th WWW, 2016.
     Artificial Intelligence, volume 33, 2019, pp. 53–60.       [45] D. Liang, L. Charlin, D. M. Blei, Causal inference
[31] Q. Ai, Y. Zhang, K. Bi, W. B. Croft, Explainable                for recommendation, in: Causation: Foundation to
     product search with a dynamic relation embedding                Application, Workshop at UAI, 2016.
     model, ACM Transactions on Information Systems             [46] A. Ghazimatin, O. Balalau, R. Saha Roy, G. Weikum,
     Prince: provider-side interpretability with counter-
     factual explanations in recommender systems, in:
     Proceedings of the 13th International Conference
     on Web Search and Data Mining, 2020, pp. 196–204.
[47] D. Alvarez-Melis, T. S. Jaakkola, A causal frame-
     work for explaining the predictions of black-box
     sequence-to-sequence models, Proceedings of the
     2017 Conference on Empirical Methods in Natural
     Language Processing (2017).
[48] R. K. Mothilal, A. Sharma, C. Tan, Explaining ma-
     chine learning classifiers through diverse counter-
     factual explanations, in: Proceedings of the 2020
     Conference on Fairness, Accountability, and Trans-
     parency, 2020.
[49] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-
     Thieme, Bpr: Bayesian personalized ranking from
     implicit feedback, UAI (2012).
[50] D. P. Kingma, M. Welling, Auto-encoding varia-
     tional bayes, 2014.
[51] J. Tang, K. Wang, Personalized top-n sequential
     recommendation via convolutional sequence em-
     bedding, in: Proceedings of the Eleventh ACM
     International Conference on Web Search and Data
     Mining, ACM, 2018, pp. 565–573.
[52] R. Moraffah, M. Karami, R. Guo, A. Raglin,
     H. Liu, Causal interpretability for machine learning-
     problems, methods and evaluation, ACM SIGKDD
     Explorations Newsletter 22 (2020) 18–33.