=Paper= {{Paper |id=Vol-2911/paper3 |storemode=property |title=Learning Causal Explanations for Recommendation |pdfUrl=https://ceur-ws.org/Vol-2911/paper3.pdf |volume=Vol-2911 |authors=Shuyuan Xu,Yunqi Li,Shuchang Liu,Zuohui Fu,Yingqiang Ge,Xu Chen,Yongfeng Zhang }} ==Learning Causal Explanations for Recommendation== https://ceur-ws.org/Vol-2911/paper3.pdf

Learning Causal Explanations for Recommendation
Shuyuan Xu1 , Yunqi Li1 , Shuchang Liu1 , Zuohui Fu1 , Yingqiang Ge1 , Xu Chen2 and
Yongfeng Zhang1
1
Department of Computer Science, Rutgers University, New Brunswick, NJ 08901, US
2
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, 100872, China

Abstract
State-of-the-art recommender systems have the ability to generate high-quality recommendations, but usually cannot pro-
vide explanations to humans due to the usage of black-box prediction models. The lack of transparency has highlighted the
critical importance of improving the explainability of recommender systems. In this paper, we propose to construct causal
explainable recommendation which aims to provide post-hoc explanations for the recommendations by answering “what
if” questions, e.g., “what would the recommendation result change if the user’s behavior history had been different?” Our
approach first obtains counterfactual user histories and counterfactual recommendation items with the aid of a perturbation
model, and then extracts personalized causal relationships for the recommendation model through a causal rule mining
algorithm. Different from some existing explainable recommendation models that aim to provide persuasive explanations,
our model aims to find out the true explanations for the recommendation of an item. Therefore, in addition to evaluating
the fidelity of discovered causal explanations, we adopt the average causal effect to measure the quality of explanations.
Here by quality we mean whether they are true explanations rather than their persuasiveness. We conduct experiments for
several state-of-the-art sequential recommendation models on real-world datasets to verify the performance of our model
on generating causal explanations.

Keywords
Sequential Recommendation, Explainable Recommendation, Post-hoc Explanation, Causal Analysis

1. Introduction Real History Recommend

As widely used in decision-making, recommender sys- User Recommender

tems have been recognized for its ability to provide high- Counterfactual History
Recommend
Result has
been changed,
quality services that reduce the gap between products and What if the
user's history
item 3 could
the reason.
customers. And many state-of-the-art models achieves has been
different?
outstanding expressiveness by using high-dimensional
user/item representations and deep learning models with
Recommender

thousands or even millions of parameters [1, 2]. How-
ever, this excessive complexity easily go beyond the com-
prehension of a human who may demand for intuitive Figure 1: An example of causal explanation. Comparing the
explanations for why the model made a specific decision. recommendation of real history and counterfactual histories,
if replacing one certain item will result in the change of rec-
Moreover, providing supportive information and inter-
ommendation, the certain item could be the true reason that
pretation along with the recommendation can be helpful
the system recommends the original item.
for both the customers and the platform, since it improves
the transparency, persuasiveness, trustworthiness, effec-
tiveness, and user satisfaction of the recommendation
systems, while facilitating system designers to refine the dation.
algorithms [3]. Thus, people are looking for solutions One typical method to solve explainable recommenda-
that can generate explanations along with the recommen- tion is to construct a model-intrinsic explanation mod-
ule that also serves as an intermediate recommendation
stage[4, 5]. However, this approach has to redesign the
The 1st International Workshop on Causality in Search and
Recommendation (CSR’21), July 15, 2021, Virtual Event, Canada original recommendation model and thus may sacrifice
" shuyuan.xu@rutgers.edu (S. Xu); yunqi.li@rutgers.edu (Y. Li); model accuracy in order to obtain good explanations
shuchang.syt.liu@rutgers.edu (S. Liu); zuohui.fu@rutgers.edu [6]. Moreover, for complex deep models, it is even more
(Z. Fu); yingqiang.ge@rutgers.edu (Y. Ge); xu.chen@ruc.edu.cn challenging to integrate an explainable method into the
(X. Chen); yongfeng.zhang@rutgers.edu (Y. Zhang)
original design while maintaining recommendation per-
~ https://zuohuif.github.io/ (Z. Fu); https://yingqiangge.github.io/
(Y. Ge); http://xu-chen.com/ (X. Chen); http://yongfeng.me formance [3]. In contrast, post-hoc models (a.k.a model-
(Y. Zhang) agnostic explanation) consider the underlying recommen-
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). dation model as a black-box, and provide explanations
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
after the recommendation decision has been made. Al- able framework for a wide range of sequential
though such explanations may not strictly follow the recommendations.
exact mechanism that generated the corresponding rec- • We show that this framework can generate per-
ommendations, they offer the flexibility to be applied to a sonalized post-hoc explanations based on item-
wide range of recommendation models. Furthermore, the level causal rules.
explanation model and recommendation model work sep- • We conduct several experiments on real-world
arately, we obtain the benefit of explainability without data to demonstrate that our explanation model
hurting the prediction performance. outperforms state-of-the-art baselines in terms of
While it is still not fully understood what informa- fidelity.
tion is useful to generate the explanations for a certain • We apply average causal effect to illustrate that
recommendation result, Peake [7] argued that one can the causal explanations provided by our frame-
provide post-hoc item-level explanations. Specifically, in- work are essential component for most sequential
teracted items (the causes) in a user’s history can be used recommendation model.
as explanations for the future item recommendations (the
effect).The authors propose to solve this by association For the remainder of this paper, we first review related
rule mining which finds co-occurred items as explanation. work in Section 2, and then introduce our model in Sec-
However, explanations generated by association rules are tion 3. Experimental settings and results are provided in
not personalized, i.e., different users would receive the Section 4. Finally, we conclude this work in Section 5.
same explanation as long as the rules are only applied
to their overlapped histories. This makes it incompatible
with modern recommender systems, which aim to pro- 2. Related Work
vide personalized services to users. Moreover, we believe
that the true explanation of a recommendation model 2.1. Sequential Recommendation
should be able to answer the questions like “which item Sequential recommendation takes into account the his-
contribute to the system’s decision?” as well as “ Will torical order of items interacted by a user and aims to
the system change the decision if a different set of items capture useful sequential patterns to make consecutive
were purchased by the same user? ” In other words, predictions of the user’s future behaviors. Rendle et al.
the explanation should be aware of the counterfactual [10] proposed Factorized Personalized Markov Chains
world of the unobserved user histories and their corre- (FPMC) to combine Markov chain and matrix factoriza-
sponding recommendation when analyzing the cause of tion for next basket recommendation. The Hierarchical
a recommendation in real world. Representation Model (HRM) [11] further extended this
In this paper, we explore a counterfactual analysis idea by leveraging representation learning as latent fac-
framework to provide post-hoc causal explanations for tors in a hierarchical model. However, these methods can
any given black-box sequential recommendation algo- only model the local sequential patterns of very limited
rithm. Fig.1 shows an example to illustrate our intuition. number of adjacent records. To model multi-step sequen-
Technically, we first create several counterfactual histo- tial behaviors, He et al. [12] adopted Markov chain to
ries which are different but similar to the real history provide recommendations with sparse sequences. Later
through a Variational Auto-Encoder (VAE) based per- on, the rapid development of representation learning
turbation model, and obtain the recommendation for and neural networks introduced many new techniques
the counterfactual data. Then we apply causal analysis that further pushed the research of sequential recom-
on the combined data to extract causal rules between a mendation to a new level. For example, Hidasi et. al.
user’s history and future behaviors as explanations. Un- [13] used an RNN-based model to learn the user his-
like other explainable recommendation models [4, 8, 9] tory representation, Yu et. al. [14] provided a dynamic
that focus on persuading users to keep engaged with recurrent model, Li et. al. [15] proposed an attention-
the system, this type of explanation focuses on model based GRU model, Chen et. al. [16] developed user- and
transparency and finds out the true reason or the most item-level memory networks, and Huang et. al. [17] fur-
essential item that leads to a specific recommendation. ther integrated knowledge graphs into memory networks.
Therefore, instead of taking user studies or online evalu- However, most of the models exhibit complicated neural
ations to evaluate the persuasiveness or effectiveness of network architectures, and it is usually difficult to inter-
explanations, we use the average causal effect to measure pret their prediction results. To make up for this, we plan
whether the item used for explanation can explain how to generate explanations for these black box sequential
the system works. recommendation models.
The key contributions of this paper are as follows:

• We design and study a counterfactual explain-
2.2. Explainable Recommendation and reinforcement learning [43], etc. With respect to
recommendation tasks, large amount of work is about
Explainable recommendation focuses on developing mod-
how to achieve de-bias matrix factorization with causal
els that can generate not only high-quality recommen-
inference. The probabilistic approach ExpoMF proposed
dations but also intuitive explanations, which help to
in [44] directly incorporated user exposure to items into
improve the transparency of the recommendation sys-
collaborative filtering, where the exposure is modeled as
tems [3]. Generally, the explainable models can be either
a latent variable. Liang et. al. [45] followed to develop
model-intrinsic or model-agnostic. As for model-intrinsic
a causal inference approach to recommender systems
approaches, lots of popular explainable recommendation
which believed that the exposure and click data came
methods, such as factorization models [4, 18, 9, 19], deep
from different models, thus using the click data alone to
learning models [20, 16, 21, 22], knowledge graph models
infer the user preferences would be biased by the expo-
[23, 5, 17, 24, 25, 26], explanation ranking models [27],
sure data. They used causal inference to correct for this
logical reasoning models [1, 28, 29], dynamic explana-
bias for improving generalization of recommendation
tion models [30, 31], visual explanation models [8] and
systems to new data. Bonner et. al. [40] proposed a new
natural language generation models [32, 33, 34] have
domain adaptation algorithm which was learned from
been proposed. A more complete review of the related
logged data including outcomes from a biased recom-
models can be seen in [3]. However, they mix the recom-
mendation policy, and predicted recommendation results
mendation mechanism with interpretable components,
according to random exposure. Besides de-bias recom-
which often results in over-complicated systems to make
mendation, Ghazimatin et. al. [46] proposed PRINCE
successful explanations. Moreover, the increased model
model to explore counterfactual evidence for discovering
complexity may reduce the interpretability. A natural
causal explanations in a heterogeneous information net-
way to avoid this dilemma is to rely on model-agnostic
work. Differently, this paper focuses on learning causal
post-hoc approaches so that the recommendation system
rules to provide more intuitive explanation for the black-
is free from the noises of the down-stream explanation
box sequential recommendation models. Additionally,
generator. Examples include [35] that proposed a bandit
we consider [47] as a highly related work though it is
approach, [36] that proposed a reinforcement learning
originally proposed for natural language processing tasks.
framework to generate sentence explanations, and [7]
As we will discuss in the later sections, we utilize some
that developed an association rule mining approach. Ad-
of the key ideas of its model construction, and show why
ditionally, some work distinguish the model explanations
it works in sequential recommendation scenarios.
by their purpose [37]: while persuasive explanations aim
to improve user engagement, model explanation reflexes
how the system really works and may not necessarily be 3. Proposed Approach
persuasive. Our study fall into the later case and aims
to find causal explanations for a given sequential recom- In this section, we first define the explanation problem
mendation model. and then introduce our model as a combination of two
parts: a VAE-based perturbation model that generates the
2.3. Causal Inference in Recommendation counterfactual samples for causal analysis, and a causal
rule mining model that can extract causal dependencies
Originated as statistical problems, causal inference [38, between the cause-effect items.
39] aims at understanding and explaining the causal ef-
fect of one variable on another. While the observational
3.1. Problem Setting
data is considered as the factual world, causal effect in-
ferences should be aware of the counterfactual world, We denote the set of users as 𝒰 = {𝑢1 , 𝑢2 , · · · , 𝑢|𝒰 | }
thus often being regarded as the questions of "what-if". and set of items as ℐ = {𝑖1 , 𝑖2 . · · · , 𝑖|ℐ| }. Each user 𝑢
The challenge is that it is often expensive or even im- is associated with a purchase history represented as a
possible to obtain counterfactual data. For example, it sequence of items ℋ𝑢 . The 𝑗-th interacted item in the
is immoral to re-do the experiment on a patient to find history is denoted as 𝐻𝑗𝑢 ∈ ℐ. Without specification, the
out what will happen if we have not given the medicine. calligraphic ℋ in the paper represents user history, and
Though the majority of causal inference study resides in a straight 𝐻 represents an item. A black-box sequential
the direction of statistics and philosophy, it has recently recommendation model ℱ : ℋ → ℐ is a function that
attracted the attention from AI community for its great takes a sequence of items (as will discuss later, it can be
power of explainablity and bias elimination ability. Ef- the counterfactual user history) as input and outputs the
forts have managed to bring causal inference to several recommended item. In practice, the underlying mecha-
machine learning areas, including recommendation [40], nism usually consists of two steps: a ranking function
learning to rank [41], natural language processing [42], first scores all candidate items based on the user history,
and then it selects the item with the highest score as the 3.2.1. Perturbation Model
final output. Note that it only uses user-item interac-
To capture the causal dependency between items in his-
tion without any content or context information, and the
tory and the recommended items, we want to know what
scores predicted by the ranking function may differ ac-
would take place if the user history had been different. To
cording to the tasks (e.g. {1, . . . , 5} for rating prediction,
avoid unknown influences caused by the length of input
while [0, 1] for Click Through Rate (CTR) prediction).
sequence (i.e., user history), we keep the input length
Our goal is to find an item-level post-hoc model that cap-
unchanged, and only replace items in the sequence to
tures the causal relation between the history items and
create counterfactual histories. Ideally, for each item
the recommended item for each user.
𝐻𝑗𝑢 in a user’s history ℋ𝑢 , it will be replaced by all pos-
Definition 1. (Causal Relation) For two variables 𝑋 and sible items in ℐ to fully explore the influence that 𝐻𝑗𝑢
𝑌 , if 𝑋 triggers 𝑌 , then we say that there is a causal re- makes in the history. However, the number of possible
lation 𝑋 ⇒ 𝑌 , where 𝑋 is the cause and 𝑌 is the effect. combinations will become impractical for the learning
system, since recommender systems usually deal with
hundreds of thousands or even tens of millions items. In
When a given recommendation model ℱ maps a user fact, counterfactual examples that are closest to the orig-
history ℋ to a recommended item 𝑌 ∈ ℐ, all items inal input can be the most useful to a user as shown in
𝑢 𝑢

in ℋ𝑢 are considered as potential causes of 𝑌 𝑢 . Thus [48]. Therefore, we pursue a perturbation-based method
we can formulate the set of causal relation candidates as that generate counterfactual examples, which replaces
𝒮 𝑢 = {(𝐻, 𝑌 𝑢 )|𝐻 ∈ ℋ𝑢 }. items in the original user history ℋ𝑢 .
Definition 2. (Causal Explanation for Sequential Rec- There are various ways to obtain the counterfactual
ommendation Model) Given a causal relation candidate history, as long as they are similar to the real history.
set 𝒮 𝑢 for user 𝑢, if there exists a true causal relation The simplest solution is randomly selecting an item in
𝑢 𝑢
(𝐻, 𝑌 ) ∈ 𝒮 , then the causal explanation for recom- ℋ 𝑢
and replacing it with a randomly selected item from
mending 𝑌 𝑢 is described as “Because you purchased 𝐻, ℐ ∖ ℋ . However, user histories are far from random.
𝑢

the model recommends you 𝑌 𝑢 ”, denoted as 𝐻 ⇒ 𝑌 𝑢 . Thus, we assume that their exists a ground truth user
history distribution, and we adopt VAE to learn such a dis-
Then the remaining problem is to determine whether tribution. As is shown in Figure 2, we design a VAE-based
a candidate pair is a true causal relation. perturbation method, which creates item sequences that
We can mitigate the problem by allowing a likelihood are similar to but slightly different from a user’s genuine
estimation for a candidate pair being a causal relation. history sequence, by sampling from a distribution in the
Definition 3. (Causal Dependency) For a given candi- latent embedding space centered around the user’s true
date pair of causal relation (𝐻, 𝑌 𝑢 ), the causal depen- history sequence.
dency 𝜃𝐻,𝑌 𝑢 of that pair is the likelihood of the pair being In detail, the VAE component consists of a probabilis-
a true causal relation. tic encoder (𝜇, 𝜎) = ENC(𝒳 ) and a decoder 𝒳 ˜ =
DEC(𝑧). The encoder ENC(·) takes a sequence of item
In other words, we would like to find a ranking func- embeddings 𝒳 into latent embedding space, and extracts
tion that predicts the likelihood for each candidate pair, the variational information for the sequence, i.e., mean
and the causal explanation is generated by selecting the and variance of the latent embeddings under independent
pair with top ranking score from these candidates. One Gaussian distribution. The decoder DEC(·) generates a
advantage of this formulation is that it allows the possibil- sequence of item embeddings 𝒳 ˜ given a latent embed-
ity of giving no causal relation between a user’s history ding 𝑧 sampled from the Gaussian distribution. Here,
and the recommended item, e.g., when algorithm rec- both 𝒳 and 𝒳 ˜ are ordered concatenations of pre-trained
ommends the most popular items regardless of the user item embeddings based on pair-wise matrix factorization
history. (BPR-MF) [49]. We follow the standard training regime
of VAE by maximizing the variational lower bound of the
3.2. Causal Model for Post-Hoc Explanationdata likelihood [50]. Specifically, the reconstruction error
involved in this lower bound is calculated by a softmax
In this section, we introduce our counterfactual explana- across all items for each position of the input sequence.
tion framework for recommendation. Inspired by [47], We observe that VAE can reconstruct the original data
we divide our framework into two models: a perturbation set accurately, while offering the power of perturbation.
model and a causal rule mining model. The overview of After pretraining ENC(·) and DEC(·), the variational
the model framework is shown in Fig.2. nature of this model allows us to obtain counterfactual
history ℋ̃ for any real history ℋ. More specifically, it
first extracts the mean and variance of the encoded item
Per tur bation M odel Causality M ining M odel

Re al
hi s t o r y ...
Co u nt e r - m+1
Co u nt e r - f ac t u al
f ac t u al pai r s
Encoder hi s t o r y
Sample m
user Times
Recommendation Causality Mining
history

...

...
Model

...
Causal Dependencies
Decoder
Co u nt e r - Rank and Selection
Co u nt e r - f ac t u al
Per sonalized
f ac t u al
hi s t o r y
... Re s u l t
Explanation

...
Figure 2: Model framework. 𝑥 is the concatenation of the item embeddings of the user history. 𝑥
˜ is the perturbed embedding.

𝑢
sequences in the latent space, and then the perturbation and output item 𝑌ˆ 𝑖 . We consider that the occurrence of
model samples 𝑚 latent embeddings 𝑧 based on the above a single output can be modeled as a logistic regression
variational information. These sampled embeddings 𝑧 on causal dependencies from all the input items in the
are then passed to the decoder DEC(·) to obtain the sequence:
perturbed versions 𝒳 ˜ . For now, each item embedding in 𝑛
˜ may not represent an actual item since it is a sampled 𝑢 𝑢
(︁ ∑︁ )︁
𝒳 𝑃 (𝑌ˆ 𝑖 |ℋ̂𝑖 ) = 𝜎 𝜃𝐻 ^𝑢 · 𝛾
^ 𝑢 ,𝑌
𝑛−𝑗
(1)
vector from the latent space, as a result, we find its nearest 𝑗=1
𝑖𝑗 𝑖

neighbor in the candidate item set ℐ ∖ ℋ through dot
product similarity as the actual item. In this way, 𝒳 ˜ is where 𝜎 is the sigmoid function defined as 𝜎(𝑥) =
transformed into the final counterfactual history ℋ̃. One (1 + exp(−𝑥))−1 in order to scale the score to [0, 1].
should keep in mind that the variance should be kept Additionally, in recommendation task, the order of a
small during sampling, so that the resulting sequences user’s previously interacted items may affect their causal
can be similar to the original sequence. dependency with the user’s next interaction. A closer
Finally, the generated counterfactual data ℋ̃ together behavior tends to have a stronger effect on user’s future
with the original ℋ will be injected into the black-box behaviors, and behaviors are discounted if they happened
recommendation model ℱ to obtain the recommenda- earlier [13]. Therefore, we involve a weight decay param-
tion results 𝑌˜ and 𝑌 , correspondingly. For any user 𝑢, eter 𝛾 to represent the time effect. Here 𝛾 is a positive
after completing this process, we will have 𝑚 different value less than one.
𝑢 𝑢
counterfactual input-output pairs: {(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚 For an input-output pair in 𝒟𝑢 , the probability of its
𝑖=1 , as
well as the original pair (ℋ , 𝑌 ). Here the value of 𝑚
𝑢 𝑢 occurrence generated by Eq.(1) should be close to one.
is manually set, but it cannot exceed the number of all As a result, we learn the causal dependencies 𝜃 by maxi-
possible item combinations. mizing the probability over 𝒟𝑢 . When optimizing 𝜃, they
are always initialized as zero to allow for no causation
between two items. When learning this regression model,
3.2.2. Causal Rule Learning Model
we are able to gradually increase 𝜃 until they converge to
Denote 𝒟𝑢 as the combined records of counterfactual the point where the data likelihood of 𝒟𝑢 is maximized.
𝑢 𝑢
input-output pairs {(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚 𝑖=1 and the original pair After gathering all the causal dependencies, we select
(ℋ , 𝑌 ) for user 𝑢. We aim to develop a causal model
𝑢 𝑢 the items that have high 𝜃 scores to build causal explana-
that first extracts causal dependencies between input and tions. This involves a three-step procedure.
outputs items appeared in 𝒟𝑢 , and then selects the causal 1. We select those causal dependencies 𝜃𝐻 ^ 𝑢 ,𝑌
^𝑢
rule based on these inferred causal dependencies. 𝑢
𝑖𝑗 𝑖
𝑢
Let ℋ̂𝑖 = [𝐻 ˆ 𝑢𝑖1 , 𝐻
ˆ 𝑢𝑖2 , · · · , 𝐻
ˆ 𝑢𝑖𝑛 ] be the input sequence whose output is the original 𝑌 (i.e., 𝑌 𝑖 = 𝑌 𝑢 ).
𝑢 ˆ
ˆ 𝑢𝑖𝑗 is the j-th item in Note that these (𝐻 ˆ 𝑢𝑖𝑗 , 𝑌 𝑢 ) pairs may come from
of the 𝑖-th record of 𝒟𝑢 , where 𝐻
𝑢 𝑢 either the original sequence or counterfactual se-
ℋ̂𝑖 . Let 𝑌ˆ 𝑖 represent the corresponding output. Note
quences, because when a counterfactual sequence
that this includes the original real pair (ℋ𝑢 , 𝑌 𝑢 ). The
is fed into the black-box recommendation model,
model should be able to infer the causal dependency
ˆ𝑢 the output may happen to be the same as the
(refer to Definition 3) 𝜃𝐻 ^ 𝑢 ,𝑌 ^ 𝑢 between input item 𝐻 𝑖𝑗
𝑖𝑗 𝑖 original sequence 𝑌 𝑢 .
Algorithm 1 Causal Explanation Model Table 1
Summary of the Datasets
Input: users 𝒰, items ℐ, user history ℋ , 𝑢

counterfactual number 𝑚, black-box model ℱ, Dataset # users # items # interactions # train # test sparsity
embedding model ℰ, causal mining model ℳ
Movielens 943 1682 100,000 95,285 14,715 6.3%
Output: causal explanations 𝐻 ⇒ 𝑌 𝑢 where 𝐻 ∈ ℋ𝑢 Amazon 573 478 13,062 9,624 3,438 4.7%
1: Use embedding model ℰ to get item embeddings ℰ(ℐ)

2: Use ℰ(ℐ) and true user history to train perturbation can serve as an intuitive explanation for the black-box
model 𝒫 recommendation model.
3: for each user 𝑢 do
4: for 𝑖 from 1 to 𝑚 do 4.1. Dataset Description
𝑢 𝑢 𝑢
5: ℋ̃𝑖 ← 𝒫(ℋ𝑢 ); 𝑌˜ 𝑖 ← ℱ (ℋ̃𝑖 )
We evaluate our proposed causal explanation framework
6: end for
against baselines on two datasets. The first dataset is
7: Construct counterfactual input-output pairs
𝑢 𝑢 MovieLens100k1 . This dataset consists of information
{(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚𝑖=1
𝑢 𝑢 𝑢
˜𝑢 𝑚 about users, movies and ratings. In this dataset, each user
8: {(ℋ̂𝑖 , 𝑌ˆ 𝑖 )}𝑚+1 𝑢 𝑢
𝑖=1 ← {(ℋ̃𝑖 , 𝑌 𝑖 )}𝑖=1 ∪ (ℋ , 𝑌 ) has rated at least 20 movies, and each movie can belong to
(︁ 𝑢
ˆ 𝑢 𝑚+1
)︁ several genres. The second dataset is the office product
9: 𝜃𝐻 ^ 𝑢 ← ℳ {(ℋ̂𝑖 , 𝑌 𝑖 )}𝑖=1
^ 𝑢 ,𝑌
𝑖𝑗 𝑖 dataset from Amazon2 , which contains the user-item
10: Rank 𝜃𝐻 ^ 𝑢 ,𝑌 𝑢 and select top-𝑘 pairs interactions from May 1996 to July 2014. The original
𝑖𝑗
{(𝐻𝑗 , 𝑌 𝑢 )}𝑘𝑗=1 dataset is 5-core. To achieve sequential recommendation
11: if ∃𝐻min{𝑗} ∈ ℋ𝑢 then with input length of 5, we select the users with at least
12: Generate causal explanation 𝐻min{𝑗} ⇒ 𝑌 𝑢 15 purchases and the items with at least 10 interactions.
13: else Since our framework is used to explain sequential rec-
14: No explanation for the recommended item 𝑌 𝑢 ommendation models, we split the dataset chronologi-
15: end if cally. Further, to learn the pre-trained item embeddings
16: end for based on BPR-MF [49] (section 3.2.1), we take the last
17: return all causal explanations 𝐻 ⇒ 𝑌
𝑢 6 interactions from each user to construct the testing
set, and use all previous interactions from each user as
the training set. To avoid data leakage, when testing the
2. We sort the above selected causal dependencies black-box recommendation models and our VAE-based
in descending order and take the top-𝑘 (𝐻 ˆ 𝑢𝑖𝑗 , 𝑌 𝑢 ) perturbation model, we only use the last 6 interactions
pairs. of each user (i.e., the testing set of the pre-training stage).
3. If there exist one or more pairs in these top-𝑘 Following common practice, we adopt the leave-one-out
ˆ 𝑢𝑖𝑗 appears in the user’s protocol, i.e., among the 6 interactions in test set, we use
pairs, which cause item 𝐻
input sequence ℋ , then we pick such pair of the the last one for testing, and the previous five interactions
𝑢

highest rank, and construct 𝐻 ˆ 𝑢𝑖𝑗 ⇒ 𝑌 𝑢 as the will serve as input to the recommendation models. A
causal explanation for the user. Otherwise, i.e., brief summary of the data is shown in Table 1.
no cause item appears in the user history, then
we output no causal explanation for the user. 4.2. Experimental Settings
Note that the extracted causal explanation is personal- We adopt the following methods to train black-box se-
ized since the algorithm is applied on 𝒟𝑢 , which only con- quential recommendation models and extract traditional
tains records centered around the user’s original record association rules as comparative explanations. Mean-
(ℋ𝑢 , 𝑌 𝑢 ), while collaborative learning among users is while, we further conduct different variants of the per-
indirectly modeled by the VAE-based perturbation model. turbation model to analyze our model. We include both
The overall algorithm is provided in Alg.1. For each user, shallow and deep models for the experiment.
there are two phases - perturbation phase (line 4-7) and FPMC [10]: The Factorized Personalized Markov
causal rule mining phase (line 8-15). Chain model, which combines matrix factorization and
Markov chains to capture user’s personalized sequential
behavior patterns for prediction3 .
4. Experiments
1
https://grouplens.org/datasets/movielens/
In this section, we conduct experiments to show what 2
https://nijianmo.github.io/amazon/
causal relationships our model can capture and how they 3
https://github.com/khesui/FPMC
Table 2
Results of Model Fidelity. Our causal explanation framework is tested under the number of candidate causal explanations
𝑘 = 1. The association explanation framework is tested under support, confidence, and lift thresholds, respectively. The best
fidelity on each column is highlighted in bold.

Dataset Movielens 100k Amazon
Models FPMC GRU4Rec NARM Caser FPMC GRU4Rec NARM Caser
AR-sup 0.3160 0.1453 0.4581 0.1569 0.2932 0.1449 0.4066 0.2024
AR-conf 0.2959 0.1410 0.4305 0.1559 0.2949 0.1449 0.4031 0.1885
AR-lift 0.2959 0.1410 0.4305 0.1559 0.2949 0.1449 0.4031 0.1885
CR-AE 0.5631 0.7413 0.7084 0.6151 0.6981 0.8255 0.8970 0.7260
CR-VAE 0.9650 0.9852 0.9714 0.9703 0.9511 0.9721 0.9791 0.9599

GRU4Rec [13]: A session-based recommendation decoder are Multi-Layer Perceptrons (MLP) with two
model, which uses recurrent neural networks – in partic- hidden layers, and each layer consists of 1024 neurons.
ular, Gated Recurrent Units (GRU) – to capture sequential The only difference between our model and the vari-
patterns for prediction4 . ant model CR-AE is that the variant model applies fixed
NARM [15]: A sequential recommendation model normal distribution as variance instead of learned person-
which utilizes GRU and attention mechanism to estimate alized variance. The default number of counterfactual
the importance of each interactions5 . input-output pairs is 𝑚 = 500 on both datasets. The
Caser [51]: The ConvolutionAl Sequence Embedding default time decay factor is 𝛾 = 0.7. We will discuss the
Recommendation (Caser) model, which adopts convo- influence of counterfactual number 𝑚 and time decay
lutional filters over recent items to learn the sequential factor 𝛾 in the experiments.
patterns for prediction6 . In the following, we will apply our model and all base-
AR-sup [7]: A post-hoc explanation model, which lines on the black-box recommendation models to evalu-
extract association rules from interactions from all users ate and compare the generated explanations. In particu-
and rank based on support value to generate item-level lar, we evaluate our framework from three perspectives.
explanations. First, a explanation model should at least be able to offer
AR-conf [7]: Extracting association rules and rank explanations for most recommendations, we will show it
based on confidence value to get explanations. in the result (explanation fidelity). Second, if our model is
AR-lift [7]: Rank based on lift value among extracted capable of generating explanations for most recommen-
association rules to generate explanations. dations, we need to verify that the causal explanations
CR-AE: A variant of our causal rule model which learned by our framework represent the key component
applies fixed variance in hidden layer of AutoEncoder of recommendation mechanism (explanation quality). Fi-
model as the perturbation model. Compared with our nally, since counterfactual examples are involved in our
VAE-based perturbation model, this variant apply non- framework, our framework should be able to generate
personalized variance. closer counterfactual examples (counterfactual quality).
For black-box recommendation models FPMC, Additionally, we shed light on how our model differs
GRU4Rec, NARM and Caser, we adopt their best from other models on statistical metrics.
parameter selection in their corresponding public imple-
mentation. For the association rule-based explanation 4.3. Model Fidelity
model, we follow the recommendations in [7] to set
the optimal parameters: support = 0.1, confidence A very basic purpose of designing a explanation model
= 0.1, lift = 0.1, length = 2 for MovieLens100k, and is to generate explanations for most recommendations.
support = 0.01, confidence = 0.01, lift = 0.01, length Therefore, an important evaluation measure for explana-
= 2 for Amazon dataset due to its smaller scale. We tion models is model fidelity, i.e., what’s the percentage
accept top 100 rules based on corresponding values (i.e. of the recommendation results can be explained by the
support/confidence/lift) as explanations model [3]. The results of model fidelity are shown in
For our causal rule learning framework, we set the Table 2. In this experiment, we only report the results of
item embedding size as 16, both the VAE encoder and keeping the number of candidate causal explanations 𝑘
4
as 1 for our framework and variant. For the association
https://github.com/hungthanhpham94/GRU4REC-pytorch
5
https://github.com/Wang-Shuo/Neural-Attentive-Session-
rule explanation model (section 4.2), we apply the global
Based-Recommendation-PyTorch association rules [7] ranking by support, confidence, and
6
https://github.com/graytowne/caser_pytorch lift, respectively.
We can see that on both datasets, our causal explana- Here 𝑑𝑜() represents an external intervention, which
tion framework is able to generate explanations for most forces a variable to take a specific value. Specifically, in
of the recommended items (including the variant), while our case, for an extracted causal rule 𝐻 ⇒ 𝑌 𝑢 , we define
𝑢
the association explanation approach can only provide ex- the binary random variable as 1 if 𝐻 ∈ ℋ̃𝑖 , 0 else. We
planations for significantly fewer recommendations. The also define another variable 𝑦 as a binary random variable,
𝑢
underlying reason is that association explanations have which is 1 if 𝑌˜ 𝑖 = 𝑌𝑢 , otherwise it will be 0. We then
to be extracted based on the original input-output pairs, report average ACE on all generated explanations. Note
which limits the number of pairs that we can use for rule that since the ACE value is used for causal related models,
extraction. However, based on the perturbation model, we cannot report it on the association rule baseline.
our causal explanation framework is capable of creating Suppose the perturbation model (section 3.2.1) cre-
many counterfactual examples to assist causal rule learn- ates 𝑚 counterfactual input-output pairs for each user 𝑢:
𝑢 𝑢 𝑢
ing, which makes it possible to go beyond the limited 𝑖=1 . Here ℋ̃ is created by our perturbation
{(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚
original data to extract causal explanations. Moreover, model (i.e. not observed in the original data), and thus
𝑢
when the number of input and output items are limited observing 𝐻 ∈ ℋ̃ implies we have 𝑑𝑜(𝑥 = 1) in ad-
(e.g. five history items as input and the model recom- vance. Let 𝐻 ⇒ 𝑌 𝑢 be the causal explanation extracted
mends only one item in our case), it is harder to match by the casual rule learning model (section 3.2.2). Then
global rules with personal interactions and recommen- we estimate the ACE based on these 𝑚 counterfactual
dation, which limits the flexibility of global association pairs as,
rules.
E[𝑦|𝑑𝑜(𝑥 = 1)] = Pr(𝑦 = 1|𝑑𝑜(𝑥 = 1))
Another interesting observation is that GRU4Rec and 𝑢
Caser have significantly (𝑝 < 0.01) lower fidelity than #Pairs(𝐻 ∈ ℋ̃ ∧ 𝑌 = 𝑌 𝑢 )
= 𝑢
FPMC and NARM when explained by the association #Pairs(𝐻 ∈ ℋ̃ )
(2)
model. This is reasonable because FPMC is a Markov- E[𝑦|𝑑𝑜(𝑥 = 0)] = Pr(𝑦 = 1|𝑑𝑜(𝑥 = 0))
based model that consider input as a basket and directly 𝑢
#Pairs(𝐻 ∈
/ ℋ̃ ∧ 𝑌 = 𝑌 𝑢 )
learns the correlation between candidate items and each = 𝑢
items in a sequence, as a result, it is easier to extract asso- #Pairs(𝐻 ∈
/ ℋ̃ )
ciation rules between inputs and outputs for the model. We report the ACE value of our model and variants in
NARM combines the whole session information and in- Table.3. While showing the ACE value, we still keep the
fluence of each individual item in the session, therefore,
number of candidate causal explanations 𝑘 as 1.
association rules which involve individual information We can see that our model can achieve higher ACE
will be easier to be extracted for this model. However, value than the variant for most recommendation models
it also means that the fidelity performance of the asso-on both dataset. But here we can observe an interesting
ciation approach highly depends on the recommenda- results that the ACE value for FPMC model is much lower
tion model being explained. Meanwhile, we see that our than other recommendation models (GRU4Rec, NARM,
causal approach achieves comparably good fidelity on allCaser). Meanwhile, the variant model has slightly larger
three recommendation models, because the perturbation ACE than our model when applying on FPC model.
model is able to create sufficient counterfactual exam- The difference between FPMC and other recommen-
ples to break the correlation of frequently co-occurringdation models is that FPMC is based on Markov chain
items in the input sequence. This indicates the robust- that only considers the last behavior while other mod-
ness of our causal explanation framework in terms of els involve the whole session information. For FPMC
model fidelity. model, although we take a session as input to recom-
mend next item, this model actually considers it as a
4.4. Average Causal Effect basket and linearly combines the influence of each item
from the basket. In this case, every part of the session
We then verify our causal explanations are true expla- will have independent influence towards next item pre-
nations that explanation are important component for diction. So changing a small part of input items may not
recommending original item. A common way is to mea- significantly change the next item prediction which high
sure the causal effect on the outcome of the model[52]. likely results in same recommendation item. Based on
First of all, we show the definition of Average Causal our experiment, when we keep counterfactual histories
Effect. same for all recommendation models, FPMC model only
Definition 4. (Average Causal Effect) The Average gets 98 counterfactual histories (19.6%) in average with
Causal Effect (ACE) of a binary random variable 𝑥 on different recommendation (different from the recommen-
another random variable 𝑦 is define as E[𝑦|𝑑𝑜(𝑥 = 1)] − dation item based on real history), while other models
E[𝑦|𝑑𝑜(𝑥 = 0)] have at least 315 counterfactual histories (63%) in aver-
age with different recommendation item. This difference
Table 3 Table 4
Results of Average Causal Effect. Our causal explanation Results of Proximity. The value of proximity is calculated by
framework is tested under the number of candidate causal Eq.(3)
explanations 𝑘 = 1.
Dataset Movielens 100k
Dataset Movielens 100k
Models FPMC GRU4Rec NARM Caser
Models FPMC GRU4Rec NARM Caser
CR-AE -22.69 -22.37 -22.35 -22.40
CR-AE 0.0184 0.1479 0.1108 0.1199 CR-VAE -17.35 -16.88 -16.83 -16.93
CR-VAE 0.0178 0.1862 0.1274 0.1388
Dataset Amazon
Dataset Amazon
Models FPMC GRU4Rec NARM Caser
Models FPMC GRU4Rec NARM Caser
CR-AE -21.83 -21.28 -21.20 -21.33
CR-AE 0.0230 0.1150 0.1101 0.1347 CR-VAE -18.01 -17.40 -17.31 -17.51
CR-VAE 0.0212 0.1434 0.1511 0.1563

makes the FPMC model has much lower ACE value com-
pared with other recommendation models. Comparing
our model with CR-AE, the variant model will generate
less similar counterfactual histories which more likely
result in different recommendation item than our model.
Therefore, CR-AE has slightly higher ACE values than (a) Model Fidelity on Movielens (b) Model Fidelity on Amazon
CR-VAE.
Figure 3: Model fidelity on different time decay parameters
𝛾 . 𝑥-axis is the time decay parameter 𝛾 ∈ {0.1, 0.3, 0.7, 1}
4.5. Proximity and 𝑦 -axis is the model fidelity. The left side pictures are on
Movielens and the right side pictures are on Amazon.
As we mentioned before, counterfactual examples that
are closest to the original can be the most useful to users.
Similar with [48], we define the proximity as the distance
between negative counterfactual examples (i.e. generate terfactual examples of our model have higher quality and
recommendation item different from original item) and be more useful.
original real history. Intuitively, a counterfactual example
that close enough but get totally different results will be 4.6. Influence of Parameters
more helpful. For a given user, the proximity can be
expressed as In this section, we discuss the influence of two important
parameters. The first one is time decay parameter 𝛾 – in
𝑢
our framework, when explaining the sequential recom-
∑︁
𝑃 𝑟𝑜𝑥𝑖𝑚𝑖𝑡𝑦𝑢 = −𝑚𝑒𝑎𝑛( 𝑑𝑖𝑠𝑡(ℋ̃𝑖 , ℋ𝑢 )) (3)
˜ 𝑢 ̸=𝑌𝑢
𝑌
mendation models, earlier interactions in the sequence
𝑖
will have discounted effects to the recommended item.
Here the distance is defined in latent space. The repre- A proper time decay parameter helps the framework to
sentation of any history sequence is the concatenation reduce noise signals when learning patterns from the
of the latent representation of each item in the sequence. sequence. The second parameter is the number of per-
The latent representations of items are learned from pre- turbed input-output pairs 𝑚 – in our framework, we
trained BPRMF [49] model. The distance of any two use perturbations to create counterfactual examples for
sequence is defined as Euclidean distance between the causal learning, but there may exist trade-off between ef-
representation of two sequence. The reported proximity ficiency and performance. We will analyze the influence
value would be the average over all users. of these two parameters.
Given that association rule model does not involve Time Decay Effect: Figure 3 shows the influence of 𝛾
counterfactual examples, this metric can only be reported on different recommendation models and datasets. From
on our model and the variant model on both datasets, as the result we can see that the time decay effect 𝛾 indeed
shown in Table.4 We can observe that our model can affects the model performance on fidelity. In particular,
achieve higher proximity compared with the variant when 𝛾 is small, the previous interactions in a sequence
model. In other words, counterfactual examples gen- are more likely to be ignored, which thus reduces the
erated with learned latent variance is more similar with performance on model fidelity. When 𝛾 is large (e.g.,
real history. Therefore, higher proximity implies coun- 𝛾 = 1), old interactions will have equal importance with
Association Rule Our Explanation
Explanation

unknown

(a) Model Fidelity on Movielens (b) Model Fidelity on Amazon

Figure 4: Model fidelity on different number of counterfac- Association Rule Our Explanation
Explanation
tual pairs 𝑚. 𝑥-axis is the number of counterfactual pairs 𝑚.
𝑦 -axis is model fidelity. Figure 5: A case study on MovieLens by the Caser model.
The first movie for 𝑢1 is unknown in the dataset.

latest interactions, which also hurts the performance. We
can see from the results that the best performance is gives >95% confidence that the estimated probability er-
achieved at about 𝛾 = 0.7 on both datasets. ror is <0.1.
Number of Counterfactual Examples: Figure 4
shows the influence for the number of counterfactual 4.7. Case Study
input-output pairs 𝑚. A basic observation from Figure 4
is that when 𝑚 increases, model fidelity will decrease first In this section, we provide a simple case study to com-
and then increase. The underlying reason is as follows. pare causal explanations and association explanations.
When 𝑚 is small, the variance of the counterfactual Compared with the association explanation model, our
input-output pairs will be small, and fewer counterfac- model is capable of generating personalized explanations,
tual items will be involved. Then the model is more likely which means that even if the recommendation model rec-
to select original item as explanation. For example, sup- ommends the same item for two different users and the
pose the original input-output pair is 𝐴, 𝐵, 𝐶 → 𝑌 . In users have overlapped histories, our model still has the
the extreme case where 𝑚 = 1, we will have only one potential to generate different explanations for differ-
counterfactual pair, e.g., 𝐴, 𝐵
˜ , 𝐶 → 𝑌˜ . According to the ent users. However, the association model will provide
causal rule learning model (section 3.2.2), if 𝑌˜ ̸= 𝑌 , then the same explanation since the association rules are ex-
𝐵 ⇒ 𝑌 will be the causal explanation since the change tracted based on global records. An example by the Caser
of 𝐵 results in a different output, while if 𝑌˜ = 𝑌 , then [51] recommendation model on MovieLens100k dataset is
either 𝐴 ⇒ 𝑌 or 𝐶 ⇒ 𝑌 will be the causal explanation shown in Figure 5, where two users with one commonly
since their 𝜃 scores will be higher than 𝐵 or 𝐵 ˜ . In either watched movie (The Sound of Music) get exactly same
case, the model fidelity and percentage of verified causal recommendation (Pulp Fiction). The association model
rules will be 100%. However, in this case, the results do provides the overlapped movie as an explanation for the
not present statistical meanings since they are estimated two different users, while our model can generate per-
on a very small amount of examples. sonalized explanation for different users even when they
When 𝑚 increases but not large enough, then random got the same recommendation.
noise examples created by the perturbation model will re-
duce the model fidelity. Still consider the above example, 5. Conclusions
if many pairs with the same output 𝑌 are created, then
the model may find other items beyond 𝐴, 𝐵, 𝐶 as the Recommender systems are widely used in our daily life.
cause, which will result in no explanation for the origi- Effective recommendation mechanisms usually work
nal sequence. However, if we continue to increase 𝑚 to through black-box models, resulting in the lack of trans-
sufficiently large numbers, such noise will be statistically parency. In this paper, we extract causal rules from user
offset, and thus the model fidelity and percentages will history to provide personalized, item-level, post-hoc ex-
increase again. In the most ideal case, we would create all planations for the black-box sequential recommendation
of the |ℋ||ℐ| sequences for causal rule learning, where models. The causal explanations are extracted through a
|ℋ| is the number of item slots in the input sequence, and perturbation model and a causal rule learning model. We
|ℐ| is the total number of items in the dataset. However, conduct several experiments on real-world datasets, and
|ℋ||ℐ| would be a huge number that makes it compu- apply our explanation framework to several state-of-the-
tational infeasible for causal rule learning. In practice, art sequential recommendation models. Experimental
we only need to specify 𝑚 sufficiently large. Based on results verified the quality and fidelity of the causal ex-
Chebyshev’s Inequality, we find that 𝑚 = 500 already planations extracted by our framework.
In this work, we only considered item-level causal [9] N. Wang, H. Wang, Y. Jia, Y. Yin, Explainable recom-
relationships, while in the future, it would be interesting mendation via multi-task learning in opinionated
to explore causal relations on feature-level external data text data, in: The 41st International ACM SIGIR
such as textual user reviews, which can help to generate Conference on Research & Development in Infor-
finer-grained causal explanations. mation Retrieval, ACM, 2018.
[10] S. Rendle, C. Freudenthaler, L. Schmidt-Thieme,
Factorizing personalized markov chains for next-
Acknowledgments basket recommendation, in: Proceedings of the
19th international conference on World wide web,
This work was partly supported by NSF IIS-1910154 and
ACM, 2010, pp. 811–820.
IIS-2007907. Any opinions, findings, conclusions or rec-
[11] P. Wang, J. Guo, Y. Lan, J. Xu, S. Wan, X. Cheng,
ommendations expressed in this material are those of
Learning hierarchical representation model for
the authors and do not necessarily reflect those of the
nextbasket recommendation, in: Proceedings of the
sponsors.
38th International ACM SIGIR conference on Re-
search and Development in Information Retrieval,
References ACM, 2015, pp. 403–412.
[12] R. He, J. McAuley, Fusing similarity models with
[1] H. Chen, S. Shi, Y. Li, Y. Zhang, Neural collaborative markov chains for sparse sequential recommenda-
reasoning, in: Proceedings of the Web Conference tion, in: 2016 IEEE 16th International Conference
2021, 2021, pp. 1516–1527. on Data Mining (ICDM), IEEE, 2016, pp. 191–200.
[2] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning [13] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk,
based recommender system: A survey and new Session-based recommendations with recurrent
perspectives, ACM Computing Surveys (CSUR) 52 neural networks, in: International Conference on
(2019) 1–38. Learning Representations, 2016.
[3] Y. Zhang, X. Chen, Explainable recommendation: [14] F. Yu, Q. Liu, S. Wu, L. Wang, T. Tan, A dynamic
A survey and new perspectives, Foundations and recurrent model for next basket recommendation,
Trends® in Information Retrieval (2020). in: Proceedings of the 39th International ACM SI-
[4] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, S. Ma, GIR conference on Research and Development in
Explicit factor models for explainable recommenda- Information Retrieval, ACM, 2016, pp. 729–732.
tion based on phrase-level sentiment analysis, in: [15] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, J. Ma, Neu-
Proceedings of the 37th international ACM SIGIR ral attentive session-based recommendation, in:
conference on Research & development in informa- Proceedings of the 2017 ACM on Conference on
tion retrieval, ACM, 2014, pp. 83–92. Information and Knowledge Management, ACM,
[5] Y. Xian, Z. Fu, S. Muthukrishnan, G. De Melo, 2017, pp. 1419–1428.
Y. Zhang, Reinforcement knowledge graph reason- [16] X. Chen, H. Xu, Y. Zhang, J. Tang, Y. Cao, Z. Qin,
ing for explainable recommendation, in: Proceed- H. Zha, Sequential recommendation with user mem-
ings of the 42nd International ACM SIGIR Confer- ory networks, in: Proceedings of the eleventh ACM
ence on Research and Development in Information international conference on WSDM, 2018, pp. 108–
Retrieval, 2019, pp. 285–294. 116.
[6] A. Theodorou, R. H. Wortham, J. J. Bryson, De- [17] J. Huang, W. X. Zhao, H. Dou, J.-R. Wen, E. Y.
signing and implementing transparency for real Chang, Improving sequential recommendation
time inspection of autonomous robots, Connection with knowledge-enhanced memory networks, in:
Science 29 (2017) 230–241. The 41st International ACM SIGIR Conference on
[7] G. Peake, J. Wang, Explanation mining: Post hoc Research & Development in Information Retrieval,
interpretability of latent factor models for recom- ACM, 2018, pp. 505–514.
mendation systems, in: Proceedings of the 24th [18] J. Chen, F. Zhuang, X. Hong, X. Ao, X. Xie, Q. He,
ACM SIGKDD International Conference on Knowl- Attention-driven factor model for explainable per-
edge Discovery & Data Mining, 2018. sonalized recommendation, in: The 41st Interna-
[8] X. Chen, H. Chen, H. Xu, Y. Zhang, Y. Cao, Z. Qin, tional ACM SIGIR Conference on Research & De-
H. Zha, Personalized fashion recommendation with velopment in Information Retrieval, ACM, 2018, pp.
visual explanations based on multimodal attention 909–912.
network: Towards visually explainable recommen- [19] X. Chen, Z. Qin, Y. Zhang, T. Xu, Learning to rank
dation, in: Proceedings of the 42nd International features for recommendation over multiple cate-
ACM SIGIR Conference on Research and Develop- gories, in: Proceedings of the 39th International
ment in Information Retrieval, 2019, pp. 765–774. ACM SIGIR conference on Research and Develop-
ment in Information Retrieval, 2016, pp. 305–314. (TOIS) 38 (2019) 1–29.
[20] S. Seo, J. Huang, H. Yang, Y. Liu, Interpretable [32] L. Li, Y. Zhang, L. Chen, Personalized transformer
convolutional neural networks with dual local and for explainable recommendation, ACL (2021).
global attention for review rating prediction, in: [33] L. Li, Y. Zhang, L. Chen, Generate neural template
Proceedings of the Eleventh ACM Conference on explanations for recommendation, in: Proceed-
RecSys, 2017, pp. 297–305. ings of the 29th ACM International Conference on
[21] C. Li, C. Quan, L. Peng, Y. Qi, Y. Deng, L. Wu, A cap- Information & Knowledge Management, 2020, pp.
sule network for recommendation and explaining 755–764.
what you like and dislike, in: Proceedings of the [34] H. Chen, X. Chen, S. Shi, Y. Zhang, Generate natural
42nd International ACM SIGIR Conference on Re- language explanations for recommendation, SIGIR
search and Development in Information Retrieval, 2019 Workshop on ExplainAble Recommendation
ACM, 2019, pp. 275–284. and Search (2019).
[22] F. Costa, S. Ouyang, P. Dolog, A. Lawlor, Automatic [35] J. McInerney, B. Lacker, S. Hansen, K. Higley,
generation of natural language explanations, in: H. Bouchard, A. Gruson, R. Mehrotra, Explore,
Proceedings of the 23rd International Conference exploit, and explain: personalizing explainable rec-
on Intelligent User Interfaces Companion, ACM, ommendations with bandits, in: Proceedings of the
2018, p. 57. 12th ACM Conference on Recommender Systems,
[23] Q. Ai, V. Azizi, X. Chen, Y. Zhang, Learning hetero- ACM, 2018, pp. 31–39.
geneous knowledge base embeddings for explain- [36] X. Wang, Y. Chen, J. Yang, L. Wu, Z. Wu, X. Xie,
able recommendation, Algorithms 11 (2018) 137. A reinforcement learning framework for explain-
[24] Z. Fu, Y. Xian, R. Gao, J. Zhao, Q. Huang, Y. Ge, S. Xu, able recommendation, in: 2018 IEEE International
S. Geng, C. Shah, Y. Zhang, et al., Fairness-aware ex- Conference on Data Mining (ICDM), IEEE, 2018, pp.
plainable recommendation over knowledge graphs, 587–596.
SIGIR (2020). [37] N. Tintarev, Explanations of recommendations,
[25] W. Ma, M. Zhang, Y. Cao, W. Jin, C. Wang, Y. Liu, in: Proceedings of the 2007 ACM conference on
S. Ma, X. Ren, Jointly learning explainable rules for Recommender systems, 2007, pp. 203–206.
recommendation with knowledge graph, in: The [38] J. Pearl, Causality: models, reasoning and inference,
World Wide Web Conference, 2019, pp. 1210–1221. volume 29, Springer, 2000.
[26] Y. Xian, Z. Fu, H. Zhao, Y. Ge, X. Chen, Q. Huang, [39] G. W. Imbens, D. B. Rubin, Causal inference in statis-
S. Geng, Z. Qin, G. De Melo, S. Muthukrishnan, tics, social, and biomedical sciences, Cambridge
et al., Cafe: Coarse-to-fine neural symbolic reason- University Press, 2015.
ing for explainable recommendation, in: Proceed- [40] S. Bonner, F. Vasile, Causal embeddings for rec-
ings of the 29th ACM International Conference on ommendation, in: Proceedings of the 12th ACM
Information & Knowledge Management, 2020, pp. Conference on Recommender Systems, ACM, 2018.
1645–1654. [41] T. Joachims, A. Swaminathan, T. Schnabel, Un-
[27] L. Li, Y. Zhang, L. Chen, Extra: Explanation ranking biased learning-to-rank with biased feedback, in:
datasets for explainable recommendation, SIGIR Proceedings of the Tenth ACM International Con-
(2021). ference on Web Search and Data Mining, ACM,
[28] S. Shi, H. Chen, W. Ma, J. Mao, M. Zhang, Y. Zhang, 2017, pp. 781–789.
Neural logic reasoning, in: Proceedings of the 29th [42] Z. Wood-Doughty, I. Shpitser, M. Dredze, Chal-
ACM International Conference on Information & lenges of using text classifiers for causal inference,
Knowledge Management, 2020, pp. 1365–1374. in: Proceedings of the 2018 Conference on Empiri-
[29] Y. Zhu, Y. Xian, Z. Fu, G. de Melo, Y. Zhang, Faith- cal Methods in Natural Language Processing, 2018,
fully explainable recommendation via neural logic pp. 4586–4598.
reasoning, in: Proceedings of the 2021 Conference [43] L. Buesing, T. Weber, Y. Zwols, S. Racaniere,
of the North American Chapter of the Association A. Guez, J.-B. Lespiau, N. Heess, Woulda, coulda,
for Computational Linguistics: Human Language shoulda: Counterfactually-guided policy search, in:
Technologies, 2021, pp. 3083–3090. ICLR, 2019.
[30] X. Chen, Y. Zhang, Z. Qin, Dynamic explainable [44] D. Liang, L. Charlin, J. McInerney, D. M. Blei, Mod-
recommendation based on neural attentive mod- eling user exposure in recommendation, in: Pro-
els, in: Proceedings of the AAAI Conference on ceedings of the 25th WWW, 2016.
Artificial Intelligence, volume 33, 2019, pp. 53–60. [45] D. Liang, L. Charlin, D. M. Blei, Causal inference
[31] Q. Ai, Y. Zhang, K. Bi, W. B. Croft, Explainable for recommendation, in: Causation: Foundation to
product search with a dynamic relation embedding Application, Workshop at UAI, 2016.
model, ACM Transactions on Information Systems [46] A. Ghazimatin, O. Balalau, R. Saha Roy, G. Weikum,
Prince: provider-side interpretability with counter-
factual explanations in recommender systems, in:
Proceedings of the 13th International Conference
on Web Search and Data Mining, 2020, pp. 196–204.
[47] D. Alvarez-Melis, T. S. Jaakkola, A causal frame-
work for explaining the predictions of black-box
sequence-to-sequence models, Proceedings of the
2017 Conference on Empirical Methods in Natural
Language Processing (2017).
[48] R. K. Mothilal, A. Sharma, C. Tan, Explaining ma-
chine learning classifiers through diverse counter-
factual explanations, in: Proceedings of the 2020
Conference on Fairness, Accountability, and Trans-
parency, 2020.
[49] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-
Thieme, Bpr: Bayesian personalized ranking from
implicit feedback, UAI (2012).
[50] D. P. Kingma, M. Welling, Auto-encoding varia-
tional bayes, 2014.
[51] J. Tang, K. Wang, Personalized top-n sequential
recommendation via convolutional sequence em-
bedding, in: Proceedings of the Eleventh ACM
International Conference on Web Search and Data
Mining, ACM, 2018, pp. 565–573.
[52] R. Moraffah, M. Karami, R. Guo, A. Raglin,
H. Liu, Causal interpretability for machine learning-
problems, methods and evaluation, ACM SIGKDD
Explorations Newsletter 22 (2020) 18–33.