=Paper=
{{Paper
|id=Vol-2911/paper3
|storemode=property
|title=Learning Causal Explanations for Recommendation
|pdfUrl=https://ceur-ws.org/Vol-2911/paper3.pdf
|volume=Vol-2911
|authors=Shuyuan Xu,Yunqi Li,Shuchang Liu,Zuohui Fu,Yingqiang Ge,Xu Chen,Yongfeng Zhang
}}
==Learning Causal Explanations for Recommendation==
Learning Causal Explanations for Recommendation Shuyuan Xu1 , Yunqi Li1 , Shuchang Liu1 , Zuohui Fu1 , Yingqiang Ge1 , Xu Chen2 and Yongfeng Zhang1 1 Department of Computer Science, Rutgers University, New Brunswick, NJ 08901, US 2 Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, 100872, China Abstract State-of-the-art recommender systems have the ability to generate high-quality recommendations, but usually cannot pro- vide explanations to humans due to the usage of black-box prediction models. The lack of transparency has highlighted the critical importance of improving the explainability of recommender systems. In this paper, we propose to construct causal explainable recommendation which aims to provide post-hoc explanations for the recommendations by answering “what if” questions, e.g., “what would the recommendation result change if the user’s behavior history had been different?” Our approach first obtains counterfactual user histories and counterfactual recommendation items with the aid of a perturbation model, and then extracts personalized causal relationships for the recommendation model through a causal rule mining algorithm. Different from some existing explainable recommendation models that aim to provide persuasive explanations, our model aims to find out the true explanations for the recommendation of an item. Therefore, in addition to evaluating the fidelity of discovered causal explanations, we adopt the average causal effect to measure the quality of explanations. Here by quality we mean whether they are true explanations rather than their persuasiveness. We conduct experiments for several state-of-the-art sequential recommendation models on real-world datasets to verify the performance of our model on generating causal explanations. Keywords Sequential Recommendation, Explainable Recommendation, Post-hoc Explanation, Causal Analysis 1. Introduction Real History Recommend As widely used in decision-making, recommender sys- User Recommender tems have been recognized for its ability to provide high- Counterfactual History Recommend Result has been changed, quality services that reduce the gap between products and What if the user's history item 3 could the reason. customers. And many state-of-the-art models achieves has been different? outstanding expressiveness by using high-dimensional user/item representations and deep learning models with Recommender thousands or even millions of parameters [1, 2]. How- ever, this excessive complexity easily go beyond the com- prehension of a human who may demand for intuitive Figure 1: An example of causal explanation. Comparing the explanations for why the model made a specific decision. recommendation of real history and counterfactual histories, if replacing one certain item will result in the change of rec- Moreover, providing supportive information and inter- ommendation, the certain item could be the true reason that pretation along with the recommendation can be helpful the system recommends the original item. for both the customers and the platform, since it improves the transparency, persuasiveness, trustworthiness, effec- tiveness, and user satisfaction of the recommendation systems, while facilitating system designers to refine the dation. algorithms [3]. Thus, people are looking for solutions One typical method to solve explainable recommenda- that can generate explanations along with the recommen- tion is to construct a model-intrinsic explanation mod- ule that also serves as an intermediate recommendation stage[4, 5]. However, this approach has to redesign the The 1st International Workshop on Causality in Search and Recommendation (CSR’21), July 15, 2021, Virtual Event, Canada original recommendation model and thus may sacrifice " shuyuan.xu@rutgers.edu (S. Xu); yunqi.li@rutgers.edu (Y. Li); model accuracy in order to obtain good explanations shuchang.syt.liu@rutgers.edu (S. Liu); zuohui.fu@rutgers.edu [6]. Moreover, for complex deep models, it is even more (Z. Fu); yingqiang.ge@rutgers.edu (Y. Ge); xu.chen@ruc.edu.cn challenging to integrate an explainable method into the (X. Chen); yongfeng.zhang@rutgers.edu (Y. Zhang) original design while maintaining recommendation per- ~ https://zuohuif.github.io/ (Z. Fu); https://yingqiangge.github.io/ (Y. Ge); http://xu-chen.com/ (X. Chen); http://yongfeng.me formance [3]. In contrast, post-hoc models (a.k.a model- (Y. Zhang) agnostic explanation) consider the underlying recommen- © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). dation model as a black-box, and provide explanations CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) after the recommendation decision has been made. Al- able framework for a wide range of sequential though such explanations may not strictly follow the recommendations. exact mechanism that generated the corresponding rec- • We show that this framework can generate per- ommendations, they offer the flexibility to be applied to a sonalized post-hoc explanations based on item- wide range of recommendation models. Furthermore, the level causal rules. explanation model and recommendation model work sep- • We conduct several experiments on real-world arately, we obtain the benefit of explainability without data to demonstrate that our explanation model hurting the prediction performance. outperforms state-of-the-art baselines in terms of While it is still not fully understood what informa- fidelity. tion is useful to generate the explanations for a certain • We apply average causal effect to illustrate that recommendation result, Peake [7] argued that one can the causal explanations provided by our frame- provide post-hoc item-level explanations. Specifically, in- work are essential component for most sequential teracted items (the causes) in a user’s history can be used recommendation model. as explanations for the future item recommendations (the effect).The authors propose to solve this by association For the remainder of this paper, we first review related rule mining which finds co-occurred items as explanation. work in Section 2, and then introduce our model in Sec- However, explanations generated by association rules are tion 3. Experimental settings and results are provided in not personalized, i.e., different users would receive the Section 4. Finally, we conclude this work in Section 5. same explanation as long as the rules are only applied to their overlapped histories. This makes it incompatible with modern recommender systems, which aim to pro- 2. Related Work vide personalized services to users. Moreover, we believe that the true explanation of a recommendation model 2.1. Sequential Recommendation should be able to answer the questions like “which item Sequential recommendation takes into account the his- contribute to the system’s decision?” as well as “ Will torical order of items interacted by a user and aims to the system change the decision if a different set of items capture useful sequential patterns to make consecutive were purchased by the same user? ” In other words, predictions of the user’s future behaviors. Rendle et al. the explanation should be aware of the counterfactual [10] proposed Factorized Personalized Markov Chains world of the unobserved user histories and their corre- (FPMC) to combine Markov chain and matrix factoriza- sponding recommendation when analyzing the cause of tion for next basket recommendation. The Hierarchical a recommendation in real world. Representation Model (HRM) [11] further extended this In this paper, we explore a counterfactual analysis idea by leveraging representation learning as latent fac- framework to provide post-hoc causal explanations for tors in a hierarchical model. However, these methods can any given black-box sequential recommendation algo- only model the local sequential patterns of very limited rithm. Fig.1 shows an example to illustrate our intuition. number of adjacent records. To model multi-step sequen- Technically, we first create several counterfactual histo- tial behaviors, He et al. [12] adopted Markov chain to ries which are different but similar to the real history provide recommendations with sparse sequences. Later through a Variational Auto-Encoder (VAE) based per- on, the rapid development of representation learning turbation model, and obtain the recommendation for and neural networks introduced many new techniques the counterfactual data. Then we apply causal analysis that further pushed the research of sequential recom- on the combined data to extract causal rules between a mendation to a new level. For example, Hidasi et. al. user’s history and future behaviors as explanations. Un- [13] used an RNN-based model to learn the user his- like other explainable recommendation models [4, 8, 9] tory representation, Yu et. al. [14] provided a dynamic that focus on persuading users to keep engaged with recurrent model, Li et. al. [15] proposed an attention- the system, this type of explanation focuses on model based GRU model, Chen et. al. [16] developed user- and transparency and finds out the true reason or the most item-level memory networks, and Huang et. al. [17] fur- essential item that leads to a specific recommendation. ther integrated knowledge graphs into memory networks. Therefore, instead of taking user studies or online evalu- However, most of the models exhibit complicated neural ations to evaluate the persuasiveness or effectiveness of network architectures, and it is usually difficult to inter- explanations, we use the average causal effect to measure pret their prediction results. To make up for this, we plan whether the item used for explanation can explain how to generate explanations for these black box sequential the system works. recommendation models. The key contributions of this paper are as follows: • We design and study a counterfactual explain- 2.2. Explainable Recommendation and reinforcement learning [43], etc. With respect to recommendation tasks, large amount of work is about Explainable recommendation focuses on developing mod- how to achieve de-bias matrix factorization with causal els that can generate not only high-quality recommen- inference. The probabilistic approach ExpoMF proposed dations but also intuitive explanations, which help to in [44] directly incorporated user exposure to items into improve the transparency of the recommendation sys- collaborative filtering, where the exposure is modeled as tems [3]. Generally, the explainable models can be either a latent variable. Liang et. al. [45] followed to develop model-intrinsic or model-agnostic. As for model-intrinsic a causal inference approach to recommender systems approaches, lots of popular explainable recommendation which believed that the exposure and click data came methods, such as factorization models [4, 18, 9, 19], deep from different models, thus using the click data alone to learning models [20, 16, 21, 22], knowledge graph models infer the user preferences would be biased by the expo- [23, 5, 17, 24, 25, 26], explanation ranking models [27], sure data. They used causal inference to correct for this logical reasoning models [1, 28, 29], dynamic explana- bias for improving generalization of recommendation tion models [30, 31], visual explanation models [8] and systems to new data. Bonner et. al. [40] proposed a new natural language generation models [32, 33, 34] have domain adaptation algorithm which was learned from been proposed. A more complete review of the related logged data including outcomes from a biased recom- models can be seen in [3]. However, they mix the recom- mendation policy, and predicted recommendation results mendation mechanism with interpretable components, according to random exposure. Besides de-bias recom- which often results in over-complicated systems to make mendation, Ghazimatin et. al. [46] proposed PRINCE successful explanations. Moreover, the increased model model to explore counterfactual evidence for discovering complexity may reduce the interpretability. A natural causal explanations in a heterogeneous information net- way to avoid this dilemma is to rely on model-agnostic work. Differently, this paper focuses on learning causal post-hoc approaches so that the recommendation system rules to provide more intuitive explanation for the black- is free from the noises of the down-stream explanation box sequential recommendation models. Additionally, generator. Examples include [35] that proposed a bandit we consider [47] as a highly related work though it is approach, [36] that proposed a reinforcement learning originally proposed for natural language processing tasks. framework to generate sentence explanations, and [7] As we will discuss in the later sections, we utilize some that developed an association rule mining approach. Ad- of the key ideas of its model construction, and show why ditionally, some work distinguish the model explanations it works in sequential recommendation scenarios. by their purpose [37]: while persuasive explanations aim to improve user engagement, model explanation reflexes how the system really works and may not necessarily be 3. Proposed Approach persuasive. Our study fall into the later case and aims to find causal explanations for a given sequential recom- In this section, we first define the explanation problem mendation model. and then introduce our model as a combination of two parts: a VAE-based perturbation model that generates the 2.3. Causal Inference in Recommendation counterfactual samples for causal analysis, and a causal rule mining model that can extract causal dependencies Originated as statistical problems, causal inference [38, between the cause-effect items. 39] aims at understanding and explaining the causal ef- fect of one variable on another. While the observational 3.1. Problem Setting data is considered as the factual world, causal effect in- ferences should be aware of the counterfactual world, We denote the set of users as 𝒰 = {𝑢1 , 𝑢2 , · · · , 𝑢|𝒰 | } thus often being regarded as the questions of "what-if". and set of items as ℐ = {𝑖1 , 𝑖2 . · · · , 𝑖|ℐ| }. Each user 𝑢 The challenge is that it is often expensive or even im- is associated with a purchase history represented as a possible to obtain counterfactual data. For example, it sequence of items ℋ𝑢 . The 𝑗-th interacted item in the is immoral to re-do the experiment on a patient to find history is denoted as 𝐻𝑗𝑢 ∈ ℐ. Without specification, the out what will happen if we have not given the medicine. calligraphic ℋ in the paper represents user history, and Though the majority of causal inference study resides in a straight 𝐻 represents an item. A black-box sequential the direction of statistics and philosophy, it has recently recommendation model ℱ : ℋ → ℐ is a function that attracted the attention from AI community for its great takes a sequence of items (as will discuss later, it can be power of explainablity and bias elimination ability. Ef- the counterfactual user history) as input and outputs the forts have managed to bring causal inference to several recommended item. In practice, the underlying mecha- machine learning areas, including recommendation [40], nism usually consists of two steps: a ranking function learning to rank [41], natural language processing [42], first scores all candidate items based on the user history, and then it selects the item with the highest score as the 3.2.1. Perturbation Model final output. Note that it only uses user-item interac- To capture the causal dependency between items in his- tion without any content or context information, and the tory and the recommended items, we want to know what scores predicted by the ranking function may differ ac- would take place if the user history had been different. To cording to the tasks (e.g. {1, . . . , 5} for rating prediction, avoid unknown influences caused by the length of input while [0, 1] for Click Through Rate (CTR) prediction). sequence (i.e., user history), we keep the input length Our goal is to find an item-level post-hoc model that cap- unchanged, and only replace items in the sequence to tures the causal relation between the history items and create counterfactual histories. Ideally, for each item the recommended item for each user. 𝐻𝑗𝑢 in a user’s history ℋ𝑢 , it will be replaced by all pos- Definition 1. (Causal Relation) For two variables 𝑋 and sible items in ℐ to fully explore the influence that 𝐻𝑗𝑢 𝑌 , if 𝑋 triggers 𝑌 , then we say that there is a causal re- makes in the history. However, the number of possible lation 𝑋 ⇒ 𝑌 , where 𝑋 is the cause and 𝑌 is the effect. combinations will become impractical for the learning system, since recommender systems usually deal with hundreds of thousands or even tens of millions items. In When a given recommendation model ℱ maps a user fact, counterfactual examples that are closest to the orig- history ℋ to a recommended item 𝑌 ∈ ℐ, all items inal input can be the most useful to a user as shown in 𝑢 𝑢 in ℋ𝑢 are considered as potential causes of 𝑌 𝑢 . Thus [48]. Therefore, we pursue a perturbation-based method we can formulate the set of causal relation candidates as that generate counterfactual examples, which replaces 𝒮 𝑢 = {(𝐻, 𝑌 𝑢 )|𝐻 ∈ ℋ𝑢 }. items in the original user history ℋ𝑢 . Definition 2. (Causal Explanation for Sequential Rec- There are various ways to obtain the counterfactual ommendation Model) Given a causal relation candidate history, as long as they are similar to the real history. set 𝒮 𝑢 for user 𝑢, if there exists a true causal relation The simplest solution is randomly selecting an item in 𝑢 𝑢 (𝐻, 𝑌 ) ∈ 𝒮 , then the causal explanation for recom- ℋ 𝑢 and replacing it with a randomly selected item from mending 𝑌 𝑢 is described as “Because you purchased 𝐻, ℐ ∖ ℋ . However, user histories are far from random. 𝑢 the model recommends you 𝑌 𝑢 ”, denoted as 𝐻 ⇒ 𝑌 𝑢 . Thus, we assume that their exists a ground truth user history distribution, and we adopt VAE to learn such a dis- Then the remaining problem is to determine whether tribution. As is shown in Figure 2, we design a VAE-based a candidate pair is a true causal relation. perturbation method, which creates item sequences that We can mitigate the problem by allowing a likelihood are similar to but slightly different from a user’s genuine estimation for a candidate pair being a causal relation. history sequence, by sampling from a distribution in the Definition 3. (Causal Dependency) For a given candi- latent embedding space centered around the user’s true date pair of causal relation (𝐻, 𝑌 𝑢 ), the causal depen- history sequence. dency 𝜃𝐻,𝑌 𝑢 of that pair is the likelihood of the pair being In detail, the VAE component consists of a probabilis- a true causal relation. tic encoder (𝜇, 𝜎) = ENC(𝒳 ) and a decoder 𝒳 ˜ = DEC(𝑧). The encoder ENC(·) takes a sequence of item In other words, we would like to find a ranking func- embeddings 𝒳 into latent embedding space, and extracts tion that predicts the likelihood for each candidate pair, the variational information for the sequence, i.e., mean and the causal explanation is generated by selecting the and variance of the latent embeddings under independent pair with top ranking score from these candidates. One Gaussian distribution. The decoder DEC(·) generates a advantage of this formulation is that it allows the possibil- sequence of item embeddings 𝒳 ˜ given a latent embed- ity of giving no causal relation between a user’s history ding 𝑧 sampled from the Gaussian distribution. Here, and the recommended item, e.g., when algorithm rec- both 𝒳 and 𝒳 ˜ are ordered concatenations of pre-trained ommends the most popular items regardless of the user item embeddings based on pair-wise matrix factorization history. (BPR-MF) [49]. We follow the standard training regime of VAE by maximizing the variational lower bound of the 3.2. Causal Model for Post-Hoc Explanationdata likelihood [50]. Specifically, the reconstruction error involved in this lower bound is calculated by a softmax In this section, we introduce our counterfactual explana- across all items for each position of the input sequence. tion framework for recommendation. Inspired by [47], We observe that VAE can reconstruct the original data we divide our framework into two models: a perturbation set accurately, while offering the power of perturbation. model and a causal rule mining model. The overview of After pretraining ENC(·) and DEC(·), the variational the model framework is shown in Fig.2. nature of this model allows us to obtain counterfactual history ℋ̃ for any real history ℋ. More specifically, it first extracts the mean and variance of the encoded item Per tur bation M odel Causality M ining M odel Re al hi s t o r y ... Co u nt e r - m+1 Co u nt e r - f ac t u al f ac t u al pai r s Encoder hi s t o r y Sample m user Times Recommendation Causality Mining history ... ... Model ... Causal Dependencies Decoder Co u nt e r - Rank and Selection Co u nt e r - f ac t u al Per sonalized f ac t u al hi s t o r y ... Re s u l t Explanation ... Figure 2: Model framework. 𝑥 is the concatenation of the item embeddings of the user history. 𝑥 ˜ is the perturbed embedding. 𝑢 sequences in the latent space, and then the perturbation and output item 𝑌ˆ 𝑖 . We consider that the occurrence of model samples 𝑚 latent embeddings 𝑧 based on the above a single output can be modeled as a logistic regression variational information. These sampled embeddings 𝑧 on causal dependencies from all the input items in the are then passed to the decoder DEC(·) to obtain the sequence: perturbed versions 𝒳 ˜ . For now, each item embedding in 𝑛 ˜ may not represent an actual item since it is a sampled 𝑢 𝑢 (︁ ∑︁ )︁ 𝒳 𝑃 (𝑌ˆ 𝑖 |ℋ̂𝑖 ) = 𝜎 𝜃𝐻 ^𝑢 · 𝛾 ^ 𝑢 ,𝑌 𝑛−𝑗 (1) vector from the latent space, as a result, we find its nearest 𝑗=1 𝑖𝑗 𝑖 neighbor in the candidate item set ℐ ∖ ℋ through dot product similarity as the actual item. In this way, 𝒳 ˜ is where 𝜎 is the sigmoid function defined as 𝜎(𝑥) = transformed into the final counterfactual history ℋ̃. One (1 + exp(−𝑥))−1 in order to scale the score to [0, 1]. should keep in mind that the variance should be kept Additionally, in recommendation task, the order of a small during sampling, so that the resulting sequences user’s previously interacted items may affect their causal can be similar to the original sequence. dependency with the user’s next interaction. A closer Finally, the generated counterfactual data ℋ̃ together behavior tends to have a stronger effect on user’s future with the original ℋ will be injected into the black-box behaviors, and behaviors are discounted if they happened recommendation model ℱ to obtain the recommenda- earlier [13]. Therefore, we involve a weight decay param- tion results 𝑌˜ and 𝑌 , correspondingly. For any user 𝑢, eter 𝛾 to represent the time effect. Here 𝛾 is a positive after completing this process, we will have 𝑚 different value less than one. 𝑢 𝑢 counterfactual input-output pairs: {(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚 For an input-output pair in 𝒟𝑢 , the probability of its 𝑖=1 , as well as the original pair (ℋ , 𝑌 ). Here the value of 𝑚 𝑢 𝑢 occurrence generated by Eq.(1) should be close to one. is manually set, but it cannot exceed the number of all As a result, we learn the causal dependencies 𝜃 by maxi- possible item combinations. mizing the probability over 𝒟𝑢 . When optimizing 𝜃, they are always initialized as zero to allow for no causation between two items. When learning this regression model, 3.2.2. Causal Rule Learning Model we are able to gradually increase 𝜃 until they converge to Denote 𝒟𝑢 as the combined records of counterfactual the point where the data likelihood of 𝒟𝑢 is maximized. 𝑢 𝑢 input-output pairs {(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚 𝑖=1 and the original pair After gathering all the causal dependencies, we select (ℋ , 𝑌 ) for user 𝑢. We aim to develop a causal model 𝑢 𝑢 the items that have high 𝜃 scores to build causal explana- that first extracts causal dependencies between input and tions. This involves a three-step procedure. outputs items appeared in 𝒟𝑢 , and then selects the causal 1. We select those causal dependencies 𝜃𝐻 ^ 𝑢 ,𝑌 ^𝑢 rule based on these inferred causal dependencies. 𝑢 𝑖𝑗 𝑖 𝑢 Let ℋ̂𝑖 = [𝐻 ˆ 𝑢𝑖1 , 𝐻 ˆ 𝑢𝑖2 , · · · , 𝐻 ˆ 𝑢𝑖𝑛 ] be the input sequence whose output is the original 𝑌 (i.e., 𝑌 𝑖 = 𝑌 𝑢 ). 𝑢 ˆ ˆ 𝑢𝑖𝑗 is the j-th item in Note that these (𝐻 ˆ 𝑢𝑖𝑗 , 𝑌 𝑢 ) pairs may come from of the 𝑖-th record of 𝒟𝑢 , where 𝐻 𝑢 𝑢 either the original sequence or counterfactual se- ℋ̂𝑖 . Let 𝑌ˆ 𝑖 represent the corresponding output. Note quences, because when a counterfactual sequence that this includes the original real pair (ℋ𝑢 , 𝑌 𝑢 ). The is fed into the black-box recommendation model, model should be able to infer the causal dependency ˆ𝑢 the output may happen to be the same as the (refer to Definition 3) 𝜃𝐻 ^ 𝑢 ,𝑌 ^ 𝑢 between input item 𝐻 𝑖𝑗 𝑖𝑗 𝑖 original sequence 𝑌 𝑢 . Algorithm 1 Causal Explanation Model Table 1 Summary of the Datasets Input: users 𝒰, items ℐ, user history ℋ , 𝑢 counterfactual number 𝑚, black-box model ℱ, Dataset # users # items # interactions # train # test sparsity embedding model ℰ, causal mining model ℳ Movielens 943 1682 100,000 95,285 14,715 6.3% Output: causal explanations 𝐻 ⇒ 𝑌 𝑢 where 𝐻 ∈ ℋ𝑢 Amazon 573 478 13,062 9,624 3,438 4.7% 1: Use embedding model ℰ to get item embeddings ℰ(ℐ) 2: Use ℰ(ℐ) and true user history to train perturbation can serve as an intuitive explanation for the black-box model 𝒫 recommendation model. 3: for each user 𝑢 do 4: for 𝑖 from 1 to 𝑚 do 4.1. Dataset Description 𝑢 𝑢 𝑢 5: ℋ̃𝑖 ← 𝒫(ℋ𝑢 ); 𝑌˜ 𝑖 ← ℱ (ℋ̃𝑖 ) We evaluate our proposed causal explanation framework 6: end for against baselines on two datasets. The first dataset is 7: Construct counterfactual input-output pairs 𝑢 𝑢 MovieLens100k1 . This dataset consists of information {(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚𝑖=1 𝑢 𝑢 𝑢 ˜𝑢 𝑚 about users, movies and ratings. In this dataset, each user 8: {(ℋ̂𝑖 , 𝑌ˆ 𝑖 )}𝑚+1 𝑢 𝑢 𝑖=1 ← {(ℋ̃𝑖 , 𝑌 𝑖 )}𝑖=1 ∪ (ℋ , 𝑌 ) has rated at least 20 movies, and each movie can belong to (︁ 𝑢 ˆ 𝑢 𝑚+1 )︁ several genres. The second dataset is the office product 9: 𝜃𝐻 ^ 𝑢 ← ℳ {(ℋ̂𝑖 , 𝑌 𝑖 )}𝑖=1 ^ 𝑢 ,𝑌 𝑖𝑗 𝑖 dataset from Amazon2 , which contains the user-item 10: Rank 𝜃𝐻 ^ 𝑢 ,𝑌 𝑢 and select top-𝑘 pairs interactions from May 1996 to July 2014. The original 𝑖𝑗 {(𝐻𝑗 , 𝑌 𝑢 )}𝑘𝑗=1 dataset is 5-core. To achieve sequential recommendation 11: if ∃𝐻min{𝑗} ∈ ℋ𝑢 then with input length of 5, we select the users with at least 12: Generate causal explanation 𝐻min{𝑗} ⇒ 𝑌 𝑢 15 purchases and the items with at least 10 interactions. 13: else Since our framework is used to explain sequential rec- 14: No explanation for the recommended item 𝑌 𝑢 ommendation models, we split the dataset chronologi- 15: end if cally. Further, to learn the pre-trained item embeddings 16: end for based on BPR-MF [49] (section 3.2.1), we take the last 17: return all causal explanations 𝐻 ⇒ 𝑌 𝑢 6 interactions from each user to construct the testing set, and use all previous interactions from each user as the training set. To avoid data leakage, when testing the 2. We sort the above selected causal dependencies black-box recommendation models and our VAE-based in descending order and take the top-𝑘 (𝐻 ˆ 𝑢𝑖𝑗 , 𝑌 𝑢 ) perturbation model, we only use the last 6 interactions pairs. of each user (i.e., the testing set of the pre-training stage). 3. If there exist one or more pairs in these top-𝑘 Following common practice, we adopt the leave-one-out ˆ 𝑢𝑖𝑗 appears in the user’s protocol, i.e., among the 6 interactions in test set, we use pairs, which cause item 𝐻 input sequence ℋ , then we pick such pair of the the last one for testing, and the previous five interactions 𝑢 highest rank, and construct 𝐻 ˆ 𝑢𝑖𝑗 ⇒ 𝑌 𝑢 as the will serve as input to the recommendation models. A causal explanation for the user. Otherwise, i.e., brief summary of the data is shown in Table 1. no cause item appears in the user history, then we output no causal explanation for the user. 4.2. Experimental Settings Note that the extracted causal explanation is personal- We adopt the following methods to train black-box se- ized since the algorithm is applied on 𝒟𝑢 , which only con- quential recommendation models and extract traditional tains records centered around the user’s original record association rules as comparative explanations. Mean- (ℋ𝑢 , 𝑌 𝑢 ), while collaborative learning among users is while, we further conduct different variants of the per- indirectly modeled by the VAE-based perturbation model. turbation model to analyze our model. We include both The overall algorithm is provided in Alg.1. For each user, shallow and deep models for the experiment. there are two phases - perturbation phase (line 4-7) and FPMC [10]: The Factorized Personalized Markov causal rule mining phase (line 8-15). Chain model, which combines matrix factorization and Markov chains to capture user’s personalized sequential behavior patterns for prediction3 . 4. Experiments 1 https://grouplens.org/datasets/movielens/ In this section, we conduct experiments to show what 2 https://nijianmo.github.io/amazon/ causal relationships our model can capture and how they 3 https://github.com/khesui/FPMC Table 2 Results of Model Fidelity. Our causal explanation framework is tested under the number of candidate causal explanations 𝑘 = 1. The association explanation framework is tested under support, confidence, and lift thresholds, respectively. The best fidelity on each column is highlighted in bold. Dataset Movielens 100k Amazon Models FPMC GRU4Rec NARM Caser FPMC GRU4Rec NARM Caser AR-sup 0.3160 0.1453 0.4581 0.1569 0.2932 0.1449 0.4066 0.2024 AR-conf 0.2959 0.1410 0.4305 0.1559 0.2949 0.1449 0.4031 0.1885 AR-lift 0.2959 0.1410 0.4305 0.1559 0.2949 0.1449 0.4031 0.1885 CR-AE 0.5631 0.7413 0.7084 0.6151 0.6981 0.8255 0.8970 0.7260 CR-VAE 0.9650 0.9852 0.9714 0.9703 0.9511 0.9721 0.9791 0.9599 GRU4Rec [13]: A session-based recommendation decoder are Multi-Layer Perceptrons (MLP) with two model, which uses recurrent neural networks – in partic- hidden layers, and each layer consists of 1024 neurons. ular, Gated Recurrent Units (GRU) – to capture sequential The only difference between our model and the vari- patterns for prediction4 . ant model CR-AE is that the variant model applies fixed NARM [15]: A sequential recommendation model normal distribution as variance instead of learned person- which utilizes GRU and attention mechanism to estimate alized variance. The default number of counterfactual the importance of each interactions5 . input-output pairs is 𝑚 = 500 on both datasets. The Caser [51]: The ConvolutionAl Sequence Embedding default time decay factor is 𝛾 = 0.7. We will discuss the Recommendation (Caser) model, which adopts convo- influence of counterfactual number 𝑚 and time decay lutional filters over recent items to learn the sequential factor 𝛾 in the experiments. patterns for prediction6 . In the following, we will apply our model and all base- AR-sup [7]: A post-hoc explanation model, which lines on the black-box recommendation models to evalu- extract association rules from interactions from all users ate and compare the generated explanations. In particu- and rank based on support value to generate item-level lar, we evaluate our framework from three perspectives. explanations. First, a explanation model should at least be able to offer AR-conf [7]: Extracting association rules and rank explanations for most recommendations, we will show it based on confidence value to get explanations. in the result (explanation fidelity). Second, if our model is AR-lift [7]: Rank based on lift value among extracted capable of generating explanations for most recommen- association rules to generate explanations. dations, we need to verify that the causal explanations CR-AE: A variant of our causal rule model which learned by our framework represent the key component applies fixed variance in hidden layer of AutoEncoder of recommendation mechanism (explanation quality). Fi- model as the perturbation model. Compared with our nally, since counterfactual examples are involved in our VAE-based perturbation model, this variant apply non- framework, our framework should be able to generate personalized variance. closer counterfactual examples (counterfactual quality). For black-box recommendation models FPMC, Additionally, we shed light on how our model differs GRU4Rec, NARM and Caser, we adopt their best from other models on statistical metrics. parameter selection in their corresponding public imple- mentation. For the association rule-based explanation 4.3. Model Fidelity model, we follow the recommendations in [7] to set the optimal parameters: support = 0.1, confidence A very basic purpose of designing a explanation model = 0.1, lift = 0.1, length = 2 for MovieLens100k, and is to generate explanations for most recommendations. support = 0.01, confidence = 0.01, lift = 0.01, length Therefore, an important evaluation measure for explana- = 2 for Amazon dataset due to its smaller scale. We tion models is model fidelity, i.e., what’s the percentage accept top 100 rules based on corresponding values (i.e. of the recommendation results can be explained by the support/confidence/lift) as explanations model [3]. The results of model fidelity are shown in For our causal rule learning framework, we set the Table 2. In this experiment, we only report the results of item embedding size as 16, both the VAE encoder and keeping the number of candidate causal explanations 𝑘 4 as 1 for our framework and variant. For the association https://github.com/hungthanhpham94/GRU4REC-pytorch 5 https://github.com/Wang-Shuo/Neural-Attentive-Session- rule explanation model (section 4.2), we apply the global Based-Recommendation-PyTorch association rules [7] ranking by support, confidence, and 6 https://github.com/graytowne/caser_pytorch lift, respectively. We can see that on both datasets, our causal explana- Here 𝑑𝑜() represents an external intervention, which tion framework is able to generate explanations for most forces a variable to take a specific value. Specifically, in of the recommended items (including the variant), while our case, for an extracted causal rule 𝐻 ⇒ 𝑌 𝑢 , we define 𝑢 the association explanation approach can only provide ex- the binary random variable as 1 if 𝐻 ∈ ℋ̃𝑖 , 0 else. We planations for significantly fewer recommendations. The also define another variable 𝑦 as a binary random variable, 𝑢 underlying reason is that association explanations have which is 1 if 𝑌˜ 𝑖 = 𝑌𝑢 , otherwise it will be 0. We then to be extracted based on the original input-output pairs, report average ACE on all generated explanations. Note which limits the number of pairs that we can use for rule that since the ACE value is used for causal related models, extraction. However, based on the perturbation model, we cannot report it on the association rule baseline. our causal explanation framework is capable of creating Suppose the perturbation model (section 3.2.1) cre- many counterfactual examples to assist causal rule learn- ates 𝑚 counterfactual input-output pairs for each user 𝑢: 𝑢 𝑢 𝑢 ing, which makes it possible to go beyond the limited 𝑖=1 . Here ℋ̃ is created by our perturbation {(ℋ̃𝑖 , 𝑌˜ 𝑖 )}𝑚 original data to extract causal explanations. Moreover, model (i.e. not observed in the original data), and thus 𝑢 when the number of input and output items are limited observing 𝐻 ∈ ℋ̃ implies we have 𝑑𝑜(𝑥 = 1) in ad- (e.g. five history items as input and the model recom- vance. Let 𝐻 ⇒ 𝑌 𝑢 be the causal explanation extracted mends only one item in our case), it is harder to match by the casual rule learning model (section 3.2.2). Then global rules with personal interactions and recommen- we estimate the ACE based on these 𝑚 counterfactual dation, which limits the flexibility of global association pairs as, rules. E[𝑦|𝑑𝑜(𝑥 = 1)] = Pr(𝑦 = 1|𝑑𝑜(𝑥 = 1)) Another interesting observation is that GRU4Rec and 𝑢 Caser have significantly (𝑝 < 0.01) lower fidelity than #Pairs(𝐻 ∈ ℋ̃ ∧ 𝑌 = 𝑌 𝑢 ) = 𝑢 FPMC and NARM when explained by the association #Pairs(𝐻 ∈ ℋ̃ ) (2) model. This is reasonable because FPMC is a Markov- E[𝑦|𝑑𝑜(𝑥 = 0)] = Pr(𝑦 = 1|𝑑𝑜(𝑥 = 0)) based model that consider input as a basket and directly 𝑢 #Pairs(𝐻 ∈ / ℋ̃ ∧ 𝑌 = 𝑌 𝑢 ) learns the correlation between candidate items and each = 𝑢 items in a sequence, as a result, it is easier to extract asso- #Pairs(𝐻 ∈ / ℋ̃ ) ciation rules between inputs and outputs for the model. We report the ACE value of our model and variants in NARM combines the whole session information and in- Table.3. While showing the ACE value, we still keep the fluence of each individual item in the session, therefore, number of candidate causal explanations 𝑘 as 1. association rules which involve individual information We can see that our model can achieve higher ACE will be easier to be extracted for this model. However, value than the variant for most recommendation models it also means that the fidelity performance of the asso-on both dataset. But here we can observe an interesting ciation approach highly depends on the recommenda- results that the ACE value for FPMC model is much lower tion model being explained. Meanwhile, we see that our than other recommendation models (GRU4Rec, NARM, causal approach achieves comparably good fidelity on allCaser). Meanwhile, the variant model has slightly larger three recommendation models, because the perturbation ACE than our model when applying on FPC model. model is able to create sufficient counterfactual exam- The difference between FPMC and other recommen- ples to break the correlation of frequently co-occurringdation models is that FPMC is based on Markov chain items in the input sequence. This indicates the robust- that only considers the last behavior while other mod- ness of our causal explanation framework in terms of els involve the whole session information. For FPMC model fidelity. model, although we take a session as input to recom- mend next item, this model actually considers it as a 4.4. Average Causal Effect basket and linearly combines the influence of each item from the basket. In this case, every part of the session We then verify our causal explanations are true expla- will have independent influence towards next item pre- nations that explanation are important component for diction. So changing a small part of input items may not recommending original item. A common way is to mea- significantly change the next item prediction which high sure the causal effect on the outcome of the model[52]. likely results in same recommendation item. Based on First of all, we show the definition of Average Causal our experiment, when we keep counterfactual histories Effect. same for all recommendation models, FPMC model only Definition 4. (Average Causal Effect) The Average gets 98 counterfactual histories (19.6%) in average with Causal Effect (ACE) of a binary random variable 𝑥 on different recommendation (different from the recommen- another random variable 𝑦 is define as E[𝑦|𝑑𝑜(𝑥 = 1)] − dation item based on real history), while other models E[𝑦|𝑑𝑜(𝑥 = 0)] have at least 315 counterfactual histories (63%) in aver- age with different recommendation item. This difference Table 3 Table 4 Results of Average Causal Effect. Our causal explanation Results of Proximity. The value of proximity is calculated by framework is tested under the number of candidate causal Eq.(3) explanations 𝑘 = 1. Dataset Movielens 100k Dataset Movielens 100k Models FPMC GRU4Rec NARM Caser Models FPMC GRU4Rec NARM Caser CR-AE -22.69 -22.37 -22.35 -22.40 CR-AE 0.0184 0.1479 0.1108 0.1199 CR-VAE -17.35 -16.88 -16.83 -16.93 CR-VAE 0.0178 0.1862 0.1274 0.1388 Dataset Amazon Dataset Amazon Models FPMC GRU4Rec NARM Caser Models FPMC GRU4Rec NARM Caser CR-AE -21.83 -21.28 -21.20 -21.33 CR-AE 0.0230 0.1150 0.1101 0.1347 CR-VAE -18.01 -17.40 -17.31 -17.51 CR-VAE 0.0212 0.1434 0.1511 0.1563 makes the FPMC model has much lower ACE value com- pared with other recommendation models. Comparing our model with CR-AE, the variant model will generate less similar counterfactual histories which more likely result in different recommendation item than our model. Therefore, CR-AE has slightly higher ACE values than (a) Model Fidelity on Movielens (b) Model Fidelity on Amazon CR-VAE. Figure 3: Model fidelity on different time decay parameters 𝛾 . 𝑥-axis is the time decay parameter 𝛾 ∈ {0.1, 0.3, 0.7, 1} 4.5. Proximity and 𝑦 -axis is the model fidelity. The left side pictures are on Movielens and the right side pictures are on Amazon. As we mentioned before, counterfactual examples that are closest to the original can be the most useful to users. Similar with [48], we define the proximity as the distance between negative counterfactual examples (i.e. generate terfactual examples of our model have higher quality and recommendation item different from original item) and be more useful. original real history. Intuitively, a counterfactual example that close enough but get totally different results will be 4.6. Influence of Parameters more helpful. For a given user, the proximity can be expressed as In this section, we discuss the influence of two important parameters. The first one is time decay parameter 𝛾 – in 𝑢 our framework, when explaining the sequential recom- ∑︁ 𝑃 𝑟𝑜𝑥𝑖𝑚𝑖𝑡𝑦𝑢 = −𝑚𝑒𝑎𝑛( 𝑑𝑖𝑠𝑡(ℋ̃𝑖 , ℋ𝑢 )) (3) ˜ 𝑢 ̸=𝑌𝑢 𝑌 mendation models, earlier interactions in the sequence 𝑖 will have discounted effects to the recommended item. Here the distance is defined in latent space. The repre- A proper time decay parameter helps the framework to sentation of any history sequence is the concatenation reduce noise signals when learning patterns from the of the latent representation of each item in the sequence. sequence. The second parameter is the number of per- The latent representations of items are learned from pre- turbed input-output pairs 𝑚 – in our framework, we trained BPRMF [49] model. The distance of any two use perturbations to create counterfactual examples for sequence is defined as Euclidean distance between the causal learning, but there may exist trade-off between ef- representation of two sequence. The reported proximity ficiency and performance. We will analyze the influence value would be the average over all users. of these two parameters. Given that association rule model does not involve Time Decay Effect: Figure 3 shows the influence of 𝛾 counterfactual examples, this metric can only be reported on different recommendation models and datasets. From on our model and the variant model on both datasets, as the result we can see that the time decay effect 𝛾 indeed shown in Table.4 We can observe that our model can affects the model performance on fidelity. In particular, achieve higher proximity compared with the variant when 𝛾 is small, the previous interactions in a sequence model. In other words, counterfactual examples gen- are more likely to be ignored, which thus reduces the erated with learned latent variance is more similar with performance on model fidelity. When 𝛾 is large (e.g., real history. Therefore, higher proximity implies coun- 𝛾 = 1), old interactions will have equal importance with Association Rule Our Explanation Explanation unknown (a) Model Fidelity on Movielens (b) Model Fidelity on Amazon Figure 4: Model fidelity on different number of counterfac- Association Rule Our Explanation Explanation tual pairs 𝑚. 𝑥-axis is the number of counterfactual pairs 𝑚. 𝑦 -axis is model fidelity. Figure 5: A case study on MovieLens by the Caser model. The first movie for 𝑢1 is unknown in the dataset. latest interactions, which also hurts the performance. We can see from the results that the best performance is gives >95% confidence that the estimated probability er- achieved at about 𝛾 = 0.7 on both datasets. ror is <0.1. Number of Counterfactual Examples: Figure 4 shows the influence for the number of counterfactual 4.7. Case Study input-output pairs 𝑚. A basic observation from Figure 4 is that when 𝑚 increases, model fidelity will decrease first In this section, we provide a simple case study to com- and then increase. The underlying reason is as follows. pare causal explanations and association explanations. When 𝑚 is small, the variance of the counterfactual Compared with the association explanation model, our input-output pairs will be small, and fewer counterfac- model is capable of generating personalized explanations, tual items will be involved. Then the model is more likely which means that even if the recommendation model rec- to select original item as explanation. For example, sup- ommends the same item for two different users and the pose the original input-output pair is 𝐴, 𝐵, 𝐶 → 𝑌 . In users have overlapped histories, our model still has the the extreme case where 𝑚 = 1, we will have only one potential to generate different explanations for differ- counterfactual pair, e.g., 𝐴, 𝐵 ˜ , 𝐶 → 𝑌˜ . According to the ent users. However, the association model will provide causal rule learning model (section 3.2.2), if 𝑌˜ ̸= 𝑌 , then the same explanation since the association rules are ex- 𝐵 ⇒ 𝑌 will be the causal explanation since the change tracted based on global records. An example by the Caser of 𝐵 results in a different output, while if 𝑌˜ = 𝑌 , then [51] recommendation model on MovieLens100k dataset is either 𝐴 ⇒ 𝑌 or 𝐶 ⇒ 𝑌 will be the causal explanation shown in Figure 5, where two users with one commonly since their 𝜃 scores will be higher than 𝐵 or 𝐵 ˜ . In either watched movie (The Sound of Music) get exactly same case, the model fidelity and percentage of verified causal recommendation (Pulp Fiction). The association model rules will be 100%. However, in this case, the results do provides the overlapped movie as an explanation for the not present statistical meanings since they are estimated two different users, while our model can generate per- on a very small amount of examples. sonalized explanation for different users even when they When 𝑚 increases but not large enough, then random got the same recommendation. noise examples created by the perturbation model will re- duce the model fidelity. Still consider the above example, 5. Conclusions if many pairs with the same output 𝑌 are created, then the model may find other items beyond 𝐴, 𝐵, 𝐶 as the Recommender systems are widely used in our daily life. cause, which will result in no explanation for the origi- Effective recommendation mechanisms usually work nal sequence. However, if we continue to increase 𝑚 to through black-box models, resulting in the lack of trans- sufficiently large numbers, such noise will be statistically parency. In this paper, we extract causal rules from user offset, and thus the model fidelity and percentages will history to provide personalized, item-level, post-hoc ex- increase again. In the most ideal case, we would create all planations for the black-box sequential recommendation of the |ℋ||ℐ| sequences for causal rule learning, where models. The causal explanations are extracted through a |ℋ| is the number of item slots in the input sequence, and perturbation model and a causal rule learning model. We |ℐ| is the total number of items in the dataset. However, conduct several experiments on real-world datasets, and |ℋ||ℐ| would be a huge number that makes it compu- apply our explanation framework to several state-of-the- tational infeasible for causal rule learning. In practice, art sequential recommendation models. Experimental we only need to specify 𝑚 sufficiently large. Based on results verified the quality and fidelity of the causal ex- Chebyshev’s Inequality, we find that 𝑚 = 500 already planations extracted by our framework. In this work, we only considered item-level causal [9] N. Wang, H. Wang, Y. Jia, Y. Yin, Explainable recom- relationships, while in the future, it would be interesting mendation via multi-task learning in opinionated to explore causal relations on feature-level external data text data, in: The 41st International ACM SIGIR such as textual user reviews, which can help to generate Conference on Research & Development in Infor- finer-grained causal explanations. mation Retrieval, ACM, 2018. [10] S. Rendle, C. Freudenthaler, L. Schmidt-Thieme, Factorizing personalized markov chains for next- Acknowledgments basket recommendation, in: Proceedings of the 19th international conference on World wide web, This work was partly supported by NSF IIS-1910154 and ACM, 2010, pp. 811–820. IIS-2007907. Any opinions, findings, conclusions or rec- [11] P. Wang, J. Guo, Y. Lan, J. Xu, S. Wan, X. Cheng, ommendations expressed in this material are those of Learning hierarchical representation model for the authors and do not necessarily reflect those of the nextbasket recommendation, in: Proceedings of the sponsors. 38th International ACM SIGIR conference on Re- search and Development in Information Retrieval, References ACM, 2015, pp. 403–412. [12] R. He, J. McAuley, Fusing similarity models with [1] H. Chen, S. Shi, Y. Li, Y. Zhang, Neural collaborative markov chains for sparse sequential recommenda- reasoning, in: Proceedings of the Web Conference tion, in: 2016 IEEE 16th International Conference 2021, 2021, pp. 1516–1527. on Data Mining (ICDM), IEEE, 2016, pp. 191–200. [2] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning [13] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk, based recommender system: A survey and new Session-based recommendations with recurrent perspectives, ACM Computing Surveys (CSUR) 52 neural networks, in: International Conference on (2019) 1–38. Learning Representations, 2016. [3] Y. Zhang, X. Chen, Explainable recommendation: [14] F. Yu, Q. Liu, S. Wu, L. Wang, T. Tan, A dynamic A survey and new perspectives, Foundations and recurrent model for next basket recommendation, Trends® in Information Retrieval (2020). in: Proceedings of the 39th International ACM SI- [4] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, S. Ma, GIR conference on Research and Development in Explicit factor models for explainable recommenda- Information Retrieval, ACM, 2016, pp. 729–732. tion based on phrase-level sentiment analysis, in: [15] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, J. Ma, Neu- Proceedings of the 37th international ACM SIGIR ral attentive session-based recommendation, in: conference on Research & development in informa- Proceedings of the 2017 ACM on Conference on tion retrieval, ACM, 2014, pp. 83–92. Information and Knowledge Management, ACM, [5] Y. Xian, Z. Fu, S. Muthukrishnan, G. De Melo, 2017, pp. 1419–1428. Y. Zhang, Reinforcement knowledge graph reason- [16] X. Chen, H. Xu, Y. Zhang, J. Tang, Y. Cao, Z. Qin, ing for explainable recommendation, in: Proceed- H. Zha, Sequential recommendation with user mem- ings of the 42nd International ACM SIGIR Confer- ory networks, in: Proceedings of the eleventh ACM ence on Research and Development in Information international conference on WSDM, 2018, pp. 108– Retrieval, 2019, pp. 285–294. 116. [6] A. Theodorou, R. H. Wortham, J. J. Bryson, De- [17] J. Huang, W. X. Zhao, H. Dou, J.-R. Wen, E. Y. signing and implementing transparency for real Chang, Improving sequential recommendation time inspection of autonomous robots, Connection with knowledge-enhanced memory networks, in: Science 29 (2017) 230–241. The 41st International ACM SIGIR Conference on [7] G. Peake, J. Wang, Explanation mining: Post hoc Research & Development in Information Retrieval, interpretability of latent factor models for recom- ACM, 2018, pp. 505–514. mendation systems, in: Proceedings of the 24th [18] J. Chen, F. Zhuang, X. Hong, X. Ao, X. Xie, Q. He, ACM SIGKDD International Conference on Knowl- Attention-driven factor model for explainable per- edge Discovery & Data Mining, 2018. sonalized recommendation, in: The 41st Interna- [8] X. Chen, H. Chen, H. Xu, Y. Zhang, Y. Cao, Z. Qin, tional ACM SIGIR Conference on Research & De- H. Zha, Personalized fashion recommendation with velopment in Information Retrieval, ACM, 2018, pp. visual explanations based on multimodal attention 909–912. network: Towards visually explainable recommen- [19] X. Chen, Z. Qin, Y. Zhang, T. Xu, Learning to rank dation, in: Proceedings of the 42nd International features for recommendation over multiple cate- ACM SIGIR Conference on Research and Develop- gories, in: Proceedings of the 39th International ment in Information Retrieval, 2019, pp. 765–774. ACM SIGIR conference on Research and Develop- ment in Information Retrieval, 2016, pp. 305–314. (TOIS) 38 (2019) 1–29. [20] S. Seo, J. Huang, H. Yang, Y. Liu, Interpretable [32] L. Li, Y. Zhang, L. Chen, Personalized transformer convolutional neural networks with dual local and for explainable recommendation, ACL (2021). global attention for review rating prediction, in: [33] L. Li, Y. Zhang, L. Chen, Generate neural template Proceedings of the Eleventh ACM Conference on explanations for recommendation, in: Proceed- RecSys, 2017, pp. 297–305. ings of the 29th ACM International Conference on [21] C. Li, C. Quan, L. Peng, Y. Qi, Y. Deng, L. Wu, A cap- Information & Knowledge Management, 2020, pp. sule network for recommendation and explaining 755–764. what you like and dislike, in: Proceedings of the [34] H. Chen, X. Chen, S. Shi, Y. Zhang, Generate natural 42nd International ACM SIGIR Conference on Re- language explanations for recommendation, SIGIR search and Development in Information Retrieval, 2019 Workshop on ExplainAble Recommendation ACM, 2019, pp. 275–284. and Search (2019). [22] F. Costa, S. Ouyang, P. Dolog, A. Lawlor, Automatic [35] J. McInerney, B. Lacker, S. Hansen, K. Higley, generation of natural language explanations, in: H. Bouchard, A. Gruson, R. Mehrotra, Explore, Proceedings of the 23rd International Conference exploit, and explain: personalizing explainable rec- on Intelligent User Interfaces Companion, ACM, ommendations with bandits, in: Proceedings of the 2018, p. 57. 12th ACM Conference on Recommender Systems, [23] Q. Ai, V. Azizi, X. Chen, Y. Zhang, Learning hetero- ACM, 2018, pp. 31–39. geneous knowledge base embeddings for explain- [36] X. Wang, Y. Chen, J. Yang, L. Wu, Z. Wu, X. Xie, able recommendation, Algorithms 11 (2018) 137. A reinforcement learning framework for explain- [24] Z. Fu, Y. Xian, R. Gao, J. Zhao, Q. Huang, Y. Ge, S. Xu, able recommendation, in: 2018 IEEE International S. Geng, C. Shah, Y. Zhang, et al., Fairness-aware ex- Conference on Data Mining (ICDM), IEEE, 2018, pp. plainable recommendation over knowledge graphs, 587–596. SIGIR (2020). [37] N. Tintarev, Explanations of recommendations, [25] W. Ma, M. Zhang, Y. Cao, W. Jin, C. Wang, Y. Liu, in: Proceedings of the 2007 ACM conference on S. Ma, X. Ren, Jointly learning explainable rules for Recommender systems, 2007, pp. 203–206. recommendation with knowledge graph, in: The [38] J. Pearl, Causality: models, reasoning and inference, World Wide Web Conference, 2019, pp. 1210–1221. volume 29, Springer, 2000. [26] Y. Xian, Z. Fu, H. Zhao, Y. Ge, X. Chen, Q. Huang, [39] G. W. Imbens, D. B. Rubin, Causal inference in statis- S. Geng, Z. Qin, G. De Melo, S. Muthukrishnan, tics, social, and biomedical sciences, Cambridge et al., Cafe: Coarse-to-fine neural symbolic reason- University Press, 2015. ing for explainable recommendation, in: Proceed- [40] S. Bonner, F. Vasile, Causal embeddings for rec- ings of the 29th ACM International Conference on ommendation, in: Proceedings of the 12th ACM Information & Knowledge Management, 2020, pp. Conference on Recommender Systems, ACM, 2018. 1645–1654. [41] T. Joachims, A. Swaminathan, T. Schnabel, Un- [27] L. Li, Y. Zhang, L. Chen, Extra: Explanation ranking biased learning-to-rank with biased feedback, in: datasets for explainable recommendation, SIGIR Proceedings of the Tenth ACM International Con- (2021). ference on Web Search and Data Mining, ACM, [28] S. Shi, H. Chen, W. Ma, J. Mao, M. Zhang, Y. Zhang, 2017, pp. 781–789. Neural logic reasoning, in: Proceedings of the 29th [42] Z. Wood-Doughty, I. Shpitser, M. Dredze, Chal- ACM International Conference on Information & lenges of using text classifiers for causal inference, Knowledge Management, 2020, pp. 1365–1374. in: Proceedings of the 2018 Conference on Empiri- [29] Y. Zhu, Y. Xian, Z. Fu, G. de Melo, Y. Zhang, Faith- cal Methods in Natural Language Processing, 2018, fully explainable recommendation via neural logic pp. 4586–4598. reasoning, in: Proceedings of the 2021 Conference [43] L. Buesing, T. Weber, Y. Zwols, S. Racaniere, of the North American Chapter of the Association A. Guez, J.-B. Lespiau, N. Heess, Woulda, coulda, for Computational Linguistics: Human Language shoulda: Counterfactually-guided policy search, in: Technologies, 2021, pp. 3083–3090. ICLR, 2019. [30] X. Chen, Y. Zhang, Z. Qin, Dynamic explainable [44] D. Liang, L. Charlin, J. McInerney, D. M. Blei, Mod- recommendation based on neural attentive mod- eling user exposure in recommendation, in: Pro- els, in: Proceedings of the AAAI Conference on ceedings of the 25th WWW, 2016. Artificial Intelligence, volume 33, 2019, pp. 53–60. [45] D. Liang, L. Charlin, D. M. Blei, Causal inference [31] Q. Ai, Y. Zhang, K. Bi, W. B. Croft, Explainable for recommendation, in: Causation: Foundation to product search with a dynamic relation embedding Application, Workshop at UAI, 2016. model, ACM Transactions on Information Systems [46] A. Ghazimatin, O. Balalau, R. Saha Roy, G. Weikum, Prince: provider-side interpretability with counter- factual explanations in recommender systems, in: Proceedings of the 13th International Conference on Web Search and Data Mining, 2020, pp. 196–204. [47] D. Alvarez-Melis, T. S. Jaakkola, A causal frame- work for explaining the predictions of black-box sequence-to-sequence models, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017). [48] R. K. Mothilal, A. Sharma, C. Tan, Explaining ma- chine learning classifiers through diverse counter- factual explanations, in: Proceedings of the 2020 Conference on Fairness, Accountability, and Trans- parency, 2020. [49] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt- Thieme, Bpr: Bayesian personalized ranking from implicit feedback, UAI (2012). [50] D. P. Kingma, M. Welling, Auto-encoding varia- tional bayes, 2014. [51] J. Tang, K. Wang, Personalized top-n sequential recommendation via convolutional sequence em- bedding, in: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, ACM, 2018, pp. 565–573. [52] R. Moraffah, M. Karami, R. Guo, A. Raglin, H. Liu, Causal interpretability for machine learning- problems, methods and evaluation, ACM SIGKDD Explorations Newsletter 22 (2020) 18–33.