Utilising Crowdsourcing to Assess the Effectiveness of Item-based Explanations of Merchant Recommendations Denis Krasilnikov1,3 , Oleg Lashinin1 , Maksim Tsygankov1 , Marina Ananyeva1,2 and Sergey Kolesnikov1 1 Tinkoff, 2-Ya Khutorskaya Ulitsa, 38A, bld. 26, Moscow, 117198, Russian Federation 2 National Research University Higher School of Economics, Myasnitskaya Ulitsa, 20, Moscow, 101000, Russian Federation 3 Lomonosov Moscow State University, Ulitsa Kolmogorova, 1, bld. 52, Moscow, 119234, Russian Federation Abstract The explainability of recommendations is a common research topic among researchers and providers of recommender systems. Numerous approaches and inference types were developed in order to find explanations for recommendations. For example, we can send users the following recommendation with an explanation: ”Since you recently made a purchase from merchant X, we suggest you merchant Y”. A variety of methods can be used to produce the (X, Y) item pairs with this explanation logic. Despite this, some users might not understand the logical connection between the recommendation Y and explanation X. In this study, we validate 23,000 recommendation explanations with the help of 400 crowdworkers. Additionally, we suggest a novel method for evaluating the quality of the (X, Y) item pair explanations based on crowdworkers’ responses. Finally, we evaluate 9 different approaches and produce interesting findings. We hope that, in future research, our method will be expanded upon and further studied for additional types of explanations and domains. Keywords recommender systems, explainable recommendations, evaluation study, crowdsoursing 1. Introduction These days, recommender systems are an integral part of many areas in people’s lives. They are capable of influencing people’s choice of films, clothing, tourist destinations, food- and health-related habits, and much more. As a rule, such algorithms leverage previous users’ actions in order to show current users items that can potentially be interesting to them. Content that is personalized in such a way is attractive to users, driving them to interact more with various online stores, dating services, music, videos, and others. Recent works highlight the main problems of merchant recommendations. For instance, some banks provide merchant reward systems [1, 2, 3]. When bank clients make transactions with WSDM 2023 Crowd Science Workshop on Collaboration of Humans and Learning Algorithms for Data Labeling, March 3, 2023, Singapore Envelope-Open d.i.krasilnikov@tinkoff.ru (D. Krasilnikov); o.a.lashinin@tinkoff.ru (O. Lashinin); m.r.tsygankov@tinkoff.ru (M. Tsygankov); m.ananyeva@tinkoff.ru (M. Ananyeva); scitator@gmail.com (S. Kolesnikov) Orcid 0009-0009-6864-392X (D. Krasilnikov); 0000-0001-8894-9592 (O. Lashinin); 0000-0002-9885-2230 (M. Ananyeva); 0000-0002-4820-987X (S. Kolesnikov) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) particular merchants, they receive cashback automatically. With many such offers available, personalization of the rewards section improves user profit [1]. In order to do so, it is possible to find the most relevant offers for each individual user based on their transaction history. For example, some recent works demonstrated experiment results on different real-world transaction datasets [4, 5, 6]. These works proved that recommendation models are capable of accurately predict users’ future behavior by analysing past transactions. However, it is important to note providing explanations for such merchant recommendations is not well studied yet. In this paper, we research the problem of offering users explanations as well as merchant recommendations. Formally, we have a dataset D with transaction histories of anonymous users with a number of merchants. Our task is to not only suggest the most appropriate merchants for each user, but to also explain each personalized suggestion. There are various types of explanations [7] for results provided by recommender systems. One of them is item-based explanations, where textual patterns consist of a few items, connected by certain conditions. For instance, we can show the user the following message: ”We recommend you merchant Y because you purchased from merchant X”. This communication can be fully defined by the (X, Y) merchant pair. Both of them should be represented in the dataset D. The item Y is received from a recommender model M built on historical transactions. The item X must be in the history of the user for whom we are providing the recommendation explanation. Otherwise, the statement in the communication will obviously be wrong. We chose this method due to a number of advantages. Firstly, it does not require additional knowledge about merchants. Secondly, it is quite simple to implement in an interface for testing with real users. Finally, there are many different approaches and heuristics to retrieve the (X, Y) merchant pairs. However, not all pairs may be valid. For example, a merchant pair consisting of a bar and children’s store can be perceived negatively by real users. To avoid such situations, we suggest pre-screening some merchant pairs using crowdsourcing platforms. If some of them look questionable together when considered by real users, then they should be filtered out using additional labeling. This idea is illustrated in Figure 1. In this paper, we research the validation of X → Y pairs for further use in recommender systems for real clients. We provide an extensive survey for 23,000 merchant pairs. 400 crowdworkers share their opinions as if they were seeing these pairs in a scenario with real recommender systems. Based on the results of these surveys, we evaluate 9 approaches for explainable recommendations. This helps us estimate the quality of recommendation explanations in offline experiments based on the feelings of real people. Specifically, the main contribution of this paper can be listed as follows: • We have asked 400 real people on a crowdsourcing platform to evaluate 23,000 pairs of explanations. We share our results in an anonymized dataset. • We describe a new way to evaluate explanations of merchant recommendations based on users’ feedback. Our approach makes it possible to separate the development of algorithms and the evaluation of explanation quality into independent processes. As a result, we can collect user feedback once and then use it many times for different approaches. • We provide the results of experiments with 9 recommender algorithms as well as heuristics that generate the most appropriate explanation pairs (X, Y). We demonstrate how the Figure 1: In order to provide explanations for the (X, Y) merchant pair, we suggest validating such pairs using crowdsourcing platforms. This will help avoid questionable pairs and possible negative feedback from users. ranking quality of these pairs matches the opinions of real people. 2. Related Work Explanations based on X → Y pairs can be considered a relatively well-studied topic in the research community. For instance, iALS [8] is capable of not only generating high-quality recommendations [9], but can also provide X → Y explanations. To do so, its optimization algorithm learns the weights of item-item similarities. Moreover, iALS takes into account the significance of interactions to each user. Another work [10] demonstrated a way of mining post-hoc explanations. This method is based on association rules and can be applied to any latent factor recommender model. A recent study [11] used a causal rule learning model to retrieve personal post-hoc explanations. Finally, there are studies that analyze the influence of input data on output scores [12, 13] Crowdsourcing can be very helpful in the field of explainable recommendations. For instance, crowdworkers can generate textual explanations [14], provide information about cold start items [15] and evaluate the explanations offered for recommendations [16]. In a recent work [17], crowdworkers helped improve the quality of recommendations via a human-in-the-loop framework. The crowdsourced opinions helped increase accuracy of personalized suggestions. Moreover, according to [18, 19], crowdworkers are capable of evaluating different explanations in flexible experiment settings. To the best of our knowledge, explainable merchant recommendations are not well-studied in the broad research community. Recent works tend to consider only the quality of merchant recommendations [4, 5, 6]. However, we are of the opinion that the explainability of such models is a topic that is worth investigating in future works. Table 1 Descriptive statistics of the transaction dataset. Version #users #merchants #transactions #density Original 88800 5400 12M 2.51% After preprocessing 88500 350 10M 33.32% 3. Dataset Collection In this section, we describe our approach to collecting and processing data. The TTRS dataset (Tinkoff Transaction Recommender System) served as the source for all of our data. This open-source dataset with detailed statistics was provided in [5]. TTRS contains real clients’ transactions with some merchants such as brands, retail chains and services. In the open-source version of the dataset, each transaction contains the user id, merchant id1 , merchant category and transaction timestamp. As this dataset was provided by us, we can enrich it with additional information. Concretely, merchant names of brands were used in our crowdsoursing experiments. We chose only a set 𝑇 𝑂𝑃 of top 350 most popular merchants. According to original data, these merchants accounted for about 87% of all transactions. The specifications of this dataset can be found in Table 1. Additionally, we create a square sparse matrix 𝐶 with a shape of (350, 350) to represent the ”relevance” of items. In order to collect this data, we sampled around 23,000 unique (X, Y) pairs randomly, where 𝑋 ≠ 𝑌 and 𝑋, 𝑌 ∈ 𝑇 𝑂𝑃. This number was determined by our study’s limited budget. Since it is possible to create 349 * 349 = 121,801 pairs in total, we have selected about 19% of all possible pairs. To conduct our survey, we used an internal crowdsourcing platform at Tinkoff. Since machine learning algorithms require a large amount of labeled data [20], machine learning developers can use this platform for their needs. The Tinkoff crowdsourcing platform makes it possible to label datasets based on people’s opinions. We used this platform to ask crowdworkers the following question: ”We recommend you merchant Y because you bought from merchant X. Based on this explanation, do you think the model works correctly?”. This question is preceded by a brief description of both merchants. In addition, respondents were given instructions before starting to answer the questions. The instruction requested crowdworkers to imagine a situation in which they actually made a purchase from merchant X. Afterwards, they received a communication recommending merchant Y. The wording of the question was chosen based on the analysis of previous works [18, 19, 16]. The basic intuition is that if the (X, Y) pair seems illogical, the crowdworker will answer in the negative. If the pair is perceived as logical by the respondent, then we will get a positive answer from them. During the experiment, we asked 400 people to respond to our questions. To improve the quality of the experiment, we asked five different crowdworkers to answer every question. This made it possible for us to determine how many votes are needed to consider the recommendation explanation correct (3 out of 5, 4 out of 5, or 5 out of 5). We received 2,350 pairs where at least 3 1 Both user id and merchant id columns are anonymized Table 2 Notation used in the paper. 𝑈 a set of users 𝐼 a set of items 𝑢, 𝑖, 𝑡 a particular user/item/timestamp respectively 𝑋 an item from the user’s history, used to explain a recommendation 𝑌 an item to be recommended for the user 𝐵𝑢,𝑖,𝑡 a set of transactions from the dataset 𝑆𝑢,𝑖,𝑡 a sample of 𝐵𝑢,𝑖,𝑡 with distinct (u, i) pairs 𝐾 (𝑢) a number of items consumed by the user 𝑢 𝑟𝑢,𝑖,𝑡 an ordered set of transactions made by the user 𝑢, with timestamps 𝐼𝑢 an unordered set of items consumed by the user 𝑢 𝑀 a fitted recommender model 𝐸 an approach ranking (X, Y) explanation pairs 𝐶 a matrix which contains averaged answers from respondents 𝑠𝑢,𝑋 ,𝑌 a set of scores for ranking (X, Y) pairs for a user 𝑢 out of 5 people said that the model works correctly. Thus, the cell of matrix 𝐶[𝑋 ][𝑌 ] equals 1 if 3 out of 5 respondents label the pair as correct. Otherwise, 𝐶[𝑋 ][𝑌 ] is 0. 4. Evaluating Explanation Quality In this section, we describe a new method of evaluating X → Y explanations. We summarize our notation in Table 2. Let’s assume that we have a set of users 𝑈 and a set of items 𝐼 = {𝑖1 , … , 𝑖|𝐼 | }. We have a training part of a dataset which can be represented by the set of transactions 𝐵𝑢,𝐼 ,𝑡 = {𝐵𝑢𝑘 ,𝑖𝑘 ,𝑡𝑘 }, where 𝑢𝑘 , 𝑖𝑘 , 𝑡𝑘 are the user, item and timestamp respectively. For simplicity, we will consider only the following subset 𝑆𝑢,𝑚𝑒,𝑡 ⊂ {𝐵𝑢𝑘 ,𝑖𝑘 ,𝑡𝑘 } such that it contains only the maximum timestamps for each (𝑢, 𝑖) pair. This makes it so that the recommender system should only predict new merchants for users. The modeling of recurring transactions remains to be researched in future works. A user 𝑢 interacted with a set of 𝐾 (𝑢) unique items 𝑟𝑢,𝑖,𝑡 = {𝐵𝑢,1,𝑡1 , … , 𝐵𝑢,|𝐾 (𝑢)|,𝑡𝐾 (𝑢) }, ordered by timestamps. Let’s assume that there is a method 𝐸 that can generate explainability scores 𝑠𝑢,𝑋 ,𝑌 = {𝑠𝑢,1,𝑌 , … , 𝑠𝑢,|𝐾 (𝑢)|,𝑌 }. If we have items 𝑖 and 𝑗, and 𝑠𝑢,𝑖,𝑌 < 𝑠𝑢,𝑗,𝑌 , we will be explaining the recommendation of Y with item 𝑗. It is important to have valid pairs with higher scores and invalid (illogical) pairs with the lower scores. Therefore, it is possible to compute quality ranking metrics under 𝑠𝑢,𝑋 ,𝑌 for some subset of (X, Y) pairs and users from 𝑈. The proposed approach is described in detail in Algorithm 1. The key idea of this method is to take all the items user 𝑢 interacted with. Then, we leave only the pairs of merchants that meet the following conditions: (a) Y was recommended, (b) X is in the user’s purchase history, (c) (X, Y) pair is validated by the respondents. If for user 𝑢 there are at least two or more candidates 𝑥𝑘 for explaining each recommended item 𝑌, we can sort these candidates by 𝑠𝑢,𝑥𝑘 ,𝑌 . Finally, quality ranking metrics compare the sorted lists of candidates and people’s opinions. Higher metric values prove that a method 𝐸 can effectively retrieve explanation pairs, while low values may indicate that users might dislike certain explanations because they find them incorrect. Algorithm 1: The algorithm proposed for evaluating explanation pairs Data: a ground truth matrix 𝐶, a set of transaction 𝑟𝑢,𝑖,𝑡 for each user from 𝑈, a trained recommender model 𝑀, a method 𝐸 for explanation generation, selected quality ranking metrics. Result: Calculated ranking quality metrics 1 foreach user 𝑢 ∈ 𝑈 do 2 generate top-K recommendations 𝑌𝑗 with model 𝑀; 3 foreach 𝑦𝑗 ∈ 𝑌𝑗 do 4 compute 𝑠𝑢,𝑥𝑘 ,𝑦𝑗 = {𝑠𝑢,𝑥𝑞 ,𝑦𝑗 ∣ 𝑞 = 1, … , 𝐾 (𝑢) ∧ 𝐶[𝑥𝑞 ][𝑦𝑗 ] ∈ {0, 1}}; 5 if |𝑠𝑢,𝑥𝑘 ,𝑦𝑗 | = 0 then 6 continue; 7 else 8 let 𝑔𝑢,𝑥𝑘 = {𝐶[𝑖][𝑦𝑗 ] ∣ 𝑖 ∈ items from 𝑟𝑢,𝑖,𝑡 } ; 9 sort 𝑠𝑢,𝑥𝑘 ,𝑦𝑗 and 𝑔𝑢,𝑥𝑘 by 𝑠𝑢,𝑥𝑘 ,𝑦𝑗 ; 10 foreach metric ∈ metrics do 11 calculate metric(𝑠𝑢,𝑥𝑘 ,𝑦𝑗 , 𝑔𝑢,𝑥𝑘 ); 12 end 13 end 14 end 15 end 5. Methods to Rank Explanation Pairs In this section, we briefly describe the methods of generating explanation pairs which we included in our work. • Random. This method simply generates random scores 𝑠𝑢,𝑋 ,𝑌 . It is included in order to calculate the relative improvement of other approaches. • Chrono. Some sequential recommenders assume that a user’s future interactions are caused by their recent interactions [21, 22]. Chrono is a heuristic approach that works under this assumption. Specifically, it gives higher scores for the most recent items. The last recent item has the highest score 𝑠𝑢,𝑥𝐾 (𝑢) ,𝑌 because of the assumption that future test in- 1 teractions are mostly caused by the last interaction. Formally, let 𝑠𝑢,𝑖,𝑌 = . 𝑟𝑎𝑛𝑘( max (𝑑𝑎𝑡𝑒)) 𝑖∈𝐼𝑢 𝑑𝑎𝑡𝑒∈𝑡𝑖 Here 𝑡𝑖 is all possible dates for user 𝑢. • MostPop. This baseline ranks items according to their popularity in the training part of the dataset. This popularity is defined as the total number of transactions. • PersonTop. In this approach, we calculate the personal frequencies of user interactions with every item. The more a user purchased from a certain merchant, the higher the value of 𝑠𝑢,𝑖,𝑌 . If a user buys from two different merchants an equal number of times, the order between them is defined by their popularity, similar to MostPop. • Similar Category MostPop. Merchants in our dataset have categories. People may consider it reasonable if they see merchants X and Y from one category. Therefore, we can assign higher scores to merchants from the same category and lower to those from 2 + 𝑚𝑝𝑖 , if X and Y belong to the same category different ones. Formally, 𝑠𝑖 = { 𝑚𝑝𝑖 , if X and Y are from different categories Where 𝑚𝑝𝑖 ∈ [0, 1] is the score from MostPop. • Implicit ALS [8]. This model not only shows competitive performance on top-n recom- mendations [9], but is capable of generating explanations for recommendations. • Similar Category + iALS. This method is similar to Similar Category MostPop with items sorted according to iALS scores. • Association Rules. This method makes it possible to generate explanations for any recommender model [10]. There are different metrics to compute the rule. In our work, we include confidence, support and leverage. • EASE [23]. This approach is a shallow autoencoder that has an item-item weight matrix 𝑊 [𝑋 ][𝑌 ]. This matrix can be considered a method to calculate 𝑠𝑢,𝑋 ,𝑌 . Formally, 𝑠𝑢,𝑋 ,𝑌 = 𝑊 [𝑋 ][𝑌 ], It is important to note that Random, MostPop, Association Rules, EASE do not depend on 𝑟𝑢,𝑖,𝑡 . They rank all 𝐼𝑢 items, and the relative order of items does not change if new interactions are made. Alternatively, Chrono, PersonTop, iALS, Similar Category + iALS take into account the set of 𝐼𝑢 and rank explanation pairs in person. 6. Experiments We use the feedback collected from the crowdworkers to evaluate the explanation quality of different heuristics and algorithms. The choice of recommender model that provides recom- mendations is not the focus of our work. Therefore, we take an MF-based method iALS [8], which is a powerful recommender model for the top-n recommendation task [9, 24] that is capable of generating explanations for its recommendations. We used the last month of user transactions as a test set and the penultimate month as a validation set to determine the best model hypeparameters. To evaluate the ranking quality, we use standard metrics such as Recall@K, NDCG@K, MAP@K. If the list of candidates is smaller than a particular 𝐾, we simply pad the end of the list with zeros. Furthermore, we do not calculate ranking metrics for lists of candidates if they lack any examples that can create a valid (X, Y) pair. 6.1. Results The results of our experiments are provided in Table 3. The rows are sorted by Recall@1. Since it is difficult to explain the recommendation of any merchant Y by the most popular merchant X, MostPop clearly performs the poorest at ordering explanations. As AR𝑠𝑢𝑝𝑝𝑜𝑟𝑡 uses pairs of items, its results are slightly superior to MostPop. Chrono, AR𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒 , Sim- CatMostPop, EASE and PersonParty take into account some logical heuristics, including Table 3 Explanation quality of different methods. N@K/R@K/M@K are NDCG@K/Recall@K/MNAP@K re- spectively. The best value is boldfaced. N@1 N@3 N@10 R@1 R@3 R@10 M@1 M@5 M@10 MostPop 0.34 0.54 0.63 0.22 0.65 0.84 0.34 0.46 0.52 Random 0.37 0.53 0.62 0.23 0.63 0.82 0.37 0.45 0.51 𝐴𝑅𝑠𝑢𝑝𝑝𝑜𝑟𝑡 0.38 0.55 0.64 0.24 0.65 0.84 0.38 0.47 0.53 Chrono 0.4 0.55 0.64 0.25 0.64 0.83 0.4 0.47 0.53 𝐴𝑅𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒 0.39 0.55 0.64 0.25 0.64 0.83 0.39 0.47 0.53 SimCatMostPop 0.4 0.56 0.64 0.26 0.64 0.83 0.4 0.48 0.54 EASE 0.42 0.58 0.66 0.27 0.f66 0.84 0.42 0.5 0.56 PersonPop 0.42 0.58 0.66 0.27 0.67 0.85 0.42 0.5 0.56 𝐴𝑅𝑙𝑒𝑣𝑒𝑟𝑎𝑔𝑒 0.46 0.59 0.66 0.29 0.66 0.84 0.46 0.51 0.56 SimCatIALS 0.5 0.62 0.7 0.3 0.7 0.87 0.5 0.54 0.6 IALS 0.58 0.71 0.76 0.37 0.77 0.91 0.58 0.64 0.68 different item-to-item relations or the notion that customers frequently purchase from particular merchants. Therefore, these approaches produced relatively good results. AR𝑙𝑒𝑣𝑒𝑟𝑎𝑔𝑒 provided quality results because it takes into account both X and Y separately in addition to the pair X → Y. However, iALS-based models take the top of the leaderboard. The most accurate ranking was produced by implicit ALS. An interesting point to consider is that this method was able to generate rankings that were more accurate than category-based sorting. SimCatIALS per- formed worse than iALS, possibly due to the fact that two merchants in the same category can have a different target audience. It is important to note that the best-performing iALS method has a Recall@1 of 0.37. It means that in 63% of cases, this model retrieves a pair (X, Y) that is labeled as incorrect by crowdworkers. On the other hand, this result is about 20% better on the Recall@1 metric than the results of MostPop. 7. Limitations and Future Work Our research has some limitations that we plan to overcome in future works. Firstly, the number of available merchants can be very large and it can be expensive to label most of the provided item pairs. This problem can be potentially addressed if we can find a way to predict the respondent’s answers based on partial data labeling. Secondly, we did not study the use of unlabeled pairs. For instance, the people’s opinions can be predicted by factorising the matrix 𝐶. Finally, we considered only a small set of approaches for ranking explanation candidates. Approaches with casual explanations [25] is something to consider in future work. 8. Conclusion In this paper, we studied the validation of explanations for merchant recommendations. We validated the explanation pairs using a crowdsourcing platform. This made it possible for us to attempt a new approach to evaluating the quality of explanations of recommendations in the offline scenario. We also considered 9 different approaches for generating explanations and compared them based on the data gathered from crowdworkers. The results of our ex- periments have shown that even well-known approaches may generate invalid explanations that are considered illogical by real users. We hope that this method will allow researchers to develop explainable models and test them in an offline scenario based on data gathered from crowdsourcing. References [1] N. Ranjbar Kermany, L. Pizzato, T. Min, C. Scott, A. Leontjeva, A multi-stakeholder recommender system for rewards recommendations, in: Proceedings of the 16th ACM Conference on Recommender Systems, RecSys ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 484–487. URL: https://doi.org/10.1145/3523227.3547388. doi:10. 1145/3523227.3547388 . [2] W. Neussner, E. Ginina, N. Kryvinska, et al., Novel approaches to increasing customer loyalty: Example of “cashback” in austria, J Fin Mark. 2022; 6 (2): 1-14. 2 J Fin Mark 2022 Volume 6 Issue 2 (2022). [3] R. Poleshchuk, Increasing bank customers’ loyalty through innovative loyalty programs (2022). [4] X. Chen, A. Reibman, S. Arora, Sequential recommendation model for next purchase prediction, arXiv preprint arXiv:2207.06225 (2022). [5] S. Kolesnikov, O. Lashinin, M. Pechatov, A. Kosov, Ttrs: Tinkoff transactions recommender system benchmark, arXiv preprint arXiv:2110.05589 (2021). [6] M. Du, R. Christensen, W. Zhang, F. Li, Pcard: Personalized restaurants recommendation from card payment transaction records, in: The World Wide Web Conference, 2019, pp. 2687–2693. [7] P. Kouki, J. Schaffer, J. Pujara, J. O’Donovan, L. Getoor, Personalized explanations for hybrid recommender systems, in: Proceedings of the 24th International Conference on Intelligent User Interfaces, 2019, pp. 379–390. [8] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering for implicit feedback datasets, in: 2008 Eighth IEEE international conference on data mining, Ieee, 2008, pp. 263–272. [9] S. Rendle, W. Krichene, L. Zhang, Y. Koren, Revisiting the performance of ials on item rec- ommendation benchmarks, in: Proceedings of the 16th ACM Conference on Recommender Systems, 2022, pp. 427–435. [10] G. Peake, J. Wang, Explanation mining: Post hoc interpretability of latent factor models for recommendation systems, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 2060–2069. [11] S. Xu, Y. Li, S. Liu, Z. Fu, X. Chen, Y. Zhang, Learning post-hoc causal explanations for recommendation, arXiv preprint arXiv:2006.16977 (2020). [12] W. Cheng, Y. Shen, L. Huang, Y. Zhu, Incorporating interpretability into latent factor models via fast influence analysis, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery; Data Mining, KDD ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 885–893. URL: https://doi.org/10.1145/3292500. 3330857. doi:10.1145/3292500.3330857 . [13] V. W. Anelli, A. Bellogín, T. Di Noia, F. M. Donini, V. Paparella, C. Pomo, An analysis of local explanation with lime-rs (2022). [14] S. Chang, F. M. Harper, L. G. Terveen, Crowd-based personalized natural language explana- tions for recommendations, in: Proceedings of the 10th ACM conference on recommender systems, 2016, pp. 175–182. [15] D.-G. Hong, Y.-C. Lee, J. Lee, S.-W. Kim, Crowdstart: Warming up cold-start items using crowdsourcing, Expert Systems with Applications 138 (2019) 112813. URL: https: //www.sciencedirect.com/science/article/pii/S0957417419305093. doi:https://doi.org/ 10.1016/j.eswa.2019.07.030 . [16] P. Kouki, J. Schaffer, J. Pujara, J. O’Donovan, L. Getoor, Personalized explanations for hybrid recommender systems, in: Proceedings of the 24th International Conference on Intelligent User Interfaces, 2019, pp. 379–390. [17] A. Ghazimatin, S. Pramanik, R. Saha Roy, G. Weikum, Elixir: learning from user feedback on explanations to improve recommender models, in: Proceedings of the Web Conference 2021, 2021, pp. 3850–3860. [18] K. Balog, F. Radlinski, Measuring recommendation explanation quality: The conflicting goals of explanations, in: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, 2020, pp. 329–338. [19] X. Chen, Y. Zhang, J.-R. Wen, Measuring ”why” in recommender systems: a comprehensive survey on the evaluation of explainable recommendation, arXiv preprint arXiv:2202.06466 (2022). [20] A. Drutsa, V. Farafonova, V. Fedorova, O. Megorskaya, E. Zerminova, O. Zhilinskaya, Practice of efficient data collection via crowdsourcing at large-scale, arXiv preprint arXiv:1912.04444 (2019). [21] W.-C. Kang, J. McAuley, Self-attentive sequential recommendation, in: 2018 IEEE interna- tional conference on data mining (ICDM), IEEE, 2018, pp. 197–206. [22] R. He, J. McAuley, Fusing similarity models with markov chains for sparse sequential recommendation, in: 2016 IEEE 16th international conference on data mining (ICDM), IEEE, 2016, pp. 191–200. [23] H. Steck, Embarrassingly shallow autoencoders for sparse data, arXiv preprint arXiv:1905.03375 (2019). [24] M. Ferrari Dacrema, S. Boglio, P. Cremonesi, D. Jannach, A troubling analysis of re- producibility and progress in recommender systems research, ACM Transactions on Information Systems (TOIS) 39 (2021) 1–49. [25] S. Xu, Y. Li, S. Liu, Z. Fu, Y. Ge, X. Chen, Y. Zhang, Learning causal explanations for recommendation, in: The 1st International Workshop on Causality in Search and Recom- mendation, 2021.