=Paper=
{{Paper
|id=Vol-2960/paper11
|storemode=property
|title=Adherence and Constancy in LIME-RS Explanations for Recommendation (Long paper)
|pdfUrl=https://ceur-ws.org/Vol-2960/paper11.pdf
|volume=Vol-2960
|authors=Vito Walter Anelli,Alejandro Bellogin,Tommaso Di Noia,Francesco Maria Donini,Vincenzo Paparella,Claudio Pomo
|dblpUrl=https://dblp.org/rec/conf/recsys/AnelliBNDPP21
}}
==Adherence and Constancy in LIME-RS Explanations for Recommendation (Long paper)==
Adherence and Constancy in LIME-RS Explanations for Recommendation Vito Walter Anelli1 , Alejandro Bellogín2 , Tommaso Di Noia1 , Francesco Maria Donini3 , Vincenzo Paparella1 and Claudio Pomo1 1 Politecnico di Bari, via Orabona, 4, 70125 Bari, Italy 2 Universidad Autónoma de Madrid, Ciudad Universitaria de Cantoblanco, 28049 Madrid, Spain 3 Università degli Studi della Tuscia, via Santa Maria in Gradi, 4, 01100 Viterbo, Italy Abstract Explainable Recommendation has attracted a lot of attention due to a renewed interest in explainable artificial intelligence. In particular, post-hoc approaches have proved to be the most easily applicable ones to increasingly complex recommendation models, which are then treated as black boxes. The most recent literature has shown that for post-hoc explanations based on local surrogate models, there are problems related to the robustness of the approach itself. This consideration becomes even more relevant in human-related tasks like recommendation. The explanation also has the arduous task of enhancing increasingly relevant aspects of user experience such as transparency or trustworthiness. This paper aims to show how the characteristics of a classical post-hoc model based on surrogates is strongly model-dependent and does not prove to be accountable for the explanations generated. Keywords explainable recommendation, post-hoc explanation, local surrogate model 1. Introduction is an Acrobatic Duels Movie”, for the same user, this behavior would be perceived as nondeterministic, and The explanation of a recommendation list plays an in- thus reducing its trustworthiness. creasingly important role in the interaction of a user with a recommender system: the pervasiveness of eco- Among several ways of generating explanations, we nomic interest and the inscrutability of most Artificial study here the application of LIME [4] to the recommen- Intelligence systems make users ask for some form of ac- dation process. LIME is an algorithm that can explain countability in the behavior of the systems they interact the predictions of any classifier or regressor in a faithful with. Given the explanation that a system can provide way, by approximating it locally with an interpretable to a user we identify at least two characteristics that the model. LIME belongs to the category of post-hoc algo- explanation part should enforce [1, 2, 3]: rithms and it sees the prediction system as a black box by ignoring its underlying operations and algorithms. • Adherence to reality: the explanation should mention Since we can consider the recommendation task as a only features that really pertain to the recommended particular Machine Learning task, the LIME approach item. For instance, if the system recommends the can also be applied to recommendation. LIME-RS [5] movie “Titanic”, it should not explain this recommen- is an adaptation of the general algorithm to the recom- dation by saying “because it is a War Movie” since it is mendation task and can be considered in all respects as by no means an adherent description of that movie; a black-box explainer. This means that it generates an explanation by drawing a huge number of (random) calls • Constancy in the behavior: when the explanation is to the system, collecting the answers, building a model generated based on some sample, and such a sample is of behavior of the system, and then constructing the ex- drawn with a probability distribution, the entire pro- planation for the particular recommended item. While cess should not exhibit a random behavior to the user. the fact of adopting a black-box approach lets LIME-RS For instance, if the explanation for recommending the to be applicable for every recommender system, the way movie “The Matrix” to the same user is first “because of building a model by drawing a huge random sample it is a Dystopian Science Fiction”, and then “because it of system behaviors makes it lose both adherence and 3rd Edition of Knowledge-aware and Conversational Recommender constancy, as our experiments show later on this paper. Systems (KaRS) & 5th Edition of Recommendation in Complex This suggests that the direct application of LIME-RS to Environments (ComplexRec) Joint Workshop @ RecSys 2021, recommender systems is not advisable, and that further September 27–1 October 2021, Amsterdam, Netherlands research is needed to assess the usefulness of LIME-RS Envelope-Open claudio.pomo@poliba.it (C. Pomo) © 2021 Copyright for this paper by its authors. Use permitted under Creative in explaining recommendations. Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 The paper is organized as follows: Section 2 reviews the state of the art on explanation in recommendation; to provide an explanation opens the way to new chal- Section 3 details LIME to make the paper self-contained. lenges [1, 18, 19, 20]. Section 4 shows the results of experiments with two main- There are two different approaches to address this type stream recommendation models: Attribute Item-kNN of issue. and Vector Space Model. We discuss the outcomes of the • On the one hand, the model-intrinsic explanation experiments in Section 5, and conclude with Section 6. strategy aims to create a user-friendly recommen- dation model or encapsulates an explaining mech- 2. Related Work anism. However, as Lipton [21] points out, this strategy will weigh in on the trade-off between In recent years, the theme of Explanation in Artificial the transparency and accuracy of the model. In- Intelligence has come to the foreground, capturing the deed, if the goal becomes to justify recommen- attention not only of the Machine Learning and related dations, the purpose of the system is no longer communities – that deal more specifically with the algo- to provide only personalized recommendations, rithmic part – but also of fields closer to Social Sciences, resulting in a distortion of the recommendation such as Sociology or Cognitivism, which look with great process. interest to this area of research [1]. The growing interest • On the other hand, we have a model-agnostic [22] in this area is also dictated by new regulations of both approach, also known as post-hoc [23], which Europe [6] and US [7] with respect to sensitive issues in does not require to intervene on the internal the field of personal data processing, and legal respon- mechanisms of the recommendation model and sibility. This trend has also touched the research field therefore does not affect its performance in terms of recommender systems [8, 9, 10, 11]. However, topics of accuracy. such as explanation are by no means new to this field. In fact, we can date back to 2014 the introduction of the Most recommendation algorithms take an MF-approach, term “explainable recommendation” [12], although the and thus the entire recommendation process is based on need to provide an explanation that accompanies the rec- the interaction of latent factors that bring out the level of ommendation is a need that emerged as early as 1999 liking for an item with respect to a user. Many post-hoc by Schafer et al. [13], when people began trying to ex- explanation methods have been proposed for precisely plain a recommendation with other similar items familiar these types of recommendation models. It seems evi- to the user who received that recommendation. dent that the most difficult challenge for this type of Catalyzation of interest around the topic of explana- approach lies in making these latent factors explicit and tion of recommendations coincides also with the aware- understandable for the user [9]. Peake and Wang [23] ness achieved in considering metrics beyond accuracy generate an explanation by exploiting the association as fundamental in evaluating a recommendation sys- rules between features; Tao et al. [24] in their work, find tem [14, 15]. Indeed, all of the well-known metrics of benefit from regression trees to drive learning, and then novelty, diversity, and serendipity are intended to im- explain the latent space; instead, Gao et al. [25] try a deep prove the user experience, and in this respect, a key role model based on attention mechanisms to make relevant is played by explanation [3, 16]. “Why are you recom- features emerge. Along the same lines are Pan et al. [11], mending that?”—this is the question that usually accom- who present a feature mapping approach that maps the panies the user when a suggestion is provided. Tintarev uninterpretable general features onto the interpretable as- and Masthoff [2] detailed in a scrupulous way the as- pect features. Among other approaches to consider, [12] pects involved in the process of explanation when we proposes an explicit factor model that builds a mapping talk about recommendation. They identified 7 aspects: between the interpretable features and the latent space. user’s trust, satisfaction, persuasiveness, efficiency, effec- On the same line we also find the work by Fusco et al. tiveness, scrutability, and transparency. [26]. In their work, they provide an approach to identify, This is the starting point to define Explainable Rec- in a neural model, which features contribute most to the ommendation as a task that aims to provide suggestions recommendation. However, these post-hoc explanation to the users and make them aware of the recommen- approaches turn out to be built for very specific mod- dation process, explaining also why that specific object els. Purely model-agnostic approaches include the recent has been suggested. Gedikli et al. [3] evaluated differ- work of Tsang et al. [27], who present GLIDER, an ap- ent types of explanations and drew a set of guidelines proach to estimate interactions between features rather to decide what the best explanation that should equip a than on the significance of features as in the original recommendation system is. This is due to the fact that LIME [4] algorithm. This type of solution is constructed popular recommendation systems are based on Matrix regardless of the recommendation model. Factorization (MF) [17]; for this type of model, trying Our paper focuses on the operation of LIME, a model- agnostic method for a surrogate-based local explanation. When a user-item pair is provided, this model returns as an outcome of the explanation a set of feature weights, 𝜉 (𝑥) = argminℒ (𝑓 , 𝑒, 𝜋𝑥 ) + Ω(𝑒) (1) for any recommender system. However, the recommen- 𝑒∈𝐸 dation task is very specific, so there is a version called where ℒ represents the fidelity of the surrogate model to LIME-RS [5] that applies the explanation model tech- the original 𝑓, and 𝑒 represents a particular instance of the nique to the recommendation domain. In this way, any class 𝐸 of all possible explainable models. Among all the recommender is seen as a black box, so LIME-RS plays the possible models, the one most frequently chosen is based role of a model-agnostic explainer whose result is a set on a linear prediction. In this case, an explanation refers of interpretable features and their relative importance. to the weights of the most important interpretable fea- The goal of LIME-RS is to exploit the predictive power tures, which, when combined, minimize the divergence of the recommendation (black box) model to generate from the black-box model. The function 𝜋𝑥 measures the an explanation about the suggestion of a particular item distance between the instance to be explained 𝑥 ∈ 𝒳, and for a user. In this respect, it exploits a neighborhood the samples 𝑥 ′ ∈ 𝒳 extracted from the training set to drawn according to a generic distribution compared to train the model 𝑒. Finally, Ω(𝑒) represents the complexity the candidate item for the explanation. It seems obvious of the explanation model. that the choice of the neighborhood plays a crucial role Two pieces of evidence make the application of LIME within the process of explanation generation by LIME-RS. possible: (i) the existence of a feature space 𝒵 on which We can compare this sample extraction action to a per- to train the surrogate model of 𝑓, (ii) and the presence turbation of the user-item pair we are using to generate of a surjective function that maps the space mentioned the explanation. In the case of LIME-RS this perturba- above (𝒵) to the original space of instances (𝒳). Going tion must generate consistent samples with respect to into more detail, we consider the fidelity function ℒ as the source dataset. We see that this choice represents a the mean square deviation between the prediction for a critical issue for all the post-hoc models which base their generic instance 𝑥 ′ ∈ 𝒳 of the black-box model and that expressiveness on the locality of the instance to explain. generated for the counterpart 𝑧 ′ ∈ 𝒵 by the surrogate This trend is confirmed in several papers addressing model. Starting from these considerations we can express this issue of surrogate-based explanation systems such as ℒ with the following formula: LIME and SHAP [28]. In two recent papers, Alvarez-Melis and Jaakkola [29] have shown how the explanations gen- ℒ (𝑓 , 𝑒, 𝜋𝑥 ) = ∑ 𝜋𝑥 (𝑥 ′ ) ⋅ (𝑓 (𝑥 ′ ) − 𝑒 (𝑧 ′ ))2 (2) erated with LIME are not very robust: their contribution 𝑥 ′ ∈𝒳 ,𝑧 ′ ∈𝒵 aims to bring out how small variations or perturbations in the input data cause significant variations in the expla- In the formula above 𝜋𝑥 plays a fundamental role as nation of that specific input [30]. In their paper, a new it expresses the distance between the instance to be strategy is introduced to strengthen these methods by explained and the sampled instance used to build the exploiting local Lipschitz continuity. By deeply inves- surrogate model. From a generic perspective, we can tigating this drawback, they introduced self-explaining express this function as a kernel function like 𝜋𝑥 = models in stages, progressively generalizing linear classi- 𝑒𝑥𝑝(−𝐷(𝑥, 𝑥 ′ )2 /𝜎 2 ), where 𝐷 is any measure of distance. fiers to complex yet architecturally explicit models. The full impact of this distance is captured when the Saito et al. [31] also explored this issue by turning their fidelity function also considers the transformation of the gaze to different types of sampling to make the result of surrogate sample in the original space. As mentioned an explanation generated through LIME more robust. In earlier, we consider a surjective function 𝑝 that maps the particular, in their work, they introduce the possibility original space into the feature space 𝑝 ∶ 𝒳 → 𝒵. We of generating realistic samples produced with a Genera- can also consider the function that allows us to move tive Adversarial Network. Finally, Slack et al. [32] adopt in the opposite direction 𝑝 −1 ∶ 𝒳 → 𝒵. At this point, a similar solution in order to control the perturbation Equation (2) becomes: generating neighborhood data points by attempting to mitigate the generation of unreliable explanations while 2 ℒ (𝑓 , 𝑒, 𝜋𝑥 , 𝑝) = ∑𝑧 ′ ∈𝒵 𝜋𝑥 (𝑝 −1 (𝑧 ′ )) ⋅ (𝑓 (𝑝 −1 (𝑧 ′ )) − 𝑒 (𝑧 ′ )) (3) maintaining a stable black-box model of prediction. From this last equation, we can grasp the criticality of the surjective mapping function. Indeed, the neighbor- 3. Background Technology hood in 𝒵-space cannot be guaranteed with the transfor- From a formal point of view, we can define a LIME- mation in 𝒳-space. Thus, some samples selected to train generated explanation for a generic instance 𝑥 ∈ 𝒳 pro- the surrogate model could not satisfy the neighborhood duced by a model 𝑓 as: criterion for which they were chosen. We must therefore stress on the centrality of the sam- pling function: how do we extract the neighborhood of our instance to be explained? If we look at the application lens Small [33], and Yahoo! Movies1 . Their characteristics of LIME to the recommendation domain, we can compare are shown in Table 1. this sampling action to a local perturbation around our instance 𝑥; however, this perturbation aims to generate Table 1 𝑛 samples 𝑥 ′ , which might contain inconsistencies: as Characteristics of the datasets involved in the experiments. an example, suppose we want to explain James’s feeling Users Items Transactions Sparsity about the movie The Matrix. The original triple of the in- stance to be explained associates to the user-item pair the Movielens 1M 6040 3675 797758 0,9640 Movielens Small 610 8990 80419 0,9853 genre of the movie (representing the explainable space) Yahoo! Movies 7636 8429 160604 0,9975 and in this case it is of the type ⟨𝐽 𝑎𝑚𝑒𝑠, 𝑇 ℎ𝑒𝑀𝑎𝑡𝑟𝑖𝑥, 𝑆𝑐𝑖-𝐹 𝑖⟩. A perturbation around this instance could generate incon- sistencies of the type ⟨𝐽 𝑎𝑚𝑒𝑠, 𝑇 ℎ𝑒𝑀𝑎𝑡𝑟𝑖𝑥, 𝑊 𝑒𝑠𝑡𝑒𝑟𝑛⟩. For As for the choice of the models to be used in this work this reason, in LIME-RS the perturbation considers only is concerned, we selected two well-known recommenda- real and not synthetic data. This choice is dictated by tion models that are able to exploit the information con- the avoidance of the out-of-sample (OOS) process phe- tent of the items to produce a recommendation: Attribute nomenon. Closely related to this problem predicted by Item kNN (Att-Item-kNN) and Vector Space Model (VSM). OOS is that the interpretation examples selected in LIME- The two chosen models represent the simplest solution RS represent the ability to capture the locality through to address the recommendation problem by exploiting disturbance mechanisms effectively. One of the disad- the content associated with the items in the catalog. vantages of LIME-like methods is that they sometimes Att-Item-kNN exploits the characteristics of fail to estimate an appropriate local replacement model neighborhood-based models but expresses the represen- but instead generate a model that focuses on explaining tation of the items in terms of their content and, based the examples and is also affected by more general trends on this representation, it computes a similarity between in the data. users. Starting from this similarity and exploiting This issue is central to our work, and it involves two the collaborative contribution in terms of interactions aspects: (i) the first one concerns the sampling function between users and items, Att-Item-kNN tries to estimate of the samples precisely. In the LIME-RS implementation, the level of liking of the items in the catalog. VSM this function is driven by the popularity distribution of represents both users and items in a new space to the items within the dataset. (ii) The second critical issuelink users and items to the considered information concerns the model’s ability to wittily discriminate the content. Once obtained this new representation, with user’s taste from the neighborhood extracted to build the an appropriate function of similarity, VSM estimates surrogate model. A model that squashes too much on bias which are the most appealing items for a specific user. or is inaccurate cannot bring out the peculiarities of user The implementation of both models are available in the taste that are critical in building the explainable model ELLIOT [34] evaluation framework. This benchmarking which are, in turn, useful in generating the explanation framework was used to select the best configuration for the instance of interest. for the two recommendation models by exploiting the These observations dictate the two research questions corresponding configuration file2 . that motivated our work: Our experiments start by selecting the best configu- rations based on nDCG [35, 36] for the two models on RQ1 Can we consider the surrogate-based model on the considered datasets. Then, we generate the top-10 which LIME-RS is built to generate always the list of recommendations for each user, and we take into same explanations, or does the extraction of a dif- account the first item 𝑖 on these lists for each user 𝑢. Fi- 1 ferent neighborhood severely impact the system’s nally, each recommendation pair (𝑢, 𝑖 ) is explained with 1 constancy? LIME-RS. The explanation consists of a weighted vector RQ2 Are LIME-RS explanations adherent to item con- (𝑔, 𝑤)𝑖 where 𝑔 is the genre of the movies in the dataset tent, despite the fact that the sampling function – i.e., the features – and 𝑤 is the weight associated to 𝑔 is uncritical and based only on popularity? by LIME-RS within the explanation. Then, this vector is sorted by descending weights. In this way, the genres of the movies which play a key role within the recommen- 4. Experiments dation, as explained by LIME-RS, are highlighted at the first positions of the vector. These operations are then This section is devoted to illustrating how the experimen- repeated 𝑛 = 10 times and changing the seed each time, tal campaign was conducted. The datasets used for this 1 phase of experimentation are Movielens 1M [33], Movie- R4 - Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0 http://webscope.sandbox.yahoo.com/. 2 https://tny.sh/basic_limers as 10 is likely to be a good choice to detect a general pat- Table 2 tern in the behavior of LIME-RS. At this point, for each Constancy of LIME-RS. A value equal to 0 means that the pair (𝑢, 𝑖1 ), we have a group of 10 explanations ordered genre(s) provided by LIME-RS in the first 𝑘 position(s) is always by descending values of 𝑤, which we exploit to answer different (worst case: completely inconstant behavior); A value our two research questions. equal to 1 means that the genre(s) provided by LIME-RS in RQ1. Empirically, since in a real scenario of recommenda- the first 𝑘 position(s) is always the same (total constancy). tion a too verbose explanation is not useful, we consider 𝜇1 𝜇2 𝜇3 𝜇4 𝜇5 only the first five features in the sorted vector represent- Att-Item-kNN ing the explanation of each recommendation. In order to Movielens 1M 0,9130 0,7822 0,6927 0,6288 0,5727 verify the constancy of the behavior of LIME-RS, given a Movielens Small 0,8830 0,7426 0,6639 0,60459 0,5616 Yahoo! Movies 0,9230 0,8016 0,7232 0,6528 0,5830 (𝑢, 𝑖1 ) pair, we exploit the 𝑛 previously generated expla- VSM nations for this pair. Then for 𝑘 = 1, 2, … , 5, we define Movielens 1M 0,8929 0,7953 0,7729 0,7726 0,7801 𝐺𝑘 as the multiset of genres that appear in 𝑘-th position Movielens Small 0,9464 0,8636 0,8343 0,8138 0,8049 – for instance, if “Sci-Fi” occurs in the first position of 7 Yahoo! Movies 0,9732 0,9209 0,8887 0,8884 0,9056 explanations, then “Sci-Fi” occurs 7 times in the multiset 𝐺1 , and similarly for other genres and multisets. Then, we compute the frequency of genres in each position as the values are much more stable. In this case, we have a follows: given a position 𝑘, a genre 𝑔, and the number constancy that, regardless of the length of the weighted 𝑛 of generated explanations for a given pair (𝑢, 𝑖1 ), the vector of the explanation, stabilizes on average around frequency 𝑓𝑔𝑘 of 𝑔 in 𝑘-th position is computed as: 80%. An aspect emerges that will be discussed in de- tail later: LIME-RS is conditioned by the ability of the ||{𝑔 | 𝑔 ∈ 𝐺𝑘 }|| 𝑓𝑔𝑘 = (4) black-box model to discriminate the user’s tastes locally. 𝑛 RQ2. With the aim of providing an answer about the where || ⋅ || denotes the cardinality of a multiset. Then, all adherence to reality of LIME-RS, we make a comparison this information is collected for each user in five lists — between the genres claimed to explain a recommended one for each of the 𝑘 positions — of pairs ⟨𝑔, 𝑓𝑔𝑘 ⟩ sorted by item and its actual genres. Indeed, the explanations about frequency. One can observe that the computed frequency an item should fit the list of genres the item is charac- is an estimation of the probability that a given genre is terized by. This means that, in an ideal case, all highly put in that position within the explanation generated by weighted features within the explanation should match LIME-RS sorted by values. Hence, the pair ⟨𝑔, max (𝑓𝑔𝑘 )⟩ the genres of the item. From the results in Table 2, we no- describes the genre with the highest frequency in the tice that using Att-Item-kNN the constancy of LIME-RS 𝑘-th position of the explanation for a pair (𝑢, 𝑖1 ). Finally, reaches a low value after the third feature. Hence, it is a it makes sense to compute the mean 𝜇𝑘 of the highest futile effort to go deeper in the study of the explanation. probability values in each position 𝑘 of the explanations To this aim, we intersected each explanation limited to for each pair (𝑢, 𝑖1 ). Formally, by setting a position 𝑘, the the set 𝐸𝑘 of its first 𝑘 genres with the set of genres 𝐹𝑖1 mean 𝜇𝑘 is computed as: characterizing the first recommended item, for 𝑘 = 1, 2, 3. Upon completion of this operation for all the 𝑛 expla- |𝑈 | nations generated for each (𝑢, 𝑖1 ) pair, we computed the ∑𝑗=1 max (𝑓𝑔𝑘 )𝑗 𝜇𝑘 = (5) number of times we obtained an empty intersection of |𝑈 | these sets, normalized by the total number of explana- where 𝑈 is the set of users for whom it was possible to tions 𝑛 × |𝑈 |, in order to understand to what extent an generate a recommendation for. Observing the value explanation is (not) adherent to the item. Formally, for a of 𝜇𝑘 , we can state to what extent LIME-RS is constant given value of 𝑘, the value 𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒𝑘 is computed as: in providing the explanations until the 𝑘-th feature: the 𝑛×|𝑈 | higher the value of 𝜇𝑘 , the higher the constancy of LIME- ∑𝑗=1 [(𝐸𝑘 ∩ 𝐹𝑖1 )𝑗 = ∅] RS concerning the 𝑘-th feature. 𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒𝑘 = (6) 𝑛 × |𝑈 | By looking at Table 2, we can see that for Att-Item- kNN the LIME-RS explanation model is reliable as long where 𝑈 is the set of users of the dataset for whom it as it considers at most three features in the weighted was possible to generate a recommendation, 𝑛 is the vector presented as an explanation of the recommen- number of generated explanations for each pair (𝑢, 𝑖1 ), dation. Extending the explanation to four features, we and by Σ[⋯] we mean that we sum 1 if the condition have a constancy that falls below 65%, while arriving at inside [⋯] is true, and 0 otherwise. One can note that an explanation with five features is more likely to run 𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒𝑘 ∈ [0, 1], where a value of 1 indicates the worst into explanations that exhibit an unacceptably random case in which for none of the 𝑛 explanations under con- behavior. On the other hand, we can see that for VSM sideration at least one genre of the item is in the first 𝑘 features of the explanation. In contrast, the lower the (except for the first feature with Movielens 1M) better per- value of 𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒𝑘 , the higher the adherence of LIME-RS. formance than Att-Item-kNN in terms of constancy, with peaks up to 97%. A straightforward consequence of these Table 3 observations could be analyzed in terms of confidence or Adherence of LIME-RS. For value equals to 1 no genre provided probability. If the constancy steadily decreases, it means by LIME-RS in the first 𝑘 real genres of the movie (worst case); that the probability that LIME-RS suggests the same ex- For value equals to 0 at least one genre provided by LIME-RS planatory feature decreases. In practical terms, we could in the first 𝑘 genres is always among the real genres of the say that LIME-RS is less confident about its explanation. movie. In fact, this is the behavior of Att-Item-kNN. Conversely, 𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒1 𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒2 𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒3 VSM shows high values of constancy, resulting in a more Att-Item-kNN ”deterministic” behavior. With VSM, LIME-RS is more confident of its explanations. This could increase user’s Movielens 1M 0,2774 0,1105 0,0488 Movielens Small 0,2364 0,0651 0,0180 trustworthiness, since LIME-RS behavior is more reliable. Yahoo! Movies 0,3597 0,1202 0,0476 However, these results could also be interpreted to- VSM gether with the ones from Table 3. They show how often at least one feature – out of 𝑘 features provided Movielens 1M 0,5357 0,2539 0,1088 by LIME-RS– adheres to the features that describe the Movielens Small 0,4384 0,1674 0,0403 Yahoo! Movies 0,1013 0,01348 0,0021 item being explained. In other words, they measure the probability that LIME-RS succeeds in reconstructing at least one feature of a specific item. Combining the re- Observing the results from Table 3, Att-Item-KNN per- sults of Table 2 and those of Table 3, Att-Item-kNN, as forms well in terms of adherence since, in approximately already mentioned, shows good performance regarding 75% of cases, even considering only the main feature of adherence and identifies 3 times out of 4 the first funda- the explanation, it falls into the set of the item genres, mental feature of the explanation among those present as for Movielens dataset family. This performance is in the set of features originally associated with the item. a 10% lower for Yahoo! Movies. In contrast with this As expected, if the number 𝑘 of LIME-RS-reconstructed result, VSM shows poor performances on both dataset features increases, the number of times such a set has a of the Movielens family, by failing half the time about nonempty intersection (with the features belonging to Movielens 1M as regards adherence. A surprising result the item) – i.e., adherence – increases. It could be noted is achieved for Yahoo! Movies dataset because, enlarging that Att-Item-kNN on Yahoo! Movies shows the worst the study to the first three features among the explana- behavior in terms of adherence. VSM shows a different tion, the error is almost completely absent. The reasons behavior. Despite the excellent performance regarding we found to explain this difference in the performances constancy, it could be observed that on both Movielens concern the characteristics and the quality of the dataset, datasets, the performance in terms of adherence is poor, as we highlight later on. and worse for Movielens 1M than for Movielens Small. Surprisingly, on Yahoo! Movies, VSM performs much 5. Discussion better, and the errors are almost negligible. The difference between the two models could be due This work investigates how well a post-hoc approach to many reasons. In the following we analyze possible based on local surrogates – such as the LIME-RS algo- relations between such behaviors and two of them: pop- rithm – explains a recommendation. Instead of studying ularity bias in the dataset and characteristics of side in- the impact of explanations on users (that is a well-studied formation. On the one hand, if the dataset is affected topic in the literature and is beyond our scope), we fo- by popularity bias, it would be a well-studied cause of cus on objective evidences that could emerge. In this confusion for LIME-RS. On the other hand, the character- respect, we have designed specific experiments, which istics of the side information associated with the datasets introduced two different metrics, to evaluate adherence could dramatically influence the performance of the two and constancy for this kind of algorithms. For instance, recommendation models. To assess these hypotheses, we Table 2 shows a different behavior for Att-Item-kNN and have evaluated (see Table 4) the recommendation lists VSM. On the one hand, Att-Item-kNN seems to guarantee produced by Att-Item-kNN and VSM considering nDCG, a good constancy in explanations up to the third feature. Hit Rate (HR), Mean Average Precision (MAP), and Mean This suggests that an explanation that exploits the first Reciprocal Rank (MRR). Table 4 shows that the chosen three features of the list produced by LIME-RS could be datasets are strongly affected by popularity bias. Indeed, barely considered as reliable (i.e., reaching a constancy of MostPop is the best performing approach, and the two 0.69 on Movielens 1M). On the other hand, VSM exhibits ”personalized” models fail to produce accurate results. a much more ”stable” behavior, demonstrating in all cases This triggers the second aspect that concerns the quality Table 4 ommendation scenario. We propose two different mea- Results of the experiments on the models involved in the sures to understand how reliable an explanation based experiments. Models are optimized according to the value of on LIME-RS is: (i) constancy was used to assess the im- nDCG. pact of the random sampling phase of LIME-RS on the model nDCG Recall HR Precision MAP MRR provided explanation – ideally the explanation should Movielens 1m remain constant in spite of the sample used to obtain Random 0,0051 0,0028 0,0869 0,0098 0,0094 0,0264 it; (ii) adherence was proposed to understand the recon- MostPop 0,0845 0,0379 0,4548 0,104 0,115 0,2205 Att-Item-kNN 0,0229 0,0165 0,2425 0,0383 0,0387 0,0888 structive power of LIME-RS with respect to the features VSM 0,0173 0,0109 0,2106 0,0292 0,0306 0,0741 that belong to the item involved in the explanation – ide- Movielens Small ally, LIME-RS should provide an explanation that always Random 0,0030 0,0013 0,0492 0,0049 0,0068 0,0205 MostPop 0,0715 0,0389 0,3902 0,0748 0,0912 0,1961 adheres to the actual features of the recommended item. Att-Item-kNN 0,0124 0,0068 0,1459 0,0197 0,0191 0,0484 To test both constancy and adherence, we trained and VSM 0,0085 0,0056 0,1000 0,0111 0,0123 0.0350 optimized two content-based recommendation models: Yahoo! Movies Random 0,0005 0,0008 0,0051 0,0005 0,0005 0,0015 Attribute Item-kNN (Att-Item-kNN), and a classical Vec- MostPop 0,2188 0,2589 0,596 0,1067 0,1501 0,3447 tor Space Model. For each model, and for all datasets Att-Item-kNN 0,0215 0,0262 0,1198 0,0132 0,0155 0,0435 VSM 0,0131 0,0171 0,0754 0,0081 0,0092 0,0261 exploited in the study, we generated recommendation lists for all users. We exploited the first item of these top-10 lists to produce the explanations that were then the subject of our investigation. It turned out that for of the content. The results suggest that the side infor- models built with a large collaborative input such as mation is not good enough to boost the recommenda- Att-Item-kNN, LIME-RS produces fairly constant expla- tion systems in producing meaningful recommendations. nations up to a length of three features. Moreover, these In fact, the three datasets seem to have an informative explanations turn out to be adherent with respect to the content that is not adequate to generate appealing rec- item between 65% and 75% of the cases in which only ommendations. We observe that, from an informative the first feature of the weighted vector of explanations point of view, the Yahoo! Movies dataset is slightly more is considered. VSM shows a different behavior where complete: 22 genres against the 18 genres available on explanations are much more constant, but suffer a lot in Movielens. Although the VSM model does not show ex- terms of adherence, except for the Yahoo! Movies dataset cellent performance, in combination with LIME-RS, it for which the explanation model showed outstanding provides explanations that are very reliable in terms of performance despite the poor ability of VSM to provide constancy (see Table 2) and adherence (see Table 3) to sound recommendations to users. the actual content of the items being explained. In our experiments, some evidence started to emerge From the designer perspective, there is also a prag- highlighting that the adopted explanation model is condi- matic way to look at the experimental results. Suppose a tioned not only by the accuracy of the black-box model it developer needs an off-the-shelf way of generating expla- tries to explain but also by the quality of the side informa- nations for recommendations, and chooses LIME-RS to tion used to train the model. The latter result deserves to do that. Our results suggest that if the explainer employs be adequately investigated to search for a link at a higher a Movielens dataset with Att-Item-kNN model, then it level of detail. We plan to apply our experiments also to is better to run the explainer several times. Indeed, the other recommendation models, to see whether the prob- first feature obtained for the explanation could change lems with adherence and constancy that we found for the around 1 time every 5 trials (first column of Table 2), two tested models show up also in other situations. We and once such a feature is obtained, it is better to check will also investigate what impact structured knowledge whether this feature is really among the ones describing has on this performance by exploiting models capable of the item, since 1 time out of 4 the feature can be wrong leveraging this type of content. In addition, it would also (first column of Table 3). Moreover, if the explainer em- be the case to try different reference domains with richer ploys the Yahoo! Movies dataset with VSM model, then datasets of side information to understand what impact probably there is no need to run the explainer twice, since content quality has on this type of explainer. its behavior is constant 97% of the times, while the fea- ture is wrong only 10% of the times. However, the low performance of such a model is to be taken into account. Acknowledgments The authors acknowledge partial support of PID2019- 6. Conclusion 108965GB-I00, PON ARS01_00876 BIO-D, Casa delle Tec- nologie Emergenti della Città di Matera, PON ARS01_00821 In this paper we shed a first light on the effectiveness FLET4.0, PIA Servizi Locali 2.0, H2020 Passapartout - Grant n. of LIME-RS as a black-box explanation model in a rec- 101016956, PIA ERP4.0, and IPZS-PRJ4_IA_NORMATIVO. References dation via interpretable feature mapping and evalua- tion of explainability, in: C. Bessiere (Ed.), Proceed- [1] T. Miller, Explanation in artificial intelligence: Insights ings of the Twenty-Ninth International Joint Conference from the social sciences, Artif. Intell. 267 (2019) 1–38. on Artificial Intelligence, IJCAI 2020, ijcai.org, 2020, pp. URL: https://doi.org/10.1016/j.artint.2018.07.007. doi:1 0 . 2690–2696. URL: https://doi.org/10.24963/ijcai.2020/373. 1016/j.artint.2018.07.007. doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 2 0 / 3 7 3 . [2] N. Tintarev, J. Masthoff, Explaining recommendations: [12] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, S. Ma, Design and evaluation, in: Recommender Systems Hand- Explicit factor models for explainable recommendation book, Springer, 2015, pp. 353–382. based on phrase-level sentiment analysis, in: S. Geva, [3] F. Gedikli, D. Jannach, M. Ge, How should I explain? A. Trotman, P. Bruza, C. L. A. Clarke, K. Järvelin (Eds.), A comparison of different explanation types for recom- The 37th International ACM SIGIR Conference on Re- mender systems, Int. J. Hum. Comput. Stud. 72 (2014) search and Development in Information Retrieval, SIGIR 367–382. URL: https://doi.org/10.1016/j.ijhcs.2013.12.007. ’14, Gold Coast , QLD, Australia - July 06 - 11, 2014, ACM, doi:1 0 . 1 0 1 6 / j . i j h c s . 2 0 1 3 . 1 2 . 0 0 7 . 2014, pp. 83–92. URL: https://doi.org/10.1145/2600428. [4] M. T. Ribeiro, S. Singh, C. Guestrin, ”why should I trust 2609579. doi:1 0 . 1 1 4 5 / 2 6 0 0 4 2 8 . 2 6 0 9 5 7 9 . you?”: Explaining the predictions of any classifier, in: [13] J. B. Schafer, J. A. Konstan, J. Riedl, Recommender sys- B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, tems in e-commerce, in: S. I. Feldman, M. P. Wellman D. Shen, R. Rastogi (Eds.), Proceedings of the 22nd ACM (Eds.), Proceedings of the First ACM Conference on Elec- SIGKDD International Conference on Knowledge Discov- tronic Commerce (EC-99), Denver, CO, USA, November ery and Data Mining, San Francisco, CA, USA, August 13- 3-5, 1999, ACM, 1999, pp. 158–166. URL: https://doi.org/ 17, 2016, ACM, 2016, pp. 1135–1144. URL: https://doi.org/ 10.1145/336992.337035. doi:1 0 . 1 1 4 5 / 3 3 6 9 9 2 . 3 3 7 0 3 5 . 10.1145/2939672.2939778. doi:1 0 . 1 1 4 5 / 2 9 3 9 6 7 2 . 2 9 3 9 7 7 8 . [14] S. M. McNee, J. Riedl, J. A. Konstan, Being accurate is [5] C. Nóbrega, L. B. Marinho, Towards explaining rec- not enough: how accuracy metrics have hurt recom- ommendations through local surrogate models, in: mender systems, in: G. M. Olson, R. Jeffries (Eds.), C. Hung, G. A. Papadopoulos (Eds.), Proceedings of Extended Abstracts Proceedings of the 2006 Confer- the 34th ACM/SIGAPP Symposium on Applied Com- ence on Human Factors in Computing Systems, CHI puting, SAC 2019, Limassol, Cyprus, April 8-12, 2019, 2006, Montréal, Québec, Canada, April 22-27, 2006, ACM, 2019, pp. 1671–1678. URL: https://doi.org/10.1145/ ACM, 2006, pp. 1097–1101. URL: https://doi.org/10.1145/ 3297280.3297443. doi:1 0 . 1 1 4 5 / 3 2 9 7 2 8 0 . 3 2 9 7 4 4 3 . 1125451.1125659. doi:1 0 . 1 1 4 5 / 1 1 2 5 4 5 1 . 1 1 2 5 6 5 9 . [6] S. Wachter, B. Mittelstadt, C. Russell, Counterfactual [15] S. Vargas, Novelty and diversity enhancement and eval- explanations without opening the black box: Automated uation in recommender systems and information re- decisions and the gdpr, Harv. JL & Tech. 31 (2017) 841. trieval, in: S. Geva, A. Trotman, P. Bruza, C. L. A. Clarke, [7] J. Chakraborty, K. Peng, T. Menzies, Making fair ML soft- K. Järvelin (Eds.), The 37th International ACM SIGIR Con- ware using trustworthy explanation, in: 35th IEEE/ACM ference on Research and Development in Information International Conference on Automated Software Engi- Retrieval, SIGIR ’14, Gold Coast , QLD, Australia - July neering, ASE 2020, Melbourne, Australia, September 21- 06 - 11, 2014, ACM, 2014, p. 1281. URL: https://doi.org/ 25, 2020, IEEE, 2020, pp. 1229–1233. URL: https://doi.org/ 10.1145/2600428.2610382. doi:1 0 . 1 1 4 5 / 2 6 0 0 4 2 8 . 2 6 1 0 3 8 2 . 10.1145/3324884.3418932. doi:1 0 . 1 1 4 5 / 3 3 2 4 8 8 4 . 3 4 1 8 9 3 2 . [16] N. Tintarev, J. Masthoff, A survey of explanations in [8] Y. Zhang, X. Chen, Explainable recommendation: A recommender systems, in: ICDE Workshops, IEEE Com- survey and new perspectives, Found. Trends Inf. Retr. 14 puter Society, 2007, pp. 801–810. (2020) 1–101. URL: https://doi.org/10.1561/1500000066. [17] Y. Koren, R. M. Bell, C. Volinsky, Matrix factorization doi:1 0 . 1 5 6 1 / 1 5 0 0 0 0 0 0 6 6 . techniques for recommender systems, Computer 42 [9] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ragone, J. Trotta, (2009) 30–37. URL: https://doi.org/10.1109/MC.2009.263. How to make latent factors interpretable by feeding doi:1 0 . 1 1 0 9 / M C . 2 0 0 9 . 2 6 3 . factorization machines with knowledge graphs, in: [18] K. Tsukuda, M. Goto, Dualdiv: diversifying items C. Ghidini, O. Hartig, M. Maleshkova, V. Svátek, I. F. and explanation styles in explainable hybrid recom- Cruz, A. Hogan, J. Song, M. Lefrançois, F. Gandon (Eds.), mendation, in: T. Bogers, A. Said, P. Brusilovsky, The Semantic Web - ISWC 2019 - 18th International Se- D. Tikk (Eds.), Proceedings of the 13th ACM Confer- mantic Web Conference, Auckland, New Zealand, Oc- ence on Recommender Systems, RecSys 2019, Copen- tober 26-30, 2019, Proceedings, Part I, volume 11778 of hagen, Denmark, September 16-20, 2019, ACM, 2019, pp. Lecture Notes in Computer Science, Springer, 2019, pp. 398–402. URL: https://doi.org/10.1145/3298689.3347063. 38–56. URL: https://doi.org/10.1007/978-3-030-30793-6_ doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 6 3 . 3. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 3 0 7 9 3 - 6 \ _ 3 . [19] X. Chen, H. Chen, H. Xu, Y. Zhang, Y. Cao, Z. Qin, H. Zha, [10] G. P. Polleti, H. N. Munhoz, F. G. Cozman, Explanations Personalized fashion recommendation with visual ex- within conversational recommendation systems: improv- planations based on multimodal attention network: To- ing coverage through knowledge graph embedding, in: wards visually explainable recommendation, in: B. Pi- 2020 AAAI Workshop on Interactive and Conversational wowarski, M. Chevalier, É. Gaussier, Y. Maarek, J. Nie, Recommendation System. AAAI Press, New York City, F. Scholer (Eds.), Proceedings of the 42nd International New York, USA, 2020. ACM SIGIR Conference on Research and Development in [11] D. Pan, X. Li, X. Li, D. Zhu, Explainable recommen- Information Retrieval, SIGIR 2019, Paris, France, July 21- 25, 2019, ACM, 2019, pp. 765–774. URL: https://doi.org/ detection, in: 8th International Conference on Learn- 10.1145/3331184.3331254. doi:1 0 . 1 1 4 5 / 3 3 3 1 1 8 4 . 3 3 3 1 2 5 4 . ing Representations, ICLR 2020, Addis Ababa, Ethiopia, [20] G. Cornacchia, F. M. Donini, F. Narducci, C. Pomo, April 26-30, 2020, OpenReview.net, 2020. URL: https: A. Ragone, Explanation in multi-stakeholder recom- //openreview.net/forum?id=BkgnhTEtDS. mendation for enterprise decision support systems, in: [28] E. Strumbelj, I. Kononenko, Explaining prediction A. Polyvyanyy, S. Rinderle-Ma (Eds.), Advanced Informa- models and individual predictions with feature con- tion Systems Engineering Workshops - CAiSE 2021 Inter- tributions, Knowl. Inf. Syst. 41 (2014) 647–665. URL: national Workshops, Melbourne, VIC, Australia, June 28 https://doi.org/10.1007/s10115-013-0679-x. doi:1 0 . 1 0 0 7 / - July 2, 2021, Proceedings, volume 423 of Lecture Notes s10115- 013- 0679- x. in Business Information Processing, Springer, 2021, pp. [29] D. Alvarez-Melis, T. S. Jaakkola, Towards ro- 39–47. URL: https://doi.org/10.1007/978-3-030-79022-6_ bust interpretability with self-explaining neural net- 4. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 7 9 0 2 2 - 6 \ _ 4 . works, in: S. Bengio, H. M. Wallach, H. Larochelle, [21] Z. C. Lipton, The mythos of model interpretability, Com- K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Ad- mun. ACM 61 (2018) 36–43. URL: https://doi.org/10.1145/ vances in Neural Information Processing Systems 3233231. doi:1 0 . 1 1 4 5 / 3 2 3 3 2 3 1 . 31: Annual Conference on Neural Information Pro- [22] X. Wang, Y. Chen, J. Yang, L. Wu, Z. Wu, X. Xie, A cessing Systems 2018, NeurIPS 2018, December 3- reinforcement learning framework for explainable rec- 8, 2018, Montréal, Canada, 2018, pp. 7786–7795. ommendation, in: IEEE International Conference on URL: https://proceedings.neurips.cc/paper/2018/hash/ Data Mining, ICDM 2018, Singapore, November 17-20, 3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.html. 2018, IEEE Computer Society, 2018, pp. 587–596. URL: [30] D. Alvarez-Melis, T. S. Jaakkola, On the robustness of https://doi.org/10.1109/ICDM.2018.00074. doi:1 0 . 1 1 0 9 / interpretability methods, CoRR abs/1806.08049 (2018). ICDM.2018.00074. URL: http://arxiv.org/abs/1806.08049. a r X i v : 1 8 0 6 . 0 8 0 4 9 . [23] G. Peake, J. Wang, Explanation mining: Post hoc in- [31] S. Saito, E. Chua, N. Capel, R. Hu, Improving LIME terpretability of latent factor models for recommenda- robustness with smarter locality sampling, CoRR tion systems, in: Y. Guo, F. Farooq (Eds.), Proceed- abs/2006.12302 (2020). URL: https://arxiv.org/abs/2006. ings of the 24th ACM SIGKDD International Confer- 12302. a r X i v : 2 0 0 6 . 1 2 3 0 2 . ence on Knowledge Discovery & Data Mining, KDD [32] D. Slack, S. Hilgard, E. Jia, S. Singh, H. Lakkaraju, Fooling 2018, London, UK, August 19-23, 2018, ACM, 2018, LIME and SHAP: adversarial attacks on post hoc expla- pp. 2060–2069. URL: https://doi.org/10.1145/3219819. nation methods, in: A. N. Markham, J. Powles, T. Walsh, 3220072. doi:1 0 . 1 1 4 5 / 3 2 1 9 8 1 9 . 3 2 2 0 0 7 2 . A. L. Washington (Eds.), AIES ’20: AAAI/ACM Confer- [24] Y. Tao, Y. Jia, N. Wang, H. Wang, The fact: Taming ence on AI, Ethics, and Society, New York, NY, USA, latent factor models for explainability with factoriza- February 7-8, 2020, ACM, 2020, pp. 180–186. URL: https: tion trees, in: B. Piwowarski, M. Chevalier, É. Gaussier, //doi.org/10.1145/3375627.3375830. doi:1 0 . 1 1 4 5 / 3 3 7 5 6 2 7 . Y. Maarek, J. Nie, F. Scholer (Eds.), Proceedings of 3375830. the 42nd International ACM SIGIR Conference on Re- [33] F. M. Harper, J. A. Konstan, The movielens datasets: search and Development in Information Retrieval, SI- History and context, ACM Trans. Interact. Intell. Syst. 5 GIR 2019, Paris, France, July 21-25, 2019, ACM, 2019, pp. (2016) 19:1–19:19. URL: https://doi.org/10.1145/2827872. 295–304. URL: https://doi.org/10.1145/3331184.3331244. doi:1 0 . 1 1 4 5 / 2 8 2 7 8 7 2 . doi:1 0 . 1 1 4 5 / 3 3 3 1 1 8 4 . 3 3 3 1 2 4 4 . [34] V. W. Anelli, A. Bellogín, A. Ferrara, D. Malitesta, [25] J. Gao, X. Wang, Y. Wang, X. Xie, Explainable recom- F. A. Merra, C. Pomo, F. M. Donini, T. D. Noia, El- mendation through attentive multi-view learning, in: liot: A comprehensive and rigorous framework for The Thirty-Third AAAI Conference on Artificial Intel- reproducible recommender systems evaluation, in: ligence, AAAI 2019, The Thirty-First Innovative Appli- F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, T. Sakai cations of Artificial Intelligence Conference, IAAI 2019, (Eds.), SIGIR ’21: The 44th International ACM SIGIR The Ninth AAAI Symposium on Educational Advances Conference on Research and Development in Informa- in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, tion Retrieval, Virtual Event, Canada, July 11-15, 2021, USA, January 27 - February 1, 2019, AAAI Press, 2019, ACM, 2021, pp. 2405–2414. URL: https://doi.org/10.1145/ pp. 3622–3629. URL: https://doi.org/10.1609/aaai.v33i01. 3404835.3463245. doi:1 0 . 1 1 4 5 / 3 4 0 4 8 3 5 . 3 4 6 3 2 4 5 . 33013622. doi:1 0 . 1 6 0 9 / a a a i . v 3 3 i 0 1 . 3 3 0 1 3 6 2 2 . [35] V. W. Anelli, T. D. Noia, E. D. Sciascio, C. Pomo, [26] F. Fusco, M. Vlachos, V. Vasileiadis, K. Wardatzky, A. Ragone, On the discriminative power of hyper- J. Schneider, Reconet: An interpretable neural ar- parameters in cross-validation and how to choose them, chitecture for recommender systems, in: S. Kraus in: T. Bogers, A. Said, P. Brusilovsky, D. Tikk (Eds.), Pro- (Ed.), Proceedings of the Twenty-Eighth International ceedings of the 13th ACM Conference on Recommender Joint Conference on Artificial Intelligence, IJCAI 2019, Systems, RecSys 2019, Copenhagen, Denmark, Septem- Macao, China, August 10-16, 2019, ijcai.org, 2019, pp. ber 16-20, 2019, ACM, 2019, pp. 447–451. URL: https: 2343–2349. URL: https://doi.org/10.24963/ijcai.2019/325. //doi.org/10.1145/3298689.3347010. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 1 9 / 3 2 5 . 3347010. [27] M. Tsang, D. Cheng, H. Liu, X. Feng, E. Zhou, Y. Liu, [36] W. Krichene, S. Rendle, On sampled metrics for item Feature interaction interpretability: A case for explain- recommendation, in: R. Gupta, Y. Liu, J. Tang, B. A. ing ad-recommendation systems via neural interaction Prakash (Eds.), KDD ’20: The 26th ACM SIGKDD Con- ference on Knowledge Discovery and Data Mining, Vir- tual Event, CA, USA, August 23-27, 2020, ACM, 2020, pp. 1748–1757. URL: https://doi.org/10.1145/3394486. 3403226. doi:1 0 . 1 1 4 5 / 3 3 9 4 4 8 6 . 3 4 0 3 2 2 6 .