1. Introduction

10.1145/3394486.3403226

Constancy in LIME-RS Explanations for Recom mendation

Vito Walter Anelli

0 1 4

Alejandro Bellogín

0 2 4

Tommaso Di Noia

0 1 4

Francesco Maria Donini

0 3 4

Vincenzo Paparella

0 1 4

Claudio Pomo

claudio.pomo@poliba.it 0 1 4 0 Environments (ComplexRec) Joint Workshop @ RecSys 2021 1 Politecnico di Bari , via Orabona, 4, 70125 Bari , Italy 2 Universidad Autónoma de Madrid , Ciudad Universitaria de Cantoblanco, 28049 Madrid , Spain 3 Università degli Studi della Tuscia , via Santa Maria in Gradi, 4, 01100 Viterbo , Italy 4 particular Machine Learning task, the LIME approach

2021

23 27

Explainable Recommendation has attracted a lot of attention due to a renewed interest in explainable artificial intelligence. In particular, post-hoc approaches have proved to be the most easily applicable ones to increasingly complex recommendation models, which are then treated as black boxes. The most recent literature has shown that for post-hoc explanations based on local surrogate models, there are problems related to the robustness of the approach itself. This consideration becomes even more relevant in human-related tasks like recommendation. The explanation also has the arduous task of enhancing increasingly relevant aspects of user experience such as transparency or trustworthiness. This paper aims to show how the characteristics of a classical post-hoc model based on surrogates is strongly model-dependent and does not prove to be accountable for the explanations generated.

explainable recommendation post-hoc explanation local surrogate model

1. Introduction

The explanation of a recommendation list plays an increasingly important role in the interaction of a user with a recommender system: the pervasiveness of economic interest and the inscrutability of most Artificial countability in the behavior of the systems they interact with. Given the explanation that a system can provide to a user we identify at least two characteristics that the explanation part should enforce [ 1, 2, 3 ]: • Adherence to reality: the explanation should mention only features that really pertain to the recommended item. For instance, if the system recommends the dation by saying “because it is a War Movie” since it is by no means an adherent description of that movie; • Constancy in the behavior: when the explanation is generated based on some sample, and such a sample is drawn with a probability distribution, the entire process should not exhibit a random behavior to the user.

For instance, if the explanation for recommending the movie “The Matrix” to the same user is first “because it is a Dystopian Science Fiction”, and then “because it

Systems (KaRS) & 5th Edition of Recommendation in Complex can also be applied to recommendation. LIME-RS [5] mendation task and can be considered in all respects as a black-box explainer. This means that it generates an explanation by drawing a huge number of (random) calls to the system, collecting the answers, building a model of behavior of the system, and then constructing the explanation for the particular recommended item. While the fact of adopting a black-box approach lets LIME-RS to be applicable for every recommender system, the way of building a model by drawing a huge random sample of system behaviors makes it lose both adherence and constancy, as our experiments show later on this paper.

This suggests that the direct application of LIME-RS to recommender systems is not advisable, and that further research is needed to assess the usefulness of LIME-RS in explaining recommendations. The paper is organized as follows: Section 2 reviews

the state of the art on explanation in recommendation; to provide an explanation opens the way to new chalSection 3 details LIME to make the paper self-contained. lenges [1, 18, 19, 20].

Section 4 shows the results of experiments with two main- There are two diferent approaches to address this type stream recommendation models: Attribute Item-kNN of issue. and Vector Space Model. We discuss the outcomes of the experiments in Section 5, and conclude with Section 6.

2. Related Work In recent years, the theme of Explanation in Artificial

Intelligence has come to the foreground, capturing the attention not only of the Machine Learning and related communities – that deal more specifically with the algorithmic part – but also of fields closer to Social Sciences, such as Sociology or Cognitivism, which look with great interest to this area of research [1]. The growing interest in this area is also dictated by new regulations of both Europe [6] and US [7] with respect to sensitive issues in the field of personal data processing, and legal responsibility. This trend has also touched the research field of recommender systems [ 8, 9, 10, 11 ]. However, topics such as explanation are by no means new to this field. In fact, we can date back to 2014 the introduction of the term “explainable recommendation” [12], although the need to provide an explanation that accompanies the recommendation is a need that emerged as early as 1999 by Schafer et al. [13], when people began trying to explain a recommendation with other similar items familiar to the user who received that recommendation.

Catalyzation of interest around the topic of explanation of recommendations coincides also with the awareness achieved in considering metrics beyond accuracy as fundamental in evaluating a recommendation system [14, 15]. Indeed, all of the well-known metrics of novelty, diversity, and serendipity are intended to improve the user experience, and in this respect, a key role is played by explanation [3, 16]. “Why are you recommending that?”—this is the question that usually accompanies the user when a suggestion is provided. Tintarev and Masthof [ 2 ] detailed in a scrupulous way the aspects involved in the process of explanation when we talk about recommendation. They identified 7 aspects: user’s trust, satisfaction, persuasiveness, eficiency, efectiveness, scrutability, and transparency.

This is the starting point to define Explainable Recommendation as a task that aims to provide suggestions to the users and make them aware of the recommendation process, explaining also why that specific object has been suggested. Gedikli et al. [3] evaluated diferent types of explanations and drew a set of guidelines to decide what the best explanation that should equip a recommendation system is. This is due to the fact that popular recommendation systems are based on Matrix Factorization (MF) [17]; for this type of model, trying • On the one hand, the model-intrinsic explanation strategy aims to create a user-friendly recommendation model or encapsulates an explaining mechanism. However, as Lipton [21] points out, this strategy will weigh in on the trade-of between the transparency and accuracy of the model. Indeed, if the goal becomes to justify recommendations, the purpose of the system is no longer to provide only personalized recommendations, resulting in a distortion of the recommendation process. • On the other hand, we have a model-agnostic [22] approach, also known as post-hoc [23], which does not require to intervene on the internal mechanisms of the recommendation model and therefore does not afect its performance in terms of accuracy.

Most recommendation algorithms take an MF-approach, and thus the entire recommendation process is based on the interaction of latent factors that bring out the level of liking for an item with respect to a user. Many post-hoc explanation methods have been proposed for precisely these types of recommendation models. It seems evident that the most dificult challenge for this type of approach lies in making these latent factors explicit and understandable for the user [9]. Peake and Wang [23] generate an explanation by exploiting the association rules between features; Tao et al. [24] in their work, find benefit from regression trees to drive learning, and then explain the latent space; instead, Gao et al. [ 25 ] try a deep model based on attention mechanisms to make relevant features emerge. Along the same lines are Pan et al. [11], who present a feature mapping approach that maps the uninterpretable general features onto the interpretable aspect features. Among other approaches to consider, [12] proposes an explicit factor model that builds a mapping between the interpretable features and the latent space. On the same line we also find the work by Fusco et al. [26]. In their work, they provide an approach to identify, in a neural model, which features contribute most to the recommendation. However, these post-hoc explanation approaches turn out to be built for very specific models. Purely model-agnostic approaches include the recent work of Tsang et al. [27], who present GLIDER, an approach to estimate interactions between features rather than on the significance of features as in the original LIME [ 4 ] algorithm. This type of solution is constructed regardless of the recommendation model.

Our paper focuses on the operation of LIME, a modelagnostic method for a surrogate-based local explanation.

When a user-item pair is provided, this model returns as

an outcome of the explanation a set of feature weights, () = argminℒ ( , , ) + Ω() (1) for any recommender system. However, the recommen- ∈ dation task is very specific, so there is a version called where ℒ represents the fidelity of the surrogate model to LIME-RS [5] that applies the explanation model tech- the original , and represents a particular instance of the nique to the recommendation domain. In this way, any class of all possible explainable models. Among all the recommender is seen as a black box, so LIME-RS plays the possible models, the one most frequently chosen is based role of a model-agnostic explainer whose result is a set on a linear prediction. In this case, an explanation refers of interpretable features and their relative importance. to the weights of the most important interpretable fea

The goal of LIME-RS is to exploit the predictive power tures, which, when combined, minimize the divergence of the recommendation (black box) model to generate from the black-box model. The function measures the an explanation about the suggestion of a particular item distance between the instance to be explained ∈ , and for a user. In this respect, it exploits a neighborhood the samples ′ ∈ extracted from the training set to drawn according to a generic distribution compared to train the model . Finally, Ω() represents the complexity the candidate item for the explanation. It seems obvious of the explanation model. that the choice of the neighborhood plays a crucial role Two pieces of evidence make the application of LIME within the process of explanation generation by LIME-RS. possible: (i) the existence of a feature space on which We can compare this sample extraction action to a per- to train the surrogate model of , (ii) and the presence turbation of the user-item pair we are using to generate of a surjective function that maps the space mentioned the explanation. In the case of LIME-RS this perturba- above ( ) to the original space of instances ( ). Going tion must generate consistent samples with respect to into more detail, we consider the fidelity function ℒ as the source dataset. We see that this choice represents a the mean square deviation between the prediction for a critical issue for all the post-hoc models which base their generic instance ′ ∈ of the black-box model and that expressiveness on the locality of the instance to explain. generated for the counterpart ′ ∈ by the surrogate

This trend is confirmed in several papers addressing model. Starting from these considerations we can express this issue of surrogate-based explanation systems such as ℒ with the following formula: LIME and SHAP [28]. In two recent papers, Alvarez-Melis and Jaakkola [29] have shown how the explanations generated with LIME are not very robust: their contribution ℒ ( , , ) = ∑ ( ′) ⋅ ( ( ′) − ( ′))2 (2) aims to bring out how small variations or perturbations ′∈ , ′∈ in the input data cause significant variations in the expla- In the formula above plays a fundamental role as nation of that specific input [ 30]. In their paper, a new it expresses the distance between the instance to be strategy is introduced to strengthen these methods by explained and the sampled instance used to build the exploiting local Lipschitz continuity. By deeply inves- surrogate model. From a generic perspective, we can tigating this drawback, they introduced self-explaining express this function as a kernel function like = models in stages, progressively generalizing linear classi- (−(, ′)2/ 2), where is any measure of distance. ifers to complex yet architecturally explicit models. The full impact of this distance is captured when the

Saito et al. [31] also explored this issue by turning their fidelity function also considers the transformation of the gaze to diferent types of sampling to make the result of surrogate sample in the original space. As mentioned an explanation generated through LIME more robust. In earlier, we consider a surjective function that maps the particular, in their work, they introduce the possibility original space into the feature space ∶ → . We of generating realistic samples produced with a Genera- can also consider the function that allows us to move tive Adversarial Network. Finally, Slack et al. [32] adopt in the opposite direction −1 ∶ → . At this point, a similar solution in order to control the perturbation Equation (2) becomes: generating neighborhood data points by attempting to mitigate the generation of unreliable explanations while maintaining a stable black-box model of prediction. ℒ ( , , , ) = ∑ ′∈ ( −1 ( ′)) ⋅ ( ( −1 ( ′)) − ( ′))2 (3)

3. Background Technology From a formal point of view, we can define a LIME

generated explanation for a generic instance ∈ produced by a model as:

From this last equation, we can grasp the criticality of

the surjective mapping function. Indeed, the neighborhood in -space cannot be guaranteed with the transformation in -space. Thus, some samples selected to train the surrogate model could not satisfy the neighborhood criterion for which they were chosen.

We must therefore stress on the centrality of the sampling function: how do we extract the neighborhood of our instance to be explained? If we look at the application lens Small [33], and Yahoo! Movies1. Their characteristics of LIME to the recommendation domain, we can compare are shown in Table 1. this sampling action to a local perturbation around our instance ; however, this perturbation aims to generate Table 1 samples ′, which might contain inconsistencies: as Characteristics of the datasets involved in the experiments. an example, suppose we want to explain James’s feeling about the movie The Matrix. The original triple of the in- Users Items Transactions Sparsity stance to be explained associates to the user-item pair the Movielens 1M 6040 3675 797758 0,9640 genre of the movie (representing the explainable space) YMaohvoioel!eMnsoSvmiesall 7663160 88492990 16800640149 00,,99987553 and in this case it is of the type ⟨ , ℎ , - ⟩ .

A perturbation around this instance could generate inconsistencies of the type ⟨ , ℎ , ⟩ . For As for the choice of the models to be used in this work this reason, in LIME-RS the perturbation considers only is concerned, we selected two well-known recommendareal and not synthetic data. This choice is dictated by tion models that are able to exploit the information conthe avoidance of the out-of-sample (OOS) process phe- tent of the items to produce a recommendation: Attribute nomenon. Closely related to this problem predicted by Item kNN (Att-Item-kNN) and Vector Space Model (VSM). OOS is that the interpretation examples selected in LIME- The two chosen models represent the simplest solution RS represent the ability to capture the locality through to address the recommendation problem by exploiting disturbance mechanisms efectively. One of the disad- the content associated with the items in the catalog. vantages of LIME-like methods is that they sometimes Att-Item-kNN exploits the characteristics of fail to estimate an appropriate local replacement model neighborhood-based models but expresses the represenbut instead generate a model that focuses on explaining tation of the items in terms of their content and, based the examples and is also afected by more general trends on this representation, it computes a similarity between in the data. users. Starting from this similarity and exploiting

This issue is central to our work, and it involves two the collaborative contribution in terms of interactions aspects: (i) the first one concerns the sampling function between users and items, Att-Item-kNN tries to estimate of the samples precisely. In the LIME-RS implementation, the level of liking of the items in the catalog. VSM this function is driven by the popularity distribution of represents both users and items in a new space to the items within the dataset. (ii) The second critical issue link users and items to the considered information concerns the model’s ability to wittily discriminate the content. Once obtained this new representation, with user’s taste from the neighborhood extracted to build the an appropriate function of similarity, VSM estimates surrogate model. A model that squashes too much on bias which are the most appealing items for a specific user. or is inaccurate cannot bring out the peculiarities of user The implementation of both models are available in the taste that are critical in building the explainable model ELLIOT [34] evaluation framework. This benchmarking which are, in turn, useful in generating the explanation framework was used to select the best configuration for the instance of interest. for the two recommendation models by exploiting the

These observations dictate the two research questions corresponding configuration file 2. that motivated our work: Our experiments start by selecting the best configurations based on nDCG [35, 36] for the two models on RQ1 Can we consider the surrogate-based model on the considered datasets. Then, we generate the top-10 which LIME-RS is built to generate always the list of recommendations for each user, and we take into same explanations, or does the extraction of a dif- account the first item 1 on these lists for each user . Fiferent neighborhood severely impact the system’s nally, each recommendation pair (, 1) is explained with constancy? LIME-RS. The explanation consists of a weighted vector RQ2 Are LIME-RS explanations adherent to item con- (, ) where is the genre of the movies in the dataset tent, despite the fact that the sampling function – i.e., the features – and is the weight associated to is uncritical and based only on popularity? by LIME-RS within the explanation. Then, this vector is sorted by descending weights. In this way, the genres of the movies which play a key role within the recommen4. Experiments dation, as explained by LIME-RS, are highlighted at the ifrst positions of the vector. These operations are then This section is devoted to illustrating how the experimen- repeated = 10 times and changing the seed each time, tal campaign was conducted. The datasets used for this phase of experimentation are Movielens 1M [33], Movie

1R4 - Yahoo! Movies User Ratings and Descriptive Content

Information, v.1.0 http://webscope.sandbox.yahoo.com/.

2https://tny.sh/basic_limers as 10 is likely to be a good choice to detect a general pat- Table 2 tern in the behavior of LIME-RS. At this point, for each pair (, 1), we have a group of 10 explanations ordered by descending values of , which we exploit to answer our two research questions. frequency RQ1. Empirically, since in a real scenario of recommendation a too verbose explanation is not useful, we consider only the first five features in the sorted vector representing the explanation of each recommendation. In order to verify the constancy of the behavior of LIME-RS, given a (, 1) pair, we exploit the previously generated explanations for this pair. Then for = 1, 2, … , 5 , we define as the multiset of genres that appear in -th position – for instance, if “Sci-Fi” occurs in the first position of 7 explanations, then “Sci-Fi” occurs 7 times in the multiset 1, and similarly for other genres and multisets. Then, we compute the frequency of genres in each position as follows: given a position , a genre , and the number of generated explanations for a given pair (, 1), the of in -th position is computed as: = ||{ | ∈

}|| where || ⋅ || denotes the cardinality of a multiset. Then, all this information is collected for each user in five lists — one for each of the positions — of pairs ⟨, frequency. One can observe that the computed frequency is an estimation of the probability that a given genre is put in that position within the explanation generated by LIME-RS sorted by values. Hence, the pair ⟨, max ( )⟩ describes the genre with the highest frequency in the -th position of the explanation for a pair (, 1). Finally, it makes sense to compute the mean of the highest probability values in each position of the explanations for each pair (, 1). Formally, by setting a position , the ⟩ sorted by mean is computed as: =

| | ∑ =1 max ( ) | | (4) (5)

RS concerning the -th feature.

where is the set of users for whom it was possible to generate a recommendation for. Observing the value of , we can state to what extent LIME-RS is constant in providing the explanations until the -th feature: the higher the value of , the higher the constancy of LIME

By looking at Table 2, we can see that for Att-ItemkNN the LIME-RS explanation model is reliable as long as it considers at most three features in the weighted

Constancy of LIME-RS. A value equal to 0 means that the genre(s) provided by LIME-RS in the first position(s) is always diferent (worst case: completely inconstant behavior); A value equal to 1 means that the genre(s) provided by LIME-RS in the first position(s) is always the same (total constancy).

Movielens 1M Movielens Small Yahoo! Movies Movielens 1M Movielens Small Yahoo! Movies the set of its first genres with the set of genres 1 characterizing the first recommended item, for = 1, 2, 3 .

Upon completion of this operation for all the explanations generated for each (, 1) pair, we computed the number of times we obtained an empty intersection of these sets, normalized by the total number of explanations × | | , in order to understand to what extent an explanation is (not) adherent to the item. Formally, for a given value of , the value ℎ

is computed as: ℎ = ∑ ×| | =1 [( ∩ 1) = ∅] × | |

(6) where is the set of users of the dataset for whom it was possible to generate a recommendation, is the vector presented as an explanation of the recommen- number of generated explanations for each pair (, 1), dation. Extending the explanation to four features, we have a constancy that falls below 65%, while arriving at an explanation with five features is more likely to run into explanations that exhibit an unacceptably random behavior. On the other hand, we can see that for VSM and by Σ[⋯] we mean that we sum 1 if the condition inside [⋯] is true, and 0 otherwise. One can note that ℎ case in which for none of the explanations under consideration at least one genre of the item is in the first ∈ [0, 1], where a value of 1 indicates the worst features of the explanation. In contrast, the lower the (except for the first feature with Movielens 1M) better pervalue of ℎ , the higher the adherence of LIME-RS. formance than Att-Item-kNN in terms of constancy, with peaks up to 97%. A straightforward consequence of these Table 3 observations could be analyzed in terms of confidence or Adherence of LIME-RS. For value equals to 1 no genre provided probability. If the constancy steadily decreases, it means by LIME-RS in the first real genres of the movie (worst case); that the probability that LIME-RS suggests the same exFor value equals to 0 at least one genre provided by LIME-RS planatory feature decreases. In practical terms, we could in the first genres is always among the real genres of the say that LIME-RS is less confident about its explanation. movie. In fact, this is the behavior of Att-Item-kNN. Conversely, ℎAtt-Item1 -kNℎN 2 ℎ 3 ”VdSeMtersmhoinwissthicig”hbevhalauveisoro.f WconitshtaVnScMy, ,reLsIuMltEin-RgSinisa mmoorree confident of its explanations. This could increase user’s MMoovviieelleennss 1SMmall 00,,22377644 00,,01610551 00,,00148880 trustworthiness, since LIME-RS behavior is more reliable. Yahoo! Movies 0,3597 0,1202 0,0476 However, these results could also be interpreted toVSM gether with the ones from Table 3. They show how

often at least one feature – out of features provided MMoovviieelleennss 1SMmall 00,,45338547 00,,12657349 00,,01400838 by LIME-RS– adheres to the features that describe the Yahoo! Movies 0,1013 0,01348 0,0021 item being explained. In other words, they measure the probability that LIME-RS succeeds in reconstructing at least one feature of a specific item. Combining the re

Observing the results from Table 3, Att-Item-KNN per- sults of Table 2 and those of Table 3, Att-Item-kNN, as forms well in terms of adherence since, in approximately already mentioned, shows good performance regarding 75% of cases, even considering only the main feature of adherence and identifies 3 times out of 4 the first fundathe explanation, it falls into the set of the item genres, mental feature of the explanation among those present as for Movielens dataset family. This performance is in the set of features originally associated with the item. a 10% lower for Yahoo! Movies. In contrast with this As expected, if the number of LIME-RS-reconstructed result, VSM shows poor performances on both dataset features increases, the number of times such a set has a of the Movielens family, by failing half the time about nonempty intersection (with the features belonging to Movielens 1M as regards adherence. A surprising result the item) – i.e., adherence – increases. It could be noted is achieved for Yahoo! Movies dataset because, enlarging that Att-Item-kNN on Yahoo! Movies shows the worst the study to the first three features among the explana- behavior in terms of adherence. VSM shows a diferent tion, the error is almost completely absent. The reasons behavior. Despite the excellent performance regarding we found to explain this diference in the performances constancy, it could be observed that on both Movielens concern the characteristics and the quality of the dataset, datasets, the performance in terms of adherence is poor, as we highlight later on. and worse for Movielens 1M than for Movielens Small. Surprisingly, on Yahoo! Movies, VSM performs much 5. Discussion better, and the errors are almost negligible. The diference between the two models could be due This work investigates how well a post-hoc approach to many reasons. In the following we analyze possible based on local surrogates – such as the LIME-RS algo- relations between such behaviors and two of them: poprithm – explains a recommendation. Instead of studying ularity bias in the dataset and characteristics of side inthe impact of explanations on users (that is a well-studied formation. On the one hand, if the dataset is afected topic in the literature and is beyond our scope), we fo- by popularity bias, it would be a well-studied cause of cus on objective evidences that could emerge. In this confusion for LIME-RS. On the other hand, the characterrespect, we have designed specific experiments, which istics of the side information associated with the datasets introduced two diferent metrics, to evaluate adherence could dramatically influence the performance of the two and constancy for this kind of algorithms. For instance, recommendation models. To assess these hypotheses, we Table 2 shows a diferent behavior for Att-Item-kNN and have evaluated (see Table 4) the recommendation lists VSM. On the one hand, Att-Item-kNN seems to guarantee produced by Att-Item-kNN and VSM considering nDCG, a good constancy in explanations up to the third feature. Hit Rate (HR), Mean Average Precision (MAP), and Mean This suggests that an explanation that exploits the first Reciprocal Rank (MRR). Table 4 shows that the chosen three features of the list produced by LIME-RS could be datasets are strongly afected by popularity bias. Indeed, barely considered as reliable (i.e., reaching a constancy of MostPop is the best performing approach, and the two 0.69 on Movielens 1M). On the other hand, VSM exhibits ”personalized” models fail to produce accurate results. a much more ”stable” behavior, demonstrating in all cases This triggers the second aspect that concerns the quality Table 4 ommendation scenario. We propose two diferent meaResults of the experiments on the models involved in the sures to understand how reliable an explanation based experiments. Models are optimized according to the value of on LIME-RS is: (i) constancy was used to assess the imnDCG. pact of the random sampling phase of LIME-RS on the model nDCG Recall HR Precision MAP MRR provided explanation – ideally the explanation should

Movielens 1m remain constant in spite of the sample used to obtain RMaonsdtPomop 00,,00804551 00,,00307298 00,,40584689 00,0,100948 00,0,101954 00,,20220654 it; (ii) adherence was proposed to understand the reconAtt-Item-kNN 0,0229 0,0165 0,2425 0,0383 0,0387 0,0888 structive power of LIME-RS with respect to the features VSM 0,0173 0,0109 0,2106 0,0292 0,0306 0,0741 that belong to the item involved in the explanation – ide

Movielens Small ally, LIME-RS should provide an explanation that always RMaonsdtPomop 00,,00701350 00,,00308193 00,,30940922 00,,00704489 00,,00901628 00,,10926015 adheres to the actual features of the recommended item. Att-Item-kNN 0,0124 0,0068 0,1459 0,0197 0,0191 0,0484 To test both constancy and adherence, we trained and VSM 0,0085 0,0056 0,1000 0,0111 0,0123 0.0350 optimized two content-based recommendation models: Random 0,0005 0,00Y0a8hoo!0M,00o5v1ies 0,0005 0,0005 0,0015 Attribute Item-kNN (Att-Item-kNN), and a classical VecMostPop 0,2188 0,2589 0,596 0,1067 0,1501 0,3447 tor Space Model. For each model, and for all datasets AVtStM-Item-kNN 00,,00123115 00,,00127612 00,,01715948 00,,00018312 00,,00019525 00,,00246315 exploited in the study, we generated recommendation lists for all users. We exploited the first item of these top-10 lists to produce the explanations that were then of the content. The results suggest that the side infor- the subject of our investigation. It turned out that for mation is not good enough to boost the recommenda- models built with a large collaborative input such as tion systems in producing meaningful recommendations. Att-Item-kNN, LIME-RS produces fairly constant explaIn fact, the three datasets seem to have an informative nations up to a length of three features. Moreover, these content that is not adequate to generate appealing rec- explanations turn out to be adherent with respect to the ommendations. We observe that, from an informative item between 65% and 75% of the cases in which only point of view, the Yahoo! Movies dataset is slightly more the first feature of the weighted vector of explanations complete: 22 genres against the 18 genres available on is considered. VSM shows a diferent behavior where Movielens. Although the VSM model does not show ex- explanations are much more constant, but sufer a lot in cellent performance, in combination with LIME-RS, it terms of adherence, except for the Yahoo! Movies dataset provides explanations that are very reliable in terms of for which the explanation model showed outstanding constancy (see Table 2) and adherence (see Table 3) to performance despite the poor ability of VSM to provide the actual content of the items being explained. sound recommendations to users. From the designer perspective, there is also a prag- In our experiments, some evidence started to emerge matic way to look at the experimental results. Suppose a highlighting that the adopted explanation model is condideveloper needs an of-the-shelf way of generating expla- tioned not only by the accuracy of the black-box model it nations for recommendations, and chooses LIME-RS to tries to explain but also by the quality of the side informado that. Our results suggest that if the explainer employs tion used to train the model. The latter result deserves to a Movielens dataset with Att-Item-kNN model, then it be adequately investigated to search for a link at a higher is better to run the explainer several times. Indeed, the level of detail. We plan to apply our experiments also to ifrst feature obtained for the explanation could change other recommendation models, to see whether the probaround 1 time every 5 trials (first column of Table 2), lems with adherence and constancy that we found for the and once such a feature is obtained, it is better to check two tested models show up also in other situations. We whether this feature is really among the ones describing will also investigate what impact structured knowledge the item, since 1 time out of 4 the feature can be wrong has on this performance by exploiting models capable of (first column of Table 3). Moreover, if the explainer em- leveraging this type of content. In addition, it would also ploys the Yahoo! Movies dataset with VSM model, then be the case to try diferent reference domains with richer probably there is no need to run the explainer twice, since datasets of side information to understand what impact its behavior is constant 97% of the times, while the fea- content quality has on this type of explainer. ture is wrong only 10% of the times. However, the low performance of such a model is to be taken into account.

Acknowledgments

The authors acknowledge partial support of PID20196. Conclusion 108965GB-I00, PON ARS01_00876 BIO-D, Casa delle Tecnologie Emergenti della Città di Matera, PON ARS01_00821 In this paper we shed a first light on the efectiveness FLET4.0, PIA Servizi Locali 2.0, H2020 Passapartout - Grant n. of LIME-RS as a black-box explanation model in a rec- 101016956, PIA ERP4.0, and IPZS-PRJ4_IA_NORMATIVO.

25, 2019 , ACM, 2019 , pp. 765 - 774 . URL: https://doi.org/ detection, in: 8th International Conference on Learn-

10.1145/3331184.3331254. doi: 10 .1145/3331184.3331254. ing Representations, ICLR 2020 ,

Addis

Ababa , Ethiopia, [20]

Cornacchia ,

F. M.

Donini ,

Narducci ,

Pomo , April 26-30 , 2020 , OpenReview.net, 2020 . URL: https:

mendation for enterprise decision support systems , in: [28]

Strumbelj , I. Kononenko, Explaining prediction

tion Systems Engineering Workshops - CAiSE 2021 Inter- tributions, Knowl. Inf. Syst. 41 ( 2014 ) 647 - 665 . URL:

national Workshops , Melbourne, VIC , Australia, June 28 https://doi.org/10.1007/s10115-013-0679-x. doi: 10 .1007/

- July 2 , 2021 , Proceedings, volume 423 of Lecture Notes s10115- 013 - 0679- x.

in Business Information Processing , Springer, 2021 , pp. [29]

Alvarez-Melis ,

T. S.

Jaakkola , Towards ro-

39- 47 . URL: https://doi.org/10.1007/978-3- 030 -79022-6 _ bust interpretability with self-explaining neural net-

4. doi: 10 .1007/978- 3- 030 - 79022- 6\_4. works, in: S. Bengio,

H. M.

Wallach , H. Larochelle, [21]

Z. C.

Lipton , The mythos of model interpretability, Com- K. Grauman,

Cesa-Bianchi , R. Garnett (Eds.), Ad-

mun. ACM 61 ( 2018 ) 36 - 43 . URL: https://doi.org/10.1145/ vances in Neural Information Processing Systems

3233231. doi: 10 .1145/3233231. 31: Annual Conference on Neural Information Pro[22]

Wang ,

Chen ,

Yang ,

Wu ,

Xie , A cessing Systems 2018 , NeurIPS 2018 , December 3-

reinforcement learning framework for explainable rec- 8 , 2018 , Montréal, Canada, 2018 , pp. 7786 - 7795 .

ommendation, in: IEEE International Conference on URL: https://proceedings.neurips.cc/paper/2018/hash/

Data

Mining , ICDM 2018 , Singapore, November 17 - 20 , 3e9f0fc9b2f89e043bc6233994dfcf76 - Abstract .html.

2018 , IEEE Computer Society, 2018 , pp. 587 - 596 . URL: [30]

Alvarez-Melis ,

T. S.

Jaakkola , On the robustness of

https://doi.org/10.1109/ICDM. 2018 . 00074 . doi: 10 .1109/ interpretability methods, CoRR abs/ 1806 .08049 ( 2018 ).

ICDM.

2018 .00074. URL: http://arxiv.org/abs/ 1806 .08049. arXiv: 1806 . 08049 . [23]

Peake ,

Wang , Explanation mining: Post hoc in- [31]

Saito ,

Chua ,

Capel ,

Hu , Improving LIME

tion systems , in: Y. Guo , F. Farooq (Eds.), Proceed- abs/ 2006 .12302 ( 2020 ). URL: https://arxiv.org/abs/ 2006 .

ings of the 24th ACM SIGKDD International Confer- 12302 . arXiv: 2006 .12302.

ence on Knowledge Discovery & Data Mining , KDD [32]

Slack ,

Hilgard ,

Jia ,

Singh ,

Lakkaraju , Fooling

2018, London, UK, August 19- 23 , 2018 , ACM, 2018 , LIME and SHAP: adversarial attacks on post hoc expla-

pp. 2060 - 2069 . URL: https://doi.org/10.1145/3219819. nation methods, in: A. N. Markham , J. Powles , T. Walsh,

3220072. doi: 10 .1145/3219819.3220072. A. L. Washington (Eds.), AIES '20: AAAI/ACM Confer[24]

Tao ,

Jia ,

Wang ,

Wang , The fact: Taming ence on AI, Ethics, and Society , New York, NY, USA,

latent factor models for explainability with factoriza-

February 7-8 , 2020 , ACM, 2020 , pp. 180 - 186 . URL: https:

tion trees , in: B. Piwowarski , M. Chevalier , É. Gaussier, //doi.org/10.1145/3375627.3375830. doi: 10 .1145/3375627.

Maarek ,

Nie ,

Scholer (Eds.), Proceedings of 3375830.

the 42nd International ACM SIGIR Conference on Re- [33]

F. M.

Harper ,

J. A.

Konstan , The movielens datasets:

search and Development in Information Retrieval, SI- History and context , ACM Trans. Interact. Intell. Syst. 5

GIR 2019 , Paris, France, July 21-25 , 2019 , ACM, 2019 , pp. ( 2016 ) 19 : 1 - 19 : 19 . URL: https://doi.org/10.1145/2827872.

295- 304 . URL: https://doi.org/10.1145/3331184.3331244. doi: 10 .1145/2827872.

doi:10.1145/3331184 .3331244. [34]

V. W.

Anelli ,

Bellogín ,

Ferrara ,

Malitesta , [25]

Gao ,

Wang ,

Xie , Explainable recom- F. A. Merra , C.

Pomo , F. M.

Donini , T. D. Noia, El-

ligence , AAAI 2019 ,

The

Thirty-First Innovative Appli- F. Diaz ,

Shah ,

Suel ,

Castells ,

Jones , T. Sakai

cations of Artificial Intelligence Conference , IAAI 2019 , (Eds.), SIGIR '21: The 44th International ACM SIGIR

in Artificial Intelligence, EAAI 2019 , Honolulu, Hawaii, tion Retrieval, Virtual Event, Canada, July 11-15 , 2021 ,

USA , January 27 - February 1, 2019 , AAAI Press, 2019 , ACM, 2021 , pp. 2405 - 2414 . URL: https://doi.org/10.1145/

pp. 3622 - 3629 . URL: https://doi.org/10.1609/aaai.v33i01. 3404835 .3463245. doi: 10 .1145/3404835.3463245.

33013622. doi: 10 .1609/aaai.v33i01. 33013622 . [35]

V. W.

Anelli ,

T. D.

Noia ,

E. D.

Sciascio , C. Pomo, [26]

Fusco ,

Vlachos ,

Vasileiadis ,

Wardatzky ,

Ragone , On the discriminative power of hyper-

(Ed.), Proceedings of the Twenty-Eighth International ceedings of the 13th ACM Conference on Recommender

Joint Conference on Artificial Intelligence, IJCAI 2019 , Systems , RecSys 2019 , Copenhagen, Denmark, Septem-

Macao , China, August 10-16 , 2019 , ijcai.org, 2019 , pp. ber 16-20 , 2019 , ACM, 2019 , pp. 447 - 451 . URL: https:

2343- 2349 . URL: https://doi.org/10.24963/ijcai. 2019 /325. //doi.org/10.1145/3298689.3347010. doi: 10 .1145/3298689.

doi:10 .24963/ijcai. 2019 /325. 3347010. [27]

Tsang , D. Cheng, H. Liu,

Feng ,

Zhou , Y. Liu, [36]

Krichene ,

Rendle , On sampled metrics for item

ing ad-recommendation systems via neural interaction Prakash (Eds.) , KDD '20: The 26th ACM SIGKDD Con-