=Paper= {{Paper |id=Vol-2960/paper11 |storemode=property |title=Adherence and Constancy in LIME-RS Explanations for Recommendation (Long paper) |pdfUrl=https://ceur-ws.org/Vol-2960/paper11.pdf |volume=Vol-2960 |authors=Vito Walter Anelli,Alejandro Bellogin,Tommaso Di Noia,Francesco Maria Donini,Vincenzo Paparella,Claudio Pomo |dblpUrl=https://dblp.org/rec/conf/recsys/AnelliBNDPP21 }} ==Adherence and Constancy in LIME-RS Explanations for Recommendation (Long paper)== https://ceur-ws.org/Vol-2960/paper11.pdf
Adherence and Constancy in LIME-RS Explanations for
Recommendation
Vito Walter Anelli1 , Alejandro Bellogín2 , Tommaso Di Noia1 , Francesco Maria Donini3 ,
Vincenzo Paparella1 and Claudio Pomo1
1
  Politecnico di Bari, via Orabona, 4, 70125 Bari, Italy
2
  Universidad Autónoma de Madrid, Ciudad Universitaria de Cantoblanco, 28049 Madrid, Spain
3
  Università degli Studi della Tuscia, via Santa Maria in Gradi, 4, 01100 Viterbo, Italy


                                       Abstract
                                       Explainable Recommendation has attracted a lot of attention due to a renewed interest in explainable artificial intelligence. In
                                       particular, post-hoc approaches have proved to be the most easily applicable ones to increasingly complex recommendation
                                       models, which are then treated as black boxes. The most recent literature has shown that for post-hoc explanations based
                                       on local surrogate models, there are problems related to the robustness of the approach itself. This consideration becomes
                                       even more relevant in human-related tasks like recommendation. The explanation also has the arduous task of enhancing
                                       increasingly relevant aspects of user experience such as transparency or trustworthiness. This paper aims to show how
                                       the characteristics of a classical post-hoc model based on surrogates is strongly model-dependent and does not prove to be
                                       accountable for the explanations generated.

                                       Keywords
                                       explainable recommendation, post-hoc explanation, local surrogate model



1. Introduction                                                                                     is an Acrobatic Duels Movie”, for the same user, this
                                                                                                    behavior would be perceived as nondeterministic, and
The explanation of a recommendation list plays an in-                                               thus reducing its trustworthiness.
creasingly important role in the interaction of a user
with a recommender system: the pervasiveness of eco-                                                 Among several ways of generating explanations, we
nomic interest and the inscrutability of most Artificial                                          study here the application of LIME [4] to the recommen-
Intelligence systems make users ask for some form of ac-                                          dation process. LIME is an algorithm that can explain
countability in the behavior of the systems they interact                                         the predictions of any classifier or regressor in a faithful
with. Given the explanation that a system can provide                                             way, by approximating it locally with an interpretable
to a user we identify at least two characteristics that the                                       model. LIME belongs to the category of post-hoc algo-
explanation part should enforce [1, 2, 3]:                                                        rithms and it sees the prediction system as a black box
                                                                                                  by ignoring its underlying operations and algorithms.
• Adherence to reality: the explanation should mention Since we can consider the recommendation task as a
              only features that really pertain to the recommended particular Machine Learning task, the LIME approach
              item. For instance, if the system recommends the can also be applied to recommendation. LIME-RS [5]
              movie “Titanic”, it should not explain this recommen- is an adaptation of the general algorithm to the recom-
              dation by saying “because it is a War Movie” since it is mendation task and can be considered in all respects as
              by no means an adherent description of that movie;                                  a black-box explainer. This means that it generates an
                                                                                                  explanation by drawing a huge number of (random) calls
• Constancy in the behavior: when the explanation is
                                                                                                  to the system, collecting the answers, building a model
              generated based on some sample, and such a sample is
                                                                                                  of behavior of the system, and then constructing the ex-
              drawn with a probability distribution, the entire pro-
                                                                                                  planation for the particular recommended item. While
              cess should not exhibit a random behavior to the user.
                                                                                                  the fact of adopting a black-box approach lets LIME-RS
              For instance, if the explanation for recommending the
                                                                                                  to be applicable for every recommender system, the way
              movie “The Matrix” to the same user is first “because
                                                                                                  of building a model by drawing a huge random sample
              it is a Dystopian Science Fiction”, and then “because it
                                                                                                  of system behaviors makes it lose both adherence and
3rd Edition of Knowledge-aware and Conversational Recommender                                     constancy, as our experiments show later on this paper.
Systems (KaRS) & 5th Edition of Recommendation in Complex                                         This suggests that the direct application of LIME-RS to
Environments (ComplexRec) Joint Workshop @ RecSys 2021,                                           recommender systems is not advisable, and that further
September 27–1 October 2021, Amsterdam, Netherlands                                               research is needed to assess the usefulness of LIME-RS
Envelope-Open claudio.pomo@poliba.it (C. Pomo)
                     © 2021 Copyright for this paper by its authors. Use permitted under Creative in explaining recommendations.
                     Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings      CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                     The paper is organized as follows: Section 2 reviews
the state of the art on explanation in recommendation; to provide an explanation opens the way to new chal-
Section 3 details LIME to make the paper self-contained. lenges [1, 18, 19, 20].
Section 4 shows the results of experiments with two main-        There are two different approaches to address this type
stream recommendation models: Attribute Item-kNN of issue.
and Vector Space Model. We discuss the outcomes of the               • On the one hand, the model-intrinsic explanation
experiments in Section 5, and conclude with Section 6.                 strategy aims to create a user-friendly recommen-
                                                                       dation model or encapsulates an explaining mech-
2. Related Work                                                        anism. However, as Lipton [21] points out, this
                                                                       strategy will weigh in on the trade-off between
In recent years, the theme of Explanation in Artificial                the transparency and accuracy of the model. In-
Intelligence has come to the foreground, capturing the                 deed, if the goal becomes to justify recommen-
attention not only of the Machine Learning and related                 dations, the purpose of the system is no longer
communities – that deal more specifically with the algo-               to provide only personalized recommendations,
rithmic part – but also of fields closer to Social Sciences,           resulting in a distortion of the recommendation
such as Sociology or Cognitivism, which look with great                process.
interest to this area of research [1]. The growing interest          • On the other hand, we have a model-agnostic [22]
in this area is also dictated by new regulations of both               approach, also known as post-hoc [23], which
Europe [6] and US [7] with respect to sensitive issues in              does not require to intervene on the internal
the field of personal data processing, and legal respon-               mechanisms of the recommendation model and
sibility. This trend has also touched the research field               therefore does not affect its performance in terms
of recommender systems [8, 9, 10, 11]. However, topics                 of accuracy.
such as explanation are by no means new to this field.
In fact, we can date back to 2014 the introduction of the Most recommendation algorithms take an MF-approach,
term “explainable recommendation” [12], although the and thus the entire recommendation process is based on
need to provide an explanation that accompanies the rec- the interaction of latent factors that bring out the level of
ommendation is a need that emerged as early as 1999 liking for an item with respect to a user. Many post-hoc
by Schafer et al. [13], when people began trying to ex- explanation methods have been proposed for precisely
plain a recommendation with other similar items familiar these types of recommendation models. It seems evi-
to the user who received that recommendation.                  dent that the most difficult challenge for this type of
   Catalyzation of interest around the topic of explana- approach lies in making these latent factors explicit and
tion of recommendations coincides also with the aware- understandable for the user [9]. Peake and Wang [23]
ness achieved in considering metrics beyond accuracy generate an explanation by exploiting the association
as fundamental in evaluating a recommendation sys- rules between features; Tao et al. [24] in their work, find
tem [14, 15]. Indeed, all of the well-known metrics of benefit from regression trees to drive learning, and then
novelty, diversity, and serendipity are intended to im- explain the latent space; instead, Gao et al. [25] try a deep
prove the user experience, and in this respect, a key role model based on attention mechanisms to make relevant
is played by explanation [3, 16]. “Why are you recom- features emerge. Along the same lines are Pan et al. [11],
mending that?”—this is the question that usually accom- who present a feature mapping approach that maps the
panies the user when a suggestion is provided. Tintarev uninterpretable general features onto the interpretable as-
and Masthoff [2] detailed in a scrupulous way the as- pect features. Among other approaches to consider, [12]
pects involved in the process of explanation when we proposes an explicit factor model that builds a mapping
talk about recommendation. They identified 7 aspects: between the interpretable features and the latent space.
user’s trust, satisfaction, persuasiveness, efficiency, effec- On the same line we also find the work by Fusco et al.
tiveness, scrutability, and transparency.                      [26]. In their work, they provide an approach to identify,
   This is the starting point to define Explainable Rec- in a neural model, which features contribute most to the
ommendation as a task that aims to provide suggestions recommendation. However, these post-hoc explanation
to the users and make them aware of the recommen- approaches turn out to be built for very specific mod-
dation process, explaining also why that specific object els. Purely model-agnostic approaches include the recent
has been suggested. Gedikli et al. [3] evaluated differ- work of Tsang et al. [27], who present GLIDER, an ap-
ent types of explanations and drew a set of guidelines proach to estimate interactions between features rather
to decide what the best explanation that should equip a than on the significance of features as in the original
recommendation system is. This is due to the fact that LIME [4] algorithm. This type of solution is constructed
popular recommendation systems are based on Matrix regardless of the recommendation model.
Factorization (MF) [17]; for this type of model, trying          Our paper focuses on the operation of LIME, a model-
                                                               agnostic method for a surrogate-based local explanation.
When a user-item pair is provided, this model returns as
an outcome of the explanation a set of feature weights,                          𝜉 (𝑥) = argminℒ (𝑓 , 𝑒, 𝜋𝑥 ) + Ω(𝑒)                                (1)
for any recommender system. However, the recommen-                                             𝑒∈𝐸
dation task is very specific, so there is a version called       where ℒ represents the fidelity of the surrogate model to
LIME-RS [5] that applies the explanation model tech-             the original 𝑓, and 𝑒 represents a particular instance of the
nique to the recommendation domain. In this way, any             class 𝐸 of all possible explainable models. Among all the
recommender is seen as a black box, so LIME-RS plays the         possible models, the one most frequently chosen is based
role of a model-agnostic explainer whose result is a set         on a linear prediction. In this case, an explanation refers
of interpretable features and their relative importance.         to the weights of the most important interpretable fea-
   The goal of LIME-RS is to exploit the predictive power        tures, which, when combined, minimize the divergence
of the recommendation (black box) model to generate              from the black-box model. The function 𝜋𝑥 measures the
an explanation about the suggestion of a particular item         distance between the instance to be explained 𝑥 ∈ 𝒳, and
for a user. In this respect, it exploits a neighborhood          the samples 𝑥 ′ ∈ 𝒳 extracted from the training set to
drawn according to a generic distribution compared to            train the model 𝑒. Finally, Ω(𝑒) represents the complexity
the candidate item for the explanation. It seems obvious         of the explanation model.
that the choice of the neighborhood plays a crucial role            Two pieces of evidence make the application of LIME
within the process of explanation generation by LIME-RS.         possible: (i) the existence of a feature space 𝒵 on which
We can compare this sample extraction action to a per-           to train the surrogate model of 𝑓, (ii) and the presence
turbation of the user-item pair we are using to generate         of a surjective function that maps the space mentioned
the explanation. In the case of LIME-RS this perturba-           above (𝒵) to the original space of instances (𝒳). Going
tion must generate consistent samples with respect to            into more detail, we consider the fidelity function ℒ as
the source dataset. We see that this choice represents a         the mean square deviation between the prediction for a
critical issue for all the post-hoc models which base their      generic instance 𝑥 ′ ∈ 𝒳 of the black-box model and that
expressiveness on the locality of the instance to explain.       generated for the counterpart 𝑧 ′ ∈ 𝒵 by the surrogate
   This trend is confirmed in several papers addressing          model. Starting from these considerations we can express
this issue of surrogate-based explanation systems such as        ℒ with the following formula:
LIME and SHAP [28]. In two recent papers, Alvarez-Melis
and Jaakkola [29] have shown how the explanations gen-
                                                                    ℒ (𝑓 , 𝑒, 𝜋𝑥 ) =         ∑            𝜋𝑥 (𝑥 ′ ) ⋅ (𝑓 (𝑥 ′ ) − 𝑒 (𝑧 ′ ))2        (2)
erated with LIME are not very robust: their contribution
                                                                                         𝑥 ′ ∈𝒳 ,𝑧 ′ ∈𝒵
aims to bring out how small variations or perturbations
in the input data cause significant variations in the expla-        In the formula above 𝜋𝑥 plays a fundamental role as
nation of that specific input [30]. In their paper, a new        it expresses the distance between the instance to be
strategy is introduced to strengthen these methods by            explained and the sampled instance used to build the
exploiting local Lipschitz continuity. By deeply inves-          surrogate model. From a generic perspective, we can
tigating this drawback, they introduced self-explaining          express this function as a kernel function like 𝜋𝑥 =
models in stages, progressively generalizing linear classi-      𝑒𝑥𝑝(−𝐷(𝑥, 𝑥 ′ )2 /𝜎 2 ), where 𝐷 is any measure of distance.
fiers to complex yet architecturally explicit models.               The full impact of this distance is captured when the
   Saito et al. [31] also explored this issue by turning their   fidelity function also considers the transformation of the
gaze to different types of sampling to make the result of        surrogate sample in the original space. As mentioned
an explanation generated through LIME more robust. In            earlier, we consider a surjective function 𝑝 that maps the
particular, in their work, they introduce the possibility        original space into the feature space 𝑝 ∶ 𝒳 → 𝒵. We
of generating realistic samples produced with a Genera-          can also consider the function that allows us to move
tive Adversarial Network. Finally, Slack et al. [32] adopt       in the opposite direction 𝑝 −1 ∶ 𝒳 → 𝒵. At this point,
a similar solution in order to control the perturbation          Equation (2) becomes:
generating neighborhood data points by attempting to
mitigate the generation of unreliable explanations while                                                                                        2
                                                                  ℒ (𝑓 , 𝑒, 𝜋𝑥 , 𝑝) = ∑𝑧 ′ ∈𝒵 𝜋𝑥 (𝑝 −1 (𝑧 ′ )) ⋅ (𝑓 (𝑝 −1 (𝑧 ′ )) − 𝑒 (𝑧 ′ ))       (3)
maintaining a stable black-box model of prediction.
                                                           From this last equation, we can grasp the criticality of
                                                        the surjective mapping function. Indeed, the neighbor-
3. Background Technology                                hood in 𝒵-space cannot be guaranteed with the transfor-
From a formal point of view, we can define a LIME- mation in 𝒳-space. Thus, some samples selected to train
generated explanation for a generic instance 𝑥 ∈ 𝒳 pro- the surrogate model could not satisfy the neighborhood
duced by a model 𝑓 as:                                  criterion for which they were chosen.
                                                           We must therefore stress on the centrality of the sam-
                                                        pling function: how do we extract the neighborhood of
our instance to be explained? If we look at the application         lens Small [33], and Yahoo! Movies1 . Their characteristics
of LIME to the recommendation domain, we can compare                are shown in Table 1.
this sampling action to a local perturbation around our
instance 𝑥; however, this perturbation aims to generate             Table 1
𝑛 samples 𝑥 ′ , which might contain inconsistencies: as             Characteristics of the datasets involved in the experiments.
an example, suppose we want to explain James’s feeling
                                                                                         Users    Items   Transactions       Sparsity
about the movie The Matrix. The original triple of the in-
stance to be explained associates to the user-item pair the          Movielens 1M         6040     3675         797758        0,9640
                                                                     Movielens Small       610     8990          80419        0,9853
genre of the movie (representing the explainable space)              Yahoo! Movies        7636     8429         160604        0,9975
and in this case it is of the type ⟨𝐽 𝑎𝑚𝑒𝑠, 𝑇 ℎ𝑒𝑀𝑎𝑡𝑟𝑖𝑥, 𝑆𝑐𝑖-𝐹 𝑖⟩.
A perturbation around this instance could generate incon-
sistencies of the type ⟨𝐽 𝑎𝑚𝑒𝑠, 𝑇 ℎ𝑒𝑀𝑎𝑡𝑟𝑖𝑥, 𝑊 𝑒𝑠𝑡𝑒𝑟𝑛⟩. For     As for the choice of the models to be used in this work
this reason, in LIME-RS the perturbation considers only     is concerned, we selected two well-known recommenda-
real and not synthetic data. This choice is dictated by     tion models that are able to exploit the information con-
the avoidance of the out-of-sample (OOS) process phe-       tent of the items to produce a recommendation: Attribute
nomenon. Closely related to this problem predicted by       Item kNN (Att-Item-kNN) and Vector Space Model (VSM).
OOS is that the interpretation examples selected in LIME-   The two chosen models represent the simplest solution
RS represent the ability to capture the locality through    to address the recommendation problem by exploiting
disturbance mechanisms effectively. One of the disad-       the content associated with the items in the catalog.
vantages of LIME-like methods is that they sometimes           Att-Item-kNN exploits the characteristics of
fail to estimate an appropriate local replacement model     neighborhood-based models but expresses the represen-
but instead generate a model that focuses on explaining     tation of the items in terms of their content and, based
the examples and is also affected by more general trends    on this representation, it computes a similarity between
in the data.                                                users. Starting from this similarity and exploiting
   This issue is central to our work, and it involves two   the collaborative contribution in terms of interactions
aspects: (i) the first one concerns the sampling function   between users and items, Att-Item-kNN tries to estimate
of the samples precisely. In the LIME-RS implementation,    the level of liking of the items in the catalog. VSM
this function is driven by the popularity distribution of   represents both users and items in a new space to
the items within the dataset. (ii) The second critical issuelink users and items to the considered information
concerns the model’s ability to wittily discriminate the    content. Once obtained this new representation, with
user’s taste from the neighborhood extracted to build the   an appropriate function of similarity, VSM estimates
surrogate model. A model that squashes too much on bias     which are the most appealing items for a specific user.
or is inaccurate cannot bring out the peculiarities of user The implementation of both models are available in the
taste that are critical in building the explainable model   ELLIOT [34] evaluation framework. This benchmarking
which are, in turn, useful in generating the explanation    framework was used to select the best configuration
for the instance of interest.                               for the two recommendation models by exploiting the
   These observations dictate the two research questions    corresponding configuration file2 .
that motivated our work:                                       Our experiments start by selecting the best configu-
                                                            rations based on nDCG [35, 36] for the two models on
 RQ1 Can we consider the surrogate-based model on the considered datasets. Then, we generate the top-10
        which LIME-RS is built to generate always the list of recommendations for each user, and we take into
        same explanations, or does the extraction of a dif- account the first item 𝑖 on these lists for each user 𝑢. Fi-
                                                                                    1
        ferent neighborhood severely impact the system’s nally, each recommendation pair (𝑢, 𝑖 ) is explained with
                                                                                                    1
        constancy?                                          LIME-RS. The explanation consists of a weighted vector
 RQ2 Are LIME-RS explanations adherent to item con- (𝑔, 𝑤)𝑖 where 𝑔 is the genre of the movies in the dataset
        tent, despite the fact that the sampling function – i.e., the features – and 𝑤 is the weight associated to 𝑔
        is uncritical and based only on popularity?         by LIME-RS within the explanation. Then, this vector is
                                                            sorted by descending weights. In this way, the genres of
                                                            the movies which play a key role within the recommen-
4. Experiments                                              dation, as explained by LIME-RS, are highlighted at the
                                                            first positions of the vector. These operations are then
This section is devoted to illustrating how the experimen- repeated 𝑛 = 10 times and changing the seed each time,
tal campaign was conducted. The datasets used for this          1
phase of experimentation are Movielens 1M [33], Movie-            R4 - Yahoo! Movies User Ratings and Descriptive Content
                                                                    Information, v.1.0 http://webscope.sandbox.yahoo.com/.
                                                                        2
                                                                          https://tny.sh/basic_limers
as 10 is likely to be a good choice to detect a general pat-    Table 2
tern in the behavior of LIME-RS. At this point, for each        Constancy of LIME-RS. A value equal to 0 means that the
pair (𝑢, 𝑖1 ), we have a group of 10 explanations ordered       genre(s) provided by LIME-RS in the first 𝑘 position(s) is always
by descending values of 𝑤, which we exploit to answer           different (worst case: completely inconstant behavior); A value
our two research questions.                                     equal to 1 means that the genre(s) provided by LIME-RS in
RQ1. Empirically, since in a real scenario of recommenda-       the first 𝑘 position(s) is always the same (total constancy).
tion a too verbose explanation is not useful, we consider                            𝜇1         𝜇2       𝜇3      𝜇4        𝜇5
only the first five features in the sorted vector represent-                              Att-Item-kNN
ing the explanation of each recommendation. In order to          Movielens 1M      0,9130    0,7822   0,6927   0,6288    0,5727
verify the constancy of the behavior of LIME-RS, given a         Movielens Small   0,8830    0,7426   0,6639   0,60459   0,5616
                                                                 Yahoo! Movies     0,9230 0,8016 0,7232        0,6528    0,5830
(𝑢, 𝑖1 ) pair, we exploit the 𝑛 previously generated expla-                                    VSM
nations for this pair. Then for 𝑘 = 1, 2, … , 5, we define       Movielens 1M      0,8929    0,7953   0,7729   0,7726    0,7801
𝐺𝑘 as the multiset of genres that appear in 𝑘-th position        Movielens Small   0,9464    0,8636   0,8343   0,8138    0,8049
– for instance, if “Sci-Fi” occurs in the first position of 7    Yahoo! Movies     0,9732 0,9209 0,8887        0,8884    0,9056
explanations, then “Sci-Fi” occurs 7 times in the multiset
𝐺1 , and similarly for other genres and multisets. Then,
we compute the frequency of genres in each position as          the values are much more stable. In this case, we have a
follows: given a position 𝑘, a genre 𝑔, and the number          constancy that, regardless of the length of the weighted
𝑛 of generated explanations for a given pair (𝑢, 𝑖1 ), the      vector of the explanation, stabilizes on average around
frequency 𝑓𝑔𝑘 of 𝑔 in 𝑘-th position is computed as:             80%. An aspect emerges that will be discussed in de-
                                                                tail later: LIME-RS is conditioned by the ability of the
                            ||{𝑔 | 𝑔 ∈ 𝐺𝑘 }||
                      𝑓𝑔𝑘 =                                 (4) black-box model to discriminate the user’s tastes locally.
                                    𝑛                           RQ2. With the aim of providing an answer about the
where || ⋅ || denotes the cardinality of a multiset. Then, all adherence to reality of LIME-RS, we make a comparison
this information is collected for each user in five lists — between the genres claimed to explain a recommended
one for each of the 𝑘 positions — of pairs ⟨𝑔, 𝑓𝑔𝑘 ⟩ sorted by item and its actual genres. Indeed, the explanations about
frequency. One can observe that the computed frequency an item should fit the list of genres the item is charac-
is an estimation of the probability that a given genre is terized by. This means that, in an ideal case, all highly
put in that position within the explanation generated by weighted features within the explanation should match
LIME-RS sorted by values. Hence, the pair ⟨𝑔, max (𝑓𝑔𝑘 )⟩ the genres of the item. From the results in Table 2, we no-
describes the genre with the highest frequency in the tice that using Att-Item-kNN the constancy of LIME-RS
𝑘-th position of the explanation for a pair (𝑢, 𝑖1 ). Finally, reaches a low value after the third feature. Hence, it is a
it makes sense to compute the mean 𝜇𝑘 of the highest futile effort to go deeper in the study of the explanation.
probability values in each position 𝑘 of the explanations To this aim, we intersected each explanation limited to
for each pair (𝑢, 𝑖1 ). Formally, by setting a position 𝑘, the the set 𝐸𝑘 of its first 𝑘 genres with the set of genres 𝐹𝑖1
mean 𝜇𝑘 is computed as:                                         characterizing the first recommended item, for 𝑘 = 1, 2, 3.
                                                                Upon completion of this operation for all the 𝑛 expla-
                             |𝑈 |                               nations generated for each (𝑢, 𝑖1 ) pair, we computed the
                           ∑𝑗=1 max (𝑓𝑔𝑘 )𝑗
                    𝜇𝑘 =                                    (5) number of times we obtained an empty intersection of
                                   |𝑈 |                         these sets, normalized by the total number of explana-
where 𝑈 is the set of users for whom it was possible to tions 𝑛 × |𝑈 |, in order to understand to what extent an
generate a recommendation for. Observing the value explanation is (not) adherent to the item. Formally, for a
of 𝜇𝑘 , we can state to what extent LIME-RS is constant given value of 𝑘, the value 𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒𝑘 is computed as:
in providing the explanations until the 𝑘-th feature: the                                  𝑛×|𝑈 |
higher the value of 𝜇𝑘 , the higher the constancy of LIME-                               ∑𝑗=1 [(𝐸𝑘 ∩ 𝐹𝑖1 )𝑗 = ∅]
RS concerning the 𝑘-th feature.                                            𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒𝑘 =                                (6)
                                                                                                  𝑛 × |𝑈 |
   By looking at Table 2, we can see that for Att-Item-
kNN the LIME-RS explanation model is reliable as long where 𝑈 is the set of users of the dataset for whom it
as it considers at most three features in the weighted was possible to generate a recommendation, 𝑛 is the
vector presented as an explanation of the recommen- number of generated explanations for each pair (𝑢, 𝑖1 ),
dation. Extending the explanation to four features, we and by Σ[⋯] we mean that we sum 1 if the condition
have a constancy that falls below 65%, while arriving at inside [⋯] is true, and 0 otherwise. One can note that
an explanation with five features is more likely to run 𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒𝑘 ∈ [0, 1], where a value of 1 indicates the worst
into explanations that exhibit an unacceptably random case in which for none of the 𝑛 explanations under con-
behavior. On the other hand, we can see that for VSM sideration at least one genre of the item is in the first 𝑘
features of the explanation. In contrast, the lower the (except for the first feature with Movielens 1M) better per-
value of 𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒𝑘 , the higher the adherence of LIME-RS. formance than Att-Item-kNN in terms of constancy, with
                                                                 peaks up to 97%. A straightforward consequence of these
Table 3                                                          observations could be analyzed in terms of confidence or
Adherence of LIME-RS. For value equals to 1 no genre provided probability. If the constancy steadily decreases, it means
by LIME-RS in the first 𝑘 real genres of the movie (worst case); that the probability that LIME-RS suggests the same ex-
For value equals to 0 at least one genre provided by LIME-RS planatory feature decreases. In practical terms, we could
in the first 𝑘 genres is always among the real genres of the say that LIME-RS is less confident about its explanation.
movie.                                                           In fact, this is the behavior of Att-Item-kNN. Conversely,
                       𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒1    𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒2    𝑎𝑑ℎ𝑒𝑟𝑒𝑛𝑐𝑒3    VSM shows high values of constancy, resulting in a more
                         Att-Item-kNN                            ”deterministic” behavior. With VSM, LIME-RS is more
                                                                 confident of its explanations. This could increase user’s
  Movielens 1M           0,2774        0,1105        0,0488
  Movielens Small        0,2364        0,0651        0,0180
                                                                 trustworthiness, since LIME-RS behavior is more reliable.
  Yahoo! Movies          0,3597        0,1202        0,0476         However, these results could also be interpreted to-
                              VSM
                                                                 gether   with the ones from Table 3. They show how
                                                                 often at least one feature – out of 𝑘 features provided
  Movielens 1M           0,5357        0,2539        0,1088
                                                                 by LIME-RS– adheres to the features that describe the
  Movielens Small        0,4384        0,1674        0,0403
  Yahoo! Movies          0,1013       0,01348        0,0021      item being explained. In other words, they measure the
                                                                 probability that LIME-RS succeeds in reconstructing at
                                                                 least one feature of a specific item. Combining the re-
   Observing the results from Table 3, Att-Item-KNN per- sults of Table 2 and those of Table 3, Att-Item-kNN, as
forms well in terms of adherence since, in approximately already mentioned, shows good performance regarding
75% of cases, even considering only the main feature of adherence and identifies 3 times out of 4 the first funda-
the explanation, it falls into the set of the item genres, mental feature of the explanation among those present
as for Movielens dataset family. This performance is in the set of features originally associated with the item.
a 10% lower for Yahoo! Movies. In contrast with this As expected, if the number 𝑘 of LIME-RS-reconstructed
result, VSM shows poor performances on both dataset features increases, the number of times such a set has a
of the Movielens family, by failing half the time about nonempty intersection (with the features belonging to
Movielens 1M as regards adherence. A surprising result the item) – i.e., adherence – increases. It could be noted
is achieved for Yahoo! Movies dataset because, enlarging that Att-Item-kNN on Yahoo! Movies shows the worst
the study to the first three features among the explana- behavior in terms of adherence. VSM shows a different
tion, the error is almost completely absent. The reasons behavior. Despite the excellent performance regarding
we found to explain this difference in the performances constancy, it could be observed that on both Movielens
concern the characteristics and the quality of the dataset, datasets, the performance in terms of adherence is poor,
as we highlight later on.                                        and worse for Movielens 1M than for Movielens Small.
                                                                 Surprisingly, on Yahoo! Movies, VSM performs much
5. Discussion                                                    better, and the errors are almost negligible.
                                                                 The difference between the two models could be due
This work investigates how well a post-hoc approach to many reasons. In the following we analyze possible
based on local surrogates – such as the LIME-RS algo- relations between such behaviors and two of them: pop-
rithm – explains a recommendation. Instead of studying ularity bias in the dataset and characteristics of side in-
the impact of explanations on users (that is a well-studied formation. On the one hand, if the dataset is affected
topic in the literature and is beyond our scope), we fo- by popularity bias, it would be a well-studied cause of
cus on objective evidences that could emerge. In this confusion for LIME-RS. On the other hand, the character-
respect, we have designed specific experiments, which istics of the side information associated with the datasets
introduced two different metrics, to evaluate adherence could dramatically influence the performance of the two
and constancy for this kind of algorithms. For instance, recommendation models. To assess these hypotheses, we
Table 2 shows a different behavior for Att-Item-kNN and have evaluated (see Table 4) the recommendation lists
VSM. On the one hand, Att-Item-kNN seems to guarantee produced by Att-Item-kNN and VSM considering nDCG,
a good constancy in explanations up to the third feature. Hit Rate (HR), Mean Average Precision (MAP), and Mean
This suggests that an explanation that exploits the first Reciprocal Rank (MRR). Table 4 shows that the chosen
three features of the list produced by LIME-RS could be datasets are strongly affected by popularity bias. Indeed,
barely considered as reliable (i.e., reaching a constancy of MostPop is the best performing approach, and the two
0.69 on Movielens 1M). On the other hand, VSM exhibits ”personalized” models fail to produce accurate results.
a much more ”stable” behavior, demonstrating in all cases This triggers the second aspect that concerns the quality
Table 4                                                                  ommendation scenario. We propose two different mea-
Results of the experiments on the models involved in the                 sures to understand how reliable an explanation based
experiments. Models are optimized according to the value of              on LIME-RS is: (i) constancy was used to assess the im-
nDCG.                                                                    pact of the random sampling phase of LIME-RS on the
 model          nDCG     Recall     HR     Precision   MAP      MRR      provided explanation – ideally the explanation should
                            Movielens 1m                                 remain constant in spite of the sample used to obtain
 Random         0,0051   0,0028 0,0869      0,0098     0,0094   0,0264   it; (ii) adherence was proposed to understand the recon-
 MostPop        0,0845   0,0379 0,4548       0,104      0,115   0,2205
 Att-Item-kNN   0,0229   0,0165 0,2425      0,0383     0,0387   0,0888   structive power of LIME-RS with respect to the features
 VSM            0,0173   0,0109 0,2106      0,0292     0,0306   0,0741   that belong to the item involved in the explanation – ide-
                           Movielens Small
                                                                         ally, LIME-RS should provide an explanation that always
 Random         0,0030   0,0013 0,0492      0,0049     0,0068   0,0205
 MostPop        0,0715   0,0389 0,3902      0,0748     0,0912   0,1961   adheres to the actual features of the recommended item.
 Att-Item-kNN   0,0124   0,0068 0,1459      0,0197     0,0191   0,0484       To test both constancy and adherence, we trained and
 VSM            0,0085   0,0056 0,1000      0,0111     0,0123   0.0350
                                                                         optimized two content-based recommendation models:
                            Yahoo! Movies
 Random         0,0005   0,0008 0,0051      0,0005     0,0005   0,0015
                                                                         Attribute Item-kNN (Att-Item-kNN), and a classical Vec-
 MostPop        0,2188   0,2589    0,596    0,1067     0,1501   0,3447   tor Space Model. For each model, and for all datasets
 Att-Item-kNN   0,0215   0,0262 0,1198      0,0132     0,0155   0,0435
 VSM            0,0131   0,0171 0,0754      0,0081     0,0092   0,0261
                                                                         exploited in the study, we generated recommendation
                                                                         lists for all users. We exploited the first item of these
                                                                         top-10 lists to produce the explanations that were then
                                                                         the subject of our investigation. It turned out that for
of the content. The results suggest that the side infor-
                                                                         models built with a large collaborative input such as
mation is not good enough to boost the recommenda-
                                                                         Att-Item-kNN, LIME-RS produces fairly constant expla-
tion systems in producing meaningful recommendations.
                                                                         nations up to a length of three features. Moreover, these
In fact, the three datasets seem to have an informative
                                                                         explanations turn out to be adherent with respect to the
content that is not adequate to generate appealing rec-
                                                                         item between 65% and 75% of the cases in which only
ommendations. We observe that, from an informative
                                                                         the first feature of the weighted vector of explanations
point of view, the Yahoo! Movies dataset is slightly more
                                                                         is considered. VSM shows a different behavior where
complete: 22 genres against the 18 genres available on
                                                                         explanations are much more constant, but suffer a lot in
Movielens. Although the VSM model does not show ex-
                                                                         terms of adherence, except for the Yahoo! Movies dataset
cellent performance, in combination with LIME-RS, it
                                                                         for which the explanation model showed outstanding
provides explanations that are very reliable in terms of
                                                                         performance despite the poor ability of VSM to provide
constancy (see Table 2) and adherence (see Table 3) to
                                                                         sound recommendations to users.
the actual content of the items being explained.
                                                                             In our experiments, some evidence started to emerge
From the designer perspective, there is also a prag-
                                                                         highlighting that the adopted explanation model is condi-
matic way to look at the experimental results. Suppose a
                                                                         tioned not only by the accuracy of the black-box model it
developer needs an off-the-shelf way of generating expla-
                                                                         tries to explain but also by the quality of the side informa-
nations for recommendations, and chooses LIME-RS to
                                                                         tion used to train the model. The latter result deserves to
do that. Our results suggest that if the explainer employs
                                                                         be adequately investigated to search for a link at a higher
a Movielens dataset with Att-Item-kNN model, then it
                                                                         level of detail. We plan to apply our experiments also to
is better to run the explainer several times. Indeed, the
                                                                         other recommendation models, to see whether the prob-
first feature obtained for the explanation could change
                                                                         lems with adherence and constancy that we found for the
around 1 time every 5 trials (first column of Table 2),
                                                                         two tested models show up also in other situations. We
and once such a feature is obtained, it is better to check
                                                                         will also investigate what impact structured knowledge
whether this feature is really among the ones describing
                                                                         has on this performance by exploiting models capable of
the item, since 1 time out of 4 the feature can be wrong
                                                                         leveraging this type of content. In addition, it would also
(first column of Table 3). Moreover, if the explainer em-
                                                                         be the case to try different reference domains with richer
ploys the Yahoo! Movies dataset with VSM model, then
                                                                         datasets of side information to understand what impact
probably there is no need to run the explainer twice, since
                                                                         content quality has on this type of explainer.
its behavior is constant 97% of the times, while the fea-
ture is wrong only 10% of the times. However, the low
performance of such a model is to be taken into account.                 Acknowledgments
                                                                         The authors acknowledge partial support of PID2019-
6. Conclusion                                                            108965GB-I00, PON ARS01_00876 BIO-D, Casa delle Tec-
                                                                         nologie Emergenti della Città di Matera, PON ARS01_00821
In this paper we shed a first light on the effectiveness                 FLET4.0, PIA Servizi Locali 2.0, H2020 Passapartout - Grant n.
of LIME-RS as a black-box explanation model in a rec-                    101016956, PIA ERP4.0, and IPZS-PRJ4_IA_NORMATIVO.
References                                                                             dation via interpretable feature mapping and evalua-
                                                                                       tion of explainability, in: C. Bessiere (Ed.), Proceed-
 [1] T. Miller, Explanation in artificial intelligence: Insights                       ings of the Twenty-Ninth International Joint Conference
     from the social sciences, Artif. Intell. 267 (2019) 1–38.                         on Artificial Intelligence, IJCAI 2020, ijcai.org, 2020, pp.
     URL: https://doi.org/10.1016/j.artint.2018.07.007. doi:1 0 .                      2690–2696. URL: https://doi.org/10.24963/ijcai.2020/373.
     1016/j.artint.2018.07.007.                                                        doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 2 0 / 3 7 3 .
 [2] N. Tintarev, J. Masthoff, Explaining recommendations: [12] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, S. Ma,
     Design and evaluation, in: Recommender Systems Hand-                              Explicit factor models for explainable recommendation
     book, Springer, 2015, pp. 353–382.                                                based on phrase-level sentiment analysis, in: S. Geva,
 [3] F. Gedikli, D. Jannach, M. Ge, How should I explain?                              A. Trotman, P. Bruza, C. L. A. Clarke, K. Järvelin (Eds.),
     A comparison of different explanation types for recom-                            The 37th International ACM SIGIR Conference on Re-
     mender systems, Int. J. Hum. Comput. Stud. 72 (2014)                              search and Development in Information Retrieval, SIGIR
     367–382. URL: https://doi.org/10.1016/j.ijhcs.2013.12.007.                        ’14, Gold Coast , QLD, Australia - July 06 - 11, 2014, ACM,
     doi:1 0 . 1 0 1 6 / j . i j h c s . 2 0 1 3 . 1 2 . 0 0 7 .                       2014, pp. 83–92. URL: https://doi.org/10.1145/2600428.
 [4] M. T. Ribeiro, S. Singh, C. Guestrin, ”why should I trust                         2609579. doi:1 0 . 1 1 4 5 / 2 6 0 0 4 2 8 . 2 6 0 9 5 7 9 .
     you?”: Explaining the predictions of any classifier, in: [13] J. B. Schafer, J. A. Konstan, J. Riedl, Recommender sys-
     B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal,                            tems in e-commerce, in: S. I. Feldman, M. P. Wellman
     D. Shen, R. Rastogi (Eds.), Proceedings of the 22nd ACM                           (Eds.), Proceedings of the First ACM Conference on Elec-
     SIGKDD International Conference on Knowledge Discov-                              tronic Commerce (EC-99), Denver, CO, USA, November
     ery and Data Mining, San Francisco, CA, USA, August 13-                           3-5, 1999, ACM, 1999, pp. 158–166. URL: https://doi.org/
     17, 2016, ACM, 2016, pp. 1135–1144. URL: https://doi.org/                         10.1145/336992.337035. doi:1 0 . 1 1 4 5 / 3 3 6 9 9 2 . 3 3 7 0 3 5 .
     10.1145/2939672.2939778. doi:1 0 . 1 1 4 5 / 2 9 3 9 6 7 2 . 2 9 3 9 7 7 8 . [14] S. M. McNee, J. Riedl, J. A. Konstan, Being accurate is
 [5] C. Nóbrega, L. B. Marinho, Towards explaining rec-                                not enough: how accuracy metrics have hurt recom-
     ommendations through local surrogate models, in:                                  mender systems, in: G. M. Olson, R. Jeffries (Eds.),
     C. Hung, G. A. Papadopoulos (Eds.), Proceedings of                                Extended Abstracts Proceedings of the 2006 Confer-
     the 34th ACM/SIGAPP Symposium on Applied Com-                                     ence on Human Factors in Computing Systems, CHI
     puting, SAC 2019, Limassol, Cyprus, April 8-12, 2019,                             2006, Montréal, Québec, Canada, April 22-27, 2006,
     ACM, 2019, pp. 1671–1678. URL: https://doi.org/10.1145/                           ACM, 2006, pp. 1097–1101. URL: https://doi.org/10.1145/
     3297280.3297443. doi:1 0 . 1 1 4 5 / 3 2 9 7 2 8 0 . 3 2 9 7 4 4 3 .              1125451.1125659. doi:1 0 . 1 1 4 5 / 1 1 2 5 4 5 1 . 1 1 2 5 6 5 9 .
 [6] S. Wachter, B. Mittelstadt, C. Russell, Counterfactual [15] S. Vargas, Novelty and diversity enhancement and eval-
     explanations without opening the black box: Automated                             uation in recommender systems and information re-
     decisions and the gdpr, Harv. JL & Tech. 31 (2017) 841.                           trieval, in: S. Geva, A. Trotman, P. Bruza, C. L. A. Clarke,
 [7] J. Chakraborty, K. Peng, T. Menzies, Making fair ML soft-                         K. Järvelin (Eds.), The 37th International ACM SIGIR Con-
     ware using trustworthy explanation, in: 35th IEEE/ACM                             ference on Research and Development in Information
     International Conference on Automated Software Engi-                              Retrieval, SIGIR ’14, Gold Coast , QLD, Australia - July
     neering, ASE 2020, Melbourne, Australia, September 21-                            06 - 11, 2014, ACM, 2014, p. 1281. URL: https://doi.org/
     25, 2020, IEEE, 2020, pp. 1229–1233. URL: https://doi.org/                        10.1145/2600428.2610382. doi:1 0 . 1 1 4 5 / 2 6 0 0 4 2 8 . 2 6 1 0 3 8 2 .
     10.1145/3324884.3418932. doi:1 0 . 1 1 4 5 / 3 3 2 4 8 8 4 . 3 4 1 8 9 3 2 . [16] N. Tintarev, J. Masthoff, A survey of explanations in
 [8] Y. Zhang, X. Chen, Explainable recommendation: A                                  recommender systems, in: ICDE Workshops, IEEE Com-
     survey and new perspectives, Found. Trends Inf. Retr. 14                          puter Society, 2007, pp. 801–810.
     (2020) 1–101. URL: https://doi.org/10.1561/1500000066. [17] Y. Koren, R. M. Bell, C. Volinsky, Matrix factorization
     doi:1 0 . 1 5 6 1 / 1 5 0 0 0 0 0 0 6 6 .                                         techniques for recommender systems, Computer 42
 [9] V. W. Anelli, T. D. Noia, E. D. Sciascio, A. Ragone, J. Trotta,                   (2009) 30–37. URL: https://doi.org/10.1109/MC.2009.263.
     How to make latent factors interpretable by feeding                               doi:1 0 . 1 1 0 9 / M C . 2 0 0 9 . 2 6 3 .
     factorization machines with knowledge graphs, in: [18] K. Tsukuda, M. Goto, Dualdiv: diversifying items
     C. Ghidini, O. Hartig, M. Maleshkova, V. Svátek, I. F.                            and explanation styles in explainable hybrid recom-
     Cruz, A. Hogan, J. Song, M. Lefrançois, F. Gandon (Eds.),                         mendation, in: T. Bogers, A. Said, P. Brusilovsky,
     The Semantic Web - ISWC 2019 - 18th International Se-                             D. Tikk (Eds.), Proceedings of the 13th ACM Confer-
     mantic Web Conference, Auckland, New Zealand, Oc-                                 ence on Recommender Systems, RecSys 2019, Copen-
     tober 26-30, 2019, Proceedings, Part I, volume 11778 of                           hagen, Denmark, September 16-20, 2019, ACM, 2019, pp.
     Lecture Notes in Computer Science, Springer, 2019, pp.                            398–402. URL: https://doi.org/10.1145/3298689.3347063.
     38–56. URL: https://doi.org/10.1007/978-3-030-30793-6_                            doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 6 3 .
     3. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 3 0 7 9 3 - 6 \ _ 3 .             [19] X. Chen, H. Chen, H. Xu, Y. Zhang, Y. Cao, Z. Qin, H. Zha,
[10] G. P. Polleti, H. N. Munhoz, F. G. Cozman, Explanations                           Personalized fashion recommendation with visual ex-
     within conversational recommendation systems: improv-                             planations based on multimodal attention network: To-
     ing coverage through knowledge graph embedding, in:                               wards visually explainable recommendation, in: B. Pi-
     2020 AAAI Workshop on Interactive and Conversational                              wowarski, M. Chevalier, É. Gaussier, Y. Maarek, J. Nie,
     Recommendation System. AAAI Press, New York City,                                 F. Scholer (Eds.), Proceedings of the 42nd International
     New York, USA, 2020.                                                              ACM SIGIR Conference on Research and Development in
[11] D. Pan, X. Li, X. Li, D. Zhu, Explainable recommen-                               Information Retrieval, SIGIR 2019, Paris, France, July 21-
     25, 2019, ACM, 2019, pp. 765–774. URL: https://doi.org/                             detection, in: 8th International Conference on Learn-
     10.1145/3331184.3331254. doi:1 0 . 1 1 4 5 / 3 3 3 1 1 8 4 . 3 3 3 1 2 5 4 .        ing Representations, ICLR 2020, Addis Ababa, Ethiopia,
[20] G. Cornacchia, F. M. Donini, F. Narducci, C. Pomo,                                  April 26-30, 2020, OpenReview.net, 2020. URL: https:
     A. Ragone, Explanation in multi-stakeholder recom-                                  //openreview.net/forum?id=BkgnhTEtDS.
     mendation for enterprise decision support systems, in:                         [28] E. Strumbelj, I. Kononenko, Explaining prediction
     A. Polyvyanyy, S. Rinderle-Ma (Eds.), Advanced Informa-                             models and individual predictions with feature con-
     tion Systems Engineering Workshops - CAiSE 2021 Inter-                              tributions, Knowl. Inf. Syst. 41 (2014) 647–665. URL:
     national Workshops, Melbourne, VIC, Australia, June 28                              https://doi.org/10.1007/s10115-013-0679-x. doi:1 0 . 1 0 0 7 /
     - July 2, 2021, Proceedings, volume 423 of Lecture Notes                            s10115- 013- 0679- x.
     in Business Information Processing, Springer, 2021, pp.                        [29] D. Alvarez-Melis, T. S. Jaakkola,                          Towards ro-
     39–47. URL: https://doi.org/10.1007/978-3-030-79022-6_                              bust interpretability with self-explaining neural net-
     4. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 7 9 0 2 2 - 6 \ _ 4 .                    works, in: S. Bengio, H. M. Wallach, H. Larochelle,
[21] Z. C. Lipton, The mythos of model interpretability, Com-                            K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Ad-
     mun. ACM 61 (2018) 36–43. URL: https://doi.org/10.1145/                             vances in Neural Information Processing Systems
     3233231. doi:1 0 . 1 1 4 5 / 3 2 3 3 2 3 1 .                                        31: Annual Conference on Neural Information Pro-
[22] X. Wang, Y. Chen, J. Yang, L. Wu, Z. Wu, X. Xie, A                                  cessing Systems 2018, NeurIPS 2018, December 3-
     reinforcement learning framework for explainable rec-                               8, 2018, Montréal, Canada, 2018, pp. 7786–7795.
     ommendation, in: IEEE International Conference on                                   URL: https://proceedings.neurips.cc/paper/2018/hash/
     Data Mining, ICDM 2018, Singapore, November 17-20,                                  3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.html.
     2018, IEEE Computer Society, 2018, pp. 587–596. URL:                           [30] D. Alvarez-Melis, T. S. Jaakkola, On the robustness of
     https://doi.org/10.1109/ICDM.2018.00074. doi:1 0 . 1 1 0 9 /                        interpretability methods, CoRR abs/1806.08049 (2018).
     ICDM.2018.00074.                                                                    URL: http://arxiv.org/abs/1806.08049. a r X i v : 1 8 0 6 . 0 8 0 4 9 .
[23] G. Peake, J. Wang, Explanation mining: Post hoc in-                            [31] S. Saito, E. Chua, N. Capel, R. Hu, Improving LIME
     terpretability of latent factor models for recommenda-                              robustness with smarter locality sampling, CoRR
     tion systems, in: Y. Guo, F. Farooq (Eds.), Proceed-                                abs/2006.12302 (2020). URL: https://arxiv.org/abs/2006.
     ings of the 24th ACM SIGKDD International Confer-                                   12302. a r X i v : 2 0 0 6 . 1 2 3 0 2 .
     ence on Knowledge Discovery & Data Mining, KDD                                 [32] D. Slack, S. Hilgard, E. Jia, S. Singh, H. Lakkaraju, Fooling
     2018, London, UK, August 19-23, 2018, ACM, 2018,                                    LIME and SHAP: adversarial attacks on post hoc expla-
     pp. 2060–2069. URL: https://doi.org/10.1145/3219819.                                nation methods, in: A. N. Markham, J. Powles, T. Walsh,
     3220072. doi:1 0 . 1 1 4 5 / 3 2 1 9 8 1 9 . 3 2 2 0 0 7 2 .                        A. L. Washington (Eds.), AIES ’20: AAAI/ACM Confer-
[24] Y. Tao, Y. Jia, N. Wang, H. Wang, The fact: Taming                                  ence on AI, Ethics, and Society, New York, NY, USA,
     latent factor models for explainability with factoriza-                             February 7-8, 2020, ACM, 2020, pp. 180–186. URL: https:
     tion trees, in: B. Piwowarski, M. Chevalier, É. Gaussier,                           //doi.org/10.1145/3375627.3375830. doi:1 0 . 1 1 4 5 / 3 3 7 5 6 2 7 .
     Y. Maarek, J. Nie, F. Scholer (Eds.), Proceedings of                                3375830.
     the 42nd International ACM SIGIR Conference on Re-                             [33] F. M. Harper, J. A. Konstan, The movielens datasets:
     search and Development in Information Retrieval, SI-                                History and context, ACM Trans. Interact. Intell. Syst. 5
     GIR 2019, Paris, France, July 21-25, 2019, ACM, 2019, pp.                           (2016) 19:1–19:19. URL: https://doi.org/10.1145/2827872.
     295–304. URL: https://doi.org/10.1145/3331184.3331244.                              doi:1 0 . 1 1 4 5 / 2 8 2 7 8 7 2 .
     doi:1 0 . 1 1 4 5 / 3 3 3 1 1 8 4 . 3 3 3 1 2 4 4 .                            [34] V. W. Anelli, A. Bellogín, A. Ferrara, D. Malitesta,
[25] J. Gao, X. Wang, Y. Wang, X. Xie, Explainable recom-                                F. A. Merra, C. Pomo, F. M. Donini, T. D. Noia, El-
     mendation through attentive multi-view learning, in:                                liot: A comprehensive and rigorous framework for
     The Thirty-Third AAAI Conference on Artificial Intel-                               reproducible recommender systems evaluation, in:
     ligence, AAAI 2019, The Thirty-First Innovative Appli-                              F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, T. Sakai
     cations of Artificial Intelligence Conference, IAAI 2019,                           (Eds.), SIGIR ’21: The 44th International ACM SIGIR
     The Ninth AAAI Symposium on Educational Advances                                    Conference on Research and Development in Informa-
     in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii,                            tion Retrieval, Virtual Event, Canada, July 11-15, 2021,
     USA, January 27 - February 1, 2019, AAAI Press, 2019,                               ACM, 2021, pp. 2405–2414. URL: https://doi.org/10.1145/
     pp. 3622–3629. URL: https://doi.org/10.1609/aaai.v33i01.                            3404835.3463245. doi:1 0 . 1 1 4 5 / 3 4 0 4 8 3 5 . 3 4 6 3 2 4 5 .
     33013622. doi:1 0 . 1 6 0 9 / a a a i . v 3 3 i 0 1 . 3 3 0 1 3 6 2 2 .        [35] V. W. Anelli, T. D. Noia, E. D. Sciascio, C. Pomo,
[26] F. Fusco, M. Vlachos, V. Vasileiadis, K. Wardatzky,                                 A. Ragone, On the discriminative power of hyper-
     J. Schneider, Reconet: An interpretable neural ar-                                  parameters in cross-validation and how to choose them,
     chitecture for recommender systems, in: S. Kraus                                    in: T. Bogers, A. Said, P. Brusilovsky, D. Tikk (Eds.), Pro-
     (Ed.), Proceedings of the Twenty-Eighth International                               ceedings of the 13th ACM Conference on Recommender
     Joint Conference on Artificial Intelligence, IJCAI 2019,                            Systems, RecSys 2019, Copenhagen, Denmark, Septem-
     Macao, China, August 10-16, 2019, ijcai.org, 2019, pp.                              ber 16-20, 2019, ACM, 2019, pp. 447–451. URL: https:
     2343–2349. URL: https://doi.org/10.24963/ijcai.2019/325.                            //doi.org/10.1145/3298689.3347010. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 .
     doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 1 9 / 3 2 5 .                                 3347010.
[27] M. Tsang, D. Cheng, H. Liu, X. Feng, E. Zhou, Y. Liu,                          [36] W. Krichene, S. Rendle, On sampled metrics for item
     Feature interaction interpretability: A case for explain-                           recommendation, in: R. Gupta, Y. Liu, J. Tang, B. A.
     ing ad-recommendation systems via neural interaction                                Prakash (Eds.), KDD ’20: The 26th ACM SIGKDD Con-
ference on Knowledge Discovery and Data Mining, Vir-
tual Event, CA, USA, August 23-27, 2020, ACM, 2020,
pp. 1748–1757. URL: https://doi.org/10.1145/3394486.
3403226. doi:1 0 . 1 1 4 5 / 3 3 9 4 4 8 6 . 3 4 0 3 2 2 6 .