A Clustering-based Approach for Interpreting Black-box Models (Discussion Paper) Luca Ferragina1,† , Simona Nisticò1,*,† 1 DIMES, University of Calabria, 87036 Rende (CS), Italy Abstract Classification and regression tasks involving image data are often connected to critical domains or oper- ations. In this context, Machine and Deep Learning techniques have achieved astonishing performances. Unfortunately, the models resulting from such techniques are so complex to be seen as black boxes, even when we have full access to the model’s information. This is limiting for experts who leverage these tools to make decisions and lowers the trust of users who are somehow subjected to their outcomes. Some methods have been proposed to solve the task of explaining a black box both in a non-specific data domain and for images. Nevertheless, the most used explanation tools when dealing with image data have some limitations, as they consider pixel-level explanations (SHAP), involve an image segmentation phase (LIME) or apply to specific neural architectures (Grad-CAM). In this work, we introduce CLAIM, a model-agnostic explanation approach, that interprets black boxes by leveraging a clustering-based approach to produce interpretation-dependent higher lever features. Additionally, we perform a preliminary analysis aimed at probing the potentiality of the proposed approach. Keywords eXplainable AI, Post-hoc explanations, Local Explanations, Model-agnostic Explanations 1. Introduction The pervasive use of Machine and Deep Learning models in everyday life processes has raised the problem of models’ trustworthiness. The root of this problem lies in the level of complexity characterizing them. Indeed, such complexity makes it difficult to understand the logic followed by the model to perform its prediction, and this is true not only for final users but also for Machine and Deep Learning experts who have difficulties inspecting and debugging their own models. When predictive models take part to decisions that affect users’ lives, the above-stated problem involves even a legal dimension. The GDPR enshrines the right of explanation [1], which requires to be able to provide to users an intelligible explanation for the model outcome. All the above-stated issues have led to the birth of the eXplainable Artificial Intelligence (XAI) field, which collects all the research efforts in providing instruments for a more-aware use of artificial intelligence solutions. Many types of approaches have been developed to face the SEBD ’24: 32nd Symposium on Advanced Database Systems June 23-26, 2024 - Villasimius, Sardinia, Italy * Corresponding author. † These authors contributed equally. $ luca.ferragina@unical.it (L. Ferragina); simona.nistico@unical.it (S. Nisticò)  0000-0003-3184-4639 (L. Ferragina); 0000-0002-7386-2512 (S. Nisticò) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings model explainability problem [2, 3]. From the taxonomy described in [1], it emerges that some works focus on building models which are interpretable by design, whose more challenging aspect is to provide explainability without affecting too much model performances. Others, known as post-hoc explanation methods, aim to explain already designed and trained models. This class of methods differs from the others because there are various levels of information availability ranging from having access only to the model output to having complete access to all its information. In this work, we will focus on the latter class of explanation methods. Our contributions can be summarized as follows: • we analyze the main algorithms for explaining models working on image data; • we introduce CLAIM, CLustering-based Approach for Interpreting black-box Models; • we provide preliminary experimental results. The structure of the paper is the following. In Section 2 we describe the background and discuss related works. In Section 3 we introduce the CLAIM algorithm and provide an example to illustrate its working principles. In Section 4 the results of the experiments are reported. Finally, Section 5 concludes the paper. 2. Preliminaries and Related Works Regarding post-hoc explainability, various settings arise from different levels of information availability. One of them considers model-agnostic explanations in which only the information provided by the model output is exploited to understand its behaviour. To explain the prediction for a certain data instance, some Post-hoc methodologies perturb the input, collect information about how the outcome changes, and then exploit it to estimate the level of importance of each feature. Among them, SHAP [4] is a game-theory-inspired method that attempts to enhance interpretability by computing the importance values for each feature. While SHAP applies to different data types, the RISE [5] method focuses on image data. To explain a prediction, it generates an importance map indicating how salient each pixel is for the model’s prediction by probing the model with randomly masked versions of the input image to observe how the output changes. Explanations can also be given by examples, in which case counterfactuals, which are in- stances similar to the considered sample that bring the model to produce a different outcome, are provided to users as justifications. Among the methods of this category it is possible to find LORE [6], which learns an interpretable classifier in a neighbourhood of the sample to explain. This neighbourhood is generated by employing a genetic algorithm, using model outcomes as labels, and then extracting from the interpretable classifier an explanation consisting of a decision rule and a set of counterfactuals. DiCE, proposed in [7], generates counterfactual examples that are both actionable and diverse. The diversity requirement aims at increasing the richness of information delivered to the user. MILE [8] exploits an adversarial-like neural network architecture to learn a transformation able to change the black-box model outcome for the considered sample. It can provide simultaneously two kinds of explanation: a counterfactual, which is the result of the transformation application, and a score for each object feature, derived from the transformation. Finally, some methodologies explain models via local surrogates, which are self-interpretable functions that mimic the decisions of black-box models locally. LIME [9] explains black-box models by converting the data object into a domain composed of interpretable features, and then perturbing it in that domain and querying the black-box to learn a simple model (the local surrogate) using this generated dataset. A variant of LIME called 𝒮-LIME, has been proposed in [10] to solve the problem related to out-of-distribution generated samples. In particular, it exploits semantic features extracted through unsupervised learning to generate the neighbourhood. The type of explanations provided changes in Anchor [11] where if-then rules having high precision, called anchors, are created and utilised to represent locally sufficient conditions for prediction. Other approaches, known as model-specific methods, exploit the peculiarities of the class of models they are targeted to. For example, some methods focus on neural networks and exploit information carried by the gradient. The authors of [12] propose two methodologies to explain ConvNets through visualization, that leverage the computation of the gradient related to the class score for the input image considered. Grad-CAM [13, 14] uses the gradients flowing into the final convolutional layer of any target concept, which can be related to classification or other tasks, to produce a coarse localization map highlighting important regions in the image for predicting the concept. Among the aforementioned methods, the ones that specifically address the issue of providing a post-hoc explanation of black-box models dealing with image data are Grad-CAM and RISE. However, Grad-CAM has the weakness of applying only to a very specific type of models, i. e. convolutional neural networks. As for RISE, being designed for computing the importance of every single feature, it provides explanations that are not beneficial since there is a consistent number of features. Generally, to make these explanations more user-friendly, as long as image data is considered, the explanation is shaped as a heatmap h that assigns a continuous importance value to each pixel. Regrettably, this is not sufficient, since the so obtained explanations should be so scattered to be still incomprehensible for users. Even if it is not specifically tailored to image data, one of the most widespread methods for model-agnostic explanation is LIME, because of its versatility and ease of use. When dealing with images, LIME considers as interpretable features for the surrogate model the output obtained from a segmentation algorithm. The issue with this approach is that the segmentation and explanation steps are separated from each other, therefore the aggregation of different pixels is based on their importance for the model. This fact may lead to a rough explanation that identifies the most important portions (according to the black box) of the image with low precision. In the following section, we introduce CLAIM, an algorithm for post-hoc explaining black- box models specifically designed for image data. CLAIM faces the issues described above by aggregating pixels that the model considers of similar importance, i. e. that produce a similar effect when they are perturbed. 3. Methodology Let 𝑓 : R𝑑 → R be a black-box assigning a real value to each data point belonging to the input space. Theoretically, 𝑓 may be the function obtained from a machine learning method achieving any specific task. Thus, for example, 𝑓 (x) may represent the result of a regression analysis on x, the probability that x belong to a certain class, the anomaly degree of the point x, and so on. Given a sample x, in order to understand which features are the most important, accord- ing to the model 𝑓 , for the elaboration of the output 𝑓 (x), we investigate how feature-wise perturbations of x affect the value of 𝑓 (x). Thus, for each fixed 𝑖 ∈ {1, . . . , 𝑑}, we consider 𝑝(𝑖) 𝑛 = 𝑓 (x) − 𝑓 (x + 𝑛𝜀e𝑖 ), (1) where 𝜀 is a perturbation step, 𝑛 ∈ {−𝑁, . . . , −1, 1, . . . , 𝑁 } determines the number of pertur- bations we are performing and e𝑖 is a vector whose components are all equals to 0 except for the 𝑖-th that is equal to 1. The value in Equation (1) expresses how much the output of the black-box model 𝑓 varies if we perturb by 𝑛𝜀 the feature 𝑖 of the input data point x. By collecting the variations obtained on the feature 𝑖 with all the different 𝑛 ∈ {−𝑁, . . . , −1, 1, . . . , 𝑁 }, we obtain an embedded representation p(𝑖) ∈ R2𝑁 relative to the feature 𝑖, whose components are expressed by [︁ ]︁ (𝑖) (𝑖) (𝑖) (𝑖) p(𝑖) = 𝑝−𝑁 , . . . , 𝑝−1 , 𝑝1 , . . . , 𝑝𝑁 . The same reasoning applied to each feature of x produces a finite set of embedded points {︁ }︁ 𝑃 = p(1) , . . . , p(𝑑) ⊆ R2𝑁 such that each p(𝑖) represents how the model behaves when we perturb x on the feature 𝑖. Each of the 2𝑁 dimensions of the space we build contains information about how the pixels of the sample behave when they are subjected to a fixed perturbation. The norm of a certain p(𝑖) in the 2𝑁 -dimensional space represents a score measuring the importance of the feature 𝑖 for the elaboration of the output provided by 𝑓 on the sample x. Indeed, if ‖p(𝑖) ‖ is relatively low, it means that perturbations on the feature 𝑖 of the input do not substantially modify the output provided by 𝑓 , thus the contribution of the feature 𝑖 to the output 𝑓 (x) is poor. On the other hand, if it is high, it means that perturbing the feature 𝑖 has a huge effect on the output, thus 𝑖 must be a very important feature according to the model. In order to provide a more understandable heatmap we apply a clustering algorithm on 𝑃 with the aim of aggregating features with similar behaviors. Here we consider the 𝐾-MEANS algorithm [15] that builds up a set of 𝐾 centroids 𝐶 = {c(1) , . . . , c(𝐾) } that are representative of the 𝐾 clusters {𝐶1 , . . . , 𝐶𝐾 } in which 𝑃 is partitioned. From our perspective, each cluster 𝐶𝑘 with 𝑘 ∈ {1, . . . , 𝐾} constitutes a subset of features of the input point x in which the model 𝑓 behaves similarly in presence of perturbations. Thus, each 𝐶𝑘 can be seen as a macro-feature and the norm of the relative centroid ‖c(𝑘) ‖ indicates the importance of this macro-feature for the output provided by 𝑓 on x. To better visualize the explanation obtained, we build a heatmap h whose 𝑖-th element is given by Algorithm 1: CLAIM Input: Black-box 𝑓 , point x, perturbation step 𝜀, number of perturbations 𝑁 , number of macro-features 𝐾 Output: An heatmap h highlighting the most important macro-features of x according to 𝑓 1 foreach feature 𝑖 = 1, . . . , 𝑑 do 2 foreach 𝑛 ∈ {−𝑁, . . . , −1, 1, . . . , 𝑁 } do (𝑖) 3 Compute the perturbation 𝑝𝑛 using Equation (1); 4 Apply the 𝐾-MEANS algorithm to the set 𝑃 obtaining clusters {𝐶1 , . . . , 𝐶𝐾 }; 5 Build the value of the feature 𝑖 of the heatmap h using Equation (2); ℎ𝑖 = ‖𝒞(p(𝑖) )‖ (2) where we indicate by 𝒞 : 𝑃 → 𝐶 the function that assigns each point in 𝑃 to the centroid of the cluster to which it belongs. To better illustrate each step of CLAIM (also reported in Algorithm 1), in the next section we provide a detailed example. 3.1. Motivating Example When dealing with a black-box model, an analysis of the value of performance metrics such as accuracy, is not always sufficient to effectively assess its quality. One common situation is the one in which the presence of some bias in the data used for the training of a model 𝑓 may affect the output provided by 𝑓 , making it focus on non-relevant features. This is potentially dangerous, because, if also the data used for the quality assessment presents the same issue, the resulting model performances do not reflect its poor quality. To reproduce this kind of scenario, we set up the following experiment. We consider as 𝑓 a logistic regression model that must classify the images belonging to the MNIST [16] data set as 3 or 9, more specifically, given an image x, 𝑓 outputs the probability of x belonging to the class 9. Then, we train 𝑓 with samples of classes 3 and 9 in which we add a black rectangle in the bottom left corner to all the training images belonging to the class 3, as shown in Figure 1a. In the following, we refer to the images with this modification as biased. At inference time we pass to the model an input sample x representing a non-biased 3 (Figure 1b). The model fails in recognizing it as a 3 since it outputs 𝑓 (x) = 0.97. This happens because the added sign is fully discriminating between classes in the training set, thus, the model is clearly focusing only on the portion of the image where the sign is (or should be) located, as depicted in Figure 1c showing the magnitude of the model’s weights. We now apply CLAIM to explain the behaviour of 𝑓 on the image x, in particular, we set 𝜀 = 0.5, 𝐾 = 2, and 𝑁 = 1. Figure 2a shows the bi-dimensional data space 𝑃 built in line 3 of Algorithm 1 and Figure 2b shows the two centroids (the red triangles) and the splitting into the two clusters (purple and yellow points) obtained in line 4. We can observe that the cluster in yellow is the one farther from the origin, which means that its centroid has a larger norm and, thus, it is relative to the macro-feature that contributes the most to the output of 𝑓 . 1.0 0.12 0.10 0.8 0.08 0.6 0.06 0.4 0.04 0.2 0.02 0.0 (a) Biased train sample. (b) Unbiased test sample. (c) Model’s weights. (d) CLAIM heatmap. Figure 1: Explanation of the classification outcome for an unbiased sample of class 3 (Figure 1b) classified by the black box as belonging to class 9. For the latter two figures, the colour indicates the level of importance for the corresponding pixel/cluster. 0.005 0.005 0.000 0.000 −0.015 −0.010 −0.005 0.000 0.005 0.010 −0.015 −0.010 −0.005 0.000 0.005 0.010 −0.005 −0.005 −0.010 −0.010 −0.015 −0.015 (a) (b) Figure 2: Plot of the 2-dimensional data set collected by CLAIM to explain the black-box outcome for an unbiased element of class 3. Figure 2a depicts the points carrying the collected information. Figure 2b shows how the samples have been clustered in 2 groups, the red triangles highlight cluster centroids. Figure 1d shows the heatmap provided in output, we can see that it is almost identical to the model’s weight, which means that CLAIM has been able to identify with great precision the most expressive (according to 𝑓 ) macro-feature of x. 4. Experimental Results The example described before is quite simple since the bias is inducted by a patch with a regular shape that is quite easy to handle for most algorithms. Therefore, in order to analyze the behavior of CLAIM in more challenging scenarios, in the following we consider settings similar to the one described in Section 3.1 but in which the shapes of the patches are more elaborated. To have guidance in assessing the adherence of the explanation to the behaviour of the explained model, in all the experiments we consider again a logistic regression as a black-box 𝑓 . We also compare the results obtained by CLAIM with those obtained by LIME. In the first experiment, instead of a single square, the bias is given by five squares placed in a sort of "chessboard"-shaped patch (Figure 3a). Figure 3 reports the visual result of the experiment on an image (Figure 3a) of a 9 that, differently from the training set, contains the 0 0.40 0.7 2.5 0.35 5 0.6 2.0 0.30 0.5 10 0.25 1.5 0.4 0.20 15 0.3 1.0 0.15 20 0.2 0.10 0.5 0.1 0.05 25 0.0 0 5 10 15 20 25 (a) Test sample. (b) Model’s weights. (c) CLAIM heatmap. (d) LIME heatmap. Figure 3: A biased 9 from the test set of the first experiment and its associated explanations. 0 0.30 0.07 2.5 5 0.25 0.06 2.0 0.05 0.20 10 1.5 0.04 0.15 15 0.03 1.0 0.10 20 0.02 0.5 0.05 0.01 25 0.0 0 5 10 15 20 25 (a) Test sample. (b) Model’s weights. (c) CLAIM heatmap. (d) LIME heatmap. Figure 4: An unbiased 3 from the test set of the first experiment and its associated explanations. bias. Similarly, as before, the model has mainly focused on the features involved in the bias, indeed all the weights associated with the other features are very close to 0 (Figure 3b). As we can see from Figure 3d CLAIM is able to perfectly isolate the bias aggregating all its pixels in a single macro-feature and assigning to it a large value in the heatmap. On the other hand, LIME is only able to roughly identify the portion of the image containing the bias but it fails to precisely determine its shape. This happens because the segmentation algorithm used in LIME does not exactly separate the bias from the rest of the image. A similar behavior can be observed on an image of a 3 that does not contain the patch (Figure 4a). Even in this case in which the bias is invisible, our model succeeds in capturing it, obtaining a heatmap (Figure 4d) consistent with the weights of 𝑓 (Figure 4b). As for LIME (Figure 4c), its behaviour in this sample is worse. Indeed, since the segmentation step is performed before the rest of the algorithm and only considers the image content, it is impossible for it to obtain a partition of the image that is suitable to identify this bias. In the second experiment, we add another difficulty by inserting a disconnected bias into the training set. In particular, we place one "chessboard" patch in the bottom-left corner and one in the top-right corner. As we can see from Figure 5, the heatmap provided by CLAIM is extremely precise, since it includes in a single macro-feature both the patches and judges it as the most relevant macro-feature for 𝑓 . For what concerns LIME, also in this case, the segmentation step causes some issues. In particular, the two patches belong to different portions of the images and thus they are assigned to two different macro-features. This type of result is not desirable since the pixels belonging to the two patches contribute in exactly the same way to the output provided by 𝑓 and, for this reason, they conceptually belong to a unique macro-feature. Still concerning this experiment, Table 1 shows Precision and Recall over the features con- 0.40 2.00 0.35 1.75 0.004 1.50 0.30 1.25 0.003 0.25 1.00 0.20 0.002 0.75 0.15 0.50 0.10 0.001 0.25 0.05 0.00 (a) Test sample. (b) Model’s weights. (c) CLAIM heatmap. (d) LIME heatmap. Figure 5: A biased 9 from the test set of the second experiment and its associated explanations. CLAIM LIME 1 LIME 2 LIME 3 Precision 1.000 ± 0.000 0.072 ± 0.002 0.034 ± 0.000 0.023 ± 0.000 Recall 1.000 ± 0.000 0.816 ± 0.001 0.840 ± 0.003 0.865 ± 0.004 Table 1 Precision and Recall over the features considered important by the weight of the logistic regression trained on the training data set containing the disconnected bias. sidered important by the weight of the logistic regressor. The metrics are computed on an unbiased test set and we report the mean and the standard deviation over all the samples in the test set. The first column is relative to our method while the others are relative to the heatmap of LIME in which we consider the 1, 2, or 3 most important macro-features. The numerical results confirm that CLAIM overcomes LIME in detecting such kind of bias. In particular, the Precision of LIME is always much smaller than the Recall, which means that it is judging as important a lot of pixels that actually are irrelevant for 𝑓 . 5. Conclusion In this work, we deal with the post-hoc explanation of models that work with image data and introduce the CLAIM algorithm. Our goal is to provide users with heatmaps that include higher-level features built through the guidance of the black-box model without exploiting segmentation algorithms, which only consider image content. The preliminary analysis performed on the CLAIM’s explanations leads to promising results that give rise to hope about the potential of this method in producing faithful explanations that lead users to focus only on important image regions. In future development, we will focus on deepening the expressive power of the information it extracts to compute explanations. We will also investigate CLAIM’s robustness with respect to its parameters, to analyze how they affect explanation quality. Furthermore, we plan to enlarge the experiments performed to include more competitors and to consider richer data sets. Acknowledgments We acknowledge the support of the PNRR project FAIR - Future AI Research (PE00000013), Spoke 9 - Green-aware AI, under the NRRP MUR program funded by the NextGenerationEU. References [1] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models, ACM Comput. Surv. 51 (2019) 93:1–93:42. URL: https://doi.org/10.1145/3236009. doi:10.1145/3236009. [2] Q. Zhang, Y. N. Wu, S.-C. Zhu, Interpretable convolutional neural networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8827–8836. [3] J. Donnelly, A. J. Barnett, C. Chen, Deformable protopnet: An interpretable image classifier using deformable prototypes, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10265–10275. [4] S. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, arXiv preprint arXiv:1705.07874 (2017). [5] V. Petsiuk, A. Das, K. Saenko, Rise: Randomized input sampling for explanation of black- box models, arXiv preprint arXiv:1806.07421 (2018). [6] R. Guidotti, A. Monreale, F. Giannotti, D. Pedreschi, S. Ruggieri, F. Turini, Factual and counterfactual explanations for black box decision making, IEEE Intelligent Systems 34 (2019) 14–23. [7] R. K. Mothilal, A. Sharma, C. Tan, Explaining machine learning classifiers through diverse counterfactual explanations, in: Proceedings of the 2020 ACM FAccT, 2020, pp. 607–617. [8] F. Angiulli, F. Fassetti, S. Nisticò, Local interpretable classifier explanations with self- generated semantic features, in: DS, Springer, 2021, pp. 401–410. [9] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD KDD, 2016, pp. 1135–1144. [10] F. Angiulli, F. Fassetti, S. Nisticò, Finding local explanations through masking models, in: IDEAL 2021, Springer, 2021, pp. 467–475. [11] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors: High-precision model-agnostic explanations, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [12] K. Simonyan, A. Vedaldi, A. Zisserman, Visualising image classification models and saliency maps, Deep Inside Convolutional Networks (2014). [13] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE ICCV, 2017, pp. 618–626. [14] A. Chattopadhay, A. Sarkar, P. Howlader, V. N. Balasubramanian, Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks, in: 2018 IEEE winter conference on applications of computer vision (WACV), IEEE, 2018, pp. 839–847. [15] J. MacQueen, et al., Some methods for classification and analysis of multivariate obser- vations, in: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, Oakland, CA, USA, 1967, pp. 281–297. [16] L. Deng, The mnist database of handwritten digit images for machine learning research, IEEE Signal Processing Magazine 29 (2012) 141–142.