=Paper=
{{Paper
|id=Vol-3318/paper7
|storemode=property
|title=Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May be Effective for Detecting Unknown Spurious Correlations
|pdfUrl=https://ceur-ws.org/Vol-3318/paper7.pdf
|volume=Vol-3318
|authors=Shea Cardozo,Gabriel Islas Montero,Dmitry Kazhdan,Botty Dimanov,Maleakhi Wijaya,Mateja Jamnik,Pietro Lio
|dblpUrl=https://dblp.org/rec/conf/cikm/CardozoMKDWJL22
}}
==Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May be Effective for Detecting Unknown Spurious Correlations==
Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May be Effective for Detecting Unknown Spurious Correlations Shea Cardozo1,2,† , Gabriel Islas Montero1,2,† , Dmitry Kazhdan1,3 , Botty Dimanov1 , Maleakhi Wijaya1 , Mateja Jamnik3 and Pietro Lio3 1 Tenyks 2 University of Toronto 3 University of Cambridge Abstract Recent work has suggested post-hoc explainers might be ineffective for detecting spurious correlations in Deep Neural Networks (DNNs). However, we show there are serious weaknesses with the existing evaluation frameworks for this setting. Previously proposed metrics are extremely difficult to interpret and are not directly comparable between explainer methods. To alleviate these constraints, we propose a new evaluation methodology, Explainer Divergence Scores (EDS), grounded in an information theory approach to evaluate explainers. EDS is easy to interpret and naturally comparable across explainers. We use our methodology to compare the detection performance of three different explainers - feature attribution methods, influential examples and concept extraction, on two different image datasets. We discover post-hoc explainers often contain substantial information about a DNN’s dependence on spurious artifacts, but in ways often imperceptible to human users. This suggests the need for new techniques that can use this information to better detect a DNN’s reliance on spurious correlations. Keywords explainability, interpretability, XAI, spurious correlations, explainer evaluation, post-hoc explanations, shortcut learning 1. Introduction use post-hoc explanations to detect spurious signals if said spurious signal is not known ahead of time [12]. Spurious correlations pose a serious risk to the appli- In this work, we ask deeper questions: Do post-hoc cation of Deep Neural Networks (DNNs), especially in explanations contain any information that can be used to critical applications, such as medical imaging and secu- detect spurious signals even if the signal is not known rity [1, 2, 3, 4]. This phenomenon, also known as shortcut ahead of time? If so, can we quantify and compare the learning or the Clever Hans Effect, is the result of DNN’s amount of information different post-hoc explainers can tendency to overfit to subtle patterns that are difficult for extract? a human user to identify. This causes trained models to In particular, we make the following contributions: form decision rules that fail to generalise [5, 6, 7, 8]. Consequently, detecting a model’s dependency on a • We propose Explainer Divergence Scores (EDS): spurious signal (or ‘model spuriousness’) in computer a novel way to evaluate a post-hoc explainer’s vision tasks has become an active area of research. Ex- ability to detect spurious correlations based on plainable AI (XAI) methods have been proposed as a an information theory foundation. potential avenue to address this challenge [5, 6, 9, 10] • We show our method’s effectiveness by eval- . One of these methods, post-hoc explanations, aims to uating and comparing three different types of describe the inference process of a pre-trained DNN in a post-hoc explainers - feature attribution methods human-interpretable manner [2, 11]. [13, 14], influential examples [15], and concept Past work has suggested human users may struggle to extraction [16] - across multiple datasets [17, 18] and spurious artifacts. AIMLAI @ CIKM’22: Advances in Interpretable Machine Learning and • We compare the amount of information regard- Artificial Intelligence, October 21, 2022, Atlanta, GA † ing the presence of a spurious signal between Equal Contribution different post-hoc explainers, which existing ap- $ shea.cardozo@tenyks.ai (S. Cardozo); gabriel.montero@tenyks.ai (G. I. Montero); proaches fail to address, and discover that post- dmitry.kazhdan@tenyks.ai (D. Kazhdan); botty.dimanov@tenyks.ai hoc explainers contain a significant amount of (B. Dimanov); maleakhi.wijaya@tenyks.ai (M. Wijaya); information on model spuriousness. Since this mateja.jamnik@cl.cam.ac.uk (M. Jamnik); pl219@cam.ac.uk (P. Lio) information is frequently not visible to human © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). users, our findings suggest that future research CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Explainer Divergence Score (EDS). a With one engineered spurious dataset and one clean dataset, b we train two separate classification models. c These models evaluate different combinations of spurious and non-spurious examples. d EDS can assess a post-hoc explainer’s ability to detect spurious correlations. e In comparison to previous work, our approach allows us to compare the performance of different types of post-hoc explainers directly. into post-hoc explanations should focus on dis- methods [28, 15] instead quantify the effect of specific covering and utilising this information. training examples on a given output. Concept extraction methods [29, 16] seek to measure a DNN’s reliance on a set of understandable concepts. These methods are 2. Related Work naturally interpretable and extendable to many DNN architectures [30, 31]. Spurious Correlations Spurious Correlations in Recent work has called into question the effectiveness DNNs have been the subject of a increasingly diverse of post-hoc explainers in both adversarial [32, 33] and body of work, with contributors analysing them through non-adversarial [34, 35, 36, 37] settings. Given these defi- the lenses of distribution shift [19, 20], shortcut learning ciencies and their widespread usage, systematic methods [6] and causal inference [21, 22]. Spurious correlations of comparing and evaluating post-hoc explanations have have raised issues in areas as diverse as privacy [23], became increasingly needed. fairness [24], and adversarial attacks [25]. Recent work has focused on identifying where spurious correlations Evaluating Explainers There is no generally agreed manifest and their properties, finding they often appear method for comparing and evaluating post-hoc explain- in practical settings [5, 6, 26, 7, 8]. ers. The majority of previous work has focused on feature attribution methods, proposing metrics to measure desir- Post-Hoc Explainers Post-hoc explainability methods able qualities about the attribution method [38, 39, 40]. generate explanations of the inference process of an arbi- The metrics often rely on semi-synthetic datasets contain- trary trained DNN. Numerous post-hoc explainers have ing ‘ground truth’ explanations that correspond to the been proposed. presence of known spurious signals [41, 42]. Metrics for Feature attribution methods or ‘heatmaps’ in Com- other explainers remain limited with few exceptions [43], puter Vision domains, measure the effect of each individ- and human trials are often still the only viable approach. ual input (e.g., pixel) on the output of a DNN by either The closest work to ours is Adebayo et al. [12] which leveraging input perturbation [14] or gradient informa- formulates a paradigm for evaluating DNN explainers for tion [13, 27]. Influential examples or influence function the purpose of identifying spurious correlations. Similar to our work, they focus on analysing spurious correla- set (whether the model that generated it is spurious or tions in settings where the spurious signal is not known non-spurious) and x is a random variable distributed ac- ahead of time via comparing explainers from spurious cording to the mixture distribution 𝑋 of equally weighted and non-spurious models. However, their framework explanations from both the spurious and non-spurious does not allow for the direct comparison between differ- models. We then have: ent types of explainers as their proposed quantities have different units for different types of explainers. To the best of our knowledge, this work is the only presentation ℓ(𝑓𝜃^ ) = min Ex∼𝑋 [𝐻 (𝑌 |x, 𝑓𝜃 (x))] (1) 𝜃∈Θ of a method for evaluating a post-hoc explainer’s ability =1 − 𝐷JS (𝑋|y = 0, 𝑋|y = 1) (2) to detect spurious correlations that is comparable across all types of explainers while remaining focused on the + min Ex∼𝑋 [𝐷KL (𝑌 |x‖𝑓𝜃 (x))] (3) 𝜃∈Θ context where the spurious signal is unknown. Where 𝐷JS represents the Jensen-Shannon Diver- gence and 𝐷KL represents the Kullback–Leibler Diver- 3. Explainer Divergence Score gence, and all quantities are measured in bits of entropy. The full derivation of this expression is present in the We motivate our approach by considering the setting Supplementary Material 6. Ideally, for a well trained clas- where a user seeks to determine whether a given model sifier 𝑓𝜃^ of sufficient expressiveness, we would expect depends on a spurious signal using a post-hoc explainer. the distribution represented by the output of our classi- They inspect an explanation generated from a model fier 𝑓𝜃^ (x) to approximate the true distribution of 𝑌 |x, prediction and use it to predict whether the model is meaning the Kullback-Leibler Divergence between them spurious or not. Similarly to Adebayo et al. [12], we is close to 0: expect a high-quality explainer to generate very different explanations from spurious models compared to non- Ex∼𝑋 [𝐷KL (𝑌 |x‖𝑓𝜃^ (x))] ≃ 0 (4) spurious models. And thus: This can be framed as a binary classification problem, where a classifier outputs a binary label corresponding ℓ(𝑓𝜃^ ) ≃ 1 − 𝐷JS (𝑋|y = 0, 𝑋|y = 1) (5) to a prediction of a model’s dependence on a spurious signal based upon an explanation as input. The classifier In which case the loss of our trained model can be under this formulation is a machine learning model that seen as approximating the Jensen-Shannon Divergence takes the place of the user, and is trained to distinguish between the distribution of explanations generated by between explanations generated by spurious models and spurious models and the distribution of explanations gen- explanations generated by non-spurious models. A visual erated by non-spurious models. Moreover, as all quanti- summary of our approach can be found in Figure 1. ties share the same unit (information), they are directly Critically, the classifier is trained to distinguish be- comparable across explainers. tween explanations generated by all spurious and non- In practice, sufficient classifier accuracy for Equation spurious models generated by a specified training strat- 4 to hold appears to be uncommon, leading to an average egy, instead of any individual pair. This allows the classi- loss that is unbounded above and difficult to estimate. fier to generalize to unseen models much like a human Hence we define our EDS as the classification accuracy user would be expected to. We detail how we accomplish of the binary classifier instead. This has the added ad- this in Section 4.1. vantage of providing an interpretable baseline for our EDS is defined as the performance of this binary clas- metrics - if the classifier can not do better than random sifier in predicting model spuriousness on explanations guessing (EDS of 0.5), then the classifier has failed to generated using unseen models - and can be interpreted capture any information in the explanations useful for as a measure of explainer quality. determining model spuriousness and thus there’s a very We can view our trained binary classifier’s loss as low likelihood the explainer captures any information an estimate of the distance between the distribution of about the spurious signal. explanations from spurious and non-spurious models re- spectively. Assume we have a trained binary classifier 4. Experiments 𝑓𝜃^ parameterized by 𝜃 ∈ Θ. We train this classifier by minimizing the loss ℓ consisting of the cross-entropy 𝐻 Using a similar setup to Adebayo et al. [12], Yang and between the distribution represented by the output of the Chaudhuri [44], we investigated three different types of model and 𝑌 |x, that is, the distribution 𝑌 conditioned on spurious artifacts: the random variable x where 𝑌 is the Bernoulli distribu- tion of binary labels of a given explanation in our training • Square - a small square in the top left corner of the image • Stripe - a vertical stripe 9 pixels from the left of divided by the task label and the presence of the spurious the image artifact in the image. There are four subclasses in total: • Noise - uniform Gaussian noise applied to every pixel value of the image • Images from the Spurious Class without the Spu- rious Artifact (abbreviated as ‘S/NA’ in figures) Examples of each spurious artifact on both the dSprites • Images from Non-Spurious Classes without the and 3dshapes datasets are present in the Supplementary Spurious Artifact (abbreviated as ‘NS/NA’ in fig- Material 6. ures) We experiment to determine the effect of the intensity • Images from the Spurious Class with the Spurious of each spurious artifact on a model’s spurious behaviour, Artifact (abbreviated as ‘S/A’ in figures) and then trained models to maximize this spuriousness. • Images from Non-Spurious Classes with the Spu- The details of this experiment and overall model training rious Artifact (abbreviated as ‘NS/A’ in figures) procedure can be found in the Supplementary Material 6. For example, say we had a class consisting of images of 4.1. EDS Experimental Setup ‘circles’ and another of images of ‘squares’, and we trained a classification model between the two where we injected For all datasets and explainers, we evaluate the Explainer spurious Gaussian noise into the ‘circles’ class. ‘S/NA’ Divergence Score (EDS) as follows. We split the dataset would correspond to images of circles without Gaussian into three partitions - 80% partition used for model train- noise, ‘NS/NA’ would correspond to images of squares ing, 14% partition used for binary classifier training, and without Gaussian noise, ‘S/A’ would correspond to im- 6% partition used for validation. ages of circles with Gaussian noise, and ‘NS/A’ would Recall in Section 3 we defined EDS using a binary correspond to images of squares with Gaussian noise. classifier trained to distinguish between explanations This subdivision allows us to interpret the type of im- generated across all spurious and non-spurious models. ages the explainer can effectively use to determine model Training a new model for every explanation is far too spuriousness. This is analogous to what is done in Ade- computationally intensive. To rectify this for each spuri- bayo et al. [12] via the ‘Cause-for-Concern Metric’ (CCM) ous artifact we train 100 spurious and 100 non-spurious and ‘False Alarm Metric’ (FAM) that measure results by models on our model training dataset partition, using whether the spurious artifact is present in the image, but different weight initialization, and use this sample as an we present results in even finer detail with added class estimate of the complete distribution of trained spurious information. and non-spurious models respectively. We train models and ensure they are spurious or non-spurious respec- tively using the procedure detailed in the Supplementary 4.3. Synthetic Explainer Comparison Material 6. We compare our EDS method to the approach in Ade- We reserve 30 spurious and non-spurious models each bayo et al. [12], starting with a simple example. We con- for validation and use the remaining 70 of each set to sider a toy classification task with two simple classes (the generate training data for our binary classifier. Images dSprites classes of a ‘heart’ and ‘oval’) with the ‘stripe’ from the respective dataset partition are combined with spurious artifact injected. Instead of using a specific spu- a randomly selected model to generate an explanation as rious detection method, we instead construct synthetic well as a binary class label corresponding to whether the explainers that represent the expected behaviour of each model came from a spurious or non-spurious set. A clas- method under ideal circumstances. We construct these sifier is then trained on this data to use the explanations ‘ideal’ explainers as follows: to predict this class label. Finally, our remaining 30 spurious and 30 non-spurious • Heatmaps - for the spurious model the explainer models are combined with the validation dataset parti- places all emphasis on the stripe for all images tion to generate explanations in the same fashion as in where it is present and the area where the stripe training. The label prediction accuracy of the binary clas- would be for images from the spurious class sifier on this set is then our estimate of the Explainer without the stripe. The explainer puts all em- Divergence Score of the given explainer for this spurious phasis on the shape for all cases with the non- signal. Further experimental setup details are noted in spurious model and for the spurious model on the Supplementary Material 6. non-spurious classes without the stripe. • Influential Examples - for the spurious model 4.2. Subclass Definitions the explainer selects influential examples of the spurious class with the stripe for all images unless For all EDS results we display accuracy not just over the it is an image from a non-spurious class without entire dataset (noted as ‘Overall’ in figures), but also sub- Explainers Explainer Divergence Scores Adebayo et al. [12] Metrics Overall S/NA NS/NA S/A NS/A KSSD CCM FAM Heatmaps Ideal 0.823 0.943 0.492 0.959 0.896 1.000 0.991 0.990 Noisy 0.702 0.771 0.459 0.805 0.771 0.970 0.993 0.993 Random 0.512 0.518 0.510 0.537 0.484 0.965 0.996 0.996 Influence Ideal 0.824 0.955 0.496 0.949 0.896 1.000 0.500 0.000 Noisy 0.668 0.734 0.490 0.736 0.713 0.587 0.712 0.656 Random 0.515 0.510 0.516 0.520 0.514 0.000 0.915 0.927 Concept Ideal 0.750 0.500 0.500 1.000 1.000 0.000 0.000 -0.500 Noisy 0.617 0.488 0.535 0.723 0.721 -0.284 -0.412 -0.514 Random 0.491 0.490 0.475 0.527 0.473 -0.494 -0.481 -0.509 Table 1 Results of evaluating EDS on the specific synthetic explainers averaged over 5 runs. We observe clear outperformance of heatmaps and influential examples over concept extraction, as well as the complete failure of the ‘random’ explainer in the EDS results. However, these are not visible in the KSSD, CCM and FAM metrics [12] results. Standard deviation estimates are provided in the Supplementary Material 6, with all results having estimated 95% confidence intervals within ±0.03. the stripe. For the non-spurious model the ex- between the spurious and non-spurious models in each plainer always selects examples of the correct subclass. Cases where the explanations generated from class with the correct presence of the spurious spurious and non-spurious models are drawn from the artifact. same distribution should result in the worst possible met- • Concept Extraction - For concept extraction we rics. Conversely, cases where the explanations are always specify two binary concepts, one of the class label radically different should result in close to perfect met- and one of the presence of the spurious artifact. rics. We assume the spurious model can detect both This is exactly what we observe with EDS. Our ap- perfectly, and thus extracts both accurately. On proach finds the ideal heatmap and influential examples the other hand, the non-spurious model is invari- almost perfectly identify model spuriousness - failing ant to spurious artifact in all circumstances, and only on explanations generated from images from a non- thus always extracts that it is not present. spurious class without the spurious artifact. The ideal concept extraction explainer additionally falls short on In addition to these ideal explainers, we also create images from the spurious class with the spurious arti- ‘noisy’ variants where we inject noise across every ex- fact, indicating that this specification is a worse explainer planation as well as a purely random variant where for detecting spurious correlations then the competing the corresponding explanations consist purely of noise. methods. For heatmaps we inject uniform Gaussian noise to the We observe that the KSSD, CCM and FAM metrics heatmap, for influential examples we specify a chance from Adebayo et al. [12] fall short in this type of analy- (100% in the noise variant) of randomly selecting a train- sis: different types of explainers use different similarity ing image, and for concept extraction we specify a chance functions with different units that are not comparable (100% in the noise variant) of predicting a random con- directly. This is a major innovation of our method over cept label. the existing state of the art. We evaluate both our EDS and the KSSD, CCM and Our method comes to our expected conclusion that the FAM metrics [12] on these examples. For these metrics ideal explainers capture more information about model we specify similarity functions as follows: for heatmaps spuriousness than the noisy explainers, while the random we use the SSIM similarity function as specified in [12], explainers completely fail to capture any information for influential examples we use the Bhattacharyya coef- about model spuriousness. This declining performance ficient [45] between the distributions of the class labels can also be seen in the KSSD, CCM and FAM metrics - but and the presence of a spurious artifact in the influential the utter failure of the random explainers is not visible examples, and for concept extraction we use the negative with these metrics. With EDS, if the trained classifier of the L2 distance between concept labels as a ‘similarity’ fails to achieve at least 50% accuracy, we can interpret function. The results are shown in Table 1. the explainer as having no information about the model’s Synthetic ‘ideal’ explainers are useful as we can specify spuriousness. This is not possible using the KSSD, CCM in advance exactly how our explainers should perform Explainers Explainer Divergence Scores Adebayo et al. [12] Metrics Overall S/NA NS/NA S/A NS/A KSSD CCM FAM Square Heatmap 0.799 0.837 0.590 0.902 0.916 0.851 0.877 0.837 Influence 0.887 0.937 0.860 0.891 0.887 0.562 0.991 0.989 Concept 0.715 0.645 0.578 0.827 0.831 -0.062 -0.074 -0.076 Stripe Heatmap 0.831 0.901 0.689 0.958 0.870 0.877 0.880 0.878 Influence 0.881 0.892 0.829 0.909 0.913 0.561 0.991 0.980 Concept 0.707 0.618 0.596 0.788 0.815 -0.061 -0.074 -0.077 Noise Heatmap 0.717 0.857 0.610 0.872 0.682 0.728 0.877 0.804 Influence 0.795 0.884 0.707 0.890 0.796 0.566 0.970 0.966 Concept 0.744 0.650 0.652 0.863 0.808 -0.062 -0.078 -0.076 Table 2 Results of evaluating EDS on the dSprites dataset averaged over 5 runs. We observe the outperformance of influential examples over heatmaps and concept extraction visible in the EDS results but not in the comparative metrics. Standard deviation estimates are provided in the Supplementary Material 6., with all results having estimated 95% confidence intervals within ±0.04. and FAM metrics without explicitly running a baseline We find the strongest performance for heatmaps and for every type of explainer evaluated. influential examples. EDS was highest for images in the spurious class without the spurious artifact, lowest for 4.4. Real Explainer Comparison images in non-spurious classes without the spurious ar- tifact, and somewhat high for images with the spurious To test EDS on real explainer methods, we conduct exper- artifact regardless of class. These findings appear consis- iments on reduced versions of both the dSprites [17] and tent across all three of our chosen spurious artifacts, and the 3dshapes [18] datasets. We train models to perform in both datasets. We notice a sharp drop in performance a shape classification task and arbitrarily select one class for our Gaussian noise spurious artifact compared to the to be the spurious class for each experiment. more localized spurious artifacts. We chose some commonly used methods as represen- Concept extractions consistently perform worse than tatives for each explainer type of interest. We use Inte- the other two explainers, operating well only on images grated Gradients [13] as our chosen feature attribution with the explicit presence of the spurious artifact. This method. For influential examples we use the TraceInCP follows our expectations - we would expect concept ex- method [15], and for concept extraction we use Concept traction to more effectively identify the presence of the Model Extraction (CME) [46]. Examples of each explana- spurious signal concept from the activations of spurious tion on images from the dSprites dataset are present in models compared to activations of non-spurious models the Supplementary Material 6. that have learned to become invariant to them. More- More detailed information about the configuration over the dimensionality of our the concept predictions is setup for each experiment is present in the Supplemen- much lower than explanations for the other two explain- tary Material 6. ers, limiting their expressiveness. Interestingly while We display results for dSprites in Table 2 and results performance on images without the spurious artifact is for 3dshapes in Table 3. For comparison, we also evaluate poor, it is still above our 0.5 theoretical baseline despite the KSSD, CCM, and FAM metrics formulated in Adebayo there being no obvious reason for concept predictions to et al. [12] on both dSprites and 3dshapes. shift between spurious and non-surious models. This is We observe Explainer Divergence Scores significantly further discussed in Section 4.5. above the 0.5 theoretical baseline for all explainers and We notice significant differences in explainer per- spurious artifacts in both datasets. This indicates all of formance between the dSprites and 3dshapes datasets. our explainers are successful in capturing information While in dSprites we find slightly higher performance about the model’s spuriousness in both tasks. The key for influential examples over heatmaps, in 3dshapes we advantage of EDS over previous work is that we can find significant strong performance for heatmaps across now directly compare the performance of explainers for all experiments. In 3dshapes often our EDS binary clas- detecting model spuriousness for the specified task and sifier identifies the spuriousness of a given model from spurious artifact. We interpret our results with this aim a heatmap with 100% accuracy. This is in sharp con- in mind. trast to the other two explainers that perform worse with Explainers Explainer Divergence Scores Adebayo et al. [12] Metrics Overall S/NA NS/NA S/A NS/A KSSD CCM FAM Square Heatmap 1.00 1.00 1.00 1.00 1.00 0.680 0.828 0.826 Influence 0.675 0.817 0.615 0.696 0.682 0.562 0.991 0.989 Concept 0.595 0.532 0.562 0.651 0.628 -0.156 -0.080 -0.072 Stripe Heatmap 0.996 0.993 0.997 0.998 0.994 0.644 0.810 0.807 Influence 0.867 0.922 0.844 0.903 0.861 0.561 0.991 0.980 Concept 0.720 0.653 0.636 0.783 0.795 -0.152 -0.075 -0.070 Noise Heatmap 0.987 1.00 0.984 0.994 0.984 0.673 0.846 0.847 Influence 0.703 0.908 0.645 0.887 0.627 0.566 0.970 0.966 Concept 0.569 0.600 0.566 0.587 0.555 -0.150 -0.074 -0.074 Table 3 Results of evaluating EDS on the 3dshapes dataset averaged over 5 runs. We observe the extreme outperformance of heatmaps over the remaining explainers visible in the EDS results but not in the comparative metrics. Standard deviation estimates are provided in the Supplementary Material 6, with all results having estimated 95% confidence intervals within ±0.04. 3dshapes, performing only comparably using the ‘stripe’ information may prove useful in designing more effective spurious artifact. Despite this diminished performance, explainers. both influential examples and concept extraction still This is particularly evident in the case of concept ex- perform above our 0.5 theoretical baseline for EDS. traction where there is no clear hypothesis for why spu- rious and non-spurious models would have differing in- 4.5. Discussion formation about the underlying concepts in images from the non-spurious class without the spurious artifact. This These results favour heatmaps and influential examples, suggests that the presence of a spurious correlation can which are very effective at detecting model spuriousness affect a model’s ability to extract features in entirely un- in both experiments with real explainer methods. Con- related image classes. versely concept extraction consistently performed the worst, and is only useful on images for which the spu- rious artifact is present. As expected, performance is 5. Conclusion sensitive to the dataset and specified task. We conduct further experiments to confirm Explainer We present Explainer Divergence Scores - a novel method Divergence Scores are robust to our choice of optimiza- for evaluating post-hoc explainers for the purposes of tion procedure and model architecture. These are ex- detecting unknown spurious correlations. panded upon in the Supplementary Material 6. Across three experiments we show EDS’s superior In both datasets, we observe EDS performances signifi- capabilities over state of the art post-hoc explainer eval- cantly above the 0.5 theoretical baseline for all explainers, uation methods. EDS provides an interpretable estimate spurious artifacts, and subclasses. Notably this is seen of the amount of information an explainer can capture even with images from unrelated, non-spurious classes about a DNN’s dependence on an unknown spurious sig- without the presence of the spurious artifact. nal. Moreover EDS allows direct comparisons between This has interesting implications about the utility of different types of explainers, unlike previous methods, post-hoc explainers in detecting model spuriousness. For letting us quantitatively identify and evaluate the best example, heatmaps generated from 3dshapes images in explainer for a given dataset and spurious signal. non-spurious classes without the spurious artifact do In contrast to previous work [12], our results reveal not show any obvious signal that a human could use to that commonly used post-hoc explainers contain substan- identify their respective model has some sort of spurious tial amount of information about a model’s dependence dependency. Yet a trained classifier with sufficient prior on unknown spurious signals. This information is of- knowledge can diagnose whether the model depends ten unidentifiable by human observers, and yet can be upon a spurious signal with extremely high certainty. used by a well-trained classifier to detect dependencies Information present in our explanations indicating spu- on images seemingly unrelated to the spurious signal. riousness may not always be perceptible by a human Our findings suggest that future research into post-hoc observer, and identifying ways to extract or isolate this explanations should focus on identifying and utilizing this unseen information. 6. Supplementary Material gence guided radiology systems, Frontiers in Dig- ital Health 3 (2021). doi:10.3389/fdgth.2021. Additional information about our work, including a 671015. more detailed mathematical justification, ancillary exper- [9] J. Adebayo, M. Muelly, I. Liccardi, B. Kim, Debug- iments, and standard error estimates for all our results ging tests for model explanations, in: H. Larochelle, are detailed in the Appendix available at this link. M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural References Information Processing Systems 2020, NeurIPS [1] A. Chouldechova, Fair prediction with disparate 2020, December 6-12, 2020, virtual, 2020. URL: impact: A study of bias in recidivism prediction https://proceedings.neurips.cc/paper/2020/hash/ instruments, Big Data 5 (2017) 153–163. URL: https: 075b051ec3d22dac7b33f788da631fd4-Abstract. //doi.org/10.1089/big.2016.0047. doi:10.1089/big. html. 2016.0047. [10] X. Han, B. C. Wallace, Y. Tsvetkov, Explaining [2] F. Doshi-Velez, B. Kim, Towards a rigorous science black box predictions and unveiling data artifacts of interpretable machine learning, 2017. URL: https: through influence functions, in: D. Jurafsky, J. Chai, //arxiv.org/abs/1702.08608. doi:10.48550/ARXIV. N. Schluter, J. R. Tetreault (Eds.), Proceedings of 1702.08608. the 58th Annual Meeting of the Association for [3] A. Datta, M. C. Tschantz, A. Datta, Automated Computational Linguistics, ACL 2020, Online, July experiments on ad privacy settings, Proc. Priv. En- 5-10, 2020, Association for Computational Lin- hancing Technol. 2015 (2015) 92–112. URL: https: guistics, 2020, pp. 5553–5563. URL: https://doi.org/ //doi.org/10.1515/popets-2015-0007. doi:10.1515/ 10.18653/v1/2020.acl-main.492. doi:10.18653/v1/ popets-2015-0007. 2020.acl-main.492. [4] J. Buolamwini, T. Gebru, Gender shades: In- [11] B. Dimanov, Interpretable Deep Learning: Beyond tersectional accuracy disparities in commercial Feature-Importance with Concept-based Explana- gender classification, in: S. A. Friedler, C. Wil- tions, Ph.D. thesis, University of Cambridge, 2021. son (Eds.), Conference on Fairness, Accountabil- [12] J. Adebayo, M. Muelly, H. Abelson, B. Kim, ity and Transparency, FAT 2018, 23-24 February Post hoc explanations may be ineffective for de- 2018, New York, NY, USA, volume 81 of Proceed- tecting unknown spurious correlation, in: In- ings of Machine Learning Research, PMLR, 2018, ternational Conference on Learning Representa- pp. 77–91. URL: http://proceedings.mlr.press/v81/ tions, 2022. URL: https://openreview.net/forum?id= buolamwini18a.html. xNOVfCCvDpM. [5] S. Lapuschkin, S. Wäldchen, A. Binder, G. Mon- [13] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attri- tavon, W. Samek, K. Müller, Unmasking clever bution for deep networks, in: D. Precup, Y. W. Teh hans predictors and assessing what machines re- (Eds.), Proceedings of the 34th International Con- ally learn, CoRR abs/1902.10178 (2019). URL: http: ference on Machine Learning, ICML 2017, Sydney, //arxiv.org/abs/1902.10178. arXiv:1902.10178. NSW, Australia, 6-11 August 2017, volume 70 of [6] R. Geirhos, J. Jacobsen, C. Michaelis, R. S. Zemel, Proceedings of Machine Learning Research, PMLR, W. Brendel, M. Bethge, F. A. Wichmann, Short- 2017, pp. 3319–3328. URL: http://proceedings.mlr. cut learning in deep neural networks, Nat. press/v70/sundararajan17a.html. Mach. Intell. 2 (2020) 665–673. URL: https:// [14] M. T. Ribeiro, S. Singh, C. Guestrin, "why should doi.org/10.1038/s42256-020-00257-z. doi:10.1038/ I trust you?": Explaining the predictions of any s42256-020-00257-z. classifier, in: B. Krishnapuram, M. Shah, A. J. [7] S. Sagawa, A. Raghunathan, P. W. Koh, P. Liang, Smola, C. C. Aggarwal, D. Shen, R. Rastogi (Eds.), An investigation of why overparameterization ex- Proceedings of the 22nd ACM SIGKDD Interna- acerbates spurious correlations, in: Proceedings tional Conference on Knowledge Discovery and of the 37th International Conference on Machine Data Mining, San Francisco, CA, USA, August 13- Learning, ICML 2020, 13-18 July 2020, Virtual Event, 17, 2016, ACM, 2016, pp. 1135–1144. URL: https: volume 119 of Proceedings of Machine Learning //doi.org/10.1145/2939672.2939778. doi:10.1145/ Research, PMLR, 2020, pp. 8346–8356. URL: http: 2939672.2939778. //proceedings.mlr.press/v119/sagawa20a.html. [15] G. Pruthi, F. Liu, S. Kale, M. Sundararajan, [8] U. Mahmood, R. Shrestha, D. Bates, L. Mannelli, Estimating training data influence by tracing G. Corrias, Y. Erdi, C. Kanan, Detecting spurious gradient descent, in: H. Larochelle, M. Ran- correlations with sanity tests for artificial intelli- zato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural doi:10.1145/3457607. Information Processing Systems 2020, NeurIPS [25] X. Chen, C. Liu, B. Li, K. Lu, D. Song, Targeted back- 2020, December 6-12, 2020, virtual, 2020. URL: door attacks on deep learning systems using data https://proceedings.neurips.cc/paper/2020/hash/ poisoning, CoRR abs/1712.05526 (2017). URL: http: e6385d39ec9394f2f3a354d9d2b88eec-Abstract. //arxiv.org/abs/1712.05526. arXiv:1712.05526. html. [26] K. Y. Xiao, L. Engstrom, A. Ilyas, A. Madry, Noise [16] D. Kazhdan, B. Dimanov, M. Jamnik, P. Liò, or signal: The role of image backgrounds in object A. Weller, Now you see me (CME): concept-based recognition, in: 9th International Conference on model extraction, in: S. Conrad, I. Tiddi (Eds.), Pro- Learning Representations, ICLR 2021, Virtual Event, ceedings of the CIKM 2020 Workshops co-located Austria, May 3-7, 2021, OpenReview.net, 2021. URL: with 29th ACM International Conference on Infor- https://openreview.net/forum?id=gl3D-xY7wLq. mation and Knowledge Management (CIKM 2020), [27] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, Galway, Ireland, October 19-23, 2020, volume 2699 D. Parikh, D. Batra, Grad-cam: Why did you of CEUR Workshop Proceedings, CEUR-WS.org, 2020. say that? visual explanations from deep net- URL: http://ceur-ws.org/Vol-2699/paper02.pdf. works via gradient-based localization, CoRR [17] L. Matthey, I. Higgins, D. Hassabis, A. Lerchner, abs/1610.02391 (2016). URL: http://arxiv.org/abs/ dsprites: Disentanglement testing sprites dataset, 1610.02391. arXiv:1610.02391. https://github.com/deepmind/dsprites-dataset/, [28] P. W. Koh, P. Liang, Understanding black-box pre- 2017. dictions via influence functions, in: D. Precup, Y. W. [18] C. Burgess, H. Kim, 3d shapes dataset, Teh (Eds.), Proceedings of the 34th International https://github.com/deepmind/3dshapes-dataset/, Conference on Machine Learning, ICML 2017, Syd- 2018. ney, NSW, Australia, 6-11 August 2017, volume 70 [19] C. Zhou, X. Ma, P. Michel, G. Neubig, Examining of Proceedings of Machine Learning Research, PMLR, and combating spurious features under distribu- 2017, pp. 1885–1894. URL: http://proceedings.mlr. tion shift, in: M. Meila, T. Zhang (Eds.), Proceed- press/v70/koh17a.html. ings of the 38th International Conference on Ma- [29] B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai, J. Wexler, chine Learning, ICML 2021, 18-24 July 2021, Virtual F. B. Viégas, R. Sayres, Interpretability beyond Event, volume 139 of Proceedings of Machine Learn- feature attribution: Quantitative testing with con- ing Research, PMLR, 2021, pp. 12857–12867. URL: cept activation vectors (TCAV), in: J. G. Dy, http://proceedings.mlr.press/v139/zhou21g.html. A. Krause (Eds.), Proceedings of the 35th Inter- [20] S. Sagawa, P. W. Koh, T. B. Hashimoto, P. Liang, national Conference on Machine Learning, ICML Distributionally robust neural networks, in: 8th 2018, Stockholmsmässan, Stockholm, Sweden, July International Conference on Learning Represen- 10-15, 2018, volume 80 of Proceedings of Machine tations, ICLR 2020, Addis Ababa, Ethiopia, April Learning Research, PMLR, 2018, pp. 2673–2682. URL: 26-30, 2020, OpenReview.net, 2020. URL: https: http://proceedings.mlr.press/v80/kim18d.html. //openreview.net/forum?id=ryxGuJrFvS. [30] D. Kazhdan, B. Dimanov, M. Jamnik, P. Liò, MEME: [21] M. Arjovsky, L. Bottou, I. Gulrajani, D. Lopez- generating RNN model explanations via model ex- Paz, Invariant risk minimization, CoRR traction, CoRR abs/2012.06954 (2020). URL: https: abs/1907.02893 (2019). URL: http://arxiv.org/abs/ //arxiv.org/abs/2012.06954. arXiv:2012.06954. 1907.02893. arXiv:1907.02893. [31] L. C. Magister, D. Kazhdan, V. Singh, P. Liò, Gc- [22] L. Moneda, Spurious correlation machine learn- explainer: Human-in-the-loop concept-based ex- ing and causality, Blogpost at lgmoneda.github.io planations for graph neural networks, CoRR (2021). abs/2107.11889 (2021). URL: https://arxiv.org/abs/ [23] K. Leino, M. Fredrikson, Stolen memories: 2107.11889. arXiv:2107.11889. Leveraging model memorization for calibrated [32] P. Kindermans, S. Hooker, J. Adebayo, M. Al- white-box membership inference, in: S. Cap- ber, K. T. Schütt, S. Dähne, D. Erhan, B. Kim, kun, F. Roesner (Eds.), 29th USENIX Security The (un)reliability of saliency methods, in: Symposium, USENIX Security 2020, August 12- W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, 14, 2020, USENIX Association, 2020, pp. 1605– K. Müller (Eds.), Explainable AI: Interpreting, 1622. URL: https://www.usenix.org/conference/ Explaining and Visualizing Deep Learning, vol- usenixsecurity20/presentation/leino. ume 11700 of Lecture Notes in Computer Sci- [24] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, ence, Springer, 2019, pp. 267–280. URL: https://doi. A. Galstyan, A survey on bias and fairness in org/10.1007/978-3-030-28954-6_14. doi:10.1007/ machine learning, ACM Comput. Surv. 54 (2021) 978-3-030-28954-6\_14. 115:1–115:35. URL: https://doi.org/10.1145/3457607. [33] B. Dimanov, U. Bhatt, M. Jamnik, A. Weller, You shouldn’t trust me: Learning models which con- A survey on methods and metrics, Electronics 10 ceal unfairness from multiple explanation methods, (2021). URL: https://www.mdpi.com/2079-9292/10/ in: H. Espinoza, J. Hernández-Orallo, X. C. Chen, 5/593. doi:10.3390/electronics10050593. S. S. ÓhÉigeartaigh, X. Huang, M. Castillo-Effen, [41] C. Agarwal, E. Saxena, S. Krishna, M. Pawelczyk, R. Mallah, J. A. McDermid (Eds.), Proceedings of N. Johnson, I. Puri, M. Zitnik, H. Lakkaraju, the Workshop on Artificial Intelligence Safety, co- Openxai: Towards a transparent evaluation of located with 34th AAAI Conference on Artificial model explanations, CoRR abs/2206.11104 (2022). Intelligence, SafeAI@AAAI 2020, New York City, URL: https://doi.org/10.48550/arXiv.2206.11104. NY, USA, February 7, 2020, volume 2560 of CEUR doi:10.48550/arXiv.2206.11104. Workshop Proceedings, CEUR-WS.org, 2020, pp. 63– arXiv:2206.11104. 73. URL: http://ceur-ws.org/Vol-2560/paper8.pdf. [42] Y. Liu, S. Khandagale, C. White, W. Neiswanger, [34] A. Ghorbani, A. Abid, J. Y. Zou, Interpre- Synthetic benchmarks for scientific research tation of neural networks is fragile, CoRR in explainable machine learning, in: J. Van- abs/1710.10547 (2017). URL: http://arxiv.org/abs/ schoren, S. Yeung (Eds.), Proceedings of the 1710.10547. arXiv:1710.10547. Neural Information Processing Systems Track on [35] S. Basu, P. Pope, S. Feizi, Influence functions in Datasets and Benchmarks 1, NeurIPS Datasets and deep learning are fragile, in: 9th International Con- Benchmarks 2021, December 2021, virtual, 2021. ference on Learning Representations, ICLR 2021, URL: https://datasets-benchmarks-proceedings. Virtual Event, Austria, May 3-7, 2021, OpenRe- neurips.cc/paper/2021/hash/ view.net, 2021. URL: https://openreview.net/forum? c16a5320fa475530d9583c34fd356ef5-Abstract-round2. id=xHKVVHGDOEk. html. [36] F. Locatello, S. Bauer, M. Lucic, G. Rätsch, S. Gelly, [43] M. E. Zarlenga, P. Barbiero, Z. Shams, D. Kazhdan, B. Schölkopf, O. Bachem, Challenging common U. Bhatt, M. Jamnik, On the quality assurance of assumptions in the unsupervised learning of disen- concept-based representations, 2022. URL: https: tangled representations, in: Reproducibility in Ma- //openreview.net/forum?id=Ehhk6jyas6v. chine Learning, ICLR 2019 Workshop, New Orleans, [44] Y. Yang, K. Chaudhuri, Understanding rare spu- Louisiana, United States, May 6, 2019, OpenRe- rious correlations in neural networks, CoRR view.net, 2019. URL: https://openreview.net/forum? abs/2202.05189 (2022). URL: https://arxiv.org/abs/ id=Byg6VhUp8V. 2202.05189. arXiv:2202.05189. [37] Y. Zhou, S. Booth, M. T. Ribeiro, J. Shah, Do fea- [45] T. Kailath, The divergence and bhattacharyya dis- ture attribution methods correctly attribute fea- tance measures in signal selection, IEEE Trans- tures?, in: Thirty-Sixth AAAI Conference on Arti- actions on Communication Technology 15 (1967) ficial Intelligence, AAAI 2022, Thirty-Fourth Con- 52–60. doi:10.1109/TCOM.1967.1089532. ference on Innovative Applications of Artificial In- [46] D. Kazhdan, B. Dimanov, M. Jamnik, P. Liò, telligence, IAAI 2022, The Twelveth Symposium A. Weller, Now you see me (CME): concept-based on Educational Advances in Artificial Intelligence, model extraction, in: S. Conrad, I. Tiddi (Eds.), Pro- EAAI 2022 Virtual Event, February 22 - March 1, ceedings of the CIKM 2020 Workshops co-located 2022, AAAI Press, 2022, pp. 9623–9633. URL: https: with 29th ACM International Conference on Infor- //ojs.aaai.org/index.php/AAAI/article/view/21196. mation and Knowledge Management (CIKM 2020), [38] D. Alvarez-Melis, T. S. Jaakkola, On the ro- Galway, Ireland, October 19-23, 2020, volume 2699 bustness of interpretability methods, CoRR of CEUR Workshop Proceedings, CEUR-WS.org, 2020. abs/1806.08049 (2018). URL: http://arxiv.org/abs/ URL: http://ceur-ws.org/Vol-2699/paper02.pdf. 1806.08049. arXiv:1806.08049. [39] J. Dai, S. Upadhyay, U. Aïvodji, S. H. Bach, H. Lakkaraju, Fairness via explanation quality: Evaluating disparities in the quality of post hoc ex- planations, in: V. Conitzer, J. Tasioulas, M. Scheutz, R. Calo, M. Mara, A. Zimmermann (Eds.), AIES ’22: AAAI/ACM Conference on AI, Ethics, and Society, Oxford, United Kingdom, May 19 - 21, 2021, ACM, 2022, pp. 203–214. URL: https://doi.org/ 10.1145/3514094.3534159. doi:10.1145/3514094. 3534159. [40] J. Zhou, A. H. Gandomi, F. Chen, A. Holzinger, Eval- uating the quality of machine learning explanations: