=Paper=
{{Paper
|id=Vol-3318/paper7
|storemode=property
|title=Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May be Effective for Detecting Unknown Spurious Correlations
|pdfUrl=https://ceur-ws.org/Vol-3318/paper7.pdf
|volume=Vol-3318
|authors=Shea Cardozo,Gabriel Islas Montero,Dmitry Kazhdan,Botty Dimanov,Maleakhi Wijaya,Mateja Jamnik,Pietro Lio
|dblpUrl=https://dblp.org/rec/conf/cikm/CardozoMKDWJL22
}}
==Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May be Effective for Detecting Unknown Spurious Correlations==
<pdf width="1500px">https://ceur-ws.org/Vol-3318/paper7.pdf</pdf>
<pre>
Explainer Divergence Scores (EDS): Some Post-Hoc
Explanations May be Effective for Detecting Unknown
Spurious Correlations
Shea Cardozo1,2,† , Gabriel Islas Montero1,2,† , Dmitry Kazhdan1,3 , Botty Dimanov1 ,
Maleakhi Wijaya1 , Mateja Jamnik3 and Pietro Lio3
1
  Tenyks
2
  University of Toronto
3
  University of Cambridge


                                          Abstract
                                          Recent work has suggested post-hoc explainers might be ineffective for detecting spurious correlations in Deep Neural
                                          Networks (DNNs). However, we show there are serious weaknesses with the existing evaluation frameworks for this setting.
                                          Previously proposed metrics are extremely difficult to interpret and are not directly comparable between explainer methods.
                                          To alleviate these constraints, we propose a new evaluation methodology, Explainer Divergence Scores (EDS), grounded in an
                                          information theory approach to evaluate explainers.
                                              EDS is easy to interpret and naturally comparable across explainers. We use our methodology to compare the detection
                                          performance of three different explainers - feature attribution methods, influential examples and concept extraction, on two
                                          different image datasets. We discover post-hoc explainers often contain substantial information about a DNN’s dependence
                                          on spurious artifacts, but in ways often imperceptible to human users. This suggests the need for new techniques that can
                                          use this information to better detect a DNN’s reliance on spurious correlations.

                                          Keywords
                                          explainability, interpretability, XAI, spurious correlations, explainer evaluation, post-hoc explanations, shortcut learning


1. Introduction                                                                                         use post-hoc explanations to detect spurious signals if
                                                                                                        said spurious signal is not known ahead of time [12].
Spurious correlations pose a serious risk to the appli-                                                    In this work, we ask deeper questions: Do post-hoc
cation of Deep Neural Networks (DNNs), especially in explanations contain any information that can be used to
critical applications, such as medical imaging and secu- detect spurious signals even if the signal is not known
rity [1, 2, 3, 4]. This phenomenon, also known as shortcut ahead of time? If so, can we quantify and compare the
learning or the Clever Hans Effect, is the result of DNN’s amount of information different post-hoc explainers can
tendency to overfit to subtle patterns that are difficult for extract?
a human user to identify. This causes trained models to                                                    In particular, we make the following contributions:
form decision rules that fail to generalise [5, 6, 7, 8].
   Consequently, detecting a model’s dependency on a                                                          • We propose Explainer Divergence Scores (EDS):
spurious signal (or ‘model spuriousness’) in computer                                                           a novel way to evaluate a post-hoc explainer’s
vision tasks has become an active area of research. Ex-                                                         ability to detect spurious correlations based on
plainable AI (XAI) methods have been proposed as a                                                              an information theory foundation.
potential avenue to address this challenge [5, 6, 9, 10]                                                      • We show our method’s effectiveness by eval-
. One of these methods, post-hoc explanations, aims to                                                          uating and comparing three different types of
describe the inference process of a pre-trained DNN in a                                                        post-hoc explainers - feature attribution methods
human-interpretable manner [2, 11].                                                                             [13, 14], influential examples [15], and concept
   Past work has suggested human users may struggle to                                                          extraction [16] - across multiple datasets [17, 18]
                                                                                                                and spurious artifacts.
AIMLAI @ CIKM’22: Advances in Interpretable Machine Learning and                                              • We compare the amount of information regard-
Artificial Intelligence, October 21, 2022, Atlanta, GA
†                                                                                                               ing the presence of a spurious signal between
  Equal Contribution                                                                                            different post-hoc explainers, which existing ap-
$ shea.cardozo@tenyks.ai (S. Cardozo);
gabriel.montero@tenyks.ai (G. I. Montero);
                                                                                                                proaches fail to address, and discover that post-
dmitry.kazhdan@tenyks.ai (D. Kazhdan); botty.dimanov@tenyks.ai                                                  hoc explainers contain a significant amount of
(B. Dimanov); maleakhi.wijaya@tenyks.ai (M. Wijaya);                                                            information on model spuriousness. Since this
mateja.jamnik@cl.cam.ac.uk (M. Jamnik); pl219@cam.ac.uk (P. Lio)                                                information is frequently not visible to human
           © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
           Attribution 4.0 International (CC BY 4.0).                                                           users, our findings suggest that future research
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Explainer Divergence Score (EDS). a With one engineered spurious dataset and one clean dataset, b we train
two separate classification models. c These models evaluate different combinations of spurious and non-spurious examples.
 d EDS can assess a post-hoc explainer’s ability to detect spurious correlations. e In comparison to previous work, our
approach allows us to compare the performance of different types of post-hoc explainers directly.


        into post-hoc explanations should focus on dis- methods [28, 15] instead quantify the effect of specific
        covering and utilising this information.             training examples on a given output. Concept extraction
                                                             methods [29, 16] seek to measure a DNN’s reliance on
                                                             a set of understandable concepts. These methods are
2. Related Work                                              naturally interpretable and extendable to many DNN
                                                             architectures [30, 31].
Spurious Correlations Spurious Correlations in                  Recent work has called into question the effectiveness
DNNs have been the subject of a increasingly diverse of post-hoc explainers in both adversarial [32, 33] and
body of work, with contributors analysing them through non-adversarial [34, 35, 36, 37] settings. Given these defi-
the lenses of distribution shift [19, 20], shortcut learning ciencies and their widespread usage, systematic methods
[6] and causal inference [21, 22]. Spurious correlations of comparing and evaluating post-hoc explanations have
have raised issues in areas as diverse as privacy [23], became increasingly needed.
fairness [24], and adversarial attacks [25]. Recent work
has focused on identifying where spurious correlations Evaluating Explainers There is no generally agreed
manifest and their properties, finding they often appear method for comparing and evaluating post-hoc explain-
in practical settings [5, 6, 26, 7, 8].                      ers. The majority of previous work has focused on feature
                                                             attribution methods, proposing metrics to measure desir-
Post-Hoc Explainers Post-hoc explainability methods able qualities about the attribution method [38, 39, 40].
generate explanations of the inference process of an arbi- The metrics often rely on semi-synthetic datasets contain-
trary trained DNN. Numerous post-hoc explainers have ing ‘ground truth’ explanations that correspond to the
been proposed.                                               presence of known spurious signals [41, 42]. Metrics for
   Feature attribution methods or ‘heatmaps’ in Com- other explainers remain limited with few exceptions [43],
puter Vision domains, measure the effect of each individ- and human trials are often still the only viable approach.
ual input (e.g., pixel) on the output of a DNN by either        The closest work to ours is Adebayo et al. [12] which
leveraging input perturbation [14] or gradient informa- formulates a paradigm for evaluating DNN explainers for
tion [13, 27]. Influential examples or influence function the purpose of identifying spurious correlations. Similar
to our work, they focus on analysing spurious correla-       set (whether the model that generated it is spurious or
tions in settings where the spurious signal is not known     non-spurious) and x is a random variable distributed ac-
ahead of time via comparing explainers from spurious         cording to the mixture distribution 𝑋 of equally weighted
and non-spurious models. However, their framework            explanations from both the spurious and non-spurious
does not allow for the direct comparison between differ-     models. We then have:
ent types of explainers as their proposed quantities have
different units for different types of explainers. To the
best of our knowledge, this work is the only presentation        ℓ(𝑓𝜃^ ) = min Ex∼𝑋 [𝐻 (𝑌 |x, 𝑓𝜃 (x))]               (1)
                                                                          𝜃∈Θ
of a method for evaluating a post-hoc explainer’s ability              =1 − 𝐷JS (𝑋|y = 0, 𝑋|y = 1)                   (2)
to detect spurious correlations that is comparable across
all types of explainers while remaining focused on the                          + min Ex∼𝑋 [𝐷KL (𝑌 |x‖𝑓𝜃 (x))]       (3)
                                                                                  𝜃∈Θ
context where the spurious signal is unknown.
                                                                Where 𝐷JS represents the Jensen-Shannon Diver-
                                                             gence and 𝐷KL represents the Kullback–Leibler Diver-
3. Explainer Divergence Score                                gence, and all quantities are measured in bits of entropy.
                                                             The full derivation of this expression is present in the
We motivate our approach by considering the setting Supplementary Material 6. Ideally, for a well trained clas-
where a user seeks to determine whether a given model sifier 𝑓𝜃^ of sufficient expressiveness, we would expect
depends on a spurious signal using a post-hoc explainer. the distribution represented by the output of our classi-
They inspect an explanation generated from a model fier 𝑓𝜃^ (x) to approximate the true distribution of 𝑌 |x,
prediction and use it to predict whether the model is meaning the Kullback-Leibler Divergence between them
spurious or not. Similarly to Adebayo et al. [12], we is close to 0:
expect a high-quality explainer to generate very different
explanations from spurious models compared to non-                          Ex∼𝑋 [𝐷KL (𝑌 |x‖𝑓𝜃^ (x))] ≃ 0             (4)
spurious models.                                             And thus:
   This can be framed as a binary classification problem,
where a classifier outputs a binary label corresponding                 ℓ(𝑓𝜃^ ) ≃ 1 − 𝐷JS (𝑋|y = 0, 𝑋|y = 1)          (5)
to a prediction of a model’s dependence on a spurious
signal based upon an explanation as input. The classifier       In which case the loss of our trained model can be
under this formulation is a machine learning model that      seen   as approximating the Jensen-Shannon Divergence
takes the place of the user, and is trained to distinguish   between     the distribution of explanations generated by
between explanations generated by spurious models and spurious models and the distribution of explanations gen-
explanations generated by non-spurious models. A visual erated by non-spurious models. Moreover, as all quanti-
summary of our approach can be found in Figure 1.            ties share the same unit (information), they are directly
   Critically, the classifier is trained to distinguish be- comparable across explainers.
tween explanations generated by all spurious and non-           In practice, sufficient classifier accuracy for Equation
spurious models generated by a specified training strat-     4 to hold appears to be uncommon, leading to an average
egy, instead of any individual pair. This allows the classi- loss  that is unbounded above and difficult to estimate.
fier to generalize to unseen models much like a human Hence we define our EDS as the classification accuracy
user would be expected to. We detail how we accomplish of the binary classifier instead. This has the added ad-
this in Section 4.1.                                         vantage of providing an interpretable baseline for our
   EDS is defined as the performance of this binary clas- metrics - if the classifier can not do better than random
sifier in predicting model spuriousness on explanations guessing (EDS of 0.5), then the classifier has failed to
generated using unseen models - and can be interpreted capture any information in the explanations useful for
as a measure of explainer quality.                           determining model spuriousness and thus there’s a very
   We can view our trained binary classifier’s loss as low likelihood the explainer captures any information
an estimate of the distance between the distribution of about the spurious signal.
explanations from spurious and non-spurious models re-
spectively. Assume we have a trained binary classifier 4. Experiments
𝑓𝜃^ parameterized by 𝜃 ∈ Θ. We train this classifier by
minimizing the loss ℓ consisting of the cross-entropy 𝐻 Using a similar setup to Adebayo et al. [12], Yang and
between the distribution represented by the output of the Chaudhuri [44], we investigated three different types of
model and 𝑌 |x, that is, the distribution 𝑌 conditioned on spurious artifacts:
the random variable x where 𝑌 is the Bernoulli distribu-
tion of binary labels of a given explanation in our training        • Square - a small square in the top left corner of
                                                                      the image
     • Stripe - a vertical stripe 9 pixels from the left of   divided by the task label and the presence of the spurious
       the image                                              artifact in the image. There are four subclasses in total:
     • Noise - uniform Gaussian noise applied to every
       pixel value of the image                                     • Images from the Spurious Class without the Spu-
                                                                      rious Artifact (abbreviated as ‘S/NA’ in figures)
   Examples of each spurious artifact on both the dSprites          • Images from Non-Spurious Classes without the
and 3dshapes datasets are present in the Supplementary                Spurious Artifact (abbreviated as ‘NS/NA’ in fig-
Material 6.                                                           ures)
   We experiment to determine the effect of the intensity
                                                                    • Images from the Spurious Class with the Spurious
of each spurious artifact on a model’s spurious behaviour,
                                                                      Artifact (abbreviated as ‘S/A’ in figures)
and then trained models to maximize this spuriousness.
                                                                    • Images from Non-Spurious Classes with the Spu-
The details of this experiment and overall model training
                                                                      rious Artifact (abbreviated as ‘NS/A’ in figures)
procedure can be found in the Supplementary Material 6.
                                                              For example, say we had a class consisting of images of
4.1. EDS Experimental Setup                                   ‘circles’ and another of images of ‘squares’, and we trained
                                                              a classification model between the two where we injected
For all datasets and explainers, we evaluate the Explainer spurious Gaussian noise into the ‘circles’ class. ‘S/NA’
Divergence Score (EDS) as follows. We split the dataset would correspond to images of circles without Gaussian
into three partitions - 80% partition used for model train- noise, ‘NS/NA’ would correspond to images of squares
ing, 14% partition used for binary classifier training, and without Gaussian noise, ‘S/A’ would correspond to im-
6% partition used for validation.                             ages of circles with Gaussian noise, and ‘NS/A’ would
   Recall in Section 3 we defined EDS using a binary correspond to images of squares with Gaussian noise.
classifier trained to distinguish between explanations           This subdivision allows us to interpret the type of im-
generated across all spurious and non-spurious models. ages the explainer can effectively use to determine model
Training a new model for every explanation is far too spuriousness. This is analogous to what is done in Ade-
computationally intensive. To rectify this for each spuri- bayo et al. [12] via the ‘Cause-for-Concern Metric’ (CCM)
ous artifact we train 100 spurious and 100 non-spurious and ‘False Alarm Metric’ (FAM) that measure results by
models on our model training dataset partition, using whether the spurious artifact is present in the image, but
different weight initialization, and use this sample as an we present results in even finer detail with added class
estimate of the complete distribution of trained spurious information.
and non-spurious models respectively. We train models
and ensure they are spurious or non-spurious respec-
tively using the procedure detailed in the Supplementary 4.3. Synthetic Explainer Comparison
Material 6.                                                   We compare our EDS method to the approach in Ade-
   We reserve 30 spurious and non-spurious models each bayo et al. [12], starting with a simple example. We con-
for validation and use the remaining 70 of each set to sider a toy classification task with two simple classes (the
generate training data for our binary classifier. Images dSprites classes of a ‘heart’ and ‘oval’) with the ‘stripe’
from the respective dataset partition are combined with spurious artifact injected. Instead of using a specific spu-
a randomly selected model to generate an explanation as rious detection method, we instead construct synthetic
well as a binary class label corresponding to whether the explainers that represent the expected behaviour of each
model came from a spurious or non-spurious set. A clas- method under ideal circumstances. We construct these
sifier is then trained on this data to use the explanations ‘ideal’ explainers as follows:
to predict this class label.
   Finally, our remaining 30 spurious and 30 non-spurious           • Heatmaps - for the spurious model the explainer
models are combined with the validation dataset parti-                 places all emphasis on the stripe for all images
tion to generate explanations in the same fashion as in                where it is present and the area where the stripe
training. The label prediction accuracy of the binary clas-            would be for images from the spurious class
sifier on this set is then our estimate of the Explainer               without the stripe. The explainer puts all em-
Divergence Score of the given explainer for this spurious              phasis on the shape for all cases with the non-
signal. Further experimental setup details are noted in                spurious model and for the spurious model on
the Supplementary Material 6.                                          non-spurious classes without the stripe.
                                                                    • Influential Examples - for the spurious model
4.2. Subclass Definitions                                              the explainer selects influential examples of the
                                                                       spurious class with the stripe for all images unless
For all EDS results we display accuracy not just over the              it is an image from a non-spurious class without
entire dataset (noted as ‘Overall’ in figures), but also sub-
                 Explainers              Explainer Divergence Scores            Adebayo et al. [12] Metrics
                               Overall     S/NA NS/NA S/A              NS/A     KSSD    CCM        FAM
                 Heatmaps
                 Ideal         0.823       0.943    0.492      0.959   0.896    1.000    0.991    0.990
                 Noisy         0.702       0.771    0.459      0.805   0.771    0.970    0.993    0.993
                 Random        0.512       0.518    0.510      0.537   0.484    0.965    0.996    0.996
                 Influence
                 Ideal         0.824       0.955    0.496      0.949   0.896    1.000    0.500    0.000
                 Noisy         0.668       0.734    0.490      0.736   0.713    0.587    0.712    0.656
                 Random        0.515       0.510    0.516      0.520   0.514    0.000    0.915    0.927
                 Concept
                 Ideal         0.750       0.500    0.500      1.000   1.000    0.000    0.000    -0.500
                 Noisy         0.617       0.488    0.535      0.723   0.721    -0.284   -0.412   -0.514
                 Random        0.491       0.490    0.475      0.527   0.473    -0.494   -0.481   -0.509
Table 1
Results of evaluating EDS on the specific synthetic explainers averaged over 5 runs. We observe clear outperformance of
heatmaps and influential examples over concept extraction, as well as the complete failure of the ‘random’ explainer in the
EDS results. However, these are not visible in the KSSD, CCM and FAM metrics [12] results. Standard deviation estimates are
provided in the Supplementary Material 6, with all results having estimated 95% confidence intervals within ±0.03.


       the stripe. For the non-spurious model the ex-           between the spurious and non-spurious models in each
       plainer always selects examples of the correct           subclass. Cases where the explanations generated from
       class with the correct presence of the spurious          spurious and non-spurious models are drawn from the
       artifact.                                                same distribution should result in the worst possible met-
     • Concept Extraction - For concept extraction we           rics. Conversely, cases where the explanations are always
       specify two binary concepts, one of the class label      radically different should result in close to perfect met-
       and one of the presence of the spurious artifact.        rics.
       We assume the spurious model can detect both                This is exactly what we observe with EDS. Our ap-
       perfectly, and thus extracts both accurately. On         proach finds the ideal heatmap and influential examples
       the other hand, the non-spurious model is invari-        almost perfectly identify model spuriousness - failing
       ant to spurious artifact in all circumstances, and       only on explanations generated from images from a non-
       thus always extracts that it is not present.             spurious class without the spurious artifact. The ideal
                                                                concept extraction explainer additionally falls short on
   In addition to these ideal explainers, we also create        images from the spurious class with the spurious arti-
‘noisy’ variants where we inject noise across every ex-         fact, indicating that this specification is a worse explainer
planation as well as a purely random variant where              for detecting spurious correlations then the competing
the corresponding explanations consist purely of noise.         methods.
For heatmaps we inject uniform Gaussian noise to the               We observe that the KSSD, CCM and FAM metrics
heatmap, for influential examples we specify a chance           from Adebayo et al. [12] fall short in this type of analy-
(100% in the noise variant) of randomly selecting a train-      sis: different types of explainers use different similarity
ing image, and for concept extraction we specify a chance       functions with different units that are not comparable
(100% in the noise variant) of predicting a random con-         directly. This is a major innovation of our method over
cept label.                                                     the existing state of the art.
   We evaluate both our EDS and the KSSD, CCM and                  Our method comes to our expected conclusion that the
FAM metrics [12] on these examples. For these metrics           ideal explainers capture more information about model
we specify similarity functions as follows: for heatmaps        spuriousness than the noisy explainers, while the random
we use the SSIM similarity function as specified in [12],       explainers completely fail to capture any information
for influential examples we use the Bhattacharyya coef-         about model spuriousness. This declining performance
ficient [45] between the distributions of the class labels      can also be seen in the KSSD, CCM and FAM metrics - but
and the presence of a spurious artifact in the influential      the utter failure of the random explainers is not visible
examples, and for concept extraction we use the negative        with these metrics. With EDS, if the trained classifier
of the L2 distance between concept labels as a ‘similarity’     fails to achieve at least 50% accuracy, we can interpret
function. The results are shown in Table 1.                     the explainer as having no information about the model’s
   Synthetic ‘ideal’ explainers are useful as we can specify    spuriousness. This is not possible using the KSSD, CCM
in advance exactly how our explainers should perform
                 Explainers              Explainer Divergence Scores            Adebayo et al. [12] Metrics
                               Overall     S/NA NS/NA S/A              NS/A     KSSD    CCM        FAM
                 Square
                 Heatmap       0.799       0.837    0.590      0.902   0.916    0.851    0.877     0.837
                 Influence     0.887       0.937    0.860      0.891   0.887    0.562    0.991     0.989
                 Concept       0.715       0.645    0.578      0.827   0.831    -0.062   -0.074    -0.076
                 Stripe
                 Heatmap       0.831       0.901    0.689      0.958   0.870    0.877    0.880     0.878
                 Influence     0.881       0.892    0.829      0.909   0.913    0.561    0.991     0.980
                 Concept       0.707       0.618    0.596      0.788   0.815    -0.061   -0.074    -0.077
                 Noise
                 Heatmap       0.717       0.857    0.610      0.872   0.682    0.728    0.877     0.804
                 Influence     0.795       0.884    0.707      0.890   0.796    0.566    0.970     0.966
                 Concept       0.744       0.650    0.652      0.863   0.808    -0.062   -0.078    -0.076
Table 2
Results of evaluating EDS on the dSprites dataset averaged over 5 runs. We observe the outperformance of influential examples
over heatmaps and concept extraction visible in the EDS results but not in the comparative metrics. Standard deviation
estimates are provided in the Supplementary Material 6., with all results having estimated 95% confidence intervals within
±0.04.


and FAM metrics without explicitly running a baseline              We find the strongest performance for heatmaps and
for every type of explainer evaluated.                          influential examples. EDS was highest for images in the
                                                                spurious class without the spurious artifact, lowest for
4.4. Real Explainer Comparison                                  images in non-spurious classes without the spurious ar-
                                                                tifact, and somewhat high for images with the spurious
To test EDS on real explainer methods, we conduct exper-        artifact regardless of class. These findings appear consis-
iments on reduced versions of both the dSprites [17] and        tent across all three of our chosen spurious artifacts, and
the 3dshapes [18] datasets. We train models to perform          in both datasets. We notice a sharp drop in performance
a shape classification task and arbitrarily select one class    for our Gaussian noise spurious artifact compared to the
to be the spurious class for each experiment.                   more localized spurious artifacts.
   We chose some commonly used methods as represen-                Concept extractions consistently perform worse than
tatives for each explainer type of interest. We use Inte-       the other two explainers, operating well only on images
grated Gradients [13] as our chosen feature attribution         with the explicit presence of the spurious artifact. This
method. For influential examples we use the TraceInCP           follows our expectations - we would expect concept ex-
method [15], and for concept extraction we use Concept          traction to more effectively identify the presence of the
Model Extraction (CME) [46]. Examples of each explana-          spurious signal concept from the activations of spurious
tion on images from the dSprites dataset are present in         models compared to activations of non-spurious models
the Supplementary Material 6.                                   that have learned to become invariant to them. More-
   More detailed information about the configuration            over the dimensionality of our the concept predictions is
setup for each experiment is present in the Supplemen-          much lower than explanations for the other two explain-
tary Material 6.                                                ers, limiting their expressiveness. Interestingly while
   We display results for dSprites in Table 2 and results       performance on images without the spurious artifact is
for 3dshapes in Table 3. For comparison, we also evaluate       poor, it is still above our 0.5 theoretical baseline despite
the KSSD, CCM, and FAM metrics formulated in Adebayo            there being no obvious reason for concept predictions to
et al. [12] on both dSprites and 3dshapes.                      shift between spurious and non-surious models. This is
   We observe Explainer Divergence Scores significantly         further discussed in Section 4.5.
above the 0.5 theoretical baseline for all explainers and          We notice significant differences in explainer per-
spurious artifacts in both datasets. This indicates all of      formance between the dSprites and 3dshapes datasets.
our explainers are successful in capturing information          While in dSprites we find slightly higher performance
about the model’s spuriousness in both tasks. The key           for influential examples over heatmaps, in 3dshapes we
advantage of EDS over previous work is that we can              find significant strong performance for heatmaps across
now directly compare the performance of explainers for          all experiments. In 3dshapes often our EDS binary clas-
detecting model spuriousness for the specified task and         sifier identifies the spuriousness of a given model from
spurious artifact. We interpret our results with this aim       a heatmap with 100% accuracy. This is in sharp con-
in mind.                                                        trast to the other two explainers that perform worse with
                 Explainers              Explainer Divergence Scores            Adebayo et al. [12] Metrics
                               Overall     S/NA NS/NA S/A              NS/A     KSSD    CCM        FAM
                 Square
                 Heatmap       1.00        1.00     1.00      1.00     1.00     0.680    0.828     0.826
                 Influence     0.675       0.817    0.615     0.696    0.682    0.562    0.991     0.989
                 Concept       0.595       0.532    0.562     0.651    0.628    -0.156   -0.080    -0.072
                 Stripe
                 Heatmap       0.996       0.993    0.997     0.998    0.994    0.644    0.810     0.807
                 Influence     0.867       0.922    0.844     0.903    0.861    0.561    0.991     0.980
                 Concept       0.720       0.653    0.636     0.783    0.795    -0.152   -0.075    -0.070
                 Noise
                 Heatmap       0.987       1.00     0.984     0.994    0.984    0.673    0.846     0.847
                 Influence     0.703       0.908    0.645     0.887    0.627    0.566    0.970     0.966
                 Concept       0.569       0.600    0.566     0.587    0.555    -0.150   -0.074    -0.074
Table 3
Results of evaluating EDS on the 3dshapes dataset averaged over 5 runs. We observe the extreme outperformance of heatmaps
over the remaining explainers visible in the EDS results but not in the comparative metrics. Standard deviation estimates are
provided in the Supplementary Material 6, with all results having estimated 95% confidence intervals within ±0.04.


3dshapes, performing only comparably using the ‘stripe’       information may prove useful in designing more effective
spurious artifact. Despite this diminished performance,       explainers.
both influential examples and concept extraction still           This is particularly evident in the case of concept ex-
perform above our 0.5 theoretical baseline for EDS.           traction where there is no clear hypothesis for why spu-
                                                              rious and non-spurious models would have differing in-
4.5. Discussion                                               formation about the underlying concepts in images from
                                                              the non-spurious class without the spurious artifact. This
These results favour heatmaps and influential examples, suggests that the presence of a spurious correlation can
which are very effective at detecting model spuriousness affect a model’s ability to extract features in entirely un-
in both experiments with real explainer methods. Con- related image classes.
versely concept extraction consistently performed the
worst, and is only useful on images for which the spu-
rious artifact is present. As expected, performance is 5. Conclusion
sensitive to the dataset and specified task.
   We conduct further experiments to confirm Explainer We present Explainer Divergence Scores - a novel method
Divergence Scores are robust to our choice of optimiza- for evaluating post-hoc explainers for the purposes of
tion procedure and model architecture. These are ex- detecting unknown spurious correlations.
panded upon in the Supplementary Material 6.                     Across three experiments we show EDS’s superior
   In both datasets, we observe EDS performances signifi-     capabilities over state of the art post-hoc explainer eval-
cantly above the 0.5 theoretical baseline for all explainers, uation  methods.  EDS provides an interpretable estimate
spurious artifacts, and subclasses. Notably this is seen      of the amount   of information an explainer can capture
even with images from unrelated, non-spurious classes about a DNN’s dependence on an unknown spurious sig-
without the presence of the spurious artifact.                nal. Moreover EDS allows direct comparisons between
   This has interesting implications about the utility of different types of explainers, unlike previous methods,
post-hoc explainers in detecting model spuriousness. For letting us quantitatively identify and evaluate the best
example, heatmaps generated from 3dshapes images in explainer for a given dataset and spurious signal.
non-spurious classes without the spurious artifact do            In contrast to previous work [12], our results reveal
not show any obvious signal that a human could use to that commonly used post-hoc explainers contain substan-
identify their respective model has some sort of spurious tial amount of information about a model’s dependence
dependency. Yet a trained classifier with sufficient prior on unknown spurious signals. This information is of-
knowledge can diagnose whether the model depends ten unidentifiable by human observers, and yet can be
upon a spurious signal with extremely high certainty. used by a well-trained classifier to detect dependencies
Information present in our explanations indicating spu- on images seemingly unrelated to the spurious signal.
riousness may not always be perceptible by a human Our findings suggest that future research into post-hoc
observer, and identifying ways to extract or isolate this explanations should focus on identifying and utilizing
                                                              this unseen information.
6. Supplementary Material                                          gence guided radiology systems, Frontiers in Dig-
                                                                   ital Health 3 (2021). doi:10.3389/fdgth.2021.
Additional information about our work, including a                 671015.
more detailed mathematical justification, ancillary exper-     [9] J. Adebayo, M. Muelly, I. Liccardi, B. Kim, Debug-
iments, and standard error estimates for all our results           ging tests for model explanations, in: H. Larochelle,
are detailed in the Appendix available at this link.               M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.),
                                                                   Advances in Neural Information Processing
                                                                   Systems 33: Annual Conference on Neural
References                                                         Information Processing Systems 2020, NeurIPS
 [1] A. Chouldechova, Fair prediction with disparate               2020, December 6-12, 2020, virtual, 2020. URL:
     impact: A study of bias in recidivism prediction              https://proceedings.neurips.cc/paper/2020/hash/
     instruments, Big Data 5 (2017) 153–163. URL: https:           075b051ec3d22dac7b33f788da631fd4-Abstract.
     //doi.org/10.1089/big.2016.0047. doi:10.1089/big.             html.
     2016.0047.                                               [10] X. Han, B. C. Wallace, Y. Tsvetkov, Explaining
 [2] F. Doshi-Velez, B. Kim, Towards a rigorous science            black box predictions and unveiling data artifacts
     of interpretable machine learning, 2017. URL: https:          through influence functions, in: D. Jurafsky, J. Chai,
     //arxiv.org/abs/1702.08608. doi:10.48550/ARXIV.               N. Schluter, J. R. Tetreault (Eds.), Proceedings of
     1702.08608.                                                   the 58th Annual Meeting of the Association for
 [3] A. Datta, M. C. Tschantz, A. Datta, Automated                 Computational Linguistics, ACL 2020, Online, July
     experiments on ad privacy settings, Proc. Priv. En-           5-10, 2020, Association for Computational Lin-
     hancing Technol. 2015 (2015) 92–112. URL: https:              guistics, 2020, pp. 5553–5563. URL: https://doi.org/
     //doi.org/10.1515/popets-2015-0007. doi:10.1515/              10.18653/v1/2020.acl-main.492. doi:10.18653/v1/
     popets-2015-0007.                                             2020.acl-main.492.
 [4] J. Buolamwini, T. Gebru, Gender shades: In-              [11] B. Dimanov, Interpretable Deep Learning: Beyond
     tersectional accuracy disparities in commercial               Feature-Importance with Concept-based Explana-
     gender classification, in: S. A. Friedler, C. Wil-            tions, Ph.D. thesis, University of Cambridge, 2021.
     son (Eds.), Conference on Fairness, Accountabil-         [12] J. Adebayo, M. Muelly, H. Abelson, B. Kim,
     ity and Transparency, FAT 2018, 23-24 February                Post hoc explanations may be ineffective for de-
     2018, New York, NY, USA, volume 81 of Proceed-                tecting unknown spurious correlation, in: In-
     ings of Machine Learning Research, PMLR, 2018,                ternational Conference on Learning Representa-
     pp. 77–91. URL: http://proceedings.mlr.press/v81/             tions, 2022. URL: https://openreview.net/forum?id=
     buolamwini18a.html.                                           xNOVfCCvDpM.
 [5] S. Lapuschkin, S. Wäldchen, A. Binder, G. Mon-           [13] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attri-
     tavon, W. Samek, K. Müller, Unmasking clever                  bution for deep networks, in: D. Precup, Y. W. Teh
     hans predictors and assessing what machines re-               (Eds.), Proceedings of the 34th International Con-
     ally learn, CoRR abs/1902.10178 (2019). URL: http:            ference on Machine Learning, ICML 2017, Sydney,
     //arxiv.org/abs/1902.10178. arXiv:1902.10178.                 NSW, Australia, 6-11 August 2017, volume 70 of
 [6] R. Geirhos, J. Jacobsen, C. Michaelis, R. S. Zemel,           Proceedings of Machine Learning Research, PMLR,
     W. Brendel, M. Bethge, F. A. Wichmann, Short-                 2017, pp. 3319–3328. URL: http://proceedings.mlr.
     cut learning in deep neural networks,            Nat.         press/v70/sundararajan17a.html.
     Mach. Intell. 2 (2020) 665–673. URL: https://            [14] M. T. Ribeiro, S. Singh, C. Guestrin, "why should
     doi.org/10.1038/s42256-020-00257-z. doi:10.1038/              I trust you?": Explaining the predictions of any
     s42256-020-00257-z.                                           classifier, in: B. Krishnapuram, M. Shah, A. J.
 [7] S. Sagawa, A. Raghunathan, P. W. Koh, P. Liang,               Smola, C. C. Aggarwal, D. Shen, R. Rastogi (Eds.),
     An investigation of why overparameterization ex-              Proceedings of the 22nd ACM SIGKDD Interna-
     acerbates spurious correlations, in: Proceedings              tional Conference on Knowledge Discovery and
     of the 37th International Conference on Machine               Data Mining, San Francisco, CA, USA, August 13-
     Learning, ICML 2020, 13-18 July 2020, Virtual Event,          17, 2016, ACM, 2016, pp. 1135–1144. URL: https:
     volume 119 of Proceedings of Machine Learning                 //doi.org/10.1145/2939672.2939778. doi:10.1145/
     Research, PMLR, 2020, pp. 8346–8356. URL: http:               2939672.2939778.
     //proceedings.mlr.press/v119/sagawa20a.html.             [15] G. Pruthi, F. Liu, S. Kale, M. Sundararajan,
 [8] U. Mahmood, R. Shrestha, D. Bates, L. Mannelli,               Estimating training data influence by tracing
     G. Corrias, Y. Erdi, C. Kanan, Detecting spurious             gradient descent, in: H. Larochelle, M. Ran-
     correlations with sanity tests for artificial intelli-        zato, R. Hadsell, M. Balcan, H. Lin (Eds.),
                                                                   Advances in Neural Information Processing
     Systems 33: Annual Conference on Neural                       doi:10.1145/3457607.
     Information Processing Systems 2020, NeurIPS             [25] X. Chen, C. Liu, B. Li, K. Lu, D. Song, Targeted back-
     2020, December 6-12, 2020, virtual, 2020. URL:                door attacks on deep learning systems using data
     https://proceedings.neurips.cc/paper/2020/hash/               poisoning, CoRR abs/1712.05526 (2017). URL: http:
     e6385d39ec9394f2f3a354d9d2b88eec-Abstract.                    //arxiv.org/abs/1712.05526. arXiv:1712.05526.
     html.                                                    [26] K. Y. Xiao, L. Engstrom, A. Ilyas, A. Madry, Noise
[16] D. Kazhdan, B. Dimanov, M. Jamnik, P. Liò,                    or signal: The role of image backgrounds in object
     A. Weller, Now you see me (CME): concept-based                recognition, in: 9th International Conference on
     model extraction, in: S. Conrad, I. Tiddi (Eds.), Pro-        Learning Representations, ICLR 2021, Virtual Event,
     ceedings of the CIKM 2020 Workshops co-located                Austria, May 3-7, 2021, OpenReview.net, 2021. URL:
     with 29th ACM International Conference on Infor-              https://openreview.net/forum?id=gl3D-xY7wLq.
     mation and Knowledge Management (CIKM 2020),             [27] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell,
     Galway, Ireland, October 19-23, 2020, volume 2699             D. Parikh, D. Batra, Grad-cam: Why did you
     of CEUR Workshop Proceedings, CEUR-WS.org, 2020.              say that? visual explanations from deep net-
     URL: http://ceur-ws.org/Vol-2699/paper02.pdf.                 works via gradient-based localization, CoRR
[17] L. Matthey, I. Higgins, D. Hassabis, A. Lerchner,             abs/1610.02391 (2016). URL: http://arxiv.org/abs/
     dsprites: Disentanglement testing sprites dataset,            1610.02391. arXiv:1610.02391.
     https://github.com/deepmind/dsprites-dataset/,           [28] P. W. Koh, P. Liang, Understanding black-box pre-
     2017.                                                         dictions via influence functions, in: D. Precup, Y. W.
[18] C. Burgess, H. Kim, 3d shapes dataset,                        Teh (Eds.), Proceedings of the 34th International
     https://github.com/deepmind/3dshapes-dataset/,                Conference on Machine Learning, ICML 2017, Syd-
     2018.                                                         ney, NSW, Australia, 6-11 August 2017, volume 70
[19] C. Zhou, X. Ma, P. Michel, G. Neubig, Examining               of Proceedings of Machine Learning Research, PMLR,
     and combating spurious features under distribu-               2017, pp. 1885–1894. URL: http://proceedings.mlr.
     tion shift, in: M. Meila, T. Zhang (Eds.), Proceed-           press/v70/koh17a.html.
     ings of the 38th International Conference on Ma-         [29] B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai, J. Wexler,
     chine Learning, ICML 2021, 18-24 July 2021, Virtual           F. B. Viégas, R. Sayres, Interpretability beyond
     Event, volume 139 of Proceedings of Machine Learn-            feature attribution: Quantitative testing with con-
     ing Research, PMLR, 2021, pp. 12857–12867. URL:               cept activation vectors (TCAV), in: J. G. Dy,
     http://proceedings.mlr.press/v139/zhou21g.html.               A. Krause (Eds.), Proceedings of the 35th Inter-
[20] S. Sagawa, P. W. Koh, T. B. Hashimoto, P. Liang,              national Conference on Machine Learning, ICML
     Distributionally robust neural networks, in: 8th              2018, Stockholmsmässan, Stockholm, Sweden, July
     International Conference on Learning Represen-                10-15, 2018, volume 80 of Proceedings of Machine
     tations, ICLR 2020, Addis Ababa, Ethiopia, April              Learning Research, PMLR, 2018, pp. 2673–2682. URL:
     26-30, 2020, OpenReview.net, 2020. URL: https:                http://proceedings.mlr.press/v80/kim18d.html.
     //openreview.net/forum?id=ryxGuJrFvS.                    [30] D. Kazhdan, B. Dimanov, M. Jamnik, P. Liò, MEME:
[21] M. Arjovsky, L. Bottou, I. Gulrajani, D. Lopez-               generating RNN model explanations via model ex-
     Paz,      Invariant risk minimization,         CoRR           traction, CoRR abs/2012.06954 (2020). URL: https:
     abs/1907.02893 (2019). URL: http://arxiv.org/abs/             //arxiv.org/abs/2012.06954. arXiv:2012.06954.
     1907.02893. arXiv:1907.02893.                            [31] L. C. Magister, D. Kazhdan, V. Singh, P. Liò, Gc-
[22] L. Moneda, Spurious correlation machine learn-                explainer: Human-in-the-loop concept-based ex-
     ing and causality, Blogpost at lgmoneda.github.io             planations for graph neural networks, CoRR
     (2021).                                                       abs/2107.11889 (2021). URL: https://arxiv.org/abs/
[23] K. Leino, M. Fredrikson,         Stolen memories:             2107.11889. arXiv:2107.11889.
     Leveraging model memorization for calibrated             [32] P. Kindermans, S. Hooker, J. Adebayo, M. Al-
     white-box membership inference, in: S. Cap-                   ber, K. T. Schütt, S. Dähne, D. Erhan, B. Kim,
     kun, F. Roesner (Eds.), 29th USENIX Security                  The (un)reliability of saliency methods,            in:
     Symposium, USENIX Security 2020, August 12-                   W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen,
     14, 2020, USENIX Association, 2020, pp. 1605–                 K. Müller (Eds.), Explainable AI: Interpreting,
     1622. URL: https://www.usenix.org/conference/                 Explaining and Visualizing Deep Learning, vol-
     usenixsecurity20/presentation/leino.                          ume 11700 of Lecture Notes in Computer Sci-
[24] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman,              ence, Springer, 2019, pp. 267–280. URL: https://doi.
     A. Galstyan, A survey on bias and fairness in                 org/10.1007/978-3-030-28954-6_14. doi:10.1007/
     machine learning, ACM Comput. Surv. 54 (2021)                 978-3-030-28954-6\_14.
     115:1–115:35. URL: https://doi.org/10.1145/3457607.      [33] B. Dimanov, U. Bhatt, M. Jamnik, A. Weller, You
     shouldn’t trust me: Learning models which con-                A survey on methods and metrics, Electronics 10
     ceal unfairness from multiple explanation methods,            (2021). URL: https://www.mdpi.com/2079-9292/10/
     in: H. Espinoza, J. Hernández-Orallo, X. C. Chen,             5/593. doi:10.3390/electronics10050593.
     S. S. ÓhÉigeartaigh, X. Huang, M. Castillo-Effen,        [41] C. Agarwal, E. Saxena, S. Krishna, M. Pawelczyk,
     R. Mallah, J. A. McDermid (Eds.), Proceedings of              N. Johnson, I. Puri, M. Zitnik, H. Lakkaraju,
     the Workshop on Artificial Intelligence Safety, co-           Openxai: Towards a transparent evaluation of
     located with 34th AAAI Conference on Artificial               model explanations, CoRR abs/2206.11104 (2022).
     Intelligence, SafeAI@AAAI 2020, New York City,                URL: https://doi.org/10.48550/arXiv.2206.11104.
     NY, USA, February 7, 2020, volume 2560 of CEUR                doi:10.48550/arXiv.2206.11104.
     Workshop Proceedings, CEUR-WS.org, 2020, pp. 63–              arXiv:2206.11104.
     73. URL: http://ceur-ws.org/Vol-2560/paper8.pdf.         [42] Y. Liu, S. Khandagale, C. White, W. Neiswanger,
[34] A. Ghorbani, A. Abid, J. Y. Zou,             Interpre-        Synthetic benchmarks for scientific research
     tation of neural networks is fragile,           CoRR          in explainable machine learning, in: J. Van-
     abs/1710.10547 (2017). URL: http://arxiv.org/abs/             schoren, S. Yeung (Eds.), Proceedings of the
     1710.10547. arXiv:1710.10547.                                 Neural Information Processing Systems Track on
[35] S. Basu, P. Pope, S. Feizi, Influence functions in            Datasets and Benchmarks 1, NeurIPS Datasets and
     deep learning are fragile, in: 9th International Con-         Benchmarks 2021, December 2021, virtual, 2021.
     ference on Learning Representations, ICLR 2021,               URL: https://datasets-benchmarks-proceedings.
     Virtual Event, Austria, May 3-7, 2021, OpenRe-                neurips.cc/paper/2021/hash/
     view.net, 2021. URL: https://openreview.net/forum?            c16a5320fa475530d9583c34fd356ef5-Abstract-round2.
     id=xHKVVHGDOEk.                                               html.
[36] F. Locatello, S. Bauer, M. Lucic, G. Rätsch, S. Gelly,   [43] M. E. Zarlenga, P. Barbiero, Z. Shams, D. Kazhdan,
     B. Schölkopf, O. Bachem, Challenging common                   U. Bhatt, M. Jamnik, On the quality assurance of
     assumptions in the unsupervised learning of disen-            concept-based representations, 2022. URL: https:
     tangled representations, in: Reproducibility in Ma-           //openreview.net/forum?id=Ehhk6jyas6v.
     chine Learning, ICLR 2019 Workshop, New Orleans,         [44] Y. Yang, K. Chaudhuri, Understanding rare spu-
     Louisiana, United States, May 6, 2019, OpenRe-                rious correlations in neural networks, CoRR
     view.net, 2019. URL: https://openreview.net/forum?            abs/2202.05189 (2022). URL: https://arxiv.org/abs/
     id=Byg6VhUp8V.                                                2202.05189. arXiv:2202.05189.
[37] Y. Zhou, S. Booth, M. T. Ribeiro, J. Shah, Do fea-       [45] T. Kailath, The divergence and bhattacharyya dis-
     ture attribution methods correctly attribute fea-             tance measures in signal selection, IEEE Trans-
     tures?, in: Thirty-Sixth AAAI Conference on Arti-             actions on Communication Technology 15 (1967)
     ficial Intelligence, AAAI 2022, Thirty-Fourth Con-            52–60. doi:10.1109/TCOM.1967.1089532.
     ference on Innovative Applications of Artificial In-     [46] D. Kazhdan, B. Dimanov, M. Jamnik, P. Liò,
     telligence, IAAI 2022, The Twelveth Symposium                 A. Weller, Now you see me (CME): concept-based
     on Educational Advances in Artificial Intelligence,           model extraction, in: S. Conrad, I. Tiddi (Eds.), Pro-
     EAAI 2022 Virtual Event, February 22 - March 1,               ceedings of the CIKM 2020 Workshops co-located
     2022, AAAI Press, 2022, pp. 9623–9633. URL: https:            with 29th ACM International Conference on Infor-
     //ojs.aaai.org/index.php/AAAI/article/view/21196.             mation and Knowledge Management (CIKM 2020),
[38] D. Alvarez-Melis, T. S. Jaakkola, On the ro-                  Galway, Ireland, October 19-23, 2020, volume 2699
     bustness of interpretability methods,           CoRR          of CEUR Workshop Proceedings, CEUR-WS.org, 2020.
     abs/1806.08049 (2018). URL: http://arxiv.org/abs/             URL: http://ceur-ws.org/Vol-2699/paper02.pdf.
     1806.08049. arXiv:1806.08049.
[39] J. Dai, S. Upadhyay, U. Aïvodji, S. H. Bach,
     H. Lakkaraju, Fairness via explanation quality:
     Evaluating disparities in the quality of post hoc ex-
     planations, in: V. Conitzer, J. Tasioulas, M. Scheutz,
     R. Calo, M. Mara, A. Zimmermann (Eds.), AIES
     ’22: AAAI/ACM Conference on AI, Ethics, and
     Society, Oxford, United Kingdom, May 19 - 21,
     2021, ACM, 2022, pp. 203–214. URL: https://doi.org/
     10.1145/3514094.3534159. doi:10.1145/3514094.
     3534159.
[40] J. Zhou, A. H. Gandomi, F. Chen, A. Holzinger, Eval-
     uating the quality of machine learning explanations:

</pre>