=Paper=
{{Paper
|id=Vol-3318/short19
|storemode=property
|title=Shortcomings of Interpretability Taxonomies for Deep Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-3318/short19.pdf
|volume=Vol-3318
|authors=Anders Søgaard
|dblpUrl=https://dblp.org/rec/conf/cikm/Sogaard22
}}
==Shortcomings of Interpretability Taxonomies for Deep Neural Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-3318/short19.pdf</pdf>
<pre>
Shortcomings of Interpretability Taxonomies
for Deep Neural Networks
Anders Søgaard1,2,3
1
  Dpt. of Computer Science, University of Copenhagen, Universitetsparken 1, DK-2200 Copenhagen
2
  Pioneer Centre for Artificial Intelligence, Lyngbyvej 2, DK-2100 Copenhagen
3
  Dpt. of Philosophy, Karen Blixens Plads 8, DK-2300 Copenhagen


                                       Abstract
                                       Taxonomies are vehicles for thinking about what’s possible, for identifying unconsidered options, as well as for establishing
                                       formal relations between entities. We identify several shortcomings in 10 existing taxonomies for interpretability methods
                                       for explainable artificial intelligence (XAI), focusing on methods for deep neural networks. The shortcomings include
                                       redundancies, incompleteness, and inconsistencies. We design a new taxonomy based on two orthogonal dimensions and
                                       show how it can be used to derive results about entire classes of interpretability methods for deep neural networks.

                                       Keywords
                                       interpretability, taxonomy


1. Two Common Distinctions                                                                      inadequacies in 10 interpretability taxonomies. (b) We
                                                                                                establish a simple, yet superior, two-dimensional taxon-
Biological taxonomies provide a basis for conservation                                          omy. (c) We derive six non-trivial observations or results
and development and are used to generate interesting                                            based on this taxonomy.
questions about missing species [1, 2]. Inconsistent tax-
onomies can, at the same time, hinder research or lead in                                                              Inconsistent
                                                                                                          Dimensions


                                                                                                                                      Incomplete

                                                                                                                                                   Redundant
the wrong direction [3, 4, 5]. In engineering, taxonomies


                                                                                                                                                                         intr-phoc
play additional roles: They are vehicles for thinking about
                                                                                                 Survey


                                                                                                                                                               loc-glo


                                                                                                                                                                                     other
what’s possible, for identifying unconsidered options, as
well as for establishing formal relations between meth-
ods.                                                                                              [6]       4             �                                    (�)       (�)         time, expertise
                                                                                                  [7]       3             �              �                      �         �          model-specific/model-agnostic
   Several taxonomies of interpretability methods already
                                                                                                  [8]       4             �                          �          �         �          pre-model/in-model/post-model,
exist [6, 7, 8, 9, 10, 11, 12, 13]. These taxonomies provide                                                                                                                         results
us with technical terms for distinguishing approaches to                                          [9]       4             �                          �           �          �        spec./agn., results
interpretability and can be efficient tools for researchers                                       [10]      3             �                                      �          �        types
to contextualize their work. They generate interesting                                            [11]      3             �              �           �           �          �        technique
                                                                                                  [12]      3                            �           �           �          �        methodology
research questions – e.g., if all methods in class A but no
                                                                                                  [14]      1             �              �                                           grad./pert./ simpl.
methods in B happen to exhibit property X, it this by neces-                                      [15]      1             �              �                                           att./rule/sum.
sity, or can we design a method in B with property X? – and                                       [13]      2             �              �           �           �                   inst./approx./ attr./counterf.
help us see relations between methods – e.g., two methods
in class A are mathematically equivalent. Unfortunately,
the taxonomies that exist, without exception, have short-                                       Table 1
comings and are either redundant, incomplete, or incon-                                         10 existing taxonomies and their shortcomings: Most distin-
                                                                                                guish local from global methods, and intrinsic from posthoc
sistent. In §2, we show this, examining the above 10 tax-
                                                                                                methods. We argue the additional dimensions all lead to incon-
onomies, one by one, also discussing between-taxonomy                                           sistencies and/or redundancies, and that the intrinsic-posthoc
inconsistencies in how individual methods are classified.                                       distinction is itself problematic.
In §2, we present a consistent taxonomy and establish var-
ious observations and results that apply to entire classes
of methods in our taxonomy. Contributions (a) We detect                                                             The simplest taxonomies presented are one-
                                                                                                                 dimensional, i.e., simple groupings [14, 15]. Other
AIMLAI 2022: Advances in Interpretable Machine Learning and Artifi- methods introduce up to four dimensions and use these
cial Intelligence, October 21, 2022, Atlanta, GA                                                                 to cross-classify existing methods. The 10 taxonomies
Envelope-Open soegaard@di.ku.dk (A. Søgaard)                                                                     are at most a couple of years old (2019-2021) and
GLOBE https://anderssoegaard.github.io/ (A. Søgaard)
                                                                                                                 discussed in chronological order. We first discuss two
Orcid 0000-0001-5250-4276 (A. Søgaard)
                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License common distinctions that are largely agreed upon: local-
                    Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings     CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                 global and intrinsic-posthoc. One of these distinctions,
local-global, will be useful, while the other is problematic                taxonomies have tried to classify interpretation methods
in several respects. We define an interpretability method                   into local and global ones, in practice, some methods
ℳ(w, 𝑆) as a complex function that takes a model w                          have seemed harder to classify than others. Concept ac-
and a sample of token sequences 𝑆 ⊆ 𝒮 as input, and is                      tivation approaches [18], for example, use joint global
composed of three types of functions:                                       training to learn mappings of individual examples into
                                                                            local explanations. Contrastive interpretability methods
Definition 1.1 (Forward functions). Let                                     [20] provide explanations in terms of pair of examples. It
forward(w, 𝑆) return w(𝑓 (𝑆)) for all inputs 𝑠 ∈ 𝑆,                         may also seem unclear whether a challenge dataset pro-
i.e., w1 (𝑠), … , w𝑛 (𝑠) for 𝑛 layers, with 𝑓 ∶ 𝒮 ↦ 𝒮 a                     vides a local or global explanation. [10] discuss what they
function from input to input, e.g., perturb, delete, identity.              call semi-local approaches, and [8] introduce a category
                                                                            for interpretability methods that relate to groups of exam-
Definition 1.2 (Backward functions). Let                                    ples. Are there methods that are not easily categorized
backward(w, 𝑆) return                                                       as global or local? Answer Our definition of local-global
                                                                            focuses on the induction of explanations from samples.
                      𝑔(w−1 , (forward(w, 𝑆)))
                                                                            This focus enables unambiguous classification and leads
where 𝑔(⋅, ⋅) is a function that defines a backward pass of                 us to classify concept activation methods as global, since
gradients, relevance scores, etc., over the inverse model w−1 .             the explanatory model component is induced from a sam-
                                                                            ple (and relies on the representativity of this sample).
Definition 1.3 (Inductive functions). Let                                   Similarly, we classify contrastive and group methods
induce(w, 𝑆) return a set of parameters v fitted by                         as local methods, since they do not require induction
minimizing an objective over w and 𝑆.                                       or assume representative samples; and, finally, we clas-
                                                                            sify challenge datasets as local methods, since challenge
Examples of inductive functions include, for various loss                   datasets also do not have to be representative.2
functions ℓ(⋅, ⋅): (a) probing [16], in which the objective                    Intrinsic and post-hoc explanations This distinction,
is of the form ℓ(v(𝑆), 𝑙(w𝑗 (𝑆))) where w𝑗 (𝑠) is the repre-                also called active-passive in [10] and self-explaining-ad
sentation of 𝑠 at the 𝑗the layer of w, and 𝑙 is the probe-                  hoc in [11], is between intrinsic methods that jointly out-
specific re-labeling function of samples; or (c) linear ap-                 put explanations, and methods that derive these expla-
proximation [17], in which the objective is of the form                     nations post-hoc using auxiliary techniques. While most
ℓ(v(𝑆), w(𝑆)), where v is a linear function.                                taxonomies introduce this distinction, we argue that it
   Local and global explanations The distinction between                    is inherently problematic. Challenge The distinction
local and global interpretability methods is shared across                  between intrinsic and posthoc methods can be hard to
all the taxonomies discussed in this paper, and will also be
                                                                              can be used to derive aggregate statistics that characterize global
one of the two dimensions in the taxonomy we propose                          properties of models. LIME [19], for example, is mostly classified as
below. The distinction is defined slightly differently by                     a local method ([12] classify it as both local and global), but in [19],
different authors, As should be clear from the discussion                     the authors explicitly discuss how LIME can be used on i.i.d. samples
below, this is not equivalent to our definition, which uses                   to derive aggregate statistics that characterize model behavior on
the reliance of global methods on samples, rather than                        distributions (same can be done for all local methods; see §3.6). Our
                                                                              definition makes it clear that such methods are local; local methods
the reliance of local methods on specific instances, as                       can be applied globally, whereas global methods cannot be applied
the distinguishing criterion. One argument against the                        locally. It is also clear from our definition that the two classes of in-
definition in [11] is that it is not entirely clear in what                   terpretability methods are often motivated by different prototypical
sense global methods such as concept activation vectors                       applications: Local methods are often used to explain the motiva-
[18], for example, are independent of any particular input.                   tion behind critical decisions, e.g., why a customer was assessed
                                                                              as high-risk, why a traveling review was flagged as fraudulent, or
The function that provides us with explanations is global,                    why a newspaper article was flagged as misleading, whereas global
but of course its output depends on the input. or not                         methods are used to characterize biases in models and evaluate
defined at all, e.g., [6], but here we present the definition                 their robustness.
                                                                            2
that our taxonomy below relies on: A method ℳ is said                         Examples of local methods include gradients [21, 22, 23], LRP
to be global if and only if it includes at least one inductive                [24], deep Taylor decomposition [25], integrated gradients [26, 27],
                                                                              DeepLift [28], direct interpretation of gate/attention weights [29],
function. Otherwise ℳ is said to be local. Global meth-                       attention roll-out and flow [30], word association norms and analo-
ods typically require access to a representative sample of                    gies [31], time step dynamics [32], challenge datasets [33, 34, 35, 36],
data, to minimize their objectives, whereas local methods                     local uptraining [19], and influence sketching and influence func-
are applicable to singleton samples.1 Challenge When                          tions [37]; examples of global methods include unstructured prun-
                                                                              ing, lottery tickets, dynamic sparse training, binary networks,
1
    Note that our definition does not refer to how the methods charac-        sparse coding, gate and attention head pruning, correlation of rep-
    terize the models, e.g., whether they describe individual inferences,     resentations [38], clustering [39, 40, 41], probing classifiers [16],
    or derive aggregate statistics that quantify ways the models are bi-      concept activation [18], representer point selection [42], TracIn
    ased. This is to avoid a common source of confusion: Local methods        [43], and uptraining [44].
maintain.3 Moreover, for a method to be posthoc means                        methods, but not to all possible methods?
different things to local and global methods. A post-                           Carvalho et al. (2019) [8] introduce four dimensions in
hoc, local method is post-hoc relative to a class inference                  their taxonomy: (a) scope, which coincides with the local-
(in the case of classification); a post-hoc, global method                   global distinction; (b) intrinsic-posthoc; (c) pre-model, in-
is post-hoc relative to training, introducing a disjoint                     model, and post-model, with in-model corresponding to
training phase for learning the interpretability functions.                  intrinsic methods, and post-model corresponding to post-
Strictly speaking, the fact that ’post-hoc’ takes on two                     hoc methods, whereas pre-model comprises various ap-
disjoint meanings for local and global methods, namely                       proaches to data analysis. We argue below that (c) is both
’post-inference’ and ’post-training’, makes taxonomies                       redundant and inconsistent. Finally, they introduce (d) a
that rely on both dimensions inconsistent.                                   results dimension, which concerns the form of the expla-
                                                                             nations provided by the methods. Inconsistencies In
                                                                             addition to the inconsistency of intrinsic-posthoc, includ-
2. Shortcomings of Taxonomies                                                ing pre-model explanations leads to further taxonomic
                                                                             inconsistency in that pre-model approaches cannot be
We now briefly assess the 10 taxonomies, pointing out
                                                                             classified along the other dimensions in that they do not
the ways in which they are inconsistent, incomplete or
                                                                             refer to models at all. For the same reason, one might
redundant:
                                                                             argue they are not model interpretation methods in the
   Guidotti et al. (2018) [6] makes the local-global distinc-
                                                                             first place. Redundancies The redundancy of (c) fol-
tion, as well as two that relate to how explanations are
                                                                             lows from the observation that the distinction between
communicated (how much time the user is expected to
                                                                             in-model and post-model explanations is identical to the
have to understand the model decisions, and how much
                                                                             distinction made in (b), as well as the observation that pre-
domain knowledge and technical experience the user is
                                                                             model explanations do not refer to models at all. Chal-
expected to have). In addition to the terms local and
                                                                             lenge What is an intrinsic interpretability method that
global, they also refer, synonymously, to outcome expla-
                                                                             presents post-model explanations, or a post-hoc inter-
nation and model explanation. Later in their survey, [6]
                                                                             pretability method that presents in-model explanations?
make a fourth distinction that is very similar to intrinsic-
                                                                                Molnar (2019) [9] distinguishes between local-global
posthoc, namely between transparent design (leading to
                                                                             and intrinsic-posthoc, between different results, and be-
intrinsically interpretable models) and (post-hoc) black
                                                                             tween model-specific and model-agnostic methods, mak-
box inspection, but oddly, this is not seen as an orthogo-
                                                                             ing their taxonomy very similar to [8]. Inconsistencies
nal dimension, but as two additional classes on par with
                                                                             See discussion of [7]. Also, the results dimension is also
outcome and model explanation. Challenge How to
                                                                             inconsistent in that explanations can, simultaneously, be
classify methods that are both, say, local and post-hoc,
                                                                             intrinsically interpretable models and feature summary
i.e., do outcome explanation by black-box inspection? Ex-
                                                                             statistics. LIME [19], for example, presents local explana-
amples would include gradients [21, 22, 23], layer-wise
                                                                             tions as the linear coefficients of a linear fit, i.e., an intrin-
relevance propagation [24], deep Taylor decomposition
                                                                             sically interpretable model that consists solely of feature
[25], integrated gradients [26, 27], etc.
                                                                             summary statistics. Redundancies The most important
   Adadi and Berrada (2019) [7] rely on the local-global
                                                                             redundancy is that all model-agnostic interpretability
and intrinsic-posthoc distinctions (referring to the later
                                                                             methods are also post-hoc, since intrinsic methods re-
as complexity), and, as a third dimension, they distin-
                                                                             quire joint training, which in turn requires compatibility
guish between model-agnostic and model-specific inter-
                                                                             with model architectures. Moreover, model-agnostic in-
pretability methods. Inconsistencies We argue that the
                                                                             terpretability methods are all grounded in input features
distinction between model-specific and model-agnostic
                                                                             and thus lead to explanations in terms of feature sum-
methods is suboptimal in that state-of-the-art models are
                                                                             mary statistics or visualizations. Moreover, all explana-
moving targets, and so is what counts as model-specific.
                                                                             tions in terms of intrinsically interpretable models are,
This may lead to inconsistencies over time. Challenge
                                                                             quite obviously, intrinsic. Challenge What is a post-hoc
How do we classify a method that applies to all known
                                                                             interpretability method whose explanations are intrinsi-
3
                                                                             cally interpretable models?
    Consider the difference between the two global interpretability
                                                                                Zhang et al. (2020) [10] rely on these dimensions: (a)
    methods, concept activation vectors and probing classifiers: CAV
    are trained jointly, probing classifiers sequentially. These are ex-     global-local; (b) intrinsic-posthoc (which they call active-
    tremes of a (curriculum) continuum, which is hard to binarize: If a      passive; and (c) a distinction between four explanation
    probing classifier is trained jointly with the last epoch of the model   types, namely examples, attribution, hidden semantics, and
    training, is the method then intrinsic or posthoc? For a real example,   rules. Inconsistencies The explanation type dimension
    consider TracIn [43], in which influence functions are estimated
                                                                             in [10] conflates (a) the model components we are try-
    across various training check points. Again, is TracIn intrinsic or
    posthoc? That the binary distinction covers a continuum, makes           ing to explain, and (b) what the explanations look like.
    the distinction hard to apply in practice.                               Hidden semantics, e.g., is a model component, whereas
examples and rules refer to the (syntactic) form of the


                                                                                                                                 [10]


                                                                                                                                          [11]


                                                                                                                                                    [12]
                                                                                       [6]


                                                                                                [7]


                                                                                                           [8]


                                                                                                                      [9]
explanations. The distinction between hidden semantics
                                                                       GradCAM          L                                        L-H                L-H
and attribution is also apparent. Hidden semantics can                 DeepLift         H                                        L-H                L-H
be used to derive attribution (a results type in [8] and [9]),         LRP                     L/G-H                            S-H      L-I/H     L/G-H
e.g., in LSTMVis [45]; this is because hidden semantics is             LIME             L       L-H       L-H          L        L-H       L-H      L/G-H
                                                                       TCAV                                           G-I       G-H                 G-H
not a type of explanation, but a model component. At-                  IF                                 L/G         L-H
tribution, examples, and rules are types of explanations,
but this list is not exhaustive, since explanations can also                  Forward                                       Backward
be in terms of concepts, free texts, or visualizations, for                   Attention, Attention roll-out, Atten-         Gradients, Layer-wise relevance


                                                                     Local
example. Challenge What is a passive interpretability                         tion flow, Time step dynamics, Lo-            propagation, Deep Taylor decom-
                                                                              cal uptraining, Influence functions           position, Integrated gradients,
method that does not provide local explanations?                                                                            DeepLift
   Danilevsky et al. (2020) [11] only distinguish between                     Weight pruning, Correlation of rep-           Dynamic sparse training, Binary


                                                                     Global
global-local and intrinsic-posthoc (which they call self-                     resentations, Clustering, Probing             networks, Sparse coding, Concept
                                                                              classifiers, Uptraining                       activation, Gradient-based weight
explaining and ad-hoc) methods. Inconsistencies [11]                                                                        pruning
say most attribution methods are global and ad-hoc. We
argue attribution methods are necessarily local, and while       Table 2
aggregate statistics can of course be computed across            Left: 4/6 methods (bottom half) are classified incoherently
real or synthetic corpora, little is gained by blurring tax-     across taxonomies. Explanation: local (L), global (G), intrinsic
onomies to reflect that. All local methods can be used           (I), and posthoc (H). Right: Our novel taxonomy.
to compute summary statistics. Incompleteness [11]
admit their survey is biased toward local methods, and
many global interpretability methods are left uncovered.            Kotonya and Toni (2020) [15] distinguish between
Challenge What is a local interpretability method that           attention-based explanations, explanations as rule discov-
cannot be used to compute summary statistics?                    ery, and explanations as summarization. Incomplete-
   Das et al. (2020) [12] distinguish between local and          ness Using gating mechanisms to interpret models, e.g.,
global methods, gradient-based and perturbation-based            does not fit any of the three categories. Inconsistencies
methods on the other (methodology), and intrinsic and            One class of methods is defined in terms of the model
post-hoc methods (usage). Their taxonomy is both in-             components being interpreted (attention-based), and an-
complete and redundant: Incompleteness Several ap-               other class in terms of the form of explanations they
proaches are neither gradient-based or perturbation-             provide (rule discovery and summarization). Mixing or-
based. Redundancies All gradient-based approaches                thogonal dimensions is inconsistent, i.e., methods can
are classified as post-hoc approaches in [12]; similarly,        belong to several categories, e.g., attention head pruning
all intrinsic methods are classified as global methods. Of       [46] (attention-based and summarization), or when rules
course these cells may be filled with methods that were          are induced from attention weights [47].
not covered, but in particular, it seems that gradient-             Chen et al. (2021) [13] introduce the global-local dis-
based approaches are, almost always, post-hoc? Chal-             tinction, but not the intrinsic-posthoc distinction. In ad-
lenge What is an intrinsic, gradient-based approach?             dition they distinguish between interpretability methods
   Atanasova et al. (2020) [14] distinguish between              that present explanations in terms of training instances,
three classes of interpretability methods: gradient-based,       approximations, feature attribution and counterfactuals.
perturbation-based, and simplification-based methods.            Inconsistencies The second dimension again makes or-
Inconsistencies The distinction between gradient-                thogonal distinctions. Approximations, for example, can
based and perturbation-based methods is similar to [12],         be used to attribute importance to features (LIME). In-
but the two classifications are inconsistent, with [14]          completeness Concepts, attention weights, gate activa-
citing LIME [19] as a simplification-based method. It            tions, rules, etc., are not covered by the second dimension.
seems that the distinction between perturbation-based            Redundancies All methods that present explanations in
and simplification-based methods is in itself inconsistent       terms of training instances are necessarily local. Chal-
in that both perturbations and gradients can be used to          lenge What’s a global interpretability method providing
simplify models; similarly, perturbations can be used to         explanations in terms of training instances?4
baseline gradient-based approaches. Incompleteness                  Inconsistent Classifications Table 2 shows that tax-
Clearly, not all interpretability methods are gradient-
based, perturbation-based or simplification-based: Other         4
                                                                     Several of the above taxonomies include dimensions that pertain to
methods are based on weight magnitudes, carefully de-                the form of the output of interpretation methods. We argue such
                                                                     distinctions are orthogonal to the methods and should therefore not
signed example templates, visualizing and quantifying at-
                                                                     be included in taxonomies. To see this, note that most interpretabil-
tention weights or gating mechanisms. Challenge How                  ity methods, e.g., LIME, can provide explanations of different form:
would you classify attention roll-out [30], for example?             aggregate statistics, coefficients, rules, visualizations, etc.
onomies are not only internally inconsistent, but also                     necessarily, from the fact that backward methods reverse
inconsistent in how they classify methods. Six methods                     the direction of connections, thus returning quantities
were mentioned by more than one survey, 4/6 of which                       that hold for the input nodes. Pre-input quantities are
were classified differently.                                               not interpretable.
                                                                           Observation 3.2. Only global methods can be unfaithful.
3. A Novel Taxonomy and
                                                                 §3.2 follows from the definition of faithfulness: ℳ is faith-
   Observations                                                  ful if the inductive functions of ℳ have ℓ(v, 𝑃) = 0 and
                                                                 𝑆𝑃.̃ ℳ can only be unfaithful with respective to inductive
Our taxonomy is two-dimensional: One is local-global, component functions; local methods can therefore not
the other a distinction between explanations based on for- be unfaithful.6
ward passes, and explanations based on backward passes.
The forward explanations correlate intermediate repre- Observation 3.3. Global methods can at best be epsilon-
sentations or continuous or discrete output representa- faithful and only on i.i.d. instances.
tions to obtain explanations, whereas backward expla-
nations concern training dynamics. We define forward- §3.3 follows from the fact that standard learning theory
backward:                                                        applies to the inductive component functions of global
                                                                 interpretability methods. Since their faithfulness is the
Definition 3.1 (Forward-backward). A method ℳ is inverse of the empirical risk of these inductions, it follows
said to be backward if it contains backward functions; that global methods can at best be 𝜖-faithful, with 𝜖 the ex-
otherwise, ℳ is said to be forward.                              pected loss of these inductions. Note that when the expla-
                                                                 nation is a model approximation 𝜃 ′ , 𝜖 = 𝔼[ℓ(𝜃(𝑥), 𝜃 ′ (𝑥))].
Local backward methods include gradients [21, 22, 23],
integrated gradients [26, 27], layerwise relevance prop- Observation 3.4. Only forward methods are used for lo-
agation [48], DeepLIFT [28], and deep Taylor decompo- cal layer-wise analysis.
sition [25], which all derive explanations for individual
                                                                 Since local backward methods are attribution methods
instances from what is normally used as training signals,
                                                                 (§3.1), and layer-wise analysis concerns differences be-
typically based on derivatives of the loss function (gra-
                                                                 tween layers, local backward methods cannot be used
dients) evaluating ℎ on training data, e.g., 𝑑(ℓ(ℎ(x𝑖 ), 𝑦𝑖 )).
                                                                 here, simply because they only output attributions at the
Global backward methods rely on such training signals
                                                                 input level. §3.4 thus follows from §3.1, making it, too,
to modify or extend the model parameters w associated
                                                                 an empirical observation, not a formal derivation.
with ℎ, typically extracting approximations, rules or vi-
sualizations.  5
                                                                 Observation 3.5. No equivalence relations can hold
                                                                 across the four categories of methods.
Observation 3.1. Local backward methods are always
attribution methods (presenting feature summary statis- §3.5 follows from the disjointness of the three sets of com-
tics).                                                           ponent functions, and how the four classes are defined,
                                                                 i.e., that global functions cannot be local, and forward
Since local methods have to provide explanations in terms functions cannot be backward. Equivalences between
of input/output (as they do not modify weights), and since methods have already been found [28, 26, 27, 60, 61], but
backward passes do not generate output distributions, consistent taxonomic classification effectively prunes the
they have to present explanations in terms of attribution search space of possible equivalences.
of relevance or gradients to input features or input seg-
ments. §3.1 is empirical. It follows naturally, but not Observation 3.6. Local methods can always characterize
5
  Local forward methods either consider intermediate representa-
                                                                 models globally on i.i.d. samples.
tions, e.g., gates [49], attention [29], attention flow [50], etc.; con-
                                                                           §3.6 states that any local method that derives quantifies
tinuous output representations, e.g., using word association norms
[51] or word analogies [52, 53]; or discrete output, such as when          for an example can be used to aggregate corpus-level
evaluating on challenge datasets [33, 34, 35, 36], or when approxi-        statistics for appropriate-level samples. See [19] for how
mating the model’s output distribution [19, 54, 37]. In the same way,      to do this with LIME. It should be easy to see how this
global forward methods can rely on intermediate representations in         result generalizes to all other local methods.
forward passes, e.g., in attention head pruning [46], attention factor
                                                                           6
analysis [55], syntactic decoding of attention heads [50], attention           Local methods compute quantities based on forward or backward
head manipulation [56], etc.; continuous output in forward passes,             passes, but these quantities are not induced to simulate anything.
including work using clustering in the vector space to manually                Global methods induce parameters to simulate a distribution and
analyze model representations [57, 58], probing classifiers [16], and          can be more or less faithful to this distribution, but since local
concept activation strategies [18]; or on discrete output, e.g., in            methods simply ’read off’ their quantities, they cannot be unfaithful.
uptraining [44] and knowledge distillation [59].                               Only, the quantities can be misinterpreted.
4. Conclusion                                                              interpretable machine learning, CoRR (2021).
                                                                           arXiv:2103.06254.
We examined 10 taxonomies of interpretability methods                 [14] P. Atanasova, J. G. Simonsen, C. Lioma, I. Augen-
and found all to be inconsistent. We introduces a two-                     stein, A diagnostic study of explainability tech-
dimensional taxonomy and showed how it can be helpful                      niques for text classification, in: EMNLP, Online,
in deriving general observations and results.                              2020.
                                                                      [15] N. Kotonya, F. Toni, Explainable automated fact-
                                                                           checking: A survey, in: COLING, Barcelona, Spain
References                                                                 (Online), 2020.
 [1] S. Bacher, Still not enough taxonomists: reply to                [16] Y. Belinkov, Probing Classifiers: Promises, Short-
     Joppa et al., Trends in Ecology Evolution 27 2 (2012)                 comings, and Advances, Computational Linguistics
     65–6; author reply 66.                                                (2021) 1–13.
 [2] S. Thomson, R. Pyle, S. Ahyong, M. Alonso-                       [17] J. Ba, R. Caruana, Do deep nets really need to be
     Zarazaga, J. Ammirati, J.-F. Araya, J. Ascher, T. Audi-               deep?, in: NeurIPS, 2014.
     sio, V. Azevedo-Santos, N. Bailly, W. Baker, M. Balke,           [18] B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai, J. Wexler,
     M. Barclay, R. Barrett, R. Benine, J. Bickerstaff,                    F. B. Viégas, R. Sayres, Interpretability beyond fea-
     P. Bouchard, R. Bour, T. Bourgoin, H.-Z. Zhou, Tax-                   ture attribution: Quantitative testing with concept
     onomy based on science is necessary for global                        activation vectors (TCAV)., in: ICML, volume 80,
     conservation, PLoS Biology 16 (2018). doi:1 0 . 1 3 7 1 /             2018.
     journal.pbio.2005075.                                            [19] M. T. Ribeiro, S. Singh, C. Guestrin, ”why should
 [3] M. S. Brewer, P. Sierwald, J. Bond, Millipede taxon-                  i trust you?”: Explaining the predictions of any
     omy after 250 years: Classification and taxonomic                     classifier, in: KDD, New York, NY, USA, 2016.
     practices in a mega-diverse yet understudied arthro-             [20] A. Dhurandhar, P.-Y. Chen, R. Luss, C.-C. Tu,
     pod group, PLoS ONE 7 (2012).                                         P. Ting, K. Shanmugam, P. Das, Explanations based
 [4] H. Fraser, G. Garrard, L. Rumpff, C. Hauser, M. Mc-                   on the missing: Towards contrastive explanations
     Carthy, Consequences of inconsistently classifying                    with pertinent negatives, 2018. a r X i v : 1 8 0 2 . 0 7 6 2 3 .
     woodland birds, Frontiers in Ecology and Evolution               [21] P. Leray, P. Gallinari, P. Gallinari, P. Gallinari,
     3 (2015) 83.                                                          Feature selection with neural networks, Behav-
 [5] B. Jones, A few bad scientists are threatening to                     iormetrika 26 (1998) 16–6.
     topple taxonomy, Smithsonian Magazine (2017).                    [22] K. Simonyan, A. Vedaldi, A. Zisserman, Deep
 [6] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Gi-              inside convolutional networks: Visualising im-
     annotti, D. Pedreschi, A survey of methods for                        age classification models and saliency maps, 2014.
     explaining black box models (2018).                                   arXiv:1312.6034.
 [7] A. Adadi, M. Berrada, Peeking inside the black-box:              [23] M. Denil, A. Demiraj, N. Kalchbrenner, P. Blunsom,
     A survey on explainable artificial intelligence (xai),                N. de Freitas, Modelling, visualising and summaris-
     IEEE Access 6 (2018) 52138–52160.                                     ing documents with a single convolutional neural
 [8] D. V. Carvalho, E. M. Pereira, J. S. Cardoso, Machine                 network., CoRR abs/1406.3830 (2014).
     learning interpretability: A survey on methods and               [24] L. Arras, F. Horn, G. Montavon, K.-R. Müller,
     metrics, Electronics 8 (2019) 832.                                    W. Samek, Explaining predictions of non-linear
 [9] C. Molnar, Interpretable Machine Learn-                               classifiers in NLP, in: RepLNLP, Berlin, Germany,
     ing,          2019.              https://christophm.github.io/        2016.
     interpretable-ml-book/.                                          [25] G. Montavon, S. Lapuschkin, A. Binder, W. Samek,
[10] Y. Zhang, P. Tiňo, A. Leonardis, K. Tang, A                           K.-R. Müller, Explaining nonlinear classification
     survey on neural network interpretability, 2020.                      decisions with deep Taylor decomposition, Pattern
     arXiv:2012.14261.                                                     Recognition 65 (2017) 211–222.
[11] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis,                  [26] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attri-
     B. Kawas, P. Sen, A survey of the state of explain-                   bution for deep networks, 2017. a r X i v : 1 7 0 3 . 0 1 3 6 5 .
     able AI for natural language processing, in: AACL-               [27] P. K. Mudrakarta, A. Taly, M. Sundararajan,
     IJNLP, Suzhou, China, 2020.                                           K. Dhamdhere, Did the model understand the ques-
[12] A. Das, P. Rad, Opportunities and challenges in                       tion?, in: ACL, Melbourne, Australia, 2018.
     explainable artificial intelligence (xai): A survey,             [28] A. Shrikumar, P. Greenside, A. Kundaje, Learning
     2020. a r X i v : 2 0 0 6 . 1 1 3 7 1 .                               important features through propagating activation
[13] V. Chen, J. Li, J. S. Kim, G. Plumb, A. Talwalkar,                    differences, in: ICML, 2017.
     Towards connecting use cases and methods in                      [29] M. Rei, A. Søgaard, Zero-shot sequence labeling:
                                                                           Transferring knowledge from sentences to tokens,
     in: NAACL, New Orleans, Louisiana, 2018.                          actions on Visualization and Computer Graphics
[30] S. Abnar, W. Zuidema, Quantifying attention flow                  24 (2018) 667–676. doi:1 0 . 1 1 0 9 / T V C G . 2 0 1 7 . 2 7 4 4 1 5 8 .
     in transformers, in: ACL, Online, 2020.                      [46] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, I. Titov,
[31] T. Mikolov, Q. V. Le, I. Sutskever, Exploiting simi-              Analyzing multi-head self-attention: Specialized
     larities among languages for machine translation,                 heads do the heavy lifting, the rest can be pruned,
     CoRR abs/1309.4168 (2013). a r X i v : 1 3 0 9 . 4 1 6 8 .        in: ACL, Florence, Italy, 2019.
[32] H. Strobelt, S. Gehrmann, H. Pfister, A. M. Rush,            [47] T. Ruzsics, O. Sozinova, X. Gutierrez-Vasques,
     Lstmvis: A tool for visual analysis of hidden                     T. Samardzic, Interpretability for morphologi-
     state dynamics in recurrent neural networks, 2017.                cal inflection: from character-level predictions to
     arXiv:1606.07461.                                                 subword-level rules, in: EACL, Online, 2021.
[33] M. Richardson, C. J. Burges, E. Renshaw, MCTest:             [48] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R.
     A challenge dataset for the open-domain machine                   Müller, W. Samek, PLoS ONE (????).
     comprehension of text, in: EMNLP, 2013.                      [49] Y. Lakretz, G. Kruszewski, T. Desbordes, D. Hupkes,
[34] J. Mullenbach, J. Gordon, N. Peng, J. May, Do nu-                 S. Dehaene, M. Baroni, The emergence of number
     clear submarines have nuclear captains? a chal-                   and syntax units in LSTM language models, in:
     lenge dataset for commonsense reasoning over ad-                  NAACL, Minneapolis, Minnesota, 2019.
     jectives and objects, in: EMNLP, Hong Kong, China,           [50] V. Ravishankar, A. Kulmizev, M. Abdou, A. Søgaard,
     2019.                                                             J. Nivre, Attention can reflect syntactic structure
[35] K. Sun, D. Yu, J. Chen, D. Yu, Y. Choi, C. Cardie,                (if you let it), in: EACL, Online, 2021.
     DREAM: A challenge data set and models for                   [51] K. W. Church, P. Hanks, Word association norms,
     dialogue-based reading comprehension, Transac-                    mutual information, and lexicography, in: 2ACL,
     tions of the Association for Computational Linguis-               Vancouver, British Columbia, Canada, 1989.
     tics 7 (2019) 217–231.                                       [52] T. Mikolov, K. Chen, G. Corrado, J. Dean, Dis-
[36] N. F. Liu, R. Schwartz, N. A. Smith, Inoculation                  tributed representations of words and phrases and
     by fine-tuning: A method for analyzing challenge                  their compositionality, in: NeurIPS, 2013.
     datasets, in: NAACL, Minneapolis, Minnesota,                 [53] N. Garneau, M. Hartmann, A. Sandholm, S. Ruder,
     2019.                                                             I. Vulic, A. Søgaard, Analogy training multilingual
[37] P. W. Koh, P. Liang, Understanding black-box pre-                 encoders, in: AAAI, 2021.
     dictions via influence functions, in: ICML, 2017.            [54] D. Alvarez-Melis, T. Jaakkola, A causal frame-
[38] N. Kriegeskorte, M. Mur, P. Bandettini, Representa-               work for explaining the predictions of black-box
     tional similarity analysis – connecting the branches              sequence-to-sequence models, in: EMNLP, Copen-
     of systems neuroscience, Frontiers in Systems Neu-                hagen, Denmark, 2017.
     roscience 3 (2008).                                          [55] G. Kobayashi, T. Kuribayashi, S. Yokoi, K. Inui, At-
[39] T. A. Trost, D. Klakow, Parameter free hierarchi-                 tention is not only a weight: Analyzing transform-
     cal graph-based clustering for analyzing continu-                 ers with vector norms, in: EMNLP, Online, 2020.
     ous word embeddings, in: TextGraphs, Vancouver,              [56] S. Vashishth, S. Upadhyay, G. S. Tomar, M. Faruqui,
     Canada, 2017.                                                     Attention interpretability across NLP tasks, 2019.
[40] D. Yenicelik, F. Schmidt, Y. Kilcher, How does BERT               arXiv:1909.11218.
     capture semantics? a closer look at polysemous               [57] K. Heylen, D. Speelman, D. Geeraerts, Looking
     words, in: BlackboxNLP, Online, 2020.                             at word meaning. an interactive visualization of
[41] R. Aharoni, Y. Goldberg, Unsupervised domain                      semantic vector spaces for Dutch synsets, in:
     clusters in pretrained language models, in: ACL,                  LINGVIS & UNCLH, Avignon, France, 2012.
     Online, 2020.                                                [58] E. Reif, A. Yuan, M. Wattenberg, F. B. Viegas, A. Co-
[42] C.-K. Yeh, J. S. Kim, I. E. H. Yen, P. Ravikumar, Rep-            enen, A. Pearce, B. Kim, Visualizing and measuring
     resenter point selection for explaining deep neural               the geometry of BERT, in: NeurIPS, volume 32,
     networks, 2018. a r X i v : 1 8 1 1 . 0 9 7 2 0 .                 2019.
[43] D. Pruthi, M. Gupta, B. Dhingra, G. Neubig, Z. C.            [59] Y. Kim, A. M. Rush, Sequence-level knowledge
     Lipton, Learning to deceive with attention-based                  distillation, in: EMNLP, Austin, Texas, 2016.
     explanations, in: ACL, Online, 2020.                         [60] M. Ancona, E. Ceolini, C. Öztireli, M. Gross, To-
[44] S. Petrov, P.-C. Chang, M. Ringgaard, H. Alshawi,                 wards better understanding of gradient-based attri-
     Uptraining for accurate deterministic question pars-              bution methods for deep neural networks, in: ICLR,
     ing, in: EMNLP, 2010.                                             2018.
[45] H. Strobelt, S. Gehrmann, H. Pfister, A. M. Rush,            [61] W. Samek, G. Montavon, S. Lapuschkin, C. J. An-
     Lstmvis: A tool for visual analysis of hidden state               ders, K.-R. Müller, Explaining deep neural networks
     dynamics in recurrent neural networks, IEEE Trans-                and beyond: A review of methods and applica-
tions, Proceedings of the IEEE 109 (2021) 247–278.
doi:1 0 . 1 1 0 9 / J P R O C . 2 0 2 1 . 3 0 6 0 4 8 3 .

</pre>