<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J P R O C .</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Shortcomings of Interpretability Taxonomies for Deep Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anders Søgaard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dpt. of Computer Science, University of Copenhagen</institution>
          ,
          <addr-line>Universitetsparken 1, DK-2200 Copenhagen</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dpt. of Philosophy</institution>
          ,
          <addr-line>Karen Blixens Plads 8, DK-2300 Copenhagen</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Pioneer Centre for Artificial Intelligence</institution>
          ,
          <addr-line>Lyngbyvej 2, DK-2100 Copenhagen</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>2</volume>
      <issue>0</issue>
      <abstract>
        <p>Taxonomies are vehicles for thinking about what's possible, for identifying unconsidered options, as well as for establishing formal relations between entities. We identify several shortcomings in 10 existing taxonomies for interpretability methods for explainable artificial intelligence (XAI), focusing on methods for deep neural networks. The shortcomings include redundancies, incompleteness, and inconsistencies. We design a new taxonomy based on two orthogonal dimensions and show how it can be used to derive results about entire classes of interpretability methods for deep neural networks.</p>
      </abstract>
      <kwd-group>
        <kwd>interpretability</kwd>
        <kwd>taxonomy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Two Common Distinctions</title>
      <p>Biological taxonomies provide a basis for conservation
and development and are used to generate interesting
questions about missing species [1, 2]. Inconsistent
taxonomies can, at the same time, hinder research or lead in
the wrong direction [3, 4, 5]. In engineering, taxonomies
play additional roles: They are vehicles for thinking about
what’s possible, for identifying unconsidered options, as
well as for establishing formal relations between
methods.</p>
      <p>Several taxonomies of interpretability methods already
exist [6, 7, 8, 9, 10, 11, 12, 13]. These taxonomies provide
us with technical terms for distinguishing approaches to
interpretability and can be eficient tools for researchers
to contextualize their work. They generate interesting
research questions – e.g., if all methods in class A but no
methods in B happen to exhibit property X, it this by
necessity, or can we design a method in B with property X? – and
help us see relations between methods – e.g., two methods
in class A are mathematically equivalent. Unfortunately,
the taxonomies that exist, without exception, have
shortcomings and are either redundant, incomplete, or
inconsistent. In §2, we show this, examining the above 10
taxonomies, one by one, also discussing between-taxonomy
inconsistencies in how individual methods are classified.</p>
      <sec id="sec-1-1">
        <title>In §2, we present a consistent taxonomy and establish var</title>
        <p>ious observations and results that apply to entire classes
of methods in our taxonomy. Contributions (a) We detect
https://anderssoegaard.github.io/ (A. Søgaard)
0000-0001-5250-4276 (A. Søgaard)
s
[10]
[11]
[12]
[14]
[15]
[13]
4
3
4
4
3
3
3
1
1
2
( )
( )
time, expertise
model-specific/model-agnostic
pre-model/in-model/post-model,
results
spec./agn., results
types
technique
methodology
grad./pert./ simpl.
att./rule/sum.</p>
        <p>inst./approx./ attr./counterf.
10 existing taxonomies and their shortcomings: Most
distinguish local from global methods, and intrinsic from posthoc
methods. We argue the additional dimensions all lead to
inconsistencies and/or redundancies, and that the intrinsic-posthoc
distinction is itself problematic.</p>
        <p>The
simplest taxonomies
presented</p>
        <p>are
onedimensional, i.e., simple groupings [14, 15].</p>
        <p>Other
methods introduce up to four dimensions and use these
to cross-classify existing methods. The 10 taxonomies
are at most a couple of years old (2019-2021) and
discussed in chronological order. We first discuss two
global and intrinsic-posthoc. One of these distinctions,
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License common distinctions that are largely agreed upon:
locallocal-global, will be useful, while the other is problematic taxonomies have tried to classify interpretation methods
in several respects. We define an interpretability method into local and global ones, in practice, some methods
ℳ(w, ) as a complex function that takes a model w have seemed harder to classify than others. Concept
acand a sample of token sequences  ⊆  as input, and is tivation approaches [18], for example, use joint global
composed of three types of functions: training to learn mappings of individual examples into
local explanations. Contrastive interpretability methods
Definition 1.1 (Forward functions). Let [20] provide explanations in terms of pair of examples. It
forward(w, ) return w( ()) for all inputs  ∈  , may also seem unclear whether a challenge dataset
proi.e., w1(), … , w () for  layers, with  ∶  ↦  a vides a local or global explanation. [10] discuss what they
function from input to input, e.g., perturb, delete, identity. call semi-local approaches, and [8] introduce a category
for interpretability methods that relate to groups of
examDefinition 1.2 (Backward functions). Let ples. Are there methods that are not easily categorized
backward(w, ) return as global or local? Answer Our definition of local-global
focuses on the induction of explanations from samples.</p>
        <p>( w−1, (forward(w, ))) This focus enables unambiguous classification and leads
where (⋅, ⋅) is a function that defines a backward pass of us to classify concept activation methods as global, since
gradients, relevance scores, etc., over the inverse model w−1. the explanatory model component is induced from a
sample (and relies on the representativity of this sample).</p>
        <p>Definition 1.3 (Inductive functions). Let Similarly, we classify contrastive and group methods
induce(w, ) return a set of parameters v fitted by as local methods, since they do not require induction
minimizing an objective over w and  . or assume representative samples; and, finally, we
classify challenge datasets as local methods, since challenge
datasets also do not have to be representative.2</p>
        <p>Intrinsic and post-hoc explanations This distinction,
also called active-passive in [10] and self-explaining-ad
hoc in [11], is between intrinsic methods that jointly
output explanations, and methods that derive these
explanations post-hoc using auxiliary techniques. While most
taxonomies introduce this distinction, we argue that it
is inherently problematic. Challenge The distinction
between intrinsic and posthoc methods can be hard to
can be used to derive aggregate statistics that characterize global
properties of models. LIME [19], for example, is mostly classified as
a local method ([12] classify it as both local and global), but in [19],
the authors explicitly discuss how LIME can be used on i.i.d. samples
to derive aggregate statistics that characterize model behavior on
distributions (same can be done for all local methods; see §3.6). Our
definition makes it clear that such methods are local; local methods
can be applied globally, whereas global methods cannot be applied
locally. It is also clear from our definition that the two classes of
interpretability methods are often motivated by diferent prototypical
applications: Local methods are often used to explain the
motivation behind critical decisions, e.g., why a customer was assessed
as high-risk, why a traveling review was flagged as fraudulent, or
why a newspaper article was flagged as misleading, whereas global
methods are used to characterize biases in models and evaluate
their robustness.
2Examples of local methods include gradients [21, 22, 23], LRP
[24], deep Taylor decomposition [25], integrated gradients [26, 27],
DeepLift [ 28], direct interpretation of gate/attention weights [29],
attention roll-out and flow [ 30], word association norms and
analogies [31], time step dynamics [32], challenge datasets [33, 34, 35, 36],
local uptraining [19], and influence sketching and influence
functions [37]; examples of global methods include unstructured
pruning, lottery tickets, dynamic sparse training, binary networks,
sparse coding, gate and attention head pruning, correlation of
representations [38], clustering [39, 40, 41], probing classifiers [ 16],
concept activation [18], representer point selection [42], TracIn
[43], and uptraining [44].</p>
        <p>Examples of inductive functions include, for various loss
functions ℓ(⋅, ⋅): (a) probing [16], in which the objective
is of the form ℓ(v(), ( w ())) where w () is the
representation of  at the  the layer of w, and  is the
probespecific re-labeling function of samples; or (c) linear
approximation [17], in which the objective is of the form
ℓ(v(), w()) , where v is a linear function.</p>
        <p>Local and global explanations The distinction between
local and global interpretability methods is shared across
all the taxonomies discussed in this paper, and will also be
one of the two dimensions in the taxonomy we propose
below. The distinction is defined slightly diferently by
diferent authors, As should be clear from the discussion
below, this is not equivalent to our definition, which uses
the reliance of global methods on samples, rather than
the reliance of local methods on specific instances, as
the distinguishing criterion. One argument against the
definition in [ 11] is that it is not entirely clear in what
sense global methods such as concept activation vectors
[18], for example, are independent of any particular input.</p>
        <p>The function that provides us with explanations is global,
but of course its output depends on the input. or not
defined at all, e.g., [ 6], but here we present the definition
that our taxonomy below relies on: A method ℳ is said
to be global if and only if it includes at least one inductive
function. Otherwise ℳ is said to be local. Global
methods typically require access to a representative sample of
data, to minimize their objectives, whereas local methods
are applicable to singleton samples.1 Challenge When
1Note that our definition does not refer to how the methods
characterize the models, e.g., whether they describe individual inferences,
or derive aggregate statistics that quantify ways the models are
biased. This is to avoid a common source of confusion: Local methods
maintain.3 Moreover, for a method to be posthoc means methods, but not to all possible methods?
diferent things to local and global methods. A post- Carvalho et al. (2019) [8] introduce four dimensions in
hoc, local method is post-hoc relative to a class inference their taxonomy: (a) scope, which coincides with the
local(in the case of classification); a post-hoc, global method global distinction; (b) intrinsic-posthoc; (c) pre-model,
inis post-hoc relative to training, introducing a disjoint model, and post-model, with in-model corresponding to
training phase for learning the interpretability functions. intrinsic methods, and post-model corresponding to
postStrictly speaking, the fact that ’post-hoc’ takes on two hoc methods, whereas pre-model comprises various
apdisjoint meanings for local and global methods, namely proaches to data analysis. We argue below that (c) is both
’post-inference’ and ’post-training’, makes taxonomies redundant and inconsistent. Finally, they introduce (d) a
that rely on both dimensions inconsistent. results dimension, which concerns the form of the
explanations provided by the methods. Inconsistencies In
addition to the inconsistency of intrinsic-posthoc,
includ2. Shortcomings of Taxonomies ing pre-model explanations leads to further taxonomic
inconsistency in that pre-model approaches cannot be
We now briefly assess the 10 taxonomies, pointing out classified along the other dimensions in that they do not
the ways in which they are inconsistent, incomplete or refer to models at all. For the same reason, one might
redundant: argue they are not model interpretation methods in the</p>
        <p>Guidotti et al. (2018) [6] makes the local-global distinc- ifrst place. Redundancies The redundancy of (c)
foltion, as well as two that relate to how explanations are lows from the observation that the distinction between
communicated (how much time the user is expected to in-model and post-model explanations is identical to the
have to understand the model decisions, and how much distinction made in (b), as well as the observation that
predomain knowledge and technical experience the user is model explanations do not refer to models at all.
Chalexpected to have). In addition to the terms local and lenge What is an intrinsic interpretability method that
global, they also refer, synonymously, to outcome expla- presents post-model explanations, or a post-hoc
internation and model explanation. Later in their survey, [6] pretability method that presents in-model explanations?
make a fourth distinction that is very similar to intrinsic- Molnar (2019) [9] distinguishes between local-global
posthoc, namely between transparent design (leading to and intrinsic-posthoc, between diferent results, and
beintrinsically interpretable models) and (post-hoc) black tween model-specific and model-agnostic methods,
makbox inspection, but oddly, this is not seen as an orthogo- ing their taxonomy very similar to [8]. Inconsistencies
nal dimension, but as two additional classes on par with See discussion of [7]. Also, the results dimension is also
outcome and model explanation. Challenge How to inconsistent in that explanations can, simultaneously, be
classify methods that are both, say, local and post-hoc, intrinsically interpretable models and feature summary
i.e., do outcome explanation by black-box inspection? Ex- statistics. LIME [19], for example, presents local
explanaamples would include gradients [21, 22, 23], layer-wise tions as the linear coeficients of a linear fit, i.e., an
intrinrelevance propagation [24], deep Taylor decomposition sically interpretable model that consists solely of feature
[25], integrated gradients [26, 27], etc. summary statistics. Redundancies The most important</p>
        <p>Adadi and Berrada (2019) [7] rely on the local-global redundancy is that all model-agnostic interpretability
and intrinsic-posthoc distinctions (referring to the later methods are also post-hoc, since intrinsic methods
reas complexity), and, as a third dimension, they distin- quire joint training, which in turn requires compatibility
guish between model-agnostic and model-specific inter- with model architectures. Moreover, model-agnostic
inpretability methods. Inconsistencies We argue that the terpretability methods are all grounded in input features
distinction between model-specific and model-agnostic and thus lead to explanations in terms of feature
summethods is suboptimal in that state-of-the-art models are mary statistics or visualizations. Moreover, all
explanamoving targets, and so is what counts as model-specific. tions in terms of intrinsically interpretable models are,
This may lead to inconsistencies over time. Challenge quite obviously, intrinsic. Challenge What is a post-hoc
How do we classify a method that applies to all known interpretability method whose explanations are
intrinsically interpretable models?
3Consider the diference between the two global interpretability Zhang et al. (2020) [10] rely on these dimensions: (a)
amreetthroadins,ecdojnocineptlty,acptriovbaitniogncvlaescstiofierrss asneqdupernotbiainllgy.clTahsseisfieersa:rCeAexV- global-local; (b) intrinsic-posthoc (which they call
activetremes of a (curriculum) continuum, which is hard to binarize: If a passive; and (c) a distinction between four explanation
probing classifier is trained jointly with the last epoch of the model types, namely examples, attribution, hidden semantics, and
training, is the method then intrinsic or posthoc? For a real example, rules. Inconsistencies The explanation type dimension
consider TracIn [43], in which influence functions are estimated in [10] conflates (a) the model components we are
tryapcorsothssovc?arTiohuast ttrhaeinbiinngarcyhedcikstpinocitnitosn. Acogvaeinrs, iascTornatciInnuiunmtr,inmsaickoesr ing to explain, and (b) what the explanations look like.
the distinction hard to apply in practice. Hidden semantics, e.g., is a model component, whereas
Gradients, Layer-wise relevance
propagation, Deep Taylor
decomposition, Integrated gradients,
DeepLift
examples and rules refer to the (syntactic) form of the []6 []7 ][8 ][9 []10 ][11 []21
explanations. The distinction between hidden semantics
and attribution is also apparent. Hidden semantics can DGeraedpCLiAftM HL L-LH-H L-LH-H
be used to derive attribution (a results type in [8] and [9]), LRP L/G-H S-H L-I/H L/G-H
e.g., in LSTMVis [45]; this is because hidden semantics is LIME L L-H L-H L L-H L-H L/G-H
not a type of explanation, but a model component. At- ITFCAV L/G LG--HI G-H G-H
tribution, examples, and rules are types of explanations,
but this list is not exhaustive, since explanations can also Forward Backward
be in terms of concepts, free texts, or visualizations, for l Attention, Attention roll-out,
Attenexample. Challenge What is a passive interpretability cao tion flow, Time step dynamics,
Lomethod that does not provide local explanations? L cal uptraining, Influence functions
egxlopDblaaaniln-illioenvcgsakalynadnetdaadiln-.ht(r2oi0cn)2s0mic)-e[pt1ho1so]tdohsno.lcyIn(dwciohstniicnshgisutthiesenhycbcieeatslwl[se1ee1lnf]- llaboG rcWelaseseisgnihtfiatetprisor,unUns,pintgCra,liuCnsoitnregrreinlagt,ionProofbrienpg- apnDcreyuttniwnvaiaonmtrgikiocsn,,sSpGpaarrarssdeeietcnrota-dibninainsgeg,d,CwoBniencigaehprytt
say most attribution methods are global and ad-hoc. We
argue attribution methods are necessarily local, and while Table 2
aggregate statistics can of course be computed across Left: 4/6 methods (bottom half) are classified incoherently
real or synthetic corpora, little is gained by blurring tax- across taxonomies. Explanation: local (L), global (G), intrinsic
onomies to reflect that. All local methods can be used (I), and posthoc (H). Right: Our novel taxonomy.
to compute summary statistics. Incompleteness [11]
admit their survey is biased toward local methods, and
many global interpretability methods are left uncovered. Kotonya and Toni (2020) [15] distinguish between
Challenge What is a local interpretability method that attention-based explanations, explanations as rule
discovcannot be used to compute summary statistics? ery, and explanations as summarization.
Incomplete</p>
        <p>Das et al. (2020) [12] distinguish between local and ness Using gating mechanisms to interpret models, e.g.,
global methods, gradient-based and perturbation-based does not fit any of the three categories. Inconsistencies
methods on the other (methodology), and intrinsic and One class of methods is defined in terms of the model
post-hoc methods (usage). Their taxonomy is both in- components being interpreted (attention-based), and
ancomplete and redundant: Incompleteness Several ap- other class in terms of the form of explanations they
proaches are neither gradient-based or perturbation- provide (rule discovery and summarization). Mixing
orbased. Redundancies All gradient-based approaches thogonal dimensions is inconsistent, i.e., methods can
are classified as post-hoc approaches in [ 12]; similarly, belong to several categories, e.g., attention head pruning
all intrinsic methods are classified as global methods. Of [46] (attention-based and summarization), or when rules
course these cells may be filled with methods that were are induced from attention weights [47].
not covered, but in particular, it seems that gradient- Chen et al. (2021) [13] introduce the global-local
disbased approaches are, almost always, post-hoc? Chal- tinction, but not the intrinsic-posthoc distinction. In
adlenge What is an intrinsic, gradient-based approach? dition they distinguish between interpretability methods</p>
        <p>Atanasova et al. (2020) [14] distinguish between that present explanations in terms of training instances,
three classes of interpretability methods: gradient-based, approximations, feature attribution and counterfactuals.
perturbation-based, and simplification-based methods. Inconsistencies The second dimension again makes
orInconsistencies The distinction between gradient- thogonal distinctions. Approximations, for example, can
based and perturbation-based methods is similar to [12], be used to attribute importance to features (LIME).
Inbut the two classifications are inconsistent, with [ 14] completeness Concepts, attention weights, gate
activaciting LIME [19] as a simplification-based method. It tions, rules, etc., are not covered by the second dimension.
seems that the distinction between perturbation-based Redundancies All methods that present explanations in
and simplification-based methods is in itself inconsistent terms of training instances are necessarily local.
Chalin that both perturbations and gradients can be used to lenge What’s a global interpretability method providing
simplify models; similarly, perturbations can be used to explanations in terms of training instances?4
baseline gradient-based approaches. Incompleteness Inconsistent Classifications Table 2 shows that
taxClearly, not all interpretability methods are
gradientbased, perturbation-based or simplification-based: Other 4Several of the above taxonomies include dimensions that pertain to
methods are based on weight magnitudes, carefully de- the form of the output of interpretation methods. We argue such
signed example templates, visualizing and quantifying at- distinctions are orthogonal to the methods and should therefore not
tention weights or gating mechanisms. Challenge How ibteyimnceltuhdoedds,ine.tga.,xLonIMo
mE,iecsa.nTporoseveidtehiesx,pnloatneatthioantsmoofsdtiifenrteenrptrfeotramb:ilwould you classify attention roll-out [30], for example? aggregate statistics, coeficients, rules, visualizations, etc.
onomies are not only internally inconsistent, but also
inconsistent in how they classify methods. Six methods
were mentioned by more than one survey, 4/6 of which
were classified diferently.
necessarily, from the fact that backward methods reverse
the direction of connections, thus returning quantities
that hold for the input nodes. Pre-input quantities are
not interpretable.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. A Novel Taxonomy and</title>
    </sec>
    <sec id="sec-3">
      <title>Observations</title>
      <sec id="sec-3-1">
        <title>Our taxonomy is two-dimensional: One is local-global,</title>
        <p>the other a distinction between explanations based on
forward passes, and explanations based on backward passes.</p>
        <p>The forward explanations correlate intermediate repre- Observation 3.3. Global methods can at best be
epsilonsentations or continuous or discrete output representa- faithful and only on i.i.d. instances.
tions to obtain explanations, whereas backward
explanations concern training dynamics. We define
forwardbackward:
Observation 3.2. Only global methods can be unfaithful.
§3.2 follows from the definition of faithfulness: ℳ is
faithful if the inductive functions of ℳ have ℓ(v,  ) = 0 and
  .̃ ℳ can only be unfaithful with respective to inductive
component functions; local methods can therefore not
be unfaithful.6
§3.3 follows from the fact that standard learning theory
applies to the inductive component functions of global
interpretability methods. Since their faithfulness is the
inverse of the empirical risk of these inductions, it follows
that global methods can at best be  -faithful, with  the
expected loss of these inductions. Note that when the
explanation is a model approximation  ′,  = [ℓ((),  ′())] .</p>
        <p>Definition 3.1 (Forward-backward). A method ℳ is
said to be backward if it contains backward functions;
otherwise, ℳ is said to be forward.</p>
        <p>Local backward methods include gradients [21, 22, 23],
integrated gradients [26, 27], layerwise relevance prop- Observation 3.4. Only forward methods are used for
loagation [48], DeepLIFT [28], and deep Taylor decompo- cal layer-wise analysis.
sition [25], which all derive explanations for individual
instances from what is normally used as training signals, Since local backward methods are attribution methods
typically based on derivatives of the loss function (gra- (§3.1), and layer-wise analysis concerns diferences
bedients) evaluating ℎ on training data, e.g., (ℓ(ℎ( x ),   )). tween layers, local backward methods cannot be used
Global backward methods rely on such training signals here, simply because they only output attributions at the
to modify or extend the model parameters w associated input level. §3.4 thus follows from §3.1, making it, too,
with ℎ, typically extracting approximations, rules or vi- an empirical observation, not a formal derivation.
sualizations.5
Observation 3.1. Local backward methods are always
attribution methods (presenting feature summary
statistics).</p>
        <p>Since local methods have to provide explanations in terms
of input/output (as they do not modify weights), and since
backward passes do not generate output distributions,
they have to present explanations in terms of attribution
of relevance or gradients to input features or input
segments. §3.1 is empirical. It follows naturally, but not
5Local forward methods either consider intermediate
representations, e.g., gates [49], attention [29], attention flow [ 50], etc.;
continuous output representations, e.g., using word association norms
[51] or word analogies [52, 53]; or discrete output, such as when
evaluating on challenge datasets [33, 34, 35, 36], or when
approximating the model’s output distribution [19, 54, 37]. In the same way,
global forward methods can rely on intermediate representations in
forward passes, e.g., in attention head pruning [46], attention factor
analysis [55], syntactic decoding of attention heads [50], attention
head manipulation [56], etc.; continuous output in forward passes,
including work using clustering in the vector space to manually
analyze model representations [57, 58], probing classifiers [ 16], and
concept activation strategies [18]; or on discrete output, e.g., in
uptraining [44] and knowledge distillation [59].</p>
        <p>Observation 3.5. No equivalence relations can hold
across the four categories of methods.
§3.5 follows from the disjointness of the three sets of
component functions, and how the four classes are defined,
i.e., that global functions cannot be local, and forward
functions cannot be backward. Equivalences between
methods have already been found [28, 26, 27, 60, 61], but
consistent taxonomic classification efectively prunes the
search space of possible equivalences.</p>
        <p>Observation 3.6. Local methods can always characterize
models globally on i.i.d. samples.
§3.6 states that any local method that derives quantifies
for an example can be used to aggregate corpus-level
statistics for appropriate-level samples. See [19] for how
to do this with LIME. It should be easy to see how this
result generalizes to all other local methods.</p>
      </sec>
      <sec id="sec-3-2">
        <title>6Local methods compute quantities based on forward or backward</title>
        <p>passes, but these quantities are not induced to simulate anything.
Global methods induce parameters to simulate a distribution and
can be more or less faithful to this distribution, but since local
methods simply ’read of’ their quantities, they cannot be unfaithful.
Only, the quantities can be misinterpreted.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>in: NAACL, New Orleans, Louisiana, 2018. actions on Visualization and Computer Graphics
[30] S. Abnar, W. Zuidema, Quantifying attention flow 24 (2018) 667–676. doi:1 0 . 1 1 0 9 / T V C G . 2 0 1 7 . 2 7 4 4 1 5 8 .</p>
      <p>in transformers, in: ACL, Online, 2020. [46] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, I. Titov,
[31] T. Mikolov, Q. V. Le, I. Sutskever, Exploiting simi- Analyzing multi-head self-attention: Specialized
larities among languages for machine translation, heads do the heavy lifting, the rest can be pruned,
CoRR abs/1309.4168 (2013). a r X i v : 1 3 0 9 . 4 1 6 8 . in: ACL, Florence, Italy, 2019.
[32] H. Strobelt, S. Gehrmann, H. Pfister, A. M. Rush, [47] T. Ruzsics, O. Sozinova, X. Gutierrez-Vasques,
Lstmvis: A tool for visual analysis of hidden T. Samardzic, Interpretability for
morphologistate dynamics in recurrent neural networks, 2017. cal inflection: from character-level predictions to
a r X i v : 1 6 0 6 . 0 7 4 6 1 . subword-level rules, in: EACL, Online, 2021.
[33] M. Richardson, C. J. Burges, E. Renshaw, MCTest: [48] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R.</p>
      <p>A challenge dataset for the open-domain machine Müller, W. Samek, PLoS ONE (????).
comprehension of text, in: EMNLP, 2013. [49] Y. Lakretz, G. Kruszewski, T. Desbordes, D. Hupkes,
[34] J. Mullenbach, J. Gordon, N. Peng, J. May, Do nu- S. Dehaene, M. Baroni, The emergence of number
clear submarines have nuclear captains? a chal- and syntax units in LSTM language models, in:
lenge dataset for commonsense reasoning over ad- NAACL, Minneapolis, Minnesota, 2019.
jectives and objects, in: EMNLP, Hong Kong, China, [50] V. Ravishankar, A. Kulmizev, M. Abdou, A. Søgaard,
2019. J. Nivre, Attention can reflect syntactic structure
[35] K. Sun, D. Yu, J. Chen, D. Yu, Y. Choi, C. Cardie, (if you let it), in: EACL, Online, 2021.</p>
      <p>DREAM: A challenge data set and models for [51] K. W. Church, P. Hanks, Word association norms,
dialogue-based reading comprehension, Transac- mutual information, and lexicography, in: 2ACL,
tions of the Association for Computational Linguis- Vancouver, British Columbia, Canada, 1989.
tics 7 (2019) 217–231. [52] T. Mikolov, K. Chen, G. Corrado, J. Dean,
Dis[36] N. F. Liu, R. Schwartz, N. A. Smith, Inoculation tributed representations of words and phrases and
by fine-tuning: A method for analyzing challenge their compositionality, in: NeurIPS, 2013.
datasets, in: NAACL, Minneapolis, Minnesota, [53] N. Garneau, M. Hartmann, A. Sandholm, S. Ruder,
2019. I. Vulic, A. Søgaard, Analogy training multilingual
[37] P. W. Koh, P. Liang, Understanding black-box pre- encoders, in: AAAI, 2021.</p>
      <p>dictions via influence functions, in: ICML, 2017. [54] D. Alvarez-Melis, T. Jaakkola, A causal
frame[38] N. Kriegeskorte, M. Mur, P. Bandettini, Representa- work for explaining the predictions of black-box
tional similarity analysis – connecting the branches sequence-to-sequence models, in: EMNLP,
Copenof systems neuroscience, Frontiers in Systems Neu- hagen, Denmark, 2017.</p>
      <p>roscience 3 (2008). [55] G. Kobayashi, T. Kuribayashi, S. Yokoi, K. Inui,
At[39] T. A. Trost, D. Klakow, Parameter free hierarchi- tention is not only a weight: Analyzing
transformcal graph-based clustering for analyzing continu- ers with vector norms, in: EMNLP, Online, 2020.
ous word embeddings, in: TextGraphs, Vancouver, [56] S. Vashishth, S. Upadhyay, G. S. Tomar, M. Faruqui,
Canada, 2017. Attention interpretability across NLP tasks, 2019.
[40] D. Yenicelik, F. Schmidt, Y. Kilcher, How does BERT a r X i v : 1 9 0 9 . 1 1 2 1 8 .</p>
      <p>capture semantics? a closer look at polysemous [57] K. Heylen, D. Speelman, D. Geeraerts, Looking
words, in: BlackboxNLP, Online, 2020. at word meaning. an interactive visualization of
[41] R. Aharoni, Y. Goldberg, Unsupervised domain semantic vector spaces for Dutch synsets, in:
clusters in pretrained language models, in: ACL, LINGVIS &amp; UNCLH, Avignon, France, 2012.</p>
      <p>Online, 2020. [58] E. Reif, A. Yuan, M. Wattenberg, F. B. Viegas, A.
Co[42] C.-K. Yeh, J. S. Kim, I. E. H. Yen, P. Ravikumar, Rep- enen, A. Pearce, B. Kim, Visualizing and measuring
resenter point selection for explaining deep neural the geometry of BERT, in: NeurIPS, volume 32,
networks, 2018. a r X i v : 1 8 1 1 . 0 9 7 2 0 . 2019.
[43] D. Pruthi, M. Gupta, B. Dhingra, G. Neubig, Z. C. [59] Y. Kim, A. M. Rush, Sequence-level knowledge
Lipton, Learning to deceive with attention-based distillation, in: EMNLP, Austin, Texas, 2016.
explanations, in: ACL, Online, 2020. [60] M. Ancona, E. Ceolini, C. Öztireli, M. Gross,
To[44] S. Petrov, P.-C. Chang, M. Ringgaard, H. Alshawi, wards better understanding of gradient-based
attriUptraining for accurate deterministic question pars- bution methods for deep neural networks, in: ICLR,
ing, in: EMNLP, 2010. 2018.
[45] H. Strobelt, S. Gehrmann, H. Pfister, A. M. Rush, [61] W. Samek, G. Montavon, S. Lapuschkin, C. J.
AnLstmvis: A tool for visual analysis of hidden state ders, K.-R. Müller, Explaining deep neural networks
dynamics in recurrent neural networks, IEEE Trans- and beyond: A review of methods and
applica</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>