=Paper=
{{Paper
|id=Vol-3318/short19
|storemode=property
|title=Shortcomings of Interpretability Taxonomies for Deep Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-3318/short19.pdf
|volume=Vol-3318
|authors=Anders Søgaard
|dblpUrl=https://dblp.org/rec/conf/cikm/Sogaard22
}}
==Shortcomings of Interpretability Taxonomies for Deep Neural Networks==
Shortcomings of Interpretability Taxonomies for Deep Neural Networks Anders Søgaard1,2,3 1 Dpt. of Computer Science, University of Copenhagen, Universitetsparken 1, DK-2200 Copenhagen 2 Pioneer Centre for Artificial Intelligence, Lyngbyvej 2, DK-2100 Copenhagen 3 Dpt. of Philosophy, Karen Blixens Plads 8, DK-2300 Copenhagen Abstract Taxonomies are vehicles for thinking about what’s possible, for identifying unconsidered options, as well as for establishing formal relations between entities. We identify several shortcomings in 10 existing taxonomies for interpretability methods for explainable artificial intelligence (XAI), focusing on methods for deep neural networks. The shortcomings include redundancies, incompleteness, and inconsistencies. We design a new taxonomy based on two orthogonal dimensions and show how it can be used to derive results about entire classes of interpretability methods for deep neural networks. Keywords interpretability, taxonomy 1. Two Common Distinctions inadequacies in 10 interpretability taxonomies. (b) We establish a simple, yet superior, two-dimensional taxon- Biological taxonomies provide a basis for conservation omy. (c) We derive six non-trivial observations or results and development and are used to generate interesting based on this taxonomy. questions about missing species [1, 2]. Inconsistent tax- onomies can, at the same time, hinder research or lead in Inconsistent Dimensions Incomplete Redundant the wrong direction [3, 4, 5]. In engineering, taxonomies intr-phoc play additional roles: They are vehicles for thinking about Survey loc-glo other what’s possible, for identifying unconsidered options, as well as for establishing formal relations between meth- ods. [6] 4 � (�) (�) time, expertise [7] 3 � � � � model-specific/model-agnostic Several taxonomies of interpretability methods already [8] 4 � � � � pre-model/in-model/post-model, exist [6, 7, 8, 9, 10, 11, 12, 13]. These taxonomies provide results us with technical terms for distinguishing approaches to [9] 4 � � � � spec./agn., results interpretability and can be efficient tools for researchers [10] 3 � � � types to contextualize their work. They generate interesting [11] 3 � � � � � technique [12] 3 � � � � methodology research questions – e.g., if all methods in class A but no [14] 1 � � grad./pert./ simpl. methods in B happen to exhibit property X, it this by neces- [15] 1 � � att./rule/sum. sity, or can we design a method in B with property X? – and [13] 2 � � � � inst./approx./ attr./counterf. help us see relations between methods – e.g., two methods in class A are mathematically equivalent. Unfortunately, the taxonomies that exist, without exception, have short- Table 1 comings and are either redundant, incomplete, or incon- 10 existing taxonomies and their shortcomings: Most distin- guish local from global methods, and intrinsic from posthoc sistent. In §2, we show this, examining the above 10 tax- methods. We argue the additional dimensions all lead to incon- onomies, one by one, also discussing between-taxonomy sistencies and/or redundancies, and that the intrinsic-posthoc inconsistencies in how individual methods are classified. distinction is itself problematic. In §2, we present a consistent taxonomy and establish var- ious observations and results that apply to entire classes of methods in our taxonomy. Contributions (a) We detect The simplest taxonomies presented are one- dimensional, i.e., simple groupings [14, 15]. Other AIMLAI 2022: Advances in Interpretable Machine Learning and Artifi- methods introduce up to four dimensions and use these cial Intelligence, October 21, 2022, Atlanta, GA to cross-classify existing methods. The 10 taxonomies Envelope-Open soegaard@di.ku.dk (A. Søgaard) are at most a couple of years old (2019-2021) and GLOBE https://anderssoegaard.github.io/ (A. Søgaard) discussed in chronological order. We first discuss two Orcid 0000-0001-5250-4276 (A. Søgaard) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License common distinctions that are largely agreed upon: local- Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 global and intrinsic-posthoc. One of these distinctions, local-global, will be useful, while the other is problematic taxonomies have tried to classify interpretation methods in several respects. We define an interpretability method into local and global ones, in practice, some methods ℳ(w, 𝑆) as a complex function that takes a model w have seemed harder to classify than others. Concept ac- and a sample of token sequences 𝑆 ⊆ 𝒮 as input, and is tivation approaches [18], for example, use joint global composed of three types of functions: training to learn mappings of individual examples into local explanations. Contrastive interpretability methods Definition 1.1 (Forward functions). Let [20] provide explanations in terms of pair of examples. It forward(w, 𝑆) return w(𝑓 (𝑆)) for all inputs 𝑠 ∈ 𝑆, may also seem unclear whether a challenge dataset pro- i.e., w1 (𝑠), … , w𝑛 (𝑠) for 𝑛 layers, with 𝑓 ∶ 𝒮 ↦ 𝒮 a vides a local or global explanation. [10] discuss what they function from input to input, e.g., perturb, delete, identity. call semi-local approaches, and [8] introduce a category for interpretability methods that relate to groups of exam- Definition 1.2 (Backward functions). Let ples. Are there methods that are not easily categorized backward(w, 𝑆) return as global or local? Answer Our definition of local-global focuses on the induction of explanations from samples. 𝑔(w−1 , (forward(w, 𝑆))) This focus enables unambiguous classification and leads where 𝑔(⋅, ⋅) is a function that defines a backward pass of us to classify concept activation methods as global, since gradients, relevance scores, etc., over the inverse model w−1 . the explanatory model component is induced from a sam- ple (and relies on the representativity of this sample). Definition 1.3 (Inductive functions). Let Similarly, we classify contrastive and group methods induce(w, 𝑆) return a set of parameters v fitted by as local methods, since they do not require induction minimizing an objective over w and 𝑆. or assume representative samples; and, finally, we clas- sify challenge datasets as local methods, since challenge Examples of inductive functions include, for various loss datasets also do not have to be representative.2 functions ℓ(⋅, ⋅): (a) probing [16], in which the objective Intrinsic and post-hoc explanations This distinction, is of the form ℓ(v(𝑆), 𝑙(w𝑗 (𝑆))) where w𝑗 (𝑠) is the repre- also called active-passive in [10] and self-explaining-ad sentation of 𝑠 at the 𝑗the layer of w, and 𝑙 is the probe- hoc in [11], is between intrinsic methods that jointly out- specific re-labeling function of samples; or (c) linear ap- put explanations, and methods that derive these expla- proximation [17], in which the objective is of the form nations post-hoc using auxiliary techniques. While most ℓ(v(𝑆), w(𝑆)), where v is a linear function. taxonomies introduce this distinction, we argue that it Local and global explanations The distinction between is inherently problematic. Challenge The distinction local and global interpretability methods is shared across between intrinsic and posthoc methods can be hard to all the taxonomies discussed in this paper, and will also be can be used to derive aggregate statistics that characterize global one of the two dimensions in the taxonomy we propose properties of models. LIME [19], for example, is mostly classified as below. The distinction is defined slightly differently by a local method ([12] classify it as both local and global), but in [19], different authors, As should be clear from the discussion the authors explicitly discuss how LIME can be used on i.i.d. samples below, this is not equivalent to our definition, which uses to derive aggregate statistics that characterize model behavior on the reliance of global methods on samples, rather than distributions (same can be done for all local methods; see §3.6). Our definition makes it clear that such methods are local; local methods the reliance of local methods on specific instances, as can be applied globally, whereas global methods cannot be applied the distinguishing criterion. One argument against the locally. It is also clear from our definition that the two classes of in- definition in [11] is that it is not entirely clear in what terpretability methods are often motivated by different prototypical sense global methods such as concept activation vectors applications: Local methods are often used to explain the motiva- [18], for example, are independent of any particular input. tion behind critical decisions, e.g., why a customer was assessed as high-risk, why a traveling review was flagged as fraudulent, or The function that provides us with explanations is global, why a newspaper article was flagged as misleading, whereas global but of course its output depends on the input. or not methods are used to characterize biases in models and evaluate defined at all, e.g., [6], but here we present the definition their robustness. 2 that our taxonomy below relies on: A method ℳ is said Examples of local methods include gradients [21, 22, 23], LRP to be global if and only if it includes at least one inductive [24], deep Taylor decomposition [25], integrated gradients [26, 27], DeepLift [28], direct interpretation of gate/attention weights [29], function. Otherwise ℳ is said to be local. Global meth- attention roll-out and flow [30], word association norms and analo- ods typically require access to a representative sample of gies [31], time step dynamics [32], challenge datasets [33, 34, 35, 36], data, to minimize their objectives, whereas local methods local uptraining [19], and influence sketching and influence func- are applicable to singleton samples.1 Challenge When tions [37]; examples of global methods include unstructured prun- ing, lottery tickets, dynamic sparse training, binary networks, 1 Note that our definition does not refer to how the methods charac- sparse coding, gate and attention head pruning, correlation of rep- terize the models, e.g., whether they describe individual inferences, resentations [38], clustering [39, 40, 41], probing classifiers [16], or derive aggregate statistics that quantify ways the models are bi- concept activation [18], representer point selection [42], TracIn ased. This is to avoid a common source of confusion: Local methods [43], and uptraining [44]. maintain.3 Moreover, for a method to be posthoc means methods, but not to all possible methods? different things to local and global methods. A post- Carvalho et al. (2019) [8] introduce four dimensions in hoc, local method is post-hoc relative to a class inference their taxonomy: (a) scope, which coincides with the local- (in the case of classification); a post-hoc, global method global distinction; (b) intrinsic-posthoc; (c) pre-model, in- is post-hoc relative to training, introducing a disjoint model, and post-model, with in-model corresponding to training phase for learning the interpretability functions. intrinsic methods, and post-model corresponding to post- Strictly speaking, the fact that ’post-hoc’ takes on two hoc methods, whereas pre-model comprises various ap- disjoint meanings for local and global methods, namely proaches to data analysis. We argue below that (c) is both ’post-inference’ and ’post-training’, makes taxonomies redundant and inconsistent. Finally, they introduce (d) a that rely on both dimensions inconsistent. results dimension, which concerns the form of the expla- nations provided by the methods. Inconsistencies In addition to the inconsistency of intrinsic-posthoc, includ- 2. Shortcomings of Taxonomies ing pre-model explanations leads to further taxonomic inconsistency in that pre-model approaches cannot be We now briefly assess the 10 taxonomies, pointing out classified along the other dimensions in that they do not the ways in which they are inconsistent, incomplete or refer to models at all. For the same reason, one might redundant: argue they are not model interpretation methods in the Guidotti et al. (2018) [6] makes the local-global distinc- first place. Redundancies The redundancy of (c) fol- tion, as well as two that relate to how explanations are lows from the observation that the distinction between communicated (how much time the user is expected to in-model and post-model explanations is identical to the have to understand the model decisions, and how much distinction made in (b), as well as the observation that pre- domain knowledge and technical experience the user is model explanations do not refer to models at all. Chal- expected to have). In addition to the terms local and lenge What is an intrinsic interpretability method that global, they also refer, synonymously, to outcome expla- presents post-model explanations, or a post-hoc inter- nation and model explanation. Later in their survey, [6] pretability method that presents in-model explanations? make a fourth distinction that is very similar to intrinsic- Molnar (2019) [9] distinguishes between local-global posthoc, namely between transparent design (leading to and intrinsic-posthoc, between different results, and be- intrinsically interpretable models) and (post-hoc) black tween model-specific and model-agnostic methods, mak- box inspection, but oddly, this is not seen as an orthogo- ing their taxonomy very similar to [8]. Inconsistencies nal dimension, but as two additional classes on par with See discussion of [7]. Also, the results dimension is also outcome and model explanation. Challenge How to inconsistent in that explanations can, simultaneously, be classify methods that are both, say, local and post-hoc, intrinsically interpretable models and feature summary i.e., do outcome explanation by black-box inspection? Ex- statistics. LIME [19], for example, presents local explana- amples would include gradients [21, 22, 23], layer-wise tions as the linear coefficients of a linear fit, i.e., an intrin- relevance propagation [24], deep Taylor decomposition sically interpretable model that consists solely of feature [25], integrated gradients [26, 27], etc. summary statistics. Redundancies The most important Adadi and Berrada (2019) [7] rely on the local-global redundancy is that all model-agnostic interpretability and intrinsic-posthoc distinctions (referring to the later methods are also post-hoc, since intrinsic methods re- as complexity), and, as a third dimension, they distin- quire joint training, which in turn requires compatibility guish between model-agnostic and model-specific inter- with model architectures. Moreover, model-agnostic in- pretability methods. Inconsistencies We argue that the terpretability methods are all grounded in input features distinction between model-specific and model-agnostic and thus lead to explanations in terms of feature sum- methods is suboptimal in that state-of-the-art models are mary statistics or visualizations. Moreover, all explana- moving targets, and so is what counts as model-specific. tions in terms of intrinsically interpretable models are, This may lead to inconsistencies over time. Challenge quite obviously, intrinsic. Challenge What is a post-hoc How do we classify a method that applies to all known interpretability method whose explanations are intrinsi- 3 cally interpretable models? Consider the difference between the two global interpretability Zhang et al. (2020) [10] rely on these dimensions: (a) methods, concept activation vectors and probing classifiers: CAV are trained jointly, probing classifiers sequentially. These are ex- global-local; (b) intrinsic-posthoc (which they call active- tremes of a (curriculum) continuum, which is hard to binarize: If a passive; and (c) a distinction between four explanation probing classifier is trained jointly with the last epoch of the model types, namely examples, attribution, hidden semantics, and training, is the method then intrinsic or posthoc? For a real example, rules. Inconsistencies The explanation type dimension consider TracIn [43], in which influence functions are estimated in [10] conflates (a) the model components we are try- across various training check points. Again, is TracIn intrinsic or posthoc? That the binary distinction covers a continuum, makes ing to explain, and (b) what the explanations look like. the distinction hard to apply in practice. Hidden semantics, e.g., is a model component, whereas examples and rules refer to the (syntactic) form of the [10] [11] [12] [6] [7] [8] [9] explanations. The distinction between hidden semantics GradCAM L L-H L-H and attribution is also apparent. Hidden semantics can DeepLift H L-H L-H be used to derive attribution (a results type in [8] and [9]), LRP L/G-H S-H L-I/H L/G-H e.g., in LSTMVis [45]; this is because hidden semantics is LIME L L-H L-H L L-H L-H L/G-H TCAV G-I G-H G-H not a type of explanation, but a model component. At- IF L/G L-H tribution, examples, and rules are types of explanations, but this list is not exhaustive, since explanations can also Forward Backward be in terms of concepts, free texts, or visualizations, for Attention, Attention roll-out, Atten- Gradients, Layer-wise relevance Local example. Challenge What is a passive interpretability tion flow, Time step dynamics, Lo- propagation, Deep Taylor decom- cal uptraining, Influence functions position, Integrated gradients, method that does not provide local explanations? DeepLift Danilevsky et al. (2020) [11] only distinguish between Weight pruning, Correlation of rep- Dynamic sparse training, Binary Global global-local and intrinsic-posthoc (which they call self- resentations, Clustering, Probing networks, Sparse coding, Concept classifiers, Uptraining activation, Gradient-based weight explaining and ad-hoc) methods. Inconsistencies [11] pruning say most attribution methods are global and ad-hoc. We argue attribution methods are necessarily local, and while Table 2 aggregate statistics can of course be computed across Left: 4/6 methods (bottom half) are classified incoherently real or synthetic corpora, little is gained by blurring tax- across taxonomies. Explanation: local (L), global (G), intrinsic onomies to reflect that. All local methods can be used (I), and posthoc (H). Right: Our novel taxonomy. to compute summary statistics. Incompleteness [11] admit their survey is biased toward local methods, and many global interpretability methods are left uncovered. Kotonya and Toni (2020) [15] distinguish between Challenge What is a local interpretability method that attention-based explanations, explanations as rule discov- cannot be used to compute summary statistics? ery, and explanations as summarization. Incomplete- Das et al. (2020) [12] distinguish between local and ness Using gating mechanisms to interpret models, e.g., global methods, gradient-based and perturbation-based does not fit any of the three categories. Inconsistencies methods on the other (methodology), and intrinsic and One class of methods is defined in terms of the model post-hoc methods (usage). Their taxonomy is both in- components being interpreted (attention-based), and an- complete and redundant: Incompleteness Several ap- other class in terms of the form of explanations they proaches are neither gradient-based or perturbation- provide (rule discovery and summarization). Mixing or- based. Redundancies All gradient-based approaches thogonal dimensions is inconsistent, i.e., methods can are classified as post-hoc approaches in [12]; similarly, belong to several categories, e.g., attention head pruning all intrinsic methods are classified as global methods. Of [46] (attention-based and summarization), or when rules course these cells may be filled with methods that were are induced from attention weights [47]. not covered, but in particular, it seems that gradient- Chen et al. (2021) [13] introduce the global-local dis- based approaches are, almost always, post-hoc? Chal- tinction, but not the intrinsic-posthoc distinction. In ad- lenge What is an intrinsic, gradient-based approach? dition they distinguish between interpretability methods Atanasova et al. (2020) [14] distinguish between that present explanations in terms of training instances, three classes of interpretability methods: gradient-based, approximations, feature attribution and counterfactuals. perturbation-based, and simplification-based methods. Inconsistencies The second dimension again makes or- Inconsistencies The distinction between gradient- thogonal distinctions. Approximations, for example, can based and perturbation-based methods is similar to [12], be used to attribute importance to features (LIME). In- but the two classifications are inconsistent, with [14] completeness Concepts, attention weights, gate activa- citing LIME [19] as a simplification-based method. It tions, rules, etc., are not covered by the second dimension. seems that the distinction between perturbation-based Redundancies All methods that present explanations in and simplification-based methods is in itself inconsistent terms of training instances are necessarily local. Chal- in that both perturbations and gradients can be used to lenge What’s a global interpretability method providing simplify models; similarly, perturbations can be used to explanations in terms of training instances?4 baseline gradient-based approaches. Incompleteness Inconsistent Classifications Table 2 shows that tax- Clearly, not all interpretability methods are gradient- based, perturbation-based or simplification-based: Other 4 Several of the above taxonomies include dimensions that pertain to methods are based on weight magnitudes, carefully de- the form of the output of interpretation methods. We argue such distinctions are orthogonal to the methods and should therefore not signed example templates, visualizing and quantifying at- be included in taxonomies. To see this, note that most interpretabil- tention weights or gating mechanisms. Challenge How ity methods, e.g., LIME, can provide explanations of different form: would you classify attention roll-out [30], for example? aggregate statistics, coefficients, rules, visualizations, etc. onomies are not only internally inconsistent, but also necessarily, from the fact that backward methods reverse inconsistent in how they classify methods. Six methods the direction of connections, thus returning quantities were mentioned by more than one survey, 4/6 of which that hold for the input nodes. Pre-input quantities are were classified differently. not interpretable. Observation 3.2. Only global methods can be unfaithful. 3. A Novel Taxonomy and §3.2 follows from the definition of faithfulness: ℳ is faith- Observations ful if the inductive functions of ℳ have ℓ(v, 𝑃) = 0 and 𝑆𝑃.̃ ℳ can only be unfaithful with respective to inductive Our taxonomy is two-dimensional: One is local-global, component functions; local methods can therefore not the other a distinction between explanations based on for- be unfaithful.6 ward passes, and explanations based on backward passes. The forward explanations correlate intermediate repre- Observation 3.3. Global methods can at best be epsilon- sentations or continuous or discrete output representa- faithful and only on i.i.d. instances. tions to obtain explanations, whereas backward expla- nations concern training dynamics. We define forward- §3.3 follows from the fact that standard learning theory backward: applies to the inductive component functions of global interpretability methods. Since their faithfulness is the Definition 3.1 (Forward-backward). A method ℳ is inverse of the empirical risk of these inductions, it follows said to be backward if it contains backward functions; that global methods can at best be 𝜖-faithful, with 𝜖 the ex- otherwise, ℳ is said to be forward. pected loss of these inductions. Note that when the expla- nation is a model approximation 𝜃 ′ , 𝜖 = 𝔼[ℓ(𝜃(𝑥), 𝜃 ′ (𝑥))]. Local backward methods include gradients [21, 22, 23], integrated gradients [26, 27], layerwise relevance prop- Observation 3.4. Only forward methods are used for lo- agation [48], DeepLIFT [28], and deep Taylor decompo- cal layer-wise analysis. sition [25], which all derive explanations for individual Since local backward methods are attribution methods instances from what is normally used as training signals, (§3.1), and layer-wise analysis concerns differences be- typically based on derivatives of the loss function (gra- tween layers, local backward methods cannot be used dients) evaluating ℎ on training data, e.g., 𝑑(ℓ(ℎ(x𝑖 ), 𝑦𝑖 )). here, simply because they only output attributions at the Global backward methods rely on such training signals input level. §3.4 thus follows from §3.1, making it, too, to modify or extend the model parameters w associated an empirical observation, not a formal derivation. with ℎ, typically extracting approximations, rules or vi- sualizations. 5 Observation 3.5. No equivalence relations can hold across the four categories of methods. Observation 3.1. Local backward methods are always attribution methods (presenting feature summary statis- §3.5 follows from the disjointness of the three sets of com- tics). ponent functions, and how the four classes are defined, i.e., that global functions cannot be local, and forward Since local methods have to provide explanations in terms functions cannot be backward. Equivalences between of input/output (as they do not modify weights), and since methods have already been found [28, 26, 27, 60, 61], but backward passes do not generate output distributions, consistent taxonomic classification effectively prunes the they have to present explanations in terms of attribution search space of possible equivalences. of relevance or gradients to input features or input seg- ments. §3.1 is empirical. It follows naturally, but not Observation 3.6. Local methods can always characterize 5 Local forward methods either consider intermediate representa- models globally on i.i.d. samples. tions, e.g., gates [49], attention [29], attention flow [50], etc.; con- §3.6 states that any local method that derives quantifies tinuous output representations, e.g., using word association norms [51] or word analogies [52, 53]; or discrete output, such as when for an example can be used to aggregate corpus-level evaluating on challenge datasets [33, 34, 35, 36], or when approxi- statistics for appropriate-level samples. See [19] for how mating the model’s output distribution [19, 54, 37]. In the same way, to do this with LIME. It should be easy to see how this global forward methods can rely on intermediate representations in result generalizes to all other local methods. forward passes, e.g., in attention head pruning [46], attention factor 6 analysis [55], syntactic decoding of attention heads [50], attention Local methods compute quantities based on forward or backward head manipulation [56], etc.; continuous output in forward passes, passes, but these quantities are not induced to simulate anything. including work using clustering in the vector space to manually Global methods induce parameters to simulate a distribution and analyze model representations [57, 58], probing classifiers [16], and can be more or less faithful to this distribution, but since local concept activation strategies [18]; or on discrete output, e.g., in methods simply ’read off’ their quantities, they cannot be unfaithful. uptraining [44] and knowledge distillation [59]. Only, the quantities can be misinterpreted. 4. Conclusion interpretable machine learning, CoRR (2021). arXiv:2103.06254. We examined 10 taxonomies of interpretability methods [14] P. Atanasova, J. G. Simonsen, C. Lioma, I. Augen- and found all to be inconsistent. We introduces a two- stein, A diagnostic study of explainability tech- dimensional taxonomy and showed how it can be helpful niques for text classification, in: EMNLP, Online, in deriving general observations and results. 2020. [15] N. Kotonya, F. Toni, Explainable automated fact- checking: A survey, in: COLING, Barcelona, Spain References (Online), 2020. [1] S. Bacher, Still not enough taxonomists: reply to [16] Y. Belinkov, Probing Classifiers: Promises, Short- Joppa et al., Trends in Ecology Evolution 27 2 (2012) comings, and Advances, Computational Linguistics 65–6; author reply 66. (2021) 1–13. [2] S. Thomson, R. Pyle, S. Ahyong, M. Alonso- [17] J. Ba, R. Caruana, Do deep nets really need to be Zarazaga, J. Ammirati, J.-F. Araya, J. Ascher, T. Audi- deep?, in: NeurIPS, 2014. sio, V. Azevedo-Santos, N. Bailly, W. Baker, M. Balke, [18] B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai, J. Wexler, M. Barclay, R. Barrett, R. Benine, J. Bickerstaff, F. B. Viégas, R. Sayres, Interpretability beyond fea- P. Bouchard, R. Bour, T. Bourgoin, H.-Z. Zhou, Tax- ture attribution: Quantitative testing with concept onomy based on science is necessary for global activation vectors (TCAV)., in: ICML, volume 80, conservation, PLoS Biology 16 (2018). doi:1 0 . 1 3 7 1 / 2018. journal.pbio.2005075. [19] M. T. Ribeiro, S. Singh, C. Guestrin, ”why should [3] M. S. Brewer, P. Sierwald, J. Bond, Millipede taxon- i trust you?”: Explaining the predictions of any omy after 250 years: Classification and taxonomic classifier, in: KDD, New York, NY, USA, 2016. practices in a mega-diverse yet understudied arthro- [20] A. Dhurandhar, P.-Y. Chen, R. Luss, C.-C. Tu, pod group, PLoS ONE 7 (2012). P. Ting, K. Shanmugam, P. Das, Explanations based [4] H. Fraser, G. Garrard, L. Rumpff, C. Hauser, M. Mc- on the missing: Towards contrastive explanations Carthy, Consequences of inconsistently classifying with pertinent negatives, 2018. a r X i v : 1 8 0 2 . 0 7 6 2 3 . woodland birds, Frontiers in Ecology and Evolution [21] P. Leray, P. Gallinari, P. Gallinari, P. Gallinari, 3 (2015) 83. Feature selection with neural networks, Behav- [5] B. Jones, A few bad scientists are threatening to iormetrika 26 (1998) 16–6. topple taxonomy, Smithsonian Magazine (2017). [22] K. Simonyan, A. Vedaldi, A. Zisserman, Deep [6] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Gi- inside convolutional networks: Visualising im- annotti, D. Pedreschi, A survey of methods for age classification models and saliency maps, 2014. explaining black box models (2018). arXiv:1312.6034. [7] A. Adadi, M. Berrada, Peeking inside the black-box: [23] M. Denil, A. Demiraj, N. Kalchbrenner, P. Blunsom, A survey on explainable artificial intelligence (xai), N. de Freitas, Modelling, visualising and summaris- IEEE Access 6 (2018) 52138–52160. ing documents with a single convolutional neural [8] D. V. Carvalho, E. M. Pereira, J. S. Cardoso, Machine network., CoRR abs/1406.3830 (2014). learning interpretability: A survey on methods and [24] L. Arras, F. Horn, G. Montavon, K.-R. Müller, metrics, Electronics 8 (2019) 832. W. Samek, Explaining predictions of non-linear [9] C. Molnar, Interpretable Machine Learn- classifiers in NLP, in: RepLNLP, Berlin, Germany, ing, 2019. https://christophm.github.io/ 2016. interpretable-ml-book/. [25] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, [10] Y. Zhang, P. Tiňo, A. Leonardis, K. Tang, A K.-R. Müller, Explaining nonlinear classification survey on neural network interpretability, 2020. decisions with deep Taylor decomposition, Pattern arXiv:2012.14261. Recognition 65 (2017) 211–222. [11] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis, [26] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attri- B. Kawas, P. Sen, A survey of the state of explain- bution for deep networks, 2017. a r X i v : 1 7 0 3 . 0 1 3 6 5 . able AI for natural language processing, in: AACL- [27] P. K. Mudrakarta, A. Taly, M. Sundararajan, IJNLP, Suzhou, China, 2020. K. Dhamdhere, Did the model understand the ques- [12] A. Das, P. Rad, Opportunities and challenges in tion?, in: ACL, Melbourne, Australia, 2018. explainable artificial intelligence (xai): A survey, [28] A. Shrikumar, P. Greenside, A. Kundaje, Learning 2020. a r X i v : 2 0 0 6 . 1 1 3 7 1 . important features through propagating activation [13] V. Chen, J. Li, J. S. Kim, G. Plumb, A. Talwalkar, differences, in: ICML, 2017. Towards connecting use cases and methods in [29] M. Rei, A. Søgaard, Zero-shot sequence labeling: Transferring knowledge from sentences to tokens, in: NAACL, New Orleans, Louisiana, 2018. actions on Visualization and Computer Graphics [30] S. Abnar, W. Zuidema, Quantifying attention flow 24 (2018) 667–676. doi:1 0 . 1 1 0 9 / T V C G . 2 0 1 7 . 2 7 4 4 1 5 8 . in transformers, in: ACL, Online, 2020. [46] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, I. Titov, [31] T. Mikolov, Q. V. Le, I. Sutskever, Exploiting simi- Analyzing multi-head self-attention: Specialized larities among languages for machine translation, heads do the heavy lifting, the rest can be pruned, CoRR abs/1309.4168 (2013). a r X i v : 1 3 0 9 . 4 1 6 8 . in: ACL, Florence, Italy, 2019. [32] H. Strobelt, S. Gehrmann, H. Pfister, A. M. Rush, [47] T. Ruzsics, O. Sozinova, X. Gutierrez-Vasques, Lstmvis: A tool for visual analysis of hidden T. Samardzic, Interpretability for morphologi- state dynamics in recurrent neural networks, 2017. cal inflection: from character-level predictions to arXiv:1606.07461. subword-level rules, in: EACL, Online, 2021. [33] M. Richardson, C. J. Burges, E. Renshaw, MCTest: [48] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. A challenge dataset for the open-domain machine Müller, W. Samek, PLoS ONE (????). comprehension of text, in: EMNLP, 2013. [49] Y. Lakretz, G. Kruszewski, T. Desbordes, D. Hupkes, [34] J. Mullenbach, J. Gordon, N. Peng, J. May, Do nu- S. Dehaene, M. Baroni, The emergence of number clear submarines have nuclear captains? a chal- and syntax units in LSTM language models, in: lenge dataset for commonsense reasoning over ad- NAACL, Minneapolis, Minnesota, 2019. jectives and objects, in: EMNLP, Hong Kong, China, [50] V. Ravishankar, A. Kulmizev, M. Abdou, A. Søgaard, 2019. J. Nivre, Attention can reflect syntactic structure [35] K. Sun, D. Yu, J. Chen, D. Yu, Y. Choi, C. Cardie, (if you let it), in: EACL, Online, 2021. DREAM: A challenge data set and models for [51] K. W. Church, P. Hanks, Word association norms, dialogue-based reading comprehension, Transac- mutual information, and lexicography, in: 2ACL, tions of the Association for Computational Linguis- Vancouver, British Columbia, Canada, 1989. tics 7 (2019) 217–231. [52] T. Mikolov, K. Chen, G. Corrado, J. Dean, Dis- [36] N. F. Liu, R. Schwartz, N. A. Smith, Inoculation tributed representations of words and phrases and by fine-tuning: A method for analyzing challenge their compositionality, in: NeurIPS, 2013. datasets, in: NAACL, Minneapolis, Minnesota, [53] N. Garneau, M. Hartmann, A. Sandholm, S. Ruder, 2019. I. Vulic, A. Søgaard, Analogy training multilingual [37] P. W. Koh, P. Liang, Understanding black-box pre- encoders, in: AAAI, 2021. dictions via influence functions, in: ICML, 2017. [54] D. Alvarez-Melis, T. Jaakkola, A causal frame- [38] N. Kriegeskorte, M. Mur, P. Bandettini, Representa- work for explaining the predictions of black-box tional similarity analysis – connecting the branches sequence-to-sequence models, in: EMNLP, Copen- of systems neuroscience, Frontiers in Systems Neu- hagen, Denmark, 2017. roscience 3 (2008). [55] G. Kobayashi, T. Kuribayashi, S. Yokoi, K. Inui, At- [39] T. A. Trost, D. Klakow, Parameter free hierarchi- tention is not only a weight: Analyzing transform- cal graph-based clustering for analyzing continu- ers with vector norms, in: EMNLP, Online, 2020. ous word embeddings, in: TextGraphs, Vancouver, [56] S. Vashishth, S. Upadhyay, G. S. Tomar, M. Faruqui, Canada, 2017. Attention interpretability across NLP tasks, 2019. [40] D. Yenicelik, F. Schmidt, Y. Kilcher, How does BERT arXiv:1909.11218. capture semantics? a closer look at polysemous [57] K. Heylen, D. Speelman, D. Geeraerts, Looking words, in: BlackboxNLP, Online, 2020. at word meaning. an interactive visualization of [41] R. Aharoni, Y. Goldberg, Unsupervised domain semantic vector spaces for Dutch synsets, in: clusters in pretrained language models, in: ACL, LINGVIS & UNCLH, Avignon, France, 2012. Online, 2020. [58] E. Reif, A. Yuan, M. Wattenberg, F. B. Viegas, A. Co- [42] C.-K. Yeh, J. S. Kim, I. E. H. Yen, P. Ravikumar, Rep- enen, A. Pearce, B. Kim, Visualizing and measuring resenter point selection for explaining deep neural the geometry of BERT, in: NeurIPS, volume 32, networks, 2018. a r X i v : 1 8 1 1 . 0 9 7 2 0 . 2019. [43] D. Pruthi, M. Gupta, B. Dhingra, G. Neubig, Z. C. [59] Y. Kim, A. M. Rush, Sequence-level knowledge Lipton, Learning to deceive with attention-based distillation, in: EMNLP, Austin, Texas, 2016. explanations, in: ACL, Online, 2020. [60] M. Ancona, E. Ceolini, C. Öztireli, M. Gross, To- [44] S. Petrov, P.-C. Chang, M. Ringgaard, H. Alshawi, wards better understanding of gradient-based attri- Uptraining for accurate deterministic question pars- bution methods for deep neural networks, in: ICLR, ing, in: EMNLP, 2010. 2018. [45] H. Strobelt, S. Gehrmann, H. Pfister, A. M. Rush, [61] W. Samek, G. Montavon, S. Lapuschkin, C. J. An- Lstmvis: A tool for visual analysis of hidden state ders, K.-R. Müller, Explaining deep neural networks dynamics in recurrent neural networks, IEEE Trans- and beyond: A review of methods and applica- tions, Proceedings of the IEEE 109 (2021) 247–278. doi:1 0 . 1 1 0 9 / J P R O C . 2 0 2 1 . 3 0 6 0 4 8 3 .