1. Two Common Distinctions

J P R O C .

Shortcomings of Interpretability Taxonomies for Deep Neural Networks

Anders Søgaard

0 1 2 0 Dpt. of Computer Science, University of Copenhagen , Universitetsparken 1, DK-2200 Copenhagen 1 Dpt. of Philosophy , Karen Blixens Plads 8, DK-2300 Copenhagen 2 Pioneer Centre for Artificial Intelligence , Lyngbyvej 2, DK-2100 Copenhagen

2021

2 0

Taxonomies are vehicles for thinking about what's possible, for identifying unconsidered options, as well as for establishing formal relations between entities. We identify several shortcomings in 10 existing taxonomies for interpretability methods for explainable artificial intelligence (XAI), focusing on methods for deep neural networks. The shortcomings include redundancies, incompleteness, and inconsistencies. We design a new taxonomy based on two orthogonal dimensions and show how it can be used to derive results about entire classes of interpretability methods for deep neural networks.

interpretability taxonomy

1. Two Common Distinctions

Biological taxonomies provide a basis for conservation and development and are used to generate interesting questions about missing species [1, 2]. Inconsistent taxonomies can, at the same time, hinder research or lead in the wrong direction [3, 4, 5]. In engineering, taxonomies play additional roles: They are vehicles for thinking about what’s possible, for identifying unconsidered options, as well as for establishing formal relations between methods.

Several taxonomies of interpretability methods already exist [6, 7, 8, 9, 10, 11, 12, 13]. These taxonomies provide us with technical terms for distinguishing approaches to interpretability and can be eficient tools for researchers to contextualize their work. They generate interesting research questions – e.g., if all methods in class A but no methods in B happen to exhibit property X, it this by necessity, or can we design a method in B with property X? – and help us see relations between methods – e.g., two methods in class A are mathematically equivalent. Unfortunately, the taxonomies that exist, without exception, have shortcomings and are either redundant, incomplete, or inconsistent. In §2, we show this, examining the above 10 taxonomies, one by one, also discussing between-taxonomy inconsistencies in how individual methods are classified.

In §2, we present a consistent taxonomy and establish var

ious observations and results that apply to entire classes of methods in our taxonomy. Contributions (a) We detect https://anderssoegaard.github.io/ (A. Søgaard) 0000-0001-5250-4276 (A. Søgaard) s [10] [11] [12] [14] [15] [13] 4 3 4 4 3 3 3 1 1 2 ( ) ( ) time, expertise model-specific/model-agnostic pre-model/in-model/post-model, results spec./agn., results types technique methodology grad./pert./ simpl. att./rule/sum.

inst./approx./ attr./counterf. 10 existing taxonomies and their shortcomings: Most distinguish local from global methods, and intrinsic from posthoc methods. We argue the additional dimensions all lead to inconsistencies and/or redundancies, and that the intrinsic-posthoc distinction is itself problematic.

The simplest taxonomies presented

are onedimensional, i.e., simple groupings [14, 15].

Other methods introduce up to four dimensions and use these to cross-classify existing methods. The 10 taxonomies are at most a couple of years old (2019-2021) and discussed in chronological order. We first discuss two global and intrinsic-posthoc. One of these distinctions, © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License common distinctions that are largely agreed upon: locallocal-global, will be useful, while the other is problematic taxonomies have tried to classify interpretation methods in several respects. We define an interpretability method into local and global ones, in practice, some methods ℳ(w, ) as a complex function that takes a model w have seemed harder to classify than others. Concept acand a sample of token sequences ⊆ as input, and is tivation approaches [18], for example, use joint global composed of three types of functions: training to learn mappings of individual examples into local explanations. Contrastive interpretability methods Definition 1.1 (Forward functions). Let [20] provide explanations in terms of pair of examples. It forward(w, ) return w( ()) for all inputs ∈ , may also seem unclear whether a challenge dataset proi.e., w1(), … , w () for layers, with ∶ ↦ a vides a local or global explanation. [10] discuss what they function from input to input, e.g., perturb, delete, identity. call semi-local approaches, and [8] introduce a category for interpretability methods that relate to groups of examDefinition 1.2 (Backward functions). Let ples. Are there methods that are not easily categorized backward(w, ) return as global or local? Answer Our definition of local-global focuses on the induction of explanations from samples.

( w−1, (forward(w, ))) This focus enables unambiguous classification and leads where (⋅, ⋅) is a function that defines a backward pass of us to classify concept activation methods as global, since gradients, relevance scores, etc., over the inverse model w−1. the explanatory model component is induced from a sample (and relies on the representativity of this sample).

Definition 1.3 (Inductive functions). Let Similarly, we classify contrastive and group methods induce(w, ) return a set of parameters v fitted by as local methods, since they do not require induction minimizing an objective over w and . or assume representative samples; and, finally, we classify challenge datasets as local methods, since challenge datasets also do not have to be representative.2

Intrinsic and post-hoc explanations This distinction, also called active-passive in [10] and self-explaining-ad hoc in [11], is between intrinsic methods that jointly output explanations, and methods that derive these explanations post-hoc using auxiliary techniques. While most taxonomies introduce this distinction, we argue that it is inherently problematic. Challenge The distinction between intrinsic and posthoc methods can be hard to can be used to derive aggregate statistics that characterize global properties of models. LIME [19], for example, is mostly classified as a local method ([12] classify it as both local and global), but in [19], the authors explicitly discuss how LIME can be used on i.i.d. samples to derive aggregate statistics that characterize model behavior on distributions (same can be done for all local methods; see §3.6). Our definition makes it clear that such methods are local; local methods can be applied globally, whereas global methods cannot be applied locally. It is also clear from our definition that the two classes of interpretability methods are often motivated by diferent prototypical applications: Local methods are often used to explain the motivation behind critical decisions, e.g., why a customer was assessed as high-risk, why a traveling review was flagged as fraudulent, or why a newspaper article was flagged as misleading, whereas global methods are used to characterize biases in models and evaluate their robustness. 2Examples of local methods include gradients [21, 22, 23], LRP [24], deep Taylor decomposition [25], integrated gradients [26, 27], DeepLift [ 28], direct interpretation of gate/attention weights [29], attention roll-out and flow [ 30], word association norms and analogies [31], time step dynamics [32], challenge datasets [33, 34, 35, 36], local uptraining [19], and influence sketching and influence functions [37]; examples of global methods include unstructured pruning, lottery tickets, dynamic sparse training, binary networks, sparse coding, gate and attention head pruning, correlation of representations [38], clustering [39, 40, 41], probing classifiers [ 16], concept activation [18], representer point selection [42], TracIn [43], and uptraining [44].

Examples of inductive functions include, for various loss functions ℓ(⋅, ⋅): (a) probing [16], in which the objective is of the form ℓ(v(), ( w ())) where w () is the representation of at the the layer of w, and is the probespecific re-labeling function of samples; or (c) linear approximation [17], in which the objective is of the form ℓ(v(), w()) , where v is a linear function.

Local and global explanations The distinction between local and global interpretability methods is shared across all the taxonomies discussed in this paper, and will also be one of the two dimensions in the taxonomy we propose below. The distinction is defined slightly diferently by diferent authors, As should be clear from the discussion below, this is not equivalent to our definition, which uses the reliance of global methods on samples, rather than the reliance of local methods on specific instances, as the distinguishing criterion. One argument against the definition in [ 11] is that it is not entirely clear in what sense global methods such as concept activation vectors [18], for example, are independent of any particular input.

The function that provides us with explanations is global, but of course its output depends on the input. or not defined at all, e.g., [ 6], but here we present the definition that our taxonomy below relies on: A method ℳ is said to be global if and only if it includes at least one inductive function. Otherwise ℳ is said to be local. Global methods typically require access to a representative sample of data, to minimize their objectives, whereas local methods are applicable to singleton samples.1 Challenge When 1Note that our definition does not refer to how the methods characterize the models, e.g., whether they describe individual inferences, or derive aggregate statistics that quantify ways the models are biased. This is to avoid a common source of confusion: Local methods maintain.3 Moreover, for a method to be posthoc means methods, but not to all possible methods? diferent things to local and global methods. A post- Carvalho et al. (2019) [8] introduce four dimensions in hoc, local method is post-hoc relative to a class inference their taxonomy: (a) scope, which coincides with the local(in the case of classification); a post-hoc, global method global distinction; (b) intrinsic-posthoc; (c) pre-model, inis post-hoc relative to training, introducing a disjoint model, and post-model, with in-model corresponding to training phase for learning the interpretability functions. intrinsic methods, and post-model corresponding to postStrictly speaking, the fact that ’post-hoc’ takes on two hoc methods, whereas pre-model comprises various apdisjoint meanings for local and global methods, namely proaches to data analysis. We argue below that (c) is both ’post-inference’ and ’post-training’, makes taxonomies redundant and inconsistent. Finally, they introduce (d) a that rely on both dimensions inconsistent. results dimension, which concerns the form of the explanations provided by the methods. Inconsistencies In addition to the inconsistency of intrinsic-posthoc, includ2. Shortcomings of Taxonomies ing pre-model explanations leads to further taxonomic inconsistency in that pre-model approaches cannot be We now briefly assess the 10 taxonomies, pointing out classified along the other dimensions in that they do not the ways in which they are inconsistent, incomplete or refer to models at all. For the same reason, one might redundant: argue they are not model interpretation methods in the

Guidotti et al. (2018) [6] makes the local-global distinc- ifrst place. Redundancies The redundancy of (c) foltion, as well as two that relate to how explanations are lows from the observation that the distinction between communicated (how much time the user is expected to in-model and post-model explanations is identical to the have to understand the model decisions, and how much distinction made in (b), as well as the observation that predomain knowledge and technical experience the user is model explanations do not refer to models at all. Chalexpected to have). In addition to the terms local and lenge What is an intrinsic interpretability method that global, they also refer, synonymously, to outcome expla- presents post-model explanations, or a post-hoc internation and model explanation. Later in their survey, [6] pretability method that presents in-model explanations? make a fourth distinction that is very similar to intrinsic- Molnar (2019) [9] distinguishes between local-global posthoc, namely between transparent design (leading to and intrinsic-posthoc, between diferent results, and beintrinsically interpretable models) and (post-hoc) black tween model-specific and model-agnostic methods, makbox inspection, but oddly, this is not seen as an orthogo- ing their taxonomy very similar to [8]. Inconsistencies nal dimension, but as two additional classes on par with See discussion of [7]. Also, the results dimension is also outcome and model explanation. Challenge How to inconsistent in that explanations can, simultaneously, be classify methods that are both, say, local and post-hoc, intrinsically interpretable models and feature summary i.e., do outcome explanation by black-box inspection? Ex- statistics. LIME [19], for example, presents local explanaamples would include gradients [21, 22, 23], layer-wise tions as the linear coeficients of a linear fit, i.e., an intrinrelevance propagation [24], deep Taylor decomposition sically interpretable model that consists solely of feature [25], integrated gradients [26, 27], etc. summary statistics. Redundancies The most important

Adadi and Berrada (2019) [7] rely on the local-global redundancy is that all model-agnostic interpretability and intrinsic-posthoc distinctions (referring to the later methods are also post-hoc, since intrinsic methods reas complexity), and, as a third dimension, they distin- quire joint training, which in turn requires compatibility guish between model-agnostic and model-specific inter- with model architectures. Moreover, model-agnostic inpretability methods. Inconsistencies We argue that the terpretability methods are all grounded in input features distinction between model-specific and model-agnostic and thus lead to explanations in terms of feature summethods is suboptimal in that state-of-the-art models are mary statistics or visualizations. Moreover, all explanamoving targets, and so is what counts as model-specific. tions in terms of intrinsically interpretable models are, This may lead to inconsistencies over time. Challenge quite obviously, intrinsic. Challenge What is a post-hoc How do we classify a method that applies to all known interpretability method whose explanations are intrinsically interpretable models? 3Consider the diference between the two global interpretability Zhang et al. (2020) [10] rely on these dimensions: (a) amreetthroadins,ecdojnocineptlty,acptriovbaitniogncvlaescstiofierrss asneqdupernotbiainllgy.clTahsseisfieersa:rCeAexV- global-local; (b) intrinsic-posthoc (which they call activetremes of a (curriculum) continuum, which is hard to binarize: If a passive; and (c) a distinction between four explanation probing classifier is trained jointly with the last epoch of the model types, namely examples, attribution, hidden semantics, and training, is the method then intrinsic or posthoc? For a real example, rules. Inconsistencies The explanation type dimension consider TracIn [43], in which influence functions are estimated in [10] conflates (a) the model components we are tryapcorsothssovc?arTiohuast ttrhaeinbiinngarcyhedcikstpinocitnitosn. Acogvaeinrs, iascTornatciInnuiunmtr,inmsaickoesr ing to explain, and (b) what the explanations look like. the distinction hard to apply in practice. Hidden semantics, e.g., is a model component, whereas Gradients, Layer-wise relevance propagation, Deep Taylor decomposition, Integrated gradients, DeepLift examples and rules refer to the (syntactic) form of the []6 []7 ][8 ][9 []10 ][11 []21 explanations. The distinction between hidden semantics and attribution is also apparent. Hidden semantics can DGeraedpCLiAftM HL L-LH-H L-LH-H be used to derive attribution (a results type in [8] and [9]), LRP L/G-H S-H L-I/H L/G-H e.g., in LSTMVis [45]; this is because hidden semantics is LIME L L-H L-H L L-H L-H L/G-H not a type of explanation, but a model component. At- ITFCAV L/G LG--HI G-H G-H tribution, examples, and rules are types of explanations, but this list is not exhaustive, since explanations can also Forward Backward be in terms of concepts, free texts, or visualizations, for l Attention, Attention roll-out, Attenexample. Challenge What is a passive interpretability cao tion flow, Time step dynamics, Lomethod that does not provide local explanations? L cal uptraining, Influence functions egxlopDblaaaniln-illioenvcgsakalynadnetdaadiln-.ht(r2oi0cn)2s0mic)-e[pt1ho1so]tdohsno.lcyIn(dwciohstniicnshgisutthiesenhycbcieeatslwl[se1ee1lnf]- llaboG rcWelaseseisgnihtfiatetprisor,unUns,pintgCra,liuCnsoitnregrreinlagt,ionProofbrienpg- apnDcreyuttniwnvaiaonmtrgikiocsn,,sSpGpaarrarssdeeietcnrota-dibninainsgeg,d,CwoBniencigaehprytt say most attribution methods are global and ad-hoc. We argue attribution methods are necessarily local, and while Table 2 aggregate statistics can of course be computed across Left: 4/6 methods (bottom half) are classified incoherently real or synthetic corpora, little is gained by blurring tax- across taxonomies. Explanation: local (L), global (G), intrinsic onomies to reflect that. All local methods can be used (I), and posthoc (H). Right: Our novel taxonomy. to compute summary statistics. Incompleteness [11] admit their survey is biased toward local methods, and many global interpretability methods are left uncovered. Kotonya and Toni (2020) [15] distinguish between Challenge What is a local interpretability method that attention-based explanations, explanations as rule discovcannot be used to compute summary statistics? ery, and explanations as summarization. Incomplete

Das et al. (2020) [12] distinguish between local and ness Using gating mechanisms to interpret models, e.g., global methods, gradient-based and perturbation-based does not fit any of the three categories. Inconsistencies methods on the other (methodology), and intrinsic and One class of methods is defined in terms of the model post-hoc methods (usage). Their taxonomy is both in- components being interpreted (attention-based), and ancomplete and redundant: Incompleteness Several ap- other class in terms of the form of explanations they proaches are neither gradient-based or perturbation- provide (rule discovery and summarization). Mixing orbased. Redundancies All gradient-based approaches thogonal dimensions is inconsistent, i.e., methods can are classified as post-hoc approaches in [ 12]; similarly, belong to several categories, e.g., attention head pruning all intrinsic methods are classified as global methods. Of [46] (attention-based and summarization), or when rules course these cells may be filled with methods that were are induced from attention weights [47]. not covered, but in particular, it seems that gradient- Chen et al. (2021) [13] introduce the global-local disbased approaches are, almost always, post-hoc? Chal- tinction, but not the intrinsic-posthoc distinction. In adlenge What is an intrinsic, gradient-based approach? dition they distinguish between interpretability methods

Atanasova et al. (2020) [14] distinguish between that present explanations in terms of training instances, three classes of interpretability methods: gradient-based, approximations, feature attribution and counterfactuals. perturbation-based, and simplification-based methods. Inconsistencies The second dimension again makes orInconsistencies The distinction between gradient- thogonal distinctions. Approximations, for example, can based and perturbation-based methods is similar to [12], be used to attribute importance to features (LIME). Inbut the two classifications are inconsistent, with [ 14] completeness Concepts, attention weights, gate activaciting LIME [19] as a simplification-based method. It tions, rules, etc., are not covered by the second dimension. seems that the distinction between perturbation-based Redundancies All methods that present explanations in and simplification-based methods is in itself inconsistent terms of training instances are necessarily local. Chalin that both perturbations and gradients can be used to lenge What’s a global interpretability method providing simplify models; similarly, perturbations can be used to explanations in terms of training instances?4 baseline gradient-based approaches. Incompleteness Inconsistent Classifications Table 2 shows that taxClearly, not all interpretability methods are gradientbased, perturbation-based or simplification-based: Other 4Several of the above taxonomies include dimensions that pertain to methods are based on weight magnitudes, carefully de- the form of the output of interpretation methods. We argue such signed example templates, visualizing and quantifying at- distinctions are orthogonal to the methods and should therefore not tention weights or gating mechanisms. Challenge How ibteyimnceltuhdoedds,ine.tga.,xLonIMo mE,iecsa.nTporoseveidtehiesx,pnloatneatthioantsmoofsdtiifenrteenrptrfeotramb:ilwould you classify attention roll-out [30], for example? aggregate statistics, coeficients, rules, visualizations, etc. onomies are not only internally inconsistent, but also inconsistent in how they classify methods. Six methods were mentioned by more than one survey, 4/6 of which were classified diferently. necessarily, from the fact that backward methods reverse the direction of connections, thus returning quantities that hold for the input nodes. Pre-input quantities are not interpretable.

3. A Novel Taxonomy and Observations Our taxonomy is two-dimensional: One is local-global,

the other a distinction between explanations based on forward passes, and explanations based on backward passes.

The forward explanations correlate intermediate repre- Observation 3.3. Global methods can at best be epsilonsentations or continuous or discrete output representa- faithful and only on i.i.d. instances. tions to obtain explanations, whereas backward explanations concern training dynamics. We define forwardbackward: Observation 3.2. Only global methods can be unfaithful. §3.2 follows from the definition of faithfulness: ℳ is faithful if the inductive functions of ℳ have ℓ(v, ) = 0 and .̃ ℳ can only be unfaithful with respective to inductive component functions; local methods can therefore not be unfaithful.6 §3.3 follows from the fact that standard learning theory applies to the inductive component functions of global interpretability methods. Since their faithfulness is the inverse of the empirical risk of these inductions, it follows that global methods can at best be -faithful, with the expected loss of these inductions. Note that when the explanation is a model approximation ′, = [ℓ((), ′())] .

Definition 3.1 (Forward-backward). A method ℳ is said to be backward if it contains backward functions; otherwise, ℳ is said to be forward.

Local backward methods include gradients [21, 22, 23], integrated gradients [26, 27], layerwise relevance prop- Observation 3.4. Only forward methods are used for loagation [48], DeepLIFT [28], and deep Taylor decompo- cal layer-wise analysis. sition [25], which all derive explanations for individual instances from what is normally used as training signals, Since local backward methods are attribution methods typically based on derivatives of the loss function (gra- (§3.1), and layer-wise analysis concerns diferences bedients) evaluating ℎ on training data, e.g., (ℓ(ℎ( x ), )). tween layers, local backward methods cannot be used Global backward methods rely on such training signals here, simply because they only output attributions at the to modify or extend the model parameters w associated input level. §3.4 thus follows from §3.1, making it, too, with ℎ, typically extracting approximations, rules or vi- an empirical observation, not a formal derivation. sualizations.5 Observation 3.1. Local backward methods are always attribution methods (presenting feature summary statistics).

Since local methods have to provide explanations in terms of input/output (as they do not modify weights), and since backward passes do not generate output distributions, they have to present explanations in terms of attribution of relevance or gradients to input features or input segments. §3.1 is empirical. It follows naturally, but not 5Local forward methods either consider intermediate representations, e.g., gates [49], attention [29], attention flow [ 50], etc.; continuous output representations, e.g., using word association norms [51] or word analogies [52, 53]; or discrete output, such as when evaluating on challenge datasets [33, 34, 35, 36], or when approximating the model’s output distribution [19, 54, 37]. In the same way, global forward methods can rely on intermediate representations in forward passes, e.g., in attention head pruning [46], attention factor analysis [55], syntactic decoding of attention heads [50], attention head manipulation [56], etc.; continuous output in forward passes, including work using clustering in the vector space to manually analyze model representations [57, 58], probing classifiers [ 16], and concept activation strategies [18]; or on discrete output, e.g., in uptraining [44] and knowledge distillation [59].

Observation 3.5. No equivalence relations can hold across the four categories of methods. §3.5 follows from the disjointness of the three sets of component functions, and how the four classes are defined, i.e., that global functions cannot be local, and forward functions cannot be backward. Equivalences between methods have already been found [28, 26, 27, 60, 61], but consistent taxonomic classification efectively prunes the search space of possible equivalences.

Observation 3.6. Local methods can always characterize models globally on i.i.d. samples. §3.6 states that any local method that derives quantifies for an example can be used to aggregate corpus-level statistics for appropriate-level samples. See [19] for how to do this with LIME. It should be easy to see how this result generalizes to all other local methods.

6Local methods compute quantities based on forward or backward

passes, but these quantities are not induced to simulate anything. Global methods induce parameters to simulate a distribution and can be more or less faithful to this distribution, but since local methods simply ’read of’ their quantities, they cannot be unfaithful. Only, the quantities can be misinterpreted.

4. Conclusion

in: NAACL, New Orleans, Louisiana, 2018. actions on Visualization and Computer Graphics [30] S. Abnar, W. Zuidema, Quantifying attention flow 24 (2018) 667–676. doi:1 0 . 1 1 0 9 / T V C G . 2 0 1 7 . 2 7 4 4 1 5 8 .

in transformers, in: ACL, Online, 2020. [46] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, I. Titov, [31] T. Mikolov, Q. V. Le, I. Sutskever, Exploiting simi- Analyzing multi-head self-attention: Specialized larities among languages for machine translation, heads do the heavy lifting, the rest can be pruned, CoRR abs/1309.4168 (2013). a r X i v : 1 3 0 9 . 4 1 6 8 . in: ACL, Florence, Italy, 2019. [32] H. Strobelt, S. Gehrmann, H. Pfister, A. M. Rush, [47] T. Ruzsics, O. Sozinova, X. Gutierrez-Vasques, Lstmvis: A tool for visual analysis of hidden T. Samardzic, Interpretability for morphologistate dynamics in recurrent neural networks, 2017. cal inflection: from character-level predictions to a r X i v : 1 6 0 6 . 0 7 4 6 1 . subword-level rules, in: EACL, Online, 2021. [33] M. Richardson, C. J. Burges, E. Renshaw, MCTest: [48] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R.

A challenge dataset for the open-domain machine Müller, W. Samek, PLoS ONE (????). comprehension of text, in: EMNLP, 2013. [49] Y. Lakretz, G. Kruszewski, T. Desbordes, D. Hupkes, [34] J. Mullenbach, J. Gordon, N. Peng, J. May, Do nu- S. Dehaene, M. Baroni, The emergence of number clear submarines have nuclear captains? a chal- and syntax units in LSTM language models, in: lenge dataset for commonsense reasoning over ad- NAACL, Minneapolis, Minnesota, 2019. jectives and objects, in: EMNLP, Hong Kong, China, [50] V. Ravishankar, A. Kulmizev, M. Abdou, A. Søgaard, 2019. J. Nivre, Attention can reflect syntactic structure [35] K. Sun, D. Yu, J. Chen, D. Yu, Y. Choi, C. Cardie, (if you let it), in: EACL, Online, 2021.

DREAM: A challenge data set and models for [51] K. W. Church, P. Hanks, Word association norms, dialogue-based reading comprehension, Transac- mutual information, and lexicography, in: 2ACL, tions of the Association for Computational Linguis- Vancouver, British Columbia, Canada, 1989. tics 7 (2019) 217–231. [52] T. Mikolov, K. Chen, G. Corrado, J. Dean, Dis[36] N. F. Liu, R. Schwartz, N. A. Smith, Inoculation tributed representations of words and phrases and by fine-tuning: A method for analyzing challenge their compositionality, in: NeurIPS, 2013. datasets, in: NAACL, Minneapolis, Minnesota, [53] N. Garneau, M. Hartmann, A. Sandholm, S. Ruder, 2019. I. Vulic, A. Søgaard, Analogy training multilingual [37] P. W. Koh, P. Liang, Understanding black-box pre- encoders, in: AAAI, 2021.

dictions via influence functions, in: ICML, 2017. [54] D. Alvarez-Melis, T. Jaakkola, A causal frame[38] N. Kriegeskorte, M. Mur, P. Bandettini, Representa- work for explaining the predictions of black-box tional similarity analysis – connecting the branches sequence-to-sequence models, in: EMNLP, Copenof systems neuroscience, Frontiers in Systems Neu- hagen, Denmark, 2017.

roscience 3 (2008). [55] G. Kobayashi, T. Kuribayashi, S. Yokoi, K. Inui, At[39] T. A. Trost, D. Klakow, Parameter free hierarchi- tention is not only a weight: Analyzing transformcal graph-based clustering for analyzing continu- ers with vector norms, in: EMNLP, Online, 2020. ous word embeddings, in: TextGraphs, Vancouver, [56] S. Vashishth, S. Upadhyay, G. S. Tomar, M. Faruqui, Canada, 2017. Attention interpretability across NLP tasks, 2019. [40] D. Yenicelik, F. Schmidt, Y. Kilcher, How does BERT a r X i v : 1 9 0 9 . 1 1 2 1 8 .

capture semantics? a closer look at polysemous [57] K. Heylen, D. Speelman, D. Geeraerts, Looking words, in: BlackboxNLP, Online, 2020. at word meaning. an interactive visualization of [41] R. Aharoni, Y. Goldberg, Unsupervised domain semantic vector spaces for Dutch synsets, in: clusters in pretrained language models, in: ACL, LINGVIS & UNCLH, Avignon, France, 2012.

Online, 2020. [58] E. Reif, A. Yuan, M. Wattenberg, F. B. Viegas, A. Co[42] C.-K. Yeh, J. S. Kim, I. E. H. Yen, P. Ravikumar, Rep- enen, A. Pearce, B. Kim, Visualizing and measuring resenter point selection for explaining deep neural the geometry of BERT, in: NeurIPS, volume 32, networks, 2018. a r X i v : 1 8 1 1 . 0 9 7 2 0 . 2019. [43] D. Pruthi, M. Gupta, B. Dhingra, G. Neubig, Z. C. [59] Y. Kim, A. M. Rush, Sequence-level knowledge Lipton, Learning to deceive with attention-based distillation, in: EMNLP, Austin, Texas, 2016. explanations, in: ACL, Online, 2020. [60] M. Ancona, E. Ceolini, C. Öztireli, M. Gross, To[44] S. Petrov, P.-C. Chang, M. Ringgaard, H. Alshawi, wards better understanding of gradient-based attriUptraining for accurate deterministic question pars- bution methods for deep neural networks, in: ICLR, ing, in: EMNLP, 2010. 2018. [45] H. Strobelt, S. Gehrmann, H. Pfister, A. M. Rush, [61] W. Samek, G. Montavon, S. Lapuschkin, C. J. AnLstmvis: A tool for visual analysis of hidden state ders, K.-R. Müller, Explaining deep neural networks dynamics in recurrent neural networks, IEEE Trans- and beyond: A review of methods and applica