Interpretability in Activation Space Analysis of Transformers: A Focused Survey Soniya Vijayakumar1,∗ 1 German Research Center for Artificial Intelligence (DFKI), Saarland Informatics Campus, Saarland, Germany Abstract The field of natural language processing has reached breakthroughs with the advent of transformers. They have remained state-of-the-art since then, and there also has been much research in analyzing, interpreting, and evaluating the attention layers and the underlying embedding space. In addition to the self-attention layers, the feed-forward layers in the transformer are a prominent architectural component. From extensive research, we observe that its role is under-explored. We focus on the latent space, known as the Activation Space, that consists of the neuron activations from these feed-forward layers. In this survey paper, we review interpretability methods that examine the learnings that occurred in this activation space. Since there exists only limited research in this direction, we conduct a detailed examination of each work and point out potential future directions of research. We hope our work provides a step towards strengthening activation space analysis. Keywords explainability, interpretability, machine learning, activation space analysis, linguistic information, transformers, feed-forward layers 1. Introduction much of the focus is in the domain of image process- ing [10]. A challenge that exists is the gap between the Through thick and thin, there is evidence that trans- low-level features that the neural networks compute and formers have established itself as the state-of-the-art in the high-level concepts that are human-understandable. various Natural Language Processing (NLP) tasks since Furthermore, we observe that there have been relatively their conception and realization in 2017. BERT, the most fewer research methods applied in understanding the in- well-known transformer language model [1], consists ternal learnings of networks in comparison to analyzing of two major architectural components: self-attention the functions of self-attention layers. layers and feed-forward layers. Much work has been The core focus of our review is directed towards those done in analyzing the functions of self-attention layers methods that unfold the learnings in the internal repre- [2, 3, 4]. In our survey, we focus on interpretability of sentations of the neural network, i.e, we look at those the feed-forward layers. Each layer in the encoder and methods that answer the question: “What does the model decoder contains a fully connected position-wise feed- learn?” We further refine our focus on understanding forward network. The feed-forward network contains specifically the feed-forward layers in transformer mod- two linear transformations with a rectified linear acti- els. The motivation for this study is two-fold: vation function. Even though existing works highlight the importance of such feed-forward layers in transform- • The inputs undergo a non-linear transformation ers [5, 6, 7], still, to date, the role of feed-forward layers when passing through the activation functions in remains under-explored [8]. Our review focuses on the the feed-forward layers of deep neural networks research that uses interpretability methods to understand [11]. the learnings in these feed-forward layers. We define the • The parameters in the position-wise feed-forward latent space, that comprises of the activations extracted layers of the transformer account for two-thirds from these layers, as the Activation Space. Many meth- of the total model’s parameters (8𝑑 2 per layer, d is ods already exist for aggregating these representations the model’s hidden dimension). This also implies including the default Huggingface1 pipeline used in the that there is a considerable amount of computa- original BERT paper [9]. tional budget involved in training these parame- Several methods for explaining and interpreting deep ters to achieve the state-of-the-art performance neural networks have been devised and we observe that they deliver today [12]. From recent research, the methods that focus on un- Proceedings of the CIKM 2022 Workshops, October 17 - 22, 2022 derstanding the feed-forward layers show substantial ∗ Corresponding author. evidence that the feed-forward layer activation space Envelope-Open soniya.vijayakumar@dfki.de (S. Vijayakumar) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License embeds useful information (see Section 5). We find that Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 the learnings in the feed-forward layer remain under- 1 https://huggingface.co/ explored. With our methodological survey, our objective Table 1 Major attributes of the methods explored in the activation space analysis methods Method Properties NLP Tasks Quantitative Evalua- Qualitative Evalua- tion tion Linguistic Phe- Word Morphology, Parts-of-Speech, Sensitivity, Prediction Human-expert visual nomena [13, 14, Lexical Semantics, Semantic and Syntax Accuracy, Selectivity inspection of selected 15, 16] Sentence Length, Tagging and Pre- Score neurons Parts-of-Speech diction, Syntactic Chunking Neural Mem- Vocabulary Distri- Next Sequence Predic- Agreement Rate, Pre- Pattern search by hu- ory Cells [12, 8] bution, Human- tion, Fill-in-the-blank diction Probability, At- man experts Interpretable Patterns, Cloze Task tribution Score, Per- Factual Knowledge plexity, Change and Success Rate Knowledge Illu- Lexical, Geometric Next Sequence Predic- Projection Score, Acti- Human annotations sion [17] Properties (Local tion vation Quantile, Word for patterns using Semantic Coherence) Frequency Correlation visualization is to understand the internal mechanisms of transform- focus on the NLP domain. This work focuses on outcome ers by exploring the activation space of the feed-forward explanation problems which help end users understand network. Further, we consider this paper as a focused the model’s operation and thereby build trust in these starting point for facilitating future research in activation NLP-based AI systems. Along with the high-level classi- space analysis. Finally, we also conduct a comparative fication of explanations, the work introduces two addi- study of these methods, their evaluation techniques and tional aspects: techniques that derive the explanation and report our observations, understandings, and potential techniques to present to the end user. The explainability future directions (see Section 7). Table 1 summarizes the techniques are categorized into feature importance, sur- methods and its attributes that we have explored. rogate models, example-driven, provenance-based and declarative induction. A set of operations such as first- derivative salience, layer-wise relevance propagation, in- 2. Related Surveys put perturbations, attention mechanism, and Long-Short- Term-Memory (LSTM) gating signal and explainability- As the interest in the Explainable Artificial intelligence aware architectures enable explainability. An interesting (XAI) field grows, various survey articles were published, observation is the consideration of adding attention lay- trying to consolidate and categorize the approaches. We ers to neural network architectures as a strategy to enable segregate the reviews into two categories: Surveys that explanations. give a general overview of existing explainability meth- The closest survey related to our work is from Sajjad ods [18, 19, 20, 21, 22] and surveys that focus on explain- et al. [25], where the survey is on fine-grained neuron ability methods in the NLP domain. We narrow our sur- analysis. While there have been two previous surveys veys to the NLP domain as this is the core focus of this that cover Concept Analysis [26] and Attribution Analy- review paper. sis [24], their focus is on analyzing individual neurons to A survey that acts as a prior to ours is from Belinkov better understand the inner workings of neural networks. and Glass [23], where the authors review the various They refer to this as Neuron Analysis and categorized analysis methods used to conduct novel and fine-grained these reviewed methods into visualization, corpus-based, neural network interpretation and evaluation. The pri- neuron probing, and unsupervised methods. The work mary question that has been relevant while formulating further discusses findings and applications of neuron these interpretation methods is: What linguistic infor- interpretation and summarizes open issues. mation is captured in neural networks? The authors We observe that, from the various existing surveys, emphasize three aspects of the language-specific analy- there are different dimensions to be considered. We nar- sis, namely, methods used for conducting the analysis, row down our survey into the following dimensions: linguistic information sought, and neural network parts investigated. They also identify several gaps and limita- • Analysis methods that focus on the internal inter- tions in the surveys. pretation of the activation space. Danilevsky et al. [24] presents a broader overview of • Linguistic Information such as parts-of-speech, the state of XAI over a span of 7 years (until 2020), with a syntactic, semantic and Non-linguistics Informa- tion such as sentence length, factual knowledge, of relevant knowledge from a machine-learning model geometric properties. concerning relationships either contained in the data • Neural network object neurons and its activations or learned by the model. This definition rather focuses as the Activation Space in the transformer lan- on understanding what the model learns either from an guage model. input-output mapping perspective or what the model itself learns. On the other hand, explainability directs the We believe that interpretability alone is not sufficient focus back to human understanding by examining the re- in understanding the inner workings of the transform- lationship between input features and model predictions ers, we also need explainability to summarize the reason in a human-understandable format [21]. for the model’s behaviour in a human-comprehensible After reviewing numerous relevant existing literature, manner. One has to keep in mind that, explainability we observed that explainability techniques broadly fall and interpretability have distinguishable meanings [27] into three major classes. The first differentiates between and our review focuses only on interpretability methods understanding a model’s individual prediction process because the research works reviewed focus on the same. versus prediction process as a whole [24]. A second differentiation is made in self-explaining or post-hoc methods, where the former generates explanations along 3. Survey Methodology with the model’s prediction process whereas the latter Our survey aim to cover the advances in NLP XAI re- requires post-processing of elements extracted during search focusing on neuron interpretation. As defined ear- the model prediction process. The third major distinc- lier, we define this latent dimension as Activation Space tion corresponds to methods that are model specific or and consider the reviewed techniques as Activation Space agnostic in nature. We also observed the existence of Analysis methods. We filtered to those methods that work various other categorizations like outcome-based expla- at the feed-forward neuron-level, individual vs global, nations, visual explanation methods, operations, and within the transformer model. We identified relevant conceptual vs attribution. Visualization methods play a papers published in NLP and AI conferences (AAAI, ACL, salient role in further understanding any interpretation IJCNLP, EMNLP) between 2018 and 2022. With the lim- method [30, 31, 32, 33]. These methods are inherent to ited scope of neuron-level analysis, we arrived at seven interpretability and is been widely reviewed, we leave contemporary papers. With a limited number of work this to the reader to explore the relevant literature. in this direction, we decided to take a deeper look into each of these methods, analyze its benefits, limitations, 5. Activation Space Analysis and gaps and present this study as our review paper. We are aware that this is an ongoing and relatively new Methods research field and our focus is extremely limited; we ac- knowledge that we might have omitted certain papers. There are two types of interpretability analysis that are We also assume that if the authors have focused on ex- carried out in the related research work: 1) Analyze indi- plainability, they are more likely to cover the relevant vidual neurons and 2) Analyze the entire set of neurons related taxonomies, categories, and methods. Another of the feed-forward layer. We look into both approaches common observation is that explanations are generated from four perspectives: categorization, linguistic knowl- in an NLP task-oriented setting and remain relevant to edge sought for, methodology, and evaluations, and con- the task context. Even though we summarize the tasks duct a comparative analysis of these methods. on which these researches are based, the task definitions Linguistic Phenomena: Investigating the linguistic are not relevant in our review process of understanding phenomena that occurs within the activations of pre- the activation space. trained models, when trained for a specific task set, using various interpretability analysis methods, is a common way to interpret the features learned by these models. 4. Taxonomies and Categorization The linguistic phenomenon refers to the presence of var- ious linguistic features such as word morphology, lexical There still exists a reasonably vague understanding and semantics, syntax or linguistic knowledge such as parts- lack of concrete mathematical definition between the two of-speech, grammar, coreference, lemmas. Linguistic commonly used terms: explainability and interpretability. Correlation Analysis (LCA) is one such method that fo- Interpretability has been defined as ”the degree to which cuses on understanding what the model learned about a human can understand the cause of a decision” [28] or linguistic features and determining those neurons that the degree to which a human can consistently predict explicitly focus on such phenomena. A toolkit with three the model’s result [29]. A broader definition exists for major methods, Individual Model Analysis, Cross-model the term interpretable machine learning as the extraction Analysis and LCA, to identify salient neurons within the model or related to a task under consideration, is and simultaneously, values induce a distribution over presented by Dalvi et al. [13]. the output vocabulary [12]. The work analyzes these Probing using diagnostic classifiers to understand the memories present in the feed-forward layers and further knowledge captured in neural representations is another explores the function of these layers in transformer-based common method for associating model components with language models. linguistic properties [34, 35, 36]. This involves extracting A neural memory is defined as a key-value pair, where feature representations from the network and training each key value is a d-dimensional vector. The emula- an auxiliary classifier to predict the linguistic property. tion, mathematical similarity between feed-forward and Layer-wise and neuron-level diagnostic classifiers that key-value neural memories, allows the hidden dimension probe representation from individual layers w.r.t linguis- to be considered as number of memories in each layer tic properties and find neurons that capture salient fea- and the activations as vectors containing un-normalized tures, respectively, are used to conduct analysis on pre- non-negative memory coefficients. Using this similarity, trained models BERT, RoBERTa and XLNet [14]. The task the study posits that the key vectors act as pattern detec- of predicting a certain linguistic property is defined. A tors. This hypothesis is tested by looking for the highest diagnostic classifier (logistic regression) is trained on gen- memory coefficient that is associated with the input text, erated activations, for both layer-wise and neuron-wise retrieving those input examples, and conducting human probes, to predict the existence of this linguistic prop- evaluations to identify patterns. The study further ex- erty. An LCA is conducted to generate neuron ranking plores intra-layer memory composition and inter-layer based on weight distribution. Additionally, an elastic-net prediction refinement. regularization is fine-tuned using grid-search to balance The concept of knowledge neurons, neurons that ex- between focused and distributed neurons. The top N press a fact, is introduced by Dai et al. [8]. The authors salient neurons extracted from this ranked list are used to propose a method to find the neurons that express facts retrain the classifier until an Oracle accuracy is achieved. and how their activations correlate in expressing these Durrani et al. [15] and Alammar [16] conducts sim- facts. The evaluations on pre-trained models for fill-in- ilar experiments, where the entire neuron activations the-blank cloze tasks show that these models have the from the feed-forward layers are used to train an exter- ability to recall factual knowledge even without fine- nal classifier. Durrani et al. [15] uses a probing classifier tuning. The work considers feed-forward layers as key- (logistic regression) with the additional elastic-net regu- value memories, hypothesize that these key-value mem- larization to conduct a fine-grained neuron level analysis ories store factual knowledge and proposes a knowledge on pre-trained models ELMo, T-ELMo, BERT, and XLNET. attribution method. The knowledge attribution method, This variance of models, in this study, covers different based on integrated gradients, evaluates the contribu- modeling choices of the blocks, optimization objectives, tion of each neuron, in BERT-base-cased transformer, to and model architectures. The case study conducted by knowledge predictions by assigning them an attribution Alammar [16] uses probing the feed-forward neuron acti- score. Those neurons with a higher gradient i.e attribu- vations for Parts-of-Speech (POS) Information. A control tion score are identified as those contributing to factual task is created where each token is assigned to a random expressions. Further refinement of these neurons is done POS tag and a separate probe is trained on this control under the hypothesis that there are chances that the same set. This allows us to measure the difference in predic- fact can share the same set of true positive knowledge tion accuracy between the actual and control dataset, neurons. This refinement allows in retaining only those selectivity score, thereby concluding if the probe really knowledge neurons that are shared by a certain percent- extracts the POS information. The author collects exist- age of input prompts. ing methods that examines input saliency, hidden state Knowledge Illusion: Based on the generalization evolution, neuron activations, and non-negative matrix of the hypothesis that concepts are encoded in the lin- factorization of neuron activations, along with dimen- ear combinations of neural activations, Bolukbasi et al. sionality reduction methods to extract patterns into an [17] describe a surprising phenomenon “interpretabil- open-source library known as Ecco [16]. These methods ity illusion”. Probing experiments conducted on BERT- can be directly employed on pre-trained models such as base-uncased model determines if individual neurons GPT2, BERT, RoBERTa. contained human-interpretable meaning. The final layer Neural Memory Cells: In the context of a neural net- creates embeddings for four datasets (QQP, QNLI, Wiki, work with a recurrent attention model, Sukhbaatar et al. and Books) and top 10 activating sentences for a neuron [37] introduced input and output memory representa- are annotated to determine a pattern. Here a pattern is tions. A recent work extends this neural memory concept defined as a single property such as sentence length or and shows that the feed-forward layers in the transformer lexical similarity shared by a set of sentences. By propos- models operate as key-value memories, where keys cor- ing three sources: dataset idiosyncrasy, local semantic relate to specific human-interpretable input pattern sets coherence in BERT’s embedding space, and annotator error, the authors explain this illusion. The same exper- Google, that inspects trained models based on predic- iment is repeated, by keeping a set of target neurons tion and Seq2Seq-Vis [40], that can trace back prediction constant, on various datasets to reveal the illusion as decisions in Neural Machine Translation input models described by the authors. The work further explores the [13]. causes of this illusion by investigating local, global and Neural Memory Cells: Relating the patterns identi- dataset-level concepts. fied by human experts (NLP graduate students) to human understanding, the patterns are classified as shallow or semantic and are associated with lower layers and up- 6. Evaluations per layers of a 16-layer transformer model, respectively [8]. Further analysis of the corresponding values from Linguistic Phenomena: A layer-wise probing is con- the key-value memories complements the patterns ob- ducted to understand the redistribution of linguistic served in the respective keys. The agreement rate, the knowledge (syntactic chunking, POS, and semantic tag- fraction of memory cells that match the corresponding ging) when fine-tuned for downstream tasks [14]. Us- keys and values, is seen to increase in higher layers. The ing this probing across three fine-tuned models BERT, authors suggest that the memory cells in the higher lay- RoBERTa, and XLnet, on GLUE tasks and architectures ers contribute to the output whereas the lower layers reveal the following observations: The morpho-syntactic do not show such a clear key-value correlation to con- linguistic phenomenon that is preserved, post fine-tuning, tribute toward the output distribution of the next word. in the higher layers is dependent on the task; Different A qualitative analysis, by manually analyzing a few ran- architectures preserve linguistic information differently dom cases, is conducted on the layer-wise distribution of post fine-tuning. The neuron-wise probing further re- memory cells and how the model refines its prediction fines to the fine-grained neuron level, where the most from layer to layer using residual connections. The work salient neurons are extracted and their distribution across is an extension of Sukhbaatar et al. [37], which suggests architecture and variations in downstream tasks are stud- a theoretical similarity between feed-forward layers and ied. An alignment of findings is found with Merchant key-value memories. Additionally their observations, of et al. [38], where the fine-tuning affects only the top layer. shallow feature encoding, confirms with recent findings In comparison with Mosbach et al. [39], which is focused from Peters et al. [41], Jawahar et al. [42], Liu et al. [43]. on sentence level probing, Durrani et al. [14] studies core- The BERT-base-cased model is experimented with the linguistic phenomena. Additionally, their findings from knowledge attribution, where activation value is consid- fine-grained neuron analysis extend the core-linguistic ered as the attribution score for a neuron, to measure task layer-wise analysis, along with fine-tuning effects neuron sensitivity towards input. Similar observations on these neurons. Another interesting observation made to Geva et al. [12] and Tenney et al. [44] are identified: is the different patterns that are entailed when these net- fact-related neurons are distributed in the higher layers works are pruned from top or bottom. of the transformer. Further, the authors investigate how An ablation study conducted by Durrani et al. [15] on these neurons contribute to expressing the knowledge ei- the top salient neurons, from four pre-trained models ther by suppressing or amplifying their activations. Two ELMo, T-ELMo, BERT, and XLNet, indicates higher distri- additional use cases, updating facts and erasing relations, bution of linguistic information across the network when are presented, where the authors demonstrate the poten- the underlying task is more complex (CCG supertagging), tial application of these identified knowledge neurons. revealing information redundancy. Further refined study, Two evaluation metrics are used: change and success considering only a minimal set of neurons, to identify rate for measuring fact updating and inter/intra-relation the network parts that predominantly capture the lin- perplexity for measuring the influence on other knowl- guistic information and understand the localization or edge. These evaluations indicate that changes in very distribution of this information, indicate that the number few neurons in the transformers can affect certain facts. of neurons required to achieve the Oracle accuracy varies Erasing of facts is also measured using perplexity and and is dependent on the complexity of the task. By em- is observed that post fact erasing operation, i.e. setting ploying a selectivity score next to the prediction accuracy knowledge neuron to zero vectors, the perplexity of the score, and training separate POS probes for the actual moved knowledge increased. The knowledge attribution dataset and a control task, Alammar [16] observes that method, built on integrated gradients, is inspired by Hao the activation space encodes POS information at levels et al. [45] and Sundararajan et al. [46]. comparable to BERT’s hidden states. The non-negative Knowledge Illusion: A qualitative evaluation is con- matrix factorization method helps in identifying those ducted by annotating three sets of sentences for a neuron patterns in neuron activations that correspond to syn- in consideration: 1) top ten activating sentences for the tactic and semantic properties of the input text. The neuron, 2) top ten activating sentences in random direc- NeuroX toolkit is compared with the What-if tool from tion and 3) ten random sentences [17]. The objective of this annotation is to find patterns, where a pattern is the lack of both theoretical foundations and empirical defined as a property shared by a set of sentences. A pat- considerations in evaluations [25, 23, 24]. Even though tern is considered as a proxy for a learned concept by the each method has quantitative measures for evaluation, model. For each neuron under consideration, an average there is no standard set of metrics for comparing various of 2.5 distinct patterns across four datasets are observed. observations, hence, confining the scope of respective in- This illusion is further explored by studying the regions terpretability technique results to specific model architec- of activation space the input data occupies, the influence tures or task-related domains. Studies have proposed var- of top activating sentences on patterns from both local ious desiderata for interpretable concepts such as Fidelity, semantic coherence and global directions, and annotation Diversity and Grounding [48] for qualitative consistency error. Qualitative analysis is conducted through (UMAP Additionally, a few studies employ human experts for dimensionality reduction) visualization and it is observed qualitative analysis such as pattern annotation and iden- that sentences cluster in accordance with datasets. Addi- tifications, but again lack a standard framework for a tionally, the high accuracy of a Support Vector Machine comparative study and consistent explanations. More- classifier distinguishes between these datasets and pro- over, the subjective nature of interpretability and the lack vides quantitative evidence for this observation. This of existence of ground truth in qualitative analysis makes indicates the dependence of information encoded within it even more challenging to evaluate these methods. neurons on the idiosyncrasies of the natural language By reviewing the above works, that focus on activation datasets, even though they have similar activation values. space, we observe the following from the model perspec- The analysis of global directions in BERT’s activation tive: For a fixed model architecture and when a fixed space using activation quantiles helps in understanding set of neurons are examined, each set of neurons encode the correlation between word frequency change and itsdifferent information, dependent on the input dataset; monotonicity in each combination of datasets. This cor- On the contrary, when a wider set of model architectures relation indicated that despite BERT’s illusionary effect, are considered, the same set of neurons encode similar there still exists meaningful global direction in its activa- information at lower and higher layers across these ar- tion space. While comparing the observed illusions with chitectures but the information encoded is dependent on previous works, it is in alignment with Aharoni and Gold- the underlying task. These observations emphasize the berg [47], where they demonstrate the usage of BERT dependency on the input data and the underlying task representations to disambiguate datasets. This explains of interpreting the linguistic information encoded in the the existence of patterns in datasets, further experiments activation space. are conducted to understand the cause of such pattern Experiments conducted align with the definition of existence. interpretability and explainability in understanding the We observe that all the methods that we reviewed so rationale behind the model’s decision but lack human far fall under the local interpretability methods and limit understandable explanations. In the context of ex- themselves to the top N salient neurons (see Table 1). plainability, we observe that there is a gap in human- From reviewing these studies, we observe dimensionality understandable linguistic concepts and linguistic features reduction is required to understand the properties under captured in the network. We make a clear distinction be- consideration. Dimensionality reduction is associatedtween linguistic features and concepts: features consist with information loss and this loss is not accounted for of linguistic properties such as parts-of-speech, syntactic in these studies. Another observation is that the focus and semantic properties, and word morphology whereas of these studies alternates between identifying the neu- the linguistic concepts, from a human understandable rons that capture the relevant linguistic information and perspective, encode general human knowledge and how those subsets of these neurons that affect the prediction it is expressed in natural language. Various contempo- accuracy. Moreover, some interpretability methods arerary methods such as Concept Relevant Propagation [49], evaluated through user studies (where users subjectively Testing Concept Activation Vector [50], Integrated Con- evaluate the explanations), whereas others are evaluated ceptual Sensitivity [51] that are based on human under- in terms of how they satisfy some properties, either quan- standable local and global concept-based explanations titatively or qualitatively, without real users’ evaluations. exist. These methods are applied and evaluated in the In the next section, we further discuss our observations image processing domain and are yet to be explored in and present our insights and future detections. understanding linguistic concepts. It is evident that ex- ploring activation space is a promising research direction and we propose a potential future direction: extend the 7. Insights and Future Directions interpretability techniques from image processing to the natural language processing domain through transfer A common observation that we see in the contempo- learning. rary general surveys and from our focused reviews is Acknowledgments [9] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional trans- The authors would like to thank the anonymous re- formers for language understanding, CoRR viewers for their helpful feedback. The work was par- abs/1810.04805 (2018). URL: http://arxiv.org/abs/ tially funded by the German Federal Ministry of Educa- 1810.04805. arXiv:1810.04805 . tion and Research (BMBF) through the project XAINES [10] A. Das, P. Rad, Opportunities and challenges in (01IW20005). explainable artificial intelligence (XAI): A survey, CoRR abs/2006.11371 (2020). URL: https://arxiv.org/ abs/2006.11371. arXiv:2006.11371 . References [11] S. Zhao, D. Pascual, G. Brunner, R. Wattenhofer, Of [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor- non-linearity and commutativity in BERT, CoRR eit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- abs/2101.04547 (2021). URL: https://arxiv.org/abs/ sukhin, Attention is all you need, CoRR 2101.04547. arXiv:2101.04547 . abs/1706.03762 (2017). URL: http://arxiv.org/abs/ [12] M. Geva, R. Schuster, J. Berant, O. Levy, Trans- 1706.03762. arXiv:1706.03762 . former feed-forward layers are key-value mem- [2] M. Nasr, R. Shokri, A. Houmansadr, Comprehensive ories, CoRR abs/2012.14913 (2020). URL: https: privacy analysis of deep learning: Passive and ac- //arxiv.org/abs/2012.14913. arXiv:2012.14913 . tive white-box inference attacks against centralized [13] F. Dalvi, A. Nortonsmith, D. A. Bau, Y. Belinkov, and federated learning, in: 2019 IEEE Symposium H. Sajjad, N. Durrani, J. Glass, Neurox: A toolkit for on Security and Privacy (SP), 2019, pp. 739–753. analyzing individual neurons in neural networks, doi:10.1109/SP.2019.00065 . Proceedings of the AAAI Conference on Artificial [3] K. Clark, U. Khandelwal, O. Levy, C. D. Manning, Intelligence (AAAI) (2019). What does BERT look at? an analysis of bert’s [14] N. Durrani, H. Sajjad, F. Dalvi, How transfer attention, CoRR abs/1906.04341 (2019). URL: http: learning impacts linguistic knowledge in deep NLP //arxiv.org/abs/1906.04341. arXiv:1906.04341 . models?, in: Findings of the Association for [4] J. Vig, Y. Belinkov, Analyzing the structure of Computational Linguistics: ACL-IJCNLP 2021, As- attention in a transformer language model, in: sociation for Computational Linguistics, Online, Proceedings of the 2019 ACL Workshop Black- 2021, pp. 4947–4957. URL: https://aclanthology. boxNLP: Analyzing and Interpreting Neural Net- org/2021.findings-acl.438. doi:10.18653/v1/2021. works for NLP, Association for Computational findings- acl.438 . Linguistics, Florence, Italy, 2019, pp. 63–76. URL: [15] N. Durrani, H. Sajjad, F. Dalvi, Y. Belinkov, Ana- https://aclanthology.org/W19-4808. doi:10.18653/ lyzing individual neurons in pre-trained language v1/W19- 4808 . models, CoRR abs/2010.02695 (2020). URL: https: [5] O. Press, N. A. Smith, O. Levy, Improving trans- //arxiv.org/abs/2010.02695. arXiv:2010.02695 . former models by reordering their sublayers, in: [16] J. Alammar, Ecco: An open source library for Proceedings of the 58th Annual Meeting of the As- the explainability of transformer language mod- sociation for Computational Linguistics, Associa- els, in: Proceedings of the 59th Annual Meet- tion for Computational Linguistics, Online, 2020, ing of the Association for Computational Lin- pp. 2996–3005. URL: https://aclanthology.org/2020. guistics and the 11th International Joint Confer- acl-main.270. doi:10.18653/v1/2020.acl- main. ence on Natural Language Processing: System 270 . Demonstrations, Association for Computational [6] B. Pulugundla, Y. Gao, B. King, G. Keskin, H. Mallidi, Linguistics, Online, 2021, pp. 249–257. URL: https:// M. Wu, J. Droppo, R. Maas, Attention-based neu- aclanthology.org/2021.acl-demo.30. doi:10.18653/ ral beamforming layers for multi-channel speech v1/2021.acl- demo.30 . recognition, 2021. URL: https://arxiv.org/abs/2105. [17] T. Bolukbasi, A. Pearce, A. Yuan, A. Co- 05920. doi:10.48550/ARXIV.2105.05920 . enen, E. Reif, F. B. Viégas, M. Wattenberg, [7] H. Xu, Q. Liu, D. Xiong, J. van Genabith, An interpretability illusion for BERT, CoRR Transformer with depth-wise LSTM, CoRR abs/2104.07143 (2021). URL: https://arxiv.org/abs/ abs/2007.06257 (2020). URL: https://arxiv.org/abs/ 2104.07143. arXiv:2104.07143 . 2007.06257. arXiv:2007.06257 . [18] A. Adadi, M. Berrada, Peeking inside the black-box: [8] D. Dai, L. Dong, Y. Hao, Z. Sui, F. Wei, Knowl- A survey on explainable artificial intelligence (xai), edge neurons in pretrained transformers, CoRR IEEE Access 6 (2018) 52138–52160. abs/2104.08696 (2021). URL: https://arxiv.org/abs/ [19] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Gi- 2104.08696. arXiv:2104.08696 . annotti, D. Pedreschi, A survey of methods for ex- plaining black box models, ACM Comput. Surv. 51 (2018). URL: https://doi.org/10.1145/3236009. doi:10. [29] B. Kim, R. Khanna, O. O. Koyejo, Examples 1145/3236009 . are not enough, learn to criticize! criticism [20] V. Arya, R. K. E. Bellamy, P. Chen, A. Dhurand- for interpretability, in: D. Lee, M. Sugiyama, har, M. Hind, S. C. Hoffman, S. Houde, Q. V. Liao, U. Luxburg, I. Guyon, R. Garnett (Eds.), Advances R. Luss, A. Mojsilovic, S. Mourad, P. Pedemonte, in Neural Information Processing Systems, R. Raghavendra, J. T. Richards, P. Sattigeri, K. Shan- volume 29, Curran Associates, Inc., 2016. URL: mugam, M. Singh, K. R. Varshney, D. Wei, Y. Zhang, https://proceedings.neurips.cc/paper/2016/file/ One explanation does not fit all: A toolkit and 5680522b8e2bb01943234bce7bf84534-Paper.pdf. taxonomy of AI explainability techniques, CoRR [30] P. Pezeshkpour, Y. Tian, S. Singh, Investigating abs/1909.03012 (2019). URL: http://arxiv.org/abs/ robustness and interpretability of link prediction 1909.03012. arXiv:1909.03012 . via adversarial modifications, in: Proceedings [21] P. Linardatos, V. Papastefanopoulos, S. Kotsiantis, of the 2019 Conference of the North American Explainable ai: A review of machine learning in- Chapter of the Association for Computational Lin- terpretability methods, Entropy 23 (2021). URL: guistics: Human Language Technologies, Volume https://www.mdpi.com/1099-4300/23/1/18. doi:10. 1 (Long and Short Papers), Association for Com- 3390/e23010018 . putational Linguistics, Minneapolis, Minnesota, [22] A. Krajna, M. Kovac, M. Brcic, A. Šarčević, Ex- 2019, pp. 3336–3347. URL: https://aclanthology.org/ plainable artificial intelligence: An updated per- N19-1337. doi:10.18653/v1/N19- 1337 . spective, in: 2022 45th Jubilee International Con- [31] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, J. Eisen- vention on Information, Communication and Elec- stein, Explainable prediction of medical codes tronic Technology (MIPRO), 2022, pp. 859–864. from clinical text, in: Proceedings of the 2018 doi:10.23919/MIPRO55190.2022.9803681 . Conference of the North American Chapter of the [23] Y. Belinkov, J. Glass, Analysis Methods in Neural Association for Computational Linguistics: Hu- Language Processing: A Survey, Transactions of man Language Technologies, Volume 1 (Long Pa- the Association for Computational Linguistics 7 pers), Association for Computational Linguistics, (2019) 49–72. URL: https://doi.org/10.1162/tacl_a_ New Orleans, Louisiana, 2018, pp. 1101–1111. URL: 00254. doi:10.1162/tacl_a_00254 . https://aclanthology.org/N18-1100. doi:10.18653/ [24] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis, v1/N18- 1100 . B. Kawas, P. Sen, A survey of the state of explain- [32] D. Croce, D. Rossini, R. Basili, Auditing deep able AI for natural language processing, in: Pro- learning processes through kernel-based explana- ceedings of the 1st Conference of the Asia-Pacific tory models, in: Proceedings of the 2019 Confer- Chapter of the Association for Computational Lin- ence on Empirical Methods in Natural Language guistics and the 10th International Joint Conference Processing and the 9th International Joint Con- on Natural Language Processing, Association for ference on Natural Language Processing (EMNLP- Computational Linguistics, Suzhou, China, 2020, IJCNLP), Association for Computational Linguis- pp. 447–459. URL: https://aclanthology.org/2020. tics, Hong Kong, China, 2019, pp. 4037–4046. URL: aacl-main.46. https://aclanthology.org/D19-1415. doi:10.18653/ [25] H. Sajjad, N. Durrani, F. Dalvi, Neuron-level inter- v1/D19- 1415 . pretation of deep NLP models: A survey, CoRR [33] D. Bahdanau, K. Cho, Y. Bengio, Neural machine abs/2108.13138 (2021). URL: https://arxiv.org/abs/ translation by jointly learning to align and translate, 2108.13138. arXiv:2108.13138 . 2014. URL: https://arxiv.org/abs/1409.0473. doi:10. [26] Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, J. R. 48550/ARXIV.1409.0473 . Glass, On the linguistic representational power [34] D. Hupkes, S. Veldhoen, W. Zuidema, Visualisation of neural machine translation models, CoRR and ’diagnostic classifiers’ reveal how recurrent abs/1911.00317 (2019). URL: http://arxiv.org/abs/ and recursive neural networks process hierarchical 1911.00317. arXiv:1911.00317 . structure (2017). URL: https://arxiv.org/abs/1711. [27] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. A. 10203. doi:10.48550/ARXIV.1711.10203 . Specter, L. Kagal, Explaining explanations: An [35] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, approach to evaluating interpretability of machine M. Baroni, What you can cram into a single vec- learning, CoRR abs/1806.00069 (2018). URL: http: tor: Probing sentence embeddings for linguistic //arxiv.org/abs/1806.00069. arXiv:1806.00069 . properties, CoRR abs/1805.01070 (2018). URL: http: [28] T. Miller, Explanation in artificial intelli- //arxiv.org/abs/1805.01070. arXiv:1805.01070 . gence: Insights from the social sciences, CoRR [36] Y. Belinkov, J. R. Glass, Analysis methods in abs/1706.07269 (2017). URL: http://arxiv.org/abs/ neural language processing: A survey, CoRR 1706.07269. arXiv:1706.07269 . abs/1812.08951 (2018). URL: http://arxiv.org/abs/ 1812.08951. arXiv:1812.08951 . [44] I. Tenney, D. Das, E. Pavlick, BERT rediscovers [37] S. Sukhbaatar, E. Grave, G. Lample, H. Jégou, the classical NLP pipeline, in: Proceedings of A. Joulin, Augmenting self-attention with the 57th Annual Meeting of the Association for persistent memory, CoRR abs/1907.01470 Computational Linguistics, Association for Com- (2019). URL: http://arxiv.org/abs/1907.01470. putational Linguistics, Florence, Italy, 2019, pp. arXiv:1907.01470 . 4593–4601. URL: https://aclanthology.org/P19-1452. [38] A. Merchant, E. Rahimtoroghi, E. Pavlick, I. Ten- doi:10.18653/v1/P19- 1452 . ney, What happens to BERT embeddings dur- [45] Y. Hao, L. Dong, F. Wei, K. Xu, Self-attention attribu- ing fine-tuning?, in: Proceedings of the Third tion: Interpreting information interactions inside BlackboxNLP Workshop on Analyzing and Inter- transformer, 2020. URL: https://arxiv.org/abs/2004. preting Neural Networks for NLP, Association for 11207. doi:10.48550/ARXIV.2004.11207 . Computational Linguistics, Online, 2020, pp. 33–44. [46] M. Sundararajan, A. Taly, Q. Yan, Axiomatic at- URL: https://aclanthology.org/2020.blackboxnlp-1. tribution for deep networks, in: D. Precup, Y. W. 4. doi:10.18653/v1/2020.blackboxnlp- 1.4 . Teh (Eds.), Proceedings of the 34th International [39] M. Mosbach, A. Khokhlova, M. A. Hedderich, Conference on Machine Learning, volume 70 of Pro- D. Klakow, On the Interplay Between Fine-tuning ceedings of Machine Learning Research, PMLR, 2017, and Sentence-level Probing for Linguistic Knowl- pp. 3319–3328. URL: https://proceedings.mlr.press/ edge in Pre-trained Transformers, in: Findings v70/sundararajan17a.html. of the Association for Computational Linguistics: [47] R. Aharoni, Y. Goldberg, Unsupervised domain clus- EMNLP 2020, Association for Computational Lin- ters in pretrained language models, in: Proceedings guistics, Online, 2020, pp. 2502–2516. URL: https:// of the 58th Annual Meeting of the Association for aclanthology.org/2020.findings-emnlp.227. doi:10. Computational Linguistics, Association for Compu- 18653/v1/2020.findings- emnlp.227 . tational Linguistics, Online, 2020, pp. 7747–7763. [40] H. Strobelt, S. Gehrmann, M. Behrisch, A. Perer, URL: https://aclanthology.org/2020.acl-main.692. H. Pfister, A. M. Rush, Seq2seq-vis: A visual debug- doi:10.18653/v1/2020.acl- main.692 . ging tool for sequence-to-sequence models, CoRR [48] D. Alvarez Melis, T. Jaakkola, Towards robust abs/1804.09299 (2018). URL: http://arxiv.org/abs/ interpretability with self-explaining neural net- 1804.09299. arXiv:1804.09299 . works, in: S. Bengio, H. Wallach, H. Larochelle, [41] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), C. Clark, K. Lee, L. Zettlemoyer, Deep contextual- Advances in Neural Information Processing Sys- ized word representations, in: Proceedings of the tems, volume 31, Curran Associates, Inc., 2018. 2018 Conference of the North American Chapter URL: https://proceedings.neurips.cc/paper/2018/ of the Association for Computational Linguistics: file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf. Human Language Technologies, Volume 1 (Long [49] R. Achtibat, M. Dreyer, I. Eisenbraun, S. Bosse, Papers), Association for Computational Linguistics, T. Wiegand, W. Samek, S. Lapuschkin, From ”where” New Orleans, Louisiana, 2018, pp. 2227–2237. URL: to ”what”: Towards human-understandable expla- https://aclanthology.org/N18-1202. doi:10.18653/ nations through concept relevance propagation, v1/N18- 1202 . 2022. URL: https://arxiv.org/abs/2206.03208. doi:10. [42] G. Jawahar, B. Sagot, D. Seddah, What does BERT 48550/ARXIV.2206.03208 . learn about the structure of language?, in: Proceed- [50] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, ings of the 57th Annual Meeting of the Associa- F. Viegas, R. Sayres, Interpretability beyond fea- tion for Computational Linguistics, Association for ture attribution: Quantitative testing with con- Computational Linguistics, Florence, Italy, 2019, pp. cept activation vectors (tcav) (2017). URL: https: 3651–3657. URL: https://aclanthology.org/P19-1356. //arxiv.org/abs/1711.11279. doi:10.48550/ARXIV. doi:10.18653/v1/P19- 1356 . 1711.11279 . [43] N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, [51] J. Schrouff, S. Baur, S. Hou, D. Mincu, E. Loreaux, N. A. Smith, Linguistic knowledge and transfer- R. Blanes, J. Wexler, A. Karthikesalingam, B. Kim, ability of contextual representations, in: Proceed- Best of both worlds: local and global explana- ings of the 2019 Conference of the North American tions with human-understandable concepts, CoRR Chapter of the Association for Computational Lin- abs/2106.08641 (2021). URL: https://arxiv.org/abs/ guistics: Human Language Technologies, Volume 2106.08641. arXiv:2106.08641 . 1 (Long and Short Papers), Association for Com- putational Linguistics, Minneapolis, Minnesota, 2019, pp. 1073–1094. URL: https://aclanthology.org/ N19-1112. doi:10.18653/v1/N19- 1112 . A. Evaluation Metrics Definitions • Selectivity: The difference between linguistic task accuracy and control task accuracy • Prediction Accuracy: Performance measure of the model on a given task • Agreement Rate: The fraction of memory cells (dimensions) where the value’s top prediction matches the key’s top trigger example • Value Probability: Probability of the values’ top prediction • Projection Score: The dot product between a sen- tence embedding and a direction • Activation Quantile: Equally sized smaller subsec- tion of the activation space • Word Frequency Correlation: The correlation be- tween directions and words in the embedding space • Attribution Score: Measures the contribution of the neuron to the factual expressions • Perplexity: Measurement of how well a proba- bility model predicts a sample, degree of ‘uncer- tainty’ a model has in predicting • Change Rate: The ratio that the original predic- tion is modified to another • Success Rate: The ratio that becomes learned prediction the top predictions