Interpretability in Activation Space Analysis of
Transformers: A Focused Survey
Soniya Vijayakumar1,∗
1
    German Research Center for Artificial Intelligence (DFKI), Saarland Informatics Campus, Saarland, Germany


                                       Abstract
                                       The field of natural language processing has reached breakthroughs with the advent of transformers. They have remained
                                       state-of-the-art since then, and there also has been much research in analyzing, interpreting, and evaluating the attention
                                       layers and the underlying embedding space. In addition to the self-attention layers, the feed-forward layers in the transformer
                                       are a prominent architectural component. From extensive research, we observe that its role is under-explored. We focus on
                                       the latent space, known as the Activation Space, that consists of the neuron activations from these feed-forward layers. In this
                                       survey paper, we review interpretability methods that examine the learnings that occurred in this activation space. Since
                                       there exists only limited research in this direction, we conduct a detailed examination of each work and point out potential
                                       future directions of research. We hope our work provides a step towards strengthening activation space analysis.

                                       Keywords
                                       explainability, interpretability, machine learning, activation space analysis, linguistic information, transformers, feed-forward
                                       layers


1. Introduction                                                much of the focus is in the domain of image process-
                                                               ing [10]. A challenge that exists is the gap between the
Through thick and thin, there is evidence that trans- low-level features that the neural networks compute and
formers have established itself as the state-of-the-art in the high-level concepts that are human-understandable.
various Natural Language Processing (NLP) tasks since Furthermore, we observe that there have been relatively
their conception and realization in 2017. BERT, the most fewer research methods applied in understanding the in-
well-known transformer language model [1], consists ternal learnings of networks in comparison to analyzing
of two major architectural components: self-attention the functions of self-attention layers.
layers and feed-forward layers. Much work has been               The core focus of our review is directed towards those
done in analyzing the functions of self-attention layers methods that unfold the learnings in the internal repre-
[2, 3, 4]. In our survey, we focus on interpretability of sentations of the neural network, i.e, we look at those
the feed-forward layers. Each layer in the encoder and methods that answer the question: “What does the model
decoder contains a fully connected position-wise feed- learn?” We further refine our focus on understanding
forward network. The feed-forward network contains specifically the feed-forward layers in transformer mod-
two linear transformations with a rectified linear acti- els. The motivation for this study is two-fold:
vation function. Even though existing works highlight
the importance of such feed-forward layers in transform-            • The inputs undergo a non-linear transformation
ers [5, 6, 7], still, to date, the role of feed-forward layers        when passing through the activation functions in
remains under-explored [8]. Our review focuses on the                 the feed-forward layers of deep neural networks
research that uses interpretability methods to understand             [11].
the learnings in these feed-forward layers. We define the           • The  parameters in the position-wise feed-forward
latent space, that comprises of the activations extracted             layers of the transformer account for two-thirds
from these layers, as the Activation Space. Many meth-                of the total model’s parameters (8𝑑 2 per layer, d is
ods already exist for aggregating these representations               the model’s hidden dimension). This also implies
including the default Huggingface1 pipeline used in the               that there is a considerable amount of computa-
original BERT paper [9].                                              tional budget involved in training these parame-
   Several methods for explaining and interpreting deep               ters to achieve the state-of-the-art performance
neural networks have been devised and we observe that                 they deliver today [12].
                                                                                                          From recent research, the methods that focus on un-
Proceedings of the CIKM 2022 Workshops, October 17 - 22, 2022                                           derstanding the feed-forward layers show substantial
∗
     Corresponding author.                                                                              evidence that the feed-forward layer activation space
Envelope-Open soniya.vijayakumar@dfki.de (S. Vijayakumar)
           © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License embeds useful information (see Section 5). We find that
           Attribution 4.0 International (CC BY 4.0).
    CEUR

           CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                        the learnings in the feed-forward layer remain under-
1
  https://huggingface.co/                                                                               explored. With our methodological survey, our objective
Table 1
Major attributes of the methods explored in the activation space analysis methods

  Method              Properties                NLP Tasks                 Quantitative Evalua-       Qualitative Evalua-
                                                                          tion                       tion
  Linguistic Phe-     Word Morphology,          Parts-of-Speech,          Sensitivity, Prediction    Human-expert visual
  nomena [13, 14,     Lexical    Semantics,     Semantic and Syntax       Accuracy, Selectivity      inspection of selected
  15, 16]             Sentence      Length,     Tagging and Pre-          Score                      neurons
                      Parts-of-Speech           diction,      Syntactic
                                                Chunking
  Neural Mem-         Vocabulary      Distri-   Next Sequence Predic-     Agreement Rate, Pre-       Pattern search by hu-
  ory Cells [12, 8]   bution,       Human-      tion, Fill-in-the-blank   diction Probability, At-   man experts
                      Interpretable Patterns,   Cloze Task                tribution Score, Per-
                      Factual Knowledge                                   plexity, Change and
                                                                          Success Rate
  Knowledge Illu-     Lexical, Geometric        Next Sequence Predic-     Projection Score, Acti-    Human annotations
  sion [17]           Properties    (Local      tion                      vation Quantile, Word      for patterns using
                      Semantic Coherence)                                 Frequency Correlation      visualization


is to understand the internal mechanisms of transform-          focus on the NLP domain. This work focuses on outcome
ers by exploring the activation space of the feed-forward       explanation problems which help end users understand
network. Further, we consider this paper as a focused           the model’s operation and thereby build trust in these
starting point for facilitating future research in activation   NLP-based AI systems. Along with the high-level classi-
space analysis. Finally, we also conduct a comparative          fication of explanations, the work introduces two addi-
study of these methods, their evaluation techniques and         tional aspects: techniques that derive the explanation and
report our observations, understandings, and potential          techniques to present to the end user. The explainability
future directions (see Section 7). Table 1 summarizes the       techniques are categorized into feature importance, sur-
methods and its attributes that we have explored.               rogate models, example-driven, provenance-based and
                                                                declarative induction. A set of operations such as first-
                                                                derivative salience, layer-wise relevance propagation, in-
2. Related Surveys                                              put perturbations, attention mechanism, and Long-Short-
                                                                Term-Memory (LSTM) gating signal and explainability-
As the interest in the Explainable Artificial intelligence
                                                                aware architectures enable explainability. An interesting
(XAI) field grows, various survey articles were published,
                                                                observation is the consideration of adding attention lay-
trying to consolidate and categorize the approaches. We
                                                                ers to neural network architectures as a strategy to enable
segregate the reviews into two categories: Surveys that
                                                                explanations.
give a general overview of existing explainability meth-
                                                                   The closest survey related to our work is from Sajjad
ods [18, 19, 20, 21, 22] and surveys that focus on explain-
                                                                et al. [25], where the survey is on fine-grained neuron
ability methods in the NLP domain. We narrow our sur-
                                                                analysis. While there have been two previous surveys
veys to the NLP domain as this is the core focus of this
                                                                that cover Concept Analysis [26] and Attribution Analy-
review paper.
                                                                sis [24], their focus is on analyzing individual neurons to
   A survey that acts as a prior to ours is from Belinkov
                                                                better understand the inner workings of neural networks.
and Glass [23], where the authors review the various
                                                                They refer to this as Neuron Analysis and categorized
analysis methods used to conduct novel and fine-grained
                                                                these reviewed methods into visualization, corpus-based,
neural network interpretation and evaluation. The pri-
                                                                neuron probing, and unsupervised methods. The work
mary question that has been relevant while formulating
                                                                further discusses findings and applications of neuron
these interpretation methods is: What linguistic infor-
                                                                interpretation and summarizes open issues.
mation is captured in neural networks? The authors
                                                                   We observe that, from the various existing surveys,
emphasize three aspects of the language-specific analy-
                                                                there are different dimensions to be considered. We nar-
sis, namely, methods used for conducting the analysis,
                                                                row down our survey into the following dimensions:
linguistic information sought, and neural network parts
investigated. They also identify several gaps and limita-            • Analysis methods that focus on the internal inter-
tions in the surveys.                                                  pretation of the activation space.
   Danilevsky et al. [24] presents a broader overview of             • Linguistic Information such as parts-of-speech,
the state of XAI over a span of 7 years (until 2020), with a           syntactic, semantic and Non-linguistics Informa-
       tion such as sentence length, factual knowledge,       of relevant knowledge from a machine-learning model
       geometric properties.                                  concerning relationships either contained in the data
     • Neural network object neurons and its activations      or learned by the model. This definition rather focuses
       as the Activation Space in the transformer lan-        on understanding what the model learns either from an
       guage model.                                           input-output mapping perspective or what the model
                                                              itself learns. On the other hand, explainability directs the
   We believe that interpretability alone is not sufficient   focus back to human understanding by examining the re-
in understanding the inner workings of the transform-         lationship between input features and model predictions
ers, we also need explainability to summarize the reason      in a human-understandable format [21].
for the model’s behaviour in a human-comprehensible              After reviewing numerous relevant existing literature,
manner. One has to keep in mind that, explainability          we observed that explainability techniques broadly fall
and interpretability have distinguishable meanings [27]       into three major classes. The first differentiates between
and our review focuses only on interpretability methods       understanding a model’s individual prediction process
because the research works reviewed focus on the same.        versus prediction process as a whole [24]. A second
                                                              differentiation is made in self-explaining or post-hoc
                                                              methods, where the former generates explanations along
3. Survey Methodology                                         with the model’s prediction process whereas the latter
Our survey aim to cover the advances in NLP XAI re-           requires post-processing of elements extracted during
search focusing on neuron interpretation. As defined ear-     the model prediction process. The third major distinc-
lier, we define this latent dimension as Activation Space     tion corresponds to methods that are model specific or
and consider the reviewed techniques as Activation Space      agnostic in nature. We also observed the existence of
Analysis methods. We filtered to those methods that work      various other categorizations like outcome-based expla-
at the feed-forward neuron-level, individual vs global,       nations, visual explanation methods, operations, and
within the transformer model. We identified relevant          conceptual vs attribution. Visualization methods play a
papers published in NLP and AI conferences (AAAI, ACL,        salient role in further understanding any interpretation
IJCNLP, EMNLP) between 2018 and 2022. With the lim-           method [30, 31, 32, 33]. These methods are inherent to
ited scope of neuron-level analysis, we arrived at seven      interpretability and is been widely reviewed, we leave
contemporary papers. With a limited number of work            this to the reader to explore the relevant literature.
in this direction, we decided to take a deeper look into
each of these methods, analyze its benefits, limitations,     5. Activation Space Analysis
and gaps and present this study as our review paper.
We are aware that this is an ongoing and relatively new          Methods
research field and our focus is extremely limited; we ac-
knowledge that we might have omitted certain papers.      There are two types of interpretability analysis that are
We also assume that if the authors have focused on ex-    carried out in the related research work: 1) Analyze indi-
plainability, they are more likely to cover the relevant  vidual neurons and 2) Analyze the entire set of neurons
related taxonomies, categories, and methods. Another      of the feed-forward layer. We look into both approaches
common observation is that explanations are generated     from four perspectives: categorization, linguistic knowl-
in an NLP task-oriented setting and remain relevant to    edge sought for, methodology, and evaluations, and con-
the task context. Even though we summarize the tasks      duct a comparative analysis of these methods.
on which these researches are based, the task definitions    Linguistic Phenomena: Investigating the linguistic
are not relevant in our review process of understanding   phenomena that occurs within the activations of pre-
the activation space.                                     trained models, when trained for a specific task set, using
                                                          various interpretability analysis methods, is a common
                                                          way to interpret the features learned by these models.
4. Taxonomies and Categorization The linguistic phenomenon refers to the presence of var-
                                                          ious linguistic features such as word morphology, lexical
There still exists a reasonably vague understanding and semantics, syntax or linguistic knowledge such as parts-
lack of concrete mathematical definition between the two of-speech, grammar, coreference, lemmas. Linguistic
commonly used terms: explainability and interpretability. Correlation Analysis (LCA) is one such method that fo-
Interpretability has been defined as ”the degree to which cuses on understanding what the model learned about
a human can understand the cause of a decision” [28] or linguistic features and determining those neurons that
the degree to which a human can consistently predict explicitly focus on such phenomena. A toolkit with three
the model’s result [29]. A broader definition exists for major methods, Individual Model Analysis, Cross-model
the term interpretable machine learning as the extraction Analysis and LCA, to identify salient neurons within
the model or related to a task under consideration, is           and simultaneously, values induce a distribution over
presented by Dalvi et al. [13].                                  the output vocabulary [12]. The work analyzes these
   Probing using diagnostic classifiers to understand the        memories present in the feed-forward layers and further
knowledge captured in neural representations is another          explores the function of these layers in transformer-based
common method for associating model components with              language models.
linguistic properties [34, 35, 36]. This involves extracting        A neural memory is defined as a key-value pair, where
feature representations from the network and training            each key value is a d-dimensional vector. The emula-
an auxiliary classifier to predict the linguistic property.      tion, mathematical similarity between feed-forward and
Layer-wise and neuron-level diagnostic classifiers that          key-value neural memories, allows the hidden dimension
probe representation from individual layers w.r.t linguis-       to be considered as number of memories in each layer
tic properties and find neurons that capture salient fea-        and the activations as vectors containing un-normalized
tures, respectively, are used to conduct analysis on pre-        non-negative memory coefficients. Using this similarity,
trained models BERT, RoBERTa and XLNet [14]. The task            the study posits that the key vectors act as pattern detec-
of predicting a certain linguistic property is defined. A        tors. This hypothesis is tested by looking for the highest
diagnostic classifier (logistic regression) is trained on gen-   memory coefficient that is associated with the input text,
erated activations, for both layer-wise and neuron-wise          retrieving those input examples, and conducting human
probes, to predict the existence of this linguistic prop-        evaluations to identify patterns. The study further ex-
erty. An LCA is conducted to generate neuron ranking             plores intra-layer memory composition and inter-layer
based on weight distribution. Additionally, an elastic-net       prediction refinement.
regularization is fine-tuned using grid-search to balance           The concept of knowledge neurons, neurons that ex-
between focused and distributed neurons. The top N               press a fact, is introduced by Dai et al. [8]. The authors
salient neurons extracted from this ranked list are used to      propose a method to find the neurons that express facts
retrain the classifier until an Oracle accuracy is achieved.     and how their activations correlate in expressing these
   Durrani et al. [15] and Alammar [16] conducts sim-            facts. The evaluations on pre-trained models for fill-in-
ilar experiments, where the entire neuron activations            the-blank cloze tasks show that these models have the
from the feed-forward layers are used to train an exter-         ability to recall factual knowledge even without fine-
nal classifier. Durrani et al. [15] uses a probing classifier    tuning. The work considers feed-forward layers as key-
(logistic regression) with the additional elastic-net regu-      value memories, hypothesize that these key-value mem-
larization to conduct a fine-grained neuron level analysis       ories store factual knowledge and proposes a knowledge
on pre-trained models ELMo, T-ELMo, BERT, and XLNET.             attribution method. The knowledge attribution method,
This variance of models, in this study, covers different         based on integrated gradients, evaluates the contribu-
modeling choices of the blocks, optimization objectives,         tion of each neuron, in BERT-base-cased transformer, to
and model architectures. The case study conducted by             knowledge predictions by assigning them an attribution
Alammar [16] uses probing the feed-forward neuron acti-          score. Those neurons with a higher gradient i.e attribu-
vations for Parts-of-Speech (POS) Information. A control         tion score are identified as those contributing to factual
task is created where each token is assigned to a random         expressions. Further refinement of these neurons is done
POS tag and a separate probe is trained on this control          under the hypothesis that there are chances that the same
set. This allows us to measure the difference in predic-         fact can share the same set of true positive knowledge
tion accuracy between the actual and control dataset,            neurons. This refinement allows in retaining only those
selectivity score, thereby concluding if the probe really        knowledge neurons that are shared by a certain percent-
extracts the POS information. The author collects exist-         age of input prompts.
ing methods that examines input saliency, hidden state              Knowledge Illusion: Based on the generalization
evolution, neuron activations, and non-negative matrix           of the hypothesis that concepts are encoded in the lin-
factorization of neuron activations, along with dimen-           ear combinations of neural activations, Bolukbasi et al.
sionality reduction methods to extract patterns into an          [17] describe a surprising phenomenon “interpretabil-
open-source library known as Ecco [16]. These methods            ity illusion”. Probing experiments conducted on BERT-
can be directly employed on pre-trained models such as           base-uncased model determines if individual neurons
GPT2, BERT, RoBERTa.                                             contained human-interpretable meaning. The final layer
   Neural Memory Cells: In the context of a neural net-          creates embeddings for four datasets (QQP, QNLI, Wiki,
work with a recurrent attention model, Sukhbaatar et al.         and Books) and top 10 activating sentences for a neuron
[37] introduced input and output memory representa-              are annotated to determine a pattern. Here a pattern is
tions. A recent work extends this neural memory concept          defined as a single property such as sentence length or
and shows that the feed-forward layers in the transformer        lexical similarity shared by a set of sentences. By propos-
models operate as key-value memories, where keys cor-            ing three sources: dataset idiosyncrasy, local semantic
relate to specific human-interpretable input pattern sets        coherence in BERT’s embedding space, and annotator
error, the authors explain this illusion. The same exper-        Google, that inspects trained models based on predic-
iment is repeated, by keeping a set of target neurons            tion and Seq2Seq-Vis [40], that can trace back prediction
constant, on various datasets to reveal the illusion as          decisions in Neural Machine Translation input models
described by the authors. The work further explores the          [13].
causes of this illusion by investigating local, global and          Neural Memory Cells: Relating the patterns identi-
dataset-level concepts.                                          fied by human experts (NLP graduate students) to human
                                                                 understanding, the patterns are classified as shallow or
                                                                 semantic and are associated with lower layers and up-
6. Evaluations                                                   per layers of a 16-layer transformer model, respectively
                                                                 [8]. Further analysis of the corresponding values from
Linguistic Phenomena: A layer-wise probing is con-
                                                                 the key-value memories complements the patterns ob-
ducted to understand the redistribution of linguistic
                                                                 served in the respective keys. The agreement rate, the
knowledge (syntactic chunking, POS, and semantic tag-
                                                                 fraction of memory cells that match the corresponding
ging) when fine-tuned for downstream tasks [14]. Us-
                                                                 keys and values, is seen to increase in higher layers. The
ing this probing across three fine-tuned models BERT,
                                                                 authors suggest that the memory cells in the higher lay-
RoBERTa, and XLnet, on GLUE tasks and architectures
                                                                 ers contribute to the output whereas the lower layers
reveal the following observations: The morpho-syntactic
                                                                 do not show such a clear key-value correlation to con-
linguistic phenomenon that is preserved, post fine-tuning,
                                                                 tribute toward the output distribution of the next word.
in the higher layers is dependent on the task; Different
                                                                 A qualitative analysis, by manually analyzing a few ran-
architectures preserve linguistic information differently
                                                                 dom cases, is conducted on the layer-wise distribution of
post fine-tuning. The neuron-wise probing further re-
                                                                 memory cells and how the model refines its prediction
fines to the fine-grained neuron level, where the most
                                                                 from layer to layer using residual connections. The work
salient neurons are extracted and their distribution across
                                                                 is an extension of Sukhbaatar et al. [37], which suggests
architecture and variations in downstream tasks are stud-
                                                                 a theoretical similarity between feed-forward layers and
ied. An alignment of findings is found with Merchant
                                                                 key-value memories. Additionally their observations, of
et al. [38], where the fine-tuning affects only the top layer.
                                                                 shallow feature encoding, confirms with recent findings
In comparison with Mosbach et al. [39], which is focused
                                                                 from Peters et al. [41], Jawahar et al. [42], Liu et al. [43].
on sentence level probing, Durrani et al. [14] studies core-
                                                                    The BERT-base-cased model is experimented with the
linguistic phenomena. Additionally, their findings from
                                                                 knowledge attribution, where activation value is consid-
fine-grained neuron analysis extend the core-linguistic
                                                                 ered as the attribution score for a neuron, to measure
task layer-wise analysis, along with fine-tuning effects
                                                                 neuron sensitivity towards input. Similar observations
on these neurons. Another interesting observation made
                                                                 to Geva et al. [12] and Tenney et al. [44] are identified:
is the different patterns that are entailed when these net-
                                                                 fact-related neurons are distributed in the higher layers
works are pruned from top or bottom.
                                                                 of the transformer. Further, the authors investigate how
   An ablation study conducted by Durrani et al. [15] on
                                                                 these neurons contribute to expressing the knowledge ei-
the top salient neurons, from four pre-trained models
                                                                 ther by suppressing or amplifying their activations. Two
ELMo, T-ELMo, BERT, and XLNet, indicates higher distri-
                                                                 additional use cases, updating facts and erasing relations,
bution of linguistic information across the network when
                                                                 are presented, where the authors demonstrate the poten-
the underlying task is more complex (CCG supertagging),
                                                                 tial application of these identified knowledge neurons.
revealing information redundancy. Further refined study,
                                                                 Two evaluation metrics are used: change and success
considering only a minimal set of neurons, to identify
                                                                 rate for measuring fact updating and inter/intra-relation
the network parts that predominantly capture the lin-
                                                                 perplexity for measuring the influence on other knowl-
guistic information and understand the localization or
                                                                 edge. These evaluations indicate that changes in very
distribution of this information, indicate that the number
                                                                 few neurons in the transformers can affect certain facts.
of neurons required to achieve the Oracle accuracy varies
                                                                 Erasing of facts is also measured using perplexity and
and is dependent on the complexity of the task. By em-
                                                                 is observed that post fact erasing operation, i.e. setting
ploying a selectivity score next to the prediction accuracy
                                                                 knowledge neuron to zero vectors, the perplexity of the
score, and training separate POS probes for the actual
                                                                 moved knowledge increased. The knowledge attribution
dataset and a control task, Alammar [16] observes that
                                                                 method, built on integrated gradients, is inspired by Hao
the activation space encodes POS information at levels
                                                                 et al. [45] and Sundararajan et al. [46].
comparable to BERT’s hidden states. The non-negative
                                                                    Knowledge Illusion: A qualitative evaluation is con-
matrix factorization method helps in identifying those
                                                                 ducted by annotating three sets of sentences for a neuron
patterns in neuron activations that correspond to syn-
                                                                 in consideration: 1) top ten activating sentences for the
tactic and semantic properties of the input text. The
                                                                 neuron, 2) top ten activating sentences in random direc-
NeuroX toolkit is compared with the What-if tool from
                                                                 tion and 3) ten random sentences [17]. The objective
of this annotation is to find patterns, where a pattern is
                                                     the lack of both theoretical foundations and empirical
defined as a property shared by a set of sentences. A pat-
                                                     considerations in evaluations [25, 23, 24]. Even though
tern is considered as a proxy for a learned concept by the
                                                     each method has quantitative measures for evaluation,
model. For each neuron under consideration, an average
                                                     there is no standard set of metrics for comparing various
of 2.5 distinct patterns across four datasets are observed.
                                                     observations, hence, confining the scope of respective in-
This illusion is further explored by studying the regions
                                                     terpretability technique results to specific model architec-
of activation space the input data occupies, the influence
                                                     tures or task-related domains. Studies have proposed var-
of top activating sentences on patterns from both local
                                                     ious desiderata for interpretable concepts such as Fidelity,
semantic coherence and global directions, and annotation
                                                     Diversity and Grounding [48] for qualitative consistency
error. Qualitative analysis is conducted through (UMAP
                                                     Additionally, a few studies employ human experts for
dimensionality reduction) visualization and it is observed
                                                     qualitative analysis such as pattern annotation and iden-
that sentences cluster in accordance with datasets. Addi-
                                                     tifications, but again lack a standard framework for a
tionally, the high accuracy of a Support Vector Machine
                                                     comparative study and consistent explanations. More-
classifier distinguishes between these datasets and pro-
                                                     over, the subjective nature of interpretability and the lack
vides quantitative evidence for this observation. This
                                                     of existence of ground truth in qualitative analysis makes
indicates the dependence of information encoded within
                                                     it even more challenging to evaluate these methods.
neurons on the idiosyncrasies of the natural language   By reviewing the above works, that focus on activation
datasets, even though they have similar activation values.
                                                     space, we observe the following from the model perspec-
The analysis of global directions in BERT’s activation
                                                     tive: For a fixed model architecture and when a fixed
space using activation quantiles helps in understanding
                                                     set of neurons are examined, each set of neurons encode
the correlation between word frequency change and itsdifferent information, dependent on the input dataset;
monotonicity in each combination of datasets. This cor-
                                                     On the contrary, when a wider set of model architectures
relation indicated that despite BERT’s illusionary effect,
                                                     are considered, the same set of neurons encode similar
there still exists meaningful global direction in its activa-
                                                     information at lower and higher layers across these ar-
tion space. While comparing the observed illusions with
                                                     chitectures but the information encoded is dependent on
previous works, it is in alignment with Aharoni and Gold-
                                                     the underlying task. These observations emphasize the
berg [47], where they demonstrate the usage of BERT  dependency on the input data and the underlying task
representations to disambiguate datasets. This explains
                                                     of interpreting the linguistic information encoded in the
the existence of patterns in datasets, further experiments
                                                     activation space.
are conducted to understand the cause of such pattern   Experiments conducted align with the definition of
existence.                                           interpretability and explainability in understanding the
   We observe that all the methods that we reviewed so
                                                     rationale behind the model’s decision but lack human
far fall under the local interpretability methods and limit
                                                     understandable explanations. In the context of ex-
themselves to the top N salient neurons (see Table 1).
                                                     plainability, we observe that there is a gap in human-
From reviewing these studies, we observe dimensionality
                                                     understandable linguistic concepts and linguistic features
reduction is required to understand the properties under
                                                     captured in the network. We make a clear distinction be-
consideration. Dimensionality reduction is associatedtween linguistic features and concepts: features consist
with information loss and this loss is not accounted for
                                                     of linguistic properties such as parts-of-speech, syntactic
in these studies. Another observation is that the focus
                                                     and semantic properties, and word morphology whereas
of these studies alternates between identifying the neu-
                                                     the linguistic concepts, from a human understandable
rons that capture the relevant linguistic information and
                                                     perspective, encode general human knowledge and how
those subsets of these neurons that affect the prediction
                                                     it is expressed in natural language. Various contempo-
accuracy. Moreover, some interpretability methods arerary methods such as Concept Relevant Propagation [49],
evaluated through user studies (where users subjectively
                                                     Testing Concept Activation Vector [50], Integrated Con-
evaluate the explanations), whereas others are evaluated
                                                     ceptual Sensitivity [51] that are based on human under-
in terms of how they satisfy some properties, either quan-
                                                     standable local and global concept-based explanations
titatively or qualitatively, without real users’ evaluations.
                                                     exist. These methods are applied and evaluated in the
In the next section, we further discuss our observations
                                                     image processing domain and are yet to be explored in
and present our insights and future detections.      understanding linguistic concepts. It is evident that ex-
                                                     ploring activation space is a promising research direction
                                                     and we propose a potential future direction: extend the
7. Insights and Future Directions interpretability techniques from image processing to the
                                                     natural language processing domain through transfer
A common observation that we see in the contempo-
                                                     learning.
rary general surveys and from our focused reviews is
Acknowledgments                                                [9] J. Devlin, M. Chang, K. Lee, K. Toutanova,
                                                                   BERT: pre-training of deep bidirectional trans-
The authors would like to thank the anonymous re-                  formers for language understanding,             CoRR
viewers for their helpful feedback. The work was par-              abs/1810.04805 (2018). URL: http://arxiv.org/abs/
tially funded by the German Federal Ministry of Educa-             1810.04805. arXiv:1810.04805 .
tion and Research (BMBF) through the project XAINES           [10] A. Das, P. Rad, Opportunities and challenges in
(01IW20005).                                                       explainable artificial intelligence (XAI): A survey,
                                                                   CoRR abs/2006.11371 (2020). URL: https://arxiv.org/
                                                                   abs/2006.11371. arXiv:2006.11371 .
References                                                    [11] S. Zhao, D. Pascual, G. Brunner, R. Wattenhofer, Of
 [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor-                 non-linearity and commutativity in BERT, CoRR
     eit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-               abs/2101.04547 (2021). URL: https://arxiv.org/abs/
     sukhin,     Attention is all you need,         CoRR           2101.04547. arXiv:2101.04547 .
     abs/1706.03762 (2017). URL: http://arxiv.org/abs/        [12] M. Geva, R. Schuster, J. Berant, O. Levy, Trans-
     1706.03762. arXiv:1706.03762 .                                former feed-forward layers are key-value mem-
 [2] M. Nasr, R. Shokri, A. Houmansadr, Comprehensive              ories, CoRR abs/2012.14913 (2020). URL: https:
     privacy analysis of deep learning: Passive and ac-            //arxiv.org/abs/2012.14913. arXiv:2012.14913 .
     tive white-box inference attacks against centralized     [13] F. Dalvi, A. Nortonsmith, D. A. Bau, Y. Belinkov,
     and federated learning, in: 2019 IEEE Symposium               H. Sajjad, N. Durrani, J. Glass, Neurox: A toolkit for
     on Security and Privacy (SP), 2019, pp. 739–753.              analyzing individual neurons in neural networks,
     doi:10.1109/SP.2019.00065 .                                   Proceedings of the AAAI Conference on Artificial
 [3] K. Clark, U. Khandelwal, O. Levy, C. D. Manning,              Intelligence (AAAI) (2019).
     What does BERT look at? an analysis of bert’s            [14] N. Durrani, H. Sajjad, F. Dalvi, How transfer
     attention, CoRR abs/1906.04341 (2019). URL: http:             learning impacts linguistic knowledge in deep NLP
     //arxiv.org/abs/1906.04341. arXiv:1906.04341 .                models?, in: Findings of the Association for
 [4] J. Vig, Y. Belinkov, Analyzing the structure of               Computational Linguistics: ACL-IJCNLP 2021, As-
     attention in a transformer language model, in:                sociation for Computational Linguistics, Online,
     Proceedings of the 2019 ACL Workshop Black-                   2021, pp. 4947–4957. URL: https://aclanthology.
     boxNLP: Analyzing and Interpreting Neural Net-                org/2021.findings-acl.438. doi:10.18653/v1/2021.
     works for NLP, Association for Computational                  findings- acl.438 .
     Linguistics, Florence, Italy, 2019, pp. 63–76. URL:      [15] N. Durrani, H. Sajjad, F. Dalvi, Y. Belinkov, Ana-
     https://aclanthology.org/W19-4808. doi:10.18653/              lyzing individual neurons in pre-trained language
     v1/W19- 4808 .                                                models, CoRR abs/2010.02695 (2020). URL: https:
 [5] O. Press, N. A. Smith, O. Levy, Improving trans-              //arxiv.org/abs/2010.02695. arXiv:2010.02695 .
     former models by reordering their sublayers, in:         [16] J. Alammar, Ecco: An open source library for
     Proceedings of the 58th Annual Meeting of the As-             the explainability of transformer language mod-
     sociation for Computational Linguistics, Associa-             els, in: Proceedings of the 59th Annual Meet-
     tion for Computational Linguistics, Online, 2020,             ing of the Association for Computational Lin-
     pp. 2996–3005. URL: https://aclanthology.org/2020.            guistics and the 11th International Joint Confer-
     acl-main.270. doi:10.18653/v1/2020.acl- main.                 ence on Natural Language Processing: System
     270 .                                                         Demonstrations, Association for Computational
 [6] B. Pulugundla, Y. Gao, B. King, G. Keskin, H. Mallidi,        Linguistics, Online, 2021, pp. 249–257. URL: https://
     M. Wu, J. Droppo, R. Maas, Attention-based neu-               aclanthology.org/2021.acl-demo.30. doi:10.18653/
     ral beamforming layers for multi-channel speech               v1/2021.acl- demo.30 .
     recognition, 2021. URL: https://arxiv.org/abs/2105.      [17] T. Bolukbasi, A. Pearce, A. Yuan, A. Co-
     05920. doi:10.48550/ARXIV.2105.05920 .                        enen, E. Reif, F. B. Viégas, M. Wattenberg,
 [7] H. Xu, Q. Liu, D. Xiong, J. van Genabith,                     An interpretability illusion for BERT, CoRR
     Transformer with depth-wise LSTM,              CoRR           abs/2104.07143 (2021). URL: https://arxiv.org/abs/
     abs/2007.06257 (2020). URL: https://arxiv.org/abs/            2104.07143. arXiv:2104.07143 .
     2007.06257. arXiv:2007.06257 .                           [18] A. Adadi, M. Berrada, Peeking inside the black-box:
 [8] D. Dai, L. Dong, Y. Hao, Z. Sui, F. Wei, Knowl-               A survey on explainable artificial intelligence (xai),
     edge neurons in pretrained transformers, CoRR                 IEEE Access 6 (2018) 52138–52160.
     abs/2104.08696 (2021). URL: https://arxiv.org/abs/       [19] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Gi-
     2104.08696. arXiv:2104.08696 .                                annotti, D. Pedreschi, A survey of methods for ex-
                                                                   plaining black box models, ACM Comput. Surv. 51
     (2018). URL: https://doi.org/10.1145/3236009. doi:10.    [29] B. Kim, R. Khanna, O. O. Koyejo, Examples
     1145/3236009 .                                                are not enough, learn to criticize! criticism
[20] V. Arya, R. K. E. Bellamy, P. Chen, A. Dhurand-               for interpretability, in: D. Lee, M. Sugiyama,
     har, M. Hind, S. C. Hoffman, S. Houde, Q. V. Liao,            U. Luxburg, I. Guyon, R. Garnett (Eds.), Advances
     R. Luss, A. Mojsilovic, S. Mourad, P. Pedemonte,              in Neural Information Processing Systems,
     R. Raghavendra, J. T. Richards, P. Sattigeri, K. Shan-        volume 29, Curran Associates, Inc., 2016. URL:
     mugam, M. Singh, K. R. Varshney, D. Wei, Y. Zhang,            https://proceedings.neurips.cc/paper/2016/file/
     One explanation does not fit all: A toolkit and               5680522b8e2bb01943234bce7bf84534-Paper.pdf.
     taxonomy of AI explainability techniques, CoRR           [30] P. Pezeshkpour, Y. Tian, S. Singh, Investigating
     abs/1909.03012 (2019). URL: http://arxiv.org/abs/             robustness and interpretability of link prediction
     1909.03012. arXiv:1909.03012 .                                via adversarial modifications, in: Proceedings
[21] P. Linardatos, V. Papastefanopoulos, S. Kotsiantis,           of the 2019 Conference of the North American
     Explainable ai: A review of machine learning in-              Chapter of the Association for Computational Lin-
     terpretability methods, Entropy 23 (2021). URL:               guistics: Human Language Technologies, Volume
     https://www.mdpi.com/1099-4300/23/1/18. doi:10.               1 (Long and Short Papers), Association for Com-
     3390/e23010018 .                                              putational Linguistics, Minneapolis, Minnesota,
[22] A. Krajna, M. Kovac, M. Brcic, A. Šarčević, Ex-               2019, pp. 3336–3347. URL: https://aclanthology.org/
     plainable artificial intelligence: An updated per-            N19-1337. doi:10.18653/v1/N19- 1337 .
     spective, in: 2022 45th Jubilee International Con-       [31] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, J. Eisen-
     vention on Information, Communication and Elec-               stein, Explainable prediction of medical codes
     tronic Technology (MIPRO), 2022, pp. 859–864.                 from clinical text, in: Proceedings of the 2018
     doi:10.23919/MIPRO55190.2022.9803681 .                        Conference of the North American Chapter of the
[23] Y. Belinkov, J. Glass, Analysis Methods in Neural             Association for Computational Linguistics: Hu-
     Language Processing: A Survey, Transactions of                man Language Technologies, Volume 1 (Long Pa-
     the Association for Computational Linguistics 7               pers), Association for Computational Linguistics,
     (2019) 49–72. URL: https://doi.org/10.1162/tacl_a_            New Orleans, Louisiana, 2018, pp. 1101–1111. URL:
     00254. doi:10.1162/tacl_a_00254 .                             https://aclanthology.org/N18-1100. doi:10.18653/
[24] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis,               v1/N18- 1100 .
     B. Kawas, P. Sen, A survey of the state of explain-      [32] D. Croce, D. Rossini, R. Basili, Auditing deep
     able AI for natural language processing, in: Pro-             learning processes through kernel-based explana-
     ceedings of the 1st Conference of the Asia-Pacific            tory models, in: Proceedings of the 2019 Confer-
     Chapter of the Association for Computational Lin-             ence on Empirical Methods in Natural Language
     guistics and the 10th International Joint Conference          Processing and the 9th International Joint Con-
     on Natural Language Processing, Association for               ference on Natural Language Processing (EMNLP-
     Computational Linguistics, Suzhou, China, 2020,               IJCNLP), Association for Computational Linguis-
     pp. 447–459. URL: https://aclanthology.org/2020.              tics, Hong Kong, China, 2019, pp. 4037–4046. URL:
     aacl-main.46.                                                 https://aclanthology.org/D19-1415. doi:10.18653/
[25] H. Sajjad, N. Durrani, F. Dalvi, Neuron-level inter-          v1/D19- 1415 .
     pretation of deep NLP models: A survey, CoRR             [33] D. Bahdanau, K. Cho, Y. Bengio, Neural machine
     abs/2108.13138 (2021). URL: https://arxiv.org/abs/            translation by jointly learning to align and translate,
     2108.13138. arXiv:2108.13138 .                                2014. URL: https://arxiv.org/abs/1409.0473. doi:10.
[26] Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, J. R.           48550/ARXIV.1409.0473 .
     Glass, On the linguistic representational power          [34] D. Hupkes, S. Veldhoen, W. Zuidema, Visualisation
     of neural machine translation models, CoRR                    and ’diagnostic classifiers’ reveal how recurrent
     abs/1911.00317 (2019). URL: http://arxiv.org/abs/             and recursive neural networks process hierarchical
     1911.00317. arXiv:1911.00317 .                                structure (2017). URL: https://arxiv.org/abs/1711.
[27] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. A.             10203. doi:10.48550/ARXIV.1711.10203 .
     Specter, L. Kagal, Explaining explanations: An           [35] A. Conneau, G. Kruszewski, G. Lample, L. Barrault,
     approach to evaluating interpretability of machine            M. Baroni, What you can cram into a single vec-
     learning, CoRR abs/1806.00069 (2018). URL: http:              tor: Probing sentence embeddings for linguistic
     //arxiv.org/abs/1806.00069. arXiv:1806.00069 .                properties, CoRR abs/1805.01070 (2018). URL: http:
[28] T. Miller,       Explanation in artificial intelli-           //arxiv.org/abs/1805.01070. arXiv:1805.01070 .
     gence: Insights from the social sciences, CoRR           [36] Y. Belinkov, J. R. Glass, Analysis methods in
     abs/1706.07269 (2017). URL: http://arxiv.org/abs/             neural language processing: A survey, CoRR
     1706.07269. arXiv:1706.07269 .                                abs/1812.08951 (2018). URL: http://arxiv.org/abs/
     1812.08951. arXiv:1812.08951 .                          [44] I. Tenney, D. Das, E. Pavlick, BERT rediscovers
[37] S. Sukhbaatar, E. Grave, G. Lample, H. Jégou,                the classical NLP pipeline, in: Proceedings of
     A. Joulin,       Augmenting self-attention with              the 57th Annual Meeting of the Association for
     persistent memory,           CoRR abs/1907.01470             Computational Linguistics, Association for Com-
     (2019). URL: http://arxiv.org/abs/1907.01470.                putational Linguistics, Florence, Italy, 2019, pp.
     arXiv:1907.01470 .                                           4593–4601. URL: https://aclanthology.org/P19-1452.
[38] A. Merchant, E. Rahimtoroghi, E. Pavlick, I. Ten-            doi:10.18653/v1/P19- 1452 .
     ney, What happens to BERT embeddings dur-               [45] Y. Hao, L. Dong, F. Wei, K. Xu, Self-attention attribu-
     ing fine-tuning?, in: Proceedings of the Third               tion: Interpreting information interactions inside
     BlackboxNLP Workshop on Analyzing and Inter-                 transformer, 2020. URL: https://arxiv.org/abs/2004.
     preting Neural Networks for NLP, Association for             11207. doi:10.48550/ARXIV.2004.11207 .
     Computational Linguistics, Online, 2020, pp. 33–44.     [46] M. Sundararajan, A. Taly, Q. Yan, Axiomatic at-
     URL: https://aclanthology.org/2020.blackboxnlp-1.            tribution for deep networks, in: D. Precup, Y. W.
     4. doi:10.18653/v1/2020.blackboxnlp- 1.4 .                   Teh (Eds.), Proceedings of the 34th International
[39] M. Mosbach, A. Khokhlova, M. A. Hedderich,                   Conference on Machine Learning, volume 70 of Pro-
     D. Klakow, On the Interplay Between Fine-tuning              ceedings of Machine Learning Research, PMLR, 2017,
     and Sentence-level Probing for Linguistic Knowl-             pp. 3319–3328. URL: https://proceedings.mlr.press/
     edge in Pre-trained Transformers, in: Findings               v70/sundararajan17a.html.
     of the Association for Computational Linguistics:       [47] R. Aharoni, Y. Goldberg, Unsupervised domain clus-
     EMNLP 2020, Association for Computational Lin-               ters in pretrained language models, in: Proceedings
     guistics, Online, 2020, pp. 2502–2516. URL: https://         of the 58th Annual Meeting of the Association for
     aclanthology.org/2020.findings-emnlp.227. doi:10.            Computational Linguistics, Association for Compu-
     18653/v1/2020.findings- emnlp.227 .                          tational Linguistics, Online, 2020, pp. 7747–7763.
[40] H. Strobelt, S. Gehrmann, M. Behrisch, A. Perer,             URL: https://aclanthology.org/2020.acl-main.692.
     H. Pfister, A. M. Rush, Seq2seq-vis: A visual debug-         doi:10.18653/v1/2020.acl- main.692 .
     ging tool for sequence-to-sequence models, CoRR         [48] D. Alvarez Melis, T. Jaakkola, Towards robust
     abs/1804.09299 (2018). URL: http://arxiv.org/abs/            interpretability with self-explaining neural net-
     1804.09299. arXiv:1804.09299 .                               works, in: S. Bengio, H. Wallach, H. Larochelle,
[41] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner,              K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.),
     C. Clark, K. Lee, L. Zettlemoyer, Deep contextual-           Advances in Neural Information Processing Sys-
     ized word representations, in: Proceedings of the            tems, volume 31, Curran Associates, Inc., 2018.
     2018 Conference of the North American Chapter                URL: https://proceedings.neurips.cc/paper/2018/
     of the Association for Computational Linguistics:            file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf.
     Human Language Technologies, Volume 1 (Long             [49] R. Achtibat, M. Dreyer, I. Eisenbraun, S. Bosse,
     Papers), Association for Computational Linguistics,          T. Wiegand, W. Samek, S. Lapuschkin, From ”where”
     New Orleans, Louisiana, 2018, pp. 2227–2237. URL:            to ”what”: Towards human-understandable expla-
     https://aclanthology.org/N18-1202. doi:10.18653/             nations through concept relevance propagation,
     v1/N18- 1202 .                                               2022. URL: https://arxiv.org/abs/2206.03208. doi:10.
[42] G. Jawahar, B. Sagot, D. Seddah, What does BERT              48550/ARXIV.2206.03208 .
     learn about the structure of language?, in: Proceed-    [50] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler,
     ings of the 57th Annual Meeting of the Associa-              F. Viegas, R. Sayres, Interpretability beyond fea-
     tion for Computational Linguistics, Association for          ture attribution: Quantitative testing with con-
     Computational Linguistics, Florence, Italy, 2019, pp.        cept activation vectors (tcav) (2017). URL: https:
     3651–3657. URL: https://aclanthology.org/P19-1356.           //arxiv.org/abs/1711.11279. doi:10.48550/ARXIV.
     doi:10.18653/v1/P19- 1356 .                                  1711.11279 .
[43] N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters,       [51] J. Schrouff, S. Baur, S. Hou, D. Mincu, E. Loreaux,
     N. A. Smith, Linguistic knowledge and transfer-              R. Blanes, J. Wexler, A. Karthikesalingam, B. Kim,
     ability of contextual representations, in: Proceed-          Best of both worlds: local and global explana-
     ings of the 2019 Conference of the North American            tions with human-understandable concepts, CoRR
     Chapter of the Association for Computational Lin-            abs/2106.08641 (2021). URL: https://arxiv.org/abs/
     guistics: Human Language Technologies, Volume                2106.08641. arXiv:2106.08641 .
     1 (Long and Short Papers), Association for Com-
     putational Linguistics, Minneapolis, Minnesota,
     2019, pp. 1073–1094. URL: https://aclanthology.org/
     N19-1112. doi:10.18653/v1/N19- 1112 .
A. Evaluation Metrics Definitions
   • Selectivity: The difference between linguistic task
     accuracy and control task accuracy
   • Prediction Accuracy: Performance measure of the
     model on a given task
   • Agreement Rate: The fraction of memory cells
     (dimensions) where the value’s top prediction
     matches the key’s top trigger example
   • Value Probability: Probability of the values’ top
     prediction
   • Projection Score: The dot product between a sen-
     tence embedding and a direction
   • Activation Quantile: Equally sized smaller subsec-
     tion of the activation space
   • Word Frequency Correlation: The correlation be-
     tween directions and words in the embedding
     space
   • Attribution Score: Measures the contribution of
     the neuron to the factual expressions
   • Perplexity: Measurement of how well a proba-
     bility model predicts a sample, degree of ‘uncer-
     tainty’ a model has in predicting
   • Change Rate: The ratio that the original predic-
     tion is modified to another
   • Success Rate: The ratio that becomes learned
     prediction the top predictions