Causal Mediation Analysis for Interpreting Large Language Models Elisabetta Rocchetti1,∗,† , Alfio Ferrara1,† 1 Università degli Studi di Milano, Department of Computer Science Abstract Being able to understand the inner workings of Large Language Models (LLMs) is crucial for ensuring safer development practices and fostering trust in their predictions, particularly in sensitive applications. Causal Mediation Analysis (CMA) is a causality framework which fits perfectly for this scenario, providing a mechanistic interpretation of the behaviour of LLM components and assessing a specific type of knowledge in the model (e.g. presence of gender bias). This study discusses the challenges and potential pathways in applying CMA to open LLMs’ black boxes. Through three exemplary case studies from the literature, we show the unique insights CMA can provide. We elaborate on the inherent challenges and opportunities this approach presents. These challenges range from the influence of model architecture on prompt viability to the complexities of ensuring metric comparability across studies. Conversely, the opportunities lie in the dissection of LLMs’ knowledge through the extraction of the specific domains of knowledge activated during processing. Our discussion aims to provide a comprehensive insight into CMA, focusing on essential aspects to equip researchers with the knowledge necessary for crafting effective CMA experiments tailored towards interpretability objectives. Keywords LLM, interpretability, causality, causal mediation analysis 1. Introduction Large Language Models (LLMs) have gained a great amount of success and have become ubiqui- tous in many research and application areas. Understanding their behaviour is of central interest to correct them at inference-time [1] and to guarantee safer development [2]. In the area of XAI, mechanistic interpretability techniques involve deconstructing the computational processes of a model into its elements, with the aim of uncovering, understanding, and confirming the algorithms (referred to as circuits in some studies) that are executed by the model’s weights [3]. Among these techniques, Causal Mediation Analysis (CMA) provides a causal approach which aims at extracting reliable cause-effect relations between inputs and outputs, contrarily to those XAI techniques relying merely on simple and correlations. The architecture of a LLM can be interpreted as a structural causal model: within this framework, CMA enables the isolation of independent contributions from individual neural network components. In this study, we aim to elucidate the CMA technique and its application in probing the SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy ∗ Corresponding author. † These authors contributed equally. Envelope-Open elisabetta.rocchetti@unimi.it (E. Rocchetti); alfio.ferrara@unimi.it (A. Ferrara) Orcid 0009-0000-5617-7612 (E. Rocchetti); 0000-0002-4991-4984 (A. Ferrara) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings inner mechanisms of LLMs. Through a series of case studies drawn from existing literature, we highlight the potential benefits and opportunities that CMA offers. Furthermore, we delve into the primary challenges encountered when applying CMA, as well as the pressing issues that must be addressed to enhance its robustness and flexibility. This paper is structured as follows: Section 2 introduces the CMA formulation; Section 3 shows some of the works in the literature applying CMA to three different case studies; Section 4 shows how to apply interventions for CMA including an illustrative example; Section 5 discusses the limitations and issues of CMA, alongside its potential and challenges; Section 6 concludes. 2. Causal Mediation Analysis Consider a causal model including three variables, 𝑋, 𝑌 and 𝑍, representing an intervention, an outcome and a mediator respectively. In particular, the intervention 𝑋 affects the outcome variable 𝑌, and 𝑍 is placed between these two and modifies some intermediate process between 𝑋 and 𝑌. How can we measure the separate effects of 𝑋 and 𝑍 on 𝑌? Linear regression paradigms [4] rely on the “no interaction” property, thus they cannot work in nonlinear systems where editing 𝑍 could change the effect of 𝑋 on 𝑌. Causal mediation analysis [5] aims at measuring the effects of an intervention 𝑋 on an outcome variable 𝑌 when an intermediate variable 𝑍 is standing between the two, modifying some intermediate process between 𝑋 and 𝑌. This method can provide an answer to our question, since it removes these nonlinear barriers using causal assumptions1 . Let our system be the one depicted in Figure 1, where 𝑥 = 𝐹1 (𝜀1 ), z = 𝐹2 (𝑥, 𝜀2 ), y = 𝐹3 (𝑥, z, 𝜀3 ), 𝑋 , 𝑌 , 𝑍 are discrete or continuous random variables, 𝐹1 , 𝐹2 , 𝐹3 are arbitrary functions and 𝜀1 , 𝜀2 , 𝜀3 represent omitted factors which are assumed to be mutually independent yet arbitrarily distributed [5]. This technique allows the separation of the total effect of 𝑋 on 𝑌 Figure 1: A causal model having an intervention variable 𝑋 affecting an outcome variable 𝑌, and a mediator 𝑍 standing between the two and altering some of the signal from 𝑋. 𝑋 is dependent only on the error term 𝜀1 , 𝑍 is a function of 𝑥 ∈ 𝑋 and the error 𝜀2 , and 𝑌 depends on both 𝑥 ∈ 𝑋 and z ∈ 𝑍 and the error term 𝜀3 . This figure is taken from [5]. in indirect effect and direct effects. The total effect (𝑇 𝐸) measures the change in 𝑌 produced by a change in 𝑋, for example from 𝑋 = 𝑥 to 𝑋 = 𝑥 ′ [5]. We can express 𝑇 𝐸 at the population level using the formula TE𝑥,𝑥 ′ = 𝔼(𝑌 |𝑋 = 𝑥 ′ ) − 𝔼(𝑌 |𝑋 = 𝑥) (1) 1 However, one assumption is still made: the error terms must be mutually independent. The notion we introduce here about direct effects (DE) refers to what is technically called ”Natural Direct Effects”2 . As defined in [5], DE is the expected change in 𝑌 induced by changing 𝑋 from 𝑥 to 𝑥 ′ while keeping all mediating factors constant at whatever value they would have obtained under 𝑋 = 𝑥, before the transition from 𝑥 to 𝑥 ′ . Estimating DE from the population data is formalised by DE𝑥,𝑥 ′ (𝑌 ) = ∑[𝔼(𝑌 |𝑥 ′ , z) − 𝔼(𝑌 |𝑥, z)]𝑃(z|𝑥) (2) 𝑧 where the condition probabilities use short-hand notations for 𝑋 = 𝑥, 𝑋 = 𝑥 ′ , and 𝑍 = z. Contrarily to the DE, the indirect effect (IE) is defined as the expected change in 𝑌 while keeping 𝑋 constant and changing 𝑍 to the value it would have attained has 𝑋 been set to 𝑋 = 𝑥 ′ (according to each individual) [5]. This requires, indeed, a counterfactual representation estimated by IE𝑥,𝑥 ′ (𝑌 ) = ∑ 𝔼(𝑌 |𝑥, z)[𝑃(z|𝑥 ′ ) − 𝑃(z|𝑥)] (3) 𝑧 Equation 3 is a general formula for estimating the mediating effects, and can be also applied to any nonlinear system. 3. Literature Review CMA has been recently applied in the field of XAI for LLM [2, 6, 7, 8, 9, 10, 11, 12]. Indeed, the neural network architecture of a LLM can be viewed as a structural causal model. We can view a subset of a language model’s internal components as an instance of the mediator variable 𝑍. Suppose we select a specific neuron to be our z: then, z’s output is influenced by the model’s input, and it affects the model output [7]. To show the real effectiveness of CMA, we cover three different case studies from the literature: gender bias detection [6, 7, 12], syntactic agreement [8], and arithmetic reasoning [2]. 3.1. Gender bias detection The first attempt towards the causal mediation formula application is shown in [6] and extended in [7]. These studies introduce a novel approach for probing both the structural dynamics and predictive behaviors of LLMs, with a particular focus on uncovering and quantifying gender bias. The methodology centers around inputting specifically crafted prompts, such as “The nurse said that [blank]”, into LLMs to observe the predictive preference between gendered pronouns “he” and “she” filling the blank. This setup allows for an examination of bias: a model demonstrating a consistent higher likelihood for “she” in contexts traditionally stereotyped towards women is flagged as exhibiting gender bias. To quantify this bias, the authors define a grammatical gender bias measure that compares the prediction probabilities of anti-stereotypical and stereotypical pronouns. Through designed interventions—manipulating the input sentence to replace profession nouns with their anti-stereotypical pronouns—the authors calculate TE, 2 The term natural here refers to the fact that we want to observe the change in 𝑌 after a change in 𝑋 while holding 𝑍 at a constant value, and the level at which this constant value is set can vary based on the individual we are considering. DE, and IE. These metrics illuminate the separate and combined influences of the intervention and the mediator variable (a specific neuron or set of neurons within the LLM) on the model’s output. This comprehensive methodology has yielded insightful findings: larger models are dispro- portionately affected by gender bias, and the manifestation of bias significantly varies across different datasets. Moreover, certain biases were found to align with crowdsourced gender perceptions. Importantly, the study also pinpoints the localization of gender bias within the model, identifying middle network layers and specific attention heads as primary contributors. These findings not only enhance our understanding of gender bias within LLMs but also guide targeted interventions for mitigating such biases, thereby paving the way for more equitable AI systems [12]. 3.2. Syntactic agreement The study by [8] explores the application of CMA to probe models’ sensitivity to syntactic agreement, assessing how different syntactic structures influence a model’s preference for verb inflections. The evaluated structures range from simple agreements to complex scenarios involving object relative clauses and distractors, aiming to understand the model’s grammatical preferences. The intervention swap-number is introduced to challenge the model with counter- factual prompts, altering the number feature of subjects to examine the model’s inflection choice (e.g. “The friend (that) the lawyers *likes/like” becomes “The friend (that) the lawyer likes/*like”, with the asterisk denoting the erroneous inflection). This approach helps identify if the model favors the correct grammatical form, with expectations set for the model’s preference metrics in response to these interventions. Their findings reveal nuanced insights into model behavior: contrary to previous gender bias studies, model size does not linearly correlate with the magnitude of syntactic preference. The presence of adverbial distractors increases total effects, suggesting improved accuracy, while attractors decrease accuracy. The study also highlights the distribution of syntactic knowledge across model layers and the impact of structural separations on subject-verb agreement, offering a comprehensive view of how LLMs process syntactic information. 3.3. Arithmetic reasoning In exploring arithmetic reasoning within LLMs, the study by [2] applies CMA to dissect the internal mechanics of LLMs as they process mathematical concepts. The authors hypothesize a network subset specialized in arithmetic reasoning, tested through task-specific prompts that blend operands and operations into arithmetic problems of varying complexity. Modifications for the study include altering operands and operations to gauge the model’s computational accuracy and mediator contribution. This involves generating problems like ”How much is 𝑛1 plus 𝑛2 ? ” and assessing outcomes against counterfactual scenarios, quantified through an IE formulation. Key activation sites identified include: the Multi Layer Perceptron (MLP) modules at initial layers for operand tokens, intermediate attention blocks for sequence ends, and later- layer MLP modules for final token processing [2]. This suggests attention mechanisms channel necessary information for MLPs to execute computations. Further analysis on number retrieval and factual knowledge, using randomized templates, indicates the last-token MLP’s broad role in information processing, not strictly limited to arithmetic. This contrasts with early MLP involvement in factual retrieval, highlighting arithmetic specificity in late MLP activations. 4. Applying the Causal Mediation Formula in LLMs Manipulating the internal representations of LLMs enables the generation of genuine coun- terfactual outputs. This is achieved by transferring internal representations between model executions that use both original and modified utterances. Here, we detail how this process is used to compute total, direct, and indirect effects, leveraging the gender bias case study from [7]. Given the problem of bias detection, we need to engineer prompts so that we induce the model to lean towards expressing its bias, if it does have any. For instance, the authors in [7] feed a LLM with prompts 𝑢 like “The accountant said that [blank]”, where the profession “accountant” is interpreted as stereotypically female, as result of a crowdsourced stereotypicality metric [7]. Then, the evaluation consists in testing which among the tokens “he” and “she” has the highest probability of being predicted instead of the [blank] space. Given this example, if the model consistently shows a higher likelihood for the stereotypical pronoun “she” 𝑝𝜃 (she|𝑢) than the anti-stereotypical pronoun “he”, then the LM is said to be biased (𝜃 are the model parameters) 𝑝𝜃 ( anti-stereotypical ∣ 𝑢) y(𝑢) = . (4) 𝑝𝜃 ( stereotypical ∣ 𝑢) If y(𝑢) < 1, the prediction is stereotypical; if y(𝑢) > 1, the prediction is anti-stereotypical; and if y(𝑢) = 1, the prediction is unbiased [7]. We illustrate a set-gender-male neuron intervention as described by [6, 7]. Under null intervention, 𝑢 = The accountant said that. The set-gender-male intervention exchanges the word “accountant” with its antistereotypical counterpart, which is “man”. The two variants are processed by the same network and the probabilities of the two candidates “she” and “he” are evaluated. Figure 2 depicts the procedure for obtaining the candidates’ probabilities. TE can thus be computed as3 , MLP MLP MLP The accountant + ... + + said that [blank] A A A set-gender-male accountant → man MLP MLP MLP The man said that + ... + + [blank] A A A Figure 2: The diagram showcases the calculation of total effect in a Transformer-based LLM, analyzing “The accountant said that” and its modified version “The man said that” to extract the probabilities of “he” and “she”. This picture is inspired by a figure from [2]. yset-gender (𝑢) 0.5 0.05 TE( set-gender, null ; y, 𝑢) = −1= / = 7.5 ynull (𝑢) 0.2 0.15 Calculating DE and IE requires capturing intermediate representations from the mediator z, potentially involving multiple MLPs or attention layers. In our example, z is an MLP. To compute the DE, we need to take the “accountant” representation resulting from z when it processes the original sentence. Then, this representation replaces the one resulting from z when it processes “man” in the alternate sentence. Figure 3 shows this procedure. DE is computed as yset-gender ,znull (𝑢) (𝑢) 0.35 0.05 DE( set-gender, null ; y, 𝑢) = −1= / =3 ynull (𝑢) 0.35 0.15 Concerning the IE computation, we need to extract the “man” representation resulting from z MLP MLP MLP The accountant + ... + + said that [blank] A A A set-gender-male accountant → man MLP MLP MLP The man said that + ... + + [blank] A A A Figure 3: The diagram shows how to compute direct effect (DE) in a Transformer-based LLM. It begins with processing an original sentence, extracting the “accountant” token’s representation from the MLP mediator. After applying an intervention to change “accountant” to “man” in the input, the model’s computation for z uses the previously extracted representation. This picture is inspired by a figure from [2]. when it processes the alternate sentence, and this then replaces the “accountant” representation from z when it processes the original sentence. Figure 4 shows this procedure. IE can be computed as ynull ,zset-gender (𝑢) (𝑢) 0.5 0.25 IE( set-gender, null ; y, The accountant said that) = −1= / = 1.2 ynull (𝑢) 0.2 0.12 Results obtained with this analysis can be then verified employing a LLM initialised with random weights, and comparing results with the ones coming from the original execution. 5. Discussion As demonstrated in previous sections, implementing CMA is relatively straightforward. How- ever, there are critical points to address to design an effective CMA experiment. In this section, 3 The probability values shown in these examples are not coming from a real experiment. MLP MLP MLP The accountant + ... + + said that [blank] A A A set-gender-male accountant → man MLP MLP MLP The man said that + ... + + [blank] A A A Figure 4: The diagram demonstrates calculating indirect effect probabilities in a Transformer-based model. Initially, the model processes a sentence with an intervention, extracting the “man” token’s representation from an MLP mediator. Then, it processes the original sentence, “accountant”, but replaces its MLP output with the “man” representation from the first step. This picture is inspired by a figure from [2]. we discuss important limitations and issues to consider when applying this technique. In particular, we cover challenges about interventions, prompts and metrics engineering. Intervention engineering-related challenges. Designing interventions in a prudent manner is a key factor to experiment success. The aim here is to thinking which syntax could trigger the desired LLM knowledge the most. For example, if gender bias is the object of inspection, words having a strong stereotypical connotations are better suited for replacing neutral expressions. One could also design different interventions to trigger the model at different levels, and then compare results for these experiments. Another challenge is to produce alternative sentences from which to extract alternate representation for interventions. These sentences may vary syntactically from the originals, yet their semantics are required to convey a concept that is diametrically opposed to that of the original sentences, contingent upon the chosen intervention. For instance, let’s take the concept of “leadership” and explore how we might intervene in a sentence to shift the perception from a traditional to a more inclusive understanding, while ensuring grammatical correctness. If the original sentence was “The successful leader commanded his team with firmness and ensured compliance through strict policies.”, the intervention process would include multiple modifications, for example using gender-neutral pronouns, a softer tone, and democratic policies. The intervened sentence could be: “The successful leader guided their team with understanding and fostered collaboration through flexible policies.”. Prompt engineering-related challenges. Model selection is pivotal in CMA prompt generation, demanding tailored strategies based on the chosen model. The prompt “The nurse said that [blank]” exemplifies how decoder-only and auto-regressive models handle candidate probabilities differently compared to masked models that employ a [MASK] token. For non-end- of-sentence evaluation prompts, like “[blank] dream is to become a doctor”, modifications are necessary to maintain evaluation consistency across models. Indeed, decoder-only model could not be tested using this framework. A trivial solution to this issue is to manipulate the prompt so that the candidates are placed last, but in this case we cannot assure consistency of magnitudes for the computed causal effects without deeper investigations. Does the model behave differently when choosing different formulations of utterances? Do the estimated probabilities shift due to the varying structures? Does the model produce more uncertain results after modification? This highlights the intricacies of prompt engineering in CMA, requiring not only specific adaptations, such as rephrasing to fit model requirements but also a deeper linguistic analysis to isolate the intended effect from potential confounders. For instance, addressing issues like coreference and complex sentence structures ensures the reliability of results by minimizing the influence of unrelated variables. Relevant linguistic features to inspect prior to CMA experiments must be selected according to what type of linguistic analysis has been found to be relevant for LLMs. Metrics engineering-related challenges. Different works in the literature use diverse metrics to compute the desired effect, and these metrics are usually employed to perform comparisons across models and datasets. However, these metrics lack of an absolute scale suggesting how to interpret the different magnitudes in the same and across different case studies [2, 7]. We argue that this limitation restricts analysis to merely ranking effects rather than quantifying them relative to each other, complicating comparisons such as determining whether BERT exhibits more bias than GPT2, or identifying which templates trigger the most bias. There are many variables affecting the magnitude of the output probabilities employed in the computation of metrics, including the presence of a more difficult computation involved in the process (e.g. coreference resolution in Winograd-style datasets), or even relative position on the candidate tokens in the sentence. It’s imperative that these metrics are formulated to ensure comparability in scale and consistency across different experimental designs. For instance, future research could focus on developing a normalized bias index designed to measure and compare biases across models, taking into account factors such as coreference difficulty and token positioning. 6. Conclusions We have shown which types of insights CMA can extract from Transformer-based LLMs through three different exemplar case studies from the literature. Moreover, we detail the application of this analysis to equip readers with a practical understanding of the technique, thereby enabling them to more effectively engage with both the challenges and opportunities CMA presents. Challenges include the impact of model architecture on prompt viability and the intricacies of ensuring metric comparability across studies. On the opportunity side, it includes the ability to dissect the knowledge within LLMs, offering insights into the knowledge domains activated during processing. All the case studies presented in this work share something, which we argue to be rather important: they all sought to uncover which knowledge a Transformer has learned during its training process. CMA gives us the capability to examine the activation within a LLM’s neural architecture, thereby discerning the specific domains of knowledge engaged during processing. For example, it could be feasible to delineate the global and human values encapsulated within the documents constituting the training dataset. This concept is particularly compelling as it affords an objective representation of contemporary societal values, the educational paradigms imparted to a generation, or the extraction of characteristic human values from historical contexts. The nature of the insights gleaned is inherently dependent on the composition of the training data supplied to the model. Acknowledgements This work was supported in part by project SERICS (PE00000014) under the NRRP MUR program funded by the EU - NGEU. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the Italian MUR. Neither the European Union nor the Italian MUR can be held responsible for them. References [1] K. Li, O. Patel, F. Viégas, H. Pfister, M. Wattenberg, Inference-time intervention: Eliciting truthful answers from a language model, Advances in Neural Information Processing Systems 36 (2024). [2] A. Stolfo, Y. Belinkov, M. Sachan, A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, 2023, pp. 7035–7052. doi:10.18653/v1/2023.emnlp- main.435 . [3] T. Räuker, A. Ho, S. Casper, D. Hadfield-Menell, Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, in: 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), IEEE, 2023, pp. 464–483. [4] R. M. Baron, D. A. Kenny, The moderator–mediator variable distinction in social psycholog- ical research: Conceptual, strategic, and statistical considerations., Journal of personality and social psychology 51 (1986) 1173. [5] J. Pearl, The Causal Mediation Formula—A Guide to the Assessment of Pathways and Mechanisms, Prevention Science 13 (2012) 426–436. doi:10.1007/s11121- 011- 0270- 1 . [6] J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, S. Shieber, Investigating Gender Bias in Language Models Using Causal Mediation Analysis, in: Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 12388–12401. [7] J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, S. Sakenis, J. Huang, Y. Singer, S. Shieber, Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias, 2020. doi:10.48550/arXiv.2004.12265 . arXiv:2004.12265 . [8] M. Finlayson, A. Mueller, S. Gehrmann, S. Shieber, T. Linzen, Y. Belinkov, Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 1828–1843. doi:10.18653/ v1/2021.acl- long.144 . [9] A. Geiger, H. Lu, T. Icard, C. Potts, Causal Abstractions of Neural Networks, in: Advances in Neural Information Processing Systems, volume 34, Curran Associates, Inc., 2021, pp. 9574–9586. [10] K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, J. Steinhardt, Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small, in: NeurIPS ML Safety Workshop, 2022. [11] K. Meng, D. Bau, A. Andonian, Y. Belinkov, Locating and Editing Factual Associations in GPT, 2023. doi:10.48550/arXiv.2202.05262 . arXiv:2202.05262 . [12] Y. Da, M. N. Bossa, A. D. Berenguer, H. Sahli, Reducing Bias in Sentiment Analysis Models Through Causal Mediation Analysis and Targeted Counterfactual Training, IEEE Access (2024) 1–1. doi:10.1109/ACCESS.2024.3353056 .