<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Katharina Simbeck</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariam Mahran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HTW Berlin University of Applied Sciences</institution>
          ,
          <addr-line>Treskowallee 8, 10318 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Despite growing research on bias in large language models (LLMs), most work has focused on gender and race, with little attention to religious identity. This paper explores how religion is internally represented in LLMs and how it intersects with concepts of violence and geography. Using mechanistic interpretability and Sparse Autoencoders (SAEs) via the Neuronpedia API, we analyze latent feature activations across five models. We measure overlap between religion- and violence-related prompts and probe semantic patterns in activation contexts. While all five religions show comparable internal cohesion, Islam is more frequently linked to features associated with violent language. In contrast, geographic associations largely reflect real-world religious demographics, revealing how models embed both factual distributions and cultural stereotypes. These findings highlight the value of structural analysis in auditing not just outputs but also internal representations that shape model behavior.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLMs</kwd>
        <kwd>sparse autoencoders</kwd>
        <kwd>bias</kwd>
        <kwd>Interpretability</kwd>
        <kwd>religion</kwd>
        <kwd>mechanistic interpretability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The rapid rise of large language models (LLMs) has transformed natural language processing but also
raised serious concerns about embedded biases [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Trained on extensive datasets often sourced
from the internet, LLMs often mirror (and even amplify) existing stereotypes in the data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. If left
unchecked, these biases can lead to discriminatory outcomes, especially in sensitive domains such
as education, recruitment, and information dissemination [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Mechanistic interpretability ofers a
powerful lens for uncovering hidden conceptual structures within LLMs [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5, 6, 7, 8</xref>
        ]. Sparse Autoencoders
(SAEs), in particular, have been introduced as a method for mechanistically interpreting LLMs [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. By
enforcing sparsity in the feature space, SAEs enable the identification of individual activations linked to
meaningful concepts. This approach not only reveals internal patterns of bias, but also opens the door
for targeted mitigation through adjustments to latent feature weights.
      </p>
      <p>
        Mainstream LLM bias research has prioritized aspects like gender and race, with religion receiving
less attention [
        <xref ref-type="bibr" rid="ref1 ref10">1, 10</xref>
        ]. Prior work shows that religion is a sensitive axis for stereotyping [
        <xref ref-type="bibr" rid="ref2">2, 11, 12</xref>
        ], yet
few studies have explored how such biases are internalized in LLMs’ latent spaces.
Using the Neuronpedia API [13], we analyzed latent feature activations across five models using
two forms of analysis: (1) intra-group overlap: how consistently religion-related prompts activate
shared features within each model (RQ1); and (2) inter-group overlap: the extent to which
religionrelated features overlap with violent-related features (RQ2). We also perform semantic probing on the
most-activated features to identify patterns tied to violence (RQ2) and geographic associations (RQ3).
Diferences across models and datasets are used to address RQ4.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work: Explainability and Interpretability of LLMs</title>
      <sec id="sec-2-1">
        <title>2.1. Early vs. New Interpretability Techniques</title>
        <p>
          Biases in LLMs can be broadly categorized into intrinsic biases, embedded in internal representations,
and extrinsic biases, which manifest in generated outputs [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Due to the scale of modern models,
intrinsic biases are more dificult to study. Most existing research focuses on output-level evaluations,
using word association tests or probability-based templates [14, 15, 16, 17, 18].
        </p>
        <p>Moving beyond output-based analysis, researchers have explored how models represent
information internally. Two common early techniques include attention head visualization and neuron-level
interpretation. The former highlights token relationships across layers, but often fails to reflect the true
basis of model decisions [19, 20]. The latter examines whether individual neurons encode meaningful
concepts, although interpretations often lack consistency across datasets [21, 22].</p>
        <p>
          Given the limitations of earlier methods, more recent work has shifted toward mechanistic
interpretability as a way to gain deeper insight into model internals. Rather than treating models as black
boxes, this approach aims to make sense of the finite, yet massive, number of model parameters [
          <xref ref-type="bibr" rid="ref8">8, 23</xref>
          ].
This technique ofers a more structured way to investigate how concepts such as identity, belief, or bias
are internally encoded. One major challenge for mechanistic interpretability is neuron polysemanticity,
where individual neurons are activated by multiple, often unrelated, concepts across diferent contexts
[24]. Sparse autoencoders have been proposed to address this issue by learning new, disentangled
features from model activations [
          <xref ref-type="bibr" rid="ref6 ref8">6, 8, 24</xref>
          ]. The key idea is that, by enforcing sparsity, only a few features
are active for a given input, making it easier to isolate individual concepts to interpret them.
        </p>
        <p>
          SAEs are shallow neural networks with a single hidden layer, trained to reconstruct LLM activations
under a sparsity constraint. The encoder identifies active features for a given input, and the decoder
maps the features back to the original state [25]. The resulting sparse features (latent features) are
hypothesized to correspond to real semantically meaningful concepts embedded within the model
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. SAEs are trained on LLM activations recorded at specific points in the transformer block, using
tokenized input text. These activations form the basis for learning a sparse feature dictionary [26].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Religious Bias in LLMs</title>
        <p>
          Bias in LLMs has been widely studied in dimensions such as gender, race, and political ideology, with
many tools developed to assess and mitigate these issues [
          <xref ref-type="bibr" rid="ref1 ref10">1, 10</xref>
          ]. However, religious bias remains
underexplored despite its potential to reinforce stereotypes and marginalize vulnerable communities.
Some studies have shown that models associate “Muslim” with “terrorist” or “Jewish” with “money” [11],
and that Islam is more frequently linked to violence than Christianity in both language and text-to-image
models [12]. Others found that western religions are represented with more nuance, while eastern ones
are reduced to oversimplified representations [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. These findings show that religion is a sensitive axis
of model bias but one that has received less attention in mainstream research. Addressing this gap is
essential for developing AI systems that respect and fairly represent diverse religious identities.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data Collection</title>
      <p>
        This analysis relies on the Neuronpedia API1 to access sparse, interpretable feature representations
extracted from a variety of LLMs using SAEs. The platform provides pre-trained SAEs across multiple
model families, allowing for detailed analysis of internal model representations. While we do not
revalidate the underlying SAEs here, they are drawn from prior peer-reviewed work [
        <xref ref-type="bibr" rid="ref7">7, 13, 25, 27</xref>
        ]. We
treat them as a reasonable foundation for exploring internal associations.
      </p>
      <p>
        The API returns the top activating latent features for each query, along with metadata such as feature
ID, activation layer, and highly activating example texts. Table 1 summarizes the SAEs used in this
study, which include variations of GPT2-small, Gemma-2, and Llama3.1-8B. These were trained at both
attention (-att) and residual stream (-res) positions [
        <xref ref-type="bibr" rid="ref7">7, 25, 27</xref>
        ].
      </p>
      <p>To collect data, we constructed a set of minimal, controlled natural language prompts targeting
religious and violence-related concepts.2 Each was submitted to the Neuronpedia API, and the top 20
activating feature IDs were stored per model, along with the most activating texts for later semantic
analysis. We focus on five major world religions: Christianity, Islam, Judaism, Hinduism, and Buddhism.
For each, we curated representative keywords (e.g., sacred texts, places of worship) and embedded them
in simple declarative sentences with consistent structure like “This is the Quran” or “This is a church.”
Table 4 (Appendix A) lists all religious terms used. We also created a smaller set of prompts related
to violence and criminality (e.g., “terrorist,” “extremist”). These were used to assess overlap between
religion- and violence-activated features.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data Analysis</title>
      <sec id="sec-4-1">
        <title>4.1. Latent Feature Overlap Analysis</title>
        <p>We conducted a two-part analysis that focused on latent feature overlaps. We explore whether certain
religions activate coherent internal structures within a model (intra-group overlap) and whether these
structures intersect with violence-related concepts (inter-group overlap). The main goal is to assess
representation consistency, as well as potential internalized associations with harmful stereotypes.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Intra-group Overlap</title>
          <p>To test how consistently LLMs encode each religion as a coherent concept (RQ1), we measured how
often prompts for the same religion activated the same latent features. Higher overlap means that the
model tends to treat the religion as one clear idea, instead of spreading the activations across a wide
feature space. We initially explored cosine similarity between binary feature vectors, but the results
were too noisy due to extreme sparsity, with some prompts activating less than 20 out of over 100,000
features. Discrete counts provided more interpretable and stable results and were used instead.
2Code and data are available at https://github.com/iug-htw/SAE_fairness.</p>
          <p>As shown in Table 2, all five religions show similar intra-group overlap across models. For example,
in GPT2-small, Buddhism and Hinduism each share about 60 features across their respective prompts.
In Gemma-2-9b-IT, overlap ranges from 36 (Islam) to 43 (Buddhism), suggesting comparable internal
consistency across religious categories.</p>
          <p>We also computed the total number of unique features activated by all religious prompts combined
(regardless of group), to assess the overall compactness of religious representations. The results ranged
between 18 to 145 across models. Gemma-2-9b had the most compact representation, indicating it
encodes religion using a small, overlapping set of features. This potentially reflects eficient abstraction,
but also less diferentiation between religions.</p>
          <p>While these results don’t reveal strong religious bias, they show that all five religions are represented
with similar structural cohesion RQ1. This serves as a useful baseline to ensure that diferences observed
in later bias analyses are not simply side efects of representation inconsistency.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Inter-group Overlap</title>
          <p>The second part of our analysis examined whether internal representations of religion intersect with
harmful concepts like “terrorism” (RQ2). We compiled all unique latent features activated per religion
group, then compared them to the set of features activated by a group of crime-related prompts.
The overlap count reflects how many latent features are shared between the two groups. To enable
meaningful comparisons between models, we calculated the Violence Association Index (VAI). This index
normalizes the raw overlap by dividing it by the mean overlap across all five religions within the same
model, then multiplying by 100. A VAI above 100 indicates stronger-than-average association with
violence-related features within that model, while below 100 indicates weaker association.</p>
          <p>As shown in Table 2, Islam consistently registers the highest VAI across all five models. In
Gemma-22b, for example, Islam scores 117, while other religions range from 94 to 96. A similar pattern appears
in GPT2-small (Islam: 113; others range from 94 to 102). Although the raw overlap values for some
religions are close, the VAI highlights a consistent relative skew across models.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Semantic Activation Analysis</title>
        <p>We examined top activation texts linked to religion-related features to explore whether certain themes
appear more often in the internal representations of specific religions. We focus on two semantic
categories: crime and geographical representations. Crime-related associations can suggest whether
models implicitly link religious identity with violence or extremism. Geographic associations help
reveal whether representations align with real-world religious distributions or reflect biased mappings.</p>
        <p>The Neuronpedia API returns top activating texts for each feature, ofering insight into their semantic
content. For each religion and model, we gathered texts related to the top 20 features per query. We then
applied a keyword-based search to these texts using predefined keyword lists for crime and geographic
terms (Appendix B).</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Crime Analysis</title>
          <p>To further address RQ2, we conducted a semantic analysis of activation texts related to each religion,
searching for twelve crime-related keywords (e.g., “terrorism,” “extremist,” “crime,” “shooting,” “violence”)
within the top texts returned by the Neuronpedia API.</p>
          <p>Keyword matches were normalized as a percentage of all activation texts per religion and model.
As shown in Table 3, Islam consistently had the highest proportion of crime-related terms in most
models. For example, Islam scored 3.46% in Gemma-2-9b-IT, compared to 2.44% for Christianity and
2.20% for Judaism. However, in GPT2-small and Llama3.1-8B, Hinduism showed unexpectedly higher
rates than Islam. These diferences may reflect training data co-occurrence patterns shaped by regional
sociopolitical discourse. Such variation across models hints that associations between religion and
violence depend on architecture and dataset composition (RQ4).</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Geographic Analysis</title>
          <p>To address RQ3, we analyzed how LLMs associate religions with geographic regions by scanning
activation texts for curated keywords representing seven areas: Africa, Asia, Australia, Europe, the
Middle East, North America, and South America. Mentions were counted per religion–region pair.
Figure 1 presents a grouped bar chart of geographic mention shares by religion, illustrating how often
diferent regions appear in the activation texts associated with each religious concept. Europe and
North America emerge as the most frequently mentioned regions, with relatively balanced associations
among the five religions. Asia and the Middle East also show strong representation, though with
more variation. Hinduism and Buddhism dominated the Asian context, while Islam is most prominent
in the Middle East. In contrast, Africa and South America exhibit lower overall mention rates and
greater disparity between religions, indicating weaker and less consistent associations. Australia had
the sparsest coverage, appearing minimally within all religious groups.</p>
          <p>Christianity and Islam appear across nearly all regions, reflecting a broad conceptual reach. Hinduism
and Buddhism are more localized but still surface meaningfully outside Asia, particularly in the Americas.
Judaism, by comparison, has a markedly narrower geographic distribution.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Our analysis shows that religion in LLMs is not only internally coherent, but also systematically
entangled with broader cultural narratives. The implications are twofold: (1) LLMs reliably abstract
religion into stable latent categories, and (2) those categories often co-activate with features tied to
violence or geography, embedding cultural frames into the models’ conceptual space.</p>
      <p>All five religions activated compact feature sets ( RQ1), showing that the models treat religion as
stable rather than difuse. But stability is not neutrality. Islam’s stronger overlap with violence-related
features (RQ2) shows how coherent categories can still encode stereotypes. This highlights a core risk.
Stable representations make linked stereotypes more deeply embedded and harder to remove.</p>
      <p>Interestingly, some models, such as GPT2-small and Llama3.1-8B, deviated from this trend, where
Hinduism showed the strongest crime associations. This variation reflects training data influences,
particularly in regions where religion is intertwined with political conflict. It shows that LLMs do not
merely learn language but also reproduce the narratives and frames prevalent in their training corpora.</p>
      <p>The geographic analysis (RQ3) revealed both expected mappings (e.g., Hinduism–Asia,
Christianity–Europe) and some distortions, with Europe and North America strongly represented across all
religions, while Australia and South America were largely absent. This points to a Western-focused
lens, where media visibility rather than demographics shapes associations. Judaism’s narrow footprint
and Islam’s global spread further reflect how internal representations mirror cultural salience more
than statistical reality, with potential downstream impacts on tasks like summarization or retrieval.</p>
      <p>The diferences across models ( RQ4) remind us that interpretability findings cannot be generalized
across architectures. Smaller models like GPT2-small revealed noisier and more exaggerated associations
(e.g., Hinduism-crime), while larger models like Gemma-2-9b encoded more compact and abstract
representations. This suggests that bias is shaped not just by data but by model scale and structure.
Model audits therefore require granularity. What looks like a universal association may in fact be
contingent on model class.</p>
      <p>The key contribution of this work is methodological. By moving beyond outputs, we show how SAEs
uncover the “conceptual geography” inside LLMs. This matters because downstream harms do not
always stem from what models say, but from how they internally prioritize and structure information.
A recommender system, for example, could inherit latent religion–violence associations without ever
producing an explicitly biased sentence. More broadly, our findings illustrate how interpretability tools
can map the transition point where generalization slips into stereotype.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This study investigated how LLMs internally represent religious concepts, focusing on structural and
semantic associations with violence and geography. Using Sparse Autoencoders, we analyzed latent
feature overlap and activation contexts across models and religions. Results showed consistent internal
cohesion but notable asymmetries (especially Islam’s stronger links to violence-related features) and
revealed how regional narratives are embedded in model representations.</p>
      <p>Future work could extend this analysis to more religions, multilingual models, or other identity-linked
concepts. As interpretability tools advance, they will be essential for understanding not just model
outputs, but the internal structures that shape them.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This paper portrays the work carried out in the context of the KIWI project (16DHBKI071) that is
generously funded by the Federal Ministry of Research, Technology and Space (BMFTR).</p>
      <p>We also gratefully acknowledge the Neuronpedia API, which provided access to SAE activations and
feature explanations. Their open infrastructure was essential for the experiments conducted in this
study.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used OpenAI’s GPT-4.1 to check grammar and spelling
and to enhance the writing style. After using these tools/services, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
[11] A. Abid, M. Farooqi, J. Zou, Persistent Anti-Muslim Bias in Large Language Models, in: Proceedings
of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, Association for Computing
Machinery, New York, NY, USA, 2021, p. 298–306. doi:10.1145/3461702.3462624.
[12] A. Abrar, N. T. Oeshy, M. Kabir, S. Ananiadou, Religious Bias Landscape in Language and
Text-to</p>
      <p>Image Models: Analysis, Detection, and Debiasing Strategies, 2025. arXiv:2501.08441.
[13] J. Lin, Neuronpedia: Interactive Reference and Tooling for Analyzing Neural Networks, 2023. URL:
https://www.neuronpedia.org, software available from neuronpedia.org.
[14] A. Caliskan, J. J. Bryson, A. Narayanan, Semantics derived automatically from language corpora
contain human-like biases, Science 356 (2017) 183–186. URL: https://www.science.org/doi/abs/10.
1126/science.aal4230. doi:10.1126/science.aal4230.
[15] C. May, A. Wang, S. Bordia, S. R. Bowman, R. Rudinger, On Measuring Social Biases in Sentence
Encoders, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics,
Minneapolis, Minnesota, 2019, pp. 622–628. URL: https://aclanthology.org/N19-1063/. doi:10.
18653/v1/N19-1063.
[16] M. Kaneko, D. Bollegala, Unmasking the Mask – Evaluating Social Biases in Masked Language
Models, Proceedings of the AAAI Conference on Artificial Intelligence 36 (2022) 11954–11962. URL:
https://ojs.aaai.org/index.php/AAAI/article/view/21453. doi:10.1609/aaai.v36i11.21453.
[17] K. Kurita, N. Vyas, A. Pareek, A. W. Black, Y. Tsvetkov, Measuring Bias in Contextualized Word
Representations, in: M. R. Costa-jussà, C. Hardmeier, W. Radford, K. Webster (Eds.), Proceedings of
the First Workshop on Gender Bias in Natural Language Processing, Association for Computational
Linguistics, Florence, Italy, 2019, pp. 166–172. doi:10.18653/v1/W19-3823.
[18] M. Nadeem, A. Bethke, S. Reddy, StereoSet: Measuring stereotypical bias in pretrained language
models, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online,
2021, pp. 5356–5371. doi:10.18653/v1/2021.acl-long.416.
[19] J. Vig, A Multiscale Visualization of Attention in the Transformer Model, in: M. R. Costa-jussà,
E. Alfonseca (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics: System Demonstrations, Association for Computational Linguistics, Florence, Italy,
2019, pp. 37–42. doi:10.18653/v1/P19-3007.
[20] S. Jain, B. C. Wallace, Attention is not Explanation, in: J. Burstein, C. Doran, T. Solorio (Eds.),
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 3543–3556. URL:
https://aclanthology.org/N19-1357/. doi:10.18653/v1/N19-1357.
[21] F. Dalvi, N. Durrani, H. Sajjad, Y. Belinkov, A. Bau, J. Glass, What Is One Grain of Sand in the
Desert? Analyzing Individual Neurons in Deep NLP Models, Proceedings of the AAAI Conference
on Artificial Intelligence 33 (2019) 6309–6317. URL: https://ojs.aaai.org/index.php/AAAI/article/
view/4592. doi:10.1609/aaai.v33i01.33016309.
[22] T. Bolukbasi, A. Pearce, A. Yuan, A. Coenen, E. Reif, F. Viégas, M. Wattenberg, An Interpretability</p>
      <p>Illusion for BERT, 2021. arXiv:2104.07143.
[23] C. Olah, Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases,
Transformer Circuits Thread, 2022. URL: https://www.transformer-circuits.pub/2022/mech-interp-essay,
informal note published June 27 2022.
[24] H. Cunningham, A. Ewart, L. Riggs, R. Huben, L. Sharkey, Sparse Autoencoders Find Highly</p>
      <p>Interpretable Features in Language Models, 2023. arXiv:2309.08600.
[25] Z. He, W. Shu, X. Ge, L. Chen, J. Wang, Y. Zhou, F. Liu, Q. Guo, X. Huang, Z. Wu, Y.-G. Jiang,
X. Qiu, Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders,
2024. arXiv:2410.20526.
[26] S. Rajamanoharan, T. Lieberum, N. Sonnerat, A. Conmy, V. Varma, J. Kramár, N. Nanda, Jumping
Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders, 2024. URL:
https://arxiv.org/abs/2407.14435. arXiv:2407.14435.
[27] J. Bloom, Open Source Sparse Autoencoders for all Residual Stream
Layers of GPT2 Small, https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/
open-source-sparse-autoencoders-for-all-residual-stream, 2024.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Appendices: List of Representative Keywords Used in The Latent</title>
    </sec>
    <sec id="sec-10">
      <title>Feature Overlap Analysis</title>
      <p>To generate activations for the latent feature overlap analysis, each keyword listed in this appendix was
wrapped in a simple declarative sentence using a fixed template structure (e.g., “This is a church,” “This
is a mosque,” “This is a temple”). This approach was adopted after initial experiments using standalone
keywords yielded less stable and interpretable features. By embedding each keyword in an identical
sentence frame, we aimed to maintain consistency across prompts while minimizing concept leakage.
The use of a neutral and uniform structure helps ensure that any background noise introduced by the
sentence is equally distributed across all categories and can be averaged out during analysis.
Potential Bias terrorist, militant, radical, extremist, attack, bombing, bomb, gun, weapon, terror attack,
massacre, shooting</p>
    </sec>
    <sec id="sec-11">
      <title>B. Appendices: Predefined Keyword Lists for Semantic Analysis of</title>
    </sec>
    <sec id="sec-12">
      <title>Crime- and Geographic-related terms</title>
      <p>Category
Crime
Europe
Asia
Middle East
terrorism, terrorist, crime, criminal, violence, extremist, extremism, attack,
radical, assault, shooting, bomb
europe, european, eurozone, schengen, western europe, eastern europe,
northern europe, southern europe, scandinavia, balkans, benelux, iberian peninsula,
baltic states, central europe, united kingdom, britain, england, scotland, wales,
northern ireland, france, germany, italy, spain, portugal, netherlands, belgium,
switzerland, austria, norway, sweden, denmark, finland, iceland, poland, czech
republic, slovakia, hungary, romania, bulgaria, greece, croatia, serbia, bosnia,
slovenia, montenegro, albania, macedonia, ukraine, belarus, moldova,
russia, georgia, armenia, azerbaijan, ireland, luxembourg, liechtenstein, andorra,
monaco, san marino, vatican, london, paris, berlin, rome, madrid, vienna,
amsterdam, brussels, oslo, stockholm, copenhagen, helsinki, lisbon, warsaw,
prague, budapest, bucharest, sofia, athens, zagreb, belgrade, sarajevo,
ljubljana, tirana, skopje, kiev, minsk, chisinau, moscow, tbilisi, yerevan, baku,
dublin, luxembourg city, vaduz, andorra la vella, eifel tower, big ben,
colosseum, berlin wall, vatican city, acropolis, buckingham palace, louvre, reichstag,
sagrada familia
asia, asian, east asia, south asia, southeast asia, central asia, west asia, indian
subcontinent, asian continent, china, india, japan, south korea, north korea,
taiwan, mongolia, pakistan, bangladesh, nepal, bhutan, sri lanka, maldives,
indonesia, philippines, vietnam, thailand, myanmar, burma, malaysia,
singapore, cambodia, laos, brunei, east timor, kazakhstan, uzbekistan, turkmenistan,
kyrgyzstan, tajikistan, iran, afghanistan, georgia, armenia, azerbaijan, beijing,
shanghai, hong kong, tokyo, osaka, seoul, busan, delhi, mumbai, kolkata,
chennai, karachi, lahore, islamabad, dhaka, kathmandu, thimphu, colombo,
bangkok, hanoi, ho chi minh, jakarta, manila, kuala lumpur, singapore, phnom
penh, vientiane, yangon, tehran, bishkek, dushanbe, ashgabat, almaty,
yerevan, tbilisi, baku, great wall, taj mahal, angkor wat, borobudur, meiji shrine,
forbidden city, mount fuji, petronas towers, ganges, yangtze, mekong
middle east, middle eastern, arab, arabic, gulf, levant, arab world, iran, iraq,
syria, lebanon, jordan, palestine, gaza, west bank, egypt, saudi arabia, saudi,
yemen, oman, emirates, qatar, bahrain, kuwait, turkey, cyprus, sudan, south
sudan, libya, mauritania, israel, tel aviv, tehran, baghdad, damascus, aleppo,
beirut, amman, jerusalem, gaza city, ramallah, riyadh, mecca, medina, jeddah,
sanaa, muscat, doha, dubai, abu dhabi, manama, kuwait city, cairo, giza,
alexandria, khartoum, tripoli, ankara, istanbul, nicosia, khartoum, al-azhar,
al-aqsa, petra, cedars of lebanon, pyramids, temple mount, jerusalem, sinai
Continued on next page
Africa
North America
Australia</p>
      <p>Table 5 – continued from previous page
Keywords
africa, african, sub-saharan africa, north africa, west africa, east africa, central
africa, southern africa, saharan, sahel, horn of africa, maghreb, morocco,
algeria, tunisia, nigeria, ethiopia, egypt, south africa, kenya, algeria, morocco,
sudan, angola, ghana, mozambique, madagascar, cameroon, côte d’ivoire,
ivory coast, niger, burkina faso, mali, malawi, zambia, zimbabwe, tanzania,
rwanda, uganda, chad, senegal, tunisia, libya, botswana, namibia, lesotho,
eswatini, gabon, congo, democratic republic of congo, drc, somalia, south
sudan, sierra leone, liberia, benin, togo, djibouti, equatorial guinea, central
african republic, gambia, guinea, guinea-bissau, mauritania, seychelles,
comoros, cape verde, lagos, abuja, nairobi, addis ababa, johannesburg, cape
town, durban, tunis, algiers, accra, dakar, kampala, harare, lusaka, windhoek,
gaborone, maputo, antananarivo, kinshasa, brazzaville, bamako, ouagadougou,
freetown, bujumbura, mogadishu, djibouti city, sahara desert, kilimanjaro,
victoria falls, ngorongoro crater, nile river, zambezi
north america, north american, american, the us, usa, united states, u.s., u.s.a.,
the states, canada, canadian, mexico, mexican, washington, new york, new
york city, los angeles, chicago, san francisco, boston, miami, dallas, houston,
atlanta, seattle, philadelphia, detroit, phoenix, denver, las vegas, orlando,
san diego, austin, toronto, vancouver, montreal, ottawa, calgary, edmonton,
quebec city, winnipeg, halifax, mexico city, guadalajara, monterrey, tijuana,
cancun, puebla, merida, white house, statue of liberty, niagara falls, grand
canyon, empire state building, times square, hollywood, pentagon, capitol hill,
liberty bell, alamo, mount rushmore, silicon valley
south america, south american, latin america, latinoamerica, andes, argentina,
brazil, chile, peru, colombia, venezuela, ecuador, bolivia, paraguay, uruguay,
guyana, suriname, french guiana, buenos aires, rosario, cordoba, rio de janeiro,
são paulo, brasilia, salvador, recife, santiago, valparaíso, lima, cusco, bogotá,
medellín, caracas, quito, la paz, sucre, asunción, montevideo, georgetown,
paramaribo, cayenne, amazon rainforest, andes mountains, machu picchu,
iguazu falls, cristo redentor, atacama desert, pampas, patagonia, galápagos
islands
australia, australian, oceania, australasia, new zealand, kiwi, aotearoa, papua
new guinea, fiji, samoa, tonga, vanuatu, solomon islands, micronesia, palau,
marshall islands, nauru, tuvalu, kiribati, sydney, melbourne, canberra,
brisbane, perth, adelaide, hobart, darwin, auckland, wellington, christchurch,
hamilton, dunedin, suva, port moresby, nuku’alofa, great barrier reef, uluru,
ayers rock, outback, tasmania, kangaroo island, aboriginal, maori, dreamtime,
coral sea, southern ocean, tasman sea, anzac, trans-tasman</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I. O.</given-names>
            <surname>Gallegos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barrow</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Tanjim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>N. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <source>Bias and Fairness in Large Language Models: A Survey</source>
          ,
          <source>Computational Linguistics</source>
          <volume>50</volume>
          (
          <year>2024</year>
          )
          <fpage>1097</fpage>
          -
          <lpage>1179</lpage>
          . doi:
          <volume>10</volume>
          .1162/coli_a_
          <fpage>00524</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Plaza-del Arco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Curry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Cercas</given-names>
            <surname>Curry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          , Divine LLaMAs: Bias, Stereotypes, Stigmatization, and
          <article-title>Emotion Representation of Religion in Large Language Models</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>4346</fpage>
          -
          <lpage>4366</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .findings-emnlp.
          <volume>251</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Oketunji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Anas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Saina</surname>
          </string-name>
          ,
          <string-name>
            <surname>Large Language Model (LLM) Bias</surname>
            <given-names>Index-LLMBI</given-names>
          </string-name>
          , Data
          <string-name>
            <surname>Policy</surname>
          </string-name>
          (
          <year>2023</year>
          ). arXiv:
          <year>2010</year>
          .00133.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simbeck</surname>
          </string-name>
          ,
          <article-title>They shall be fair, transparent, and robust: auditing learning analytics systems</article-title>
          ,
          <source>AI and Ethics</source>
          <volume>4</volume>
          (
          <year>2024</year>
          )
          <fpage>555</fpage>
          -
          <lpage>571</lpage>
          . doi:
          <volume>10</volume>
          .1007/s43681-023-00292-7.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conmy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mavor-Parker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lynch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Heimersheim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garriga-Alonso</surname>
          </string-name>
          ,
          <article-title>Towards automated circuit discovery for mechanistic interpretability</article-title>
          , in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>36</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2023</year>
          , pp.
          <fpage>16318</fpage>
          -
          <lpage>16352</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/ 2023/file/34e1dbe95d34d7ebaf99b9bcaeb5b2be-Paper-Conference.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Galichin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dontsov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Druzhinina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Razzhigaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Y.</given-names>
            <surname>Rogov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          , I.
          <source>Oseledets, I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders</source>
          ,
          <year>2025</year>
          . arXiv:
          <volume>2503</volume>
          .
          <fpage>18878</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lieberum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajamanoharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Conmy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sonnerat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kramár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dragan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nanda</surname>
          </string-name>
          , Gemma Scope:
          <article-title>Open Sparse Autoencoders Everywhere All At Once on Gemma 2</article-title>
          , in: Y.
          <string-name>
            <surname>Belinkov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Jumelet</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mohebbi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mueller</surname>
          </string-name>
          , H. Chen (Eds.),
          <source>Proceedings of the 7th BlackboxNLP Workshop</source>
          : Analyzing and
          <article-title>Interpreting Neural Networks for NLP, Association for Computational Linguistics</article-title>
          , Miami, Florida,
          <string-name>
            <surname>US</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>278</fpage>
          -
          <lpage>300</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          . blackboxnlp-
          <volume>1</volume>
          .
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          , N. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <source>A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models</source>
          ,
          <year>2025</year>
          . arXiv:
          <volume>2503</volume>
          .
          <fpage>05613</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chughtai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nanda</surname>
          </string-name>
          ,
          <article-title>A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations</article-title>
          , in: A.
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Brunskill</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Engelhardt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sabato</surname>
          </string-name>
          , J. Scarlett (Eds.),
          <source>Proceedings of the 40th International Conference on Machine Learning</source>
          , volume
          <volume>202</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>6243</fpage>
          -
          <lpage>6267</lpage>
          . URL: https://proceedings. mlr.press/v202/chughtai23a.html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <source>A Survey on Fairness in Large Language Models</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2308</volume>
          .
          <fpage>10149</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>