Detecting Stereotyped Representations of Words within
                         Language Models Embedding Space⋆
                         Michele Dusi1,†
                         1
                             Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Brescia, Via Branze 38, Brescia, Italy


                                       Abstract
                                       Today’s widespread use of Natural Language Processing techniques rises the need for control mechanisms
                                       to prevent harmful behaviors in terms of safety and ethics. Many Language Models have been shown
                                       to learn a distorted representation of words and concepts, gathering such prejudiced information from
                                       the stereotypes of the training datasets.
                                           In this paper, a new method is presented to detect whether a language model exhibits internal bias.
                                       The proposed method is based on the Cramér’s V metric [1], which measures the correlation between
                                       two categorical variables. The method operates directly on the model’s internal representation by
                                       analyzing its word embeddings.
                                           Empirical results on gender and religion biases suggest that a cardinality of 50 words (for each class)
                                       is sufficient to obtain stable values, although even a dozen words per class can provide an acceptable
                                       estimate of the measurement.


                         1. Introduction
                         Scientific literature on AI Fairness has increased in recent years, as fairness began to be considered
                         a requirement in system development and various methodologies have been developed to ensure
                         its presence in AI models. Fairness has been defined in several ways, but to grasp the general
                         idea, fairness is the “absence of any prejudice or favoritism toward an individual or group based
                         on their inherent or acquired characteristics” [2].
                            In this paper, we address the problem of fairness in the field of Natural Language Processing
                         (NLP): our analysis focuses on words and texts fed to models [3], with the purpose of under-
                         standing whether their processing can be seen as fair or unfair. More specifically, we rely on
                         carefully selected datasets of words that clearly designate a human attribute, such as gender or
                         religion. Our aim is to detect whether the representation of these attributes is somewhat biased
                         within the Language Model’s inner embedding space, i.e. whether the embedding of the words
                         suggest an unwanted correlation with other attributes, such as job salary or criminality.
                            Approaching the study of bias by analyzing the relationship between two attributes is a standard
                         procedure in the literature [3]; stereotypes are often defined as an undesirable association between
                         human properties. For example, a stereotype could suggest the association between women
                         (gender property) and a lower salary, or the association between Muslim people (religion property)
                         and a stronger tendency to criminal behavior.
                            Our computational approach diverges from those described in existing literature [4, 5]. Specif-
                         ically, we utilize access to a white-box model to characterize the distribution of embeddings
                         associated with a primary attribute, namely, the protected property, and subsequently compare
                         this distribution with that of embeddings corresponding to a secondary attribute, namely, the
                         stereotyped property. The association is quantified by calculating a score within the interval
                         [0, 1] using the Cramér’s V metric [1].


                         Doctoral Consortium at the 23rd International Conference of the Italian Association for Artificial Intelligence
                         Bolzano, Italy, November 25-28, 2024.
                         ‡
                              Michele Dusi was enrolled in the Italian National PhD Program in Artificial Intelligence conducted by Sapienza,
                              University of Rome, with the University of Brescia.
                         Envelope-Open michele.dusi@unibs.it,michele.dusi@uniroma1.it (M. Dusi)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International
                                      (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
  The methodology and preliminary findings of this ongoing research are outlined in this short
paper; additional details are available in the full paper [6]. This work draws some of its initial
insights from a prior study on bias visualization [7].


2. Background and Related Works
The seminal work that first highlighted fairness issues in natural language processing (NLP)
was published in 2016 [8]. This study focused on evaluating and mitigating gender bias in early
word embedding models, demonstrating the significant drawbacks of training these models on
large text corpora without critical oversight.
   The observation of biases in language models prompted a series of studies examining the
geometry of the embedding space to assess whether the embeddings exhibit any undesired
distributions [4, 9]. At first, the models considered were based on static word embeddings, like
Word2vec, or GloVe. Over the following few years, the same approach was applied on contextual
models [10], such as Transformers-based [11] and BERT-based [12, 13] models.
   Our study focuses on the same contextual transformer-based models, which are among the
most widely studied open-source language models in scientific literature. Compared to other
techniques, our method requires fewer words to define the analyzed properties, thereby making
it more practical and easier to apply.
   For the definition of bias, we refer to a survey paper that presents a structured framework:
in [3], the authors outline an ontology-based approach by defining bias at the semantic level.
This approach characterizes bias as the (undesirable) correlation between two human properties,
serving as the foundation for our bias detection technique.


3. Methodology
In this section, we outline our methodology for measuring the bias of a language model. As
mentioned briefly earlier, the procedure involves accessing the inner embedding space of the
model to examine and analyze the distribution of word vectors.

Encoding the properties. We start from the two properties involved in the bias we want to
detect (e.g. the gender-jobs bias, or the religion-criminality bias). These are the two properties
that need to be analyzed within the language model.
   The first step is to define a word list for each value of the property, with the goal of collecting
the terms used in the language to describe a specific value of a given property. For instance, the
male class of the gender property can be represented by terms such as “he”, “him”, “father”, and
“king”, while the female class of the same property can be represented by terms such as “she”,
“her”, “mother”, and “queen”.
   Each term is then converted into a vector, referred to as the word embedding, by the language
model. Since we work with transformer-based models, the word embedding is context-dependent,
meaning it varies based on the entire sentence in which the word is used. Therefore, each term
appears in multiple sentences, and the final embedding is computed by averaging the embeddings
from these sentences. The result of this pre-processing step is a list of vectors within the model’s
embedding space, each encoding an average representation of the corresponding terms.

Learning the protected property. In the next step, the protected embeddings - that is, the
embeddings of the terms associated with the protected property – are used to train an auxiliary
classifier to distinguish between the different values of the protected property.
   This step aims to identify how the language model encodes the protected classes (e.g., male and
female) and addresses questions such as: which vector components are most relevant to encoding
gender/religion? How are the protected classes distributed within the model’s embedding space?
                                                      Predicted values (protected)
                           religion × adjectives                                      ∑
                                                      Christian       Muslim
                        Actual values    positive        59.2            60.8         120
                        (stereotyped)    negative         46              74          120
                                                         105.2          134.8         240
Table 1
Example of contingency matrix that shows the average distribution (over 10 testcases) of religion predicted labels
for actual adjectives words.


Evaluating the stereotyped property. The stereotyped embeddings – that is, the embeddings
of the terms of the stereotyped property — are then used to test the auxiliary classifier trained at
the previous step. Each stereotyped embedding corresponds to a single value of the stereotyped
property, but is also classified as one of the values of the protected property. As a result, each
embedding in this test set is identified by a pair of values.
   The expected outcome is random classification by the auxiliary classifier: the stereotyped
embeddings should not contain any encoding of the protected property, so they could be classified
into any of the protected values. However, if the model exhibits bias and the word embeddings
are not neutral, we should observe a statistical shift in the classifiers predictions.
   For example, in the case of religion bias, the terms “criminal” and “peaceful” should ideally
be labeled as either Muslim or Christian independently, as they do not inherently carry any
religious connotation. However, if bias is present, we may observe that crime-related terms are
classified as Muslim and good-related terms as Christian.

Measuring the bias. The final step aims to measure the distortion in the classification and
express it as a quantifiable metric. The predicted labels are collected and counted by class,
resulting in an aggregate measure known as the contingency table. An example of this can be
seen in Table 1.
   A contingency table is a matrix where each row corresponds to a stereotyped class (e.g. “positive
adjectives” and “negative adjectives” for criminal behavior), and each column corresponds to a
protected class (e.g. “Christian” and “Muslim” for religion). The values in each cell represent the
(average) number of terms belonging to the row-associated stereotyped class and labeled as the
column-associated protected class. For instance, in Table 1, the 120 negative adjectives are split
and classified into “Christian” (46) and “Muslim” (74).
   As stated before, an unbalanced distribution may suggest a biased representation of such
concepts; in the previous example, Muslim people are more likely to be associated with negative
terms, with respect to christian people and positive terms. To compute the strength of this
association, we use Cramér’s V metric [1], which measures the correlation between two categorical
variables. In our context, these variables correspond to the protected and stereotyped properties,
and their values represent the classes into which the words are grouped.
   Cramér’s V is a metric normalized between 0 and 1, where 0 represents a situation of no
correlation (i.e. the properties are independent and unrelated), whereas 1 represents a situation
of maximal association between the properties (i.e. all the words of one stereotyped class have
the same protected class).
   To compute the score, we calculate the Minimum Square Error (MSE) between the observed
distribution and the expected distribution; the observed distribution is simply the contingency
matrix we gather from the auxiliary classifier, whereas the expected distribution assumes that
the variables are independent. Afterwards, the MSE value is exploited to compute the Cramér’s
V metric score:
                                                    MSE
                                    𝑉 =                                                          (1)
                                         √ 𝑛 ⋅ min(|𝑆| − 1, |𝑃| − 1)
which normalizes the previous result in the interval [0; 1]. More specifically, the MSE score is
                                                                  Language Models
                                  𝑝prot            𝑝ster
                                                                 BERT RoBERTa
                                 gender           profession     33.5 %   39.2 %
                                 religion         adjectives     13.9 %   2.8 %

       Table 2
       Values of the Cramér’s V metric over 100 testcases. Each row represents an experiment over two
       properties (protected and stereotyped).


divided by the total number of samples 𝑛 and by the minimum between the degrees of freedom of
the rows (number of stereotyped classes |𝑆| minus 1) and the degrees of freedom of the columns
(number of protected classes |𝑃| minus 1).
   We consider the resulting score as a measure of bias, quantifying the prejudiced correlation
between two human categories. It is important to note that, due to the mathematical properties
of Cramér’s V metric, the outcome is an easily interpretable value that is unaffected by the size
of the initial datasets and is applicable to multi-class properties [1].


4. Experimental Results
In this section, we present the results of our experiments on measuring the bias of language
models. We evaluated the following two transformer-based models:

       • BERT [12] in its base implementation (bert-base-uncased) by Hugging Face1 .

       • RoBERTa [13], a more robust version of BERT, by Hugging Face2 .

The considered models are trained mainly for the English language.
   We also tested two different kind of social bias: the gender bias, with respect to the stereotyped
professions, and the religion bias, with respect to a positive or negative behavior (expressed by
adjectives like “peaceful” or “aggressive”).
   The results are heterogeneous across different models, indicating that the language models
exhibit varying amounts of bias. Table 2 summarizes the key findings of this experimental phase.
We observe the highest bias values for the gender property (BERT 33.5 %, RoBERTa 39.2 %),
while the religion property shows a relatively lighter presence of bias (BERT 13.9 %, RoBERTa
2.8 %).
   When comparing the results from the model perspective, RoBERTa [13] demonstrates a
significant difference between the scores for the two domains. This suggests that gender is
strongly encoded and recognizable within RoBERTa’s word embeddings, while the model does
not exhibit any notable religion bias. In contrast, BERT [12] shows less disparity in its scores,
implying that while biased behavior may emerge from the internal representation, its effects on
the model’s functions are relatively lighter.
   Finally, we observe from the series of results that a cardinality of 50 words per class is
sufficient to obtain stable values. However, even as few as a dozen words per class can provide
an acceptable estimate of the measurement.


5. Conclusion and Future Works
In this paper, we presented a novel automatic method for detecting and measuring social bias
within language models. The method requires minimal initial data, as it only necessitates the
definition of two datasets corresponding to the properties being analyzed.
1
    https://huggingface.co/bert-base-uncased
2
    https://huggingface.co/docs/transformers/model_doc/roberta
   Furthermore, our method operates directly on the model’s internal representation by analyzing
its word embeddings. This, however, also constitutes a limitation, as it requires access to a
white-box model. Such an approach would not be feasible for other large language models
(LLMs), which are often black-box models.
   In the future, it would be valuable to expand this type of analysis in several directions. For
example, new properties and additional classes could be considered. A common limitation in
the technical literature on bias detection is the reduction of gender to only two classes (male
and female). However, incorporating more possibilities, as supported by current psychological
studies [14], could help address this issue. Our method is already capable of handling a larger
number of classes; the challenge, in this case, would be to identify a sufficient number of terms
that uniquely represent the newly introduced classes.
   As noted earlier, this article is written in English, which is the standard language in scientific
literature. However, different languages may carry distinct stereotypes and biases. For example, in
some languages, gender is embedded in grammatical structures, which can affect the interpretation
of certain terms. To fully address the nuances of language-specific biases, further research is
required to adapt this method to other languages in a way that accounts for their unique
characteristics.


References
 [1] H. Cramér, Mathematical methods of statistics, Princeton: Princeton University Press,
     1946.
 [2] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, A. Galstyan, A survey on bias and
     fairness in machine learning, ACM Comput. Surv. 54 (2021).
 [3] I. Garrido-Muñoz, A. Montejo-Ráez, F. Martínez-Santiago, L. A. Ureña-López, A survey on
     bias in deep nlp, Applied Sciences 11 (2021). URL: https://www.mdpi.com/2076-3417/11/7/3184.
     doi:10.3390/app11073184 .
 [4] A. Caliskan, J. J. Bryson, A. Narayanan, Semantics derived automatically from language
     corpora contain human-like biases, Science 356 (2017) 183–186.
 [5] C. May, A. Wang, S. Bordia, S. R. Bowman, R. Rudinger, On measuring social biases in
     sentence encoders, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
     Conference of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019,
     Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp.
     622–628.
 [6] M. Dusi, N. Arici, A. Emilio Gerevini, L. Putelli, I. Serina, Discrimination bias detection
     through categorical association in pre-trained language models, IEEE Access 12 (2024)
     162651–162667. doi:10.1109/ACCESS.2024.3482010 .
 [7] M. Dusi, N. Arici, A. E. Gerevini, L. Putelli, I. Serina, Graphical identification of gender
     bias in bert with a weakly supervised approach, in: NL4AI 2022: Sixth Workshop on
     Natural Language for Artificial Intelligence, CEUR-WS, 2022. URL: http://sag.art.uniroma2.
     it/NL4AI/wp-content/uploads/2022/11/paper16.pdf.
 [8] T. Bolukbasi, K. Chang, J. Y. Zou, V. Saligrama, A. T. Kalai, Man is to computer
     programmer as woman is to homemaker? debiasing word embeddings, in: D. D. Lee,
     M. Sugiyama, U. von Luxburg, I. Guyon, R. Garnett (Eds.), Advances in Neural Information
     Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016,
     December 5-10, 2016, Barcelona, Spain, 2016, pp. 4349–4357.
 [9] W. Guo, A. Caliskan, Detecting emergent intersectional biases: Contextualized word
     embeddings contain a distribution of human-like biases, in: Proceedings of the 2021
     AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, Association for Computing
     Machinery, New York, NY, USA, 2021, p. 122133. URL: https://doi.org/10.1145/3461702.
     3462536. doi:10.1145/3461702.3462536 .
[10] J. Zhao, T. Wang, M. Yatskar, R. Cotterell, V. Ordonez, K.-W. Chang, Gender bias in
     contextualized word embeddings, in: Proceedings of the 2019 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics,
     Minneapolis, Minnesota, 2019, pp. 629–634. URL: https://aclanthology.org/N19-1064. doi:10.
     18653/v1/N19- 1064 .
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
     I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
     R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing
     Systems, volume 30, Curran Associates, Inc., 2017.
[12] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.),
     Proceedings of the 2019 Conference of the North American Chapter of the Association for
     Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis,
     MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, 2019, pp. 4171–4186.
[13] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle-
     moyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR
     abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692 .
[14] C. Richards, W. P. Bouman, L. Seal, M. J. Barker, T. O. Nieder, G. TSjoen, Non-binary or
     genderqueer genders, International Review of Psychiatry 28 (2016) 95–102.