<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>and (2019). URL: https://www.frontiersin.org/journals/
reality: Selected writings of benjamin lee whorf</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/N19-1423</article-id>
      <title-group>
        <article-title>On the Relationship of Social Gender Equality and Grammatical Gender in Pre-trained Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Magdalena Biesialska</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Solans</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jordi Luque</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Segura</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>D Research</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TALP Research Center, Universitat Politècnica de Catalunya</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Telefónica I</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>24</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Large Language Models pre-trained on vast amounts of text have demonstrated remarkable capabilities in modeling and generating human language, finding applications across a wide range of Natural Language Processing tasks. However, recent studies have unveiled the presence of biases in these models, inherited from social biases reflected in their training data. In this research article, we delve into the examination of grammatical gender's influence on four distinct languages exploring how the gender prejudices, exhibited by the LLMs, relate to their capacity to characterise social realities. We show that prevalence of gender biases difer not just in relation to the architecture and training data of the LLMs, as previously documented, but also vary with respect to the language and level of grammatical gender marking present in the language under study. Diferent LLM systems and languages are examined, ranging from a major grammatical gender language, such as Polish, up to English, which lacks most gender inflection, and throughout gendered languages, such as German and Spanish.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;gender bias</kwd>
        <kwd>large language models</kwd>
        <kwd>bias quantification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        of Virtual Assistants [11]. This could have a long-lasting
efect on the behavior of society conditions. Leading to
Large Language Models (LLMs) are neural network sys- discriminatory responses and decisions about race, age,
tems that have been trained on massive amounts of text religion, geographical origins, or the specific case of
gendata by using deep learning techniques [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. These der [7, 12, 13, 14]. Thus, perpetuating mechanisms that
models seem capable of generating and comprehending create and maintain male dominance.
human-like text, e.g., having reported remarkable perfor- As a result of this, LLMs could not correlate female
mance across the majority of Natural Language Process- terms, e.g., with engineering professions, being prone to
ing (NLP) benchmarks and tasks [
        <xref ref-type="bibr" rid="ref3">3, 4</xref>
        ]. Pre-trained LLMs not promote female candidates for engineering positions
are often adapted or fine-tuned to a specific NLP task even when being equally qualified [ 15, 9] as their male
(often referred to as downstream tasks) aiming at reduc- counterparts. A biased LLM may perpetuate harmful
ing the computationally expensive and time-consuming stereotypes and reinforce both bad preconceptions and
training stage. Downstream tasks can include a range prejudices which would limit chances and increase
inof NLP tasks such as machine translation, question an- equalities, then limiting opportunities for some groups
swering, semantic parsing, natural language inference [16].
or paraphrasing, among others [5] and often rely on Furthermore, online data is gathered from the specific
extracted word embeddings [6] from pre-trained LLMs, group of population that uses online resources, which
e.g., in sentiment and gender bias towards politicians [7]. has particular characteristics, resulting in biased
trainHowever, they are not immune to the biases that exist ing samples that fail to efectively reflect the needs of
in the society, often reflected in their training corpora, marginalised social groups [
        <xref ref-type="bibr" rid="ref2">17, 18, 2</xref>
        ]. Detecting and
charas gender bias or other social clichés [8]. As LLMs are acterizing biases becomes a crucial task, especially in the
trained on not well-balanced data, in terms of gender case that such models are used in high-risk domains 1,
or other attributes, they reflect societal stereotypes in where NLP applications can easily limit human
potenmany shapes, forms and times [9, 10]. The biases that are tial, e.g., by inducing biases against women in authority
present in the massive amounts of linguistic data used to [19]; hamper economic growth, and, definitively,
reintrain LLMs are often incorporated by them, like the case force social inequity [20]. In the labour market domain,
eforts to address gender biases include promoting
diversity and inclusion in hiring and promotion processes,
raising awareness of unconscious bias, and providing
support to women and other underrepresented groups [21].
      </p>
      <sec id="sec-1-1">
        <title>1European Commission, Regulatory framework</title>
        <p>proposal on artificial intelligence.
https://digitalstrategy.ec.europa.eu/en/policies/regulatory-framework-ai
But the problem does not only apply to the data. NLP perception proposing that language influences the way
systems are prone to amplify the gender bias exhibited in we perceive and think about the world, a concept known
text corpora. Hence, the problem becomes multi-faceted as the Sapir-Whorf hypothesis or linguistic relativity
hyand may be present at various stages of the development pothesis. Whorf argues that diferent languages may lead
of NLP systems, including training data, resources, pre- to diferent ways of thinking and perceiving the world,
trained models, and algorithms [22]. Further propagation suggesting that language not only reflects our thoughts
of gender bias from NLP models to downstream applica- but also shapes and constrains them, having the key
artions is likely to reinforce harmful stereotypes and may gument that language afects our perception of time.
result in, for example, discrimination of female candi- In contrary, some studies argued that the influence of
dates on the labour market. language on thought is limited and that there are
uni</p>
        <p>Presence of LLMs’ gender biases in the labour mar- versal cognitive processes that are independent of
lanket domain has been previously investigated at the level guage [36]. This pattern of perception, which is predicted
of professions, assessing the correlation between labour by the asymmetry between space and time in linguistic
census and LLMs’ association scores for a subset of pro- metaphors, was reported also in [37] by tasks that do not
fessions across genders [8, 23]. However, we argue that involve any linguistic stimuli or responses, arguing that
this form of bias evaluation does not consider relation- our mental representations and conceptualization of time
ships between professions, such as the economic sectors are built upon our experiences with space and motion
in which their activity is developed. and not necessarily involving the way we talk about time,</p>
        <p>For this reason, we provide an alternative perspective e.g., by using the spatial language from an idiom.
that relies on the evaluation of biases in LLMs at the level Nonetheless, recent research has provided evidence
of economic sectors. Using a higher level of granularity supporting the influence of the language we speak on our
allows us to detect patterns that could not be observed cognitive framework of the world we perceive. The work
before. The findings of this study have important im- of Tan et al. [38] uses functional magnetic resonance
plications for the development and use of cross-lingual imaging (fMRI) reporting that brain regions involved
language models. By quantifying gender bias, these mod- in language processing are also activated during
percepels can be improved to provide more fair and unbiased tual decision-making tasks, which suggests that language
representations of language. This research contributes and perception are closely intertwined. Finally, the study
to the broader goal of promoting gender equality and from Banaji and Hardin [39] supports the claim that
genreducing bias in NLP applications. der information conveyed by word and sentences can
automatically influence judgement, creating a form of
automatic stereotyping in persons.
2. Related Work One of the primary objectives of this research is to
investigate diferences in gender equality among
countries, across various economic sectors, and with regard to
LLMs. This research will explore the correlation between
gender-marked languages and gender equality and
evaluate whether LLMs represent the world depending on the
language they were trained on. Additionally, previous
work in LLMs and labour sector, from computational
linguistics [8, 15] has focused on a small fraction (∼ 15.6%)
of the complete list of professions available in the U.S.
census to assess biases where diferences in gender
prevalence according to the census are maximised (e.g.,
female professions) or minimised (e.g., neutral professions).</p>
        <p>However, the previous analysis does not shed light on
patterns that might be dependent on the economic sector
where the full set of professions are located.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Gender bias is understood as the systematic preference or</title>
        <p>prejudice toward one gender over the other [17, 24, 25].
Previous work has studied the issue of quantifying
social biases in language [26], NLP [27, 28], and
specifically, gender biases elicited by LLMs or carried on by
implicit associations in their word embeddings, for
human work-related activities [15]. However, while the
proposed methods work well for English-based LLMs,
they fail to capture bias for languages with a rich
morphology or gender-marking, such as German, Polish or
Spanish [29]. Countries where gendered languages are
spoken often evidence less gender equality compared to
countries with other grammatical gender systems [30].
While previous work has centered on the English
language, recent studies have explored bias in multilingual
contexts and languages other than English [31, 32, 33, 34]
.</p>
      </sec>
      <sec id="sec-1-3">
        <title>There are weak evidences that language shapes the</title>
        <p>way of thinking. Previous argument is mentioned in the
work of Whorf and Carroll [35] and such ideas have been
the subject of debate and criticism. Whorf’s work
explores the idea that language shapes our cognition and</p>
        <sec id="sec-1-3-1">
          <title>2.1. Contributions</title>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>This work makes the following contributions: (i) We ex</title>
        <p>tend previous definitions of gender biases in pre-trained
LLMs to work with two diferent types: stereotyping bias
and representation bias and characterise multiple items
in the trade-of between them. (ii) We evaluate such
0.35
0.30
0.25
ity0.20
s
n
eD0.15
0.10
0.05
male
female
male
female
male
female
0.25
biases in pre-trained LLMs across multiple languages, [41], a more robust version of BERT, was released. Both
ranging from languages without grammatical gender (e.g., models rely on the Transformer architecture introduced
English) to rich morphological or gender-marking lan- in [42]. In a nutshell, BERT is trained to predict the
guages, which we name gendered languages (e.g., Span- original tokens in a sentence that have been randomly
ish). (iii) For each language, we perform an evaluation masked. Utilizing the Masked Language Model (MLM)
on multiple pre-trained LLMs such as BERT [40] and objective, BERT evaluates the probability distribution
RoBERTa [41]. (iv) With state-of-the-art focused in study- of possible tokens that could fit the masked position in
ing biases in labour market at the level of professions, the sentence, attempting to correctly infer the original
we change the lenses and analyze them at the level of masked word. Additionally, the model predicts the next
economic sectors, comparing results with gender statis- sentence. RoBERTa aimed to address some of the
shorttics for the labour market. (v) We will release our code comings of the BERT architecture, hence RoBERTa was
including templates. trained with dynamic masking instead of the static
variant when a sequence is input to the model.</p>
        <p>In particular, we performed experiments with two
3. Methodology types of LLMs: BERT [40] and RoBERTa [41]. All the
BERT base models were trained using 110M parameters,
This section describes the methodology employed to mea- while BERT large with 340M parameters. RoBERTa
sure gender bias in LLMs, with a focus on labour market base and large models were trained with 125M and
stereotypes. In our work, we leverage pre-trained LLMs 355M parameters respectively. Importantly, we
experto quantify gender bias using a template-based approach imented with diferent LLM architectures ( BERT and
to measure association scores between a token and a RoBERTa), model size (base and large), as well as uncased
masked target or attribute. and cased variants for four languages. In the previous
3.1. Pre-trained Language Models studies, such as [15] the evaluation of LLMs was limited
only to two languages (English and German), only one
model type (BERT base) and the authors did not analyze
how casing influences the results. The diverse models
and the corresponding languages and corpus they were
trained on are outlined in table 4.</p>
      </sec>
      <sec id="sec-1-5">
        <title>Pre-trained LLMs have been successfully employed to</title>
        <p>diferent tasks and numerous applications in NLP in
recent years. Significant performance gains have led to
the development of various architectures. One of the
most prominent LLMs is BERT [40]. Later RoBERTa
3.2. Grammatical and Natural Gender approach this task from two diferent perspectives, what
Languages we name Stereotyping Bias () and Representation Bias
(). The former quantifies how a given LLM is far from
In the field of linguistics, a grammatical gender system gender neutrality given a context. The latter takes into
represents a distinct form of a noun class system, wherein account the LLM bias with respect to what is observed in
nouns are categorised based on gender attributes. In society. For instance, in figure 1d where professions are
languages featuring a grammatical gender system, the supposed to be balanced among genders [15], we would
majority of the nouns inherently bear one value of the expect that a BERT model with no bias will produce
assogrammatical category known as gender. ciation scores around zero (see section 3.3.2 for more
de</p>
        <p>The Spanish language is considered a romantic lan- tails on the association scores). Looking at figure 1a, any
guage that falls within the grammatical gender language deviation from the observed perfect overlapping would
category as well as German and Polish languages. In account for stereotyping bias, see sections 3.3.3 and 3.3.4
Spanish, there are two genders: masculine and feminine, for further details on the two perspectives on bias
quanand both the noun and adjective systems exhibit these tification. The applicability and preference for one notion
two genders [43]. In addition, articles and some pro- over the other depends on the context of usage of the
nouns and determiners have a neuter gender in their LLM at hand [45]. Existing studies quantify gender bias
singular form. German is also an inflected language in pre-trained LLMs typically using tailored sets of
syn[44] with three genders: masculine, feminine and neuter. thetically generated sentences and implicit associations
In Polish, the only non Indo-European language in this between word embbedings [46]. In the work of Kurita
study, nouns belong to one of three genders: masculine, et al. [47], gender bias in BERT models is measured using
feminine and neuter. In this West-Slavic language, the a probability-based metric [25] and by using template
masculine gender is also divided into subgenders: ani- sentences. Specifically, the LLM is directly queried for a
mate/inanimate in the singular, and human/nonhuman particular token in a template sentence by sequentially
in the plural. Furthermore, adjectives agree with nouns masking of either target or attribute token, see table 1 in
in terms of gender and conjugated verb forms agree with where &lt;  &gt; and &lt; _ &gt; stand
their subject’s gender in the case of past tense and sub- for the target and the attribute words, respectively. In our
junctive/conditional forms. analysis, the mask [TARGET] is replaced by gendered</p>
        <p>Nevertheless, English is considered a natural gender nouns and pronouns (e.g.: he/she/my sister) and the mask
language and most of the nouns, with some exceptions, [ATTRIBUTE] is replaced with terms related to specific
are considered genderless [44]. English has three gen- economic sectors (e.g., fishing/services/secondary). As
dered pronouns, but no longer has grammatical gender in contextualised embeddings of a given token are
depenthe sense of noun class distinctions or inflections. Instead, dent on its context, a relative measure of bias for the
gender is characterised through the language’s pronouns attribute word can be evaluated by substituting target
[30], that is, the distinction between "he", "she", and other classes (e.g., male and female). In [47], the authors
compersonal pronouns and "it". pare their evaluation method with the baseline cosine
similarity measure among word embeddings.
3.3. Bias Quantification However, applying Kurita’s methodology confronts
diferent challenges when applied for grammatically
gendered languages such as Spanish or German. Previous
work by Bartl et al. [15] demonstrated that the original
association scores proposed in [47] were not efective
for the German language due to its inherent gender
sufTo quantify biases in a particular context, it is important
to first establish a clear definition of what a bias-free
system would look like. This requires a thoughtful reflection
on the desired behavior of the analysed model and the
impact that potential biases might have. In our work, we
ifxes in attributes. In English a few gendered words
exist (e.g., king/queen, waiter/waitress, actor/actress), and
measuring the association score for sentences with those
words, e.g., "[TARGET] is the waitress", with male or
female options would yield misleading results when using
word embedding projection methods [29], thus
showing a gender bias against men instead of women. This
phenomenon prevails into gendered languages, where
diferent words are used for each gender. For instance, if
we compare the distributions from the figures 1a to 1c,
corresponding to the distributions of association scores
for Spanish language, we notice that Kurita’s method
obtains overlapped distributions for the three groups of
professions in Spanish language. This result is also
conifrmed by a drastic reduction of the p-values obtained by
a Wilcoxon test statistic compared to English
distributions. It is worth to mention that the same efect occurs
for both German, as previously noticed by Bartl et al. [15],
and for the Polish languages. The previous results
motivate us to develop a new set of templates, aiming to avoid
the efects of gendered attributes for the quantification
of bias in this work.
3.3.1. Templates
We adapt the idea of using templates to quantify and
measure gender bias [47, 15]. Bartl et al. [15] used
association scores to analyze gender biases across professions,
releasing the BEC-Pro dataset for English and German
languages. We follow a similar approach, but we extend
the analysis to two additional languages: Spanish and
Polish. More importantly, we shift the focus from individual
professions to entire economic sectors.</p>
        <p>Note that, as we discussed before in 3.3, the relation
between the grammatical gender of the person word and
the profession does influence the associations scores in
gender-marking languages. In response to this, the novel
approach of measuring biases across economic sectors
instead of using a list of occupation words allows us to
minimize potential complications stemming from
grammatical gender inflections and pronouns. Additionally,
by examining economic sectors, our investigation
encompasses an aggregated view instead of limiting the analysis
to a specific list of professions and, for instance,
facilitating the relation of results to macroeconomic statistics.</p>
        <p>Our templates are designed to assess gender bias in
LLMs concerning economic sectors. To achieve that, we
take into account changes in sentence structure (e.g.,
articles) depending on the female or male person word.</p>
        <p>These templates follow a standard structure, where a
sentence contains an economic sector reference as the
attribute with a specific gendered term as the target.</p>
      </sec>
      <sec id="sec-1-6">
        <title>The association scores methodology proposed by Kurita</title>
        <p>et al. [47] is employed to measure the likelihood of a
masked word being associated with a specific gender.
These scores quantify the gender bias present in the LLMs
by evaluating the probability that the masked token is
classified as male or female. Higher scores for a particular
gender indicate a stronger bias towards that gender in
the predictions of the evaluated LLM.</p>
        <p>The aim of this method is to estimate the implicit
association between specific targets and attributes using
BERT’s MLM objective. For example, using the
template sentence "she works in the construction sector",
the method can quantify the association between the
target female (given by the pronoun "she") and the attribute
construction. The distribution scores drawn by figure 1
are obtained in the same manner.</p>
        <p>The main steps of the method are as follows:</p>
      </sec>
      <sec id="sec-1-7">
        <title>1. Prepare a template sentence</title>
        <p>e.g. "[TARGET] works in the [ATTRIBUTE]
sector".</p>
        <p>For example this may be "she works in the
construction sector".
2. Mask the [TARGET] word and compute the target
probability  which corresponds to the
likelihood of the target word given an unmasked
attribute.</p>
        <p>For the updated example, the sentence becomes
"[MASK] works in the construction sector" and
 measures how likely the LLM is to predict
"she" as the missing word.</p>
      </sec>
      <sec id="sec-1-8">
        <title>3. Compute the prior probability , which is</title>
        <p>the likelihood of the target word when the
attribute is also masked.</p>
        <p>The example sentence would be "[MASK] works
in the [MASK] sector", and  is the
probability of predicting "she" without the influence of
the attribute.
4. Compute the association between target and
attribute as  = log  .</p>
      </sec>
      <sec id="sec-1-9">
        <title>This logarithmic ratio is the association score, . To</title>
        <p>measure gender bias, we compute the gender bias by
comparing these scores for diferent targets, such as "he" and
"she", averaging for all templates and taking the
diference between female and male association score averages.</p>
        <p>This method, as evidenced in the original paper,
outperforms traditional cosine-based measures like WEAT [8]
in detecting gender biases.
3.3.3. Stereotyping Bias
Stereotyping bias () quantifies the extent to which a
given LLM is far away from gender neutrality in a given
context given by a specific language (ℒ) and a LLM model comparing the percentages of females and males in
sec(ℳ). To do so, it quantifies the disparities in average tor  using data from the Global Gender Gap Index across
association score across genders for each of the economic diferent countries.
sectors: primary, secondary and tertiary. A balanced model (ℛ = 0) perfectly reproduces
so</p>
        <p>To calculate the overall  across sectors we first cal- cietal gender distributions. Negative values indicate a
culate the stereotyping bias for a specific sector  as model preference for males compared to the real
prevathe inner disparity ℐ(ℒ, ℳ) by first computing the lence, while positive values indicate a preference for
model’s average diference between association scores  females. This metric helps language modelers ensure
between females and males. This diference is calculated accurate societal representations in their models.
for each i-th sentence generated for females () and
males () between the total number  of male and 3.4. Labour Market Data
female oriented sentences generated for ℒ and .</p>
      </sec>
      <sec id="sec-1-10">
        <title>We evaluate gender bias by investigating the relation be</title>
        <p>tween gender-denoting target words and sectors names
ℐ(ℒ, ℳ) = 1 ∑=︁0 (︀ () − ()︀) (1) ibnefEonreg,liosuhr, Sppraonpiosshe,dPomliesthhoadndavGoeirdmsaunsi.nAgspmroefnetsisoinoends
names, gendered or not, to keep the attribute unchanged</p>
        <p>The overall stereotyping bias  across sectors for ℳ among languages, thus making the results comparable
and ℒ is computed as the average inner disparity across among diferent languages under study. The observed
all three economic sectors: bias is compared to real-world through the gender
statis(ℒ, ℳ) = 13 ∑=3︁1 ℐ(ℳ, ℒ) (2) (ftbseiacemsseeaatdlaceborsolneascss2rp)ao,ecswctsiihficveiiccctooyhnusdoneemtcsrtciiocerrissbseaepcnttrdhooervtship.derCeierdovrmabelylepanattcrhivieeseooWlnfasmonragalrudleeasBdgaoaennnsdke
A model ℳ trained for language ℒ without stereotyp- (e.g., Spanish models are compared with statistics from
ing bias (ℒ, ℳ) = 0, would produce equal average Spain) so that we are able to specifically compare each
association scores for male and female targets in eco- model’s outputs with its social reality, regarding the level
nomic sectors. Negative values indicate bias favoring of gender equality represented at the workforce statistics
males, while positive values indicate bias favoring fe- and per each economic sector. We also report the Global
males. Stereotyping bias is specific to each model and Gender Gap Index2 by the World Economic Forum (WEF).
the language in which it was trained. As seen in table 2, the bigger gender gap among sectors
is found for the secondary sector, a common trend in the
3.3.4. Representation Bias four countries, where is mostly occupied by male
workers. The contrary case is found in the tertiary sector, in
where the relative diference favours the female gender.</p>
        <p>For the primary sector, a similar statistic is found except
for Poland where gender is almost balanced.</p>
        <p>With a the broader view, representation bias in a given
domain generally refers to the underrepresentation or
overrepresentation of certain groups (such as genders
or ethnicities) as compared to their prevalence in the
overall target population. However, in the context of our 4. Results
research, we adopt a definition of representation bias (ℛ),
particularly tailored to the context of our analysis. Here, The results obtained for LLMs trained in English, German,
ℛ is understood as the divergence of a model’s internal Spanish and Polish languages reveal intricate patterns
representation of genders from the actual societal gender of the two notions (, )of gender bias that emerge
distributions in the workforce. and fluctuate across languages, economic sectors, and</p>
        <p>We define the overall representation bias across eco- model types. The analysis based on results depicted in
nomic sectors for a given LLM ℳ trained for language figure 2, representing gender bias, separated by primary,
ℒ as: secondary, and tertiary sectors, exhibits diverse trends
across languages and sectors. As can be observed, the
uncased BERT models generally exhibit less stereotyping
3
ℛ(ℒ, ℳ) = 31 ∑︁ (︀
=1
ℐ(ℒ, ℳ) −  (ℒ))︀</p>
        <p>(3)</p>
        <p>Here, ℐ(ℳ, ℒ) is the model’s inner disparity score
as defined above, and (ℒ) represents the observed
gender ratio in economic sector  in the country
associated with the language ℒ. (ℒ) is calculated by
2The GGG index "assesses countries on how well they
are dividing their resources and opportunities among their
male and female populations, regardless of the overall levels
of these resources and opportunities", https://www3.weforum.
org/docs/WEF_GGGR_2022.pdf
3https://genderdata.worldbank.org/indicators/sl-empl-zs/
bias compared to their cased counterparts in all languages. the language model exhibits.</p>
        <p>Within the range of languages analysed, Polish demon- Specifically, we are interested in the specific domains
strates the lower stereotyping bias, followed by Spanish, for  in which grammatical gendering of the LLM’s
lanwith German and English showing similar average val- guage might be a proxy for predicting gender inequality
ues, but with English models demonstrating a broader in the country of the spoken language. As reported by
range of stereotyping bias. Prewitt-Freilino et al. [30] , countries predominated by a</p>
        <p>
          When analyzing results both aggregated and across natural gender language, like English, evidence greater
sectors, an interesting pattern emerges. In the 2D scat- gender equality than countries with other grammatical
terplot all models, for a given language, align along a gender systems. Albeit, as seen in table 2, that is not
specific trajectory, revealing a clear trade-of between the case for the English GGG index (only accounting for
how accurate a language model represents the social re- England), lower than GGG indexes reported for Germany
ality (representation bias) and the stereotyping bias that and Spain, both gendered languages.
Table 2 to the non-bias point (0, 0) compared to the rest of
lanThe male/female columns refer to the % of workforce in the guages, except for Polish. Previous result is diluted
desector. The Rel. dif. column stands for the relative diference pending on the specific economic sector we look at, in
between both genders, normalised in the range [
          <xref ref-type="bibr" rid="ref1">-1,1</xref>
          ], where where the English  in primary and secondary sector
negative values indicate real-world bias towards male gender. is compensated by the tertiary sector, the latter biased
The Global Gender Gap index (GGG) reported by the World towards the female direction.
tEocogneonmdeirc sFtoartuismtic2s02fr2o2man2d01em9bpylotyhmeeWntobrlydsBeactnokr3w.Tithherevsaplueecst We also observe interesting results in the case of the
of the index range from 0 to 1, with higher values indicating Polish language, which belongs to a family of Slavic
langreater gender equality. guages. Polish, as a West Slavic language, is gendered;
however, there is a very limited number of studies
invesLanguage GGG Sector Female Male Rel. dif. tigating bias even in the broader Slavic language family
        </p>
        <p>Primary 37.64 62.36 -0.25 [48] . Hence, the lack of such analysis is addressed in our
German 0.801 Secondary 28.90 71.10 -0.42 work. The results for Polish LLMs, as shown in figure 2,
Tertiary 61.80 38.21 0.24 in general demonstrate lower representation bias scores
Primary 28.13 71.87 -0.44 as compared to other languages.</p>
        <p>Spanish 0.788 Secondary 26.23 73.77 -0.48 We hypothesize that one of the reasons might be
atTertiary 60.40 39.60 0.21 tributed to the gender-sensitive grammar structures in
Primary 49.00 51.05 -0.02 Polish. Unlike many other languages, Polish modifies not
Polish 0.709 Secondary 32.10 67.94 -0.36 only pronouns but also verb forms to correspond with</p>
        <p>TPerirmtiaarryy 6351..6159 6384..8315 -00..3318 gender. For example, in certain sentence templates, the
English 0.780 Secondary 24.22 75.78 -0.52 conditional and past tenses of verbs, and relative
proTertiary 59.34 40.66 0.19 nouns, alter according to the gender of the person being
referred to. This linguistic feature may potentially impact
how LLMs learn and represent gender-related concepts</p>
        <p>For the case of the English LLMs, see figure 2a, the in Polish, thus influencing the extent of bias observed
BERT base uncased and RoBERTa base cased models re- in these models. Given the gendered nature of Polish
port the lowest |ℛ| and ||, thus being both a good and how the provided patterns reflect that, we conclude
proxy for real-world data and low stereotyping of LLMs. that the representativeness (as indicated by the
repreNote that values of  can be understood as the LLM’s sentation bias) of Polish LLMs is slightly better than its
perception of the world once the LLM is trained in a spe- English, Spanish and German counterparts due to the
cific language, whereas  describes its capacity for pre- necessary agreement of verbs and pronouns with the
dicting the real-world data or gender gap. Similar results gendered subject.
are observed for Spanish, where the Base models
outperform Large models in both bias metrics, ℛ and . 5. Conclusion
Note that the main diference between Large and Base
models resides in the number of parameters employed In this work, we have used the idea of association scores
for the architecture, being the number of tokens in the [47] to quantify gender biases in LLMs in the labour
training corpus the same for both LLM systems. Table 4 market at the level of economic sectors. We distinguish
in annex, summarizes the training data used and number between two diferent notions of biases: (i)
Representaof parameters for each LLM. tivity bias and (ii) Stereotyping bias. The first quantifies</p>
        <p>Regarding the Spanish model BERTIN, it corresponds the extent to which the model is able to learn patterns
to a RoBERTa base model trained with 100B tokens more that can be observed in society, whereas the latter studies
than the RoBERTa-BSC model. In the figures, both mod- how far from gender-neutral its internal representation
els are denoted with an orange circle. BERTIN portrays is.
the lowest error in term of bias, skewed toward negative By conducting this cross-linguistic analysis, we
convalues of  for the sector aggregation graph 2a. Note tribute to the understanding of biases in LLMs,
highlightthat the graph for all sectors is computed as a weighted ing the nuanced interplay between language structure,
average using WEF data proportions from the other three training data, and the biases exhibited by these models.
sectors results, see table 2. Our study underscores the importance of
comprehend</p>
        <p>Overall, the gendered languages exhibit a smaller vari- ing how biases are captured or amplified within LLMs,
ance around the (, ℛ) = (0, 0) compared to English paving the way for future eforts to mitigate and address
LLMs. Nonetheless, by removing the Large models from these biases.
the analysis, we can realize that English LLMs, trained We use these definitions of bias to characterise
multiwith a natural gender language, are closer in average ple state-of-the-art pre-trained LLMs, comparing results
among diferent languages, from languages with no
grammatical gender, or natural gender, to gendered languages.</p>
        <p>Among other results, the conducted analysis reveals
interesting and consistent trends where biases vary across
languages and economic sectors, being Polish the
language whose models systematically showcase less biases
and the tertiary sector, the unique case for which the
models exhibit a biased preference towards the female
gender.</p>
        <p>Additionally, we observed a quasi-linear relation
between both types of biases, with most of the models
exhibiting representation, stereotyping biases or a
combination of both and Large models reporting higher biases.</p>
        <p>We expect these results to contribute to building a
better understanding on the presence of systematic gender
biases in LLMs.</p>
        <sec id="sec-1-10-1">
          <title>5.1. Limitations and Future Work</title>
          <p>This study uses a multi-language dataset, synthetically
created with equivalent examples across the studied
languages. However, as other datasets used in the related
work, it is still limited in the sense that few templates
are used to generate it. Additionally, it is important to
note that cultural biases might afect the understanding
of the translated templates to each language, leading to
diferences that could be reflected in the obtained results.
This raises the possibility that unintended biases may be
present in the results derived from the data.</p>
          <p>Furthermore, the dataset is composed by a tailored
collection of terms that are descriptors of economic
sectors for which the results are then aggregated. Although
aggregation is a powerful tool to observe patterns, at the
same time it has the drawback of restricting the visibility
of interesting patterns that occur at a lower
granularity level, for instance, using professions related to each
economic sector. This means that using solely the
results reported in this work might not be suficient to
understand all the possible types of biases in LLMs in the
domain of the labour market, but correspond to another
set of information to be accounted.</p>
          <p>Moreover, we are comparing individual census with
results of diferent languages that are spoken in multiple
countries. As a future work, it could be interesting to
compare results across multiple countries that use those
languages. Additionally, more research could be done on
the efects of other demographic factors or covariates.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>6. Ethics Statement</title>
      <sec id="sec-2-1">
        <title>This research provides a deepened insight into the in</title>
        <p>lfuence of grammatical gender on gender biases within
LLMs across multiple languages. The broader societal
impact of understanding and quantifying these biases is
significant for several reasons:
1. Enhancing Awareness: By bringing attention
to the variances in gender biases across languages,
we can enhance the broader community’s
awareness of potential pitfalls when deploying LLMs
in diverse linguistic settings. This awareness is
crucial for developers, policymakers, and users
to make informed decisions about the application
and potential limitations of LLMs in diferent
linguistic contexts.
2. Informed Deployment: Knowledge about
the biases inherent to these models can guide
decision-making processes for institutions and
industries that utilize LLMs. By being aware of
the biases, stakeholders can make better decisions
regarding where and how to deploy these
models, especially in applications that may have
realworld implications for individuals or groups.
3. Influence on Future Research : Our study can
pave the way for future research into the
mitigation of gender biases in LLMs. By understanding
the nuanced interplay between language
structure, training data, and model bias, the
community can work towards developing techniques and
best practices to address and reduce such biases.</p>
        <sec id="sec-2-1-1">
          <title>6.1. Ethical Considerations</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>1. Dataset Limitations: While our study utilizes</title>
        <p>a multi-language dataset, it is synthetically
created with equivalent examples across languages.
As with any synthetic dataset, there’s a risk of
unintended biases, potentially afecting the
results. We recognize and caution that translating
templates across languages can introduce cultural
biases, which might inadvertently influence the
outcomes.
2. Scope of Findings: It is important to understand
that our findings, while indicative of trends, may
not extrapolate seamlessly to other new LLMs or
to every application scenario. For example, we
are only reporting results for a narrow set of
languages and modeled by non-causal LLMs, that
is, no autoregressive models, as GPT-like, have
been evaluated. Biases are intricately linked to
specific training data, model architecture, and
application context. Our study should be viewed as
a piece in the larger puzzle of understanding and
addressing biases in LLMs, rather than a
conclusive assessment of all possible instances of gender
bias in every LLM.
3. Aggregation of Results: Our use of aggregation,
while powerful in discerning patterns, might also
mask more granular biases present in LLMs,
particularly within specific economic sectors or
professions. Users and developers should be aware</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>Funded by the European Union’s Horizon 2020.</title>
        <p>Views and opinions expressed are however those
of the author(s) only and do not necessarily
relfect those of the European Union or European
Commission-EU. Neither the European Union nor the
granting authority can be held responsible for them.</p>
        <p>of this and consider more detailed analyses when
appropriate.
4. Comparative Analyses: Our study compares
census data with results from languages spoken
across diferent countries. The cultural, economic,
and social dynamics of each country can vary
widely, even if the same language is spoken.
Future work may benefit from a more localized
approach, considering the multifaceted nature of
biases in each country.
5. Potential Misuse: Recognizing that biased
systems can perpetuate stereotypes or reinforce
societal prejudices, it is ethically imperative for
developers and users to ensure that LLMs are not
misused, especially in critical domains where
biases can lead to tangible harms or injustices.</p>
        <p>3581017. 1038/d41586-018-05707-8.
[12] C. O’Neil, Weapons of Math Destruction: How Big [22] T. Sun, A. Gaut, S. Tang, Y. Huang, M. ElSherief,
Data Increases Inequality and Threatens Democ- J. Zhao, D. Mirza, E. Belding, K.-W. Chang, W. Y.
racy, Crown, 2016. URL: https://books.google.es/ Wang, Mitigating gender bias in natural language
books?id=NgEwCwAAQBAJ. processing: Literature review, in: Proceedings of
[13] M. R. Costa-jussà, An analysis of gender bias stud- the 57th Annual Meeting of the Association for
ies in natural language processing, Nature Ma- Computational Linguistics, Association for
Comchine Intelligence 1 (2019) 495–496. URL: https:// putational Linguistics, Florence, Italy, 2019, pp.
doi.org/10.1038/s42256-019-0105-5. doi:10.1038/ 1630–1640. URL: https://aclanthology.org/P19-1159.
s42256-019-0105-5. doi:10.18653/v1/P19-1159.
[14] D. Nozza, F. Bianchi, D. Hovy, HONEST: Mea- [23] A. Konnikov, N. Denier, Y. Hu, K. D. Hughes, J.
Alsuring hurtful sentence completion in language shehabi Al-Ani, L. Ding, I. Rets, M. Tarafdar, et al.,
models, in: Proceedings of the 2021 Confer- Bias word inventory for work and employment
dience of the North American Chapter of the As- versity,(in) equality and inclusivity (version 1.0),
sociation for Computational Linguistics: Human SocArXiv (2022).</p>
        <p>Language Technologies, Association for Compu- [24] M. E. Heilman, Gender stereotypes and workplace
tational Linguistics, Online, 2021, pp. 2398–2406. bias, Research in Organizational Behavior 32 (2012)
URL: https://aclanthology.org/2021.naacl-main.191. 113–135. URL: https://www.sciencedirect.com/
doi:10.18653/v1/2021.naacl-main.191. science/article/pii/S0191308512000093. doi:https:
[15] M. Bartl, M. Nissim, A. Gatt, Unmasking Contex- //doi.org/10.1016/j.riob.2012.11.003.
tual Stereotypes: Measuring and Mitigating BERT’s [25] I. O. Gallegos, R. A. Rossi, J. Barrow, M. M.
TanGender Bias, in: M. R. Costa-jussà, C. Hardmeier, jim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, N. K.
W. Radford, K. Webster (Eds.), Proceedings of the Ahmed, Bias and fairness in large language models:
Second Workshop on Gender Bias in Natural Lan- A survey, 2024. arXiv:2309.00770.
guage Processing, Association for Computational [26] K. M. White Smolinski, Gender Bias in Natural
GenLinguistics, Barcelona, Spain (Online), 2020, pp. 1– der Language and Grammatical Gender Language
16. URL: https://aclanthology.org/2020.gebnlp-1.1. within Children’s Literature, PhD dissertation,
Lib[16] P. Kahn, Rising tide: Gender equality and erty University, 2024. URL: https://digitalcommons.
cultural change around the world, Perspec- liberty.edu/doctoral/5294.
tives on Politics 2 (2004) 407–409. doi:10.1017/ [27] D. Cirillo, H. Gonen, E. Santus, A. Valencia, M. R.</p>
        <p>S1537592704770978. Costa-jussà, M. Villegas, Sex and gender bias
[17] C. A. Moss-Racusin, J. F. Dovidio, V. L. Brescoll, in natural language processing, in: D. Cirillo,
M. J. Graham, J. Handelsman, Science fac- S. Catuara-Solarz, E. Guney (Eds.), Sex and
ulty’s subtle gender biases favor male students, Gender Bias in Technology and Artificial
IntelliProceedings of the National Academy of Sci- gence, Academic Press, 2022, pp. 113–132. URL:
ences 109 (2012) 16474–16479. URL: https://www. https://www.sciencedirect.com/science/article/pii/
pnas.org/doi/abs/10.1073/pnas.1211286109. doi:10. B9780128213926000091. doi:https://doi.org/
1073/pnas.1211286109. 10.1016/B978-0-12-821392-6.00009-1.
[18] M. McKinnon, C. O’Connell, Perceptions of stereo- [28] P. Czarnowska, Y. Vyas, K. Shah, Quantifying Social
types applied to women who publicly commu- Biases in NLP: A Generalization and Empirical
Comnicate their stem work, Humanities and Social parison of Extrinsic Fairness Metrics, Transactions
Sciences Communications (2020). doi:10.1057/ of the Association for Computational Linguistics
s41599-020-00654-0. 9 (2021) 1249–1267. URL: https://doi.org/10.1162/
[19] S. Marjanovic, K. Stańczak, I. Augenstein, Quan- tacl_a_00425. doi:10.1162/tacl_a_00425.
tifying gender biases towards politicians on red- [29] P. Zhou, W. Shi, J. Zhao, K.-H. Huang, M. Chen,
dit, PLOS ONE 17 (2022) 1–36. URL: https://doi. R. Cotterell, K.-W. Chang, Examining Gender
org/10.1371/journal.pone.0274317. doi:10.1371/ Bias in Languages with Grammatical Gender, in:
journal.pone.0274317. K. Inui, J. Jiang, V. Ng, X. Wan (Eds.),
Proceed[20] A. H. Bailey, A. Williams, A. Cimpian, Based on bil- ings of the 2019 Conference on Empirical
Methlions of words on the internet, people=men, Science ods in Natural Language Processing and the 9th
Advances 8 (2022) 2463. URL: https://www.science. International Joint Conference on Natural
Lanorg/doi/abs/10.1126/sciadv.abm2463. doi:10.1126/ guage Processing (EMNLP-IJCNLP), Association
sciadv.abm2463. for Computational Linguistics, Hong Kong, China,
[21] J. Zou, L. Schiebinger, Ai can be sexist and racist 2019, pp. 5276–5284. URL: https://aclanthology.org/
— it’s time to make it fair, Nature (2018). doi:10. D19-1531. doi:10.18653/v1/D19-1531.
Slavic Natural Language Processing 2023
(SlavicNLP 2023), Association for Computational
Linguistics, Dubrovnik, Croatia, 2023, pp. 146–154. URL:
https://aclanthology.org/2023.bsnlp-1.17.
[49] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R.
Urtasun, A. Torralba, S. Fidler, Aligning books and
movies: Towards story-like visual explanations by
watching movies and reading books, 2015 IEEE
International Conference on Computer Vision (ICCV)
(2015) 19–27.
[50] B. Staatsbibliothek, German bert, 2023. URL: https:</p>
        <p>//github.com/dbmdz/berts.
[51] B. Minixhofer, F. Paischer, N. Rekabsaz,
WECH</p>
        <p>SEL: Efective initialization of subword embeddings
for cross-lingual transfer of monolingual language
models, in: Proceedings of the 2022 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Association for Computational
Linguistics, Seattle, United States, 2022, pp. 3992–4006.</p>
        <p>URL: https://aclanthology.org/2022.naacl-main.293.</p>
        <p>doi:10.18653/v1/2022.naacl-main.293.
[52] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho,</p>
        <p>H. Kang, J. Pérez, Spanish pre-trained bert model
and evaluation data, in: PML4DC at ICLR 2020,
2020.
[53] J. de la Rosa, E. G. Ponferrada, M. Romero,</p>
        <p>P. Villegas, P. G. de Prado Salas, M. Grandury,
Bertin: Eficient pre-training of a spanish
language model using perplexity sampling,
Procesamiento del Lenguaje Natural 68 (2022) 13–23.</p>
        <p>URL: http://journal.sepln.org/sepln/ojs/ojs/index.</p>
        <p>php/pln/article/view/6403.
[54] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao,</p>
        <p>J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R.</p>
        <p>Penagos, A. G. Agirre, M. Villegas, Maria: Spanish
language models, Procesamiento del Lenguaje
Natural 68 (2022). URL: https://upcommons.upc.edu/
handle/2117/367156#.YyMTB4X9A-0.mendeley.</p>
        <p>doi:10.26342/2022-68-3.
[55] D. Kłeczek, Polbert: Attacking polish nlp tasks
with transformers, in: M. Ogrodniczuk, Łukasz
Kobyliński (Eds.), Proceedings of the PolEval 2020
Workshop, Institute of Computer Science, Polish</p>
        <p>Academy of Sciences, 2020.
[56] S. Dadas, M. Perełkiewicz, R. Poświata, Pre-training
polish transformer-based language models at scale,
in: L. Rutkowski, R. Scherer, M. Korytkowski,
W. Pedrycz, R. Tadeusiewicz, J. M. Zurada (Eds.),
Artificial Intelligence and Soft Computing, Springer</p>
        <p>International Publishing, Cham, 2020, pp. 301–314.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Employment by Sector with Respect to Gender</title>
      <p>BERT [40]
RoBERTa [41]
BERT [52]
RoBERTa [53, 54]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          ,
          <article-title>On the dangers of stochastic parrots: Can language models be too big?</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</source>
          , FAccT '21,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>610</fpage>
          -
          <lpage>623</lpage>
          . URL: https://doi.org/10.1145/3442188. 3445922. doi:
          <volume>10</volume>
          .1145/3442188.3445922.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Harnessing the power of llms in practice: A survey on chatgpt and beyond</article-title>
          ,
          <source>ArXiv abs/2304</source>
          .13712 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tedeschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Declerck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hajič</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>HershBooksCorpus (0</article-title>
          .8B words) [
          <volume>49</volume>
          ],
          <article-title>English Wikipedia (2.5B words; excluding lists, tables and headers), CC-News (September 2016-</article-title>
          <year>February 2019</year>
          ), OpenWebText, Stories.
          <source>Size: 161GB Wikipedia</source>
          , EU Bookshop, Open Subtitles, CommonCrawl, ParaCrawl and
          <string-name>
            <given-names>News</given-names>
            <surname>Crawl</surname>
          </string-name>
          .
          <source>Size: 16GB, Tokens: 2.35B BooksCorpus (0.8B words)</source>
          [
          <volume>49</volume>
          ],
          <article-title>English Wikipedia (2.5B words; excluding lists, tables and headers), CC-News (September 2016-</article-title>
          <year>February 2019</year>
          ), OpenWebText, Stories.
          <source>Size: 161GB</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>