1. Introduction

and (2019). URL: https://www.frontiersin.org/journals/ reality: Selected writings of benjamin lee whorf

10.18653/v1/N19-1423

On the Relationship of Social Gender Equality and Grammatical Gender in Pre-trained Large Language Models

Magdalena Biesialska

David Solans

Jordi Luque

Carlos Segura

2 0 D Research , Barcelona , Spain 1 TALP Research Center, Universitat Politècnica de Catalunya , Barcelona , Spain 2 Telefónica I

2024

1 24 27

Large Language Models pre-trained on vast amounts of text have demonstrated remarkable capabilities in modeling and generating human language, finding applications across a wide range of Natural Language Processing tasks. However, recent studies have unveiled the presence of biases in these models, inherited from social biases reflected in their training data. In this research article, we delve into the examination of grammatical gender's influence on four distinct languages exploring how the gender prejudices, exhibited by the LLMs, relate to their capacity to characterise social realities. We show that prevalence of gender biases difer not just in relation to the architecture and training data of the LLMs, as previously documented, but also vary with respect to the language and level of grammatical gender marking present in the language under study. Diferent LLM systems and languages are examined, ranging from a major grammatical gender language, such as Polish, up to English, which lacks most gender inflection, and throughout gendered languages, such as German and Spanish.

eol>gender bias large language models bias quantification

1. Introduction

of Virtual Assistants [11]. This could have a long-lasting efect on the behavior of society conditions. Leading to Large Language Models (LLMs) are neural network sys- discriminatory responses and decisions about race, age, tems that have been trained on massive amounts of text religion, geographical origins, or the specific case of gendata by using deep learning techniques [ 1, 2 ]. These der [7, 12, 13, 14]. Thus, perpetuating mechanisms that models seem capable of generating and comprehending create and maintain male dominance. human-like text, e.g., having reported remarkable perfor- As a result of this, LLMs could not correlate female mance across the majority of Natural Language Process- terms, e.g., with engineering professions, being prone to ing (NLP) benchmarks and tasks [ 3, 4 ]. Pre-trained LLMs not promote female candidates for engineering positions are often adapted or fine-tuned to a specific NLP task even when being equally qualified [ 15, 9] as their male (often referred to as downstream tasks) aiming at reduc- counterparts. A biased LLM may perpetuate harmful ing the computationally expensive and time-consuming stereotypes and reinforce both bad preconceptions and training stage. Downstream tasks can include a range prejudices which would limit chances and increase inof NLP tasks such as machine translation, question an- equalities, then limiting opportunities for some groups swering, semantic parsing, natural language inference [16]. or paraphrasing, among others [5] and often rely on Furthermore, online data is gathered from the specific extracted word embeddings [6] from pre-trained LLMs, group of population that uses online resources, which e.g., in sentiment and gender bias towards politicians [7]. has particular characteristics, resulting in biased trainHowever, they are not immune to the biases that exist ing samples that fail to efectively reflect the needs of in the society, often reflected in their training corpora, marginalised social groups [ 17, 18, 2 ]. Detecting and charas gender bias or other social clichés [8]. As LLMs are acterizing biases becomes a crucial task, especially in the trained on not well-balanced data, in terms of gender case that such models are used in high-risk domains 1, or other attributes, they reflect societal stereotypes in where NLP applications can easily limit human potenmany shapes, forms and times [9, 10]. The biases that are tial, e.g., by inducing biases against women in authority present in the massive amounts of linguistic data used to [19]; hamper economic growth, and, definitively, reintrain LLMs are often incorporated by them, like the case force social inequity [20]. In the labour market domain, eforts to address gender biases include promoting diversity and inclusion in hiring and promotion processes, raising awareness of unconscious bias, and providing support to women and other underrepresented groups [21].

1European Commission, Regulatory framework

proposal on artificial intelligence. https://digitalstrategy.ec.europa.eu/en/policies/regulatory-framework-ai But the problem does not only apply to the data. NLP perception proposing that language influences the way systems are prone to amplify the gender bias exhibited in we perceive and think about the world, a concept known text corpora. Hence, the problem becomes multi-faceted as the Sapir-Whorf hypothesis or linguistic relativity hyand may be present at various stages of the development pothesis. Whorf argues that diferent languages may lead of NLP systems, including training data, resources, pre- to diferent ways of thinking and perceiving the world, trained models, and algorithms [22]. Further propagation suggesting that language not only reflects our thoughts of gender bias from NLP models to downstream applica- but also shapes and constrains them, having the key artions is likely to reinforce harmful stereotypes and may gument that language afects our perception of time. result in, for example, discrimination of female candi- In contrary, some studies argued that the influence of dates on the labour market. language on thought is limited and that there are uni

Presence of LLMs’ gender biases in the labour mar- versal cognitive processes that are independent of lanket domain has been previously investigated at the level guage [36]. This pattern of perception, which is predicted of professions, assessing the correlation between labour by the asymmetry between space and time in linguistic census and LLMs’ association scores for a subset of pro- metaphors, was reported also in [37] by tasks that do not fessions across genders [8, 23]. However, we argue that involve any linguistic stimuli or responses, arguing that this form of bias evaluation does not consider relation- our mental representations and conceptualization of time ships between professions, such as the economic sectors are built upon our experiences with space and motion in which their activity is developed. and not necessarily involving the way we talk about time,

For this reason, we provide an alternative perspective e.g., by using the spatial language from an idiom. that relies on the evaluation of biases in LLMs at the level Nonetheless, recent research has provided evidence of economic sectors. Using a higher level of granularity supporting the influence of the language we speak on our allows us to detect patterns that could not be observed cognitive framework of the world we perceive. The work before. The findings of this study have important im- of Tan et al. [38] uses functional magnetic resonance plications for the development and use of cross-lingual imaging (fMRI) reporting that brain regions involved language models. By quantifying gender bias, these mod- in language processing are also activated during percepels can be improved to provide more fair and unbiased tual decision-making tasks, which suggests that language representations of language. This research contributes and perception are closely intertwined. Finally, the study to the broader goal of promoting gender equality and from Banaji and Hardin [39] supports the claim that genreducing bias in NLP applications. der information conveyed by word and sentences can automatically influence judgement, creating a form of automatic stereotyping in persons. 2. Related Work One of the primary objectives of this research is to investigate diferences in gender equality among countries, across various economic sectors, and with regard to LLMs. This research will explore the correlation between gender-marked languages and gender equality and evaluate whether LLMs represent the world depending on the language they were trained on. Additionally, previous work in LLMs and labour sector, from computational linguistics [8, 15] has focused on a small fraction (∼ 15.6%) of the complete list of professions available in the U.S. census to assess biases where diferences in gender prevalence according to the census are maximised (e.g., female professions) or minimised (e.g., neutral professions).

However, the previous analysis does not shed light on patterns that might be dependent on the economic sector where the full set of professions are located.

Gender bias is understood as the systematic preference or

prejudice toward one gender over the other [17, 24, 25]. Previous work has studied the issue of quantifying social biases in language [26], NLP [27, 28], and specifically, gender biases elicited by LLMs or carried on by implicit associations in their word embeddings, for human work-related activities [15]. However, while the proposed methods work well for English-based LLMs, they fail to capture bias for languages with a rich morphology or gender-marking, such as German, Polish or Spanish [29]. Countries where gendered languages are spoken often evidence less gender equality compared to countries with other grammatical gender systems [30]. While previous work has centered on the English language, recent studies have explored bias in multilingual contexts and languages other than English [31, 32, 33, 34] .

There are weak evidences that language shapes the

way of thinking. Previous argument is mentioned in the work of Whorf and Carroll [35] and such ideas have been the subject of debate and criticism. Whorf’s work explores the idea that language shapes our cognition and

2.1. Contributions This work makes the following contributions: (i) We ex

tend previous definitions of gender biases in pre-trained LLMs to work with two diferent types: stereotyping bias and representation bias and characterise multiple items in the trade-of between them. (ii) We evaluate such 0.35 0.30 0.25 ity0.20 s n eD0.15 0.10 0.05 male female male female male female 0.25 biases in pre-trained LLMs across multiple languages, [41], a more robust version of BERT, was released. Both ranging from languages without grammatical gender (e.g., models rely on the Transformer architecture introduced English) to rich morphological or gender-marking lan- in [42]. In a nutshell, BERT is trained to predict the guages, which we name gendered languages (e.g., Span- original tokens in a sentence that have been randomly ish). (iii) For each language, we perform an evaluation masked. Utilizing the Masked Language Model (MLM) on multiple pre-trained LLMs such as BERT [40] and objective, BERT evaluates the probability distribution RoBERTa [41]. (iv) With state-of-the-art focused in study- of possible tokens that could fit the masked position in ing biases in labour market at the level of professions, the sentence, attempting to correctly infer the original we change the lenses and analyze them at the level of masked word. Additionally, the model predicts the next economic sectors, comparing results with gender statis- sentence. RoBERTa aimed to address some of the shorttics for the labour market. (v) We will release our code comings of the BERT architecture, hence RoBERTa was including templates. trained with dynamic masking instead of the static variant when a sequence is input to the model.

In particular, we performed experiments with two 3. Methodology types of LLMs: BERT [40] and RoBERTa [41]. All the BERT base models were trained using 110M parameters, This section describes the methodology employed to mea- while BERT large with 340M parameters. RoBERTa sure gender bias in LLMs, with a focus on labour market base and large models were trained with 125M and stereotypes. In our work, we leverage pre-trained LLMs 355M parameters respectively. Importantly, we experto quantify gender bias using a template-based approach imented with diferent LLM architectures ( BERT and to measure association scores between a token and a RoBERTa), model size (base and large), as well as uncased masked target or attribute. and cased variants for four languages. In the previous 3.1. Pre-trained Language Models studies, such as [15] the evaluation of LLMs was limited only to two languages (English and German), only one model type (BERT base) and the authors did not analyze how casing influences the results. The diverse models and the corresponding languages and corpus they were trained on are outlined in table 4.

Pre-trained LLMs have been successfully employed to

diferent tasks and numerous applications in NLP in recent years. Significant performance gains have led to the development of various architectures. One of the most prominent LLMs is BERT [40]. Later RoBERTa 3.2. Grammatical and Natural Gender approach this task from two diferent perspectives, what Languages we name Stereotyping Bias () and Representation Bias (). The former quantifies how a given LLM is far from In the field of linguistics, a grammatical gender system gender neutrality given a context. The latter takes into represents a distinct form of a noun class system, wherein account the LLM bias with respect to what is observed in nouns are categorised based on gender attributes. In society. For instance, in figure 1d where professions are languages featuring a grammatical gender system, the supposed to be balanced among genders [15], we would majority of the nouns inherently bear one value of the expect that a BERT model with no bias will produce assogrammatical category known as gender. ciation scores around zero (see section 3.3.2 for more de

The Spanish language is considered a romantic lan- tails on the association scores). Looking at figure 1a, any guage that falls within the grammatical gender language deviation from the observed perfect overlapping would category as well as German and Polish languages. In account for stereotyping bias, see sections 3.3.3 and 3.3.4 Spanish, there are two genders: masculine and feminine, for further details on the two perspectives on bias quanand both the noun and adjective systems exhibit these tification. The applicability and preference for one notion two genders [43]. In addition, articles and some pro- over the other depends on the context of usage of the nouns and determiners have a neuter gender in their LLM at hand [45]. Existing studies quantify gender bias singular form. German is also an inflected language in pre-trained LLMs typically using tailored sets of syn[44] with three genders: masculine, feminine and neuter. thetically generated sentences and implicit associations In Polish, the only non Indo-European language in this between word embbedings [46]. In the work of Kurita study, nouns belong to one of three genders: masculine, et al. [47], gender bias in BERT models is measured using feminine and neuter. In this West-Slavic language, the a probability-based metric [25] and by using template masculine gender is also divided into subgenders: ani- sentences. Specifically, the LLM is directly queried for a mate/inanimate in the singular, and human/nonhuman particular token in a template sentence by sequentially in the plural. Furthermore, adjectives agree with nouns masking of either target or attribute token, see table 1 in in terms of gender and conjugated verb forms agree with where < > and < _ > stand their subject’s gender in the case of past tense and sub- for the target and the attribute words, respectively. In our junctive/conditional forms. analysis, the mask [TARGET] is replaced by gendered

Nevertheless, English is considered a natural gender nouns and pronouns (e.g.: he/she/my sister) and the mask language and most of the nouns, with some exceptions, [ATTRIBUTE] is replaced with terms related to specific are considered genderless [44]. English has three gen- economic sectors (e.g., fishing/services/secondary). As dered pronouns, but no longer has grammatical gender in contextualised embeddings of a given token are depenthe sense of noun class distinctions or inflections. Instead, dent on its context, a relative measure of bias for the gender is characterised through the language’s pronouns attribute word can be evaluated by substituting target [30], that is, the distinction between "he", "she", and other classes (e.g., male and female). In [47], the authors compersonal pronouns and "it". pare their evaluation method with the baseline cosine similarity measure among word embeddings. 3.3. Bias Quantification However, applying Kurita’s methodology confronts diferent challenges when applied for grammatically gendered languages such as Spanish or German. Previous work by Bartl et al. [15] demonstrated that the original association scores proposed in [47] were not efective for the German language due to its inherent gender sufTo quantify biases in a particular context, it is important to first establish a clear definition of what a bias-free system would look like. This requires a thoughtful reflection on the desired behavior of the analysed model and the impact that potential biases might have. In our work, we ifxes in attributes. In English a few gendered words exist (e.g., king/queen, waiter/waitress, actor/actress), and measuring the association score for sentences with those words, e.g., "[TARGET] is the waitress", with male or female options would yield misleading results when using word embedding projection methods [29], thus showing a gender bias against men instead of women. This phenomenon prevails into gendered languages, where diferent words are used for each gender. For instance, if we compare the distributions from the figures 1a to 1c, corresponding to the distributions of association scores for Spanish language, we notice that Kurita’s method obtains overlapped distributions for the three groups of professions in Spanish language. This result is also conifrmed by a drastic reduction of the p-values obtained by a Wilcoxon test statistic compared to English distributions. It is worth to mention that the same efect occurs for both German, as previously noticed by Bartl et al. [15], and for the Polish languages. The previous results motivate us to develop a new set of templates, aiming to avoid the efects of gendered attributes for the quantification of bias in this work. 3.3.1. Templates We adapt the idea of using templates to quantify and measure gender bias [47, 15]. Bartl et al. [15] used association scores to analyze gender biases across professions, releasing the BEC-Pro dataset for English and German languages. We follow a similar approach, but we extend the analysis to two additional languages: Spanish and Polish. More importantly, we shift the focus from individual professions to entire economic sectors.

Note that, as we discussed before in 3.3, the relation between the grammatical gender of the person word and the profession does influence the associations scores in gender-marking languages. In response to this, the novel approach of measuring biases across economic sectors instead of using a list of occupation words allows us to minimize potential complications stemming from grammatical gender inflections and pronouns. Additionally, by examining economic sectors, our investigation encompasses an aggregated view instead of limiting the analysis to a specific list of professions and, for instance, facilitating the relation of results to macroeconomic statistics.

Our templates are designed to assess gender bias in LLMs concerning economic sectors. To achieve that, we take into account changes in sentence structure (e.g., articles) depending on the female or male person word.

These templates follow a standard structure, where a sentence contains an economic sector reference as the attribute with a specific gendered term as the target.

The association scores methodology proposed by Kurita

et al. [47] is employed to measure the likelihood of a masked word being associated with a specific gender. These scores quantify the gender bias present in the LLMs by evaluating the probability that the masked token is classified as male or female. Higher scores for a particular gender indicate a stronger bias towards that gender in the predictions of the evaluated LLM.

The aim of this method is to estimate the implicit association between specific targets and attributes using BERT’s MLM objective. For example, using the template sentence "she works in the construction sector", the method can quantify the association between the target female (given by the pronoun "she") and the attribute construction. The distribution scores drawn by figure 1 are obtained in the same manner.

The main steps of the method are as follows:

1. Prepare a template sentence

e.g. "[TARGET] works in the [ATTRIBUTE] sector".

For example this may be "she works in the construction sector". 2. Mask the [TARGET] word and compute the target probability which corresponds to the likelihood of the target word given an unmasked attribute.

For the updated example, the sentence becomes "[MASK] works in the construction sector" and measures how likely the LLM is to predict "she" as the missing word.

3. Compute the prior probability , which is

the likelihood of the target word when the attribute is also masked.

The example sentence would be "[MASK] works in the [MASK] sector", and is the probability of predicting "she" without the influence of the attribute. 4. Compute the association between target and attribute as = log .

This logarithmic ratio is the association score, . To

measure gender bias, we compute the gender bias by comparing these scores for diferent targets, such as "he" and "she", averaging for all templates and taking the diference between female and male association score averages.

This method, as evidenced in the original paper, outperforms traditional cosine-based measures like WEAT [8] in detecting gender biases. 3.3.3. Stereotyping Bias Stereotyping bias () quantifies the extent to which a given LLM is far away from gender neutrality in a given context given by a specific language (ℒ) and a LLM model comparing the percentages of females and males in sec(ℳ). To do so, it quantifies the disparities in average tor using data from the Global Gender Gap Index across association score across genders for each of the economic diferent countries. sectors: primary, secondary and tertiary. A balanced model (ℛ = 0) perfectly reproduces so

To calculate the overall across sectors we first cal- cietal gender distributions. Negative values indicate a culate the stereotyping bias for a specific sector as model preference for males compared to the real prevathe inner disparity ℐ(ℒ, ℳ) by first computing the lence, while positive values indicate a preference for model’s average diference between association scores females. This metric helps language modelers ensure between females and males. This diference is calculated accurate societal representations in their models. for each i-th sentence generated for females () and males () between the total number of male and 3.4. Labour Market Data female oriented sentences generated for ℒ and .

We evaluate gender bias by investigating the relation be

tween gender-denoting target words and sectors names ℐ(ℒ, ℳ) = 1 ∑=︁0 (︀ () − ()︀) (1) ibnefEonreg,liosuhr, Sppraonpiosshe,dPomliesthhoadndavGoeirdmsaunsi.nAgspmroefnetsisoinoends names, gendered or not, to keep the attribute unchanged

The overall stereotyping bias across sectors for ℳ among languages, thus making the results comparable and ℒ is computed as the average inner disparity across among diferent languages under study. The observed all three economic sectors: bias is compared to real-world through the gender statis(ℒ, ℳ) = 13 ∑=3︁1 ℐ(ℳ, ℒ) (2) (ftbseiacemsseeaatdlaceborsolneascss2rp)ao,ecswctsiihficveiiccctooyhnusdoneemtcsrtciiocerrissbseaepcnttrdhooervtship.derCeierdovrmabelylepanattcrhivieeseooWlnfasmonragalrudleeasBdgaoaennnsdke A model ℳ trained for language ℒ without stereotyp- (e.g., Spanish models are compared with statistics from ing bias (ℒ, ℳ) = 0, would produce equal average Spain) so that we are able to specifically compare each association scores for male and female targets in eco- model’s outputs with its social reality, regarding the level nomic sectors. Negative values indicate bias favoring of gender equality represented at the workforce statistics males, while positive values indicate bias favoring fe- and per each economic sector. We also report the Global males. Stereotyping bias is specific to each model and Gender Gap Index2 by the World Economic Forum (WEF). the language in which it was trained. As seen in table 2, the bigger gender gap among sectors is found for the secondary sector, a common trend in the 3.3.4. Representation Bias four countries, where is mostly occupied by male workers. The contrary case is found in the tertiary sector, in where the relative diference favours the female gender.

For the primary sector, a similar statistic is found except for Poland where gender is almost balanced.

With a the broader view, representation bias in a given domain generally refers to the underrepresentation or overrepresentation of certain groups (such as genders or ethnicities) as compared to their prevalence in the overall target population. However, in the context of our 4. Results research, we adopt a definition of representation bias (ℛ), particularly tailored to the context of our analysis. Here, The results obtained for LLMs trained in English, German, ℛ is understood as the divergence of a model’s internal Spanish and Polish languages reveal intricate patterns representation of genders from the actual societal gender of the two notions (, )of gender bias that emerge distributions in the workforce. and fluctuate across languages, economic sectors, and

We define the overall representation bias across eco- model types. The analysis based on results depicted in nomic sectors for a given LLM ℳ trained for language figure 2, representing gender bias, separated by primary, ℒ as: secondary, and tertiary sectors, exhibits diverse trends across languages and sectors. As can be observed, the uncased BERT models generally exhibit less stereotyping 3 ℛ(ℒ, ℳ) = 31 ∑︁ (︀ =1 ℐ(ℒ, ℳ) − (ℒ))︀

(3)

Here, ℐ(ℳ, ℒ) is the model’s inner disparity score as defined above, and (ℒ) represents the observed gender ratio in economic sector in the country associated with the language ℒ. (ℒ) is calculated by 2The GGG index "assesses countries on how well they are dividing their resources and opportunities among their male and female populations, regardless of the overall levels of these resources and opportunities", https://www3.weforum. org/docs/WEF_GGGR_2022.pdf 3https://genderdata.worldbank.org/indicators/sl-empl-zs/ bias compared to their cased counterparts in all languages. the language model exhibits.

Within the range of languages analysed, Polish demon- Specifically, we are interested in the specific domains strates the lower stereotyping bias, followed by Spanish, for in which grammatical gendering of the LLM’s lanwith German and English showing similar average val- guage might be a proxy for predicting gender inequality ues, but with English models demonstrating a broader in the country of the spoken language. As reported by range of stereotyping bias. Prewitt-Freilino et al. [30] , countries predominated by a

When analyzing results both aggregated and across natural gender language, like English, evidence greater sectors, an interesting pattern emerges. In the 2D scat- gender equality than countries with other grammatical terplot all models, for a given language, align along a gender systems. Albeit, as seen in table 2, that is not specific trajectory, revealing a clear trade-of between the case for the English GGG index (only accounting for how accurate a language model represents the social re- England), lower than GGG indexes reported for Germany ality (representation bias) and the stereotyping bias that and Spain, both gendered languages. Table 2 to the non-bias point (0, 0) compared to the rest of lanThe male/female columns refer to the % of workforce in the guages, except for Polish. Previous result is diluted desector. The Rel. dif. column stands for the relative diference pending on the specific economic sector we look at, in between both genders, normalised in the range [ -1,1 ], where where the English in primary and secondary sector negative values indicate real-world bias towards male gender. is compensated by the tertiary sector, the latter biased The Global Gender Gap index (GGG) reported by the World towards the female direction. tEocogneonmdeirc sFtoartuismtic2s02fr2o2man2d01em9bpylotyhmeeWntobrlydsBeactnokr3w.Tithherevsaplueecst We also observe interesting results in the case of the of the index range from 0 to 1, with higher values indicating Polish language, which belongs to a family of Slavic langreater gender equality. guages. Polish, as a West Slavic language, is gendered; however, there is a very limited number of studies invesLanguage GGG Sector Female Male Rel. dif. tigating bias even in the broader Slavic language family

Primary 37.64 62.36 -0.25 [48] . Hence, the lack of such analysis is addressed in our German 0.801 Secondary 28.90 71.10 -0.42 work. The results for Polish LLMs, as shown in figure 2, Tertiary 61.80 38.21 0.24 in general demonstrate lower representation bias scores Primary 28.13 71.87 -0.44 as compared to other languages.

Spanish 0.788 Secondary 26.23 73.77 -0.48 We hypothesize that one of the reasons might be atTertiary 60.40 39.60 0.21 tributed to the gender-sensitive grammar structures in Primary 49.00 51.05 -0.02 Polish. Unlike many other languages, Polish modifies not Polish 0.709 Secondary 32.10 67.94 -0.36 only pronouns but also verb forms to correspond with

TPerirmtiaarryy 6351..6159 6384..8315 -00..3318 gender. For example, in certain sentence templates, the English 0.780 Secondary 24.22 75.78 -0.52 conditional and past tenses of verbs, and relative proTertiary 59.34 40.66 0.19 nouns, alter according to the gender of the person being referred to. This linguistic feature may potentially impact how LLMs learn and represent gender-related concepts

For the case of the English LLMs, see figure 2a, the in Polish, thus influencing the extent of bias observed BERT base uncased and RoBERTa base cased models re- in these models. Given the gendered nature of Polish port the lowest |ℛ| and ||, thus being both a good and how the provided patterns reflect that, we conclude proxy for real-world data and low stereotyping of LLMs. that the representativeness (as indicated by the repreNote that values of can be understood as the LLM’s sentation bias) of Polish LLMs is slightly better than its perception of the world once the LLM is trained in a spe- English, Spanish and German counterparts due to the cific language, whereas describes its capacity for pre- necessary agreement of verbs and pronouns with the dicting the real-world data or gender gap. Similar results gendered subject. are observed for Spanish, where the Base models outperform Large models in both bias metrics, ℛ and . 5. Conclusion Note that the main diference between Large and Base models resides in the number of parameters employed In this work, we have used the idea of association scores for the architecture, being the number of tokens in the [47] to quantify gender biases in LLMs in the labour training corpus the same for both LLM systems. Table 4 market at the level of economic sectors. We distinguish in annex, summarizes the training data used and number between two diferent notions of biases: (i) Representaof parameters for each LLM. tivity bias and (ii) Stereotyping bias. The first quantifies

Regarding the Spanish model BERTIN, it corresponds the extent to which the model is able to learn patterns to a RoBERTa base model trained with 100B tokens more that can be observed in society, whereas the latter studies than the RoBERTa-BSC model. In the figures, both mod- how far from gender-neutral its internal representation els are denoted with an orange circle. BERTIN portrays is. the lowest error in term of bias, skewed toward negative By conducting this cross-linguistic analysis, we convalues of for the sector aggregation graph 2a. Note tribute to the understanding of biases in LLMs, highlightthat the graph for all sectors is computed as a weighted ing the nuanced interplay between language structure, average using WEF data proportions from the other three training data, and the biases exhibited by these models. sectors results, see table 2. Our study underscores the importance of comprehend

Overall, the gendered languages exhibit a smaller vari- ing how biases are captured or amplified within LLMs, ance around the (, ℛ) = (0, 0) compared to English paving the way for future eforts to mitigate and address LLMs. Nonetheless, by removing the Large models from these biases. the analysis, we can realize that English LLMs, trained We use these definitions of bias to characterise multiwith a natural gender language, are closer in average ple state-of-the-art pre-trained LLMs, comparing results among diferent languages, from languages with no grammatical gender, or natural gender, to gendered languages.

Among other results, the conducted analysis reveals interesting and consistent trends where biases vary across languages and economic sectors, being Polish the language whose models systematically showcase less biases and the tertiary sector, the unique case for which the models exhibit a biased preference towards the female gender.

Additionally, we observed a quasi-linear relation between both types of biases, with most of the models exhibiting representation, stereotyping biases or a combination of both and Large models reporting higher biases.

We expect these results to contribute to building a better understanding on the presence of systematic gender biases in LLMs.

5.1. Limitations and Future Work

This study uses a multi-language dataset, synthetically created with equivalent examples across the studied languages. However, as other datasets used in the related work, it is still limited in the sense that few templates are used to generate it. Additionally, it is important to note that cultural biases might afect the understanding of the translated templates to each language, leading to diferences that could be reflected in the obtained results. This raises the possibility that unintended biases may be present in the results derived from the data.

Furthermore, the dataset is composed by a tailored collection of terms that are descriptors of economic sectors for which the results are then aggregated. Although aggregation is a powerful tool to observe patterns, at the same time it has the drawback of restricting the visibility of interesting patterns that occur at a lower granularity level, for instance, using professions related to each economic sector. This means that using solely the results reported in this work might not be suficient to understand all the possible types of biases in LLMs in the domain of the labour market, but correspond to another set of information to be accounted.

Moreover, we are comparing individual census with results of diferent languages that are spoken in multiple countries. As a future work, it could be interesting to compare results across multiple countries that use those languages. Additionally, more research could be done on the efects of other demographic factors or covariates.

6. Ethics Statement This research provides a deepened insight into the in

lfuence of grammatical gender on gender biases within LLMs across multiple languages. The broader societal impact of understanding and quantifying these biases is significant for several reasons: 1. Enhancing Awareness: By bringing attention to the variances in gender biases across languages, we can enhance the broader community’s awareness of potential pitfalls when deploying LLMs in diverse linguistic settings. This awareness is crucial for developers, policymakers, and users to make informed decisions about the application and potential limitations of LLMs in diferent linguistic contexts. 2. Informed Deployment: Knowledge about the biases inherent to these models can guide decision-making processes for institutions and industries that utilize LLMs. By being aware of the biases, stakeholders can make better decisions regarding where and how to deploy these models, especially in applications that may have realworld implications for individuals or groups. 3. Influence on Future Research : Our study can pave the way for future research into the mitigation of gender biases in LLMs. By understanding the nuanced interplay between language structure, training data, and model bias, the community can work towards developing techniques and best practices to address and reduce such biases.

6.1. Ethical Considerations 1. Dataset Limitations: While our study utilizes

a multi-language dataset, it is synthetically created with equivalent examples across languages. As with any synthetic dataset, there’s a risk of unintended biases, potentially afecting the results. We recognize and caution that translating templates across languages can introduce cultural biases, which might inadvertently influence the outcomes. 2. Scope of Findings: It is important to understand that our findings, while indicative of trends, may not extrapolate seamlessly to other new LLMs or to every application scenario. For example, we are only reporting results for a narrow set of languages and modeled by non-causal LLMs, that is, no autoregressive models, as GPT-like, have been evaluated. Biases are intricately linked to specific training data, model architecture, and application context. Our study should be viewed as a piece in the larger puzzle of understanding and addressing biases in LLMs, rather than a conclusive assessment of all possible instances of gender bias in every LLM. 3. Aggregation of Results: Our use of aggregation, while powerful in discerning patterns, might also mask more granular biases present in LLMs, particularly within specific economic sectors or professions. Users and developers should be aware

Acknowledgments Funded by the European Union’s Horizon 2020.

Views and opinions expressed are however those of the author(s) only and do not necessarily relfect those of the European Union or European Commission-EU. Neither the European Union nor the granting authority can be held responsible for them.

of this and consider more detailed analyses when appropriate. 4. Comparative Analyses: Our study compares census data with results from languages spoken across diferent countries. The cultural, economic, and social dynamics of each country can vary widely, even if the same language is spoken. Future work may benefit from a more localized approach, considering the multifaceted nature of biases in each country. 5. Potential Misuse: Recognizing that biased systems can perpetuate stereotypes or reinforce societal prejudices, it is ethically imperative for developers and users to ensure that LLMs are not misused, especially in critical domains where biases can lead to tangible harms or injustices.

3581017. 1038/d41586-018-05707-8. [12] C. O’Neil, Weapons of Math Destruction: How Big [22] T. Sun, A. Gaut, S. Tang, Y. Huang, M. ElSherief, Data Increases Inequality and Threatens Democ- J. Zhao, D. Mirza, E. Belding, K.-W. Chang, W. Y. racy, Crown, 2016. URL: https://books.google.es/ Wang, Mitigating gender bias in natural language books?id=NgEwCwAAQBAJ. processing: Literature review, in: Proceedings of [13] M. R. Costa-jussà, An analysis of gender bias stud- the 57th Annual Meeting of the Association for ies in natural language processing, Nature Ma- Computational Linguistics, Association for Comchine Intelligence 1 (2019) 495–496. URL: https:// putational Linguistics, Florence, Italy, 2019, pp. doi.org/10.1038/s42256-019-0105-5. doi:10.1038/ 1630–1640. URL: https://aclanthology.org/P19-1159. s42256-019-0105-5. doi:10.18653/v1/P19-1159. [14] D. Nozza, F. Bianchi, D. Hovy, HONEST: Mea- [23] A. Konnikov, N. Denier, Y. Hu, K. D. Hughes, J. Alsuring hurtful sentence completion in language shehabi Al-Ani, L. Ding, I. Rets, M. Tarafdar, et al., models, in: Proceedings of the 2021 Confer- Bias word inventory for work and employment dience of the North American Chapter of the As- versity,(in) equality and inclusivity (version 1.0), sociation for Computational Linguistics: Human SocArXiv (2022).

Language Technologies, Association for Compu- [24] M. E. Heilman, Gender stereotypes and workplace tational Linguistics, Online, 2021, pp. 2398–2406. bias, Research in Organizational Behavior 32 (2012) URL: https://aclanthology.org/2021.naacl-main.191. 113–135. URL: https://www.sciencedirect.com/ doi:10.18653/v1/2021.naacl-main.191. science/article/pii/S0191308512000093. doi:https: [15] M. Bartl, M. Nissim, A. Gatt, Unmasking Contex- //doi.org/10.1016/j.riob.2012.11.003. tual Stereotypes: Measuring and Mitigating BERT’s [25] I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. TanGender Bias, in: M. R. Costa-jussà, C. Hardmeier, jim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, N. K. W. Radford, K. Webster (Eds.), Proceedings of the Ahmed, Bias and fairness in large language models: Second Workshop on Gender Bias in Natural Lan- A survey, 2024. arXiv:2309.00770. guage Processing, Association for Computational [26] K. M. White Smolinski, Gender Bias in Natural GenLinguistics, Barcelona, Spain (Online), 2020, pp. 1– der Language and Grammatical Gender Language 16. URL: https://aclanthology.org/2020.gebnlp-1.1. within Children’s Literature, PhD dissertation, Lib[16] P. Kahn, Rising tide: Gender equality and erty University, 2024. URL: https://digitalcommons. cultural change around the world, Perspec- liberty.edu/doctoral/5294. tives on Politics 2 (2004) 407–409. doi:10.1017/ [27] D. Cirillo, H. Gonen, E. Santus, A. Valencia, M. R.

S1537592704770978. Costa-jussà, M. Villegas, Sex and gender bias [17] C. A. Moss-Racusin, J. F. Dovidio, V. L. Brescoll, in natural language processing, in: D. Cirillo, M. J. Graham, J. Handelsman, Science fac- S. Catuara-Solarz, E. Guney (Eds.), Sex and ulty’s subtle gender biases favor male students, Gender Bias in Technology and Artificial IntelliProceedings of the National Academy of Sci- gence, Academic Press, 2022, pp. 113–132. URL: ences 109 (2012) 16474–16479. URL: https://www. https://www.sciencedirect.com/science/article/pii/ pnas.org/doi/abs/10.1073/pnas.1211286109. doi:10. B9780128213926000091. doi:https://doi.org/ 1073/pnas.1211286109. 10.1016/B978-0-12-821392-6.00009-1. [18] M. McKinnon, C. O’Connell, Perceptions of stereo- [28] P. Czarnowska, Y. Vyas, K. Shah, Quantifying Social types applied to women who publicly commu- Biases in NLP: A Generalization and Empirical Comnicate their stem work, Humanities and Social parison of Extrinsic Fairness Metrics, Transactions Sciences Communications (2020). doi:10.1057/ of the Association for Computational Linguistics s41599-020-00654-0. 9 (2021) 1249–1267. URL: https://doi.org/10.1162/ [19] S. Marjanovic, K. Stańczak, I. Augenstein, Quan- tacl_a_00425. doi:10.1162/tacl_a_00425. tifying gender biases towards politicians on red- [29] P. Zhou, W. Shi, J. Zhao, K.-H. Huang, M. Chen, dit, PLOS ONE 17 (2022) 1–36. URL: https://doi. R. Cotterell, K.-W. Chang, Examining Gender org/10.1371/journal.pone.0274317. doi:10.1371/ Bias in Languages with Grammatical Gender, in: journal.pone.0274317. K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceed[20] A. H. Bailey, A. Williams, A. Cimpian, Based on bil- ings of the 2019 Conference on Empirical Methlions of words on the internet, people=men, Science ods in Natural Language Processing and the 9th Advances 8 (2022) 2463. URL: https://www.science. International Joint Conference on Natural Lanorg/doi/abs/10.1126/sciadv.abm2463. doi:10.1126/ guage Processing (EMNLP-IJCNLP), Association sciadv.abm2463. for Computational Linguistics, Hong Kong, China, [21] J. Zou, L. Schiebinger, Ai can be sexist and racist 2019, pp. 5276–5284. URL: https://aclanthology.org/ — it’s time to make it fair, Nature (2018). doi:10. D19-1531. doi:10.18653/v1/D19-1531. Slavic Natural Language Processing 2023 (SlavicNLP 2023), Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 146–154. URL: https://aclanthology.org/2023.bsnlp-1.17. [49] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, S. Fidler, Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015 IEEE International Conference on Computer Vision (ICCV) (2015) 19–27. [50] B. Staatsbibliothek, German bert, 2023. URL: https:

//github.com/dbmdz/berts. [51] B. Minixhofer, F. Paischer, N. Rekabsaz, WECH

SEL: Efective initialization of subword embeddings for cross-lingual transfer of monolingual language models, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Seattle, United States, 2022, pp. 3992–4006.

URL: https://aclanthology.org/2022.naacl-main.293.

doi:10.18653/v1/2022.naacl-main.293. [52] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho,

H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, in: PML4DC at ICLR 2020, 2020. [53] J. de la Rosa, E. G. Ponferrada, M. Romero,

P. Villegas, P. G. de Prado Salas, M. Grandury, Bertin: Eficient pre-training of a spanish language model using perplexity sampling, Procesamiento del Lenguaje Natural 68 (2022) 13–23.

URL: http://journal.sepln.org/sepln/ojs/ojs/index.

php/pln/article/view/6403. [54] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao,

J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R.

Penagos, A. G. Agirre, M. Villegas, Maria: Spanish language models, Procesamiento del Lenguaje Natural 68 (2022). URL: https://upcommons.upc.edu/ handle/2117/367156#.YyMTB4X9A-0.mendeley.

doi:10.26342/2022-68-3. [55] D. Kłeczek, Polbert: Attacking polish nlp tasks with transformers, in: M. Ogrodniczuk, Łukasz Kobyliński (Eds.), Proceedings of the PolEval 2020 Workshop, Institute of Computer Science, Polish

Academy of Sciences, 2020. [56] S. Dadas, M. Perełkiewicz, R. Poświata, Pre-training polish transformer-based language models at scale, in: L. Rutkowski, R. Scherer, M. Korytkowski, W. Pedrycz, R. Tadeusiewicz, J. M. Zurada (Eds.), Artificial Intelligence and Soft Computing, Springer

International Publishing, Cham, 2020, pp. 301–314.

A. Employment by Sector with Respect to Gender

BERT [40] RoBERTa [41] BERT [52] RoBERTa [53, 54]

[1]

E. M.

Bender ,

Gebru ,

McMillan-Major ,

Shmitchell , On the dangers of stochastic parrots: Can language models be too big? , in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , FAccT '21, Association for Computing Machinery, New York, NY, USA, 2021 , p. 610 - 623 . URL: https://doi.org/10.1145/3442188. 3445922. doi: 10 .1145/3442188.3445922.

[2]

Yang ,

Jin ,

Tang ,

Han ,

Feng ,

Jiang ,

Yin ,

Hu , Harnessing the power of llms in practice: A survey on chatgpt and beyond , ArXiv abs/2304 .13712 ( 2023 ).

[3]

Tedeschi ,

Bos ,

Declerck ,

Hajič , D. HershBooksCorpus (0 .8B words) [ 49 ], English Wikipedia (2.5B words; excluding lists, tables and headers), CC-News (September 2016- February 2019 ), OpenWebText, Stories. Size: 161GB Wikipedia , EU Bookshop, Open Subtitles, CommonCrawl, ParaCrawl and

News

Crawl . Size: 16GB, Tokens: 2.35B BooksCorpus (0.8B words) [ 49 ], English Wikipedia (2.5B words; excluding lists, tables and headers), CC-News (September 2016- February 2019 ), OpenWebText, Stories. Size: 161GB