Measuring Gender Stereotype Reinforcement in
Information Retrieval Systems*
Discussion paper

Alessandro Fabris1 , Alberto Purpura1 , Gianmaria Silvello1 and Gian Antonio Susto1
1
    Department of Information Engineering, University of Padua, Padua, Italy


                                         Abstract
                                         Can we measure the tendency of an Information Retrieval (IR) system to reinforce gender stereotypes
                                         in its users? In this abstract, we define the construct of Gender Stereotype Reinforcement (GSR) in the
                                         context of IR and propose a measure for it based on Word Embeddings. We briefly discuss the validity
                                         of our measure and summarize our experiments on different families of IR systems.

                                         Keywords
                                         Fairness, Gender Stereotypes, Information Retrieval, Search Engines, Word Embeddings


1. Introduction
Search Engines (SEs) increasingly act as the gatekeepers of information. Their role in information
access is undisputed, with a user base exceeding 90% of all people connected to the internet
[1]. SEs inevitably influence users, helping them map concepts and link entities across queries
and documents. For this reason, they can play an important role in countering or reinforcing
stereotypical associations [2].
   Stereotypes are generalised beliefs about groups of individuals, held widely in a population
of interest. They arise from a co-occurrence of features, such as membership to a group and
display of certain traits and roles. The extent to which an individual believes a stereotypical
trait to be common in a given group is often measured through an association test between
groups and traits [3].
   Male and female are highly salient categories in human cognition, available from an early age
for stereotypical associations.2 As a result, western societies maintain a wide range of gender
stereotypes, relating e.g. to professions, career, competence, care, predisposition for science
and mathematics. The same stereotypes are also found in artifacts and technology produced by
the same societies. For example, the search results of popular image SEs were found to contain
gender-stereotypical associations [5] and to influence users’ cognition accordingly [6]. Only
recently, novel approaches to measure gender bias in text-based SEs have been proposed [4, 7].
IIR 2021 – 11th Italian Information Retrieval Workshop, September 13–15, 2021, Bari, Italy
" fabrisal@dei.unipd.it (A. Fabris); purpuraa@dei.unipd.it (A. Purpura); silvello@dei.unipd.it (G. Silvello);
gianantonio.susto@unipd.it (G. A. Susto)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR

         CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


                  2
     The present binary framing of gender is a consequence of this fact and a clear limitation of our work. This is
a common weakness for work in this space; addressing it is far from trivial.
     * Extended abstract of Fabris et al. [4].
                                                                                  (0.34)          (0.08)             (-0.11)         (-0.11)   (-0.05)
                                      y                    w=beauty
                                                                                  (0.26)          (0.11)    (0.22)   (-0.16)         (0.01)

                                                                                  (0.34)          (0.03)             (-0.12)         (-0.00)   (-0.10)

                                                                                  (0.17)                             (-0.12)         (-0.02)

                                                                                  (0.03)          (0.31)             (-0.05)         (-0.09)

    -0.4           -0.2                              0.2   g(w)       0.4     x   (0.22)          (0.24)    (0.05)   (-0.12)         (-0.01)

           him                                                              her   (0.16)                             (-0.06)         (-0.08)   (-0.05)
                 man                                              woman           (0.24)          (0.06)             (-0.09)         (-0.04)
                          jobs_m                       jobs_f
                              career        family                                (0.18)          (0.11)             (-0.13)         (0.01)

                                science    arts                                   (0.13)          (0.09)             (-0.02)         (-0.08)   (-0.08)
                             agency     communion


            (a) Gender direction                                                           (b) “Female” queries                (c) “Male” queries
Figure 1: Adapted from [4]. Left. Gender projection 𝑔(𝑤) for different words and concepts related to
known gender stereotypes. Four stereotypical dychotomies are depicted in Figure 1a, which relate to
gender concentrations in professions (jobs_m vs jobs_f), work-related choices (career vs family),
predisposition for subjects (science vs arts), competence and warmth (agency vs communion). For
each of these concepts, Figure 1a depicts the average projection of a set of words that have been pro-
posed as representative of the concept. The projection of stereotypically female concepts is always
(statistically significantly) lower than the projection of their male counterpart. Middle and right.
Gendered queries from the Robust04 collection [8] according to gender projection 𝑔(·). The text is
printed with color-coded gradient where strongly “masculine” words are orange, strongly “feminine”
words are purple, neutral words are white. Indeed the words in these queries relate to gender either
intrinsically (women), biologically (menopause) or stereotypically (e.g. child relates to family, quilts
to steretypical occupations, heroic to agency, dangerous is contrary to communion).


In this work we provide an overview of the Gender Stereotype Reinforcement (GSR) construct
and measure of Fabris et al. [4].


2. GSR: Construct, Measure and Validity
In the context of IR, we define GSR as a SE’s tendency to reinforce (or counter) gender-
stereotypical associations in its users. Direct measurement of this construct would require
impractical longitudinal user studies of counterfactual nature. Fabris et al. [4] propose a compu-
tational approach, based on Word Embeddings (WEs).
   Indeed, WEs have been found to reliably encode several gender stereotypes [9, 10], typically
along a single direction of the embedded space. More precisely, Bolukbasi et al. [9] show
how to isolate a problematic direction, called gender subspace, where gender-related concepts
are clustered in accordance with gender stereotypes. To illustrate this concept, Figure 1a
depicts the gender direction 𝑤𝑔 of Word2vec embeddings [11] along the 𝑥 axis. A sample word,
𝑤 = beauty, is projected onto the gender direction, where it is closer to intrinsically female
words (her, woman) and to stereotypically female concepts than to their male counterparts.
   Let us indicate by 𝑔(𝑤) = (𝑤 · 𝑤𝑔 )/(|𝑤||𝑤𝑔 |) the function associating a word 𝑤 with its
normalized scalar projection on the gender direction 𝑤𝑔 . By extension, 𝑔(·) maps a query 𝑞𝑖
into the average projection of its words 𝑔(𝑞𝑖 ). Let us call 𝑔(𝑥) the genderedness of 𝑥. Moreover,
we apply function 𝑔(·) to the ranked list of documents ℒ𝑖 returned by an IR system 𝑠 in
response to query 𝑞𝑖 . We indicate it by 𝑔(ℒ𝑖 ) and define it as the average projection of words
in ranked∑︀documents 𝑑𝑘 , weighted according to the rank of each document in ℒ𝑖 . In symbols
𝑔(ℒ𝑖 ) = 𝑑𝑘 ∈ℒ𝑖 𝑤𝑘 · 𝑔(𝑑𝑘 ), with 𝑤𝑘 computed according to a DCG-like logarithmic discount
[12].3 Given a set of 𝑁 queries 𝒬 and a collection of documents 𝒟 available for retrieval,
we define the GSR of an IR system 𝑠 over (𝒬, 𝒟) in terms of the correlation between the
genderedness of queries in 𝑔(𝑞𝑖 ), 𝑞𝑖 ∈ 𝒬 and the genderedness of ranked lists of documents
𝑔(ℒ𝑖 ) produced in response. More precisely, GSR is defined as

                                                      𝑁
                                           1 ∑︁
                                             1
                          𝑚𝑠 (𝒬, 𝒟) = 2         (𝑔(𝑞𝑖 ) − 𝜇𝑞 )(𝑔(ℒ𝑖 ) − 𝜇ℒ ),                  (1)
                                     𝜎𝑔(𝑞) 𝑁         𝑖=1

where 𝜇𝑞 , 𝜇ℒ represent the average genderedness of queries and ranked lists, while 𝜎𝑔(𝑞) 2    is a
scaling factor to go from correlation to slope coefficient. Informally, Equation 1 captures the
agreement between the language of queries and documents along stereotypically gendered
lines, induced by an IR system 𝑠.
   A thorough assessment of the suitability of this equation to measure the GSR construct is
an important and complex endeavour undertaken in [4]. Here we show the precision of the
projection function 𝑔(·) in finding interesting queries for the study of gender stereotypes in SEs.
Figure 1b (1c) shows the ten queries with lowest (highest) genderedness 𝑔(𝑞) in the Robust04
collection [8], which are the most associated with women (men) according to 𝑔(𝑞). Indeed these
are gendered queries, ranging from intrinsically gendered (mentioning women), to biologically
gendered (mentioning menopause), to stereotypically gendered (with quilts and child
among words in stereotypically female queries, dangerous and heroic in stereotypically
male queries).


3. Experiments and Discussion
Our experiments on the Robut04 collection [8], omitted here for brevity, compare IR ranking
algorithms from different families. We consider lexical models (e.g. BM25 - [13]), semantic
models (e.g. w2v add - [14]) and neural architectures (e.g. MatchPyramid - [15]). We find that
semantic models, based on biased WEs, are most prone to reinforcing gender stereotypes, while
neural systems based on the same word representations can mitigate this effect. Indeed neural
models exhibit low GSR, comparable to that of lexical systems such as BM25. Moreover, we
test the reliability of these conclusions by measuring GSR according to two different sets of
WEs (Word2Vec [11] and fastText [16]), finding strong agreement between the two. Finally,
we assessed the impact of debiasing WEs [9] on downstream IR tasks. By measuring system
performance and GSR both before and after debiasing, we find this approach to be superficial
and insufficient to reduce the tendency of an IR system to reinforce gender stereotypes.


   3
       To be precise, 𝑔(ℒ𝑖 ) should be query-dependent [4]. Here we neglect this aspect.
Acknowledgments
Part of this work was supported by MIUR (Italian Minister for Education) under the initiative
"Departments of Excellence" (Law 232/2016).


References
 [1] K. Purcell, J. Brenner, L. Rainie, Search engine use 2012, 2012. URL: https:
     //www.pewresearch.org/internet/wp-content/uploads/sites/9/media/Files/Reports/
     2012/PIP_Search_Engine_Use_2012.pdf.
 [2] S. U. Noble, Algorithms of oppression: How search engines reinforce racism, NYU Press,
     2018.
 [3] A. G. Greenwald, D. E. McGhee, J. L. Schwartz, Measuring individual differences in implicit
     cognition: the implicit association test., Journal of personality and social psychology 74
     (1998) 1464–1480.
 [4] A. Fabris, A. Purpura, G. Silvello, G. A. Susto, Gender stereotype reinforcement: Measuring
     the gender bias conveyed by ranking algorithms, Information Processing & Management
     57 (2020) 102377.
 [5] J. Otterbacher, J. Bates, P. Clough, Competent men and warm women: Gender stereotypes
     and backlash in image search results, in: Proc. of CHI 2017, ACM, 2017, p. 6620–6631.
 [6] M. Kay, C. Matuszek, S. A. Munson, Unequal representation and gender stereotypes in
     image search results for occupations, in: Proceedings of the 33rd Annual ACM Conference
     on Human Factors in Computing Systems, ACM, 2015, pp. 3819–3828.
 [7] N. Rekabsaz, M. Schedl, Do neural ranking models intensify gender bias?, in: Proceedings
     of the 43rd International ACM SIGIR Conference on Research and Development in Infor-
     mation Retrieval, SIGIR ’20, Association for Computing Machinery, New York, NY, USA,
     2020, p. 2065–2068.
 [8] D. Harman, The darpa tipster project, SIGIR Forum 26 (1992) 26–28.
 [9] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, A. T. Kalai, Man is to computer
     programmer as woman is to homemaker? debiasing word embeddings, in: Advances in
     neural information processing systems, 2016, pp. 4349–4357.
[10] A. Caliskan, J. J. Bryson, A. Narayanan, Semantics derived automatically from language
     corpora contain human-like biases, Science 356 (2017) 183–186.
[11] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations
     of words and phrases and their compositionality, in: Advances in neural information
     processing systems, 2013, pp. 3111–3119.
[12] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of ir techniques, ACM
     Transactions on Information Systems 20 (2002) 422–446.
[13] S. E. Robertson, U. Zaragoza, The Probabilistic Relevance Framework: BM25 and Beyond,
     Foundations and Trends in Information Retrieval (FnTIR) 3 (2009) 333–389.
[14] I. Vulić, M.-F. Moens, Monolingual and cross-lingual information retrieval models based
     on (bilingual) word embeddings, in: Proc. of SIGIR 2015, ACM, 2015, p. 363–372.
[15] L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, X. Cheng, Text matching as image recognition, in:
     Proc. of AAAI 2016, AAAI Press, 2016, p. 2793–2799.
[16] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification,
     in: Proceedings of the 15th Conference of the European Chapter of the Association for
     Computational Linguistics: Volume 2, Short Papers, ACL, 2017, pp. 427–431.