<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>stereotypes in textual documents⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Silvana Badaloni</string-name>
          <email>silvana.badaloni@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Rodà</string-name>
          <email>antonio.roda@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martino Scagnet</string-name>
          <email>martino.scagnet@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering</institution>
          ,
          <addr-line>via Gradenigo, 6, 35131 Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Elena Cornaro Center on Gender Studies, University of Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The presence of stereotypes associated to historically disadvantaged groups constitutes a strong limitation for the justice and welfare of society. Gender stereotypes are among the most deep-rooted ones and have, over time, given rise to real conventions that permeate various aspects of social life, creating unfairness and sometimes discrimination. This study focuses on the possibility of identifying gender stereotypes in textual documents using Machine Learning and Natural Language Processing tools. To this end, a corpus of Italian language texts was collected and 107 participants were asked to evaluate each sentence by assigning a score that would reveal the presence of gender stereotypes (female or male). The collected data allowed the labelling of the text sections of the corpus, by assigning a “gender score” to each one. The dataset thus developed can be used to foster the development and/or evaluation of automatic tools for detecting gender stereotypes, facilitating the writing of more inclusive texts.</p>
      </abstract>
      <kwd-group>
        <kwd>gender bias</kwd>
        <kwd>gendered innovation</kwd>
        <kwd>fairness</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The problem of biased and unfair outcomes of AI-based systems is becoming increasingly clear.
One of the main causes is that Machine Learning algorithms, by their intrinsic nature, are
trained on the basis of examples, they learn from data, and therefore can subsume and capture
the stereotypes related to people sharing a characteristic, for example the gender identity, which
run through the data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. If used to make automatic decisions, these potentially biased systems
could lead to unfair, incorrect decisions that could discriminate for or against some groups over
others. There is the risk of being discriminatory for certain categories of users. Moreover, the
triangular relationship between algorithm-human-data, which becomes increasingly relevant as
collaboration between humans and AI increases, risks continually feeding the spread of biases.
      </p>
      <p>
        While the concept of bias is very broad, gender-related biases are considered an essential
aspect of fairness [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In particular, we believe that in the European social-cultural context,
the gender biases represents a particularly interesting case study for the Artificial Intelligence
community, for several reasons listed below.
      </p>
      <p>CEUR
Workshop
Proceedings</p>
      <p>First of all, numerous studies have shown that gender biases are deeply rooted in our society.
Therefore, the risk that the datasets used for many applications with great social impact
(autonomous driving vehicles, recommendation systems, personnel selection systems, etc.) contain
biases linked directly or indirectly to gender is very high.</p>
      <p>Secondly, gender biases afect more or less half of the population, so their presence has an
impact on a large number of people.</p>
      <p>Thirdly, given the wide spread of gender bias in our societies, it is relatively easy to find
datasets on which to experiment with analysis and debiasing techniques.</p>
      <p>Fourthly, in comparison with other types of bias (racial, social, etc.), it is easier to define the
categories subject to possible discrimination. Gender studies, while recognising the multiplicity
of gender identities, validate the existence of two well-defined polarities, male and female. The
existence of two prevailing categories facilitates the definition of experimental protocols for the
validation of analysis and debiasing techniques.</p>
      <p>Fifthly, following the usual practice of bringing our research experiences back into teaching,
promoting studies on gender biases in AI can facilitate the introduction of gender issues into our
computer science courses, with a twofold advantage: a) increasing the degree of involvement
of our female students, and b) making our male students aware of stereotypes and biases that
risk discriminating against their female counterparts, making their university and professional
careers more dificult.</p>
      <p>
        In this paper, we will focus on this kind of bias and, in particular, we will deal with gender bias
as open issue for applications based on Natural Language Processing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. How Word Embeddings
learns stereotypes has been the focus of many research on gender bias and artificial intelligence
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Since Word Embeddings are used as a knowledge base in many applications, biases in these
models can propagate into many NLP applications. In general, gender biases difused in the
textual corpora used for Word-Embedding are subsumed by the model: for example, words
related to traditionally male professions are found closer to inherently gendered words, such as
he or man, and vice versa. Techniques to reduce these biases have been recently studied [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
but the problem is still open, in particular for those language that are more grammatically
gendered, as Italian [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Language has a profound impact on how we understand gender roles.
A gender-inclusive language is, therefore, a key tool to contribute to the achievement of gender
equality. Consequently, having tools to identify gender biases in texts is crucial to mitigating
their propagation. However, there is still a shortage of gender bias datasets to automate gender
bias detection using machine learning (ML) and natural language processing (NLP) techniques
(see [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). In particular, as far as we know, there is no specific dataset in Italian language.
      </p>
      <p>The present study will focus on the possibility of automatically identifying gender stereotypes
in textual documents. To this aim, a corpus of texts in Italian, labelled according to the genre
(understood in a conventional way) to which the reading is addressed, has been developed1.
Texts have been collected from various sources assuming the presence of gender stereotypes
in some and gender neutrality in others. Then, voluntary participants were asked to rate the
genre to which the text fragment was aimed. In the following sections, we will present the
methodology used to collect the corpus and the participants’ annotations. Finally, we will
provide a statistical analysis and discussion of the results.
1The labeled dataset is available at https://doi.org/10.5281/zenodo.10027951</p>
    </sec>
    <sec id="sec-3">
      <title>2. The dataset development</title>
      <sec id="sec-3-1">
        <title>2.1. Materials</title>
        <p>The first step for dataset building was the collection of an initial corpus of texts. To ensure the
presence of sentences with diferent degrees of gender bias (both feminine and masculine), a
number of articles were selected from magazines explicitly targeting either a female or a male
audience. Such a choice stems from the assumption that these magazines tend to have content
stereotyped for the gender they target in an attempt to maximize the number of interested
readers. Indeed, it is a well-known phenomenon that men and women tend to conform to
gender stereotypes in order to align with social expectations. And for this purpose, magazines
that dispense advice on fashion, on body care and physical training, on managing family or love
relationships, all of which are historically gendered in our society, have proven to be useful.</p>
        <p>In addition, to have more gender-neutral content, a number of articles were selected from
the website of the University of Padua (www.unipd.it), an institution that has a code of conduct
to limit gender stereotypes and to make its communications more inclusive.</p>
        <p>A total of 92 articles were collected. Each article was then divided into sections 30 to 70
words long, usually containing 2 or 3 sentences, so as to include some context necessary for
understanding the text. Table 1gives details on the composition of the initial text corpus.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Participants</title>
        <p>Each section of the corpus was assessed and labeled using a questionnaire involving 107
participants, who responded to the invitation sent by email. Of these, 31 dropped out before the
completion of the questions. An additional 5 participants were excluded because they completed
the task in less than 4 minutes, a time considered insuficient to provide reliable answers. Of the
remaining 71 participants (mean age 45.95), 57 claimed to be female, 13 male, and 1 preferred
not to specify the gender.</p>
        <p>This data underlines the need to broaden the pool of participants in an attempt to have a
sample that best reflects social reality.</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Procedure</title>
        <p>
          The evaluation activity was carried out using an online questionnaire, developed within the
framework PsyToolkit [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ]. Each participant is presented with 20 items that include, in a
single webpage, a section of text and an assessment scale. 18 of the proposed textual sections
are randomly selected from the initial corpus. 2 items are control questions to discard any
participants who answer randomly or inattentively. These questions also consist of a section
of text, but at some point it is made explicit that the item is a control question and that the
participant must give a certain answer regardless of the text. The time required for each
participant is about 10 minutes.
        </p>
        <p>For each proposed section of text, the question posed to participants is, ”You are asked to
assess the gender of the reader you think the text is aimed at.” with the aim that each participant,
based on their own experience and culture, report the presence of gender stereotypes in the
texts.</p>
        <p>Gender rating is indicated on the following 5-point Likert scale (in parentheses the numerical
value assigned to each response, hidden from the participant): Completely female (-2), More
female than male (-1), Neutral (0), More male than female (+1), Completely male (+2).</p>
      </sec>
      <sec id="sec-3-4">
        <title>2.4. Results</title>
        <p>To increase the reliability of the dataset labels, items answered by fewer than 5 participants
were discarded. The diferent distribution of responses to diferent items is due to the random
assignment made by the survey system, and the possibility for participants to refuse to answer
some items, presumably those with unclear or ambiguous sentences.</p>
        <p>In the end, from the initial text corpus of Table 1, only the 156 text sections, which received
statistically consistent scores, were included. Of these sections: 55 came from journals addressed
to females, 49 from male journals, and 52 from texts extracted from www.unipd.it. For each
question, the mean of the responses received was performed. As the Lickert scale was defined,
a negative value indicates a text judged to be aimed at female readers, while a positive value
indicates a bias toward male readers.</p>
        <p>Figure 1 shows the distribution of scores assigned to each item. The average score obtained
among all items is close to zero with a slight tendency toward the female end of the range. This
is probably due to both the slightly higher number of category texts from female journals and
the higher score (in absolute value) obtained from the sections judged as female.</p>
        <p>In any case, the fact that the mean value approaches 0 is an indicator of a suficient balance
in the dataset between negative (female), and positive (male) scores.</p>
        <p>Another interesting analysis performed is that made with respect to the scores obtained from
the texts based on their origin in the initial corpus.</p>
        <p>Table 2shows a statistical description of the scores obtained from the three types of sources.
The scores confirm the assumption made about the sources: sections from women’s journals
had a negative average score (-0.70), those from men’s journals a positive average score (0.51),
and those from www.unipd.it a slightly negative score, but still very close to 0 (-0.17).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Conclusions</title>
      <p>
        As stated in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], some biases are inevitable in large language models since these models learn
from vast amounts of test data and they are exposed to the biases present within human language
and culture in diferent ways of expressions. First, there are inherent biases in language due
to the fact that language is the expression of culture. Second, cultural norms and values vary
significantly across communities and regions. Third, there are many definitions of fairness
as it is a subjective concept. Last, language and culture are constantly evolving, with new
expressions, norms and biases emerging over time. Therefore, it is important that developers,
researchers and stakeholders continue to work reducing biases by developing strategies for
identifying and mitigating them.
      </p>
      <p>The present paper presented a novel labelled dataset to foster the development and/or
evaluation of automatic tools for detecting gender stereotypes in Italian texts. The analysis of the
results, and in particular the comparison between the participants’ scores and the expectations
deriving from the sources of the texts, supports the efectiveness of the followed methodology,
based on an online questionnaire.</p>
      <p>
        It is worth noting some limitations. The current dataset release includes 156 labeled text
sections. A quantity that is certainly insuficient for its use as a training dataset for machine
learning models. Its use is therefore more suitable as a test dataset for already trained models
or for algorithms aimed at estimating the gender score in texts, such as the one proposed
by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Another aspect to pay attention to is that the participants in the dataset annotation
are many more females than males. Although it is generally good to have a more balanced
gender distribution, in this case we do not believe that this makes the annotation less reliable.
Indeed, movements to denounce and raise awareness of discrimination against the females in
our society have made women more alert and aware of stereotypes in texts. In addition, the
imbalance in the gender distribution of our participants is due to the fact that many more men
dropped out of the annotation task before the end, thus being excluded, confirming the lower
awareness and interest of males to whom the invitation to participate had come.
      </p>
      <p>To the best of our knowledge, this is a first dataset of this kind for Italian texts. We plan to
continue this work, significantly increasing the size of the dataset, so that it will also be suitable
for training tasks and trying to avoid possible imbalances of participants from a gender point of
view.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is partially supported by the project “Creative Recommendations to avoid Unfair
Bottlenecks” of the Dept of Information Engineering of the University of Padova.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Badaloni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodà</surname>
          </string-name>
          , et al.,
          <source>Gender knowledge and artificial intelligence</source>
          ,
          <source>in: Proceedings of the 1st Workshop on Bias</source>
          ,
          <string-name>
            <surname>Ethical</surname>
            <given-names>AI</given-names>
          </string-name>
          ,
          <article-title>Explainability and the role of Logic and Logic Programming</article-title>
          , BEWARE-
          <volume>22</volume>
          , co-located
          <source>with AIxIA</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Leavy</surname>
          </string-name>
          , U. C. Dublin, Gender Bias in Artificial Intelligence:
          <article-title>The Need for Diversity and Gender Theory in Machine Learning</article-title>
          ,
          <source>in: Proc. of the ACM/IEEE 1st International Workshop on Gender Equality in Software Engineering</source>
          , Gothenburg, Sweden,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Doughman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Khreich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>El Gharib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Berjawi</surname>
          </string-name>
          ,
          <article-title>Gender bias in text: Origin, taxonomy, and implications</article-title>
          ,
          <source>in: Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>34</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bolukbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Saligrama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Kalai</surname>
          </string-name>
          ,
          <article-title>Man is to computer programmer as woman is to homemaker? debiasing word embeddings</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>29</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Blodgett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barocas</surname>
          </string-name>
          , H.
          <string-name>
            <surname>Daumé</surname>
            <given-names>III</given-names>
          </string-name>
          , H. Wallach,
          <article-title>Language (technology) is power: A critical survey of ”bias” in nlp</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>14050</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Biasion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fabris</surname>
          </string-name>
          , G. Silvello,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Susto</surname>
          </string-name>
          ,
          <article-title>Gender bias in italian word embeddings</article-title>
          ., in: CLiC-it,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Doughman</surname>
          </string-name>
          , W. Khreich,
          <article-title>Gender bias in text: Labeled datasets and lexicons</article-title>
          ,
          <source>arXiv preprint arXiv:2201.08675</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Stoet</surname>
          </string-name>
          ,
          <article-title>Psytoolkit - a software package for programming psychological experiments using linux</article-title>
          ,
          <source>Behavior Research Methods</source>
          <volume>4</volume>
          (
          <year>2010</year>
          )
          <fpage>1096</fpage>
          -
          <lpage>1104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Stoet</surname>
          </string-name>
          ,
          <article-title>Psytoolkit: A novel web-based method for running online questionnaires and reaction-time experiments</article-title>
          .,
          <source>Teaching of Psychology</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          ,
          <article-title>Should chatgpt be biased? challenges and risks of bias in large language models, Submitted to Machine Learning with Applications</article-title>
          .
          <source>Preprint on arXiv:2304.03738</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fabris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Purpura</surname>
          </string-name>
          , G. Silvello,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Susto</surname>
          </string-name>
          ,
          <article-title>Gender stereotype reinforcement: Measuring the gender bias conveyed by ranking algorithms</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>57</volume>
          (
          <year>2020</year>
          )
          <article-title>102377</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.ipm.
          <year>2020</year>
          .
          <volume>102377</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>