An Italian dataset for the analysis of gender stereotypes in textual documents⋆ Silvana Badaloni1,2,† , Antonio Rodà1,∗,† and Martino Scagnet1,† 1 Department of Information Engineering, via Gradenigo, 6, 35131 Padova, Italy 2 Elena Cornaro Center on Gender Studies, University of Padova, Italy Abstract The presence of stereotypes associated to historically disadvantaged groups constitutes a strong limitation for the justice and welfare of society. Gender stereotypes are among the most deep-rooted ones and have, over time, given rise to real conventions that permeate various aspects of social life, creating unfairness and sometimes discrimination. This study focuses on the possibility of identifying gender stereotypes in textual documents using Machine Learning and Natural Language Processing tools. To this end, a corpus of Italian language texts was collected and 107 participants were asked to evaluate each sentence by assigning a score that would reveal the presence of gender stereotypes (female or male). The collected data allowed the labelling of the text sections of the corpus, by assigning a “gender score” to each one. The dataset thus developed can be used to foster the development and/or evaluation of automatic tools for detecting gender stereotypes, facilitating the writing of more inclusive texts. Keywords gender bias, gendered innovation, fairness, artificial intelligence, machine learning, 1. Introduction The problem of biased and unfair outcomes of AI-based systems is becoming increasingly clear. One of the main causes is that Machine Learning algorithms, by their intrinsic nature, are trained on the basis of examples, they learn from data, and therefore can subsume and capture the stereotypes related to people sharing a characteristic, for example the gender identity, which run through the data [1]. If used to make automatic decisions, these potentially biased systems could lead to unfair, incorrect decisions that could discriminate for or against some groups over others. There is the risk of being discriminatory for certain categories of users. Moreover, the triangular relationship between algorithm-human-data, which becomes increasingly relevant as collaboration between humans and AI increases, risks continually feeding the spread of biases. While the concept of bias is very broad, gender-related biases are considered an essential aspect of fairness [2]. In particular, we believe that in the European social-cultural context, the gender biases represents a particularly interesting case study for the Artificial Intelligence community, for several reasons listed below. 2st Workshop on Bias, Ethical AI, Explainability and the role of Logic and Logic Programming, BEWARE-23, co-located with AIxIA 2023, Rome 6-9 Novevember, Italy, 2023 ∗ Corresponding author. † These authors contributed equally. Envelope-Open silvana.badaloni@unipd.it (S. Badaloni); antonio.roda@unipd.it (A. Rodà); martino.scagnet@studenti.unipd.it (M. Scagnet) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings First of all, numerous studies have shown that gender biases are deeply rooted in our society. Therefore, the risk that the datasets used for many applications with great social impact (au- tonomous driving vehicles, recommendation systems, personnel selection systems, etc.) contain biases linked directly or indirectly to gender is very high. Secondly, gender biases affect more or less half of the population, so their presence has an impact on a large number of people. Thirdly, given the wide spread of gender bias in our societies, it is relatively easy to find datasets on which to experiment with analysis and debiasing techniques. Fourthly, in comparison with other types of bias (racial, social, etc.), it is easier to define the categories subject to possible discrimination. Gender studies, while recognising the multiplicity of gender identities, validate the existence of two well-defined polarities, male and female. The existence of two prevailing categories facilitates the definition of experimental protocols for the validation of analysis and debiasing techniques. Fifthly, following the usual practice of bringing our research experiences back into teaching, promoting studies on gender biases in AI can facilitate the introduction of gender issues into our computer science courses, with a twofold advantage: a) increasing the degree of involvement of our female students, and b) making our male students aware of stereotypes and biases that risk discriminating against their female counterparts, making their university and professional careers more difficult. In this paper, we will focus on this kind of bias and, in particular, we will deal with gender bias as open issue for applications based on Natural Language Processing [3]. How Word Embeddings learns stereotypes has been the focus of many research on gender bias and artificial intelligence [4]. Since Word Embeddings are used as a knowledge base in many applications, biases in these models can propagate into many NLP applications. In general, gender biases diffused in the textual corpora used for Word-Embedding are subsumed by the model: for example, words related to traditionally male professions are found closer to inherently gendered words, such as he or man, and vice versa. Techniques to reduce these biases have been recently studied [5], but the problem is still open, in particular for those language that are more grammatically gendered, as Italian [6]. Language has a profound impact on how we understand gender roles. A gender-inclusive language is, therefore, a key tool to contribute to the achievement of gender equality. Consequently, having tools to identify gender biases in texts is crucial to mitigating their propagation. However, there is still a shortage of gender bias datasets to automate gender bias detection using machine learning (ML) and natural language processing (NLP) techniques (see [7]). In particular, as far as we know, there is no specific dataset in Italian language. The present study will focus on the possibility of automatically identifying gender stereotypes in textual documents. To this aim, a corpus of texts in Italian, labelled according to the genre (understood in a conventional way) to which the reading is addressed, has been developed1 . Texts have been collected from various sources assuming the presence of gender stereotypes in some and gender neutrality in others. Then, voluntary participants were asked to rate the genre to which the text fragment was aimed. In the following sections, we will present the methodology used to collect the corpus and the participants’ annotations. Finally, we will provide a statistical analysis and discussion of the results. 1 The labeled dataset is available at https://doi.org/10.5281/zenodo.10027951 Table 1 Initial textual corpus. Source # Articles # Sections # Words Adv. words for each article www.unipd.it 30 533 25602 853 Female magazine 32 548 25066 783 Male magazine 30 542 28523 950 2. The dataset development 2.1. Materials The first step for dataset building was the collection of an initial corpus of texts. To ensure the presence of sentences with different degrees of gender bias (both feminine and masculine), a number of articles were selected from magazines explicitly targeting either a female or a male audience. Such a choice stems from the assumption that these magazines tend to have content stereotyped for the gender they target in an attempt to maximize the number of interested readers. Indeed, it is a well-known phenomenon that men and women tend to conform to gender stereotypes in order to align with social expectations. And for this purpose, magazines that dispense advice on fashion, on body care and physical training, on managing family or love relationships, all of which are historically gendered in our society, have proven to be useful. In addition, to have more gender-neutral content, a number of articles were selected from the website of the University of Padua (www.unipd.it), an institution that has a code of conduct to limit gender stereotypes and to make its communications more inclusive. A total of 92 articles were collected. Each article was then divided into sections 30 to 70 words long, usually containing 2 or 3 sentences, so as to include some context necessary for understanding the text. Table 1gives details on the composition of the initial text corpus. 2.2. Participants Each section of the corpus was assessed and labeled using a questionnaire involving 107 participants, who responded to the invitation sent by email. Of these, 31 dropped out before the completion of the questions. An additional 5 participants were excluded because they completed the task in less than 4 minutes, a time considered insufficient to provide reliable answers. Of the remaining 71 participants (mean age 45.95), 57 claimed to be female, 13 male, and 1 preferred not to specify the gender. This data underlines the need to broaden the pool of participants in an attempt to have a sample that best reflects social reality. 2.3. Procedure The evaluation activity was carried out using an online questionnaire, developed within the framework PsyToolkit [8, 9]. Each participant is presented with 20 items that include, in a single webpage, a section of text and an assessment scale. 18 of the proposed textual sections are randomly selected from the initial corpus. 2 items are control questions to discard any Table 2 Scores assigend to the text sections, based on their origin in the initial corpus. M=mean, SD=standard deviation. Source # Sections # Answers per item (M) Score (M) Score (SD) min max www.unipd.it 52 6.29 -0.1774 0.4847 -1.5 +0.8 Female magazine 55 6.62 -0.7001 0.6703 -1.857 +1 Male magazine 49 6.02 +0.5109 0.8084 -1.6 2 participants who answer randomly or inattentively. These questions also consist of a section of text, but at some point it is made explicit that the item is a control question and that the participant must give a certain answer regardless of the text. The time required for each participant is about 10 minutes. For each proposed section of text, the question posed to participants is, ”You are asked to assess the gender of the reader you think the text is aimed at.” with the aim that each participant, based on their own experience and culture, report the presence of gender stereotypes in the texts. Gender rating is indicated on the following 5-point Likert scale (in parentheses the numerical value assigned to each response, hidden from the participant): Completely female (-2), More female than male (-1), Neutral (0), More male than female (+1), Completely male (+2). 2.4. Results To increase the reliability of the dataset labels, items answered by fewer than 5 participants were discarded. The different distribution of responses to different items is due to the random assignment made by the survey system, and the possibility for participants to refuse to answer some items, presumably those with unclear or ambiguous sentences. In the end, from the initial text corpus of Table 1, only the 156 text sections, which received statistically consistent scores, were included. Of these sections: 55 came from journals addressed to females, 49 from male journals, and 52 from texts extracted from www.unipd.it. For each question, the mean of the responses received was performed. As the Lickert scale was defined, a negative value indicates a text judged to be aimed at female readers, while a positive value indicates a bias toward male readers. Figure 1 shows the distribution of scores assigned to each item. The average score obtained among all items is close to zero with a slight tendency toward the female end of the range. This is probably due to both the slightly higher number of category texts from female journals and the higher score (in absolute value) obtained from the sections judged as female. In any case, the fact that the mean value approaches 0 is an indicator of a sufficient balance in the dataset between negative (female), and positive (male) scores. Another interesting analysis performed is that made with respect to the scores obtained from the texts based on their origin in the initial corpus. Table 2shows a statistical description of the scores obtained from the three types of sources. The scores confirm the assumption made about the sources: sections from women’s journals had a negative average score (-0.70), those from men’s journals a positive average score (0.51), Figure 1: Histogram of the mean scores assigned to the 156 text sections in the dataset. The scale for the evaluation goes from -2 (Completely female) to +2 (Completely male). The black line represents the 0 (Neutral), the blue line the average value of all the scores. and those from www.unipd.it a slightly negative score, but still very close to 0 (-0.17). 3. Conclusions As stated in [10], some biases are inevitable in large language models since these models learn from vast amounts of test data and they are exposed to the biases present within human language and culture in different ways of expressions. First, there are inherent biases in language due to the fact that language is the expression of culture. Second, cultural norms and values vary significantly across communities and regions. Third, there are many definitions of fairness as it is a subjective concept. Last, language and culture are constantly evolving, with new expressions, norms and biases emerging over time. Therefore, it is important that developers, researchers and stakeholders continue to work reducing biases by developing strategies for identifying and mitigating them. The present paper presented a novel labelled dataset to foster the development and/or evaluation of automatic tools for detecting gender stereotypes in Italian texts. The analysis of the results, and in particular the comparison between the participants’ scores and the expectations deriving from the sources of the texts, supports the effectiveness of the followed methodology, based on an online questionnaire. It is worth noting some limitations. The current dataset release includes 156 labeled text sections. A quantity that is certainly insufficient for its use as a training dataset for machine learning models. Its use is therefore more suitable as a test dataset for already trained models or for algorithms aimed at estimating the gender score in texts, such as the one proposed by [11]. Another aspect to pay attention to is that the participants in the dataset annotation are many more females than males. Although it is generally good to have a more balanced gender distribution, in this case we do not believe that this makes the annotation less reliable. Indeed, movements to denounce and raise awareness of discrimination against the females in our society have made women more alert and aware of stereotypes in texts. In addition, the imbalance in the gender distribution of our participants is due to the fact that many more men dropped out of the annotation task before the end, thus being excluded, confirming the lower awareness and interest of males to whom the invitation to participate had come. To the best of our knowledge, this is a first dataset of this kind for Italian texts. We plan to continue this work, significantly increasing the size of the dataset, so that it will also be suitable for training tasks and trying to avoid possible imbalances of participants from a gender point of view. Acknowledgments This work is partially supported by the project “Creative Recommendations to avoid Unfair Bottlenecks” of the Dept of Information Engineering of the University of Padova. References [1] S. Badaloni, A. Rodà, et al., Gender knowledge and artificial intelligence, in: Proceedings of the 1st Workshop on Bias, Ethical AI, Explainability and the role of Logic and Logic Programming, BEWARE-22, co-located with AIxIA, 2022. [2] S. Leavy, U. C. Dublin, Gender Bias in Artificial Intelligence: The Need for Diversity and Gender Theory in Machine Learning, in: Proc. of the ACM/IEEE 1st International Workshop on Gender Equality in Software Engineering, Gothenburg, Sweden, 2018. [3] J. Doughman, W. Khreich, M. El Gharib, M. Wiss, Z. Berjawi, Gender bias in text: Origin, taxonomy, and implications, in: Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing, 2021, pp. 34–44. [4] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, A. T. Kalai, Man is to computer programmer as woman is to homemaker? debiasing word embeddings, Advances in neural information processing systems 29 (2016). [5] S. L. Blodgett, S. Barocas, H. Daumé III, H. Wallach, Language (technology) is power: A critical survey of ”bias” in nlp, arXiv preprint arXiv:2005.14050 (2020). [6] D. Biasion, A. Fabris, G. Silvello, G. A. Susto, Gender bias in italian word embeddings., in: CLiC-it, 2020. [7] J. Doughman, W. Khreich, Gender bias in text: Labeled datasets and lexicons, arXiv preprint arXiv:2201.08675 (2022). [8] G. Stoet, Psytoolkit - a software package for programming psychological experiments using linux, Behavior Research Methods 4 (2010) 1096–1104. [9] G. Stoet, Psytoolkit: A novel web-based method for running online questionnaires and reaction-time experiments., Teaching of Psychology (2017). [10] E. Ferrara, Should chatgpt be biased? challenges and risks of bias in large language models, Submitted to Machine Learning with Applications. Preprint on arXiv:2304.03738 (2023). [11] A. Fabris, A. Purpura, G. Silvello, G. A. Susto, Gender stereotype reinforcement: Measuring the gender bias conveyed by ranking algorithms, Information Processing & Management 57 (2020) 102377. doi:10.1016/j.ipm.2020.102377 .