Mitigating Biases in Deep Learning Models: A Path Towards Fairness and Inclusivity Ismael Garrido-Muñoz1 1 Universidad de Jaén, Campus Las Lagunillas s/n, 23071 Jaén, España Abstract The emergence of large language models (LLMs) has revolutionized the field of natural language process- ing, facilitating remarkable progress across various domains. However, the inherent opaqueness of these models, functioning as black boxes, presents significant challenges. The lack of transparency obstructs our comprehension of their internal mechanisms and decision-making processes, raising concerns about their reliability and fairness. Various forms of biases have already been identified within these models. It is crucial to identify the location and encoding of these biases within LLMs to enable necessary modifications that ensure their safe and equitable application free of social biases in all kind of areas. Given the extensive deployment of LLMs in real-world applications, their impact on individuals’ lives is magnified. Thus, the subsequent phase of this thesis will focus on effectively mitigating biases in deep learning models. Keywords bias, deep learning, nlp, fairness, mitigation 1. Introduction The advent of GPT-3[1] has sparked a massive adoption of this model, with predictions of its profound impact on the labor market, as outlined by [2]. This remarkable influence stems from the diverse range of capabilities that these models possess, including question answering, text generation, translation, summarization, information retrieval, act as a conversational agent, programming assistance, educational support, story telling and more. However, despite the tremendous utility of LLMs, they also pose an emerging challenge: their tendency to operate as black boxes. While they exhibit impressive performance, their internal mechanisms and decision-making processes often remain opaque, making them difficult to comprehend and explain. This lack of transparency gives rise to concerns regarding their trustworthiness, fairness, and the potential biases that may be embedded within their models. The concept of a black box refers to a system or model where the inputs and outputs are known, but the inner mechanisms and algorithms that generate those outputs remain concealed or poorly understood. LLMs, with their complex neural networks and millions, or even bil- lions, of parameters, are intricate black boxes that often surpass human comprehension. This opaqueness hampers our ability to fully grasp the decision-making processes of these models, Doctoral Symposium on Natural Language Processing from the Proyecto ILENIA, 28 September 2023, Jaén, Spain. Envelope-Open igmunoz@ujaen.es (I. Garrido-Muñoz) GLOBE https://ismael.codes/ (I. Garrido-Muñoz) Orcid 0000-0001-6656-9679 (I. Garrido-Muñoz) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings making it difficult to tackle biases, recognize potential vulnerabilities, and guarantee ethical and responsible utilization. Consequently, there is a pressing need to enhance transparency and develop techniques that shed light on the inner workings of LLMs. In recent years, artificial intelligence has made significant advances, and a substantial por- tion of this progress can be attributed to neural network models. These models, trained on extensive datasets, have showcased remarkable capabilities in capturing various aspects of reality. However, while their ability to capture reality with precision is commendable, it can also have negative implications. One such concern arises from their propensity to inadvertently perpetuate and replicate undesirable stereotypes. These models are already being used in multiple production systems such as medical sys- tems[3], legal systems[4], hiring[5], content moderation[6], CRM[7], marketing[8], virtual assistants, harmful content detection[9], chatbots, etc. These systems are used in products despite having proven to be unsafe. It is well known that sometimes these black boxes cause unintended harm. One example is the police COMPAS system, which assigned an unreal recidivism value to both white and black people. For white individuals, the assigned value was lower than the actual value, while for black individuals, it was higher than the actual value[10]. Another example is the medical system called Optum[11], which systematically allocated fewer resources for the treatment of black patients compared to white patients with the same level of need. This realization raises concerns about the fairness and potential harm that may arise from the application of non-explainable models in certain situations. For instance, Amazon discontinued the use of a recruitment tool [12] when it was discovered to be biased against women. These examples highlight the presence of biases not only in language models but also in systems employing computer vision [13], audio processing [14], and linguistic corpora [15], [16]. It is cru- cial to address these biases as they can perpetuate inequality and have real-world consequences. Understanding and mitigating biases in such systems is a pressing concern. In the case of GPT-3[1] (or its frontends like Chat-GPT or Bing GPT) or Google’s alternative, Bard[17], studying these models is not feasible because they are provided as services through APIs or web interfaces. However, there have been releases of models with similar numbers of parameters and capabilities as the aforementioned ones. For instance, models like Llama[18], Vicuna[19], Bloom[20], OPT[21], XGLM[22], and the recent Falcon[23] do provide access to the trained models weights. This access enables us to review, correct, or mitigate any biases present in them. This will be the next step of the thesis. Next, we provide a brief overview of evaluation techniques, followed by a collection of the most relevant techniques for bias mitigation. In previous works, such as the one mentioned in , a broader summary of the state-of-the-art in studying bias in language models can be found. This will be the next step of the thesis. In the following section, we provide a brief overview of evaluation techniques, followed by a collection of the most relevant techniques for bias mitigation. In previous works, a broader summary of the state-of-the-art in studying bias in language models can be found [24]. 2. Bias in NLP with deep learning When we talk about bias in language models, we can approach it as a representational prob- lem[25]. This refers to the bias that certain demographic groups face in terms of misrepresenta- tion, including negative associations or even their absence in the data and consequently in the model. On the other hand, we can approach it as an allocation problem, which refers to issues of opportunities or resource distribution for individuals belonging to specific demographic groups. 2.1. Bias evaluation There is extensive work when it comes to evaluating language models for bias, starting with the work of Bolukbasi et al. [26] on simple word embeddings. Later studies approached the bias issue from the perspective of coreference resolution, such as [27] with GloVe embeddings. Bias is also examined by measuring the association between concepts and protected attributes. Caliskan et al. [28] created the Word Embedding Association Test (WEAT) for this purpose. This test was extended by Dev et al. [29] and Manzini et al. [30]. Also it was extended by Lauscher et al. [31] by adding more protected attributes and applying it to languages other than English . It was later on adapted to more complex models like BERT, under the name SEAT, by May et al. [32] and Tan and Celis [33]. There are other approaches for more complex models like BERT or GPT-2.Vig [34] introduced visualization tools to understand where these models capture unwanted biases by examining their attention. Additionally, adaptations of WEAT, such as SEAT, have been developed. SEAT tests the protected attribute against a sentence instead of a word, specifically designed for contextual models like BERT. This work was further extended to consider the full context instead of just the sentence level. The latest evaluation method is applied to models like GPT-2, BERT, ELMo, among others. More complex models also make serious errors. A compendium of errors discovered in ChatGPT is presented in the work of Borji [35]. The paper explains that this model is unable to successfully complete tasks that require spatial, temporal, or physical reasoning unless it has been specifically trained for those tasks. 2.2. Bias correction The main approaches to address bias in language models consist of the following: fine-tuning the model [36], data augmentation to balance categories and avoid distortions towards one category[27], protecting the attribute during model training to prevent bias capture[37], or correcting the vector space of the model as presented in the works of Manzini et al. [30], Zhou et al. [38], Dev and Phillips [37]. Among these techniques, fine tuning and model editing are considered the most realistic, especially in the case of large-scale models, since retraining a model from scratch would be very costly in terms of time, hardware resources, money and the effort required to perform the pre-processing and tuning of training data. One of the most promising techniques for model editing involves identifying how the model encodes certain knowledge and then making edits accordingly. The proposal of Meng et al. [39] focuses on editing factual knowledge and serves as a foundation for further adaptation. This technique first identifies the model’s influential parts that contribute the most weight in choosing the last token by using causal mediation analysis. From there, the model’s weights are edited to guide it towards the desired token. For example, if the model answers Obama to the question ”What is the surname of the U.S. president?”, the weights can be located and corrected to select the desired token Biden since this would be the updated and accurate answer. Similarly, this method can be generalized to make broader corrections. In fact, in a subsequent work[40] they adapt this method to perform mass corrections across the model weights. Then they evaluate whether the model only edits knowledge for the specific context given in the prompt or if it can generalize by asking questions about the same fact using different questions or contexts. These techniques hold great potential in tackling bias, enhancing the accuracy, and bolstering the reliability of language models. By facilitating targeted edits that align with desired outcomes, these approaches enable the mitigation of unwanted biases in the models’ responses. As a result, they contribute to an improved understanding of fairness and ensure more reliable and unbiased outputs from language models. 3. Relevance of the problem Every day, these enormous models are increasingly integrated into various products and pro- duction systems. However, this integration comes with its own set of challenges. From an economic standpoint, utilizing a biased system can lead to significant disadvantages, as it may not function effectively for all users. On the other hand, the impact of these models on people’s lives cannot be overlooked. There are specific contexts, such as systems for resource distribution, employment, or bank credit, where it is crucial to avoid using models that may contain any form of bias. Therefore, it is imperative to thoroughly study bias in data models and understand its underlying causes. This knowledge will enable us to either avoid deploying biased models altogether or develop strategies to mitigate harmful biases when they arise. Furthermore, when a language model is identified as not performing adequately in a pro- duction system, such as a commercial product, companies face important decisions. Given the immense size and cost associated with training these models, some proposed solutions may be difficult to justify from an economic perspective. For instance, training the model from scratch with revised, filtered, or corrected training data would entail significant expenses. Another option, albeit costly, could involve discontinuing the use of the model, as a poorly performing model is unsuitable for deployment in production systems. This proposition gains some relevance considering the potential non-compliance of such models with new European AI regulations[41]. Alternatively, more practical approaches could involve retraining the model or leveraging state-of-the-art bias mitigation techniques to address the identified issues. The choice of approach will depend on various factors, including the severity of the bias, the feasibility of retraining or mitigating the model, and the legal and ethical obligations that must be met. Regardless of the chosen course of action, it is essential to proactively address and rectify bias issues to ensure responsible and fair deployment of language models in real-world applications. By doing so, we can foster inclusivity, promote equitable outcomes, and uphold the principles of fairness and ethical AI. 4. Hypotheses and objectives The following hypothesis is assumed: Given a language model based on deep learning, it will be possible to discern whether it contains biases, and characterize, measure, and mitigate them. The following objectives are established: • Conduct an intensive study of the state of the art regarding detection, evaluation, or mitigation of biases in deep learning models. • Analyze and characterize biases present in existing models. • Development of techniques and algorithms for unsupervised or semi-supervised detection and characterization of bias in existing models. • Development of techniques and algorithms for the mitigation or correction of bias in existing models. At this phase of the thesis, our primary focus is on the last point. 5. Methodology and the proposed experiments As we move forward with the use of large language models, our next step will involve adapting and evaluating the previous work[42] in the context of LLMs. Specifically, the previous study shed light on how models tend to perceive women based on their physical appearance, while men are assessed primarily based on their behavior. This pattern was observed across the majority of the models investigated. To proceed, we will replicate the aforementioned experiment using large language models (LLMs) and analyze to what extent increasing the model size affects bias, whether it exacerbates or reduces it. Once this evaluation is completed, our focus will shift towards bias mitigation strategies. To mitigate bias, we will construct a corpus of prompts that elicit biased responses from the models. This corpus will serve as a foundation for our work in two main areas. First, we will develop methods to detect and identify biased terms produced by the model in its responses. Second, we will explore the previously discussed fact editing techniques to edit the behavior of the model for the detected biases in order to reduce or eliminate them. This will require adapting the causal mediation analysis mechanism to our problem, since editing a specific fact is not the same as making an edit that causes a trade-off between different classes of a protected attribute. After the editing process, we will evaluate the performance of the model with the same set of prompts to check the effectiveness of the mitigation method. By undertaking these steps, we aim to gain insights into the behavior of large language models regarding bias and work towards developing effective strategies for bias mitigation. References [1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, CoRR abs/2005.14165 (2020). URL: https://arxiv.or g/abs/2005.14165. arXiv:2005.14165 . [2] T. Eloundou, S. Manning, P. Mishkin, D. Rock, Gpts are gpts: An early look at the labor market impact potential of large language models, ArXiv abs/2303.10130 (2023). [3] S. Velupillai, H. Suominen, M. Liakata, A. Roberts, A. D. Shah, K. Morley, D. Osborn, J. Hayes, R. Stewart, J. Downs, W. Chapman, R. Dutta, Using clinical natural language processing for health outcomes research: Overview and actionable suggestions for future advances, J Biomed Inform 88 (2018) 11–19. [4] R. DALE, Law and word order: Nlp in legal tech, Natural Language Engineering 25 (2019) 211–217. doi:10.1017/S1351324918000475 . [5] M. Bogen, A. Rieke, Help wanted: an examination of hiring algorithms, equity, and bias, 2018. [6] T. Gillespie, Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media, 2018. doi:10.12987/9780300235029 . [7] Salesforce, Salesforce Announces AI Cloud – Bringing Trusted Generative AI to the Enterprise — investor.salesforce.com, https://investor.salesforce.com/press-releases/pres s-release-details/2023/Salesforce-Announces-AI-Cloud--Bringing-Trusted-Generative-A I-to-the-Enterprise/default.aspx, 2023. [Accessed 18-Jun-2023]. [8] Adobe, Adobe Announces New Sensei GenAI Services to Reimagine End-to-End Marketing Workflows — news.adobe.com, https://news.adobe.com/news/news-details/2023/Adobe-A nnounces-New-Sensei-GenAI-Services-to-Reimagine-End-to-End-Marketing-Workflows /default.aspx, 2023. [Accessed 18-Jun-2023]. [9] S. Tabahriti, Twitter is now relying more on AI to identify harmful content, says its new trust and safety chief — businessinsider.com, https://www.businessinsider.com/twitter-n ow-relying-more-ai-identify-harmful-content-2022-12, 2022. [Accessed 18-Jun-2023]. [10] J. L. Julia Angwin, Machine bias - there’s software used across the country to predict future criminals. and it’s biased against blacks., 2016. URL: https://www.propublica.org/article/m achine-bias-risk-assessments-in-criminal-sentencing. [11] Z. O. U. Berkeley, Z. Obermeyer, U. Berkeley, S. M. U. o. Chicago, S. Mullainathan, U. o. Chicago, O. M. A. Metrics, Dissecting racial bias in an algorithm that guides health decisions for 70 million people: Proceedings of the conference on fairness, accountability, and transparency, 2019. URL: https://dl.acm.org/doi/10.1145/3287560.3287593. [12] J. Dastin, Amazon scraps secret ai recruiting tool that showed bias against women, 2018. URL: https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSK CN1MK08G. [13] A. Howard, J. Borensteion, Trust and bias in robots, 2019. URL: https://www.americanscie ntist.org/article/trust-and-bias-in-robots. [14] J. Rodger, P. Pendharkar, A field study of the impact of gender and user’s technical experience on the performance of voice-activated medical tracking application, Int. J. Hum.-Comput. Stud. 60 (2004) 529–544. doi:10.1016/j.ijhcs.2003.09.005 . [15] J. A. Bullinaria, J. P. Levy, Extracting semantic representations from word co-occurrence statistics: A computational study, Behavior Research Methods 39 (2007) 510–526. URL: https://doi.org/10.3758/BF03193020. doi:10.3758/BF03193020 . [16] M. Barlow, Michael stubbs. text and corpus analysis: Computer-assisted studies of language and culture, International Journal of Corpus Linguistics 3 (1998) 319–327. [17] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. T. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al., Palm 2 technical report, ArXiv abs/2305.10403 (2023). [18] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, LLaMA: Open and Efficient Foundation Language Models, ArXiv abs/2302.13971 (2023). [19] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL: https://lmsys.org/blog/2023-03-30-vicuna/. [20] T. L. Scao, A. Fan, C. Akiki, E.-J. Pavlick, S. Ili’c, D. Hesslow, R. Castagn’e, A. S. Luccioni, F. Yvon, M. Gallé, et al., Bloom: A 176b-parameter open-access multilingual language model, ArXiv abs/2211.05100 (2022). [21] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, L. Zettlemoyer, Opt: Open pre-trained transformer language models, ArXiv abs/2205.01068 (2022). [22] X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J. Du, R. Pasunuru, S. Shleifer, P. S. Koura, V. Chaudhary, B. O’Horo, J. Wang, L. Zettle- moyer, Z. Kozareva, M. T. Diab, V. Stoyanov, X. Li, Few-shot learning with multilingual language models, CoRR abs/2112.10668 (2021). URL: https://arxiv.org/abs/2112.10668. arXiv:2112.10668 . [23] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Heslow, J. Launay, Q. Malartic, B. Noune, B. Pannier, G. Penedo, Falcon-40B: an open large language model with state-of-the-art performance (2023). [24] I. Garrido-Muñoz, A. Montejo-Ráez, F. Martínez-Santiago, L. A. Ureña-López, A survey on bias in deep nlp, Applied Sciences 11 (2021). URL: https://www.mdpi.com/2076-3417/11/7 /3184. doi:10.3390/app11073184 . [25] K. Ramesh, S. Sitaram, M. Choudhury, Fairness in Language Models Beyond English: Gaps and Challenges, in: Findings of the Association for Computational Linguistics: EACL 2023, Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 2106–2119. URL: https://aclanthology.org/2023.findings-eacl.157. [26] T. Bolukbasi, K. Chang, J. Y. Zou, V. Saligrama, A. Kalai, Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, CoRR abs/1607.06520 (2016). URL: http://arxiv.org/abs/1607.06520. arXiv:1607.06520 . [27] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, K.-W. Chang, Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods, arXiv e-prints (2018) arXiv:1804.06876. arXiv:1804.06876 . [28] A. Caliskan, J. J. Bryson, A. Narayanan, Semantics derived automatically from lan- guage corpora contain human-like biases, Science 356 (2017) 183–186. URL: https: //www.science.org/doi/abs/10.1126/science.aal4230. doi:10.1126/science.aal4230 . arXiv:https://www.science.org/doi/pdf/10.1126/science.aal4230 . [29] S. Dev, T. Li, J. M. Phillips, V. Srikumar, OSCaR: Orthogonal subspace correction and rectification of biases in word embeddings, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 5034–5050. URL: https: //aclanthology.org/2021.emnlp-main.411. doi:10.18653/v1/2021.emnlp- main.411 . [30] T. Manzini, L. Yao Chong, A. W. Black, Y. Tsvetkov, Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 615–621. URL: https://aclanthology.org/N19-1062. doi:10.18653/v1/N19- 1062 . [31] A. Lauscher, G. Glavas, S. P. Ponzetto, I. Vulic, A general framework for implicit and explicit debiasing of distributional word vector spaces, in: AAAI, 2020. [32] C. May, A. Wang, S. Bordia, S. R. Bowman, R. Rudinger, On measuring social biases in sentence encoders, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 622–628. URL: https://aclanthology.org/N19-1063. doi:10.18653/v1/ N19- 1063 . [33] Y. C. Tan, L. E. Celis, Assessing social and intersectional biases in contextualized word representations, in: NeurIPS, 2019. [34] J. Vig, A multiscale visualization of attention in the transformer model, 2019, pp. 37–42. doi:10.18653/v1/P19- 3007 . [35] A. Borji, A categorical archive of chatgpt failures, 2023. doi:10.21203/rs.3.rs- 2895792 /v1 . [36] R. H. Maudslay, H. Gonen, R. Cotterell, S. Teufel, It’s all in the name: Mitigating gender bias with name-based counterfactual data substitution, CoRR abs/1909.00871 (2019). URL: http://arxiv.org/abs/1909.00871. arXiv:1909.00871 . [37] S. Dev, J. M. Phillips, Attenuating bias in word vectors, CoRR abs/1901.07656 (2019). URL: http://arxiv.org/abs/1901.07656. arXiv:1901.07656 . [38] P. Zhou, W. Shi, J. Zhao, K.-H. Huang, M. Chen, K.-W. Chang, Analyzing and mitigating gender bias in languages with grammatical gender and bilingual word embeddings, in: ACL 2019, 2019. [39] K. Meng, D. Bau, A. Andonian, Y. Belinkov, Locating and editing factual associations in GPT, Advances in Neural Information Processing Systems 36 (2022). [40] K. Meng, A. Sen Sharma, A. Andonian, Y. Belinkov, D. Bau, Mass editing memory in a transformer, arXiv preprint arXiv:2210.07229 (2022). [41] H. Ziady, Europe is leading the race to regulate AI. Here’s what you need to know | CNN Business — edition.cnn.com, https://edition.cnn.com/2023/06/15/tech/ai-act-europe-key-t akeaways/index.html, 2023. [Accessed 18-Jun-2023]. [42] I. Garrido, A. Montejo Raéz, F. Martínez Santiago, Maria and beto are sexist: evaluating gender bias in large language models for spanish, Language Resources and Evaluation (2022).