A Methodology based on Rebalancing Techniques to measure and improve Fairness in Artificial Intelligence algorithms Ana Lavalle Alejandro Maté Lucentia Research (DLSI) Lucentia Research (DLSI) University of Alicante (Spain) University of Alicante (Spain) alavalle@dlsi.ua.es amate@dlsi.ua.es Juan Trujillo Jorge García Lucentia Research (DLSI) Lucentia Research (DLSI) University of Alicante (Spain) University of Alicante (Spain) jtrujillo@dlsi.ua.es jorge.g@ua.es ABSTRACT Unfortunately, several studies have shown that the recidivism Artificial Intelligence (AI) has become one of the key drivers prediction scores are biased [1, 2]. This algorithm showed dis- for the next decade. As important decisions are increasingly criminatory behavior towards African-American inmates, which supported or directly made by AI systems, concerns regarding were almost three times more likely to be classified as high risk the rationale and fairness in their outputs are becoming more inmates than Caucasian inmates [1]. and more prominent nowadays. Following the recent interest in As a result of this trend, AI research communities have recently fairer predictions, several metrics for measuring fairness have increased their attention towards the issue of AI algorithm’s fair- been proposed, leading to different objectives which may need to ness. The IEEE Standards Association pays attention to the mean- be addressed in different fashion. In this paper, we propose (i) a ing and impact of algorithmic transparency [18]. Moreover, these methodology for analyzing and improving fairness in AI predic- issues are also aligned to the ethical guidelines for a trustworthy tions by selecting sensitive attributes that should be protected; AI presented by the European Commission [8].Therefore, it is (ii) We analyze how the most common rebalance approaches essential to ensure that the decisions made by AI solutions do affect the fairness of AI predictions and how they compare to the not reflect discriminatory behavior. alternatives of removing or creating separate classifiers for each Nevertheless, to the best of our knowledge, most of the ap- group within a protected attribute. Finally, (iii) our methodology proaches are mainly focused on improving the accuracy of algo- generates a set of tables that can be easily computed for choosing rithms in the prediction, while the fairness of the output is rele- the best alternative in each particular case. The main advantage gated to a second-class metric [5, 11, 14]. Thus, there has not been of our methodology is that it allows AI practitioners to measure any proposal or methodology that guides the AI practitioners and improve fairness in AI algorithms in a systematic way. In in choosing the best features to avoid unfair and discriminatory order to check our proposal, we have properly applied it to the outputs from AI algorithms. COMPAS dataset, which has been widely demonstrated to be In this paper, we propose a methodology that considers fair- biased by several previous studies. ness as a first-class citizen. Our methodology measures and eval- uates the impact of the dataset rebalancing techniques on AI fairness. The novelty of our methodology is that it introduces 1 INTRODUCTION new steps with respect to the traditional process of AI develop- The use of Artificial Intelligence (AI) systems is rapidly spreading ment such as: (i) a bias analysis, (ii) fairness definition and (iii) across many different sectors and organizations. More and more fairness evaluation. Moreover, another novelty of our methodol- important decisions are being made supported by AI algorithms. ogy is that it helps to improve fairness by applying rebalancing Therefore, it is essential to ensure that these decisions do not approaches considering not only the target variable/s, but also reflect discriminatory behavior towards certain groups. How- sensitive attributes in the dataset that should be protected from ever, given the lack of an adequate methodology, creating fair AI discrimination. In order to both exemplify our approach and test systems has proven to be a complex and challenging task [16]. the impact of each rebalancing alternative, we implement a classi- As it is becoming more and more used, big companies and fier over the COMPAS dataset, calculating the degree of fairness governments are delegating responsibilities to AI systems which obtained according to three different fairness definitions. have not been thoroughly evaluated. In turn, some taken de- cisions have often been biased and unfair (e.g. the AI system from Amazon to qualify job applicants [22] or the granting of 2 RELATED WORK credit for the Apple credit card [20]). One of the most notorious Bias can appear in many forms. [16] groups and lists different cases where AI tools have acted in a biased and unfair way is types of biases that can affect AI solutions according to where COMPAS (Correctional Offender Management Profiling for Al- they appear: from Data to Algorithm, when AI algorithms are ternative Sanctions). This software has been used by judges in trained with biased data, the output of these algorithms might order to decide whether to grant parole to criminals or keep them be also biased. From Algorithm to User, when bias arises as a in prison. The output is provided by an algorithm that evaluates result of an algorithm output it affects users’ behavior. Or from the probability that a criminal defendant becoming a recidivist. User to Data, when data sources used for training AI algorithms are generated by users, historical socio-cultural issues can be © Copyright 2022 for this paper by its author(s). Use permitted under Creative introduced into the data even when perfect sampling and feature Commons License Attribution 4.0 International (CC BY 4.0) selection are carried out. To tackle these situations, researchers have proposed differ- 3.1 Dataset Description ent techniques that can be grouped into the next perspectives. The dataset chosen in order to apply our methodology in a real Data Perspective when class distribution is artificially rebal- case study has been the ProPublica COMPAS dataset available in anced by sampling the data. This rebalancing can be done by: [17]. This dataset includes information about criminal defendants Oversampling [14], creating more data in the minority classes. who were evaluated with COMPAS scores in the Broward County Undersampling [11], eliminating data from the majority classes Sheriff’s Office in Florida, during 2013 and 2014. or other like SMOTE [5], where minority classes are oversam- For each accused (case), this dataset contain information re- pled by interpolating between neighboring data points. However, lated to their demographic information (race, gender, etc), crimi- these techniques must be used with tremendous care as they can nal history and administrative information. Finally, the dataset lead to the loss of certain characteristics of the data. An alterna- also contains information about whether the accused was really tive perspective is the Algorithmic Perspective, these solutions a recidivists or not in the next 2 years. This dataset is highly adjust the hyperparameters of the learning algorithms. Or, the imbalanced, the representation of the different races is heavily Ensemble Approach that mixes aspects from both the data and skewed. Then, we will apply our methodology step by step. algorithmic perspectives. Most of these approaches mainly focus on improving the ac- 3.2 Target Variable Definition curacy of algorithms in the prediction, while the fairness of the The first step of our proposed methodology is to define the target output is relegated to a second-class metric. As [21] states, ac- variable. In this case, the target variable is “v_score_text” which curacy is no longer the only concern when developing models. uses 3 attributes (Low, Middle, High) to classify the risk of re- Fairness must be taken into account as well in order to avoid cidivism. For the sake of simplicity, we will binarize the target more cases as those presented in the introduction. variable by mapping the Low class to Non-Recividist, and the Moreover, as [9] argues, modifying data sources or restricting Middle and High classes to the Recividist class thereby facilitating models in order to improve the fairness can harm the predictive following the analysis presented. Therefore the Target variable accuracy. The fairness of predictions should be evaluated in the is defined as Risk of recidivism (0 Non-recividist, 1 Recividist). context of data. Unfairness induced by inadequate samples sizes or unmeasured predictive variables should be addressed through 3.3 Bias Analysis data collection, rather than by constraining the model [6]. Thus, differently from the above-presented proposals, we pro- The second step is to perform a Bias Analysis. As previously- pose a novel methodology that considers fairness as a first-class summarized in Section 2, and according to [16], the different data citizen from the very beginning of the AI process. We drive the bias that can be used in our case study context are (i) Data to whole process considering protected attributes during the rebal- Algorithm, (ii) Algorithm to Use and (iii) User to Data. As in our ance step and leading the AI practitioner to a conscious decision particular case, we are analyzing how biased data sets affect AI on the trade-off (if necessary) between accuracy and fairness. algorithms, we will apply the Data to Algorithm bias. In order to analyze how data bias affect the behavior of AI algo- rithms, firstly we apply our previously published algorithm [15] 3 IMPROVING FAIRNESS IN ARTIFICIAL that automatically detects and visualizes bias in data analytics. INTELLIGENCE This algorithm examines the dataset returning us as output a Tackling AI challenges requires awareness of the context where number between 0 and 10 that establishes the bias ratio of the algorithms will be not only trained, but also, where they generate attributes (being 0 equally distributed and 10 very biased). This outputs. Biases and errors that go unnoticed lead into wrong or number is visually represented in order to present an overview unfair decisions. Moreover, since training AI algorithms is a time- of the data bias for a better understanding and exploration. consuming task (several days or weeks), developing them without In this case of study, the bias ratios were (Race: 9.95, Sex: a clear direction may result in considerable waste of resources. 7.60 and Age category: 6.28), the most biased attribute was race For this reason, we propose the methodology shown in Fig. 1. and it was selected as a protected attribute. The main reason By following this methodology, AI practitioners will be able to is that the race of the accused should never be a characteristic analyze and improve fairness in AI predictions. that influences the classification of risk of recidivism (the target The first step in our methodology (Fig. 1) starts with the defini- variable). Therefore the Protected attribute is defined as Race. tion of the Target Variable by AI practitioners. Then, during the Furthermore, a visualization (Fig. 2) that groups the predicted Bias Analysis step, the algorithm proposed in [15] is executed target variable (risk of recidivism) by the attributes selected as in order to detect existing biases in the dataset. This algorithm protected (race) is created. As cleary observed, there is a high output will provide an overview of how biased the attributes of risk in accused of African-American race than in the rest of races. the dataset are. Moreover, this information will help practition- Once the dataset has been analyzed and the bias has been ers to select the Protected Attribute/s such as race, gender, or located, AI practitioners will have more detailed knowledge in any other that requires special attention to ensure fair treatment. order to detect the types of bias that might arise. Among the types Whether protected attributes have been detected in the dataset, of bias which can appear, those relevant for our methodology are a Definition of Fairness will be launched in order to allow categorized in Data to Algorithm bias as described by [16]: practitioners to measure whether the AI system is really being • Measurement Bias: Arises when we choose and mea- fair. Then, a Data Rebalancing (whether necesary) will be ac- sure features of interest. If a group is monitored more complished and AI practitioners will proceed to the Algorithm frequently, more errors will be observed in that group. Training. Finally, we propose a set of tables and visualizations • Omitted Variable Bias: When important variables are in order to interpret the Algorithm Results. left out of the model. In the following, we will further describe all the steps of our • Representation Bias: Arises in the data collection pro- methodology by applying it in a real case study. cess when data does not represent the real population. Target Variable Bias Analysis Definition AI Practitioners Protected Yes Fairness attributes? Definition No Data Algorithm Rebalancing Training Algorithm Results Interpretation Figure 1: Methodology to mitigate bias in AI algorithms • Aggregation Bias: When false conclusions are drawn In our case study, the race attribute was considerably biased. about individuals from observing the entire population. As this is a protected attribute, it is important to define one or Data from several groups (i.e. cities, races, age groups, etc.) more metrics that quantify the fairness of the results. can be correlated differently across classes. However, if an As [19] argues, in general terms, fairness can be defined as the aggregation is performed, the general correlation of the absence of any prejudice or favoritism towards an individual or a aggregated data could be completely different from the group. However, although fairness is a quality highly desired by earlier correlations. society, it can be surprisingly difficult to achieve in practice [16]. • Sampling Bias: Trends estimated for one population may Therefore, with the aim of defining, limiting and being able not generalize to data collected from a new population. to measure whether fairness is being achieved, our proposed • Longitudinal Data Fallacy: When temporal data is mod- method makes AI practitioners reflect on the type of justice that eled using a cross-sectional analysis, which combines mul- they want to achieve. Among the types of justice we can find: tiple groups at a single point in time. • Individual Fairness: Give similar predictions to similar • Linking Bias: When network attributes obtained from individuals [10], i.e. points that are closer to each other in user connections, activities, or interactions differ and mis- the feature space should have similar predictions. represent the true behavior of the users. • Group Fairness: Treat different groups equally [10]. In this specific case study, the bias analysis leads to consider as • Subgroup Fairness: Try to obtain the best properties of potential biases both Measurement Bias, some individuals tend the group and individual notions of fairness. It picks a to live in zones with high criminal activity, hence a higher level statistical fairness constraint (like equalizing false posi- of surveillance by the police is needed and it could derive into a tive rates across protected groups) and asks whether this feedback loop. And Representation Bias since data presents a constraint holds over a large collection of subgroups [13]. significantly different distribution compared to the demographic In this case of study, “Group Fairness” has been selected, since distribution of Florida state [3] (where the data was collected). the race attribute has been selected as protected and fairness is sought between the different groups of races. Specifically, the 3.4 Fairness Definition following definitions of “Group Fairness” have been followed: The presence of bias can eventually derive into unfair results, • Equalized Odds: Groups within protected attributes must especially when the bias is present in protected attributes. Thus, have the same ratio of true and false positives [12]. As analyzing which biases might be present in the current problem is equality of odds can be really difficult to achieve, it can be essential to determine which fairness metrics are more important. decomposed into two more relaxed versions: • Equal Opportunity: Groups within protected attributes must have equal true positive rates [12]. • Predictive Equality: False positive rates must be equal across all groups of the protected attribute [4]. Depending on the problem, one definition could be more im- portant than the other. For example, in building a model to predict if a subject is eligible for a grant, it is relevant for the rate of true positives of both sexes to be equal, i.e. equal opportunity should be achieved. On the other side, a risk assessing model should focus on having the same false positive rates across protected groups, as missclassifying an individual as high risk can be really harmful, hence the importance to prioritize predictive equality. 3.5 Data Rebalancing By choosing and studying which fairness metrics are more suit- able for the current problem, AI practitioners are now able to focus on applying several techniques and evaluate its impact Figure 2: Risk of recidivism score by race based on these fairness definitions. In this case, different data rebalancing techniques will be used most sensitive classification, since classifying non-recidivists as to modify the dataset distribution in terms of race and recidivism a high risk of recidivism can bring them negative consequences. rate, in order to assess its impact in terms of fairness. As we can see in Table 1, the techniques that achieve the best Usually, data rebalancing techniques are used in problems of FPR for the Caucasian race are Original Train - Original Test and imbalanced classification, where the target variable to be pre- Remove race attribute with a 0.172 rate. Meanwhile, Remove race dicted has a majority and a minority class. attribute obtains the best FPR for African-American race with a In this case study, the dataset could be rebalanced to be com- 0.347 rate. It is remarkable how the Caucasian race obtains the posed of 50% non-recidivists and 50% recidivists, which is the best results when the data is original, while the African-American target variable. However, this approach does not take into ac- race obtains the best results when the race attribute is removed. count the different groups where fairness has to be assessed and However, even though using these techniques we get better preserved. Therefore, as an alternative view on the problem, we False Positive rates, the difference between getting a 17,2% of propose to treat the bias and unfairness in the protected attributes Caucasian defendants wrongly accused as a recidivists and that as a rebalancing problem. In this sense, we extend the rebalancing the 34,7% of African-American defendants were wrongly accused methods to consider the protected attribute in addition to their as a recidivists would still be considered highly unfair. associated target variable, thus allowing to control the proportion Additionally, our methodology generates Table 2 that calcu- of each group in the sample. lates and compares the fairness definitions chosen in Section In other words, by extending the rebalancing techniques, the 3.4. Using Table 2 is possible to know, depending on the type dataset of this case study can be modified as follows: 25% African- of fairness pursued, which technique will bring better results. American non-recidivists, 25% African-American recidivists, 25% We have marked the best (green) and worst (red) techniques Caucasian non-recidivists and 25% Caucasian recidivists. for each definition of fairness and for the overall accuracy. We As there are several techniques for rebalancing, in this case should clarify that a lower fairness number represents that there study we will focus on three different data rebalancing techniques: is less difference between the protected groups, i.e. it is more fair. Undersampling [11], Oversampling [14], and SMOTE [5]. However, the accuracy is better when its value is higher, since it means that there have been fewer errors in the classification. 3.6 Algorithm Training As we can observe, the technique that gets the best score in In this case, the XGBoost [7] classifier has been used with the de- terms of Equal Opportunity, Equalized Odds and Accuracy is to fault hyperparameters. In order to complement the experiments remove the protected attribute, in this case the attribute race. related to rebalancing using the previous techniques, three extra However, other highly used techniques as SMOTE gets the worst experiments have been carried out to provide further insights: results in terms of Equal Opportunity and Equalized Odds. • Baseline: It is important to evaluate the model obtained without applying any rebalancing so that it acts as a base- 4 CONCLUSIONS AND FUTURE WORK line model in order to compare the results. The use of Artificial Intelligence (AI) systems is rapidly spreading • Split by race: Two separate classifiers will be trained, one across different sectors and organizations. More and more impor- for each of the race studied. tant decisions are being made supported by AI systems which • Remove race attribute: Same experiment as baseline, have not been thoroughly evaluated. It is essential to ensure that but omitting the race attribute. these decisions do not reflect discriminatory behavior towards Regarding the accomplished experiments, the whole training certain groups. Nevertheless, most of the approaches mainly fo- process can be described as follows: (1) Split the dataset into cus on improving the prediction accuracy of algorithms without a training and test sets, (2) Rebalance the dataset by using the considering fairness in their development. forementioned techniques (depending on the experiment, either Thus, in this paper we have presented a methodology that the training set or both sets are rebalanced), (3) Train the classifier allows AI practitioners to measure and improve fairness in AI to predict the risk of recividism given variables such as sex, age, algorithms in a systematic way. Our novel methodology consid- race and prior criminal history of the subject, and (4) Once the ers fairness as a first-class citizen and introduces new steps with classifier is trained, it is evaluated on the test set by computing respect to the traditional process of AI development such as: (i) a the metrics above-mentioned. bias analysis, (ii) fairness definition and (iii) fairness evaluation. In total, nine experiments will be performed: the baseline, We have also analyzed how the most common data rebalancing training one separate model for each race, completely omitting approaches affect the fairness of AI predictions taking into ac- the race variable, and six related to rebalancing either the training count both (i) the target variable and (ii) the protected attributes. set or both the training and test set, with each of the rebalancing Furthermore, our methodology generates a set of tables for choos- techniques presented: undersampling, oversampling and SMOTE. ing the best rebalancing alternative for each particular definition The code of the experiments is publicly available in https://gitlab. of fairness. Both our methodology as well as the interpretation com/lucentia/DOLAP2022. of the algorithms results (tables and visualizations) can be easily replicated in any AI algorithm. 3.7 Algorithm Results Interpretation In order to both exemplify our approach and test the impact Finally, in order to compare the output of the XGBoost classifier of each rebalancing alternative, we have applied it in a real case algorithm and to be able to measure if it has been fair, we have of study. We have implemented a classifier over the COMPAS created Table 1 and Table 2. It should be noted that this tables dataset, calculating the degree of fairness obtained according to can be easily replicated in any Artificial Intelligence challenge. three different fairness definitions. First, Table 1 represents the True Positive Rates (TPR) and Given the obtained results, we consider that by following our False Positive Rates (FPR) for Caucasian and African-American proposed methodology we can avoid falling into the usual pitfalls groups. In this specific case, False Positive Rates (FPR) were the that lead to controversial outputs when the input datasets include Table 1: True Positive and False Positive Rates for African-American and Caucasian races TPR Cauc. TPR Afr. FPR Cauc. FPR Afr. Original Train - Original Test 0.356 0.714 0.172 0.381 SMOTE Train - Original Test 0.340 0.716 0.178 0.384 SMOTE Train - SMOTE Test 0.294 0.716 0.188 0.391 Over Train - Original Test 0.397 0.701 0.215 0.370 Over Train - Over Test 0.372 0.701 0.206 0.380 Under Train - Original Test 0.371 0.721 0.188 0.418 Under Train - Under Test 0.371 0.722 0.227 0.454 Split by race 0.371 0.703 0.198 0.372 Remove race attribute 0.407 0.674 0.172 0.347 Table 2: Fairness rates comparison Eq. Oportunity Pred. Equality Eq. Odds Accuracy Original Train - Original Test 0.358 0.209 0.567 0.659 SMOTE Train - Original Test 0.376 0.206 0.582 0.654 SMOTE Train - SMOTE Test 0.422 0.203 0.625 0.608 Over Train - Original Test 0.304 0.155 0.459 0.654 Over Train - Over Test 0.328 0.174 0.503 0.622 Under Train - Original Test 0.350 0.230 0.580 0.649 Under Train - Under Test 0.351 0.227 0.577 0.603 Split by race 0.332 0.174 0.506 0.654 Remove race attribute 0.267 0.175 0.442 0.664 biased protected attributes. In addition, it allows us to discover [9] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. which is the most appropriate data rebalancing techniques to try 2017. Algorithmic Decision Making and the Cost of Fairness. (2017), 797–806. [10] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard to maximize different definitions of fairness. Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innova- Regarding the limitations of our proposal, we should take tions in theoretical computer science conference. Association for Computing Machinery, 214–226. into account that our proposal has achieved successful results [11] Salvador García and Francisco Herrera. 2009. Evolutionary undersampling for when protected attributes are individual and binary. However, classification with imbalanced datasets: Proposals and taxonomy. Evolutionary when as the number of protected attributes increases, rebalancing computation 17, 3 (2009), 275–306. [12] Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in becomes more difficult. Future work is needed in order to study supervised learning. Advances in neural information processing systems 29 the best approach to carry out rebalancing techniques in the (2016), 3315–3323. cases where there are several protected attributes defined and [13] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Prevent- ing fairness gerrymandering: Auditing and learning for subgroup fairness. In the classes contain a large number of different attribute groups. Proceedings of the 35th International Conference on Machine Learning, Vol. 80. PMLR, 2564–2572. [14] György Kovács. 2019. An empirical comparison and evaluation of minority ACKNOWLEDGMENTS oversampling techniques on a large number of imbalanced datasets. Applied Soft Computing 83 (2019), 105662. This work has been co-funded by the AETHER-UA project (PID [15] Ana Lavalle, Alejandro Maté, and Juan Trujillo. 2020. An Approach to Auto- 2020-112540RB-C43), funded by Spanish Ministry of Science matically Detect and Visualize Bias in Data Analytics. In Proceedings of the and Innovation and the BALLADEER (PROMETEO/2021/088) 22nd International Workshop on Design, Optimization, Languages and Analyti- cal Processing of Big Data, DOLAP@EDBT/ICDT 2020, Vol. 2572. CEUR-WS.org, projects, funded by the Conselleria de Innovación, Universidades, 84–88. http://ceur-ws.org/Vol-2572/short11.pdf Ciencia y Sociedad Digital (Generalitat Valenciana). [16] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–35. REFERENCES [17] Broward County Clerk’s Office, Broward County Sherrif’s Office, Florida De- partment of Corrections, and ProPublica. 2021. COMPAS Recidivism Risk [1] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Ma- Score Data and Analysis. https://www.propublica.org/datastore/dataset/ chine Bias - There’s software used across the country to predict future crim- compas-recidivism-risk-score-data-and-analysis. inals. And it’s biased against blacks. https://www.propublica.org/article/ [18] The IEEE Global Initiative on Ethics of Autonomous and Intelligent machine-bias-risk-assessments-in-criminal-sentencing. Systems. 2017. Ethically Aligned Design: A Vision for Prioritizing [2] Matias Barenstein. 2019. ProPublica’s COMPAS Data Revisited. CoRR Human Well-being with Autonomous and Intelligent Systems, Version abs/1906.04711 (2019). arXiv:1906.04711 2. https://standards.ieee.org/content/dam/ieee-standards/standards/web/ [3] The U.S. Census Bureau. 2010. Population percent change. https://www. documents/other/ead_v2.pdf. census.gov/quickfacts/FL. [19] Nripsuta Ani Saxena, Karen Huang, Evan DeFilippis, Goran Radanovic, [4] Alessandro Castelnovo, Riccardo Crupi, Greta Greco, and Daniele Regoli. 2021. David C Parkes, and Yang Liu. 2019. How do fairness definitions fare? Examin- The zoo of Fairness metrics in Machine Learning. CoRR abs/2106.00467 (2021). ing public attitudes towards algorithmic definitions of fairness. In Proceedings [5] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 99–106. Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. [20] Neil Vigdor. 2019. Apple Card Investigated After Gender Dis- Journal of artificial intelligence research 16 (2002), 321–357. crimination Complaints. https://www.nytimes.com/2019/11/10/business/ [6] Irene Chen, Fredrik D Johansson, and David Sontag. 2018. Why Is My Classifier Apple-credit-card-investigation.html. Discriminatory?. In Advances in Neural Information Processing Systems, Vol. 31. [21] Christina Wadsworth, Francesca Vera, and Chris Piech. 2018. Achieving Curran Associates, Inc. fairness through adversarial learning: an application to recidivism prediction. [7] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting CoRR abs/1807.00199 (2018). arXiv:1807.00199 system. Proceedings of the 22nd ACM SIGKDD International Conference on [22] Jordan Weissmann. 2018. Amazon Created a Hiring Tool Using A.I. It Imme- Knowledge Discovery and Data Mining (Aug 2016). diately Started Discriminating Against Women. https://slate.com/business/ [8] European Commission. 2021. Ethics guidelines for trustworthy AI. https: 2018/10/amazon-artificial-intelligence-hiring-discrimination-women.html. //digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai.