Introduction

Assessing Fairness in Classification Parity of Machine Learning Models in Healthcare

Ming Yuan

mirandayuan09@gmail.com 1

Vikas Kumar

Muhammad Aurangzeb Ahmad

Ankur Teredesai

KenSci Inc

Seattle

0 Department of Computer Science, University of Washington - Bothell , USA 1 Department of Computer Science, University of Washington - Tacoma , USA

Fairness in AI and machine learning systems has become a fundamental problem in the accountability of AI systems. While the need for accountability of AI models is near ubiquitous, healthcare in particular is a challenging field where accountability of such systems takes upon additional importance, as decisions in healthcare can have life altering consequences. In this paper we present preliminary results on fairness in the context of classification parity in healthcare. We also present some exploratory methods to improve fairness and choosing appropriate classification algorithms in the context of healthcare.

Introduction

Although machine learning has been around for just over sixty years, it is only in the last decade or so that its influence on society at large is being felt, as systems powered by machine learning are now impacting the lives of billions of people. For instance, recommendation systems that suggest items to people by inferring their preferences play a pivotal role in most e-commerce sites such as Amazon, Netflix, Alibaba etc. Other example include predicting crimes for active policing, predicting risk of re-offence to facilitate sentencing, financial decision making, and decision making in healthcare. Given that many applications of machine learning have potential life changing implications, fairness in machine learning has thus become a critical issue.

Additionally, the quest for fairness in machine learning is motivated in many domains by the desire to adhere to national and international legislation, for example the GDPR in the European Union, the Universal Declaration of Human Rights (Assembly 1948) in the context of the digital age (Zliobaite 2015) etc. The quest for fairness in machine learning is part of the larger enterprise of creating responsible machine learning systems that engender trust and ensure transparency of the machine learning methods being used (Zliobaite 2015). This involves explainability of the machine learning model and often requires guarantees regarding what would happen when the algorithms involved in making decisions that impact lives of people. Fairness in machine learning is especially critical for minority or vulnerable are more likely to be affected by decision making by automated by machine learning systems. Thus, creating machine learning systems that are fair is pivotal to upholding the social contract. While there is wide agreement on the need for fairness in machine learning, there is no single notion of fairness that can be applied for all use cases. The reason for this being that fairness can refer to disparate but related concepts in different contexts.

Though it is universally acknowledged that fairness is critical in most domains, in certain applications in the judicial system or in healthcare its importance and impact is paramount. This is because the algorithms used in the field of healthcare can be both widespread and specialized, which may require additional constraints to consider to build a fair system (Ahmad, Eckert, and Teredesai 2018) . To illustrate how the usage of machine learning models can affect and bias models in critical domains, we focus on healthcare in this paper as of a domain where fairness in machine learning can have life changing consequences. While there are multiple notions of fairness in machine learning, we focus on the classification parity with respect to protected features such as age, gender, and race. Additionally we measure and address fairness for various classification tasks in the performance of Machine Learning algorithms over various healthcare datasets. Specifically, we address the following problems: • Measure fairness as classification parity in the context of predictive performance of machine learning methods • Determine how the ablation of protected features affects the performance and fairness of machine learning methods in general and also in via sampling • Determine fairness threshold i.e., a threshold for machine learning models where the models are relatively fair and as wells as predictive performance is sufficiently good We use three publicly available datasets to explore the question of fairness outlined here. The healthcare datasets that are used are somewhat limited in terms of the small size of the datasets. Because of this limitation the differences between the predictive performance of certain prediction models with different classification thresholds may not be statistically significant. In this paper, however, our goal is to show the feasibility of the techniques employed. We do plan to address limitation of publicly available dataset in the future by deploying the framework outlined in this paper in a large hospital system in the mid-West in a real world clinical setting.

Related Work

Bias is inevitable in any sufficiently complex dataset. The data collection and capture process tends to capture only aspects of the phenomenon of interest, and hidden assumptions may lie in data collection, processing and analysis. It is thus unavoidable that when machine learning algorithms learn from data, intentional or unintentional discrimination can result (Barocas and Selbst 2016). There is extensive literature on issues related to fairness in different application domains within machine learning e.g., natural language processing, image classification, target advertising, and judicial sentencing. Studies from these various domains prove that data bias is inherit in many disciplines which in turn creates disparate treatment effects across categories. To address these limitations, many commercial organizations have released software to address fairness in machine learning models e.g., Google, Amazon , IBM etc. More recent developments include the FairMLHealth python package that focuses on algorithms and metrics for fairness in healthcare machine learning (Allen et al. 2020).

Additionally organizations like Google and Amazon have set up AI ethics board emphasizing the importance of fairness in AI. Given the extensive nature of the literature it is not possible to list all the relevant papers on fairness in machine learning here. We give an brief overview of papers that are most relevant to the current manuscript. Researchers have proposed a number of theories and methods to detect and measure fairness based on different definition of fairness in Machine Learning. Zliobaite lists (Zliobaite 2015) several statistical methods and comparison functions to detect or measure different notions of fairness in Machine learning based on the prediction or classification results, like Normalized Difference (Zliobaite 2015) which is normalized mean difference for binary classification used to quantify the difference between groups of people.

Martinez et. al. explore fairness in healthcare in the context of risk-disparity among subpopulations. (Martinez, Bertran, and Sapiro 2019). Tramer et al. (Tramer et al. 2017) has introduced unwarranted associations (UA) framework to detect the fairness issue in data-driven applications by investigating associations between application outcomes and sensitive user attributes. There are also some detection methods developed for specific problems or algorithms, like the detection methods for ranking algorithm. Corbett-Davies give a comprehensive survey of fairness in machine learning (Corbett-Davies and Goel 2018). Lastly, Ahmad et al. survey the field of healthcare AI within the context of fairness and describe the limitations and challenges of fairness in the healthcare domain (Ahmad et al. 2020) . Precision

Dimensions of Fairness in Machine Learning

Even with the presence of competing notions of fairness in machine learning, it is still possible to describe the various definitions of fairness in terms of the machine learning pipeline. At a high level one can talk about three orthogonal dimensions of data, algorithms and metrics. In this paper, we focus our experiments on each of these dimensions. Each of these can be described as follows: • Fairness in dataset: Problems in fairness may be related to the attributes of dataset, for instance, when one or several categories are under-represented, when the dataset is outdated or incomplete (Ahmad et al. 2020) (Crawford 2013), or when the dataset inherits unintentional perpetuation and promotion of historical biases, the machine learning methods may learn these disparateness and unfairness and thus end up operationalizing unfair outcomes. Unbiasing in this case requires techniques that appropriately handle the data while leaving the rest of the machine learning pipeline intact. In this paper we address unfairness in dataset by applying oversampling to the protected features and then determining how does that change the results of the predictive models. • Fairness in model/algorithm: Problems related to the design or even the choice of the machine learning model or algorithm. Poorly designed matching systems, inappropriate choice of features, and assumptions like correlation necessarily implies causation could all give rise to issues related to unfairness. In certain contexts, one could even state that just as data is not neutral, algorithms are also not neutral. We explore this dimension in this paper by considering how the performance and fairness of different algorithms vary for different classification tasks in healthcare. • Fairness in metrics/results: Problems related to the effect of, and choice of metrics on measuring fairness. Additionally, problems related to unbiasing results from machine learning models. The main approaches in this domain are focused on post-processing solutions to rectify results after the fact. In this paper we focus on determining thresholds for models which can be then used to create a more equitable outcome for protected features described below.

Protected Features

Protected or sensitive features are the features which can potentially be used to discriminate against populations e.g., race, gender, religion, ethnicity, caste, sexual orientation. It may be the case that the use of such features may lead to improvement in predictive performance in the machine learning models but could also likely lead to discrimination. Thus, the non-use of protected features in machine learning models is recommended. Protected features can be divided into two broad categories of features: • Known protected features: The attributes that are already protected by the law, e.g., the Equality Act of 2010 (Zliobaite 2015) such as age, gender, disability, race, etc. • Unknown protected features: These are potentially protected features which are non-obvious. Some data analysis or prior experience may be required to determine what constitutes such features. Consider the example of inferring race given a person’s last name and zip code which may not appear to be protected features at first. It has been demonstrated that it is possible to build machine learning models that can infer race based on these characteristics (Zliobaite 2015) are also sometimes referred to as proxy features.

It is not always straightforward to determine if a feature is sufficiently correlated with protected features and if we should include it in training As mentioned in the introduction, there are multiple notions of fairness in machine learning. In this paper is classification parity, which is the performance parity of machine learning models with respect to the protected features. Classification parity for protected features can be defined as follows: Protected Feature Classification Parity: Given a dataset D with V attributes and a subset of protected features V , for any given categorical attribute vi 2 V with C classes, if the performance of a predictive model for Algorithm Aq is (Aq; D; vij ) for class cj 2 C then the following condition should hold for performance parity.

(Aq; D; vij ) (Aq; D; vik) < ; 8 j 6= k (1) where is a threshold corresponding to a relatively small difference in performance of the various classes of the performance metrics. The performance metric can be any classification metric, for example precision, recall, AUC, FScore, or Brier Score. In other words, each class of the protected features should have similar performance on the quantitative assessment metrics of relevance. For instance, consider the case of models that are used to assess a criminal defendant’s probability of becoming a recidivist i.e., a reoffending criminal. Such algorithms have been increasingly used across the nation by probation officers, judges and parole officers but studies have shown that these algorithms in general have very different predictive performance across different racial and ethnic groups.

Experiments and Results Dataset

We employ data from three publicly available healthcare related datasets to study the effect of thresholds on fairness. • MIMIC An ICU related dataset which has been extensively used in literature to study a number of prediction problems in healthcare. The data considered consisted of 46,630 rows and 212 feature. In this paper we focus on the problem of predicting length of stay at the time of admission to a medical facility. (Johnson et al. 2016) • Thyroid Disease Dataset Thyroid disease records supplied by the Garavan Institute in Australia. The data consisted of 3,772 rows and 29 features. The problem of predicting the presence or absence of the Thyroid disease is addressed. (Dheeru 2017). • Pima Indians Diabetes Dataset (PIMA) Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases which mainly consists of diagnostic measurements (Smith et al. 1988). The datset consisted of 768 rows and 9 features. We focus on the problem of predicting the presence or absence of diabetes in patients. The target variables in all these cases are nominal, and thus we pose these problems as classification problems. We note that while the experiments outlined here were performed on all three datasets, because of limitations in space we only report results of a subset experiments because of limitations of space.

Model Assessment

A standard machine learning pipeline consists of data collection, data pre-processing, feature selecting, algorithm selection, model training, model selection and model evaluation. In this paper, we focus on the data pre-processing, algorithm selection and model evaluation part of fairness in machine learning. We propose a general way to assess classification parity for fairness in machine learning and determine the optimal thresholds for creating fair and unbiased machine learning models. Specifically, for each protected feature, we measured fairness by comparing the predictive performance of each category for the feature. Afterwards we compared the optimal threshold for predictive performance -0.0023 chosen based on the performance of each category. Next, we determined a fairness threshold which corresponded to optimal outcome across classes of the protected features. The fairness threshold allows us to create classification models with fair outcomes without significant drop in predictive performance. We measure predictive performance in terms of standard classification metrics like AUC, precision, recall and F1 score.

The optimal threshold was found based on Youden’s index (Youden 1950):

J = sensitivity + specif icity 1 (2) Youden’s index has been used as the measure of diagnostic effectiveness. It could also be used for selecting the optimal cut-point on ROC-curve (Schisterman et al. 2005), which corresponds to the optimal threshold we desire. Additionally we measured performance of the predictive models at the level of data and features i.e., training without the protected features, training without the important features related to the protected feature to help reduce the unfairness, and sampling to balance the size of each category on training dataset. Due to space limitations we limit the analysis of algorithmic performance to the following three popular, well-studied and well-applied algorithms in the healthcare domain: Logistic Regression, Random Forest and XGBoost (eXtreme Gradient Boosting).

Classification Parity and Fairness

Definition: Given a classification task T , dataset D with m classes C1; C2; C3; :::; Cm, evaluation metric , the fairness threshold is defined as follows: argmin( 2( (Ci); (Cj); (D))); 8Cj ; Cj 2 C (3) In other words, for a given classification task and protected feature with m classes the fairness threshold is the threshold where the variance in the predictive performance of an algorithm in minimized with respect to each of the m classes as well as the overall dataset. To illustrate this concept, consider Figure 1 which shows the performance of a Random Forest model for the Length of Stay prediction task. Here the protected variable is gender and performance is measured in terms of precision. The performance is given for the two classes of gender (i.e., male and female) for this dataset as well as the overall performance. The x-axis shows the class or formula with respect to which precision is maximized, and the y-axis shows the precision of the model.

Consider the threshold that are used to maximize performance for female, the threshold would also correspond to some non-optimal performance for male and also for the overall population. Similarly, consider the threshold for fairness when the performance of the predictive model is most fair for male. The two other values in the graph are values that are obtained if the fairness threshold for the male population is used for the female population as well as the overall population. Now consider the thresholds given on the right in the figure, these are the thresholds that are computed as the aggregate (average, min, max etc) of the thresholds for the protected classes as well as the overall population. We computed the performance when the thresholds are chosen based on using minimum, maximum, average and median threshold are used. From Figure 1 it is clear that the best results are obtained when the average or the minimum threshold is used. Table 1 shows the results of prediction for the length of stay prediction problem along with the variance of the performance of all the categories, which are ’male’ and ’female’ in this example. The variance in this case quantifies the difference of performance. The lesser the variance is, the lesser is the difference in predictive performance. This also implies that the models are more fair. To find the threshold for fairness, we set the -0.0028 -0.0025 -0.0096 -0.0138 -0.0070 -0.0103 -0.0094 -0.0124 optimal threshold for entire dataset, optimal thresholds for each category, and the average, median, maximum and minimum value of the thresholds of all the categories as the candidates. And then, we computed the difference of performance corresponding to each threshold, including precision, recall and F1 score, across each category, and chose the one with minimum difference as the fairness threshold. To help analyze effect on performance of choosing different threshold, we plotted performance boundary, including precision boundary, recall boundary and F1 score boundary, for each fairness threshold candidate. The performance boundary shows that best performance does not mean fair. For instance, the threshold based on male which is also the maximum threshold in Figure 1 has the highest precision, but the it also has maximum difference, which means the performance corresponding to this threshold is the least fair one. Furthermore, model with fair outcomes can have comparable performance. For instance, the fairness threshold in Table 1 performs even a bit better than the performance corresponding to optimal threshold for entire dataset, which we normally care about.

Effect of Removal of Protected Features

To make the performance of the protected feature’s each category similar to each other, removing the protected features so that they could not influence the performance directly is one apparent choice. Tables 6,5 and 7 in the appendix gives the performance of methods training with and without protected features on LOS dataset for age, gender and race respectively. The increase in performance are marked as red and the decrease in variance of the performance of all the categories are marked as blue.

One thing to note is that the model trained without the protected feature, gender, has more fair performance than the one trained with all the features. This also implies that underlying model was most likely using gender in its prediction. One can also find the optimal threshold for entire dataset and the ones for all the categories. We note that in this particular example the difference in the variance of the models is not statistically significant. We found similar results for the other two models. We however emphasize that the current models are for demonstrating the feasibility of the proposed methods and we plan to explore this further as described in the future work section below.

With the above disclaimer, one can still observe that there are certain differences in the performance of the algorithms which can be used to design experiments and analysis in the future. In summary, the results show that: • The variance of the performance is more likely to decrease 10% - 30%, sometimes over 60%, which means removal of protected features could help the performance become more fair; • Although it may appear that the performance is more likely to improve with the inclusion of the protected feature, it mostly increases under 5%, which could be interpreted as the performance shows insignificant change.

Effect of Removal of Proxy Features

A further idea that we explore is the effect of the removal of proxy features, especially ones with high importance scores with respect to the predictive performance. This is to ensure that the indirect influence of the protected features on the outcomes is not factored into the model and model in general is fair. We define important features as follows, given n features the important features are top k features rank sorted by their important scores. The importance of a feature is calculated by how much the performance measure would improve on each attribute split and weighted by the number of observations the node is responsible for while training with with ’race’, before sampling without ’race’, before sampling with ’race’, after sampling without ’race’, after sampling Gradient Boosting method (Dash and Liu 1997) (Xu et al. 2014). Since we more interested in the proxy features, we only consider the important features that are highly correlated with the protected features. Table 3 and Table 2 gives the performance of models trained on the Thyroid dataset, with and without the important features related to the protected features for gender and age respectively. For the important feature, we consider k = 5. The main takeaways can be summarized as follows: And it shows that: • The variance of the performance does not decrease if the important features are removed. • As expected, the performance of the models decreases when the important features are removed.

These observations imply that removing related important features is not helpful for either improvement in predictive performance or for fairness in general.

Effect of Sampling

The distribution of classes in most protected features are imbalanced. Consequently, most problems related to class imbalance in supervised learning are also prominent problems in this domain. One way to mitigate the problem of class imbalance in supervised learning is to over-sample the underrepresented classes. We considered the distribution of the protected feature ’age’ and another protected feature ’race’ in the LOS dataset as examples. We observe that minority populations like African Americans and Asians are underrepresented in the race feature. Similarly, pediatric patients are under-represented in the age feature. To reduce the influence of under representation of the several categories, we oversampled training dataset to balance the size of each category. We observe that the result of determining the optimal threshold before and after sampling is that the value of the optimal threshold changes once sampling is done.

Fairness in Methods

By comparing the AUC scores and the variance of AUC of the three methods, we could see the rank of the three methods based on AUC score is: Logistic regression < Random Forest < XGBoost And the rank based on the variance in AUC is: Logistic regression < Random Forest < XGBoost(Race) This could lead us to the conclusion that an accurate model does not necessarily imply fair outcomes. The summary results for the experiments for the Diabetes datasets are given in Table 4.

Conclusion

Identifying unfairness is a challenging task. In this work, we first compared the predictive performance across protected features, and use the variance of the performance as a criteria to measure fairness. Second, we determined the optimal thresholds chosen based on each category. We also explored several ways to address unfairness. Additionally, we trained models without the protected features, which could help reduce unfairness but did not cause a drop in performance. The second method focused on the data dimension, which is critical in a machine learning process because characteristics like under-representation could be inherited or even exacerbated in machine learning process. When the size of dataset is small, or the one or several category is under-represented, one could sample the training dataset first to balance the size of each category. Finally, when one has detected unfairness and preferred to address it without training the model again, one could find fairness threshold, which could make the performance more fair but also comparable.

Furthermore, comparison among AUC scores of different models lead us to the conclusion that an accurate model does not necessarily imply fairness. Models with higher accuracy may have less fair outcomes. We note that the fairness measurement obtained for best results for each prediction problem do not necessarily correspond to the best possible theoretical results. This is work in progress, and in follow up to this work we plan to explore theoretical aspects of fairness threshold. The methods and experiments described in this paper are part of a proof of concept to test the efficacy of methods to detect unfairness in machine learning in healthcare use cases. Our plan is to incorporate insights gleaned from this exploratory analysis into a productiondeployed system at a large scale healthcare system in the United States. the 2018 ACM international conference on bioinformatics, computational biology, and health informatics, 559–560. Allen, C.; Ahmad, Muhammad Eckert, C.; Hu, J.; and Kumar, V. 2020. fairMLHealth: Tools and tutorials for evaluation of fairness and bias in healthcare applications of machine learning models. https://github.com/KenSciResearch/ fairMLHealth.

Assembly, U. G. 1948. Universal declaration of human rights. UN General Assembly 302(2).

Barocas, S., and Selbst, A. D. 2016. Big data’s disparate impact. Calif. L. Rev. 104:671.

Corbett-Davies, S., and Goel, S. 2018. The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv preprint arXiv:1808.00023.

Crawford, K. 2013. The hidden biases in big data. Harvard business review 1(1):814.

Dash, M., and Liu, H. 1997. Feature selection for classification. Intelligent data analysis 1(3):131–156.

Dheeru, D. 2017. Karra taniskidou e. UCI machine learning repository 12.

Johnson, A. E.; Pollard, T. J.; Shen, L.; Li-Wei, H. L.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L. A.; and Mark, R. G. 2016. Mimic-iii, a freely accessible critical care database. Scientific data 3(1):1–9.

Martinez, N.; Bertran, M.; and Sapiro, G. 2019. Fairness with minimal harm: A pareto-optimal approach for healthcare. arXiv preprint arXiv:1911.06935.

Schisterman, E. F.; Perkins, N. J.; Liu, A.; and Bondell, H. 2005. Optimal cut-point and its corresponding youden index to discriminate individuals using pooled blood samples. Epidemiology 73–81.

Smith, J. W.; Everhart, J.; Dickson, W.; Knowler, W.; and Johannes, R. 1988. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care, 261. American Medical Informatics Association. Tramer, F.; Atlidakis, V.; Geambasu, R.; Hsu, D.; Hubaux, J.-P.; Humbert, M.; Juels, A.; and Lin, H. 2017. Fairtest: Discovering unwarranted associations in data-driven applications. In 2017 IEEE European Symposium on Security and Privacy (EuroS&P), 401–416. IEEE.

Xu, Z.; Huang, G.; Weinberger, K. Q.; and Zheng, A. X. 2014. Gradient boosted feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 522–531.

Youden, W. J. 1950. Index for rating diagnostic tests. Cancer 3(1):32–35.

Zliobaite, I. 2015. A survey on measuring indirect discrimination in machine learning. arXiv preprint arXiv:1511.00148. no gender

male no gender

XGBoost

gender

female no gender 0 0 0 .0 .0 . 1 0 0 0 0 0 4 9 1 2 0 0 . 0 0 . 0 . - - - - 0 - - - 0 - - 0 0 0 0 0 - 0 0 - 0 0 - - - - - 7 8 7 0 1 5 1 2 8 4 0 9 4 7 4 5 1 1 0 1 2 2 2 0 6 6 3 3 3 5 5 5 9 9 9 0 0 0 0 0 0 0 0 0 0 0 0 n 0 0 0 0 0 0 0 0 0 0 0 0 o s r o a c t p a s 0 5 7 9 1 0 2 7 i .000 .010 .000 .000 .0 .0 .000 .000 .0 .00 .00 .00 .00 .0 H no oB - - - - - - - / / - 0 0 m - - - - 0 - - - - - - o 9 7 8 7 3 1 9 0 n 2 8 3 3 2 2 7 1 6 3 4 6 1 2 0 a 3 4 3 2 5 8 9 8 8 0 3 0 i 8 8 8 2 2 2 4 4 3 7 7 7 0 0 0

4 2 2 2 3 9 5 8 9 8 9 3 5 7 7 5 0 3 2 1 2 2 0 3 3 8 0 0 .0 .0 .

0 .7 . 4 4 7 1 5 - 0 0 0 4 5 5 9 9 9 2 3 5 0 9 7 6 9 1 9 9 7 4 1 0 0 5 0 1 0 8 0 4 0 0 9 8 9 5 8 8 9 1 1 2 5 9 2 3 9 9 5 0 0 5 0 4 0 0 9 3 5 5 0 1 1 7 9 9 2 2 2 r a V ce a O ec a r l B ec a r

r e o

n 1 2 5 8 2 1 0 . . 0

2 7 4 0 0 0 0 0 . . 0 0 0

0 0 1 1 1 .7 .7 . 5 1 6 0 8 0 0 u t e a th e

F re ; eh re

u w ta , e te f s taa teh d f S O on on lcu s n

e o F t ) .

s y e x r ro tu (p fea d te to a s l e r r fe )

e e r ac ’F R ’ ( e n

m r tu u fea lco im lbu lla teh by eov

k r’ ith ra s m ie l

d F sd re ;

e o s r th a u

c m n :P r i

o s 7 f

e le re li b p 4 2 2 1 s i C c

e U r s i C c

e U r i s i C c

e U r

Ahmad , M. A. ; Patel , A. ; Eckert , C. ; Kumar , V. ; and Teredesai , A. 2020 . Fairness in machine learning for healthcare . In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 3529 - 3530 .

Ahmad , M. A. ; Eckert , C. ; and Teredesai , A. 2018 . Interpretable machine learning in healthcare . In Proceedings of a i a d