<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessing Fairness in Classification Parity of Machine Learning Models in Healthcare</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ming Yuan</string-name>
          <email>mirandayuan09@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vikas Kumar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Aurangzeb Ahmad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ankur Teredesai</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>KenSci Inc</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seattle</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Washington - Bothell</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Washington - Tacoma</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Fairness in AI and machine learning systems has become a fundamental problem in the accountability of AI systems. While the need for accountability of AI models is near ubiquitous, healthcare in particular is a challenging field where accountability of such systems takes upon additional importance, as decisions in healthcare can have life altering consequences. In this paper we present preliminary results on fairness in the context of classification parity in healthcare. We also present some exploratory methods to improve fairness and choosing appropriate classification algorithms in the context of healthcare.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Although machine learning has been around for just over
sixty years, it is only in the last decade or so that its
influence on society at large is being felt, as systems powered by
machine learning are now impacting the lives of billions of
people. For instance, recommendation systems that suggest
items to people by inferring their preferences play a pivotal
role in most e-commerce sites such as Amazon, Netflix,
Alibaba etc. Other example include predicting crimes for
active policing, predicting risk of re-offence to facilitate
sentencing, financial decision making, and decision making in
healthcare. Given that many applications of machine
learning have potential life changing implications, fairness in
machine learning has thus become a critical issue.</p>
      <p>Additionally, the quest for fairness in machine learning is
motivated in many domains by the desire to adhere to
national and international legislation, for example the GDPR
in the European Union, the Universal Declaration of
Human Rights (Assembly 1948) in the context of the digital
age (Zliobaite 2015) etc. The quest for fairness in machine
learning is part of the larger enterprise of creating
responsible machine learning systems that engender trust and ensure
transparency of the machine learning methods being used
(Zliobaite 2015). This involves explainability of the machine
learning model and often requires guarantees regarding what
would happen when the algorithms involved in making
decisions that impact lives of people. Fairness in machine
learning is especially critical for minority or vulnerable are more
likely to be affected by decision making by automated by
machine learning systems. Thus, creating machine learning
systems that are fair is pivotal to upholding the social
contract. While there is wide agreement on the need for fairness
in machine learning, there is no single notion of fairness that
can be applied for all use cases. The reason for this being
that fairness can refer to disparate but related concepts in
different contexts.</p>
      <p>
        Though it is universally acknowledged that fairness is
critical in most domains, in certain applications in the
judicial system or in healthcare its importance and impact is
paramount. This is because the algorithms used in the field
of healthcare can be both widespread and specialized, which
may require additional constraints to consider to build a fair
system
        <xref ref-type="bibr" rid="ref2">(Ahmad, Eckert, and Teredesai 2018)</xref>
        . To illustrate
how the usage of machine learning models can affect and
bias models in critical domains, we focus on healthcare in
this paper as of a domain where fairness in machine
learning can have life changing consequences. While there are
multiple notions of fairness in machine learning, we focus
on the classification parity with respect to protected features
such as age, gender, and race. Additionally we measure and
address fairness for various classification tasks in the
performance of Machine Learning algorithms over various
healthcare datasets. Specifically, we address the following
problems:
• Measure fairness as classification parity in the context of
predictive performance of machine learning methods
• Determine how the ablation of protected features affects
the performance and fairness of machine learning
methods in general and also in via sampling
• Determine fairness threshold i.e., a threshold for machine
learning models where the models are relatively fair and
as wells as predictive performance is sufficiently good
We use three publicly available datasets to explore the
question of fairness outlined here. The healthcare datasets that
are used are somewhat limited in terms of the small size of
the datasets. Because of this limitation the differences
between the predictive performance of certain prediction
models with different classification thresholds may not be
statistically significant. In this paper, however, our goal is to
show the feasibility of the techniques employed. We do plan
to address limitation of publicly available dataset in the
future by deploying the framework outlined in this paper in a
large hospital system in the mid-West in a real world clinical
setting.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Bias is inevitable in any sufficiently complex dataset. The
data collection and capture process tends to capture only
aspects of the phenomenon of interest, and hidden
assumptions may lie in data collection, processing and analysis. It
is thus unavoidable that when machine learning algorithms
learn from data, intentional or unintentional discrimination
can result (Barocas and Selbst 2016). There is extensive
literature on issues related to fairness in different application
domains within machine learning e.g., natural language
processing, image classification, target advertising, and judicial
sentencing. Studies from these various domains prove that
data bias is inherit in many disciplines which in turn
creates disparate treatment effects across categories. To address
these limitations, many commercial organizations have
released software to address fairness in machine learning
models e.g., Google, Amazon , IBM etc. More recent
developments include the FairMLHealth python package that
focuses on algorithms and metrics for fairness in healthcare
machine learning (Allen et al. 2020).</p>
      <p>Additionally organizations like Google and Amazon have
set up AI ethics board emphasizing the importance of
fairness in AI. Given the extensive nature of the literature it is
not possible to list all the relevant papers on fairness in
machine learning here. We give an brief overview of papers
that are most relevant to the current manuscript. Researchers
have proposed a number of theories and methods to detect
and measure fairness based on different definition of fairness
in Machine Learning. Zliobaite lists (Zliobaite 2015)
several statistical methods and comparison functions to detect
or measure different notions of fairness in Machine learning
based on the prediction or classification results, like
Normalized Difference (Zliobaite 2015) which is normalized mean
difference for binary classification used to quantify the
difference between groups of people.</p>
      <p>
        Martinez et. al. explore fairness in healthcare in the
context of risk-disparity among subpopulations. (Martinez,
Bertran, and Sapiro 2019). Tramer et al. (Tramer et al. 2017)
has introduced unwarranted associations (UA) framework to
detect the fairness issue in data-driven applications by
investigating associations between application outcomes and
sensitive user attributes. There are also some detection
methods developed for specific problems or algorithms, like the
detection methods for ranking algorithm. Corbett-Davies
give a comprehensive survey of fairness in machine
learning (Corbett-Davies and Goel 2018). Lastly, Ahmad et al.
survey the field of healthcare AI within the context of
fairness and describe the limitations and challenges of fairness
in the healthcare domain
        <xref ref-type="bibr" rid="ref1">(Ahmad et al. 2020)</xref>
        .
Precision
      </p>
    </sec>
    <sec id="sec-3">
      <title>Dimensions of Fairness in Machine Learning</title>
      <p>
        Even with the presence of competing notions of fairness
in machine learning, it is still possible to describe the
various definitions of fairness in terms of the machine learning
pipeline. At a high level one can talk about three orthogonal
dimensions of data, algorithms and metrics. In this paper, we
focus our experiments on each of these dimensions. Each of
these can be described as follows:
• Fairness in dataset: Problems in fairness may be related
to the attributes of dataset, for instance, when one or
several categories are under-represented, when the dataset
is outdated or incomplete
        <xref ref-type="bibr" rid="ref1">(Ahmad et al. 2020)</xref>
        (Crawford 2013), or when the dataset inherits unintentional
perpetuation and promotion of historical biases, the
machine learning methods may learn these disparateness and
unfairness and thus end up operationalizing unfair
outcomes. Unbiasing in this case requires techniques that
appropriately handle the data while leaving the rest of the
machine learning pipeline intact. In this paper we address
unfairness in dataset by applying oversampling to the
protected features and then determining how does that change
the results of the predictive models.
• Fairness in model/algorithm: Problems related to the
design or even the choice of the machine learning model
or algorithm. Poorly designed matching systems,
inappropriate choice of features, and assumptions like correlation
necessarily implies causation could all give rise to issues
related to unfairness. In certain contexts, one could even
state that just as data is not neutral, algorithms are also
not neutral. We explore this dimension in this paper by
considering how the performance and fairness of different
algorithms vary for different classification tasks in
healthcare.
• Fairness in metrics/results: Problems related to the effect
of, and choice of metrics on measuring fairness.
Additionally, problems related to unbiasing results from machine
learning models. The main approaches in this domain are
focused on post-processing solutions to rectify results
after the fact. In this paper we focus on determining
thresholds for models which can be then used to create a more
equitable outcome for protected features described below.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Protected Features</title>
      <p>Protected or sensitive features are the features which can
potentially be used to discriminate against populations e.g.,
race, gender, religion, ethnicity, caste, sexual orientation.
It may be the case that the use of such features may lead
to improvement in predictive performance in the machine
learning models but could also likely lead to discrimination.
Thus, the non-use of protected features in machine learning
models is recommended. Protected features can be divided
into two broad categories of features:
• Known protected features: The attributes that are already
protected by the law, e.g., the Equality Act of 2010
(Zliobaite 2015) such as age, gender, disability, race, etc.
• Unknown protected features: These are potentially
protected features which are non-obvious. Some data
analysis or prior experience may be required to determine what
constitutes such features. Consider the example of
inferring race given a person’s last name and zip code which
may not appear to be protected features at first. It has been
demonstrated that it is possible to build machine learning
models that can infer race based on these characteristics
(Zliobaite 2015) are also sometimes referred to as proxy
features.</p>
      <p>It is not always straightforward to determine if a feature
is sufficiently correlated with protected features and if we
should include it in training As mentioned in the
introduction, there are multiple notions of fairness in machine
learning. In this paper is classification parity, which is the
performance parity of machine learning models with respect
to the protected features. Classification parity for protected
features can be defined as follows: Protected Feature
Classification Parity: Given a dataset D with V attributes and a
subset of protected features V , for any given
categorical attribute vi 2 V with C classes, if the performance of
a predictive model for Algorithm Aq is (Aq; D; vij ) for
class cj 2 C then the following condition should hold for
performance parity.</p>
      <p>(Aq; D; vij )
(Aq; D; vik) &lt; ; 8 j 6= k
(1)
where is a threshold corresponding to a relatively small
difference in performance of the various classes of the
performance metrics. The performance metric can be any
classification metric, for example precision, recall, AUC,
FScore, or Brier Score. In other words, each class of the
protected features should have similar performance on the
quantitative assessment metrics of relevance. For instance,
consider the case of models that are used to assess a criminal
defendant’s probability of becoming a recidivist i.e., a
reoffending criminal. Such algorithms have been increasingly
used across the nation by probation officers, judges and
parole officers but studies have shown that these algorithms
in general have very different predictive performance across
different racial and ethnic groups.</p>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Results</title>
      <sec id="sec-5-1">
        <title>Dataset</title>
        <p>We employ data from three publicly available healthcare
related datasets to study the effect of thresholds on fairness.
• MIMIC An ICU related dataset which has been
extensively used in literature to study a number of prediction
problems in healthcare. The data considered consisted of
46,630 rows and 212 feature. In this paper we focus on
the problem of predicting length of stay at the time of
admission to a medical facility. (Johnson et al. 2016)
• Thyroid Disease Dataset Thyroid disease records supplied
by the Garavan Institute in Australia. The data consisted
of 3,772 rows and 29 features. The problem of
predicting the presence or absence of the Thyroid disease is
addressed. (Dheeru 2017).
• Pima Indians Diabetes Dataset (PIMA) Dataset from the
National Institute of Diabetes and Digestive and Kidney
Diseases which mainly consists of diagnostic
measurements (Smith et al. 1988). The datset consisted of 768
rows and 9 features. We focus on the problem of
predicting the presence or absence of diabetes in patients.
The target variables in all these cases are nominal, and thus
we pose these problems as classification problems. We note
that while the experiments outlined here were performed on
all three datasets, because of limitations in space we only
report results of a subset experiments because of limitations
of space.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Model Assessment</title>
      <p>A standard machine learning pipeline consists of data
collection, data pre-processing, feature selecting, algorithm
selection, model training, model selection and model
evaluation. In this paper, we focus on the data pre-processing,
algorithm selection and model evaluation part of fairness in
machine learning. We propose a general way to assess
classification parity for fairness in machine learning and
determine the optimal thresholds for creating fair and unbiased
machine learning models. Specifically, for each protected
feature, we measured fairness by comparing the predictive
performance of each category for the feature. Afterwards we
compared the optimal threshold for predictive performance
-0.0023
chosen based on the performance of each category. Next,
we determined a fairness threshold which corresponded to
optimal outcome across classes of the protected features.
The fairness threshold allows us to create classification
models with fair outcomes without significant drop in predictive
performance. We measure predictive performance in terms
of standard classification metrics like AUC, precision, recall
and F1 score.</p>
      <p>The optimal threshold was found based on Youden’s index
(Youden 1950):</p>
      <p>J = sensitivity + specif icity
1
(2)
Youden’s index has been used as the measure of diagnostic
effectiveness. It could also be used for selecting the optimal
cut-point on ROC-curve (Schisterman et al. 2005), which
corresponds to the optimal threshold we desire.
Additionally we measured performance of the predictive models at
the level of data and features i.e., training without the
protected features, training without the important features
related to the protected feature to help reduce the unfairness,
and sampling to balance the size of each category on
training dataset. Due to space limitations we limit the analysis
of algorithmic performance to the following three popular,
well-studied and well-applied algorithms in the healthcare
domain: Logistic Regression, Random Forest and XGBoost
(eXtreme Gradient Boosting).</p>
    </sec>
    <sec id="sec-7">
      <title>Classification Parity and Fairness</title>
      <p>Definition: Given a classification task T , dataset D with m
classes C1; C2; C3; :::; Cm, evaluation metric , the fairness
threshold is defined as follows:
argmin( 2( (Ci); (Cj); (D))); 8Cj ; Cj 2 C
(3)
In other words, for a given classification task and protected
feature with m classes the fairness threshold is the threshold
where the variance in the predictive performance of an
algorithm in minimized with respect to each of the m classes
as well as the overall dataset. To illustrate this concept,
consider Figure 1 which shows the performance of a Random
Forest model for the Length of Stay prediction task. Here the
protected variable is gender and performance is measured
in terms of precision. The performance is given for the two
classes of gender (i.e., male and female) for this dataset as
well as the overall performance. The x-axis shows the class
or formula with respect to which precision is maximized,
and the y-axis shows the precision of the model.</p>
      <p>Consider the threshold that are used to maximize
performance for female, the threshold would also correspond to
some non-optimal performance for male and also for the
overall population. Similarly, consider the threshold for
fairness when the performance of the predictive model is most
fair for male. The two other values in the graph are
values that are obtained if the fairness threshold for the male
population is used for the female population as well as the
overall population. Now consider the thresholds given on
the right in the figure, these are the thresholds that are
computed as the aggregate (average, min, max etc) of the
thresholds for the protected classes as well as the overall
population. We computed the performance when the thresholds
are chosen based on using minimum, maximum, average
and median threshold are used. From Figure 1 it is clear
that the best results are obtained when the average or the
minimum threshold is used. Table 1 shows the results of
prediction for the length of stay prediction problem along
with the variance of the performance of all the categories,
which are ’male’ and ’female’ in this example. The
variance in this case quantifies the difference of performance.
The lesser the variance is, the lesser is the difference in
predictive performance. This also implies that the models
are more fair. To find the threshold for fairness, we set the
-0.0028
-0.0025
-0.0096
-0.0138
-0.0070
-0.0103
-0.0094
-0.0124
optimal threshold for entire dataset, optimal thresholds for
each category, and the average, median, maximum and
minimum value of the thresholds of all the categories as the
candidates. And then, we computed the difference of
performance corresponding to each threshold, including
precision, recall and F1 score, across each category, and chose
the one with minimum difference as the fairness threshold.
To help analyze effect on performance of choosing
different threshold, we plotted performance boundary, including
precision boundary, recall boundary and F1 score
boundary, for each fairness threshold candidate. The performance
boundary shows that best performance does not mean fair.
For instance, the threshold based on male which is also the
maximum threshold in Figure 1 has the highest precision,
but the it also has maximum difference, which means the
performance corresponding to this threshold is the least fair
one. Furthermore, model with fair outcomes can have
comparable performance. For instance, the fairness threshold in
Table 1 performs even a bit better than the performance
corresponding to optimal threshold for entire dataset, which we
normally care about.</p>
    </sec>
    <sec id="sec-8">
      <title>Effect of Removal of Protected Features</title>
      <p>To make the performance of the protected feature’s each
category similar to each other, removing the protected features
so that they could not influence the performance directly is
one apparent choice. Tables 6,5 and 7 in the appendix gives
the performance of methods training with and without
protected features on LOS dataset for age, gender and race
respectively. The increase in performance are marked as red
and the decrease in variance of the performance of all the
categories are marked as blue.</p>
      <p>One thing to note is that the model trained without the
protected feature, gender, has more fair performance than
the one trained with all the features. This also implies that
underlying model was most likely using gender in its
prediction. One can also find the optimal threshold for entire
dataset and the ones for all the categories. We note that in
this particular example the difference in the variance of the
models is not statistically significant. We found similar
results for the other two models. We however emphasize that
the current models are for demonstrating the feasibility of
the proposed methods and we plan to explore this further as
described in the future work section below.</p>
      <p>With the above disclaimer, one can still observe that there
are certain differences in the performance of the algorithms
which can be used to design experiments and analysis in the
future. In summary, the results show that:
• The variance of the performance is more likely to
decrease 10% - 30%, sometimes over 60%, which means
removal of protected features could help the performance
become more fair;
• Although it may appear that the performance is more
likely to improve with the inclusion of the protected
feature, it mostly increases under 5%, which could be
interpreted as the performance shows insignificant change.</p>
    </sec>
    <sec id="sec-9">
      <title>Effect of Removal of Proxy Features</title>
      <p>A further idea that we explore is the effect of the removal of
proxy features, especially ones with high importance scores
with respect to the predictive performance. This is to ensure
that the indirect influence of the protected features on the
outcomes is not factored into the model and model in
general is fair. We define important features as follows, given n
features the important features are top k features rank sorted
by their important scores. The importance of a feature is
calculated by how much the performance measure would
improve on each attribute split and weighted by the number of
observations the node is responsible for while training with
with ’race’, before sampling
without ’race’, before sampling
with ’race’, after sampling
without ’race’, after sampling
Gradient Boosting method (Dash and Liu 1997) (Xu et al.
2014). Since we more interested in the proxy features, we
only consider the important features that are highly
correlated with the protected features. Table 3 and Table 2 gives
the performance of models trained on the Thyroid dataset,
with and without the important features related to the
protected features for gender and age respectively. For the
important feature, we consider k = 5. The main takeaways can
be summarized as follows: And it shows that:
• The variance of the performance does not decrease if the
important features are removed.
• As expected, the performance of the models decreases
when the important features are removed.</p>
      <p>These observations imply that removing related important
features is not helpful for either improvement in predictive
performance or for fairness in general.</p>
    </sec>
    <sec id="sec-10">
      <title>Effect of Sampling</title>
      <p>The distribution of classes in most protected features are
imbalanced. Consequently, most problems related to class
imbalance in supervised learning are also prominent problems
in this domain. One way to mitigate the problem of class
imbalance in supervised learning is to over-sample the
underrepresented classes. We considered the distribution of the
protected feature ’age’ and another protected feature ’race’
in the LOS dataset as examples. We observe that minority
populations like African Americans and Asians are
underrepresented in the race feature. Similarly, pediatric patients
are under-represented in the age feature. To reduce the
influence of under representation of the several categories, we
oversampled training dataset to balance the size of each
category. We observe that the result of determining the optimal
threshold before and after sampling is that the value of the
optimal threshold changes once sampling is done.</p>
    </sec>
    <sec id="sec-11">
      <title>Fairness in Methods</title>
      <p>By comparing the AUC scores and the variance of AUC
of the three methods, we could see the rank of the three
methods based on AUC score is:
Logistic regression &lt; Random Forest &lt; XGBoost
And the rank based on the variance in AUC is:
Logistic regression &lt; Random Forest &lt; XGBoost(Race)
This could lead us to the conclusion that an accurate
model does not necessarily imply fair outcomes. The
summary results for the experiments for the Diabetes datasets
are given in Table 4.</p>
    </sec>
    <sec id="sec-12">
      <title>Conclusion</title>
      <p>Identifying unfairness is a challenging task. In this work, we
first compared the predictive performance across protected
features, and use the variance of the performance as a
criteria to measure fairness. Second, we determined the optimal
thresholds chosen based on each category. We also explored
several ways to address unfairness. Additionally, we trained
models without the protected features, which could help
reduce unfairness but did not cause a drop in performance.
The second method focused on the data dimension, which is
critical in a machine learning process because characteristics
like under-representation could be inherited or even
exacerbated in machine learning process. When the size of dataset
is small, or the one or several category is under-represented,
one could sample the training dataset first to balance the size
of each category. Finally, when one has detected unfairness
and preferred to address it without training the model again,
one could find fairness threshold, which could make the
performance more fair but also comparable.</p>
      <p>Furthermore, comparison among AUC scores of different
models lead us to the conclusion that an accurate model does
not necessarily imply fairness. Models with higher accuracy
may have less fair outcomes. We note that the fairness
measurement obtained for best results for each prediction
problem do not necessarily correspond to the best possible
theoretical results. This is work in progress, and in follow up
to this work we plan to explore theoretical aspects of
fairness threshold. The methods and experiments described in
this paper are part of a proof of concept to test the
efficacy of methods to detect unfairness in machine learning
in healthcare use cases. Our plan is to incorporate insights
gleaned from this exploratory analysis into a
productiondeployed system at a large scale healthcare system in the
United States.
the 2018 ACM international conference on bioinformatics,
computational biology, and health informatics, 559–560.
Allen, C.; Ahmad, Muhammad Eckert, C.; Hu, J.; and
Kumar, V. 2020. fairMLHealth: Tools and tutorials for
evaluation of fairness and bias in healthcare applications of
machine learning models. https://github.com/KenSciResearch/
fairMLHealth.</p>
      <p>Assembly, U. G. 1948. Universal declaration of human
rights. UN General Assembly 302(2).</p>
      <p>Barocas, S., and Selbst, A. D. 2016. Big data’s disparate
impact. Calif. L. Rev. 104:671.</p>
      <p>Corbett-Davies, S., and Goel, S. 2018. The measure and
mismeasure of fairness: A critical review of fair machine
learning. arXiv preprint arXiv:1808.00023.</p>
      <p>Crawford, K. 2013. The hidden biases in big data. Harvard
business review 1(1):814.</p>
      <p>Dash, M., and Liu, H. 1997. Feature selection for
classification. Intelligent data analysis 1(3):131–156.</p>
      <p>Dheeru, D. 2017. Karra taniskidou e. UCI machine learning
repository 12.</p>
      <p>Johnson, A. E.; Pollard, T. J.; Shen, L.; Li-Wei, H. L.; Feng,
M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L. A.; and
Mark, R. G. 2016. Mimic-iii, a freely accessible critical care
database. Scientific data 3(1):1–9.</p>
      <p>Martinez, N.; Bertran, M.; and Sapiro, G. 2019. Fairness
with minimal harm: A pareto-optimal approach for
healthcare. arXiv preprint arXiv:1911.06935.</p>
      <p>Schisterman, E. F.; Perkins, N. J.; Liu, A.; and Bondell, H.
2005. Optimal cut-point and its corresponding youden
index to discriminate individuals using pooled blood samples.
Epidemiology 73–81.</p>
      <p>Smith, J. W.; Everhart, J.; Dickson, W.; Knowler, W.; and
Johannes, R. 1988. Using the adap learning algorithm to
forecast the onset of diabetes mellitus. In Proceedings of
the Annual Symposium on Computer Application in Medical
Care, 261. American Medical Informatics Association.
Tramer, F.; Atlidakis, V.; Geambasu, R.; Hsu, D.; Hubaux,
J.-P.; Humbert, M.; Juels, A.; and Lin, H. 2017. Fairtest:
Discovering unwarranted associations in data-driven
applications. In 2017 IEEE European Symposium on Security
and Privacy (EuroS&amp;P), 401–416. IEEE.</p>
      <p>Xu, Z.; Huang, G.; Weinberger, K. Q.; and Zheng, A. X.
2014. Gradient boosted feature selection. In Proceedings of
the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining, 522–531.</p>
      <p>Youden, W. J. 1950. Index for rating diagnostic tests.
Cancer 3(1):32–35.</p>
      <p>Zliobaite, I. 2015. A survey on measuring
indirect discrimination in machine learning. arXiv preprint
arXiv:1511.00148.
no gender</p>
      <p>male
no gender</p>
      <sec id="sec-12-1">
        <title>XGBoost</title>
        <p>gender</p>
        <p>female
no gender
0 0 0
.0 .0 .
1 0 0
0 0 0
4 9
1 2
0 0 . 0 0 . 0 .
- - - - 0 - - - 0 - - 0
0 0 0 0
- 0 0 - 0 0 - - - - -
7 8 7
0 1 5 1 2 8 4 0 9 4 7
4 5 1 1 0 1 2 2 2 0 6 6
3 3 3 5 5 5 9 9 9
0 0 0 0 0 0 0 0 0 0 0 0 n 0 0 0 0 0 0 0 0 0 0 0 0
o
s r o
a c t
p a s
0 5 7 9
1 0 2 7
i .000 .010 .000 .000 .0 .0 .000 .000 .0 .00 .00 .00 .00 .0
H no oB - - - - - - - / / - 0 0 m - - - - 0 - - - - - -
o
9 7 8 7 3 1 9 0 n 2 8 3
3 2 2 7 1 6 3 4 6 1 2 0 a 3 4 3 2 5 8 9 8 8 0 3 0 i
8 8 8 2 2 2 4 4 3 7 7 7 0 0 0</p>
        <p>4 2 2 2 3 9 5
8 9 8 9 3 5 7 7 5 0 3 2
1 2 2 0 3 3 8 0 0
.0 .0 .</p>
        <p>0 .7 .
4 4 7 1 5
- 0 0 0 4 5 5 9 9 9
2
3
5 0 9 7 6 9 1 9 9 7 4 1
0 0 5 0 1 0 8 0 4 0 0 9
8 9 5 8 8 9 1 1 2
5 9 2 3 9 9 5 0 0 5 0 4
0 0 9 3 5 5 0 1 1 7 9 9
2 2 2
r
a
V ce
a
O ec
a
r
l
B ec
a
r</p>
        <p>r
e o</p>
        <p>n
1 2
5 8
2 1
0 .
. 0</p>
        <p>2
7 4
0 0
0 0
0 .
. 0
0 0</p>
        <p>0
0
1 1 1
.7 .7 .
5
1 6
0 8
0 0
u
t
e a
th e</p>
        <p>F
re ;
eh re</p>
        <p>u
w ta
, e
te f
s
taa teh
d f
S
O on
on lcu
s n</p>
        <p>e
o F
t
) .</p>
        <p>s
y e
x r
ro tu
(p fea
d
te to
a s
l
e r
r fe
)</p>
        <p>e
e r
ac ’F
R ’
(
e n</p>
        <p>m
r
tu u
fea lco
im lbu lla
teh by eov</p>
        <p>k r’
ith ra s
m ie
l</p>
        <p>d F
sd re ;</p>
        <p>e
o s r
th a u</p>
        <p>t</p>
        <p>c
m n
:P r i</p>
        <p>o s
7 f</p>
        <p>e
le re li
b p
4 2
2 1
s
i
C c</p>
        <p>e
U r
s
i
C c</p>
        <p>e
U r
i
s
i
C c</p>
        <p>e
U r</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Eckert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Teredesai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Fairness in machine learning for healthcare</article-title>
          .
          <source>In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>
          ,
          <fpage>3529</fpage>
          -
          <lpage>3530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Eckert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Teredesai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Interpretable machine learning in healthcare</article-title>
          .
          <source>In Proceedings of a i a d</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>