=Paper=
{{Paper
|id=Vol-3087/paper_36
|storemode=property
|title=Blackbox Post-Processing for Multiclass Fairness
|pdfUrl=https://ceur-ws.org/Vol-3087/paper_36.pdf
|volume=Vol-3087
|authors=Preston Putzel,Scott Lee
|dblpUrl=https://dblp.org/rec/conf/aaai/PutzelL22
}}
==Blackbox Post-Processing for Multiclass Fairness==
<pdf width="1500px">https://ceur-ws.org/Vol-3087/paper_36.pdf</pdf>
<pre>
                               Blackbox Postprocessing for Multiclass Fairness

                                                    Preston Putzel1 * , Scott Lee2 *
                       1
                           Department of Computer Science, University of California, Irvine, CA, 92697, USA
                           2
                             Centers for Disease Control and Prevention, 1600 Clifton Rd., Atlanta, GA, USA


                              Abstract                                assumption that model parameters are accessible to the algo-
                                                                      rithm, but there is increasing availability of powerful black-
  Applying standard machine learning approaches for classi-           box models whose internal parameters can be either inacces-
  fication can produce unequal results across different demo-
                                                                      sible or too costly to train. In this paper, we address the case
  graphic groups. When then used in real-world settings, these
  inequities can have negative societal impacts. This has mo-         where outcomes are multiclass and the user has received a
  tivated the development of various approaches to fair clas-         pre-trained blackbox model. The main contributions of our
  sification with machine learning models in recent years. In         work are as follows:
  this paper, we consider the problem of modifying the pre-
  dictions of a blackbox machine learning classifier in order
                                                                       • We show how to extend Hardt, Price, and Srebro (2016)
  to achieve fairness in a multiclass setting. To accomplish this,       to multiclass outcomes.
  we extend the ’post-processing’ approach in Hardt, Price, and        • We demonstrate in what data regimes multiclass postpro-
  Srebro (2016), which focuses on fairness for binary classifi-          cessing is likely to produce fair, useful, and accurate re-
  cation, to the setting of fair multiclass classification. We ex-       sults via a set of rigorous synthetic experiments.
  plore when our approach produces both fair and accurate pre-
  dictions through systematic synthetic experiments and also           • We demonstrate the results of our post-processing algo-
  evaluate discrimination-fairness tradeoffs on several publicly         rithm on publicly available real-world applications.
  available real-world application datasets. We find that overall,
  our approach produces minor drops in accuracy and enforces          Code and Dataset Availability All of the code used to
  fairness when the number of individuals in the dataset is high      produce our experimental results as well as the synthetic and
  relative to the number of classes and protected groups.
                                                                      real-world datasets can be found on our github page1 .

                           Introduction                                                     Technical Approach
As machine learning begins moving into sensitive predic-              As in Hardt, Price, and Srebro (2016), we consider the prob-
tions tasks, it becomes critical to ensure the fair performance       lem of enforcing fairness on a blackbox classifier without
of prediction models. Naively trained machine learning sys-           changing its internal parameters. This means that our ap-
tems can replicate biases present in their training data, re-         proach only has access to the predicted labels yˆi from the
sulting in unfair outcomes that can accentuate societal in-           blackbox classifier, the true labels yi , and the protected at-
equities. For example, machine learning systems have been             tributes ai for i ∈ {1, ..., N } where N is the number of
discovered to be unfair in predicting time to criminal re-            individuals. The goal of our approach is to produce a new
cidivism (Dieterich, Mendoza, and Brennan 2016), ranking
applications to nursing school (Romano, Bates, and Candès            set of updated and fair ’adjusted’ predictions yiadj that sat-
2020), and recognizing faces (Buolamwini and Gebru 2018).             isfy a desired fairness criterion. For each of yˆi , yi , and ai ,
Most prior work in this area has focused on ensuring fairness         we define corresponding random variables Ŷ , Y , A. Then,
for binary outcomes. However, there are many important                following Hardt, Price, and Srebro (2016) we define the ran-
real-world applications with multiclass outcomes instead.             dom variable for the adjusted predictions Y adj to be a ran-
For example, a self-driving car will need to be able to dis-          domized function of Ŷ and A. We extend the approach in
tinguish clearly between humans, non-human animals (such              Hardt, Price, and Srebro (2016) by allowing multiclass out-
as dogs), and non-sentient objects while nonetheless main-            comes, such that the sample spaces of Ŷ , Y , and Y adj are
taining fair performance for both wheelchair users and non-           a collection of discrete and mutually exclusive outcomes
wheelchair users. Most work has also been done with the               C = {1, 2, ...., |C|}. We in principle allow the sample space
   * These authors contributed equally.
                                                                      of the protected group A, A , to contain any number of dis-
Copyright © 2022 for this paper by its authors. Use permitted un-
                                                                      crete values as well: A = {1, 2, ..., |A |}.
der Creative Commons License Attribution 4.0 International (CC
                                                                         1
BY 4.0).                                                                     https://github.com/scotthlee/fairness/tree/aaai
Linear Program Our approach involves the construc-                  therefore are linear in the adjusted probabilities as needed
tion of a linear program over the conditional probabilities         for the linear program (see appendix A for the exact form
of the adjusted predictor P r(Y adj = y adj |Ŷ = ŷ, A =           M a takes for the different fairness criteria).
a) such that a desired fairness criterion is satisfied by              The first definition involves requiring strictly equal per-
those probabilities. In order to construct the linear program,      formance across protected groups.
both the loss and fairness criteria must be linear in terms         Definition 1 (Term-by-Term Multiclass Equality of Odds).
of the protected attribute conditional probability matrices         A multiclass predictor satisfies term-by-term equality of
Pa = P r(Y adj |Ŷ , A = a) which have dimensions |C|×|C|.          odds if the protected group conditional confusion matrices
                                                                    Wa are equal across all protected groups:
Types of Objective Functions We consider objective
functions which are linear in the group conditional adjusted                         W1 = W2 = · · · = W|A |                      (1)
probabilities Pa . More specifically we consider minimizing
expected losses of the form:                                        where Wa = P r(Y adj |Y, A = a).
                                                                       This is a straightforward extension to the multiclass case
     E[l(y adj , y)] =                                              of equality of odds defined in Hardt, Price, and Srebro
       |C|                                                          (2016). Notice that since this definition requires equality of
                                                                    each off-diagonal term of Wa across all groups, it enforces
      XX   X
                    P r(Y adj = i, Y = j, A = a)l(i, j, a)
     a∈A i=1 j6=i                                                   that not only are errors made at the same overall rate across
           |C|
                                                                    groups, but also that the rate of specific types of errors are
          XX   X                                                    equal. For some practical applications, term-by-term equal-
      =                  Wija P r(A = a, Y = j) l(i, j, a)          ity of odds is important, such as predicting criminal recidi-
          a∈A i=1 j6=i                                              vism times binned into three years, two years, one year, and
where Wija = P r(Y adj = i|Y = j, A = a) are the pro-               ”never recommits”. In this case, making the error of predict-
tected attribute conditional confusion matrices. Under the          ing 3 years until recidivism when the actual time is 1 year is
independence assumption Y adj ⊥ Y |A, Ŷ , we can write             much worse than predicting 3 years when the actual time is
                                                                    2. Therefore, it is critical for fairness in this application that
Wa = Pa Za where Za = P r(Ŷ |Y, A = a), the class                  the rates of specific types of errors are strictly equal across
conditional confusion matrices of the original blackbox clas-       groups.
sifier’s predictions. The matrices Za are estimated empiri-            Instead of requiring strict equality of off-diagonal terms
cally from the training data (yi , and ai ) and blackbox pre-       of Wa we can instead enforce equality across the classwise
dictions of the model (yˆi ). Therefore, this formulation of the    overall false detection rates F DR, which leads to the next
objective function remains linear in the protected attribute        fairness definition:
conditional probability matrices, Pa , as is necessary for the
linear program. This definition is similar to Hardt, Price,         Definition 2 (Classwise Multiclass Equality of Odds). A
and Srebro (2016) except we let the loss l(i, j, a) also be a       multiclass predictor satisfies classwise multiclass equality
function of protected attributes instead of just the true and       of odds if the diagonals of the protected group conditional
adjusted labels, which allows controlling the strictness of         confusion matrices and the protected attribute conditional
penalties for errors made for specific protected groups and         vector of false detection rates are equal across all protected
classes. The most straightforward version of this loss is let-      groups:
ting l(y adj , y, a) be the zero-one loss (ignoring the protected       diag(W1 ) = diag(W2 ) = · · · = diag(W|A | )
attributes) which results in minimizing the sum of the joint                                                                      (2)
                                                                        FDR1 = FDR2 = · · · = FDR|A |
probabilities of mismatch between Y adj and Y . We refer to
this approach as unweighted loss. Another approach is to set        where FDRa = P r(Y adj |Y adj 6= Y, A = a).
l(y adj , y, a) equal to one over the joint probabilities of the
true label and protected attribute 1/P r(Y = y, A = a) (es-            This version of fairness can ’trade’ better performance for
timated empirically), which we refer to as weighted loss. In-       a specific protected group on one off diagonal term in Wa
tuitively, this option reweights the loss to give rarer protected   (i.e. lower error probability for that term) for poorer perfor-
groups and label combinations equal importance to the opti-         mance of the same group on a different off diagonal term
mization which could improve fairness when very low mem-            (i.e. higher error probability for another term). Individually
bership minority protected groups exist in the dataset. This        each class label has it’s true detection rate, and overall false
option for the objective function can be equivalently mini-         detection rate set equal across groups. Thus, this type of fair-
mized by maximizing the diagonals (true detection rates) of         ness is ’classwise’.
the group conditional confusion matrices Wa .                          For some problems it is sufficient to maintain fair true de-
                                                                    tection rates across classes and allow false detection rates to
Types of Fairness We consider several versions of multi-            differ across groups. This is even less restrictive than Defi-
class fairness criteria, all of which can be written as a collec-   nition 2. This may be desirable when, for example, deciding
tion of |A |−1 pairwise equalities setting a fairness criterion     whether an accepted college application should be accepted
of interest equal across all groups. Moreover, each of the          into a honors program, accepted with scholarship, or regu-
terms in these equalities can be written as some |C| × |C|          larly accepted. Since all the outcomes are positive, unfair-
matrix M a times the adjusted probability matrix Pa , and           ness across false detection rates may not be critical, as long
as the true detection rates are fair across groups. This moti-     true distribution of the predictions, true labels, and true pro-
vates the following fairness criteria:                             tected attributes.
Definition 3 (Multiclass Equality of Opportunity). A multi-           Multiclass blackbox post-processing techniques are less
class predictor satisfies equality of opportunity if the diag-     studied; although there have been a few new approaches re-
onals of the protected group conditional confusion matrices        cently. Notably, Denis et al. (2021) derive an optimally fair
Wa are equal across all groups:                                    classifier from a pre-trained model and show several nice
                                                                   theoretical guarantees, including the asymptotic fairness of
      diag(W1 ) = diag(W2 ) = · · · = diag(W|A | )          (3)    their proposed plug-in estimator. We see 3 key differences
                                                                   between their approach and the extension to Hardt, Price,
where Wa = P r(Y adj |Y, A = a).                                   and Srebro (2016) that we propose: they only consider bi-
   A common and even more relaxed version of fairness              nary protected attributes (|A | = 2), while we allow cate-
called demographic parity only requires the rate of class pre-     gorical protected attributes (|A | > 2) and can take on any
dictions across different groups to be equal (Calders, Kami-       number of unique values, at least theoretically; their method
ran, and Pechenizkiy 2009).                                        requires fitting a new estimator to the test data, whereas ours
Definition 4 (Multiclass Demographic Parity). A multiclass         only requires computing probabilities and solving a linear
predictor satisfies demographic parity if the protected group      program, which is relatively efficient; and, perhaps most im-
conditional class probabilities are equal across groups:           portantly, their approach is limited to the demographic par-
                                                                   ity fairness constraint, whereas our approach applies to any
       P r(Y adj |A = 1) =                                         constraint that is linear in Pa .
                                                            (4)
       P r(Y adj |A = 2) = · · · = P r(Y adj |A = |A |)               In broader terms, Hossain, Mladenovic, and Shah (2020)
                                                                   unify many of the published methods for learning fair clas-
   Enforcing this version of fairness for certain datasets may     sifiers by showing that equalized odds, equal opportunity,
produce effectively unfair outcomes (Dwork et al. 2012).           and other common measures of fairness in the binary set-
However, in synthetically produced data, this definition has       ting are subsumed by their proposed generalizations of the
been shown to reduce the reputation of disadvantaged pro-          economic notions of envy-freeness and equitability. They
tected groups when repeatedly applied over a long period of        show that these generalizations of fairness apply to the mul-
time to sensitive decision-making tasks such as hiring (Hu         ticlass setting, but post-processing techniques are incapable
and Chen 2018).                                                    of achieving them. We show here that this notion is not en-
   Note that while the learned adjusted probabilities after        tirely correct, at least in a narrow sense, and that fairness
running the linear program, Pa are guaranteed to be fair, tak-     can be achieved with post-processing techniques in the mul-
ing the max value over the learned probabilities when pre-         ticlass setting, so long as the joint distribution P (Y, Ŷ , A) is
dicting on an individual level will not maintain fairness. In      either fully known or can be reasonably approximated by a
fact, it can occur that taking the max over the adjusted proba-    large-enough sample of training data.
bilities will just result in identical predictions as those made
by the original blackbox classifier. Instead, when predict-
ing the class of an individual, the corresponding learned ad-                   Synthetic Data Experiments
justed probabilities must be sampled from in order to main-        Synthetic Data To explore the effect of different data
tain the fairness guarantee.                                       regimes and optimization goals on post-adjustment discrimi-
                                                                   nation, we conducted thorough (though by no means exhaus-
                      Related Work                                 tive) synthetic experiments for a 3-class outcome. We con-
Most prior work done on post-processing based fairness ap-         structed synthetic datasets with N = 1, 000 observations for
proaches focus on binary task prediction. Wei, Ramamurthy,         each unique combination of the following data-generating
and Calmon (2019) create a post-processing algorithm that          hyperparameters:
modifies the raw scores of a binary classifier (instead of          • The number of unique values for the protected attribute,
thresholded hard predictions) in order to achieve desired             |A |. We explored setting |A | = 2 or |A | = 3 (see re-
fairness constraints expressed as linear combinations of the          sults with |A | = 2 in our github repository)
per-group expected raw scores. Ye and Xie (2020) develop            • The amount of class imbalance for the labels Y . For sim-
a general in-processing fairness framework which alternates           plicity, we did not allow this to vary across protected
between a process of selecting a subset of the training data          groups.
and fitting a classifier to that data.
   Several adversarial approaches to multiclass fairness have       • Group balance, or the number and relative size of minor-
been investigated recently; although these are not blackbox           ity groups compared to majority groups. This varied ac-
post-processing algorithms. Zhang, Lemoine, and Mitchell              cording to the number of groups but was generally either
(2018) first present the idea of adversarial debiasing, while         none, weak, or strong.
Romano, Bates, and Candès (2020) present a multiclass ap-          • Predictive bias as the difference in mean true detection
proach for in-process training based on adversarial learning,         rate, T DR, between the groups. We vary this from mild
with the discriminator distinguishing between the distribu-           predictive bias (10 percent difference) to severe bias with
tion of the model’s current predictions, the true label, and          the minority group T DR being near chance. The predic-
artificial protected attributes resampled to be fair, and the         tive bias is set to always favor the majority group.
                                                     Experiments with |A | = 3
                     Hyperparameter               Level           Change in Acc (CI)     Change in TDR (CI)
                          Intercept                 –             -0.13 (-0.17, -0.09)    -0.18 (-0.21, -0.15)
                            Loss               Unweighted                  –                       –
                                                Weighted          -0.11 (-0.13, -0.09)     0.12 (0.10, 0.13)
                            Goal            Equalized Odds                 –                       –
                                           Demographic Parity      0.24 (0.22, 0.27)       0.21 (0.18, 0.23)
                                           Equal Opportunity       0.08 (0.05, 0.11)       0.03 (0.01, 0.05)
                                             Term-by-Term          0.08 (0.05, 0.11)       0.02 (-0.01, 0.04)
                       Group Balance          No Minority                  –                       –
                                           One Slight Minority    -0.03 (-0.06, 0.00)     -0.02 (-0.04, 0.01)
                                           One Strong Minority    -0.04 (-0.07, -0.00)    -0.01 (-0.03, 0.02)
                                          Two Slight Minorities   -0.05 (-0.08, -0.02)    -0.02 (-0.04, 0.01)
                                          Two Strong Minorities   -0.07 (-0.11, -0.04)    -0.01 (-0.04, 0.01)
                       Class Balance            Balanced                   –                       –
                                                One Rare           0.02 (-0.00, 0.04)     -0.04 (-0.06, -0.02)
                                                Two Rare           0.07 (0.04, 0.09)      -0.18 (-0.20, -0.17)
                         Pred Bias             Low One                     –                        –
                                               Low Two             0.00 (-0.03, 0.04)      -0.00 (-0.03, 0.02)
                                              Medium One          -0.06 (-0.09, -0.02)    -0.06 (-0.08, -0.03)
                                              Medium Two          -0.04 (-0.07, -0.00)    -0.06 (-0.08, -0.03)
                                               High One           -0.18 (-0.22, -0.15)    -0.16 (-0.19, -0.14)
                                               High Two           -0.15 (-0.19, -0.12)    -0.13 (-0.16, -0.11)

Table 1: Predicted change and 95% confidence intervals for accuracy and mean T DR as a function of the experimental hyper-
parameters in our synthetic datasets with three protected attributes. All datasets had a 3-class outcome.


   This process yielded 117 datasets. For each one, we ran            • Predictive bias and class imbalance are the two main
the linear program to adjust the (synthetic) biased blackbox            drivers of decreases in post-adjustment discrimination,
predictions 8 times, once for each unique combination of                for both accuracy, and T DR.
the objective function and type of fairness, yielding a total         • High group imbalance for the protected attributes lowers
of 936 adjustments. After each adjustment, we recorded two              post-adjustment discrimination, but only from the per-
broad measures of the fair predictor’s performance:                     spective of global accuracy–even with 2 strong minori-
 • Triviality, or whether any of the columns in Wa =                    ties (3-group scenario), mean T DR only drops by 1.1%.
    P r(Y adj |Y, A = a) contained all zeroes (i.e., whether          • Relative to the weighted objective, the unweighted objec-
    any levels of the outcome were no longer predicted).                tive leads to higher scores for global accuracy but lower
 • Discrimination, or the percent change in loss for the ad-            scores for mean T DR. This is perhaps unsurprising, but
    justed predictor relative to that of the original predic-           it is worth noting nonetheless.
    tor. For this measure, we examined two specific metrics:          • Despite finding better accuracy solutions, we also found
    global accuracy and the mean of the group-wise T DRs.               that the unweighted objective leads to trivial solutions
    These are equivalent to 1 minus the post-adjustment loss            far more frequently (30% of the time it was used) than
    under the two versions of the objective functions we                the weighted version of the loss (0.2% of the time it was
    present above.                                                      used). This trend will likely worsen with increasing di-
   To quantify the average effect of each hyperparameter                mension of either the number of classes or the number of
on discrimination, we fit two multivariable linear regression           protected groups.
models to the resulting dataset, one for each discrimination          • Fairness is generally harder to achieve with 3 protected
metric. Before fitting the models, we converted the categor-            groups than with 2, since the intercepts are lower for both
ical hyperparameters (so all but loss) to one-hot variables,            accuracy and mean T DR. We believe this to be a general
and then we set a reference level for each, removing the cor-           consequence of forcing fairness across more groups and
responding column from the design matrix. We then fit the               expect this trend to continue as the number of groups in-
models separately using ordinary least squares (OLS) and                creases.
calculated confidence intervals (CIs) for the resulting coeffi-
cients.                                                                      Experiments with Real-World Data
Results Table 1 shows coefficients and 95% confidence in-            Dataset Descriptions To further examine the performance
tervals for the regression models with |A | = 3. The results         characteristics of our algorithm, we ran it on several real-
highlight several important points:                                  world datasets described below.
Figure 1: Fairness-discrimination plots for our postprocessing algorithm on our 4 real-world datasets, created by systematically
relaxing the fairness equality constraints of the linear program. The plots show Brier score as a function of the maximum
average difference between groups of the corresponding fairness criterion. Performance of the original, unadjusted predictor is
marked by an X.


1. Drug Usage (Fehrman et al. 2017). This dataset has in-          4. Parkinson’s Telemonitoring (Tsanas et al. 2009). This
   herently multiclass outcomes, with the target being a 7-           dataset does not have inherently multiclass outcomes,
   level categorical variable indicating recentness of use for        with the target for prediction being the continuous Uni-
   a variety of drugs. We focus on predicting cannabis us-            fied Parkinson’s Disease Rating Scale (UPDRS), a con-
   age, where we collapsed the 7-level usage indicator into           tinuous score that increases with the severity of impair-
   3 broader categories: never used, used but not in the past         ment. We again used Otsu’s method to bin the contin-
   year, and used in the past year. Predictors included demo-         uous score into 3 categories–low impairment, moder-
   graphic variables like age, gender, and level of education,        ate impairment, and high impairment–which we took as
   as well as a variety of measures of personality traits hy-         the new class labels. The protected attribute is a 2-level
   pothesized to affect usage habits.                                 variable for gender (Male/Female). Predictors included
                                                                      mostly biomedical measurements from the voice record-
2. Obesity (Palechor and de la Hoz Manotas 2019). This                ings of patients with Parkinson’s Disease.
   dataset has inherently multiclass outcomes, with the tar-
   get being a 7-level categorical variable indicating weight         For each of these datasets, we obtained a potentially-
   category; the protected attribute is gender (Male/Fe-           biased predictor Ŷ by training a random forest on all avail-
   male). Because some of the observations are synthetic           able informative features (including the protected attribute)
   in order to protect privacy, not all of the gender/weight       to predict the multiclass outcome, and then taking the cat-
   categories had sufficient numbers for modeling, and so          egories corresponding to the row-wise maxima of the out-
   we omitted observations from the 2 most extreme weight          of-bag decision scores as the set of predicted labels. We
   categories, Obesity Type-II and Obesity Type-III, leaving       then adjusted the predictions with the weighted objective
   a 5-level target for prediction. Predictors included age,       and term-by-term equality of odds fairness constraint and
   gender, family medical history, and several measures of         recorded the relative changes in global accuracy and mean
   physical activity and behavioral health.                        T DR as the outcome measures of interest, as with our syn-
                                                                   thetic experiments.
3. LSAC Bar Passage (Wightman 1998). This dataset has
   inherently multiclass outcomes, with the target being a 3-      Exploring the Effect of Finite Sampling Hardt, Price,
   level variable indicating bar exam passage status (passed       and Srebro (2016) note that their method will not be ef-
   first time, passed second time, or did not pass). The pro-      fected by finite sample variability as long as the joint distri-
   tected attribute is race, which we collapsed from its orig-     bution P r(Y, Ŷ , A) is known, or at least well-approximated
   inal 8 levels to 2 (white and non-white). Predictors in-        by a large sample. In practical applications, however, the
   cluded mostly measures of educational achievement, like         sample at hand may not be large enough to approximate
   undergraduate GPA, law school GPA, and LSAT score.              the joint distribution with precision. This problem is exac-
                                                         In-Sample Results
                Dataset (N)        # Terms    Old Acc  New Acc     Old T DR  New T DR       Pre  Post-Adj Disparity
                                    in Pa         (% change)              (% change)                (% change)
               Bar (N=22406)          18       88 %  88% (-1%)        36%  34% (-7%)          0.11 0.00 (-100%)
             Cannabis (N=1885)        18       74%  71% (-4%)         67%  63% (-6%)          0.07 0.00 (-100%)
             Obesity (N=1490)         50       78%  73% (-7%)         78%  73% (-7%)          0.05  0.00 (-100%)
            Parkinsons (N=5875)       18       93%  91% (-2%)         92%  89% (-3%)           0.04 0.00(-100%)

                                                       Out of Sample Results
                 Dataset (N)       # Terms    Old Acc  New Acc     Old T DR  New T DR       Pre  Post-Adj Disparity
                                    in Pa         (% change)              (% change)                (% change)
               Bar (N=22406)          18       88 %  83% (-6%)       36%  33% (-8%)            0.11 0.01 (-95%)
             Cannabis (N=1885)        18       74%  61% (-18%)       67%  52% (-22%)           0.07 0.16 (124%)
             Obesity (N=1490)         50       78%  41% (-47%)       78%  42% (-46%)           0.05  0.07 (45%)
            Parkinsons (N=5875)       18       93%  82% (-12%)       92%  78% (-15%)            0.04 0.05(33%)

Table 2: Results of applying the linear program to adjust the blackbox predictions and produce y der for four real-world datasets.
The top table is without any splitting. Results shown in the bottom table are cross-validated across five 80/20 splits of each
dataset. Accuracy and T DR are shown before and adjustment, with T DR being the mean across all classes. Percent changes,
shown in parentheses are the relative percent drops in accuracy and mean T DR. Post-adjustment disparity is the element-wise
mean difference across all groups of Wa .


erbated when the number of observations N is small rela-            measures, we took the maximum of the mean differences
tive to the number of probabilities learned by the algorithm        across pairs of groups of the following metrics:
of which there are |C| × |C| × |A | total. This difficulty is
therefore more severe for our extension in this work where            • W, or the matrix of probabilities P (Y adj |Y ), for term-
|C| > 2.                                                                by-term equality of odds
   In these cases, the adjusted predictor Y adj may have worse        • Youden’s J index, or T DR+(1−F DR)−1, for classwise
classification performance and higher disparity when ap-                equality of odds
plied to unseen, out-of-sample data. As a preliminary ex-             • T DR for equal opportunity
ploration of this effect, we used 5-fold cross-validation to          • P (Y adj ) for demographic parity
generate out-of-sample predictions for each of the observa-
tions in our real-world datasets. Keeping Y , Ŷ , and A fixed,        We note here that taking the maximum of the maxima of
we solved the linear program on 80% of the data and then            the pairwise differences would also be a valid and sensible
used the adjusted probabilities Pa to obtain class predic-          global measure. So that the plots show performance under
tions for the observations in the remaining 20%. As with the        optimal conditions, we do not use cross-validation to obtain
predictions obtained from solving the linear program on the         Y adj , i.e., we obtain it by solving the linear program on the
full dataset, we measured the changes in accuracy and mean          entire dataset.
T DR for the cross-validated predictions. Because fairness
                                                                    Results Table 2 shows changes in global accuracy and
is not guaranteed when the joint distribution assumption is
                                                                    mean T DR after adjustment with the weighted objective
violated, we also measured post-adjustment fairness.
                                                                    and term-by-term conditional fairness constraint for our four
Exploring the Fairness-Discrimination Tradeoff When                 datasets, using cross-validation as described above to cap-
there are large gaps in a predictor’s performance across            ture some of the variability that comes with finite sampling.
groups, i.e., when predictive bias is high, strict fairness may     Overall, adjustment lowered both accuracy and mean T DR.
not always be possible or desirable to achieve because of           Although, for the bar passage, drug usage, and Parkinson’s
the large amount of randomization required to balance the           datasets, the drops were moderate, with average relative
blackbox classifier’s predictions. To explore the tradeoff be-      changes in both metrics coming in at around 12% and 15%,
tween fairness and discrimination, we ran the linear pro-           respectively (without cross-validation, the drops were much
gram on each of the real-world datasets once for each of the        smaller at 3% and 4%). For the obesity dataset, the drops
four kinds of fairness. For each combination of dataset and         are much larger at 47% and 46%, respectively, which are
fairness type, we varied the equality constraints of the lin-       indeed substantial and would likely make the predictor un-
ear program–the maximum percent difference allowed be-              usable in practical settings. On in-sample data, these drops
tween any pairwise comparison of fairness measures be-              were both only around 7%, and so we suspect that charac-
tween groups–from 0.0 to 1.0 in increments of 0.01, and then        teristics of the data, like large class imbalance or small over-
plotted the value of the weighted objective at each point as a      all sample size, are responsible for the poor performance.
function of the global measure of fairness corresponding to         Perhaps most importantly, the post-adjustment disparity for
the fairness type under consideration. To obtain these global       all datasets is non-zero, and for three of the datasets actu-
ally increases. The bar passage dataset was the only exam-             More generally, even when finite sampling variability is
ple where the out-of-sample post-adjustment disparity de-           not an issue, not all datasets will lend themselves well to
creased to near zero likely due to it being the largest dataset.    this kind of post-processing approach. In our synthetic ex-
This starkly points out the sensitivity of the method to esti-      periments, we showed that severe class imbalance and se-
mating the joint probabilities P r(Y, Ŷ , A), and shows that       vere predictive bias (predicting at nearly the level of chance
the approach is unlikely to work in smaller dataset regimes         for minority protected groups) lead to large drops in post-
which have a larger combination of classes and protected            adjustment performance on average. In many of the sin-
attributes. Note that for in-sample results, post-adjustment        gle experimental runs for synthetic datasets with these set-
disparity drops completely to 0.0 for all datasets since it is      tings, the resulting derived predictor was effectively use-
strictly enforced by the linear program in Table 2.                 less, either producing trivial results or lowering predictive
   Figure 1 shows fairness-discrimination plots for our 4           performance to near chance (for all groups) for one or
datasets with the weighted objective and each of the 4 fair-        more class outcomes. In these circumstances, it may be
ness constraints. Under strict fairness, with inequality set        more sensible to enforce fairness through a combination of
to 0, equalized odds is the hardest to satisfy, showing the         pre-processing, in-processing, and post-processing methods,
largest increase in Brier score. For the drug usage, obesity,       rather than through a post-processing method alone. Indeed,
and Parkinson’s datasets, discrimination improves approx-           Woodworth et al. (2017) make this point generally, albeit
imately linearly as fairness worsens; for the bar passage           for the binary setting, by showing that unless the biased pre-
dataset, discrimination improves to a point, but then worsens       dictor Ŷ is very close to being Bayes optimal, the derived
as fairness approaches the value for the original, unadjusted       predictor Y adj proposed by Hardt, Price, and Srebro (2016)
predictor Ŷ . For all datasets, the total loss of discrimination   can underperform relative to other methods, sometimes sub-
under strict fairness is relatively small (the biggest drop is      stantially. Under less extreme circumstances, however, we
around 7.5 percentage points on Brier score), but the random        found our approach produces good results, especially given
forests’ predictions were only mildly biased to begin with,         the time-efficiency of solving the linear program relative to
so we expect this gap to increase for less-fair predictors.         other methods.

                         Discussion                                                   Acknowledgments
                                                                    This work was supported in part by the HPI Research Cen-
Generally, our post-processing approach to achieving fair-          ter in Machine Learning and Data Science at UC Irvine (P.
ness in multiclass settings seems both feasible and efficient       Putzel), as well as in part by an appointment to the Research
given a large enough dataset size. We have shown above that         Participation Program at the Centers for Disease Control and
the linear programming technique proposed by Hardt, Price,          Prevention, administered by the Oak Ridge Institute for Sci-
and Srebro (2016) can be extended to accommodate a the-             ence and Education (P. Putzel). We would also like to thank
oretically arbitrarily large number of discrete outcomes and        Chad Heilig, and Padhraic Smyth for their helpful comments
levels of a protected attribute. Nonetheless, our synthetic ex-     on the approach and paper.
periments and analyses of real-world datasets show that are a
few important considerations for using the approach in prac-                               References
tice.                                                               Buolamwini, J.; and Gebru, T. 2018. Gender shades: Inter-
   In many cases, the effect of finite sampling may be non-         sectional accuracy disparities in commercial gender classifi-
negligible, especially when the number of observations N is         cation. In Conference on fairness, accountability and trans-
small relative to number of outcomes |C| or the number of           parency, 77–91. PMLR.
protected groups |A |. For example, the obesity dataset with
                                                                    Calders, T.; Kamiran, F.; and Pechenizkiy, M. 2009. Build-
|C| = 5 and N = 1, 490 saw a large relative drop of 46% in
                                                                    ing classifiers with independency constraints. In 2009 IEEE
mean T DR after adjustment under cross-validation. We also
                                                                    International Conference on Data Mining Workshops, 13–
saw this effect extend to fairness, which was not reduced
                                                                    18. IEEE.
completely to zero on out-of-sample data for any of the real-
world datasets. In fact, for the drug usage dataset we found        Denis, C.; Elie, R.; Hebiri, M.; and Hu, F. 2021. Fair-
post-adjustment disparity doubled on out-of-sample data.            ness guarantee in multi-class classification. arXiv preprint
   This last observation raises a concerning point: for some        arXiv:2109.13642.
classification problems, the post-adjustment predictions on         Dieterich, W.; Mendoza, C.; and Brennan, T. 2016. COM-
out-of-sample data may increase disparity rather than low-          PAS risk scales: Demonstrating accuracy equity and predic-
ering it. For the largest of the datasets, the bar passage          tive parity. Northpointe Inc.
dataset with N = 22, 406, neither of these issues was a             Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; and Zemel,
concern. Even under cross-validation, the relative change           R. 2012. Fairness through awareness. In Proceedings of the
in T DR was only -8%, and the disparity dropped to near             3rd innovations in theoretical computer science conference,
0 (-95% decrease). Given this, we expect that with a large          214–226.
enough dataset size, our approach will be far more reliable         Fehrman, E.; Muhammad, A. K.; Mirkes, E. M.; Egan, V.;
on out-of-sample data. Future work more precisely quanti-           and Gorban, A. N. 2017. The five factor model of person-
fying the number of training examples needed for reliable           ality and evaluation of drug consumption risk. In Data sci-
out-of-sample fair performance with our approach is needed.         ence, 231–242. Springer.
                                                                           a
Hardt, M.; Price, E.; and Srebro, N. 2016. Equality of op-       where Zkj     = P r(Ŷ = k|Y = j, A = a), and can be
portunity in supervised learning. Advances in neural infor-      estimated empirically using the original predictions of the
mation processing systems, 29: 3315–3323.                        blackbox classifier. Thus we have Wa = Pa Za . Moving
Hossain, S.; Mladenovic, A.; and Shah, N. 2020. Design-          from the third to fourth line requires the conditional indepen-
ing fairly fair classifiers via economic fairness notions. In    dence assumption Y adj ⊥ Y |A, Ŷ . This assumption is vio-
Proceedings of The Web Conference 2020, 1559–1569.               lated in cases where the blackbox predictions are weak, for
Hu, L.; and Chen, Y. 2018. A short-term intervention for         example completely random, and can intuitively be thought
long-term fairness in the labor market. In Proceedings of        of as requiring that the initial blackbox classifier has reason-
the 2018 World Wide Web Conference, 1389–1398.                   able discriminative performance. In other words, relevant in-
                                                                 formation for predicting Y is contained in Ŷ .
Palechor, F. M.; and de la Hoz Manotas, A. 2019. Dataset
                                                                    Multiclass equality of opportunity only requires enforcing
for estimation of obesity levels based on eating habits and
                                                                 equality on the diagonals of Wa , and therefore is linear in
physical condition in individuals from Colombia, Peru and
                                                                 Pa as well.
Mexico. Data in brief, 25: 104344.
                                                                    Enforcing the classwise version of multiclass equality of
Romano, Y.; Bates, S.; and Candès, E. J. 2020. Achieving        odds requires enforcing equality of opportunity, which is al-
Equalized Odds by Resampling Sensitive Attributes. arXiv         ready shown to be linear above, and also enforcing the over-
preprint arXiv:2006.04292.                                       all false detection rates to be equal across protected groups.
Tsanas, A.; Little, M.; McSharry, P.; and Ramig, L. 2009.        So in order for classwise multiclass equality of odds to be
Accurate telemonitoring of Parkinson’s disease progression       linear, the false detection rates must be linear in Pa , shown
by non-invasive speech tests. Nature Precedings, 1–1.            below:
Wei, D.; Ramamurthy, K. N.; and Calmon, F. d. P. 2019.            F DRca =P r(Y adj = c|Y 6= c, A = a)
Optimized score transformation for fair classification. arXiv
preprint arXiv:1906.00066.                                                     P r(Y adj = c, Y 6= c, A = a)
                                                                           =
Wightman, L. F. 1998. LSAC National Longitudinal Bar                                P r(Y 6= c, A = a)
Passage Study. LSAC Research Report Series.                                    X X P r(Y adj = c, Y = c0 , Ŷ = j, A = a)
                                                                           =
Woodworth, B.; Gunasekar, S.; Ohannessian, M. I.; and Sre-                                             P r(Y 6= c, A = a)
                                                                                j   c0 6=c
bro, N. 2017. Learning non-discriminatory predictors. In
Conference on Learning Theory, 1920–1953. PMLR.
                                                                                    a a
                                                                               X X Pcj Zjc0 P r(Y = c0 , A = a)
                                                                           =
Ye, Q.; and Xie, W. 2020. Unbiased Subdata Selection for                                           P r(Y 6= c, A = a)
                                                                                j   c0 6=c
Fair Classification: A Unified Framework and Scalable Al-                                       a
                                                                                             X Zjc           0
                                                                               X                  0 P r(Y = c , A = a)
gorithms. arXiv preprint arXiv:2012.12356.                                 =         a
                                                                                    Pcj
Zhang, B. H.; Lemoine, B.; and Mitchell, M. 2018. Mitigat-                                           P r(Y 6= c, A = a)
                                                                                j         c0 6=c
ing unwanted biases with adversarial learning. In Proceed-                     X
                                                                                     a a
ings of the 2018 AAAI/ACM Conference on AI, Ethics, and                    =        Pcj Vjc
                                                                                j
Society, 335–340.
                                                                                           a           0
                                                                           a
                                                                                 P       Zjc 0 P r(Y =c ,A=a)
                                                                 where Vjc   =      0
                                                                                   c 6=c    P r(Y 6=c,A=a) . This allows us to
                       Appendix A                                write the protected attribute conditional false detection rates
Derivation of Linearity of Fairness Constraints: In or-          as FDRa = diag(Pa Va ). As before, Va can be computed
der to obtain linearity in the protected attribute conditional   from the empirical estimates of Za , and P r(Y = i, A = j).
probability matrices Pa we must find an expression of the           For multiclass demographic parity we can write:
form Wa = Pa Ma :
                                                                   Da =P r(Y adj |A = a)
      Wija =P r(Y adj = i|Y   = j, A = a)                                    1       X
                                                                      =                  P r(Y adj , A = a, Ŷ = k)
                                                                        P r(A = a)
               X
           =       P r(Y adj = j, Ŷ = k|Y = j, A = a)                                        k
               k                                                           X                                  P r(Ŷ = k, A = a)
               X                                                       =       P r(Y adj |Ŷ = k, A = a)
           =       P r(Y adj = i|Y = j, A = a, Ŷ = k)                                                            P r(A = a)
                                                                           k
               k                                                           X
                      × P r(Ŷ = k|Y = j, A = a)                       =       P r(Y adj |Ŷ = k, A = a)P r(Ŷ = k|A = a)
               X                                                           k
           =       P r(Y adj = i|Ŷ = k, A = a)                            a
                                                                       =P P r(Ŷ |A = a)
               k

                      × P r(Ŷ = k|Y = j, A = a)                 which is again linear in Pa , and the conditional probability
               X                                                 vector P r(Ŷ |A = a) can be computed emprically.
                    a a
           =       Pik Zkj
               k
                                                                 Synthetic Experiment Results with |A | = 2
                                                        Experiments with |A | = 2
                            Hyperparameter           Level          Change in Acc (CI)     Change in TDR (CI)

                                Intercept              –            -0.08 (-0.13, -0.03)    -0.14 (-0.18, -0.10)

                                  Loss            Unweighted                 –                       –
                                                   Weighted         -0.09 (-0.12, -0.06)     0.10 (0.08, 0.13)

                                  Goal          Equalized Odds               –                       –
                                               Demographic Parity    0.20 (0.15, 0.24)       0.17 (0.14, 0.21)
                                               Equal Opportunity     0.02 (-0.02, 0.07)      0.02 (-0.02, 0.05)
                                                     Strict         0.021 (-0.02, 0.07)      0.01 (-0.03, 0.04)

                             Group Balance        No Minority                –                       –
                                                Slight Minority     -0.05 (-0.09, -0.01)     0.01 (-0.02, 0.04)
                                                Strong Minority     -0.07 (-0.11, -0.03)     0.00 (-0.03, 0.04)

                              Class Balance        Balanced                   –                      –
                                                   One Rare         -0.005 (-0.04, 0.03)    -0.05 (-0.08, -0.01)
                                                   Two Rare           0.08 (0.04, 0.11)     -0.14 (-0.17, -0.11)

                             Predictive Bias         Low                     –                       –
                                                    Medium          -0.06 (-0.10, -0.03)    -0.09 (-0.12, -0.05)
                                                     High           -0.20 (-0.24, -0.16)    -0.18 (-0.22, -0.15)


Table 3: Regression coefficients and 95% confidence intervals for accuracy and mean T DR as a function of the experimental
hyperparameters for the synthetic datasets with two protected attributes and three possible outcomes.

</pre>