=Paper=
{{Paper
|id=Vol-3087/paper_36
|storemode=property
|title=Blackbox Post-Processing for Multiclass Fairness
|pdfUrl=https://ceur-ws.org/Vol-3087/paper_36.pdf
|volume=Vol-3087
|authors=Preston Putzel,Scott Lee
|dblpUrl=https://dblp.org/rec/conf/aaai/PutzelL22
}}
==Blackbox Post-Processing for Multiclass Fairness==
Blackbox Postprocessing for Multiclass Fairness Preston Putzel1 * , Scott Lee2 * 1 Department of Computer Science, University of California, Irvine, CA, 92697, USA 2 Centers for Disease Control and Prevention, 1600 Clifton Rd., Atlanta, GA, USA Abstract assumption that model parameters are accessible to the algo- rithm, but there is increasing availability of powerful black- Applying standard machine learning approaches for classi- box models whose internal parameters can be either inacces- fication can produce unequal results across different demo- sible or too costly to train. In this paper, we address the case graphic groups. When then used in real-world settings, these inequities can have negative societal impacts. This has mo- where outcomes are multiclass and the user has received a tivated the development of various approaches to fair clas- pre-trained blackbox model. The main contributions of our sification with machine learning models in recent years. In work are as follows: this paper, we consider the problem of modifying the pre- dictions of a blackbox machine learning classifier in order • We show how to extend Hardt, Price, and Srebro (2016) to achieve fairness in a multiclass setting. To accomplish this, to multiclass outcomes. we extend the ’post-processing’ approach in Hardt, Price, and • We demonstrate in what data regimes multiclass postpro- Srebro (2016), which focuses on fairness for binary classifi- cessing is likely to produce fair, useful, and accurate re- cation, to the setting of fair multiclass classification. We ex- sults via a set of rigorous synthetic experiments. plore when our approach produces both fair and accurate pre- dictions through systematic synthetic experiments and also • We demonstrate the results of our post-processing algo- evaluate discrimination-fairness tradeoffs on several publicly rithm on publicly available real-world applications. available real-world application datasets. We find that overall, our approach produces minor drops in accuracy and enforces Code and Dataset Availability All of the code used to fairness when the number of individuals in the dataset is high produce our experimental results as well as the synthetic and relative to the number of classes and protected groups. real-world datasets can be found on our github page1 . Introduction Technical Approach As machine learning begins moving into sensitive predic- As in Hardt, Price, and Srebro (2016), we consider the prob- tions tasks, it becomes critical to ensure the fair performance lem of enforcing fairness on a blackbox classifier without of prediction models. Naively trained machine learning sys- changing its internal parameters. This means that our ap- tems can replicate biases present in their training data, re- proach only has access to the predicted labels yˆi from the sulting in unfair outcomes that can accentuate societal in- blackbox classifier, the true labels yi , and the protected at- equities. For example, machine learning systems have been tributes ai for i ∈ {1, ..., N } where N is the number of discovered to be unfair in predicting time to criminal re- individuals. The goal of our approach is to produce a new cidivism (Dieterich, Mendoza, and Brennan 2016), ranking applications to nursing school (Romano, Bates, and Candès set of updated and fair ’adjusted’ predictions yiadj that sat- 2020), and recognizing faces (Buolamwini and Gebru 2018). isfy a desired fairness criterion. For each of yˆi , yi , and ai , Most prior work in this area has focused on ensuring fairness we define corresponding random variables Ŷ , Y , A. Then, for binary outcomes. However, there are many important following Hardt, Price, and Srebro (2016) we define the ran- real-world applications with multiclass outcomes instead. dom variable for the adjusted predictions Y adj to be a ran- For example, a self-driving car will need to be able to dis- domized function of Ŷ and A. We extend the approach in tinguish clearly between humans, non-human animals (such Hardt, Price, and Srebro (2016) by allowing multiclass out- as dogs), and non-sentient objects while nonetheless main- comes, such that the sample spaces of Ŷ , Y , and Y adj are taining fair performance for both wheelchair users and non- a collection of discrete and mutually exclusive outcomes wheelchair users. Most work has also been done with the C = {1, 2, ...., |C|}. We in principle allow the sample space * These authors contributed equally. of the protected group A, A , to contain any number of dis- Copyright © 2022 for this paper by its authors. Use permitted un- crete values as well: A = {1, 2, ..., |A |}. der Creative Commons License Attribution 4.0 International (CC 1 BY 4.0). https://github.com/scotthlee/fairness/tree/aaai Linear Program Our approach involves the construc- therefore are linear in the adjusted probabilities as needed tion of a linear program over the conditional probabilities for the linear program (see appendix A for the exact form of the adjusted predictor P r(Y adj = y adj |Ŷ = ŷ, A = M a takes for the different fairness criteria). a) such that a desired fairness criterion is satisfied by The first definition involves requiring strictly equal per- those probabilities. In order to construct the linear program, formance across protected groups. both the loss and fairness criteria must be linear in terms Definition 1 (Term-by-Term Multiclass Equality of Odds). of the protected attribute conditional probability matrices A multiclass predictor satisfies term-by-term equality of Pa = P r(Y adj |Ŷ , A = a) which have dimensions |C|×|C|. odds if the protected group conditional confusion matrices Wa are equal across all protected groups: Types of Objective Functions We consider objective functions which are linear in the group conditional adjusted W1 = W2 = · · · = W|A | (1) probabilities Pa . More specifically we consider minimizing expected losses of the form: where Wa = P r(Y adj |Y, A = a). This is a straightforward extension to the multiclass case E[l(y adj , y)] = of equality of odds defined in Hardt, Price, and Srebro |C| (2016). Notice that since this definition requires equality of each off-diagonal term of Wa across all groups, it enforces XX X P r(Y adj = i, Y = j, A = a)l(i, j, a) a∈A i=1 j6=i that not only are errors made at the same overall rate across |C| groups, but also that the rate of specific types of errors are XX X equal. For some practical applications, term-by-term equal- = Wija P r(A = a, Y = j) l(i, j, a) ity of odds is important, such as predicting criminal recidi- a∈A i=1 j6=i vism times binned into three years, two years, one year, and where Wija = P r(Y adj = i|Y = j, A = a) are the pro- ”never recommits”. In this case, making the error of predict- tected attribute conditional confusion matrices. Under the ing 3 years until recidivism when the actual time is 1 year is independence assumption Y adj ⊥ Y |A, Ŷ , we can write much worse than predicting 3 years when the actual time is 2. Therefore, it is critical for fairness in this application that Wa = Pa Za where Za = P r(Ŷ |Y, A = a), the class the rates of specific types of errors are strictly equal across conditional confusion matrices of the original blackbox clas- groups. sifier’s predictions. The matrices Za are estimated empiri- Instead of requiring strict equality of off-diagonal terms cally from the training data (yi , and ai ) and blackbox pre- of Wa we can instead enforce equality across the classwise dictions of the model (yˆi ). Therefore, this formulation of the overall false detection rates F DR, which leads to the next objective function remains linear in the protected attribute fairness definition: conditional probability matrices, Pa , as is necessary for the linear program. This definition is similar to Hardt, Price, Definition 2 (Classwise Multiclass Equality of Odds). A and Srebro (2016) except we let the loss l(i, j, a) also be a multiclass predictor satisfies classwise multiclass equality function of protected attributes instead of just the true and of odds if the diagonals of the protected group conditional adjusted labels, which allows controlling the strictness of confusion matrices and the protected attribute conditional penalties for errors made for specific protected groups and vector of false detection rates are equal across all protected classes. The most straightforward version of this loss is let- groups: ting l(y adj , y, a) be the zero-one loss (ignoring the protected diag(W1 ) = diag(W2 ) = · · · = diag(W|A | ) attributes) which results in minimizing the sum of the joint (2) FDR1 = FDR2 = · · · = FDR|A | probabilities of mismatch between Y adj and Y . We refer to this approach as unweighted loss. Another approach is to set where FDRa = P r(Y adj |Y adj 6= Y, A = a). l(y adj , y, a) equal to one over the joint probabilities of the true label and protected attribute 1/P r(Y = y, A = a) (es- This version of fairness can ’trade’ better performance for timated empirically), which we refer to as weighted loss. In- a specific protected group on one off diagonal term in Wa tuitively, this option reweights the loss to give rarer protected (i.e. lower error probability for that term) for poorer perfor- groups and label combinations equal importance to the opti- mance of the same group on a different off diagonal term mization which could improve fairness when very low mem- (i.e. higher error probability for another term). Individually bership minority protected groups exist in the dataset. This each class label has it’s true detection rate, and overall false option for the objective function can be equivalently mini- detection rate set equal across groups. Thus, this type of fair- mized by maximizing the diagonals (true detection rates) of ness is ’classwise’. the group conditional confusion matrices Wa . For some problems it is sufficient to maintain fair true de- tection rates across classes and allow false detection rates to Types of Fairness We consider several versions of multi- differ across groups. This is even less restrictive than Defi- class fairness criteria, all of which can be written as a collec- nition 2. This may be desirable when, for example, deciding tion of |A |−1 pairwise equalities setting a fairness criterion whether an accepted college application should be accepted of interest equal across all groups. Moreover, each of the into a honors program, accepted with scholarship, or regu- terms in these equalities can be written as some |C| × |C| larly accepted. Since all the outcomes are positive, unfair- matrix M a times the adjusted probability matrix Pa , and ness across false detection rates may not be critical, as long as the true detection rates are fair across groups. This moti- true distribution of the predictions, true labels, and true pro- vates the following fairness criteria: tected attributes. Definition 3 (Multiclass Equality of Opportunity). A multi- Multiclass blackbox post-processing techniques are less class predictor satisfies equality of opportunity if the diag- studied; although there have been a few new approaches re- onals of the protected group conditional confusion matrices cently. Notably, Denis et al. (2021) derive an optimally fair Wa are equal across all groups: classifier from a pre-trained model and show several nice theoretical guarantees, including the asymptotic fairness of diag(W1 ) = diag(W2 ) = · · · = diag(W|A | ) (3) their proposed plug-in estimator. We see 3 key differences between their approach and the extension to Hardt, Price, where Wa = P r(Y adj |Y, A = a). and Srebro (2016) that we propose: they only consider bi- A common and even more relaxed version of fairness nary protected attributes (|A | = 2), while we allow cate- called demographic parity only requires the rate of class pre- gorical protected attributes (|A | > 2) and can take on any dictions across different groups to be equal (Calders, Kami- number of unique values, at least theoretically; their method ran, and Pechenizkiy 2009). requires fitting a new estimator to the test data, whereas ours Definition 4 (Multiclass Demographic Parity). A multiclass only requires computing probabilities and solving a linear predictor satisfies demographic parity if the protected group program, which is relatively efficient; and, perhaps most im- conditional class probabilities are equal across groups: portantly, their approach is limited to the demographic par- ity fairness constraint, whereas our approach applies to any P r(Y adj |A = 1) = constraint that is linear in Pa . (4) P r(Y adj |A = 2) = · · · = P r(Y adj |A = |A |) In broader terms, Hossain, Mladenovic, and Shah (2020) unify many of the published methods for learning fair clas- Enforcing this version of fairness for certain datasets may sifiers by showing that equalized odds, equal opportunity, produce effectively unfair outcomes (Dwork et al. 2012). and other common measures of fairness in the binary set- However, in synthetically produced data, this definition has ting are subsumed by their proposed generalizations of the been shown to reduce the reputation of disadvantaged pro- economic notions of envy-freeness and equitability. They tected groups when repeatedly applied over a long period of show that these generalizations of fairness apply to the mul- time to sensitive decision-making tasks such as hiring (Hu ticlass setting, but post-processing techniques are incapable and Chen 2018). of achieving them. We show here that this notion is not en- Note that while the learned adjusted probabilities after tirely correct, at least in a narrow sense, and that fairness running the linear program, Pa are guaranteed to be fair, tak- can be achieved with post-processing techniques in the mul- ing the max value over the learned probabilities when pre- ticlass setting, so long as the joint distribution P (Y, Ŷ , A) is dicting on an individual level will not maintain fairness. In either fully known or can be reasonably approximated by a fact, it can occur that taking the max over the adjusted proba- large-enough sample of training data. bilities will just result in identical predictions as those made by the original blackbox classifier. Instead, when predict- ing the class of an individual, the corresponding learned ad- Synthetic Data Experiments justed probabilities must be sampled from in order to main- Synthetic Data To explore the effect of different data tain the fairness guarantee. regimes and optimization goals on post-adjustment discrimi- nation, we conducted thorough (though by no means exhaus- Related Work tive) synthetic experiments for a 3-class outcome. We con- Most prior work done on post-processing based fairness ap- structed synthetic datasets with N = 1, 000 observations for proaches focus on binary task prediction. Wei, Ramamurthy, each unique combination of the following data-generating and Calmon (2019) create a post-processing algorithm that hyperparameters: modifies the raw scores of a binary classifier (instead of • The number of unique values for the protected attribute, thresholded hard predictions) in order to achieve desired |A |. We explored setting |A | = 2 or |A | = 3 (see re- fairness constraints expressed as linear combinations of the sults with |A | = 2 in our github repository) per-group expected raw scores. Ye and Xie (2020) develop • The amount of class imbalance for the labels Y . For sim- a general in-processing fairness framework which alternates plicity, we did not allow this to vary across protected between a process of selecting a subset of the training data groups. and fitting a classifier to that data. Several adversarial approaches to multiclass fairness have • Group balance, or the number and relative size of minor- been investigated recently; although these are not blackbox ity groups compared to majority groups. This varied ac- post-processing algorithms. Zhang, Lemoine, and Mitchell cording to the number of groups but was generally either (2018) first present the idea of adversarial debiasing, while none, weak, or strong. Romano, Bates, and Candès (2020) present a multiclass ap- • Predictive bias as the difference in mean true detection proach for in-process training based on adversarial learning, rate, T DR, between the groups. We vary this from mild with the discriminator distinguishing between the distribu- predictive bias (10 percent difference) to severe bias with tion of the model’s current predictions, the true label, and the minority group T DR being near chance. The predic- artificial protected attributes resampled to be fair, and the tive bias is set to always favor the majority group. Experiments with |A | = 3 Hyperparameter Level Change in Acc (CI) Change in TDR (CI) Intercept – -0.13 (-0.17, -0.09) -0.18 (-0.21, -0.15) Loss Unweighted – – Weighted -0.11 (-0.13, -0.09) 0.12 (0.10, 0.13) Goal Equalized Odds – – Demographic Parity 0.24 (0.22, 0.27) 0.21 (0.18, 0.23) Equal Opportunity 0.08 (0.05, 0.11) 0.03 (0.01, 0.05) Term-by-Term 0.08 (0.05, 0.11) 0.02 (-0.01, 0.04) Group Balance No Minority – – One Slight Minority -0.03 (-0.06, 0.00) -0.02 (-0.04, 0.01) One Strong Minority -0.04 (-0.07, -0.00) -0.01 (-0.03, 0.02) Two Slight Minorities -0.05 (-0.08, -0.02) -0.02 (-0.04, 0.01) Two Strong Minorities -0.07 (-0.11, -0.04) -0.01 (-0.04, 0.01) Class Balance Balanced – – One Rare 0.02 (-0.00, 0.04) -0.04 (-0.06, -0.02) Two Rare 0.07 (0.04, 0.09) -0.18 (-0.20, -0.17) Pred Bias Low One – – Low Two 0.00 (-0.03, 0.04) -0.00 (-0.03, 0.02) Medium One -0.06 (-0.09, -0.02) -0.06 (-0.08, -0.03) Medium Two -0.04 (-0.07, -0.00) -0.06 (-0.08, -0.03) High One -0.18 (-0.22, -0.15) -0.16 (-0.19, -0.14) High Two -0.15 (-0.19, -0.12) -0.13 (-0.16, -0.11) Table 1: Predicted change and 95% confidence intervals for accuracy and mean T DR as a function of the experimental hyper- parameters in our synthetic datasets with three protected attributes. All datasets had a 3-class outcome. This process yielded 117 datasets. For each one, we ran • Predictive bias and class imbalance are the two main the linear program to adjust the (synthetic) biased blackbox drivers of decreases in post-adjustment discrimination, predictions 8 times, once for each unique combination of for both accuracy, and T DR. the objective function and type of fairness, yielding a total • High group imbalance for the protected attributes lowers of 936 adjustments. After each adjustment, we recorded two post-adjustment discrimination, but only from the per- broad measures of the fair predictor’s performance: spective of global accuracy–even with 2 strong minori- • Triviality, or whether any of the columns in Wa = ties (3-group scenario), mean T DR only drops by 1.1%. P r(Y adj |Y, A = a) contained all zeroes (i.e., whether • Relative to the weighted objective, the unweighted objec- any levels of the outcome were no longer predicted). tive leads to higher scores for global accuracy but lower • Discrimination, or the percent change in loss for the ad- scores for mean T DR. This is perhaps unsurprising, but justed predictor relative to that of the original predic- it is worth noting nonetheless. tor. For this measure, we examined two specific metrics: • Despite finding better accuracy solutions, we also found global accuracy and the mean of the group-wise T DRs. that the unweighted objective leads to trivial solutions These are equivalent to 1 minus the post-adjustment loss far more frequently (30% of the time it was used) than under the two versions of the objective functions we the weighted version of the loss (0.2% of the time it was present above. used). This trend will likely worsen with increasing di- To quantify the average effect of each hyperparameter mension of either the number of classes or the number of on discrimination, we fit two multivariable linear regression protected groups. models to the resulting dataset, one for each discrimination • Fairness is generally harder to achieve with 3 protected metric. Before fitting the models, we converted the categor- groups than with 2, since the intercepts are lower for both ical hyperparameters (so all but loss) to one-hot variables, accuracy and mean T DR. We believe this to be a general and then we set a reference level for each, removing the cor- consequence of forcing fairness across more groups and responding column from the design matrix. We then fit the expect this trend to continue as the number of groups in- models separately using ordinary least squares (OLS) and creases. calculated confidence intervals (CIs) for the resulting coeffi- cients. Experiments with Real-World Data Results Table 1 shows coefficients and 95% confidence in- Dataset Descriptions To further examine the performance tervals for the regression models with |A | = 3. The results characteristics of our algorithm, we ran it on several real- highlight several important points: world datasets described below. Figure 1: Fairness-discrimination plots for our postprocessing algorithm on our 4 real-world datasets, created by systematically relaxing the fairness equality constraints of the linear program. The plots show Brier score as a function of the maximum average difference between groups of the corresponding fairness criterion. Performance of the original, unadjusted predictor is marked by an X. 1. Drug Usage (Fehrman et al. 2017). This dataset has in- 4. Parkinson’s Telemonitoring (Tsanas et al. 2009). This herently multiclass outcomes, with the target being a 7- dataset does not have inherently multiclass outcomes, level categorical variable indicating recentness of use for with the target for prediction being the continuous Uni- a variety of drugs. We focus on predicting cannabis us- fied Parkinson’s Disease Rating Scale (UPDRS), a con- age, where we collapsed the 7-level usage indicator into tinuous score that increases with the severity of impair- 3 broader categories: never used, used but not in the past ment. We again used Otsu’s method to bin the contin- year, and used in the past year. Predictors included demo- uous score into 3 categories–low impairment, moder- graphic variables like age, gender, and level of education, ate impairment, and high impairment–which we took as as well as a variety of measures of personality traits hy- the new class labels. The protected attribute is a 2-level pothesized to affect usage habits. variable for gender (Male/Female). Predictors included mostly biomedical measurements from the voice record- 2. Obesity (Palechor and de la Hoz Manotas 2019). This ings of patients with Parkinson’s Disease. dataset has inherently multiclass outcomes, with the tar- get being a 7-level categorical variable indicating weight For each of these datasets, we obtained a potentially- category; the protected attribute is gender (Male/Fe- biased predictor Ŷ by training a random forest on all avail- male). Because some of the observations are synthetic able informative features (including the protected attribute) in order to protect privacy, not all of the gender/weight to predict the multiclass outcome, and then taking the cat- categories had sufficient numbers for modeling, and so egories corresponding to the row-wise maxima of the out- we omitted observations from the 2 most extreme weight of-bag decision scores as the set of predicted labels. We categories, Obesity Type-II and Obesity Type-III, leaving then adjusted the predictions with the weighted objective a 5-level target for prediction. Predictors included age, and term-by-term equality of odds fairness constraint and gender, family medical history, and several measures of recorded the relative changes in global accuracy and mean physical activity and behavioral health. T DR as the outcome measures of interest, as with our syn- thetic experiments. 3. LSAC Bar Passage (Wightman 1998). This dataset has inherently multiclass outcomes, with the target being a 3- Exploring the Effect of Finite Sampling Hardt, Price, level variable indicating bar exam passage status (passed and Srebro (2016) note that their method will not be ef- first time, passed second time, or did not pass). The pro- fected by finite sample variability as long as the joint distri- tected attribute is race, which we collapsed from its orig- bution P r(Y, Ŷ , A) is known, or at least well-approximated inal 8 levels to 2 (white and non-white). Predictors in- by a large sample. In practical applications, however, the cluded mostly measures of educational achievement, like sample at hand may not be large enough to approximate undergraduate GPA, law school GPA, and LSAT score. the joint distribution with precision. This problem is exac- In-Sample Results Dataset (N) # Terms Old Acc New Acc Old T DR New T DR Pre Post-Adj Disparity in Pa (% change) (% change) (% change) Bar (N=22406) 18 88 % 88% (-1%) 36% 34% (-7%) 0.11 0.00 (-100%) Cannabis (N=1885) 18 74% 71% (-4%) 67% 63% (-6%) 0.07 0.00 (-100%) Obesity (N=1490) 50 78% 73% (-7%) 78% 73% (-7%) 0.05 0.00 (-100%) Parkinsons (N=5875) 18 93% 91% (-2%) 92% 89% (-3%) 0.04 0.00(-100%) Out of Sample Results Dataset (N) # Terms Old Acc New Acc Old T DR New T DR Pre Post-Adj Disparity in Pa (% change) (% change) (% change) Bar (N=22406) 18 88 % 83% (-6%) 36% 33% (-8%) 0.11 0.01 (-95%) Cannabis (N=1885) 18 74% 61% (-18%) 67% 52% (-22%) 0.07 0.16 (124%) Obesity (N=1490) 50 78% 41% (-47%) 78% 42% (-46%) 0.05 0.07 (45%) Parkinsons (N=5875) 18 93% 82% (-12%) 92% 78% (-15%) 0.04 0.05(33%) Table 2: Results of applying the linear program to adjust the blackbox predictions and produce y der for four real-world datasets. The top table is without any splitting. Results shown in the bottom table are cross-validated across five 80/20 splits of each dataset. Accuracy and T DR are shown before and adjustment, with T DR being the mean across all classes. Percent changes, shown in parentheses are the relative percent drops in accuracy and mean T DR. Post-adjustment disparity is the element-wise mean difference across all groups of Wa . erbated when the number of observations N is small rela- measures, we took the maximum of the mean differences tive to the number of probabilities learned by the algorithm across pairs of groups of the following metrics: of which there are |C| × |C| × |A | total. This difficulty is therefore more severe for our extension in this work where • W, or the matrix of probabilities P (Y adj |Y ), for term- |C| > 2. by-term equality of odds In these cases, the adjusted predictor Y adj may have worse • Youden’s J index, or T DR+(1−F DR)−1, for classwise classification performance and higher disparity when ap- equality of odds plied to unseen, out-of-sample data. As a preliminary ex- • T DR for equal opportunity ploration of this effect, we used 5-fold cross-validation to • P (Y adj ) for demographic parity generate out-of-sample predictions for each of the observa- tions in our real-world datasets. Keeping Y , Ŷ , and A fixed, We note here that taking the maximum of the maxima of we solved the linear program on 80% of the data and then the pairwise differences would also be a valid and sensible used the adjusted probabilities Pa to obtain class predic- global measure. So that the plots show performance under tions for the observations in the remaining 20%. As with the optimal conditions, we do not use cross-validation to obtain predictions obtained from solving the linear program on the Y adj , i.e., we obtain it by solving the linear program on the full dataset, we measured the changes in accuracy and mean entire dataset. T DR for the cross-validated predictions. Because fairness Results Table 2 shows changes in global accuracy and is not guaranteed when the joint distribution assumption is mean T DR after adjustment with the weighted objective violated, we also measured post-adjustment fairness. and term-by-term conditional fairness constraint for our four Exploring the Fairness-Discrimination Tradeoff When datasets, using cross-validation as described above to cap- there are large gaps in a predictor’s performance across ture some of the variability that comes with finite sampling. groups, i.e., when predictive bias is high, strict fairness may Overall, adjustment lowered both accuracy and mean T DR. not always be possible or desirable to achieve because of Although, for the bar passage, drug usage, and Parkinson’s the large amount of randomization required to balance the datasets, the drops were moderate, with average relative blackbox classifier’s predictions. To explore the tradeoff be- changes in both metrics coming in at around 12% and 15%, tween fairness and discrimination, we ran the linear pro- respectively (without cross-validation, the drops were much gram on each of the real-world datasets once for each of the smaller at 3% and 4%). For the obesity dataset, the drops four kinds of fairness. For each combination of dataset and are much larger at 47% and 46%, respectively, which are fairness type, we varied the equality constraints of the lin- indeed substantial and would likely make the predictor un- ear program–the maximum percent difference allowed be- usable in practical settings. On in-sample data, these drops tween any pairwise comparison of fairness measures be- were both only around 7%, and so we suspect that charac- tween groups–from 0.0 to 1.0 in increments of 0.01, and then teristics of the data, like large class imbalance or small over- plotted the value of the weighted objective at each point as a all sample size, are responsible for the poor performance. function of the global measure of fairness corresponding to Perhaps most importantly, the post-adjustment disparity for the fairness type under consideration. To obtain these global all datasets is non-zero, and for three of the datasets actu- ally increases. The bar passage dataset was the only exam- More generally, even when finite sampling variability is ple where the out-of-sample post-adjustment disparity de- not an issue, not all datasets will lend themselves well to creased to near zero likely due to it being the largest dataset. this kind of post-processing approach. In our synthetic ex- This starkly points out the sensitivity of the method to esti- periments, we showed that severe class imbalance and se- mating the joint probabilities P r(Y, Ŷ , A), and shows that vere predictive bias (predicting at nearly the level of chance the approach is unlikely to work in smaller dataset regimes for minority protected groups) lead to large drops in post- which have a larger combination of classes and protected adjustment performance on average. In many of the sin- attributes. Note that for in-sample results, post-adjustment gle experimental runs for synthetic datasets with these set- disparity drops completely to 0.0 for all datasets since it is tings, the resulting derived predictor was effectively use- strictly enforced by the linear program in Table 2. less, either producing trivial results or lowering predictive Figure 1 shows fairness-discrimination plots for our 4 performance to near chance (for all groups) for one or datasets with the weighted objective and each of the 4 fair- more class outcomes. In these circumstances, it may be ness constraints. Under strict fairness, with inequality set more sensible to enforce fairness through a combination of to 0, equalized odds is the hardest to satisfy, showing the pre-processing, in-processing, and post-processing methods, largest increase in Brier score. For the drug usage, obesity, rather than through a post-processing method alone. Indeed, and Parkinson’s datasets, discrimination improves approx- Woodworth et al. (2017) make this point generally, albeit imately linearly as fairness worsens; for the bar passage for the binary setting, by showing that unless the biased pre- dataset, discrimination improves to a point, but then worsens dictor Ŷ is very close to being Bayes optimal, the derived as fairness approaches the value for the original, unadjusted predictor Y adj proposed by Hardt, Price, and Srebro (2016) predictor Ŷ . For all datasets, the total loss of discrimination can underperform relative to other methods, sometimes sub- under strict fairness is relatively small (the biggest drop is stantially. Under less extreme circumstances, however, we around 7.5 percentage points on Brier score), but the random found our approach produces good results, especially given forests’ predictions were only mildly biased to begin with, the time-efficiency of solving the linear program relative to so we expect this gap to increase for less-fair predictors. other methods. Discussion Acknowledgments This work was supported in part by the HPI Research Cen- Generally, our post-processing approach to achieving fair- ter in Machine Learning and Data Science at UC Irvine (P. ness in multiclass settings seems both feasible and efficient Putzel), as well as in part by an appointment to the Research given a large enough dataset size. We have shown above that Participation Program at the Centers for Disease Control and the linear programming technique proposed by Hardt, Price, Prevention, administered by the Oak Ridge Institute for Sci- and Srebro (2016) can be extended to accommodate a the- ence and Education (P. Putzel). We would also like to thank oretically arbitrarily large number of discrete outcomes and Chad Heilig, and Padhraic Smyth for their helpful comments levels of a protected attribute. Nonetheless, our synthetic ex- on the approach and paper. periments and analyses of real-world datasets show that are a few important considerations for using the approach in prac- References tice. Buolamwini, J.; and Gebru, T. 2018. Gender shades: Inter- In many cases, the effect of finite sampling may be non- sectional accuracy disparities in commercial gender classifi- negligible, especially when the number of observations N is cation. In Conference on fairness, accountability and trans- small relative to number of outcomes |C| or the number of parency, 77–91. PMLR. protected groups |A |. For example, the obesity dataset with Calders, T.; Kamiran, F.; and Pechenizkiy, M. 2009. Build- |C| = 5 and N = 1, 490 saw a large relative drop of 46% in ing classifiers with independency constraints. In 2009 IEEE mean T DR after adjustment under cross-validation. We also International Conference on Data Mining Workshops, 13– saw this effect extend to fairness, which was not reduced 18. IEEE. completely to zero on out-of-sample data for any of the real- world datasets. In fact, for the drug usage dataset we found Denis, C.; Elie, R.; Hebiri, M.; and Hu, F. 2021. Fair- post-adjustment disparity doubled on out-of-sample data. ness guarantee in multi-class classification. arXiv preprint This last observation raises a concerning point: for some arXiv:2109.13642. classification problems, the post-adjustment predictions on Dieterich, W.; Mendoza, C.; and Brennan, T. 2016. COM- out-of-sample data may increase disparity rather than low- PAS risk scales: Demonstrating accuracy equity and predic- ering it. For the largest of the datasets, the bar passage tive parity. Northpointe Inc. dataset with N = 22, 406, neither of these issues was a Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; and Zemel, concern. Even under cross-validation, the relative change R. 2012. Fairness through awareness. In Proceedings of the in T DR was only -8%, and the disparity dropped to near 3rd innovations in theoretical computer science conference, 0 (-95% decrease). Given this, we expect that with a large 214–226. enough dataset size, our approach will be far more reliable Fehrman, E.; Muhammad, A. K.; Mirkes, E. M.; Egan, V.; on out-of-sample data. Future work more precisely quanti- and Gorban, A. N. 2017. The five factor model of person- fying the number of training examples needed for reliable ality and evaluation of drug consumption risk. In Data sci- out-of-sample fair performance with our approach is needed. ence, 231–242. Springer. a Hardt, M.; Price, E.; and Srebro, N. 2016. Equality of op- where Zkj = P r(Ŷ = k|Y = j, A = a), and can be portunity in supervised learning. Advances in neural infor- estimated empirically using the original predictions of the mation processing systems, 29: 3315–3323. blackbox classifier. Thus we have Wa = Pa Za . Moving Hossain, S.; Mladenovic, A.; and Shah, N. 2020. Design- from the third to fourth line requires the conditional indepen- ing fairly fair classifiers via economic fairness notions. In dence assumption Y adj ⊥ Y |A, Ŷ . This assumption is vio- Proceedings of The Web Conference 2020, 1559–1569. lated in cases where the blackbox predictions are weak, for Hu, L.; and Chen, Y. 2018. A short-term intervention for example completely random, and can intuitively be thought long-term fairness in the labor market. In Proceedings of of as requiring that the initial blackbox classifier has reason- the 2018 World Wide Web Conference, 1389–1398. able discriminative performance. In other words, relevant in- formation for predicting Y is contained in Ŷ . Palechor, F. M.; and de la Hoz Manotas, A. 2019. Dataset Multiclass equality of opportunity only requires enforcing for estimation of obesity levels based on eating habits and equality on the diagonals of Wa , and therefore is linear in physical condition in individuals from Colombia, Peru and Pa as well. Mexico. Data in brief, 25: 104344. Enforcing the classwise version of multiclass equality of Romano, Y.; Bates, S.; and Candès, E. J. 2020. Achieving odds requires enforcing equality of opportunity, which is al- Equalized Odds by Resampling Sensitive Attributes. arXiv ready shown to be linear above, and also enforcing the over- preprint arXiv:2006.04292. all false detection rates to be equal across protected groups. Tsanas, A.; Little, M.; McSharry, P.; and Ramig, L. 2009. So in order for classwise multiclass equality of odds to be Accurate telemonitoring of Parkinson’s disease progression linear, the false detection rates must be linear in Pa , shown by non-invasive speech tests. Nature Precedings, 1–1. below: Wei, D.; Ramamurthy, K. N.; and Calmon, F. d. P. 2019. F DRca =P r(Y adj = c|Y 6= c, A = a) Optimized score transformation for fair classification. arXiv preprint arXiv:1906.00066. P r(Y adj = c, Y 6= c, A = a) = Wightman, L. F. 1998. LSAC National Longitudinal Bar P r(Y 6= c, A = a) Passage Study. LSAC Research Report Series. X X P r(Y adj = c, Y = c0 , Ŷ = j, A = a) = Woodworth, B.; Gunasekar, S.; Ohannessian, M. I.; and Sre- P r(Y 6= c, A = a) j c0 6=c bro, N. 2017. Learning non-discriminatory predictors. In Conference on Learning Theory, 1920–1953. PMLR. a a X X Pcj Zjc0 P r(Y = c0 , A = a) = Ye, Q.; and Xie, W. 2020. Unbiased Subdata Selection for P r(Y 6= c, A = a) j c0 6=c Fair Classification: A Unified Framework and Scalable Al- a X Zjc 0 X 0 P r(Y = c , A = a) gorithms. arXiv preprint arXiv:2012.12356. = a Pcj Zhang, B. H.; Lemoine, B.; and Mitchell, M. 2018. Mitigat- P r(Y 6= c, A = a) j c0 6=c ing unwanted biases with adversarial learning. In Proceed- X a a ings of the 2018 AAAI/ACM Conference on AI, Ethics, and = Pcj Vjc j Society, 335–340. a 0 a P Zjc 0 P r(Y =c ,A=a) where Vjc = 0 c 6=c P r(Y 6=c,A=a) . This allows us to Appendix A write the protected attribute conditional false detection rates Derivation of Linearity of Fairness Constraints: In or- as FDRa = diag(Pa Va ). As before, Va can be computed der to obtain linearity in the protected attribute conditional from the empirical estimates of Za , and P r(Y = i, A = j). probability matrices Pa we must find an expression of the For multiclass demographic parity we can write: form Wa = Pa Ma : Da =P r(Y adj |A = a) Wija =P r(Y adj = i|Y = j, A = a) 1 X = P r(Y adj , A = a, Ŷ = k) P r(A = a) X = P r(Y adj = j, Ŷ = k|Y = j, A = a) k k X P r(Ŷ = k, A = a) X = P r(Y adj |Ŷ = k, A = a) = P r(Y adj = i|Y = j, A = a, Ŷ = k) P r(A = a) k k X × P r(Ŷ = k|Y = j, A = a) = P r(Y adj |Ŷ = k, A = a)P r(Ŷ = k|A = a) X k = P r(Y adj = i|Ŷ = k, A = a) a =P P r(Ŷ |A = a) k × P r(Ŷ = k|Y = j, A = a) which is again linear in Pa , and the conditional probability X vector P r(Ŷ |A = a) can be computed emprically. a a = Pik Zkj k Synthetic Experiment Results with |A | = 2 Experiments with |A | = 2 Hyperparameter Level Change in Acc (CI) Change in TDR (CI) Intercept – -0.08 (-0.13, -0.03) -0.14 (-0.18, -0.10) Loss Unweighted – – Weighted -0.09 (-0.12, -0.06) 0.10 (0.08, 0.13) Goal Equalized Odds – – Demographic Parity 0.20 (0.15, 0.24) 0.17 (0.14, 0.21) Equal Opportunity 0.02 (-0.02, 0.07) 0.02 (-0.02, 0.05) Strict 0.021 (-0.02, 0.07) 0.01 (-0.03, 0.04) Group Balance No Minority – – Slight Minority -0.05 (-0.09, -0.01) 0.01 (-0.02, 0.04) Strong Minority -0.07 (-0.11, -0.03) 0.00 (-0.03, 0.04) Class Balance Balanced – – One Rare -0.005 (-0.04, 0.03) -0.05 (-0.08, -0.01) Two Rare 0.08 (0.04, 0.11) -0.14 (-0.17, -0.11) Predictive Bias Low – – Medium -0.06 (-0.10, -0.03) -0.09 (-0.12, -0.05) High -0.20 (-0.24, -0.16) -0.18 (-0.22, -0.15) Table 3: Regression coefficients and 95% confidence intervals for accuracy and mean T DR as a function of the experimental hyperparameters for the synthetic datasets with two protected attributes and three possible outcomes.