-

1613-0073

ProxiMix: Enhancing Fairness with Proximity Samples in Subgroups

Jingyu Hu

Jun Hong

jun.hong@uwe.ac.uk 2

Mengnan Du

mengnan.du@njit.edu 0

WeiruLiu

Group Fairness

Bias Mitigations

Mixup

Data Augmentation

0 New Jersey Institute of Technology , 323 Dr Martin Luther King Jr Blvd, Newark , USA 1 University of Bristol , Beacon House, Queens Rd, Bristol , UK 2 University of the West of England , Coldharbour Ln, Stoke Giford, Bristol , UK

Many bias mitigation methods have been developed for addressing fairness issues in machine learning. We have found that using linear mixup alone, a data augmentation technique, for bias mitigation, can still retain biases present in dataset labels. Research presented in this paper aims to address this issue by proposing a novel pre-processing strategy in which both an existing mixup method and our new bias mitigation algorithm can be utilized to improve the generation of labels of augmented samples, hence being proximity aware. Specifically, we propose ProxiMix which keeps both pairwise and proximity relationships for fairer data augmentation. We have conducted thorough experiments with three datasets, three ML models, and diferent hyperparameters settings. Our experimental results show the efectiveness of ProxiMix from both fairness of predictions and fairness of recourse perspectives.

CEUR ceur-ws.org

1. Introduction

To bridge the research gap, in this work, we propose ProxiMix to address the issue of biased labels in pre-processing for bias mitigation. Motivated by the relabeling the discrimination method [12], which assigns labels to instances based on their K-nearest neighbors to ensure that similar individuals have similar labels, our proposed approach adds proximity samples for re-auditing mixed labels to mitigate potential bias in mixup. The intuition is that compared with focusing on pairwise labels, considering the labels of proximity samples as latent label relationships can reduce the probability of generating biased labels. We have conducted experiments to compare the existing pairwise mixup with the proposed proximity-aware mixup on multiple models and datasets. The results show that our ProxiMix achieves higher fairness, particularly when the original labels in the dataset are highly biased.

Our main contributions can be summarised as follows: ( 1 ) We propose a new bias mitigation algorithm to address the label bias retainment issue in the current mixup method; ( 2 ) Subgroup preference analysis: we explore how diferent subgroups perform during the sampling process; ( 3 ) Trade-of analysis: we explore the tradeof between using our proximity-based strategy and the traditional mixup; ( 4 ) Validation: we validate the efectiveness of our method using prediction-based metrics and the cost of counterfactual explanations from an XAI perspective.

2. Related Work

The fairness problem can be divided into individual and group levels. Individual fairness measures the bias by checking if similar predictions can be made for similar individuals. Group fairness compares the treatments of fairness in unprivileged and privileged groups. Fairness is achieved when the treatments are equal between groups. Prediction-based fairness and recourse-based fairness are two perspectives for evaluating model fairness. In this paper, we focus on group fairness in machine learning.

Fairness of Prediction Outcomes Most fairness metrics are based on predicted outcomes. Demographic Parity (DP)1[3] based metrics use predicted outcomes to assess whether diferent demographic groups are equally favored by the model. It aims at having equal proportions of positive outcomes across subgroups. The DP diference between groups is called Statistical Parity Diference (SP), and DP ratio between groups is called Disparate Impact (DI). In addition to depending on predictions only, there are some fairness met1r4ic]st[hat consider both predicted and actual outcomes. Equality of Opportunity (EO) measures the True Positive Rate (TPR) of subgroups. Equalized odds (Eodds) compares both True Positive Rate (TPR) and False Positive Rate (FPR) of each groups.

Fairness of Recourse Another recent research trend is to apply Explainable Artificial Intelligence (XAI) methods to address fairness issues. One of the key components in this area is counterfactual explanation (CE), sometimes also called as algorithm recourse. CE focuses on explaining why a particular outcome occurred instead of an alternative plausible outcome. [15, 16]. Recourse refers to identifying the closest counterfactuals that could alter the result with minimal feature changes. Several algorithms have been developed to generate such counterfactual explanations for machine learning mod1e7l,s1[8, 19]. The concept of fairness of recourse are proposed b2y0[] and defined as the disparity of the mean cost to achieve the desirable recourse among the unprivileged subgroup6s,.2[1] propose metrics based on the cost of counterfactual explanation to measure fairness performance across subgroups. Predictive Counterfactual Fairness (PreCo2F2)][ utilises CEs to detect the underlying patterns for the discrimination in the model.

Bias Mitigation Methods Bias mitigation methods can be categorized into three stages: pre-processing, in-processing, and post-processin8g,2[3]. Pre-processing mitigations aim to reduce bias by modifying and creating a fairer training da2t4a,s2e5t, 2[6]. In-processing mitigation occurs during training by adding regularization terms and constraints to models [11, 27]. Mitigations in the post-processing stage like calibration are applied after a model has been successfully trained2[1, 28]. Both pre-processing and post-processing-based methods are model-agnostic as they occur before and after the model training.

Over-sampling in the pre-processing stage refers to changing the distribution of the training dataset by adding more samples. Duplicating instances of the unprivileged group is one straightforward strate2g9y,3[0]. [31, 32] generate synthetic samples around the unprivileged group to mitigate bias. MixS G10[] takes both the privileged and unprivileged groups into consideration when synthesizing new data using mixup, but the potential bias in generated labels has not been discussed yet.

3. Preliminaries and Problem Statement

Notations Given the datas et= {( , ,

)}=1 with samples, where is a set of features space, and each featur ein has a set of values in , label ∈ ∶= {0, 1} attribu te∈ ∶= {0, 1} . The dataset is divided into trainin g s et and test se t . We use , and a sensitive to fit a classifier model ∶ → and

to assess the model’s prediction and fairness performance. Fairness is measured by the model’s performance on the diference between subgroups identified by . We define the unprivileged/minority group when Z=0, and Z=1 is the privileged/majority group.

Mixup Strategy in Fairness Mixup [9] is a data augmentation technique that involves blending pairs of samples to create new synthetic training examples. The premise of mixup is that linear combinations of features will result in the same linear combinations of target labels. a new sample(̃ , ̃)̃ , with random parametersdrawn from a Beta distribution. Thus, mixup applies stochastic linear combinations to samp0le(s0, 0), 1( 1, 1)to generate =̃ ∗ 0 + (1 − ) ∗ 1, =̃ ∗ 0 + (1 − ) ∗ 1, where 0, 1 are input vectors where 0, 1 are target labels ( 1 ) ( 2 )

To address fairness concerns, previous research has explored the practice of sa m0palinndg 1 from diferent subgroups, applying this step to both pre-processing stage like mi1x0S]G [ and in-processing stage like fairMixu1p1][ as bias mitigation methods.

Bias Persist After Mixup

The premise of mixup lies in the linear relationship between features and labels. The challenge here is if the original labels in the dataset are biased, the labels of mixed samples can retain this bias. The newly generated biased samples can impact the fairness of the trained model.

r is considered as the sensitive attrib,udtieviding the data into subgroups. Here, we consider the female subgroup as unprivileged.

The table shows individual features of male sample1s ( and2 ) and the female sample 2( ) are remarkably similar (Oficer with simila r and ), but with diferent income labels. This shows initial bias that female and male groups are treated unequally.

We follow the mixSG method to select one sample from one subgroup and another from the other subgroup to genera,t̃ ẽ )(. Assume we have chosen one sample 2 from the female subgroup, 2 will be randomly paired with eith1er or2 from the male subgroup. If the mixture ratio of the female sample 2 is over 50%, we say the mixed sample is female.

Otherwise,

is male.

When the random = 0.8, will be a female sample. And the labe l of the mixed female sample will primarily depend on the label from fema 2le , meaning that both combinations 2of with1 or2

will have a high probability of low income≤(50 ). Though individual features of high-income men (M1 and M2) and low-income women (F2) are remarkably similar (Oficer with similar capital gain and age), mixed label still indicates a tendency toward lower incomes for female. If = 0.2, the mixed sample will be most depend on the label from the male sample and the generated sample becomes male with high income. The labels of mixed samples are heavily influenced by gender. Considering the initial bias in the dataset, new samples generated by mixup can deepen gender bias against unprivileged groups, causing the model to be more likely to predict male samples as high-income and female samples as low-income under similar conditions.

4. Methodology and Experiment Design

for improvments. It synthesizesnew = {( , ,

)}=1 from To address the issue of possible biased label for mix-up, we proposed a method called ProxiMix with the consideration of both pairwise and proximity samples, to reduce dataset bias. Fitting the model with fairer dataset ′ =

∪ new is expected to improve its fairness performance. ": sample " !: sample !

Samples after Case 1 Case 2 Case 3 0 0 1 1 0 1 1

0 1 0 0 1 0 0 0 1 1 0

Sample from Mixup Sample from ProxiMix Sample from Mixup Sample from ProxiMix Sample from Mixup Sample from ProxiMix

4.1. ProxiMix Algorithm

The Importance of Proximity Awareness Given a sample 0 from group train( = 0) , and another sample1 from train( = 1) , the proximity samples set o1f is defined as = { 0, 1, ..., }. The label value of each sample can be eithe0ror1. We illustrate three cases when mixing up two samples 0 and 1: ( 1 ) Case 1: Labels of 0, 1 and all of their proximity samples are the same.( 2 ) Case 2: Labels of 0 and 1 are the same, but there exist diferent labels among proximity samples . ( 3 ) Case 3: Labels of 0 and 1 are diferent. Figure1 presents these three cases.

In Case 1, linear mixing and proximity yield the same results because there are no impurities between the two samples. In Case 2, both sample0sand 1 have the same label. This implies that direct mixing will result in all labels becom0inrgegardless of the mixing ratio. This approach ignores the samples fro1min between and can potentially introduce bias when predicting subgroups with t1helabel. In Case 3, the mixed label depends on the mixing rate when using mixup directly. Specifically, the mixed label becomes1 when the mixing rate exceeds 0.5. However, we can see in the example that the majority of the proximity s a mples between0 and1 belong to0. It suggests that the probability of being classified0 ashsould be higher. Considering the proportion of proximity labels can enhance the probability of being classified as 0.

ProxiMix Algorithm Design ProxiMix consists of two parts: we first introduce proximitybased mixed label and then combine with from the existing mixup1[0] using d-adjusted balancing degree.

As discussed above, the current mixup approach does not account for potential biases in labels. Our proposal aims to determine the mixed label by considering the proportions of labels in proximity samples. Specifically, when mixing two samples, 0 and 1, we calculate their Euclidean distance with their one-hot encoded fea1t,udreensoted as = || 0 − 1||, to measure their proximity. Then, we select all the samples that are withi n thdeistance from 0 to form a potential proximity samples set ProxiSet. The final mixed label f0oarnd 1 is assigned based on the label with the larger proportion within0t∪he .

Let’s look back at the toy examp le=: { 2, 1, 2 } when we want to mi x2 with either1 or2 . Two-thirds of the labels in t h e is high income, so that the proximitybased mixed is high income.

We combine our proximity-based with from the current mixup to form the new definition of mixed ̃, achieved by calculating∗ + (1 − ) ∗ , where is a balancing degree between 0 and 1. The algorithm pseudocode is described in Algorithm 1. Algorithm 1 ProxiMix Algorithm

Input 0( 0, 0, 0) ∼ train( = 0), 1( 1, 1, 1) ∼ train( = 1) procedure ProxiMix( 0, 1, train, ) procedure Proximity-Based-Mixed( 0, 1, train) = [] . = || 0 − 1|| for each sample ( , , )in train( = 1) do = || − 0|| if ≤ then

Add to .

end if end for = 0 ∪ _( ∈ )/( ) = end procedure procedure Lambda-Based-Mix( 0, 1) = Beta(, ) = ∗ 0 + (1 − ) ∗ 1 end procedure =̃ ∗ + (1 − ) ∗ , ∈ [0, 1]

Return ̃ end procedure 1scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

! "

Fig. 2 shows an example of how ProxiMix works. Samples are categorized into two subgroups, green and blue, based on their colors. The shape of each sample represents its label: circles for label 0, and plus-signs for label 1. Specifically, the green circle0)( and the blue plus-sign ( 1) are two samples selected for ProxiMix. The new label of the mixed samples changes with diferent values of the balancing paramete.rThe varying shades of blue samples represent the impact degree o f , while the thickness of the red lines betw ee0nand 1 represents the strength of . The black line indicates no consideration fo.rFor = 1 , it employs the original mixup ; for = 0 , it utilizes our proximity-bas ed exclusively; and it combines the two for values in between. We will discuss how difere ntimpact the model performance in Section5.2.

Accelerating Calculation of ProxiMix in Practice Our core idea is to introduce proximity samples’ label set as a reference when performing label mixup. To enhance computational eficiency, we find ifrst in practice. Our implementation is as follows: ( 1 ) Given a randomly selected sample0 from ( = ) , we first find its from ( = ¬) . contains samples that are proximal t0o; ( 2 ) Then, we treat each sample in as 1 and sequentially mix it wit h0, following the ‘furthest-first’ rule. It means the mixing begins with the sample i n that is furthest fr om0. After each mix, we remove the used sample from ; ( 3 ) Repeat this proce s/s times until the desire d new samples are generated. The generated samples are merge d to as training samples for classification model.

4.2. Experiment Setting

Fig. 3 presents the overall workflow of our experiment. The parameter balancing degirneoeur mixup algorithm is tested with values ranging from 0 to 1, in increments of 0.1. The proximity samples for each round are set to 25. we consider proximity when there are at least 5 neighbors to ensure credibility. The mixing ratiios randomly generated from the Beta( 1,1 ) distribution.

Datasets The experiment is conducted on three datasets for classification problems: ( 1 ) Adult income dataset3[3]: predicting whether a person’s annual income exceeds 50K (high/lowincome); ( 2 ) Law school dataset34[]: predicting whether a person’s in law school will fail/pass the exam; ( 3 ) Credit default datas3e5t]:[predicting whether a person’s credit payment will be on-time/overdue.

Models Three models including logistic regression (LogReg), decision trees (DT) and multiDataset (Adult/Credit/Law)

Train Dtrain Test Dtest ❶ProxiMix

Train Dtrain’ ❸Predict ❷Train

Model LogReg DTree MLP ❹Eval

Metrics Prediction Fairness CF cost layer perceptron (MLP) are tested. All implementations are based on sciki2t.-Tlehaermnaximum depth is 7 in the decision tree. We use a three-layer MLP with 128 neurons in the ith hidden layer, ‘rule’ as the activation function, and a maximum of 1500 iterations. The random seed is set to 42 for reproducible results.

Metrics Prediction performance metrics are based on True Positive (TP), False Positive (FP),False Negative (FN),True Negative (TN) in the confusion matrix. The equations of Precision, Recall, and F1-score are as follows. Recall is also called True Positive Rate. is the distance betwe e n and . In this way, we can compute the counterfactual cost for each sample in datase t. The average costs of counterfactuals across diferent groups can be considered as a measure of fairness: with the cost gap between groups (e.g., females and males) increasing, the model’s unfairness also grows. Our evaluation follows the implementation of counterfactual explanation cost pack4,aagned specifically, we opt counterfactual explanations cost without constraints as metrics.

5. Results

In section5.1, we fix the balancing degree of ProxiMix and examined the impact of diferent sampling modes for subgroups on the outcomes. In secti5o.n2, we fix the sampling mode and explore the impact of diferent balancing degre eson the results. To ensure the consistency of findings, Section 5.3 assesses the efectiveness of ProxiMix from the counterfactual cost perspective. 5.1. Sampling Mode Preferences in ProxiMix with Fixed Balancing Degree ProxiMix is built on the mixup concept, which involves continuously selecting and mixing two samples to generate new data. To identify which combinations of samples had a more positive impact on the model’s performance, we divide the dataset into diferent subgroups and sample from them.

There are four subgroups with considerations on both labels and values of a single sensitive feature i n . The first sample selected from each grou p ( = , = ) is notated a1s , 3 , the second sample selected from the subgroup ( = )̄ which has the opposite sensitive label is notated a1s ′,2 ′,3 ′,4 ′, respectively. In Table2, 1 is sampled from<female, lowincome> subgroup in the Adult dataset, from <thfeemale, failed> subgroup in Law dataset, and from the<female, on-time> subgroup from the Credit dataset respectiv1el′yr.efers to the sample selected from the male group in the adult, law and credit datasets. All sampling combinations are listed in Tab2l.eWe denote the sample derived from ProxiMix with diferent sampling combination modes as ⊙ , where ∈ {1, 2, 3, 4}, ∈ {1 ′, 3 ′}.

Table 3 presents models performance using ProxiMix under four sampling combinations ⊙ and compares it with performance without any augmentation (baseline). 4github.com/HammerLabML/ModelAgnosticGroupFairnessCounterfactuals/

In the adult dataset, we found that diferent subgroup sampling combinations have diferent impacts on ProxiMix performance. T h2e⊙ 1 ′ (augmenting high-income female) significantly improves the fairness performance of both decision tree and logistic regression models. In contrast 1, ⊙ 1 ′(augmenting low-income female) degrades the fairness of both models, suggesting it introduces extra bias to the underrepresented group. This implies that focusing on underrepresented labels in the unprivileged group when generating samples (such as high income) can greatly improve fairness performance.

In the Law dataset, nearly all mixup methods enhance model prediction performance, but only marginally improve fairness. This is because fairness performance DP% without any augmentation already exceeds 90%, indicating the minimal bias in the model. Therefore, the improvement potential is limited.

Overall, ProxiMix enhances fairness when a model displays significant bias. Also, the choice of the subgroup for sampling during mixup is important: some enhance fairness, while others can even worsen it.

5.2. The Impact of Balancing Degree in ProxiMix

In the above section we have discussed the diferent sampling strategies with a balanced mixup ( = 0.5 ). This section explores how diferent in ProxiMix can impact model performance. Here, we fix strategy ⊙ while changing balance degree.

Fig. 4 illustrates the impact of data augmentation on model fairness in the Credit dataset, under 1 ⊙ 1 ′ and3 ⊙ 3 ′ strategies, with diferent degre.eThe trend shows most combinations positively afect a model fairness, with an optim atlhat maximizes fairness improvements. The best performance is achieved ad=t0.7 for the1 ⊙ 1 ′ strategy, while fo3r⊙ 3 ′, the optimal performance is reached adt=0.2.

Similar patterns are observed in the adult datase5t): (tFhige. impact of diferent values o f on the model also shows a trend. Specifically, data generated with2t⊙h e2 ′ strategy shows the better improvement in model fairness w hernanges from0.2 to0.5.

We noticed the best fairness DP% and Eodds% occur s=a1t under4 ⊙ 3 ′. However, both TPR of female and male groups decline whe n exceeds0.5. [36] mentions a similar scenario and suggests to consider both relative and absolute values in fairness performance. To have a further investigation of their performance in absolute values,4Tparbelseents the model’s Fairness Performance on Credit Dataset (C1 ⊙ C1’)

Fairness Performance on Credit Dataset (C3 ⊙ C3’) performance across diferent subgroups. We can see the model trained with data augmentation in the0 to0.5 range, although having lower fairness metrics compared=to1 , shows an absolute improvement in model performance. Therefore, we conclude the optimal balancing for4 ⊙ 3 ′ strategy i0s.2. Counterfactual explanations cost comparison on the Adult dataset with Decision Tree across female(F) and male(M) subgroups with diferent balancing degree = [0, 0.5, 1] .

1 ⊙ 1 ′ male(M) subgroups with diferent balancing degree

5.3. Counterfactual Cost across Diferent Groups

We now evaluate the efectiveness of our algorithm from the XAI perspective, and the results are consistent with the above observations. First, we calculate the average (avg) and standard deviation (std) of the counterfactual cost across female (F) and male (M) subgroups. Then, we compare the cost gaps between the two groups. A smaller gap indicates fairer counterfactual explanations within diferent groups. In the Adult dat a2s⊙et,1 ′ remains to show more significant bias mitigation performance. In the Law school dataset, as we have disscussed above, the improvment is limited because the bias in the original dataset is not significant.

6. Conclusion

This paper proposes a new debiasing algorithm called ProxiMix. It extends the mixup technique by considering labels from proximity samples in the subgroup to mitigate potential bias in the preprocessing stage. Our experiments evaluated the performance of ProxiMix with diferent sampling combinations and balancing degrees. The results prove that adding proximity-based labels improves fairness performance, and there exists optimal balancing degree for achieving the most significant enhancement. These observations were further supported by the experimental results on the cost comparison of counterfactual explanations. In future work, we plan to extent ProxiMix to multi-class tasks and consider intersectional fairness.

Acknowledgments

This work is funded by Doctoral Training Partnership Studentship of Engineering and Physical Sciences Research Council (EPSRC-DTP, EP/W524414/1/2894964). of the stanford heuristic programming project (the Addison-Wesley series in artificial intelligence), Addison-Wesley Longman Publishing Co., Inc., 1984. [16] S. Gregor, I. Benbasat, Explanations from intelligent systems: Theoretical foundations and implications for practice, MIS quarterly (1999) 497–530. [17] R. K. Mothilal, A. Sharma, C. Tan, Explaining machine learning classifiers through diverse counterfactual explanations, in: Proceedings of the 2020 conference on fairness, accountability, and transparency, 2020, pp. 607–617. [18] S. Wachter, B. Mittelstadt, C. Russell, Counterfactual explanations without opening the black box: Automated decisions and the gdpr, Harv. JL Tech. 31 (2017) 841. [19] D. Brughmans, P. Leyman, D. Martens, Nice: an algorithm for nearest instance counterfactual explanations, Data Mining and Knowledge Discovery (2023) 1–39. [20] V. Gupta, P. Nokhiz, C. D. Roy, S. Venkatasubramanian, Equalizing recourse across groups, arXiv preprint arXiv:1909.03166 (2019). [21] A. Artelt, B. Hammer, Explain it in the same way!–model-agnostic group fairness of counterfactual explanations, arXiv preprint arXiv:2211.14858 (2022). [22] S. Goethals, D. Martens, T. Calders, Precof: counterfactual explanations for fairness,

Machine Learning (2023) 1–32. [23] S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, D. Roth, A comparative study of fairness-enhancing interventions in machine learning, in: Proceedings of the conference on fairness, accountability, and transparency, 2019, pp. 329–338. [24] F. Kamiran, T. Calders, Classifying without discriminating, in: 2009 2nd international conference on computer, control and communication, IEEE, 2009, pp. 1–6. [25] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, S. Venkatasubramanian, Certifying and removing disparate impact, in: proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 259–268. [26] H. Sun, K. Wu, T. Wang, W. H. Wang, Towards fair and robust classification, in: 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), IEEE, 2022, pp. 356–376. [27] F. Kamiran, T. Calders, M. Pechenizkiy, Discrimination aware decision tree learning, in: 2010 IEEE international conference on data mining, IEEE, 2010, pp. 869–874. [28] G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, K. Q. Weinberger, On fairness and calibration,

Advances in neural information processing systems 30 (2017). [29] J. J. Amend, S. Spurlock, Improving machine learning fairness with sampling and adversarial learning, J. Comput. Sci. Coll 36 (2021) 14–23. [30] A. Morano, Bias mitigation for automated decision making systems, Politecnico di Torino (2020). [31] D. Dablain, B. Krawczyk, N. Chawla, Towards a holistic view of bias in machine learning: Bridging algorithmic fairness and imbalanced learning, arXiv preprint arXiv:2207.06084 (2022). [32] J. Chakraborty, S. Majumder, T. Menzies, Bias in machine learning software: Why? how? what to do?, CoRR (2021). [33] R. Kohavi, et al., Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid., in: Kdd, volume 96, 1996, pp. 202–207. [34] K. Xivuri, H. Twinomurinzi, A systematic review of fairness in artificial intelligence algorithms, in: Responsible AI and Analytics for an Ethical and Inclusive Digitized Society: 20th IFIP WG 6.11 Conference on e-Business, e-Services and e-Society, I3E 2021, Galway, Ireland, September 1–3, 2021, Proceedings 20, Springer, 2021, pp. 271–284. [35] I.-C. Yeh, C.-h. Lien, The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients, Expert systems with applications 36 (2009) 2473–2480. [36] G. Maheshwari, A. Bellet, P. Denis, M. Keller, Fair without leveling down: A new intersectional fairness definition, in: EMNLP 2023-The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. [37] T. Le Quy, A. Roy, V. Iosifidis, W. Zhang, E. Ntoutsi, A survey on datasets for fairnessaware machine learning, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 12 (2022) e1452.

A. Appendices: Dataset Description A.1. Adult Income Dataset

The Adult Income dataset is also known as the Census Income dataset. Its documen5tation provides a detailed description of 14 features in the dataset. We omitted some features, such as ‘fnlwgt’, and the final features we used after data cleaning are as follows. 5https://www.cs.toronto.edu/~delve/data/adult/adultDetail.html A.2. Law School Dataset The Law School dataset contains admission records for law schools. We followed the description provided in 3[7] and the data cleaning pipeline i2n1[], extracting the following features for the experiment.

A.3. Credit Default Dataset

The Credit Default dataset, also known as the credit card clients dataset, explores default payments on credit cards. Followings are the features and descriptions.

B. Appendices: Results

B.1. ProxiMix in Credit Default Dataset with MLP model B.2. ProxiMix in Adult Income Dataset with MLP model

[1]

O. A.

Osoba ,

Welser IV , W. Welser, An intelligence in our image: The risks of bias and errors in artificial intelligence , Rand Corporation , 2017 .

[2]

Burkart ,

M. F.

Huber , A survey on the explainability of supervised machine learning , Journal of Artificial Intelligence Research 70 ( 2021 ) 245 - 317 .

[3]

A. B.

Arrieta ,

Díaz-Rodríguez ,

J. Del

Ser ,

Bennetot ,

Tabik ,

Barbado ,

García ,

Gil-López ,

Molina ,

Benjamins , et al., Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai , Information fusion 58 ( 2020 ) 82 - 115 .

[4]

Ciatto ,

Sabbatini ,

Agiollo ,

Magnini ,

Omicini , Symbolic knowledge extraction and injection with sub-symbolic predictors: A systematic literature review , ACM Computing Surveys 56 ( 2024 ) 1 - 35 .

[5]

I. D.

Raji ,

Buolamwini , Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products , in: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society , 2019 , pp. 429 - 435 .

[6]

Kavouras ,

Tsopelas ,

Giannopoulos ,

Sacharidis ,

Psaroudaki ,

Theologitis ,

Rontogiannis ,

Fotakis , I. Emiris , Fairness aware counterfactuals for subgroups , in: Thirty-seventh Conference on Neural Information Processing Systems , 2023 .

[7]

Gohar , L. Cheng, A survey on intersectional fairness in machine learning: Notions, mitigation, and challenges , arXiv preprint arXiv:2305.06969 ( 2023 ).

[8]

Hort ,

Chen ,

J. M.

Zhang ,

Sarro ,

Harman , Bia mitigation for machine learning classifiers: A comprehensive survey , arXiv preprint arXiv:2207.07068 ( 2022 ).

[9]

Zhang ,

Cisse ,

Y. N.

Dauphin , D. Lopez-Paz, mixup: Beyond empirical risk minimization , arXiv preprint arXiv:1710.09412 ( 2017 ).

[10]

Navarro ,

Little ,

G. I.

Allen ,

Segarra , Data augmentation via subgroup mixup for improving fairness , arXiv preprint arXiv:2309.07110 ( 2023 ).

[11] C.-Y. Chuang , Y. Mroueh , Fair mixup: Fairness via interpolation, arXiv preprint arXiv:2103.06503 ( 2021 ).

[12]

B. T.

Luong ,

Ruggieri ,

Turini , K-nn as an implementation of situation testing for discrimination discovery and prevention , in: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining , 2011 , pp. 502 - 510 .

[13] M. B. Zafar , I. Valera, M. Gomez

Rodriguez , K. P.

Gummadi , Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment , in: Proceedings of the 26th international conference on world wide web , 2017 , pp. 1171 - 1180 .

[14]

Hardt ,

Price ,

Srebro , Equality of opportunity in supervised learning , Advances in neural information processing systems 29 ( 2016 ).

[15]

B. G.

Buchanan ,

E. H.

Shortlife , Rule based expert systems: the mycin experiments