=Paper=
{{Paper
|id=Vol-3908/paper_2
|storemode=property
|title=Can Generative Ai-based Data Balancing Mitigate Unfairness Issues in Machine Learning?
|pdfUrl=https://ceur-ws.org/Vol-3908/paper_2.pdf
|volume=Vol-3908
|authors=Benoît Ronval,Siegfried Nijssen,Ludwig Bothmann
|dblpUrl=https://dblp.org/rec/conf/ewaf/RonvalNB24
}}
==Can Generative Ai-based Data Balancing Mitigate Unfairness Issues in Machine Learning?==
Can Generative AI-based Data Balancing Mitigate
Unfairness Issues in Machine Learning?
Benoît Ronval1,* , Siegfried Nijssen1 and Ludwig Bothmann2,3
1
ICTEAM, UCLouvain, Belgium
2
Department of Statistics, LMU Munich, Germany
3
Munich Center for Machine Learning (MCML)
Abstract
Data imbalance in the protected attributes can lead to machine learning models that perform better on
the majority than on the minority group, giving rise to unfairness issues. While a baseline method like
SMOTE can balance datasets, we investigate how methods of generative artificial intelligence compare
concerning classical fairness metrics. Using generated fake data, we propose different balancing methods
and investigate the behavior of classification models in thorough benchmark studies using German credit
and Berkeley admission data. While our experiments suggest that such methods may improve fairness
metrics, further investigations are necessary to derive clear practical recommendations.
Keywords
Fairness, Generative AI, Imbalanced Data, Large Language Models, Machine Learning
1. Imbalanced data and fairness
Fairness issues in automated decision-making (ADM) systems may be attributed to different
sources: the data, the machine learning (ML) algorithm (learner) and the user interactions [1].
We focus on algorithmic bias [2] where an ML model introduces a bias to previously unbiased
data and tackle the subproblem of data imbalance in the protected attribute (PA). Imbalance in
the PAs can give the learner a wrong incentive [3, 4]: Let us assume that a target 𝑌 shall be
predicted based on features 𝑋 and that the PA 𝐴 is a binary feature Gender.1 A learner that
uses empirical risk minimization could gain more from fitting the majority group very closely
than from spending model complexity on the minority group. This would lead to a model that
has substantially better performance on the majority group, giving rise to unfairness issues [5].
A natural countermeasure would be to obtain more data from the minority group. In most
applications, however, it is not possible to sample new data from the data-generating process
(DGP), which calls for artificial data augmentation. The challenge of imbalanced data is not
new to the ML literature and many proposals have been made to counteract this, however,
mostly focusing on imbalance in the target variable 𝑌 rather than in the PA 𝐴 [see, e.g., 5, 6, 7].
EWAF’24: European Workshop on Algorithmic Fairness, July 01–03, 2024, Mainz, Germany
$ benoit.ronval@uclouvain.be (B. Ronval); siegfried.nijssen@uclouvain.be (S. Nijssen); ludwig.bothmann@lmu.de
(L. Bothmann)
https://www.slds.stat.uni-muenchen.de/people/bothmann/ (L. Bothmann)
0000-0002-1471-6582 (L. Bothmann)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
CEUR Workshop Proceedings (CEUR-WS.org)
Proceedings
http://ceur-ws.org
ISSN 1613-0073
1
While an extension for multi-categorical Gender is straight-forward, we focus on the binary version for simplicity.
1
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Benoît Ronval et al. CEUR Workshop Proceedings 1–6
These proposals include baseline methods such as random under/oversampling, and more
sophisticated methods such as SMOTE [8]. In this work, we investigate how generative artificial
intelligence (genAI) could mitigate unfairness issues related to ML models in ADM systems and
compare these with SMOTE.
2. GenAI models
GenAI models learn the distribution of the original data to create fake (or synthetic) observations
from that distribution. The Generative Adversarial Network (GAN) [9] for creating images
comprises a generator and a discriminator, which compete against each other during training.
An alternative for tabular data is the Conditional Tabular GAN (CTGAN) [10], where tabular
data is represented with a mode-specific normalization that the GAN can use.
Large Language Models are important members of genAI models [11]. With adequate pro-
cessing of the inputs and outputs, using, e.g., a format similar to "[feature] is [feature value]",
LLMs can be fine-tuned to generate tabular data, as demonstrated by the model GReaT [12].
Using a relatively old LLM like GPT2 (or distil-GPT2) [13, 14], it can obtain fake data that are
closer to the original distribution than with other approaches.
There exist genAI models that try to produce fair data [15, 16]. Their focus is to ensure that
fake data are representative of the populations. In particular, TabFairGAN [17] is specialized
for tabular data. We intend to use it in future work. Notice that we use genAI to learn better
classification models from balanced data, we do not improve genAI directly.
3. Dataset corrections
We first need to train the genAI models. The CTGAN model has been trained for 200 epochs.
GReaT, using distil-GPT2, has been fine-tuned for 200 epochs with a batch size of 16 and a
learning rate of 0.00005. We trained the genAI models in two different setups: (1) using all
original data and (2) using only 80%, leaving the remaining 20% as an unseen test set.
We consider different dataset correction methods that use the generated observations to
balance the number of male and female observations. The baseline method is called real-only
which randomly downsamples the male group. Given 𝑚 and 𝑓 , the original number of male
and female observations, this correction reduces 𝑚 to 𝑓 , resulting in 2𝑓 original observations.
The fake-only datasets are reduced to 2𝑓 observations, with equal representation of male and
female and using only fake data. The mixture-full and mixture corrections are intermediate
approaches. We add 𝑚 − 𝑓 fake female observations to the original dataset, reaching 2𝑚
observations. The mixture approach downsamples this augmented dataset to 2𝑓 observations.
To compare the generative models (GReaT, CTGAN, SMOTE), corrections are done with each
model excepted: (1) real-only correction is done once as it is independent of the generative
models; (2) the combination fake-only and SMOTE is not possible as it creates data only for the
minority group. In the end, we have 9 corrected versions for a given dataset: 3 with CTGAN, 3
with GReaT, 2 with SMOTE, and 1 real-only. We checked the fake data followed the distribution
of the original with a visual comparison (not shown due to page limit). We use the German
credit dataset [18] and Berkeley admission [19] and plan to add other datasets in future work.
2
Benoît Ronval et al. CEUR Workshop Proceedings 1–6
4. Experiments
We use the above-generated 9 datasets to carry out thorough experiments to compare 3 dif-
ferent learners: logistic regression, classification trees, and random forests (RF). We tune their
hyperparameters with 100 iterations of a random search using 3-fold cross-validation (CV).
We evaluate model performance with accuracy (ACC) and area under the curve (AUC). For
a description of performance differences in the subgroups of the PA Gender, we use three
confusion matrix-based metrics. (There has been some criticism of using these “classical fairness
metrics” due to a lack of philosophical justification [see 20] – we therefore use these rather as
descriptive tools than for proving or disproving fairness.) For assessing demographic parity,
we compute the difference of the ratios of positive predictions for males and females (DP [21]).
For assessing equalized odds, we compute the mean of the absolute values of (i) the difference
between the true positive rates for males and females and (ii) the difference of the true negative
rates for males and females (EO [22]). Similarly, we condition on the predicted classes and
compute the mean of the absolute differences between males and females regarding positive
and negative predictive values, aiming at conditional use accuracy equality (CUAE [23]).
The experiments have two settings: (1) A resampling study carries out – on each corrected
dataset – nested resampling with 3-fold CV as inner resampling and 100 iterations of subsampling
with a train-test ratio of 80/20 as outer resampling. The goal is to analyze the behavior of the
tuned learners on unseen test data from the same distribution, including estimates of the
standard deviation for statistical significance. (2) Since a generation method that generates data
that are far from the original data distribution but rather simple to classify would also obtain
good results in the resampling, we additionally test all methods on the same 20% test sample
of the original data. In this setting, the genAI models are trained on the remaining 80% of the
original data, and the learners are tuned with 3-fold CV on the resulting datasets (which are
hence smaller as in the benchmark study). We use the R packages mlr3 [24] for performing
experiments and mlr3fairness [25] for computing fairness metrics. Results are presented in
Table 1 and Table 2, respectively.
Resampling For space constraints, we selected some of the 27 combinations of learner and
correction methods: Since RF performed consistently best regarding mean ACC, we limited the
presentation of the results to RF. For the generation methods, we excluded the mixture method
since the results are between fake-only and mixture-full (mf). For the German credit data,
the generation methods outperform the results obtained by real-only (see Table 1), however,
most differences are not statistically significant. Between the generation methods, a clear
winner cannot be distinguished, since SMOTE leads to better predictive performances, whereas
CTGAN and GReaT have better values in the fairness metrics. In-line with prior findings [26],
fake-only versions are outperformed by mixture-full versions. As one example, we included
GReaT-fake, which later (see Table 2) suffers from a drastic drop in performance when evaluated
on original data (same holds for CTGAN-fake). For the Berkeley data (fake-versions omitted for
space constraints), the generation methods also perform better, but again, differences are not
statistically significant. Notably, SMOTE performs rather bad in the fairness metrics. In total, it
appears that balancing the number of males and females in the datasets tends to improve the
fairness metrics of the models learned on these data.
3
Benoît Ronval et al. CEUR Workshop Proceedings 1–6
Table 1
Results of resampling. Standard deviations in parentheses. Best value per column and dataset in bold.
Data Correction ACC ↑ AUC ↑ DP ↓ EO ↓ CUAE ↓
German real-only 0.737 (0.038) 0.766 (0.039) 0.060 (0.049) 0.095 (0.057) 0.115 (0.067)
German ctgan_mf 0.739 (0.024) 0.702 (0.027) 0.043 (0.031) 0.075 (0.044) 0.095 (0.054)
German great_mf 0.747 (0.021) 0.777 (0.025) 0.056 (0.039) 0.074 (0.039) 0.085 (0.052)
German great_fake 0.772 (0.036) 0.799 (0.041) 0.099 (0.053) 0.116 (0.065) 0.155 (0.092)
German smote_mf 0.779 (0.024) 0.818 (0.025) 0.062 (0.040) 0.102 (0.047) 0.105 (0.056)
Berkeley real-only 0.653 (0.010) 0.628 (0.013) 0.177 (0.015) 0.179 (0.016) 0.079 (0.036)
Berkeley ctgan_mf 0.658 (0.008) 0.630 (0.010) 0.159 (0.017) 0.160 (0.020) 0.055 (0.020)
Berkeley great_mf 0.665 (0.007) 0.632 (0.009) 0.177 (0.009) 0.179 (0.010) 0.083 (0.028)
Berkeley smote_mf 0.615 (0.033) 0.670 (0.008) 0.398 (0.151) 0.395 (0.141) 0.107 (0.023)
Table 2
Results on 20% real test data. Best value per column in bold.
Data Correction ACC ↑ AUC ↑ DP ↓ EO ↓ CUAE ↓
German real_only 0.730 0.761 0.013 0.084 0.163
German ctgan_mf 0.760 0.771 0.036 0.054 0.134
German great_mf 0.760 0.766 0.072 0.045 0.048
German great_fake 0.695 0.732 0.026 0.033 0.133
German smote_mf 0.775 0.770 0.005 0.039 0.137
Evaluation on real test data Table 2 summarizes the results on the 20% test sample of the
original data for the same combinations. Again, the generation methods have better values in
all metrics, where a clear winner cannot be distinguished. Consistently, SMOTE leads to a well-
performing model and is competitive in the fairness metrics. As mentioned above, GReaT-fake
suffers from a drop in predictive performance. For Berkeley, the results of the different methods
are comparable to those of the resampling (omitted in Table 2).
5. Discussion and Conclusion
The presented results indicate that generative methods can help to improve fairness metrics
when facing data imbalance in the PA, where fake-only corrections do not seem to generalize
well. There is, however, no clear sign that more complex, resource-intensive genAI methods like
CTGAN and GReaT outperform more basic methods such as SMOTE, even if the distribution
of the fake data is closer to the original data than the one observed with SMOTE. Further
experiments are necessary to investigate this: beyond experiments on other real-world datasets,
we plan to do thorough simulation studies with differing degrees of imbalance and complexity
of the DGP. Other generative models such as TabFairGAN [17], specialized in the generation of
fair tabular data, will also be studied in this context.
4
Benoît Ronval et al. CEUR Workshop Proceedings 1–6
Acknowledgments
Computational resources have been provided by the Consortium des Équipements de Calcul
Intensif (CÉCI), funded by the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS)
under Grant No. 2.5020.11 and by the Walloon Region
References
[1] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, A. Galstyan, A Survey on Bias and Fairness
in Machine Learning, ACM Computing Surveys 54 (2021) 1–35. doi:10.1145/3457607.
[2] D. Danks, A. J. London, Algorithmic Bias in Autonomous Systems, in: Proceedings of
the Twenty-Sixth International Joint Conference on Artificial Intelligence, International
Joint Conferences on Artificial Intelligence Organization, Melbourne, Australia, 2017, pp.
4691–4697. doi:10.24963/ijcai.2017/654.
[3] A. Roy, V. Iosifidis, E. Ntoutsi, Multi-fairness Under Class-Imbalance, in: P. Pascal, D. Ienco
(Eds.), Discovery Science, Lecture Notes in Computer Science, Springer Nature Switzerland,
Cham, 2022, pp. 286–301. doi:10.1007/978-3-031-18840-4_21.
[4] V. Iosifidis, B. Fetahu, E. Ntoutsi, FAE: A Fairness-Aware Ensemble Framework, in: 2019
IEEE International Conference on Big Data (Big Data), IEEE, Los Angeles, CA, USA, 2019,
pp. 1375–1380. doi:10.1109/BigData47090.2019.9006487.
[5] P. Vuttipittayamongkol, E. Elyan, A. Petrovski, On the class overlap problem in imbalanced
data classification, Knowledge-Based Systems 212 (2021) 106631. doi:10.1016/j.knosys.
2020.106631.
[6] V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into classification with
imbalanced data: Empirical results and current trends on using data intrinsic characteristics,
Information Sciences 250 (2013) 113–141. doi:10.1016/j.ins.2013.07.007.
[7] N. Japkowicz, The class imbalance problem: Significance and strategies, in: Proceedings
of the International Conference on Artificial Intelligence, volume 56, 2000, pp. 111–117.
[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic Minority
Over-sampling Technique, Journal of Artificial Intelligence Research 16 (2002) 321–357.
doi:10.1613/jair.953.
[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial networks, Communications of the ACM 63 (2020)
139–144. doi:10.1145/3422622.
[10] L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni, Modeling Tabular data
using Conditional GAN, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d. Alché-Buc,
E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 32,
Curran Associates, Inc., 2019. URL: https://proceedings.neurips.cc/paper_files/paper/2019/
file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf.
[11] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong,
Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, J.-R.
Wen, A Survey of Large Language Models, 2023. doi:10.48550/arXiv.2303.18223.
5
Benoît Ronval et al. CEUR Workshop Proceedings 1–6
[12] V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, G. Kasneci, Language Models are Realistic
Tabular Data Generators, 2023. doi:10.48550/arXiv.2210.06280.
[13] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are
unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[14] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter, in: NeurIPS EMC2 Workshop, 2019. doi:10.48550/arXiv.
1910.01108.
[15] D. Xu, S. Yuan, L. Zhang, X. Wu, FairGAN: Fairness-aware Generative Adversarial Net-
works, in: 2018 IEEE International Conference on Big Data (Big Data), IEEE, Seattle, WA,
USA, 2018, pp. 570–575. doi:10.1109/BigData.2018.8622525.
[16] F. Friedrich, M. Brack, L. Struppek, D. Hintersdorf, P. Schramowski, S. Luccioni, K. Kersting,
Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness, 2023. doi:10.
48550/arXiv.2302.10893.
[17] A. Rajabi, O. O. Garibay, TabFairGAN: Fair Tabular Data Generation with Generative
Adversarial Networks, Machine Learning and Knowledge Extraction 4 (2022) 488–501.
doi:10.3390/make4020022.
[18] H. Hofmann, Statlog (German Credit Data), 1994. doi:10.24432/C5NC77.
[19] P. J. Bickel, E. A. Hammel, J. W. O’Connell, Sex Bias in Graduate Admissions: Data from
Berkeley: Measuring bias is harder than is usually assumed, and the evidence is sometimes
contrary to expectation., Science 187 (1975) 398–404. doi:10.1126/science.187.4175.
398.
[20] L. Bothmann, K. Peters, B. Bischl, What Is Fairness? On the Role of Protected Attributes
and Fictitious Worlds, arXiv, 2024. doi:10.48550/arXiv.2205.09622.
[21] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, R. Zemel, Fairness through awareness,
in: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference,
Association for Computing Machinery, New York, NY, USA, 2012, pp. 214–226. doi:10.
1145/2090236.2090255.
[22] M. Hardt, E. Price, N. Srebro, Equality of Opportunity in Supervised Learning, in: Advances
in Neural Information Processing Systems, volume 29, Curran Associates, Inc., 2016.
URL: https://papers.nips.cc/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.
html.
[23] R. Berk, H. Heidari, S. Jabbari, M. Kearns, A. Roth, Fairness in Criminal Justice Risk
Assessments: The State of the Art, Sociological Methods & Research 50 (2021) 3–44.
doi:10.1177/0049124118782533, publisher: SAGE Publications Inc.
[24] M. Lang, M. Binder, J. Richter, P. Schratz, F. Pfisterer, S. Coors, Q. Au, G. Casalicchio,
L. Kotthoff, B. Bischl, mlr3: A modern object-oriented machine learning framework in R,
Journal of Open Source Software (2019). doi:10.21105/joss.01903.
[25] F. Pfisterer, W. Siyi, M. Lang, mlr3fairness: Fairness Auditing and Debiasing for ’mlr3’,
2024. URL: https://mlr3fairness.mlr-org.com.
[26] D. Manousakas, S. Aydöre, On the Usefulness of Synthetic Tabular Data Generation, 2023.
doi:10.48550/arXiv.2306.15636.
6