End-to-End Bias Mitigation in Candidate Recommender Systems with Fairness Gates Adam Mehdi Arafan1,† , David Graus2 , Fernando P. Santos3 and Emma Beauxis-Aussalet4 1 University of Amsterdam, Amsterdam, The Netherlands 2 Randstad Groep Nederland, Diemen, The Netherlands 3 University of Amsterdam, Amsterdam, The Netherlands 4 Vrije Universiteit Amsterdam, Amsterdam, The Netherlands Abstract Recommender Systems (RS) have proven successful in a wide variety of domains, and the human resources (HR) domain is no exception. RS proved valuable for recommending candidates for a position, although the ethical implications have recently been identified as high-risk by the European Commission. In this study, we apply RS to match candidates with job requests. The RS pipeline includes two fairness gates at two different steps: pre-processing (using GAN-based synthetic candidate generation) and post-processing (with greedily searched candidate re-ranking). While prior research studied fairness at pre- and post-processing steps separately, our approach combines them both in the same pipeline applicable to the HR domain. We show that the combination of gender-balanced synthetic training data with pair re-ranking increased fairness with satisfactory levels of ranking utility. Our findings show that using only the gender-balanced synthetic data for bias mitigation is fairer by a negligible margin when compared to using real data. However, when implemented together with the pair re-ranker, candidate recommendation fairness improved considerably, while maintaining a satisfactory utility score. In contrast, using only the pair re-ranker achieved a similar fairness level, but had a consistently lower utility. Keywords Fair Artificial Intelligence, Generative Modelling, Information Retrieval, Recommender Systems 1. Introduction perpetually face discrimination in finding employment. The risk of harm is especially great considering the scal- Machine learning (ML) applications have proven to be able nature of recommender systems. Here we focus on useful in many domains over recent years. However, de- a CRS to support a recruiter in finding the best matching spite the many benefits of ML-enabled tools, biases can candidates for a client job request (e.g., a factory request- occur and be amplified through the highly scalable nature ing 20 technicians). of ML-enabled systems. Algorithms used in applications As most ML algorithms perform predictions in a dis- such as recidivism prediction, predictive policing, or fa- criminative fashion using historical data, it is not trivial cial recognition, have revealed bias towards either race, to guarantee that discrimination is not (unfairly) influ- gender or both [1, 2]. These biases can also be expressed enced by proxies that might be correlated with protected through proxy (unobservable) correlations expressed via characteristics. The fairness in ML problem has been ap- sensitive attributes such as gender and poorly defined proached by many researchers such as Rajabi and Garibay decision boundaries [3, 4]. who tackled the problem by synthesizing data, or Li et al. We are focusing on fairness issues with candidate rec- constraining recommendations, and Geyik et al. by re- ommender systems (CRS). The goal of such a system is to ranking recommendations. These researchers produced recommend the best candidates for a specific job, often state-of-the-art (SOTA) algorithms tackling specific fair- computing ranked lists of candidates in descending order ness techniques, from which we distinguish two: pre- of relevance. A variety of fairness issues may arise from processing (enforcing fairness at the data level) and the large and diverse pools of candidates and job offers. post-processing (enforcing fairness after predictions In the case of the HR industry, bias in recommenda- were made). tions comes with a high risk of harm as candidates can These two approaches have traditionally been re- searched separately in RS and fairness literature, ignoring RecSys in HR’22: The 2nd Workshop on Recommender Systems for potential synergistic effects of applying fairness mecha- Human Resources, in conjunction with the 16th ACM Conference on Recommender Systems, September 18–23, 2022, Seattle, USA. nisms at different stages of the ML pipeline. To the best † Work done while on internship at Randstad Groep Nederland. of our knowledge, we found no prior work experiment- Envelope-Open adammehdiarafan@gmail.com (A. M. Arafan); ing with more than one processing technique in a single david.graus@randstadgroep.nl (D. Graus); f.p.santos@uva.nl pipeline. We aim to close this gap by testing SOTA bias (F. P. Santos); e.m.a.l.beauxisaussalet@vu.nl (E. Beauxis-Aussalet) mitigation methods in both pre- and post-processing, © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). and observing the impact on the fairness of candidate CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) ranking. We propose a pipeline for a CRS that integrates [10]. These deeper models, more specifically GANs, af- two bias mitigation mechanisms (called Fairness Gates, forded the synthesis of more complex unstructured data FG) at the pre- and post-processing steps. By FG, we refer such as images and videos. In the context of this thesis to the enforcement of bias mitigation techniques within project, GANs will be used to generate tabular (struc- the pipeline. The FGs are a synthetic data generator tured) synthetic candidate data. and a greedy re-ranker. Despite their popularity, GANs are mainly used for un- The synthetic data generator enforces gender bal- structured data synthesis tasks such as image and video ance in the sampling size while the greedy re-ranker synthesis, the generation of synthetic tabular data such optimizes for both utility (the quality or usefulness of as job candidates is not only uncommon from a domain candidate recommendations) and gender balance in candi- perspective but also from a technical perspective. This is date ranking. In this paper, we explore the fairness-utility caused by the difficulty of learning discrete features with trade-offs among re-ranked CRS outputs trained using potentially imbalanced classes. A challenge for which Xu synthetic data or only real data. Therefore, we focus on et al. found a solution by integrating a Gumbel Softmax exploring what are the impacts and trade-offs be- (GS) activation function in their 𝐶𝑇 𝐺𝐴𝑁. The GS is based tween utility and fairness that arise from combining on the Gumbel-Max trick, a common method for discrete synthetic data generation at pre-processing and greedy approximation [12]. pair re-ranking at a post-processing level. With the ability to generate categorical features, other Our experimental results show that the best compro- issues can hinder the tabular candidate synthesis process. mise between fairness and utility is achieved when com- Issues such as input datasets with mixed distributions (as bining the two FGs rather than using just one. is the case for our input data) can severely affect genera- tive performance. For these problems, Xu et al. propose two solutions: mode-specific normalization for contin- 2. Background and Related Work uous column normalization and conditional sampling to enforce class balancing, both are known problems in Before presenting the experiments conducted within our discriminatory generative modelling. Therefore, 𝐶𝑇 𝐺𝐴𝑁 novel candidate recommendation pipeline, essential ter- is an ideal generator for the task at hand as it can bal- minology needs to be defined alongside the state of the art ance imbalanced datasets and handle mixtures of data in the (sub)task(s) at hand. More specifically, we will first types. Before outlining the fairness-related work, we introduce synthetic candidate synthesis which serves as relate 𝐶𝑇 𝐺𝐴𝑁 to our CRS pipeline and discuss its contri- our first FG, before introducing fairness and specifying bution to both the academic and domain gap. the relevant techniques used in the CRS pipeline. Finally, Candidate synthesis is uncommon, although fairness we will conclude with the research gap and a summary research showed successful use of tabular GANs to gener- of how the discussed techniques fit in our CRS. ate fair data and more domain-relevant research showed the use of Gaussian copulas for synthetic candidate gen- 2.1. Data Synthesis eration, considerations using 𝐶𝑇 𝐺𝐴𝑁s to support down- Originally proposed by Rubin in 1993, the synthetic data stream tasks are rare if not unavailable [5, 13]. In the syn- solution was initially tasked to overcome confidentiality thetic candidate generation domain, van Els et al. is the concerns during surveys [8]. Although confidentiality unique example in our high risk of harm task. Therefore, issues have become more important with new stricter Eu- the use of GANs, more specifically 𝐶𝑇 𝐺𝐴𝑁s to generate ropean regulations such as the General Data Protection candidates will greatly improve the fairness of our CRS Regulation (GDPR), the current applications of synthetic pipeline. data have also shown their strength in generating fair In fact, as outlined by Xu et al., conditional sampling and private synthetic data. In fact, synthetic data applica- will allow us to synthesize balanced training data with tions extend far beyond survey data synthesis, use cases ease which can be used downstream as a fair balanced range from missing data imputation as well as data aug- basis to train candidate-scoring algorithms and mitigate mentation solutions in semi-supervised learning, media bias; the use of conditional sampling alongside reject applications with image-to-image translation and finally sampling (to be introduced in the methodology section) is image super-resolution [9]. how we link candidate synthesis with fairness and Data synthesis has evolved from Bayesian bootstrap- ultimately bias mitigation in our end-to-end CRS ping methods and predictive posterior distributions to pipeline. Therefore, the use of 𝐶𝑇 𝐺𝐴𝑁s is novel in the deeper techniques such as Autoencoders (AE), Variational candidate recommendation domain. With the synthetic Autoencoders (VAEs), autoregressive models, Boltzmann pre-processing techniques outlined, we will provide machines, deep belief networks, and generative adver- an outline of the fairness literature, by focusing more sarial networks (GANs) after the advent of deep learning specifically on post-processing methods. 2.2. Fairness how the combination of candidate synthesis for scoring model training combines with re-ranking methods for a With the relevant background and related work on candi- better bias mitigation end-to-end process. This combi- date synthesis introduced, we now proceed further down nation is novel in both the HR domain and in the our CRS pipeline towards the second FG which will miti- literature for fairness and generative modelling. gate bias at the post-processing level, therefore, after the models are trained on synthetic data to score real candidates. The scored candidates are then evaluated ac- 2.3. Summary and Research Gap cording to a relevant fairness metric and re-ranked using The above mini-literature review outlined the different a relevant post-processing technique. key areas of (candidate) synthesis and fairness process- Currently, multiple fairness metrics exist, each with ing techniques. As shown, the combination of multiple their respective strengths and weaknesses. In our case, processing techniques within one CRS pipeline has never we only consider demographic parity, which was defined been attempted. Therefore, our pipeline is presented by Kusner et al. as: as a combination of the presented related work and it will be evaluated based on the output of the candidate • Demographic Parity: ”A predictor 𝑌̂ satisfies ̂ ̂ rankings. For the evaluation, we will not be comparing demographic parity if P(𝑌 |𝐴 = 0) = P(𝑌 |𝐴 = 1).” our CRS pipeline’s 𝐶𝑇 𝐺𝐴𝑁 to Xu et al. nor will we be For 𝐴 representing a sensitive attribute with 𝑎 comparing our re-ranker to Geyik et al. as we are using levels. drastically different datasets. Instead we will be devel- Many other fairness techniques exist, namely the re- oping our own evaluation framework for the candidate moval of any sensitive attributes. We stress that simply data at hand which we will outline in section 3. removing sensitive attributes is not guaranteed to re- The goal of this section was to provide a high-level move bias. This process of simply removing protected overview of the literature and techniques used all while attributes is known as fairness through unawareness and exposing the academic gap where our pipeline resides. In was shown to perpetuate unfairness [14]. In fact, in our the following section, we use the provided background CRS pipeline, we are using the opposite logic to achieve to introduce our experiments with in-depth technical fairness through awareness by explicitely using gender detail and apply the SOTA related work to the candidate to re-rank candidates in the post-processing step. recommendation problem with our novel CRS pipeline. 2.2.1. Fairness in Rankings 3. Methodology While demographic parity is useful for quantifying fair- Our CRS follows a point-wise learning to rank approach, ness, the enforcement of such rules has yet to be defined. where for a given job 𝑗, we fetch and rank candidates Fairness can be enforced either through a data cleaning 𝑖, much like given a query, the goal is to rank docu- process verifying for class imbalances and the existence ments in the traditional document retrieval scenario. In of sensitive (proxy) variables (pre-processing) or modi- other words, our recommender system predicts relevance fying model output post-training with approaches such scores 𝑦𝑖,𝑗 ̂ given the candidate and job features 𝑋𝑖,𝑗 . as re-ranking (post-processing)[7]. Although we con- We use real data from an international HR company. sider the two approaches in this project, the evaluation of For training purposes, the candidate features 𝑋𝑖 are asso- our model will follow the SOTA post-processing tech- ciated with a ground truth label 𝑦𝑖,𝑗 where 𝑦𝑖,𝑗 = 1 if the niques which are presented below. candidate 𝑖 has been recruited or shortlisted for a job 𝑗, For our CRS pipeline we will use Geyik et al.’s ap- and 0 otherwise. proach considering it is already used in the HR domain The data used for training is of a structured nature, (the task at hand was the recommendation of candidates spanning real-valued, categorical, and binary features. in LinkedIn). Additionally, Geyik et al. achieved SOTA Features correspond to candidate features (e.g., job seek- performance with more than a 4-fold reduction in un- ers’ preferences such as minimum salary, preferred work- fairness and a reduction in utility of only 6%. From a ing hours, or maximum travel distance, in addition to data research gap perspective, candidate re-ranking is widely related to their work experience or level of education). used in the industry and researched in Information Re- Job features (e.g., industry of the company, company size, trieval literature. However, despite not being novel in geographical location), and finally candidate-job features this sub-task, our CRS pipeline fills the research gap by that represent their overlap (e.g., geographical distance performing the re-ranking of candidates on synthetically between candidate and job, or a binary feature indicating trained scoring models. whether candidate has worked in job’s industry before). This is where our end-to-end CRS pipeline contributes Much in the same vein that query, document, and query- to both the domain and the relevant literature, by testing document features are designed in a traditional learning 3.2. Candidate scoring and re-ranking to rank for information retrieval-scenario. We trained CRS models to score candidates 𝑖 by estimat- ing their relevance score 𝑦𝑖𝑗̂ for the jobs 𝑗. We trained a 3.1. Gender balance and synthetic data total of 10 CRS models, using real or synthetic job candi- Imbalanced data is very common in CRSs, and we focus dates as training data (5 datasets each respectively). The on gender imbalance for our case, which is common in jobs for which candidates are scored remain those of the the job market. To effectively study the issue of imbal- real data, more specifically, the real holdout test data. ance, we construct various explicitly (im)balanced sce- We tested the CRS models with their respective hold- narios through a rejection sampling algorithm based on out test sets, comprising real data with the same gender John V. Neumannn’s technique [15]. We first sampled re- balance. For each test set, we scored candidates using balanced subsets of the original training data,considering either the CRS trained with synthetic data or with real gender as the sensitive attribute 𝑎. We only considered 2 data (of the same gender balance), i.e., we use 2 CRS genders (female, male) as unfortunately our dataset does models per each of the 5 test sets, and thus obtain a total not contain enough samples of non-binary genders. of 10 sets of scores. After scoring candidates we rank To construct our (im)balanced subsets, we randomly candidates by descending order of relevance scores, and sampled job candidates from each job request 𝑗 with a obtain 10 sets of rankings. constrained proportion of candidates from each gender. After the candidates are scored and ranked, we in- We generated two datasets with heavy imbalance troduce our second Fairness Gate (FG) at the post- (one with 20% of female candidates, one with 20% of processing level of the CRS pipeline. This FG aims to males); two datasets with minor imbalance (one with improve the fairness of candidate ranking by using a 45% of female candidates, one with 45% of males); re-ranking algorithm that interleaves males and females and a balanced dataset (with 50% of male and female equally at the top ranks (e.g., Figure 2). For our experi- candidates). For each training dataset, 10% of the data mental CRS pipeline, we reused the re-ranking algorithm points were kept as a held-out test set. To avoid data from Geyik et al. [7], and obtained 10 sets of re-rankings leakage, all job requests 𝑗 were unique to the test set. (Figure 1). The test dataset sizes in number of unique < 𝑗, 𝑖 >-pairs after rejection sampling are shown in Table 1. 3.3. Metrics and Evaluation Test Data Sample Size The impact of the re-ranking is evaluated in terms of heavy imbalance (20% males) 38 701 utility using Normalised Discounted Cumulative Gain heavy imbalance (20% females) 40 975 (𝑁 𝐷𝐶𝐺), a common ranking metric to maximise [16]. To minor imbalance (45% males) 48 195 measure the impact of the re-ranking, we compared the minor imbalance (45% females) 41 972 𝑁 𝐷𝐶𝐺 scores before re-ranking (by considering the ini- balanced 48 178 tial ranking as the ideal ranking) and after re-ranking. A lower 𝑁 𝐷𝐶𝐺 score means re-ranking had a negative Table 1 impact on the original rankings. A higher 𝑁 𝐷𝐶𝐺 score Test set sizes after rejection sampling. means re-ranking had less impact. As we are considering the impact of the ranking, the 𝑁 𝐷𝐶𝐺 score was calcu- lated after ranking, hence the appearance of only one score. Therefore, we used the 𝑁 𝐷𝐶𝐺 as a single impact We trained 5 synthetic data models, using each re- metric. The original predicted ranks were used as ground balanced dataset as training data for the CTGAN algo- truth (ideal ranking) which was measured against the rithm [11]. We were able to generate balanced synthetic re-ranked candidates. To ensure the ideal ranks are valid, data using the models’ conditional sampling parameters. we have used common classification metrics such as F1 We generated balanced synthetic data where each gender and AUC. represents 50% of the dataset, for both positive (𝑦𝑖,𝑗 = 1) In terms of fairness, we used 𝑁 𝐷𝐾 𝐿 (normalized dis- and negative (𝑦𝑖,𝑗 = 0) examples. counted cumulative Kullback-Leibler divergence), a dis- The synthetic data generation is our first fairness tance metric comparing distribution dissimilarity, such gate (FG) in the CRS pipeline. This FG aims to improve as rank distributions [7]. the fairness of candidate scoring 𝑦𝑖,𝑗 ̂ by training the CRS Here, 𝑁 𝐷𝐾 𝐿 calculates the dissimilarity between the on balanced data. The full overview of the experimental distributions of males and females, especially at the top pipeline is shown in Figure 1. ranks. We consider that demographic parity is achieved when the rank distributions of males and females are similar (i.e., 𝑁 𝐷𝐾 𝐿 = 0). Figure 1: Experimental CRS pipeline including bias mitigation techniques at pre-processing and post-processing steps. 4. Results and Analysis the increase in utility is almost two-fold (+45%). The 𝑁 𝐷𝐾 𝐿 difference is very small between CRS mod- We present the results of the CRS that include one, two, els trained with real or synthetic datasets, and shows or none of our Fairness Gates (FG): re-balancing the train- a negligible improvement of fairness. These results ing set with synthetic data (1st FG), and re-ranking the show that using balanced synthetic data to train job candidates (2nd FG). We consider 3 levels of data im- CRS mnodels (1st FG) considerably improved utility balance, and summarise the NDCG and NDKL for each (𝑁 𝐷𝐶𝐺) while maintaining the same level of fair- level in Table 2. ness (𝑁 𝐷𝐾 𝐿). The 𝑁 𝐷𝐶𝐺 difference is noticeable between CRS mod- The 𝑁 𝐷𝐾 𝐿 decreases before and after ranking els trained with real or synthetic datasets (i.e., between (i.e., last two columns in Table 2), showing that the pairs of rows in Table 2). For the heavy imbalance case, data. These results show that using re-ranking at post-processing (2nd FG) equally improved fairness (𝑁 𝐷𝐾 𝐿) whether or not synthetic data was used to train CRS models (1st FG). We also explored the score distributions for male and female candidates. Those attributed by CRS models trained with real data are unevenly skewed toward the left, even in cases where the real data is balanced (bal- anced dataset). However, for CRS models trained with synthetic data, the score distributions of both genders shift more to the right, creating a more normally- shaped score distribution across both studied gen- ders. Figure 2: Plot displaying the rankings of the top 10 candi- dates before re-ranking and after re-ranking. The ranks of the candidates are on the x-axis. Female candidates are blue bars, and male candidates are orange bars. The ranking A are from a CRS trained on heavily imbalanced data, and A1 represents the re-ranked candidates from A. Similarly, B and B1 are the initial and re-ranked rankings for a CRS trained on the balanced dataset. Ranked Lists NDCG NDKL NDKL Before After re-ranking re-ranking Heavy imbalance: 0.384 0.366 0.200 CRS trained w. real data Heavy imbalance: 0.693 0.358 0.197 CRS trained w. (+45%) synthetic data Minor imbalance: 0.403 0.217 0.126 CRS trained w. real data Minor imbalance: 0.647 0.213 0.126 CRS trained w. (+38%) synthetic data No Imbalance : 0.403 0.213 0.124 CRS trained w. Figure 3: Score distribution for male and female candidates. real data The score assigned to the candidates is on the x-axis, female No Imbalance: 0.633 0.206 0.124 candidates are in blue while male candidates are in orange. A CRS trained w. (+36%) represents a CRS model trained with heavily imbalanced real synthetic data data, and A1 a CRS trained with synthetic data learned (from a generator trained on heavily imbalanced data). B and B1 Table 2 are the the balanced dataset. Average 𝑁 𝐷𝐶𝐺 and 𝑁 𝐷𝐾 𝐿 for ranked list obtained at each level of data imbalance, using CRS trained with real or syn- thetic data (1st FG), with or without re-ranking (2nd FG). 5. Discussion rank distributions of male and female candidates are Despite the promising results shown in section 4, our more similar after re-ranking. The decrease is of CRS pipeline has shown some pitfalls. More specifically, similar magnitude for each level of data imbalance, i.e., the computation of 𝑁 𝐷𝐶𝐺 using ranked candidates as whether the CRS model is trained with real or synthetic ground truth and only evaluating the re-ranked perfor- mance can come with additional validity issues. However, satisfactory results. The goal was to build a recommen- it should be noted that these validity issues can be easily dation pipeline using both real and synthetic data to be averted by adding another 𝑁 𝐷𝐶𝐺 calculation evaluating able to experiment with fair processing techniques and also non-re-ranked candidates against a ground truth as a result, mitigate bias in candidate recommendations. constructed from another holdout set for example. From this perspective, the double fair-gated CRS pipeline Additionally, supplementary validation methods could was successfully built and the generation of synthetic have been considered. For instance, it could have been candidates was successful, valid and accurate throughout beneficial to use future 𝑗, not included in the data, in the pipeline. further evaluations. Statistical tests could have also been The generated data has shown to be accurate on all conducted, while other user-based approaches, such as (im)balance levels, validating the expectations on mode- an evaluation with recruiters, could have contributed to specific normalization and conditional sampling in CT- reinforce the validity of this project. These extra valida- GANs, while also demonstrating the benefits of rejection tion steps should be implemented before deploying the sampling methods in re-balancing imbalanced data and fairness mechanisms proposed using the synthetic candidates generated from it to score Furthermore, some findings were unexplainable with real (im)balanced test subsets fairly. From a fairness per- the current analysis. For instance, the 𝑁 𝐷𝐾 𝐿 scores for spective, it was also shown how scorers trained on syn- CRSs trained on real minor imbalanced datasets are lower thetic candidates outperform scorers trained on balanced than those trained on real balanced datasets, which also real data from a utilitarian perspective. applies after re-ranking. Although the scores vary by Although the issues outlined in section 5 concerning a small margin, such behaviour is difficult to explain the lack of measurement of pre-re-ranked utility raise considering the complexity of our pipeline, rendering some minor validity concerns, the evidence shows how de-bugging tasks equally complex. synthetically-trained CRSs provide fair, useful can- Additional unexplainable results are also visible on the didate recommendations when integrated in such a synthetic to real comparison with CRSs trained on syn- pipeline. thetic datasets such as heavy imbalance showing more unfairness by a small margin when compared to real- trained counterparts. These unexplainable findings be- 7. Future Work tween real and synthetic subsets are even more puzzling In future work, the recommendations shared in the dis- considering, figure 3 shows more balanced scoring for cussion can be considered. More specifically, the use of all synthetically-trained CRSs which should result in a additional evaluation methods with human-in-the-loop lower 𝑁 𝐷𝐾 𝐿 score before re-ranking. evaluation using recruiters or the use of future requests Finally, the implementation of demographic parity to to test the CRS pipeline. enforce equal proportions between genders oversimpli- Additionally, future researchers should also consider fies the complexity of the candidate hiring landscape. the use of less data-greedy rejection sampling techniques This oversimplification can be resolved in future research as we have lost more than 80% the amount of the hold- with a lesser degree of generalizability. Future research out information we had at the start of the pipeline. This can be more specific by adjusting fairness rules to the can either be resolved with more elegant rejection sam- domain of the job request 𝑗. For instance, certain jobs pling constraints, the use of larger datasets or data- such as security personnel can show real-world skewness augmentation techniques through synthetic data for in- towards a certain gender. A future CRS pipeline needs stance. The latter could have been considered in this to adjust its fairness rules at 𝑗 level. project if it was within the scope of our research. Despite these limitations and suggestions for future Finally, with a solved data scarcity problem future work, overall, our research successfully showed that the researchers can consider the discussed domain-adjustable combination of synthetic data and re-ranking was a com- fairness rules for more specific fairness constraints to bination contributing to both fairness and utility even overcome real-world skewness. when compared to CRSs trained on real balanced data such as the balanced dataset. Therefore, as expected, a combination of pre-processing and post-processing FGs 8. Acknowledgements proved to be useful. We acknowledge the University of Amsterdam - Master programme Information Studies for creating the condi- 6. Conclusion tions to perform this research and for financially support- ing this publication. The goal of our CRS pipeline was never to produce SOTA synthetic candidates and recommendations, despite our References ing fairness assessments with synthetic data: a prac- tical use case with a recommender system for hu- [1] A. Chouldechova, Fair prediction with dis- man resources, 2022. parate impact: A study of bias in recidi- [14] M. J. Kusner, J. R. Loftus, C. Russell, R. Silva, Coun- vism prediction instruments, Big Data 5 terfactual fairness, 2018. arXiv:1703.06856 . (2017) 153–163. URL: https://doi.org/10.1089/ [15] J. Neumann, Various techniques used in connection big.2016.0047. doi:10.1089/big.2016.0047 . with random digits, National Bureau of Standards, arXiv:https://doi.org/10.1089/big.2016.0047 , Applied Math Series 12 (1951) 768–770. pMID: 28632438. [16] K. Järvelin, J. Kekäläinen, Cumulated gain-based [2] J. Buolamwini, T. Gebru, Gender shades: Intersec- evaluation of ir techniques, ACM Transactions on tional accuracy disparities in commercial gender Information Systems (TOIS) 20 (2002) 422–446. classification, in: S. A. Friedler, C. Wilson (Eds.), Proceedings of the 1st Conference on Fairness, Ac- countability and Transparency, volume 81 of Pro- ceedings of Machine Learning Research, PMLR, 2018, pp. 77–91. URL: https://proceedings.mlr.press/v81/ buolamwini18a.html. [3] S. Hajian, J. Domingo-Ferrer, A methodology for direct and indirect discrimination prevention in data mining, IEEE Transactions on Knowl- edge and Data Engineering 25 (2013) 1445–1459. doi:10.1109/TKDE.2012.72 . [4] A. Prince, D. Schwarcz, Proxy discrimination in the age of artificial intelligence and big data, Iowa Law Review 105 (2020) 1257–1318. Publisher Copyright: © 2020 University of Iowa. All rights reserved. [5] A. Rajabi, O. O. Garibay, Tabfairgan: Fair tabular data generation with generative adversarial net- works, arXiv preprint arXiv:2109.00666 (2021). [6] Y. Li, H. Chen, S. Xu, Y. Ge, Y. Zhang, Towards personalized fairness based on causal notion, CoRR abs/2105.09829 (2021). URL: https://arxiv.org/abs/ 2105.09829. arXiv:2105.09829 . [7] S. C. Geyik, S. Ambler, K. Kenthapadi, Fairness- aware ranking in search & recommendation sys- tems with application to linkedin talent search, 2019. URL: https://doi.org/10.1145/3292500.3330691. doi:10.1145/3292500.3330691 . [8] D. B. Rubin, Discussion statistical disclosure limita- tion, Journal of Official Statistics 9 (1993) 461–468. [9] I. Goodfellow, Nips 2016 tutorial: Generative adver- sarial networks, 2017. URL: https://arxiv.org/abs/ 1701.00160. doi:10.48550/ARXIV.1701.00160 . [10] A. C. Ian GoodFellow, Yoshua Bengio, Deep Learn- ing, 1st ed., MIT Press, Cambridge, Massachusetts, United States, 2016. [11] L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veera- machaneni, Modeling tabular data using condi- tional gan, 2019. URL: https://arxiv.org/abs/1907. 00503. doi:10.48550/ARXIV.1907.00503 . [12] E. Jang, S. Gu, B. Poole, Categorical reparame- terization with gumbel-softmax, 2016. URL: https: //arxiv.org/abs/1611.01144. doi:10.48550/ARXIV. 1611.01144 . [13] S.-J. van Els, D. Graus, E. BeauxisAussalet, Improv-