-

Recommender Systems, September

Mitigation in Candidate Recom mender Systems with Fairness Gates

Adam Mehdi Arafan

adammehdiarafan@gmail.com 1

David Graus

david.graus@randstadgroep.nl 0

Fernando P. Santos

f.p.santos@uva.nl 1

Emma Beauxis-Aussalet

2 0 Randstad Groep Nederland , Diemen , The Netherlands 1 University of Amsterdam , Amsterdam , The Netherlands 2 Vrije Universiteit Amsterdam , Amsterdam , The Netherlands

2022

1 8 23

Recommender Systems (RS) have proven successful in a wide variety of domains, and the human resources (HR) domain is no exception. RS proved valuable for recommending candidates for a position, although the ethical implications have recently been identified as high-risk by the European Commission. In this study, we apply RS to match candidates with job requests. The RS pipeline includes two fairness gates at two diferent steps: pre-processing (using GAN-based synthetic candidate generation) and post-processing (with greedily searched candidate re-ranking). While prior research studied fairness at pre- and post-processing steps separately, our approach combines them both in the same pipeline applicable to the HR domain. We show that the combination of gender-balanced synthetic training data with pair re-ranking increased fairness with satisfactory levels of ranking utility. Our findings show that using only the gender-balanced synthetic data for bias mitigation is fairer by a negligible margin when compared to using real data. However, when implemented together with the pair re-ranker, candidate recommendation fairness improved considerably, while maintaining a satisfactory utility score. In contrast, using only the pair re-ranker achieved a similar fairness level, but had a consistently lower utility.

1. Introduction spite the many benefits of ML-enabled tools, biases can occur and be amplified through the highly scalable nature of ML-enabled systems. Algorithms used in applications such as recidivism prediction, predictive policing, or facial recognition, have revealed bias towards either race, gender or both [1, 2]. These biases can also be expressed through proxy (unobservable) correlations expressed via sensitive attributes such as gender and poorly defined decision boundaries [3, 4]. ommender systems (CRS). The goal of such a system is to recommend the best candidates for a specific job, often computing ranked lists of candidates in descending order of relevance. A variety of fairness issues may arise from the large and diverse pools of candidates and job ofers.

In the case of the HR industry, bias in recommendations comes with a high risk of harm as candidates can

RecSys in HR’22: The 2nd Workshop on Recommender Systems for Human Resources, in conjunction with the 16th ACM Conference on †Work done while on internship at Randstad Groep Nederland. pipeline. We aim to close this gap by testing SOTA bias mitigation methods in both pre- and post-processing, and observing the impact on the fairness of candidate ranking. We propose a pipeline for a CRS that integrates [10]. These deeper models, more specifically GANs, aftwo bias mitigation mechanisms (called Fairness Gates, forded the synthesis of more complex unstructured data FG) at the pre- and post-processing steps. By FG, we refer such as images and videos. In the context of this thesis to the enforcement of bias mitigation techniques within project, GANs will be used to generate tabular (structhe pipeline. The FGs are a synthetic data generator tured) synthetic candidate data. and a greedy re-ranker. Despite their popularity, GANs are mainly used for un

The synthetic data generator enforces gender bal- structured data synthesis tasks such as image and video ance in the sampling size while the greedy re-ranker synthesis, the generation of synthetic tabular data such optimizes for both utility (the quality or usefulness of as job candidates is not only uncommon from a domain candidate recommendations) and gender balance in candi- perspective but also from a technical perspective. This is date ranking. In this paper, we explore the fairness-utility caused by the dificulty of learning discrete features with trade-ofs among re-ranked CRS outputs trained using potentially imbalanced classes. A challenge for which Xu synthetic data or only real data. Therefore, we focus on et al. found a solution by integrating a Gumbel Softmax exploring what are the impacts and trade-ofs be- (GS) activation function in their . The GS is based tween utility and fairness that arise from combining on the Gumbel-Max trick, a common method for discrete synthetic data generation at pre-processing and greedy approximation [12]. pair re-ranking at a post-processing level. With the ability to generate categorical features, other

Our experimental results show that the best compro- issues can hinder the tabular candidate synthesis process. mise between fairness and utility is achieved when com- Issues such as input datasets with mixed distributions (as bining the two FGs rather than using just one. is the case for our input data) can severely afect generative performance. For these problems, Xu et al. propose two solutions: mode-specific normalization for contin2. Background and Related Work uous column normalization and conditional sampling to enforce class balancing, both are known problems in Before presenting the experiments conducted within our discriminatory generative modelling. Therefore, novel candidate recommendation pipeline, essential ter- is an ideal generator for the task at hand as it can balminology needs to be defined alongside the state of the art ance imbalanced datasets and handle mixtures of data in the (sub)task(s) at hand. More specifically, we will first types. Before outlining the fairness-related work, we ionutrr ofirdstuFceGs,ybnetfhoereticinctarnoddiudcaitnegsfyanirthneessissawnhdicsphesceirfyveinsgas rbeultaitoen to both ttoheouarcaCdReSmpicipaenlidnedoamndaidnisgcaups.s its contrithe relevant techniques used in the CRS pipeline. Finally, Candidate synthesis is uncommon, although fairness we will conclude with the research gap and a summary research showed successful use of tabular GANs to generof how the discussed techniques fit in our CRS. ate fair data and more domain-relevant research showed the use of Gaussian copulas for synthetic candidate gen2.1. Data Synthesis eration, considerations using s to support downstream tasks are rare if not unavailable [5, 13]. In the synthetic candidate generation domain, van Els et al. is the unique example in our high risk of harm task. Therefore, the use of GANs, more specifically s to generate candidates will greatly improve the fairness of our CRS pipeline.

In fact, as outlined by Xu et al., conditional sampling will allow us to synthesize balanced training data with ease which can be used downstream as a fair balanced basis to train candidate-scoring algorithms and mitigate bias; the use of conditional sampling alongside reject sampling (to be introduced in the methodology section) is how we link candidate synthesis with fairness and ultimately bias mitigation in our end-to-end CRS pipeline. Therefore, the use of s is novel in the candidate recommendation domain. With the synthetic pre-processing techniques outlined, we will provide an outline of the fairness literature, by focusing more specifically on post-processing methods.

Originally proposed by Rubin in 1993, the synthetic data

solution was initially tasked to overcome confidentiality concerns during surveys [8]. Although confidentiality issues have become more important with new stricter European regulations such as the General Data Protection Regulation (GDPR), the current applications of synthetic data have also shown their strength in generating fair and private synthetic data. In fact, synthetic data applications extend far beyond survey data synthesis, use cases range from missing data imputation as well as data augmentation solutions in semi-supervised learning, media applications with image-to-image translation and finally image super-resolution [9].

Data synthesis has evolved from Bayesian bootstrapping methods and predictive posterior distributions to deeper techniques such as Autoencoders (AE), Variational Autoencoders (VAEs), autoregressive models, Boltzmann machines, deep belief networks, and generative adversarial networks (GANs) after the advent of deep learning

2.2. Fairness

With the relevant background and related work on candidate synthesis introduced, we now proceed further down our CRS pipeline towards the second FG which will mitigate bias at the post-processing level, therefore, after the models are trained on synthetic data to score real candidates. The scored candidates are then evaluated according to a relevant fairness metric and re-ranked using a relevant post-processing technique.

Currently, multiple fairness metrics exist, each with their respective strengths and weaknesses. In our case, we only consider demographic parity, which was defined by Kusner et al. as: • Demographic Parity: ”A predictor ̂ satisfies demographic parity if P( ̂ | = 0) = P( ̂ | = 1).” For representing a sensitive attribute with levels.

Many other fairness techniques exist, namely the re

moval of any sensitive attributes. We stress that simply removing sensitive attributes is not guaranteed to remove bias. This process of simply removing protected attributes is known as fairness through unawareness and was shown to perpetuate unfairness [14]. In fact, in our CRS pipeline, we are using the opposite logic to achieve fairness through awareness by explicitely using gender to re-rank candidates in the post-processing step. 2.2.1. Fairness in Rankings

While demographic parity is useful for quantifying fair

ness, the enforcement of such rules has yet to be defined. Fairness can be enforced either through a data cleaning process verifying for class imbalances and the existence of sensitive (proxy) variables (pre-processing) or modifying model output post-training with approaches such as re-ranking (post-processing)[7]. Although we consider the two approaches in this project, the evaluation of our model will follow the SOTA post-processing techniques which are presented below.

For our CRS pipeline we will use Geyik et al.’s approach considering it is already used in the HR domain (the task at hand was the recommendation of candidates in LinkedIn). Additionally, Geyik et al. achieved SOTA performance with more than a 4-fold reduction in unfairness and a reduction in utility of only 6%. From a research gap perspective, candidate re-ranking is widely used in the industry and researched in Information Retrieval literature. However, despite not being novel in this sub-task, our CRS pipeline fills the research gap by performing the re-ranking of candidates on synthetically trained scoring models.

This is where our end-to-end CRS pipeline contributes to both the domain and the relevant literature, by testing how the combination of candidate synthesis for scoring model training combines with re-ranking methods for a better bias mitigation end-to-end process. This combination is novel in both the HR domain and in the literature for fairness and generative modelling.

2.3. Summary and Research Gap

The above mini-literature review outlined the diferent key areas of (candidate) synthesis and fairness processing techniques. As shown, the combination of multiple processing techniques within one CRS pipeline has never been attempted. Therefore, our pipeline is presented as a combination of the presented related work and it will be evaluated based on the output of the candidate rankings. For the evaluation, we will not be comparing our CRS pipeline’s to Xu et al. nor will we be comparing our re-ranker to Geyik et al. as we are using drastically diferent datasets. Instead we will be developing our own evaluation framework for the candidate data at hand which we will outline in section 3.

The goal of this section was to provide a high-level overview of the literature and techniques used all while exposing the academic gap where our pipeline resides. In the following section, we use the provided background to introduce our experiments with in-depth technical detail and apply the SOTA related work to the candidate recommendation problem with our novel CRS pipeline.

3. Methodology Our CRS follows a point-wise learning to rank approach,

where for a given job , we fetch and rank candidates , much like given a query, the goal is to rank documents in the traditional document retrieval scenario. In other words, our recommender system predicts relevance scores ,̂ given the candidate and job features , .

We use real data from an international HR company. For training purposes, the candidate features are associated with a ground truth label , where , = 1 if the candidate has been recruited or shortlisted for a job , and 0 otherwise.

The data used for training is of a structured nature, spanning real-valued, categorical, and binary features. Features correspond to candidate features (e.g., job seekers’ preferences such as minimum salary, preferred working hours, or maximum travel distance, in addition to data related to their work experience or level of education). Job features (e.g., industry of the company, company size, geographical location), and finally candidate-job features that represent their overlap (e.g., geographical distance between candidate and job, or a binary feature indicating whether candidate has worked in job’s industry before). Much in the same vein that query, document, and query

3.2. Candidate scoring and re-ranking

document features are designed in a traditional learning to rank for information retrieval-scenario.

We trained CRS models to score candidates by estimat

ing their relevance score ̂ for the jobs . We trained a 3.1. Gender balance and synthetic data total of 10 CRS models, using real or synthetic job candiImbalanced data is very common in CRSs, and we focus dates as training data (5 datasets each respectively). The on gender imbalance for our case, which is common in jobs for which candidates are scored remain those of the the job market. To efectively study the issue of imbal- real data, more specifically, the real holdout test data. ance, we construct various explicitly (im)balanced sce- We tested the CRS models with their respective holdnarios through a rejection sampling algorithm based on out test sets, comprising real data with the same gender John V. Neumannn’s technique [15]. We first sampled re- balance. For each test set, we scored candidates using balanced subsets of the original training data,considering either the CRS trained with synthetic data or with real gender as the sensitive attribute . We only considered 2 data (of the same gender balance), i.e., we use 2 CRS genders (female, male) as unfortunately our dataset does models per each of the 5 test sets, and thus obtain a total not contain enough samples of non-binary genders. of 10 sets of scores. After scoring candidates we rank

To construct our (im)balanced subsets, we randomly candidates by descending order of relevance scores, and sampled job candidates from each job request with a obtain 10 sets of rankings. constrained proportion of candidates from each gender. After the candidates are scored and ranked, we inWe generated two datasets with heavy imbalance troduce our second Fairness Gate (FG) at the post(one with 20% of female candidates, one with 20% of processing level of the CRS pipeline. This FG aims to males); two datasets with minor imbalance (one with improve the fairness of candidate ranking by using a 45% of female candidates, one with 45% of males); re-ranking algorithm that interleaves males and females and a balanced dataset (with 50% of male and female equally at the top ranks (e.g., Figure 2). For our expericandidates). For each training dataset, 10% of the data mental CRS pipeline, we reused the re-ranking algorithm points were kept as a held-out test set. To avoid data from Geyik et al. [7], and obtained 10 sets of re-rankings leakage, all job requests were unique to the test set. (Figure 1).

The test dataset sizes in number of unique < , > -pairs after rejection sampling are shown in Table 1. 3.3. Metrics and Evaluation Test Data heavy imbalance (20% males) heavy imbalance (20% females) minor imbalance (45% males) minor imbalance (45% females) balanced

We trained 5 synthetic data models, using each re

balanced dataset as training data for the CTGAN algorithm [11]. We were able to generate balanced synthetic data using the models’ conditional sampling parameters. We generated balanced synthetic data where each gender represents 50% of the dataset, for both positive ( , = 1) and negative ( , = 0) examples.

The synthetic data generation is our first fairness gate (FG) in the CRS pipeline. This FG aims to improve the fairness of candidate scoring ,̂ by training the CRS on balanced data. The full overview of the experimental pipeline is shown in Figure 1.

The impact of the re-ranking is evaluated in terms of

utility using Normalised Discounted Cumulative Gain ( ), a common ranking metric to maximise [16]. To measure the impact of the re-ranking, we compared the scores before re-ranking (by considering the initial ranking as the ideal ranking) and after re-ranking. A lower score means re-ranking had a negative impact on the original rankings. A higher score means re-ranking had less impact. As we are considering the impact of the ranking, the score was calculated after ranking, hence the appearance of only one score. Therefore, we used the as a single impact metric. The original predicted ranks were used as ground truth (ideal ranking) which was measured against the re-ranked candidates. To ensure the ideal ranks are valid, we have used common classification metrics such as F1 and AUC.

In terms of fairness, we used (normalized discounted cumulative Kullback-Leibler divergence), a distance metric comparing distribution dissimilarity, such as rank distributions [7].

Here, calculates the dissimilarity between the distributions of males and females, especially at the top ranks. We consider that demographic parity is achieved when the rank distributions of males and females are similar (i.e., = 0 ).

4. Results and Analysis

the increase in utility is almost two-fold (+45%).

The diference is very small between CRS modWe present the results of the CRS that include one, two, els trained with real or synthetic datasets, and shows or none of our Fairness Gates (FG): re-balancing the train- a negligible improvement of fairness. These results ing set with synthetic data (1st FG), and re-ranking the show that using balanced synthetic data to train job candidates (2nd FG). We consider 3 levels of data im- CRS mnodels (1st FG) considerably improved utility balance, and summarise the NDCG and NDKL for each ( ) while maintaining the same level of fairlevel in Table 2. ness ( ).

The diference is noticeable between CRS mod- The decreases before and after ranking els trained with real or synthetic datasets (i.e., between (i.e., last two columns in Table 2), showing that the pairs of rows in Table 2). For the heavy imbalance case,

We also explored the score distributions for male and female candidates. Those attributed by CRS models trained with real data are unevenly skewed toward the left, even in cases where the real data is balanced ( balanced dataset ). However, for CRS models trained with synthetic data, the score distributions of both genders shift more to the right, creating a more normallyshaped score distribution across both studied genders. 5. Discussion rank distributions of male and female candidates are Despite the promising results shown in section 4, our more similar after re-ranking. The decrease is of CRS pipeline has shown some pitfalls. More specifically, similar magnitude for each level of data imbalance, i.e., the computation of using ranked candidates as whether the CRS model is trained with real or synthetic ground truth and only evaluating the re-ranked performance can come with additional validity issues. However, satisfactory results. The goal was to build a recommenit should be noted that these validity issues can be easily dation pipeline using both real and synthetic data to be averted by adding another calculation evaluating able to experiment with fair processing techniques and also non-re-ranked candidates against a ground truth as a result, mitigate bias in candidate recommendations. constructed from another holdout set for example. From this perspective, the double fair-gated CRS pipeline

Additionally, supplementary validation methods could was successfully built and the generation of synthetic have been considered. For instance, it could have been candidates was successful, valid and accurate throughout beneficial to use future , not included in the data, in the pipeline. further evaluations. Statistical tests could have also been The generated data has shown to be accurate on all conducted, while other user-based approaches, such as (im)balance levels, validating the expectations on modean evaluation with recruiters, could have contributed to specific normalization and conditional sampling in CTreinforce the validity of this project. These extra valida- GANs, while also demonstrating the benefits of rejection tion steps should be implemented before deploying the sampling methods in re-balancing imbalanced data and fairness mechanisms proposed using the synthetic candidates generated from it to score

Furthermore, some findings were unexplainable with real (im)balanced test subsets fairly. From a fairness perthe current analysis. For instance, the scores for spective, it was also shown how scorers trained on synCRSs trained on real minor imbalanced datasets are lower thetic candidates outperform scorers trained on balanced than those trained on real balanced datasets, which also real data from a utilitarian perspective. applies after re-ranking. Although the scores vary by Although the issues outlined in section 5 concerning a small margin, such behaviour is dificult to explain the lack of measurement of pre-re-ranked utility raise considering the complexity of our pipeline, rendering some minor validity concerns, the evidence shows how de-bugging tasks equally complex. synthetically-trained CRSs provide fair, useful can

Additional unexplainable results are also visible on the didate recommendations when integrated in such a synthetic to real comparison with CRSs trained on syn- pipeline. thetic datasets such as heavy imbalance showing more unfairness by a small margin when compared to realtrained counterparts. These unexplainable findings be- 7. Future Work tween real and synthetic subsets are even more puzzling considering, figure 3 shows more balanced scoring for In future work, the recommendations shared in the disall synthetically-trained CRSs which should result in a cussion can be considered. More specifically, the use of lowFienra lly, the sicmoprelebmeefonrteatrioe-nraonfkdienmg.ographic parity to eavdadliutiaotnioanl euvsainlugartieocnrumiteetrhsoodrsthweituhsehuomffauntu-irne-trheeq-uloesotps enforce equal proportions between genders oversimpli- to test the CRS pipeline. ifes the complexity of the candidate hiring landscape. Additionally, future researchers should also consider This oversimplification can be resolved in future research the use of less data-greedy rejection sampling techniques with a lesser degree of generalizability. Future research as we have lost more than 80% the amount of the holdcan be more specific by adjusting fairness rules to the out information we had at the start of the pipeline. This domain of the job request . For instance, certain jobs can either be resolved with more elegant rejection samsuch as security personnel can show real-world skewness pling constraints, the use of larger datasets or datatowards a certain gender. A future CRS pipeline needs augmentation techniques through synthetic data for into adjust its fairness rules at level. stance. The latter could have been considered in this

Despite these limitations and suggestions for future project if it was within the scope of our research. work, overall, our research successfully showed that the Finally, with a solved data scarcity problem future combination of synthetic data and re-ranking was a com- researchers can consider the discussed domain-adjustable bination contributing to both fairness and utility even fairness rules for more specific fairness constraints to when compared to CRSs trained on real balanced data overcome real-world skewness. such as the balanced dataset. Therefore, as expected, a combination of pre-processing and post-processing FGs 8. Acknowledgements proved to be useful.

6. Conclusion The goal of our CRS pipeline was never to produce SOTA synthetic candidates and recommendations, despite our We acknowledge the University of Amsterdam - Master programme Information Studies for creating the conditions to perform this research and for financially supporting this publication.

ing fairness assessments with synthetic data: a practical use case with a recommender system for hu[1] A. Chouldechova, Fair prediction with dis- man resources, 2022.

parate impact: A study of bias in recidi- [14] M. J. Kusner, J. R. Loftus, C. Russell, R. Silva, Counvism prediction instruments, Big Data 5 terfactual fairness, 2018. arXiv:1703.06856. (2017) 153–163. URL: https://doi.org/10.1089/ [15] J. Neumann, Various techniques used in connection big.2016.0047. doi:10.1089/big.2016.0047. with random digits, National Bureau of Standards, arXiv:https://doi.org/10.1089/big.2016.0047, Applied Math Series 12 (1951) 768–770. pMID: 28632438. [16] K. Järvelin, J. Kekäläinen, Cumulated gain-based [2] J. Buolamwini, T. Gebru, Gender shades: Intersec- evaluation of ir techniques, ACM Transactions on tional accuracy disparities in commercial gender Information Systems (TOIS) 20 (2002) 422–446. classification, in: S. A. Friedler, C. Wilson (Eds.), Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 77–91. URL: https://proceedings.mlr.press/v81/ buolamwini18a.html. [3] S. Hajian, J. Domingo-Ferrer, A methodology for direct and indirect discrimination prevention in data mining, IEEE Transactions on Knowledge and Data Engineering 25 (2013) 1445–1459.

doi:10.1109/TKDE.2012.72. [4] A. Prince, D. Schwarcz, Proxy discrimination in the age of artificial intelligence and big data, Iowa Law Review 105 (2020) 1257–1318. Publisher Copyright: © 2020 University of Iowa. All rights reserved. [5] A. Rajabi, O. O. Garibay, Tabfairgan: Fair tabular data generation with generative adversarial networks, arXiv preprint arXiv:2109.00666 (2021). [6] Y. Li, H. Chen, S. Xu, Y. Ge, Y. Zhang, Towards personalized fairness based on causal notion, CoRR abs/2105.09829 (2021). URL: https://arxiv.org/abs/ 2105.09829. arXiv:2105.09829. [7] S. C. Geyik, S. Ambler, K. Kenthapadi, Fairnessaware ranking in search & recommendation systems with application to linkedin talent search, 2019. URL: https://doi.org/10.1145/3292500.3330691.

doi:10.1145/3292500.3330691. [8] D. B. Rubin, Discussion statistical disclosure limita

tion, Journal of Oficial Statistics 9 (1993) 461–468. [9] I. Goodfellow, Nips 2016 tutorial: Generative adversarial networks, 2017. URL: https://arxiv.org/abs/ 1701.00160. doi:10.48550/ARXIV.1701.00160. [10] A. C. Ian GoodFellow, Yoshua Bengio, Deep Learning, 1st ed., MIT Press, Cambridge, Massachusetts,

United States, 2016. [11] L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni, Modeling tabular data using conditional gan, 2019. URL: https://arxiv.org/abs/1907.

00503. doi:10.48550/ARXIV.1907.00503. [12] E. Jang, S. Gu, B. Poole, Categorical reparameterization with gumbel-softmax, 2016. URL: https: //arxiv.org/abs/1611.01144. doi:10.48550/ARXIV.

1611.01144. [13] S.-J. van Els, D. Graus, E. BeauxisAussalet, Improv