End-to-End Bias Mitigation in Candidate Recommender
Systems with Fairness Gates
Adam Mehdi Arafan1,† , David Graus2 , Fernando P. Santos3 and Emma Beauxis-Aussalet4
1
  University of Amsterdam, Amsterdam, The Netherlands
2
  Randstad Groep Nederland, Diemen, The Netherlands
3
  University of Amsterdam, Amsterdam, The Netherlands
4
  Vrije Universiteit Amsterdam, Amsterdam, The Netherlands


                                          Abstract
                                          Recommender Systems (RS) have proven successful in a wide variety of domains, and the human resources (HR) domain is no
                                          exception. RS proved valuable for recommending candidates for a position, although the ethical implications have recently
                                          been identified as high-risk by the European Commission. In this study, we apply RS to match candidates with job requests.
                                          The RS pipeline includes two fairness gates at two different steps: pre-processing (using GAN-based synthetic candidate
                                          generation) and post-processing (with greedily searched candidate re-ranking). While prior research studied fairness at
                                          pre- and post-processing steps separately, our approach combines them both in the same pipeline applicable to the HR
                                          domain. We show that the combination of gender-balanced synthetic training data with pair re-ranking increased fairness
                                          with satisfactory levels of ranking utility. Our findings show that using only the gender-balanced synthetic data for bias
                                          mitigation is fairer by a negligible margin when compared to using real data. However, when implemented together with the
                                          pair re-ranker, candidate recommendation fairness improved considerably, while maintaining a satisfactory utility score. In
                                          contrast, using only the pair re-ranker achieved a similar fairness level, but had a consistently lower utility.

                                          Keywords
                                          Fair Artificial Intelligence, Generative Modelling, Information Retrieval, Recommender Systems


1. Introduction                                                                                                  perpetually face discrimination in finding employment.
                                                                                                                 The risk of harm is especially great considering the scal-
Machine learning (ML) applications have proven to be able nature of recommender systems. Here we focus on
useful in many domains over recent years. However, de- a CRS to support a recruiter in finding the best matching
spite the many benefits of ML-enabled tools, biases can candidates for a client job request (e.g., a factory request-
occur and be amplified through the highly scalable nature ing 20 technicians).
of ML-enabled systems. Algorithms used in applications                                                              As most ML algorithms perform predictions in a dis-
such as recidivism prediction, predictive policing, or fa- criminative fashion using historical data, it is not trivial
cial recognition, have revealed bias towards either race, to guarantee that discrimination is not (unfairly) influ-
gender or both [1, 2]. These biases can also be expressed enced by proxies that might be correlated with protected
through proxy (unobservable) correlations expressed via characteristics. The fairness in ML problem has been ap-
sensitive attributes such as gender and poorly defined proached by many researchers such as Rajabi and Garibay
decision boundaries [3, 4].                                                                                      who tackled the problem by synthesizing data, or Li et al.
              We are focusing on fairness issues with candidate rec- constraining recommendations, and Geyik et al. by re-
ommender systems (CRS). The goal of such a system is to ranking recommendations. These researchers produced
recommend the best candidates for a specific job, often state-of-the-art (SOTA) algorithms tackling specific fair-
computing ranked lists of candidates in descending order ness techniques, from which we distinguish two: pre-
of relevance. A variety of fairness issues may arise from processing (enforcing fairness at the data level) and
the large and diverse pools of candidates and job offers. post-processing (enforcing fairness after predictions
              In the case of the HR industry, bias in recommenda- were made).
tions comes with a high risk of harm as candidates can                                                              These two approaches have traditionally been re-
                                                                                                                 searched separately in RS and fairness literature, ignoring
RecSys in HR’22: The 2nd Workshop on Recommender Systems for potential synergistic effects of applying fairness mecha-
Human Resources, in conjunction with the 16th ACM Conference on
Recommender Systems, September 18–23, 2022, Seattle, USA.                                                        nisms at different stages of the ML pipeline. To the best
†
    Work done while on internship at Randstad Groep Nederland.                                                   of our knowledge, we found no prior work experiment-
Envelope-Open adammehdiarafan@gmail.com (A. M. Arafan);                                                          ing with more than one processing technique in a single
david.graus@randstadgroep.nl (D. Graus); f.p.santos@uva.nl                                                       pipeline. We aim to close this gap by testing SOTA bias
(F. P. Santos); e.m.a.l.beauxisaussalet@vu.nl (E. Beauxis-Aussalet)                                              mitigation methods in both pre- and post-processing,
                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                    Attribution 4.0 International (CC BY 4.0).                                                   and observing the impact on the fairness of candidate
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
ranking. We propose a pipeline for a CRS that integrates        [10]. These deeper models, more specifically GANs, af-
two bias mitigation mechanisms (called Fairness Gates,          forded the synthesis of more complex unstructured data
FG) at the pre- and post-processing steps. By FG, we refer      such as images and videos. In the context of this thesis
to the enforcement of bias mitigation techniques within         project, GANs will be used to generate tabular (struc-
the pipeline. The FGs are a synthetic data generator            tured) synthetic candidate data.
and a greedy re-ranker.                                            Despite their popularity, GANs are mainly used for un-
   The synthetic data generator enforces gender bal-            structured data synthesis tasks such as image and video
ance in the sampling size while the greedy re-ranker            synthesis, the generation of synthetic tabular data such
optimizes for both utility (the quality or usefulness of        as job candidates is not only uncommon from a domain
candidate recommendations) and gender balance in candi-         perspective but also from a technical perspective. This is
date ranking. In this paper, we explore the fairness-utility    caused by the difficulty of learning discrete features with
trade-offs among re-ranked CRS outputs trained using            potentially imbalanced classes. A challenge for which Xu
synthetic data or only real data. Therefore, we focus on        et al. found a solution by integrating a Gumbel Softmax
exploring what are the impacts and trade-offs be-               (GS) activation function in their 𝐶𝑇 𝐺𝐴𝑁. The GS is based
tween utility and fairness that arise from combining            on the Gumbel-Max trick, a common method for discrete
synthetic data generation at pre-processing and greedy          approximation [12].
pair re-ranking at a post-processing level.                        With the ability to generate categorical features, other
   Our experimental results show that the best compro-          issues can hinder the tabular candidate synthesis process.
mise between fairness and utility is achieved when com-         Issues such as input datasets with mixed distributions (as
bining the two FGs rather than using just one.                  is the case for our input data) can severely affect genera-
                                                                tive performance. For these problems, Xu et al. propose
                                                                two solutions: mode-specific normalization for contin-
2. Background and Related Work                                  uous column normalization and conditional sampling
                                                                to enforce class balancing, both are known problems in
Before presenting the experiments conducted within our
                                                                discriminatory generative modelling. Therefore, 𝐶𝑇 𝐺𝐴𝑁
novel candidate recommendation pipeline, essential ter-
                                                                is an ideal generator for the task at hand as it can bal-
minology needs to be defined alongside the state of the art
                                                                ance imbalanced datasets and handle mixtures of data
in the (sub)task(s) at hand. More specifically, we will first
                                                                types. Before outlining the fairness-related work, we
introduce synthetic candidate synthesis which serves as
                                                                relate 𝐶𝑇 𝐺𝐴𝑁 to our CRS pipeline and discuss its contri-
our first FG, before introducing fairness and specifying
                                                                bution to both the academic and domain gap.
the relevant techniques used in the CRS pipeline. Finally,
                                                                   Candidate synthesis is uncommon, although fairness
we will conclude with the research gap and a summary
                                                                research showed successful use of tabular GANs to gener-
of how the discussed techniques fit in our CRS.
                                                                ate fair data and more domain-relevant research showed
                                                                the use of Gaussian copulas for synthetic candidate gen-
2.1. Data Synthesis                                             eration, considerations using 𝐶𝑇 𝐺𝐴𝑁s to support down-
Originally proposed by Rubin in 1993, the synthetic data        stream tasks are rare if not unavailable [5, 13]. In the syn-
solution was initially tasked to overcome confidentiality       thetic candidate generation domain, van Els et al. is the
concerns during surveys [8]. Although confidentiality           unique example in our high risk of harm task. Therefore,
issues have become more important with new stricter Eu-         the use of GANs, more specifically 𝐶𝑇 𝐺𝐴𝑁s to generate
ropean regulations such as the General Data Protection          candidates will greatly improve the fairness of our CRS
Regulation (GDPR), the current applications of synthetic        pipeline.
data have also shown their strength in generating fair             In fact, as outlined by Xu et al., conditional sampling
and private synthetic data. In fact, synthetic data applica-    will allow us to synthesize balanced training data with
tions extend far beyond survey data synthesis, use cases        ease which can be used downstream as a fair balanced
range from missing data imputation as well as data aug-         basis to train candidate-scoring algorithms and mitigate
mentation solutions in semi-supervised learning, media          bias; the use of conditional sampling alongside reject
applications with image-to-image translation and finally        sampling (to be introduced in the methodology section) is
image super-resolution [9].                                     how we link candidate synthesis with fairness and
   Data synthesis has evolved from Bayesian bootstrap-          ultimately bias mitigation in our end-to-end CRS
ping methods and predictive posterior distributions to          pipeline. Therefore, the use of 𝐶𝑇 𝐺𝐴𝑁s is novel in the
deeper techniques such as Autoencoders (AE), Variational        candidate recommendation domain. With the synthetic
Autoencoders (VAEs), autoregressive models, Boltzmann           pre-processing techniques outlined, we will provide
machines, deep belief networks, and generative adver-           an outline of the fairness literature, by focusing more
sarial networks (GANs) after the advent of deep learning        specifically on post-processing methods.
2.2. Fairness                                             how the combination of candidate synthesis for scoring
                                                          model training combines with re-ranking methods for a
With the relevant background and related work on candi-
                                                          better bias mitigation end-to-end process. This combi-
date synthesis introduced, we now proceed further down
                                                          nation is novel in both the HR domain and in the
our CRS pipeline towards the second FG which will miti-
                                                          literature for fairness and generative modelling.
gate bias at the post-processing level, therefore, after
the models are trained on synthetic data to score real
candidates. The scored candidates are then evaluated ac- 2.3. Summary and Research Gap
cording to a relevant fairness metric and re-ranked using The above mini-literature review outlined the different
a relevant post-processing technique.                     key areas of (candidate) synthesis and fairness process-
   Currently, multiple fairness metrics exist, each with ing techniques. As shown, the combination of multiple
their respective strengths and weaknesses. In our case, processing techniques within one CRS pipeline has never
we only consider demographic parity, which was defined been attempted. Therefore, our pipeline is presented
by Kusner et al. as:                                      as a combination of the presented related work and it
                                                          will be evaluated based on the output of the candidate
      • Demographic Parity: ”A predictor 𝑌̂ satisfies
                                   ̂            ̂         rankings. For the evaluation, we will not be comparing
        demographic parity if P(𝑌 |𝐴 = 0) = P(𝑌 |𝐴 = 1).”
                                                          our CRS pipeline’s 𝐶𝑇 𝐺𝐴𝑁 to Xu et al. nor will we be
        For 𝐴 representing a sensitive attribute with 𝑎
                                                          comparing our re-ranker to Geyik et al. as we are using
        levels.
                                                          drastically different datasets. Instead we will be devel-
   Many other fairness techniques exist, namely the re- oping our own evaluation framework for the candidate
moval of any sensitive attributes. We stress that simply data at hand which we will outline in section 3.
removing sensitive attributes is not guaranteed to re-      The goal of this section was to provide a high-level
move bias. This process of simply removing protected overview of the literature and techniques used all while
attributes is known as fairness through unawareness and exposing the academic gap where our pipeline resides. In
was shown to perpetuate unfairness [14]. In fact, in our the following section, we use the provided background
CRS pipeline, we are using the opposite logic to achieve to introduce our experiments with in-depth technical
fairness through awareness by explicitely using gender detail and apply the SOTA related work to the candidate
to re-rank candidates in the post-processing step.        recommendation problem with our novel CRS pipeline.


2.2.1. Fairness in Rankings                                   3. Methodology
While demographic parity is useful for quantifying fair-
                                                              Our CRS follows a point-wise learning to rank approach,
ness, the enforcement of such rules has yet to be defined.
                                                              where for a given job 𝑗, we fetch and rank candidates
Fairness can be enforced either through a data cleaning
                                                              𝑖, much like given a query, the goal is to rank docu-
process verifying for class imbalances and the existence
                                                              ments in the traditional document retrieval scenario. In
of sensitive (proxy) variables (pre-processing) or modi-
                                                              other words, our recommender system predicts relevance
fying model output post-training with approaches such
                                                              scores 𝑦𝑖,𝑗
                                                                       ̂ given the candidate and job features 𝑋𝑖,𝑗 .
as re-ranking (post-processing)[7]. Although we con-
                                                                 We use real data from an international HR company.
sider the two approaches in this project, the evaluation of
                                                              For training purposes, the candidate features 𝑋𝑖 are asso-
our model will follow the SOTA post-processing tech-
                                                              ciated with a ground truth label 𝑦𝑖,𝑗 where 𝑦𝑖,𝑗 = 1 if the
niques which are presented below.
                                                              candidate 𝑖 has been recruited or shortlisted for a job 𝑗,
   For our CRS pipeline we will use Geyik et al.’s ap-
                                                              and 0 otherwise.
proach considering it is already used in the HR domain
                                                                 The data used for training is of a structured nature,
(the task at hand was the recommendation of candidates
                                                              spanning real-valued, categorical, and binary features.
in LinkedIn). Additionally, Geyik et al. achieved SOTA
                                                              Features correspond to candidate features (e.g., job seek-
performance with more than a 4-fold reduction in un-
                                                              ers’ preferences such as minimum salary, preferred work-
fairness and a reduction in utility of only 6%. From a
                                                              ing hours, or maximum travel distance, in addition to data
research gap perspective, candidate re-ranking is widely
                                                              related to their work experience or level of education).
used in the industry and researched in Information Re-
                                                              Job features (e.g., industry of the company, company size,
trieval literature. However, despite not being novel in
                                                              geographical location), and finally candidate-job features
this sub-task, our CRS pipeline fills the research gap by
                                                              that represent their overlap (e.g., geographical distance
performing the re-ranking of candidates on synthetically
                                                              between candidate and job, or a binary feature indicating
trained scoring models.
                                                              whether candidate has worked in job’s industry before).
   This is where our end-to-end CRS pipeline contributes
                                                              Much in the same vein that query, document, and query-
to both the domain and the relevant literature, by testing
document features are designed in a traditional learning   3.2. Candidate scoring and re-ranking
to rank for information retrieval-scenario.
                                                           We trained CRS models to score candidates 𝑖 by estimat-
                                                           ing their relevance score 𝑦𝑖𝑗̂ for the jobs 𝑗. We trained a
3.1. Gender balance and synthetic data                     total of 10 CRS models, using real or synthetic job candi-
Imbalanced data is very common in CRSs, and we focus       dates as training data (5 datasets each respectively). The
on gender imbalance for our case, which is common in       jobs for which candidates are scored remain those of the
the job market. To effectively study the issue of imbal-   real data, more specifically, the real holdout test data.
ance, we construct various explicitly (im)balanced sce-       We tested the CRS models with their respective hold-
narios through a rejection sampling algorithm based on     out test sets, comprising real data with the same gender
John V. Neumannn’s technique [15]. We first sampled re-    balance. For each test set, we scored candidates using
balanced subsets of the original training data,considering either the CRS trained with synthetic data or with real
gender as the sensitive attribute 𝑎. We only considered 2  data (of the same gender balance), i.e., we use 2 CRS
genders (female, male) as unfortunately our dataset does   models per each of the 5 test sets, and thus obtain a total
not contain enough samples of non-binary genders.          of 10 sets of scores. After scoring candidates we rank
   To construct our (im)balanced subsets, we randomly      candidates by descending order of relevance scores, and
sampled job candidates from each job request 𝑗 with a      obtain 10 sets of rankings.
constrained proportion of candidates from each gender.        After the candidates are scored and ranked, we in-
We generated two datasets with heavy imbalance             troduce our second Fairness Gate (FG) at the post-
(one with 20% of female candidates, one with 20% of        processing level of the CRS pipeline. This FG aims to
males); two datasets with minor imbalance (one with        improve the fairness of candidate ranking by using a
45% of female candidates, one with 45% of males);          re-ranking algorithm that interleaves males and females
and a balanced dataset (with 50% of male and female        equally at the top ranks (e.g., Figure 2). For our experi-
candidates). For each training dataset, 10% of the data    mental CRS pipeline, we reused the re-ranking algorithm
points were kept as a held-out test set. To avoid data     from Geyik et al. [7], and obtained 10 sets of re-rankings
leakage, all job requests 𝑗 were unique to the test set.   (Figure 1).
The test dataset sizes in number of unique < 𝑗, 𝑖 >-pairs
after rejection sampling are shown in Table 1.             3.3. Metrics and Evaluation
     Test Data                          Sample Size          The impact of the re-ranking is evaluated in terms of
     heavy imbalance (20% males)        38 701               utility using Normalised Discounted Cumulative Gain
     heavy imbalance (20% females)      40 975               (𝑁 𝐷𝐶𝐺), a common ranking metric to maximise [16]. To
     minor imbalance (45% males)        48 195               measure the impact of the re-ranking, we compared the
     minor imbalance (45% females)      41 972               𝑁 𝐷𝐶𝐺 scores before re-ranking (by considering the ini-
     balanced                           48 178               tial ranking as the ideal ranking) and after re-ranking.
                                                             A lower 𝑁 𝐷𝐶𝐺 score means re-ranking had a negative
Table 1
                                                             impact on the original rankings. A higher 𝑁 𝐷𝐶𝐺 score
Test set sizes after rejection sampling.
                                                             means re-ranking had less impact. As we are considering
                                                             the impact of the ranking, the 𝑁 𝐷𝐶𝐺 score was calcu-
                                                             lated after ranking, hence the appearance of only one
                                                             score. Therefore, we used the 𝑁 𝐷𝐶𝐺 as a single impact
   We trained 5 synthetic data models, using each re- metric. The original predicted ranks were used as ground
balanced dataset as training data for the CTGAN algo- truth (ideal ranking) which was measured against the
rithm [11]. We were able to generate balanced synthetic re-ranked candidates. To ensure the ideal ranks are valid,
data using the models’ conditional sampling parameters. we have used common classification metrics such as F1
We generated balanced synthetic data where each gender and AUC.
represents 50% of the dataset, for both positive (𝑦𝑖,𝑗 = 1)     In terms of fairness, we used 𝑁 𝐷𝐾 𝐿 (normalized dis-
and negative (𝑦𝑖,𝑗 = 0) examples.                            counted cumulative Kullback-Leibler divergence), a dis-
   The synthetic data generation is our first fairness tance metric comparing distribution dissimilarity, such
gate (FG) in the CRS pipeline. This FG aims to improve as rank distributions [7].
the fairness of candidate scoring 𝑦𝑖,𝑗 ̂ by training the CRS    Here, 𝑁 𝐷𝐾 𝐿 calculates the dissimilarity between the
on balanced data. The full overview of the experimental distributions of males and females, especially at the top
pipeline is shown in Figure 1.                               ranks. We consider that demographic parity is achieved
                                                             when the rank distributions of males and females are
                                                             similar (i.e., 𝑁 𝐷𝐾 𝐿 = 0).
Figure 1: Experimental CRS pipeline including bias mitigation techniques at pre-processing and post-processing steps.


4. Results and Analysis                                       the increase in utility is almost two-fold (+45%).
                                                                  The 𝑁 𝐷𝐾 𝐿 difference is very small between CRS mod-
We present the results of the CRS that include one, two,      els trained with real or synthetic datasets, and shows
or none of our Fairness Gates (FG): re-balancing the train-   a negligible improvement of fairness. These results
ing set with synthetic data (1st FG), and re-ranking the      show that using balanced synthetic data to train
job candidates (2nd FG). We consider 3 levels of data im-     CRS mnodels (1st FG) considerably improved utility
balance, and summarise the NDCG and NDKL for each             (𝑁 𝐷𝐶𝐺) while maintaining the same level of fair-
level in Table 2.                                             ness (𝑁 𝐷𝐾 𝐿).
   The 𝑁 𝐷𝐶𝐺 difference is noticeable between CRS mod-            The 𝑁 𝐷𝐾 𝐿 decreases before and after ranking
els trained with real or synthetic datasets (i.e., between    (i.e., last two columns in Table 2), showing that the
pairs of rows in Table 2). For the heavy imbalance case,
                                                               data. These results show that using re-ranking at
                                                               post-processing (2nd FG) equally improved fairness
                                                               (𝑁 𝐷𝐾 𝐿) whether or not synthetic data was used to
                                                               train CRS models (1st FG).

                                                                  We also explored the score distributions for male and
                                                               female candidates. Those attributed by CRS models
                                                               trained with real data are unevenly skewed toward the
                                                               left, even in cases where the real data is balanced (bal-
                                                               anced dataset). However, for CRS models trained with
                                                               synthetic data, the score distributions of both genders
                                                               shift more to the right, creating a more normally-
                                                               shaped score distribution across both studied gen-
                                                               ders.


Figure 2: Plot displaying the rankings of the top 10 candi-
dates before re-ranking and after re-ranking. The ranks of
the candidates are on the x-axis. Female candidates are blue
bars, and male candidates are orange bars. The ranking A
are from a CRS trained on heavily imbalanced data, and A1
represents the re-ranked candidates from A. Similarly, B and
B1 are the initial and re-ranked rankings for a CRS trained
on the balanced dataset.

 Ranked Lists         NDCG        NDKL          NDKL
                                  Before        After
                                  re-ranking    re-ranking
 Heavy imbalance:     0.384       0.366         0.200
 CRS trained w.
 real data
 Heavy imbalance:     0.693       0.358         0.197
 CRS trained w.       (+45%)
 synthetic data
 Minor imbalance:     0.403       0.217         0.126
 CRS trained w.
 real data
 Minor imbalance:     0.647       0.213         0.126
 CRS trained w.       (+38%)
 synthetic data
 No Imbalance :       0.403       0.213         0.124
 CRS trained w.                                                Figure 3: Score distribution for male and female candidates.
 real data                                                     The score assigned to the candidates is on the x-axis, female
 No Imbalance:        0.633       0.206         0.124          candidates are in blue while male candidates are in orange. A
 CRS trained w.       (+36%)                                   represents a CRS model trained with heavily imbalanced real
 synthetic data                                                data, and A1 a CRS trained with synthetic data learned (from
                                                               a generator trained on heavily imbalanced data). B and B1
Table 2                                                        are the the balanced dataset.
Average 𝑁 𝐷𝐶𝐺 and 𝑁 𝐷𝐾 𝐿 for ranked list obtained at each
level of data imbalance, using CRS trained with real or syn-
thetic data (1st FG), with or without re-ranking (2nd FG).
                                                               5. Discussion
rank distributions of male and female candidates are           Despite the promising results shown in section 4, our
more similar after re-ranking. The decrease is of              CRS pipeline has shown some pitfalls. More specifically,
similar magnitude for each level of data imbalance, i.e.,      the computation of 𝑁 𝐷𝐶𝐺 using ranked candidates as
whether the CRS model is trained with real or synthetic        ground truth and only evaluating the re-ranked perfor-
mance can come with additional validity issues. However,      satisfactory results. The goal was to build a recommen-
it should be noted that these validity issues can be easily   dation pipeline using both real and synthetic data to be
averted by adding another 𝑁 𝐷𝐶𝐺 calculation evaluating        able to experiment with fair processing techniques and
also non-re-ranked candidates against a ground truth          as a result, mitigate bias in candidate recommendations.
constructed from another holdout set for example.             From this perspective, the double fair-gated CRS pipeline
   Additionally, supplementary validation methods could       was successfully built and the generation of synthetic
have been considered. For instance, it could have been        candidates was successful, valid and accurate throughout
beneficial to use future 𝑗, not included in the data, in      the pipeline.
further evaluations. Statistical tests could have also been      The generated data has shown to be accurate on all
conducted, while other user-based approaches, such as         (im)balance levels, validating the expectations on mode-
an evaluation with recruiters, could have contributed to      specific normalization and conditional sampling in CT-
reinforce the validity of this project. These extra valida-   GANs, while also demonstrating the benefits of rejection
tion steps should be implemented before deploying the         sampling methods in re-balancing imbalanced data and
fairness mechanisms proposed                                  using the synthetic candidates generated from it to score
   Furthermore, some findings were unexplainable with         real (im)balanced test subsets fairly. From a fairness per-
the current analysis. For instance, the 𝑁 𝐷𝐾 𝐿 scores for     spective, it was also shown how scorers trained on syn-
CRSs trained on real minor imbalanced datasets are lower      thetic candidates outperform scorers trained on balanced
than those trained on real balanced datasets, which also      real data from a utilitarian perspective.
applies after re-ranking. Although the scores vary by            Although the issues outlined in section 5 concerning
a small margin, such behaviour is difficult to explain        the lack of measurement of pre-re-ranked utility raise
considering the complexity of our pipeline, rendering         some minor validity concerns, the evidence shows how
de-bugging tasks equally complex.                             synthetically-trained CRSs provide fair, useful can-
   Additional unexplainable results are also visible on the   didate recommendations when integrated in such a
synthetic to real comparison with CRSs trained on syn-        pipeline.
thetic datasets such as heavy imbalance showing more
unfairness by a small margin when compared to real-
trained counterparts. These unexplainable findings be-        7. Future Work
tween real and synthetic subsets are even more puzzling
                                                              In future work, the recommendations shared in the dis-
considering, figure 3 shows more balanced scoring for
                                                              cussion can be considered. More specifically, the use of
all synthetically-trained CRSs which should result in a
                                                              additional evaluation methods with human-in-the-loop
lower 𝑁 𝐷𝐾 𝐿 score before re-ranking.
                                                              evaluation using recruiters or the use of future requests
   Finally, the implementation of demographic parity to
                                                              to test the CRS pipeline.
enforce equal proportions between genders oversimpli-
                                                                 Additionally, future researchers should also consider
fies the complexity of the candidate hiring landscape.
                                                              the use of less data-greedy rejection sampling techniques
This oversimplification can be resolved in future research
                                                              as we have lost more than 80% the amount of the hold-
with a lesser degree of generalizability. Future research
                                                              out information we had at the start of the pipeline. This
can be more specific by adjusting fairness rules to the
                                                              can either be resolved with more elegant rejection sam-
domain of the job request 𝑗. For instance, certain jobs
                                                              pling constraints, the use of larger datasets or data-
such as security personnel can show real-world skewness
                                                              augmentation techniques through synthetic data for in-
towards a certain gender. A future CRS pipeline needs
                                                              stance. The latter could have been considered in this
to adjust its fairness rules at 𝑗 level.
                                                              project if it was within the scope of our research.
   Despite these limitations and suggestions for future
                                                                 Finally, with a solved data scarcity problem future
work, overall, our research successfully showed that the
                                                              researchers can consider the discussed domain-adjustable
combination of synthetic data and re-ranking was a com-
                                                              fairness rules for more specific fairness constraints to
bination contributing to both fairness and utility even
                                                              overcome real-world skewness.
when compared to CRSs trained on real balanced data
such as the balanced dataset. Therefore, as expected, a
combination of pre-processing and post-processing FGs         8. Acknowledgements
proved to be useful.
                                                              We acknowledge the University of Amsterdam - Master
                                                              programme Information Studies for creating the condi-
6. Conclusion                                                 tions to perform this research and for financially support-
                                                              ing this publication.
The goal of our CRS pipeline was never to produce SOTA
synthetic candidates and recommendations, despite our
References                                                       ing fairness assessments with synthetic data: a prac-
                                                                 tical use case with a recommender system for hu-
 [1] A. Chouldechova,           Fair prediction with dis-        man resources, 2022.
     parate impact: A study of bias in recidi- [14] M. J. Kusner, J. R. Loftus, C. Russell, R. Silva, Coun-
     vism prediction instruments,             Big Data 5         terfactual fairness, 2018. arXiv:1703.06856 .
     (2017) 153–163. URL: https://doi.org/10.1089/ [15] J. Neumann, Various techniques used in connection
     big.2016.0047.          doi:10.1089/big.2016.0047 .         with random digits, National Bureau of Standards,
     arXiv:https://doi.org/10.1089/big.2016.0047 ,               Applied Math Series 12 (1951) 768–770.
     pMID: 28632438.                                        [16] K. Järvelin, J. Kekäläinen, Cumulated gain-based
 [2] J. Buolamwini, T. Gebru, Gender shades: Intersec-           evaluation of ir techniques, ACM Transactions on
     tional accuracy disparities in commercial gender            Information Systems (TOIS) 20 (2002) 422–446.
     classification, in: S. A. Friedler, C. Wilson (Eds.),
     Proceedings of the 1st Conference on Fairness, Ac-
     countability and Transparency, volume 81 of Pro-
     ceedings of Machine Learning Research, PMLR, 2018,
     pp. 77–91. URL: https://proceedings.mlr.press/v81/
     buolamwini18a.html.
 [3] S. Hajian, J. Domingo-Ferrer, A methodology
     for direct and indirect discrimination prevention
     in data mining, IEEE Transactions on Knowl-
     edge and Data Engineering 25 (2013) 1445–1459.
     doi:10.1109/TKDE.2012.72 .
 [4] A. Prince, D. Schwarcz, Proxy discrimination in the
     age of artificial intelligence and big data, Iowa Law
     Review 105 (2020) 1257–1318. Publisher Copyright:
     © 2020 University of Iowa. All rights reserved.
 [5] A. Rajabi, O. O. Garibay, Tabfairgan: Fair tabular
     data generation with generative adversarial net-
     works, arXiv preprint arXiv:2109.00666 (2021).
 [6] Y. Li, H. Chen, S. Xu, Y. Ge, Y. Zhang, Towards
     personalized fairness based on causal notion, CoRR
     abs/2105.09829 (2021). URL: https://arxiv.org/abs/
     2105.09829. arXiv:2105.09829 .
 [7] S. C. Geyik, S. Ambler, K. Kenthapadi, Fairness-
     aware ranking in search & recommendation sys-
     tems with application to linkedin talent search,
     2019. URL: https://doi.org/10.1145/3292500.3330691.
     doi:10.1145/3292500.3330691 .
 [8] D. B. Rubin, Discussion statistical disclosure limita-
     tion, Journal of Official Statistics 9 (1993) 461–468.
 [9] I. Goodfellow, Nips 2016 tutorial: Generative adver-
     sarial networks, 2017. URL: https://arxiv.org/abs/
     1701.00160. doi:10.48550/ARXIV.1701.00160 .
[10] A. C. Ian GoodFellow, Yoshua Bengio, Deep Learn-
     ing, 1st ed., MIT Press, Cambridge, Massachusetts,
     United States, 2016.
[11] L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veera-
     machaneni, Modeling tabular data using condi-
     tional gan, 2019. URL: https://arxiv.org/abs/1907.
     00503. doi:10.48550/ARXIV.1907.00503 .
[12] E. Jang, S. Gu, B. Poole, Categorical reparame-
     terization with gumbel-softmax, 2016. URL: https:
     //arxiv.org/abs/1611.01144. doi:10.48550/ARXIV.
     1611.01144 .
[13] S.-J. van Els, D. Graus, E. BeauxisAussalet, Improv-