1. Introduction

Fairness in job recommendations: estimating, explaining, and reducing gender gaps⋆

Guillaume Bied

0 1

Christophe Gaillac

Morgane Hofmann

Philippe Caillou

Bruno Crépon

Solal Nathan

Michèle Sebag

1 0 Centre de Recherche en Economie et Statistique (CREST) , Palaiseau , France 1 Laboratoire Interdisciplinaire des Sciences du Numérique (LISN) , Orsay , France 2 Nufield College and Oxford University , Oxford , United Kingdom 3 Pôle emploi , Paris , France

Algorithmic recommendations of job ads have the potential to reduce frictional unemployment, but raise concerns about fairness due to biases in past data. Our research investigates the issue of algorithmic fairness with a specific focus on gender in a hybrid job recommendation system developed in partnership with the French Public Employment Service (PES), which is trained on past hires. First, by viewing job ads as a set of characteristics (such as wage and contract type), we document how the algorithm treats job seekers diferently based on gender, both unconditionally and conditionally on their search parameters and qualifications. Second, we discuss the notion(s) of algorithmic fairness applicable in this context and the trade-ofs involved. We show that the considered system reflects some existing diferences in hiring or applications but does not exacerbate them. Finally, we consider adversarial de-biasing technique as a practical tool to demonstrate the trade-ofs between recall and reduced diferentiated treatment.

eol>Fairness Job recommender systems Adversarial de-biasing Gender gaps Human ressources

1. Introduction

At the core of e-business, recommender systems leverage past data to help users locate relevant items among large amounts of possible ones that would be costly to explore otherwise. Since an important part of unemployment can be explained by informational frictions [ 1 ], including the costs of acquiring information and cognitive limitations, recommender systems could improve matching on the labor market. As labor market outcomes shape livelihoods, social positions and individual identities, helping job seekers find the right jobs matters.

Yet job recommender systems are also a textbook case of fairness issues in machine learning [ 2 ]. Algorithms trained on real-world data, which involve human biases and discriminatory practices, may reproduce, or even increase, past undesirable behavior such as gender stereotypes, and widen labor market inequalities. Ensuring this does not happen is a major concern for the scientific community, Public Employment Services as well as for all citizens.

This paper investigates the issue of gender fairness within the context of the audit of a recommender system called MUlti-head Sparse E-recruitment (MUSE hereafter) [ 3 ], developed in partnership with the French Public Employment service (PES). MUSE leverages extensive data about job seekers’ and job ads’ characteristics and learns from past hiring patterns. Our contributions are threefold. Firstly, we discuss the appropriate notion of algorithmic fairness that should be adopted in the PES setting. Gender disparities in hirings, viewed in terms of job characteristics, such as occupation, distance, wage, full or part-time status can arise from diferentiated application choices arising from job seekers’ preferences. The algorithm’s replication of this behavior appears justified in maximizing users’ welfare (see the related individual, envy-freeness, and preference-based notions of fairness respectively in [ 4, 5, 6 ]). However, these gaps can also arise from diferential valuations of inherent job seeker’s characteristics by recruiters based on gender, which can be seen as discriminatory or unfair. Secondly, we propose to disentangle the impact of job search fundamentals (search parameters and qualifications) from other job seekers’ characteristics in explaining observed gaps by using double machine learning [ 7 ]. We analyse job ad recommendations and document gender disparities both unconditionally and conditionally on job search fundamentals, showing that these standalone do not fully account for the observed gender gaps. Nevertheless, the system does not exacerbate existing diferences in hiring or applications. This discussion brings forth a tension between a PES’s missions and values: providing optimal person-dependent recommendations regarding access to employment while ensuring fair treatment between women and men. Finally, we illustrate this trade-of by developing an adversarial de-biasing approach [ 8 ] aiming at making recommendations gender-blind. Although this approach reduces diferential treatment, it also leads to an overall performance loss and reduction in access to employment, which is more pronounced for women.

The rest of the paper is structured as follows. Section 2 describes the data and the MUSE algorithm. Section 3 proposes to leverage the Double Machine Learning method (DML hereafter) [ 7, 9, 10 ] to make inference on the efect of gender on the recommendations, while controlling for the channel of the job search fundamentals. Section 4 audits the algorithm in terms of recommendation performance, provides evidence of diferentiated treatment, and compares these diferences to those found in hiring and application behavior. Section 5 introduces adversarial techniques to reduce recommendation reliance on gender, and documents their impact on performance metrics and diferentiated treatment. Section 6 concludes and provides perspectives for further work. Appendix D contains a simple model explaining the diferent potential sources of diferential treatment and relating them to gender inequalities in observed applications and hires.

Related work. Fairness in the context of recommender systems draws an increasing amount of work, surveyed by [ 11, 12, 13 ]. Depending on the application domain, fairness issues may arise w.r.t. items (sharing users’ attention in an equitable way), w.r.t. users (presenting a fair selection of items to the users), or both [ 14, 15, 16 ]. In the present work, we focus on user fairness.

Some approaches to user fairness question whether recommendations are equally relevant for diferent groups of users in terms of standard metrics such as recall or NDCG. [ 17 ] audits search engines for diferential satisfaction between demographics. [ 18 ] extends this investigation to several public recommendation datasets, discussing whether diferent groups of users (in terms of age or gender) retrieve the same utility from recommendations based on standard metrics. Such diferences may be due to class imbalance , which may lead a recommender system to better capture the interaction patterns of a majority group in a collaborative filtering setting [ 19 ]. [ 20 ] measure fairness both in terms of diferentiated values of predicted ratings conditionally on characteristics, as well as wrt prediction errors between genders.

Other works emphasize the trade-of between recommendation performance and other fairness measures. Among them, [ 21 ] approach the problem of collaborative filtering under the lenses of a notion of neutrality akin to demographic parity: recommendations should not vary according to a user-specified viewpoint such as gender. However, with labor market applications in mind, [ 20 ] argue that such metrics possibly ignore some legitimate links between gender and preferences. In a labor market context, [ 22 ] is concerned with occupation recommendation while reducing the gender wage gap. [ 23 ] conduct a correspondence study of several Chinese job boards, demonstrating that some profiles are recommended diferent job ads depending on whether they are labelled women or men, thus showing a significant causal impact of gender.

Finally, several approaches exist to prevent fairness issues: pre-processing, in-processing and post-processing. Adversarial in-processing methods, initially proposed in the classification setting [ 8, 24, 25 ], attempt to decorrelate neural representations with gender. The approach has been proposed for neural recommenders in a labor market setting [ 22, 20, 26 ] with diferent motivations and notions of fairness in mind.

2. Experimental setting

Overview of the data The proprietary dataset provided by the French PES contains characteristics of job ads and registered job seekers, as well as their interactions, from 2019 to mid-2022 in the Auvergne-Rhône-Alpes region.

The -th job seeker’s characteristics, represented as a vector ∈ R483 after pre-processing, include job search criteria, labor market profile information, and administrative data (see Appendix B for more details). Within , job seekers’ search fundamentals (search criteria and qualifications, denoted ) include desired wage, occupation, geographic location and accepted mobility, search for a full-time or part-time job, qualification level of the desired position, and accepted working hours.

Overall, the labor market profile information in includes experience, hard and soft skills provided in the PES’s ontology, possession of a driver’s license, educational achievements, textual data (CV, description of past work experience), and administrative data (number of past unemployment spells, reasons for registration, and the type of follow-up provided by the PES). Skills and textual descriptions are each reduced by singular value decomposition [ 27 ]. It is emphasized that job seekers’ gender is available as a binary variable, although it is not provided to the recommender system.

Similarly, the -th job ad is represented by vector ∈ R469 after pre-processing. Available features include lower and upper bounds for the ofered wage, workplace postcode, desired skills, requirements in terms of education, contract type, working hours, and textual descriptions of the firm and position. Textual information and skills are also reduced by singular value decomposition. We also observe whether a job seeker applied to a job ad , and whether he or she was hired on that position. The train and test set cover 1.2 million job seekers and 2.2 million job ads. The 285,992 observed hires are split between train and test on a weekly basis: 85% of weeks are assigned to the train set (representing 241,715 hires), and the rest to the test set (44,277 hires).

Datasets used for the analysis The algorithm’s recommendations will be studied using several distinct datasets.

To study gender gaps conditional on job seeker search fundamentals (more in section 3), we restrict the analysis to men and women that cannot be perfectly distinguished on the basis of their characteristics, following the overlap / common support assumption [ 28 ]. More precisely, if individuals’ gender could be accurately predicted on the basis of characteristics, one could hardly disentangle the impact of such characteristics and that of gender on the recommendations.

The population with common support is selected as follows. The prediction of gender is achieved using random forest [ 29 ] considering selected features including education, desired wage, experience, geographic location, desired contract type, occupation, level of qualification, search for part-time job, accepted mobility. The learned classifier, referred to as propensity score, with accuracy circa 88% is used to select the job seekers in the common support, retaining individuals with propensity score in [0.05, .95].

To study recommendations issued to all job seekers at a given point in time, we consider all job seekers registered during a randomly chosen week of the test set (the fourteenth ISO week of 2022). In order to measure recommendation performance, and to contrast diferentiated treatment by the algorithm with diferences observed in hiring behavior, we also consider recommendations to all job seekers which are hired during the test weeks. To study application behavior, we consider the average characteristics (all weeks pooled together) of the applications of job seekers for which hires are observed in the test set (we observe 169,325 such applications after restriction to the common support).

The sizes, compositions in terms of gender, and size after restriction to job seekers in the common support, of the datasets of interest are reported in Table 4 in Appendix.

2.1. Algorithm

The algorithm MUSE is briefly described for the sake of self-containedness, referring the reader to [ 3 ] for a more comprehensive presentation.

Architecture MUSE is a two-tier hybrid job recommender system, designed to address the sparsity and cold start issues inherent to the job recommendation setting, and to meet computational requirements. It is trained on hiring data. Hires, rather than other type of interactions, are chosen as training labels since they indicate strong mutual interest of the job seeker and recruiter.

The first tier of the algorithm aims at retrieving a subset of 1,000 job ads (to be re-ranked by the second tier) eficiently. It is a two-tower model, trained with a triplet margin loss, which constructs embeddings for job seekers and job ads based on their contextual information and . It correctly keeps 82.25% of matches in the test set among its top-1,000 selection. In the following, we take this first stage operation as given, discarding all job ads but those ranked among the top 1,000 for each job seeker.

The second tier of the algorithm takes as input: the job seeker’s description ; the job ad’s description w.r.t. -th job seeker, noted , formed of the job ad description concatenated with the score and rank of associated to job ad for by the first tier of the algorithm, and the distance in kilometers between and . Two embeddings respectively denoted and are learned on the top of and , ; with formed as the concatenation of these embeddings and their element-wise product ( = [(), ( ), () ⊙ ( )]). The recommendation score ^ is learned as a standard neural net on the top of :

̂︀ = ( ) where is a one-hidden layer feedforward neural network parameterized by . Model parameters are learned end-to-end with a cross-entropy loss: min := ∑︁ log(̂︀ ) + (1 − ) log(1 − ̂︀ ), ,, , where is 1 if hired and 0 otherwise. In practice, negative examples (pairs which are not matches) are sampled uniformly at random within the first tier’s top-1,000 selection. To issue recommendations, job ads are ranked by decreasing ̂︀ .

3. Measuring the efect of gender on recommendations controlling for the preferences 3.1. Measures of interest

We seek to measure how the algorithm’s recommendation performance varies between men and women, but also how the characteristics of the job ads depend on gender, unconditionally and conditionally on job search fundamentals.

Recommendation performance will be measured by the recall@, defined as the share of hires correctly ranked among the algorithm’s top recommendations in the test set. Characteristics of recommended jobs We study gendered diferences in terms of the following characteristics of the top recommended job ad: 1) The logarithm of the ad’s wage; 2) The distance in kilometers of the job’s workplace to the job seeker’s zip code; 3) Whether the job ad corresponds to an executive position in the company; 4) Whether the contract is defined for an indefinite duration or not; 5) The number of hours worked per week; 6) Whether the share of women among job seekers searching for a job in the occupation is less than 20%.

We also consider an aggregate indicator of the fit between the job seeker’s search criteria and the recommended job,1 defined as an average of five binary indicators describing the fit w.r.t. to the job seeker’s i) accepted geographic mobility; ii) desired type of occupation; iii) desired wage; iv) desired type of contract; v) desired working hours.

3.2. Methodology

Parameters of interest. We seek to document whether diferent jobs are recommended to women and men on average and conditionally on their job search fundamentals (search parameters and qualifications). Previous studies in economics [ 30 ] have documented gendered preferences for commuting time, contract type and wage. These preferences are reflected in jobseekers’ job search parameters and may partially explain observed gender dissimilarities in recommendations. However, the diferent job search parameters might not be the only ones contributing to the diferences in recommendations: the study will try to identify whether other gendered features have an impact on recommendations. Our method disentangles disparities due to diferent job search fundamentals from those due to other characteristics and their valuation by the algorithm.

If disparities due to preferences are potentially justifiable from the users’ perspective, the other ones could be considered as a sign of unfair algorithmic treatment.

In the following, the covariate stands for the whole set of variables describing the job seekers, and used by the recommendation system; it includes information on past employment history, demographics (e.g number of children), and self description in the text of the resume. The control stands for the variables describing the job search fundamentals (job search parameters and qualifications, detailed in Appendix B; ⊂ ). The outcome of the recommendation system includes the set of variables describing the recommended job ad (job type, wage, whether the job is part-time or full-time) and cross-features (distance between the locations of the job seeker and the job ad, fit w.r.t. the job seeker’s aggregated search criteria).

The question of gender-related bias arises when men and women with same search fundamentals are recommended substantially diferent job ads (diferent outcomes ): even though the system has no direct access to the gender , it might value the characteristics in − in a gender-biased way.

To assess this potential diferential treatment, we focus on two quantities separately. First, we consider the naive average characteristics of the recommended ofers:

= E[ | = 1] − E[ | = 0], = 1 and = 0 denoting respectively women and men hereafter. This parameter can simply be estimated by taking diference in means. Following our discussion, it is questionable whether it is the role of a (fair) recommender system to directly disregard job search fundamentals . Our parameter of interest is thus the gender related gap in one recommended job ads characteristic , while controlling for the efects of . Taking inspiration of [ 31, 9 ] regarding 1This indicator is inspired from the proprietary PES indicator used for querying job ads. the estimation of the gender wage, we thus consider the following model:2 = 0() + + ,

E(|, ) = 0, (1) where 0() := E( | = 0, = ) are the expected characteristics of jobs for men with preferences , and is a noise variable. To be able to identify , we also make the standard assumption of common support, stating that there exists both men and women sharing all types of search parameters , i.e., for all , there exists > 0, s.t () := P( = 1| = ) ∈ [, 1 − ].

Let us give more intuition about the interpretation of the efect in our context of the impact of gender on recommendations. Consider a linear specification of the efect of the diferent job search parameters on the recommendations in (1), i.e., 0() = ′ 0. Here, denoting by 1 and 0 the coeficients of the regression of on for women and men respectively, we obtain the Oaxaca decomposition, used in the literature on gender wage gap [ 31, 9 ], of the average efect: = 0′(E(| = 1) − E(| = 0)) + ( 1 − 0)′E(| = 1),

⏟ Explained efe⏞ct by ⏟ =, unexpla⏞ined efect where is the residual of the average gender diference that cannot be explained by .

Estimation of the gender gap is performed using the double machine learning method (DML) [see, e.g., 7, 10]. This methods provides an estimator of which is asymptotically normal, robust to the preliminary estimation of other nuisance parameters, () := E( |) and the propensity score () := P( = 1|) using diferent machine learning estimators. Details are given in Appendix C.

4. Results

In all the tables presented hereafter in this section, the column “p-value" presents the p-value indicating the significance of the measure reported in the adjacent left column. Results presented in this section use random forest estimators for functions and . However, our results are not sensitive to the choice of the estimator as shown in Appendix E.

4.1. Recommendation performance is higher for women

We fist report the recall @ for all hires in the test set, as well as for male and female job seekers separately. For instance, the algorithm correctly ranks within its top 20 (resp. top 50) recommendations the job ad on which a job seeker was hired in 35% (resp. 49%) of cases. This success rate is 33.3% for men (resp. 47.5), and 36.6% for women (resp. 50%), with a statistically significant diference (more on Table 5, Appendix). More generally, we find the recall@ to be higher for women than for men at all values of considered. While the magnitude of the diference is limited, it is statistically significant. The observed higher performance of the algorithm for women could be explained by the importance given by the model to the distance criterion. Women assign greater value to proximity when searching for a job, see Table 2 on hires and applications, which could make their job choices easier to predict. 2The presented methodology follows the CATE identification procedure [see, 28], being granted that the gender cannot be considered a treatment.

4.2. Characteristics of job ads recommended to men and women are diferent

Table 1 provides conditional and unconditional estimates for gender diferences in recommended ofer characteristics for all registered job seekers and the selected sub-population (section 2).

The first and third columns show that, whatever the restrictions on the population, women are on average recommended diferent jobs than men. Their recommended job ads are paid 2.3% less than men; half a kilometer closer to home, shorter in terms of weekly working hours (by 2.9 hours); less often of indefinite duration (4 percentage points less often), and executive status (0.4 percentage points). Recommended jobs are also less often in male-dominated occupations (41% less often). Women’s recommended jobs also have a lesser degree of fit with their own search criteria (a loss of 0.028 points in the aggregate fit measure between 0 and 1). All of these diferences are statistically significant.

However, the results using the DML estimation (Table 1, column Cond. ) show that restricting the analysis to the population of job seekers with common support and conditioning on job seeker’s search fundamentals leads to a reduced gender gap in all discussed job ads characteristics. Nevertheless, after conditioning on , women’s recommended jobs still fit less with their search parameters (by 0.011 points), and remain significantly diferent in all discussed dimensions. For instance, 17% of the wage gender gap is left unexplained by job search characteristics and qualifications of job seekers.

Uncond.

Full pop. Wage (log) -0.023 0.0 -0.016 0.0 -0.004 0.000 Distance (km) -0.474 0.0 -0.231 0.0 0.400 0.000 Executive -0.004 0.0 -0.009 0.0 -0.002 0.032 Long term contract -0.040 0.0 -0.034 0.0 -0.014 0.000 %Women < 20 -0.411 0.0 -0.219 0.0 -0.033 0.000 Hours worked per week -2.934 0.0 -1.957 0.0 -0.381 0.000 Fit to job search parameters -0.028 0.0 -0.019 0.0 -0.011 0.000 Notes: The first column reports the gender gap in terms of job characteristics on average. The third column reports the gender gap on the population of job seekers with a propensity score between 0.05 and 0.95. The fifth column reports, on the population of job seekers with suficiently comparable characteristics, the estimates for the gender gap controlling for search parameters using DML. Results are given using random forests as estimators for the functions and and are robust to this choice as shown in Appendix E.

4.3. Inequalities in recommendations against the ones observed in hiring and applications

Inequalities in recommendations are comparable or smaller than the observed ones in hirings. We turn to the comparison of the characteristics of the recommended job ads to those observed in real-world hires ( Hire). We focus on the job seekers in the test set for which we observe hires.

The first column of the upper section of Table 2 shows that, for the population with common support and conditionally on job seeker’s search criteria, there exist diferences in hiring behavior Hire between women and men. Women are hired on job ads that have a lower aggregate fit (by 0.019) with their search criteria than men. They are hired less often in male-dominated occupations (14.1pp); are less often hired on indefinite duration duration contracts (3.4pp), and work less hours (1.11 hours). All of these diferences are statistically significant. Moreover, they are hired on jobs that are paid less, 3 and are less often hired in executive positions.

On the other hand, the third column of Table 2, which reports estimates for in recommendations for the subsample of hired job seekers, illustrates that the patterns are similar to those established on the whole population in the fifth column of Table 1. However, the gap between the characteristics of hires and the characteristics of recommended job ads after conditioning on ( DifH) presented in the fifth column of Table 2 show that they are somehow comparable. Indeed, the algorithm has little impact on the fit between job seeker’s search criteria and the job ads, and does not increase the gap in wages, executive status or long term contracts. Surprisingly, the algorithm seems to recommend job ads in occupations where men are over-represented less often, and recommends positions with more working hours, thus slightly reducing gender gaps.

Eventually, if the algorithm recommends diferent types of ofers to men and women, there is no evidence that it increases the inequalities already observed on the labor market when we condition for job seekers’ job search fundamentals.

Observed diferences largely replicate those observed in application behavior. Differential treatment in hires may originate from job seekers’ application behavior and from recruiters’ discriminatory behavior (see, e.g., the formal model in appendix D). In the present section, we wish to compare the magnitude of the gender gap in the algorithm’s recommendations to the magnitude of the gender gap found in job seekers’ applications App. As applications can also be seen as a noisy proxy for job seekers’ utility, especially if application costs are low (see appendix D), if the diferences DifA were large, this would indicate that the algorithm’s learned recommendations reflect job seekers’ preferences but also recruiter biases.

Due to diferent data sources, we study the sub-population of job seekers with hires in the test weeks for which we observe applications (all weeks pooled together).

The first column of the second panel of Table 2 reports estimates for gender gaps App in applications conditionally on . Indeed, the conditional estimates for the gender gaps are significant in application behavior, in terms of fit to search criteria (a significant diference of 0.029 points in the aggregate index), wages, long term contracts, full time jobs, weekly working hours, and occupations where men are over-represented.

Crucially, based on results on the fifth column of the second panel of Table 2, the conditional estimate for the diference between applications’ characteristics and the algorithm’s recommendations DifA are not statistically significantly diferent from zero with respect to fit to search criteria and to all objective job characteristics aside from occupations where men 3An estimate of 1% for the gender wage gap on the job ofers, conditional on search criteria, might be surprising considering the larger magnitudes generally discussed in the economics literature. It should be noted that we have a large set of stated preferences and that the analysis focuses on registered job seekers (rather than on the working population as a whole), with jobs closer to the national minimum wage than those in the national population.

Diferences between women and men Diference of Diferences Hire() p-value (MUSE) p-value DifH (MUSE) p-value App (Observed) p-value (MUSE) p-value DifA (MUSE) are over-represented and number of hours worked. In the two latter cases, the diferences in conditional gender gaps is reduced in the algorithm’s recommendations.

Altogether, gender gaps exist in the algorithm’s recommendations even after conditioning on job seekers’ search fundamentals, but those gaps are not larger than those found in hires or in job seekers’ application behavior. These results suggest that the recall, the relevance w.r.t. job seekers’ search fundamentals, and the reduction of the gender-related gaps in recommendation might be antagonistic.

This conjecture will be investigated empirically using adversarial techniques in the next section.

5. Limiting diferential treatment with adversarial methods

The goal of this section is to investigate the consequences of de-correlating the latent representations from the gender , using an adversarial method [ 24, 8, 22, 20 ], in terms of gender gaps and recall.

5.1. Methodology: gender-blind recommendation through adversarial learning

In the following, we take the pre-selection of 1,000 job ads by the first tier of the algorithm as given (considering job ads ranked beyond 1,000 to be irrelevant), and incorporate the adversarial setup to the second tier of the recommender system. Recall that in the usual setting, the algorithm minimizes: min := ∑︁ log(̂︀ ) + (1 − ) log(1 − ̂︀ ), ,

̂︀ where corresponds to the weights parameterizing the latent representation of job seeker and job ad (viewed with respect to its relation to ). The adversary is instantiated as a three-hidden-layer feedforward neural network predicting gender from the latent. Denote its prediction for gender by = ( ), the adversary then tries to solve: min = ∑︁ log( ) + (1 − ) log(1 − ̂︀ ),

̂︀ whereas the recommender system incurs a penalty if the adversary’s predictions perform well, leading to the program: min,

− , where > amongst the two objectives. In practice, we alternate between stochastic gradient updates of 0 is a hyper parameter prioritizing the two sets of parameters { } and {, }

5.2. Results

ommendations obtained using the adversarial strategy, letting range over {0.001, 0.01, 0.1, 1}.

Adopting the adversarial penalization strategy leads to a slight loss in recall@20: a diference of 0.016 points between = 0 and = 1. While recall remains higher for women than for men, women bear most of the loss (0.018 points, against 0.013 for men) due to adversarial de-biasing (see a theory about this risk in [32]). As increases, the gender predictions made by the adversary become less accurate (the accuracy drops from 85% when = 0.001 to a near-random accuracy of 53% when = 1).

In terms of unconditional gaps, adopting the adversarial strategy - at least for these levels of penalization - does not reduce the gender gaps to zero for all characteristics, as would perhaps have been expected. Indeed, statistically significant diferences in terms of contract type, occupations, hours worked and fit to search criteria remain. Yet, for all values of , all unconditional gender gaps are considerably reduced. For instance, the log wage gap is divided by 12 (when comparing = 0 to = 1). All conditional gender gaps are also decreased.

Altogether, the use of adversarial de-biasing techniques, aiming at making recommendation gender-blind, entails a slight loss in recommendation performance. Moreover, it reduces unconditional and conditional gender gaps, without suppressing them.

Note that the presented adversarial strategy decorrelates the latent from gender, regardless of the strategy aiming to only target gaps conditional on is left for further work. of whether it represents features from job search fundamentals or from ∖. An adaptation Notes: Results are presented on the subsample of hired job seekers, for diferent weights given to the adversarial term in the loss function. Column = 0 restates the standard algorithm’s performances for convenience in comparisons. Recall and adversary accuracy are computed on the test set (all hired job seekers). Unconditional and conditional gaps are computed on the population of hired job seekers with common support. Unconditional gaps correspond to a diference in means between men and women. Conditional gaps are obtained by DML, using random forests to estimate and .

6. Conclusion / perspectives

Our main contribution is an audit of the gender fairness of the MUSE recommender system, trained on real-world hiring data. First, we find recall to be slightly higher for women than for men. Second, we provide evidence of diferentiated treatment of men and women by the algorithm in terms of recommended job characteristics, even conditionally on job seekers’ search criteria. In the latter case, we find female job seekers to be recommended jobs that fit their own search criteria less often. In the latter case, we find female job seekers to be recommended jobs that do not increase gendered gaps observed in hirings or applications, and even decreases them in the cases of occupation type and working hours. A comparison of recommended job ads to application behavior leads to similar conclusions. Finally, we investigate the trade-ofs between recommendation performance and gender gaps entailed by the use of adversarial de-biasing techniques. The use of such techniques entails a slight loss in terms of recall, but narrows some of the conditional and unconditional gender gaps without eliminating them.

Ultimately, the merits of de-biased algorithms attempting to reduce gender gaps in recommendations hinge on the acceptability of the proposed job ads in terms of job seekers’ (possibly gendered) preferences. An algorithm straying of too far from job seekers’ search behavior might lead to a deadweight loss: a loss in recommendation quality without any efect on labor market inequalities if recommendations are simply discarded as irrelevant. Answering whether a suitable equilibrium can be found requires interacting with job seekers.

Acknowledgments

We warmly thank C. Vessereau, S. Robidou and P. Beurnier from Pôle emploi for making this research possible and granting access to the proprietary data. First author was funded on a grant from the DataIA Institute, Saclay. [31] N. Fortin, T. Lemieux, S. Firpo, Decomposition methods in economics, in: Handbook of labor economics, volume 4, Elsevier, 2011, pp. 1–102. [32] M. P. Kim, A. Korolova, G. N. Rothblum, G. Yona, Preference-informed fairness, arXiv preprint arXiv:1904.01793 (2019). [33] P. M. Robinson, Root-n-consistent semiparametric regression, Econometrica: Journal of the Econometric Society (1988) 931–954.

A. Additional tables Sample size Number men Number women % men Full week Full week (overlap) Hires Hires (overlap)

Hires & Applications (overlap) Notes: The first column presents the total sample size for the diferent datasets used in the analysis: “Full week" and “Full week (overlap)" present the sample size for a week in the test set before and after restriction to job seekers satisfying the overlap condition required in the Double Machine Learning method of Section 3;, “Hires", “Hires (overlap)", and “Hires & Applications (overlap)" present respectively the sample sizes for the subsamples of job seekers in the test set who have been hired, hired and for whom the overlap condition holds, and the subset of the latter one where we also observe applications. Notes: Recall@ is the recall on all the population on the first top recommendations. Columns “Men" and “Women" present the same recall@ separately for men and women. The last column performs a test of equality between columns 2 and 3.

B. Details on variables used

4QPV refers to poor urban areas in need of public intervention, particularly in terms of urban renewal

C. Details on the estimation of the heterogeneous efect of gender on the recommendations using the double machine learning method (DML).

To perform the estimation of the gender gap , we use the double machine learning method (DML) [see, e.g., 7, 10]. This method is based on a rewriting of (1), following the intuition of [33], as − () = ( − ()) + , (2) where () := E( |) is a regression function and () := P( = 1|) is the propensity score, i.e. the probability to be a women ( = 1) given the observed preferences and qualifications . The later are nuisance parameters, which have to be estimated in a first step, but the reformulation (3) allows the estimation of to be doubly robust to this first stage estimation error. This means that we can obtain an estimator for which is asymptotically normal under theoretical conditions which are satisfied by many machine learning methods. Estimation thus consists of 1) estimating and using machine learning estimators and ; 2) estimate ̂︀ ̂︀ the gender gap via minimization of the mean squared error associated to (3) using plug-in leave-one-out versions of ̂︀ and , i.e., i.e. predicting without using the -th example [see 10].

̂︀ 0(): (Decision applying) ⏟ ˜(, )( (, ) + ) + (1 − ˜(, ))0() − ≥

Expected utility w⏞hen applying

0() Utility w⏟ithou⏞t applying .

D. Gendered recommendations, applications, and hires: a simple formal model

To discuss the diferent sources of biases which can appear in the recommendations, and how they compare to those appearing both in the realized job applications and hires, we consider the following simple model of the decision to apply for a job and of the hiring.

For job seekers and job ads having respective types and , we denote the chances that the interview yields a hiring by (, ). The job seekers may not be rational and have expectations about their opportunities ˜(, ) which difer from the objective ones (, ) ̸= ˜(, ). We assume that job seekers expect to have a utility (, ) + − if hired, where is a unobserved random part and is the cost of application. On the contrary, they expect to have their baseline utility 0() minus the cost . In this model, job seekers decide to apply for a job with type if their expected utility if they do so is greater than their utility if they do not apply, namely In this model, the probability of observing an application of on a job ad of type is (Probability of observing an application) (, ) = − (, ) − 0() − ˜(, )

, where − denotes the cdf of − . We note that, when the cost of application are zero, = 0, we ifnd the intuitive idea that only the utility matters in the job seekers’ decisions. Otherwise, their expected chances ˜(, ) of a positive output weight their utility gains and might censor their decision of applying, hence the observed data. This simply underlines that realized applications are then not a pure expressions of the preferences, but also mix with possibly wrong expectations. It is finally of interest to consider the form taken by the probability of observing a hiring, which is simply the product of the probability of application times the objective probability of a positive output after the interview: (, ) = (, ) (, ).

This model helps us discussing several mechanisms that could yield a diferential treatment along the lines of gender. First, preferences might be gender-specific [ 30 ], e.g. women tend to appreciate the relative values of commuting time and wages diferently from men. An algorithm learning from past hires or applications could reproduce these diferences in preferences. If the later are the product of social norms or other constraints, a policymaker might find this unfair that the algorithm convey these diferences, hence justifying to impose parity along these lines. Second, even if job seekers are rational, there might be gendered diferences in the hiring chances , e.g. taste or statistical discrimination against a gender by recruiters.

The algorithm could also reproduce these diferences. A final pitfall is that the hiring expectations ˜ might also be gendered: there might be diferences in the perceptions and the representations of the chances to be hired, leading to diferences in self-censorship or overconfidence. In our model, this could directly create or exacerbate the diferences which might already be present in the objective chances , and impacting the training data of the algorithm.

E. Robustness checks

To ensure estimates for obtained by Double Machine Learning results are robust to the choice of machine learning technique used for the approximation of and , we report alternative estimates for obtained using a XGBoost and Lasso estimators as well as the p-values associated in Table 8, columns 3-6. Results are consistent with what we find with a random forest estimator (Columns 1-2).

Wage (log) Distance (km) Executive Long term contract %Women < 20 Hours worked per week Fit to job search parameters

Cond. Random Forest -0.004 0.400 -0.002 -0.014 -0.033 -0.381 -0.011 Notes: Column 1, 3 and 5 report, on the population of job seekers with suficiently comparable characteristics, the estimates for the gender gap controlling for search parameters using DML and respectively a random forest, XGBoost and lasso estimator for the functions and . Our main specification (equation 1) focuses on average gender gaps (after controlling for job search fundamentals ). However, gender gaps are likely to be heterogeneous, at least for a subset 0 of Z. For instance, the gender gaps in recommendations may be greater for women looking for high wages than for those seeking low wages. Accordingly, we propose to study gender gaps conditional on 0, in line with the estimation of so-called Conditional Average Treatment Efects in the causal estimation literature [ 28 ]. More precisely, to provide insights about this potential heterogeneity, we assume − () = ( − ()) (0) + , (3) with (0) a linear function. In the following, we consider 0 as an expansion of a single feature of interest - job seekers’ monthly reservation wage in euros - on a base of B-splines to increase the specification’s flexibility. To reduce sensitivity on outliers we top code at 90% and bottom code at 10%.

Figure 1 shows the conditional gender wage gap (solid line) in the characteristics of recommendations according to reservation wage and provides confidence interval at 95%.

[1]

Belot ,

Kircher ,

Muller , Providing advice to jobseekers at low cost: An experimental study on online advice, The review of economic studies 86 ( 2019 ) 1411 - 1447 .

[2]

Barocas ,

Hardt ,

Narayanan , Fairness and machine learning . fairmlbook. org , 2019 .

[3]

Bied ,

Nathan , E. Perennes,

Hofmann ,

Caillou ,

Crépon ,

Gaillac ,

Sebag , Toward job recommendation for all , Working paper ( 2023 ).

[4]

Dwork ,

Hardt ,

Pitassi ,

Reingold ,

Zemel , Fairness through awareness , in: Proceedings of the 3rd innovations in theoretical computer science conference , 2012 , pp. 214 - 226 .

[5]

Varian , Eficiency, equity and envy, Journal of Economic Theory 9 ( 1974 ) 63 - 91 .

[6]

M. B.

Zafar , I. Valera,

Rodriguez ,

Gummadi ,

Weller , From parity to preferencebased notions of fairness in classification , Advances in neural information processing systems 30 ( 2017 ).

[7]

Chernozhukov ,

Chetverikov ,

Demirer ,

Duflo ,

Hansen , W. Newey, Double/debiased/neyman machine learning of treatment efects, American Economic Review 107 ( 2017 ) 261 - 265 .

[8]

Edwards ,

Storkey , Censoring representations with an adversary, 2015 . URL: https: //arxiv.org/abs/1511.05897. doi: 10 .48550/ARXIV.1511.05897.

[9]

Bach ,

Chernozhukov , M. Spindler, Closing the us gender wage gap requires understanding its heterogeneity , arXiv preprint arXiv: 1812 . 04345 ( 2018 ).

[10]

Nie ,

Wager , Quasi-oracle estimation of heterogeneous treatment efects , Biometrika 108 ( 2021 ) 299 - 319 .

[11] M. D. Ekstrand , A. Das , R.

Burke , F.

Diaz , et al., Fairness in information access systems, Foundations and Trends® in Information Retrieval 16 ( 2022 ) 1 - 177 .

[12]

Wang , W. Ma, M. Zhang*, Y. Liu,

Ma , A survey on the fairness of recommender systems , ACM Journal of the ACM (JACM) ( 2022 ).

[13]

Li ,

Chen ,

Xu ,

Ge ,

Tan , S. Liu,

Zhang , Fairness in recommendation: A survey, 2022 . URL: https://arxiv.org/abs/2205.13619. doi: 10 .48550/ARXIV.2205.13619.

[14]

Singh ,

Joachims , Fairness of exposure in rankings , in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '18 , Association for Computing Machinery, New York, NY, USA, 2018 , p. 2219 - 2228 . URL: https://doi.org/10.1145/3219819.3220088. doi: 10 .1145/3219819.3220088.

[15]

Singh ,

Joachims , Policy learning for fairness in ranking , in: H. Wallach , H.

Larochelle , A.

Beygelzimer , F.

d'Alché-

Buc , E.

Fox , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 32 , Curran

Associates

, Inc., 2019 . URL: https://proceedings. neurips.cc/paper/2019/file/9e82757e9a1c12cb710ad680db11f6f1-Paper.pdf.

[16]

S. C.

Geyik ,

Ambler ,

Kenthapadi , Fairness-aware ranking in search & recommendation systems with application to LinkedIn talent search , in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM , 2019 . URL: https://doi.org/10.1145% 2F3292500 .3330691. doi: 10 .1145/3292500.3330691.

[17]

Mehrotra ,

Anderson ,

Diaz ,

Sharma ,

Wallach , E. Yilmaz, Auditing search engines for diferential satisfaction across demographics , in: Proceedings of the 26th International Conference on World Wide Web Companion - WWW '17 Companion , ACM Press, 2017 . URL: https://doi.org/10.1145% 2F3041021 .3054197. doi: 10 .1145/3041021. 3054197.

[18] M. D. Ekstrand , M.

Tian , I. M.

Azpiazu , J. D.

Ekstrand , O.

Anuyah , D.

McNeill , M. S.

Pera , All the cool kids, how do they fit in?: Popularity and demographic biases in recommender evaluation and efectiveness , in: Conference on fairness, accountability and transparency, PMLR , 2018 , pp. 172 - 186 .

[19] A. B. Melchiorre , N.

Rekabsaz , E.

Parada-Cabaleiro , S.

Brandl , O.

Lesota , M.

Schedl , Investigating gender fairness of recommendation algorithms in the music domain , Information Processing & Management 58 ( 2021 ) 102666 .

[20]

Islam ,

K. N.

Keya ,

Zeng ,

Pan ,

Foulds , Debiasing career recommendations with neural fair collaborative filtering , in: Proceedings of the Web Conference 2021 , 2021 , pp. 3779 - 3790 .

[21]

Kamishima ,

Akaho ,

Asoh ,

Sakuma , Enhancement of the neutrality in recommendation , in: Decisions@ RecSys, 2012 , pp. 8 - 14 .

[22]

Rus ,

Luppes ,

Oosterhuis ,

G. H.

Schoenmacker , Closing the gender wage gap: Adversarial fairness in job recommendation ( 2022 ). URL: https://arxiv.org/abs/2209.09592. doi: 10 .48550/ARXIV.2209.09592.

[23]

Zhang , P. Kuhn, Understanding algorithmic bias in job recommender systems: An audit study approach ( 2022 ).

[24]

Wadsworth ,

Vera ,

Piech , Achieving fairness through adversarial learning: an application to recidivism prediction, 2018 . URL: https://arxiv.org/abs/ 1807 .00199. doi: 10 . 48550/ARXIV. 1807 . 00199 .

[25]

Beutel ,

Chen ,

Zhao ,

E. H.

Chi , Data decisions and theoretical implications when adversarially learning fair representations , 2017 . URL: https://arxiv.org/abs/1707.00075. doi: 10 .48550/ARXIV.1707.00075.

[26]

Li ,

Chen ,

Xu ,

Ge , Y. Zhang, Towards personalized fairness based on causal notion , in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , ACM, 2021 . URL: https://doi.org/10.1145% 2F3404835 .3462966. doi: 10 .1145/3404835.3462966.

[27]

Deerwester ,

S. T.

Dumais ,

G. W.

Furnas ,

T. K.

Landauer ,

Harshman , Indexing by latent semantic analysis , Journal of the American society for information science 41 ( 1990 ) 391 - 407 .

[28]

G. W.

Imbens ,

D. B.

Rubin , Causal inference in statistics, social, and biomedical sciences, Cambridge University Press, 2015 .

[29]

Breiman , Random forests, Machine learning 45 ( 2001 ) 5 - 32 .

[30]

Le Barbanchon ,

Rathelot ,

Roulet , Gender diferences in job search: Trading of commute against wage , The Quarterly Journal of Economics 136 ( 2021 ) 381 - 426 .