=Paper=
{{Paper
|id=Vol-3744/paper4
|storemode=property
|title=Analyzing bias and discrimination in an algorithmic hiring use case
|pdfUrl=https://ceur-ws.org/Vol-3744/paper4.pdf
|volume=Vol-3744
|authors=David Quer,Anna Via,Marc Serra-Vidal,Laia Nadal,Didac Fortuny
|dblpUrl=https://dblp.org/rec/conf/aimmes/QuerVSNF24
}}
==Analyzing bias and discrimination in an algorithmic hiring use case==
Analyzing bias and discrimination in an algorithmic hiring use case David Quer, Anna Via, Marc Serra-Vidal, Laia Nadal and Didac Fortuny Abstract Algorithmic hiring, powered by AI, has become prevalent in recruitment processes. This paper investi- gates bias and discrimination in a specific Machine Learning use case at InfoJobs, the leading recruitment platform in Spain. The motivation stems from ethical, legal, and reputational considerations, empha- sizing the importance of building responsible and fair AI systems in recruitment. The study presents a comprehensive analysis, employing fairness metrics and, additionally, a novel Granular Bias Measure (GBM) introduced to assess biases at the individual job title level. The experiments involve real candidate experiences, and the results indicate that there is no significant negative discrimination towards female experiences. However, opportunities for improving model results for specific age groups, such as <25, are identified. The paper concludes by highlighting the need for further analysis to understand and address biases effectively. While bias detection is achievable, addressing root causes and ensuring fairness requires ongoing monitoring and improvements in algorithmic hiring systems. 1. Motivation 1.1. Company context Algorithmic hiring refers to the use of algorithms, often powered by artificial intelligence (AI) and machine learning (ML) technologies, in the recruitment and hiring processes. The end goal of leveraging these technologies is to optimize and simplify processes within the Job Boards platforms. According to [1], around 99% of Fortune 500 companies use talent-sifting software in some part of the recruitment and hiring process. Due to the usage of ML models and potential automation of decisions, the risks related to bias and discrimination in Algorithmic Hiring are huge ([2], [3]). InfoJobs1 is the leading recruitment platform in Spain and part of Adevinta group2 . The platform dealt with over 2.7M offers and more than 125M applications in 2022. This big volume and scale implies it is needed to find ways to optimize efficiency and value for candidates and recruiters. For that reason, the platform incorporates Machine Learning systems in several touchpoints of the user and recruiter experience such as offer recommendations for candidates (to help candidates find relevant job offers), automatic CV parsing from PDF (to help candidates update AIMMES 2024 Workshop on AI bias: Measurements, Mitigation, Explanation Strategies | co-located with EU Fairness Cluster Conference 2024, Amsterdam, Netherlands © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://www.infojobs.net/ 2 https://adevinta.com/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings the InfoJobs profile with information available in a PDF), and offer-CV matching computation (to help candidates and recruiters assess fitness between a CV and an offer). As these ML systems can influence users’ and recruiters’ decisions around employment, many risks arise with algorithmic hiring. It makes it mandatory to find ways to ensure these systems are built responsibly, avoiding biases and discrimination, and bringing a positive impact to society: • Ethically, it is essential to prevent the perpetuation of discrimination and injustice in recruitment. • From a legal standpoint, adherence to fairness principles is crucial to avoid legal consequences and comply with regulations (especially with the upcoming AI Act, where recruitment is considered High Risk). • From a company reputation point of view, creating fair and transparent AI systems builds user trust and contributes to a positive corporate reputation. 1.2. Project context As explained in [4], Natural Language Processing (NLP) techniques are widely used in algorith- mic hiring to standardize candidate experiences and job offers, and align on mutual relevancy. In this paper, we present a bias and discrimination analysis for a specific use case in InfoJobs, Job Title normalizations. Job Title normalizations are core in the platform, as they allow a better understanding and standardization of offers and experiences, and nurture many functionalities like offer recommendations to candidates, offers search, alerts, or offer-cv matching. This normalization model is implemented using the BERT language model [5], which, once fine-tuned with InfoJobs data, allows the classification of offers and candidate experience into a common Job Title taxonomy, based on ESCO 3 occupations taxonomy. This taxonomy provides descriptions of 3008 occupations and 13.890 skills linked to these occupations, translated into 28 languages. The main modifications to the ESCO taxonomy consisted on: • Ensuring all gender forms for a job title appear (e.g. ”camarero”, ”camarera”, ”camarero/a”, ”camarero/camarera”), and the primary form is inclusive (”camarero/a”). • Gathering feedback from users and manually revising the taxonomy to improve it. Due to the importance of this model within the platform, an analysis was prioritized to measure, detect, and mitigate potential sources of bias in the normalization of experiences. As due to GDPR the only sensitive attributes a company can store are gender and age, analysis where limited to this two attributes and their intersectionality. 2. Studies, State-of-the-art Fairness measures can target different properties. According to Fabris et al. survey on Fairness and Bias in Algorithmic Hiring [6], fairness is a broad concept which can be divided in Procedural 3 https://esco.ec.europa.eu/en/classification Figure 1: Example of experience ”Programmer” and offer ”Software Developer” normalizating into the same “Back End Developer” Job Title fairness, Outcome fairness, Accuracy fairness, Impact fairness and Representational fairness. Impact fairness and Process fairness are related to the screening and selection processes, which are out of the scope of our work. Also Representational fairness, which focuses on the wording of the offers is not dealt with in this paper. In this work, in order to measure biases for our model, the Job Title classification algorithm, we will focus on Outcome fairness, that looks at predictions from the perspective of candidates and Accuracy fairness, that requests equalization of accuracy-related properties between groups from the decision making perspective. The survey [6] describes several fairness metrics for evaluating algorithmic systems, primarily focused on ranking and classification tasks. When considering our AI system, a classification problem narrows down the metrics that we can use. As for Outcome fairness metrics the survey [6] describes as suitable metrics to evaluate fairness: Disparate Impact (DI), Demographic Disparity (DD), Representation in Positive Predicted (RPP) and True Positive Rate Difference (TPRD). Having said that, DI, requires a notion of the best and worst-off groups to be interpretable and only considers two groups, does not account for non-binary attributes and focuses just on ratio, not absolute rates, DD is prone to masking effect when acceptance rates are very small, making them unsuitable for this analysis. As for RPP, it is used in cases where the overall population of interest for an algorithm is unknown and that it’s not our case. Because of these limitations in the other metrics, we chose the True Positive Rate Difference (TPRD) measure, which measures disparities in true positive rates (also known as recall) between different sensitive groups and closely related to equal opportunity and separation. To have a better understanding of the Biases that our algorithm might have, we have considered also explaining away with conditional probabilities, True Positive Rate and True Negative Rate, before calculating the True Positive Rate Difference (TPRD) measure. With respect to Accuracy fairness metrics we picked up Balanced Classification Rate Difference (BCRD), that is a measure of the disparity in classification accuracy between groups because the other metric proposed in the survey [6], as Mean Absolute Error (MAE) targets only regression problems. Finally, given that the Job Title classification algorithm is a multiclass classification model we believed a deeper understanding was needed for every Job Title prediction, and we built the Granular Bias Measure (GBM), based on Prevalence [7], to do so. 3. Experiments As the goal was to assess the real, production fairness of the normalization predictions, we used real candidate experiences and their Job Title normalization predictions. We focused our efforts on the predictions (output) leaving aside a proper analysis of the unbalance of the training data (input). 3.1. Data input Nevertheless, we have done some sanity checks in terms of proportions to be sure that we don’t have skewed data. Considering Male vs Female genders, we have 55 vs 45% of data, and when intersecting gender with the age groups of <25, 25-45 and >45 years old, no group represents less than 5%, being Male and <25 the less represented group (9%). 3.2. Score analysis For the Job Title Normalization outputs a score assigned for each class that represents the model’s confidence in predicting a class and softmax layers in the neural network generate these probabilities. The distribution of these scores across different instances in the dataset can provide insights into the model’s behavior and reliability. As a sanity check, before analyzing the classifier, we follow the basic process of obtaining the predicted scores, construct the cumulative distribution functions for the predicted scores for each sensible variable we want to compare: Gender and Age. Finally, we apply a Kolmogorov- Smirnov test [8] to statistically assess whether the predicted scores from different classes of the sensible variable follow the same underlying distribution. Table 1 Kolmogorov-Smirnov test results for sensitive variables Kolmogorov-Smirnov test Sensitive variable Statistic p-value Accept H0 (=) Gender Male vs Female 0.41 6.16E-05 FALSE >45 vs 25-45 0.02 1.06E-01 TRUE Age >45 vs <25 0.11 5.45E-08 FALSE 25-45 vs <25 0.09 5.79E-06 FALSE Indeed, we conclude from these results that there are differences in the distributions between classes. Although Kolmogorov-Smirnov is a valuable tool for comparing the performance and calibration of models in a statistical manner, it does not tell us anything about the magnitude of the differences and which classes are more affected. On top of that, when dealing with a classifier model the threshold selected to decide the class can render these differences insignificant. That’s why we moved to proper classification metrics in order to further evaluate the fairness of our algorithm. 3.3. Classifier metrics The Job Title Normalization model is a multiclass classification model that works as follows: the class with the highest predicted probability is obtained and, from it, like a binary classification model, applies a threshold to the predicted scores to consider a prediction valid or invalid. Thanks to a labeling technique, detailed below, we are able to assess the correctness of the predictions, thus obtaining metrics such as “estimated accuracy” across all classes. To be able to evaluate the model as a binary classifier and calculate the metrics True Positive Rate Difference (TPRD), Balanced Classification Rate Difference (BCRD) and the explaining away with conditional probabilities, we require ground truth data and knowledge of ground truth labels. The labeled ground truth for validation is obtained by filtering the experiences where the titles are exactly the same as one of our modified ESCO taxonomy job title groups or their synonyms. This poses the issue that, when evaluating the model, we can only do it for the datapoints for which this ESCO ID is known, but when we weight the cost of obtaining manually labeled data versus using ESCO to validate our predictions, the latter option makes it much easier for us to evaluate the model. Still, we are able to label with this technique over 50% of the data points. 3.4. Explaining away with conditional probabilities Explaining away in the context of a confusion table involves understanding how the occurrence or absence of one event can influence the probability of another event. In a confusion table, which is commonly used to assess the performance of a classification model, explaining away refers to the impact of one class prediction on the probability of another. When considering our sensitive variables we built these confusion tables to do the explaining away based on the criteria, what is the probability of getting a right or wrong prediction given we belong to a class (e.g. Male or Female) allowing us to see differences between classes and also guessing if the model is doing better when it gets it right. For the variable Gender: Male or Female we get the following results: Table 2 Confusion matrices for the variable Gender Male predicted Female predicted actual valid invalid actual valid invalid correct 12230 3 correct 14422 5 incorrect 61 22 incorrect 249 32 In terms of True Positive and True Negative Rates by Gender group we have: Table 3 True Positive Rate and True Negative Rate by Gender Gender TPR TNR Male 0.99975 0.26506 Female 0.99965 0.11387 For the variable Age: 25-45, <25, >45 we get the following results: Table 4 Confusion matrices for the variable Age 25-45 predicted <25 predicted >45 predicted actual valid invalid actual valid invalid actual valid invalid correct 15612 5 correct 4368 1 correct 6672 3 incorrect 129 25 incorrect 21 2 incorrect 160 27 In terms of True Positive and True Negative Rates by Age group we have: Table 5 True Positive Rate and True Negative Rate by Age Age TPR TNR 25-45 0.99967 0.16233 <25 0.99977 0.08695 >45 0.99955 0.14438 As for TPRD and BCRD, strictly speaking, this metrics are used to compare two groups which, in the case of Gender works fine but in the case of Age implies having to do the pair comparisons for each group so, we did the Explaining Away interpretation for both sensitive variables and we calculate TPRD and BCRD only for Gender. 3.5. True Positive Rate Difference (TPRD) For Outcome fairness we use True Positive Rate Difference (TPRD) as a metric to summarize the prediction capacity of our model and evaluate possible biases. By definitions this metric is calculated using the True Positive Rate for each group as follows: 𝑇 𝑃𝑅𝑔 = 𝑃𝑟(𝑦 ̄ = 1|𝑦 = 1, 𝑠 = 𝑔) (1) 𝑇 𝑃𝑅𝐷 = 𝑇 𝑃𝑅𝑔 − 𝑇 𝑃𝑅𝑔 𝑐 (2) Using the previously calculated TPR by Gender we get that the True Positive Rate Difference between Gender groups is 0.0001. 𝑇 𝑃𝑅𝐷 = 𝑇 𝑃𝑅𝑚𝑎𝑙𝑒 − 𝑇 𝑃𝑅𝑓 𝑒𝑚𝑎𝑙𝑒 = 0.99975 − 0.99965 = 0.0001 (3) 3.6. Balanced Classification Rate Difference (BCRD) For Accuracy fairness we use Balanced Classification Rate Difference (BCRD) that checks for disparate accuracy between groups as is defined as follows: 𝑇 𝑃𝑅𝑔 + 𝑇 𝑁 𝑅𝑔 𝐵𝐶𝑅𝑔 = (4) 2 𝐵𝐶𝑅𝐷 = 𝐵𝐶𝑅𝑔 − 𝐵𝐶𝑅𝑔 𝑐 (5) Using the previously calculated TPR and TNR by Gender we calculate the Balanced Classification Rate by Gender group. 0.99975 + 0.26506 𝐵𝐶𝑅𝑚𝑎𝑙𝑒 = = 0.63240 (6) 2 0.99965 + 0.11387 𝐵𝐶𝑅𝑓 𝑒𝑚𝑎𝑙𝑒 = = 0.55676 (7) 2 And doing the difference between Gender groups we get a Balanced Classification Rate Difference of 0.07563 𝐵𝐶𝑅𝐷 = 𝐵𝐶𝑅𝑚𝑎𝑙𝑒 − 𝐵𝐶𝑅𝑓 𝑒𝑚𝑎𝑙𝑒 = 0.63240 − 0.5917 = 0.07563 (8) And given this difference in accuracy by Gender group we decide to take a closer look by introducing the Granular Bias Measure (GBM). 3.7. Calculating a Bias Measure for every Job Title classified Up until this point we’ve been using well known statistical techniques and algorithmic hiring metrics for the Bias calculation. These measures are aggregated measures representing outcome and accuracy biases but we are interested in understanding if a potential source for this bias could be the unequal gender representation in each Job Title (and the fact some will be easier to predict than others).To that end we need to look at every Job Title separately and check if the bias effect is present for each particular one. For this reason we embarked on building the Granular Bias Measure (GBM). To build the Granular Bias Measure (GBM), that summarizes how well is classified a Job Title, we start by calculating measure inspired on Prevalence, that is the proportion of a particular population found to be affected by an outcome, for each level of the sensitive variable. That gives us the P+, positive prevalence, when the Job Title is well classified. Note that both measure P+ is expressed as a percentage of the total population. (TP - True Positives) 𝑇 𝑃𝑚𝑎𝑙𝑒 𝑃+𝑚𝑎𝑙𝑒 = (9) 𝑇 𝑜𝑡𝑎𝑙𝑚𝑎𝑙𝑒 𝑇 𝑃𝑓 𝑒𝑚𝑎𝑙𝑒 𝑃+𝑓 𝑒𝑚𝑎𝑙𝑒 = (10) 𝑇 𝑜𝑡𝑎𝑙𝑓 𝑒𝑚𝑎𝑙𝑒 Then we calculate the GBM measure by subtracting the prevalence from each group: 𝐺𝐵𝑀 = 𝑃 +𝑚𝑎𝑙𝑒 −𝑃+𝑓 𝑒𝑚𝑎𝑙𝑒 (11) And finally to express the Bias we evaluate if the absolute value of the difference abs(GBM+) is greater than 0.0, pointing us to a possible bias. Let’s see some examples: Table 6 Sample results of GBM measure calculation on Job Titles normalized Job Title Normalized Total (M) Total (F) P+ (M) P+ (F) GBM Abogado/a 54 45 1.00000 1.00000 0.00000 Agente comercial 57 51 0.96491 0.88235 0.08256 Jefe/a de compras 53 54 0.94340 0.94444 -0.00105 Administrativo/a 118 516 0.822034 0.945736 -0.123703 Cajero/Cajera 61 424 0.934426 0.985849 -0.051423 Director/a de recursos 2 1 0.50000 1.00000 -0.50000 As a result we are able to calculate the GBM for each Job Title normalized and by considering larger differences in the score extract more detailed conclusions. 4. Discussion Model scores KS test results in table 1 show that the score distributions between gender and age groups are mostly different (except for the comparison between age groups 25-45 and >45). Also, when intersecting gender and age, we can see how these differences are maintained but don’t grow through intersectionality. We continue digging for more insightful differences with explaining away conditional probabil- ities which helps unveil the interdependence between different prediction outcomes. Considering Gender sensitive variable outcomes in table 3 we can see the probability of being Male or Female does not change much between groups when the prediction is ”correct” (True Positive). On the other hand, when evaluating ”incorrect” predictions (True Negative) it seems that there’s a higher probability of being Male if the model gets it right discarding a wrong Job Title Normalized although the volume is very small affecting very few candidates. In the same way, when considering Age sensitive variable outcomes in table 4, although the probability of being Male or Female doesn’t change much between groups it seems that <25 is little bit different from the other two 25-45 and >45 in terms of TPR and TNR, which aligns with previous conclusions from table 1. Now we are able to calculate the previously selected metrics. As for Outcome Fairness TPRD we can see in equation (3) that the difference between Gender groups is pretty small, almost non significant. And for Accuracy Fairness BCRD, where we try to measure differences in accuracy between Gender groups and, although it’s hard to interpret, doesn’t seem to be affecting the model predictions. Finally, to validate our proposed metric Granular Bias Difference, equation (8), based on the positive Prevalence of each group, equations (6) and (7), some interesting results detected are: Almost all model predictions do not present any biases, and the GMB metric is distributed in the interval [-0.25, 0.25] showing small differences on Gender classes. Closely examining the ones that are biased can be explained for various sociological reasons. When the number of observations by Gender class is well balanced GBM metric captures the differences between classes (see. ”Abogado/a”, ”Agente Comercial” or ”Jefe de compras” on table 6) GBM also works well when there’s an imbalance, up to a certain point, but we have enough volume of observations (see ”Administrativo/Administrativa” or ”Cajero/Cajera” on table 6) For many classes, there are not enough observations to get a significant GBM, and no conclusions can be extracted from there (see. ”Director de recursos” on table 6) More work has to be done in that sense to have a more broader metric. 5. Conclusions To conclude, we don’t observe discrimination by gender in our analysis. We believe this is in part thanks to the work related to ensuring all gender forms appear in the taxonomy and have the same weight (contrary to the default male naming typical in Spanish language). Most of the candidates seem to get a correct prediction regardless of their Gender. While it is possible a pretty small bias towards classifying properly male Job Titles only affects very few candidates and do not seem a thread on the overall fairness. As for Age, it seems there is an opportunity to improve the model results for the <25 group that seem to be behaving differently from the other groups 25-45 and >45. We believe an explanation for that can be the fact that experiences from young people can differ from the typical experiences in ESCO taxonomy (internships, junior, leisure, private tutor…). . Nothing indicates intersactionality between gender and age amplifies observed differences in any of the analysis. Based on the TPRD metric we can conclude that there’s no significant difference between Gender groups and there are no disparities in true positive rates. Being this measure is closely related to equal opportunity and separation we can also say that our model is giving equal opportunities independent of the Gender. In terms of accuracy, we can conclude that, based on the metric, there is no significant difference between Gender groups. Although a closer examination will be useful to detect particular cases and infer the root causes. Detecting or monitoring a Bias (a.k.a correlation) does not seem the harder task, but assessing the root cause of this Bias (a.k.a. causality) and correcting it is much harder and requires further analysis. When considering the new Granular Bias Measure, GBM, more work has to be done to build a consistent and reliable summarization of the Bias. When the classes are unbalanced the GBM is very reliant on the population sample used to evaluate. Works for multiclass classification and does it better with large samples. Playing with the threshold when classifying if a GMB score is a true bias or not could yield better results. References [1] I. I. for the future work, Algorithmic hiring systems: what are they and what are the risks?, 2022. URL: https://www.ifow.org/news-articles/algorithmic-hiring-systems. [2] Oleeo, Inclusive diversity in hiring guide, 2023. URL: https://www.oleeo.com/ inclusive-diversity-in-hiring-guide. [3] P. M. Kline, E. K. Rose, C. R. Walters, Systemic discrimination among large u.s. employers, 2022. URL: https://www.nber.org/system/files/working_papers/w29053/w29053.pdf. [4] H. Kavas, M. Serra-Vidal, L. Wanner, Job offer and applicant cv classification using rich information from a labour market taxonomy, 2023. URL: http://dx.doi.org/10.2139/ssrn. 4519766. [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL: https://arxiv.org/abs/1810.04805. [6] A. Fabris, N. Baranowska, M. J. Dennis, P. Hacker, J. Saldivar, F. Z. Borgesius, A. J. Biega, Fairness and bias in algorithmic hiring, 2023. URL: https://arxiv.org/abs/2309.13933. [7] N. P. Jewell, Statistics for Epidemiology, Chapman Hall, London, 2003. [8] Y. Z. Vance W. Berger, Kolmogorov–Smirnov Test: Overview, John Wiley Sons, Ltd., Hoboken, New Jersey, 2014.