Fairness Beyond Binary Decisions: A Case Study on German Credit Deborah D Kanubala1 , Isabel Valera1,2 and Kavya Gupta1 1 Saarland School of Informatics , Germany 2 MPI for Software Systems Abstract Data-driven approaches are increasingly used to (partially) automate decision-making in credit scoring by predicting whether an applicant is β€œcreditworthy or not” based on a set of features about the applicant, such as age and income, along with what we refer here to as treatment decisions, e.g., loan amount and duration. Existing data-driven approaches for automating and evaluating the accuracy and fairness of such credit decisions ignore that treatment decisions (here, loan terms) are part of the decision and thus may be subject to discrimination. This discrimination can propagate to the final outcome (repaid or not) of positive decisions (granted loans). In this extended abstract, we rely on causal reasoning and a broadly studied fair machine-learning dataset, the German credit, to i) show that the current fair data-driven approach neglects discrimination in treatment decisions (i.e., loan terms) and its downstream consequences on the decision outcome (i.e., ability to repay); and ii) argue for the need to move beyond binary decisions in fair data-driven decision-making in consequential settings like credit scoring. Keywords algorithmic fairness, credit scoring, discrimination, path-specific counterfactual fairness 1. Motivation In many areas such as hiring [1, 2, 3], law [4, 5, 6, 7], and finance [8, 9, 10, 11, 12], data-driven solutions are used in consequential decisions by predicting outcomes from historical data [13, 14]. The main assumption of these data-driven approaches for auditing or automating decision- making processes is the access to historical data π’Ÿ = {𝑠𝑖 , x̃𝑖 , 𝑦𝑖 }𝑁𝑖=1 . In the context of loan approval [15, 16, 17], the available dataset is often assumed to contain a representative sample1 of the random variables corresponding to i) sensitive attribute of applicants 𝑆; ii) observed outcomes after a positive decision π‘Œ , which is used as a ground-truth label indicating the β€œcreditworthiness” of applicants [19, 20, 21, 22]; and iii) features 𝑋 ˜ = {𝑋, 𝑍} that account for both the applicant characteristics 𝑋 such as income, educational level, etc., along with the treatment decisions 𝑍 which in our case correspond to the loan terms such as duration, and loan amount, under which a historical positive decision (i.e., granted loan) was given. EWAF’24: European Workshop on Algorithmic Fairness, July 01–03, 2024, Mainz, Germany * Corresponding author. $ dkanubala@aimsammi.org (D. D. Kanubala) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 For simplicity, we here assume that the data corresponds to independent and identically distributed samples and are sampled from an underlying data-generation process. However, we refer the reader to the literature on algorithmic decision-making under selective labels for relaxations of such an assumption [18]. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 𝑆 𝑋 𝑍 π‘Œ 𝑆 ˜ 𝑋 π‘Œ (a) Traditional formulation (b) Proposed Reformulation Figure 1: Causal graphs with different scenarios. a) without distinction between 𝑋 and 𝑍; b) with distinction between 𝑋 and 𝑍. Green: paths considered unfair and are in the control of the decision maker (i.e bank). Red: Considered unfair as a result of direct discrimination and should not be accepted. Blue: downstream effect of discrimination from 𝑍. This setting, however, neglects that past decisions are not binary but also involve the treatment 𝑍, which may, in turn, have a causal effect on the observed outcome π‘Œ . Studies have shown that treatment decisions 𝑍 can be discriminatory e.g., the authors of [23] found that while there were no significant differences in loan approval rates between binary gender identities, there was a substantial disparity in loan amounts between men and women. Similar studies [24, 25, 26, 27] also highlight the differences in loan terms between demographic groups. The results of these studies highlight that decisions are not binary and that discrimination against demographic groups may be neglected when considering binary decisions. That is while binary decisions may appear fair (e.g., in terms of acceptance rate [23] or true positive rate [25]) across groups, some demographic groups may still be subject to discrimination in the treatment they receive, i.e., loan terms. [23, 28, 29, 30]. While these works explored discrimination in treatment decisions, to the best of our knowledge none of these has considered the holistic analysis of the downstream effects of treatment decisions on the outcome. In this work, we go one step further and propose a setting to systematically answer the following questions about historical data of the form π’Ÿ = {𝑠𝑖 , x𝑖 , 𝑧𝑖 , 𝑦𝑖 }𝑁𝑖=1 , where we have now made explicit the distinction between applicant features 𝑋 and treatment 𝑍: β€’ Research Question (RQ) 1 : Does discrimination exist in the assignment of treatment 𝑍 (i.e., loan terms) across demographic groups? β€’ Research Question (RQ) 2 : In such a case, what are the downstream effects of such treatment discrimination on decision outcome π‘Œ (i.e., repayment probability)? Implications: The common assumption that π‘Œ is sufficient for making lending decisions is inadequate. In other words, relying on discriminatory treatment decisions will propagate to the outcomes e.g., demographics perceived as higher risk may be offered loan terms that negatively affect their repayment probability, thus reinforcing the negative perception of their risk. In summary, our study aims to provide a compelling case for rethinking the existing fairness in the machine learning pipeline. Next, we detail how we can answer the above questions using a causality-based approach. 2. Discrimination in Treatment and its Effect on Outcome We employ causal reasoning framework [31, 32, 33] for our work. More specifically, we contrast the causal graph in Figure 1a, which shows the current fair decision-making framework, with Page 2 the one in Figure 1b, where we consider the treatment decisions as part of the decisions made by the decision-maker. In our revisited data generation process in Figure 1b we make the following assumptions: 1. The treatment decisions 𝑍 is a causal child of both 𝑋 and (potentially) 𝑆, and we assume the causal mechanism that generated 𝑍 to be potentially discriminatory due to a direct (solid red line Figure 1b) or indirect (dashed green) causal path. We refer to the direct causal effect of 𝑆 on 𝑍 as Direct Treatment Discrimination (DTD), the indirect effect (mediated by 𝑋) of 𝑆 on 𝑍 as Indirect Treatment Discrimination (ITD), and the combination of the two as Treatment Discrimination (TD). 2. The treatment decisions 𝑍 is a causal parent of the outcome variable π‘Œ , and thus 𝑇 𝐷 may propagate to π‘Œ through the causal path(s) from 𝑆 to π‘Œ mediated by 𝑍 (dashed blue line Figure 1b). Based on these assumptions, we answer RQ 1 and RQ 2 as follows. RQ 1 Discrimination in treatment: We assume that for each sample factual applicant in the dataset, which we denote by (𝑠𝐹 , π‘₯𝐹 , 𝑧 𝐹 , 𝑦 𝐹 ), we can rely on the action-prediction- abduction steps by Pearl [34] to estimate its sensitive counterfactual (𝑠𝐢𝐹 , π‘₯𝐢𝐹 , 𝑧 𝐢𝐹 , 𝑦 𝐢𝐹 ). Then, we can define the (total) discrimination in treatment as the treatment difference between the factual individual 𝑧 𝐹 and its sensitive counterfactual 𝑧 𝐢𝐹 , i.e., 𝑇 𝐷 = 𝑧 𝐢𝐹 βˆ’ 𝑧 𝐹 , where 𝑧 𝐢𝐹 = 𝑍(𝑠𝐢𝐹 , 𝑋(𝑠𝐢𝐹 ))). (1) Here, 𝑍(𝑠, π‘₯) and 𝑋(𝑠) denote the causal function2 that define the value of respectively the random variables 𝑍 and 𝑋, respectively, given their causal parents according to the causal graph in Figure 1b. Equation 1 quantifies the total discrimination. To disentangle the sensitive attribute’s effect through different pathways, we follow path specific in [32] to rewrite to as: 𝑇 𝐷 = 𝐷𝑇 𝐷 + 𝐼𝑇 𝐷, where (2) 𝐷𝑇 𝐷 = 𝑧 𝐢𝐹 βˆ’ 𝑍(𝑠𝐹 , 𝑋(𝑠𝐢𝐹 ))) and 𝐼𝑇 𝐷 = 𝑍(𝑠𝐹 , 𝑋(𝑠𝐢𝐹 ))) βˆ’ 𝑧 𝐹 . RQ 2 Effect of discrimination in treatment (𝑍) on outcome (π‘Œ ): This is simply the repay- ment odds3 of a factual applicant (𝑠𝐹 , π‘₯𝐹 , 𝑧 𝐹 ) being treated according to its true sensitive attribute to as it would have been treated according to its counterfactual, i.e., giving a treatment decision 𝑍 of a factual (female) to its counterfactual (male). We refer to this as Treatment Discrimination Effect (TDE) which we computed as: π‘œπ‘‘π‘‘π‘ (𝑝𝐢𝐹 ) = exp π‘Œ (𝑠𝐹 , π‘₯𝐹 , 𝑧 𝐢𝐹 ) βˆ’ π‘Œ (𝑠𝐹 , π‘₯𝐹 , 𝑧 𝐹 ) (3) [οΈ€ ]οΈ€ 𝑇 𝐷𝐸 = 𝐹 π‘œπ‘‘π‘‘π‘ (𝑝 ) 2 We make implicit the dependence of the causal functions to the exogenous variables. In addition, assume the absence of a hidden confounder, or equivalently, we assume causal sufficiency. For more details on structural causal models, refer to [35]. 3 Repayment odds refers to the likelihood that a borrower will successfully repay a loan [36] and is computed as π‘œπ‘‘π‘‘π‘ (𝑝) = 𝑝/(1 βˆ’ 𝑝) Page 3 where the repayment probability for an individual (𝑠, π‘₯, 𝑧) is given by 𝑝 = 𝑃 (π‘Œ = 1|𝑠, π‘₯, 𝑧) = 𝜎 π‘Œ (𝑠, π‘₯, 𝑧) , with 𝜎(Β·) denoting the logistic function. Importantly, we can (οΈ€ )οΈ€ interpret the 𝑇 𝐷𝐸 values as follows: a) If TDE ≀ 1, then the odds of 𝑠𝐹 repaying the credit are equal or lower, meaning no negative downstream effect of treatment on the outcome. b) Otherwise, if TDE > 1, 𝑠𝐹 is more likely to repay credit than its sensitive counter- factual 𝑠𝐢𝐹 . In this case, we consider 𝑠𝐢𝐹 has been subject to discrimination. 3. A Case Study Using German Credit We analyze the German Credit dataset [37], using loan amount and duration as treatment decisions. Assuming additive (noise) linear causal functions, we learn the parameters of our causal model. For RQ 1, we measure the discrimination in treatment along the different pathways. Our results align with existing literature [23, 25, 27] and reveal that discrimination exists in treatment. Table 1, shows that males receive on average an increase of 10% and 20% in duration and credit amount respectively. While this may allow extending the loan terms over a long period, this often also means paying interest over the life of the loan and poses a higher risk of defaulting [38]. On the other hand, females receive on average a significantly lower credit amount and shorter duration. Our second RQ 2 was to measure how discrimination in treatment propagates to the outcome. Following our analysis, hypothetically treating a female (factual) like you would have treated a male (counterfactual) on average decreases the repayment odds by 9%. On the other hand, a hypothetical treating males (factual) like females (counterfactual) results in an increase of repayment odds by 10%. Thus, we conclude that even though males receive preferential treatment with a higher amount of loan and longer duration, this treatment however has a negative downstream effect on their ability to pay back the loan. As such, the disparity in treatment across groups increases risks for male borrowers and puts male borrowers in a higher risk situation. Table 1 German Credit: Treatment Discrimination and its downstream effects. We provide both the discrimina- tion measures and also the transformed values of their odds ratio by setting 𝑠𝐹 to its 𝑠𝐢𝐹 . (mean: πœ‡, standard deviation: 𝜎). Measure Path Duration : πœ‡(𝜎) Amount : πœ‡(𝜎) Repayment Odds : πœ‡(𝜎) DTD 𝑠→𝑧→𝑦 0.068(0.015) 0.1737(0.028) βˆ’0.041(0.006) ITD 𝑠→π‘₯→𝑧→𝑦 0.0941(0.024) 0.059(0.035) βˆ’0.053(0.014) TD both 0.162(0.022) 0.232(0.057) βˆ’0.094(0.008) Male β†’ Female ↓ 9% ↓ 16% ↑ 10% TDE Female β†’ Male ↑ 10% ↑ 20% ↓ 9% 4. Open questions We have shown from our analysis that there is discrimination in treatment and this propagates to the predictive outcome. Our study provides a compelling argument for rethinking the entire pipeline of ML fairness. Although we restricted our analysis to the German Credit dataset, the Page 4 implications extend to various other domains such as criminal justice, hiring, education, etc. Furthermore, ensuring fairness in observed outcomes π‘Œ may be inadequate to mitigate bias as this is still a function of biased treatment. This could also lead to a never-ending loop and in the worst case can worsen the financial situation of discriminated groups. These results prompt us to question the current framework and raise several open questions: Is there a need to develop new notions of fairness, considering that π‘Œ remains a composition of an unfair 𝑍? What does designing a fair policy for 𝑍 entail? What types of datasets are necessary to ensure fairness in non-binary decision-making processes? 5. Acknowledgements This work has been funded by the European Union (ERC-2021-STG, SAML, 101040177). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. References [1] E. Faliagka, K. Ramantas, A. Tsakalidis, G. Tzimas, Application of machine learning algorithms to an online recruitment system, in: International Conference on Internet and Web Applications and Services, 2012. [2] J. Silas, P. Udhan, P. Dahiphale, V. Parkale, P. Lambhate, Automation of candidate hiring system using machine learning, International Journal of Innovative Science and Research Technology (2023). [3] A. A. Mahmoud, T. A. Shawabkeh, W. A. Salameh, I. Al Amro, Performance predicting in hiring process and performance appraisals using machine learning, in: 10th international conference on Information and communication systems, 2019. [4] J. Angwin, J. Larson, S. Mattu, L. Kirchner, Machine bias, in: Ethics of data and analytics, Auerbach Publications, 2022. [5] W. Dieterich, C. Mendoza, T. Brennan, Compas risk scales: Demonstrating accuracy equity and predictive parity, Northpointe Inc (2016). [6] M. Hamilton, The sexist algorithm, Behavioral sciences & the law (2019). [7] R. Berk, H. Heidari, S. Jabbari, M. Kearns, A. Roth, Fairness in criminal justice risk assessments: The state of the art, Sociological Methods & Research (2021). [8] A. S. Almheiri, Automated loan approval system for banks, 2023. [9] S. Lessmann, B. Baesens, H.-V. Seow, L. C. Thomas, Benchmarking state-of-the-art clas- sification algorithms for credit scoring: An update of research, European Journal of Operational Research (2015). [10] D. Tripathi, D. R. Edla, A. Bablani, A. K. Shukla, B. R. Reddy, Experimental analysis of machine learning methods for credit score classification, Progress in Artificial Intelligence (2021). [11] V. Moscato, A. Picariello, G. SperlΓ­, A benchmark of machine learning approaches for credit score prediction, Expert Systems with Applications (2021). Page 5 [12] J. Sirignano, A. Sadhwani, K. Giesecke, Deep learning for mortgage risk, arXiv preprint:1607.02470 (2016). [13] T. Scantamburlo, A. Charlesworth, N. Cristianini, Machine decisions and human conse- quences, arXiv preprint:1811.06747 (2018). [14] A. Coston, Principled Machine Learning for Societally Consequential Decision Making, Ph.D. thesis, Carnegie Mellon University Pittsburgh, PA, 2023. [15] C. Hurlin, C. PΓ©rignon, S. Saurin, The fairness of credit scoring models, arXiv preprint:2205.10200 (2022). [16] M. Rajesh, A. Lakshmanarao, C. Gupta, An efficient machine learning classification model for credit approval, in: Third International Conference on Artificial Intelligence and Smart Energy, 2023. [17] K. Bhatt, P. Sharma, M. Verma, K. Agarwal, Loan status prediction in the banking sector using machine learning, in: International Conference on Computational Intelligence, Communication Technology and Networking, 2023. [18] H. Lakkaraju, J. Kleinberg, J. Leskovec, J. Ludwig, S. Mullainathan, The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables, in: Pro- ceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017. [19] R. Dobbe, S. Dean, T. Gilbert, N. Kohli, A broader view on bias in automated decision- making: Reflecting on epistemology and dynamics, arXiv preprint:1807.00553 (2018). [20] J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, S. Mullainathan, Human decisions and machine predictions, The quarterly journal of economics (2018). [21] S. Mitchell, E. Potash, S. Barocas, A. D’Amour, K. Lum, Algorithmic fairness: Choices, assumptions, and definitions, Annual Review of Statistics and Its Application (2021). [22] B. Green, L. Hu, The myth in the methodology: Towards a recontextualization of fairness in machine learning, in: Proceedings of the machine learning: the debates workshop, 2018. [23] I. Agier, A. Szafarz, Microfinance and gender: Is there a glass ceiling on loan size?, World development (2013). [24] C. L. Escalante, A. Osinubi, C. Dodson, C. E. Taylor, Looking beyond farm loan approval decisions: loan pricing and nonpricing terms for socially disadvantaged farm borrowers, Journal of Agricultural and Applied Economics (2018). [25] A. F. Alesina, F. Lotti, P. E. Mistrulli, Do women pay more for credit? evidence from italy, Journal of the European Economic Association (2013). [26] D. Aristei, M. Gallo, Are female-led firms disadvantaged in accessing bank credit? evidence from transition economies, International Journal of Emerging Markets (2022). [27] Y. Li, Gender differences in car loan access: An empirical analysis, in: Proceedings of the 12th International Conference on E-business, Management and Economics, 2021. [28] A. Cozarenco, A. Szafarz, Women’s access to credit in france: how microfinance institutions import disparate treatment from banks, Available at SSRN 2387573 (2014). [29] I. Agier, A. Szafarz, Credit to women entrepreneurs: The curse of the trustworthier sex, Available at SSRN 1718574 (2010). [30] A. Fuster, P. Goldsmith-Pinkham, T. Ramadorai, A. Walther, Predictably unequal? the effects of machine learning on credit markets, The Journal of Finance (2022). [31] D. Plecko, E. Bareinboim, Causal fairness analysis, arXiv preprint:2207.11385 (2022). Page 6 [32] S. Chiappa, Path-specific counterfactual fairness, in: Proceedings of the AAAI conference on artificial intelligence, 2019. [33] M. J. Kusner, J. Loftus, C. Russell, R. Silva, Counterfactual fairness, Advances in neural information processing systems (2017). [34] J. Pearl, et al., Models, reasoning and inference, Cambridge, UK: Cambridge University Press (2000). [35] J. Pearl, M. Glymour, N. P. Jewell, Causal inference in statistics: A primer, John Wiley & Sons, 2016. [36] M. Szumilas, Explaining odds ratios, Journal of the Canadian Academy of Child and Adolescent Psychiatry (2010). [37] H. Hofmann, Statlog (german credit data) data set, UCI Repository of Machine Learning Databases (1994). [38] Z. Guo, Y. Zhang, X. S. Zhao, Risks of long-term auto loans, Journal of Credit Risk (2022). Page 7