1. Introduction

Loan Market While Increasing Portfolio Risk- .

benchmark of credit score prediction using Machine Learning

Vincenzo Moscato

vincenzo.moscato@unina.it 0 1 2 3 4 5 6

Giancarlo Sperlì

0 1 2 3 4 5 6 0 Credit Score Prediction , Benchmark, Machine Learning , Explainable Artificial Intelligence 1 Department of Electrical Engineering and Information Technology (DIETI), University of Naples ”Federico II” , Via Claudio 21, 80125, Naples 2 Furthermore, the risk of defaults in P2

P lending plat-

3 Hence, Social lending platforms pose unique chal- 4 Workshop Proce dings 5 that are typically for traditional financial institute. These 6 tions, including Social Lending transactions , is defined

2018

2 0 6774 6781

credit risk models. One of the main relevant financial services is the credit risk assessment, whose aim is to support financial institutes in defining their policies and strategies. In the last years, traditional credit risk services have been disrupted by the arise of Social Lending Platforms. This paper reports an experimental analysis relying on the use of diferent machine learning models to deal with credit risk in social lending platform. For this reason, we use a real world dataset, composed by 877,956 samples, to compare our results w.r.t. state-of-the-art baselines and benchmarks, also evaluating the explanaibility of the proposed three best models using diferent well-known XAI tools. Hence, the proposed study aims to design both efectiveness and explainable Ital-IA 2023: 3rd National Conference on Artificial Intelligence, orgaCEUR ∗Corresponding author.

1. Introduction

In the last years, the pervasive use of Artificial Intelligence (AI) models has brought efectiveness improvements in several application domains, including the financial sector. Nowadays, several financial services have benefited from the introduction of artificial intelligencebased models by defining a new generation of financial technology (FinTech)-based systems, which have enabled the definition of a range of services such as lending, payment, risk and regulatory management [ 1, 2 ]. Hence one of the main challenge is the large of data produced by digital financial services; in fact, the financial transaction processed per day hanno raggiunto il valore di 14 trillioni, generando un incremento delle revenue del global payments del 12% negli ultimi due anni raggiungendo un valore pari a 1.9 trilions of dollars in 2018 [ 3 ].

In particular, researchers and practitioners have been increasingly interested in defining AI-based methodologies with the aim to jointly increase their revenues and minimize associated risks, leading to new opportunities and challenges, as discussed in [4]. The Basel Committee on Banking Supervision (BCBS) has classified banking risks into three categories, namely credit, market, and operational risks. According to [5], credit risks account for approximately 60% of banks’ risks., which is mainly due to the arise of Social Lending Platforms. (G. Sperlì) although they do not properly cover non-linear efects among diferent variables.

This paper represents an extended abstract of our previous study [14], where we designed a benchmark of machine learning models for credit scoring prediction, whose results have been compared w.r.t. the state of the art ones. In particular, the credit scoring task has been designed as a binary problem corresponding to the decision whether a loan or no on Social Lending platforms. The results have been investigated using several sampling strategies for dealing with the unbalanced issues in these datasets and diferent measures, also using eXplainable Artificial Intelligence (XAI) tools for explaining the prediction of the analyzed machine learning models.

2. Methodology

The proposed benchmark is designed to deal with the credit risk prediction task with the aim tosupport investors in evaluating potential borrowers on social lending platforms. In particular, members, registered on these platforms, complete a detailed application regarding their ifnancial history and the reason for seeking a loan, without the involvement of financial intermediaries. Lenders can earn higher returns than what is typically ofered through banks’ savings and investment products, while borrowers can access funds at lower interest rates.

Figure 1 shows the three main components in the benchmark testbed: ingestion, classification and explanation.

The ingestion module is responsible for crawling data Figure 1: Benchmark Testbed from social lending platforms, also performing data cleaning and feature selection operations on the basis of the chosen classifier. Firstly, data is cleaned by removing features having a significant number of missing or null The third module deals with comparing diferent XAI values, as well as zero variance attributes. Successively, techniques to explain the results obtained with the aim several transformations are performed on the dataset, to explaine prediction outcome for highlighing how decisuch as converting categorical features into numeric ones sions are made. In particular, we compared five diferent and changing date attributes into numerical values. Addi- XAI tools: LIME [20], Anchors [21], SHapley Additive extionally, a correlation analysis is conducted with respect Planations (SHAP) [22], Balanced English Explanations of to the loan status to gain a better understanding of the Forecasts (BEEF) [23] and Local Rule-Based Explanations data and their attribute trends. (LORE) [24].

The second component is responsible for credit prediction for a given user, which is impacted by the imbalance 3. Experimental Evaluation problem, typical issue in Social Lending platforms. This imbalance problem arises due to the high number of re- In this section, we describe the analysis made for evaluatjected loans compared to those that are requested. ing the efectiveness of diferent classification models on

For the classification stage, three of most eficacy mod- the basis of several sampling strategies and evaluation els in credit score prediction have been selected we se- metrics. lected three of the most commonly used classifiers for In particular, we have used a dataset from a real-world credit score prediction [15, 16, 17, 18, 19]. Social Lending platform, named Lending Club1, including

Furthermore, we train the chosen machine learning 877, 956 samples and 151 features, with the target class models on the basis of diferent sampling strategies to for our problem being the loan status. As suggested by address data imbalance issues: random under-sampling and over-sampling, that respectively and smoothing. previous research ([15, 19]), we have used the values of if the prediction changed when untrustworthy features the loan status, which are presented in Table 1. were removed from the instance (simulating human discounts), and trustworthy otherwise. In conclusion, we

Loan Status Samples number evaluate test set prediction through diferent explanation Current 395.901 methods, whose results are compared with the trustworFully Paid 354,994 thiness oracles (see Table 5) and performing 10 random Charged Of 107,384 sampling from the dataset.

Late (31-120 days) 12,550 It is worth to note in Table 5 that LORE achieves highIn-grace period 4,703 est outcomes w.r.t. the other ones by combining local preLate (16-30 days) 2,393 dictions and counterfacts explanation for providing userTDoetfaaullt 38177,956 friendly explanation in understanding which features afect changes in predictions. In turn, LIME achieves Table 1 higher coverage because it describe each prediction as Data-set characterization a weighted sum while SHAP provides more reliable outcomes through the use of SHAP values, whose expensive computational complexity can be addressed by using several heuristics. In conclusion, BEEF and Anchors sufer of limited expressive power, being based on rules.

We only included ”FullyPaid” or ”Charged of” labels due to we are intereting in predicting wheter a loan would be paid back or not. Under this assumption, we generate an imbalanced dataset, in which 77% and 23% of samples are fully paid and charged of, respectively. Furthermore, we perform a 10-fold cross-validation, in which we split the dataset according to 75:25 ratio for each fold, computing mean and standard deviation for each classifier during the training process.

The best results have been compared w.r.t. the ones in [19, 25] on the basis of several metrics (Precision, FP-Rate, Area Under Curve (AUC), accuracy (ACC), Sensitivity (TPR), Specificity (TNR), and G-mean).

The analysis has been made on a Platform-as-a-Service (PaaS) Google Colab2, providing 12 GB of RAM and a Tesla K80 with 2496 CUDA core and a software stack composed by Python 3.6 with scikit-learn 0.23.13.

4. Results

In this section, our highest results shown in Table 2 w.r.t. the best ones in ([19]) and ([25]).

It is easy to note that our RF-RUS configuration, shown in Table 2 achieves lower accuracy measure w.r.t. the best outcome in [19] in Table 3 while AUC (0.717) and Specificity (0.68) values are higher than the best results in ([19]). Furthermore, our aim is to reduce the number of false positive because the misclassification cost are more higher than assigning good loans [26]. On the other hand, Table 4 shows higher specificity values compared to our results while achieving lower sensitivity value than ours.

Furthermore, we investigate the epxlanation of the individual predictions by randomly selecting a group of possible features (25% of the total) that were considered ”untrustworthy”, being unrecognized by users. An oracle has been designed for each combination of the chosen features to label test set by classifying as ”untrustworthy” 2https://colab.research.google.com/ 3https://scikit-learn.org/stable/index.html

5. Conclusion

Predicting credit risk is a relevant challenge in the finance industry, particularly in Social lending platforms where high dimensionality and imbalanced data present unique challenges. This study proposes a benchmark for evaluating the efectiveness of machine learning techniques for credit risk prediction in real-world social lending platforms, with a focus on managing imbalanced data sets and ensuring explainability.

Future work will focused on considering additional Social Lending platforms, also designing novel techniques such as deep learning and ensemble strategies that may ofer improved performance (see [ 27]) although they are less explainable. TNR 0.582 0.650

FP-Rate 0.420 G-Mean 0.65 Accuracy 0.6920

[5] K. Buehler, A. Freeman, R. Hulme, The new arsenal of risk management, Harvard Business Review 86 (2008) 93–100. [6] A. B. Hens, M. K. Tiwari, Computational time reduction for credit scoring: An integrated approach based on support vector machine and stratified sam

[1]

Murinde , E. Rizopoulos,

Zachariadis , The impact of the fintech revolution on the future of banking: Opportunities and risks , International Review of Financial Analysis 81 ( 2022 ) 102103 . doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . i r f a . 2 0 2 2 . 1 0 2 1 0 3 .

[2]

Luo ,

Sun ,

Yang , G. Zhou, Does fintech innovation promote enterprise transformation? evidence from china , Technology in Society 68 ( 2022 ) 101821 . doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . t e c h s o c . 2 0 2 1 . 1 0 1 8 2 1 .

[3] McKinsey , Global Payments Report 2019 , https: //www.mckinsey.com/~/media/mckinsey/industri es/ financial%20services/our%20insights/tracking% 20the%20sources%20of%20robust%20payments%2 0growth%20mckinsey%20global%20payments%2 0map/global-payments-report-2019-amid-sustaine