=Paper=
{{Paper
|id=Vol-2029/paper23
|storemode=property
|title=Analysis and Comparison of Machine Learning Classification Models Applied to Credit Approval
|pdfUrl=https://ceur-ws.org/Vol-2029/op2.pdf
|volume=Vol-2029
|authors=Jorge Alarcón Flores,Jiam Lopez Malca,Luiz Ruiz Saldarriaga,Christian Sarmiento Román
|dblpUrl=https://dblp.org/rec/conf/simbig/FloresMSR17
}}
==Analysis and Comparison of Machine Learning Classification Models Applied to Credit Approval==
Analysis and comparison of machine learning classification models applied to credit approval Jorge Alarcón Flores, Jiam Lopez Malca, Luiz Ruiz Saldarriaga and Christian Sarmiento Román Maestrı́a en Informática, Pontificia Universidad Católica del Perú, Lima, Perú brian.alarcon@pucp.edu.pe,jiam.lopez@pucp.edu.pe,lruizs@pucp.pe, cwsarmiento@pucp.edu.pe Abstract fication of a client (good or bad), which provides us with an additional element for decision mak- The credit granting decision is one of ing, which will ultimately be the granting or non- the most important process of all, and granting of a credit. in whose accuracy, rests the good perfor- mance of several critical business KPI’s 2 Experimentation Design such as loans level, credit recoveries level and non-performing loan ratios. In the 2.1 Dataset Description last decade, the developing of certain tech- The original dataset consists of 1000 observations nology such AI and machine learning has and 20 attributes, in addition to the classifier. The allowed this process automation. The number of observations in the training sample is present paper has its main goal, the analy- 750, and in the test, 250. The target has two val- sis of credit granting predictions to collab- ues; namely 1 if the credit is approved or 2 if the orate with current knowledge in this issue, credit is rejected. The dataset is clearly unbal- giving an objective explanation of the re- anced, because there are 700 instances with clas- sults and suggesting following researches sification value 1 (accepted credit) and 300 with to be developed in order to get better re- classification 2 (credit rejected). sults in existing mathematical algorithms. As results of the experimentation deter- 2.2 Data preparation strategy mined that the best model was Gradient One of the first problems to be faced is preparing Boosting, with an accuracy of 83.71%. the data for training and validation of the different learning models to use. The following points must 1 Introduction be resolved: Credit institutions must establish efficient schemes Dataset balancing strategy: After the first ex- for management and control of credit risk to which perimentations it was found that there is a more they are exposed in their business development,in efficient oversampling technique called SMOTE. accordance with their own risk profile and market The advantage of this technique is that it allows to segmentation, according to the markets character- generate new observations according to the distri- istics in which they operate and the products they bution of the characteristics of the dataset. In this offer. To do this, it is necessary to adopt a research way the data set was balanced with 1400 instances and analysis procedure, which can be seen re- (700 of each class). flected in a credit scoring or credit risk assessment, Analysis and categorization of existing char- being considered as one of the most important pro- acteristics: Regarding to the numerical variables, cesses of any financial institution, that can allow to we have worked on the conversion of only two have a the customer’s first step in the credit admis- variables. For the case of categorical variables, the sion process, which can give the financial institu- one-hot encoding technique was applied to trans- tion an element of competitive advantage within form non-ordinal data into binary numerical data. the industrial sector. The problem therefore seeks This allows the multiplication of the characteris- to work a mathematical model based on machine tics, from 20 of the original dataset to 62 under learning that allows accurately predict the quali- this technique. 225 Algorithms to be used: The algorithms or In the experimentation, the model with more models selected for the tests are as follows: Logis- accuracy was the one of SVM with Gaussian Ker- tic Regression, Neural Networks, Support Vector nel. However, there are doubts to take the results Machine, Random Forest and Gradient Boosting. of this algorithm because there is a limited number Selection and justification of the quality mea- of observations compared to the number of charac- sure: Two quality measures were selected in order teristics of the training dataset. The reason for this to compare the validity and effectiveness of the decision is because it is believed that there is a risk different models to be used. The first quality in- of overfitting. The Gradient Boosting model is the dicator is the accuracy of each model. The second one that had the best regularity during the experi- quality indicator chosen is the ROC curve. ments. In Figure 1 we observe the learning curve Selection of the most relevant characteristics obtaining its smallest error (200 nodes). for learning the model: Once the best classifica- tion model was selected, the analysis of the impor- 5 Conclusions tance of variables was performed, in order to find The model that behaved in a more stable way with those characteristics of greater contribution in the the different experimental scenarios was that of prediction of credit risk. Gradient Boosting with an accuracy of 83.71%, and value ROC=0.834. The most important char- 3 Experimentation and Results acteristics for the Gradient Boosting model were This research will serve as a baseline and refer- the balance account, duration of credit, credit ence point for the presentation of the results of the amount, length of current employment and age. present study. The results and comparison of the All these variables are coincidentally the ones that mathematical models selected is showed in the ta- the experts of the financial sector. ble I. 6 References Cheng-Lung Huang, Mu-Chen Chen and Chieh- Jen Wang, 2008. Credit scoring with a data mining approach based on support vector ma- chines, Expert Systems with Applications. Joao Bastos, 2008. Credit scoring with boosted decision trees, MPRA Paper, 27(1), pp. 262-273. Table 1: Results comparisson of models ap- Josphat Kipchumba, 2012. Credit evaluation plied. model using Naive Bayes classifier: A Case of a Kenyan Commercial Bank, University of Nairobi. 4 Results Discusion Nazeeh Ghatasheh, 2014. Business Analytics using Random Forest Trees for Credit Risk Prediction: A Comparison Study, International Journal of Advanced Science and Technology. Shashi Dahiya, S.S Handa y N.P. Singh, 2015. Credit Scoring Using Ensemble of Various Clas- sifiers on Reduced Feature Set, Industrija Vol.43, No.4, pp. 163-174. Van Sang Ha, Ha Nam Nguyen and Duc Nhan Nguyen, 2016. A novel credit scoring prediction model based on Feature Selection approach and parallel random forest, Indian Journal of Science Figure 1: Learning curve with Gradient Boosting Vol. 9(20). 226