=Paper=
{{Paper
|id=Vol-2029/paper23
|storemode=property
|title=Analysis and Comparison of Machine Learning Classification Models Applied to Credit Approval
|pdfUrl=https://ceur-ws.org/Vol-2029/op2.pdf
|volume=Vol-2029
|authors=Jorge Alarcón Flores,Jiam Lopez Malca,Luiz Ruiz Saldarriaga,Christian Sarmiento Román
|dblpUrl=https://dblp.org/rec/conf/simbig/FloresMSR17
}}
==Analysis and Comparison of Machine Learning Classification Models Applied to Credit Approval==
<pdf width="1500px">https://ceur-ws.org/Vol-2029/op2.pdf</pdf>
<pre>
     Analysis and comparison of machine learning classification models
                       applied to credit approval

        Jorge Alarcón Flores, Jiam Lopez Malca, Luiz Ruiz Saldarriaga and
                               Christian Sarmiento Román
      Maestrı́a en Informática, Pontificia Universidad Católica del Perú, Lima, Perú
brian.alarcon@pucp.edu.pe,jiam.lopez@pucp.edu.pe,lruizs@pucp.pe,
                             cwsarmiento@pucp.edu.pe


                     Abstract                              fication of a client (good or bad), which provides
                                                           us with an additional element for decision mak-
    The credit granting decision is one of                 ing, which will ultimately be the granting or non-
    the most important process of all, and                 granting of a credit.
    in whose accuracy, rests the good perfor-
    mance of several critical business KPI’s               2     Experimentation Design
    such as loans level, credit recoveries level
    and non-performing loan ratios. In the                 2.1    Dataset Description
    last decade, the developing of certain tech-           The original dataset consists of 1000 observations
    nology such AI and machine learning has                and 20 attributes, in addition to the classifier. The
    allowed this process automation. The                   number of observations in the training sample is
    present paper has its main goal, the analy-            750, and in the test, 250. The target has two val-
    sis of credit granting predictions to collab-          ues; namely 1 if the credit is approved or 2 if the
    orate with current knowledge in this issue,            credit is rejected. The dataset is clearly unbal-
    giving an objective explanation of the re-             anced, because there are 700 instances with clas-
    sults and suggesting following researches              sification value 1 (accepted credit) and 300 with
    to be developed in order to get better re-             classification 2 (credit rejected).
    sults in existing mathematical algorithms.
    As results of the experimentation deter-               2.2    Data preparation strategy
    mined that the best model was Gradient                 One of the first problems to be faced is preparing
    Boosting, with an accuracy of 83.71%.                  the data for training and validation of the different
                                                           learning models to use. The following points must
1   Introduction
                                                           be resolved:
Credit institutions must establish efficient schemes          Dataset balancing strategy: After the first ex-
for management and control of credit risk to which         perimentations it was found that there is a more
they are exposed in their business development,in          efficient oversampling technique called SMOTE.
accordance with their own risk profile and market          The advantage of this technique is that it allows to
segmentation, according to the markets character-          generate new observations according to the distri-
istics in which they operate and the products they         bution of the characteristics of the dataset. In this
offer. To do this, it is necessary to adopt a research     way the data set was balanced with 1400 instances
and analysis procedure, which can be seen re-              (700 of each class).
flected in a credit scoring or credit risk assessment,        Analysis and categorization of existing char-
being considered as one of the most important pro-         acteristics: Regarding to the numerical variables,
cesses of any financial institution, that can allow to     we have worked on the conversion of only two
have a the customer’s first step in the credit admis-      variables. For the case of categorical variables, the
sion process, which can give the financial institu-        one-hot encoding technique was applied to trans-
tion an element of competitive advantage within            form non-ordinal data into binary numerical data.
the industrial sector. The problem therefore seeks         This allows the multiplication of the characteris-
to work a mathematical model based on machine              tics, from 20 of the original dataset to 62 under
learning that allows accurately predict the quali-         this technique.


                                                     225
   Algorithms to be used: The algorithms or                   In the experimentation, the model with more
models selected for the tests are as follows: Logis-      accuracy was the one of SVM with Gaussian Ker-
tic Regression, Neural Networks, Support Vector           nel. However, there are doubts to take the results
Machine, Random Forest and Gradient Boosting.             of this algorithm because there is a limited number
   Selection and justification of the quality mea-        of observations compared to the number of charac-
sure: Two quality measures were selected in order         teristics of the training dataset. The reason for this
to compare the validity and effectiveness of the          decision is because it is believed that there is a risk
different models to be used. The first quality in-        of overfitting. The Gradient Boosting model is the
dicator is the accuracy of each model. The second         one that had the best regularity during the experi-
quality indicator chosen is the ROC curve.                ments. In Figure 1 we observe the learning curve
   Selection of the most relevant characteristics         obtaining its smallest error (200 nodes).
for learning the model: Once the best classifica-
tion model was selected, the analysis of the impor-       5   Conclusions
tance of variables was performed, in order to find        The model that behaved in a more stable way with
those characteristics of greater contribution in the      the different experimental scenarios was that of
prediction of credit risk.                                Gradient Boosting with an accuracy of 83.71%,
                                                          and value ROC=0.834. The most important char-
3   Experimentation and Results                           acteristics for the Gradient Boosting model were
This research will serve as a baseline and refer-         the balance account, duration of credit, credit
ence point for the presentation of the results of the     amount, length of current employment and age.
present study. The results and comparison of the          All these variables are coincidentally the ones that
mathematical models selected is showed in the ta-         the experts of the financial sector.
ble I.
                                                          6   References
                                                          Cheng-Lung Huang, Mu-Chen Chen and Chieh-
                                                          Jen Wang, 2008. Credit scoring with a data
                                                          mining approach based on support vector ma-
                                                          chines, Expert Systems with Applications.

                                                          Joao Bastos, 2008. Credit scoring with boosted
                                                          decision trees, MPRA Paper, 27(1), pp. 262-273.


    Table 1: Results comparisson of models ap-            Josphat Kipchumba, 2012. Credit evaluation
plied.                                                    model using Naive Bayes classifier: A Case of a
                                                          Kenyan Commercial Bank, University of Nairobi.
4   Results Discusion
                                                          Nazeeh Ghatasheh, 2014. Business Analytics
                                                          using Random Forest Trees for Credit Risk
                                                          Prediction: A Comparison Study, International
                                                          Journal of Advanced Science and Technology.

                                                          Shashi Dahiya, S.S Handa y N.P. Singh, 2015.
                                                          Credit Scoring Using Ensemble of Various Clas-
                                                          sifiers on Reduced Feature Set, Industrija Vol.43,
                                                          No.4, pp. 163-174.

                                                          Van Sang Ha, Ha Nam Nguyen and Duc Nhan
                                                          Nguyen, 2016. A novel credit scoring prediction
                                                          model based on Feature Selection approach and
                                                          parallel random forest, Indian Journal of Science
Figure 1: Learning curve with Gradient Boosting           Vol. 9(20).


                                                    226

</pre>