ML-based Approach for Credit Risk Assessment Using
Parallel Calculations
Lesia Hentosh1, Yevhen Tsikalo2, Natalya Kustra3 and Hakan Kutucu4
1Department of Artificial intelligence, Lviv Polytechnic National University, Lviv, 79013, Ukraine;

lesia.i.mochurad@lpnu.ua
2Department of Accounting and Audit, Ivan Franko National University of Lviv, Lviv, 79008, Ukraine;

yevhen.tsikalo@lnu.edu.ua
3Department of Publishing Information Technologies, Lviv Polytechnic National University, Lviv, 79013, Ukraine;

Nataliia.O.Kustra@lpnu.ua
4Department of Software Engineering, Karabuk University, Karabuk, 78050, Turkey; hakankutucu@karabuk.edu.tr


            Abstract
            In banks and other credit organizations, the task of credit scoring often arises when making decisions
            on granting loans. The last one consists of making a reasoned decision based on information about the
            applicant, whether she should be granted a loan, and, if so, on what terms. This paper proposes the
            application of parallel calculations of the Random forest algorithm when solving the credit scoring task.
            This approach made it possible to reduce the time of model training and dataset processing significantly.
            Expectedly, when applying less data, the resulting acceleration and efficiency worsen. Using only 2500
            entries, the execution time of the sequential algorithm is less than the parallel algorithm. The developed
            software was tested on three different processors: 4-core, 8-core, and 12-core, to evaluate the
            parallelization quality of data pre-processing. The classification algorithm is computationally complex
            and time-consuming, so we obtained practically the same acceleration for processing 5000 and 10000
            records. With this amount of data, the 12-core processor gave the biggest gain in time when working
            with 12 threads. As a result, it is possible to have an acceleration of more than 6. This efficiency indicator
            of the proposed parallel algorithm can be significantly improved by varying the number of threads and
            considering the current trends in developing the multi-core architecture of computing systems. Also,
            using data without pre-processing, the following evaluation metrics were obtained: AUC=0.9 and
            Precision=0.845, and using data after pre-processing, these metrics were: AUC=0.86, Precision=0.89.

            Keywords 1
            Credit scoring, Random forest, classification task, parallel algorithm, acceleration.


CITRisk’2022: 3nd International Workshop on Computational & Information Technologies for Risk-Informed Systems, January 12,
2023, Neubiberg, Germany
EMAIL: lesia.i.mochurad@lpnu.ua (LHentosh), Nataliia.O.Kustra@lpnu.ua (N.Kustra), yevhen.tsikalo@lnu.edu.ua (Ye.Tsikalo),
hakankutucu@karabuk.edu.tr (H.Kutucu)
ORCID: 0000-0002-4957-1512 (L.Hentosh), 0000-0002-3562-2032 (N.Kustra), 0000-0001-8051-9299 (Ye.Tsikalo), 0000-0001-
7144-7246 (H.Kutucu)
             © 2022 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
1. Introduction
To date, a significant problem for many banking institutions is the non-repayment of loans at 60-
65% of borrowers [1, 2]. As a result, the number of credit risks also increases risks to income and
capital due to the inability of the party who has assumed obligations to fulfill the terms of any
financial agreement with the bank or otherwise fulfill the obligations assumed. There are many
methods for credit risk assessment: credit assessment methods, decision tree methods, rating
methods, Monte Carlo method, scoring, taxonomic analysis, stress testing, etc. [3, 4]. The last is
widely used in Ukraine.
    In this work, we will look at the credit scoring method. In turn, we understand credit scoring
as a classifier problem that determines whether a loan should be granted to a borrower based on
the machine learning method [5]. Banking systems already use their own scoring systems, and
they analyze the risks for their credit portfolio based on them. Many experts use neural networks
and support vector machines to create such a system. The solution proposed in this work is based
on the Random forest algorithm. This method is based on the construction of a large number
(ensemble) of decision trees, each of which is built on a sample obtained from the original training
sample using bootstrap (sampling with return) [1, 6]. Objects are classified by "voting." Each tree
assigns a classified object to one of the classes and the class for which the most trees voted wins.
Each decision tree does not give very high classification accuracy, but the result is true due to large
number of trees.
    A common problem of classification algorithms, including the Random forest algorithm, is
wasting time due to processing large volumes of data. The solution proposed in this work is the
use of parallel computing. This solution will be used for data pre-processing and training the model
itself. Parallel computing makes programs run faster as more processes or threads are used [7].
Various studies have already demonstrated the implementation of parallel calculations for the
Random Forest algorithm [8].
    The purpose of this work is the parallelization of the Random Forest algorithm for the analysis
of risks in lending and the evaluation of the obtained efficiency coefficients.
    With the development of technology, the problem of qualitative and effective classification
arises since data volume grows yearly. Current datasets can contain millions, if not billions, of
records. Moreover, these data have a lot of noise, and some attributes do not carry any useful
information about the object at all.
    So, for high-quality training of models, the data should first be processed so that they carry
more content specifically for our task and then highlight their main features. A Random forest was
presented as a classifier in [9]. The authors outlines an approach to improving credit score
modeling using Random forests and compares Random forests with logistic regression. However,
they did not analyze the problems of using this classifier to analyze large volumes of data.
    Paper [10] proposes a credit scoring method that uses information from decision trees to
improve the performance of logistic regression. However, this classifier gained the most popularity
primarily due to its application in tracking [11].
    In our work, we use this classifier for recognition/classification tasks. The advantages of
randomized trees are that they are much faster to train and test than traditional classifiers (such as
SVM), they reduce the variance, and they increase the accuracy of the model by averaging the
previously retrained data compared to conventional decision trees.
    Paper [12] develops a decision tree ensemble model using the differential sampling rate,
Synthetic Minority Oversampling Technique, and AdaBoost which is a prediction framework
integrating supply chain information to predict enterprise credit risk. The advantage of the
classification approach considered in this work over the method in [12] is that the model's
performance primarily depends on the model itself and the amount of data. Since we are working
with a limited dataset, we will use a Random forest instead of a fully connected network to
maximize the performance and training speed of the model. However, the most crucial advantage
of the approach proposed in the work is that the data can be pre-processed with the help of the
VGG-16 neural network. It makes it possible to significantly speed up the training and resistance
to overfitting the classifier. So, in the final result, we will get a classifier with a high learning
speed, which can be used, for example, on modern tensor cores of the Nvidia video card for
processing large datasets (more than 1010 records) and a relatively high resistance to outliers and
retraining.


2. Proposed methodology
       2.1.       Statement of the task and description of the dataset
As input, the pre-trained model should classify credit risks with a value from a binary set. The
dataset [13] consists of 600 records with 10 features about customers of a German bank (see
Figure 1). There are 2 types of data: object, int64. Attributes: Age, Job, Credit Amount, Duration
– int64; Sex, Housing, Savings, Checking accounts, Purpose, Risk – object.


Figure 1: Initial dataset before pre-processing


       2.2.       Data pre-processing
In order to improve the quality of predictions, the data can be processed in a certain way so that
the necessary features are highlighted. Let us start by considering the features of this dataset and
the quantitative distribution of good and bad borrowers (see Figure 2).
Figure 2: Quantitative ratio of good/bad bank customers


Figure 3 shows how the age distribution of borrowers affects credit repayment/non-repayment. A
descriptive statistic of age distribution is a measure of central tendency; the distribution deviates
from normal in these cases.


Figure 3: Distribution of borrowers by age


If we check the normality of this distribution on the QQ-Plot, we get Figure 4. We can also see
deviations in the 2nd and 4th quartiles.
Figure 4: Deviation of the distribution of borrowers by age from normal


We will visualize the credit amount distribution based on the housing availability at the bank's
client. The visualization of distributions in Figure 5 shows that borrowers borrow more significant
amounts if they do not have their own homes.


Figure 5: Distribution of loan amounts based on the availability of housing for an honest/dishonest client


The next process that the data goes through before classification is the selection of features of
objects when passing through the convolutional, activation, and pooling layers of the pre-trained
VGG-16 neural network. These features are transferred to the input of the classifier. Let us clarify
that the characteristics of the object and the corresponding label – class – "risky loan", and "risk-
free loan" are submitted to the input of the Random forest classifier.


       2.3. Consistent and proposed                                 parallel        algorithm           with
        computational complexity analysis
Consistent algorithm Random forest [14].
   Background: training set 𝑆𝑆 ≔ (𝑥𝑥1 , 𝑦𝑦1 ), … , (𝑥𝑥𝑛𝑛 , 𝑦𝑦𝑛𝑛 ), attributes 𝐹𝐹 and the number of trees in the
forest 𝐵𝐵
   function RandomForest (𝑆𝑆, 𝐹𝐹)
   1 𝐻𝐻 ← ∅
   2      for 𝑖𝑖 ∈ 1, … , 𝐵𝐵 do
   3            𝑆𝑆 (𝑖𝑖) ← 𝐴𝐴 subset of attribute objects from 𝑆𝑆
   4            ℎ𝑖𝑖 ← RandomizedTreeLearn (𝑆𝑆 (𝑖𝑖) , 𝐹𝐹)
   5            𝐻𝐻 ← 𝐻𝐻 ∪ {ℎ𝑖𝑖 }
   6      end for
   7      return H
   8 end function

   function RandomizedTreeLearn (𝑆𝑆, 𝐹𝐹)
   1 for each tree node:
   2      𝑓𝑓 ← a very small subset of the attributes of 𝐹𝐹
   3      Division by the best function in 𝑓𝑓 (gini,entropy)
   4      returning the learned tree
   5 end for
   6 end function

   Parallel algorithm Random forest
   omp_set_num_threads(NUM_THREADS);
   1 function RandomForest (𝑆𝑆, 𝐹𝐹)
   2      𝐻𝐻 ← ∅
   3     #pragma omp parallel private (𝑖𝑖, 𝑠𝑠𝑖𝑖 , ℎ𝑖𝑖 ) shared(𝐻𝐻)
   4     {
   5           #pragma omp for
   6         for 𝑖𝑖 ∈ 1, … , 𝐵𝐵 do
   7                𝑆𝑆 (𝑖𝑖) ← 𝐴𝐴 subset of objects from 𝑆𝑆
   8                 ℎ𝑖𝑖 ← RandomizedTreeLearn (𝑆𝑆 (𝑖𝑖) , 𝐹𝐹)
   9                 𝐻𝐻 ← 𝐻𝐻 ∪ {ℎ𝑖𝑖 }
   10         end for
   11    }
   12    return 𝐻𝐻
   13 end function
   For classification tasks, it is advisable to establish 𝑓𝑓 = √𝑑𝑑. We usually take 𝑑𝑑 – to be the
number of functions for regression problems. It is recommended to build each tree until all its
leaves will contain only n_min=1 examples of classifying and n_min=5 examples for regression.

   Analysis of the complexity of sequential and parallel algorithms:
   Sequential: Complexity_sequential = O(T ∗ 𝑛𝑛2 ∗ �𝑝𝑝)
                                       T∗𝑛𝑛2 ∗√𝑝𝑝
   Parallel: Complexity_parallel = O �𝑁𝑁                   �, where:
                                           𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟

   1) T is the number of trees to be built;
   2) �𝑝𝑝 isthe number of features that are taken into account at each node of the tree;
   3) 𝑛𝑛2 is the number of tree nodes*the number of partitions of the value of the variable;
   4) 𝑁𝑁𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 isthe number of threads allocated for building trees.


3. Research results
These are the results of the implementation of the proposed parallel algorithm. These results were
obtainedon 3 different processors: 4-core, 8-core, and 12-core. Python language and Joblib
library [15] were chosen for algorithmimplementation. Joblib is optimized for the fast and reliable
use of big data. By default, Joblib uses the 'loky' internal module [16] to run separate Python
worker processes to execute tasks concurrently on different CPUs. it is possible to unlock the
Python Global Interpreter Lock (GIL) for most of your computations. In that case, this technology
allows the use of multithreading, which can significantly increase execution speed. Tables 1-3
present the execution time of the proposed parallel algorithm on a 4, 8, and 12-core processor.

Table 1
Execution time of the parallel algorithm on a 4-core processor, ms
 Number                                     The number of threads
    of
 records,            1                    2                      4                       8
    n
   2500           663.35                424.5                  301.0                   308.2
   5000           1326.7                849.1                  602.0                   616.3
  10000           2669.4               1714.1                 1220.0                  1248.6

Table 2
Execution time of the parallel algorithm on an 8-core processor, ms
 Number                                    The number of threads
    of
 records,            1                    2                      4                       8
    n
   2500            572.6                350.8                 241.7                    190.1
   5000           1123.2                679.5                 461.4                    358.2
  10000           2230.4               1343.0                 906.8                    700.3
Table 3
Execution time of the parallel algorithm on a 12-core processor, ms
 Number                                     The number of threads
    of
 records,            1                    3                      6                               12
    n
   2500            591.5                217.7                 127.3                           92.6
   5000           1180.2                432.4                 251.7                          182.1
  10000           2365.9                869.1                 508.0                          369.7

It can be seen from Tables 1-3 that the best result can be obtained by the parallel algorithm in the
case when the number of threads is equal to the number of cores of the multiprocessor computer
system, which indicates the reliability of the obtained results. Furthermore, even with12 threads
on a 12-core processor, it was possible to reduce the calculation time by more than 6 times (see
Table 4).

Table 4
Indicators of acceleration and efficiency of the software implementation of the parallel algorithm
on a 12-core processor
     Number             3 threads                       6 threads                   12 threads
        of
     records,     Acceleration      Efficiency   Acceleration   Efficiency   Acceleration   Efficiency
        n
       2500           2.71              0.53          4.64          0.77          6.38            0.90
       5000           2.73              0.54          4.68          0.78          6.48            0.91
      10000           2.72              0.53          4.65          0.77          6.39            0.90

There are various ways to improve efficiency [17]. As seen from Table 4, with an increase in the
number of threads, the efficiency approaches unity, which also indicates the optimization of the
algorithm by taking into account such properties as multithreading and multi-core of modern
personal computers.
   Next, in Figures 6-7, we will present data classification results using the trained Random Forest
model.
Figure 6: Error matrix for data without pre-processing


Figure 7: Error matrix for data after pre-processing


To see more clearly (see Figure 8) how the parallel classifier works, let us have an example: A 24-
year-old skilled worker with his own home and an average checking account, who wants to buy a
radio/TV, the probability of his good faith is Pr(𝑌𝑌 = 1|𝑋𝑋; 𝑊𝑊) = 0.824. Another example is a 42-
year-old skilled worker without housing with a small checking account who wants to buy a car,
the likelihood of a credit Pr(𝑌𝑌 = 1|𝑋𝑋; 𝑊𝑊) = 0.45.
Figure 8: Test examples of the classifier program


Let us introduce a few more indicators that will allow us to evaluate our model better (see Figure
9).
    Precision is the ratio precision = tp / (tp + fp), where tp is the number of true positive
elements, and fp is the number of false positives. Precision is intuitively the ability of a classifier
not to flag a sample as positive if it is negative.
    The recall is the ratio recall = tp / (tp + fn), where tp is the number of true positives and fn
is the number of false negatives. The recall is intuitively the ability of the classifier to find all
positive samples.
    The F-beta score can be interpreted as the weighted harmonic value of precision and recall
when the F-beta score reaches its best value at 1 and worst at 0. The F-beta score considers recall
more than precision, the beta coefficient. beta = 1.0 means that recall and precision are equally
important.
    Support is the number of cases of each class in y_test.


Figure 9: Performance indicators of the model


The time of parallel training on data after pre-processing on a 4-core processor is presented in
Figure 10. Table 5 shows the acceleration of the model training result using a parallel algorithm
on a 4-core processor.
          0,8

          0,7

          0,6

          0,5

          0,4

          0,3

          0,2

          0,1

           0
                         n = 250                     n = 500                 n = 1000

                    1 thread       2 threads        4 threads    8 threads       16 threads

Figure 10: Parallel training on data after pre-processing, sec


Table 5
Acceleration of parallel training on data after pre-processing on a 4-core processor
   Number
      of
                   2 threads           4 threads             8 threads           16 threads
   records,
      n
     2500              0.76                 0.81                 0.76                 0.68
     5000              1.39                 1.58                 1.48                 1.39
    10000              1.57                 1.89                 1.68                 1.57
So, parallelization of Random forest algorithm training on a 4-core processor showed good results
for a large sample of data. We obtained high acceleration with 2 and 4 threads, and predictably the
acceleration was degraded with 8 and 16 threads.


4. Conclusion
The work demonstrated the use of parallel computing in typical machine learning algorithms, such
as Random forest. This approach made it possible to reduce the time of model training and dataset
processing significantly. From the numerical results, it can be seen that the increase in productivity
strongly depends on the architecture of the specific computer on which the code is executed.
However, this research is very relevant based on current trends in developing the multi-core
architecture of computer systems. It is possible to get an acceleration of more than 6 times.
    The effectiveness of the Random forest classification algorithm in the credit scoring problem
was also demonstrated. The numerical values of the area under the AUC - ROC curve for data
before pre-processing was 0.80, and for data after pre-processing was 0.9.
   The algorithm for classifying features of objects proposed in the work made it possible to
significantly improve the accuracy of calculations compared to the work [9].
   Using the data without pre-processing, we obtained the following indicators:
   •   𝑇𝑇𝑇𝑇 = 0.80;
   •   𝑇𝑇𝑇𝑇 = 0.89;
   •   𝐴𝐴𝐴𝐴𝐴𝐴 = 0.9;
   •   𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 0.845.
   Using data after pre-processing, these indicators were:
   •   𝑇𝑇𝑇𝑇 = 0.94;
   •   𝑇𝑇𝑇𝑇 = 0.84;
   •   𝐴𝐴𝐴𝐴𝐴𝐴 = 0.86;
   •   𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 0.89.
Therefore, it can be concluded that the proposed algorithms have successfully coped with the task.
The work also demonstrates how one of the essential programming languages in machine learning,
Python, can be combined with parallel computing in the Joblib library for significant optimization
of sequential algorithms.


References
[1] O.Artemieva, A Bestiuk, The influence of the credit risks on the banking system of Ukraine
     in conditions of transformation processes, Scientific Bulletin of KSU, Series «Economic
     Sciences», 37, 2020. DOI:10.32999/ksu2307-8030/2020-37-11
[2] R.Flunger, A.Mladenow, C.Strauss, Game Analytics—Business Impact, Methods and Tools,
     Studies in Systems, Decision and Control, 377, 2022, pp. 601–617
 [3] Qiang Cai and Qian Qian, Summary of Credit Risk Assessment Methods, Advances in
     Intelligent Systems Research, 148, 2018, рр. 89-93
[4] S.Moradi, F.Mokhatab Rafiei, A dynamic credit risk assessment model with data mining
     techniques: evidence from Iranian banks. Financ Innov 5, 15, 2019.
     DOI:10.1186/s40854-019-0121-9
[5] M.Sipper, J.H.Moore, Conservation machine learning: a case study of random forests, Sci
     Rep 11, 3629, 2021. DOI:10.1038/s41598-021-83247-4
[6] R.Tkachenko, Z.Duriagina, I.Lemishka, I.Izonin, Trostianchyn Development of machine
     learning method of titanium alloy properties identification in additive technologies, Eastern-
     European Journal of Enterprise Technologies, 3(12 (93)), 2018
  [7] L.Mochurad, A.Ilkiv, A novel method of medical classification using
     parallelization algorithms. International scientific journal «Computer systems
     and imformation technologies», 1, 2022. DOI:10.31891/CSIT-2022-1-3
 [8] N.Azizah, L.S.Riza, Y.Wihardi, Implementation of random forest algorithm with
     parallel computing in R. Journal of Physics: Conference Series, Volume 1280, Issue
     2, 2019. DOI:10.1088/1742-6596/1280/2/022028
[9] Sh.Dhruv, Improving the Art, Craft and Science of Economic Credit Risk Scorecards Using
     Random Forests: Why Credit Scorers and Economists Should Use Random Forests (June 9,
     2011).        DOI:10.2139/ssrn.1861535
[10] E.Dumitrescu, S.Hué, Ch.Hurlin, S.Tokpavi, Machine Learning or Econometrics for Credit
     Scoring: Let’s Get the Best of Both Worlds, 2020. URL: https://hal.archives-
     ouvertes.fr/hal-02507499v2
[11] D.David, Random Forest Classifier Tutorial: How to Use Tree-Based Algorithms for
     Machine Learning. AUGUST 6, 2020. URL: https://www.freecodecamp.org/news/how-to-
     use-the-tree-based-algorithm-for-machine-learning/
[12] X.Hu, T.Zhou, Y.Zhang, Enterprise credit risk prediction using supply chain information: A
     decision tree ensemble model based on the differential sampling rate, Synthetic Minority
     Oversampling Technique and AdaBoost, Expert Systems, 39, 6. DOI:10.1111/
     exsy.12953
[13] Analysis              of           German             Credit          Data.            URL:
     https://online.stat.psu.edu/stat508/resource/analysis/gcd
[14] F.Neaz, H.Tawqir, A.Marium, T.Granthi, SH.Dev Sana, Personnel security system of
     nuclear power plants using machine learning for psychological, behavioral and social media
     activity analysis, Bachelor of science in computer science and engineering, 2018, 43 p.
[15] X.Li, Yu Wang, S.Basu, K.Kumbier, B.Yu, A Debiased MDI Feature Importance
     Measure for Random Forests, 33rd Conference on Neural Information Processing Systems
     (NeurIPS 2019), Vancouver, Canada. DOI:10.48550/arXiv.1906.10845
[16] A.Malakhov, D.Liu, A.Gorshkov, T.Wilmarth,. Composable Multi-Threading and Multi-
     Processing for Numeric Libraries, 2018, рр.18-24. DOI:10.25080/Majora-4af1f417-003
[17] Embarrassingly parallel for loops. URL: https://joblib.readthedocs.io/en/latest/parallel.html
 [18] N.Boyko, K.Shakhovska, L.Mochurad, J.Campos, Information system of catering
     selection by using clustering analysis, CEUR Workshop Proceedings, 2533, 2019, pp. 94–
     106