Loan Default Prediction Using Spark Machine
                                          Learning Algorithms

                                   Aiman Muhammad Uwais1[0000−0002−6306−420X] and Hamidreza
                                              Khaleghzadeh1[0000−0003−4070−7468]

                              School of Computing, University of Portsmouth, Portsmouth, PO1 3HE, United
                                                                Kingdom
                                   aiman.uwais@myport.ac.uk, hamidreza.khaleghzadeh@port.ac.uk


                                 Abstract. Loan lending has been an important business activity for
                                 both individuals and financial institutions. Profit and loss of financial
                                 lenders to an extent depend on loan repayment. Though loan lending is
                                 beneficial for both lenders and borrowers, it does carry a great risk of
                                 the inability of the loan receiver to repay back the loan. This inability is
                                 termed as loan default. Loan default prediction is a crucial process that
                                 should be carried out by financial lenders to help them find out if a loan
                                 can default or not. Successful loan default prediction can help financial
                                 institutions to decrease the number of bad loan issues and eventually
                                 increase profit. The aim of this paper is to use data mining techniques
                                 to bring out insight from data then build a loan prediction model us-
                                 ing machine learning algorithms on the Sparks Big Data platform. Six
                                 supervised machine learning classification algorithms are applied to pre-
                                 dict loan default: Logistic Regression, Decision Tree, Random Forest,
                                 Gradient Boosted Tree (GBTs), Factorization Machines (FM) and Lin-
                                 ear Support Vector Machine (LSVM). Accuracy, precision, recall, ROC
                                 curve and F measure are used to evaluate the models and the results
                                 compared. We achieve the highest accuracy of 99.62% using the Decision
                                 Tree and Random Forest Models.

                                 Keywords: Loan default · Prediction · Machine learning · Big Data ·
                                 Spark.


                          1     Introduction
                          With increasing competition in the financial world and due to severe financial
                          constraints, taking a loan has become certain. Individuals and organizations rely
                          on loans for reasons such as overcoming financial limits to achieve their personal
                          goals or for the basic purpose of managing their affairs in times where there
                          are financial constraints [7]. Though loan lending is quite beneficial for both
                          the lenders and the receivers and is considered an essential part of financial
                          transactions, it does carry some great risks [1]. This risk is termed credit risk or
                          loan default.
                              Murray defines loan default as when a borrower does not make required pay-
                          ments or does not comply with the terms of a loan. Profit or loss of the financial


Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
2       A. Uwais, H. Khaleghzadeh

lender to a large extent depends on loan repayments, that is whether customers
are paying back the loans or not (defaulting). Therefore, when loans default,
financial institutions will lose money, and it might even lead to bankruptcy
and collapse of the institution. By predicting loan default, financial institutions
(lenders) can reduce credit risk, prevent loan default and increase profit by eval-
uating the ability of the borrower to deliver on their obligation of loan repayment
i.e. loan default prediction [9]. The process of forecasting when a loan will default
or not was initially done manually or semi-manually. With the advancement of
statistical computing packages, several machine learning algorithms are used to
calculate and predict loan default by evaluating an individual’s historical data.
But with an ever-increasing amount of data for loan default prediction, there
is the need to use Big Data applications. In this paper, we solve this problem
by building a high-performing machine learning classifier model using Apache
Spark machine learning libraries to predict loan default.
     This paper aims to demonstrate the application of Big Data and machine
learning in the finance industry. First, exploratory data analysis using data min-
ing techniques is carried out to bring out insights from the dataset. Secondly, we
employ Apache Spark machine learning libraries to make accurate loan detail
predictions. Six supervised machine learning classification algorithms are applied
to predict loan default, and we achieve the highest accuracy of 99.62% using the
Decision Tree and Random Forest Models.
     The structure of the paper is as follows. Section 2 presents related work.
Section 3 describes the research dataset, some exploratory data analysis and
data preparation. Modelling is illustrated in Section 4. Section 5 evaluates and
compares the presented models. Finally, Section 6 concludes the paper.


2   Related Work

In this section, we review different types of mechanisms that have been employed
for loan default prediction on different platforms/architectures.
    Wang et al. present a study that uses 4000 samples and 21 attributes to build
and evaluate a classifier predictive model. Four algorithms are used: classic SVM,
Backpropagation Neural Network, C4.5 and R SVM. The result shows that the
total predicting accuracy of R SVM is better than other methods [16].
    Reddy and Kavitha [12] use neural networks through attribute relevance anal-
ysis in testing class defaulter. Hassan and Abraham [4] use a bank dataset which
has 1000 cases; each case with 24 numerical attributes to develop and compare
models produced from different training algorithms, scaled conjugate gradient
back-propagation, Levenberg-Marquardt algorithm and One-step secant back-
propagation (SCG, LM and OSS). The study shows that the slowest algorithm
is OSS and the best algorithm is LM because it has the largest R, but that
means that is the best for this dataset.
    Hamid and Ahmed [3] propose a model for classifying the application of
loans to good and bad loans using three algorithms; J48, Bayesian network and
Naive Bayes classifier. They use the Weka application for the implementation
         Loan Default Prediction Using Spark Machine Learning Algorithms          3

and testing. They show that J48 has the best accuracy of 78.378%. Turkson
et al. [15] applied 15 different types of machine learning algorithms to predict
customers’ creditworthiness. The experiment shows that, apart from the Nearest
Centroid and Gaussian Naive Bayes, the rest of the algorithms performed well
in terms of accuracy and other performance evaluation metrics. Each of these
algorithms achieved an accuracy rate between 76% to over 80%.

    Odegua proposes the use of the Extreme Gradient Boosting algorithm called
XGBoost for loan default prediction. The prediction is based on loan data from
a bank with datasets containing 4368 samples and 10 attributes from both the
loan application and the demographic of the applicants. Location and age of
customers are the two most important features that affect loan default. The XG-
Boost model had an accuracy of 79%, precision (97%), Recall (79%) and F1 score
(87%). Conclusively, the paper provides an effective basis for loan credit approval
to identify risky customers from many loan applications using predictive mod-
elling [10]. Lai classifies and predicts loan default using a real-world dataset of
132,029 instances from an international bank using AdaBoost, XGBoost, random
forest, multi-layer perceptron and k-nearest neighbours. The experiment shows
that boosting algorithms performs better with the AdaBoost method achieving
100% prediction accuracy outperforming the others. ROC and AUC evaluation
metrics are used in the model evaluations. Based on the outcome obtained, it
is concluded that the application of machine learning techniques is promising in
the financial industry [6].

     Mohammad et al. present a study on loan prediction by building a logistic re-
gression with a sigmoid function model and analysing the problem of predicting
loan defaulters. Logistic Regression models are built, and the different measures
of performances are computed. The models are compared based on the sensi-
tivity and specificity performance measures. The best-case accuracy obtained is
81.1%. The researchers made the conclusion that the logistic regression method
efficiently detects the right customers to be targeted for granting loans [14].

    Patel et al. use various data mining algorithms to predict the likely defaulters
from a dataset that contains information about home loan applications, thereby
helping the banks to make better decisions in the future. The dataset used has
640,000 instances and 14 attributes. Optimum results are obtained using Lo-
gistic Regression, Random Forest, Gradient Boosting and CatBoost Classifier.
CatBoost classifier and Gradient Boost provide almost equal accuracy with the
Gradient Boosting process giving better results of 84.035%. The researchers con-
cluded that these models can be used to make better decisions on loan applicants
in predicting loan default and save financial institutions from undergoing huge
losses [11].

    Meer uses a dataset consisting of 5,960 records. Two models are built using
tuned Logistic Regression algorithms, one model using a tuned Random For-
est classifier algorithm and one model using a tuned Gradient Boosting Tree
algorithm [8].
4       A. Uwais, H. Khaleghzadeh


                    Fig. 1. Scatter plot of income vs education.


3     Research Dataset

3.1   Dataset Characteristics

The dataset has been obtained from the Kaggle website and created for peda-
gogic purposes for a common loan default prediction task. The data is generated
in such a way that default prediction machine learning models are likely to be
biased against women and minorities [5]. The dataset contains 640,000 instances
and 14 features with the default attribute as the target feature. The other fea-
tures are the minority, sex, ZIP, loan size, payment timing year, rent, education,
income, job stability and occupation. We use 70% of the dataset for training and
the rest is used for testing. The dataset consists of two class default labels; true
and false.
    Figure 1 displays the scatter plot of income against education using pandas
matplotlib python function available on Spark. The figure shows that as the
educational level increases, the income of the applicants increases. So, it implies
that the people with a higher level of education have higher incomes. This depicts
a positive correlation.
    Figure 2 shows the histogram distribution of the amount of loan size taken.
For the first 1000 rows of the dataset around 5000 is the highest amount issued.
    Figure 3 displays the default class for the two gender types. Non-defaults
(false or blue bar) are the highest meaning for both gender types, more people
were able to repay their loan as compared to those that defaulted.
         Loan Default Prediction Using Spark Machine Learning Algorithms       5


         Fig. 2. Distribution of loan size for the 1000 rows in the dataset.


                 Fig. 3. Default class for sex attribute categories.


   Figure 4 displays the number of minorities that paid up their loan (false)
and those that defaulted (true). At a glance, the highest number of people that
defaulted is the minority ethnic groups while those that did not default are the
non-minorities.
6      A. Uwais, H. Khaleghzadeh


                          Fig. 4. Minority default count.


3.2   Exploratory Data Analysis
In this section, exploratory Data Analysis using data mining techniques is car-
ried out on the data to bring out some insights. Figure 5 shows the correlation
Heatmap of the dataset. A correlation heatmap displays a 2D correlation matrix
between two discrete dimensions, using coloured cells to represent data on what
is usually a coherent scale. The rows show the values of the first dimension, while
the column shows the second dimension. The colour of the cell is proportional
to the number of measures that correspond to the value of the dimension. Based
on the heatmap diagram in figure 5, one can conclude that the target variable
(default) is most positively affected by some features such as rent and negatively
correlated with job stability.

3.3   Data Preparation
Data preparation is the process of preparing the raw dataset to be suitable for
the machine learning algorithms. The initial pre-processing task entails null value
removal and attribute data types adjustment. The data preparation steps are
listed below:

 1. Feature selection: The attributes that influence the prediction are selected
    based on the correlation Heatmap and other social factors that influence
    loan default. The attributes that are selected and used for model building
         Loan Default Prediction Using Spark Machine Learning Algorithms        7


                   Fig. 5. Correlation Heatmap of the dataset.


   are: minority, sex, rent, education, age, income, loan size, payment timing,
   job stability, year, default.
2. Addressing class imbalance problem.
3. Converting the categorical attributes into a numerical format.
4. Splitting the dataset into training (70%) and test (30%) sets.


4     Modelling

4.1   Implementation Platform

With the ever-increasing amount of data that is in gigabytes and terabytes gen-
erated by the financial institution to evaluate loan default, there is the need to
8       A. Uwais, H. Khaleghzadeh

use Big Data applications that will efficiently and accurately predict loan default
no matter the quantity of the data. Apache Spark contains machine learning li-
braries and is the most suitable Big Data application to carry out this task.
Apache Spark is a unified computing engine and a set of libraries (framework)
for parallel Big Data processing. It supports widely used programming languages
(Python, R, etc.), libraries (SQL, streaming, machine learning, etc.) and can run
from laptops to server clusters. Sparks provides a unified platform for develop-
ing Big Data applications [2]. It also has a machine learning library, known as
MLlib, to perform a variety of machine learning tasks.

4.2   Modelling Algorithms
As explained earlier, loan default is the inability of a borrower to pay back his
loan. So, when a loan default is true, it means the borrower has defaulted and
cannot pay back the loan or meet up with the terms of the loan. However, if the
loan default is false, it implies the borrower can meet up with obligation and
pay back the loan.
    Since the problem we are trying to solve aims to successfully classify values
between two categories, true and false, the problem falls within the binary classi-
fication problem. Six supervised machine learning classification algorithms that
are available on spark MLlib namely, Logistic Regression (LR), Decision Trees
(DT), Random Forests (RF), Gradient Boosting Trees (GBM), Factorization
machine (FM) and Linear Support Vector Machines (LSVM) are applied with
the training data used to train the models and the testing data used to evaluate
the models. Table 1 presents a brief description of the algorithms. We have used
the Spark MLlib API on the Databricks environment to implement the models.

5     Model Evaluation and Result
This section shows the model performance evaluation where 30% of the whole
dataset is used for model testing and evaluation. Model evaluation is an impor-
tant component of model development in which it evaluates the performance of
the developed predictive models. Therefore, we carry out a comparison of the
performance of the proposed models.
    For model evaluation, we consider ROC curves, accuracy, recall, precision and
F score derived from confusion matrices. A ROC curve is a graphical method
that shows the sensitivity and specificity of a classifier model. A confusion matrix
is a fundamental two-dimensional matrix that contains information about the
actual and predicted categories of the classifier. Accuracy, recall, precision and
F-score are then obtained from the confusion matrix parameters: true positive
(tp), true negative (tn), false positive (fp) and false negative (fn).

5.1   Result Comparison
Table 2 shows the overall evaluation metrics for the six developed machine learn-
ing classification models for loan default prediction. From the table, the accuracy
         Loan Default Prediction Using Spark Machine Learning Algorithms          9

               Table 1. Table of machine learning algorithms used.

              Algorithm              Description
       Logistic Regression (LR)      LR is a supervised classification algorithm used to pre-
                                     dict a categorical response by predicting the likelihood of
                                     outcomes. It is a generalized linear model that predicts
                                     the likelihood of outcomes. Logistic regression is one of
                                     the features available in spark.ml that can be used to
                                     predict a binary or multiclass outcome by using binomial
                                     or multinomial logistic regression respectively.
          Decision Tree (DT)         DT is one of the most common supervised learning tech-
                                     niques used to solve classification problems. It has a
                                     structure of a tree with node and leaf representing fea-
                                     tures and class labels respectively. It is easy to under-
                                     stand, less data cleaning is required, and non-linearity
                                     does not affect the model’s performance, but it may have
                                     overfitting problems.
         Random Forest (RF)          RF is flexible and easy to apply supervised ML algorithm
                                     that produces, even without hyper-parameter tuning, a
                                     top-notch result maximum of the time. It can be used for
                                     both classification and regression duties.
    Gradient-Boosted Trees (GBTs)    GBTs classifier is a supervised learning classification al-
                                     gorithm. It is a collection of trees that trains a set of
                                     decision trees with ”weak” constraints and uses boosts
                                     to combine predictions.
     Factorization Machines (FM)     FM is a supervised machine learning algorithm that is
                                     used to solve classification problems. Interactions be-
                                     tween features are estimated even in problems with huge
                                     sparsity. The spark.ml implementation supports factor-
                                     ization machines for binary classification and regression.
Linear Support Vector Machine (LSVM) LSVM classifier is a supervised machine learning algo-
                                     rithm use to solve classification problems. SVM builds
                                     a hyperplane or set of hyperplanes in high or infinite-
                                     dimensional space that can be used for classification, re-
                                     gression, or other problems. Intuitively, a good separa-
                                     tion is achieved by the hyperplane that has the greatest
                                     distance to the closest training data points of any class.


values for DT, RF, and GBTs are almost similar and are the highest (99%) but
when critically examining the top values, the RF classifier model has the highest
accuracy of 99.619%. We cannot measure the performance of machine learning
models only based on their accuracy. In this research, other evaluation metrics
like precision, recall and F measure are also considered.
    As shown in Figure 6, the DT and RF classifiers have the highest ROC
curve value of 99.56%, precision 99.8%, recall 99.2% and F-score of 99.5%. It
implies that these two models perform better in loan default classification than
the remaining models. The main reason for this observation is that both DT and
RF works well for categorical and numerical values, also missing values does not
10      A. Uwais, H. Khaleghzadeh

     Table 2. Performance evaluation metrics for the Spark MLlib based models.

               LR    DT     RF GBTs FM LSVM
AreaUnderROC 0.960 0.996 0.9967 0.990 0.882 0.954
      tp      72336 76521 76528 75537 73039 71514
      fp      1990    122    123    49 20371 2128
      fn      4784    599    592   1583 4081 5606
      tn     108879 110747 110746 110820 90498 108741
   Accuracy  0.9646 0.996 0.996 0.991 0.870 0.959
   Precision  0.973 0.998 0.998 0.999 0.782 0.971
    Recall    0.938 0.992 0.992 0.979 0.947 0.927
    F-score   0.955 0.995 0.995 0.989 0.857 0.949


                          Fig. 6. Performance evaluation.


affect their performance. Following these two models, we have GBT, LR, LSVM
and the FM classifiers with F-scores of 98%, 95%, 94% and 85%, respectively.
In this research, the factorization machine model is outperformed by the other
models. It might be due to the poor performance of FM on dense data. However,
FMs perform best in data with high sparsity [13].
    As summarised in Table 2, the Decision tree and Random forest models
show better performance (highest ROC scores) with a value of 99.5% while the
Factorization machine model gives the least performance (lowest ROC scores)
with 88.16%. This result is confirmed by the ROC plots displayed in Figure 7.
         Loan Default Prediction Using Spark Machine Learning Algorithms          11


               Fig. 7. ROC plots for the DT, RF and FM algorithms.


    Figure 7 shows ROC plots for the DT, RF and FM algorithms proposed
in this paper. The Area Under a ROC curve (ROC) is the expectation that a
model will give an elevated grade to a randomly selected positive class data point
than a randomly selected negative class data point. A ROC curve that traces a
diagonal line signifies a poor classification algorithm, and it will randomly guess
if a loan will default. In addition, a good performing model will have ROC closer
to 1 while the ROC will be closer to 0.5 for poor-performing algorithms [8].
Therefore, as shown in Figure 7, the plot for DT and RF ROC curve is towards
1 showing a better model performance compared with the ROC curve of FM.

6   Conclusion
In this paper, the application of data mining and Big Data techniques in building
loan default predictors is studied. Six models were developed using the Spark
MLlib API allowing us to obtain the best performing model for loan default pre-
diction. Based on the model evaluation, The random forest model presented the
highest accuracy (99.619%). Also, the Random forest and Decision tree models
have the best performance in terms of ROC curve (99.56%), precision (99.8%),
recall (99.2%) and f-score (99.5%). Therefore, it can be concluded that decision
tree and random forest classifier models are the most efficient and accurate in
predicting the binary categories of loan default.
    Based on the results obtained, Spark machine learning library-based models
have shown a promising result in the prediction of loan default in this research.
It allows financial institutions (lenders) to be informed of default in issued loans
beforehand which will help them reduce financial loss and the cost associated
with loan recovery. This will increase profits.

References
 1. Adewusi, A.O., Oyedokun, T.B., Bello, M.O.: Application of artificial neural net-
    work to loan recovery prediction. International Journal of Housing Markets and
12      A. Uwais, H. Khaleghzadeh

    Analysis (2016)
 2. Chambers, B., Zaharia, M.: Spark: The definitive guide: Big data processing made
    simple. ” O’Reilly Media, Inc.” (2018)
 3. Hamid, A.J., Ahmed, T.M.: Developing prediction model of loan risk in banks
    using data mining. Machine Learning and Applications: An International Journal
    (MLAIJ) Vol 3(1) (2016)
 4. Hassan, A.K.I., Abraham, A.: Modeling consumer loan default prediction using
    neural netware. In: 2013 INTERNATIONAL CONFERENCE ON COMPUTING,
    ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE). pp. 239–243.
    IEEE (2013)
 5. Klaas, J.: Loan default model trap. https://www.kaggle.com/jannesklaas/model-
    trap, (Accessed on 13/10/2021)
 6. Lai, L.: Loan default prediction with machine learning techniques. In: 2020 Inter-
    national Conference on Computer Communication and Network Security (CCNS).
    pp. 5–9. IEEE (2020)
 7. Marqués Marzal, A.I., Garcı́a Jiménez, V., Sánchez Garreta, J.S.: Exploring the
    behaviour of base classifiers in credit scoring ensembles (2012)
 8. Meer, K.: Machine learning models for mortgage default prediction in pakistan. In:
    2021 International Conference on Artificial Intelligence (ICAI). pp. 164–169. IEEE
    (2021)
 9. Murray, J.: Default on a loan, united states business law and taxes guide national
    credit act (2005). act no. 34 of 2005, republic of south africa (2011)
10. Odegua, R.: Predicting bank loan default with extreme gradient boosting. arXiv
    preprint arXiv:2002.02011 (2020)
11. Patel, B., Patil, H., Hembram, J., Jaswal, S.: Loan default forecasting using data
    mining. In: 2020 International Conference for Emerging Technology (INCET).
    pp. 1–4. IEEE (2020)
12. Reddy, M.J., Kavitha, B.: Neural networks for prediction of loan default using at-
    tribute relevance analysis. In: 2010 International Conference on Signal Acquisition
    and Processing. pp. 274–277. IEEE (2010)
13. Rendle, S.: Factorization machines. In: 2010 IEEE International conference on data
    mining. pp. 995–1000. IEEE (2010)
14. Sheikh, M.A., Goel, A.K., Kumar, T.: An approach for prediction of loan approval
    using machine learning algorithm. In: 2020 International Conference on Electronics
    and Sustainable Communication Systems (ICESC). pp. 490–494. IEEE (2020)
15. Turkson, R.E., Baagyere, E.Y., Wenya, G.E.: A machine learning approach for
    predicting bank credit worthiness. In: 2016 Third International Conference on Ar-
    tificial Intelligence and Pattern Recognition (AIPR). pp. 1–7. IEEE (2016)
16. Wang, B., Liu, Y., Hao, Y., Liu, S.: Defaults assessment of mortgage loan with
    rough set and svm. In: 2007 International Conference on Computational Intelli-
    gence and Security (CIS 2007). pp. 981–985. IEEE (2007)