Machine Learning based model for Loan Amount Prediction and
Distribution

Hitesh K. Sharma 1, Tanupriya Choudhury 2, Prashant Ahlawat3, Sachi N. Mohanty 4, Sarika
Jain5
    1,2
    University of Petroleum & Energy Studies (UPES), Dehradun- 248007,India.*
3
  Dept. of Computer Science, Chandigarh University, Punjab 140413,India
4
  !Department of Computer Science, Singidunum University, Serbia and and School of Computer Science &
Engineering, VIT-AP University, Amaravati, Andhra Pradesh, India.
5
  Dept. of Computer Application, National Institute of Technology,Kurukshetra,136119,
Haryana, India.

! authors contributed equally and all are the first author.
*are the corresponding authors- Tanupriya Choudhury and Hitesh Kumar Sharma.


                 Abstract
                 With the enhancement of the technology, expand of businesses and thoughts more and more
                 people are applying for the loans. Both for their personal use and for business use. But due to
                 the limited amount of assets the bank cannot grant loans to each and every person. Finding out
                 those right people is a typical and time-consuming process. Banks desire to deliver the loan to
                 an individual who can be recompensate the loan on time and can afford maximum profit to the
                 bank. So there is a need of a system which could do this analysis and save the banks time and
                 resources. This can be done using Machine Learning. The objective of this paper is to create a
                 more accurate loan prediction model using machine learning to reduce the risk behind selecting
                 of appropriate people for the loan. For this we'll mine the previous records of the people to
                 whom the bank has granted loan. Using these records variables and bank loan rules we will
                 train a machine learning model which will predict that a person is eligible for loan or not. We
                 will use sklearn for our model and train_test_split for splitting the data set into train dataset
                 and test dataset. Here we are going to use various models like Logistic Regression, Decision
                 Tree(DT) and Random Forest(RF) so as to fetch more accurate results as the given problem is
                 a supervised classification problem.


                 Keywords 1
                 Machine Learning, Decision Tree, Logistic Regression, Random Forest, Loan, Mine, Train,
                 Prediction.

1. Introduction
   Maximum profit of the bank comes from lending loans to the people. So distribution of loans is a
very important part for every bank. Loan needs to be given to the right person otherwise the bank can
face financial trouble and lack of profits. Banks aim to invest their assets in safe hands and from where
they will get maximum interest. There are various factors that banks investigate before lending a loan

ACI’22: Workshop on Advances in Computation Intelligence, its Concepts & Applications at ISIC 2022, May 17-19, Savannah, United States
EMAIL: hkshitesh@gmail.com (A.1);        tanupriya1986@gmail.com*,                                      (A.2);   sachinandan09@gmail.com   (A.3);
anilgrcse@gmail.com (A.4); rohinaruna@gmail.com (A.5)
ORCID: 0000-0001-6816-0324 (A. 1); 0000-0002-9826-2759 (A. 2); 0000-0002-4939-0797 (A. 3); 0000-0002-4256-5873 (A. 4); 0000-0001-
5809-7317 (A. 5)
              ©️ 2020 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                 210
such as whether a person will be able to repay the loan, what is his financial condition, for what purpose
he/she wants to have the loan, etc. Many banks undergo regress processes for choosing a right applicant
for loan even though there is still no assurance that the applicant who is selected is the deserving
applicant from all others candidates. This wastes the bank as well as the customers time. Through this
system we will be able to predict whether the applicant is safe to lend a loan or not to a level of accuracy
using the machine learning technique. As the whole process is automated no one will be able to alter
the results, it will save the banks and customers time and will lead to quick paperwork. Instead of
waiting for a few days to get the whole process done. This machine learning model will really help bank
employees and the customers. There are various factors on which this system makes the prediction, but
sometimes in real life only one strong factor could be enough for granting the loan to the person. The
data of previous records of customers will be mined so as to train our model. Data mining is the process
to extract useful information from the large dataset. Classification, clustering and association are the
types of data mining. Classification is the main type. There are various classification techniques such
as decision tree, neural network, , support vector machine and logistic regression, k-nearest neighbour
etc. In machine learning we need two types of datasets: training dataset which will be used to train the
model and other is test dataset which will be used to test the model. These both datasets are taken from
one dataset. In this paper we will use the train_test_split function of model_selection which will split
the data into training dataset and test dataset. Tree representation is used to solve the problem in which
each and every leaf node represents a class label and internal nodes represent the attributes. Any boolean
function on discrete attributes can be represented using a decision tree. Random forest, which also fits
to the supervised learning category. Both classification and regression problems can be solved by this.
The concept of ensemble learning is used, which in turn is the process of combining multiple classifiers
in order to solve a complex problem and also to progress the performance. It basically produces decision
tree set from a random subset selected from the training set and then gathers the votes from different
decision trees to make the final prediction. Logistic Regression, it is a supervised classification
Algorithm. Logical regression forecasts the categorical dependent variables. Therefore, the outcome
should be a categorical or discrete value. It is much similar to Linear regression. The core objective of
this paper is to create a less complex system for prediction of loan model. This model has been
implemented in python language by using Google Colab software, pandas library for data manipulation
and seaborn library for data visualization.


2. Literature Review
   We studied various research papers on loan prediction models. [1] A research paper by G. Arutjothi
and Dr. C. Senthamaria explained that predicting credit defaulters is a complex task so there is a need
for a machine learning model for this to save time and resources. Using the R software they have
proposed that the combination of Min-Max normalization and K- Nearest Neighbor (K-NN) classifier
will be good to accurately predict loan approvals. [2] Aboobyda and Tagir from University of
Khartoum, Sudan used j48, bayesNet and naiveBayes algorithms for loan prediction and concluded that
j48 would be best for the accurate prediction of credit approvals. They have used Weka application for
implementation and testing of the model and compared the results of all the three algorithms.[3] A
research paper by Kumar Arun, Garg Ishan and Kaur Sanmeet explained the use of various classification
models such as Random Forest, SVM, LM, Nnet and ADB for the prediction of the loan approvals. [4]
Glorfeld and Hardgrave had projected a high-performance model with optimum design for the use of
neural network thus estimating the credit value of the applications of loans. 75% of the loan applicants
were correctly predicted by their designed model. [5] Andy Liaw and Matthew Wiener in their research
paper told about the classification and regression by Random Forest. [6] Stephan Dreiseitl and Lucila
Ohno-Machado told about the artificial neural network classification and logistic regression models,
How logical regression is useful in building various systems for predictions. Machine learning
predictive models are also a good choice for loan predictions in this market scenario[7][8][9].


                                                    211
3. Proposed Model
3.1.1. Data analysis
    Firstly, we will be doing exploratory data analysis then preprocessing and then finally we will be
testing different machine learning models on this data. The data set (Fig. 1) that we are using consists
of the following columns.


Figure 1: Data set columns

   We will import (Fig. 2) certain important libraries into the code such as seaborne for the visualization
purpose and pandas library for data manipulation purpose. We will also be loading the dataset in the
code. Using the head function (Fig. 3,4) we can see a few rows of our dataset.


Figure 2: Importing important libraries


Figure 3: Head view of the train data set


Figure 4: Head view of the test data set


                                                    212
Let’s see the data types of all the columns of our train dataset (Fig. 5).


Figure 5: Data types of all the variables in the data

As shown above there are 3 data types:
1. object: This means that the variables are categorical
2. int64: This shows the integer variables
3. float64: This data type shows variables have decimal values
  Now let's study about the distribution of numerical variables. Let's see the applicant income and loan
amount. We will be doing this with the help seaborn's visualisation.


Figure 6: Graph for Applicant Income

    By seeing the graph (Fig. 6) we can say that distribution is suddenly changing its position and also
has few deviations. This may be due to the missing values in the dataset. So, we can drop those missing
values and again plot the graph with loan amount. We can achieve this with the dropna function (Fig.
7).


                                                    213
Figure 7: Graph for the Loan Amount variable

   Now let’s see for the Co-applicant income distribution (Fig 8).


Figure 6: Graph for the Co Applicant Income

   It looks similar to the applicant income distribution. As people who are more educated should be
having higher income than people who are not. So let's plot the graph with education level and income
(Fig 9).


Figure 7: Graph of Applicant Income by Education


                                                 214
   From the above graph we can see that the graduate people are having more deviation which shows
that the people with high income are well educated (Fig. 9). Loan history can also be another interesting
variable which can affect the loan prediction. We can calculate the mean of each value of loan history
by turning loan status to 0 or 1. Values closer to 1 will indicate the high loan success rates.


Figure 8: Mean for Loan History

  The results tell that the loan history variable will play an important role in the loan prediction in our
model (Fig. 10).


3.1.2. Missing data values & processing the data:
   The dataset that we are using may or may not have missing value. But from the above graphs we
can say that our dataset is having some missing values. Let's check how many missing values (Fig. 11)
are there for each variable in our dataset.


Figure 11: Checking for missing values in dataset

    One solution for categorical values is that we can fill the missing values (Fig. 12,13) with the mode,
which means filling the values with the highest frequency. For the numerical values we can use any
mean or median, but as we have seen above there are outliers so using medium will be a proper approach
in fill all the missing numerical values (Fig. 14).


Figure 12: Fixing the missing values in train dataset


                                                    215
Figure 13: Fixing the missing values in test dataset

   Now let's work on the deviations. We can remove them by log transformation where we will nullify
their effect. We will also combine applicant income and co-applicant income to total income column.


Figure 14: Fixing the outliers for train dataset


Figure 15: Fixing the outliers for test dataset

   The above graph (Fig. 15) is a histogram of total income log and it seems to be much closer to
normal distribution. We will divide our dataset in dependent and independent variables. X will show
independent variables and y will show all the dependent variables (Fig. 16).


                                                   216
Figure 16: Dividing the train dataset into dependent and independent variables.

   Now we will split our dataset into train dataset and test dataset using the train_test_split (Fig. 17).


Figure 17: Splitting the train dataset


3.1.3. Creating the model & Testing:
   For creating our model we will be using sklearn. But before creating, we will need to change all the
categorical variables to numbers. We can do this easily by using LabelEncoder which is present in
sklearn.


Figure 18: Changing all the categorical variables to numbers

    All the categorical variables are now turned into numbers (Fig. 18). So, now we will scale our data
as it improves our prediction.Now we will test different classification models for accuracy. First model
we will test is the Decision tree (Fig. 19).


Figure 19: Decision tree accuracy test


                                                    217
   Now let’s see the accuracy for Random Forest (Fig. 20).


   Figure 20: Random Forest accuracy test

   This model has a greater accuracy than the decision tree. Now let’s test for Logistic Regression model
(Fig. 21).


   Figure 21: Logistic Regression accuracy test

    Accuracy for logistic regression is higher than the DT algorithm and RF algorithm. It is a good model
to be used for loan prediction.


   4. Conclusion
   All in all, in this paper the three models that is Logistic Regression, Decision Tree and Random Forest
were applied so as to build loan prediction models that will forecast the loan approval status of applicants
as Yes or No. After training and testing all the three models with train dataset and test dataset we had the
accuracy values on the basis of which we came to a conclusion that Logistic Regression algorithm will be
accurate in predicting loan as it had a high accuracy value. Using this algorithm we will be able to predict
the right applicants for the loan approval with proper accuracy.


                                                      218
5. References

  [1] G. Arutjothi and Dr. C. Senthamaria. “Prediction of Loan Status in Commercial Bank using
      Machine Learning Classifier”, International Conference on Intelligent Sustainable
      Systems(ICISS 2017). doi:10.1109/iss1.2017.8389442.
  [2] Aboobyda Jafar Hamid and Tarig Mohammed Ahmed. “DEVELOPING PREDICTION
      MODEL OF LOAN RISK IN BANKS USING DATA MINING”, Machine Learning and
      Applications: An International Journal (MLAIJ) Vol.3, No.1, March 2016.
  [3] Kumar Arun, Garg Ishan, Kaur Sanmeet. “Loan Approval Prediction based on Machine
      Learning Approach'', National Conference on Recent Trends in Computer Science and
      Information Technology (NCERT CSIT-2016).
  [4] Louis W.Glorfeld and Bill C.Hardgrave. “An improved method for developing neural
      networks: The case of evaluating commercial loan creditworthiness”, Computers & Operations
      Research, Volume 23, Issue 10, October 1996.
  [5] Andy Liaw and Matthew Wiener. “Classification and Regression by randomForest”, ISSN
      1609-3631, Vol. 2/3,December2002.
  [6] Stephan Dreiseitl and Lucila Ohno-Machado. “Logistic regression and artificial neural network
      classification models: a methodology review”, Journal of Biomedical Informatics, Volume 35,
      Issues 5–6, October 2002.
  [7] T. Choudhury, G. Dangi, T. P. Singh, A. Chauhan and A. Aggarwal, "An Efficient Way to
      Detect Credit Card Fraud Using Machine Learning Methodologies," 2018 Second International
      Conference on Green Computing and Internet of Things (ICGCIoT), 2018, pp. 591-597, doi:
      10.1109/ICGCIoT.2018.8753077.
  [8] S. Taneja, D. Garg, M. V. Tarun Kumar and T. Choudhury, "The Machine Predicted Market,"
      2018 International Conference on Computational Techniques, Electronics and Mechanical
      Systems (CTEMS), 2018, pp. 256-260, doi: 10.1109/CTEMS.2018.8769306.
  [9] Sharma, H.K., Choudhury, T., Toe, T.T. (2022). Machine Learning Based Predictive Analytics:
      A Use Case in Insurance Sector. In: Jeyanthi, P.M., Choudhury, T., Hack-Polay, D., Singh,
      T.P., Abujar, S. (eds) Decision Intelligence Analytics and the Implementation of Strategic
      Business Management. EAI/Springer Innovations in Communication and Computing.
      Springer, Cham. https://doi.org/10.1007/978-3-030-82763-2_14.


                                               219