<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A User-Centered Approach to Create Realistic Datasets for AI. Case Study: Creditworthiness in the Banking Sector</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesca Zampino</string-name>
          <email>francesca.zampino@studenti.unisalento.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonella Longo</string-name>
          <email>antonella.longo@unisalento.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Zappatore</string-name>
          <email>marcosalvatore.zappatore@unisalento.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Salento</institution>
          ,
          <addr-line>via Lecce-Monteroni, 73100 Lecce (LE)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays businesses are evolving, as new digital tools ensure greater efficiency of their information systems. Decision-making and strategic processes can benefit from innovation opportunities such as Machine Learning. The main issue encountered in Artificial Intelligence applications, is that data can be not available or unsuitable for the case of study. This paper proposes the solution for this problem, by generating simulated data for AI. The case of study is creditworthiness in the banking sector; a loan is considered the main source of income for the banking sector, as well as the main source of risk. Consequently, the evaluation of creditworthiness is a key activity both for banks and for customers. To address this need, we propose a solution tailored to lenders to evaluate credit applications and to customers to be aware of behaviors that can reduce their credit score. The approach proposed in this paper aims at realizing realistic datasets for Artificial Intelligence (named IDEA) to meet specific business needs, and to respect users' requests. An analysis of the current literature and methods for the evolution of conceptual models will be conducted, through pre-existing datasets. The proposed approach draws from and extends such literature. The intended application is to adopt this approach in the banking sector for considering the creditworthiness of customers who have entered into financial relationships. Therefore, the envisaged use case is to forecast the probability of borrowers going bankrupt. The paper defines the approach applied to specific financial datasets for the use case. Moreover, a validation of datasets is done, thanks to the Data Quality Index, before applying IDEA to predict credit solvency.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Artificial Intelligence</kwd>
        <kwd>realistic creditworthiness</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The loan is a core business for the banking sector, as well as the main source of financial risk for
banks. European data show that the loan is the most widely used financing instrument for small and
medium-sized enterprises. The situation in which an asset causes high risks due to the inability of a
borrower to return the loan within the agreed time; is called a "Credit Risk"[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A borrower’s
creditworthiness was based on a numerical value, a score named "Credit score". In general, this value
helps authorities calculate the likelihood that a borrower will return the loan within the designated time.
Creditworthiness means the ability of a debtor (in this case, a financial intermediary) to repay its debts
on maturity, based on credit history or payment history.
      </p>
      <p>
        Recently, researchers and banks have chosen training classifiers based on various machine learning
and deep learning algorithms to automatically predict an applicant’s credit score based on its credit
history and other historical data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For example, we can calculate the future score of the credit score
or the probability of default, before issuing a loan. In order to reach our goals, the process is divided in
the steps now explained.
●
●
●
      </p>
      <p>First, we proceed to the research and selection of scientific papers targeting the same goal of
the study: this allowed us identifying the financial variables of interest and the corresponding
datasets.</p>
      <p>Understanding the elements that identify the context is useful for the implementation of a
consistent database model. The analysis process starts from the choice of models and related
variables, considered useful to describe the banking context. In general, the literature is
characterized by datasets that identify a loan as a reference entity, with attention to the credit
history of the applicants.</p>
      <p>Secondly, it is important to verify the applicability and value of the model applied to different
bank cases.</p>
      <p>The model must be validated to evaluate the quality of datasets.</p>
      <p>The paper is organized as follows:</p>
      <p>Section 1 introduces the literature analysis we performed to ground the proposed approach,
which is described in details in Section 2. Section 3 discusses a dataset validation, based on a
dedicated Data Quality Index, along with the achieved results. Conclusions are drawn in
Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of art</title>
      <p>This section presents an analysis on scientific papers chosen as a standard reference on top of which
the proposed IDEA approach is grounded. This analysis has been carried out to evaluate the widespread
models for traditional and innovative banking realities. Therefore, the literature review is the starting
point for our research. We focused on two main topic typologies:
● Commercial banking
● Peer to Peer lending</p>
      <p>
        The first category of identified papers includes traditional bank loans.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Instead, the second category refers to a kind of financial innovation, Peer to Peer lending, a loan between
individuals, granted without traditional financial intermediation.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The analyzed papers present
an analysis of the banking sector and choose to apply SEMMA as a data mining design model. The
SEMMA method is more useful than the alternative model, CRISP DM because SEMMA pays attention
to user requests, asked by our study.
      </p>
      <p>
        SEMMA [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is the multi-stage method applied by the papers analyzed.
      </p>
      <p>1. SAMPLE: Firstly, the goal is to identify a representative model for the population. The process
of collecting data from the whole population is a very difficult task, so SEMMA offers the
opportunity of using a sample of population data for the development of the model.
2. EXPLORE: The next step of the SEMMA methodology is data review.
3. MODIFY: The main tasks related to data modification are the conversion of data types and the
management of missing values.
4. MODEL: In this step of SEMMA, several algorithms and mining techniques are applied to
develop the proposed model. The purpose of this step is to identify the hidden and meaningful
information from the pre-processed data set. Among the algorithms used, Decision Tree,
Logistic Regression and Neural Network.
5. ACCESS: Once the implementation and validation of the model has been satisfied by all the
proposed techniques, the test data is incorporated into each model, for the loan approval
prediction.</p>
      <p>
        The SEMMA methodology is widely used in the literature and can be compared with two other tools
for machine learning models: CRISP-DM and KDD [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. SEMMA and CRISP-DM are an evolution of
KDD (1996).
      </p>
      <p>The CRISP-DM standard was published in its first version in 1999 in Brussels and it is composed
of 6 main stages, which can be added at the end. The steps are:
1. Business Understanding: understanding business problems.
2. Data Understanding: understanding data is fundamental to understand how data and analysis
can solve the problem of the previous phase.
3. Data Preparation: this is the data cleaning and review phase.
4. Modeling: it is the choice of the algorithm suitable for the use case.
5. Evaluation: an evaluation of the outputs: it will be possible to use a part of the data, the test
ones, to compare the results of the model with the real ones.
6. Deployment: the model will have to go into production, that is, it will have to be used on a
large scale.</p>
      <p>The SEMMA method allows the development of an application domain for end-user goals; in the
SEMMA Sample phase, data cannot be sampled unless there is an understanding of all business needs.</p>
      <p>The approach developed by this study stems from the so-called "business goals", business purposes
to be defined upstream, as a reference point for the model to be developed. It is better represented by
the SEMMA method, for the reasons explained. The aim is to create a suitable model to meet the needs
of users. 1. During this study, the data samples were selected from Kaggle Repository.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The proposed approach: IDEA</title>
      <p>In this Section, we will discuss IDEA, (realIstic DatasEt for Artificial intelligence) a systematic
model approach designed to respond to business needs, so that the expectations of target users can be
addressed properly. IDEA is an extension of the SEMMA model, discussed in Section 2. We will
identify the requirements of companies and users in order to carry out an in-depth search on existing
data repositories.</p>
      <p>On the one hand we will evaluate different data sources to develop an optimal combination of
variables since our purpose is to maximize strategic business benefits.</p>
      <p>On the other hand, the research and choice of datasets will be followed by the development of a
conceptual model as a graphic representation of the context. A conceptual model can explain the main
entities and relationships between them. It will be populated by considering datasets suitable to
represent the analysis scenario.</p>
      <p>IDEA aims to identify the process of borrowing activities, basically in the form of a loan, whose
borrowers are individuals from Italian regions of North, South and Center, aged 25 years and over.</p>
      <p>In order to apply the proposed approach, an open dataset for commercial banking was selected,
because granting loans to private individuals is the core business of a commercial bank. The dataset
choice is motivated by the purpose to provide a basic model for banking institutions which can also be
turned into a more complex model, such as P2P lending. At the same time, it is also useful for small
traditional banking companies.</p>
      <p>Our aim is to demonstrate the applicability of the proposed IDEA approach to all banking realities,
small and medium-sized enterprises or companies with high turnover. This means that the model can
be used for traditional and innovative banks, because IDEA is presented as a standard model with
traditional features, but it can be changed thanks to innovative variables, for example P2P lending ones.
This model does not focus on the number of variables, but it is characterized by functional attributes to
identify entities and context. A limited number of attributes is due to the choice of creating a model for
small banks which can be developed through other variables to become a large bank model. The process
can be applied from micro to macro realities.</p>
      <p>The model is provided as a general application guideline that can be adapted to the banking reality
that decides to apply it. The underlying entities of any financial relationship are customer, loan, and
bank.</p>
      <p>Moreover, Figure 1 shows a real estate entity, linked to the loan, through a non-compulsory
relationship: the asset can be a guarantee for the financial relationship. Although the model represents
financial deals, it was chosen to specify the optional loan guarantee to ensure the assessment of the
solvency of creditors and Probability of Default. Our dataset can be different because it does not cause
the operational problems of data normalization, about null values or duplicated data. Our model also
allows you to identify the relationships between customer, loan and bank. IDEA defines each key
attribute, while, for Kaggle datasets, we should integrate the key data, by generating key attributes
randomly.</p>
      <p>In fact, the use case of the model is forecasting the evolution of a credit portfolio, in terms of its
financial reliability. IDEA can be, therefore, critical to define a bank strategy.</p>
      <p>The research about relevant banking datasets is also characterized by an evaluation of different
online sources and data repositories currently available today. Among these Kaggle2, Dataport3, World
Bank4, World Economic Forum5, Towards Data Science6 and Data. World 7were considered. These
datasets are open, but it is important to carry out an initial verification and selection of the attributes,
understandable and, at the same time, consistent with the model. After a careful analysis, the most
consistent repository was considered Kaggle, for the availability of all variables. As explained
previously, IDEA focuses on a limited number of attributes that identify a small, medium, or large bank.
The three main variables we have considered are loans, customers and guarantees that were found in
the Kaggle datasets deemed suitable for the model. This source is the best one to represent a
development from micro to macro realities.</p>
      <p>The proposed methodology aims at identifying the main entities (clients, loans, real estate and
guarantees) and corresponding attributes to develop a suitable Entity-Relationship conceptual model
and then build a physical relational database. It is important to define the reference entity as a loan,
identified by specific attributes properly related to other entities such as the borrower - client. Literature
analysis allowed us to identify the IDEA main attributes. Elements in most of the articles are the
following:
● Id - loan
● Id - borrower
● Sex
● Personal data
● Education
● Income of the borrower (main borrower)
● Income of the second debtor
● Amount of financing
● Duration of financing in months
● Credit history
● State of financing
● Interest rate
● Spread on interest rate
● Installment
● Date of issue of the loan
● Purpose
● Default of the loan (1 = defaulting borrower; 0 = fulfilling borrower)
● Card code (YES / NO)
● Credit score
● Year of birth
● Level of credit
● Age
● Up front charges
In addition, these variables are related to the opportunity that a loan is secured by real estate.</p>
      <p>● Type of warranty
2 https://www.kaggle.com/
3 https://ieee-dataport.org/
4 https://www.worldbank.org/en/home
5 https://www.weforum.org/
6 https://towardsdatascience.com/
7 https://data.world/
●
●</p>
      <p>Type of building</p>
      <p>Amount of property evaluation
These elements were used to build a database, whose conceptual model is partially shown in Figure
1. Since we have created the model to make the estimate of solvency, we proceeded with its
preliminary validation, before applying it to a case study. In the next Section, the validation of IDEA
is discussed.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data set validation: the Data Quality Index (DQI)</title>
      <p>
        This section addresses the dataset quality for the IDEA approach [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We define an assessment
parameter (the Data Quality Index), characterized by the following weighted metrics (each weight
indicates the importance we attribute to its impact on data quality):
● Accuracy (20%)
● Completeness (20%)
● Consistency (20%)
● Uniqueness (20%)
● Validity (10%)
● Integrity (5%)
● Timeliness (5%).
      </p>
      <p>These parameters can be estimated by analyzing dataset features. This analysis can be done using
Python tools. A dedicated function (i.e., “Pandas Profiling”) from Pandas, the widely used Python
data management library, was used for the evaluation of these metrics. After analyzing the dataset
attributes, each metric is evaluated from 1 to 5. At the end these results contribute to estimate the
Data Quality Index, which is a weighted sum of each parameter. In Table 1 and Table 2 an
explanation of values from 1 to 5 is shown.</p>
      <p>1
2
3
4
5</p>
      <p>Percentage of data with no misspellings</p>
      <p>Percentage of missing cells
Correlation between attributes</p>
      <p>Percentage of duplications
Percentage of empty database fields</p>
      <p>Is data updated?
Amount of data that makes the dataset representative of reality
0%
40%
60%
80%
100%
Completeness
Questions
2.
2.</p>
      <p>1
2
3
4
5
The evaluation is based on these questions about datasets:
● Accuracy:
1. Are there any spelling errors in the data names?
2. Do data accurately represent the "real world" values they are supposed to detect?
● Completeness:</p>
    </sec>
    <sec id="sec-5">
      <title>4.1 Validation results</title>
      <p>The validation process is applied to four datasets. A set for loan and one for customer were chosen
from Kaggle to be compared with two datasets created by us. The Kaggle one can be considered the
benchmark dataset.</p>
      <p>Specifically, a loan dataset from Kaggle 8 is evaluated as the best one for quality. A dataset preview
is presented in Figure 2. The process of validation is divided in the following steps:
●
●
●</p>
      <p>Step 1: Pandas Profiling was applied to the dataset: in Figure 3 an overview of the Python
analysis on this dataset is explained.</p>
      <p>Step 2: after this analysis, metrics were calculated. In Table 4 there is a presentation of results
for each metric that contributes to score DQI questions.</p>
      <p>Step 3: eventually DQI can be scored, based on the metrics.
8 https://www.kaggle.com/datasets/yasserh/loan-default-dataset
1.</p>
      <p>2.
Integrity
Timeliness</p>
      <p>From the analysis of the Data Quality Index, the best dataset is the loan one (91%), from Kaggle.
Metrics are very positive because the dataset has the greatest number of observations; therefore, it is
more complete and valid; it has got zero null and duplicate values that guarantee its uniqueness.</p>
      <p>The second dataset, the customer one from Kaggle9, scored a DQI of 85%, lower than the first, for
the presence of about 6% of null values and a lower number of total values. The first two datasets
present a better Timeliness because they are public and as such more current and updated than the
others.</p>
      <p>Finally, these results can be compared with two datasets (loan and customer) from our realization.
As we expected, our datasets are smaller and less updated than the other ones, but they are more correct.
They have a DQI of 78% and 80%, a high quality, slightly lower than the previous ones because the
datasets are characterized by a limited number of observations (1000) and therefore have a lower level
of completeness. Datasets are valid, with no null or duplicate values, ensuring greater uniqueness.
IDEA gives us several opportunities because it minimizes data cleaning and normalization problems,
but the DQI comparison done shows our model limits too.</p>
      <p>As we can see the reason why our loan dataset has got a lower quality is due to accuracy, about 60%.
Moreover, validity results 60% as the amount of data that makes the dataset representative of reality is
1000 records. Timeliness for 50%, due to a slow data update frequency.</p>
      <p>Our model could be improved, by generating a higher number of records which should be updated
every year. What we want to explain is that all the operations were made by hand, and it causes model
limits. Our aim is to make the model more efficient thanks to Entity resolution tools which can automate
manual operations by reducing manual errors and optimizing the working time.
9 https://www.kaggle.com/</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>In this paper, an approach to the realization of realistic data sets for Artificial Intelligence was presented
(IDEA). The approach was aimed at solving business needs and user requests. The main feature of our
approach is an analysis to create a model applicable to the background.</p>
      <p>By examining literature, we achieved that the main method applied was SEMMA. We aimed to make
IDEA an extension of SEMMA method. IDEA can be useful for every reality and the use case chosen
is the banking creditworthiness. The paper explained the use case and its validation. Specifically, this
model can be used by every bank or financial institution to forecast the solvency of a customer portfolio.
As explained our purpose was developing a user centered approach to understand business requests. A
business needs an efficient strategy that can be improved thanks to IDEA because our approach gives
a representation of a business reality from two points of view: customers and businesses.</p>
    </sec>
    <sec id="sec-7">
      <title>6. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Alam</surname>
            ,
            <given-names>Talha</given-names>
          </string-name>
          <string-name>
            <surname>Mahboob</surname>
          </string-name>
          , Kamran Shaukat, Ibrahim A.
          <string-name>
            <surname>Hameed</surname>
            , Suhuai Luo, Muhammad Umer Sarwar, Shakir Shabbir,
            <given-names>Jiaming</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>and Matloob</given-names>
          </string-name>
          <string-name>
            <surname>Khushi</surname>
          </string-name>
          . “
          <article-title>An Investigation of Credit Card Default Prediction in the Imbalanced Datasets.” IEEE Access 8 (</article-title>
          <year>2020</year>
          ):
          <fpage>201173</fpage>
          -
          <lpage>98</lpage>
          . https://doi.org/10.1109/ACCESS.
          <year>2020</year>
          .
          <volume>3033784</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Azevedo</surname>
          </string-name>
          , Ana, and Manuel Filipe Santos. “KDD,
          <article-title>Semma and CRISP-DM: A Parallel Overview</article-title>
          .”
          <source>MCCSIS'08 - IADIS Multi Conference on Computer Science and Information Systems; Proceedings of Informatics 2008 and Data Mining</source>
          <year>2008</year>
          , no.
          <source>January</source>
          <year>2008</year>
          (
          <year>2008</year>
          ):
          <fpage>182</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Madaan</surname>
          </string-name>
          , Mehul, Aniket Kumar, Chirag Keshri, Rachna Jain, and Preeti Nagrath. “
          <article-title>Loan Default Prediction Using Decision Trees and Random Forest: A Comparative Study</article-title>
          .
          <source>” IOP Conference Series: Materials Science and Engineering</source>
          <volume>1022</volume>
          , no.
          <issue>1</issue>
          (
          <year>2021</year>
          ). https://doi.org/10.1088/
          <fpage>1757</fpage>
          - 899X/1022/1/012042.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Sheikh</surname>
            ,
            <given-names>Mohammad</given-names>
          </string-name>
          <string-name>
            <surname>Ahmad</surname>
          </string-name>
          , Amit Kumar Goel, and Tapas Kumar.
          <article-title>“An Approach for Prediction of Loan Approval Using Machine Learning Algorithm</article-title>
          .
          <source>” Proceedings of the International Conference on Electronics and Sustainable Communication Systems, ICESC</source>
          <year>2020</year>
          , no.
          <source>Icesc</source>
          (
          <year>2020</year>
          ):
          <fpage>490</fpage>
          -
          <lpage>94</lpage>
          . https://doi.org/10.1109/ICESC48915.
          <year>2020</year>
          .
          <volume>9155614</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Sivasree</surname>
            <given-names>M S</given-names>
          </string-name>
          , and Rekha Sunny T. “
          <article-title>Loan Credibility Prediction System Based on Decision Tree Algorithm</article-title>
          .”
          <source>International Journal of Engineering Research And V4</source>
          , no.
          <volume>09</volume>
          (
          <year>2015</year>
          ):
          <fpage>825</fpage>
          -
          <lpage>30</lpage>
          . https://doi.org/10.17577/ijertv4is090708.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Tariq</surname>
            ,
            <given-names>Hafiz</given-names>
          </string-name>
          <string-name>
            <surname>Ilyas</surname>
          </string-name>
          , Asim Sohail, Uzair Aslam, and Nowshath Kadhar Batcha. “
          <string-name>
            <surname>Loan Default Prediction Model Using Sample</surname>
          </string-name>
          , Explore, Modify, Model, and
          <string-name>
            <surname>Assess</surname>
          </string-name>
          (Semma).
          <source>” Journal of Computational and Theoretical Nanoscience</source>
          <volume>16</volume>
          , no.
          <issue>8</issue>
          (
          <year>2019</year>
          ):
          <fpage>3489</fpage>
          -
          <lpage>3503</lpage>
          . https://doi.org/10.1166/jctn.
          <year>2019</year>
          .
          <volume>8313</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Tejaswini</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>T Mohana Kavya</surname>
          </string-name>
          ,
          <string-name>
            <surname>R Devi</surname>
          </string-name>
          , Naga Ramya,
          <string-name>
            <given-names>P Sai</given-names>
            <surname>Triveni</surname>
          </string-name>
          , and Venkata Rao Maddumala. “
          <source>ACCURATE LOAN APPROVAL PREDICTION BASED ON MACHINE LEARNING APPROACH” 11</source>
          (
          <year>2020</year>
          ). www.jespublication.com.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Turiel</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Aste</surname>
          </string-name>
          . “
          <article-title>Peer-to-Peer Loan Acceptance and Default Prediction with Artificial Intelligence: P2P Default Prediction with AI</article-title>
          .
          <source>” Royal Society Open Science</source>
          <volume>7</volume>
          , no.
          <issue>6</issue>
          (
          <year>2020</year>
          ). https://doi.org/10.1098/rsos.191649rsos191649.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          , Lin. “
          <source>ScienceDirect A Study Study on Predicting Loan Default Based on the Random Forest Algorithm.” Procedia Computer Science</source>
          <volume>162</volume>
          , no.
          <source>Itqm</source>
          <year>2019</year>
          (
          <year>2020</year>
          ):
          <fpage>503</fpage>
          -
          <lpage>13</lpage>
          . https://doi.org/10.1016/j.procs.
          <year>2019</year>
          .
          <volume>12</volume>
          .017.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Sidi</surname>
          </string-name>
          , Fatimah, Payam Hassany Shariat Panahy, Lilly Suriani Affendey, Marzanah A.
          <string-name>
            <surname>Jabar</surname>
          </string-name>
          , Hamidah Ibrahim, and Aida Mustapha.
          <article-title>“Data Quality: A Survey of Data Quality Dimensions</article-title>
          .
          <source>” Proceedings - 2012 International Conference on Information Retrieval and Knowledge Management</source>
          , CAMP'
          <volume>12</volume>
          , no.
          <source>August</source>
          (
          <year>2012</year>
          ):
          <fpage>300</fpage>
          -
          <lpage>304</lpage>
          . https://doi.org/10.1109/InfRKM.
          <year>2012</year>
          .
          <volume>6204995</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>