A User-Centered Approach to Create Realistic Datasets for AI. Case Study: Creditworthiness in the Banking Sector Francesca Zampino1, Antonella Longo1 and Marco Zappatore1 1 University of Salento, via Lecce-Monteroni, 73100 Lecce (LE), Italy Abstract Nowadays businesses are evolving, as new digital tools ensure greater efficiency of their information systems. Decision-making and strategic processes can benefit from innovation opportunities such as Machine Learning. The main issue encountered in Artificial Intelligence applications, is that data can be not available or unsuitable for the case of study. This paper proposes the solution for this problem, by generating simulated data for AI. The case of study is creditworthiness in the banking sector; a loan is considered the main source of income for the banking sector, as well as the main source of risk. Consequently, the evaluation of creditworthiness is a key activity both for banks and for customers. To address this need, we propose a solution tailored to lenders to evaluate credit applications and to customers to be aware of behaviors that can reduce their credit score. The approach proposed in this paper aims at realizing realistic datasets for Artificial Intelligence (named IDEA) to meet specific business needs, and to respect users’ requests. An analysis of the current literature and methods for the evolution of conceptual models will be conducted, through pre-existing datasets. The proposed approach draws from and extends such literature. The intended application is to adopt this approach in the banking sector for considering the creditworthiness of customers who have entered into financial relationships. Therefore, the envisaged use case is to forecast the probability of borrowers going bankrupt. The paper defines the approach applied to specific financial datasets for the use case. Moreover, a validation of datasets is done, thanks to the Data Quality Index, before applying IDEA to predict credit solvency. Keywords 1 Artificial Intelligence, realistic datasets, user-centered approach, prediction, creditworthiness 1. Introduction The loan is a core business for the banking sector, as well as the main source of financial risk for banks. European data show that the loan is the most widely used financing instrument for small and medium-sized enterprises. The situation in which an asset causes high risks due to the inability of a borrower to return the loan within the agreed time; is called a "Credit Risk"[3]. A borrower’s creditworthiness was based on a numerical value, a score named "Credit score". In general, this value helps authorities calculate the likelihood that a borrower will return the loan within the designated time. Creditworthiness means the ability of a debtor (in this case, a financial intermediary) to repay its debts on maturity, based on credit history or payment history. Recently, researchers and banks have chosen training classifiers based on various machine learning and deep learning algorithms to automatically predict an applicant’s credit score based on its credit history and other historical data [3]. For example, we can calculate the future score of the credit score or the probability of default, before issuing a loan. In order to reach our goals, the process is divided in the steps now explained. ITADATA2022: The 1st Italian Conference on Big Data and Data Science, September 20-21 2022, Milan, Italy EMAIL: francesca.zampino@studenti.unisalento.it (A. 1); antonella.longo@unisalento.it (A. 2); marcosalvatore.zappatore@unisalento.it (A. 3) ORCID: 0000-0002-6902-0160 (A. 2); 0000-0002-8277-9390 (A. 3) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) CEUR Workshop Proceedings (CEUR-WS.org) CEUR ht tp: // ceur -ws .or g Works hop I SSN1613- 0073 Pr oceedi ngs ● First, we proceed to the research and selection of scientific papers targeting the same goal of the study: this allowed us identifying the financial variables of interest and the corresponding datasets. Understanding the elements that identify the context is useful for the implementation of a consistent database model. The analysis process starts from the choice of models and related variables, considered useful to describe the banking context. In general, the literature is characterized by datasets that identify a loan as a reference entity, with attention to the credit history of the applicants. ● Secondly, it is important to verify the applicability and value of the model applied to different bank cases. ● The model must be validated to evaluate the quality of datasets. The paper is organized as follows: Section 1 introduces the literature analysis we performed to ground the proposed approach, which is described in details in Section 2. Section 3 discusses a dataset validation, based on a dedicated Data Quality Index, along with the achieved results. Conclusions are drawn in Section 4. 2. State of art This section presents an analysis on scientific papers chosen as a standard reference on top of which the proposed IDEA approach is grounded. This analysis has been carried out to evaluate the widespread models for traditional and innovative banking realities. Therefore, the literature review is the starting point for our research. We focused on two main topic typologies: ● Commercial banking ● Peer to Peer lending The first category of identified papers includes traditional bank loans.[1][2][3][4]. Instead, the second category refers to a kind of financial innovation, Peer to Peer lending, a loan between individuals, granted without traditional financial intermediation.[5][6][7]. The analyzed papers present an analysis of the banking sector and choose to apply SEMMA as a data mining design model. The SEMMA method is more useful than the alternative model, CRISP DM because SEMMA pays attention to user requests, asked by our study. SEMMA [8] is the multi-stage method applied by the papers analyzed. 1. SAMPLE: Firstly, the goal is to identify a representative model for the population. The process of collecting data from the whole population is a very difficult task, so SEMMA offers the opportunity of using a sample of population data for the development of the model. 2. EXPLORE: The next step of the SEMMA methodology is data review. 3. MODIFY: The main tasks related to data modification are the conversion of data types and the management of missing values. 4. MODEL: In this step of SEMMA, several algorithms and mining techniques are applied to develop the proposed model. The purpose of this step is to identify the hidden and meaningful information from the pre-processed data set. Among the algorithms used, Decision Tree, Logistic Regression and Neural Network. 5. ACCESS: Once the implementation and validation of the model has been satisfied by all the proposed techniques, the test data is incorporated into each model, for the loan approval prediction. The SEMMA methodology is widely used in the literature and can be compared with two other tools for machine learning models: CRISP-DM and KDD [9]. SEMMA and CRISP-DM are an evolution of KDD (1996). The CRISP-DM standard was published in its first version in 1999 in Brussels and it is composed of 6 main stages, which can be added at the end. The steps are: 1. Business Understanding: understanding business problems. 2. Data Understanding: understanding data is fundamental to understand how data and analysis can solve the problem of the previous phase. 3. Data Preparation: this is the data cleaning and review phase. 4. Modeling: it is the choice of the algorithm suitable for the use case. 5. Evaluation: an evaluation of the outputs: it will be possible to use a part of the data, the test ones, to compare the results of the model with the real ones. 6. Deployment: the model will have to go into production, that is, it will have to be used on a large scale. The SEMMA method allows the development of an application domain for end-user goals; in the SEMMA Sample phase, data cannot be sampled unless there is an understanding of all business needs. The approach developed by this study stems from the so-called "business goals", business purposes to be defined upstream, as a reference point for the model to be developed. It is better represented by the SEMMA method, for the reasons explained. The aim is to create a suitable model to meet the needs of users. 1. During this study, the data samples were selected from Kaggle Repository. 3. The proposed approach: IDEA In this Section, we will discuss IDEA, (realIstic DatasEt for Artificial intelligence) a systematic model approach designed to respond to business needs, so that the expectations of target users can be addressed properly. IDEA is an extension of the SEMMA model, discussed in Section 2. We will identify the requirements of companies and users in order to carry out an in-depth search on existing data repositories. On the one hand we will evaluate different data sources to develop an optimal combination of variables since our purpose is to maximize strategic business benefits. On the other hand, the research and choice of datasets will be followed by the development of a conceptual model as a graphic representation of the context. A conceptual model can explain the main entities and relationships between them. It will be populated by considering datasets suitable to represent the analysis scenario. IDEA aims to identify the process of borrowing activities, basically in the form of a loan, whose borrowers are individuals from Italian regions of North, South and Center, aged 25 years and over. In order to apply the proposed approach, an open dataset for commercial banking was selected, because granting loans to private individuals is the core business of a commercial bank. The dataset choice is motivated by the purpose to provide a basic model for banking institutions which can also be turned into a more complex model, such as P2P lending. At the same time, it is also useful for small traditional banking companies. Our aim is to demonstrate the applicability of the proposed IDEA approach to all banking realities, small and medium-sized enterprises or companies with high turnover. This means that the model can be used for traditional and innovative banks, because IDEA is presented as a standard model with traditional features, but it can be changed thanks to innovative variables, for example P2P lending ones. This model does not focus on the number of variables, but it is characterized by functional attributes to identify entities and context. A limited number of attributes is due to the choice of creating a model for small banks which can be developed through other variables to become a large bank model. The process can be applied from micro to macro realities. The model is provided as a general application guideline that can be adapted to the banking reality that decides to apply it. The underlying entities of any financial relationship are customer, loan, and bank. Moreover, Figure 1 shows a real estate entity, linked to the loan, through a non-compulsory relationship: the asset can be a guarantee for the financial relationship. Although the model represents financial deals, it was chosen to specify the optional loan guarantee to ensure the assessment of the solvency of creditors and Probability of Default. Our dataset can be different because it does not cause the operational problems of data normalization, about null values or duplicated data. Our model also allows you to identify the relationships between customer, loan and bank. IDEA defines each key attribute, while, for Kaggle datasets, we should integrate the key data, by generating key attributes randomly. In fact, the use case of the model is forecasting the evolution of a credit portfolio, in terms of its financial reliability. IDEA can be, therefore, critical to define a bank strategy. The research about relevant banking datasets is also characterized by an evaluation of different online sources and data repositories currently available today. Among these Kaggle2, Dataport3, World Bank4, World Economic Forum5, Towards Data Science6 and Data. World 7were considered. These datasets are open, but it is important to carry out an initial verification and selection of the attributes, understandable and, at the same time, consistent with the model. After a careful analysis, the most consistent repository was considered Kaggle, for the availability of all variables. As explained previously, IDEA focuses on a limited number of attributes that identify a small, medium, or large bank. The three main variables we have considered are loans, customers and guarantees that were found in the Kaggle datasets deemed suitable for the model. This source is the best one to represent a development from micro to macro realities. The proposed methodology aims at identifying the main entities (clients, loans, real estate and guarantees) and corresponding attributes to develop a suitable Entity-Relationship conceptual model and then build a physical relational database. It is important to define the reference entity as a loan, identified by specific attributes properly related to other entities such as the borrower - client. Literature analysis allowed us to identify the IDEA main attributes. Elements in most of the articles are the following: ● Id - loan ● Id - borrower ● Sex ● Personal data ● Education ● Income of the borrower (main borrower) ● Income of the second debtor ● Amount of financing ● Duration of financing in months ● Credit history ● State of financing ● Interest rate ● Spread on interest rate ● Installment ● Date of issue of the loan ● Purpose ● Default of the loan (1 = defaulting borrower; 0 = fulfilling borrower) ● Card code (YES / NO) ● Credit score ● Year of birth ● Level of credit ● Age ● Up front charges In addition, these variables are related to the opportunity that a loan is secured by real estate. ● Type of warranty 2 https://www.kaggle.com/ 3 https://ieee-dataport.org/ 4 https://www.worldbank.org/en/home 5 https://www.weforum.org/ 6 https://towardsdatascience.com/ 7 https://data.world/ ● Type of building ● Amount of property evaluation These elements were used to build a database, whose conceptual model is partially shown in Figure 1. Since we have created the model to make the estimate of solvency, we proceeded with its preliminary validation, before applying it to a case study. In the next Section, the validation of IDEA is discussed. Figure 1: Entity-Relationship diagram 4. Data set validation: the Data Quality Index (DQI) This section addresses the dataset quality for the IDEA approach [10]. We define an assessment parameter (the Data Quality Index), characterized by the following weighted metrics (each weight indicates the importance we attribute to its impact on data quality): ● Accuracy (20%) ● Completeness (20%) ● Consistency (20%) ● Uniqueness (20%) ● Validity (10%) ● Integrity (5%) ● Timeliness (5%). These parameters can be estimated by analyzing dataset features. This analysis can be done using Python tools. A dedicated function (i.e., “Pandas Profiling”) from Pandas, the widely used Python data management library, was used for the evaluation of these metrics. After analyzing the dataset attributes, each metric is evaluated from 1 to 5. At the end these results contribute to estimate the Data Quality Index, which is a weighted sum of each parameter. In Table 1 and Table 2 an explanation of values from 1 to 5 is shown. Table 1 Metrics values Metrics Questions Accuracy 1. Percentage of data with no misspellings 1 0% 2 40% 3 60% 4 80% 5 100% Completeness 1. Percentage of missing cells 1 >20% 2 >10% 3 >5% 4 >2.5% 5 0% Consistency 1. Correlation between attributes 1 0.2 2 0.4 3 0.6 4 0.8 5 1 Uniqueness 1. Percentage of duplications 1 >20% 2 >10% 3 >5% 4 >2.5% 5 0% Validity 1. Amount of data that makes the dataset representative of reality 1 250 2 500 3 1000 4 >100000 5 >200000 Integrity 1. Percentage of empty database fields 1 >20% 2 >10% 3 >5% 4 >2.5% 5 0% Timeliness 1. Is data updated? 1 <1990 2 >1990 3 >2000 4 >2010 5 >2020 Table 2 Metrics values Metrics Questions Accuracy 2. Source reliability 1 Private source 2 Chargeable source 3 Auto-realization source 4 Public private source 5 Public source Completeness 2. Percentage of missing values for each field 1 >20% 2 >10% 3 >5% 4 >2.5% 5 0% Consistency 2. Correlation between fields 1 0.2 2 0.4 3 0.6 4 0.8 5 1 Uniqueness 2. Percentage of duplications for each field 1 >20% 2 >10% 3 >5% 4 >2.5% 5 0% Validity Amount of data to make dataset reliable 1 250 2 500 3 1000 4 >100000 5 >200000 Percentage of correct values Integrity 2. 1 100% 2 >80% 3 >60% 4 >40% 5 0% Timeliness 2. Data update frequency 1 0 2 20 years 3 10 years 4 5 years 5 <5 years The evaluation is based on these questions about datasets: ● Accuracy: 1. Are there any spelling errors in the data names? 2. Do data accurately represent the "real world" values they are supposed to detect? ● Completeness: 1. Are there data values with null elements for the entire dataset? 2. Are there data values with null elements per field? ● Consistency: 1. Are data presented in a similar or compatible format? 2. Are there distinct occurrences of the same data instances that provide conflicting information or are the data equivalent? ● Uniqueness: 1. Are data duplicated or do they have the unique feature for a field? 2. Do the data have duplicates, by mistake or do they have the characteristic of uniqueness for the dataset? ● Validity: 1. Does data correctly represent reality? 2. Are the data reliable? ● Integrity: 1. Is a dataset a measure of existence, validity, structure, content for the model? 2. Is the data correct? ● Timeliness: 1. Are data updated? 2. Do the data change with a high frequency? 4.1 Validation results The validation process is applied to four datasets. A set for loan and one for customer were chosen from Kaggle to be compared with two datasets created by us. The Kaggle one can be considered the benchmark dataset. Specifically, a loan dataset from Kaggle 8 is evaluated as the best one for quality. A dataset preview is presented in Figure 2. The process of validation is divided in the following steps: ● Step 1: Pandas Profiling was applied to the dataset: in Figure 3 an overview of the Python analysis on this dataset is explained. ● Step 2: after this analysis, metrics were calculated. In Table 4 there is a presentation of results for each metric that contributes to score DQI questions. ● Step 3: eventually DQI can be scored, based on the metrics. 8 https://www.kaggle.com/datasets/yasserh/loan-default-dataset Figure 2: loan dataset overview Figure 3: Dataset statistics (Kaggle dataset) Table 3 Metrics results (Kaggle dataset) Loan dataset 1 Metric Question Score (1-5) Percentage Accuracy 90% 1. 4 80% 2. 5 100% Completeness 90% 1. 5 100% 2. 4 80% Consistency 90% 1. 5 100% 2. 4 80% Uniqueness 100% 1. 5 100% 2. 5 100% Validity 80% 1. 4 80% 2. 4 80% Integrity 70% 1. 3 60% 2. 4 80% Timeliness 80% 1. 4 80% 2. 4 80% Figure 4: Data Quality Index (benchmark dataset) Table 4 Metrics results (Our dataset) Loan dataset 2 Metrics Question Score (1-5) Percentage Accuracy 60% 1. 3 60% 2. 3 60% Completeness 90% 1. 5 100% 2. 4 80% Consistency 80% 1. 4 80% 2. 4 80% Uniqueness 100% 1. 5 100% 2. 5 100% Validity 60% 1. 3 60% 2. 3 60% Integrity 70% 1. 3 60% 2. 4 80% Timeliness 50% 1. 3 60% 2. 2 40% Figure 4: Data Quality Index(Our dataset) From the analysis of the Data Quality Index, the best dataset is the loan one (91%), from Kaggle. Metrics are very positive because the dataset has the greatest number of observations; therefore, it is more complete and valid; it has got zero null and duplicate values that guarantee its uniqueness. The second dataset, the customer one from Kaggle9, scored a DQI of 85%, lower than the first, for the presence of about 6% of null values and a lower number of total values. The first two datasets present a better Timeliness because they are public and as such more current and updated than the others. Finally, these results can be compared with two datasets (loan and customer) from our realization. As we expected, our datasets are smaller and less updated than the other ones, but they are more correct. They have a DQI of 78% and 80%, a high quality, slightly lower than the previous ones because the datasets are characterized by a limited number of observations (1000) and therefore have a lower level of completeness. Datasets are valid, with no null or duplicate values, ensuring greater uniqueness. IDEA gives us several opportunities because it minimizes data cleaning and normalization problems, but the DQI comparison done shows our model limits too. As we can see the reason why our loan dataset has got a lower quality is due to accuracy, about 60%. Moreover, validity results 60% as the amount of data that makes the dataset representative of reality is 1000 records. Timeliness for 50%, due to a slow data update frequency. Our model could be improved, by generating a higher number of records which should be updated every year. What we want to explain is that all the operations were made by hand, and it causes model limits. Our aim is to make the model more efficient thanks to Entity resolution tools which can automate manual operations by reducing manual errors and optimizing the working time. 9 https://www.kaggle.com/ 5. Conclusion In this paper, an approach to the realization of realistic data sets for Artificial Intelligence was presented (IDEA). The approach was aimed at solving business needs and user requests. The main feature of our approach is an analysis to create a model applicable to the background. By examining literature, we achieved that the main method applied was SEMMA. We aimed to make IDEA an extension of SEMMA method. IDEA can be useful for every reality and the use case chosen is the banking creditworthiness. The paper explained the use case and its validation. Specifically, this model can be used by every bank or financial institution to forecast the solvency of a customer portfolio. As explained our purpose was developing a user centered approach to understand business requests. A business needs an efficient strategy that can be improved thanks to IDEA because our approach gives a representation of a business reality from two points of view: customers and businesses. 6. References [1] Alam, Talha Mahboob, Kamran Shaukat, Ibrahim A. Hameed, Suhuai Luo, Muhammad Umer Sarwar, Shakir Shabbir, Jiaming Li, and Matloob Khushi. “An Investigation of Credit Card Default Prediction in the Imbalanced Datasets.” IEEE Access 8 (2020): 201173–98. https://doi.org/10.1109/ACCESS.2020.3033784. [2] Azevedo, Ana, and Manuel Filipe Santos. “KDD, Semma and CRISP-DM: A Parallel Overview.” MCCSIS’08 - IADIS Multi Conference on Computer Science and Information Systems; Proceedings of Informatics 2008 and Data Mining 2008, no. January 2008 (2008): 182–85. [3] Madaan, Mehul, Aniket Kumar, Chirag Keshri, Rachna Jain, and Preeti Nagrath. “Loan Default Prediction Using Decision Trees and Random Forest: A Comparative Study.” IOP Conference Series: Materials Science and Engineering 1022, no. 1 (2021). https://doi.org/10.1088/1757- 899X/1022/1/012042. [4] Sheikh, Mohammad Ahmad, Amit Kumar Goel, and Tapas Kumar. “An Approach for Prediction of Loan Approval Using Machine Learning Algorithm.” Proceedings of the International Conference on Electronics and Sustainable Communication Systems, ICESC 2020, no. Icesc (2020): 490–94. https://doi.org/10.1109/ICESC48915.2020.9155614. [5] Sivasree M S, and Rekha Sunny T. “Loan Credibility Prediction System Based on Decision Tree Algorithm.” International Journal of Engineering Research And V4, no. 09 (2015): 825– 30. https://doi.org/10.17577/ijertv4is090708. [6] Tariq, Hafiz Ilyas, Asim Sohail, Uzair Aslam, and Nowshath Kadhar Batcha. “Loan Default Prediction Model Using Sample, Explore, Modify, Model, and Assess (Semma).” Journal of Computational and Theoretical Nanoscience 16, no. 8 (2019): 3489–3503. https://doi.org/10.1166/jctn.2019.8313. [7] Tejaswini, J, T Mohana Kavya, R Devi, Naga Ramya, P Sai Triveni, and Venkata Rao Maddumala. “ACCURATE LOAN APPROVAL PREDICTION BASED ON MACHINE LEARNING APPROACH” 11 (2020). www.jespublication.com. [8] Turiel, J. D., and T. Aste. “Peer-to-Peer Loan Acceptance and Default Prediction with Artificial Intelligence: P2P Default Prediction with AI.” Royal Society Open Science 7, no. 6 (2020). https://doi.org/10.1098/rsos.191649rsos191649. [9] Zhu, Lin. “ScienceDirect A Study Study on Predicting Loan Default Based on the Random Forest Algorithm.” Procedia Computer Science 162, no. Itqm 2019 (2020): 503–13. https://doi.org/10.1016/j.procs.2019.12.017. [10] Sidi, Fatimah, Payam Hassany Shariat Panahy, Lilly Suriani Affendey, Marzanah A. Jabar, Hamidah Ibrahim, and Aida Mustapha. “Data Quality: A Survey of Data Quality Dimensions.” Proceedings - 2012 International Conference on Information Retrieval and Knowledge Management, CAMP’12, no. August (2012): 300–304. https://doi.org/10.1109/InfRKM.2012.6204995.