38 Models of Fraud Detection and Analysis of Payment Transactions Using Machine Learning Viktor Shpyrko[0000-0003-4955-2685] and Bohdan Koval Taras Shevchenko National University of Kyiv, 60, Volodymyrska Str., Kyiv, 01033, Ukraine viksh@bigmir.net, thekovalbohdan@gmail.com Abstract. The work’s aim is to research a set of selected mathematical models and algorithms that examine the data of a single payment transaction to classify it as fraud or verified. Described models are implemented in the form of a computer code and algorithms, and therefore can be executed in real-time. The main objective is to apply different methods of machine learning to find the most accurate, in other words, the one in which the cross-validation score is maximal. Thus, the main problem to resolve is the creation of a model that could instantly detect and block a given fraudulent transaction in order to provide better security and user experience. At first, we determine the classification problem: which initial data we have, how we can interpreter it to find the solution. The next part is dedicated to presenting the methods for solving the classification problem. In particular, we describe such approaches as Logistic Regression, Support Vectors Method (SVM), K-Nearest neighbours, Decision Tree Classifier and Artificial Neural Networks; provide the notion of how these methods operate the data and yield the result. At the end, we apply these methods to the provided data using Python programming language and analyze the results. Keywords: machine learning, classification problem, fraud, payment systems, neural networks. 1 The Notion of Classification Problem and its Characteristics 1.1 The Definition of Classification Problem The classification problem is a formalized task, which contains a set of objects (situations), divided in a certain way into classes. There is specified a finite set of objects, and we know to which classes each of them belongs. This set is called a sample. There is no info about other objects, so we do not know to what class they belong. The aim is to create an algorithm that will be able to classify an arbitrary object from the initial set. To classify an object means to indicate the number (or name) of the class to which this object belongs. The classification of an object is the number or class name, issued by the classification algorithm because of its application to this particular object [1]. 39 In mathematical statistics, the classification problems are also called as the problems of discrete analysis. In machine learning, the classification problems can be solved with the help of artificial neural network methods, particularly by staging an experiment in the form of training with a teacher. Let X be a set of object descriptions, Y is a plurality of numbers (or names) of classes. There is an unknown target dependence – mapping (1) – whose values are known only for elements of a finite learning sample (2). The aim is to construct an algorithm (3) capable of classifying any arbitrary object х  Х [2]. y* : X → Y (1) Xm = {( x1, y1 ), … , ( xm, ym )} (2) a : X→Y (3) The probabilistic definition of the problem is more general. It assumes that the set of pairs “object-class” X × Y is a probabilistic space with an unknown probabilistic degree P. There is a finite study sample of observations (2) generated in accordance with the probabilistic degree P. The aim is to construct an algorithm (3), capable of classifying arbitrary object х  Х. 1.2 The Concept of Characteristics in the Tasks of Classification The characteristic is the mapping (4), where Df – the set of permissible values of the characteristic. f : X → Df (4) If the characteristics f1, …, fn are given, then the vector (5) is called the characteristic description of the object х  Х. x = (f1(x), …, fn(x)) (5) Characteristics can be identified with the objects themselves. In this case, the set (6) is called the space of characteristics. X = Df1 × … × Dfn (6) Depending on the Df set, the characteristics are divided into the following types: ─ Binary characteristics: Df = { 0, 1 }; ─ Nominal characteristics: Df – finite set; ─ Sequence characteristics: Df – finite ordered set; ─ Quantitative characteristics: Df – the set of real numbers. And into the following classes: ─ Two-class classification, which technically is the easiest case, and serves as the basis for solving more complex tasks; 40 ─ Multiclass classification. The number of classes reaches thousands (for example, when recognizing hieroglyphs or fused speech), the task of classification becomes significantly more difficult; ─ Non-overlapping classes; ─ Ordinary classes. An object may belong to several classes at a time; ─ Fuzzy classes. It is necessary to determine the degree of belonging of the object to each of the classes, usually it is a valid number from 0 to 1 [2]. In our case, we are interested in the binary characteristic of the set with a two-class specification. 1.3 Publications Dedicated to the Fraud Detection Problem Bertrand Lebichot and Yann-Ael Le Borgne have researched the problem in the “Deep- Learning Domain Adaptation Techniques for Credit Cards Fraud Detection” publication [3]. They worked on the design of automatic Fraud Detection Systems (FDS) able to detect fraudulent transactions with high precision and deal with the heterogeneous nature of the fraudster behavior. Indeed, the nature of the fraud behavior may strongly differ according to the payment system (e.g. e-commerce or shop terminal), the country and the population segment. The another publication is “Improving Card Fraud Detection Through Suspicious Pattern Discovery” by Olivier Caelen and Evgueni N. Smirnov [4]. They proposed a new approach to detect credit card fraud based on suspicious payment patterns. According to their hypothesis fraudsters use stolen credit card data at specific, recurring sets of shops. They exploited this behavior to identify fraudulent transactions. Also the problem was mentioned in “Calibrating Probability with Undersampling for Unbalanced Classification” article by Andrea Dal Pozzolo, Olivier Caelen, Gianluca Bontempi [5]. In this paper, they study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. They formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. They use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. 2 Methods of Solving the Classification Problem 2.1 Regression Methods in Solving Classification Problems Logistic regression is suitable for solving the classification problem. This is a statistical regression method used in the case when the dependent variable is categorical, so it can acquire only two values (or, more generally, a finite set of values) [6]. Let some set Y have only two values, which are usually indicated by numbers 0 and 1. Let this value depend on some set of explanatory variables (7). x = (1, x1, x2, …, xk) (7) 41 The dependence of Y on x1, x2, …, xk can be determined by adding an additional variable y*, where (8). y* = θ0 + θ1x1 + ⋯ + θkxk + u (8) Then (9): 0, ∗ ≤ 0 = (9) 1, ∗ > 0 The next tool is the method of support vectors – a data analysis method for classification and regression using models with controlled training with associated learning algorithms, which are called support vector machines. For a given set of training samples, each of which is marked as belonging to one or other of the two categories, the training algorithm of the SVM builds a model that relates new samples to one or another category, making it an incredible binary linear classifier. The SVM model is the representation of samples as points in space, displayed in such a way that samples from individual categories are separated by a blank space that is most extensive. New samples then appear to the same space, and predictions about their belonging to the category are based on which side of the gaps they fall. In addition to performing a linear classification, the SVM can effectively perform a nonlinear classification in the application of the so-called core trick, implicitly displaying its inputs to the spaces of attributes of high dimensionality. Formally, the support vector machine builds a hyperplane, or a set of hyperplanes in a space of high or infinite dimensionality that can be used for classification, regression, and other tasks. Intuitively, good separation is achieved by a hyperplane that has the greatest distance to the nearest points of the training data of any of the classes (so-called functional separation) [7]. 2.2 Discrete Methods in Solving Classification Problems The next method for solving the problems of classification uses a slightly different approach. The method of k-nearest neighbours is a simple nonparametric classification method, where the distances (usually Euclidean) used to classify objects within the space of properties, counted among all other objects. The objects to which the distance is the smallest are selected, and they are allocated in a separate class. The basic principle of the method of the closest neighbours is that the object is assigned to that class, which is the most common among the neighbours of this element. Neighbours are taken on the basis of a set of objects whose classes are already known, and based on the key for the given method, the value of k is calculated on which class is the most numerous among them. Each object has a finite number of attributes (dimensions). It is assumed that there is a certain set of objects with an already existing classification [7]. The next method for solving the classification tasks is the decision tree, which is used in the field of statistics and data analysis for predictive models. The tree structure contains the following elements: “leaves” and “branches”. On the edges (“branches”) of decision trees, attributes are written, on which the target function 42 depends, in the “leaves” the values of the target function are written, while in other nodes there are attributes that distinguish the cases. To classify a new case, we must go down the tree to the letter and give the corresponding value. Similar decision trees are widely used in intelligent data analysis. The goal is to create a model that predicts the value of the target variable based on multiple input variables [7]. Each leaf represents the value of the target variable, changed in the course of movement from the root to the letter. Each internal node corresponds to one of the input variables. A tree can also be “studied” by dividing the output sets of variables into subsets that are based on the testing of attribute values. This process is repeated on each of the received subsets. Recursion ends when the subset in the node has the same value as the target variable, so it does not add value to the predictions. The process of going from top to bottom, TDIDT, is an example of an absorbing “greedy” algorithm, and is by far the most widespread decision tree for data, but this is not the only one possible strategy. The decision trees used in Data Mining are of two main types: ─ Analysis of the classification tree when the predicted result is a class to which the data belongs; ─ Regression analysis of a tree when the predicted result can be considered as a valid number (e.g. house price, or length of stay of a patient in a hospital) [8]. In the context of the current task, we are interested in the first type of decision tree for solving classification issues. 2.3 Artificial Neural Networks in Solving Classification Problems Artificial neural networks can also be used to solve classification problems. An artificial neural network is a network of simple elements called neurons that receive input, change their internal state (activation) according to this input, and produce an output that is dependent on input and excite. The network is formed by connecting the outputs of certain neurons with inputs of other neurons with the formation of a directed weighted graph. Scales, as well as functions that calculate excitement, can change with the process called learning, which is guided by the rule of learning [7]. Components of the artificial neural network: 1. Neurons The neuron with the label j, which receives input pj(t) from the neuronal predecessors, consists of the following components: ─ Activation aj(t), which depends on the discrete time parameter; ─ The threshold θj (for binary neuron), which remains unchanged, if it does not change the learning function; ─ Activation functions f, which calculates the new activation at the given time t+1 from aj(t), θj and the network input pj(t), giving as a result the relation (10). The function is applied to all layers except the last one (where the output function is applied). Each intermediate connection has its own activation function. 43 aj(t+1) = f(aj(t), pj(t), θj) (10) ─ Output functions fout, which calculates the exit activation: (11) oj(t) = fout(aj(t)) (11) The output function is often just the same function. The input neuron has no predecessors, but serves as the login interface for the entire network. Similarly, the exit neuron has no successors, and thus serves as an interface for the output for the entire network. 2. Connections and weights The network consists of connections, each of which transmits the output of the neuron і to the input of the neuron j. In other words, і is the precursor (parent) of j, and the j is the successor (child) of і. Each such connection is assigned wij weight. 3. Distribution Functions The distribution function calculates the input pj(t) to the neuron j from the outputs of oi(t) of the precursor neurons and usually has the form: (12) ( )=∑ ( ) (12) 4. The rule of training Training rule is a rule or algorithm that changes the parameters of the neural network so that the given input to the network produces a suitable output. This learning process usually involves changing the weights and thresholds of the network variables [7]. There are three main paradigms of learning, each of which corresponds to a particular learning objective. They are guided learning, spontaneous learning, and training with reinforcement [7]. We are interested in the first paradigm, because it is used to solve classification problems. Guided learning uses a set of examples of pairs (x, y), x  X, y  Y, and has the purpose of finding a function (13) in a permitted class of functions that corresponds to these examples. f:X→Y (13) In other words, we want to display a reflection on which this data hints; the cost function is connected to the discrepancy between our reflection and the data, and it implicitly contains a priori knowledge of the subject domain. The tasks that fit into the guided learning paradigm are pattern recognition (also known as classification) and regression (also known as approximation of functions). A guided learning paradigm is also applicable to sequential data (for example, to the recognition of manual writing, speech and gestures). It can be seen as learning with a “teacher” in the form of a function that provides a constant feedback on the quality of the solutions obtained so far. 44 3 Practical Example of the Transaction Analysis and Fraud Detection Using Machine Learning 3.1 Overview and Description of the Transaction Database To investigate this problem and find a solution, a database [9] of the payment system with transactional accounts was obtained. The database reflects transactions executed within 2 days, generally containing 284,807 transactions, of which 492 are fraud (0.172%). The dataset was gathered by Worldline and ULB (Université Libre de Bruxelles) and prepared by them using various approaches: their private software algorithms, manual testing, customers’ feedback. That resulted into the merged dataset. The database consists only of numerical data. For confidentiality, the field of the database is anonymized. Because of this, it is not possible to specify a description of one or another peculiarity for which the field corresponds, and to give a more precise description of the data from an economic point of view. All 28 parameters (V1, V2, ..., V28) were obtained using the main component method – principal component analysis method – a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. The only 2 fields that have not been transformed are “time” and “quantity”. The “time” value shows the number of seconds that passed between this transaction and the first transaction. The “quantity” field shows the amount of money that went through the transaction. All other fields have no marks or legend because of security and privacy reasons. The bank decided to not share what exactly these fields are, giving only their transformed numerical values. The data set is very unbalanced, since the target class – fraudulent transactions – is only 0.17% of all transactions (Figure 1). If we use them to construct models, we will probably get a lot of false classifications due to overtraining of the model. The resulted model will assume that the transaction is likely to be a verified one, since almost all of the data set consists of such transactions. 3.2 Initial Analysis of the Transaction Database We need to create a balanced subset of data with the same frequency of fraudulent and verified transactions, which will help further algorithms to show more accurate results. What will be a subset of data? In our case, this will be a dataset with a ratio of 50/50 verified and fraudulent operations. The number of fraudulent and normal operations will be the same. 45 Fig. 1. Distribution of the initial transactions database by classes. Why create a subset of data? We found that the initial set of data is very unbalanced. Its use can create the following problems: ─ Overtraining. Since almost all records are verified, our model will empirically assign almost every transaction as non-fraudulent. ─ Wrong correlation. Although we do not know what exactly corresponds to the “V” field, it will definitely be useful to understand how each of them affects the target function. Again, having an unbalanced set of data, the correlation matrix will be fuzzy and shifted toward non-fraud transactions [8]. Before applying random subsampling to the training set of data, we must divide the initial set of data into the training set and test set. Applying data balancing techniques (over-sampling or sub-sampling) should be done only on a training set of data in order to create a model, but the model testing should be done on the initial dataset. In the next step, we will apply the technique of random over-sampling, which is about removing those entries from the set of data, which count is bigger. Thus, we achieve a ratio of 50/50 by excluding verified transactions (Figure 2). Correlation matrixes are the basis for understanding the data. It is interesting for us to understand which arguments significantly affect the classification of the transaction. Particularly indicative is matrix comparison for balanced and unbalanced data sets (Figure 3). Correlation matrix analysis: ─ negative correlation: V10, V12, V14, V17. The smaller the value of these variables is, the more likely the transaction will be fraudulent. 46 ─ positive correlation: V2, V4, V11, V19. The larger the value of the variable is, the more likely the operation is fraudulent [8]. Fig. 2. Histogram of equally distributed classes after sub-sampling. Fig. 3. Correlation matrixes of unbalanced (top) and balanced (bottom) data. 47 3.3 Creation and Evaluation of the Fraud Detection Classifiers Before we begin, we need to divide our data into training and test subsets. Of course, computing of large volumes of data and deducting the result, and, most importantly, high-speed computing, requires the use of computing machines. In practice, there are many tools and technologies for data processing, but the most popular are Python and R. What language to use is completely up to a user, the mathematical and statistical methods described above are implemented in both environments. In the given work will work in Python [10], but all the same techniques and methods are implemented in R. We will use such libraries [11]: ─ Pandas – for easier data processing; ─ Matplotlib – for visualization; ─ NumPy and SciPy – for scientific calculations; ─ Seaborn – for visualization of statistical data; ─ Sklearn – machine learning library; ─ Tensorflow – machine learning library. For each classifier, we build a model and find its accuracy [12]. After lets analyze and compare learning curves for all 4 models (Figure 4 – Figure 7): Fig. 4. Logistic regression learning curve. 48 Fig. 5. Support vectors learning curve. Fig. 6. K-nearest neighbours learning curve. 49 Fig. 7. Decision tree classifier learning curve. Logistic regression showed the best accuracy with an estimate of 94%. This is a training result that was obtained from an assessment of how precisely the model determines fraud in the training sample. For a more accurate result, check the resulting models on the test sample (remember that this is still a balanced sample, so the result will still be inaccurate). As we see from the obtained results, the logistic regression method was best demonstrated with a result of 94% on the training sample and 93.52% on the test sample (the best result was evaluated as the maximum arithmetic mean of the data of 2 indicators [13]). The method of k-nearest neighbors and the method of support vectors also showed a fairly precise result, and the support vectors method showed even better results on the test sample than the logistic regression – 93.78%. For a more detailed demonstration of the results, we output a confusion matrix [14] for logistic regression method. In the upper left and lower right squares (yellow) the correct results are placed, in other squares (black) wrong results are places. As we see from Figure 8, this method correctly detected 96 + 89 = 185 transactions. The other 8 transactions fell into inappropriate groups, so they were not predicted correctly. Remember, the above results was obtained on sub-sampled test dataset. 50 Fig. 8. Logistic regression results’ confusion matrix. 3.4 Fraud Detection Using Neural Networks To create the neural network, the same Python software package, based on the Tensorflow, was used. The structure of the neural network: a simple model that consists of one input layer, one hidden layer of 32 nodes, and one output layer that can take one of two possible values: 0 or 1. We will supervise two studies of the neural network: the first by means of sub- sampling, and the other by means of over-sampling. In the first case, we will narrow our data to a ratio of 50/50, so we will randomly drop a significant portion of the verified transactions. During the over-sampling, we will expand our data by adding new records of fraudulent data that will be generated basing on the existing records of the fraudulent data. To supervise the neural network, 20 iterations were performed on the corresponding data set. After performing the neural network training, we evaluate it on the original data set and compare the results between the neural networks itself and the best classifiers. As we see from Figure 9, the neural network on the sub-sampled data classified a significant part of the verified transactions (Y-axis) in the class of fraudulent, but only 1 fraudulent transaction passed. In general, the score of the neural network was 93.1%. Over-sampling (data expansion) showed the best result (Figure 10) among both neural networks and all models in general, having demonstrated 99.9% of the correct classifications. However, it should be noted that 24 fraudulent transactions have passed, and therefore the percentage of blocked fraudulent transactions is lower. 51 Fig. 9. Confusion matrix for the neural network, trained on sub-sampling. Fig. 10. Confusion matrix for the neural network, trained on over-sampling. 52 4 Conclusion The logistic regression reaches up to 94% of the correct classifications, while the neural network on the sub-sampled data shows a result of 93.1%, and over-sampled data shows as much as 99.9%, but misses a significant amount of fraudulent operations. On the one hand, the accuracy of the neural network on the over-sampling is higher, but on the other hand, it misses most of the fraudulent operations, although it better classifies the verified ones. Logistic regression showed average accuracy, but also missed a significant part of fraudulent transactions. Although the neural network on the sub-sampling showed the worst overall result of 93.1%, but it prevented the biggest amount of the fraud transactions. In general, the use of one or another model depends on the specific situation: whether clients are ready sometimes get denial of the transaction, but to be sure that their funds will not be obtained by fraud, or they are more interested in easy of use, and security is not that important. References 1. Ajvazyan, S.A., Buxshtaber, V.M., Enyukov, Y.S., Meshalkyn, L.D.: Prykladnaya statystyka – klassyfykacyya i snyzhenye razmernosty (Applied Statistics – Classification and Reduction of Dimensionality). Finansy i statistika, Moscow (1989) 2. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Heidelberg (2009) 3. Lebichot, B., Le Borgne, Y.-A.: Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds.) Recent Advances in Big Data and Deep Learning, pp. 78–88. Springer, New York (2019) 4. Caelen, O., Smirnov, E.N.: Improving Card Fraud Detection Through Suspicious Pattern Discovery. In: Benferhat, S., Tabia, K., Ali, M. (eds.) Advances in Artificial Intelligence: From Theory to Practice, pp. 181–190. Springer, New York (2017) 5. Pozzolo, A.D., Caelen, O., Bontempi, G., Johnson, R.A.: Calibrating Probability with Undersampling for Unbalanced Classification. Paper presented at the 2015 IEEE Symposium Series on Computational Intelligence, Cape Town, South Africa, 7-10 December 2015 6. Chernyak, O.I., Komashko, O.V., Stavyckyj, A.V., Bazhenova, O.V.: Ekonometryka (Econometrics). Vydavnycho-polihrafichnyi tsentr “Kyivskyi universytet”, Kyiv (2010) 7. Mitchell, T.: Machine Learning. McGraw Hill, New York (1997) 8. Shlezynher, M., Hlavach, V.: Desyat lekcyj po statystycheskomu y strukturnomu raspoznavanyyu (Ten lectures on statistical and structural recognition). Naukova Dumka, Kyiv (2004) 9. Transactions database. ULB, Belgium. http://mlg.ulb.ac.be/ (2016). Accessed 2 June 2018 10. McKinney, W.: Python for Data Analysis. O’Reilly Media, Sebastopol (2016) 11. Idris, I.: Python Data Analysis Cookbook. Packt Publishing, Birmingham (2016) 12. Michie, D., Spiegelhalter, D.J.: Machine Learning, Neural and Statistical Classification. University of Leeds, Leeds (1994) 13. Nilsson, N.J.: Introduction To Machine Learning. Stanford University, Stanford (1997) 14. Miller, J.D.: Big Data Visualization. Packt Publishing, Birmingham (2017) 53 15. Shpyrko, V., Koval, B.: Fraud detection models and payment transactions analysis using machine learning. SHS Web of Conferences. 65, 02002 (2019). doi:10.1051/shsconf/20196502002