A Cost-Sensitive Cosine Similarity K-Nearest Neighbor for Credit Card Fraud Detection Sara Makki∗§ , Rafiqul Haque† , Yehia Taher‡ , Zainab Assaghir§ , Mohand-Saı̈d Hacid∗ and Hassan Zeineddine§ ∗ Laboratoire LIRIS, Université de Lyon, Villeurbanne, France † Cognitus Centre for Big Data Science R&D, Paris, France ‡ Laboratoire DAVID, Université de Versailles, Versailles, France § Lebanese University, Beirut, Lebanon Abstract—Credit card fraud commonly happens in financial prevent frauds specifically credit card fraud (e.g., [1], [2], [3]). institutes such as banks. Fraud results in a huge financial damage However, the major challenge is the skewed data distribution that may reach to billions of dollars every year. Detecting and also known as the class imbalance. In this problem, the preventing credit card fraud manually is a labor intensive and relatively ineffective approach. Therefore, a significant effort dataset is extremely imbalanced and highly skewed [4]. This was made to develop automated solutions for fraud detection. problem can be summarized as having legitimate transactions Researchers dedicated their works on designing and developing examples in the training set a lot more than fraudulent ones. models and systems in particular, the fraud anlaysis systems This phenomenon makes a balanced training very difficult that enable to detect different types of fraud in different sectors and detection of fraud will be more challenging due to including insurance, telecommunication, financial audit, financial markets, money laundering, credit card, etc. However, some inadequate training examples. In the case of credit card fraud, problems remains unsolved. Of all, the most prevalent one is the the imbalance is often extreme (the fraudulent transactions in extreme class imbalance. In this paper, we aimed at addressing the training set are less than 10%). Several machine learning this problem. We focused on the K-Nearest Neighbor (KNN) methods were investigated for the credit card fraud detection classifier and investigated the cost-sensitive approaches used for and for class imbalance problem. A significant research works KNN. Also, we presented a novel cost-sensitive KNN approach that we developed using Cosine Similarity (CoS). We compared have been done concerning this problem (e.g., [5], [6], [7]). our model with the other methods to verify its efficiency, and However, an efficient solution is still missing in state of the we proved using several performance measures that it’s a better art. The available imbalance classification approaches that are approach than other KNN algorithms. used, often increase the detection of the minority group at the Index Terms—Fraud detection, K-Nearest Neighbor, Imbal- cost of generating false predictions for the other class, which anced classification, Cosine similarity, Cost-sensitive learning. leads to an overall decrease of the model’s accuracy. I. I NTRODUCTION Our Contribution: In this paper, we focused on the K-Nearest Neighbor (KNN) classifier. We investigated the cost-sensitive Credit card fraud has always been a huge interest to financial approaches used for KNN and presented a new cost-sensitive institutes and credit card users due to its detrimental effect KNN approach that we developed using Cosine Similarity in case of fraudulent event. Credit card fraud is defined as (CoS). We compared our model with the other methods to an unauthorized use of a credit card account. Fraudster uses verify its efficiency, and we proved using four performance credit card information, often remotely, without the knowledge measures (accuracy, sensitivity, PR curve and the F1 score) of the card owner or the card issuer. There are different types that it’s a better approach than other KNN algorithms. of frauds. Simple Theft (offline fraud) occurs if the card is The remainder of the paper is organized as follows. In stolen; application fraud occurs when credit card applicants Section II, we review the related work to credit card fraud obtain new credit cards using fake identity or information; using KNN and imbalanced classification specially the cost- bankruptcy fraud sometimes is a result of application fraud; sensitive approaches. In Section III, we describe briefly the finally, counterfeit fraud occurs when only the details of KNN classifier and the methods compared, then in Section IV a legitimate card are stolen (skimming or shoulder surfing) we will describe the approach we developed. In Section V, and used remotely (mobile sales, online, etc.). Detecting and we describe the data provided and we present the results and preventing these frauds are challenging for financial institutes. finally a conclusion and future work. Human-driven fraud detection approaches are labor intensive. Over the years, researchers have been working on devel- II. R ELATED W ORK oping an automated fraud detection system to reduce and Many studies investigated the skewed data distribution in This work partially benefited from the support of the Lebanese National classification, and introduced ways to tackle this problem. Council for Scientific Research (CNRS-L). There are three different ways to address the class imbalance 42 issue [8]. The first one is on data-level, as a preprocessing step The norm commonly used to measure the distance between to balance classes. Usually over- or under-sampling are done two observations p and q ∈ Rn is the euclidean distance: before applying a machine learning algorithm. The second v u n one is an algorithm-level approach, like cost-sensitive learning uX d(p, q) = t (pi − qi )2 or one-class classification, that alter the original algorithm i=1 to focus on the minority group. The basic idea behind cost- sensitive models is to give higher weights to the small class. The KNN algorithm also depends on a distance rule for the It’s equivalent to assigning higher costs to false negatives. classification using the k nearest neighbors of a new sample. However, in the one-class classification, the training is done 1) Simple Voting: The most straightforward rule used for using only one class, usually the minority group. The third the classification is the simple voting. The new observation is approach consists of a combination of the two previous assigned to the class of the majority of the k nearest points. approaches. The classification is done according to the following formula: As for credit card fraud detection, researchers used different X ŷ = argmax δ(c, class(k)) machine learning algorithms to classify credit card transactions c∈Y k∈K as normal or fraudulent. A survey of these methods and their Where ŷ represents the foretasted value of a new observation application to credit card datasets is presented by Tripathi and y, c represents the possible classification categories, and K is Pavaskar [9] and SamanehSorournejad [10]. the subset of the chosen nearest neighbors of y. The function KNN was rarely investigated for credit card fraud detection. δ is defined as follows: Ganji and Mannem [11] proposed a credit card fraud detection  system using a data stream outlier detection algorithm which 0 if x = y δ(x, y) = is based on Reverse k-nearest neighbors. Whitrow et al. 1 if x 6= y [12] considered transaction aggregation and showed that it is The function argmax returns the value of the category 0 or effective, and compared multiple methods including K-nearest 1 at which the maximum is reached, i.e. the category of the neighbors, support vector machines and logistic regression, majority of the neighbors. and they concluded that random forest outperforms all of them. 2) Distance Weighted: Another method is a distance Credit card fraud detection is a common example of im- weighted voting. The idea behind this approach is to take into balanced data classification problem. Sahin et al. [13] and account the distance of the neighbors and to assign a higher Bahnsen et al. [14] proposed a cost-sensitive decision tree weight to the closest ones. The prediction is done as follows: approach that splits the data by minimizing the sum of X ŷ = argmax wk δ(c, class(k)) misclassification costs. The authors compared their approach c∈Y k∈K to well-known methods using a real credit card fraud dataset. Kamaruddin and Vadlamani [15] proposed using one-class 1 wk = classification approach to solve the imbalance problem. They d2k proposed a hybrid system of Particle Swarm Optimization and Where wk is the assigned weight and dk is the euclidean Auto-Associative Neural Network (PSOAANN), and imple- distance between y and the considered neighbor. mented it in a Spark computational framework. B. Cost Sensitive KNN Qibei et al. [16] proposed Imbalance Class Weighted Sup- port Vector Machine (ICW-SVM) to detect credit card fraud, In [17], the authors introduced two cost-sensitive approaches according to the imbalance characteristics of the data, after for KNN. The first one called Direct Cost-Sensitive KNN reducing its dimension using Principal Component Analysis (DirectCS-KNN) is based on calculating class probabilities (PCA). They applied their model and proved its efficiency on from the original KNN algorithm using the following formula: a real bank dataset. ki Pi = k III. M ETHODS Where Pi is the probability that the class of a new observation In this section, we will describe the KNN classifier algo- y is i, and ki is the number of neighbors of class i. rithm using simple voting or distance weighted KNN and a The other approach was distance weighted called Distance-CS- cost-sensitive KNN approach that was introduced by Qin et al. KNN. Considering two categories 1 and 0, the idea consists [17]. We will also compare these methods with a cost sensitive of calculating costs for both classes C1 and C0 as follows: decision tree algorithm and a one class classification support vector machine. C1 = f p × P0 and C 0 = f n × P1 A. K-Nearest Neighbour (KNN) f p and f n represents respectively the costs of false positives and false negatives. The new sample is assigned to the class KNN is a data mining method widely used for classification with lower Ci . In this case, the Pi are calculated in a cost- and regression. It is a simple algorithm that consists of using sensitive manner different than the one in DirectCS-KNN. the k nearest points to the one we aim to predict [18]. A w1 w0 specific norm is used to measure the distance between points. P1 = and P0 = w1 + w0 w1 + w0 43 Where wi are the cost sensitive weights calculated as shown angle’s cosine between two vectors p and q representing two here: observations: ki v u n X uX wi = wj t pi qi j=1 i=1 CoS (p, q) = v v C. Cost-Sensitive Decision Tree (C5.0) u n 2 uX uX u n t pi t qi2 C5.0 is one of the most common decision tree algorithms i=1 i=1 using cross entropy and information gain to create the partition and the splits of the tree. The costs are implemented to We replaced euclidean distance used in KNN with this metric the decision boundaries, not in the training algorithm [19], when calculating the distance between observations in order so the new decision boundary considered for classifying an to find the nearest neighbors. observation into class 1 is: Note that, this metric ranges between -1 and 1. If CoS is close to 1, it indicates that the angle between the two p1 C1/0 π0 observations is close to zero and therefore they are similar > p0 C0/1 π1 (neighbors). The advantage of using CoS instead of euclidean distance Where πi is the prior probability of an observation to be is highlighted in the coming section when we compared KNN in class i. pi and Cj/i are respectively the estimator of the (whether simple voting or with weighted distance) with both probability and the cost of wrongly classifying an observation metrics and we found that CoS is better in terms of sensitivity. of class i as j. Step 2: Introducing the score Sy D. One Class Classification Support Vector Machine The second step of our approach, after finding the neighbors, Using only the minority cases, a one-class classification is to introduce an imbalanced classification approach while SVM is applied with the aim of learning only the charac- using CoS. The idea is to evaluate the similarity of an teristics of this class [20]. observation to its neighbors of the minority class, taking into The purpose of this method is to find a “small” region account the other class as well. This was done by calculating that groups most of the minority training observations. This the following score Sy for each sample y: requires defining a function f that returns 1 if a point belongs k X in this region and -1 elsewhere [21]. This function is found Ci . CoSi by optimizing the following problem: i=1 Sy = k l X 1 1 X CoSi minimize ||w||2 + ζi − ρ 2 νl i=1 i=1  T Where: w Φ(xi ) ≥ ρ − ζi subject to • k is the number of neighbors considered for y. ζi ≥ 0, i = 1, . . . , l • C is a vector of length k, Ci ∈ {0, 1} that represents the Where Φ : X → F is a mapping function. This problem is classes of the neighbors. 0 is used to denote the class of the solved by separating the points from the origin with maximum majority group and 1 is used for the minority group. ρ margin ( ||w|| ) using a hyperplane of equation: wT Φ(xi ) = ρ. • CoS is another vector representing the Cosine similarity between y and its neighbors. IV. C OST- SENSITIVE KNN APPROACH (C O SKNN) This score ranges from 0 to 1. It works similarly to a probability, that describes the likelihood of an observation to We developed a Cost-Sensitive KNN approach. Our aim was be in the minority group. When it’s close (or equal) to zero, it to tackle the imbalance problem by using cosine similarity indicates that the neighbors are mostly (or all) of the majority as a distance metric and by introducing a score for the group, which would lead to a majority group classification. classification. In order to improve the method’s performance in However, when at least one of the neighbors of y is of the terms of imbalance, we also studied the choice of the score’s minority class, this score will be higher than zero; and closer thresholds and the number of neighbors to consider. These to one, the more the neighbors are of the minority group. steps are described in the following: A. The classification and the score thresholds Step 1: The use of Cosine Similarity The classification is done according to a certain threshold α ∈ [0, 1]. The first step of our new approach is the change of the  0 if Sy ≤ α distance metric used. This metric consists of calculating the ŷ = 1 if Sy > α 44 The choice of α is not straightforward. A very low value high accuracy. The choice of other performance measures will will lead to a large number of false positives (observations be discussed later. This data is divided between train (3999 of the majority group classed as minority). However, a high transactions) and test (2001 transactions) with similar ratio threshold value will lead to very low sensitivity rate. Therefore, of imbalance. For all KNN methods, 10 nearest neighbors α should have a slightly low value. Taking into account the are considered to classify the new samples, and data is first imbalance ratio of ≈ 5%, α was later chosen according to normalized using the mean and standard deviation of the the 95th percentile of Sy , and optimized by comparing the variables, to avoid bias towards variables with large ranges sensitivity according to the values of α. [18], because the explanatory variables are on widely different scales. B. The choice of k The choice of k has an effect on many aspects of the B. Results and Discussion approach. The number of neighbors should be large enough In this section, we show the results of the comparison of to be informative about the sample’s neighborhood. our approach with the other existing in the state of the art. On the other hand, due to the imbalance, a high number The compared methods are given in Table I. of neighbors will make the classification biased towards the majority group and time consuming. TABLE I After trying several possible values of k, we found that C OMPARED METHODS AND THEIR DESCRIPTION considering 10 neighbors is the best for our case. Abbreviation Method’s Description V. E XPERIMENT AND R ESULTS EucKNN Simple voting KNN using the euclidean distance CKNN Simple voting KNN using cosine similarity To prove the efficiency of our approach, we compared it to DEucKNN Distance weighted KNN using euclidean distance the original KNN classifier using both euclidean distance and DCKNN Distance weighted KNN using cosine similarity cosine similarity, with simple voting and distance weighted The distance based cost sensitive approach Distance-CS-KNN introduced by Qin et al. [17] approach. We also compare it with cost sensitive C5.0, one CS - C5.0 Cost sensitive C5.0 decision tree class classification SVM and Distance-CS-KNN [17]. OCC - SVM One Class classification Support Vector Machine CoSKNN Cost sensitive cosine similarity based KNN A. Data Description The data we used is extracted from a credit card fraud The use of the accuracy alone as a performance measure labeled data1 . The data contains 10 million observations (in is misleading. Other measures should be considered as well, our case, the observations are financial transactions) and 8 like the sensitivity, also known as recall. The challenge is that variables described in the following. most imbalanced classification methods focus on increasing The explanatory variables are: gender representing the client’s the sensitivity, which will lead to a slight decrease in accuracy. gender, state is a categorical variable denoting the state where This decrease that can sometimes be just 1% may seem the client lives, cardholder representing the number of cards insignificant, but in fact, it hides a high number of false alarms. that the client have, balance is a continuous variable that Thus, we need to rely also on measures that finds a trade-off indicates the balance on the credit card, the number of transac- between the high accuracy and the sensitivity; the Area Under tions and international transactions made to date, respectively Precision-Recall Curve (AUPRC), is used for that purpose. denoted by numTrans and numIntTrans, and creditLine which However, since we may not always be able to plot this curve is the credit limit of the client, and the variable to predict like the case of CS - C5.0 and OCC - SVM, we will also fraudRisk indicating if each observation is fraud (denoted use the F1 score, also known as F-measure. To evaluate these 1) and legitimate (denoted 0). 596,014 (representing 5.96%) measures, a confusion matrix (Table II) is calculated using the are fraud cases and 9,403,986 (representing 94.03%) are test set. legitimate. In this paper, we extracted a part of the original data due to TABLE II the time consumption of the KNN method when calculating F ORM OF CONFUSION MATRIX the distance to all observations in the training set. The new Actual data consists of 6000 credit card transactions (observations) Predicted Legitimate (0) Fraud (1) in which we have 5657 Legitimate(0) cases, and 343 fraud(1). Legitimate (0) True Negative (TN) False Negative (FN) This dataset takes into account the imbalance ratio and have Fraud (1) False Positive (FP) True Positive (TP) the same characteristics as the original one. These proportions exemplify the extreme imbalance. The fraud TP + TN cases represents 5.7% of the dataset. The fraud detection in Accuracy = × 100 TN + FN + TP + FP this case is very challenging. In fact an accuracy rate less The accuracy shows the percentage of the legitimate and fraud than 94.3% is not acceptable, because simply an algorithm transactions correctly predicted. that classify all data points as legitimate will give us this TP TP 1 Available at http://packages.revolutionanalytics.com/datasets/ Recall = and P recision = TP + FN TP + FP 45 However, the recall or sensitivity focus only on the fraud detection and the true positive rate, and the precision measures the fraction of examples classified as positive that are truly positives. The Precision-Recall (PR) curve is obtained by plotting the precision over recall rate through different class probabilities thresholds. The closer the curve is to the upper-right-hand corner the better the model is. It is not always straightforward to find the class probabilities, so we will also use the F1 score. The higher this score the better. P recision × Recall F1 score = 2 × Fig. 1. PR curves for all methods P recision + Recall VI. C ONCLUSION AND F UTUR W ORK Table III shows the performance measures (the accuracy, the Credit card fraud detection is one of the most challenging sensitivity, the AUPRC and the F1 score) for all methods and problems for financial institutes. The financial institutes such our new approach. The PR curves are shown in the Figure 1. as banks are losing billions of dollars every year due to fraudulent activities committed by the fraudsters. Over the years, a number of solutions has been been proposed in TABLE III large bodies of literature. However, some problems are still TABLE SUMMARIZING THE PERFORMANCE MEASURES open. The class imbalance problem is one of the most critical problems that is yet to be solved. Method Accuracy Sensitivity AUPRC F1 score EucKNN 95.8% 0.30 0.39 0.45 In this paper, we aimed at addressing this problem. We CKNN 95.8% 0.42 0.53 0.53 investigated the use of KNN in fraud detection. We proposed a DEucKNN 95.6% 0.32 0.22 0.46 cost-sensitive KNN approach to tackle the imbalance problem. DCKNN 95.4% 0.34 0.54 0.46 Distance-CS-KNN 94.5% 0.53 - 0.52 We provided a comprehensive detail of our new approach. In CS - C5.0 93.3% 0.65 - 0.53 our approach we used cosine similarity instead of the euclidean OCC - SVM 88.9% 0.22 - 0.18 distance to find the neighbors, and then calculated a certain CoSKNN 95.5% 0.51 0.54 0.56 score to evaluate the probability of fraud risk. We also presented a comparative study in this paper. In our study, we compared the performance of simple voting KNN and distance weighted KNN using both euclidean distance This table shows that the accuracy is higher than 94.3% and cosine similarity, with another cost sensitive KNN, de- for all models except the CS - C5.0 and OCC - SVM. cision tree approach, one class classification SVM and our The slight differences of the accuracy between all the other new approach. The comparison was done by applying these methods shows how much information this measure hides methods to a credit card fraud dataset with imbalance, using when the imbalance is extreme. We can conclude from the multiple performance measures, mostly relying on AUPRC table when comparing the performance measures of EuCKNN and F1 score. This experiment shows that our approach is with CKNN and DEuCKNN with DCKNN that the use of outperforming all the other methods. cosine similarity is improving the classification according We encountered several challenges in our study. The most to the sensitivity, AUPRC and F1 score, with a reasonable prevalent one is the elapsed time that restricted us the use of decrease in accuracy. number of observations for the training set which could not Our approach CoSKNN is outperforming all the methods exceed 3999. This is an obvious limitation of our experiment. according to the AUPRC and F1 score. It is considerably Several works have been lined up to extend our current improving the sensitivity when compared to the simple KNN. work. We planned to work on implementing our approach in The other cost-sensitive models are performing better in terms a big data environment in order to use the massive amount of sensitivity, but at the cost of raising false alarms and of data. Another interesting task which we planned is finding decreasing the accuracy sometimes to a less than acceptable an optimized threshold α that can be selected automatically value, like the case of CS-C5.0 and OCC-SVM. instead of letting user to investigate and find it. 46 R EFERENCES [1] P. Richhariya and P. K. Singh, “Evaluating and emerging payment card fraud challenges and resolution,” International Journal of Computer Applications, vol. 107, no. 14, 2014. [2] S. Bhattacharyya, S. Jha, K. Tharakunnel, and J. C. Westland, “Data mining for credit card fraud: A comparative study,” Decision Support Systems, vol. 50, no. 3, pp. 602–613, 2011. [3] A. Dal Pozzolo, O. Caelen, Y.-A. Le Borgne, S. Waterschoot, and G. Bontempi, “Learned lessons in credit card fraud detection from a practitioner perspective,” Expert systems with applications, vol. 41, no. 10, pp. 4915–4928, 2014. [4] C. Phua, D. Alahakoon, and V. Lee, “Minority report in fraud detection: classification of skewed data,” ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 50–59, 2004. [5] Z.-H. Zhou and X.-Y. Liu, “Training cost-sensitive neural networks with methods addressing the class imbalance problem,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63–77, 2006. [6] S. Ertekin, J. Huang, and C. L. Giles, “Active learning for class imbalance problem,” in Proceedings of the 30th annual international ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 2007, pp. 823–824. [7] M. Wasikowski and X.-w. Chen, “Combating the small sample class imbalance problem using feature selection,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1388–1400, 2010. [8] B. Krawczyk, “Learning from imbalanced data: open challenges and future directions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, 2016. [9] K. K. Tripathi and M. A. Pavaskar, “Survey on credit card fraud detection methods,” International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 11, pp. 721–726, 2012. [10] S. Sorournejad, Z. Zojaji, R. E. Atani, and A. H. Monadjemi, “A survey of credit card fraud detection techniques: Data and technique oriented perspective,” CoRR, 2016. [11] V. R. Ganji and S. N. P. Mannem, “Credit card fraud detection using anti-k nearest neighbor algorithm,” International Journal on Computer Science and Engineering (IJCSE), vol. 4, no. 6, pp. 1035–1039, 2012. [12] C. Whitrow, D. Hand, J. Juszczak, D. Weston, and N. M. Adams, “Transaction aggregation as a strategy for credit card fraud detection,” Data Mining and Knowledge Discovery, pp. 30–55, 2009. [13] Y. Sahin, S. Bulkan, and E. Duman, “ A cost-sensitive decision tree approach for fraud detection,” Expert Systems with Applications, vol. 40, pp. 5916–5918, 2013. [14] A. C. Bahnsen, A. Stojanovic, D. Aouada, and B. Ottersten, “Cost sensitive credit card fraud detection using bayes minimum risk,” in Proceedings of the 2013 12th International Conference on Machine Learning and Applications, vol. 1, 2013, pp. 333–338. [15] S. Kamaruddin and V. Ravi, “Credit Card Fraud Detection using Big Data Analytics : Use of PSOAANN based One-Class Classification,” Proceedings of the International Conference on Informatics and Ana- lytics, pp. 33:1–33:8, 2016. [16] Q. Lu and C. Ju, “Research on credit card fraud detection model based on class weighted support vector machine,” Journal of Convergence Information Technology, vol. 6, pp. 62–68, 2011. [17] Z. Qin, A. T. Wang, C. Zhang, and S. Zhang, “Cost-sensitive classifica- tion with k-nearest neighbors,” in Knowledge Science, Engineering and Management. Springer Berlin Heidelberg, 2013, pp. 112–131. [18] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufman Publishers, 2003. [19] M. Kuhn and K. Johnson, Applied Predictive Modeling. Springer, 2013. [20] A. Nicholas and R. Daniel, “One-class support vector machines: Meth- ods and applications,” 2008. [21] B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Computation, vol. 13, no. 7, pp. 1443–1471, Jul. 2001. 47