=Paper=
{{Paper
|id=Vol-3611/paper18
|storemode=property
|title=Can merged datasets help in training money laundering detection models?
|pdfUrl=https://ceur-ws.org/Vol-3611/paper18.pdf
|volume=Vol-3611
|authors=Paulius Savickas,Dovilė Kuizinienė,Mantas Bugarevičius,Žilvinas Kybartas,Tomas Krilavičius
|dblpUrl=https://dblp.org/rec/conf/ivus/SavickasKBKK22
}}
==Can merged datasets help in training money laundering detection models?==
Can merged datasets help in training money laundering detection models? Paulius Savickas1,3, Dovilė Kuizinienė2,3, Mantas Bugarevičius4, Žilvinas Kybartas4 and Tomas Krilavičius2,3 1 Vytautas Magnus University, Faculty of Economics and Management, K. Donelaičio street 52, LT-44244 Kaunas, Lithuania 2 Vytautas Magnus University, Faculty of Informatics, Vileikos street 8, LT-44404 Kaunas, Lithuania 3 Centre for Applied Research and Development, Lithuania 4 UAB “Inventi”, Vilnius, Lithuania Abstract Money laundering identification and prevention is one of the most important topics in the financial industry and financial crimes investigation. However, due to the high volume of transactions, personal data protection, and highly skilled white-collar criminals. Artificial intelligence and machine learning are already successfully used in different fintech applications as well, as crime prevention. Unfortunately, due to confidentiality and privacy regulations, AML cases and related data are hard to obtain, and different datasets include very different AML models. In most research, synthetically generated datasets with their own assumptions that do not know or reflect reality are used. For this reason, in this research, we try to improve AML models by merging different datasets with different features. We experiment with three publicly available, synthetically generated money transaction datasets and five different ML approaches: Random Forest, Generalized Linear Regression, XGBoost, Isolation Forest, and an ensemble of these methods. We use SMOTE for dataset balancing. The best model has achieved 95.98% accuracy with recognized 95.6% of legal payments and 84.4% of money laundering cases. This was achieved using an ensemble of all methods. Keywords AML, money laundering, machine learning algorithms, Random Forest, Isolation Forest, XGBoost, Generalized Linear Regression, merged data 1. Introduction testing on another does not achieve good results in recog- nizing money laundering cases. Hence, merged data set Money laundering is regulated by both the government created, which is used for machine learning algorithms authorities of financial crimes and the banks, since it testing for ensuring better money laundering prevention. involves much larger amounts of money than in the case While synthetic data has numerous advantages, it can of fraudulent payments. Every year, between 2% and 5% be difficult to use appropriately. It’s extremely challeng- of global GDP is laundered, amounting to between 715 ing to ensure that it’s as reliable as real-world data. It billion and 1.87 trillion euros [1]. In 1989, the Group of is possible to create a synthetic data set that does not Seven (G-7) established the Financial Action Task Force accurately represent real-world scenarios when dealing (FATF) as an international group to combat money laun- with complex data sets containing a significant number dering on a global scale. Its mandate was broadened in of variables. This can lead to inaccurate decision-making the early 2000s to include countering terrorism financing due to incorrect insight development [3]. [2]. Money laundering, poses a greater threat to society as a whole, yet it is rarely studied by researchers due to 2. Literature review the high level of data confidentiality involved. Therefore, researchers for future topic development is using syn- Money laundering prevention is a matter for both gov- thetically generated datasets. These datasets are created ernment and financial institutions. Data in this area are on different assumptions, then teaching on one set and highly sensitive and often difficult to access, for that rea- son, this problem is not widely discussed in the literature. All studies used cryptocurrency transaction or synthet- IVUS 2022: 27th International Conference on Information Technology, May 12, 2022, Kaunas, Lithuania ically generated datasets. The following methods were $ paulius.savickas@card-ai.eu (P. Savickas); analyzed: Random Forest, Logistic Regression, Decision dovile.kuiziniene@vdu.lt (D. Kuizinieṅe); Tree, XGBoost, Support Vector Machine, deep learning mantas.bugarevicius@inventi.lt (M. Bugarevičius); methods. zilvinas@inventi.lt (Ž. Kybartas); tomas.krilavicius@vdu.lt In the vast majority of studies reviewed, Random For- (T. Krilavičius) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons est showed best performance, compared to other meth- License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 ods, with an accuracy of 98.06%, 99%, 90.40%, 97.53%, and CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings F1-score ranging from 0.76 to 0.83 [4][5][6][7][8]. 3.1.3. XGBoost The analysis suggests that Random Forest is the most XGBoost is a machine learning algorithm that implements appropriate method to address the problem of money frameworks based on Gradient Boosted Decision Trees laundering, however due to a lack of research, its effec- [10]. XGBoost surpasses other machine learning tiveness has not been validated. Furthermore, since it algorithms by solving many data science problems is unclear what assumptions the datasets were based on faster and more accurately than its counterparts. Also, and whether they correlate to reality, merging several this algorithm has additional protection from datasets allows for the most comprehensive coverage of overfitting. those assumptions. 3.1.4. Support Vector Machine 3. Methods The aim of application of Support Vector Machine is to find the maximum separating line (if the case is two- 3.1. Machine learning methods dimensional) or a separating plane (if the case is three- The selection of methods in this study was based on the dimensional), or a separating hyperplane (if the case is literature review and best practices for money laundering 𝑛-dimensional, 𝑛 > 3) that would have the maximum prevention. Four supervised methods, namely, Random distance between the nearest training data objects. For a Forest, Generalized Linear Regression, Support Vector hyperplane (could be a line or a plane) to be considered Machine and XGBoost, and one unsupervised method, as the best, it needs to have the minimum classification namely, Isolation Forest, were used in the study. error on previously unseen objects [11]. 3.1.1. Generalized Linear Model 3.1.5. Isolation Forest The term "linear model" usually encompasses both sys- Let T be a node of an isolation tree. T is either an tematic and random components in a statistical model, external-node with no child, or an internal-node with however for the purposes of this project the term was one test and exactly two daughter nodes (𝑇𝑙,𝑇𝑟 ). A test restricted to include only the systematic components: consists of an attribute q and a split value p such that the test q < p divides data points into 𝑇𝑙 and 𝑇𝑟 [12]. 𝑚 ∑︁ Given a sample of data X = 𝑥1 , ..., 𝑥𝑛 of n instances 𝑌 = 𝛽𝑖 𝑥 𝑖 , (1) from a d-variate distribution, to build an isolation tree 𝑖=1 (iTree), we recursively divide X by randomly selecting when 𝑥𝑖 is independent variables with known values, an attribute q and a split value p, until either: (i) the and 𝛽𝑖 is parameters 𝛽𝑖 values might be fixed (known) or tree reaches a height limit, (ii) |X| = 1 or (iii) all data in unknown, requiring estimation. An independent variable X have the same values. An iTree is a proper binary can be quantitative, producing a single x-variate in the tree, where each node in the tree has exactly zero or two model, qualitative, producing a set of x-variates with daughter nodes. Assuming all instances are distinct, each values between 0 and 1, or mixed, producing a set of instance is isolated to an external node when an iTree x-variates with values between 0 and 1. is fully grown, in which case the number of external nodes is n and the number of internal nodes is n 1; the 3.1.2. Random Forest total number of nodes of an iTrees is 2n 1; and thus the memory requirement is bounded and only grows linearly Random Forest is a machine learning algorithm that con- with n. structs a multitude of decision trees during the training. Anomaly detection’s goal is to generate a ranking that The main principle of constructing a random forest is that indicates the degree of anomaly. As a result, sorting data a classifier is formed by combining solutions from binary points according to their path lengths or anomaly scores decision trees made using diverse subsets of the orig- is one technique to find anomalies; anomalies are points inal dataset and subsets containing randomly selected at the top of the list. The following are the definitions of features from the feature set [9]. path length and anomaly score. Constructing small decision trees that only have a few features takes up a little of the processing time, hence the 3.2. Class balancing majority of such trees’ solutions can be combined into a single strong classifier. Unbalanced classes leads to machine learning algorithms classification issues. The unequal proportion of cases presented for each class of problem characterizes these issues. Synthetic Minority Oversampling Technique (SMOTE) is a well-known algorithm for dealing with this problem, and its strategy is to artificially generate Mahootika synthetically generated dataset [18] covers additional examples of the minority class by using the five months (February 20, 2019 – July 20, 2019) of 2,340 cases’ closest neighbors. In addition, the majority of class transactions, 60% of which are money laundering. This examples are under-represented, resulting in a more bal- dataset has seven attributes. The simulation is based on anced collection [13]. three processes of money laundering in financial transac- tions: 1) Money placement 2) Money layering 3) Money 3.3. Data normalization method integration. We use Z-score normalization to normalize each column AMLSim dataset [19] consists of 1,048,575 transactions, in the dataset separately, so the mean of the entire column of which 0.16% are money laundering cases. All of these becomes 0 and the standard deviation is 1. The following transactions are made from 9,999 accounts to 9,999 re- is the normalizing formula [14]: ceive accounts. This dataset consists of eight attributes. It is synthetically generated using the AMLSim simulator. x′= (x − µ)/σ, (2) Paysim dataset [20] consists of 6,362,620 transactions, where µ is the population mean, and σ is the population but we narrow it down to 1,272,524 transactions due standard deviation. to computer resources. 0.13% of these transactions are 3.4. Models evaluation money laundering cases. All of these transactions are made from 1,272,159 accounts to 777,582 receive accounts. In this research, we use accuracy, sensitivity, specificity, This dataset consists of 11 attributes. It is synthetically F1 score, and AUC metrics to correctly evaluate the re- generated using the Paysim simulator. sults of the models so that money laundering and legal Due to different assumptions made in the generated payments can be clearly identified and compared with datasets, models trained with separate datasets perform the results of other studies [15]. To compute them, a poorly when evaluated with the other datasets. It is confusion matrix is needed. These metrics are calculated difficult to know which assumptions are closest to the as follows [16][17]: real cases scenarios. Therefore, it is wise to train machine 𝑇𝑃 + 𝑇𝑁 learning models with all the datasets merged together. Accuracy = (3) 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 𝑇𝑃 Sensitivity = (4) 4.2. Additional attributes 𝐹𝑁 + 𝑇𝑃 𝑇𝑁 There are limited overlapping attributes in all three Specificity = (5) datasets, for this reason the additional attributes are cre- 𝑇𝑃 + 𝐹𝑁 ating from time, action and amount. Hence, 11 additional sensitivity × specificity 𝐹 1 = 2𝑥 (6) overlapping attributes are created (Table 2). These addi- sensitivity + specificity tional attributes together with transaction amount and 𝑚−1 1 ∑︁ money laundering status are being used in further re- AUC = (𝑥𝑖+1 − 𝑥𝑖) * (𝑦𝑖+1 + 𝑦𝑖) (7) search. 2 𝑖=1 4. Data Table 2 Additional attributes used in modeling 4.1. Datasets # Additional attributes Mahootika AMLSim Paysim 1 Action count ✓ ✓ ✓ For money laundering prevention, three different 2 Minimum amount ✓ ✓ ✓ synthetically generated datasets are used, which are 3 Maximum amount ✓ ✓ ✓ 4 Mean amount compared in Table 1. Comparing other datasets with 5 Median amount ✓ ✓ ✓ ✓ ✓ ✓ Paysim, it has many records, which can influence Coefficient variation 6 ✓ ✓ ✓ machine learning algorithms to recognize only its amount data. For this reason, Paysim dataset is five time Previous transactions 7 ✓ ✓ ✓ average reduced by selecting every fifth row of this dataset. Same amount 8 ✓ ✓ ✓ count Table 1 9 Time difference ✓ ✓ ✓ Money laundering datasets Coefficient variation 10 ✓ ✓ ✓ time difference Name Transactions Fraud Mean Median Same amount 11 ✓ ✓ ✓ % transaction transaction time difference Mahootika [18] 2,340 60% €2,508,583 €1,162,354 AMLSim [19] 1,323,234 0.16% €115,988 €157 Paysim (1/5) [20] 1,272,524 0.13% €179,953 €74,898 Table 3 Testing results based on merged training datasets Single methods Ensemble Train Test Metrics LR RF LR XGBoost IF ALL out Accuracy 62,1% 51,7% 67,7% 76,4% 85,2% 62,1% Sensitivity 0,7% 72% 15,3% 76,5% 65,7% 0,8% 30% Specificity 100% 39,2% 100% 76,3% 97,2% 100% Mahootika F1 1,5% 53,2% 26,5% 71,2% 77,2% 1,5% AUC 50,4% 55,6% 57,7% 76,4% 81,5% 50,4% Accuracy 97,3% 58,1% 99,9% 75,5% 99,1% 98,8% Sensitivity 97,3% 58,1% 99,9% 75,6% 99,2% 98,8% 30% Balanced Specificity 100% 83,4% 100% 12,1% 84% 100% AMLSim 70% all F1 98,6% 73,5% 100% 86% 99,6% 99,8% datasets AUC 98,6% 70,7% 100% 43,8% 91,4% 99,4% (unbalanced Accuracy 92,6% 94,2% 92,6% 99,2% 94,3% 99,9% for Sensitivity 92,6% 94,3% 92,6% 99,3% 94,4% 93,2% 30% IF model) Specificity 52,6% 48% 52% 19,8% 48% 51,9% Paysim F1 96,1% 97% 96,1% 99,6% 97,1% 96,4% AUC 72,6% 71% 72,3% 59,5% 71,2% 93% Accuracy 94,9% 75,8% 96,3% 87,1% 96,7% 96% Sensitivity 95% 75,8% 96,3% 87,2% 96,8% 95,6% 30% all Specificity 84,6% 59,8% 84,4% 32,1% 75,8% 84,4% datasets F1 97,4% 86,2% 98,1% 93,1% 98,3% 97,9% AUC 89,8% 68% 90,4% 59,7% 86,3% 90% 4.3. Data pre-processing 5.1. Mahootika All three datasets were generated synthetically using After testing the trained models on the Mahootika different assumptions, therefore it is difficult to determine dataset, the best results has been obtained using which are closest to real-world scenarios. All attributes Ensemble method on all the models. The accuracy of this are normalized separately using Z-score normalization Ensemble is 85.2% and the AUC is 81.5%. to maintain the uniqueness of the data sets and to cover their assumptions. All datasets are divided into two parts: 5.2. AMLSim 70% training and 30% testing, and then all training and testing parts are merged separately into two datasets. Testing of the AMLSim dataset has shown even better The SMOTE approach is used to balance the training part results. XGBoost has achieved 99.9% accuracy and 100% except for Isolation Forest model. Only then machine AUC. The Ensemble method without the LR model has learning methods testing is performed. showed very similar results. The accuracy of this Ensem- ble method is 98.8% and the AUC is 99.4%. 5. Results 5.3. Paysim To cover as many assumptions as possible, all data sets After testing the models on the Paysim testing part, the are merged into one, so that models could be applied to best Ensemble method is without the LR model and has real data with the best possible results. an accuracy of 99.87% and detected 51.9% of all money RF, LR, XGBoost, and IF models are trained with laundering cases. balanced merged training dataset. These trained models are tested with the rest of the merged test data of all datasets. To improve research results, the 5.4. Merged dataset Ensemble is created. Two variants of ensemble are Finally, the models have been applied to all merged calculated: datasets test’s parts. The highest accuracy is achieved 1. Ensemble with all models. with the Ensemble method of all models - 96.7% and with 2. Ensemble without LR model. XGBoost accuracy was 96.3% and AUC - 90.4%. All the results obtained for the individual and com- In conclusion, combining the datasets and using the bined datasets are shown in Table 3. Ensemble approach for all models is suitable due to bet- ter models performance. The Ensemble without the LR References model has achieved accuracy - 95.98%, with 84.4% of all money laundering cases correctly identified and [1] Europol, Crime area: money laundering, only 4.4% of all legal payments wrongly identified. 2021. URL: https://www.europol.europa. The XGBoost model’s findings has been even eu/crime-areas-and-statistics/crime-areas/ better: accuracy is 96.3%, legitimate payments economic-crime/money-laundering. has been accurately recognized 96.3%, and money [2] Aniruddha, Financial action task force (fatf), laundering cases has been detected, the same as 1989. URL: https://www.fatf-gafi.org/about/ the Ensemble method without the LR model. historyofthefatf/. [3] I. Kot, The pros and cons of synthetic data, 2021. URL: https://www.dataversity.net/ 6. Conclusion the-pros-and-cons-of-synthetic-data/#. Money laundering are not commonly addressed in [4] I. Alarab, S. Prakoonwit, M. I. Nacer, Compara- the literature since it is strictly regulated by tive analysis using supervised learning methods for government entities and financial institutions. We anti-money laundering in bitcoin, in: Proceedings use three publicly available synthetically generated of the 2020 5th International Conference on Ma- datasets in this study, each with a different set of chine Learning Technologies, ACM, 2020-06-19, pp. assumptions. Which ranged from 2,340 to 1,323,234 11–17. URL: https://dl.acm.org/doi/10.1145/3409073. transactions with 0.13% to 60% of money laundering 3409078. doi:10.1145/3409073.3409078. cases. A total of 11 additional attributes are generated [5] O. Raiter, Applying supervised machine learning for each dataset for further research. Each attribute algorithms for fraud detection in anti-money laun- of the datasets is normalized by the Z-score method. dering (2021) 13. Then all three datasets are combined into one. The [6] E. Badal-Valero, J. A. Alvarez-Jareño, J. M. combined dataset is divided into training and testing Pavía, Combining benford’s law and ma- parts and, if necessary, the data are balanced using the chine learning to detect money laundering. an SMOTE method. Finally, the results of all the models actual spanish court case 282 (2018-01) 24–34. are combined into the Ensemble method and the vast URL: https://linkinghub.elsevier.com/retrieve/pii/ majority of models make a decision about instance. S0379073817304644. doi:10.1016/j.forsciint. After testing these models, we obtained an AUC 2017.11.008. of 72.6% to 100%, and both money laundering and legal [7] V. Huyen, Machine learning in money laundering payments have been well identified. To improve the detection (2020-05-10). models, we employ the Ensemble method, in which all [8] J. Lorenz, M. I. Silva, D. Aparício, J. T. Ascensão, methods vote is weighted equally and the class is P. Bizarro, Machine learning methods to detect determined by a majority of the classifiers votes for money laundering in the bitcoin blockchain in the instance. Accuracy of this method ranged from 85.2% presence of label scarcity (2021-10-05). URL: http: to 99.1%, and AUC from 71.2% to 91.4%. //arxiv.org/abs/2005.14635. arXiv:2005.14635. Moreover, there has been made one modification [9] J. Le, Decision trees in r, DataCamp (2018). for Ensemble method, by LR model exclusion. In this https://www.datacamp.com/community/tutorials/ case, we have achieved 95.98% accuracy and the decision-trees-R. model has recognized 95.6% of legal payments, and [10] xgboost developers, xgboost Release 1.2.0- 84.4% of money laundering cases. SNAPSHOT, 2020. URL: https://buildmedia. Machine-learning-based methods have been readthedocs.org/media/pdf/xgboost/latest/ adapted to address the problem of money laundering xgboost.pdf. prevention. The results has shown that these [11] J. Han, M. Kamber, J. Pei, Data Mining: Concepts methods detects potential money laundering cases and Techniques, Second Edition (The Morgan Kauf- and reduce the number of payments reviewed. mann Series in Data Management Systems), 2 ed., However, it is not known which dataset generation Morgan Kaufmann, 2006. assumptions would be closest to our market due to [12] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation for- this reason it is recommend in implementation stage to est, in: 2008 Eighth IEEE International Confer- make data verification by experts. ence on Data Mining, IEEE, 2008, pp. 413–422. Further research should include deep learning URL: http://ieeexplore.ieee.org/document/4781136/. methods and other class balancing techniques. doi:10.1109/ICDM.2008.17. [13] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic minor- ity over-sampling technique 16 (2002) 321–357. URL: https://www.jair.org/index.php/jair/article/ view/10302. doi:10.1613/jair.953. [14] S. K. Patro, K. K. Sahu, Normalization: A prepro- cessing stage (2015) 20–22. URL: http://www.iarjset. com/upload/2015/march-15/IARJSET%205.pdf. doi:10.17148/IARJSET.2015.2305. [15] J. T. Townsend, Theoretical analysis of an alphabetic confusion matrix 9 (1971-01) 40–50. URL: http://link.springer.com/10.3758/BF03213026. doi:10.3758/BF03213026. [16] R. Kohavi, F. Provost, Glossary of terms 30 (1998) 271–274. URL: http: //link.springer.com/10.1023/A:1017181826899. doi:10.1023/A:1017181826899. [17] Y. Sasaki, The truth of the f-measure (2007) 5. [18] M.Mahootika, "money laundering data", M. Ma- hootika data set. [online]. https://www.kaggle.com/ maryam1212/money-laundering-data/metadata. ac- cessed on: February 27th, 2022 (2022). [19] "money laundering data", AMLSim data set. [online]. https://www.kaggle.com/anshankul/ ibm-amlsim-example-dataset. accessed on: February 27th, 2022 (2022). [20] "aml detection", PaySim data set. [online]. https: //www.kaggle.com/x09072993/aml-detection. ac- cessed on: February 27th, 2022 (2022).