=Paper= {{Paper |id=Vol-3611/paper18 |storemode=property |title=Can merged datasets help in training money laundering detection models? |pdfUrl=https://ceur-ws.org/Vol-3611/paper18.pdf |volume=Vol-3611 |authors=Paulius Savickas,Dovilė Kuizinienė,Mantas Bugarevičius,Žilvinas Kybartas,Tomas Krilavičius |dblpUrl=https://dblp.org/rec/conf/ivus/SavickasKBKK22 }} ==Can merged datasets help in training money laundering detection models?== https://ceur-ws.org/Vol-3611/paper18.pdf

Can merged datasets help in training money laundering
detection models?
Paulius Savickas1,3, Dovilė Kuizinienė2,3, Mantas Bugarevičius4, Žilvinas Kybartas4 and
Tomas Krilavičius2,3
1
Vytautas Magnus University, Faculty of Economics and Management, K. Donelaičio street 52, LT-44244 Kaunas, Lithuania
2
Vytautas Magnus University, Faculty of Informatics, Vileikos street 8, LT-44404 Kaunas, Lithuania
3
Centre for Applied Research and Development, Lithuania
4
UAB “Inventi”, Vilnius, Lithuania

Abstract
Money laundering identification and prevention is one of the most important topics in the financial industry and financial
crimes investigation. However, due to the high volume of transactions, personal data protection, and highly skilled white-collar
criminals. Artificial intelligence and machine learning are already successfully used in different fintech applications as well,
as crime prevention. Unfortunately, due to confidentiality and privacy regulations, AML cases and related data are hard to
obtain, and different datasets include very different AML models. In most research, synthetically generated datasets with
their own assumptions that do not know or reflect reality are used. For this reason, in this research, we try to improve AML
models by merging different datasets with different features. We experiment with three publicly available, synthetically
generated money transaction datasets and five different ML approaches: Random Forest, Generalized Linear
Regression, XGBoost, Isolation Forest, and an ensemble of these methods. We use SMOTE for dataset balancing. The
best model has achieved 95.98% accuracy with recognized 95.6% of legal payments and 84.4% of money laundering cases.
This was achieved using an ensemble of all methods.

Keywords
AML, money laundering, machine learning algorithms, Random Forest, Isolation Forest, XGBoost, Generalized Linear
Regression, merged data

1. Introduction testing on another does not achieve good results in recog-
nizing money laundering cases. Hence, merged data set
Money laundering is regulated by both the government created, which is used for machine learning algorithms
authorities of financial crimes and the banks, since it testing for ensuring better money laundering prevention.
involves much larger amounts of money than in the case While synthetic data has numerous advantages, it can
of fraudulent payments. Every year, between 2% and 5% be difficult to use appropriately. It’s extremely challeng-
of global GDP is laundered, amounting to between 715 ing to ensure that it’s as reliable as real-world data. It
billion and 1.87 trillion euros [1]. In 1989, the Group of is possible to create a synthetic data set that does not
Seven (G-7) established the Financial Action Task Force accurately represent real-world scenarios when dealing
(FATF) as an international group to combat money laun- with complex data sets containing a significant number
dering on a global scale. Its mandate was broadened in of variables. This can lead to inaccurate decision-making
the early 2000s to include countering terrorism financing due to incorrect insight development [3].
[2].
Money laundering, poses a greater threat to society
as a whole, yet it is rarely studied by researchers due to 2. Literature review
the high level of data confidentiality involved. Therefore,
researchers for future topic development is using syn- Money laundering prevention is a matter for both gov-
thetically generated datasets. These datasets are created ernment and financial institutions. Data in this area are
on different assumptions, then teaching on one set and highly sensitive and often difficult to access, for that rea-
son, this problem is not widely discussed in the literature.
All studies used cryptocurrency transaction or synthet-
IVUS 2022: 27th International Conference on Information
Technology, May 12, 2022, Kaunas, Lithuania ically generated datasets. The following methods were
$ paulius.savickas@card-ai.eu (P. Savickas); analyzed: Random Forest, Logistic Regression, Decision
dovile.kuiziniene@vdu.lt (D. Kuizinieṅe); Tree, XGBoost, Support Vector Machine, deep learning
mantas.bugarevicius@inventi.lt (M. Bugarevičius); methods.
zilvinas@inventi.lt (Ž. Kybartas); tomas.krilavicius@vdu.lt In the vast majority of studies reviewed, Random For-
(T. Krilavičius)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons est showed best performance, compared to other meth-
License Attribution 4.0 International (CC BY 4.0).
CEUR

CEUR Workshop Proceedings (CEUR-WS.org)
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
ods, with an accuracy of 98.06%, 99%, 90.40%, 97.53%, and

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
F1-score ranging from 0.76 to 0.83 [4][5][6][7][8]. 3.1.3. XGBoost
The analysis suggests that Random Forest is the most
XGBoost is a machine learning algorithm that implements
appropriate method to address the problem of money
frameworks based on Gradient Boosted Decision Trees
laundering, however due to a lack of research, its effec-
[10]. XGBoost surpasses other machine learning
tiveness has not been validated. Furthermore, since it
algorithms by solving many data science problems
is unclear what assumptions the datasets were based on
faster and more accurately than its counterparts. Also,
and whether they correlate to reality, merging several
this algorithm has additional protection from
datasets allows for the most comprehensive coverage of
overfitting.
those assumptions.

3.1.4. Support Vector Machine
3. Methods The aim of application of Support Vector Machine is to
find the maximum separating line (if the case is two-
3.1. Machine learning methods dimensional) or a separating plane (if the case is three-
The selection of methods in this study was based on the dimensional), or a separating hyperplane (if the case is
literature review and best practices for money laundering 𝑛-dimensional, 𝑛 > 3) that would have the maximum
prevention. Four supervised methods, namely, Random distance between the nearest training data objects. For a
Forest, Generalized Linear Regression, Support Vector hyperplane (could be a line or a plane) to be considered
Machine and XGBoost, and one unsupervised method, as the best, it needs to have the minimum classification
namely, Isolation Forest, were used in the study. error on previously unseen objects [11].

3.1.1. Generalized Linear Model 3.1.5. Isolation Forest
The term "linear model" usually encompasses both sys- Let T be a node of an isolation tree. T is either an
tematic and random components in a statistical model, external-node with no child, or an internal-node with
however for the purposes of this project the term was one test and exactly two daughter nodes (𝑇𝑙,𝑇𝑟 ). A test
restricted to include only the systematic components: consists of an attribute q and a split value p such that
the test q < p divides data points into 𝑇𝑙 and 𝑇𝑟 [12].
𝑚
∑︁ Given a sample of data X = 𝑥1 , ..., 𝑥𝑛 of n instances
𝑌 = 𝛽𝑖 𝑥 𝑖 , (1)
from a d-variate distribution, to build an isolation tree
𝑖=1
(iTree), we recursively divide X by randomly selecting
when 𝑥𝑖 is independent variables with known values, an attribute q and a split value p, until either: (i) the
and 𝛽𝑖 is parameters 𝛽𝑖 values might be fixed (known) or tree reaches a height limit, (ii) |X| = 1 or (iii) all data in
unknown, requiring estimation. An independent variable X have the same values. An iTree is a proper binary
can be quantitative, producing a single x-variate in the tree, where each node in the tree has exactly zero or two
model, qualitative, producing a set of x-variates with daughter nodes. Assuming all instances are distinct, each
values between 0 and 1, or mixed, producing a set of instance is isolated to an external node when an iTree
x-variates with values between 0 and 1. is fully grown, in which case the number of external
nodes is n and the number of internal nodes is n 1; the
3.1.2. Random Forest total number of nodes of an iTrees is 2n 1; and thus the
memory requirement is bounded and only grows linearly
Random Forest is a machine learning algorithm that con- with n.
structs a multitude of decision trees during the training. Anomaly detection’s goal is to generate a ranking that
The main principle of constructing a random forest is that indicates the degree of anomaly. As a result, sorting data
a classifier is formed by combining solutions from binary points according to their path lengths or anomaly scores
decision trees made using diverse subsets of the orig- is one technique to find anomalies; anomalies are points
inal dataset and subsets containing randomly selected at the top of the list. The following are the definitions of
features from the feature set [9]. path length and anomaly score.
Constructing small decision trees that only have a few
features takes up a little of the processing time, hence the 3.2. Class balancing
majority of such trees’ solutions can be combined into a
single strong classifier. Unbalanced classes leads to machine learning algorithms
classification issues. The unequal proportion of cases
presented for each class of problem characterizes these
issues. Synthetic Minority Oversampling Technique
(SMOTE) is a well-known algorithm for dealing with
this problem, and its strategy is to artificially generate Mahootika synthetically generated dataset [18] covers
additional examples of the minority class by using the five months (February 20, 2019 – July 20, 2019) of 2,340
cases’ closest neighbors. In addition, the majority of class transactions, 60% of which are money laundering. This
examples are under-represented, resulting in a more bal- dataset has seven attributes. The simulation is based on
anced collection [13]. three processes of money laundering in financial transac-
tions: 1) Money placement 2) Money layering 3) Money
3.3. Data normalization method integration.
We use Z-score normalization to normalize each column AMLSim dataset [19] consists of 1,048,575 transactions,
in the dataset separately, so the mean of the entire column of which 0.16% are money laundering cases. All of these
becomes 0 and the standard deviation is 1. The following transactions are made from 9,999 accounts to 9,999 re-
is the normalizing formula [14]: ceive accounts. This dataset consists of eight attributes.
It is synthetically generated using the AMLSim simulator.
x′= (x − µ)/σ, (2) Paysim dataset [20] consists of 6,362,620 transactions,
where µ is the population mean, and σ is the population but we narrow it down to 1,272,524 transactions due
standard deviation. to computer resources. 0.13% of these transactions are
3.4. Models evaluation money laundering cases. All of these transactions are
made from 1,272,159 accounts to 777,582 receive accounts.
In this research, we use accuracy, sensitivity, specificity, This dataset consists of 11 attributes. It is synthetically
F1 score, and AUC metrics to correctly evaluate the re- generated using the Paysim simulator.
sults of the models so that money laundering and legal Due to different assumptions made in the generated
payments can be clearly identified and compared with datasets, models trained with separate datasets perform
the results of other studies [15]. To compute them, a poorly when evaluated with the other datasets. It is
confusion matrix is needed. These metrics are calculated difficult to know which assumptions are closest to the
as follows [16][17]: real cases scenarios. Therefore, it is wise to train machine
𝑇𝑃 + 𝑇𝑁 learning models with all the datasets merged together.
Accuracy = (3)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑇𝑃
Sensitivity = (4) 4.2. Additional attributes
𝐹𝑁 + 𝑇𝑃
𝑇𝑁 There are limited overlapping attributes in all three
Specificity = (5) datasets, for this reason the additional attributes are cre-
𝑇𝑃 + 𝐹𝑁
ating from time, action and amount. Hence, 11 additional
sensitivity × specificity
𝐹 1 = 2𝑥 (6) overlapping attributes are created (Table 2). These addi-
sensitivity + specificity
tional attributes together with transaction amount and
𝑚−1
1 ∑︁ money laundering status are being used in further re-
AUC = (𝑥𝑖+1 − 𝑥𝑖) * (𝑦𝑖+1 + 𝑦𝑖) (7) search.
2 𝑖=1

4. Data Table 2
Additional attributes used in modeling
4.1. Datasets # Additional attributes Mahootika AMLSim Paysim
1 Action count ✓ ✓ ✓
For money laundering prevention, three different 2 Minimum amount ✓ ✓ ✓
synthetically generated datasets are used, which are 3 Maximum amount ✓ ✓ ✓
4 Mean amount
compared in Table 1. Comparing other datasets with 5 Median amount
✓
✓
✓
✓
✓
✓
Paysim, it has many records, which can influence Coefficient variation
6 ✓ ✓ ✓
machine learning algorithms to recognize only its amount
data. For this reason, Paysim dataset is five time Previous transactions
7 ✓ ✓ ✓
average
reduced by selecting every fifth row of this dataset. Same amount
8 ✓ ✓ ✓
count
Table 1 9 Time difference ✓ ✓ ✓
Money laundering datasets Coefficient variation
10 ✓ ✓ ✓
time difference
Name Transactions Fraud Mean Median Same amount
11 ✓ ✓ ✓
% transaction transaction time difference
Mahootika [18] 2,340 60% €2,508,583 €1,162,354
AMLSim [19] 1,323,234 0.16% €115,988 €157
Paysim (1/5) [20] 1,272,524 0.13% €179,953 €74,898
Table 3
Testing results based on merged training datasets
Single methods Ensemble
Train Test Metrics
LR
RF LR XGBoost IF ALL
out
Accuracy 62,1% 51,7% 67,7% 76,4% 85,2% 62,1%
Sensitivity 0,7% 72% 15,3% 76,5% 65,7% 0,8%
30%
Specificity 100% 39,2% 100% 76,3% 97,2% 100%
Mahootika
F1 1,5% 53,2% 26,5% 71,2% 77,2% 1,5%
AUC 50,4% 55,6% 57,7% 76,4% 81,5% 50,4%
Accuracy 97,3% 58,1% 99,9% 75,5% 99,1% 98,8%
Sensitivity 97,3% 58,1% 99,9% 75,6% 99,2% 98,8%
30%
Balanced Specificity 100% 83,4% 100% 12,1% 84% 100%
AMLSim
70% all F1 98,6% 73,5% 100% 86% 99,6% 99,8%
datasets AUC 98,6% 70,7% 100% 43,8% 91,4% 99,4%
(unbalanced Accuracy 92,6% 94,2% 92,6% 99,2% 94,3% 99,9%
for Sensitivity 92,6% 94,3% 92,6% 99,3% 94,4% 93,2%
30%
IF model) Specificity 52,6% 48% 52% 19,8% 48% 51,9%
Paysim
F1 96,1% 97% 96,1% 99,6% 97,1% 96,4%
AUC 72,6% 71% 72,3% 59,5% 71,2% 93%
Accuracy 94,9% 75,8% 96,3% 87,1% 96,7% 96%
Sensitivity 95% 75,8% 96,3% 87,2% 96,8% 95,6%
30% all
Specificity 84,6% 59,8% 84,4% 32,1% 75,8% 84,4%
datasets
F1 97,4% 86,2% 98,1% 93,1% 98,3% 97,9%
AUC 89,8% 68% 90,4% 59,7% 86,3% 90%

4.3. Data pre-processing 5.1. Mahootika
All three datasets were generated synthetically using After testing the trained models on the Mahootika
different assumptions, therefore it is difficult to determine dataset, the best results has been obtained using
which are closest to real-world scenarios. All attributes Ensemble method on all the models. The accuracy of this
are normalized separately using Z-score normalization Ensemble is 85.2% and the AUC is 81.5%.
to maintain the uniqueness of the data sets and to cover
their assumptions. All datasets are divided into two parts: 5.2. AMLSim
70% training and 30% testing, and then all training and
testing parts are merged separately into two datasets. Testing of the AMLSim dataset has shown even better
The SMOTE approach is used to balance the training part results. XGBoost has achieved 99.9% accuracy and 100%
except for Isolation Forest model. Only then machine AUC. The Ensemble method without the LR model has
learning methods testing is performed. showed very similar results. The accuracy of this Ensem-
ble method is 98.8% and the AUC is 99.4%.

5. Results 5.3. Paysim
To cover as many assumptions as possible, all data sets After testing the models on the Paysim testing part, the
are merged into one, so that models could be applied to best Ensemble method is without the LR model and has
real data with the best possible results. an accuracy of 99.87% and detected 51.9% of all money
RF, LR, XGBoost, and IF models are trained with laundering cases.
balanced merged training dataset. These trained
models are tested with the rest of the merged test
data of all datasets. To improve research results, the 5.4. Merged dataset
Ensemble is created. Two variants of ensemble are Finally, the models have been applied to all merged
calculated: datasets test’s parts. The highest accuracy is achieved
1. Ensemble with all models. with the Ensemble method of all models - 96.7% and with
2. Ensemble without LR model. XGBoost accuracy was 96.3% and AUC - 90.4%.
All the results obtained for the individual and com- In conclusion, combining the datasets and using the
bined datasets are shown in Table 3. Ensemble approach for all models is suitable due to bet-
ter models performance. The Ensemble without the LR References
model has achieved accuracy - 95.98%, with 84.4% of all
money laundering cases correctly identified and [1] Europol, Crime area: money laundering,
only 4.4% of all legal payments wrongly identified. 2021. URL: https://www.europol.europa.
The XGBoost model’s findings has been even eu/crime-areas-and-statistics/crime-areas/
better: accuracy is 96.3%, legitimate payments economic-crime/money-laundering.
has been accurately recognized 96.3%, and money [2] Aniruddha, Financial action task force (fatf),
laundering cases has been detected, the same as 1989. URL: https://www.fatf-gafi.org/about/
the Ensemble method without the LR model. historyofthefatf/.
[3] I. Kot, The pros and cons of synthetic data,
2021. URL: https://www.dataversity.net/
6. Conclusion the-pros-and-cons-of-synthetic-data/#.
Money laundering are not commonly addressed in [4] I. Alarab, S. Prakoonwit, M. I. Nacer, Compara-
the literature since it is strictly regulated by tive analysis using supervised learning methods for
government entities and financial institutions. We anti-money laundering in bitcoin, in: Proceedings
use three publicly available synthetically generated of the 2020 5th International Conference on Ma-
datasets in this study, each with a different set of chine Learning Technologies, ACM, 2020-06-19, pp.
assumptions. Which ranged from 2,340 to 1,323,234 11–17. URL: https://dl.acm.org/doi/10.1145/3409073.
transactions with 0.13% to 60% of money laundering 3409078. doi:10.1145/3409073.3409078.
cases. A total of 11 additional attributes are generated [5] O. Raiter, Applying supervised machine learning
for each dataset for further research. Each attribute algorithms for fraud detection in anti-money laun-
of the datasets is normalized by the Z-score method. dering (2021) 13.
Then all three datasets are combined into one. The [6] E. Badal-Valero, J. A. Alvarez-Jareño, J. M.
combined dataset is divided into training and testing Pavía, Combining benford’s law and ma-
parts and, if necessary, the data are balanced using the chine learning to detect money laundering. an
SMOTE method. Finally, the results of all the models actual spanish court case 282 (2018-01) 24–34.
are combined into the Ensemble method and the vast URL: https://linkinghub.elsevier.com/retrieve/pii/
majority of models make a decision about instance. S0379073817304644. doi:10.1016/j.forsciint.
After testing these models, we obtained an AUC 2017.11.008.
of 72.6% to 100%, and both money laundering and legal [7] V. Huyen, Machine learning in money laundering
payments have been well identified. To improve the detection (2020-05-10).
models, we employ the Ensemble method, in which all [8] J. Lorenz, M. I. Silva, D. Aparício, J. T. Ascensão,
methods vote is weighted equally and the class is P. Bizarro, Machine learning methods to detect
determined by a majority of the classifiers votes for money laundering in the bitcoin blockchain in the
instance. Accuracy of this method ranged from 85.2% presence of label scarcity (2021-10-05). URL: http:
to 99.1%, and AUC from 71.2% to 91.4%. //arxiv.org/abs/2005.14635. arXiv:2005.14635.
Moreover, there has been made one modification [9] J. Le, Decision trees in r, DataCamp (2018).
for Ensemble method, by LR model exclusion. In this https://www.datacamp.com/community/tutorials/
case, we have achieved 95.98% accuracy and the decision-trees-R.
model has recognized 95.6% of legal payments, and [10] xgboost developers, xgboost Release 1.2.0-
84.4% of money laundering cases. SNAPSHOT, 2020. URL: https://buildmedia.
Machine-learning-based methods have been readthedocs.org/media/pdf/xgboost/latest/
adapted to address the problem of money laundering xgboost.pdf.
prevention. The results has shown that these [11] J. Han, M. Kamber, J. Pei, Data Mining: Concepts
methods detects potential money laundering cases and Techniques, Second Edition (The Morgan Kauf-
and reduce the number of payments reviewed. mann Series in Data Management Systems), 2 ed.,
However, it is not known which dataset generation Morgan Kaufmann, 2006.
assumptions would be closest to our market due to [12] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation for-
this reason it is recommend in implementation stage to est, in: 2008 Eighth IEEE International Confer-
make data verification by experts. ence on Data Mining, IEEE, 2008, pp. 413–422.
Further research should include deep learning URL: http://ieeexplore.ieee.org/document/4781136/.
methods and other class balancing techniques. doi:10.1109/ICDM.2008.17.
[13] N. V. Chawla, K. W. Bowyer, L. O. Hall,
W. P. Kegelmeyer, SMOTE: Synthetic minor-
ity over-sampling technique 16 (2002) 321–357.
URL: https://www.jair.org/index.php/jair/article/
view/10302. doi:10.1613/jair.953.
[14] S. K. Patro, K. K. Sahu, Normalization: A prepro-
cessing stage (2015) 20–22. URL: http://www.iarjset.
com/upload/2015/march-15/IARJSET%205.pdf.
doi:10.17148/IARJSET.2015.2305.
[15] J. T. Townsend, Theoretical analysis of an
alphabetic confusion matrix 9 (1971-01) 40–50.
URL: http://link.springer.com/10.3758/BF03213026.
doi:10.3758/BF03213026.
[16] R. Kohavi, F. Provost, Glossary of
terms 30 (1998) 271–274. URL: http:
//link.springer.com/10.1023/A:1017181826899.
doi:10.1023/A:1017181826899.
[17] Y. Sasaki, The truth of the f-measure (2007) 5.
[18] M.Mahootika, "money laundering data", M. Ma-
hootika data set. [online]. https://www.kaggle.com/
maryam1212/money-laundering-data/metadata. ac-
cessed on: February 27th, 2022 (2022).
[19] "money laundering data", AMLSim data set.
[online]. https://www.kaggle.com/anshankul/
ibm-amlsim-example-dataset. accessed on:
February 27th, 2022 (2022).
[20] "aml detection", PaySim data set. [online]. https:
//www.kaggle.com/x09072993/aml-detection. ac-
cessed on: February 27th, 2022 (2022).