=Paper= {{Paper |id=Vol-3611/paper18 |storemode=property |title=Can merged datasets help in training money laundering detection models? |pdfUrl=https://ceur-ws.org/Vol-3611/paper18.pdf |volume=Vol-3611 |authors=Paulius Savickas,Dovilė Kuizinienė,Mantas Bugarevičius,Žilvinas Kybartas,Tomas Krilavičius |dblpUrl=https://dblp.org/rec/conf/ivus/SavickasKBKK22 }} ==Can merged datasets help in training money laundering detection models?== https://ceur-ws.org/Vol-3611/paper18.pdf
                                Can merged datasets help in training money laundering
                                detection models?
                                Paulius Savickas1,3, Dovilė Kuizinienė2,3, Mantas Bugarevičius4, Žilvinas Kybartas4 and
                                Tomas Krilavičius2,3
                                1
                                  Vytautas Magnus University, Faculty of Economics and Management, K. Donelaičio street 52, LT-44244 Kaunas, Lithuania
                                2
                                  Vytautas Magnus University, Faculty of Informatics, Vileikos street 8, LT-44404 Kaunas, Lithuania
                                3
                                  Centre for Applied Research and Development, Lithuania
                                4
                                  UAB “Inventi”, Vilnius, Lithuania


                                                                       Abstract
                                                                       Money laundering identification and prevention is one of the most important topics in the financial industry and financial
                                                                       crimes investigation. However, due to the high volume of transactions, personal data protection, and highly skilled white-collar
                                                                       criminals. Artificial intelligence and machine learning are already successfully used in different fintech applications as well,
                                                                       as crime prevention. Unfortunately, due to confidentiality and privacy regulations, AML cases and related data are hard to
                                                                       obtain, and different datasets include very different AML models. In most research, synthetically generated datasets with
                                                                       their own assumptions that do not know or reflect reality are used. For this reason, in this research, we try to improve AML
                                                                       models by merging different datasets with different features. We experiment with three publicly available, synthetically
                                                                       generated money transaction datasets and five different ML approaches: Random Forest, Generalized Linear
                                                                       Regression, XGBoost, Isolation Forest, and an ensemble of these methods. We use SMOTE for dataset balancing. The
                                                                       best model has achieved 95.98% accuracy with recognized 95.6% of legal payments and 84.4% of money laundering cases.
                                                                       This was achieved using an ensemble of all methods.

                                                                       Keywords
                                                                       AML, money laundering, machine learning algorithms, Random Forest, Isolation Forest, XGBoost, Generalized Linear
                                                                       Regression, merged data


                                1. Introduction                                                                                testing on another does not achieve good results in recog-
                                                                                                                               nizing money laundering cases. Hence, merged data set
                                Money laundering is regulated by both the government created, which is used for machine learning algorithms
                                authorities of financial crimes and the banks, since it testing for ensuring better money laundering prevention.
                                involves much larger amounts of money than in the case                                            While synthetic data has numerous advantages, it can
                                of fraudulent payments. Every year, between 2% and 5% be difficult to use appropriately. It’s extremely challeng-
                                of global GDP is laundered, amounting to between 715 ing to ensure that it’s as reliable as real-world data. It
                                billion and 1.87 trillion euros [1]. In 1989, the Group of is possible to create a synthetic data set that does not
                                Seven (G-7) established the Financial Action Task Force accurately represent real-world scenarios when dealing
                                (FATF) as an international group to combat money laun- with complex data sets containing a significant number
                                dering on a global scale. Its mandate was broadened in of variables. This can lead to inaccurate decision-making
                                the early 2000s to include countering terrorism financing due to incorrect insight development [3].
                                [2].
                                   Money laundering, poses a greater threat to society
                                as a whole, yet it is rarely studied by researchers due to 2. Literature review
                                the high level of data confidentiality involved. Therefore,
                                researchers for future topic development is using syn- Money laundering prevention is a matter for both gov-
                                thetically generated datasets. These datasets are created ernment and financial institutions. Data in this area are
                                on different assumptions, then teaching on one set and highly sensitive and often difficult to access, for that rea-
                                                                                                                               son, this problem is not widely discussed in the literature.
                                                                                                                               All studies used cryptocurrency transaction or synthet-
                                IVUS 2022: 27th International Conference on Information
                                Technology, May 12, 2022, Kaunas, Lithuania                                                    ically generated datasets. The following methods were
                                $ paulius.savickas@card-ai.eu (P. Savickas);                                                   analyzed:  Random Forest, Logistic Regression, Decision
                                dovile.kuiziniene@vdu.lt (D. Kuizinieṅe);                                                     Tree, XGBoost, Support Vector Machine, deep learning
                                mantas.bugarevicius@inventi.lt (M. Bugarevičius);                                              methods.
                                zilvinas@inventi.lt (Ž. Kybartas); tomas.krilavicius@vdu.lt                                       In the vast majority of studies reviewed, Random For-
                                (T. Krilavičius)
                                          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons est showed best performance, compared to other meth-
                                           License Attribution 4.0 International (CC BY 4.0).
                                    CEUR

                                          CEUR Workshop Proceedings (CEUR-WS.org)
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                                                                               ods, with an accuracy of 98.06%, 99%, 90.40%, 97.53%, and




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
F1-score ranging from 0.76 to 0.83 [4][5][6][7][8].         3.1.3. XGBoost
   The analysis suggests that Random Forest is the most
                                                            XGBoost is a machine learning algorithm that implements
appropriate method to address the problem of money
                                                            frameworks based on Gradient Boosted Decision Trees
laundering, however due to a lack of research, its effec-
                                                            [10]. XGBoost surpasses other machine learning
tiveness has not been validated. Furthermore, since it
                                                            algorithms by solving many data science problems
is unclear what assumptions the datasets were based on
                                                            faster and more accurately than its counterparts. Also,
and whether they correlate to reality, merging several
                                                            this algorithm has additional protection from
datasets allows for the most comprehensive coverage of
                                                            overfitting.
those assumptions.

                                                            3.1.4. Support Vector Machine
3. Methods                                                The aim of application of Support Vector Machine is to
                                                          find the maximum separating line (if the case is two-
3.1. Machine learning methods                             dimensional) or a separating plane (if the case is three-
The selection of methods in this study was based on the dimensional), or a separating hyperplane (if the case is
literature review and best practices for money laundering 𝑛-dimensional, 𝑛 > 3) that would have the maximum
prevention. Four supervised methods, namely, Random distance between the nearest training data objects. For a
Forest, Generalized Linear Regression, Support Vector hyperplane (could be a line or a plane) to be considered
Machine and XGBoost, and one unsupervised method, as the best, it needs to have the minimum classification
namely, Isolation Forest, were used in the study.         error on previously unseen objects [11].

3.1.1. Generalized Linear Model                             3.1.5. Isolation Forest
The term "linear model" usually encompasses both sys-       Let T be a node of an isolation tree. T is either an
tematic and random components in a statistical model,       external-node with no child, or an internal-node with
however for the purposes of this project the term was       one test and exactly two daughter nodes (𝑇𝑙,𝑇𝑟 ). A test
restricted to include only the systematic components:       consists of an attribute q and a split value p such that
                                                            the test q < p divides data points into 𝑇𝑙 and 𝑇𝑟 [12].
                           𝑚
                          ∑︁                                   Given a sample of data X = 𝑥1 , ..., 𝑥𝑛 of n instances
                    𝑌 =         𝛽𝑖 𝑥 𝑖 ,             (1)
                                                            from a d-variate distribution, to build an isolation tree
                          𝑖=1
                                                            (iTree), we recursively divide X by randomly selecting
when 𝑥𝑖 is independent variables with known values,         an attribute q and a split value p, until either: (i) the
and 𝛽𝑖 is parameters 𝛽𝑖 values might be fixed (known) or    tree reaches a height limit, (ii) |X| = 1 or (iii) all data in
unknown, requiring estimation. An independent variable      X have the same values. An iTree is a proper binary
can be quantitative, producing a single x-variate in the    tree, where each node in the tree has exactly zero or two
model, qualitative, producing a set of x-variates with      daughter nodes. Assuming all instances are distinct, each
values between 0 and 1, or mixed, producing a set of        instance is isolated to an external node when an iTree
x-variates with values between 0 and 1.                     is fully grown, in which case the number of external
                                                            nodes is n and the number of internal nodes is n 1; the
3.1.2. Random Forest                                        total number of nodes of an iTrees is 2n 1; and thus the
                                                            memory requirement is bounded and only grows linearly
Random Forest is a machine learning algorithm that con-     with n.
structs a multitude of decision trees during the training.     Anomaly detection’s goal is to generate a ranking that
The main principle of constructing a random forest is that  indicates the degree of anomaly. As a result, sorting data
a classifier is formed by combining solutions from binary   points according to their path lengths or anomaly scores
decision trees made using diverse subsets of the orig-      is one technique to find anomalies; anomalies are points
inal dataset and subsets containing randomly selected       at the top of the list. The following are the definitions of
features from the feature set [9].                          path length and anomaly score.
   Constructing small decision trees that only have a few
features takes up a little of the processing time, hence the 3.2. Class balancing
majority of such trees’ solutions can be combined into a
single strong classifier.                                    Unbalanced classes leads to machine learning algorithms
                                                             classification issues. The unequal proportion of cases
                                                             presented for each class of problem characterizes these
                                                             issues. Synthetic Minority Oversampling Technique
                                                             (SMOTE) is a well-known algorithm for dealing with
this problem, and its strategy is to artificially generate        Mahootika synthetically generated dataset [18] covers
additional examples of the minority class by using the        five months (February 20, 2019 – July 20, 2019) of 2,340
cases’ closest neighbors. In addition, the majority of class  transactions, 60% of which are money laundering. This
examples are under-represented, resulting in a more bal-      dataset has seven attributes. The simulation is based on
anced collection [13].                                        three processes of money laundering in financial transac-
                                                              tions: 1) Money placement 2) Money layering 3) Money
3.3. Data normalization method                                integration.
We use Z-score normalization to normalize each column             AMLSim dataset [19] consists of 1,048,575 transactions,
in the dataset separately, so the mean of the entire column of which 0.16% are money laundering cases. All of these
becomes 0 and the standard deviation is 1. The following transactions are made from 9,999 accounts to 9,999 re-
is the normalizing formula [14]:                              ceive accounts. This dataset consists of eight attributes.
                                                              It is synthetically generated using the AMLSim simulator.
                      x′= (x − µ)/σ,                      (2)     Paysim dataset [20] consists of 6,362,620 transactions,
where µ is the population mean, and σ is the population but we narrow it down to 1,272,524 transactions due
standard deviation.                                           to computer resources. 0.13% of these transactions are
3.4. Models evaluation                                        money laundering cases. All of these transactions are
                                                              made from 1,272,159 accounts to 777,582 receive accounts.
In this research, we use accuracy, sensitivity, specificity, This dataset consists of 11 attributes. It is synthetically
F1 score, and AUC metrics to correctly evaluate the re- generated using the Paysim simulator.
sults of the models so that money laundering and legal            Due to different assumptions made in the generated
payments can be clearly identified and compared with datasets, models trained with separate datasets perform
the results of other studies [15]. To compute them, a poorly when evaluated with the other datasets. It is
confusion matrix is needed. These metrics are calculated difficult to know which assumptions are closest to the
as follows [16][17]:                                          real cases scenarios. Therefore, it is wise to train machine
                           𝑇𝑃 + 𝑇𝑁                            learning models with all the datasets merged together.
     Accuracy =                                          (3)
                   𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
                       𝑇𝑃
   Sensitivity =                                         (4) 4.2. Additional attributes
                   𝐹𝑁 + 𝑇𝑃
                       𝑇𝑁                                     There are limited overlapping attributes in all three
    Specificity =                                        (5) datasets, for this reason the additional attributes are cre-
                   𝑇𝑃 + 𝐹𝑁
                                                              ating from time, action and amount. Hence, 11 additional
                      sensitivity × specificity
            𝐹 1 = 2𝑥                                     (6) overlapping attributes are created (Table 2). These addi-
                      sensitivity + specificity
                                                              tional attributes together with transaction amount and
                     𝑚−1
                   1 ∑︁                                       money laundering status are being used in further re-
         AUC =            (𝑥𝑖+1 − 𝑥𝑖) * (𝑦𝑖+1 + 𝑦𝑖) (7) search.
                   2       𝑖=1

4. Data                                                                 Table 2
                                                                        Additional attributes used in modeling
4.1. Datasets                                                             #   Additional attributes   Mahootika   AMLSim   Paysim
                                                                          1   Action count               ✓          ✓        ✓
For money laundering prevention, three different                          2   Minimum amount             ✓          ✓        ✓
synthetically generated datasets are used, which are                      3   Maximum amount             ✓          ✓        ✓
                                                                          4   Mean amount
compared in Table 1. Comparing other datasets with                        5   Median amount
                                                                                                         ✓
                                                                                                         ✓
                                                                                                                    ✓
                                                                                                                    ✓
                                                                                                                             ✓
                                                                                                                             ✓
Paysim, it has many records, which can influence                              Coefficient variation
                                                                          6                              ✓          ✓        ✓
machine learning algorithms to recognize only its                             amount
data. For this reason, Paysim dataset is five time                            Previous transactions
                                                                          7                              ✓          ✓        ✓
                                                                              average
reduced by selecting every fifth row of this dataset.                         Same amount
                                                                          8                              ✓          ✓        ✓
                                                                              count
Table 1                                                                   9   Time difference            ✓          ✓        ✓
Money laundering datasets                                                     Coefficient variation
                                                                         10                              ✓          ✓        ✓
                                                                              time difference
      Name           Transactions   Fraud      Mean         Median            Same amount
                                                                         11                              ✓          ✓        ✓
                                      %     transaction   transaction         time difference
  Mahootika [18]         2,340       60%     €2,508,583    €1,162,354
   AMLSim [19]         1,323,234    0.16%     €115,988        €157
 Paysim (1/5) [20]     1,272,524    0.13%     €179,953       €74,898
Table 3
Testing results based on merged training datasets
                                                                  Single methods                Ensemble
             Train            Test        Metrics
                                                                                                       LR
                                                        RF        LR      XGBoost      IF      ALL
                                                                                                      out
                                          Accuracy     62,1%     51,7%     67,7%     76,4%    85,2%  62,1%
                                         Sensitivity   0,7%       72%      15,3%     76,5%    65,7%   0,8%
                            30%
                                         Specificity   100%      39,2%     100%      76,3%    97,2%  100%
                          Mahootika
                                             F1        1,5%      53,2%     26,5%     71,2%    77,2%   1,5%
                                            AUC        50,4%     55,6%     57,7%     76,4%    81,5%  50,4%
                                          Accuracy     97,3%     58,1%     99,9%     75,5%    99,1%  98,8%
                                         Sensitivity   97,3%     58,1%     99,9%     75,6%    99,2%  98,8%
                            30%
            Balanced                     Specificity   100%      83,4%     100%      12,1%     84%   100%
                           AMLSim
              70% all                        F1        98,6%     73,5%     100%       86%     99,6%  99,8%
             datasets                       AUC        98,6%     70,7%     100%      43,8%    91,4%  99,4%
          (unbalanced                     Accuracy     92,6%     94,2%     92,6%     99,2%    94,3%  99,9%
                for                      Sensitivity   92,6%     94,3%     92,6%     99,3%    94,4%  93,2%
                             30%
            IF model)                    Specificity   52,6%      48%       52%      19,8%     48%   51,9%
                            Paysim
                                             F1        96,1%      97%      96,1%     99,6%    97,1%  96,4%
                                            AUC        72,6%      71%      72,3%     59,5%    71,2%   93%
                                          Accuracy     94,9%     75,8%     96,3%     87,1%    96,7%   96%
                                         Sensitivity    95%      75,8%     96,3%     87,2%    96,8%  95,6%
                            30% all
                                         Specificity   84,6%     59,8%     84,4%     32,1%    75,8%  84,4%
                           datasets
                                             F1        97,4%     86,2%     98,1%     93,1%    98,3%  97,9%
                                            AUC        89,8%      68%      90,4%     59,7%    86,3%   90%


4.3. Data pre-processing                                        5.1. Mahootika
All three datasets were generated synthetically using           After testing the trained models on the Mahootika
different assumptions, therefore it is difficult to determine   dataset, the best results has been obtained using
which are closest to real-world scenarios. All attributes       Ensemble method on all the models. The accuracy of this
are normalized separately using Z-score normalization           Ensemble is 85.2% and the AUC is 81.5%.
to maintain the uniqueness of the data sets and to cover
their assumptions. All datasets are divided into two parts:     5.2. AMLSim
70% training and 30% testing, and then all training and
testing parts are merged separately into two datasets.          Testing of the AMLSim dataset has shown even better
The SMOTE approach is used to balance the training part         results. XGBoost has achieved 99.9% accuracy and 100%
except for Isolation Forest model. Only then machine            AUC. The Ensemble method without the LR model has
learning methods testing is performed.                          showed very similar results. The accuracy of this Ensem-
                                                                ble method is 98.8% and the AUC is 99.4%.

5. Results                                                      5.3. Paysim
To cover as many assumptions as possible, all data sets         After testing the models on the Paysim testing part, the
are merged into one, so that models could be applied to         best Ensemble method is without the LR model and has
real data with the best possible results.                       an accuracy of 99.87% and detected 51.9% of all money
  RF, LR, XGBoost, and IF models are trained with               laundering cases.
balanced merged training dataset. These trained
models are tested with the rest of the merged test
data of all datasets. To improve research results, the          5.4. Merged dataset
Ensemble is created. Two variants of ensemble are               Finally, the models have been applied to all merged
calculated:                                                     datasets test’s parts. The highest accuracy is achieved
     1. Ensemble with all models.                               with the Ensemble method of all models - 96.7% and with
     2. Ensemble without LR model.                              XGBoost accuracy was 96.3% and AUC - 90.4%.
  All the results obtained for the individual and com-             In conclusion, combining the datasets and using the
bined datasets are shown in Table 3.                            Ensemble approach for all models is suitable due to bet-
ter models performance. The Ensemble without the LR       References
model has achieved accuracy - 95.98%, with 84.4% of all
money laundering cases correctly identified and           [1] Europol, Crime area:           money laundering,
only 4.4% of all legal payments wrongly identified.           2021.      URL:       https://www.europol.europa.
The XGBoost model’s findings has been even                    eu/crime-areas-and-statistics/crime-areas/
better: accuracy is 96.3%, legitimate payments                economic-crime/money-laundering.
has been accurately recognized 96.3%, and money           [2] Aniruddha, Financial action task force (fatf),
laundering cases has been detected, the same as               1989. URL: https://www.fatf-gafi.org/about/
the Ensemble method without the LR model.                     historyofthefatf/.
                                                          [3] I. Kot, The pros and cons of synthetic data,
                                                              2021.      URL:       https://www.dataversity.net/
6. Conclusion                                                 the-pros-and-cons-of-synthetic-data/#.
Money laundering are not commonly addressed in [4] I. Alarab, S. Prakoonwit, M. I. Nacer, Compara-
the literature since it is strictly regulated by              tive analysis using supervised learning methods for
government entities and financial institutions. We            anti-money laundering in bitcoin, in: Proceedings
use three publicly available synthetically generated          of the 2020 5th International Conference on Ma-
datasets in this study, each with a different set of          chine Learning Technologies, ACM, 2020-06-19, pp.
assumptions. Which ranged from 2,340 to 1,323,234             11–17. URL: https://dl.acm.org/doi/10.1145/3409073.
transactions with 0.13% to 60% of money laundering            3409078. doi:10.1145/3409073.3409078.
cases. A total of 11 additional attributes are generated  [5] O. Raiter, Applying supervised machine learning
for each dataset for further research. Each attribute         algorithms for fraud detection in anti-money laun-
of the datasets is normalized by the Z-score method.          dering (2021) 13.
Then all three datasets are combined into one. The        [6] E. Badal-Valero, J. A. Alvarez-Jareño, J. M.
combined dataset is divided into training and testing         Pavía,      Combining benford’s law and ma-
parts and, if necessary, the data are balanced using the      chine learning to detect money laundering. an
SMOTE method. Finally, the results of all the models          actual spanish court case 282 (2018-01) 24–34.
are combined into the Ensemble method and the vast            URL: https://linkinghub.elsevier.com/retrieve/pii/
majority of models make a decision about instance.            S0379073817304644. doi:10.1016/j.forsciint.
   After testing these models, we obtained an AUC             2017.11.008.
of 72.6% to 100%, and both money laundering and legal [7] V. Huyen, Machine learning in money laundering
payments have been well identified. To improve the            detection (2020-05-10).
models, we employ the Ensemble method, in which all [8] J. Lorenz, M. I. Silva, D. Aparício, J. T. Ascensão,
methods vote is weighted equally and the class is             P. Bizarro, Machine learning methods to detect
determined by a majority of the classifiers votes for         money laundering in the bitcoin blockchain in the
instance. Accuracy of this method ranged from 85.2%           presence of label scarcity (2021-10-05). URL: http:
to 99.1%, and AUC from 71.2% to 91.4%.                        //arxiv.org/abs/2005.14635.  arXiv:2005.14635.
   Moreover, there has been made one modification [9] J. Le, Decision trees in r, DataCamp (2018).
for Ensemble method, by LR model exclusion. In this           https://www.datacamp.com/community/tutorials/
case, we have achieved 95.98% accuracy and the                decision-trees-R.
model has recognized 95.6% of legal payments, and [10] xgboost developers, xgboost Release 1.2.0-
84.4% of money laundering cases.                              SNAPSHOT, 2020. URL: https://buildmedia.
   Machine-learning-based      methods      have    been      readthedocs.org/media/pdf/xgboost/latest/
adapted to address the problem of money laundering            xgboost.pdf.
prevention. The results has shown that these [11] J. Han, M. Kamber, J. Pei, Data Mining: Concepts
methods detects potential money laundering cases              and Techniques, Second Edition (The Morgan Kauf-
and reduce the number of payments reviewed.                   mann Series in Data Management Systems), 2 ed.,
However, it is not known which dataset generation             Morgan Kaufmann, 2006.
assumptions would be closest to our market due to [12] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation for-
this reason it is recommend in implementation stage to        est, in: 2008 Eighth IEEE International Confer-
make data verification by experts.                            ence on Data Mining, IEEE, 2008, pp. 413–422.
   Further research should include deep learning              URL: http://ieeexplore.ieee.org/document/4781136/.
methods and other class balancing techniques.                 doi:10.1109/ICDM.2008.17.
                                                         [13] N. V. Chawla, K. W. Bowyer, L. O. Hall,
                                                              W. P. Kegelmeyer, SMOTE: Synthetic minor-
                                                              ity over-sampling technique 16 (2002) 321–357.
     URL: https://www.jair.org/index.php/jair/article/
     view/10302. doi:10.1613/jair.953.
[14] S. K. Patro, K. K. Sahu, Normalization: A prepro-
     cessing stage (2015) 20–22. URL: http://www.iarjset.
     com/upload/2015/march-15/IARJSET%205.pdf.
     doi:10.17148/IARJSET.2015.2305.
[15] J. T. Townsend,        Theoretical analysis of an
     alphabetic confusion matrix 9 (1971-01) 40–50.
     URL: http://link.springer.com/10.3758/BF03213026.
     doi:10.3758/BF03213026.
[16] R. Kohavi, F. Provost,                 Glossary of
     terms 30 (1998) 271–274. URL: http:
     //link.springer.com/10.1023/A:1017181826899.
     doi:10.1023/A:1017181826899.
[17] Y. Sasaki, The truth of the f-measure (2007) 5.
[18] M.Mahootika, "money laundering data", M. Ma-
     hootika data set. [online]. https://www.kaggle.com/
     maryam1212/money-laundering-data/metadata. ac-
     cessed on: February 27th, 2022 (2022).
[19] "money laundering data", AMLSim data set.
     [online].     https://www.kaggle.com/anshankul/
     ibm-amlsim-example-dataset.          accessed    on:
     February 27th, 2022 (2022).
[20] "aml detection", PaySim data set. [online]. https:
     //www.kaggle.com/x09072993/aml-detection. ac-
     cessed on: February 27th, 2022 (2022).