Limitations and Applicability of GANs in Banking
                             Domain
                               Anubha Pandey 1 and Deepak Bhatt 2 and Tanmoy Bhowmik 3


Abstract. Threats due to payment-related frauds are always a pri-          degree of class imbalance is observed in a variety of real-world ap-
mary concern for financial institutions (FIs), often leading to huge       plications like medical diagnosis, information retrieval system, bioin-
losses and impacting consumer experience. To combat emerging               formatics [31, 1, 34, 17, 21, 38, 39].
frauds and improve the system’s robustness, FIs need an efficient             There exist several techniques for the class imbalance learning
system to detect fraud while authorizing payments. The biggest chal-       [35, 24, 15, 36]. [30] has done a comparative study of several super-
lenge in developing a fraud detection system is a high degree of class     vised and unsupervised machine learning algorithms to handle the
imbalance between fraudulent and legitimate transactions. Recently,        class-imbalance in credit card fraud detection. One of the solutions
Generative Adversarial Networks (GANs) are employed as an over-            to the class imbalance problem is to re-balance the training sets used
sampling technique to augment the dataset with synthetic minority          by the binary-classifier [2, 11, 21]. There exist several oversampling
samples.                                                                   techniques that have proved to be effective in handling class imbal-
   In this paper, we present a systematic study to train GANs for          ance. The commonly used methods are variants of SMOTE(Synthetic
synthetic fraud generation, demonstrating improved classifier perfor-      Minority Oversampling Techniques) [9, 20, 8]. The SMOTE aims to
mance detecting fraud. Training of GANs is conducted in various set-       generate samples along the line between two samples of the minority
tings, including min-max objective and with or without auxiliary loss      class. However, these methods generate synthetic samples based on
discriminating synthetic fraud and real fraud from non-fraud sam-          the existing samples in the dataset and fail to capture minority class
ples. Auxiliary loss is obtained using contrastive loss or triplet loss.   distribution. Hence can’t detect new fraudulent transactions.
Quality of trained GANs is estimated by evaluating the lift in clas-          Recently, Generative Adversarial Networks (GAN) [16, 29] have
sifier performance when trained with dataset augmented with syn-           received a lot of attention from the research community of credit card
thetic fraud. Further, the effect of Discriminator Rejection Sampling      fraud detection. Several works [14, 33, 5, 40] have shown the efficacy
(DRS) is studied in synthetic sample selection used for training data      of GAN for augmenting the dataset with synthetic minority (fraud)
augmentation. The performance comparison of different settings pro-        samples. However, mode collapse is a common phenomenon that oc-
posed in this study is evaluated using a publicly available Credit-Card    curs with GANs. Mode collapse happens when GAN generates lim-
dataset and showed an absolute improvement of up to 6% in Recall           ited varieties of samples and hence fails to capture the whole data
and 3% in precision. We hope this paper will help advance the ap-          distribution. To overcome the issue of mode collapse, researchers
plicability of GANs with a practical insight into the research that has    [33, 5] have used different architectures of GAN like WGAN[3],
been done on this topic so far and open doors to interesting future        Least Square GAN[28], Relaxed WGAN[19] to augment the dataset
research direction.                                                        and have shown an improvement in the classifier’s performance. On
                                                                           the other hand, [40] has trained a GAN based architecture to gener-
1   INTRODUCTION                                                           ate complementary samples of the majority class(legitimate transac-
Credit card has become a ubiquitous method for online payment.             tions). They have used a combination of two WGANs and two Au-
Consequently, the increase in more sophisticated fraudulent transac-       toencoders and use a three-phase training process for fraud detection.
tions is alarming. The fraudulent transactions affect the user level and      In this paper, a comprehensive study of several existing techniques
business level, resulting in financial loss and customer trust. Banks      to train GANs in fraud detection scenario is conducted along with
and fintech companies need an efficient system to monitor the mas-         highlighting their merits and demerits. We have shown experiments
sive volume of transaction logs and detect the frauds[6, 27]. How-         on conditional WGAN-GP for the generation of fraudulent data con-
ever, it should not decline legitimate transactions affecting consumer     ditioned separately on class labels for fraud samples obtained from
experience.                                                                k-means clustering or non-fraud samples from the training set. It is
   The commonly used pipeline for the fraud detection system em-           observed that just using GANs may lead to boundary distortion hence
ploys a binary classifier to distinguish samples of fraud transactions     leading to a drop in the performance for the majority class (legiti-
from the legitimate ones [26, 12, 36]. The fraudulent transactions are     mate transactions). We have proposed an auxiliary loss using Triplet
rare; they represent a tiny fraction of activity within an organization,   Network and Siamese Network separately on top of the WGAN-GP
resulting in class imbalance. The class imbalance issue makes the bi-      model to learn more discriminative fraud samples. Further, the ef-
nary classifiers biased towards the majority class and hence makes         fect in the quality of synthetic samples is studied when the WGAN-
the fraud detection a challenging problem [23, 22, 7]. A similar high      GP network is trained in an end to end fashion along with a neural
                                                                           network-based classifier and found to be useful for dealing with the
1 Mastercard, India, email: Anubha.Pandey@mastercard.com                   boundary distortion problem. All the models are simple architecture
2 Mastecard, India, email: Deepak.Bhatt@mastercard.com
                                                                           with few parameters and are trained end-to-end for the generation of
3 Mastercard, India, email: Tanmoy.Bhowmik@mastercard.com


 Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
fraudulent data as compared to [40]. We have further shown the ap-        the training set. The loss functions for the Discriminator(D) and Gen-
plicability of Discriminator Rejection Sampling [4] to improve the        erator(G) module in conditional WGAN-GP are described below:
                                                                                         n
quality of the synthetic fraud samples used for data augmentation. In                1X
                                                                             LD =           (DθD (x̃fi , yfi ) − DθD (xfi , yfi )+
the later section, we have highlighted an open problem in data aug-                  n i=1                                                   (3)
mentation, which is how to decide on the number of synthetic fraud
samples for data augmentation.                                                                           λ(||∇x̃ DθD (x˜fi , yfi )||2 − 1)2
                                                                                                   n
   The paper is organized as follows. In section 2, we have described                           1  X
                                                                                         LG =         (−DθD (GθG (zi , yfi ), yfi ))             (4)
several configurations used to train the WGAN-GP model for im-                                  n i=1
proved data augmentation. The structural details of these configura-
tions are provided in section 3 along with the dataset description,
and other experimental settings followed. Section 4 compares the
performance of all the models and visualizes the synthetic samples
obtained for data augmentation. It also talks about the effect of in-
creasing the number of synthetic samples in the augmented set on
the classifier’s performance. Finally, section 5 concludes the article
and provides a possible future research direction.                                              Figure 1. Conditional GAN
2     METHODOLOGY
                                                                          2.2.3    WGAN-GP with Siamese Network
2.1     Fraud detection framework
                                                                          Siamese Network [25] uses Contrastive divergence loss to minimize
Fraud Detection is formulated as a binary classification problem. For     the distance between positive pairs and maximize the distance be-
each transaction record in the dataset, we have a feature vector and      tween negative pairs. We use it on top of the underlying WGAN-GP
corresponding class label (fraud or non-fraud).The commonly used          model, as shown in Figure 2 to ensure the distribution learned by the
pipeline for credit card fraud detection using generative models[13,      generator for the fraud samples does not overlap with the non-fraud
33, 14, 5] is described below:                                            samples. We train both the network in an end-to-end fashion.
                                                                             Siamese Network has two Neural Network with the shared weights
1. Train a GAN to generate the fraudulent samples from the train set.
                                                                          and maps the fraud (real and generated) and non-fraud samples into
2. Augment the training set with the synthesized fraud samples.
                                                                          a shared space such that the distance between them is preserved. We
3. Train a classifier on the original and augmented training set sepa-
                                                                          pass the pairs of generated, and real fraud samples as positive pairs
   rately and compare the performances.
                                                                          i.e., (xˆf , xf , l = 1)and generated fraud samples and real non-fraud
                                                                          samples as negative pairs i.e., (xˆf , xnf , l = 0) to the Siamese Net-
2.2     Data augmentation using different                                 work S parameterized by θS and train the generator and Siamese
        configurations of WGAN-GP                                         network on Contrastive divergence loss function as defined below:

2.2.1    WGAN-GP                                                                     n
                                                                                  1X         1
                                                                          LS =          (li ) d(SθS (x̂fi ), SθS (xfi ))2 +
We use a WGAN-GP [18] architecture to oversample from the fraud-                  n i=1      2
ulent (minority) class. It has a Generator module G : Z − → X pa-                                      1
rameterized by θG and a discriminator module D : X −       → [0, 1]                           (1 − li ) {max(0, m − d(SθS (x̂fi ), SθS (xnf i )))}2
                                                                                                       2
parameterized by θD . Where, Z is a set of random noise vector sam-                                                                       (5)
pled from unit Gaussian distribution N (0, 1) and X is a set of the       where, d is the euclidean distance and m is the margin hyperparam-
feature vector of the fraud samples. Below are the loss functions to      eter.
train the Discriminator(D) and Generator(G) module in WGAN-GP:
                                                                          2.2.4    WGAN-GP with Triplet Network
          n
        1 X                                                               The Triplet Network has three Neural Network with the shared
LD =          (DθD (x̃fi ) − DθD (xfi ) + λ(||∇x̃ DθD (x̃fi )||2 − 1)2
       n i=1                                                              weights and maps the fraud (real and generated) and non-fraud sam-
                                                                    (1)   ples into a shared space such that the distance between them is pre-
where, x̃fi = tx̂fi + (1 − t)xfi with 0 ≤ t ≤ 1                           served using triplet loss function. The objective of the triplet loss[32]
                             n                                            is to minimize the distance between the generated fraud samples
.                         1X
                   LG =         (−DθD (GθG (zi )))                  (2)   and real fraud samples and simultaneously maximize the distance
                          n i=1
                                                                          between the generated fraud samples and real non-fraud samples;
  Where, x̂fi and xfi are the generated and real fraud samples re-        hence, it is a max-margin framework.
spectively and z is a random noise sample.                                   We pass the triplet generated fraud samples, real fraud samples,
                                                                          and real non-fraud samples, i.e., (x̂f , xf + , xnf − ) to the Triplet Net-
2.2.2    Conditional WGAN-GP                                              work T parameterized by θT and train the generator and Triplet Net-
We add conditions to WGAN-GP [5] , as shown in Figure 1, and ex-          work on Triplet loss function as defined below:
tend the input space of the model. G : Z × Y −   →X                                       n
                                                                                      1X
D :X×Y −     → [0, 1]                                                         LT =          max(0, m + d(TθT (x̂fi ), TθT (xfi ))
Where Y is the set of conditions corresponding to the features in                     n i=1                                                      (6)
X set. We conduct two separate experiments with different condi-                                           − d(TθT (x̂fi ), TθT (xnf i )))
tional variables, one with class labels of the fraud samples obtained
using k-means clustering and second with the non-fraud samples in         where, d is the euclidean distance and m is the margin hyperparame-
                                                                          ter.
                                        Figure 2. Different configurations used to train WGAN-GP architecture.

2.2.5    WGAN-GP with Classifier                                             has time elapsed from the first transaction, and ’Class’ has label 1
                                                                             for fraudulent transactions and 0 otherwise. There is no missing val-
We train the WGAN-GP model with a binary Classifier, as shown                ues in the dataset. We use Log transformed ’Amount’ values to give
in Figure 2. We pass the generated fraud samples along with the              more normal distribution and normalize the features between 0 and
real non-fraud samples into a classifier C parameterized by θC and           1. We divide the dataset into a train and test set such that the train set
train the generator on the classification loss. In this configuration,       has 70% of the transactions in the dataset i.e., 199364 transactions,
we have two different classifiers in the network, C tries to distin-         and the test set has 30% of the transactions in the dataset i.e., 85443
guish samples from fraudulent (minority) and non-fraudulent (ma-             transactions. 344 and 148 fraud samples account for 0.173% of the
jority) transactions. Whereas, another classifier, discriminator(critic)     total transactions in the training and testing set, respectively.
D, tells how far is the learned distribution from the true distribution.
This configuration of WGAN-GP and classifier ensures that the gen-
erated fraud samples do not overlap with the real non-fraud samples          3.2     Architecture details
and, simultaneously, follow the distribution of the minority(fraud)
                                                                             Lift in the performance of XGBoost classifier [10] is used as a metric
class. We use binary cross-entropy loss to train the classifier and the
                                                                             to quantify the quality of generated synthetic samples when used for
generator module of the architecture, as defined below:
                                                                             augmenting the training set. To evaluate various settings of WGAN-
                n
                X                                                            GP, we train it on different loss functions with the aim of generating
           L=          − log(CθC (x̂fi )) − log(CθC (xnf i ))        (7)     more realistic fraud data. The architectural information of all the dif-
                 i=1                                                         ferent configurations used is described below:

                                                                             1. WGAN-GP
2.2.6    WGAN-GP with Discriminator Rejection Sampling                          The generator module of the WGAN-GP model has four fully con-
                                                                                nected layers with 30, 128, 256, 512 neurons, respectively, Relu is
In the standard GAN, it is a common practice to discard the dis-
                                                                                the activation function used after each layer except the last layer.
criminator after training and generators are used for synthetic data
                                                                                The generator accepts the random noise of dimension 30 as input.
generation. It is believed that after training a GAN, the generator
                                                                                The discriminator has a series of 5 fully connected layers with 30,
perfectly captures the underlying data distribution. However, recent
                                                                                512, 256, 128, 1 neurons respectively in each layer, Relu is used
studies[4, 37] have shown that the GANs do not converge to the true
                                                                                as the activation function in all the layer except the last layer has
data distribution, and the trained generator still generates samples
                                                                                Sigmoid activation. We train both the discriminator and generator
that can be easily distinguished by the discriminator from the real
                                                                                module together on the loss function defined in Equation 1 and
sample. They have also shown that the discriminator captures the
                                                                                2, respectively. We set LAMBDA = 10.0 in the Gradient-Penalty
data distribution more closely than the generators. Hence, we should
                                                                                term. We iterate over the discriminator module 5 times per genera-
consider the distribution defined by both the generator and discrimi-
                                                                                tor module update. We train the model using Adam Optimizer for
nator for better quality samples.
                                                                                10000 epochs with mini-batch of size 64 and learning rate 2e-4.
   We use Discriminator Rejection Sampling(DRS) method [4], to
                                                                                We observe the convergence of the model at around 4000 epoch.
sample from the distribution learned by the discriminator pd (x). We
                                                                             2. Conditional WGAN-GP
use DRS as a post-processing step, where we use the trained discrim-
                                                                                We use a k-means clustering algorithm to divide the fraud dataset
inator D∗ to improve the synthetic fraud samples from the trained
                                                                                into 2 clusters and label the fraud samples as 0 or 1 accordingly.
generator G∗ .
                                                                                We pass the labels of fraud samples as a condition to the WGAN-
3     EXPERIMENTAL SETUP                                                        GP along with random noise as input and train the WGAN-GP
                                                                                model with the same architecture as defined above. To form pairs
3.1     Dataset description
                                                                                of fraud and non-fraud samples, we randomly pick samples from
All the experiments uses publicly available Credit-Card dataset [11].           the respective classes and pair them. We trained the model on the
The dataset has transactions for two days done in September 2013                loss function defined in Equation 3 and 4 and observed the model’s
by the European cardholders. There are 284807 transactions in the               convergence at 4000 epoch.
dataset, out of which 492 are fraudulent transactions, i.e., the frauds      3. WGAN-GP with Triplet Network
account for 0.172% of the total transactions. 31 features represent the         We form a triplet of the synthetic samples obtained from the gen-
transactions, namely ’Amount,’ ’Time,’ ’Class,’ and 28 other numer-             erator with the real fraud samples and real non-fraud samples and
ical features obtained from PCA (V1, V2,... V28). Feature ’Time’                pass them to the Triplet Network. The network has three neural
   networks with shared weights. There are three fully connected            Augmentation Method               Precision    Recall    F1-Score
   layers with 30, 30, and 2 neurons, respectively. Each layer has          Without Augmentation              0.90         0.76      0.83
   a Relu activation except for the last layer. It uses Triplet loss to     WGAN-GP                           0.88         0.81      0.84
                                                                                         Labels from
   simultaneously ensure that the positive pair of generated and real       Conditional                       0.88         0.81      0.84
                                                                                         k-means
   fraud samples are close and the negative pair of generated fraud         WGAN-GP
                                                                                         Non-Fraud
   and real non-fraud samples are apart by some margin. We set the                                            0.86         0.81      0.82
                                                                                         Samples
   hyperparameter margin to 1 in the Triplet loss function defined in       WGAN-GP + Triplet Network         0.89         0.82      0.85
   the Equation6. We trained the triplet network with the WGAN-GP           WGAN-GP + Siamese Network         0.88         0.82      0.85
                                                                            WGAN-GP + Classifier              0.92         0.78      0.84
   model end-to-end using the Adam optimizer for 5000 epochs and
                                                                            WGAN-GP + DRS                     0.90         0.82      0.86
   observed the convergence at around 3500 epoch.                           WGAN-GP + Classifier + DRS        0.93         0.79      0.85
4. WGAN-GP with Siamese Network
   We use the same architecture of the WGAN-GP model, as men-
   tioned above. We pair the synthetic samples obtained from the             Table 1. Performance of XGBoost classifier trained on augmented set
                                                                                  obtained from different configuration of WGAN-GP model.
   generator with the real fraud samples and real non-fraud samples
   to form positive and negative pairs simultaneously and pass it to       a WGAN-GP model to learn the distribution of fraud samples and
   the Siamese Network. The Siamese Network has two neural net-            used the trained generator of the WGAN-GP architecture to over-
   works with shared weights. There are three fully connected layers       sample the minority class (fraud) data and augment the training set.
   with 30, 30, and 2 neurons, respectively. Each layer has Relu acti-     We further train an XGBoost classifier on the augmented training set
   vation except for the last layer. It uses Contrastive divergence loss   and report the performance on the testing set. We can observe from
   defined in Equation 5 to ensure that the positive pair of generated     Table1 that there is an absolute improvement of 5% in Recall in the
   and real fraud samples are close and some margin separates the          XGBoost Classifier trained on the dataset augmented by WGAN-GP
   negative pair of generated fraud and real non-fraud samples. We         model as compared to the original dataset.
   set the hyperparameter margin to 1 and eps to 1e-9 in the Con-             We also use the conditional WGAN-GP model to generate fraud
   trastive loss function defined in the Equation5. We train the entire    samples based on some conditions like class labels and non-fraud
   network end-to-end using the Adam optimizer for 5000 epochs             samples. Fraud samples are clustered into k classes using k-means
   and observe the model saturation at around 3000 epochs.                 clustering and corresponding cluster IDs are assigned as labels. In
5. WGAN-GP with Classifier                                                 our experiment fraud samples are classified into 2 clusters. We pass
   In this experiment, we add a binary classifier module on top of the     these labels to the conditional WGAN-GP model as conditions to
   WGAN-GP model. We pass the generated fraud samples from the             generate fraud samples. From Table 1, it can be observed that the per-
   generator to the classifier module along with the real non-fraud        formance of the classifier remains the same when trained on the aug-
   samples from the training set. The classifier then distinguishes be-    mented dataset obtained from the WGAN-GP model conditioned on
   tween the fraud and non-fraud samples. The classifier has three         labels from k-means clustering. In another setting where conditional
   fully connected layers with 30, 30, and 2 neurons in each layer,        WGAN-GP model is trained to learn the transformation of non-fraud
   respectively. All the layers have Relu activation, and the last layer   samples to fraud did not perform better leading to absolute drop of
   has Softmax activation. We train the classifier and generator pa-       2% in Precision as observed in Table 1. Further investigation is re-
   rameters on the loss function defined in Equation 7 using Adam          quired to identify the performance drop as the model output do not
   optimizer with a learning rate of 0.001. Initially, we train only       conform to our hypothesis where generating fraud from non-fraud
   the WGAN-GP model for 1000 epochs. Later we train the entire            samples should perform better.
   network end-to-end for 5000 epochs and observe the model satu-             The study proposed training of GANs with auxiliary loss func-
   ration at around 2500 epochs.                                           tions using triplet loss or siamese network loss for effective synthetic
4     RESULTS                                                              data generation. Experimental results using both the loss function
                                                                           demonstrated an improvement in Recall by 1%. However, an abso-
4.1    Performance metrics
                                                                           lute improvement of 2% and 1% can be observed in Precision with
In credit card fraud detection, the class of interest is the fraud (mi-    Triplet and Siamese Network, respectively, as compared to the simple
nority) class. Here, the cost of false positive and false negative are     WGAN-GP model. This further proves the benefits of incorporating
not equal. An ideal system should precisely identify the fraud sam-        auxiliary loss along with existing WGAN-GP training.
ples while reducing the number of false positives. Accuracy is the            In WGAN-GP with classifier, the generative module is trained on
ratio of samples correctly classified by the classifier, i.e. (TP+TN)/N.   two loss functions. The first loss corresponds to classifier which tries
However, for the imbalanced dataset, accuracy is not the correct mea-      to distinguish between the fraud and non-fraud samples and another
sure of the classifier’s performance. We pay attention to the cat-         classification loss for discriminator that distinguish between the real
egorical prediction ability. Hence we report, Precision(specificity),      and generated fraud samples. These two classifier modules, in turn,
Recall(sensitivity) and F1-Score to evaluate the performance of the        helps the generator to synthesize well-discriminative fraud samples
model. Precision refers to the percentage of your results that are rel-    that follow the fraud class distribution. Table 1 shows an absolute
evant, i.e., TP/(TP+FP). Recall refers to the percentage of total rele-    improvement of 3%in Precision and a reduction of 3% in the Recall
vant results correctly classified by your algorithm. i.e., TP/(TP+FN).     compared to the simple WGAN-GP model. However, as compared
F1- score combine both the precision and recall metrics into one, and      to the XGBoost classifier trained on the original dataset, there is an
it is the harmonic mean of Precision and Recall.                           improvement of 2% in Recall and Precision.
    The results of all the different configurations of WGAN-GP em-            The performance of the XGBoost classifier trained on the aug-
ployed to solve the task of credit card fraud detection is illustrated     mented dataset depends on the quality of the generated fraud sam-
in Table 1. First, we train an XGBoost classifier on the training set’s    ples. Hence to improve the Recall, the generated fraud samples
transactions and test the performance on the testing set. Next, we use     should be well-discriminative than the non-fraud samples. Recent
studies [4, 37] have shown that the samples generated from the
trained generator are not similar to the real class samples, which dis-
criminator would have otherwise rejected easily. We employ the dis-
criminator rejection sampling method, proposed in [4]. The trained
discriminator is used to filter out the poor-quality samples from the
generator as a post-processing step and are used for training dataset
augmentation. Table 1 shows an absolute improvement of 2% in
Precision and 1% in Recall using discriminator rejection sampling
with WGAN-GP to augment the dataset over the simple WGAN-GP
model. However, compared to the XGBoost classifier trained on the
                                                                           Figure 4. Effect of increasing generated fraud samples in the augmented
original dataset, the performance is similar in Precision with a 6%                                          set
absolute improvement in Recall.
   A reduction in Precision may result in misclassification of legit-
imate transactions as fraudulent transactions, hence penalizing the       4 shows the effect of increasing the generated fraud samples(N ) in
banks in terms of customer trust and comfort. From the previous           the augmented set on the classifier’s performance. We report the val-
experiments, we have observed adding a classifier module on the           ues of performance metrics at Epoch 4100 for the WGAN-GP model
WGAN-GP model results in an improvement in Precision. To im-              and Epoch 2000 for the WGAN-GP+Classifier model. From Table
prove the quality of samples injected into the augmented set, we          2, we observe the best performance of the WGAN-GP model when
used Discriminator Rejection Sampling (DRS) for all the configu-          N = Nf , i.e., when the number of generated samples is equal to the
rations of the WGAN-GP model discussed above. With DRS on the             number of real fraud samples. And for the WGAN-GP+Classifier
WGAN-GP model, we observe an absolute improvement of 2% and               model, the best performance is observed when N = 2Nf , i.e., when
1% in Precision and Recall over the simple WGAN-GP model. In              the number of generated samples is equal to twice the number of
the case of the WGAN-GP+Classifier model, we observe an ab-               real fraud samples. Also, from Figure 4, we observe that as the num-
solute improvement of 1% in both the Precision and Recall. How-           ber of generated fraud samples increases, the Recall of WGAN-GP
ever, for WGAN-GP+Triplet Network and WGAN-GP+Siamese                     model increases, but Precision and F1-Score drops. However, for the
Network model, no improvement was observed on applying DRS.               WGAN-GP+Classifier model, we can observe that the Precision and
                                                                          Recall drops after 4Nf and Nf , respectively.
                                                                                                   WGAN-GP                WGAN-GP+Classifier
                                                                                N
                                                                                       Precision    Recall F1 Score   Precision Recall F1 Score
                                                                              86       0.895        0.804  0.847      0.913     0.777  0.839
                                                                              172      0.892        0.784  0.834      0.914     0.784  0.844
                                                                              344      0.873        0.791  0.830      0.92      0.777  0.842
                                                                              688      0.851        0.811  0.830      0.921     0.784  0.847
                                                                              1376     0.811        0.811  0.811      0.926     0.764  0.837
                                                                              2752     0.781        0.818  0.799      0.933     0.757  0.836
                                                                              5504     0.753        0.824  0.787      0.925     0.75   0.828
                                                                              11008    0.668        0.818  0.736      0.836     0.723  0.775
         Figure 3. Samples generated from (a)WGAN-GP and                      22016    0.541        0.845  0.660      0.886     0.736  0.804
                   (b)WGAN-GP+Classifier model                                44032    0.313        0.858  0.458      0.886     0.736  0.804
                                                                              88064    0.193        0.865  0.316      0.886     0.736  0.804
                                                                              176128   0.123        0.878  0.215      0.908     0.736  0.813
4.2    Comparison of samples generated by different                        Table 2. Performance of XGBoost classifier as the number of generated
       models                                                              samples(N) is varied in the augmented set obtained from WGAN-GP and
                                                                                                WGAN-GP+Classifier model.
A visualization of the distribution of fraud transactions learned by
the simple WGAN-GP model and the WGAN-GP+Classifier model                 5     CONCLUSION AND FUTURE WORK
is shown in as shown in Figure 3. For this a 10000 syhnthetic fraud       The paper presented a detailed study on applicability and effective-
samples are drawn from both the trained models and plotted it against     ness of GANs. Various GANs variants along with ones proposed in
real fraud samples and 10000 real non-fraud samples from the train-       this study is compared to evaluate the efficacy of data augmentation
ing set. Figure 3 illustrates that the WGAN-GP model learns a class       for downstream classification task. Among different training proce-
boundary from the fraud samples and sample synthetic fraud data           dures WGAN-GP when trained with a classifier in an end-to-end
from within the learned class boundary. Also, it can be observed that     fashion performed well as shown in our study improving both preci-
these samples are not uniformly distributed but are generated from        sion and recall of XGBoost based fraud classifier. Further we found
the high population area. In the case of the WGAN-GP+Classifier           that Discrimiantor Rejection Sampling technique when applied for
model, Figure 3 shows that the generated fraud samples are uni-           selection of synthetic samples generated using WGAN-GP with clas-
formly distributed and more spread out as compared to the simple          sifier provided an incremental lift. Next we also demonstrated the
WGAN-GP model.                                                            effect in the overall performance of fraud classifier with increase in
                                                                          synthetic samples used for training data augmentation. We believe
4.3    Effect of increasing the number of synthetic
                                                                          the outcomes presented in this study would help readers in quickly
       samples on the classifier’s performance
                                                                          identify the right settings of GANs utilised in fraud space.
There are 344 fraud samples in the training set, let us denote it by         A promising future research direction is to experiment with Re-
Nf . We generate fraud samples in the multiples ([1/4, 1/2, 1, 2, 4, 8,   inforcement Learning based algorithm to automatically identify the
16, 32, 64, 128, 256, 512]) of Nf to augment the dataset and study        quality and count of samples to be used for augmenting the training
the effect on the classifier’s performance. The Table 2 and Figure        dataset leading to improved performance
REFERENCES                                                                        [22] Nathalie Japkowicz and Shaju Stephen, ‘The class imbalance problem:
                                                                                       A systematic study’, Intelligent data analysis, 6(5), 429–449, (2002).
                                                                                  [23] David Jensen, ‘Prospective assessment of ai technologies for fraud de-
                                                                                       tection: A case study’, in AAAI Workshop on AI Approaches to Fraud
 [1] Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz, ‘Applying sup-                Detection and Risk Management, pp. 34–38, (1997).
     port vector machines to imbalanced datasets’, in European conference         [24] Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, and
     on machine learning, pp. 39–50. Springer, (2004).                                 Sung Yang Bang, ‘Constructing support vector machine ensemble’,
 [2] Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz, ‘Applying sup-                Pattern recognition, 36(12), 2757–2767, (2003).
     port vector machines to imbalanced datasets’, in European conference         [25] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov, ‘Siamese
     on machine learning, pp. 39–50. Springer, (2004).                                 neural networks for one-shot image recognition’, in ICML deep learn-
 [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou, ‘Wasserstein                 ing workshop, volume 2. Lille, (2015).
     gan’, arXiv preprint arXiv:1701.07875, (2017).                               [26] Victoria López, Alberto Fernández, Salvador Garcı́a, Vasile Palade, and
 [4] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and              Francisco Herrera, ‘An insight into classification with imbalanced data:
     Augustus Odena, ‘Discriminator rejection sampling’, arXiv preprint                Empirical results and current trends on using data intrinsic characteris-
     arXiv:1810.06758, (2018).                                                         tics’, Information sciences, 250, 113–141, (2013).
 [5] Hung Ba, ‘Improving detection of credit card fraudulent trans-               [27] Sam Maes, Karl Tuyls, Bram Vanschoenwinkel, and Bernard Mander-
     actions using generative adversarial networks’, arXiv preprint                    ick, ‘Credit card fraud detection using bayesian and neural networks’,
     arXiv:1907.03355, (2019).                                                         in Proceedings of the 1st international naiso congress on neuro fuzzy
 [6] Barry G Becker, ‘Using mineset for knowledge discovery’, IEEE Com-                technologies, pp. 261–270, (2002).
     puter Graphics and Applications, 17(4), 75–78, (1997).                       [28] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang,
 [7] Siddhartha Bhattacharyya, Sanjeev Jha, Kurian Tharakunnel, and                    and Stephen Paul Smolley, ‘Least squares generative adversarial net-
     J Christopher Westland, ‘Data mining for credit card fraud: A com-                works’, in Proceedings of the IEEE International Conference on Com-
     parative study’, Decision Support Systems, 50(3), 602–613, (2011).                puter Vision, pp. 2794–2802, (2017).
 [8] Chumphol Bunkhumpornpat, Krung Sinapiromsaran, and Chidchanok                [29] Mehdi Mirza and Simon Osindero, ‘Conditional generative adversarial
     Lursinsap, ‘Dbsmote: density-based synthetic minority over-sampling               nets’, arXiv preprint arXiv:1411.1784, (2014).
     technique’, Applied Intelligence, 36(3), 664–684, (2012).                    [30] Xuetong Niu, Li Wang, and Xulei Yang, ‘A comparison study of credit
 [9] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip                    card fraud detection: Supervised versus unsupervised’, arXiv preprint
     Kegelmeyer, ‘Smote: synthetic minority over-sampling technique’,                  arXiv:1904.10604, (2019).
     Journal of artificial intelligence research, 16, 321–357, (2002).            [31] Saharon Rosset, Uzi Murad, Einat Neumann, Yizhak Idan, and Gadi
[10] Tianqi Chen and Carlos Guestrin, ‘Xgboost: A scalable tree boosting               Pinkas, ‘Discovery of fraud rules for telecommunications—challenges
     system’, in Proceedings of the 22nd acm sigkdd international confer-              and solutions’, in Proceedings of the fifth ACM SIGKDD international
     ence on knowledge discovery and data mining, pp. 785–794, (2016).                 conference on Knowledge discovery and data mining, pp. 409–413,
[11] Andrea Dal Pozzolo, Olivier Caelen, Reid A Johnson, and Gianluca                  (1999).
     Bontempi, ‘Calibrating probability with undersampling for unbalanced         [32] Florian Schroff, Dmitry Kalenichenko, and James Philbin, ‘Facenet: A
     classification’, in 2015 IEEE Symposium Series on Computational In-               unified embedding for face recognition and clustering’, in Proceedings
     telligence, pp. 159–166. IEEE, (2015).                                            of the IEEE conference on computer vision and pattern recognition, pp.
[12] Pedro Domingos, ‘Metacost: A general method for making classifiers                815–823, (2015).
     cost-sensitive’, in Proceedings of the fifth ACM SIGKDD international        [33] Akhil Sethia, Raj Patel, and Purva Raut, ‘Data augmentation using gen-
     conference on Knowledge discovery and data mining, pp. 155–164,                   erative models for credit card fraud detection’, in 2018 4th International
     (1999).                                                                           Conference on Computing Communication and Automation (ICCCA),
[13] Georgios Douzas and Fernando Bacao, ‘Effective data generation                    pp. 1–6. IEEE, (2018).
     for imbalanced learning using conditional generative adversarial net-        [34] Hua Shao, Hong Zhao, and Gui-Ran Chang, ‘Applying data mining to
     works’, Expert Systems with applications, 91, 464–471, (2018).                    detect fraud behavior in customs declaration’, in Proceedings. Interna-
[14] Ugo Fiore, Alfredo De Santis, Francesca Perla, Paolo Zanetti, and                 tional Conference on Machine Learning and Cybernetics, volume 3,
     Francesco Palmieri, ‘Using generative adversarial networks for improv-            pp. 1241–1244. IEEE, (2002).
     ing classification effectiveness in credit card fraud detection’, Informa-   [35] Erik Sherman, ‘Fighting web fraud.’, Newsweek, 139(23), 32B–32B,
     tion Sciences, 479, 448–455, (2019).                                              (2002).
[15] Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto                 [36] Kai Ming Ting, ‘An instance-weighting method to induce cost-sensitive
     Bustince, and Francisco Herrera, ‘A review on ensembles for the                   trees’, IEEE Transactions on Knowledge and Data Engineering, 14(3),
     class imbalance problem: bagging-, boosting-, and hybrid-based ap-                659–665, (2002).
     proaches’, IEEE Transactions on Systems, Man, and Cybernetics, Part          [37] Ryan Turner, Jane Hung, Eric Frank, Yunus Saatci, and Jason Yosinski,
     C (Applications and Reviews), 42(4), 463–484, (2011).                             ‘Metropolis-hastings generative adversarial networks’, arXiv preprint
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David                   arXiv:1811.11357, (2018).
     Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio,             [38] Wouter Verbeke, Karel Dejaeger, David Martens, Joon Hur, and Bart
     ‘Generative adversarial nets’, in Advances in neural information pro-             Baesens, ‘New insights into churn prediction in the telecommunication
     cessing systems, pp. 2672–2680, (2014).                                           sector: A profit driven data mining approach’, European Journal of Op-
[17] Sarah J Graves, Gregory P Asner, Roberta E Martin, Christopher B An-              erational Research, 218(1), 211–229, (2012).
     derson, Matthew S Colgan, Leila Kalantari, and Stephanie A Bohlman,          [39] Xing-Ming Zhao, Xin Li, Luonan Chen, and Kazuyuki Aihara, ‘Protein
     ‘Tree species abundance predictions in a tropical agricultural landscape          classification with imbalanced data’, Proteins: Structure, function, and
     with a supervised classification model and imbalanced data’, Remote               bioinformatics, 70(4), 1125–1132, (2008).
     Sensing, 8(2), 161, (2016).                                                  [40] Panpan Zheng, Shuhan Yuan, Xintao Wu, Jun Li, and Aidong Lu,
[18] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin,                 ‘One-class adversarial nets for fraud detection’, in Proceedings of the
     and Aaron C Courville, ‘Improved training of wasserstein gans’, in                AAAI Conference on Artificial Intelligence, volume 33, pp. 1286–1293,
     Advances in neural information processing systems, pp. 5767–5777,                 (2019).
     (2017).
[19] Xin Guo, Johnny Hong, Tianyi Lin, and Nan Yang, ‘Relaxed wasser-
     stein with applications to gans’, arXiv preprint arXiv:1705.07164,
     (2017).
[20] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao, ‘Borderline-smote: a
     new over-sampling method in imbalanced data sets learning’, in Inter-
     national conference on intelligent computing, pp. 878–887. Springer,
     (2005).
[21] Haibo He and Edwardo A Garcia, ‘Learning from imbalanced data’,
     IEEE Transactions on knowledge and data engineering, 21(9), 1263–
     1284, (2009).