A Natural Language Processing Approach
for Financial Fraud Detection
Javier Fernández Rodríguez1,2 , Michele Papale1 , Michele Carminati1,* and
Stefano Zanero1
1
    Politecnico di Milano, Dipartimento di Elettronica, Informazione e Bioingegneria, Milan, Italy
2
    Universidad Politécnica de Madrid Madrid, Spain


                                         Abstract
                                         Due to the proliferation of online banking, people are more exposed than ever to attacks. Moreover,
                                         frauds are becoming more sophisticated, bypassing the security measures put in place by the financial
                                         institutions. In this paper, we propose a novel approach to fraud detection based on Natural Language
                                         Processing models. We model the user’s spending profile and detect frauds as deviations from it. To do
                                         so, we employ the attention mechanism that allows us to model and fully exploit past transactions. Our
                                         evaluation on real-world data shows that our model achieves a good balance between precision and
                                         recall, outperforming traditional approaches in different scenarios.

                                         Keywords
                                         Fraud Detection, Natural Language Processing, Transformer Model


1. Introduction
Frauds are becoming more sophisticated as time goes along, bypassing the protection mecha-
nisms put in place. The Fraud Detection and Prevention market is valued at 19.5 billion dollars
and raising [1]. Consequently, financial institutions are demanding up-to-date solutions. Finan-
cial fraud detection is challenging due to various reasons that make usual techniques ineffective.
In fact, frauds are difficult to detect due to the lack of data, the concept drift of spending
profiles, and the temporal dimension of data. Financial datasets are hard to obtain due to
privacy concerns. However, thanks to the collaboration with a large Italian bank, we were able
to work on a real-world dataset. Additionally, frauds are rare by definition and mixed with
legitimate transactions. The imbalance between classes seriously hurts the performance of
detection models. Users, as well as fraudsters, behave differently as time goes by. Consequently,
the temporal dimension is crucial for achieving good performance.
   Natural Language Processing (NLP) studies the interactions between machines and human
language. This field has seen a dramatic change in recent years thanks to models based on
the Transformer [2]. Moreover, Transformer-based models have been proven to be universal
approximators of any sequence-to-sequence functions [3]. The objective of this work is to

ITASEC’22: Italian Conference on Cybersecurity, June 20–23, 2022, Rome, Italy
*
 Corresponding author.
$ javier.fernandez@mail.polimi.it (J. F. Rodríguez); michele.papale@mail.polimi.it (M. Papale);
michele.carminati@polimi.it (M. Carminati); stefano.zanero@polimi.it (S. Zanero)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
exploit the advances made in the Natural Language Processing (NLP) field to fully exploit user’s
past transactions to build the user spending profile. This model directly tackles the domain’s
challenges and has shown outstanding performance in many fields [4, 5, 6] that share similarities
with the fraud detection one.
   In this paper, we present a novel approach for fraud detection based on the Transformer
model [2], a state-of-the-art technique in the Natural Language Processing field to model
the user’s spending profile and detect frauds as deviations from it. We compute a risk score
by comparing the predicted transactions with the actual transactions. To do so, we employ
attention-based mechanisms [2] and unsupervised multitasks learners [7], which allow us to
model and fully exploit past transactions. In particular, we exploit the transformer model and
the attention mechanism to tackle the domain’s challenges directly. The transformed model
is designed to deal with sequences of data and time-series, while the attention mechanism
over the user’s past transactions has been proven to be robust against concept drift [8]. Our
approach hinders explainability but enables the model to optimize the feature representation of
the input generated by neural networks, which usually achieve better performance [8, 9, 10].
The evaluation on a real-world dataset shows that our model outperforms state-of-the-art
approaches in different scenarios, going from realistic to adversarial attacks. Finally, we also
demonstrate the better performance of the proposed generative solution with respect to NLP-
based discriminative approaches [11].
   In summary, we make the following contributions:

    • We present, to the best of our knowledge, the first study on the application of the Trans-
      former model to the fraud detection task. By doing this, we take into account the time
      dimension and fully exploit the users’ past spending patterns, tackling the concept drift
      of the user’s profile and the data scarcity issues.
    • We evaluate our approach on a real-world dataset, showing that it outperforms state-of-
      the-art methods in different fraudulent scenarios.
    • We compare the performance of NLP-based generative solutions against NLP-based
      discriminative approaches deployed in the fraud detection domain.


2. Background and Motivation
Most fraud detection approaches define fraud as a deviation from the normal spending pattern.
However, frauds are difficult to detect due to the lack of data, the concept drift of spending
profiles, and the time dimension. In fact, financial datasets are hard to obtain due to privacy
concerns. However, thanks to the collaboration with a large Italian bank, we were able to work
on a real-world dataset. Additionally, frauds are rare by definition and mixed with legitimate
transactions. The imbalance between classes seriously hurts the performance of detection
models. Users, as well as fraudsters, behave differently as time goes by. Consequently, the
temporal dimension is crucial for achieving good performance.
Related Work. Existing approaches can be categorized based on how they model normal user’s
behavior [12]: local models are user-centric and aggregate features by users; global models are
system-centric and try to model the behavior of the system.
   Local models usually rely on neural networks designed to deal with sequences of data.
Among them, LSTM and GRU based solutions are the more popular options [13, 14, 15]. The
main shortcoming of these solutions is the need for a fixed-length input and the LSTM cells’
bottleneck. This problem hinders the processing of both users with very few transactions and
the ones with a large amount of them. Instead, our solution can process users who perform
from 1 transaction up to 1024 thanks to the attention-based mechanism. Fraudmemory [8] adds
attention on top of the LSTM output. The model outperforms state-of-the-art solutions in terms
of Precision, Recall, and AUC. Our solution uses attention-based mechanisms, but it applies
them directly to the input to mitigate the the bottleneck for the flow of information inside the
neural networks [2]. Zamini et al. [16] use an autoencoder to model legitimates credit card
transactions, and it leverages the reconstruction error to detect frauds. In Veeramachaneni et
al. [17], the authors propose an ensemble of unsupervised methods, including a Density-based
model, a Matrix Decomposition-based model, and a Replicator Neural Network. By combining
the anomaly scores computed by the three models, their system ranks the instances based on
the most anomalous ones and then presents them to the subject matter expert for review; the
feedback collected is used to train a Random Forest model. One shortcoming of this kind of
solution is the impossibility of training the model as a whole. Hence, they fail to extract rich
differential features to detect outliers [12].
   Global models usually consist of clustering and are based on the probability distribution of
the data, which is later used to spot anomalies. Among the most known techniques, there are
k-NN [18, 19]. Although simple, k-NN performs well compared to more complex approaches,
according to Campos et al. [20]. OC-SVM [21] is a SVM modified to find a separation plane that
englobes all the legitimate samples. Frauds will fall outside this plane, and therefore, they will
be detected. Another promising unsupervised technique is OC-NN [22, 23], which partially
solves the Autoencoders’ problem and can extract a rich representation of the input optimized
for fraud detection. Although these models show promising results [12], the training times
grow exponentially with the input dimension. Thanks to the use of the attention mechanism,
our model does not suffer from this issue.
   Banksealer [24, 25, 26] is a semi-supervised model that builds a local, global, and temporal
profile using methods with a well-known statistical meaning, which adds explainability to the
final result. In a subsequent works [27, 28], the temporal profile is improved with the application
of signal processing techniques to exploit the end user’s recurrent vs. non-recurrent spending
pattern, and the authors exploit analyst feedback to self-tune and improve Banksealer’s detection
performance using a multi-objective genetic algorithm.
Research Goal. The objective of this work is to exploit the advances made in the NLP field to
fully exploit user’s past transactions to build the user spending profile. To do so, we exploit
the transformer model and the attention mechanism to tackle the domain’s challenges directly.
The transformed model is designed to deal with sequences of data and time-series, while the
attention mechanism over the user’s past transactions has been proven to be robust against
concept drift [8].
     Historical
                                                                                                                                             Transactions
   Transactions
         (n,m)                                                                                                                                      (1,m)


                                                                                      (1,h)                               (1,2)
                          Multihead                     Multihead                                  Day Predictor                                                             Day Score

                          Attention                     Attention                                                                              Gather
                                                                                      (1,h)                               (1,2)
                                  (n,h)        (n,h)           (n,h)
                                                                                                   Hour Predictor                            probabilities                  Hour Score
   Embedding
                                                                                                                                                                                                      New IBAN
                                                                                                                          (1,8)
                  (n,h)     Rel. U                        Rel. U                                Weekday Predictor                    (1,p)                              Weekday Score
                                                                                                                                                    (1,m)    (1,m)                                               (1,1)
          (n,e)                                                        (n,h)   Last position
                            Dense                         Dense                                                                                                                                  >b
                                                                                of attention
                                                                                                                          (1,25)
                                                                                                 Amount Classifier                                                       Amount Score
                                 (n,f)                        (n,f)                   (1,h)
   Projection
                            Linear                        Linear                                                          (1,1376)              -log()
                                                                                                   ASN Classifier                                                           ASN Score
                                                                                      (1,h)
                            Dense                         Dense
                                                                                                                          (1,2)
                                                                                               International Classifier                                              International Score
                          Position-wise                Position-wise                  (1,h)

                              FFN                          FFN
   Embedding                                                                                                                                 Dissimilarity                          Classifier
                                          Attention                                               Predictor
                                                                                                                                                Score


Figure 1: Overall architecture


3. Approach
Our approach is based on the Transformer model [2], which is a sequence-to-sequence model
where both input and output are sequences. It exploits an encoder and a decoder: The encoder
builds a representation of the input and passes it to the decoder; the decoder takes the information
from the encoder and the output generated so far, and predicts the next item in the output
sequence. The Transformer model was originally engineered to perform language translation,
and, therefore, it excels at modeling long sequences of interrelated data.
   In this work, we exploit the similarities between human language and transactions, focusing
on their temporal dependency. In particular, we exploit this architecture’s modeling capacities
to model users’ behavior in terms of transactions: Instead of words and sentences, we have
transactions and user records. Consequently, instead of predicting the next word, we exploit
the transformer model to predict the next transaction in the sequence belonging to the user’s
spending pattern. Significant changes have been made to adapt the Transformer model to the
fraud detection domain’s particularities. More formally, we train the Transformer model to
predict the next transaction given the user’s past transactions. The model takes the last 1024
(𝑡0 , . . . , 𝑡1023 ) transactions and will output a representation of the next transaction 𝑡′ . We select
1024 as input size since it is the value that in our experimental evaluation obtained the best
trade-off between performance and computation requirements. This parameter can be adjusted
as needed, but it must be a power of two to allow efficient computation [2]. Then, we compare
𝑡′ with the actual transaction, and if their difference is higher than a threshold 𝑝𝑡ℎ , the actual
transaction is marked as fraud.
   In Figure 1 we present an overview of the proposed solution. The model consists of 47 layers
arranged in different blocks depending on their function. All the neural network layers sum a
total of 1961358 weights that are optimized with the Adam optimizer. The main building blocks
are the Embedding, the Attention mechanism (implemented with Multihead attetention and
Position-wise feed-forward network), the Predictor, the Dissimilarity Score, and the Classifier.
3.1. Embedding
Natural language models rely on an embedding layer to translate words to numbers, usually
called Word2Vec. The objective is to map the one-hot encoded-word, a sparse high dimensional
space, to a reduced dense space. The Embedding allows knowing if two words are similar or
not. To do so, Transformer models are trained on large vocabularies, which are preprocessed.
There are several algorithms like BPE [29], WordPiece [30] or SentencePiece [30]. The idea is to
break down words into smaller pieces. In the case of transactions, we build the vocabulary by
considering each transaction’s feature as a word of a different vocabulary. We avoid building a
vocabulary for each transaction since it would have made the Embedding not scalable. We use
the concept of Embedding introduced by Tomás Mikolov [31]. These layers transform the sparse
one-hot encoded transactions into a dense, fewer dimensional space. The resulting vectors share
an interesting property: similar elements in terms of meaning are mapped closer. This property
helps the model to focus on the relevant elements. For instance, if the model is interested in
transactions with amounts in the 5th bin, the search will also return transactions belonging to
bins 4th and 6th as they are "close in meaning". There are different types of inputs. Therefore, a
different embedding is used for each one. The last layer consists of a dense layer that projects
the input in a higher space. The concept is similar to the one used in the Electra model [11].
Another issue to keep into consideration is Positional Encoding. The Transformer models need
the original position of the encoded input, which is lost during the attention mechanism’s
application. Existing techniques involve coding a cosine and sine signal from the input, marking
the position of each word in the sentence. This technique assumes that all the input elements
are equidistant, as in words in a sentence, but transactions are not. Instead of using positions,
timestamps are used. Similar to the usual Positional Encoding, cosines and sines are used to
encode the information.

3.2. Attention mechanism
The Attention mechanisms are the core of our approach. They were introduced by Bahdanau [32]
and Luong [33]. They exploit the similarity extracted from the embeddings to compute the
dot product. If the vectors of two similar words are similar, the dot product between them
will be higher. The attention mechanism allows the model to search the input for useful
information. Thanks to attention, the model can deal with long sequences of inputs. The
implementation is similar to the one presented in the original Transformer paper [2]. However,
while the original approach is a sequence-to-sequence model conceived for language translation,
predicting the next transaction is a sequence-to-one problem. Therefore, we adapt our approach
taking inspiration from GPT-2 [7], which is engineered for prediction. Therefore, the proposed
architecture is similar to the one used by GPT-2.
Multihead Attention Mechanism. The matrix calculated in Equation 1 indicates which
positions of the input 𝑉 are more relevant given the query 𝑄.
                                                              𝑄𝐾 𝑇
                                                           (︂      )︂
                            𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 √                                    (1)
                                                                𝑑𝑘
Attention is later multiplied by the value 𝑉 , as shown in Equation 2. The result 𝑉 ′ contains the
more relevant input given 𝑉 , 𝐾, and 𝑄. It worth noticing that 𝑉 ′ will have the same dimensions
                                                          (𝑛, ℎ)

                                                  Head join

                                                          (𝑛, 𝑔, ℎ/𝑔)

                                                  Attention
                               (𝑛, 𝑔, ℎ/𝑔)                               (𝑛, 𝑔, ℎ/𝑔)

                                                  (𝑛, 𝑔, ℎ/𝑔)
                                    Value dense                    Query dense


                                                  Head split

                                                          (𝑛, ℎ)


Figure 2: Multihead attention architecture


of 𝑄. In fact, in Transformer-alike models, 𝐾 = 𝑉 .

                                    𝑉 ′ = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾) * 𝑉                                   (2)

   There are different types of attention depending on how 𝑄 and 𝑉 are calculated. In this case,
all the blocks use self-attention except for the last one. Self-attention is the case in which 𝑄 and
𝑉 are obtained, applying a linear transformation to the input. It is called self-attention because
each input position pays attention to the other positions of the input. In this way, each input
is enriched with the context around them. The output has the same dimensions as the input.
The last layer of the model consists of cross-attention. Instead of getting 𝑄 from all the input,
only the last position is used. This forces 𝑉 ′ to have the same dimension of 𝑄, which is one
transaction’s dimension.
   Figure 2 shows the attention layer. This layer is called Multihead attention. Instead of using
only one head, several are used. Each head performs the Equations 1 and 2 on different parts of
the input. Therefore, the model can pay attention to several positions at once with the same
computational cost.
Position-wise Feed-Forward Network. All the dense layers used in the model are linear
except the ones used in this layer. The Point-wise feed-forward network adds non-linearity to
the model. This operation is Position-wise since it is a non-linear transformation applied to
each input position with the same parameters.

3.3. From Prediction to Fraud Detection
Some adjustments are needed to convert a predictive model to a generative one. The model
outputs the probability vector for each feature. With that, it is easy to get the probability
of a given transaction. Then, the anomaly score is obtained by applying the −𝑙𝑜𝑔(·) to the
probability [4]. Lastly, a meta-learner is trained to convert the anomaly scores of the features to
the probability of fraud, by following the stacking ensembling technique [34].
Predictor. The first part of the model has the task of predicting the next transaction. The
predictor consists of several dense layers. Each layer performs a linear transformation of the
hidden state and is trained to predict the next transaction feature. A transaction is composed
of different features with different loss functions. Each feature contributes to the final loss
function in an equal manner: 𝐿𝑡𝑜𝑡𝑎𝑙 = 𝐿𝑑𝑎𝑦 + 𝐿ℎ𝑜𝑢𝑟 + 𝐿𝑎𝑚𝑜𝑢𝑛𝑡 + 𝐿𝑤𝑒𝑒𝑘 + 𝐿𝑎𝑠𝑛 + 𝐿𝑖𝑛𝑡 . In the
case of the continuous variables (e.g., day or month), the MSE is used as the loss function. For
the discrete variables instead, the layer is trained using Sparse Cross-entropy. The magnitudes
of the losses are different as they account for different problems. For example, the ASN loss
is larger than the day loss because it is a prediction between 1375 different options whilst the
day MSE usually ranges between 0 and 1. This issue is mitigated by choosing Adam optimizer,
which scales the loss according to the learning rate [35].
Dissimilarity Score. This layer receives in input the prediction from the model and the
current transaction and generates as output the probability of the current transaction given the
prediction. The prediction given by the model consists of a vector of probabilities for each feature.
Lastly, −𝑙𝑜𝑔(·) is computed for each feature, yielding a dissimilarity score. This approach is
similar to the one taken by Brown et al. [4]. The output of the model is the probability of fraud
of the current transaction.
Classifier. The classifier can be seen as a meta-learner. Its function is to weigh the dissimilarity
scores of each feature to get a proper classification of frauds. It is a classification problem with
two classes. Thus, Binary Crossentronpy is used.


4. Experimental Evaluation
We demonstrate the effectiveness of the model in different scenarios and against state-of-the-art
approaches. The experiments are divided depending on the attacker’s knowledge [36, 37].
Dataset. The dataset used to validate our model contains the transactions of a large Italian
banking group. The transactions belong to two periods. One goes from December 2012 to
September 2013, and the other goes from October 2014 to February 2015. This accounts for
a total of 1043478, comprising 6195 different users. A detailed analysis of the data can be
found in [27, 25]. Each transaction has 31 features, of which only 9 can be used due to the
anonymization process. The proposed model takes as input 6 of them: the amount, the hour,
the day of the month, the telecommunications operator from which the transaction is issued,
and information about the destination IBAN.
Experimental Setup. First, we preprocess the dataset. Then, we split the dataset into the
training set to train the model, the validation test to assess the model’s performance at each set
of the training, and the test set to get the final results of the model. The test set is the only one
used to create the different scenarios and injected with synthetic frauds. We refer the reader to
Appendix A for the description of the metrics used.

4.1. Black Box Attacks
In the black box attacks scenarios the attacker does not have prior information of the system
he or she tries to attack. We consider four different attacking strategies that model real-life
situations: Stealing, hijacking, persistent, and Mix. The Stealing scenario simulates a phishing
attack in which the user’s credentials are stolen. The amount transferred is very high. The
connection can be originated from a national or foreign IP. The Hijacking scenario simulates
Table 1
Black Box attacks scenario results
     Scenario     Model          Precision   Recall   AUROC   AUPR    MCC       F1     FPR
     Persistent   Proposed           0.751   0.499    0.834   0.652   0.559   0.600    0.030
                  Random F.          0.737   0.387    0.657   0.507   0.525   0.507    0.014
                  Isolation F.       0.234   0.135    0.528   0.099   0.123   0.171    0.032
                  OC-SVM             0.169   0.103    0.506   0.052   0.015   0.128    0.041
                  k-NN               0.181   0.103    0.510   0.099   0.072   0.033    0.033
     Stealing     Proposed           0.869   0.986    0.997   0.987   0.910   0.924    0.001
                  Random F.          0.926   0.978    0.982   0.804   0.943   0.951    0.014
                  Isolation F.       0.695   0.788    0.961   0.152   0.801   0.739    0.031
                  OC-SVM             0.667   0.809    0.956   0.166   0.779   0.731    0.041
                  k-NN               0.684   0.801    0.959   0.158   0.793   0.738    0.034
     Hijacking    Proposed           0.864   0.968    0.978   0.922   0.897   0.913    0.007
                  Random F.          0.927   0.954    0.970   0.794   0.929   0.940    0.013
                  Isolation F.       0.696   0.921    0.952   0.687   0.791   0.793    0.032
                  OC-SVM             0.407   0.336    0.624   0.270   0.269   0.368    0.041
                  k-NN               0.668   0.918    0.918   0.647   0.739   0.773    0.034
     Mixed        Proposed           0.828   0.833    0.936   0.871   0.801   0.830    0.006
                  Random F.          0.902   0.756    0.871   0.703   0.800   0.823    0.015
                  Isolation F.       0.602   0.707    0.814   0.532   0.589   0.650    0.033
                  OC-SVM             0.467   0.474    0.691   0.365   0.382   0.470    0.043
                  k-NN               0.576   0.669    0.793   0.503   0.552   0.619    0.035


a Man-in-the-browser attack. The connection details are legitimate. The amount transfer is
high. The transfer happens no later than ten minutes from a legitimate one. The Persistent
scenario simulates the infection of a banking Trojan [38]. The frauds have a low amount, and
the connection details are legitimate. The Mix scenario combines all previous scenarios.
   Table 1 compares the proposed model against the baselines algorithms. The proposed model
is better in almost all scenarios. The lower performance of baseline algorithms is due to concept
drift. The overall low FPs, demonstrate how the proposed approach can correctly model (i.e., it
does not negatively impact) the user’s "spending pattern". Traditional algorithms have a hard
time detecting frauds that have not been seen before. The baseline algorithms’ performance is
higher in the scenarios that are more similar to the training set and lower in the most complex
scenarios as the Persistent or the Mix scenario.
   Figure 3 shows the ROC curves of each model. All the models have been trained in identical
conditions, i.e., using the same dataset. Regarding the persistent (a) and mix (b) scenarios, the
proposed model is better and presents an overall lower FPR. The curve is similar in both cases
due to the persistent frauds. Also, it is possible to distinguish two groups of frauds in these
scenarios. The ones easy to detect, which correspond to the first ramp, and the hard ones belong
to the second ramp. In the stealing (c) and hijacking (d) scenarios, almost all the models perform
similarly. The proposed model works better than baseline algorithms for very low values of
FPR. Because of that, the proposed model has a higher AUROC.
           1.0                                                                1.0


           0.8                                                                0.8


           0.6                                                                0.6
     TPR


                                                                        TPR
                                                   model                                                              model
           0.4                                     Isolation Forest           0.4                                     Isolation Forest
                                                   KNN                                                                KNN
           0.2                                     OCSVM                      0.2                                     OCSVM
                                                   Proposed                                                           Proposed
                                                   Random Forest                                                      Random Forest
           0.0                                                                0.0
                 0.0   0.2       0.4         0.6    0.8           1.0               0.0   0.2       0.4         0.6    0.8           1.0
                                       FPR                                                                FPR


                             (a) Persistent                                                       (b) Mix
           1.0                                                                1.0


           0.8                                                                0.8


           0.6                                                                0.6
     TPR


                                                                        TPR
                                                   model                                                              model
           0.4                                     Isolation Forest           0.4                                     Isolation Forest
                                                   KNN                                                                KNN
           0.2                                     OCSVM                      0.2                                     OCSVM
                                                   Proposed                                                           Proposed
                                                   Random Forest                                                      Random Forest
           0.0                                                                0.0
                 0.0   0.2       0.4         0.6    0.8          1.0                0.0   0.2       0.4         0.6    0.8          1.0
                                       FPR                                                                FPR


                              (c) Stealing                                                      (d) Hijacking

Figure 3: ROC curves in different scenarios


4.2. Grey Box Attacks
Grey box attacks consist of attacks performed by an attacker with information about the target
system, as described in [36]. The attacker has acquired (e.g., thanks to a malware infection or
leak) all the transactions from December 2012 to September 2013. The attacker uses the available
data to train a Random Forest classifier that is treated as an oracle. The attacker will try the
fraud against the oracle before injecting it into our model. If the oracle flags the transaction as
fraud, the attacker will change the transaction and will try again. Table 2 summarizes the result
obtained for each model in the grey box scenarios. The proposed model outperforms by far the
baseline algorithms. As expected, the Random Forest is the worse since it is used as an oracle
by the attacker. The results also show the benefits of exploiting different approaches to fraud
detection. The proposed model suffers less from the grey box attack because it is based on user
modeling, while the oracle and baseline algorithms rely on finding a separation plane between
frauds and legitimate transactions.

4.3. Generative Versus Discriminative Approach
The proposed model is generative: It generates the expected transaction probabilities, which
are then used to detect fraud. However, there are also discriminative models: they are trained
to discriminate between frauds and legitimate transactions. Random Forest or OC-SVM are
examples of discriminative models. Following the architecture of the generator, a discriminative
model is proposed here. The differences between the two models are in the last layers of
Table 2
Grey-box attacks scenario results
           Model          Precision   Recall    AUROC        AUPR      MCC        F1    FPR
           Proposed        0.769      0.556       0.867      0.691     0.606    0.645   0.020
           Random F.       0.548      0.096       0.541      0.237      0.182   0.163   0.026
           Isolation F.    0.094      0.044       0.483        0       -0.047   0.060   0.033
           OC-SVM          0.146      0.084       0.497        0       -0.007   0.107   0.042
           k-NN            0.141      0.075       0.496        0       -0.011   0.098   0.035


Table 3
Comparison between Discriminator and Generator approaches
        Scenario     Model            Precision     Recall    AUROC       AUPR     MCC        F1
        Persistent   Generator          0.751       0.499      0.834      0.652    0.559   0.600
                     Discriminator      0.549       0.509      0.886      0.578    0.446   0.528
        Stealing     Generator          0.869       0.986      0.997      0.987    0.910   0.924
                     Discriminator      0.674       0.869      0.972      0.896    0.799   0.759
        Hijacking    Generator          0.864       0.968      0.978      0.922    0.897   0.913
                     Discriminator      0.672       0.862      0.966      0.882    0.779   0.755
        Mixed        Generator          0.828       0.833      0.936      0.871    0.801   0.830
                     Discriminator      0.622       0.743      0.938      0.800    0.690   0.677


the model and the training. The discriminative model does not have the Predictor nor the
Dissimilarity score shown in Figure 1. Regarding training, the generator is trained for prediction,
while the discriminator is trained for classification. The discriminative model uses cross-
attention instead of self-attention and uses the incoming transaction as a query to search in
the past transactions. The results are reported in Table 3. Overall, the generative approach
performs better than the discriminative one.

4.4. LSTM-based Model Comparison
Banking information is not public due to obvious reasons. Due to the lack of standard databases
and benchmarks, it is not easy to compare different models. Therefore, we have developed an
ensemble model composed by LSTM, Random Forest, and XGBoost based on the model proposed
by Jurgovsky et al. [13]. To combine the three models’ output, we compute the Cumulative
Distribution Function of the exponential distribution applied to each model output y, which
gives the probability 𝑘 of the input sample. Then, the ensemble gives the final decision to the
model that has the maximum distance between 𝑘 and the mean CDF value seen in training. The
ensemble improves the performance of each of the individual models. The improvement over
the LSTM model is 22% in terms of AUROC. We used the same dataset to train both models.
To provide an accurate comparison, the Random Forest’s performance is set as a baseline, and
the models are compared in terms of improvement. Table 4 summarizes the results for the mix
scenarios, where all types of frauds are considered. The improvements of both models over
the baseline are close. Overall, the proposed model performs slightly better than the ensemble
Table 4
Comparison between LSTM-based approach and the proposed model.
           Model             Precision   AUROC     AUPR     MCC        F1       FPR
           LSTM Ensemble       0.422     0.975    0.0.872    0.588    0.583     0.145
           Baseline            0.505     0.961     0.803     0.634    0.649     0.098
           Improvement        -0.083     0.014     0.069    -0.046   - 0.066    0.046
           Proposed model      0.828     0.936     0.871     0.801    0.830     0.006
           Baseline            0.902     0.871     0.703     0.800    0.823     0.015
           Improvement        -0.074     0.065    0.168     0.001     0.007    -0.009


one. It is also important to notice that the proposed model can deal with users with very few
transactions, while the LSTM-based ensemble needs at least 50 transactions per user.


5. Conclusions
This paper presented a novel approach for fraud detection based on the Transformer model,
a state-of-the-art technique in the Natural Language Processing field. Besides demonstrating
the feasibility of applying NLP techniques to model the user’s spending behavior, we showed
how the proposed model overcame the domain’s limitations and achieved better performance
than state-of-the-art algorithms in complex scenarios and against adversarial attacks. Future
works will work in the direction of providing explainability to the proposed framework, which
is of paramount importance in the fraud detection domain. In light of the results obtained, we
deem that the future of banking fraud detection is standardization and transfer learning. The
field needs standard benchmarks like ImageNet [39] for image classification or GLUE [40] for
NLP models. It is a daunting task and will require open datasets, which are difficult to obtain
because of the data’s sensitivity. Withal, it would bring enormous advantages. Updated models
could be fine-tuned for specific tasks in a matter of days, allowing financial institutions and
researchers to reach new frontiers.


References
 [1] Markets, Markets, Market research report, https://www.researchandmarkets.com/, 2018.
 [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
     I. Polosukhin, Attention Is All You Need, arXiv e-prints (2017) arXiv:1706.03762.
     arXiv:1706.03762.
 [3] C. Yun, S. Bhojanapalli, A. Rawat, S. Reddi, S. Kumar, Are transformers universal approxi-
     mators ofsequence-to-sequence functions? (2019).
 [4] A. Brown, A. Tuor, B. Hutchinson, N. Nichols, Recurrent Neural Network Attention
     Mechanisms for Interpretable System Log Anomaly Detection, arXiv e-prints (2018)
     arXiv:1803.04967. arXiv:1803.04967.
 [5] G. Branwen, Gpt-2 folk music, 2020. URL: https://www.gwern.net/GPT-2-music.
 [6] S. Alexander, https://slatestarcodex.com/2020/01/06/a-very-unlikely-chess-game/, 2020.
     URL: https://slatestarcodex.com/2020/01/06/a-very-unlikely-chess-game/.
 [7] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are
     unsupervised multitask learners (2019).
 [8] Y. Kunlin, A memory-enhanced framework for financial fraud detection, in: 2018 17th
     IEEE International Conference on Machine Learning and Applications (ICMLA), 2018, pp.
     871–874. doi:10.1109/ICMLA.2018.00140.
 [9] L. Nanni, S. Ghidoni, S. Brahnam, Handcrafted vs. non-handcrafted features for
     computer vision classification, Pattern Recognition 71 (2017) 158 – 172. URL: http:
     //www.sciencedirect.com/science/article/pii/S0031320317302224. doi:https://doi.org/
     10.1016/j.patcog.2017.05.025.
[10] L. Cai, J. Zhu, H. Zeng, J. Chen, C. Cai, Deep-learned and hand-crafted features fusion
     network for pedestrian gender recognition, in: J. Cao, E. Cambria, A. Lendasse, Y. Miche,
     C. M. Vong (Eds.), Proceedings of ELM-2016, Springer International Publishing, Cham,
     2018, pp. 207–215.
[11] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, ELECTRA: Pre-training Text En-
     coders as Discriminators Rather Than Generators, arXiv e-prints (2020) arXiv:2003.10555.
     arXiv:2003.10555.
[12] R. Chalapathy, S. Chawla, Deep Learning for Anomaly Detection: A Survey, arXiv e-prints
     (2019) arXiv:1901.03407. arXiv:1901.03407.
[13] J. Jurgovsky, M. Granitzer, K. Ziegler, S. Calabretto, P.-E. Portier, L. He-Guelton, O. Cae-
     len, Sequence classification for credit-card fraud detection, Expert Systems with Ap-
     plications 100 (2018) 234 – 245. URL: http://www.sciencedirect.com/science/article/pii/
     S0957417418300435. doi:https://doi.org/10.1016/j.eswa.2018.01.037.
[14] A. Roy, J. Sun, R. Mahoney, L. Alonzi, S. Adams, P. Beling, Deep learning detecting fraud in
     credit card transactions, in: 2018 Systems and Information Engineering Design Symposium
     (SIEDS), 2018, pp. 129–134. doi:10.1109/SIEDS.2018.8374722.
[15] B. Wiese, C. Omlin, Credit Card Transactions, Fraud Detection, and Machine Learning:
     Modelling Time with LSTM Recurrent Neural Networks, Springer Berlin Heidelberg,
     Berlin, Heidelberg, 2009, pp. 231–268. URL: https://doi.org/10.1007/978-3-642-04003-0_10.
     doi:10.1007/978-3-642-04003-0_10.
[16] M. Zamini, G. Montazer, Credit card fraud detection using autoencoder based clustering,
     in: 2018 9th International Symposium on Telecommunications (IST), 2018, pp. 486–491.
     doi:10.1109/ISTEL.2018.8661129.
[17] K. Veeramachaneni, I. Arnaldo, V. Korrapati, C. Bassias, K. Li, Aiˆ 2: training a big data
     machine to defend, in: 2016 IEEE 2nd International Conference on Big Data Security on
     Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart
     Computing (HPSC), and IEEE International Conference on Intelligent Data and Security
     (IDS), IEEE, 2016, pp. 49–54.
[18] S. Ramaswamy, R. Rastogi, K. Shim, Efficient algorithms for mining outliers from large
     data sets, SIGMOD Rec. 29 (2000). URL: https://doi.org/10.1145/335191.335437. doi:10.
     1145/335191.335437.
[19] F. Angiulli, C. Pizzuti, Fast outlier detection in high dimensional spaces, Proceedings of the
     Sixth European Conference on the Principles of Data Mining and Knowledge Discovery
     2431 (2002) 15–26. doi:10.1007/3-540-45681-3_2.
[20] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert,
     I. Assent, M. E. Houle, On the evaluation of unsupervised outlier detection: measures,
     datasets, and an empirical study, 2016. URL: https://doi.org/10.1007/s10618-015-0444-8.
     doi:10.1007/s10618-015-0444-8.
[21] B. Lamrini, A. Gjini, S. Daudin, F. Armando, P. Pratmarty, L. Travé-Massuyès, Anomaly
     detection using similarity-based one-class svm for network traffic characterization, 2018.
[22] R. Chalapathy, A. K. Menon, S. Chawla, Anomaly detection using one-class neural networks,
     2019. arXiv:1802.06360.
[23] L. Ruff, R. Vandermeulen, N. Görnitz, L. Deecke, S. Siddiqui, A. Binder, E. Müller, M. Kloft,
     Deep one-class classification, 2018.
[24] M. Carminati, R. Caron, F. Maggi, I. Epifani, S. Zanero, Banksealer: A decision support
     system for online banking fraud analysis and investigation, Computers & Security 53 (2015)
     175 – 186. URL: http://www.sciencedirect.com/science/article/pii/S0167404815000437.
     doi:https://doi.org/10.1016/j.cose.2015.04.002.
[25] M. Carminati, R. Caron, F. Maggi, I. Epifani, S. Zanero, Banksealer: An online banking fraud
     analysis and decision support system, in: N. Cuppens-Boulahia, F. Cuppens, S. Jajodia,
     A. Abou El Kalam, T. Sans (Eds.), ICT Systems Security and Privacy Protection, Springer
     Berlin Heidelberg, Berlin, Heidelberg, 2014, pp. 380–394.
[26] M. Carminati, M. Polino, A. Continella, A. Lanzi, F. Maggi, S. Zanero, Security evaluation
     of a banking fraud analysis system, ACM Transactions on Privacy and Security (TOPS) 21
     (2018) 1–31.
[27] M. Carminati, A. Baggio, F. Maggi, U. Spagnolini, S. Zanero, Fraudbuster: temporal analysis
     and detection of advanced financial frauds, in: International Conference on Detection of
     Intrusions and Malware, and Vulnerability Assessment, Springer, 2018, pp. 211–233.
[28] M. Carminati, L. Valentini, S. Zanero, A supervised auto-tuning approach for a banking
     fraud detection system, in: Cyber Security Cryptography and Machine Learning, CSCML
     2017, Springer International Publishing, 2017, pp. 215–233.
[29] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword
     units, 2015. arXiv:1508.07909.
[30] M. Schuster, K. Nakajima, Japanese and korean voice search, 2012.
[31] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in
     vector space, 2013. arXiv:1301.3781.
[32] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align
     and translate, 2014. arXiv:1409.0473.
[33] M.-T. Luong, H. Pham, C. D. Manning, Effective approaches to attention-based neural
     machine translation, 2015. arXiv:1508.04025.
[34] J. Rocca, Ensemble methods: bagging, boosting and stacking, https://towardsdatascience.
     com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205, 2019.
[35] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014. arXiv:1412.6980.
[36] M. Carminati, L. Santini, M. Polino, S. Zanero, Evasion attacks against banking fraud
     detection systems, in: 23rd International Symposium on Research in Attacks, Intrusions
     and Defenses ({RAID} 2020), 2020, pp. 285–300.
[37] A. Erba, R. Taormina, S. Galelli, M. Pogliani, M. Carminati, S. Zanero, N. O. Tippenhauer,
     Constrained concealment attacks against reconstruction-based anomaly detectors in indus-
     trial control systems, in: ACSAC ’20: Annual Computer Security Applications Conference,
     Virtual Event / Austin, TX, USA, 7-11 December, 2020, ACM, 2020, pp. 480–495. URL:
     https://doi.org/10.1145/3427228.3427660. doi:10.1145/3427228.3427660.
[38] A. Continella, M. Carminati, M. Polino, A. Lanzi, S. Zanero, F. Maggi, Prometheus:
     Analyzing webinject-based information stealers, Journal of Computer Security 25 (2017)
     117–137.
[39] S. V. Lab, Imagenet, http://www.image-net.org/, 2020.
[40] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, Glue: A multi-task benchmark
     and analysis platform for natural language understanding, Association for Computational
     Linguistics, Brussels, Belgium, 2018, pp. 353–355. URL: https://www.aclweb.org/anthology/
     W18-5446. doi:10.18653/v1/W18-5446.
[41] D. Chicco, G. Jurman, The advantages of the matthews correlation coefficient (mcc)
     over f1 score and accuracy in binary classification evaluation, BMC Genomics 21 (2020).
     doi:10.1186/s12864-019-6413-7.


A. Metrics
Fraud Detection is a classification problem in which the classes are very unbalanced. Therefore,
the usual classification metrics, such as accuracy, could lead to wrong conclusions. For example,
a model predicting that all the transactions are legitimate in a dataset containing 1% of frauds
have a 99% of accuracy but will not detect frauds. Hence, we propose the use of the following
metrics, more appropriate to assess the quality of a model in an unbalance scenario as fraud
detection

A.0.1. Precision and Recall
Precision measures the number of predicted frauds that are actually frauds. It is similar to the
accuracy of the positive class.
                                                      𝑡𝑝
                                    𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                                              (3)
                                                   𝑡𝑝 + 𝑓 𝑝
  Recall measures how many frauds are detected from the total number of frauds. Provides an
indication of missed frauds.
                                                    𝑡𝑝
                                      𝑅𝑒𝑐𝑎𝑙𝑙 =                                                (4)
                                                 𝑡𝑝 + 𝑓 𝑛
  Both metrics are related. Often, increases in one metric imply decreasing the other.

A.0.2. Curves
There are two curves:
    • Receiver operating characteristic, or ROC, shows the behaviour of the system for different
      thresholds.
    • Precision-Recall, or PR, shows the tradeoff between precision and recall.

   ROC curve relates True Positive Rate with False Positive Rate. Given a desired FPR, the model
is better as higher the TPR is. A common metric to measure the quality of the curves is the area
under the curve or AUC. Higher values indicate better models.

A.0.3. Matthews correlation coefficient
Also called phi coefficient, Matthews correlation coefficient measures the correlation between
the observed and predicted classification.
  As shown in Equation A.0.3, MCC takes into account all the metrics given by the confusion
matrix. It is regarded as one of the most reliable statistics for imbalance problems [41].


                                    𝑇𝑃 * 𝑇𝑁 − 𝐹𝑃 * 𝐹𝑁
               𝑀 𝐶𝐶 = √︀                                                                        (5)
                        (𝑇 𝑃 + 𝐹 𝑃 )(𝑇 𝑃 + 𝐹 𝑁 )(𝑇 𝑁 + 𝐹 𝑃 )(𝑇 𝑁 + 𝐹 𝑁 )


A.0.4. F-score
F-score is a statistical accuracy measure. It is calculated from the Precision and Recall, as shown
in Equation 6.

                                                  𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑟𝑒𝑐𝑎𝑙𝑙
                          𝐹𝛽 = (1 + 𝛽 2 ) *                                                     (6)
                                              (𝛽 2 * 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛) + 𝑟𝑒𝑐𝑎𝑙𝑙

𝛽 is a weighting parameter. In our case, we use the 𝐹1 score, which is the same as the harmonic
median between Precision and Recall.

A.0.5. False Positive Rate
The False Positive Rate, also known as fall-out, is the ratio between the number of misclassified
negative samples and all negatives samples. Equation 7 shows its calculation, FP is the number
of False Positive, and TN the number of True Negatives.

                                                   𝐹𝑃
                                      𝐹𝑃𝑅 =                                                     (7)
                                                 𝐹𝑃 + 𝑇𝑁