=Paper= {{Paper |id=Vol-2578/DARLIAP7 |storemode=property |title=Regular Pattern and Anomaly Detection on Corporate Transaction Time Series |pdfUrl=https://ceur-ws.org/Vol-2578/DARLIAP7.pdf |volume=Vol-2578 |authors=Francesca Soro,Marco Mellia,Nicolo Russo |dblpUrl=https://dblp.org/rec/conf/edbt/SoroMR20 }} ==Regular Pattern and Anomaly Detection on Corporate Transaction Time Series== https://ceur-ws.org/Vol-2578/DARLIAP7.pdf
           Regular Pattern and Anomaly Detection on Corporate
                         Transaction Time Series
                  Francesca Soro, Marco Mellia                                                              Nicolò Russo
                            Politecnico di Torino                                              CEE CIB Innovation, UniCredit S.p.A.
                                 Torino, Italy                                                           Vienna, Austria
                          francesca.soro@polito.it                                                Nicolo.Russo2@unicredit.eu
                           marco.mellia@polito.it

ABSTRACT                                                                             reducing or enlarging the number of employees, etc.. Bearing
Business applications make extensive usage of time series analy-                     in mind these possibilities, we need to distinguish among three
sis for the most diverse tasks. By analyzing the development of                      kinds of behaviors of interest:
any phenomena over time we gain some useful insights on the                               • Cyclic phenomena: e.g., repeated peaks in the number of
stock market forecast, analyze the risk related to investments,                             payments, representing salary or suppliers payments;
understand the behavior of a company on the market and so on.                             • Continuously increasing or decreasing trend in the in-
More specifically, in a corporate investment banking environ-                               coming/outgoing amounts and counterparts, hinting to
ment, analyzing the transaction history of a customer over the                              underlying policy changes;
years is crucial to establish a fruitful relationship and adapt to                        • Single isolated anomalies that need more investigations.
its behavioural changes. In this environment we recognize three                      In this paper we present different techniques to target each of
macro-categories of phenomena of interest: cyclic events, sudden                     these cases, spanning from simpler heuristics to a combination of
and significant changes in trend, and isolated anomalous points.                     Machine Learning (ML) models, to serve as a tool to compare the
In this paper we present a framework to automatically spot these                     algorithms outcomes and advertise the most meaningful cases.
behaviors by means of simple - yet effective - machine learning                      The final aim of this work is to compare existing algorithms and
techniques. We observe that cyclic behaviors and sudden changes                      techniques to detect anomalies in banking transactions data, and
can be easily targeted by means of adaptive threshold algorithms,                    evaluate their performance, balancing the simplicity of the solu-
while unsupervised machine learning techniques are the most                          tion with its reliability. It should be clear that the final objective of
reliable in detecting isolated anomalies. We design and test our                     the framework is not that of substituting the human supervision,
algorithms on actual transactions collected in the past two years                    but to serve as an instrument aiding the relationship managers
from more than 2,000 customers of UniCredit Bank, showing                            to judge corporate clients behaviour and raise the attention on
the efficiency of our solution. This work is tested to serve as a                    unusual movements. We apply these to a very large dataset of
decision aid tool for corporate investment banking employees                         actual banking data. Given the impossibility to disclose actual
to facilitate the inspection of years of transactions and ease the                   banking data, it is not easy, to the best of our knowledge, to find
visualization of interesting events in the customer history.                         a comprehensive work that provides an insight on the effective-
                                                                                     ness of ML techniques in this environment and on anomalies
1    INTRODUCTION                                                                    showing different characteristics. As such, we are among the first
                                                                                     to provide a practical solution in this area.
Time series are commonly defined as indexed points collected                         This paper is organized as follows: in Section 2 we discuss similar
at regularly spaced points in time. Such representation makes                        use cases or applications already present in literature. In Section
them particularly suitable to represent a large amount of daily                      3 we describe the raw dataset and the preprocessing steps before
life and actual phenomena that vary over time. In the business                       feeding data to the algorithms. In Section 4 we provide a theo-
field, for instance, they are extensively used in order to, among                    retical overview with the basic concepts underlying the applied
many other applications, evaluate risk, forecast future behaviors,                   methodologies. Section 5 contains a discussion and evaluation
predict stock prices changes, or detect anomalies in different                       on the resulting outputs. Section 6 concludes the paper.
types of transactions. In this work we focus on the latter ap-
plication: we exploit, fine tune, evaluate and integrate into a                      2    RELATED WORK
comprehensive framework some well-known anomaly detection
techniques to spot both typical and unusual behaviors in corpo-                      Financial modelling and business use cases make extensive use of
rate banking transaction data [21]. Apart from raising awareness                     time series analysis techniques. Authors of [22] enumerate some
in presence of possible fraudulent events, detecting anomalies in                    of the most relevant applications: interest rates, growth rate of
this environment may rise the attention on a customer undergo-                       the gross domestic product, inflation, index of consumer confi-
ing changes in ownership or management, choosing to operate                          dence, unemployment rate, trade imbalance, corporate earnings,
in new markets, supplying new customers or adopting new sup-                         book-to-market ratio, etc. Authors of [6] present a visualization
pliers, moving parts of its business relationships to a new bank,                    tool that aggregates transactions recorded by Bank of America,
                                                                                     to serve as an aid in spotting the first signs of money laundering
This work has been supported by the SmartData@PoliTO center on Big Data and          activities. [11] reports a survey of the most used machine learn-
Data Science and was developed in collaboration with UniCredit S.p.A. Zweignieder-   ing techniques in stock forecasting, providing some outlines on
lassung, Wien.
Copyright © 2020 for this paper by its author(s).
                                                                                     how to build an extensive set of input features. Another common
Published in the Workshop Proceedings of the EDBT/ICDT 2020 Joint Confer-            field of application is risk in investment evaluation. This aspect is
ence (March 30-April 2, 2020, Copenhagen, Denmark) on CEUR-WS.org.                   addressed in [24], where authors focus on noise prediction give
Use permitted under Creative Commons License Attribution 4.0 International (CC
BY 4.0)                                                                              information about its influence on trading behavior. Authors of
                                                                                     [21] propose the usage of a regression classifier operating on
the tensor representation of High-Frequency Trading data. The             by the company, called Company or Physical Person, or CoPP. By
analysis of such transactions and the modelling of their waiting          means of a Random Forest algorithm, it labels the transaction
time is also addressed by the usage of random variables in [18].          recipients based on a set of input features and flags such as the
Time series can be enriched by other quantitative data: the work          presence of company-related stop words (e.g., GmbH, S.r.l., S.p.a.,
in [27] exploits a wider set of company-related financial indices,        etc.), the number of characters in the name, the volume of the
cash flow information and industry specific variables, to perform         transactions or the presence of a numerical value in the name of
bankruptcy prediction. The above mentioned solutions are devel-           the beneficiary. In output, it classifies the payment transactions
oped to model single specific cases, and often fail in capturing the      in those directed to suppliers and those directed to employees,
similarities among phenomena. Moreover, detecting anomalies               respectively in blue and green in Figure 1, with an average accu-
by using time series prediction outcomes implies a very strong            racy of 96.3%. As it is visible in the plot, salaries constitute the
assumption, i.e., the fact that the data used to train the model do       largest number of transactions, and look periodical. Suppliers
not contain anomalies. Since this is not always the case in a real        payment transactions tend to be more spread over the considered
world scenario, unsupervised algorithms can represent a suitable          time period and generally lower in number.
solution to group together and spot similar patterns, isolating
the most interesting anomalous ones. The work in [13] provides
                                                                                                         1000
a useful partition of the time series clustering approaches: raw-                                                      Salaries




                                                                              # distinct beneficiaries
data-based, feature-based, model-based (i.e., respectively giving                                         800          Suppliers
in input to the model the raw dataset, some feature extracted
                                                                                                          600
from it or modelling the coefficients or the residual). The same
work summarizes some of the most common clustering evalua-                                                400
tion metrics. Authors in [12] provide an example of model-based
                                                                                                          200
clustering of time series, which models residuals by means of a
Gaussian distribution. Many of the mentioned solutions are very                                             0
                                                                                                                 /01    /01    /01    /01    /01    /01    /01     /01
specifically tailored to the problem they target, and are hence                                              7/10 018/01 018/04 018/07 018/10 019/01 019/04 019/07
difficult to generalize to a wider set of use cases. In this paper                                        201      2      2      2      2      2      2      2
we propose a comparative analysis of common off-the-shelf algo-
rithms for anomaly detection applied to a large dataset of actual                   Figure 1: Output of CoPP algorithm classification
corporate investment banking transactions. The final aim is to
provide techniques to tackle such problem efficiently and from
                                                                             If we take a look at the distributions of the outgoing amounts
different perspectives, according to the operator needs, stick-
                                                                          per month, reported in Figure 2 for the same customer, we can
ing to easy to implement and understand algorithms commonly
                                                                          see that despite being less in number, the transactions to suppli-
known by data science practitioners. The output of our analysis
                                                                          ers (top plot) constitute the largest part of the overall outgoing
aims at facilitating the business relationships between the bank
                                                                          amount, outperforming the salary payments (bottom plot) of two
and the customer, allowing the relationship managers employ-
                                                                          orders of magnitude. Moreover, from the spikes in Figure 2(b),
ees, who are not ML experts, to know in advance and adapt to
                                                                          we could understand that the customer in this examples may pay
eventual changes in the customer activities.
                                                                          the 13th salary to its employees.
                                                                             Figure 3 reports, on the other hand, an example of client show-
3    DATASET                                                              ing a significantly decreasing trend. Such behavior should raise
The case study reported in this paper takes advantage of a large          the attention of the relationship manager and lead to the investi-
dataset recording years of corporate customers payment trans-             gation of the business relationships and events that originated
actions. We define a transaction as a single payment directed             this outcome.
to or operated by a given customer entity. The original dataset              To serve as input to the proposed supervised and unsuper-
consists of more than 50,000,000 transactions, characterized by 35        vised models we further enrich the time series data to build a
different fields. All the transactions are reported from the point        complete set of features useful for inspection. We report all the
of view of the customer: it is either a beneficiary, i.e., a payment      features and their brief description in Table 1. Please note that for
recipient, or a source, i.e., it is carrying the payment out. For this,   confidentiality and privacy reasons we omit some of the details
we take into account the direction of the transaction and separate        regrading the original dataset, and we cannot report complete
"incoming from "outgoing" transactions. In total, we count about          examples.
500,000 customers, many of them recorded only a small number
of transactions, and hence not so relevant. To filter them, we keep                                                     Table 1: Lags dataset
only those overcoming a monthly threshold T = 100 of minimum
incoming or outgoing payments, depending on the targeted ap-                                                     Field                         Value
                                                                                                                   y                      Target variable
plication. Given the transactions involving each client, we group                                                 date                Date, daily granularity
them by time interval, computing (i) the sum of amounts, and (ii)                                                y_t-D           Value of y at day d-D (D={1,..,6})
counting the number of unique counterparts. We set a different                                               monthly_avg          Average value of y per month
time granularity (i.e., daily, monthly, weekly, quarterly) accord-                                            weekly_avg           Average value of y per week
                                                                                                          is_<(dayofweek)>       Flag, 1 if date is <(dayofweek)>
ing to the required task and need of the analysts. To provide                                               prev_W_weeks       Value of y at week w-W (W={1,..,3})
an overview of the data, we report an example of the periodic                                              prev_M_months       Value of y at month m-M (M={1,..,3})
payments phenomena we are looking for in Figure 1. In this case                                               quarter_avg           Average value in quarter
we are able to distinguish the payments of salaries from those                                               neigh_4_days        Average value of y at +/-2 days
directed to suppliers thanks to an algorithm internally developed
                                                                             4.1.1 Salaries and suppliers payments detection. For this case
                100000                                                    study, we focus only on the number of unique counterparts to-
                      75000                                               wards the customer performs transactions. An (almost) regular
    Amount




                                                                          pattern in the count of such operations is a clear sign that the cus-
                      50000
                                                                          tomer is paying employees salaries, or suppliers. Normally such
                      25000                                               transactions appear as visible periodical spikes whose height
                                                                          does not show significant changes over time. This character-
                            0
                                                                          istic makes such behavior easy to spot by means of a simple
                            201708
                            201719
                            201710
                            201711
                            201802
                            201801
                            201802
                            201803
                            201804
                            201805
                            201806
                            201807
                            201808
                            201819
                            201810
                            201811
                            201902
                            201901
                            201902
                            201903
                            201904
                            201905
                               19 6
                                 07
                            20170




                                                                          adaptive threshold algorithms. Given the existing time series, we
                            20




                                                     Date                 compute the maximum number of counterparts registered every
                                                                          year, namley max_y_counterpart. We then define the threshold
                                   (a) Company amount distribution
                                                                          τ = 0.8 · max_y_counterpart. We then test every row against
                                                                          this threshold. We get as output 1 if the number of counterparts
                                                                          for that day overcomes the threshold, 0 otherwise. To decide that
                4000                                                      a customer shows a cyclic behavior, we check that the threshold
                                                                          is overcome at least once per month for at least the 85% of the
    Amount




                2000                                                      considered months (i.e., 20 months out of 24). Not reported here
                                                                          due to space limitations, the choice of parameters results robust,
                                                                          and the suggested values have been selected as best candidates.
                        0
                       201708
                       201719
                       201710
                       201711
                       201802
                       201801
                       201802
                       201803
                       201804
                       201805
                       201806
                       201807
                       201808
                       201819
                       201810
                       201811
                       201902
                       201901
                       201902
                       201903
                       201904
                       201905
                         19 6
                           07



                                                                              4.1.2 Trend detection. We run the trend detection procedure
                       20170
                       20




                                                    Date                  on the count of both outgoing and incoming counterparts. Since
                                                                          a daily aggregation is not suited to detect a trend change, we
                                (b) Physical person amount distribution   aggregate the data considering their average per quarter of the
                                                                          year, as typically done in economy fields. We then compute the
                                                                          variations that interested the same quarters across all the avail-
Figure 2: Amount distribution towards physical persons
                                                                          able years (namely δQ 1 , δQ 2 , δQ 3 , δQ 4 ). As a following step, we
and companies
                                                                          further aggregate the counterparts count by month, and we then
                                                                          fit to such time series a simple univariate linear regression model
                                                                          as:
                      2000                                                                         yi = α + β · x i + ϵi                       (1)
     # counterparts




                                                                          We take into account the values of β and the p-value. The former
                      1000                                                gives information on the slope of the intercept line, the latter
                                                                          on the correlation of the target with the given regressor. We
                                                                          only consider as statistically significant those customers hav-
                            0                                             ing p-value ≤ 0.05. We then define two subset of customers:
                      201708
                      201709
                      201710
                      201711
                      201712
                      201801
                      201802
                      201803
                      201804
                      201805
                      201806
                      201807
                      201808
                      201809
                      201810
                      201811
                      201812
                      201901
                      201902
                      201903
                      201904
                      201905
                      201906
                          907




                                                                          those having a relevant increase in the trend for at least a quar-
                                                                          ter (i.e., δQi ≥ 30%), and those showing a relevant decrease
                      201




                                                     Date                 (i.e., δQi ≤ −30%). This combination of thresholds set on the
                                                                          δQi and on the p-value allows us to filter customers with clear
                                                                          trends. At the end of this stage, we notify the relationship man-
Figure 3: Example of client showing decreasing trend in                   ager with a list of customers to carefully monitor.
incoming counterparts count
                                                                          4.2    Isolated anomalous points detection
                                                                          Heuristics and simple linear regression models do not target
4        OBJECTIVES AND METHODOLOGIES                                     isolated anomalies equally well. For this purpose we realize a
In this section we provide a brief description of the adopted algo-       framework including standard algorithms for time series fore-
rithms. The heuristic techniques in Section 4.1 allow to recognize        casting, supervised and unsupervised techniques. We describe
customers showing periodical repeated phenomena or sudden                 the specifications about our implementation below. A detailed
trend changes. Later we compare a set of supervised and unsu-             discussion on the techniques is out of the scope of this work.
pervised techniques to spot isolated single-point anomalies. We              4.2.1 ARIMA models. Time Series Forecasting is an exten-
assume the reader is familiar with ML algorithms.                         sively used technique to tackle the anomaly detection problem
                                                                          ([17], [7], [23], [9]). Its final objective is to provide a prediction
4.1                   Periodicity and trend detection                     of the future values of a time series based on its past values. By
We use heuristic techniques to detect two types of clients: the           defining a reasonable confidence interval for the predicted val-
ones who periodically pay salaries or suppliers, and the ones             ues, we identify as anomalies the points that fall outside such
who show a steep trend (either ascending or descending). As               predicted interval. Here we focus on Auto Regressive Integrated
previously said, spotting these kind of customers is relevant as          Moving Average Models (ARIMA). Every ARIMA model requires
it allows the bank to highlight changes, e.g., in the company             as input a stationary time series. Wand three fundamental pa-
business operations or in the personnel composition. For these            rameters: p, the number of autoregressive terms, d, the order
applications, we use the per-customer time series as input.               of the differencing term and q, the number of moving average
terms. As a general best practice, we should keep in mind that                4.2.3 Unsupervised techniques. All the implemented unsuper-
the values of p, d and q are usually kept below 3. We hence run a         vised techniques require the specification of different parameters
grid search with parameters ranging from 0 to 3. We instantiate           and outlier identfication criteria. The former is algorithm-specific:
an ARIMA model per each combination and we chose the best                 We combine three cluster quality measures: the Silhouette [19],
model by calculating the Mean Squared Error between the actual            the Davies-Boulding index [3] and the Calinski-Harabasz index
value and the prediction yielded by every (p, d, q) triplet. Given        [2] to choose the best clustering configuration. More details for
the best model, we compute the upper and lower boundary of the            each algorithm below. The latter is generic and based on the
confidence interval for each prediction, and we flag as anomalous         definition of a threshold p that defines the maximum number of
every point falling outside such boundaries.                              points a cluster should contain to be labeled as anomalous. We
                                                                          found p = 5 to provide best results for our case study.
    4.2.2 Supervised techniques. All the following techniques use             We consider four clustering algorithms: k-Means [15], DBScan
as input the dataset described in Table 1, properly standardized.         [5], Hierarchical clustering [16] and Isolation Forest [14]. More
Similarly as before, we consider well-accepted ML algorithms              in detail, for k-Means we are required to specify the parameter
that we train to predict the next value ŷ. We run hyperparameter         k, whose evaluation is commonly pointed out as a critical as-
selection, and compare the prediction ŷ with the actual value y.         pect of the algorithm itself ([8], [26]). We target this problem by
Differently from ARIMA models, we do not have a standard way              restricting the possible range of k to a reasonable set of values
to compute confidence intervals, thus we rely on domain knowl-            defined according to our domain knowledge: we let k range from
edge driven heuristics to flag outliers. In details we define a set       2 to 7. For each k we compute the number of anomalous clusters
of threshold-based control criteria to label a point as anomalous:        (i.e., the clusters containing N j ≤ p points), and among those,
the anomaly is advertised if at least one criterion is triggered. We      we identify the most common number of anomalous clusters by
list all the criteria below:                                              taking the mode of such column. We chose the best k according
     • Criterion 1 (multiplicative):                                      to each score. If the algorithm never identifies any anomalous
                                                                          cluster, it returns 0. The advantage of DBScan is that it does not
              KO : if (y > τ ŷ) ∧ (| y − ŷ |>= σmin )
            
                                                                          require to define the number of clusters a priori and it isolates
              OK : else                                                   noise points without using the aforementioned threshold p. We
                                                                          automatize the choice of the parameters ϵ and minPoints by first
     • Criterion 2 (additive):
                                                                          evaluating the distribution of the nearest neighbour distances.
                  
                    KO : if (y − ŷ) > kσr oll inд                        Once we define the average most common value of distances as
                    OK : else                                             our ϵ, we calculate the distribution of the number of points ac-
                                                                          cording to this ϵ − neiдhbourhood, and we choose our minPoints
     • Criterion 3 (multiplicative positive):                             value in the same way. We finally run the algorithm with the
                                                                          selected parameters, and we consider anomalous all the points
                   KO : if (y > ŷ) & Criterion 1
                 
                                                                          marked as noise. When using Hierarchical Clustering we again
                   OK : else                                              face the problem of defining the correct number of clusters. This
                                                                          depends on the setting of the cutoff level on the obtained output
     • Criterion 4 (additive absolute):
                                                                          dendrogram. We limit the set of reasonable cutoff levels up to
                  KO : if | y − ŷ |>= kσr oll inд                        C = 5, and we evaluate the best C by considering the number
                

                  OK : else                                               of anomalous clusters according to the threshold p combined
                                                                          with the best scores. If we are unable to identify small anomalous
   Where τ and k are multiplicative thresholds we manually tune,          clusters, we yield 0 as a result. Isolation Forest algorithm shows
σmin is the minimum monthly standard deviation of the label               as its main strength the fact that it does not require an a-priori
variable, and σr oll inд is the rolling standard deviation of the label   definition of the number of clusters, but only on the selection of
computed month by month.                                                  a reasonable number of base trees, as required by some of the
For the regression, we consider state of the art algorithms. We           previously described supervised algorithms. The main underly-
briefly report the chosen configurations below. We manually               ing idea of this technique is that anomalies are often isolated
tune all the algorithms parameter, and we hereby report only              points whose identification requires just a few partitions of the
the resulting best configurations for the sake of space. For the          feature space to separate them from the more concentrated sets
Support Vector Regressor ([25], [4]) we exploit three different           of points. We do not rely on the output of a single tree, but on
kernel functions: the linear, the polynomial and the RBF one.             the average output generated by a set of N = 100 trees.
All kernels require a regularization parameter C = 100, an ϵ =
0.1, while the polynomial kernel takes also deдree = 3. The               5     CASE STUDY AND RESULTS
Stochastic Gradient Descent Regressor [10] exploits the standard
                                                                          In this section we proceed with the description and comment
concept of stochastic gradient descent to fit linear regression
                                                                          of the results obtained with the different automatic detection
models. We choose to fix the number of maximum iterations
                                                                          techniques illustrated in Section 4. Please note that, for data
to I = 100, 000, 000, and a stopping criterion on the validation
                                                                          confidentiality reasons we report an indicative range for all the
score improvement tol = e −10 . We then exploit a set of Decision-
                                                                          numeric results. We use Python 3.71 as a programming language,
Tree based regressors: we first instantiate a simple Decision Tree,
                                                                          together with Pandas 2 and scikit-learn 3 libraries. All the exam-
which we then use as a building block for an AdaBoost regressor
                                                                          ples use as input the transactions of a subset of 2, 000 customers.
[20] and a Random Forest regressor [1]. For the former we specify
that we want to terminate the boosting at n_estimators = 300;             1 https://www.python.org/downloads/release/python-370/
for the latter we require a number of trees N = 100 and we set            2 https://pandas.pydata.org/pandas-docs/stable/index.html
max_depth = 20.                                                           3 https://scikit-learn.org/stable/
5.1                            Periodicity and trend detection                                   last quarter variations, namely δQ 1 and δQ 4 . We consider incom-
   5.1.1 Salaries and suppliers payment detection. As already re-                                ing and outgoing counterparts separately. Out of the 2, 000 clients
ported in Section 4.1.1, we define salaries and suppliers payment                                originally in scope, 20% shows an increasing trend in the incom-
those outgoing transactions showing a regular spikes over time.                                  ing counterparts and 7.5% shows a decreasing trend. Considering
The height of the spikes is generally almost constant but it may                                 outgoing counterparts, 20% of customers shows increasing trend,
be subject to changes from time to time. The output of the adap-                                 6% shows decreasing trend. Figure 5 reports some examples of the
tive threshold heuristic points out that slightly less than the 25%                              automatically detected behaviors for 3 different customers. As
of the customers under analysis performs periodical payment                                      visible, all of them show a clear decreasing trend, correctly iden-
transactions. We manually verified about 100 cases and found                                     tified by the algorithm. Also in this case, we manually verified
no evident sign of miscalssification. For instance, we report the                                the results, showing no errors in the classification.
output for two customers, namely CLI1 and CLI2, in Figure 4.
In the two figures we can see clearly how a large number of
transactions are concentrated in certain days of the month, and                                                       200                                                                      CLI1
repeated periodically. CLI1, in Figure 4(a), shows an increase in                                                                                                                              CLI2
                                                                                                                                                                                               CLI3
the number of distinct transactions after the beginning of 2019:
                                                                                                                      150




                                                                                                   # Counterparts
this may suggest, for instance, a change in the relationships with
the suppliers, or in the composition of the workforce with newly
                                                                                                                      100
hired employees. Our algorithm automatically adapts the thresh-
old τ , helping the relationship manager to detect the change. For
instance, we can suppose that the company is growing or trying                                                               50
to enlarge its business. On the other hand, CLI2 in Figure 4(b)
shows an almost regular pattern over the analyzed time period,                                                                      0
with a variation of ±3 counterparts over the years. Identified

                                                                                                                                    201708
                                                                                                                                    201719
                                                                                                                                    201710
                                                                                                                                    201711
                                                                                                                                    201802
                                                                                                                                    201801
                                                                                                                                    201802
                                                                                                                                    201803
                                                                                                                                    201804
                                                                                                                                    201805
                                                                                                                                    201806
                                                                                                                                    201807
                                                                                                                                    201808
                                                                                                                                    201819
                                                                                                                                    201810
                                                                                                                                    201811
                                                                                                                                    201902
                                                                                                                                    201901
                                                                                                                                    201902
                                                                                                                                    201903
                                                                                                                                    201904
                                                                                                                                    201905
                                                                                                                                       19 6
                                                                                                                                         07
peaks are consistent with personal payments as identified by the                                                                    20170
                                                                                                                                    20
                                                                                                                                                                       Month
CoPP algorithm. The output of the salaries detection procedure
is meant to be read together with CoPP, to allow the relation-
ship manager to have a clear overview on the customer business                                   Figure 5: Customers with relevant changes in trend for
choices and structures. Notice that our solution does not require                                outgoing counterparts count
labelled dataset, a often time-consuming process.

                                                                                                    In case of trend detection, we also present to the relationship
                               1000                                                              manager a set of aggregated statistics on the quarter variations.
    # distinct beneficiaries




                                800                                                              Figure 6 shows, for instance, the δQ 1 (in red) and δQ 4 (in blue)
                                                                                                 distributions per turnover buckets. In the figure we observe a
                                600                                                              positive growth from one year to the following in most of the
                                400                                                              buckets. The exception to be highlighted is the case of small
                                               Threshold                                         companies (i.e., the ones having turnover between 0 and 500, 000
                                200                                                              RON), which show a very significant growth in Q1. The Relation-
                                               Counterparts
                                 0                                                               ship Manager should pay attention to the bucket 500k - 1MLN,
                                         /01    /01    /01    /01    /01    /01    /01     /01   showing a significant decrease.
                                     7/10 018/01 018/04 018/07 018/10 019/01 019/04 019/07
                                201        2      2      2      2      2      2      2

                                                   (a) Counterparts, CLI1
                                                                                                                                        300
                                                                                                           Delta distribution (%)




                               200                                                                                                      200
    # distinct beneficiaries




                               150                                                                                                      100

                               100                                                                                                        0

                                50            Threshold                                                                                 100
                                                                                                                                                     k
                                                                                                                                                     K


                                                                                                                                                              LN


                                                                                                                                                                        LN


                                                                                                                                                                                    LN



                                                                                                                                                                                         10 N
                                                                                                                                                                                               LN




                                              Counterparts
                                                                                                                                                  00




                                                                                                                                                                                               L
                                                                                                                                                              M


                                                                                                                                                                     0M


                                                                                                                                                                                0M


                                                                                                                                                                                             M
                                                                                                                                                                                           0M
                                                                                                                                              -5


                                                                                                                                                          -1




                                                                                                                                                                                          00




                                 0
                                                                                                                                                                   -1


                                                                                                                                                                               -5




                                                                                                                                                          k
                                                                                                                                              0




                                                                                                                                                                                     -1




                                      /01    /01    /01    /01    /01    /01    /01     /01
                                                                                                                                                         0K




                                                                                                                                                                                           >




                                  7/10 018/01 018/04 018/07 018/10 019/01 019/04 019/07
                                                                                                                                                                  LN


                                                                                                                                                                          LN


                                                                                                                                                                                    LN
                                                                                                                                                    50




                               201
                                                                                                                                                              1M


                                                                                                                                                                        M




                                        2      2      2      2      2      2      2
                                                                                                                                                                                 M
                                                                                                                                                                       10


                                                                                                                                                                               50




                                                   (b) Counterparts, CLI2                                                                                         Turnover buckets


                     Figure 4: Output of the salary detection process                            Figure 6: Delta Q1 and Q4 distribution for outgoing coun-
                                                                                                 terparts per turnover bucket
  5.1.2 Trend detection. For the sake of space, we discuss the
output of the heuristic reported in Section 4.1.2 for the first and
      Table 2: Arima and Supervised algorithms scores                            their general characteristics, and targeted with the most appro-
                                                                                 priate set of techniques spanning from simple adaptive threshold
                ARIMA   ADA    DT     RF     SGDR    Lin     Poly   RBF          heuristics, to several types of machine learning algorithms. We
    Accuracy     0.75   0.88   0.89   0.73    0.51   0.61    0.28   0.54
    Precision    0.06   0.19   0.18   0.14    0.08   0.11    0.01   0.03         demonstrated that phenomena such as salaries and periodic sup-
    Recall         1    0.89   0.89   0.89     0.4   0.58    0.79   0.88         pliers payments can be reliably spotted by means of an adaptive
                                                                                 threshold algorithm, while a standard linear regression comes
          Table 3: Unsupervised algorithms scores                                in handy when major changes in trend need to be detected. We
                                                                                 further provided a comparative analysis of the performance of
                   DBScan   Agglomerative     Isolation     K-means              well-known machine learning algorithms in spotting isolated
       Accuracy     0.99         0.98            0.97         0.99               anomalies, whose result make us lean towards the usage of un-
       Precision    0.52         0.51            0.19         0.77               supervised algorithms. All the provided results are presented
       Recall       0.63         0.62            0.48         0.8
                                                                                 in a way that they can serve as a decision-aid tool for the bank
                                                                                 employees that need easy to read and understand results when
5.2    Isolated anomalous points                                                 dealing with corporate customers.

For the sake of space, in this Section we discuss the results for                REFERENCES
a subset of 30 clients whose time series passed the stationarity                  [1] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
test. We run all of the considered algorithms using the 80% of the                [2] Tadeusz Caliński and Jerzy Harabasz. 1974. A dendrite method for cluster
original dataset for the training phase, and the remaining 20%                        analysis. Communications in Statistics-theory and Methods 3, 1 (1974), 1–27.
                                                                                  [3] D. L. Davies and D. W. Bouldin. 1979. A cluster separation measure. IEEE
for testing (i.e., roughly the last three months of transactions).                    transactions on pattern analysis and machine intelligence 2 (1979), 224–227.
Since we do not know if there are anomalies or not, we evaluate                   [4] Harris Drucker et al. 1997. Support vector regression machines. In Advances
                                                                                      in neural information processing systems. 155–161.
the reliability of the predictors through the insertion of artificial             [5] Martin Ester et al. 1996. A density-based algorithm for discovering clusters in
anomalies. In particular, we add anomalies on the testing set by                      large spatial databases with noise.. In Kdd, Vol. 96. 226–231.
extracting a random subsample of L = 10 events from such set                      [6] Chang et al. 2007. Wirevis: Visualization of categorical, time-varying data
                                                                                      from financial transactions. In 2007 IEEE Symposium on Visual Analytics Science
(i.e., about 10% of the instances of the testing set). We iterate over                and Technology. IEEE, 155–162.
the whole time series adding one anomaly at a time. We want                       [7] Durdu Ömer Faruk. 2010. A hybrid neural network and ARIMA model for
the anomaly to be clearly out of the standard range of the time                       water quality time series prediction. Engineering Applications of Artificial
                                                                                      Intelligence 23, 4 (2010), 586–594.
series, therefore we randomly choose a random entry e, and we                     [8] Greg Hamerly and Charles Elkan. 2004. Learning the k in k-means. In Advances
modify it as:                                                                         in neural information processing systems. 281–288.
                                                                                  [9] Karin Kandananond. 2019. Electricity demand forecasting in buildings based
                   e ∗ = e · N (7, 2.5) + k · max(ts)                      (2)        on ARIMA and ARX models. In Proceedings of the 8th International Conference
                                                                                      on Informatics, Environment, Energy and Applications. ACM, 268–271.
where N is a normal distribution with mean µ = 7 and standard                    [10] Jack Kiefer et al. 1952. Stochastic estimation of the maximum of a regression
                                                                                      function. The Annals of Mathematical Statistics 23, 3 (1952), 462–466.
deviation σ = 2.5, and k is a multiplicative coefficient we set                  [11] Bjoern Krollner et al. 2010. Financial time series forecasting with machine
equal to 2.                                                                           learning techniques: a survey.. In ESANN.
                                                                                 [12] Mahesh Kumar et al. 2002. Clustering seasonality patterns in the presence of
    We now evaluate the obtained outputs. Recall that a point is                      errors. In Proceedings of the eighth ACM SIGKDD international conference on
considered anomalous by ARIMA if it falls out of the confidence                       Knowledge discovery and data mining. ACM, 557–563.
interval boundaries; for the supervised models if it falls out of the            [13] T Warren Liao. 2005. Clustering of time series data - a survey. Pattern
                                                                                      recognition 38, 11 (2005), 1857–1874.
defined threshold boundaries; and for the unsupervised models if                 [14] Fei Tony Liu et al. 2008. Isolation forest. In 2008 Eighth IEEE International
it belongs to a small cluster or it is recognized as a noise point. We                Conference on Data Mining. IEEE, 413–422.
should further point out that we retrain every model from scratch                [15] James MacQueen et al. 1967. Some methods for classification and analysis of
                                                                                      multivariate observations. In Proceedings of the fifth Berkeley symposium on
for each client. Tables 2 and 3 report the performance metrics                        mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281–297.
for all the algorithms. Such metrics allow us to surely consider                 [16] Frank Nielsen. 2016. Hierarchical clustering. In Introduction to HPC with MPI
                                                                                      for Data Science. Springer, 195–211.
unreliable the Support Vector-based algorithms and the Stochas-                  [17] Ping-Feng Pai and Chih-Sheng Lin. 2005. A hybrid ARIMA and support vector
tic Gradient Descent Regressor, as they are not able to recognize                     machines model in stock price forecasting. Omega 33, 6 (2005), 497–505.
the true anomalies (small recall), and they wrongly tend to mark                 [18] Marco Raberto et al. 2002. Waiting-times and returns in high-frequency
                                                                                      financial data: an empirical study. Physica A: Statistical Mechanics and its
a very large set of points as anomalies (small precision). This                       Applications 314, 1-4 (2002), 749–755.
second problem is common to all supervised approaches, since                     [19] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and
all them present a very low precision. In practice, the "noisy"                       validation of cluster analysis. Journal of computational and applied mathemat-
                                                                                      ics 20 (1987), 53–65.
time series does not allow the regressor to correctly predict the                [20] R. E. Schapire. 2013. Explaining adaboost. In Empirical inference. Springer,
next value, which too often results as an outlier (raising false                      37–52.
                                                                                 [21] Dat Thanh Tran et al. 2017. Tensor representation in high-frequency financial
alarms). The unsupervised models show instead a better average                        data for price change prediction. In 2017 IEEE Symposium Series on Computa-
behavior, coming from the fact that they tend to advertise anom-                      tional Intelligence (SSCI). IEEE, 1–7.
alies only if their classification is very sure. This leads them to              [22] Ruey S Tsay. 2014. Financial Time Series. Wiley StatsRef: Statistics Reference
                                                                                      Online (2014), 1–23.
be more precise: a simple k-means would indeed identify 80% of                   [23] Fang-Mei Tseng et al. 2001. Fuzzy ARIMA model for forecasting the foreign
anomalies, with 77% of recall. These results are consistent also                      exchange market. Fuzzy sets and systems 118, 1 (2001), 9–19.
for different values of (µ, σ , k) - not reported here for the sake of           [24] Tony Van Gestel et al. 2001. Financial time series prediction using least squares
                                                                                      support vector machines within the evidence framework. IEEE Transactions
brevity.                                                                              on neural networks 12, 4 (2001), 809–821.
                                                                                 [25] Vladimir Vapnik et al. 1997. Support vector method for function approximation,
                                                                                      regression estimation and signal processing. In Advances in neural information
6     CONCLUSIONS                                                                     processing systems. 281–287.
In this paper we presented a case study on anomaly detection                     [26] Kiri Wagstaff et al. 2001. Constrained k-means clustering with background
                                                                                      knowledge. In Icml, Vol. 1. 577–584.
in corporate investment banking transaction data. The anom-                      [27] Qi Yu et al. 2014. Bankruptcy prediction using extreme learning machine and
alies have been divided in three different categories, according to                   financial expertise. Neurocomputing 128 (2014), 296–302.