=Paper= {{Paper |id=Vol-1924/ialatecml_paper2 |storemode=property |title=Transfer Learning for Time Series Anomaly Detection |pdfUrl=https://ceur-ws.org/Vol-1924/ialatecml_paper2.pdf |volume=Vol-1924 |authors=Vincent Vercruyssen,Wannes Meert,Jesse Davis |dblpUrl=https://dblp.org/rec/conf/pkdd/VercruyssenMD17 }} ==Transfer Learning for Time Series Anomaly Detection== https://ceur-ws.org/Vol-1924/ialatecml_paper2.pdf
      Transfer learning for time series anomaly
                      detection

             Vincent Vercruyssen, Wannes Meert, and Jesse Davis

                 Dept. of Computer Science, KU Leuven, Belgium
                      firstname.lastname@cs.kuleuven.be



      Abstract. Currently, time series anomaly detection is attracting sig-
      nificant interest. This is especially true in industry, where companies
      continuously monitor all aspects of production processes using various
      sensors. In this context, methods that automatically detect anomalous
      behavior in the collected data could have a large impact. Unfortunately,
      for a variety of reasons, it is often difficult to collect large labeled data
      sets for anomaly detection problems. Typically, only a few data sets will
      contain labeled data, and each of these will only have a very small number
      of labeled examples. This makes it difficult to treat anomaly detection
      as a supervised learning problem. In this paper, we explore using trans-
      fer learning in a time-series anomaly detection setting. Our algorithm
      attempts to transfer labeled examples from a source domain to a target
      domain where no labels are available. The approach leverages the insight
      that anomalies are infrequent and unexpected to decide whether or not
      to transfer a labeled instance to the target domain. Once the transfer is
      complete, we construct a nearest-neighbor classifier in the target domain,
      with dynamic time warping as the similarity measure. An experimental
      evaluation on a number of real-world data sets shows that the overall
      approach is promising, and that it outperforms unsupervised anomaly
      detection in the target domain.

      Keywords: transfer learning; anomaly detection; time series


1   Introduction

Time series data frequently arise in many different scientific and industrial con-
texts. For instance, companies use a variety of sensors to continuously monitor
equipment and natural resources. One relevant use case is developing algorithms
that can automatically identify time series that show anomalous behavior. Ide-
ally, anomaly detection could be posed as a supervised learning problem. How-
ever, these algorithms require large amounts of labeled training data. Unfor-
tunately, such data is often not available as obtaining expert labels is time-
consuming and expensive. Typically, only a small number of labels are known
for a limited number of data sets. For example, if a company monitors several
similar machines, they may only label events (e.g., shutdown, maintenance...)
for a small subset of them.


                                           27
Transfer Learning for Time Series Anomaly Detection

   Transfer learning is an area of research focused on methods that are able to
extract information (e.g., labels, knowledge, etc.) from a data set and reapply
it in another, different data set. Specifically, the goal of transfer learning is
to improve performance on the target domain by leveraging information from
a related data set called the source domain [10]. In this paper, we adopt the
paradigm of transfer learning for anomaly detection. In our setting, we assume
that labeled examples are only available in the source domains, and that there
no labeled examples in the target domain. In the example, we utilize the label
information available for machine A to help constructing an anomaly detection
algorithm for machine B, where no labeled points are available for machine B.
   In this paper we study transfer learning in the context of time-series anomaly
detection, which has received less attention in transfer learning [1, 6, 10]. Our
approach attempts to transfer instances from the source domain to the target
domain. It is based on two important and common insights about anomalous
data points, namely that they are infrequent and unexpected. We leverage these
insights to propose two different ways to identify which source instances should
be transferred to the target domain. Finally, we make predictions in the target
domain by using 1-nearest neighbors classifier where the transferred instances are
the only labeled data points in the target domain. We experimentally evaluate
our approach on a large data set adapted from a real-world data set and find
that it outperforms an unsupervised approach.


2    Problem statement
We can formally define the task we address in this paper as follows:
Given: One or multiple source domains DS with source domain data {XS , YS },
   and a target domain DT with target domain data {XT , YT }, where the in-
   stances x ∈ X are time series and the labels y ∈ Y are ∈ {anomaly, normal}.
   Additionally, only partial label information is available in the source do-
   mains, and no label information in the target domain.
Do: Learn a model for anomaly detection fT (·) in the target domain DT using
   the knowledge in DS , and DS 6= DT .
Both the source and target domain instances are time series. Thus each instance
x = {(t1 , v1 ), . . . , (tn , vn )}, where ti is a time stamp and vi is a single mea-
surement of the variable of interest v at time ti . The problem has the following
characteristics:
 – The joint distributions of source and target domain data, denoted by pS (X, Y )
   and pT (X, Y ), are not necessarily equal.
 – No labels are known for the target domain, thus YT = ∅. In the source
   domain, (partial) label information is available.
 – The same variable v is monitored in the source and target domain, under
   possibly different conditions (e.g., the same machine in different factories).
 – The number of samples in the DS and DT are denoted respectively by nS =
   |XS | and nT = |XT |, and no restrictions are imposed on them.


                                         28
Transfer Learning for Time Series Anomaly Detection

 – Each time series in DS or DT has the same length d.
 – The source and target domain instances are randomly sampled from the true
   underlying distribution.

3   Context and related work
Several flavors of transfer learning distinguish themselves in the way knowledge is
transferred between source and target domain. In this paper we employ instance-
based transfer learning. The idea is to transfer specific (labeled) instances from
the source domain to the target domain in order to improve learning a tar-
get predictive function fT (·) [6]. In the case of anomaly detection, the target
function is a classifier that aims to distinguish normal instances from anoma-
lous instances. However, care needs to be taken when selecting which instances
to transfer, because transferring all instances could result in degraded perfor-
mance in the target domain (i.e., negative transfer) [8]. A popular solution is
to define a weight for each transferred instance based on the similarity of the
source and target domain. The latter is characterized either by the similarity of
the marginal probability distributions pS (X) and pT (X), and/or the similarity
of conditional probability distributions pS (Y |X) and pT (Y |X). Various ways of
calculating these weights have been proposed [3, 6, 10]. However, the problem
outlined in this paper states that YT = ∅, which is a realistic assumption given
that in practice labeling is expensive. Hence, we cannot easily calculate pT (Y |X).
Furthermore, even if the marginal distributions are different, it can still be ben-
eficial to transfer specific instances. Consider the following. Since the target task
is anomaly detection, one cares for a classifier that robustly characterizes normal
behavior. By adding a diverse set of anomalies to the training data of the clas-
sifier, the learned decision surfaces will be more restricted, ensuring a decrease
of type 2 errors when detecting anomalies in new, unseen data.
   The subject of instance-based transfer learning for time series has received
less attention in literature. Spiegel recently proposed a mechanism for learning
a target classifier using set of unlabeled time series in various source domains,
without assuming that source and target domain follow the same generative
distribution or even have the same class labels [7]. However, they require a
limited set of labels in the target domain, whereas we have YT = ∅.

4   Methodology
In order to learn the model for anomaly detection fT (·) in the target domain, we
transfer labeled instances from different source domains. To avoid situations of
negative transfer (e.g., transferring an instance with the label anomaly that maps
to a normal instance in the target domain), a decision function decides whether
to transfer an instance or not. First, we outline the intuitions behind the decision
function based on two commonly known characteristics of anomalous instances
(Sec. 4.1). Then, we propose two distinct decision functions (Sec. 4.2 and 4.3).
Finally, we describe a method for supervised anomaly detection in the target
domain based on the transferred instances (Sec. 4.4).


                                         29
Transfer Learning for Time Series Anomaly Detection

4.1   Instance-based transfer learning for anomaly detection

The literature frequently makes two important observations about anomalous
data:

Observation 1 Anomalies occur infrequently [2].

Observation 2 If a model of normal behavior is learned, then anomalies consti-
tute all unexpected behavior that falls outside the boundaries of normal behavior.
This implies that it is impossible to predefine every type of anomaly.

From the first observation we derive the following property:

Property 1 Given a labeled instance (xS , yS ) ∈ DS and yS = normal. If the
probability of the instance under the true target domain distribution pT (xS ) is
high (i.e., the instance is likely to be sampled from the target domain), then the
probability that the true label of the instance in the target domain is normal,
pT (yS = normal|xS ) is also high.

The second observation allows us to derive the reverse property:

Property 2 Given a labeled instance (xS , yS ) ∈ DS and yS = anomaly. If the
probability of the instance under the true target domain distribution pT (xS ) is
low, then the probability that the true label of the instance in the target domain
is anomaly, pT (yS = anomaly|xS ) is high.

Notice that in the latter property the time series xS can have any form, while this
is not true for the first property, where the form is restricted by the distribution
of the target domain data. Given a labeled instance (xS , yS ) ∈ DS that we want
to transfer to the target domain, Property 1 and Property 2 allow us to make
a decision whether to transfer or not. We can formally define a weight associated
with xS which will be high when the transfer makes sense, and low when it will
likely cause negative transfer.
                              (
                                pT (xS )     if yS = normal
                        wS =                                                     (1)
                                1 − pT (xS ) if yS = anomaly

However, since each time series xS can be considered as a vector of length d in
Rd (i.e., it consists of a series of numeric values for continuous variable v), the
probability of observing exactly xS under the target domain distribution must
be 0. Instead, we calculate the probability of observing a small interval around
xS , such that:                             Z
                          pT (xS ) = lim          pT (xS )dx                    (2)
                                     ∆I→0    ∆I

where ∆I is an infinitesimally small region around xS in the target domain. This
probability is equal to the true density function over the target domain fT (xS ).
Given that the true target domain density is unknown, we need to estimate it


                                        30
Transfer Learning for Time Series Anomaly Detection

from the data XT . It is shown that this estimate fˆT (xS ) can be calculated as
follows [4]:
                                          nT             
                     ˆ          1   1 X           xS − xi
                    fT (xS ) =                K                              (3)
                               nT (hnT )d i=1     (hnT )d
where
R     K(x) is the window function or kernel in the d-dimensional space and
 Rd
    K(x)dx    = 1. The parameter hnT > 0 is the bandwidth corresponding to
the width of the kernel, and depends on the number of observations nT . The
estimate fˆT (xS ) converges to the true density fT (xS ) when there is an infinite
number of observations, nT → ∞, under the assumption that the data XT are
randomly sampled from the true underlying distribution.

4.2   Density-based transfer decision function
For guaranteeing convergence of fˆT (xS ) to the true density function, the sam-
ple size must increase exponentially with the length d of the time series data.
The reasoning is clear; high-dimensional spaces are sparsely populated by the
available data, making it hard to produce accurate estimates. However, this is
often infeasible in practice (gathering data is expensive). For longer time series
d is automatically high, that is, if we treat the time series as a vector in Rd . As
a practical solution, we propose to reduce the length d of the time series xS by
dividing it into l equal-length subsequences, each with length m < d. For every
subsequence s in xS , the density is estimated using Eq. 3 with a Gaussian kernel:
                                           nT
                                           X                    2 !
               ˆ           1       1                  1 s − si
               fT,m (s) =         √            exp −                              (4)
                          nT (hnT 2π)m i=1            2    hnT

where hnT is the standard deviation of the Gaussian, and si are the subsequences
of the instances in XT . The Gaussian kernel ensures that instead of simply
counting similar subsequences, the count is weighted for each subsequence si
based on the kernelized distance to sS .
   Estimating the densities for the subsequences yields more accurate estimates
given the reduced dimensionality, but simultaneously results in l = m/d esti-
mates for each time series xS . Hence, we have to adjust Eq. 1 to reflect this new
situation. We only show the case in which the label yS = normal as the reverse
case is straightforward:
                                          l
                                                              !
                                1        X
                   wS =                        ˆ (si ) − Zmin
                                             fT,m                               (5)
                         Zmax − Zmin i=1
                                                  X
                            Zmax =       max             ˆ (sj )
                                                       fT,m                     (6)
                                     xT ∈{XT ∪xS }
                                                     sj ∈xT

The sum of the density estimates in the subsequences is normalized using min-
max normalization, such that wS ∈ [0, 1]. Zmin is calculated similarly as Zmax
in Eq. 6, but taking the minimum instead of maximum. By setting a threshold
on the final weights, we make a decision on whether to transfer or not.


                                         31
Transfer Learning for Time Series Anomaly Detection

4.3   Cluster-based transfer decision function

Our second proposed decision function is also based on the intuitions outlined in
Sec. 4.1. First, the target domain data XT are clustered using k-means clustering.
Second, the resulting set of clusters C over XT is divided into a set of large
clusters, and a set of small clusters according to the following definition [5]:

Definition 1. Given a dataset XT with nT instances, a set of ordered clusters
C = {C1 , ..., Ck } such that |C1 | ≥ |C2 | ≥ ... ≥ |Ck |, and two numeric parameters
α and β, the boundary b between large and small clusters is defined such that
either of the following conditions holds:

                                 b
                                 X
                                       |Ci | ≥ nT × α                            (7)
                                 i=1
                                          |Cb |
                                                 ≥β                              (8)
                                         |Cb+1 |

LC = {Ci |i ≤ b} and SC = {Ci |i > b} are respectively the set of large and small
clusters, and LC ∪ SC = C.
                                                                                   2
Furthermore, we define the radius of a cluster as ri = maxxj ∈Ci kxj − ci k .
Lastly, a decision is made whether or not to transfer a labeled instance xS
from the source domain. Intuitively, and in line with Observation 1 and 2,
anomalies in XT should fall in small clusters, while large clusters contain the
normal instances. Transferred labeled instances from the source domain should
adhere to the same intuitions. Each transferred instance is assigned to a cluster
                              2
Ci ∈ C such that kxS − ci k is minimized. An instance is only transferred in two
cases. First, if the instance has label normal and is assigned to a cluster Ci such
that Ci ∈ LC and the distance of the instance to the cluster center is less or
equal to the radius of the cluster. Second, if the instance has label anomaly and
fulfills either of two conditions: the instance is assigned to a cluster Ci such that
Ci ∈/ LC, or it is assigned to a cluster Ci such that Ci ∈ LC and the distance of
the instance to the cluster center is larger than the radius of the cluster. In all
other cases there is no transfer.


4.4   Supervised anomaly detection in a set of time series

After transferring instances from one or multiple source domains to the target
domain using the decision functions in Sec. 4.2 and 4.3, we can construct a
classifier in the target domain to detect anomalies. Ignoring the unlabeled target
domain data, we only use the set of labeled data L = {(xi , yi )}ni=1
                                                                   A
                                                                      , nA being the
number of instances transferred. It has been shown that one-nearest-neighbor
(1NN) classifier with dynamic time warping (DTW) or Euclidean distance is a
strong candidate for time series classification [9]. To that end, we construct a
1NN-DTW classifier on top of L to predict the labels of unseen instances.


                                          32
Transfer Learning for Time Series Anomaly Detection

5     Experimental evaluation
In this section we aim to answer the following research questions:
 – Do the proposed decision functions for instance-based time series transfer
   succeed in transferring useful knowledge between source and target domain.
First, we introduce the unsupervised baseline method to which we will compare
the 1NN-DTW method with instance transfer (Sec. 5.1). Then, we discuss the
data, the experimental setup, and the results (Sec. 5.2).

5.1   Unsupervised anomaly detection in a set of time series
Without instance transfer, the target domain consists of a set of unlabeled time
series data U = {(xi )}ni=1
                         T
                            . Based on the anomaly detection approach outlined in
Kha et al., we introduce a straightforward unsupervised algorithm for anomaly
detection that will serve as a baseline [5]. The algorithm calculates the cluster
based local outlier factor (CBLOF) for each series in U .
Definition 2. Given a set of large LC and small clusters SC defined over U
(as per definition 1), the CBLOF of an instance xi ∈ U , belonging to cluster Ci ,
is calculated as:
                            (
                              |Ci | × D(xi , ci )           if Ci ∈ LC
             CBLOF (xi ) =                                                    (9)
                              |Ci | × mincj ∈LC D(xi , cj ) if Ci ∈ SC

Then, anomalies are characterized by a high CBLOF.

5.2   Experiments
Data. Due to the lack of readily available benchmarks for the problem outlined
in Sec. 2, we experimentally evaluate on a real-world data set obtained from
a large company. The provided data detail resource usage continuously tracked
over a period of approximately two years. Since the usage is highly dependent
on the time of day, we can generate 24 (hourly) data sets by grouping the usage
data by hour. Each data set contains about 850 different time series. For a
limited number of these series in each set we possess expert labels indicating
either normal or anomaly.

Experimental setup. In turn, we treat each of the 24 data sets as the target do-
main and the remaining data sets as source domains. We consider transferring
from a single source or multiple sources. Any labeled examples in the target
domain are set aside and serve as the test set. First, the proposed decision
functions are used to transfer instances from either a single source domain or
multiple source domains combined to the target domain. Then, we train both
the unsupervised CBLOF (Sec. 5.1), and supervised 1NN-DTW anomaly de-
tection model that uses the labeled instances transferred to the target domain


                                       33
   Transfer Learning for Time Series Anomaly Detection

   (Sec. 4.4). Finally, both models predict the labels of the test set, and we report
   classification accuracy. For the density-based approach, we set the threshold on
   the final weights to 0.5. For the cluster-based approach we selected α = 0.95,
   β = 4, and the number of clusters 10.


                          1.0


                          0.8
Classification accuracy




                          0.6


                          0.4

                                                                                                                                          cluster-based
                          0.2                                                                                                             density-based
                                                                                                                                          CBLOF
                          0.0
                          00:00:00   02:00:00   04:00:00   06:00:00   08:00:00    10:00:00    12:00:00   14:00:00   16:00:00   18:00:00   20:00:00   22:00:00
                                                                                 Data set used as target

   Fig. 1: The graph plots the mean classification accuracy and the standard deviation
   for each of the 24 (hourly) data sets. These statistics are calculated after considering
   7 randomly chosen data sets as source domains, and performing the analysis for each
   combination of source and target. The plot indicates both transfer approaches with
   1NN-DTW perform quite similarly, while outperforming the unsupervised method in
   21 of the 24 data sets.

   Evaluation. A limited excerpt of the experimental results is reported in Table
   1. Figure 1 plots the full experimental results in a condensed manner. From the
   results we derive the following observations. First, instance transfer with 1NN-
   DTW outperforms the unsupervised CBLOF algorithm in 21 of the 24 data sets.
   Clearly, this indicates that the instances that are transferred by both decision
   functions, are useful in detecting anomalies. Second, the transfer works both
   between similar and dissimilar domains. To see this, one must know that in our
   real-world data set resource usage during the night is very different from usage
   during the day. As a result, the data sets at 00:00 and 01:00 are fairly similar
   for example, while data sets at 21:00 and 15:00 are highly different. From Table
   1 it is clear that this distinction has little impact on the performance of the
   1NN-DTW model. Third, the cluster-based decision function performs at least
   as well as the density-based variant. This is apparent from Figure 1.


   6                            Conclusion
   In this paper we introduced two decision functions to guide instance-based trans-
   fer learning in case the instances are time series and the task at hand is anomaly
   detection. Both functions are based on two commonly knowns insights about
   anomalies: they are infrequent and unexpected. We experimentally evaluated


                                                                                             34
Transfer Learning for Time Series Anomaly Detection

Table 1: A limited excerpt of the experimental evaluation. The number of transferred
instances is denoted by nA . Density-based is the density-based decision function with
1NN-DTW anomaly detection. Cluster-based is the cluster-based decision function with
1NN-DTW. CBLOF is the unsupervised anomaly detection. All reported numbers are
classification accuracies on a hold-out test set in the target domain, rounded off. Combo
is the the combination of 7 separate, randomly chosen source domains.

                                Cluster-based       Density-based      CBLOF
        Source      Target      nA      Result     nA       Result      Result
        01:00       00:00       14      89%         13      89%          58%
        03:00       00:00       11      79%         11      74%          52%
        21:00       00:00       10      79%          9      58%          52%
        combo       00:00       60      90%         46      85%          63%
        03:00       06:00        6      52%          5      52%          39%
        11:00       06:00       15      56%          8      56%          35%
        21:00       06:00        7      52%          8      48%          39%
        combo       06:00       79      57%         54      44%          35%
        03:00       15:00        6      58%          5      58%          23%
        11:00       15:00       19      65%          9      58%          30%
        21:00       15:00        7      58%          7      58%          19%
        combo       15:00       85      67%         54      54%          27%
        03:00       19:00        6      52%          5      52%          44%
        11:00       19:00       16      60%          8      48%          40%
        21:00       19:00        7      52%          8      44%          40%
        combo       19:00       81      56%         50      48%          44%


the proposed decision functions in combination with a 1NN-DTW classifier by
comparing it to an unsupervised anomaly detection algorithm on a real-world
data set. The experiments showed that the transfer-based approach outperforms
the unsupervised approach in 21 of the 24 data sets. Additionally, both decision
functions lead to similar results.

References
 1. Andrews, J.T., Tanay, T., Morton, E., Griffin, L.: Transfer representation-learning
    for anomaly detection. ICML (2016)
 2. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM com-
    puting surveys (CSUR) 41(3), 1–72 (2009)
 3. Chattopadhyay, R., Sun, Q., Fan, W., Davidson, I., Panchanathan, S., Ye, J.:
    Multisource domain adaptation and its application to early detection of fatigue.
    ACM Transactions on Knowledge Discovery from Data (TKDD) 6(4), 18 (2012)
 4. Fukunaga, K.: Introduction to statistical pattern recognition. Academic press
    (2013)
 5. Kha, N.H., Anh, D.T.: From cluster-based outlier detection to time series discord
    discovery. In: Revised Selected Papers of the PAKDD 2015 Workshops on Trends
    and Applications in Knowledge Discovery and Data Mining-Volume 9441. pp. 16–
    28. Springer-Verlag New York, Inc. (2015)


                                           35
Transfer Learning for Time Series Anomaly Detection

 6. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on knowledge
    and data engineering 22(10), 1345–1359 (2010)
 7. Spiegel, S.: Transfer learning for time series classification in dissimilarity spaces.
    In: Proceedings of AALTD 2016: Second ECML/PKDD International Workshop
    on Advanced Analytics and Learning on Temporal Data. p. 78 (2016)
 8. Torrey, L., Shavlik, J.: Transfer learning. Handbook of Research on Machine Learn-
    ing Applications and Trends: Algorithms, Methods, and Techniques 1, 242 (2009)
 9. Wei, L., Keogh, E.: Semi-supervised time series classification. In: Proceedings of
    the 12th ACM SIGKDD international conference on Knowledge discovery and data
    mining. pp. 748–753. ACM (2006)
10. Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. Journal
    of Big Data 3(1), 9 (2016)




                                           36