Anomaly Detection Using Similarity-based One-Class SVM for Network
                              Traffic Characterization
                          Bouchra Lamrini1 , Augustin Gjini1 , Simon Daudin1 ,
                 François Armando1 , Pascal Pratmarty1 and Louise Travé-Massuyès2
                                      1
                                        LivingObjects, Toulouse, France
e-mail: {bouchra.lamrini,augustin.gjini,simon.daudin,françois.armando,pascal.pratmarty}@livingobjects.com
                     2
                       LAAS-CNRS, Université de Toulouse, CNRS, Toulouse, France
                                            e-mail: louise@laas.fr

                             Abstract                              SVMs perform at least as good as other methods in
                                                                   terms of the generalization error [6].
            In this paper∗ , we investigate an unsu-                  Many factors contributed to the high popularity of
            pervised machine learning method based                 SVMs today. First of all, their theoretical founda-
            on one-class Support Vector Machines for               tions have been deeply investigated and they come
            anomaly detection in network traffic. In               with a convex optimization procedure ensuring that
            the absence of any prior expert knowledge              the global optimum will be reached. Moreover, the
            on anomalous data, we propose the use of               solution is sparse making it really efficient in com-
            a similarity measure for Multivariate Time             parison to other kernel-based approaches [7]. In ad-
            Series to evaluate the output results and se-          dition, they may use a non linear transformation in
            lect the best model. A set of Key Perfor-              the form of a kernel that even allow SVMs to be con-
            mance Indicators, oriented for network and             sidered as a dimensionality reduction technique [8].
            traffic monitoring, has been used to demon-            One-Class SVMs [9] have been devised for cases in
            strate the promising performance of the un-            which one class only is known and the problem is to
            supervised learning approach.                          detect anything outside this class. This is known as
                                                                   novelty detection and it refers to automatic identifi-
                                                                   cation of unforeseen or abnormal phenomena [1; 10;
       1   Introduction                                            11], i.e. outliers, embedded in a large amount of nor-
       Anomaly detection aims at identifying unusual pat-          mal data.
       terns in data that do not conform to expected behav-           In contrast to traditional SVMs, One-Class SVMs
       ior [1]. These non-conforming patterns are generally        (OC-SVM) learn a decision boundary that achieves
       referred in different application fields to as anomalies,   maximum separation between the samples of the
       aberrations, discordant observations, exceptions, nov-      known class and the origin [12]. Only a small frac-
       elty, outliers, peculiarities or contaminants, surprises,   tion of data points are allowed to lie on the other side
       strangeness. There has been applications in several         of the decision boundary: those data points are con-
       application fields from intrusion detection, e.g. iden-     sidered as outliers.
       tifying strange patterns in network traffic that could         Anomaly detection is particularly important for
       signal a hack [2], to system health monitoring, e.g.        network traffic. The observed growth rate of infor-
       spotting a malignant tumor in an MRI image scan [3],        mational and economic damage caused by intention-
       and from fraud detection in credit card transactions        ally or unintentionally attacks, faults, and anomalies
       [4], to fault detection in operating environments [5].      has been a driving force behind efforts to ensure that
       In this paper we are interested in anomaly detection        network monitoring systems are able to detect ab-
       in network traffic.                                         normal activity and characterize it. The limitations
          Support Vector Machines (SVMs) have been one of          of computing and storage resources require compe-
       the most successful machine learning techniques that        tence and ingenuity to effectively characterize ever-
       can be used in a variety of classification applications.    changing network traffic trends.
                                                                      Non-availability of labelled data, high costs for
          ∗
            Index Terms: Anomaly Detection, Support Vector Ma-     constituting labeled training data, and need to iden-
       chines (SVMs), One-Class SVMs, Unsupervised Learning,       tify anomalous and novel observations in data without
       Model Selection, Similarity Measure, Multivariate Time      having necessarily seen an example of that behaviour
       Series (MTS).                                               in the past are the main challenges tackled in this
work. A central issue is model selection, i.e. choice       Section 2 presents our case study. Section 3 is de-
of the optimal hyper-parameters that define the OC-         voted to an overview of SVMs and One-Class SVMs
SVM learning configuration. This requires a method          methods. In Section 4, we present the EROS similar-
to evaluate the results.                                    ity measure used to search for the best training model
   This paper contributes to this problem by evalu-         for anomaly detection. Experimental results and re-
ating the results of the trained model by comparing         lated discussions are provided in Section 5 to demon-
samples predicted normal with samples in the train-         strate the approach performance. Finally, Section 6
ing set. Because samples are composed of a set of           concludes the paper.
signals over a temporal window, we propose to use a
similarity index for Multivariate Time Series (MTS)         2    Case Study
called EROS (Extended Frobenius Norm). The re-
                                                            Processing network traffic involves dealing with an
sults of the model are evaluated iterativelly for differ-
                                                            immense amount of data that is quickly and con-
ent hyper-parameters of OC-SVM and the model that
                                                            stantly varying. Considering the enormous amount
evaluates best is selected. We show that OC-SVM in
                                                            of data involved it is very easy for malicious activities
combination with the Eros index [read more in Sec-
                                                            to go undetected, especially without any knowledge a
tion 4.1] can create automatically tuned reliable clas-
                                                            priori about the nature of the traffic, like it is often the
sifiers with reasonable computation cost.
                                                            case in the network domain.
   The remainder of this paper is organized as follows.


Figure 1: Training Data Set. The y-axis represents the KPI signals. From top to bottom: Total Incoming
Traffic, Total Outgoing Traffic, Server Delay, and Network Delay.


   In this study, data was collected from a real-time          A history of two months (3408 samples) of data
monitoring platform dedicated to ensure key applica-        generated at a 5 minutes rate was collected for each
tion performance. 51 sites using applications of the        of the four KPIs above and for each site. Data was
same kind and having roughly the same uses at the           segmented into time-windows wi , each of 48 points.
same time, were chosen. For each application, we            Figure 1 shows the data contained in the training set.
selected carefully four relevant Key Performance In-        Each time-window wi ∈ {w0 , ..., wl }, l = 70, is
dicators (KPIs) describing:                                 identified between lines and there are 70 samples.
  1. Total Incoming Traffic.                                   Let us notice that the data samples that are provided
  2. Total Outgoing Traffic.                                to the OC-SVM classification algorithm are multivari-
                                                            ate, each composed of four KPI segments over the
  3. Server Delay, i.e. the connection time to the          same time-window. The idea is to detect insidious
     server, which sets the expiration time for send-       problems that require an analysis of the signal under-
     ing a request.                                         lying interactions. We therefore want to detect “ab-
  4. Network Delay that specifies how long it takes         normal windows”. Each segment is characterized by
     for a bit of data to travel across the network from    seven statistical attributes: minimum (MIN), maxi-
     one node to another.                                   mum (MAX), mean (MEAN), median (MED), stan-

                                                            Page 2/8
dard deviation (STD), number of average crossings              tributes MEAN, MED, MAX, MIN, STD, nbpMean,
(nbpMean), squared mean error (SME) computed be-               and SME are illustrated in sub-figures from left to
tween the raw data and the linear fitting. These at-           right. We can already notice some overruns that will
tributes are used by OC-SVM after normalization.               make considerable contribution to the classifier pro-
Figure 2 shows the attributes built for the 70 segments        file defined by the decision function.
of the four KPIs mentioned above. The seven at-


    Figure 2: Attributes built for training data KPIs. Each point is calculated on a segment of 48 points.


3     An Unsupervised Similarity-based                         where:
                                                                                          + 1, if (wT x + b) ≥ 0
                                                                                      
      Method for Anomaly Detection                                        T
                                                                  sign(w x + b) =
Support Vector Machines (SVMs) have always been                                           − 1, otherwise
of interest in anomaly detection because of their abil-           The concept of SVMs is to find (w, b) such that the
ity to provide non-linear classification through a ker-        hyperplane is positioned at maximum distance of the
nel function. Via this short overview, we show that            nearest training samples of the two classes in order
SVMs are theoretically well founded. We briefly in-            to reduce the generalization error. This distance de-
troduce the basic concepts of SVMs then focus on               fines the "margin". SVMs have first been proposed
OC-SVM that we adopted in this study. A more de-               for linearly separable classification tasks. However
tailed presentation can be found in [13] and a good            they were extended to non-linearly separable classifi-
example is available on URL using "LibSVM" library             cation problems. Some samples are allowed to violate
of Matlab.                                                     the margin (soft-margin SVMs) and a non-linear deci-
                                                               sion boundary can be obtained by projecting the data
3.1    Support Vector Machines                                 into a higher dimension space thanks to a non-linear
                                                               function Φ(x). Data points may not be linearly sepa-
Let us consider the traditional two-class support vec-         rable in their original space but they are “lifted” into
tor machines in which we are given a set of n train-           a feature space F where a hyperplane can separate
ing instances S = {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )}.   them. When that hyperplane is projected back into
xi ∈ Rd , where yi is the class label of the xi in-            the input space, it has a non-linear shape. To prevent
stance and yi ∈ [−1, +1]. The linear SVMs classifier           the SVM classifier from over-fitting noisy data, slack
recovers an optimal separating hyperplane maximiz-             variables ξ are introduced to allow some data points to
ing the "margin" of the classifier with the equation:          lie within the margin, and the parameter C > 0 (Eq.2)
wT x + b = 0, with w ∈ F and b ∈ R two parameters              tunes the trade-off between the classification error on
witch determine the position of the decision hyper-            the training data and margin maximization. The ob-
plane in feature space F (its orientation is tuned by w        jective function of SVM classifiers has the following
and its displacement by b). The decision function can          minimization formulation:
thus be generally written as:                                                              2      n
                                                                                       kwk       X
                                                                                min          +C      ξi             (2)
      f (x; w, b) = sign(wT x + b) ∈ {−1, +1}           (1)                     w,b,ξi  2        i=1


                                                               Page 3/8
Subject to:
                yi (wT φ(xi ) + b) ≥ 1 − ξi
                     ξi ≥ 0, i = 1, ..., n
   The minimization problem is solved using La-
grange Multipliers αi , i = 1, . . . , n. The new deci-      Figure 3: SVM results with two kernel functions.
sion function rule for a data point x is:
                         n
                                                            Schölkopf [13], which is presented in the next para-
                         X
        f (x) = sign(          αi yi K(x, xi ) + b)   (3)
                         i=1                                graph, and that according to Tax and Duin [14].
                                                               In the feature space F , OC-SVM method basically
Every αi > 0 is weighted in the decision function and       separates all the data points from the origin by a hy-
thus supports the machine. Since SVMs are consid-           perplane and it maximizes the distance of this hyper-
ered to be sparse, there are relatively few Lagrange        plane to the origin. This results in a binary function
multipliers with a non-zero value.                          which captures the region of the input space where
   The function K(x, xi ) = Φ(x)T Φ(xi ) is known as        the training data lives. Thus the function returns +1
the kernel function. Since the outcome of the decision      in a “small” region (capturing the training data points)
function only relies on the dot-product of the vectors      and −1 elsewhere. The quadratic programming min-
in the feature space F (i.e. all the pairwise distances     imization function is slightly different from the origi-
for the vectors), it is not necessary to perform an ex-     nal stated by (Eq.2) and (Eq.3):
plicit projection. As long as a function K provides the
same results, it can be used instead. This is known as                             2         n
                                                                                  kwk    1 X
the kernel trick.                                                         min         +        ξi − ρ           (5)
                                                                          w,ξi ,ρ  2    ηn i=1
   Popular choices for the kernel function are linear,
polynomial, and sigmoïdal. In this study, we used the       Subject to:
Gaussian Radial Base Function:                                                w.φ(xi ) ≥ ρ − ξi
                             − kx − xi k
                                              2                                      ξ≥0
              K(x, xi ) = exp(             )         (4)                          i = 1, ..., n
                                 2σ 2
where σ ∈ R is a kernel parameter and kx − xi k is             Schölkopf et al. [13] has reformulated SVMs to
the dissimilarity measure. With this set of formulas        take the new regularization parameter η instead of
and concepts we are able to classify a set of data point    C in the original formulation (Eq.2 and Eq.3). The
into two classes with a non-linear decision function.       range of C is from zero to infinity, but η is always
   The power of the method comes from using ker-            between [0, 1]. η characterizes the solution in a nice
nel functions, which enable it to operate in a high-        interpretable way: (1) it sets an upper bound on the
dimensional, implicit feature space without ever com-       fraction of outliers, e.g. the training examples re-
puting the coordinates of the data in that space, but       garded out-of-class, (2) and it sets a lower bound on
rather by simply computing the inner products be-           the number of training examples used as support vec-
tween the images of all pairs of data in the fea-           tors.
ture space. This operation is often computationally            Again by using Lagrange techniques and using a
cheaper than the explicit computation of the coordi-        kernel function for the dot-product calculations, the
nates. Figure 3 illustrates a non linearly separable        decision function becomes:
data set clustered by SVM with two different kernel                   f (x) = sign((wΦ(xi )) − ρ)
functions: linear and radial based. The observations                                 n
are plotted blue or magenta depending on the class
                                                                                    X                          (6)
                                                                            = sign(     αi K(x, xi ) − ρ)
and the background is darker as the distance from the                               i=1
hyperplane is higher. Scores are given on the right           OC-SVMs thus create a hyperplane characterized
bottom corners and show a significant increase for the      by w and ρ which has maximal distance from the ori-
non linear kernel.                                          gin in the feature space F , hence separating all the
3.2   One-Class Support Vector Machines                     data points from the origin.
One-Class SVMs (OC-SVMs) are used to separate               4   Similarity-based Performance
the data of one specific class, the target class, from
other data. They are trained with positive examples             Evaluation for Model Selection
only, i.e. data points from the target class. There are     In this section, we address the problem of fitting the
two different approaches: the approach according to         hyper-parameters of OC-SVM automatically, that is


                                                            Page 4/8
the problem of automatic model selection. In the case        where hai , bi i is the inner product of ai and bi , w is a
of OC-SVM, this amounts to choose the kernel pa-             weight vector which
                                                                              Pn is based on the eigenvalues of the
rameter γ and the regularization parameter η. A pair         MTS dataset, i=1 wi = 1 and cos(θi ) is the angle
(γi , ηj ) is defined as a learning configuration.           between ai and bi . The range of Eros is between 0
   For this purpose, we propose to run OC-SVM for            and 1, with 1 being the most similar.
several learning configurations and select the best            Definition 2. Singular Value Decomposition. Let
configuration by evaluating the similarity of the KPI        A be a general real m × n matrix. The singular value
signals for the windows tagged normal by OC-SVM              decomposition (SVD) of A is the factorization:
and the KPI windows of the training data that are as-
sumed to be normal examples. Since a sample win-                                  A = U ΣV                      (8)
dow is composed of several KPI signals, we need              where U is a column-orthonormal N × r matrix, r
a multidimensional similarity index for Multivariate         is the rank of the matrix A, Σ is a diagonal r × r
Time Series (MTS).                                           matrix of the eigenvalues γi of A where γ1 ≥ ·· ≥
                                                             γr ≥ 0 and V is a column-orthonormal M ×r matrix.
4.1   The similarity Index Eros                              The eigenvalues and the corresponding eigenvectors
Multidimensional similarity measures aim to indicate         are sorted in non-increasing order. V is called the
simultaneously the level of similarity between several       right eigenvector matrix, and U the left eigenvector
datasets (databases, data clusters, etc.). Unlike other      matrix.
methods [15; 16; 17] that seek the level of similarity          Yang et al. (2005) [18] describe the similarity in-
between two variables by omitting the existing cor-          dex algorithm with the following steps:
relation between the set of variables, a multidimen-           1. Compute the covariance matrix of each MTS.
sional method takes into account the contribution of
each variable in defining a global similarity measure.         2. Use SVD to decompose each covariance matrix.
   One of the methods processing MTS is the method             3. Recover eigenvalues and eigenvectors.
Eros (Extended Frobenius Norm) [18]. The interest              4. Compute the weight w of individuals by normal-
behind this method lies in its ability to assess the simi-        izing the eigenvalues [18].
larity of MTS composed of a different number of data
                                                               5. Compute similarity between MTS.
points. It indeed uses the eigenvalues and eigenvec-
tors of the covariance matrix that has size n × n, n be-     4.2   Automatic Model Selection
ing the number of times series composing the MTS.
                                                             The first task is to define the learning configurations
In doing so, it also performs dimension reduction be-
                                                             that will be tested with OC-SVM. We follow the steps
cause the number of observations is generally higher
                                                             below:
than that of the variables.
   We briefly describe the similarity index Eros based         1. Define the hyper-parameter space and a proce-
on the Frobenius Norm below. The definitions and no-              dure to explore this space. In our case, we set a
tations used in this paper are taken from [19]. We first          min-max and a variation step to constitute a grid
formally define the similarity index Eros. Next, we               (β × β) value pairs, i.e. β values for each hyper-
present the algorithm describing the similarity mea-              parameter.
sure procedure and the approach proposed for model             2. Explore the hyper-parameter space and set OC-
selection.                                                        SVM accordingly: for each pair of values, one
   Definition 1. Eros (Extended Frobenius Norm).                  OC-SVM classifier is obtained after the learning
Let A and B be two MTS items of size mA × n                       step. The best configuration is retained by using
and mB × n respectively. Let VA and VB two right                  the Eros similarity index on the validation data
eigenvector matrices by applying Singular Value De-               (25% of all data) and the training data (50% of
composition (SVD) to the covariance matrices, MA                  all data). The corresponding OC-SVM classifier
and MB , respectively. Let VA = [a1 , . . . , an ] and            is taken as the best model.
VB = [b1 , . . . , bn ], where ai and bi are column-
                                                               3. Once found the best model, anomaly detection is
orthonormal of size n. The Eros similarity of A and
                                                                  performed on new data to evaluate how well the
B is then defined as:
                                                                  model behaves.
                              n
                                                                The similarity of windows tagged normal by OC-
                                                             SVM, denoted by MTSnormal       , k = 0, ..., p, and the
                              X
        Eros(A, B, w) =             wi |< ai , bi >|                                   k
                                                             data windows of the training data (considered as nor-
                              i=1
                               n
                                                       (7)   mal), denoted by MTSlearn
                                                                                     l   , l = 1, ..., q, is obtained
                                                             as follows. For every learning configuration [Figure
                              X
                          =         |cos(θi )|
                              i=1
                                                             4] given by (γi , ηj ):


                                                             Page 5/8
    1. Compute Eros for every window pair                       all the window pairs.
       (MTSnormal
             k      , MTSlearn
                         l     ), k = 0, ..., p, and          The best learning configuration is taken as the one
       l = 1, ..., q.                                      leading to the maximal "Erosmean " value over all con-
    2. Compute the average similarity "Erosmean " over     sidered learning configurations (γi , ηj ).


                           Figure 4: Diagram showing the model selection process.


5     Experiments on the Case Study                        tection not only of singular points, but also of an atyp-
                                                           ical set of points even if each point taken.
Our detection approach was applied to the case study
                                                              Acquired raw data provide KPIs with different
presented in Section 2. A history of two months of
                                                           ranges, then features (attributes) themselves don’t
data generated every 5 minutes for four KPIs was col-
                                                           have homogeneous ranges. In order to guarantee good
lected over 151 sites. The window segmentation [20]
                                                           performances of the anomaly detection approach, we
was performed after analyzing two points that can sig-
                                                           chose to normalize these attributes with respect to
nificantly impact the detection stage:
                                                           their maximal and minimal values with a tolerance
    • choice of the time-window length, i.e. the num-      using a threshold s ∈ [0, 1]. This standard preprocess-
      ber of hours and samples to take account in a        ing ensures that all the attributes contribute equally to
      window,                                              the decision process independently of the parameters
                                                           responsible of KPI dynamics.
    • definition of a reliable methodology to normalize       To automatically select the best model, the hyper-
      training and testing datasets versus in these.       parameter space was discretized with a 10 × 10 grid,
   As mentioned in section 2, the time-window length       i.e. β = 10. 100 learning configurations were there-
was chosen of 4 hours, i.e. 48 samples. Clearly,           fore evaluated to select the best model. This off-line
when access to web applications is established in a        task was performed for each application site and ap-
few hours, a window of four hours is considered a          peared computationally feasible.
significant period for traffic analysis. As noted above,      Figure 5 shows some of the test results (25% of all
each time-window is characterized by seven statistical     data). From 24 time-windows (wi ∈ {w0 , . . . , wm },
attributes: minimum (MIN), maximum (MAX), mean             m = 23), 4 anomalies have been detected represented
(MEAN), median (MED), standard deviation (STD),            by the 4 time-windows (yellow colored): w3 , w4 , w10
number of average crossings (nbpMean), squared             and w12 . The results were confirmed with Parallel
mean error (SME) computed between the raw data             Coordinates plots given in Figure 6.
and their linear fit. The attributes are computed for         In a Parallel Coordinates Plot, each attribute is
each time-window in order to obtain a multidimen-          given its own axis and all the axes are placed paral-
sional scatter plot, where each point represents a time-   lel to each other. Values are plotted as a series of lines
window. One of the major interests of segmentation         that connect across all the axes. This means that each
and feature computation is to synthesize the informa-      line corresponds to one data window for which we
tion contained in a time-window. This allows the de-       have 7 × 4 attributes (7 features for every KPI).

                                                           Page 6/8
   The order in which the axes are arranged can im-           Presenting this type of detection can ensure that the
pact the way how the reader understands the data.          network administrators adopts another reasoning to
One reason for this is that the relationships between      characterize the nature of the traffic (normal, abnor-
adjacent variables are easier to perceive than those be-   mal, critical, ...) circulating on the network. It may
tween non-adjacent variables. So re-ordering the axes      help him to identify the different forms of anomaly
can help in discovering patterns or correlations across    in his network. Data analysis must give meaning
variables. We clearly see that the four time-windows       to the data with the goal of discovering useful in-
defined by the pink lines represent a strange behav-       formation, suggesting conclusions, and supporting
ior compared to the normal windows defined by the          decision-making. The value of the data lies in the
green lines.                                               story it tells.


Figure 5: Anomaly detection (yellow windows are detected abnormal). From top to bottom, KPIs appear
in this order on the y-axis: Total Incoming Traffic, Total Outgoing Traffic, Server Delay, and Network Delay


                                                           6   Conclusion
                                                           In this work, we applied the OC-SVM method to de-
                                                           tect anomalies in real network traffic, contributing
                                                           with an automatic method based on the similarity in-
                                                           dex Eros [19] for setting the hyper-parameters which
                                                           define the learning configuration. It provided very sat-
                                                           isfactory results.
                                                              The advantages of novelty detection for complex
                                                           processes like network traffic are multiple. In partic-
                                                           ular there is no need of faulty data. A wide variety of
                                                           cases of anomaly exist and it would be impossible to
                                                           characterize them all or to gather the corresponding
                                                           data. Challenges for future work is related to the
                                                           fact that data comes in a stream and dealing with the
                                                           data in real-time is quite tedious. The amount of data
                                                           leads to cases where resources are limited. Novelty
Figure 6: Parallel Coordinates Plot illustrating the       detection in a distributed framework is also to be
four abnormal windows. From top to bottom, the             investigated.
curves labeled in pink color shows successively
the time-windows: w3 , w4 , w10 and w12 .
                                                             Acknowledgement The authors thank Bertrand
                                                           Le Marec and David Maisonneuve, leading team of
                                                           LivingObjects, for their support and valuable com-
                                                           ments about the application.


                                                           Page 7/8
References                                              [12] B. Schölkopf, J.C. Platt, J.C. Shawe-Taylor, A.J.
                                                             Smola, and R.C. Williamson. Estimating the
[1] V. Chandola, A. Banerjee, and V. Kumar.
                                                             support of a high-dimensional distribution. Neu-
    Anomaly detection: A survey. ACM Computing               ral Computation, 13(7):1443–1471, 2001.
    Surveys, 41(3):15:1–15:58, 2009.
                                                        [13] B. Schölkopf, R. Williamson, A. Smola,
[2] V. Kumar. Parallel and distributed computing             J. Shawe-Taylor, and J. Platt. Support vector
    for cybersecurity. IEEE Distributed Systems On-          method for novelty detection. In Proceeding of
    line, 6(10):1–9, 2005.                                   the 12th International Conference on Neural In-
[3] C. Spence, L. Parra, and P. Sajda. Detection,            formation Processing Systems, pages 582–588,
    synthesis and compression in mammographic                1999.
    image analysis with a hierarchical image prob-      [14] R.P.W. Tax, D.M.J. and Duin.             Support
    ability model. In Proceedings of the IEEE                vector data description. Machine learning,
    Workshop on Mathematical Methods in Biomed-              54(1):45–66, 2004.
    ical Image Analysis (MMBIA’01), MMBIA’01,           [15] G.E.A.P.A. Batista, W. Wang, and E.J. Keogh. A
    pages 3–, 2001.
                                                             complexity-invariant distance measure for time
[4] E. Aleskerov, B. Freisleben, and B. Rao. Card-           series. SDM, 2011.
    watch: a neural network based database min-         [16] C.A. Ratanamahatana and E.J. Keogh. Mak-
    ing system for credit card fraud detection. In           ing time-series classification more accurate us-
    Proceedings Of The IEEE/IAFE 1997 Compu-                 ing learned constraints. In Proceedings of
    tational Intelligence For Financial Engineering          SIAM International Conference on Data Mining
    (CIFEr), pages 220–226, 1997.                            (SDM’04), pages 11–22, 2004.
[5] R. Fujimaki, T. Yairi, and K. Machida. An ap-       [17] S. Park, W.W. Chu, J. Yoon, and C. Hsu. Effi-
    proach to spacecraft anomaly detection problem           cient searches for similar subsequences of dif-
    using kernel feature space. In Proceedings of            ferent lengths insequence databases. In 16th
    the Eleventh ACM SIGKDD International Con-               International Conference on Data Engineering,
    ference on Knowledge Discovery in Data Min-              pages 23–32, 2000.
    ing, KDD’05, pages 401–410, New York, NY,
                                                        [18] K. Yang and C. Shahabi. A multilevel distance
    USA, 2005. ACM.
                                                             based index structure for multivariate time se-
[6] C.J.C. Burges. A tutorial on support vector ma-          ries. In 12th International Symposium on Tem-
    chines for pattern recognition. Data Mining and          poral Representation and Reasoning, 2005.
    Knowledge Discovery, 2(2):121–167, 1998.            [19] K. Yang and C. Shahabi. A pca-based simi-
[7] C.M. Bishop. Pattern Recognition and Machine             larity measure for multivariate time series. In
    Learning. Springer, 2006.                                Proceedings of the Second ACM International
                                                             WorkShop on multimedia Databases, 2004.
[8] W. Wang, Z. Xu, W. Lu, and X. Zhang. Deter-
    mination of the spread parameter in the gaussian    [20] S. Fuertes, G. Picart, J.Y. Tourneret, L. Chaari,
    kernel for classification and regression. Neuro-         A. Ferrari, and C. Richard. Improving space-
    computing, 55(3-4):643–663, 2003.                        craft health monitoring with automatic anomaly
                                                             detection techniques. In 14th International Con-
[9] M.A.F. Pimentel, D.A. Clifton, and L.C.                  ference on Space Operations., page 2430, 2016.
    Tarassenko. A review of novelty detection. Sig-
    nal Processing, 99:215–249, 2014.
[10] D. Dasgupta and S. Forrest. Novelty detection in
     time series data using ideas from immunology.
     In Proceedings of The 5th International Con-
     ference on Intelligent Systems, Reno, Nevada,
     1996.
[11] E. Keogh, S. Lonardi, and W. Chiu. Finding
     surprising patterns in a time series database in
     linear time and space. In Proceedings of the
     8th ACM SIGKDD International Conference on
     Knowledge Discovery and Data Mining, KDD
     ’02, pages 550–556, New York, NY, USA, 2002.
     ACM.


                                                        Page 8/8