=Paper=
{{Paper
|id=Vol-2473/paper10
|storemode=property
|title=Anomaly Detection in Electrocardiogram Readings with Stacked LSTM Networks
|pdfUrl=https://ceur-ws.org/Vol-2473/paper10.pdf
|volume=Vol-2473
|authors=Markus Thill,Sina Däubener,Wolfgang Konen,Thomas Bäck
|dblpUrl=https://dblp.org/rec/conf/itat/ThillDKB19
}}
==Anomaly Detection in Electrocardiogram Readings with Stacked LSTM Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-2473/paper10.pdf</pdf>
<pre>
      Anomaly Detection in Electrocardiogram Readings with Stacked LSTM
                                   Networks

                               Markus Thill1 , Sina Däubener, Wolfgang Konen1 , and Thomas Bäck2
                         1   TH Köln – Cologne University of Applied Sciences, 51643 Gummersbach, Germany,
                                         {markus.thill, wolfgang.konen}@th-koeln.de
                                                    2 Leiden University, LIACS,

                                                  2333 CA Leiden, The Netherlands,
                                               t.h.w.baeck@liacs.leidenuniv.nl

Abstract: Real-world anomaly detection for time series is                 especially the case for periodic or quasi-periodic time se-
still a challenging task. This is especially true for periodic            ries where the anomaly may be a time-shifted peak, a peak
or quasi-periodic time series since automated approaches                  with a different form or other patterns which only emerge
have to learn long-term correlations before they are able to              from a long-range analysis of the signal.
detect anomalies. Electrocardiography (ECG) time series,                     Electrocardiography (ECG) time series constitute a
a prominent real-world example of quasi-periodic signals,                 prominent real-world example of quasi-periodic signals.
are investigated in this work. Anomaly detection algo-                    Anomaly detection in ECG readings plays an important
rithms often have the additional goal to identify anomalies               role in health care and medical diagnosis. There exist
in an unsupervised manner.                                                well-maintained databases, e.g. the MIT-BIH database [9],
   In this paper we present an unsupervised time series                   where a large body of data is annotated with numerous
anomaly detection algorithm. It learns with recurrent Long                types of anomalies which are characterized and verified
Short-Term Memory (LSTM) networks to predict the nor-                     by medical experts. Automated anomaly detection in such
mal time series behavior. The prediction error on several                 ECG data is still a challenging topic, because the devia-
prediction horizons is used to build a statistical model of               tions from nominal behavior are often subtle and require
normal behavior. We propose new methods that are es-                      long-range analysis. Furthermore, there are considerable
sential for a successful model-building process and for a                 signal variations from patient to patient or even within an
high signal-to-noise-ratio. We apply our method to the                    ECG time series.
well-known MIT-BIH ECG data set and present first re-
                                                                             Long short-term memory (LSTM) networks [13], which
sults. We obtain a good recall of anomalies while having
                                                                          are a special form of recurrent neural networks (RNN) and
a very low false alarm rate (FPR) in a fully unsupervised
                                                                          thus belong to the class of deep learning methods, have
procedure. We compare also with other anomaly detectors
                                                                          proven to be particularly useful in learning sequences with
(NuPic, ADVec) from the state-of-the-art.
                                                                          long-range dependencies. They avoid the vanishing gra-
                                                                          dient problem [12] and are more stable and better scal-
1    Introduction                                                         able [10] than other RNN architectures. LSTMs have been
                                                                          successfully advanced the state-of-the-art in many applica-
Anomaly detection in time series is of increasing impor-                  tion areas like language modeling and translation, acoustic
tance in many application areas, e.g. health care [6, 4],                 modeling of speech, analysis of audio data, handwriting
sensor networks [20, 14] or predictive maintenance [7].                   recognition and others [10]. We will use stacked LSTMs
Anomaly detection is a very active research area (for an                  as the building block for our ECG time series prediction.
overview see [3]), yet it is not easy to come up with a gen-                 It is the purpose of the present paper to investigate
eral definition of an anomaly. The notion of an anomaly                   whether anomaly detection in quasi-periodic ECG time
greatly depends on the application area and on character-                 series can be trained in an unsupervised manner, hence,
istics of the time series in question. To learn the char-                 without the usage of the anomaly class labels. We will
acteristics of nominal and anomalous behavior from data                   describe in Sec. 2 an LSTM prediction model which is
it is often necessary to apply sophisticated methods from                 trained to predict over multiple horizons and is applied to
natural computing (evolutionary neural networks [11] and                  time series containing nominal and also rare anomalous
immune systems [26]). While some anomalies are simple                     data. We observe multidimensional error vectors (one vec-
to detect (e.g. a sudden spike in a relatively constant sig-              tor for each point in time) and fit a multivariate Gaussian
nal may be detected by simple threshold heuristics), other                distribution to them. Based on the Mahalanobis distance
anomalies are subtle and more complex to detect. This is                  we can assign to each point in time a probability of be-
                                                                          ing anomalous. Sec. 3 describes our experimental setup
      Copyright c 2019 for this paper by its authors. Use permitted un-
                                                                          and the MIT-BIH Arrhythmia Database used in our exper-
der Creative Commons License Attribution 4.0 International (CC BY         iments. Sec. 4 presents and discusses our results, while
4.0).                                                                     Sec. 5 concludes.
1.1   Related Work                                               Data Preparation Consider a d-dimensional time series
                                                                 of length T . In a first step, it is often recommendable to
Anomaly detection in general has been done with meth-            scale or normalize the individual dimensions of the time
ods from machine learning [3] and more precisely from            series. In our setup, each dimension of each ECG sig-
natural computing: Han and Cho [11] and other works              nal are scaled into the range [−1, 1]. The training and test
cited therein use evolutionary approaches in optimizing          samples are generated by extracting sub-sequences of suit-
neural networks for the task of intrusion detection. Kieu        able length from the original time series. This is done by
et al. [15] use deep learning (LSTM, autoencoder) for            sliding a window of length W with a lag of 1 over the
anomaly detection. [27] uses a multi-resolution wavelet-         time series and collecting the windowed data in a tensor
                                                                          0
based approach for unsupervised anomaly detection. Sti-          D ∈ RT ×W ×din , where T 0 is the number of sub-sequences.
bor et al. [26] describe an immune-system approach to            Usually, one would select all d dimensions of the time se-
anomaly detection: They tackle a problem prevalent in            ries, so that din = d. Then, the first 80% of the samples
anomaly detection (and relevant also for the ECG case):          of D are selected to form the training data Xtrain . The
Often only nominal data are available during training, nev-      remaining 20% are used as a test set Xtest to later com-
ertheless the model should later detect anomalies as well.       pute an unbiased error estimate. While the inputs are din -
This task, known as one-class classification or negative         dimensional, the output-targets for each time step have the
selection, is solved in [26] with an immune-system ap-           dimension m, since one can select for one time series mul-
proach. In our case we have an unknown, small number             tiple (m) prediction horizons. Technically, it is also possi-
of anomalies embedded in nominal data. We describe in            ble to predict several time series dimensions with dout ≤ d
Sec. 2.5 a statistical test to find anomalies in an unsuper-     simultaneously in one model, however, we found in our
vised, threshold-free manner.                                    experiments that the results do not improve in this case
   Much work is devoted to anomaly detection in ECG              (for the investigated ECG time series data). The targets
readings: Several authors use multi-resolution wavelet-         ~yt ∈ Rm are future values of the selected signal at times
based techniques [23, 27]. A novelty-search approach on          t + hi for i ∈ {1, ..., m}, where the horizons are specified
ECG data is taken in [18] in order to perform unsupervised       in H = (h1 , h2 , . . . , hm ). Since we follow a many-to-many
anomaly classification. Sivaraks et al. [25] use motif dis-      time series prediction approach, where the algorithm per-
covery for robust anomaly detection.                             forms a prediction at each instance of time t, the tensor
                                                                                                                     0
   The works of Malhotra [19] and Chauhan [4] are clos-          containing the target signals has the shape RT ×W ×m with
est to our present approach. They describe nicely the gen-       T 0 = T − W − max(H) + 1. T 0 is the same for the input-
eral idea of LSTM based prediction and their application         and output tensor. As before, the first 80% of the targets
to ECG, motor sensor or power-consumption time series.           are used for training (Ytrain ) and the remaining targets for
But the big drawback of [19, 4] is that they need a manual       the test set (Ytest ).
and supervised separation into up to six data sets: training,
validation, test sets which are further subdivided into nom-
inal & anomalous subsets. This means that for a real-world      Model Architecture and Training A stacked LSTM ar-
application the ECG data for a new person would need            chitecture [13] with L = 2 layers is used to learn the predic-
to undergo an expert anomaly classification prior to any        tion task. Each layer consists of u = 64 units. A dense out-
training. This will be highly unpractical in most applica-      put layer with m units and a linear activation generates the
tion scenarios. Our method instead aims at using the whole      predictions for the specified prediction horizons in H. The
body of data for a person and train the LSTMs, without the      net is trained with the sub-sequences of length W taken in
necessity to have supervised anomaly information.               mini-batches of size B from the training inputs Xtrain and
                                                                targets Ytrain . 10% of the training data are held out for the
                                                                validation set. The LSTM model is trained by using the
2     Methods                                                   Adam optimizer [16] to minimize the mean-squared-error
                                                                (MSE) loss. Other loss functions, such as log-cosh (loga-
                                                                rithm of the hyperbolic cosine) and MAE (mean absolute
2.1   LSTM for Time Series Prediction
                                                                error) were tested as well and produced similar results for
                                                                our data. Early stopping is applied to prevent overfitting
The learning task is formulated as a time series forecast-
                                                                of the model and to reduce the overall time required for
ing problem. Hence, we attempt to train a model which is
                                                                the training process. For this purpose the MSE on the val-
able to predict future values in the time series. The intu-
                                                                idation set is tracked. For most of the investigated time
ition behind this approach is that the usual quasi-periodic
                                                                series, 10-20 epochs are sufficient to reach a minimum of
patterns in the ECG time series should be predictable with
                                                                the validation error.
only small errors, while abnormal behavior should lead to
large deviations in the predictions. Although the presented
methodology is only applied to ECG data in this paper, it       2.2   Modelling the Residuals with a multivariate
is sufficiently general to be applied to other (predictable)          Gaussian Distribution
time series as well.
                                      Uncorrected                     2.3   Anomaly Detection
          40
          30                                                          For most data points in the time series the corresponding
          20                                                          Mahalanobis distance will be comparably small, since they
          10                                                          are located close to the mean of the distribution. On the
density


          0                                                           other side, unusual patterns in the error vectors~ei – such as
                                       Corrected
          40                                                          large errors in one or more dimensions – will result in large
          30                                                          values in the Mahalanobis distance. Therefore, the Maha-
          20                                                          lanobis distance can be used as an indicator for anomalous
          10                                                          behavior in the time series signal. In our LSTM-AD al-
          0                                                           gorithm, points with a Mahalanobis distance larger than
               −0.10       −0.05         0.00          0.05   0.10    a specified anomaly threshold will be flagged as anoma-
                                         error
                                                                      lous. Figure 2 shows exemplarily the Mahalanobis dis-
Figure 1: Gaussian fit without and with the removal of out-           tance over time for a selected ECG signal. Depending on
liers in the tails of the error distribution. Exemplarily, this       the choice of threshold, more or less points will be clas-
is shown for one dimension of the overall error distribu-             sified as anomalous. If the threshold is set too small, the
tion. The blue curve shows the empiric error distribution.            algorithm will likely produce many false detections. If the
The red curve depicts the estimated Gaussian.                         threshold is chosen too large, many anomalies might be
                                                                      missed. Ideally, a threshold can be found which allows to
                                                                      identify all anomalies without any false detections. How-
Computing the Prediction Errors for the full Time Se-                 ever, in practice, one usually has to trade off true and false
ries After the LSTM prediction model is trained, the                  detections and select the threshold according to the own
whole time series X ∈ RT ×din of length T is passed through           requirements.
the model and a tensor Ŷ of shape RT ×m is predicted.
Then, the prediction errors E = Y − Ŷ are calculated,
where Y contains the true target values for the horizons              2.4   Window-Based Error Correction
in H. Now, each row i in the matrix E represents an error             Initially, we could not obtain very good results when run-
vector ~ei ∈ Rm .                                                     ning our algorithm on several example ECG time series.
                                                                      The Mahalanobis distance signal was rather noisy and
                                                                      could not be used to distinguish between nominal and ab-
Removing Outliers from the Prediction Errors We no-                   normal patterns in the data. In Figure 2 (top) this problem
ticed in our initial experiments that it was not possible to          is visualized for one example time series. Further investi-
find good Gaussian fits for the individual dimensions of              gation showed that the predictions of the LSTM network
the prediction errors in E. An example for this is shown in           were good in general, but not sufficiently accurate near the
Figure 1, upper part. This was due to the fact that the tails         heart beat peaks where the prediction had been slightly
of the error distributions contained many outliers, which             shifted (up to 10 time steps) forwards or backwards com-
significantly distorted the estimated fit. Hence, we decided          pared to the real signal. We could identify that the quasi-
to remove the outliers in the tails of each dimension (only           periodic character of most ECG signals (small but ubiq-
during the Gaussian modeling phase). We could find good               uitous frequency changes in the heart beat) is the main
solutions by discarding the upper and lower 3% quantile               source of this problem. The following solution for this
in each dimension. In our experiments we observed that                problem is proposed: In order to address the variability in
this approach usually removes slightly more than 20% of               the frequency of the signal, small corrections in the predic-
the data records.                                                     tions of the individual horizons will be permitted. For each
                                                                      output dimension k ∈ {1, . . . , m} the target values yt,k are
                                                                                                                       (win)
                                                                      compared to the neighbored predictions ŷ ∈ Ŷt,k and the
Estimating a multivariate Gaussian Distribution After                 prediction with the smallest absolute error is effectively
removing the outliers from the prediction errors E, the er-           taken:
rors are roughly Gaussian distributed and the parameters
of a multivariate Gaussian distribution can be estimated.                            ŷt,k ← arg min |yt,k − ŷ|                                (2)
                                                                                                      (win)
For this purpose the covariance matrix Σ and mean vec-                                          ŷ∈Ŷt,k
tor ~µ are computed for the cleaned matrix E. Then, the                            (win)
squared Mahalanobis distance                                                     Ŷt,k     = [ŷt−ck ,k , . . . , ŷt,k , . . . , ŷt+ck ,k ]   (3)

                                                                        We found that reasonable results can be achieved with
                       M(~ei ) = (~ei − ~µ)T Σ−1 (~ei − ~µ)     (1)   window parameters ck up to a length of 10, depending on
                                                                      the prediction horizon hk :
to the mean vector ~µ is determined for each error vector ~ei
in E.                                                                                          ck = min(hk , 10).                               (4)
Figure 2: Mahalanobis distance over an ECG time series, before and after the window-based error correction method is
applied.


   The window-based error correction is applied right after     at most ν% of the data were labeled as an anomaly or if
the LSTM prediction of Ŷ and before the prediction errors      M j ≤ χm2 (1 − α).
E are computed in Sec. 2.2.                                        The procedure is summarized in Algorithm 1.
   Although this approach corrects the predictions of the
LSTM network explicitly with the true target values Y of        Algorithm 1 Extended Rosner test
the time series, it is not supervised in the sense that no       1: procedure TEST(ν, α, D)                 . ν: maximal percentage
anomaly labels are presented to the algorithm at any time           of anomalies; D ∈ Rn×m : set of residuals, 1 − α confidence
of the training and correction process. As will be shown in         level
Sec. 4, we could significantly improve the performance of        2:     N ← dν · ne
LSTM-AD utilizing this correction step.                          3:     for j = 1, 2, . . . , N do
                                                                 4:         ~µ ← MEAN(D), Σ ← COV(D)
                                                                 5:          M j ← maxi M(~ei ) . M acc. to Eq. (1) for all ~ei ∈ D
2.5   Threshold-free Anomaly Detection                           6:          if M j ≥ χm2 (1 − α) then
                                                                 7:              D ← D \ {a window around index of M j }
For determining which points are anomalous, we extend            8:          end if
Rosner’s outlier test [24] to a multivariate scenario. This      9:     end for
                                                                10:     Set q to the last index where M j > χm2 (1 − α)
means that we want to test the hypothesis:
                                                                11:     return {M1 , M2 , . . . , Mq } and corresp. row indices.
                                                                12: end procedure
 H0 : There are no anomalies in the data.
 Ha : There are up to ν percent of anomalies in the data.

We assume that the residuals of each output k with k ∈          3     Experimental Setup
{1, ..., m} are approximately normal distributed. Based on
these residuals we iteratively calculate a multivariate nor-    3.1   The MIT-BIH Arrhythmia Database
mal distribution with mean ~µ and covariance matrix Σ. In
a next step we calculate the differences of an observed vec-    For our experiments we use the MIT-BIH Arrhythmia
tor compared to this underlying multivariate normal distri-     database [9, 21, 22], which contains two-channel electro-
bution. We do so by calculating the squared Mahalanobis         cardiogram (ECG) signals of 48 patients of the Beth Is-
distance according to Eq. (1), which can be assumed to be       rael Hospital (BIH) Arrhythmia Laboratory. Each record-
χ 2 -distributed with m degrees of freedom.                     ing has a length of approximately half an hour, which
   The proposed algorithm selects the highest value of          corresponds to 650 000 data points each. The two chan-
the Mahalanobis distance M j = maxi M(~ei ) and evaluates       nels recorded are the modified limb lead II (MLII) and a
whether this value is above the 1 − α quantile of the χm2 -     modified lower lead V5. For our LSTM-AD algorithm,
distribution. We refer to this quantile for the ease of nota-   both the MLII and V5 signal will be used as inputs of
tion as χm2 (1 − α). If so, the observation is considered to    the LSTM model and only the MLII signal is predicted
be an anomaly. A row window around the anomaly index            for the specified horizons.1 The individual signals have a
is then deleted from the data set and the algorithm pro-
ceeds with calculating the multivariate normal distribution          1 We also performed experiments where both signals (MLII and V5)

based on the reduced data set. The procedure stops when         are predicted, but could not observe any noticeable improvement.
                                                                     Parameter           value             Parameter     value
Table 1: Anomaly types in the 13 ECG signals considered                   H       (1, 3, . . . , 47, 49)       L         MSE
for the experiments. The descriptions are taken from [21].                L                  3                 u       (64, 64)
The second column shows the overall number of the vari-                   B              2048                  W          80
ous anomaly types for the 13 considered ECG signals.                  optimizer        ADAM                   αinit      0.001
  Code     #                  Description                                din                 2                dout     1 (MLII)
   A       44            Atrial premature beat
   V       53      Premature ventricular contraction             Table 2: Summary of the the parameters used for the
    |      17         Isolated QRS-like artifact                 LSTM anomaly detector.
   a       6        Aberrated atrial premature beat
   F       2     Fusion of ventricular and normal beat
   x       8     Non-conducted P-wave (blocked APC)              of anomalies that the algorithm will detect as a percentage
                                                                 of the data. This parameter is used as anomaly threshold.
                                                                    NuPic Numenta’s anomaly detection algorithm [8] has
quasi-periodic behavior, with differing heart-beat patterns      a large set of parameters which have to be set. Although
for each subject.                                                the parameters can be tuned with an internal swarming
   There are a multitude of different events in all ECG time     tool [2], we decided to use the standard parameter set-
series, which were labelled by human experts. The whole          tings recommended in [17], since the time-expensive tun-
list of heart beat annotations can be viewed at [21]. For        ing process is not feasible for the data considered in this
this initial investigation which presents first results of our   work. NuPic outputs an anomaly likelihood for each time
unsupervised approach, we decide to limit ourselves to all       series point in the interval [0,1], which is suitably thresh-
time series with 50 or fewer events. Overall, 13 time se-        olded to control the sensitivity of the algorithm.
ries will be considered for our experiments. The selected
time series contain 130 anomalous events from 6 anomaly
                                                                 3.3    Algorithm Evaluation
classes, which are listed in Table 1. Since only one point
is labelled for each anomaly, we place an anomaly win-           In order to evaluate and compare the performance of our
dow of length 600 around each event, which roughly cor-          proposed LSTM-AD algorithm and the two other algo-
responds to the length of one heart beat before and after        rithms, several common performance quantities are used
the labeled point. A more detailed database description          in this paper. Similarly to ordinary classification tasks,
can be found in [22].                                            a confusion matrix can be constructed for each time se-
                                                                 ries, containing the number of true-positives (TP), false-
                                                                 positives (FP), false-negatives and true-negatives (TN).
3.2   Parameterization of the Algorithms
                                                                 TP indicates the number of correctly identified anomalies,
All algorithms compared in this work require a set of pa-        whereby only one detection within the anomaly window
rameters, which are – if not mentioned otherwise – fixed         (Sec. 3.1) is counted. All false detections outside any
for all experiments. For each algorithm an anomaly thresh-       anomaly window are considered as false-positives (FP)
old can be set, which specifies the sensitivity of the algo-     and each anomaly window missed by an algorithm will be
rithm towards anomalies and which trades off false detec-        counted as a false negative (FN). All other points are con-
tions (false positives) and missed anomalies (false nega-        sidered as true-negatives (TN). From the confusion ma-
tives). This threshold is usually set according to the re-       trix, the additional well-known metrics precision p, re-
quirements of the anomaly detection task – allowing either       call r (true-positive rate, TPR) can be derived, as well as
a higher precision or a higher recall.                           the false-positive rate (FPR), the positive likelihood ratio
   LSTM-AD We implemented our proposed algorithm                 (PLR), and the F1 -score, which are defined as:
using the Keras framework [5] with a TensorFlow [1]
                                                                                FP             T PR        2p · r
backend. The parameters of the algorithm are summa-                   FPR =            , PLR =      , F1 =        .         (5)
rized in Table 2. Most of the parameters are related to the                   FP + T N         FPR         p+r
stacked LSTM network. We did not systematically tune
the parameters.                                                  4     Results & Analysis
   ADVec Twitter’s ADVec algorithm [28] is a time series
anomaly detection algorithm, which is based on the gen-          Firstly, we confirm that our LSTM models do not overfit,
eralized ESD test and other robust statistical approaches.       since the training and test set errors are for all time series
There are mainly two parameters which have to be pro-            nearly the same. The median of all such training and test
vided: The first parameter α represents the level of statis-     errors is 2.91 · 10−3 and 3.43 · 10−3 , respectively. The first
tical significance with which to accept or reject anomalies.     80% of each time series is used as training set and the re-
Although we did not tune the parameter extensively, we           maining 20% is used subsequently for the test set.
found the α = 0.05 to deliver the best results. The sec-            The anomaly results for the ECG readings are summa-
ond parameter maxanoms specifies the maximum number              rized in Table 3 and Table 4. In Table 3, the anomaly
threshold for the Mahalanobis distance was tuned individ-                          1.00    ●●
ually to maximize the F1 -score for each ECG signal. As                                                      ●
seen in the table, the Mahalanobis distance is generally                                                      ●●●
                                                                                   0.75                             ●    ●●


                                                                       Precision
                                                                                                                                ●
                                                                                                                                ●
a good indicator for separating nominal from anomalous                                                                                 ●
behavior in the heartbeat signals, if a suitable threshold is                      0.50
                                                                                                                                       ●
known. For all time series a recall value of 0.5 or larger
can be observed and with one exception, also the F1 -score                         0.25                                                    ●
                                                                                                                                                ●
exceeds the value 0.5. On average, a F1 -score of approxi-                                                                                          ●
mately 0.81 can be achieved for all time series. Note that                         0.00
all FPRs are smaller than 3 · 10−5 .                                                      0.00          0.25    0.50   0.75                                1.00
                                                                                                               Recall
                                                                                                       ● LSTM−AD NuPic ADVec
Table 3: Results for all ECG time series with less than
50 anomalies (in total 13). For these results, the anomaly             Figure 3: Precision-Recall plot for LSTM-AD, NuPic and
threshold is chosen for each time series individually, so              ADVec. Precision and recall are computed over the sum
that the F1 -score is maximized.                                       of TP, FP, FN of the 13 ECG time series. For all three
  No.   threshold   TP   FN   FP   Prec   Rec     F1    FPR    PLR     algorithms, each point is generated by scaling the individ-
                                                        ∗105   /105    ual best thresholds up and down by a common factor. For
    1      49.31    17   17   14   0.55   0.50   0.52   2.15   0.23
    2      71.31     6    0    4   0.60   1.00   0.75   0.62   1.62    LSTM-AD, the best thresholds are reported in Table 3.
    4      11.25     1    1    0   1.00   0.50   0.67   0.00     Inf
    9      27.12    15   13    3   0.83   0.54   0.65   0.46   1.16
   10      14.96    33    7    6   0.85   0.82   0.84   0.92   0.89    Table 4: Results for the considered 13 ECG signals for the
   11      60.75     1    0    0   1.00   1.00   1.00   0.00     Inf   threshold-free method based on Rosner’s ESD test.
   12      59.18     2    0    0   1.00   1.00   1.00   0.00     Inf
   13     116.93     6    0    0   1.00   1.00   1.00   0.00     Inf     ECG No.           threshold    TP    FN    FP   Prec   Rec        F1       FPR    PLR
   15      99.74     5    0    5   0.50   1.00   0.67   0.77   1.30                                                                                 ∗105   /105
   17      40.02     1    0    0   1.00   1.00   1.00   0.00     Inf                  1       11.30      7     27    3   0.70   0.21   0.32         0.46   0.45
   20      75.40     1    0    0   1.00   1.00   1.00   0.00     Inf                  2       11.30      5      1    4   0.56   0.83   0.67         0.62   1.35
   21      30.30     1    0    0   1.00   1.00   1.00   0.00     Inf                  4       11.30      2      0    4   0.33   1.00   0.50         0.62   1.62
                                                                                      9       11.30     10     18    0   1.00   0.36   0.53         0.00     Inf
   22     121.42     3    0    7   0.30   1.00   0.46   1.08   0.93
                                                                                     10       11.30      6     34    4   0.60   0.15   0.24         0.62   0.24
 mean          –     7    2    3   0.82   0.87   0.81   0.46     Inf
                                                                                     11       11.30      1      0    8   0.11   1.00   0.20         1.23   0.81
    Σ          –    92   38   39   0.70   0.71   0.70   0.46   1.53                  12       11.30      2      0    5   0.29   1.00   0.44         0.77   1.30
                                                                                     13       11.30      6      0    4   0.60   1.00   0.75         0.62   1.62
                                                                                     15       11.30      4      1    5   0.44   0.80   0.57         0.77   1.04
   Since the anomaly threshold is used to trade off false-                           17       11.30      1      0    7   0.12   1.00   0.22         1.08   0.93
positive and false-negatives (precision and recall), one can                         20       11.30      1      0    5   0.17   1.00   0.29         0.77   1.30
                                                                                     21       11.30      1      0    8   0.11   1.00   0.20         1.23   0.81
vary the threshold in a certain range and collect the results                        22       11.30      3      0    7   0.30   1.00   0.46         1.08   0.93
for different values. This is done in Figure 3, which also                         mean       11.30      3      6    4   0.41   0.80   0.41         0.76     Inf
shows the results for different thresholds for ADVec and                              Σ      146.92     49     81   64   0.43   0.38   0.40         0.76   0.50

NuPic. It has to be noted that ADVec accounts only for
fixed-length seasonalities, it is not built for quasi-periodic
                                                                       suitably used to distinguish between nominal and anoma-
signals as they occur in ECG readings, so it is quite under-
                                                                       lous patterns in the displayed ECG data. Only after ap-
standable that it has only low performance here.
                                                                       plying our approach, a better signal-to-noise ratio is es-
   Table 4 shows the results which are obtained when no
                                                                       tablished which allows to perfectly separate the anomalies
prior knowledge about the anomaly threshold is assumed.
                                                                       from the nominal points. For most of the investigated ECG
Instead, the threshold is calculated automatically and in-
                                                                       readings we found that the window-based error correction
crementally by our approach from Sec. 2.5 which is in-
                                                                       significantly improves the signal-to-noise ratio in the Ma-
spired by Rosner’s ESD test. The average precision and
                                                                       halanobis distance. Table 5 shows: The average F1 -score
F1 are lower than in Table 3 since the false positives (FP)
                                                                       increases from F1 =0.50 when no window-based error cor-
are higher.
                                                                       rection is applied to F1 =0.81.
   In Figure 4, two excerpts of time series No. 13 (left)
                                                                          In Table 6, various measures are listed for the individual
and No. 10 (right) with the detections of our LSTM-AD
                                                                       anomaly classes. The anomaly types a, F and x can all be
algorithm, NuPic and ADVec are exemplarily shown. In
                                                                       detected by LSTM-AD. Also for the anomaly class V a
both examples it can be seen that LSTM-AD detects all
                                                                       high recall can be achieved. However, the two remaining
indicated anomalies, while NuPic and ADvec only detect
                                                                       types appear to be hard to detect for our algorithm.
two and one anomaly, respectively. Additionally, the other
two algorithms produce several false positives.
   The importance of our proposed window-based error                   5            Conclusion & Future Work
correction method (Sec. 2.4) is illustrated in Figure 2 for
ECG signal No. 13: If no window-based error correction                 We have presented a fully unsupervised method to detect
is applied, the obtained Mahalanobis distance cannot be                anomalies in ECG readings. This method relies on an ac-
                                       Example 1                                                                Example 2
                                                            a                                        F          |
                                                                                                                                  V
                            a

       1400                                                                        1200


       1200
MLII


                                                                                   1000

       1000
                                                                                     800

        800
          140000          142500         145000             147500          150000      328000       330000         332000     334000       336000

                         Type FP        TP               Algorithm ADVec     LSTM−AD        NuPic
Figure 4: Subsets of two example time series taken from the MIT-ECG data with the anomalies detected by the algorithms
LSTM-AD, NuPic and ADVec. The red rectangles in the plot indicate the true anomaly windows. True-positives are
indicated by green colors while False-positives are colored red.


                                                                                     built. When applying the model, anomalous events have a
Table 5: Comparison of the results for all considered time
                                                                                     high probability of being detected through an unusual high
series, with and without the window-based error correc-
                                                                                     Mahalanobis distance.
tion, as described in Sec. 2.4. The last column F1 (Corr)
                                                                                        Our method is unsupervised in the sense that no
is copied from Table 3. The remaining columns depict
                                                                                     anomaly class labels are needed for training the algorithm.
the quantities which are obtained, if no window-based er-
                                                                                     In fact, it is even not necessary that anomalous events are
ror correction is applied (thresholds chosen such that F1 is
                                                                                     present at all in the training data, i.e. our algorithm can
maximized).
                                                                                     operate as a one-class classificator. We checked this by
 ECG No.      threshold   TP     FN     FP    Prec   Rec          F1   F1 (Corr)     repeating the experiment leading to Table 3, but this time
       1          20.60   12     22     23    0.34   0.35       0.35        0.52
       2           5.83    2      4      2    0.50   0.33       0.40        0.75     removing all data around anomalies during LSTM train-
       4           7.03    1      1      0    1.00   0.50       0.67        0.67     ing. When using the trained model as anomaly detector
       9          16.83   13     15     10    0.57   0.46       0.51        0.65     on all data, it worked as accurate as in Table 3, the mean
      10          16.21   28     12      2    0.93   0.70       0.80        0.84
      11          40.60    1      0      0    1.00   1.00       1.00        1.00     F1 -score being now F1 = 0.83.
      12          28.48    1      1      1    0.50   0.50       0.50        1.00        We achieve for the ECG readings these high precision,
      13          93.83    5      1    130    0.04   0.83       0.07        1.00
      15          87.97    2      3      5    0.29   0.40       0.33        0.67     recall and F1 -values (on average higher than 80%, see Ta-
      17          35.90    1      0     38    0.03   1.00       0.05        1.00     ble 3), if we tune the final threshold for the Mahalanobis
      20          25.10    1      0      1    0.50   1.00       0.67        1.00     distance such that F1 is maximized. Admittedly, this last
      21          32.31    1      0      0    1.00   1.00       1.00        1.00
      22          77.33    2      1     37    0.05   0.67       0.10        0.46     step is not unsupervised, since we calculate the confusion
   mean               –    5      4     19    0.52   0.67       0.50        0.81     matrix based on the true anomaly labels.
       Σ              –   70     60    249    0.22   0.54       0.31        0.70        The alternative unsupervised case based on Rosner’s
                                                                                     test (Sec. 2.5 and Table 4) is weaker in terms of preci-
                                                                                     sion and recall. This may be due to the fact that the cur-
Table 6: Various metrics for 5 different anomaly classes.
                                                                                     rent error data do not fulfill the assumption of being nor-
The threshold was tuned individually for each time series
                                                                                     mally distributed and therefore also the assumption of a
by attempting to maximize the F1 -score.
                                                                                     χ 2 -distribution is violated. This results in the χ 2 -criterion
        TP    FN     Prec       Rec      F1     FPR ∗105           PLR /105          giving no useful thresholds.
 A      23     21    0.65       0.52   0.58         0.14               3.64             It has to be noticed that the measure ’Mahalanobis dis-
 V      44      9    0.73       0.83   0.78         0.19               4.36
                                                                                     tance’ has the same discriminative power in both cases. It
  |      9      8    0.63       0.53   0.58         0.06               8.49
 a       6      0    0.79       1.00   0.88         0.02              53.44
                                                                                     is only that the final threshold is not adjusted optimally for
 F       2      0    0.65       1.00   0.79         0.01              80.16          the individual time series in the alternative case. Viewed
 x       8      0    0.73       1.00   0.85         0.03              29.15          from the practitioner’s perspective, it may be acceptable to
                                                                                     start with a non-optimal threshold and adjust it in a human-
                                                                                     in-the-loop approach. However, a fully unsupervised high-
curate LSTM predictor to learn the nominal behavior of                               quality method would be nicer.
the ECG for several prediction horizons. By learning the                                We have shown that the window-based error correction
error distribution between predicted and perceived values,                           is essential to achieve a Mahalanobis distance graph where
a multivariate normal error model for the nominal data is                            the anomaly cases clearly stand out (Fig. 2 and Table 5).
   Our LSTM-AD algorithm outperformed two state-of-                     IEEE Transactions on Systems, Man, and Cybernetics, Part
the-art anomaly detection algorithms (NuPic and ADVec)                  B (Cybernetics), 36(3):559–570, June 2005.
on the investigated ECG readings, achieving a higher pre-          [12] S. Hochreiter. The vanishing gradient problem during
cision and recall over a large range of anomaly thresholds.             learning recurrent neural nets and problem solutions. Inter-
   In this work we have presented first results of an un-               national Journal of Uncertainty, Fuzziness and Knowledge-
supervised anomaly detector suitable for ECG readings or                Based Systems, 6(02):107–116, 1998.
other quasi-periodic signals. The results are encouraging,         [13] S. Hochreiter and J. Schmidhuber. Long short-term mem-
but there is still room for improvement. Possible future                ory. Neural computation, 9(8):1735–1780, 1997.
works include: 1) Improving the modeling step such that            [14] R. U. Islam, M. S. Hossain, and K. Andersson. A novel
the nominal error distribution comes closer to a Gaussian               anomaly detection algorithm for sensor data under uncer-
shape and hence the nominal Mahalanobis distance closer                 tainty. Soft Computing, 22(5):1623–1639, 2018.
to a χ 2 -distribution. Then the unsupervised extended Ros-        [15] T. Kieu, B. Yang, and C. S. Jensen. Outlier detection for
ner test can be expected to work better. 2) To do so, one               multidimensional time series using deep neural networks.
                                                                        In 2018 19th IEEE International Conference on Mobile
has to address the problem of non-stationarity in the ECG
                                                                        Data Management (MDM), pages 125–134. IEEE, 2018.
readings, e. g. by applying suitable preprocessing steps to
                                                                   [16] D. P. Kingma and J. Ba. Adam: A method for stochastic
reduce the effect of signal quality changes. 3) Enrich the
                                                                        optimization. arXiv preprint arXiv:1412.6980, 2014.
model by multi-resolution approaches to span larger pre-
                                                                   [17] A. Lavin and S. S. Ahmad. Evaluating real-time anomaly
diction horizons on a coarser scale.
                                                                        detection algorithms – the Numenta anomaly benchmark.
                                                                        In IEEE Conference on Machine Learning and Applica-
References                                                              tions (ICMLA2015), 2015.
                                                                   [18] A. P. Lemos, C. Tierra-Criollo, and W. Caminhas. ECG
 [1] M. Abadi et al. Tensorflow: A system for large-scale               anomalies identification using a time series novelty detec-
     machine learning. In Proceedings of the 12th USENIX                tion technique. In IV Latin American Congress on Biomed-
     Conference on Operating Systems Design and Implementa-             ical Engineering 2007, Bioengineering Solutions for Latin
     tion, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016.             America Health, pages 65–68. Springer, 2007.
     USENIX Association.                                           [19] P. Malhotra et al. Long short term memory networks for
 [2] S. Ahmad. Running swarms. http://nupic.docs.                       anomaly detection in time series. In 23rd European Sym-
     numenta.org/0.6.0/guide-swarming.html,                 May         posium on Artificial Neural Networks, Computational In-
     2017.                                                              telligence and Machine Learning (ESANN), pages 89–94,
 [3] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detec-             2015.
     tion: A survey. ACM computing surveys (CSUR), 41(3):15,       [20] P. Malhotra et al.            LSTM-based encoder-decoder
     2009.                                                              for multi-sensor anomaly detection.            arXiv preprint
 [4] S. Chauhan and L. Vig. Anomaly detection in ecg time sig-          arXiv:1607.00148, 2016.
     nals via deep long short-term memory networks. In Data        [21] G. B. Moody and R. G. Mark. PhysioNet: The MIT-BIH
     Science and Advanced Analytics (DSAA), 2015. 36678                 Arrhythmia Database. https://www.physionet.org/
     2015. IEEE International Conference on, pages 1–7. IEEE,           physiobank/database/mitdb/, 1992.
     2015.                                                         [22] G. B. Moody and R. G. Mark. The impact of the mit-bih
 [5] F. Chollet et al. Keras. https://keras.io, 2015.                   arrhythmia database. IEEE Engineering in Medicine and
 [6] M. C. Chuah and F. Fu. ECG anomaly detection via time              Biology Magazine, 20(3):45–50, 2001.
     series analysis. In International Symposium on Parallel and   [23] H. M. Rai, A. Trivedi, and S. Shukla. ECG signal pro-
     Distributed Processing and Applications, pages 123–135.            cessing for abnormalities detection using multi-resolution
     Springer, 2007.                                                    wavelet transform and artificial neural network classifier.
 [7] M. C. Garcia, M. A. Sanz-Bobi, and J. del Pico. SIMAP:             Measurement, 46(9):3238–3246, 2013.
     Intelligent system for predictive maintenance: Application    [24] B. Rosner. Percentage points for a generalized ESD many-
     to the health condition monitoring of a windturbine gear-          outlier procedure. Technometrics, 25(2):165–172, 1983.
     box. Computers in Industry, 57(6):552–568, 2006.              [25] H. Sivaraks and C. A. Ratanamahatana. Robust and accu-
 [8] D. George and J. Hawkins. Towards a mathematical                   rate anomaly detection in ECG artifacts using time series
     theory of cortical micro-circuits. PLoS Comput Biol,               motif discovery. Computational and mathematical meth-
     5(10):e1000532, 2009.                                              ods in medicine, 2015.
 [9] A. L. Goldberger et al. Physiobank, physiotoolkit, and phy-   [26] T. Stibor et al. Is negative selection appropriate for anomaly
     sionet: components of a new research resource for com-             detection? In Proceedings of the 7th annual conference
     plex physiologic signals. Circulation, 101(23):e215–e220,          on Genetic and evolutionary computation, pages 321–328.
     2000.                                                              ACM, 2005.
[10] K. Greff et al. LSTM: A search space odyssey. IEEE            [27] M. Thill, W. Konen, and T. Bäck. Time series anomaly
     transactions on neural networks and learning systems,              detection with discrete wavelet transforms and maximum
     28(10):2222–2232, 2017.                                            likelihood estimation. In O. Valenzuela, I. Rojas, et al.,
[11] S.-J. Han and S.-B. Cho. Evolutionary neural networks              editors, Intern. Conference on Time Series (ITISE), 2017.
     for anomaly detection based on the behavior of a program.     [28] O. Vallis, J. Hochenbaum, and A. Kejariwal. A novel tech-
nique for long-term anomaly detection in the cloud. In
6th USENIX Workshop on Hot Topics in Cloud Computing,
Philadelphia, PA, 2014.

</pre>