=Paper=
{{Paper
|id=Vol-2473/paper10
|storemode=property
|title=Anomaly Detection in Electrocardiogram Readings with Stacked LSTM Networks
|pdfUrl=https://ceur-ws.org/Vol-2473/paper10.pdf
|volume=Vol-2473
|authors=Markus Thill,Sina Däubener,Wolfgang Konen,Thomas Bäck
|dblpUrl=https://dblp.org/rec/conf/itat/ThillDKB19
}}
==Anomaly Detection in Electrocardiogram Readings with Stacked LSTM Networks==
Anomaly Detection in Electrocardiogram Readings with Stacked LSTM Networks Markus Thill1 , Sina Däubener, Wolfgang Konen1 , and Thomas Bäck2 1 TH Köln – Cologne University of Applied Sciences, 51643 Gummersbach, Germany, {markus.thill, wolfgang.konen}@th-koeln.de 2 Leiden University, LIACS, 2333 CA Leiden, The Netherlands, t.h.w.baeck@liacs.leidenuniv.nl Abstract: Real-world anomaly detection for time series is especially the case for periodic or quasi-periodic time se- still a challenging task. This is especially true for periodic ries where the anomaly may be a time-shifted peak, a peak or quasi-periodic time series since automated approaches with a different form or other patterns which only emerge have to learn long-term correlations before they are able to from a long-range analysis of the signal. detect anomalies. Electrocardiography (ECG) time series, Electrocardiography (ECG) time series constitute a a prominent real-world example of quasi-periodic signals, prominent real-world example of quasi-periodic signals. are investigated in this work. Anomaly detection algo- Anomaly detection in ECG readings plays an important rithms often have the additional goal to identify anomalies role in health care and medical diagnosis. There exist in an unsupervised manner. well-maintained databases, e.g. the MIT-BIH database [9], In this paper we present an unsupervised time series where a large body of data is annotated with numerous anomaly detection algorithm. It learns with recurrent Long types of anomalies which are characterized and verified Short-Term Memory (LSTM) networks to predict the nor- by medical experts. Automated anomaly detection in such mal time series behavior. The prediction error on several ECG data is still a challenging topic, because the devia- prediction horizons is used to build a statistical model of tions from nominal behavior are often subtle and require normal behavior. We propose new methods that are es- long-range analysis. Furthermore, there are considerable sential for a successful model-building process and for a signal variations from patient to patient or even within an high signal-to-noise-ratio. We apply our method to the ECG time series. well-known MIT-BIH ECG data set and present first re- Long short-term memory (LSTM) networks [13], which sults. We obtain a good recall of anomalies while having are a special form of recurrent neural networks (RNN) and a very low false alarm rate (FPR) in a fully unsupervised thus belong to the class of deep learning methods, have procedure. We compare also with other anomaly detectors proven to be particularly useful in learning sequences with (NuPic, ADVec) from the state-of-the-art. long-range dependencies. They avoid the vanishing gra- dient problem [12] and are more stable and better scal- 1 Introduction able [10] than other RNN architectures. LSTMs have been successfully advanced the state-of-the-art in many applica- Anomaly detection in time series is of increasing impor- tion areas like language modeling and translation, acoustic tance in many application areas, e.g. health care [6, 4], modeling of speech, analysis of audio data, handwriting sensor networks [20, 14] or predictive maintenance [7]. recognition and others [10]. We will use stacked LSTMs Anomaly detection is a very active research area (for an as the building block for our ECG time series prediction. overview see [3]), yet it is not easy to come up with a gen- It is the purpose of the present paper to investigate eral definition of an anomaly. The notion of an anomaly whether anomaly detection in quasi-periodic ECG time greatly depends on the application area and on character- series can be trained in an unsupervised manner, hence, istics of the time series in question. To learn the char- without the usage of the anomaly class labels. We will acteristics of nominal and anomalous behavior from data describe in Sec. 2 an LSTM prediction model which is it is often necessary to apply sophisticated methods from trained to predict over multiple horizons and is applied to natural computing (evolutionary neural networks [11] and time series containing nominal and also rare anomalous immune systems [26]). While some anomalies are simple data. We observe multidimensional error vectors (one vec- to detect (e.g. a sudden spike in a relatively constant sig- tor for each point in time) and fit a multivariate Gaussian nal may be detected by simple threshold heuristics), other distribution to them. Based on the Mahalanobis distance anomalies are subtle and more complex to detect. This is we can assign to each point in time a probability of be- ing anomalous. Sec. 3 describes our experimental setup Copyright c 2019 for this paper by its authors. Use permitted un- and the MIT-BIH Arrhythmia Database used in our exper- der Creative Commons License Attribution 4.0 International (CC BY iments. Sec. 4 presents and discusses our results, while 4.0). Sec. 5 concludes. 1.1 Related Work Data Preparation Consider a d-dimensional time series of length T . In a first step, it is often recommendable to Anomaly detection in general has been done with meth- scale or normalize the individual dimensions of the time ods from machine learning [3] and more precisely from series. In our setup, each dimension of each ECG sig- natural computing: Han and Cho [11] and other works nal are scaled into the range [−1, 1]. The training and test cited therein use evolutionary approaches in optimizing samples are generated by extracting sub-sequences of suit- neural networks for the task of intrusion detection. Kieu able length from the original time series. This is done by et al. [15] use deep learning (LSTM, autoencoder) for sliding a window of length W with a lag of 1 over the anomaly detection. [27] uses a multi-resolution wavelet- time series and collecting the windowed data in a tensor 0 based approach for unsupervised anomaly detection. Sti- D ∈ RT ×W ×din , where T 0 is the number of sub-sequences. bor et al. [26] describe an immune-system approach to Usually, one would select all d dimensions of the time se- anomaly detection: They tackle a problem prevalent in ries, so that din = d. Then, the first 80% of the samples anomaly detection (and relevant also for the ECG case): of D are selected to form the training data Xtrain . The Often only nominal data are available during training, nev- remaining 20% are used as a test set Xtest to later com- ertheless the model should later detect anomalies as well. pute an unbiased error estimate. While the inputs are din - This task, known as one-class classification or negative dimensional, the output-targets for each time step have the selection, is solved in [26] with an immune-system ap- dimension m, since one can select for one time series mul- proach. In our case we have an unknown, small number tiple (m) prediction horizons. Technically, it is also possi- of anomalies embedded in nominal data. We describe in ble to predict several time series dimensions with dout ≤ d Sec. 2.5 a statistical test to find anomalies in an unsuper- simultaneously in one model, however, we found in our vised, threshold-free manner. experiments that the results do not improve in this case Much work is devoted to anomaly detection in ECG (for the investigated ECG time series data). The targets readings: Several authors use multi-resolution wavelet- ~yt ∈ Rm are future values of the selected signal at times based techniques [23, 27]. A novelty-search approach on t + hi for i ∈ {1, ..., m}, where the horizons are specified ECG data is taken in [18] in order to perform unsupervised in H = (h1 , h2 , . . . , hm ). Since we follow a many-to-many anomaly classification. Sivaraks et al. [25] use motif dis- time series prediction approach, where the algorithm per- covery for robust anomaly detection. forms a prediction at each instance of time t, the tensor 0 The works of Malhotra [19] and Chauhan [4] are clos- containing the target signals has the shape RT ×W ×m with est to our present approach. They describe nicely the gen- T 0 = T − W − max(H) + 1. T 0 is the same for the input- eral idea of LSTM based prediction and their application and output tensor. As before, the first 80% of the targets to ECG, motor sensor or power-consumption time series. are used for training (Ytrain ) and the remaining targets for But the big drawback of [19, 4] is that they need a manual the test set (Ytest ). and supervised separation into up to six data sets: training, validation, test sets which are further subdivided into nom- inal & anomalous subsets. This means that for a real-world Model Architecture and Training A stacked LSTM ar- application the ECG data for a new person would need chitecture [13] with L = 2 layers is used to learn the predic- to undergo an expert anomaly classification prior to any tion task. Each layer consists of u = 64 units. A dense out- training. This will be highly unpractical in most applica- put layer with m units and a linear activation generates the tion scenarios. Our method instead aims at using the whole predictions for the specified prediction horizons in H. The body of data for a person and train the LSTMs, without the net is trained with the sub-sequences of length W taken in necessity to have supervised anomaly information. mini-batches of size B from the training inputs Xtrain and targets Ytrain . 10% of the training data are held out for the validation set. The LSTM model is trained by using the 2 Methods Adam optimizer [16] to minimize the mean-squared-error (MSE) loss. Other loss functions, such as log-cosh (loga- rithm of the hyperbolic cosine) and MAE (mean absolute 2.1 LSTM for Time Series Prediction error) were tested as well and produced similar results for our data. Early stopping is applied to prevent overfitting The learning task is formulated as a time series forecast- of the model and to reduce the overall time required for ing problem. Hence, we attempt to train a model which is the training process. For this purpose the MSE on the val- able to predict future values in the time series. The intu- idation set is tracked. For most of the investigated time ition behind this approach is that the usual quasi-periodic series, 10-20 epochs are sufficient to reach a minimum of patterns in the ECG time series should be predictable with the validation error. only small errors, while abnormal behavior should lead to large deviations in the predictions. Although the presented methodology is only applied to ECG data in this paper, it 2.2 Modelling the Residuals with a multivariate is sufficiently general to be applied to other (predictable) Gaussian Distribution time series as well. Uncorrected 2.3 Anomaly Detection 40 30 For most data points in the time series the corresponding 20 Mahalanobis distance will be comparably small, since they 10 are located close to the mean of the distribution. On the density 0 other side, unusual patterns in the error vectors~ei – such as Corrected 40 large errors in one or more dimensions – will result in large 30 values in the Mahalanobis distance. Therefore, the Maha- 20 lanobis distance can be used as an indicator for anomalous 10 behavior in the time series signal. In our LSTM-AD al- 0 gorithm, points with a Mahalanobis distance larger than −0.10 −0.05 0.00 0.05 0.10 a specified anomaly threshold will be flagged as anoma- error lous. Figure 2 shows exemplarily the Mahalanobis dis- Figure 1: Gaussian fit without and with the removal of out- tance over time for a selected ECG signal. Depending on liers in the tails of the error distribution. Exemplarily, this the choice of threshold, more or less points will be clas- is shown for one dimension of the overall error distribu- sified as anomalous. If the threshold is set too small, the tion. The blue curve shows the empiric error distribution. algorithm will likely produce many false detections. If the The red curve depicts the estimated Gaussian. threshold is chosen too large, many anomalies might be missed. Ideally, a threshold can be found which allows to identify all anomalies without any false detections. How- Computing the Prediction Errors for the full Time Se- ever, in practice, one usually has to trade off true and false ries After the LSTM prediction model is trained, the detections and select the threshold according to the own whole time series X ∈ RT ×din of length T is passed through requirements. the model and a tensor Ŷ of shape RT ×m is predicted. Then, the prediction errors E = Y − Ŷ are calculated, where Y contains the true target values for the horizons 2.4 Window-Based Error Correction in H. Now, each row i in the matrix E represents an error Initially, we could not obtain very good results when run- vector ~ei ∈ Rm . ning our algorithm on several example ECG time series. The Mahalanobis distance signal was rather noisy and could not be used to distinguish between nominal and ab- Removing Outliers from the Prediction Errors We no- normal patterns in the data. In Figure 2 (top) this problem ticed in our initial experiments that it was not possible to is visualized for one example time series. Further investi- find good Gaussian fits for the individual dimensions of gation showed that the predictions of the LSTM network the prediction errors in E. An example for this is shown in were good in general, but not sufficiently accurate near the Figure 1, upper part. This was due to the fact that the tails heart beat peaks where the prediction had been slightly of the error distributions contained many outliers, which shifted (up to 10 time steps) forwards or backwards com- significantly distorted the estimated fit. Hence, we decided pared to the real signal. We could identify that the quasi- to remove the outliers in the tails of each dimension (only periodic character of most ECG signals (small but ubiq- during the Gaussian modeling phase). We could find good uitous frequency changes in the heart beat) is the main solutions by discarding the upper and lower 3% quantile source of this problem. The following solution for this in each dimension. In our experiments we observed that problem is proposed: In order to address the variability in this approach usually removes slightly more than 20% of the frequency of the signal, small corrections in the predic- the data records. tions of the individual horizons will be permitted. For each output dimension k ∈ {1, . . . , m} the target values yt,k are (win) compared to the neighbored predictions ŷ ∈ Ŷt,k and the Estimating a multivariate Gaussian Distribution After prediction with the smallest absolute error is effectively removing the outliers from the prediction errors E, the er- taken: rors are roughly Gaussian distributed and the parameters of a multivariate Gaussian distribution can be estimated. ŷt,k ← arg min |yt,k − ŷ| (2) (win) For this purpose the covariance matrix Σ and mean vec- ŷ∈Ŷt,k tor ~µ are computed for the cleaned matrix E. Then, the (win) squared Mahalanobis distance Ŷt,k = [ŷt−ck ,k , . . . , ŷt,k , . . . , ŷt+ck ,k ] (3) We found that reasonable results can be achieved with M(~ei ) = (~ei − ~µ)T Σ−1 (~ei − ~µ) (1) window parameters ck up to a length of 10, depending on the prediction horizon hk : to the mean vector ~µ is determined for each error vector ~ei in E. ck = min(hk , 10). (4) Figure 2: Mahalanobis distance over an ECG time series, before and after the window-based error correction method is applied. The window-based error correction is applied right after at most ν% of the data were labeled as an anomaly or if the LSTM prediction of Ŷ and before the prediction errors M j ≤ χm2 (1 − α). E are computed in Sec. 2.2. The procedure is summarized in Algorithm 1. Although this approach corrects the predictions of the LSTM network explicitly with the true target values Y of Algorithm 1 Extended Rosner test the time series, it is not supervised in the sense that no 1: procedure TEST(ν, α, D) . ν: maximal percentage anomaly labels are presented to the algorithm at any time of anomalies; D ∈ Rn×m : set of residuals, 1 − α confidence of the training and correction process. As will be shown in level Sec. 4, we could significantly improve the performance of 2: N ← dν · ne LSTM-AD utilizing this correction step. 3: for j = 1, 2, . . . , N do 4: ~µ ← MEAN(D), Σ ← COV(D) 5: M j ← maxi M(~ei ) . M acc. to Eq. (1) for all ~ei ∈ D 2.5 Threshold-free Anomaly Detection 6: if M j ≥ χm2 (1 − α) then 7: D ← D \ {a window around index of M j } For determining which points are anomalous, we extend 8: end if Rosner’s outlier test [24] to a multivariate scenario. This 9: end for 10: Set q to the last index where M j > χm2 (1 − α) means that we want to test the hypothesis: 11: return {M1 , M2 , . . . , Mq } and corresp. row indices. 12: end procedure H0 : There are no anomalies in the data. Ha : There are up to ν percent of anomalies in the data. We assume that the residuals of each output k with k ∈ 3 Experimental Setup {1, ..., m} are approximately normal distributed. Based on these residuals we iteratively calculate a multivariate nor- 3.1 The MIT-BIH Arrhythmia Database mal distribution with mean ~µ and covariance matrix Σ. In a next step we calculate the differences of an observed vec- For our experiments we use the MIT-BIH Arrhythmia tor compared to this underlying multivariate normal distri- database [9, 21, 22], which contains two-channel electro- bution. We do so by calculating the squared Mahalanobis cardiogram (ECG) signals of 48 patients of the Beth Is- distance according to Eq. (1), which can be assumed to be rael Hospital (BIH) Arrhythmia Laboratory. Each record- χ 2 -distributed with m degrees of freedom. ing has a length of approximately half an hour, which The proposed algorithm selects the highest value of corresponds to 650 000 data points each. The two chan- the Mahalanobis distance M j = maxi M(~ei ) and evaluates nels recorded are the modified limb lead II (MLII) and a whether this value is above the 1 − α quantile of the χm2 - modified lower lead V5. For our LSTM-AD algorithm, distribution. We refer to this quantile for the ease of nota- both the MLII and V5 signal will be used as inputs of tion as χm2 (1 − α). If so, the observation is considered to the LSTM model and only the MLII signal is predicted be an anomaly. A row window around the anomaly index for the specified horizons.1 The individual signals have a is then deleted from the data set and the algorithm pro- ceeds with calculating the multivariate normal distribution 1 We also performed experiments where both signals (MLII and V5) based on the reduced data set. The procedure stops when are predicted, but could not observe any noticeable improvement. Parameter value Parameter value Table 1: Anomaly types in the 13 ECG signals considered H (1, 3, . . . , 47, 49) L MSE for the experiments. The descriptions are taken from [21]. L 3 u (64, 64) The second column shows the overall number of the vari- B 2048 W 80 ous anomaly types for the 13 considered ECG signals. optimizer ADAM αinit 0.001 Code # Description din 2 dout 1 (MLII) A 44 Atrial premature beat V 53 Premature ventricular contraction Table 2: Summary of the the parameters used for the | 17 Isolated QRS-like artifact LSTM anomaly detector. a 6 Aberrated atrial premature beat F 2 Fusion of ventricular and normal beat x 8 Non-conducted P-wave (blocked APC) of anomalies that the algorithm will detect as a percentage of the data. This parameter is used as anomaly threshold. NuPic Numenta’s anomaly detection algorithm [8] has quasi-periodic behavior, with differing heart-beat patterns a large set of parameters which have to be set. Although for each subject. the parameters can be tuned with an internal swarming There are a multitude of different events in all ECG time tool [2], we decided to use the standard parameter set- series, which were labelled by human experts. The whole tings recommended in [17], since the time-expensive tun- list of heart beat annotations can be viewed at [21]. For ing process is not feasible for the data considered in this this initial investigation which presents first results of our work. NuPic outputs an anomaly likelihood for each time unsupervised approach, we decide to limit ourselves to all series point in the interval [0,1], which is suitably thresh- time series with 50 or fewer events. Overall, 13 time se- olded to control the sensitivity of the algorithm. ries will be considered for our experiments. The selected time series contain 130 anomalous events from 6 anomaly 3.3 Algorithm Evaluation classes, which are listed in Table 1. Since only one point is labelled for each anomaly, we place an anomaly win- In order to evaluate and compare the performance of our dow of length 600 around each event, which roughly cor- proposed LSTM-AD algorithm and the two other algo- responds to the length of one heart beat before and after rithms, several common performance quantities are used the labeled point. A more detailed database description in this paper. Similarly to ordinary classification tasks, can be found in [22]. a confusion matrix can be constructed for each time se- ries, containing the number of true-positives (TP), false- positives (FP), false-negatives and true-negatives (TN). 3.2 Parameterization of the Algorithms TP indicates the number of correctly identified anomalies, All algorithms compared in this work require a set of pa- whereby only one detection within the anomaly window rameters, which are – if not mentioned otherwise – fixed (Sec. 3.1) is counted. All false detections outside any for all experiments. For each algorithm an anomaly thresh- anomaly window are considered as false-positives (FP) old can be set, which specifies the sensitivity of the algo- and each anomaly window missed by an algorithm will be rithm towards anomalies and which trades off false detec- counted as a false negative (FN). All other points are con- tions (false positives) and missed anomalies (false nega- sidered as true-negatives (TN). From the confusion ma- tives). This threshold is usually set according to the re- trix, the additional well-known metrics precision p, re- quirements of the anomaly detection task – allowing either call r (true-positive rate, TPR) can be derived, as well as a higher precision or a higher recall. the false-positive rate (FPR), the positive likelihood ratio LSTM-AD We implemented our proposed algorithm (PLR), and the F1 -score, which are defined as: using the Keras framework [5] with a TensorFlow [1] FP T PR 2p · r backend. The parameters of the algorithm are summa- FPR = , PLR = , F1 = . (5) rized in Table 2. Most of the parameters are related to the FP + T N FPR p+r stacked LSTM network. We did not systematically tune the parameters. 4 Results & Analysis ADVec Twitter’s ADVec algorithm [28] is a time series anomaly detection algorithm, which is based on the gen- Firstly, we confirm that our LSTM models do not overfit, eralized ESD test and other robust statistical approaches. since the training and test set errors are for all time series There are mainly two parameters which have to be pro- nearly the same. The median of all such training and test vided: The first parameter α represents the level of statis- errors is 2.91 · 10−3 and 3.43 · 10−3 , respectively. The first tical significance with which to accept or reject anomalies. 80% of each time series is used as training set and the re- Although we did not tune the parameter extensively, we maining 20% is used subsequently for the test set. found the α = 0.05 to deliver the best results. The sec- The anomaly results for the ECG readings are summa- ond parameter maxanoms specifies the maximum number rized in Table 3 and Table 4. In Table 3, the anomaly threshold for the Mahalanobis distance was tuned individ- 1.00 ●● ually to maximize the F1 -score for each ECG signal. As ● seen in the table, the Mahalanobis distance is generally ●●● 0.75 ● ●● Precision ● ● a good indicator for separating nominal from anomalous ● behavior in the heartbeat signals, if a suitable threshold is 0.50 ● known. For all time series a recall value of 0.5 or larger can be observed and with one exception, also the F1 -score 0.25 ● ● exceeds the value 0.5. On average, a F1 -score of approxi- ● mately 0.81 can be achieved for all time series. Note that 0.00 all FPRs are smaller than 3 · 10−5 . 0.00 0.25 0.50 0.75 1.00 Recall ● LSTM−AD NuPic ADVec Table 3: Results for all ECG time series with less than 50 anomalies (in total 13). For these results, the anomaly Figure 3: Precision-Recall plot for LSTM-AD, NuPic and threshold is chosen for each time series individually, so ADVec. Precision and recall are computed over the sum that the F1 -score is maximized. of TP, FP, FN of the 13 ECG time series. For all three No. threshold TP FN FP Prec Rec F1 FPR PLR algorithms, each point is generated by scaling the individ- ∗105 /105 ual best thresholds up and down by a common factor. For 1 49.31 17 17 14 0.55 0.50 0.52 2.15 0.23 2 71.31 6 0 4 0.60 1.00 0.75 0.62 1.62 LSTM-AD, the best thresholds are reported in Table 3. 4 11.25 1 1 0 1.00 0.50 0.67 0.00 Inf 9 27.12 15 13 3 0.83 0.54 0.65 0.46 1.16 10 14.96 33 7 6 0.85 0.82 0.84 0.92 0.89 Table 4: Results for the considered 13 ECG signals for the 11 60.75 1 0 0 1.00 1.00 1.00 0.00 Inf threshold-free method based on Rosner’s ESD test. 12 59.18 2 0 0 1.00 1.00 1.00 0.00 Inf 13 116.93 6 0 0 1.00 1.00 1.00 0.00 Inf ECG No. threshold TP FN FP Prec Rec F1 FPR PLR 15 99.74 5 0 5 0.50 1.00 0.67 0.77 1.30 ∗105 /105 17 40.02 1 0 0 1.00 1.00 1.00 0.00 Inf 1 11.30 7 27 3 0.70 0.21 0.32 0.46 0.45 20 75.40 1 0 0 1.00 1.00 1.00 0.00 Inf 2 11.30 5 1 4 0.56 0.83 0.67 0.62 1.35 21 30.30 1 0 0 1.00 1.00 1.00 0.00 Inf 4 11.30 2 0 4 0.33 1.00 0.50 0.62 1.62 9 11.30 10 18 0 1.00 0.36 0.53 0.00 Inf 22 121.42 3 0 7 0.30 1.00 0.46 1.08 0.93 10 11.30 6 34 4 0.60 0.15 0.24 0.62 0.24 mean – 7 2 3 0.82 0.87 0.81 0.46 Inf 11 11.30 1 0 8 0.11 1.00 0.20 1.23 0.81 Σ – 92 38 39 0.70 0.71 0.70 0.46 1.53 12 11.30 2 0 5 0.29 1.00 0.44 0.77 1.30 13 11.30 6 0 4 0.60 1.00 0.75 0.62 1.62 15 11.30 4 1 5 0.44 0.80 0.57 0.77 1.04 Since the anomaly threshold is used to trade off false- 17 11.30 1 0 7 0.12 1.00 0.22 1.08 0.93 positive and false-negatives (precision and recall), one can 20 11.30 1 0 5 0.17 1.00 0.29 0.77 1.30 21 11.30 1 0 8 0.11 1.00 0.20 1.23 0.81 vary the threshold in a certain range and collect the results 22 11.30 3 0 7 0.30 1.00 0.46 1.08 0.93 for different values. This is done in Figure 3, which also mean 11.30 3 6 4 0.41 0.80 0.41 0.76 Inf shows the results for different thresholds for ADVec and Σ 146.92 49 81 64 0.43 0.38 0.40 0.76 0.50 NuPic. It has to be noted that ADVec accounts only for fixed-length seasonalities, it is not built for quasi-periodic suitably used to distinguish between nominal and anoma- signals as they occur in ECG readings, so it is quite under- lous patterns in the displayed ECG data. Only after ap- standable that it has only low performance here. plying our approach, a better signal-to-noise ratio is es- Table 4 shows the results which are obtained when no tablished which allows to perfectly separate the anomalies prior knowledge about the anomaly threshold is assumed. from the nominal points. For most of the investigated ECG Instead, the threshold is calculated automatically and in- readings we found that the window-based error correction crementally by our approach from Sec. 2.5 which is in- significantly improves the signal-to-noise ratio in the Ma- spired by Rosner’s ESD test. The average precision and halanobis distance. Table 5 shows: The average F1 -score F1 are lower than in Table 3 since the false positives (FP) increases from F1 =0.50 when no window-based error cor- are higher. rection is applied to F1 =0.81. In Figure 4, two excerpts of time series No. 13 (left) In Table 6, various measures are listed for the individual and No. 10 (right) with the detections of our LSTM-AD anomaly classes. The anomaly types a, F and x can all be algorithm, NuPic and ADVec are exemplarily shown. In detected by LSTM-AD. Also for the anomaly class V a both examples it can be seen that LSTM-AD detects all high recall can be achieved. However, the two remaining indicated anomalies, while NuPic and ADvec only detect types appear to be hard to detect for our algorithm. two and one anomaly, respectively. Additionally, the other two algorithms produce several false positives. The importance of our proposed window-based error 5 Conclusion & Future Work correction method (Sec. 2.4) is illustrated in Figure 2 for ECG signal No. 13: If no window-based error correction We have presented a fully unsupervised method to detect is applied, the obtained Mahalanobis distance cannot be anomalies in ECG readings. This method relies on an ac- Example 1 Example 2 a F | V a 1400 1200 1200 MLII 1000 1000 800 800 140000 142500 145000 147500 150000 328000 330000 332000 334000 336000 Type FP TP Algorithm ADVec LSTM−AD NuPic Figure 4: Subsets of two example time series taken from the MIT-ECG data with the anomalies detected by the algorithms LSTM-AD, NuPic and ADVec. The red rectangles in the plot indicate the true anomaly windows. True-positives are indicated by green colors while False-positives are colored red. built. When applying the model, anomalous events have a Table 5: Comparison of the results for all considered time high probability of being detected through an unusual high series, with and without the window-based error correc- Mahalanobis distance. tion, as described in Sec. 2.4. The last column F1 (Corr) Our method is unsupervised in the sense that no is copied from Table 3. The remaining columns depict anomaly class labels are needed for training the algorithm. the quantities which are obtained, if no window-based er- In fact, it is even not necessary that anomalous events are ror correction is applied (thresholds chosen such that F1 is present at all in the training data, i.e. our algorithm can maximized). operate as a one-class classificator. We checked this by ECG No. threshold TP FN FP Prec Rec F1 F1 (Corr) repeating the experiment leading to Table 3, but this time 1 20.60 12 22 23 0.34 0.35 0.35 0.52 2 5.83 2 4 2 0.50 0.33 0.40 0.75 removing all data around anomalies during LSTM train- 4 7.03 1 1 0 1.00 0.50 0.67 0.67 ing. When using the trained model as anomaly detector 9 16.83 13 15 10 0.57 0.46 0.51 0.65 on all data, it worked as accurate as in Table 3, the mean 10 16.21 28 12 2 0.93 0.70 0.80 0.84 11 40.60 1 0 0 1.00 1.00 1.00 1.00 F1 -score being now F1 = 0.83. 12 28.48 1 1 1 0.50 0.50 0.50 1.00 We achieve for the ECG readings these high precision, 13 93.83 5 1 130 0.04 0.83 0.07 1.00 15 87.97 2 3 5 0.29 0.40 0.33 0.67 recall and F1 -values (on average higher than 80%, see Ta- 17 35.90 1 0 38 0.03 1.00 0.05 1.00 ble 3), if we tune the final threshold for the Mahalanobis 20 25.10 1 0 1 0.50 1.00 0.67 1.00 distance such that F1 is maximized. Admittedly, this last 21 32.31 1 0 0 1.00 1.00 1.00 1.00 22 77.33 2 1 37 0.05 0.67 0.10 0.46 step is not unsupervised, since we calculate the confusion mean – 5 4 19 0.52 0.67 0.50 0.81 matrix based on the true anomaly labels. Σ – 70 60 249 0.22 0.54 0.31 0.70 The alternative unsupervised case based on Rosner’s test (Sec. 2.5 and Table 4) is weaker in terms of preci- sion and recall. This may be due to the fact that the cur- Table 6: Various metrics for 5 different anomaly classes. rent error data do not fulfill the assumption of being nor- The threshold was tuned individually for each time series mally distributed and therefore also the assumption of a by attempting to maximize the F1 -score. χ 2 -distribution is violated. This results in the χ 2 -criterion TP FN Prec Rec F1 FPR ∗105 PLR /105 giving no useful thresholds. A 23 21 0.65 0.52 0.58 0.14 3.64 It has to be noticed that the measure ’Mahalanobis dis- V 44 9 0.73 0.83 0.78 0.19 4.36 tance’ has the same discriminative power in both cases. It | 9 8 0.63 0.53 0.58 0.06 8.49 a 6 0 0.79 1.00 0.88 0.02 53.44 is only that the final threshold is not adjusted optimally for F 2 0 0.65 1.00 0.79 0.01 80.16 the individual time series in the alternative case. Viewed x 8 0 0.73 1.00 0.85 0.03 29.15 from the practitioner’s perspective, it may be acceptable to start with a non-optimal threshold and adjust it in a human- in-the-loop approach. However, a fully unsupervised high- curate LSTM predictor to learn the nominal behavior of quality method would be nicer. the ECG for several prediction horizons. By learning the We have shown that the window-based error correction error distribution between predicted and perceived values, is essential to achieve a Mahalanobis distance graph where a multivariate normal error model for the nominal data is the anomaly cases clearly stand out (Fig. 2 and Table 5). Our LSTM-AD algorithm outperformed two state-of- IEEE Transactions on Systems, Man, and Cybernetics, Part the-art anomaly detection algorithms (NuPic and ADVec) B (Cybernetics), 36(3):559–570, June 2005. on the investigated ECG readings, achieving a higher pre- [12] S. Hochreiter. The vanishing gradient problem during cision and recall over a large range of anomaly thresholds. learning recurrent neural nets and problem solutions. Inter- In this work we have presented first results of an un- national Journal of Uncertainty, Fuzziness and Knowledge- supervised anomaly detector suitable for ECG readings or Based Systems, 6(02):107–116, 1998. other quasi-periodic signals. The results are encouraging, [13] S. Hochreiter and J. Schmidhuber. Long short-term mem- but there is still room for improvement. Possible future ory. Neural computation, 9(8):1735–1780, 1997. works include: 1) Improving the modeling step such that [14] R. U. Islam, M. S. Hossain, and K. Andersson. A novel the nominal error distribution comes closer to a Gaussian anomaly detection algorithm for sensor data under uncer- shape and hence the nominal Mahalanobis distance closer tainty. Soft Computing, 22(5):1623–1639, 2018. to a χ 2 -distribution. Then the unsupervised extended Ros- [15] T. Kieu, B. Yang, and C. S. Jensen. Outlier detection for ner test can be expected to work better. 2) To do so, one multidimensional time series using deep neural networks. In 2018 19th IEEE International Conference on Mobile has to address the problem of non-stationarity in the ECG Data Management (MDM), pages 125–134. IEEE, 2018. readings, e. g. by applying suitable preprocessing steps to [16] D. P. Kingma and J. Ba. Adam: A method for stochastic reduce the effect of signal quality changes. 3) Enrich the optimization. arXiv preprint arXiv:1412.6980, 2014. model by multi-resolution approaches to span larger pre- [17] A. Lavin and S. S. Ahmad. Evaluating real-time anomaly diction horizons on a coarser scale. detection algorithms – the Numenta anomaly benchmark. In IEEE Conference on Machine Learning and Applica- References tions (ICMLA2015), 2015. [18] A. P. Lemos, C. Tierra-Criollo, and W. Caminhas. ECG [1] M. Abadi et al. Tensorflow: A system for large-scale anomalies identification using a time series novelty detec- machine learning. In Proceedings of the 12th USENIX tion technique. In IV Latin American Congress on Biomed- Conference on Operating Systems Design and Implementa- ical Engineering 2007, Bioengineering Solutions for Latin tion, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016. America Health, pages 65–68. Springer, 2007. USENIX Association. [19] P. Malhotra et al. Long short term memory networks for [2] S. Ahmad. Running swarms. http://nupic.docs. anomaly detection in time series. In 23rd European Sym- numenta.org/0.6.0/guide-swarming.html, May posium on Artificial Neural Networks, Computational In- 2017. telligence and Machine Learning (ESANN), pages 89–94, [3] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detec- 2015. tion: A survey. ACM computing surveys (CSUR), 41(3):15, [20] P. Malhotra et al. LSTM-based encoder-decoder 2009. for multi-sensor anomaly detection. arXiv preprint [4] S. Chauhan and L. Vig. Anomaly detection in ecg time sig- arXiv:1607.00148, 2016. nals via deep long short-term memory networks. In Data [21] G. B. Moody and R. G. Mark. PhysioNet: The MIT-BIH Science and Advanced Analytics (DSAA), 2015. 36678 Arrhythmia Database. https://www.physionet.org/ 2015. IEEE International Conference on, pages 1–7. IEEE, physiobank/database/mitdb/, 1992. 2015. [22] G. B. Moody and R. G. Mark. The impact of the mit-bih [5] F. Chollet et al. Keras. https://keras.io, 2015. arrhythmia database. IEEE Engineering in Medicine and [6] M. C. Chuah and F. Fu. ECG anomaly detection via time Biology Magazine, 20(3):45–50, 2001. series analysis. In International Symposium on Parallel and [23] H. M. Rai, A. Trivedi, and S. Shukla. ECG signal pro- Distributed Processing and Applications, pages 123–135. cessing for abnormalities detection using multi-resolution Springer, 2007. wavelet transform and artificial neural network classifier. [7] M. C. Garcia, M. A. Sanz-Bobi, and J. del Pico. SIMAP: Measurement, 46(9):3238–3246, 2013. Intelligent system for predictive maintenance: Application [24] B. Rosner. Percentage points for a generalized ESD many- to the health condition monitoring of a windturbine gear- outlier procedure. Technometrics, 25(2):165–172, 1983. box. Computers in Industry, 57(6):552–568, 2006. [25] H. Sivaraks and C. A. Ratanamahatana. Robust and accu- [8] D. George and J. Hawkins. Towards a mathematical rate anomaly detection in ECG artifacts using time series theory of cortical micro-circuits. PLoS Comput Biol, motif discovery. Computational and mathematical meth- 5(10):e1000532, 2009. ods in medicine, 2015. [9] A. L. Goldberger et al. Physiobank, physiotoolkit, and phy- [26] T. Stibor et al. Is negative selection appropriate for anomaly sionet: components of a new research resource for com- detection? In Proceedings of the 7th annual conference plex physiologic signals. Circulation, 101(23):e215–e220, on Genetic and evolutionary computation, pages 321–328. 2000. ACM, 2005. [10] K. Greff et al. LSTM: A search space odyssey. IEEE [27] M. Thill, W. Konen, and T. Bäck. Time series anomaly transactions on neural networks and learning systems, detection with discrete wavelet transforms and maximum 28(10):2222–2232, 2017. likelihood estimation. In O. Valenzuela, I. Rojas, et al., [11] S.-J. Han and S.-B. Cho. Evolutionary neural networks editors, Intern. Conference on Time Series (ITISE), 2017. for anomaly detection based on the behavior of a program. [28] O. Vallis, J. Hochenbaum, and A. Kejariwal. A novel tech- nique for long-term anomaly detection in the cloud. In 6th USENIX Workshop on Hot Topics in Cloud Computing, Philadelphia, PA, 2014.