1 Introduction

Anomaly Detection in Electrocardiogram Readings with Stacked LSTM Networks

Markus Thill

markus.thill@th-koeln.de 1

Sina Däubener

Wolfgang Konen

wolfgang.konen@th-koeln.de 1

Thomas Bäck

0 0 Leiden University , LIACS, 2333 CA Leiden , The Netherlands 1 TH Köln - Cologne University of Applied Sciences , 51643 Gummersbach , Germany

Real-world anomaly detection for time series is still a challenging task. This is especially true for periodic or quasi-periodic time series since automated approaches have to learn long-term correlations before they are able to detect anomalies. Electrocardiography (ECG) time series, a prominent real-world example of quasi-periodic signals, are investigated in this work. Anomaly detection algorithms often have the additional goal to identify anomalies in an unsupervised manner. In this paper we present an unsupervised time series anomaly detection algorithm. It learns with recurrent Long Short-Term Memory (LSTM) networks to predict the normal time series behavior. The prediction error on several prediction horizons is used to build a statistical model of normal behavior. We propose new methods that are essential for a successful model-building process and for a high signal-to-noise-ratio. We apply our method to the well-known MIT-BIH ECG data set and present first results. We obtain a good recall of anomalies while having a very low false alarm rate (FPR) in a fully unsupervised procedure. We compare also with other anomaly detectors (NuPic, ADVec) from the state-of-the-art.

1 Introduction

Anomaly detection in time series is of increasing importance in many application areas, e.g. health care [ 6, 4 ], sensor networks [ 20, 14 ] or predictive maintenance [ 7 ]. Anomaly detection is a very active research area (for an overview see [ 3 ]), yet it is not easy to come up with a general definition of an anomaly. The notion of an anomaly greatly depends on the application area and on characteristics of the time series in question. To learn the characteristics of nominal and anomalous behavior from data it is often necessary to apply sophisticated methods from natural computing (evolutionary neural networks [ 11 ] and immune systems [ 26 ]). While some anomalies are simple to detect (e.g. a sudden spike in a relatively constant signal may be detected by simple threshold heuristics), other anomalies are subtle and more complex to detect. This is especially the case for periodic or quasi-periodic time series where the anomaly may be a time-shifted peak, a peak with a different form or other patterns which only emerge from a long-range analysis of the signal.

Electrocardiography (ECG) time series constitute a prominent real-world example of quasi-periodic signals. Anomaly detection in ECG readings plays an important role in health care and medical diagnosis. There exist well-maintained databases, e.g. the MIT-BIH database [ 9 ], where a large body of data is annotated with numerous types of anomalies which are characterized and verified by medical experts. Automated anomaly detection in such ECG data is still a challenging topic, because the deviations from nominal behavior are often subtle and require long-range analysis. Furthermore, there are considerable signal variations from patient to patient or even within an ECG time series.

Long short-term memory (LSTM) networks [ 13 ], which are a special form of recurrent neural networks (RNN) and thus belong to the class of deep learning methods, have proven to be particularly useful in learning sequences with long-range dependencies. They avoid the vanishing gradient problem [ 12 ] and are more stable and better scalable [ 10 ] than other RNN architectures. LSTMs have been successfully advanced the state-of-the-art in many application areas like language modeling and translation, acoustic modeling of speech, analysis of audio data, handwriting recognition and others [ 10 ]. We will use stacked LSTMs as the building block for our ECG time series prediction.

It is the purpose of the present paper to investigate whether anomaly detection in quasi-periodic ECG time series can be trained in an unsupervised manner, hence, without the usage of the anomaly class labels. We will describe in Sec. 2 an LSTM prediction model which is trained to predict over multiple horizons and is applied to time series containing nominal and also rare anomalous data. We observe multidimensional error vectors (one vector for each point in time) and fit a multivariate Gaussian distribution to them. Based on the Mahalanobis distance we can assign to each point in time a probability of being anomalous. Sec. 3 describes our experimental setup and the MIT-BIH Arrhythmia Database used in our experiments. Sec. 4 presents and discusses our results, while Sec. 5 concludes. Anomaly detection in general has been done with methods from machine learning [ 3 ] and more precisely from natural computing: Han and Cho [ 11 ] and other works cited therein use evolutionary approaches in optimizing neural networks for the task of intrusion detection. Kieu et al. [ 15 ] use deep learning (LSTM, autoencoder) for anomaly detection. [ 27 ] uses a multi-resolution waveletbased approach for unsupervised anomaly detection. Stibor et al. [ 26 ] describe an immune-system approach to anomaly detection: They tackle a problem prevalent in anomaly detection (and relevant also for the ECG case): Often only nominal data are available during training, nevertheless the model should later detect anomalies as well. This task, known as one-class classification or negative selection, is solved in [ 26 ] with an immune-system approach. In our case we have an unknown, small number of anomalies embedded in nominal data. We describe in Sec. 2.5 a statistical test to find anomalies in an unsupervised, threshold-free manner.

Much work is devoted to anomaly detection in ECG readings: Several authors use multi-resolution waveletbased techniques [ 23, 27 ]. A novelty-search approach on ECG data is taken in [ 18 ] in order to perform unsupervised anomaly classification. Sivaraks et al. [ 25 ] use motif discovery for robust anomaly detection.

The works of Malhotra [ 19 ] and Chauhan [ 4 ] are closest to our present approach. They describe nicely the general idea of LSTM based prediction and their application to ECG, motor sensor or power-consumption time series. But the big drawback of [ 19, 4 ] is that they need a manual and supervised separation into up to six data sets: training, validation, test sets which are further subdivided into nominal & anomalous subsets. This means that for a real-world application the ECG data for a new person would need to undergo an expert anomaly classification prior to any training. This will be highly unpractical in most application scenarios. Our method instead aims at using the whole body of data for a person and train the LSTMs, without the necessity to have supervised anomaly information. 2 2.1

Methods LSTM for Time Series Prediction

The learning task is formulated as a time series forecasting problem. Hence, we attempt to train a model which is able to predict future values in the time series. The intuition behind this approach is that the usual quasi-periodic patterns in the ECG time series should be predictable with only small errors, while abnormal behavior should lead to large deviations in the predictions. Although the presented methodology is only applied to ECG data in this paper, it is sufficiently general to be applied to other (predictable) time series as well.

Data Preparation Consider a d-dimensional time series of length T . In a first step, it is often recommendable to scale or normalize the individual dimensions of the time series. In our setup, each dimension of each ECG signal are scaled into the range [ 1; 1]. The training and test samples are generated by extracting sub-sequences of suitable length from the original time series. This is done by sliding a window of length W with a lag of 1 over the time series and collecting the windowed data in a tensor D 2 RT 0 W din , where T 0 is the number of sub-sequences. Usually, one would select all d dimensions of the time series, so that din = d. Then, the first 80% of the samples of D are selected to form the training data Xtrain. The remaining 20% are used as a test set Xtest to later compute an unbiased error estimate. While the inputs are dindimensional, the output-targets for each time step have the dimension m, since one can select for one time series multiple (m) prediction horizons. Technically, it is also possible to predict several time series dimensions with dout d simultaneously in one model, however, we found in our experiments that the results do not improve in this case (for the investigated ECG time series data). The targets ~yt 2 Rm are future values of the selected signal at times t + hi for i 2 f1; :::; mg, where the horizons are specified in H = (h1; h2; : : : ; hm). Since we follow a many-to-many time series prediction approach, where the algorithm performs a prediction at each instance of time t, the tensor containing the target signals has the shape RT 0 W m with T 0 = T W max(H) + 1. T 0 is the same for the inputand output tensor. As before, the first 80% of the targets are used for training (Ytrain) and the remaining targets for the test set (Ytest).

Model Architecture and Training A stacked LSTM architecture [ 13 ] with L = 2 layers is used to learn the prediction task. Each layer consists of u = 64 units. A dense output layer with m units and a linear activation generates the predictions for the specified prediction horizons in H. The net is trained with the sub-sequences of length W taken in mini-batches of size B from the training inputs Xtrain and targets Ytrain. 10% of the training data are held out for the validation set. The LSTM model is trained by using the Adam optimizer [ 16 ] to minimize the mean-squared-error (MSE) loss. Other loss functions, such as log-cosh (logarithm of the hyperbolic cosine) and MAE (mean absolute error) were tested as well and produced similar results for our data. Early stopping is applied to prevent overfitting of the model and to reduce the overall time required for the training process. For this purpose the MSE on the validation set is tracked. For most of the investigated time series, 10-20 epochs are sufficient to reach a minimum of the validation error. 2.2

Modelling the Residuals with a multivariate Gaussian Distribution

40 30 20 10 ity 0 s n e40 d 30 20 10 0 Uncorrected Corrected 0.00 error −0.10 −0.05 0.05 0.10

Computing the Prediction Errors for the full Time Se

ries After the LSTM prediction model is trained, the whole time series X 2 RT din of length T is passed through the model and a tensor Yˆ of shape RT m is predicted. Then, the prediction errors E = Y Yˆ are calculated, where Y contains the true target values for the horizons in H. Now, each row i in the matrix E represents an error vector ~ei 2 Rm.

Removing Outliers from the Prediction Errors We no

ticed in our initial experiments that it was not possible to find good Gaussian fits for the individual dimensions of the prediction errors in E. An example for this is shown in Figure 1, upper part. This was due to the fact that the tails of the error distributions contained many outliers, which significantly distorted the estimated fit. Hence, we decided to remove the outliers in the tails of each dimension (only during the Gaussian modeling phase). We could find good solutions by discarding the upper and lower 3% quantile in each dimension. In our experiments we observed that this approach usually removes slightly more than 20% of the data records.

Estimating a multivariate Gaussian Distribution After

removing the outliers from the prediction errors E, the errors are roughly Gaussian distributed and the parameters of a multivariate Gaussian distribution can be estimated. For this purpose the covariance matrix S and mean vector ~m are computed for the cleaned matrix E. Then, the squared Mahalanobis distance

M(~ei) = (~ei ~m)T S 1(~ei ~m) (1) to the mean vector ~m is determined for each error vector ~ei in E. 2.3 For most data points in the time series the corresponding Mahalanobis distance will be comparably small, since they are located close to the mean of the distribution. On the other side, unusual patterns in the error vectors~ei – such as large errors in one or more dimensions – will result in large values in the Mahalanobis distance. Therefore, the Mahalanobis distance can be used as an indicator for anomalous behavior in the time series signal. In our LSTM-AD algorithm, points with a Mahalanobis distance larger than a specified anomaly threshold will be flagged as anomalous. Figure 2 shows exemplarily the Mahalanobis distance over time for a selected ECG signal. Depending on the choice of threshold, more or less points will be classified as anomalous. If the threshold is set too small, the algorithm will likely produce many false detections. If the threshold is chosen too large, many anomalies might be missed. Ideally, a threshold can be found which allows to identify all anomalies without any false detections. However, in practice, one usually has to trade off true and false detections and select the threshold according to the own requirements. 2.4

Window-Based Error Correction

Initially, we could not obtain very good results when running our algorithm on several example ECG time series. The Mahalanobis distance signal was rather noisy and could not be used to distinguish between nominal and abnormal patterns in the data. In Figure 2 (top) this problem is visualized for one example time series. Further investigation showed that the predictions of the LSTM network were good in general, but not sufficiently accurate near the heart beat peaks where the prediction had been slightly shifted (up to 10 time steps) forwards or backwards compared to the real signal. We could identify that the quasiperiodic character of most ECG signals (small but ubiquitous frequency changes in the heart beat) is the main source of this problem. The following solution for this problem is proposed: In order to address the variability in the frequency of the signal, small corrections in the predictions of the individual horizons will be permitted. For each output dimension k 2 f1; : : : ; mg the target values yt;k are compared to the neighbored predictions yˆ 2 Yˆt(;kwin) and the prediction with the smallest absolute error is effectively taken: yˆt;k arg min jyt;k yˆ2Yˆt(;kwin)

yˆj ˆ (win) = [yˆt ck;k; : : : ; yˆt;k; : : : ; yˆt+ck;k]

Yt;k

We found that reasonable results can be achieved with window parameters ck up to a length of 10, depending on the prediction horizon hk: ck = min(hk; 10): (2) (3)

The window-based error correction is applied right after the LSTM prediction of Yˆ and before the prediction errors E are computed in Sec. 2.2.

Although this approach corrects the predictions of the LSTM network explicitly with the true target values Y of the time series, it is not supervised in the sense that no anomaly labels are presented to the algorithm at any time of the training and correction process. As will be shown in Sec. 4, we could significantly improve the performance of LSTM-AD utilizing this correction step. 2.5

Threshold-free Anomaly Detection

For determining which points are anomalous, we extend Rosner’s outlier test [ 24 ] to a multivariate scenario. This means that we want to test the hypothesis:

H0 : There are no anomalies in the data.

Ha : There are up to n percent of anomalies in the data. We assume that the residuals of each output k with k 2 f1; :::; mg are approximately normal distributed. Based on these residuals we iteratively calculate a multivariate normal distribution with mean ~m and covariance matrix S. In a next step we calculate the differences of an observed vector compared to this underlying multivariate normal distribution. We do so by calculating the squared Mahalanobis distance according to Eq. (1), which can be assumed to be c2-distributed with m degrees of freedom.

The proposed algorithm selects the highest value of the Mahalanobis distance M j = maxi M(~ei) and evaluates whether this value is above the 1 a quantile of the cm2 distribution. We refer to this quantile for the ease of notation as cm2 (1 a). If so, the observation is considered to be an anomaly. A row window around the anomaly index is then deleted from the data set and the algorithm proceeds with calculating the multivariate normal distribution based on the reduced data set. The procedure stops when at most n% of the data were labeled as an anomaly or if M j cm2 (1 a).

The procedure is summarized in Algorithm 1.

Algorithm 1 Extended Rosner test 1: procedure TEST(n; a; D) . n: maximal percentage of anomalies; D 2 Rn m: set of residuals, 1 a confidence level 2: N dn ne 3: for j = 1; 2; : : : ; N do 4: ~m MEAN(D); S COV(D) 5: Mj maxi M(~ei) . M acc. to Eq. (1) for all ~ei 2 D 6: if Mj cm2(1 a) then 7: D D n fa window around index of Mjg 8: end if 9: end for 10: Set q to the last index where Mj > cm2(1 a) 11: return fM1; M2; : : : ; Mqg and corresp. row indices. 12: end procedure 3 3.1

Experimental Setup The MIT-BIH Arrhythmia Database

For our experiments we use the MIT-BIH Arrhythmia database [ 9, 21, 22 ], which contains two-channel electrocardiogram (ECG) signals of 48 patients of the Beth Israel Hospital (BIH) Arrhythmia Laboratory. Each recording has a length of approximately half an hour, which corresponds to 650 000 data points each. The two channels recorded are the modified limb lead II (MLII) and a modified lower lead V5. For our LSTM-AD algorithm, both the MLII and V5 signal will be used as inputs of the LSTM model and only the MLII signal is predicted for the specified horizons.1 The individual signals have a 1We also performed experiments where both signals (MLII and V5) are predicted, but could not observe any noticeable improvement. quasi-periodic behavior, with differing heart-beat patterns for each subject.

There are a multitude of different events in all ECG time series, which were labelled by human experts. The whole list of heart beat annotations can be viewed at [ 21 ]. For this initial investigation which presents first results of our unsupervised approach, we decide to limit ourselves to all time series with 50 or fewer events. Overall, 13 time series will be considered for our experiments. The selected time series contain 130 anomalous events from 6 anomaly classes, which are listed in Table 1. Since only one point is labelled for each anomaly, we place an anomaly window of length 600 around each event, which roughly corresponds to the length of one heart beat before and after the labeled point. A more detailed database description can be found in [ 22 ]. 3.2

Parameterization of the Algorithms

All algorithms compared in this work require a set of parameters, which are – if not mentioned otherwise – fixed for all experiments. For each algorithm an anomaly threshold can be set, which specifies the sensitivity of the algorithm towards anomalies and which trades off false detections (false positives) and missed anomalies (false negatives). This threshold is usually set according to the requirements of the anomaly detection task – allowing either a higher precision or a higher recall.

LSTM-AD We implemented our proposed algorithm using the Keras framework [ 5 ] with a TensorFlow [ 1 ] backend. The parameters of the algorithm are summarized in Table 2. Most of the parameters are related to the stacked LSTM network. We did not systematically tune the parameters.

ADVec Twitter’s ADVec algorithm [ 28 ] is a time series anomaly detection algorithm, which is based on the generalized ESD test and other robust statistical approaches. There are mainly two parameters which have to be provided: The first parameter a represents the level of statistical significance with which to accept or reject anomalies. Although we did not tune the parameter extensively, we found the a = 0:05 to deliver the best results. The second parameter maxanoms specifies the maximum number

Parameter

H L

B optimizer din value (1; 3; : : : ; 47; 49) 3 2048 ADAM 2

Parameter

L u W ainit dout of anomalies that the algorithm will detect as a percentage of the data. This parameter is used as anomaly threshold.

NuPic Numenta’s anomaly detection algorithm [ 8 ] has a large set of parameters which have to be set. Although the parameters can be tuned with an internal swarming tool [ 2 ], we decided to use the standard parameter settings recommended in [ 17 ], since the time-expensive tuning process is not feasible for the data considered in this work. NuPic outputs an anomaly likelihood for each time series point in the interval [ 0,1 ], which is suitably thresholded to control the sensitivity of the algorithm. In order to evaluate and compare the performance of our proposed LSTM-AD algorithm and the two other algorithms, several common performance quantities are used in this paper. Similarly to ordinary classification tasks, a confusion matrix can be constructed for each time series, containing the number of true-positives (TP), falsepositives (FP), false-negatives and true-negatives (TN). TP indicates the number of correctly identified anomalies, whereby only one detection within the anomaly window (Sec. 3.1) is counted. All false detections outside any anomaly window are considered as false-positives (FP) and each anomaly window missed by an algorithm will be counted as a false negative (FN). All other points are considered as true-negatives (TN). From the confusion matrix, the additional well-known metrics precision p, recall r (true-positive rate, TPR) can be derived, as well as the false-positive rate (FPR), the positive likelihood ratio (PLR), and the F1-score, which are defined as: Firstly, we confirm that our LSTM models do not overfit, since the training and test set errors are for all time series nearly the same. The median of all such training and test errors is 2:91 10 3 and 3:43 10 3, respectively. The first 80% of each time series is used as training set and the remaining 20% is used subsequently for the test set.

The anomaly results for the ECG readings are summarized in Table 3 and Table 4. In Table 3, the anomaly threshold for the Mahalanobis distance was tuned individually to maximize the F1-score for each ECG signal. As seen in the table, the Mahalanobis distance is generally a good indicator for separating nominal from anomalous behavior in the heartbeat signals, if a suitable threshold is known. For all time series a recall value of 0:5 or larger can be observed and with one exception, also the F1-score exceeds the value 0:5. On average, a F1-score of approximately 0:81 can be achieved for all time series. Note that all FPRs are smaller than 3 10 5.

Since the anomaly threshold is used to trade off falsepositive and false-negatives (precision and recall), one can vary the threshold in a certain range and collect the results for different values. This is done in Figure 3, which also shows the results for different thresholds for ADVec and NuPic. It has to be noted that ADVec accounts only for fixed-length seasonalities, it is not built for quasi-periodic signals as they occur in ECG readings, so it is quite understandable that it has only low performance here.

Table 4 shows the results which are obtained when no prior knowledge about the anomaly threshold is assumed. Instead, the threshold is calculated automatically and incrementally by our approach from Sec. 2.5 which is inspired by Rosner’s ESD test. The average precision and F1 are lower than in Table 3 since the false positives (FP) are higher.

In Figure 4, two excerpts of time series No. 13 (left) and No. 10 (right) with the detections of our LSTM-AD algorithm, NuPic and ADVec are exemplarily shown. In both examples it can be seen that LSTM-AD detects all indicated anomalies, while NuPic and ADvec only detect two and one anomaly, respectively. Additionally, the other two algorithms produce several false positives.

The importance of our proposed window-based error correction method (Sec. 2.4) is illustrated in Figure 2 for ECG signal No. 13: If no window-based error correction is applied, the obtained Mahalanobis distance cannot be suitably used to distinguish between nominal and anomalous patterns in the displayed ECG data. Only after applying our approach, a better signal-to-noise ratio is established which allows to perfectly separate the anomalies from the nominal points. For most of the investigated ECG readings we found that the window-based error correction significantly improves the signal-to-noise ratio in the Mahalanobis distance. Table 5 shows: The average F1-score increases from F1=0:50 when no window-based error correction is applied to F1=0:81.

In Table 6, various measures are listed for the individual anomaly classes. The anomaly types a, F and x can all be detected by LSTM-AD. Also for the anomaly class V a high recall can be achieved. However, the two remaining types appear to be hard to detect for our algorithm. 5

Conclusion & Future Work

We have presented a fully unsupervised method to detect anomalies in ECG readings. This method relies on an accurate LSTM predictor to learn the nominal behavior of the ECG for several prediction horizons. By learning the error distribution between predicted and perceived values, a multivariate normal error model for the nominal data is built. When applying the model, anomalous events have a high probability of being detected through an unusual high Mahalanobis distance.

Our method is unsupervised in the sense that no anomaly class labels are needed for training the algorithm. In fact, it is even not necessary that anomalous events are present at all in the training data, i.e. our algorithm can operate as a one-class classificator. We checked this by repeating the experiment leading to Table 3, but this time removing all data around anomalies during LSTM training. When using the trained model as anomaly detector on all data, it worked as accurate as in Table 3, the mean F1-score being now F1 = 0:83.

We achieve for the ECG readings these high precision, recall and F1-values (on average higher than 80%, see Table 3), if we tune the final threshold for the Mahalanobis distance such that F1 is maximized. Admittedly, this last step is not unsupervised, since we calculate the confusion matrix based on the true anomaly labels.

The alternative unsupervised case based on Rosner’s test (Sec. 2.5 and Table 4) is weaker in terms of precision and recall. This may be due to the fact that the current error data do not fulfill the assumption of being normally distributed and therefore also the assumption of a c 2-distribution is violated. This results in the c 2-criterion giving no useful thresholds.

It has to be noticed that the measure ’Mahalanobis distance’ has the same discriminative power in both cases. It is only that the final threshold is not adjusted optimally for the individual time series in the alternative case. Viewed from the practitioner’s perspective, it may be acceptable to start with a non-optimal threshold and adjust it in a humanin-the-loop approach. However, a fully unsupervised highquality method would be nicer.

We have shown that the window-based error correction is essential to achieve a Mahalanobis distance graph where the anomaly cases clearly stand out (Fig. 2 and Table 5).

Our LSTM-AD algorithm outperformed two state-ofthe-art anomaly detection algorithms (NuPic and ADVec) on the investigated ECG readings, achieving a higher precision and recall over a large range of anomaly thresholds.

In this work we have presented first results of an unsupervised anomaly detector suitable for ECG readings or other quasi-periodic signals. The results are encouraging, but there is still room for improvement. Possible future works include: 1) Improving the modeling step such that the nominal error distribution comes closer to a Gaussian shape and hence the nominal Mahalanobis distance closer to a c 2-distribution. Then the unsupervised extended Rosner test can be expected to work better. 2) To do so, one has to address the problem of non-stationarity in the ECG readings, e. g. by applying suitable preprocessing steps to reduce the effect of signal quality changes. 3) Enrich the model by multi-resolution approaches to span larger prediction horizons on a coarser scale. nique for long-term anomaly detection in the cloud. In 6th USENIX Workshop on Hot Topics in Cloud Computing, Philadelphia, PA, 2014.

[1]

Abadi et al. Tensorflow: A system for large-scale machine learning . In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation , OSDI' 16 , pages 265 - 283 , Berkeley, CA, USA, 2016 . USENIX Association.

[2]

Ahmad . Running swarms . http://nupic.docs. numenta.org/0 .6.0/guide-swarming.html, May 2017 .

[3]

Chandola ,

Banerjee , and

Kumar . Anomaly detection: A survey . ACM computing surveys (CSUR) , 41 ( 3 ): 15 , 2009 .

[4]

Chauhan and

Vig . Anomaly detection in ecg time signals via deep long short-term memory networks . In Data Science and Advanced Analytics (DSAA) , 2015 . 36678 2015. IEEE International Conference on, pages 1 - 7 . IEEE, 2015 .

[5]

Chollet et al. Keras. https://keras.io, 2015 .

[6]

M. C.

Chuah and

Fu . ECG anomaly detection via time series analysis . In International Symposium on Parallel and Distributed Processing and Applications , pages 123 - 135 . Springer, 2007 .

[7]

M. C.

Garcia ,

M. A.

Sanz-Bobi , and J. del Pico . SIMAP: Intelligent system for predictive maintenance: Application to the health condition monitoring of a windturbine gearbox . Computers in Industry , 57 ( 6 ): 552 - 568 , 2006 .

[8]

George and

Hawkins . Towards a mathematical theory of cortical micro-circuits . PLoS Comput Biol , 5 ( 10 ): e1000532 , 2009 .

[9]

A. L.

Goldberger et al. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals . Circulation , 101 ( 23 ): e215 - e220 , 2000 .

[10]

Greff et al. LSTM: A search space odyssey . IEEE transactions on neural networks and learning systems , 28 ( 10 ): 2222 - 2232 , 2017 .

[11]

S.-J.

Han and

S.-B.

Cho . Evolutionary neural networks for anomaly detection based on the behavior of a program . IEEE Transactions on Systems, Man, and Cybernetics , Part

(Cybernetics), 36 ( 3 ): 559 - 570 , June 2005 .

[12]

Hochreiter . The vanishing gradient problem during learning recurrent neural nets and problem solutions . International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems , 6 ( 02 ): 107 - 116 , 1998 .

[13]

Hochreiter and

Schmidhuber . Long short-term memory . Neural computation , 9 ( 8 ): 1735 - 1780 , 1997 .

[14]

R. U.

Islam ,

M. S.

Hossain , and

Andersson . A novel anomaly detection algorithm for sensor data under uncertainty . Soft Computing , 22 ( 5 ): 1623 - 1639 , 2018 .

[15]

Kieu ,

Yang , and

C. S.

Jensen . Outlier detection for multidimensional time series using deep neural networks . In 2018 19th IEEE International Conference on Mobile Data Management (MDM) , pages 125 - 134 . IEEE, 2018 .

[16]

D. P.

Kingma and

Ba . Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980 , 2014 .

[17]

Lavin and

S. S.

Ahmad . Evaluating real-time anomaly detection algorithms - the Numenta anomaly benchmark . In IEEE Conference on Machine Learning and Applications (ICMLA2015) , 2015 .

[18]

A. P.

Lemos ,

Tierra-Criollo , and

Caminhas . ECG anomalies identification using a time series novelty detection technique . In IV Latin American Congress on Biomedical Engineering 2007 , Bioengineering Solutions for Latin America Health , pages 65 - 68 . Springer, 2007 .

[19]

Malhotra et al. Long short term memory networks for anomaly detection in time series . In 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) , pages 89 - 94 , 2015 .

[20]

Malhotra et al. LSTM-based encoder-decoder for multi-sensor anomaly detection . arXiv preprint arXiv:1607.00148 , 2016 .

[21]

G. B.

Moody and R. G. Mark. PhysioNet: The MIT-BIH Arrhythmia Database . https://www.physionet.org/ physiobank/database/mitdb/, 1992 .

[22]

G. B.

Moody and R. G. Mark. The impact of the mit-bih arrhythmia database . IEEE Engineering in Medicine and Biology Magazine , 20 ( 3 ): 45 - 50 , 2001 .

[23] H. M. Rai , A.

Trivedi , and S.

Shukla . ECG signal processing for abnormalities detection using multi-resolution wavelet transform and artificial neural network classifier . Measurement , 46 ( 9 ): 3238 - 3246 , 2013 .

[24]

Rosner . Percentage points for a generalized ESD manyoutlier procedure . Technometrics , 25 ( 2 ): 165 - 172 , 1983 .

[25]

Sivaraks and

C. A.

Ratanamahatana . Robust and accurate anomaly detection in ECG artifacts using time series motif discovery . Computational and mathematical methods in medicine , 2015 .

[26]

Stibor et al. Is negative selection appropriate for anomaly detection ? In Proceedings of the 7th annual conference on Genetic and evolutionary computation , pages 321 - 328 . ACM, 2005 .

[27]

Thill ,

Konen , and

Bäck . Time series anomaly detection with discrete wavelet transforms and maximum likelihood estimation . In O. Valenzuela, I. Rojas , et al., editors, Intern. Conference on Time Series (ITISE) , 2017 .

[28]

Vallis ,

Hochenbaum , and

Kejariwal . A novel tech-